This will be an academic talk as announced. I will try to bring some of my research I did during my PhD into the real world. We are going to talk about the security of software distribution and I'm going to propose a security feature that adds on top of the signatures we have up to today and also the reproducible builds that we already have to very large degree. I am going to highlight a few points where I think infrastructure changes are required to accommodate this system and I would also appreciate any feedback you might have. I'm going to ??? a few motivation of what should we care about. In the security of software distribution we already do have cryptographic signatures I've just put up a few examples of recent attacks that involved the distribution of software where people who presumably thought they knew what they were doing had grave problems with software distribution. For example, the juniper backdoors, pretty famous. Juniper discovered two backdoors in the code and nobody really knew where they were coming from. Another example would be Chrome extension developers who got their credentials fished and subsequently their extensions backdoored or another example, a signed update to a banking software actually included a malware and infected several banks. I hope this is motivation for us to consider this kinds of text to be possible and to prepare ourselves. I have two main goals in the system I am going to propose. The first is to relax trust in the archive. In particular, what I want to achieve is a level of security even if the archive is compromised and the specific thing I am going to do is to detect targeted backdoors. That means backdoors that are distributed only to a subset of the population and what we can achieve is to force the attacker to deliver the malware to everybody, thereby greatly decreasing their degree of stealth and increasing their danger of detection. This would work to our advantage. The second goal is the forensic auditability which overlaps to a surprising degree with the first one in technical terms, in terms of implementation. So, what I want to ensure is that we have inspectable source code for every binary. We do have of course the source code available from our packages, but only for the most recent version, everything else is a best effort by the code archiving services. The mapping between those and binary can be verified once we have reproducible builds to a large extent. I want to make sure that we can identify the maintainer responsible for distribution of a particular package and the system is also interested in providing attribution of where something went from, so that we are not in a situation where we notice something went wrong but we don't really know where we have to look in order to find the problems but that we really have specific and secured indication of where a compromised problem was coming from. Let's quickly recap how our software distribution works. We have the maintainers who upload their code to the archive. The archive has access to a signing key which signs the releases. Actually, metadata covering all the actual binary packages. These are distributed over the mirror network from where the apt clients will download the package metadata. That means the hash sums for the packages, their dependencies and so on as well as the actual packages themselves. This central architecture has an important advantage, mainly the mirror network need not to be trusted, right? We have the signature that covers all the contents of binary and source packages and the metadata, so the mirror network need not to be trusted. On the other hand, it makes the archive and the signing key a very interesting target for attackers because this central point controls all the signing operations. So this is a place where we need to be particularly careful and perhaps maybe even do better than cryptographic signatures. This is where the main focus of this talk will be, although I will also consider the uploaders to some extent. We want to achieve two things: resistance against key compromise and targeted backdoors and to get some better support for auditing in case things go wrong. The approach that we choose to do this is we want to make sure that everybody runs exactly the same software or at least the parts of it these choose to install. If we think about that for a moment, this gives us a number of advantages. For example, all the analysis that's done on a piece of software immediately carries over to all other users of the software, right? Because if we haven't made sure that everybody installs the same software, they might not have exactly the same version and perhaps some backdoored version. This also ensures that we cannot suffer targeted backdoors by increasing the detection risk of attackers and we also want to have a cryptographic proof of where something went wrong. Now, to look at some pictures, I will present the data structure that we use in order to achieve these goals. The data structure is hash tree, a Merkle tree which is a data structure that operates over a list. So we have a list of these squares here which represent the list items. In our case, this is going to be the files containing a package metedata that just dependencies, a hash sum of packages and also the source packages themselves are going to be elements in this list. The tree works as follows. It uses a cryptographic has function which is a collision resistant compressing function and the labels of the inner nodes of the tree are computed as the hashes of the children. Ok? Once we have computed the root hash, the root label, we have fixed all the elements and none of the elements can be changed without changing the root hash. We can exploit this in order to efficiently prove the two following properties for ??? First of all, we can efficiently prove the inclusion of a given element in the list. If we know the tree root ???, this works as follows: let's make a quick example, we see the third list item is marked with an X and if I know the tree root, then the server operating the tree structure will only need to give me the three gray marked labels, the three marked node values and then I can recompute the root hash and be convinced that this element actually was contained in the list. The second property is that we can also efficiently verify the append-only operation of the list. So we can have a log server operating this kind of structure and the log server need not to be trusted, it's not going to be trusted third party but rather, its operation can be verified from the outside. So, what does this design look like? The theoretical foundation is called a transparency overlay and in our system it looks like this: We have the archive as per usual, we have a log server and the archive will submit package metadata, the release file, the packages file containing dependencies and so on and the source code into this log server. The apt client will be augmented with an auditor component and this auditor component is responsible for verifying the correct log operation as well as the inclusion of the downloaded release into the log. This is a mechanism which we will be able to make sure that everybody is running the exact same version of the software they installed. A third component is the monitor. The monitor is necessary also to verify log operation and also to inspect the elements that are contained in the log The monitor would then be run by groups of individuals or individuals that want to make sure of certain properties in the log Alright, let's quickly recap. We have added this log server, which can prove two properties efficiently to the outside world. And we have the auditor and monitor components. The auditor is added to the apt client and the monitor does additional investigative tasks. Now, in order to make this system work, we need to… I need to make a few assumptions. The archive will need to handle log submission and distribution of certain log datastructure. These are usually very small things given to the archive in response to submission. Then I'm assuming a very consistent release frequency. The archive is responsible for distributing reproducible binaries in my architecture. i'm assuming that the buildinfo files are covered by the release file I treat them as additional source metadata so whenever the source package or the buildinfo file changes, I expect an increase in the binary version number. I also assume source-only uploads and one additional thing that we have, keyring source package treated by the archive as authoritative and this keyring must have the special property that is operated in append-only so that we can go back in time and see what keys were authorized at different points in time. The log server is a standalone server component that speaks at the moment on an HTTP-based protocol. Probably one would want to have more than one, but we are going to have, I think, a much easier time running log servers than for example, the certificate transparency people because we only have one source of writing access, namely the archive, so we can easily schedule the write access, and you can have read-only frontends that aren't quite critical. The auditor component would need to be integrated into the apt client or library. It needs two things like cryptographic verifications, understand a bit more file formats and some more network access. Parts of the proof could also probably distribute over the mirror network and we need not necessarily do everything ??? communication with the log server. So, this covers archive auditor and log server. The monitoring servers have a few functions that are necessary for the verification of the log itself, meaning that they verify the append-only operation of the log and they will also likely want to exchange the tree roots with perhaps other monitors and some auditors. The important verification functions of the log server are validating the metadata of the release packages and sources file, namely making sure that these are complete, that the sources are available, that the versions are incremented correctly and so on. And that's necessary to make sure that a compromised archive can't do certain attacks. Also in this category is the fact that we depend on a fixed release frequency and monitors will also be verifying the upload ACL, meaning which keys are authorized to upload. Monitors also would be verifying reproducible builds in this scenario. That's the monitoring functions and I think that many different people and groups in Debian could get some benefits out of these monitoring functions in order to verify that everything worked correctly. We should note that all these verifications are completely independent of the existing infrastructure because happening on the client side. So we don't depend on any notifications from the existing infrastructure that works correctly and no notifications are stopped. This can be done completely on the client side using the data provided by the log server. For example, maintainers could verify that the code uploaded builds reproducibly using the corresponding build info or they could have checks: which uploads were done using their key which packages were modified perhaps by other people the keyring maintainers or account managers could be looking at the keyring: what keys are in the keyring and what uploads were done using which keys. And the archive, last but not least, has an additional verification step available to make sure all the metadata was produced correctly and to know ??? things happened during the production of a given release. This thing actually exists. Well, I have programmed prototypes for all these components, meaning nothing that would be ready to implement, but to show that it actually works. I've used two years of Debian Stretch releases and fed it into the system. This resulted in a tree size of 270,000 elements and the storage required was about 400GB where almost all of that is source packages. I would say that it's imminently feasible to do this. The monitor functions run rather cheaply. A monitor needs not necessarily to keep a complete copy of the log in all cases but what I noticed some unexpected events in the package metadata. I have observed sources missing and version increments missing where I think there should be a version increment So I ??? be looking more closely into these cases. If anybody is interested at the theoretical side of this, this would be the immediate pointers I can give. The first paper is the theoretical and mathematical foundation and the other ones are applications of similar transparency work, but with different goals. Summarizing, we can introduce a system to detect target backdoors, even under compromise of the archive. We need to add a bit more infrastructure and need to change how some things are done We also can improve the auditability of what we can securely identify when things go wrong. In particular, we can make sure that for every binary, we can get the source code that was used to produced the binary and then identify the responsible maintainer. There's one class of attacks I have left out for today, if anybody wants to talk about that, we can do so too. And now, I'm interested in your questions and feedback. [Applause] [Q] Did you already test the reproducibility and how do you interact with problems of not reproducible packages? i mean, do you not integrate some into the log? [A] For now, the implementation of my monitor functions hasn't covered reproducibility. I think the first step to do so would be to have a blacklist of packages that are known not to be built reproducibly and then try to get on with it. [Q] Two questions. You say "authenticating metadata and code". This means signing or what is it exactly, "authenticating"? [A] At which point? [Q] It was… back. Where the tree is. Yes, yes. The tree before that. [A] Ok. This authentication here doesn't quite mean a signature. It means if I know the value of the root of the hashtree, then I can be assured that a given element is included if I'm told the value of the three gray marked inner nodes here. And that works by recomputing the hash tree. [Q] Ok, I think I have to defer this to after the talk. [A] Yeah, I can explain. [Q] Another question would be, so, detection of targeted backdoors. You mean at the stage of signing archive or which backdoors? [A] The scenario would be that the signing key of the archive is used to create an additional release file which covers a manipulated software version. And this software version and signature is only shown to the victim population and not to the general population. This means that the malicious software would only be observed by the victim and not by everybody else. My goal is to force the attacker to distribute the malicious software to the whole world in order to increase the chance that they're going to be detected and thereby deterring perhaps the attack from the beginning. [Q] Great talk. Great ideas as well. I really liked your slide on your assumptions ??? honest about them like "yeah we assume ???" I wouldn't underestimate how difficult it would be to make some of these changes. I mean, even ones that look simple, like source-only uploads. Everyone wants them, right? [A] Yes, sure, we have to start somewhere and I hope if people are convinced that this is a great idea and we should to this then we get some more impetus for these things that everybody wants like source-only uploads. [Q] Thank you, yeah, and it will be really pretty good to base this stuff on ??? effort ??? build on the same choices. Thank you. [A] Yeah, so I'm interested in any kind of feedback. If you think it's a great idea or think there are some problems I might have missed or it might get difficult to implement. Please come to me in case you have anything. [Applause]