This will be an academic talk
as announced.
I will try to bring some of my research
I did during my PhD into the real world.
We are going to talk about the security
of software distribution and
I'm going to propose a security feature
that adds on top of
the signatures we have up to today
and also the reproducible builds that
we already have to very large degree.
I am going to highlight a few points where
I think infrastructure changes are required
to accommodate this system and I would
also appreciate any feedback
you might have.
I'm going to ??? a few motivation of
what should we care about.
In the security of software distribution
we already do have
cryptographic signatures
I've just put up a few examples of
recent attacks that involved
the distribution of software where
people who presumably thought
they knew what they were doing had
grave problems with software distribution.
For example, the juniper backdoors,
pretty famous.
Juniper discovered two backdoors
in the code and
nobody really knew where they were
coming from.
Another example would be
Chrome extension developers
who got their credentials fished and
subsequently their extensions backdoored
or another example, a signed update to
a banking software actually included
a malware and infected several banks.
I hope this is motivation for us to
consider this kinds of attacks
to be possible and to prepare ourselves.
I have two main goals in the system
I am going to propose.
The first is to relax trust in the archive.
In particular, what I want to achieve is
a level of security even if
the archive is compromised and
the specific thing I am going to do is
to detect targeted backdoors.
That means backdoors that are distributed
only to a subset of the population and
what we can achieve is to force
the attacker to deliver the malware
to everybody, thereby greatly decreasing
their degree of stealth and increasing
their danger of detection.
This would work to our advantage.
The second goal is the forensic auditability
which overlaps to a surprising degree
with the first one in technical terms,
in terms of implementation.
So, what I want to ensure is that we have
inspectable source code for every binary.
We do have of course the source code
available from our packages, but
only for the most recent version,
everything else is a best effort
by the code archiving services.
The mapping between those and binary
can be verified once we have
reproducible builds to a large extent.
I want to make sure that we can identify
the maintainer responsible for distribution
of a particular package and the system
is also interested in providing
attribution of where something went from,
so that we are not in a situation where we
notice something went wrong but
we don't really know where we have to look
in order to find the problems
but that we really have specific and
secured indication of
where a compromised problem
was coming from.
Let's quickly recap how our software
distribution works.
We have the maintainers who upload
their code to the archive.
The archive has access to a signing key
which signs the releases.
Actually, metadata covering all the actual
binary packages.
These are distributed over
the mirror network
from where the apt clients will download
the package metadata.
That means the hash sums for the packages,
their dependencies and so on
as well as the actual packages themselves.
This central architecture has an important
advantage,
mainly the mirror network need not
to be trusted, right?
We have the signature that covers all
the contents of binary and source packages
and the metadata, so the mirror network
need not to be trusted.
On the other hand, it makes the archive and
its signing key a very interesting target
for attackers because this central point
controls all the signing operations.
So this is a place where we need to be
particularly careful and perhaps
maybe even do better than
cryptographic signatures.
This is where the main focus of this talk
will be, although I will also consider
the uploaders to some extent.
We want to achieve two things:
resistance against key compromise and
targeted backdoors and
to get some better support for auditing
in case things go wrong.
The approach that we choose to do this is
we want to make sure that everybody runs
exactly the same software
or at least the parts of it these choose
to install.
If we think about that for a moment,
this gives us a number of advantages.
For example, all the analysis that's done
on a piece of software immediately
carries over to all other users of
the software, right?
Because if we haven't made sure that
everybody installs the same software,
they might not have exactly
the same version and perhaps
some backdoored version.
This also ensures that we cannot suffer
targeted backdoors by increasing
the detection risk of attackers
and we also want to have a cryptographic
proof of where something went wrong.
Now, to look at some pictures,
I will present the data structure that
we use in order to achieve these goals.
The data structure is a hash tree,
a Merkle tree which is
a data structure that operates over a list.
So we have a list of these squares here
which represent the list items.
In our case, this is going to be
the files containing a package metedata
that just dependencies, a hash sum of
packages
and also the source packages themselves
are going to be elements in this list.
The tree works as follows.
It uses a cryptographic hash function
which is a collision resistant compressing
function
and the labels of the inner nodes
of the tree are computed as
the hashes of the children. Ok?
Once we have computed the root hash,
the root label,
we have fixed all the elements and
none of the elements can be changed
without changing the root hash.
We can exploit this in order to
efficiently prove
the two following properties for elements.
First of all, we can efficiently prove
the inclusion of a given element
in the list.
If we know the tree root ???,
this works as follows:
let's make a quick example, we see
the third list item is marked with an X
and if I know the tree root, then
the server operating the tree structure
will only need to give me the three gray
marked labels,
the three marked node values and then
I can recompute the root hash and
be convinced that this element actually
was contained in the list.
The second property is that we can also
efficiently verify the append-only operation
of the list.
So we can have a log server operating
this kind of structure and
the log server need not to be trusted,
it's not going to be trusted third party
but rather, its operation can be
verified from the outside.
So, what does this design look like?
The theoretical foundation is called
a transparency overlay and
in our system it looks like this:
We have the archive as per usual,
we have a log server and the archive will
submit package metadata, the release file,
the packages file containing dependencies
and so on and the source code
into this log server.
The apt client will be augmented with
an auditor component and
this auditor component is responsible for
verifying the correct log operation
as well as the inclusion of the downloaded
release into the log.
This is a mechanism which we will be able
to make sure that everybody is running
the exact same version of the software
they installed.
A third component is the monitor.
The monitor is necessary also to verify
log operation and also to inspect
the elements that are contained in the log
The monitor would then be run by groups
of individuals or individuals that want to
make sure of certain properties in the log
Alright, let's quickly recap.
We have added this log server, which
can prove two properties efficiently
to the outside world.
And we have the auditor and monitor
components.
The auditor is added to the apt client
and the monitor does
additional investigative tasks.
Now, in order to make this system work,
we need to…
I need to make a few assumptions.
The archive will need to handle
log submission and distribution of
certain log datastructure.
These are usually very small things
given to the archive
in response to submission.
Then I'm assuming a very consistent
release frequency.
The archive is responsible for distributing
reproducible binaries
in my architecture.
I'm assuming that the buildinfo files are
covered by the release file
I treat them as additional source metadata
so whenever the source package or
the buildinfo file changes, I expect
an increase in the binary version number.
I also assume source-only uploads and
one additional thing that we have,
keyring source package treated
by the archive as authoritative and
this keyring must have
the special property that
is operated in append-only so that
we can go back in time and see
what keys were authorized at different
points in time.
The log server is a standalone server
component that speaks at the moment on
an HTTP-based protocol.
Probably one would want to have
more than one, but
we are going to have, I think,
a much easier time running log servers
than for example, the certificate
transparency people
because we only have one source
of writing access,
namely the archive, so we can easily
schedule the write access,
and you can have read-only frontends that
aren't quite critical.
The auditor component would need to be
integrated into the apt client or library.
It needs two things like cryptographic
verifications,
understand a bit more file formats and
some more network access.
Parts of the proof could also
probably distribute over
the mirror network and we need
not necessarily do everything
??? communication with
the log server.
So, this covers archive auditor and
log server.
The monitoring servers have a few functions
that are necessary for the verification of
the log itself, meaning that they verify
the append-only operation of the log
and they will also likely want to exchange
the tree roots with perhaps
other monitors and some auditors.
The important verification functions
of the log server are validating
the metadata of the release packages and
sources file,
namely making sure that these are complete,
that the sources are available,
that the versions are incremented
correctly and so on.
And that's necessary to make sure that
a compromised archive can't do
certain attacks.
Also in this category is the fact that
we depend on a fixed release frequency
and monitors will also be verifying
the upload ACL,
meaning which keys are authorized to
upload.
Monitors also would be verifying
reproducible builds in this scenario.
That's the monitoring functions and
I think that many different people and
groups in Debian could get some benefits
out of these monitoring functions
in order to verify that everything
worked correctly.
We should note that all these verifications
are completely independent of
the existing infrastructure because
happening on the client side.
So we don't depend on any notifications
from the existing infrastructure that
works correctly and no notifications
are stopped.
This can be done completely
on the client side using
the data provided by the log server.
For example, maintainers could verify that
the code uploaded builds reproducibly
using the corresponding build info or
they could have checks:
which uploads were done using their key
which packages were modified perhaps
by other people
the keyring maintainers or account
managers could be looking at
the keyring: what keys are in the keyring
and what uploads were done
using which keys.
And the archive, last but not least, has
an additional verification step available
to make sure all the metadata was produced
correctly and to know
wierd things happened during the production
of a given release.
This thing actually exists.
Well, I have programmed prototypes
for all these components,
meaning nothing that would be ready
to implement,
but to show that it actually works.
I've used two years of Debian Stretch
releases and fed it into the system.
This resulted in a tree size of
270,000 elements and
the storage required was about 400GB
where almost all of that is
source packages.
I would say that it's imminently feasible
to do this.
The monitor functions run rather cheaply.
A monitor needs not necessarily to keep
a complete copy of the log in all cases
but what I noticed some unexpected events
in the package metadata.
I have observed sources missing and
version increments missing where
I think there should be a version increment
So I'll be looking more closely
into these cases.
If anybody is interested at
the theoretical side of this,
this would be the immediate pointers
I can give.
The first paper is the theoretical and
mathematical foundation and
the other ones are applications of
similar transparency work, but
with different goals.
Summarizing, we can introduce a system
to detect target backdoors,
even under compromise of the archive.
We need to add a bit more infrastructure
and need to change how some things are done
We also can improve the auditability of
what we can securely identify when
things go wrong.
In particular, we can make sure that for
every binary, we can get
the source code that was used to produced
the binary
and then identify
the responsible maintainer.
There's one class of attacks I have left
out for today,
if anybody wants to talk about that, we
can do so too.
And now, I'm interested in your questions
and feedback.
[Applause]
[Q] Did you already test the reproducibility
and how do you interact with
problems of not reproducible packages?
I mean, do you not integrate some
into the log?
[A] For now, the implementation of
my monitor functions hasn't covered
reproducibility.
I think the first step to do so would be
to have a blacklist of packages
that are known not to be built reproducibly
and then try to get on with it.
[Q] Two questions.
You say "authenticating metadata and
code".
This means signing or what is it exactly,
"authenticating"?
[A] At which point?
[Q] It was… back. Where the tree is.
Yes, yes. The tree before that.
[A] Ok.
This authentication here doesn't quite mean
a signature.
It means if I know the value of the root
of the hashtree, then
I can be assured that a given element
is included if I'm told
the value of the three gray marked
inner nodes here.
And that works by recomputing
the hash tree.
[Q] Ok, I think I have to defer this
to after the talk.
[A] Yeah, I can explain.
[Q] Another question would be,
so, detection of targeted backdoors.
You mean at the stage of signing archive
or which backdoors?
[A] The scenario would be that
the signing key of the archive is
used to create an additional release file
which covers
a manipulated software version.
And this software version and signature is
only shown to the victim population
and not to the general population.
This means that the malicious software
would only be observed by the victim
and not by everybody else.
My goal is to force the attacker to
distribute the malicious software
to the whole world in order to increase
the chance that they're going to be
detected and thereby deterring perhaps
the attack from the beginning.
[Q] Great talk. Great ideas as well.
I really liked your slide on
your assumptions
???
honest about them like
"yeah we assume ???"
I wouldn't underestimate how difficult
it would be to make
some of these changes.
I mean, even ones that look simple, like
source-only uploads.
Everyone wants them, right?
[A] Yes, sure, we have to start somewhere
and I hope if people are convinced that
this is a great idea and we should to this
then we get some more impetus
for these things that everybody wants
like source-only uploads.
[Q] Thank you, yeah, and it will be really
pretty good to base this stuff
on reproducible builds effort because
it builds on the same choices.
Thank you.
[A] Yeah, so I'm interested in any kind
of feedback.
If you think it's a great idea or think
there are some problems I might have missed
or it might get difficult to implement.
Please come talk to me in case you have
anything.
[Applause]