This will be an academic talk
as announced.

I will try to bring some of my research
I did during my PhD into the real world.

We are going to talk about the security
of software distribution and

I'm going to propose a security feature
that adds on top of

the signatures we have up to today

and also the reproducible builds that
we already have to very large degree.

I am going to highlight a few points where
I think infrastructure changes are required

to accommodate this system and I would
also appreciate any feedback

you might have.

I'm going to ??? a few motivation of
what should we care about.

In the security of software distribution

we already do have
cryptographic signatures

I've just put up a few examples of
recent attacks that involved

the distribution of software where
people who presumably thought

they knew what they were doing had
grave problems with software distribution.

For example, the juniper backdoors,
pretty famous.

Juniper discovered two backdoors
in the code and

nobody really knew where they were
coming from.

Another example would be
Chrome extension developers

who got their credentials fished and
subsequently their extensions backdoored

or another example, a signed update to
a banking software actually included

a malware and infected several banks.

I hope this is motivation for us to
consider this kinds of text

to be possible and to prepare ourselves.

I have two main goals in the system
I am going to propose.

The first is to relax trust in the archive.

In particular, what I want to achieve is
a level of security even if

the archive is compromised and
the specific thing I am going to do is

to detect targeted backdoors.

That means backdoors that are distributed
only to a subset of the population and

what we can achieve is to force
the attacker to deliver the malware

to everybody, thereby greatly decreasing
their degree of stealth and increasing

their danger of detection.

This would work to our advantage.

The second goal is the forensic auditability

which overlaps to a surprising degree
with the first one in technical terms,

in terms of implementation.

So, what I want to ensure is that we have

inspectable source code for every binary.

We do have of course the source code
available from our packages, but

only for the most recent version,
everything else is a best effort

by the code archiving services.

The mapping between those and binary
can be verified once we have

reproducible builds to a large extent.

I want to make sure that we can identify
the maintainer responsible for distribution

of a particular package and the system
is also interested in providing

attribution of where something went from,

so that we are not in a situation where we
notice something went wrong but

we don't really know where we have to look
in order to find the problems

but that we really have specific and
secured indication of

where a compromised problem
was coming from.

Let's quickly recap how our software
distribution works.

We have the maintainers who upload
their code to the archive.

The archive has access to a signing key
which signs the releases.

Actually, metadata covering all the actual
binary packages.

These are distributed over
the mirror network

from where the apt clients will download
the package metadata.

That means the hash sums for the packages,
their dependencies and so on

as well as the actual packages themselves.

This central architecture has an important
advantage,

mainly the mirror network need not
to be trusted, right?

We have the signature that covers all
the contents of binary and source packages

and the metadata, so the mirror network
need not to be trusted.

On the other hand, it makes the archive and
the signing key a very interesting target

for attackers because this central point
controls all the signing operations.

So this is a place where we need to be
particularly careful and perhaps

maybe even do better than
cryptographic signatures.

This is where the main focus of this talk
will be, although I will also consider

the uploaders to some extent.

We want to achieve two things:

resistance against key compromise and
targeted backdoors and

to get some better support for auditing
in case things go wrong.

The approach that we choose to do this is

we want to make sure that everybody runs
exactly the same software

or at least the parts of it these choose
to install.

If we think about that for a moment,
this gives us a number of advantages.

For example, all the analysis that's done
on a piece of software immediately

carries over to all other users of
the software, right?

Because if we haven't made sure that
everybody installs the same software,

they might not have exactly
the same version and perhaps

some backdoored version.

This also ensures that we cannot suffer
targeted backdoors by increasing

the detection risk of attackers

and we also want to have a cryptographic
proof of where something went wrong.

Now, to look at some pictures,
I will present the data structure that

we use in order to achieve these goals.

The data structure is hash tree,
a Merkle tree which is

a data structure that operates over a list.

So we have a list of these squares here
which represent the list items.

In our case, this is going to be
the files containing a package metedata

that just dependencies, a hash sum of
packages

and also the source packages themselves
are going to be elements in this list.

The tree works as follows.

It uses a cryptographic has function

which is a collision resistant compressing
function

and the labels of the inner nodes
of the tree are computed as

the hashes of the children. Ok?

Once we have computed the root hash,
the root label,

we have fixed all the elements and
none of the elements can be changed

without changing the root hash.

We can exploit this in order to
efficiently prove

the two following properties for ???

First of all, we can efficiently prove
the inclusion of a given element

in the list.

If we know the tree root ???,
this works as follows:

let's make a quick example, we see
the third list item is marked with an X

and if I know the tree root, then
the server operating the tree structure

will only need to give me the three gray
marked labels,

the three marked node values and then
I can recompute the root hash and

be convinced that this element actually
was contained in the list.

The second property is that we can also
efficiently verify the append-only operation

of the list.

So we can have a log server operating
this kind of structure and

the log server need not to be trusted,

it's not going to be trusted third party
but rather, its operation can be

verified from the outside.

So, what does this design look like?

The theoretical foundation is called
a transparency overlay and

in our system it looks like this:

We have the archive as per usual,

we have a log server and the archive will
submit package metadata, the release file,

the packages file containing dependencies
and so on and the source code

into this log server.

The apt client will be augmented with
an auditor component and

this auditor component is responsible for
verifying the correct log operation

as well as the inclusion of the downloaded
release into the log.

This is a mechanism which we will be able
to make sure that everybody is running

the exact same version of the software
they installed.

A third component is the monitor.

The monitor is necessary also to verify
log operation and also to inspect

the elements that are contained in the log

The monitor would then be run by groups
of individuals or individuals that want to

make sure of certain properties in the log

Alright, let's quickly recap.

We have added this log server, which
can prove two properties efficiently

to the outside world.

And we have the auditor and monitor
components.

The auditor is added to the apt client
and the monitor does

additional investigative tasks.

Now, in order to make this system work,
we need to…

I need to make a few assumptions.

The archive will need to handle
log submission and distribution of

certain log datastructure.

These are usually very small things
given to the archive

in response to submission.

Then I'm assuming a very consistent
release frequency.

The archive is responsible for distributing
reproducible binaries

in my architecture.

i'm assuming that the buildinfo files are
covered by the release file

I treat them as additional source metadata
so whenever the source package or

the buildinfo file changes, I expect
an increase in the binary version number.

I also assume source-only uploads and
one additional thing that we have,

keyring source package treated
by the archive as authoritative and

this keyring must have
the special property that

is operated in append-only so that
we can go back in time and see

what keys were authorized at different
points in time.

The log server is a standalone server
component that speaks at the moment on

an HTTP-based protocol.

Probably one would want to have
more than one, but

we are going to have, I think,
a much easier time running log servers

than for example, the certificate
transparency people

because we only have one source
of writing access,

namely the archive, so we can easily
schedule the write access,

and you can have read-only frontends that
aren't quite critical.

The auditor component would need to be
integrated into the apt client or library.

It needs two things like cryptographic
verifications,

understand a bit more file formats and
some more network access.

Parts of the proof could also
probably distribute over

the mirror network and we need
not necessarily do everything

??? communication with
the log server.

So, this covers archive auditor and
log server.

The monitoring servers have a few functions
that are necessary for the verification of

the log itself, meaning that they verify
the append-only operation of the log

and they will also likely want to exchange
the tree roots with perhaps

other monitors and some auditors.

The important verification functions
of the log server are validating

the metadata of the release packages and
sources file,

namely making sure that these are complete,
that the sources are available,

that the versions are incremented
correctly and so on.

And that's necessary to make sure that
a compromised archive can't do

certain attacks.

Also in this category is the fact that
we depend on a fixed release frequency

and monitors will also be verifying
the upload ACL,

meaning which keys are authorized to
upload.

Monitors also would be verifying
reproducible builds in this scenario.

That's the monitoring functions and
I think that many different people and

groups in Debian could get some benefits
out of these monitoring functions

in order to verify that everything
worked correctly.

We should note that all these verifications
are completely independent of

the existing infrastructure because
happening on the client side.

So we don't depend on any notifications
from the existing infrastructure that

works correctly and no notifications
are stopped.

This can be done completely
on the client side using

the data provided by the log server.

For example, maintainers could verify that
the code uploaded builds reproducibly

using the corresponding build info or

they could have checks:

which uploads were done using their key

which packages were modified perhaps
by other people

the keyring maintainers or account
managers could be looking at

the keyring: what keys are in the keyring
and what uploads were done

using which keys.

And the archive, last but not least, has
an additional verification step available

to make sure all the metadata was produced
correctly and to know

??? things happened during the production
of a given release.

This thing actually exists.

Well, I have programmed prototypes
for all these components,

meaning nothing that would be ready
to implement,

but to show that it actually works.

I've used two years of Debian Stretch
releases and fed it into the system.

This resulted in a tree size of
270,000 elements and

the storage required was about 400GB
where almost all of that is

source packages.

I would say that it's imminently feasible
to do this.

The monitor functions run rather cheaply.

A monitor needs not necessarily to keep
a complete copy of the log in all cases

but what I noticed some unexpected events
in the package metadata.

I have observed sources missing and
version increments missing where

I think there should be a version increment

So I ??? be looking more closely
into these cases.

If anybody is interested at
the theoretical side of this,

this would be the immediate pointers
I can give.

The first paper is the theoretical and
mathematical foundation and

the other ones are applications of
similar transparency work, but

with different goals.

Summarizing, we can introduce a system
to detect target backdoors,

even under compromise of the archive.

We need to add a bit more infrastructure
and need to change how some things are done

We also can improve the auditability of
what we can securely identify when

things go wrong.

In particular, we can make sure that for
every binary, we can get

the source code that was used to produced
the binary

and then identify
the responsible maintainer.

There's one class of attacks I have left
out for today,

if anybody wants to talk about that, we
can do so too.

And now, I'm interested in your questions
and feedback.

[Applause]

[Q] Did you already test the reproducibility
and how do you interact with

problems of not reproducible packages?

i mean, do you not integrate some
into the log?

[A] For now, the implementation of
my monitor functions hasn't covered

reproducibility.

I think the first step to do so would be
to have a blacklist of packages

that are known not to be built reproducibly
and then try to get on with it.

[Q] Two questions.

You say "authenticating metadata and
code".

This means signing or what is it exactly,
"authenticating"?

[A] At which point?

[Q] It was… back. Where the tree is.

Yes, yes. The tree before that.

[A] Ok.

This authentication here doesn't quite mean
a signature.

It means if I know the value of the root
of the hashtree, then

I can be assured that a given element
is included if I'm told

the value of the three gray marked
inner nodes here.

And that works by recomputing
the hash tree.

[Q] Ok, I think I have to defer this
to after the talk.

[A] Yeah, I can explain.

[Q] Another question would be,

so, detection of targeted backdoors.

You mean at the stage of signing archive
or which backdoors?

[A] The scenario would be that
the signing key of the archive is

used to create an additional release file
which covers

a manipulated software version.

And this software version and signature is
only shown to the victim population

and not to the general population.

This means that the malicious software
would only be observed by the victim

and not by everybody else.

My goal is to force the attacker to
distribute the malicious software

to the whole world in order to increase

the chance that they're going to be
detected and thereby deterring perhaps

the attack from the beginning.

[Q] Great talk. Great ideas as well.

I really liked your slide on
your assumptions

???
honest about them like

"yeah we assume ???"

I wouldn't underestimate how difficult
it would be to make

some of these changes.

I mean, even ones that look simple, like
source-only uploads.

Everyone wants them, right?

[A] Yes, sure, we have to start somewhere
and I hope if people are convinced that

this is a great idea and we should to this

then we get some more impetus
for these things that everybody wants

like source-only uploads.

[Q] Thank you, yeah, and it will be really
pretty good to base this stuff

on ??? effort
??? build on the same choices.

Thank you.

[A] Yeah, so I'm interested in any kind
of feedback.

If you think it's a great idea or think
there are some problems I might have missed

or it might get difficult to implement.

Please come to me in case you have
anything.

[Applause]