I'm here today to talk to you about
diffoscope

and how you can use it as a better diff

or for Quality Assurance, etc., things
like that.

Moin!

Apparently that's like a north german
thing to say "welcome".

North german, north Denmark, Scandinavia,
that kind of thing, I'm told.

People are shaking their head, so I'm
going to assume that's true.

This is my first PC, an IBM 5155.

Sometimes, when you rebooted it, it would
launch into, it would somehow revert

from booting from the hard disk to booting
from a basic ROM,

as in the programming language ROM.

It was on my motherboard for some reason.

So, randomly, you just get a chance to
program in basic and then,

sometimes you wouldn't, I don't know why,
but… yeah.

It's quite fun with this kind of clicky
keyboard, and that folded in

and it was this kind of big desk thing.

Anyway…

This is my first Debian.

At the time it was already old.

What's this one? Is this Slink? 2.2?
Yeah.

And this is when we had US and non-US,
so that's really dating if you remember that.

This is my first contribution to Debian,
19th December 2006,

sending a patch to lillypond which is kind
of interesting

and the response was "Oh yeah, rock on,
many thanks. I'll upload this and

it'll be landing to Etch".

And this was super motivating because
Etch was just coming out and it was like

"Great, I've got let one line of tiny patch
in a release. This is super cool."

Thomas' response was super motivating.

So, after that, like that Christmas
basically spent ???

Debian webpages and stuff.

Very well timed.

That's kind of a good…

You know, someone sends a patch, be like
"Cool, thanks"

Like a little notice in the changelog.

It was, you know, so stupid but…
Yeah, do that kind of thing.

So, moving on.

Why diffoscope?
Why did we write diffoscope?

What's the background here?

It comes from reproducible builds.

The very quick outline is that once you
get the source code for free software,

you download the source code for nginx
or whatever,

pretty much everyone just runs binaries
on their servers or their systems.

You know, "apt install bla", "yum install",
whatever.

Android Playstore, whatever.

Can you actually trust whether these two
things correspond with each other?

You've gotten the source code, it looks
alright, and then you install this binary,

yeah…

Who generated that? Can you trust that
process?

Can you trust who generated it?

Even if you could trust them, could you
trust them not to be exploited? Etc.

This is a big problem because you can
exploit a build farm and then

obviously exploit all of that, you know,
a trojan into the build farm,

so every single binary that comes out
is compromised.

Kind of problematic.

You could also target individual developers
machines,

so I could go of to, say, your machine,
add a backdoor to it,

so every binary that you give to friends
and things like that,

are compromised in some way, stealing
your bitcoins or whatever.

I can also turn up at your door
and blackmail you into producing

software that has compromises or extra
features, shall we say,

that don't exist in the source code.

So what will happen there is that you'd
release your source

and the binaries you produce have
this sort of backdoor that, you know,

someone is forcing you into producing.

So, you don't want to do that.

Anyway

enough of that.

What you do for reproducible builds is you
ensure that every time you build

a piece of software, you get an identical
result.

Multiple people then compare their builds
and check whether they all get

the same results

and this means that an attacker must
either have infected everyone

at the same time, or they haven't
infected anyone.

The point here is that you have to ensure
that builds have identical results.

Ok, great.

So, we started the reproducible builds
project, etc.

And we build 2 debs.

Oh, I'm sorry about the colors there.

You probably can't see that.

That says "sha1sum a.deb b.deb".

Anyway, we're comparing the sha1sums
of 2 binary Debian files.

So, these two files differ.

Ok, they're not reproducible.

Why is that?

So we run a diff on them.

Yeah…

So, what can we learn from this?

Well, not very much, visibly they're
compressed so

as soon as we see one change, we'll see
they would just cascade changes

because that's how compression works.

I guess we know it's a deb probably a ar
format file, not very useful.

Ok, great so we're gonna have a look in

We'll do a binary diff and ok, well…

Again, that's not really telling us
very much

with the diff there.

Ok, great.

??? one level in

"ar x" is on the new maintainer thing,
"how you unpack a deb"

Everyone remembers this, right?

You unpack a.deb with "ar x" and you
do that to b.deb

and then we diff the results of that.

Ok, so…yeah, 7zip.

Ok, compressed content, not very useful.

Ok, so let's unpack the control.tar inside
these debs.

And then we run diff on that.

Still not really telling anything useful
about how to make this package reproducible

So let's unpack the tar.xz into the tar.

Inside that tar, there's a file called
md5sums and we start to see some differences

between some files in these two debs.

??? meaningful, so now
we have some idea that

it has something to do with this
usr/bin/pmixer binary.

Ok, interesting.

We'll unzip that and then we do a diff on
pmixer itself.

Now we're back into just binary
"globgoly" mode

This isn't very helpful and this is taking
quite a while

and if I remember correctly, Debian has
a lot of packages.

So this might take a little while.

So, basically, ??? mean

I should build a better diff.

That's not quite true, this is actually…

It was lunar that started this project

and it was called debbindiff, because
we wanted to diff

binary Debian packages.

So this is the initial commit, 2014.

"The version is successfully able to report
differences in two .changes files.

Not with much interesting details,
but it's a start."

And it was a start.

Fast forwarding… Oh, sorry about these
colors,

I don't know if we can do anything about
the lights?

Yeah?

No?

Allright, whatever…

Basically, we're diffoscoping on…

It works kind of diff does normally,

you give it two files, it outputs
a unified diff.

So "diffoscope a b", one file contains
the word "foo", one contains the word "bar".

Nothing actually out of the ordinary.

It's sort of colored by default, so that's
why you can't see it, but whatever.

It supports archive formats, so if you
give it two tar files,

if we then tar up our "a" file and
our "b" file into a a.tar and b.tar

and then run diffoscope on those tar files

we get this kind of, like, hierarchy here.

So it's saying that there are differencies
between these files,

in the file list they have different time
stamps, because I made them

at different times,

and here are the contents, so we got
"foo" there and "bar" there.

So we can see the difference between them.

Well, I can, I don't know if you can,
you get the slide there.

If we gzip these tar files and then run
diffoscope on those gzip things,

it'll say "ok, what we've done is unpack it
first, and here's the metadata

about the gzip process",

and inside that are a.tar and b.tar
from the previous slides.

And then the "a" file and the "b" file.

So, it's really going two levels deep
into this tar.gz file.

That's pretty cool.

And it's completely recursive, I think
it will actually blow out after, I think,

1000 [levels].

[light is turned down for the audience
to see the slides]

I'll just bump back a bit, just in case.

[Applause]

Thank you.

So that's the a and b files.

We've tared them up and so I see
the hierarchy of foo and bar file layer.

I've gziped them, so this is a gzip layer.

Here's the tar layer and then there's
the files themselves.

This is from a real .deb from the archive.

Inside this .deb, there's a data.tar.xz
and in that xz file there's a data.tar

and inside that tar file, there's a file
called aff and inside that

there's a version string that is different.

And that looks like a build date so we
probably know that if we went back

to the source package, we could very
quickly work out,

with get a very quick grep, work out
where this file is being generated from,

the de_DE.aff file and then ???
probably quite obvious

that it's using the current build time
and then we can just patch that, fix it etc.

This is gone from two rather obscure
binary .debs all the way to the fix

probably in about 5 minutes, and you can
probably send the patch in that time

because it'd be quite quick.

Without diffoscope here, without this sort
of recursive unpacking,

you'd be just completely lost, you'd be
there with arx all day

and working out which files are different
and trying to use xxd

and this kind of nonsense.

diffoscope's got some other things as well

if you try to do reproducible packages
and things are varying just on

the line ordering, we detect whether
a file differs only in the line ordering.

So, here's file "a", "These lines are in
order".

File "b" has "These order are in lines".

It's very difficult to say, actually,
it's like one of these tongue twisters.

Run diffoscope on these two and it says
it's got ordering differences only.

That's interesting, so you probably need
to sort,

you go all the way back to the source code,
work out very quickly,

if you know it's just ordering differences
you just kind of know

what the output's gonna be, you can
search for order in ???

and you get the right files,

I have sorted in sort in the right
place, BAM! send it patched of,

everything is great.

Oh, and send it to upstream as well
because you're good.

It supports a lot more things.

We've been showing the terminal
text output here.

It's got a HTML output mode, which is
really useful in the hierarchal thing

when it gets a bit more complicated.

Instead of being laid on top of each other
like a unified diff,

you get the diff on the left and the right
and you get sort of a nested

thing inside with colors and lines and
you can link this and various things in it

including bits of metadata here, other
bits here, what command you used.

That's the HTML output.

We also support a lot of file formats,
it's not just on text,

it's about all of these, so let's quickly
run through some of them.

You give it two Androip apk files which
are kind of like zips, but magic.

It'll know how to compare them.

There's like a Manifest file that needs
decoding.

It supports Berkeley DB databases,

Word documents, that's a Word document
with "a" and that's a Word document with "b"

and it'll correctly do that.

If you run that through diff normally,
that ??? be a binaly mess,

so completely useless.

E-books, there's epub, it also supports
mobi.

So if you give it two epub files, it'll say
"They just differ in this date".

Brilliant.

Normally that will be completely useless
diff binary ???

So you can be like "epub date, ok", grep
the source code for that,

make a patch really quickly.

Mono binaries, git repositories, why not?

Gnumeric spreadsheets, ISO images.

Oh yeah, ISO images is really cool.

So, it'll basically unpack the ISO, then
inside that there might be a squashfs image

then it'll completely go down to that and
work out any differences

between the two contents in the ISO file,
including any metadata.

This is on the squashfs metadata headers,
I think.

But say inside that ISO, there was a file
that was a pdf, and inside that pdf was

a ??? which varied,

it will basically go all the way down
and say "yeah, it's actually here,

in this ??? that the data differs."

And that means you can just go again
all the way back to the source

and say "ok, cool, we know how to fix
this quite quickly"

And this is really valuable in getting
the recent Tails distribution reproducible

so their ISOs are reproducible.

If you build one and I build one, we get
the exact same one

and that's kind of useful for something
like Tails where you would probably want to

of all, there's a lot of projects that you
might want to compromise,

you might want to go after that one,
because of the kind of people that are using it.

We support comparing images, so this is
using ???

and then just running that through diff.

That is a linux penguin and that is
something else,

I can't remember now. Oh, FT.

It supports images.

It supports JSON and pretty print,
so if you give it two JSON files

one with key/value… it'll do a nice
diff of them.

It will pretty print it first, before
doing the diff, so it'll actually give you

something clean, otherwise I don't know
if you've ever diffed

two very long JSON lines, if they differ
in the middle, you just get

a huge long unified diff, but here it's
like "oh, just ??? things have changed"

OpenDocument text formats,
Ogg audio files, because why not.

tcpdump capture files, that's actually
quite useful.

PDFs. That PDF says "Hello World" and
this PDF says "Hello sick sad world",

I don't know why, that particulary text
in the demo.

Again, run that through normal diff
program… garbage.

XML documents. Again, it'll pretty print
them so it's nice, actually nice do read.

If you want to get started on diffoscope,
the very easiest and quickest way to do is

fire up a web browser, try.diffoscope.org,
select your files, press Compare

and it'll upload them and run diffoscope
with all the support for all the file formats

in the cloud for you and give you a nice
HTML page that you can then link to people

So that's the very quickest way to get
started.

The next quickest way is to install
trydiffoscope and then you run that

on two files and it'll basically do
the same thing,

run it in the same cloud service as
trydiffoscope

but it'll give you the result on the
command line or

if you pass the webbrowser option, it will
give you an URL or load your webbrowser,

I can't remember exactly which, with
the same results.

This is 1kB of Python, nothing basically.

That's the next easiest way.

But you can then install diffoscope itself
on your own machine.

I recommend not installing recommends
because all of those file formats

might drag in extra things about
the whole of TeX,

I think the whole of OpenOffice, whole
of Mono, whole Java…

Android, yeah, quite big.

I think there's another big one I can't
think of.

They're all optional, and they all say
"By the way, I support TeX documents

or whatever, Mono, whatever.

But you need to install this package and
then you get full pretty printed support",

And it'll tell you that when it's missing.

So, if you just start with
--install-recommends disabled,

right on your file, if it says
"please install this package, you can then

install them as you go along, as you want"

rather than installing everything.

And then you just pass ??? files
and then works as before

How you can you improve all your own
quality assurance and debian packaging

with different scope

The biggest value here is not
necessary for reproducible builds

It's for basically just seeing where you
do want to have a diff or expecting a diff

and you are expecting a particularly type
of diff in a particularly way

you can basically see those changes

And if you build two debs normally and
... i'll try to demo in a second

You build a deb with a patch applied and 
then build a deb with the patch applied

you can ??? run a diff on the source package

But that's not very useful because the
binaries are going to end in the

people machines. But if you run a diff on
the binary itself, did my change actually

hit the binary? I think really ...
No..

I just run through a very live demo of
course, so it's gonna fail ...

Checkout some .... We'll get this 
libnetx-java

We just build that once

Lets say we are on security team and

want to apply a patch, and we want to be
really sure because we are to push it out

to all our users

First we will make a changelog

Closing a bug

Find some java file to change

Let's pretend we have a real patch

Let's replace that equals equals,
say that was the fix

So that's the patch from upstream

Upstream blast patch

When we build this what we wanna see is
just that change in the file

we wanna see any nonsense changes of 
extended dump but we also definitely want

to see that change, cause if our binary as
for security reasons don't have that change

then we aren't fixing people machines,
they will issue a DSA ??? installed ???

And you should do proper testing as well
at multiple levels

I will build that again

So we wanna diff the original one 0 5,

We wanna diff that one with a fake 
security one

You see on the progress bar 100%
1- there are diferences (there should be

diferences)
Lets see what that diferences are

in our web browser, its a nice html output

Let have a look.
Are we seeing what we wanna see?

There are some chances in the data tar, we
kind of expect that

What's changed in our control file?
Well the version changed,we wanted that

to change. Perfect

And its changed to ???
That's what we wanna see

No other changes here so there was no 
weird control or in magic going on

In our data tar the color of the timestamp
changes, we will ignore those for now

The changelog has changed, well I hope so
because I have changed that entry

Here is where we going to start seeing
We are going to see the changing in the

jar file which is the java class, java
compile archive format

We are seeing some meaningless timestamp
changes but we can ignore those

lets pretend because its just 
metadata maybe

Ok part of a class, so if you can see here
it's basically a de-compilation of the

java file itself and it's basically saying
"oh I use to say if now and if not now"

So these are the actual byte java
byte code instructions and whats really

And what is really ??? here
its that nothing else has changed

We were just expecting that change between
the two op codes, of if now elseif not not now

which is good cause its like it hasn't made
any code changes but also crucial we can

see that it has actually made a change
to the code.

For example its wasn't use some cached
version or something like that

This is really useful

And just running a naif diff wouldn't
give that of course, because it would just

come with binary garbage
And just seeing the diff had changed again

??? be told you anything, because all of the
change would have changed as well

So its like well yes it's diferent

The meaningful change there it's
what actually fixes the "floor"

??? but we know it's there

That's kind of ??? 
Shifting this deb out I'll be quite

confident, that this seemed like the
actual bug

I've been quite confident pushing that out
because it's very minimal amount of changes

you wanna do that for security reasons

So this was the live demo

The other one is seeing no changes
at all, so you can build once

if you build a reproducible

You can build once change your compiler
or change some other part of your toolchain

Build it again and if you got the exact same
results, well great, that's want you intended

You wanna see no changes when you change
some part of it

And that is really useful, if there were
changes diffoscope will highlight them

and show exactly why they had changed,
maybe some compile authorizations,

maybe some other things as well

So you can use it in both ways, when you
expect changes and when you don't expect

changes, and if those match the expectations
diffoscope will tell you exactly why

It's all ??? when other companies
are doing security releases

naming no names whatsoever,
but they like to release patches as you

know just a new firmware for your router

Very large file system images,
you basically have no ideia what changed

between these two files, again you run
through diff completely useless

You can start to unpack them with
squashfs and blah blah blah

But they're probably sort of concatenated
cpio archives, so that's nonsense

But diffoscope would just chew you those
and give you actually what the diferences

is between these two files, and say
they changed this, they've removed or

added some gpl license code or something
kind of interesting

So its very useful for diffing those kind
binary blobs that come from various people

So the current state of diffoscope,
the development is up and down

It started around May 2014 something like that
A bunch of work here, that's is idle I think

These are just for debconfs basically

Anyway it's going up and down its kind
of interesting

??? a lot of reproducible builds projects
of course, so every time we do a build

on the ??? reproducible builds or
testing framework if we run diffoscope

on the result, if it's reproducible it
just says , hey the file is the same

But if not, we publish the diffoscopes of
all your packages that are unreproducible

just you can just go there and be like
whats the diference between these two things

I invested a lot of work optimizing
diffoscope, ??? rather perverse end square

loops inside it. So i manage to cut down
some of the time here, cut down here

That's been quite a few performances and 
enhancements over the past ...

these are the git tags , this is version 80
and this is version 50 I just run the same

benchmark across them all

So they shows when I have introduced some
rather stupid code, embarrassing , but whatever

???

There's work been done right now,
on parallel processing, there's been

quite a few attempts before, but adding it
it's kind of interesting and difficult

Luckily we have an outreach student
Liliana, is she in the room? Is she hiding?

She's here and she's been talking tomorrow
about her work on paralel processing in

diffoscope and that will be amazing because
a lot of it is IO bound or waiting for Xtel

processors with multiple cpu machines,
you mind as well just play well

while as I stand waiting for the result
for a pdf to be unpacked I maybe as well

be running on another cpu, I think we are
going to see some real performance wins

as we do that paralell processing merge and
working and ???

You can check out our website diffoscope.org
recently migrated to Salsa .... yeeaahhh

And everything that's reproducible is now
on Salsa, it's kind of cool

That's quite recent...
???

Thank you very muck, danke shcön

You got any questions?
About diffoscope?

Thank you very much !

[Applause]

Q: A buzz word question, can you diff containers
image formats?

A: Depend which ones. So if they are just
directories, then yes, because is just a directory

Do you have particullary in mind? Like docker?

Yes, there's docker and then there's old
CI, I believe is the standard one

And that could make a buzz word complaint

Ah ok we were all about buzz words

Probable diffoscope block change as well

And then run diffoscope on connectors and
see the difference between updates of your

container images

BAM ... solved
Where do I invest?

I wasn't aware that OCI ... that's is how it's
called? No it doesn't support that right now

But it wouldn't be too difficult, presuming
there are tools to unpack it and as soon

we have a tool to unpack it, it can then 
just go to that, there is an open wishlist

bug tool box for docker containers to the 
point were I think it would be really

nice if you could just give it, say, two 
images names or whatever the noun is

So you can say "please diff these two
docker images that are available" and

it can look at your local thing and do 
a diff on them, currently it's not

supported, but there is an open wishlist
bug.

Q: Shouldn't any company that releases
binaries, be interested in supporting

diffoscope and using it?

A1: Basically when companies release binaries they are not interested in users seeing diferences...

A2: Yes, I'm surprised that actually the
docker bug was only opened two months ago

and hasn't been more interest on diffing
container images, but if you like to open

one for OCI that will be very appreciated,
and we can get on to that, that would be

great.

I was looking the page for OCI, it says
it's based on docker basically, so

once you get OCI for free, you would
sort it out for docker, if you're lucky

The OCI image formaters, they wrote out
on docker images

Ok we will sort that out, and it seems like
we're using a docker more and more

on debian

Any other questions?

Q: Out of curiosity, which ??? are you using
inside? Are you using some bio-informatics

algorithm to diff trees efficiently?

A: No it's really naif, all it does is run
normal diff, the normal diff tools, but

it will try to identify files and unpack
first, so use the file utility identifier

thing that says its a pdf , and try to
unpack it first, he doesn't do any clever

matching. The clever matching that he does
do is fuzzy matching as well, so if just

rename a directory between two inside a 
container, he will say , yeah there a

massive fuzzy match between this
two files, and things like that. So that's

kind of useful, but apart from that clever, 
which is kind of what you want , because

if it's too clever it would start to be a little
opaque ...

I personally like dumb tools.

Q: So one question to you is whether,
if you wanna do a release to stable or

something like that, you can ask for the
debdiff, I'm wandering if anyone

I mean I remember doing that myself
I've been submitting diffoscope output

as well, because is just more readable and
useful. so I'm not sure if anyone have any

objection to people asking for those.

I'll propose that to the release team
see what they say

Thank you very much, 
is there any other questions?

No further questions? Then lets thanks
Chris again !

[Applause]