I'm here today to talk to you about
diffoscope
and how you can use it as a better diff
or for Quality Assurance, etc., things
like that.
Moin!
Apparently that's like a north german
thing to say "welcome".
North german, north Denmark, Scandinavia,
that kind of thing, I'm told.
People are shaking their head, so I'm
going to assume that's true.
This is my first PC, an IBM 5155.
Sometimes, when you rebooted it, it would
launch into, it would somehow revert
from booting from the hard disk to booting
from a basic ROM,
as in the programming language ROM.
It was on my motherboard for some reason.
So, randomly, you just get a chance to
program in basic and then,
sometimes you wouldn't, I don't know why,
but… yeah.
It's quite fun with this kind of clicky
keyboard, and that folded in
and it was this kind of big desk thing.
Anyway…
This is my first Debian.
At the time it was already old.
What's this one? Is this Slink? 2.2?
Yeah.
And this is when we had US and non-US,
so that's really dating if you remember that.
This is my first contribution to Debian,
19th December 2006,
sending a patch to lillypond which is kind
of interesting
and the response was "Oh yeah, rock on,
many thanks. I'll upload this and
it'll be landing to Etch".
And this was super motivating because
Etch was just coming out and it was like
"Great, I've got let one line of tiny patch
in a release. This is super cool."
Thomas' response was super motivating.
So, after that, like that Christmas
basically spent ???
Debian webpages and stuff.
Very well timed.
That's kind of a good…
You know, someone sends a patch, be like
"Cool, thanks"
Like a little notice in the changelog.
It was, you know, so stupid but…
Yeah, do that kind of thing.
So, moving on.
Why diffoscope?
Why did we write diffoscope?
What's the background here?
It comes from reproducible builds.
The very quick outline is that once you
get the source code for free software,
you download the source code for nginx
or whatever,
pretty much everyone just runs binaries
on their servers or their systems.
You know, "apt install bla", "yum install",
whatever.
Android Playstore, whatever.
Can you actually trust whether these two
things correspond with each other?
You've gotten the source code, it looks
alright, and then you install this binary,
yeah…
Who generated that? Can you trust that
process?
Can you trust who generated it?
Even if you could trust them, could you
trust them not to be exploited? Etc.
This is a big problem because you can
exploit a build farm and then
obviously exploit all of that, you know,
a trojan into the build farm,
so every single binary that comes out
is compromised.
Kind of problematic.
You could also target individual developers
machines,
so I could go of to, say, your machine,
add a backdoor to it,
so every binary that you give to friends
and things like that,
are compromised in some way, stealing
your bitcoins or whatever.
I can also ???
and blackmail you into producing
software that has compromises or extra
features, shall we say,
that don't exist in the source code.
So what will happen there is that you'd
release your source
and the binaries you produce have
this sort of backdoor that, you know,
someone is forcing you into producing.
So, you don't want to do that.
Anyway
enough of that.
What you do for reproducible builds is you
ensure that every time you build
a piece of software, you get an identical
result.
Multiple people then compare their builds
and check whether they all get
the same results
and this means that an attacker must
either have infected everyone
at the same time, or they haven't
infected anyone.
The point here is that you have to ensure
that builds have identical results.
Ok, great.
So, we started the reproducible builds
project, etc.
And we build 2 debs.
Oh, I'm sorry about the colors there.
You probably can't see that.
That says "sha1sum a.deb b.deb".
Anyway, we're comparing the sha1sums
of 2 binary Debian files.
So, these two files differ.
Ok, they're not reproducible.
Why is that?
So we run a diff on them.
Yeah…
So, what can we learn from this?
Well, not very much, visibly they're
compressed so
as soon as we see one change, we'll see
they would just cascade changes
because that's how compression works.
I guess we know it's a deb ???
format file, not very useful.
Ok, great so we're gonna have a look in
We'll do a binary diff and ok, well…
Again, that's not really telling us
very much
with the diff there.
Ok, great.
???
"ar x" is on the new maintainer thing,
"how you unpack a deb"
Everyone remembers this, right?
You unpack a.deb with "ar x" and you
do that to b.deb
and then we diff the results of that.
Ok, so…yeah, 7zip.
Ok, compressed content, not very useful.
Ok, so let's unpack the control.tar inside
these debs.
And then we run diff on that.
Still not really telling anything useful
about how to make this package reproducible
So let's unpack the tar.xz into the tar.
Inside that tar, there's a file called
md5sums and we start to see some differences
between some files in these two debs.
??? meaningful, so now
we have some idea that
it has something to do with this
usr/bin/pmixer binary.
Ok, interesting.
We'll unzip that and then we do a diff on
pmixer itself.
Now we're back into just binary
??? mode
This isn't very helpful and this is taking
quite a while
and if I remember correctly, Debian has
a lot of packages.
So this might take a little while.
So, basically, ??? meme
I should build a better diff.
That's not quite true, this is actually…
It was lunar that started this project
and it was called debbindiff, because
we wanted to diff
binary Debian packages.
So this is the initial commit, 2014.
"The version is successfully able to report
differences in two .changes files.
Not with much interesting details,
but it's a start."
And it was a start.
Fast forwarding… Oh, sorry about these
colors,
I don't know if we can do anything about
the lights?
Yeah?
No?
Alright, well…
Basically, we're diffoscoping on…
It works kind of diff does normally,
you give it two files, it outputs
a unified diff.
So "diffoscope a b", one file contains
the word "foo", one contains the word "bar".
Nothing actually out of the ordinary.
It's sort of colored by default, so that's
why you can't see it, but whatever.
It supports archive formats, so if you
give it two tar files,
if we then tar up our "a" file and
our "b" file into a a.tar and b.tar
and then run diffoscope on those tar files
we get this kind of, like, hierarchy here.
So it's saying that there are differencies
between these files,
in the file list they have different time
stamps, because I made them
at different times,
and here are the contents, so we got
"foo" there and "bar" there.
So we can see the difference between them.
Well, I can, I don't know if you can,
you get the slide there.
If we gzip these tar files and then run
diffoscope on those gzip things,
it'll say "ok, what we've done is unpack it
first, and here's the metadata
about the gzip process",
and inside that are a.tar and b.tar
from the previous slides.
And then the "a" file and the "b" file.
So, it's really going two levels deep
into this tar.gz file.
That's pretty cool.
And it's completely recursive, I think
it will actually blow out after, I think,
1000 [levels].
[light is turned down for the audience
to see the slides]
I'll just bump back a bit, just in case.
[Applause]
Thank you.
So that's the a and b files.
We've tared them up and so I see
the hierarchy of foo and bar file layer.
I've gziped them, so this is a gzip layer.
Here's the tar layer and then there's
the files themselves.
This is from a real .deb from the archive.
Inside this .deb, there's a data.tar.xz
and in that xz file there's a data.tar
and inside that tar file, there's a file
called aff and inside that
there's a version string that is different.
And that looks like a build date so we
probably know that if we went back
to the source package, we could very
quickly work out,
with get a very quick grep, work out
where this file is being generated from,
the de_DE.aff file and then ???
probably quite obvious
that it's using the current build time
and then we can just patch that, fix it etc.
This is gone from two rather obscure
binary .debs all the way to the fix
probably in about 5 minutes, and you can
probably send the patch in that time
because it'd be quite quick.
Without diffoscope here, without this sort
of recursive unpacking,
you'd be just completely lost, you'd be
there with arx all day
and working out which files are different
and trying to use xxd
and this kind of nonsense.
diffoscope's got some other things as well
if you try to do reproducible packages
and things are varying just on
the line ordering, we detect whether
a file differs only in the line ordering.
So, here's file "a", "These lines are in
order".
File "b" has "These order are in lines".
It's very difficult to say, actually,
it's like one of these tongue twisters.
Run diffoscope on these two and it says
it's got ordering differences only.
That's interesting, so you probably need
to sort,
you go all the way back to the source code,
work out very quickly,
if you know it's just ordering differences
you just kind of know
what the output's gonna be, you can
search for order in ???
and you get the right files,
??? sort in the right place,
BAM, send it patch of (???),
everything is great.
Oh, and send it to upstream as well
because you're good.
It supports a lot more things.
We've been showing the terminal
text output here.
It's got a HTML output mode, which is
really useful in the hierarchal thing
when it gets a bit more complicated.
Instead of being laid on top of each other
like a unified diff,
you get the diff on the left and the right
and you get sort of a nested
thing inside with colors and lines and
you can link this and various things in it
including bits of metadata here, other
bits here, what command you used.
That's the HTML output.
We also support a lot of file formats,
it's not just on text,
it's about all of these, so let's quickly
run through some of them.
You give it two Androip apk files which
are kind of like zips, but magic.
It'll know how to compare them.
There's like a Manifest file that needs
decoding.
It supports Berkeley DB databases,
Word documents, that's a Word document
with "a" and that's a Word document with "b"
and it'll correctly do that.
If you run that through diff normally,
that ??? be a binaly mess,
so completely useless.
E-books, there's epub, it also supports
mobi.
So if you give it two epub files, it'll say
"They just differ in this date".
Brilliant.
Normally that will be completely useless
diff binary ???
So you can be like "epub date, ok", grep
the source code for that,
make a patch really quickly.
Mono binaries, git repositories, why not?
Gnumeric spreadsheets, ISO images.
Oh yeah, ISO images is really cool.
So, it'll basically unpack the ISO, then
inside that there might be a squashfs image
then it'll completely go down to that and
work out any differences
between the two contents in the ISO file,
including any metadata.
This is on the squashfs metadata headers,
I think.
But say inside that ISO, there was a file
that was a pdf, and inside that pdf was
a ??? which varied,
it will basically go all the way down
and say "yeah, it's actually here,
in this ??? that the data differs."
And that means you can just go again
all the way back to the source
and say "ok, cool, we know how to fix
this quite quickly"
And this is really valuable in getting
the recent Tails distribution reproducible
so their ISOs are reproducible.
If you build one and I build one, we get
the exact same one
and that's kind of useful for something
like Tails where you would probably want to
of all, there's a lot of projects that you
might want to compromise,
you might want to go after that one,
because of the kind of people that are using it.
We support comparing images, so this is
using ???
and then just running that through diff.
That is a linux penguin and that is
something else,
I can't remember now. Oh, FT.
It supports images.
It supports JSON and pretty print,
so if you give it two JSON files
one with key/value… it'll do a nice
diff of them.
It will pretty print it first, before
doing the diff, so it'll actually give you
something clean, otherwise I don't know
if you've ever diffed
two very long JSON lines, if they differ
in the middle, you just get
a huge long unified diff, but here it's
like "oh, just ??? things have changed"
OpenDocument text formats,
Ogg audio files, because why not.
tcpdump capture files, that's actually
quite useful.
PDFs. That PDF says "Hello World" and
this PDF says "Hello sick sad world",
I don't know why. ???
in the demo.
Again, run that through normal diff
program… garbage.
XML documents. Again, it'll pretty print
them so it's nice, actually nice do read.
If you want to get started on diffoscope,
the very easiest and quickest way to do is
fire up a web browser, try.diffoscope.org,
select your files, press Compare
and it'll upload them and run diffoscope
with all the support for all the file formats
in the cloud for you and give you a nice
HTML page that you can then link to people
So that's the very quickest way to get
started.
The next quickest way is to install
trydiffoscope and then you run that
on two files and it'll basically do
the same thing,
run it in the same cloud service as
trydiffoscope
but it'll give you the result on the
command line or
if you pass the webbrowser option, it will
give you an URL or load your webbrowser,
I can't remember exactly which, with
the same results.
This is 1kB of Python, nothing basically.
That's the next easiest way.
But you can then install diffoscope itself
on your own machine.
I recommend not installing recommends
because all of those file formats
might drag in extra things about
the whole of TeX,
I think the whole of OpenOffice, whole
of Mono, whole Java…
Android, yeah, quite big.
I think there's another big one I can't
think of.
They're all optional, and they all say
"By the way, I support TeX documents
or whatever, Mono, whatever.
But you need to install this package and
then you get full pretty printed support",
And it'll tell you that when it's missing.
So, if you just start with
--install-recommends disabled,
right on your file, if it says
"please install this package, you can then
install them as you go along, as you want"
rather than installing everything.
And then ??? and then works as before
I you can improve all your own quality
assurance and debian packaging
with different scope
The biggest value here is not
necessary for reproducible builds
It's for basically just seeing where you
do want to have a diff or expecting a diff
and you are expecting a particularly type
of diff in a particularly way
you can basically see those changes
And if you build two debs normally and
... i'll try to demo in a second
You build a deb with a patch applied you
can ??? see a diff on the source package
But that's not very useful because the
binaries are going to end in the
people machines. But if you run a diff on
the binary itself, did that change and
really hit the binary, I think really ...
No..
I just run through a very live demo of
course, so it's gonna fail ...
Checkout some .... We'll get this
libnetx-java
We just build that once
Lets say we are on security team and
want to apply a patch, and we want to be
really sure because we are to push it out
to all our users
First we will make a changelog
Closing a bug
Find some java file to change
Let's pretend we have a real patch
Let's replace that equals equals,
say that was the fix
So that's the patch from upstream
Upstream blast patch
When we build this what we wanna see is
just that change in the file
we wanna see any nonsense changes of
extended ??? but we also definitely want
to see that change, cause if our binary as
for security reasons don't have that change