I'm here today to talk to you about
diffoscope
and how you can use it as a better diff
or for Quality Assurance, etc., things
like that.
Moin!
Apparently that's like a north german
thing to say "welcome".
North german, north Denmark, Scandinavia,
that kind of thing, I'm told.
People are shaking their head, so I'm
going to assume that's true.
This is my first PC, an IBM 5155.
Sometimes, when you rebooted it, it would
launch into, it would somehow revert
from booting from the hard disk to booting
from a basic ROM,
as in the programming language ROM.
It was on my motherboard for some reason.
So, randomly, you just get a chance to
program in basic and then,
sometimes you wouldn't, I don't know why,
but… yeah.
It's quite fun with this kind of clicky
keyboard, and that folded in
and it was this kind of big desk thing.
Anyway…
This is my first Debian.
At the time it was already old.
What's this one? Is this Slink? 2.2?
Yeah.
And this is when we had US and non-US,
so that's really dating if you remember that.
This is my first contribution to Debian,
19th December 2006,
sending a patch to lillypond which is kind
of interesting
and the response was "Oh yeah, rock on,
many thanks. I'll upload this and
it'll be landing to Etch".
And this was super motivating because
Etch was just coming out and it was like
"Great, I've got let one line of tiny patch
in a release. This is super cool."
Thomas' response was super motivating.
So, after that, like that Christmas
basically spent ???
Debian webpages and stuff.
Very well timed.
That's kind of a good…
You know, someone sends a patch, be like
"Cool, thanks"
Like a little notice in the changelog.
It was, you know, so stupid but…
Yeah, do that kind of thing.
So, moving on.
Why diffoscope?
Why did we write diffoscope?
What's the background here?
It comes from reproducible builds.
The very quick outline is that once you
get the source code for free software,
you download the source code for nginx
or whatever,
pretty much everyone just runs binaries
on their servers or their systems.
You know, "apt install bla", "yum install",
whatever.
Android Playstore, whatever.
Can you actually trust whether these two
things correspond with each other?
You've gotten the source code, it looks
alright, and then you install this binary,
yeah…
Who generated that? Can you trust that
process?
Can you trust who generated it?
Even if you could trust them, could you
trust them not to be exploited? Etc.
This is a big problem because you can
exploit a build farm and then
obviously exploit all of that, you know,
a trojan into the build farm,
so every single binary that comes out
is compromised.
Kind of problematic.
You could also target individual developers
machines,
so I could go of to, say, your machine,
add a backdoor to it,
so every binary that you give to friends
and things like that,
are compromised in some way, stealing
your bitcoins or whatever.
I can also turn up at your door
and blackmail you into producing
software that has compromises or extra
features, shall we say,
that don't exist in the source code.
So what will happen there is that you'd
release your source
and the binaries you produce have
this sort of backdoor that, you know,
someone is forcing you into producing.
So, you don't want to do that.
Anyway
enough of that.
What you do for reproducible builds is you
ensure that every time you build
a piece of software, you get an identical
result.
Multiple people then compare their builds
and check whether they all get
the same results
and this means that an attacker must
either have infected everyone
at the same time, or they haven't
infected anyone.
The point here is that you have to ensure
that builds have identical results.
Ok, great.
So, we started the reproducible builds
project, etc.
And we build 2 debs.
Oh, I'm sorry about the colors there.
You probably can't see that.
That says "sha1sum a.deb b.deb".
Anyway, we're comparing the sha1sums
of 2 binary Debian files.
So, these two files differ.
Ok, they're not reproducible.
Why is that?
So we run a diff on them.
Yeah…
So, what can we learn from this?
Well, not very much, visibly they're
compressed so
as soon as we see one change, we'll see
they would just cascade changes
because that's how compression works.
I guess we know it's a deb probably a ar
format file, not very useful.
Ok, great so we're gonna have a look in
We'll do a binary diff and ok, well…
Again, that's not really telling us
very much
with the diff there.
Ok, great.
??? one level in
"ar x" is on the new maintainer thing,
"how you unpack a deb"
Everyone remembers this, right?
You unpack a.deb with "ar x" and you
do that to b.deb
and then we diff the results of that.
Ok, so…yeah, 7zip.
Ok, compressed content, not very useful.
Ok, so let's unpack the control.tar inside
these debs.
And then we run diff on that.
Still not really telling anything useful
about how to make this package reproducible
So let's unpack the tar.xz into the tar.
Inside that tar, there's a file called
md5sums and we start to see some differences
between some files in these two debs.
??? meaningful, so now
we have some idea that
it has something to do with this
usr/bin/pmixer binary.
Ok, interesting.
We'll unzip that and then we do a diff on
pmixer itself.
Now we're back into just binary
"globgoly" mode
This isn't very helpful and this is taking
quite a while
and if I remember correctly, Debian has
a lot of packages.
So this might take a little while.
So, basically, ??? mean
I should build a better diff.
That's not quite true, this is actually…
It was lunar that started this project
and it was called debbindiff, because
we wanted to diff
binary Debian packages.
So this is the initial commit, 2014.
"The version is successfully able to report
differences in two .changes files.
Not with much interesting details,
but it's a start."
And it was a start.
Fast forwarding… Oh, sorry about these
colors,
I don't know if we can do anything about
the lights?
Yeah?
No?
Allright, whatever…
Basically, we're diffoscoping on…
It works kind of diff does normally,
you give it two files, it outputs
a unified diff.
So "diffoscope a b", one file contains
the word "foo", one contains the word "bar".
Nothing actually out of the ordinary.
It's sort of colored by default, so that's
why you can't see it, but whatever.
It supports archive formats, so if you
give it two tar files,
if we then tar up our "a" file and
our "b" file into a a.tar and b.tar
and then run diffoscope on those tar files
we get this kind of, like, hierarchy here.
So it's saying that there are differencies
between these files,
in the file list they have different time
stamps, because I made them
at different times,
and here are the contents, so we got
"foo" there and "bar" there.
So we can see the difference between them.
Well, I can, I don't know if you can,
you get the slide there.
If we gzip these tar files and then run
diffoscope on those gzip things,
it'll say "ok, what we've done is unpack it
first, and here's the metadata
about the gzip process",
and inside that are a.tar and b.tar
from the previous slides.
And then the "a" file and the "b" file.
So, it's really going two levels deep
into this tar.gz file.
That's pretty cool.
And it's completely recursive, I think
it will actually blow out after, I think,
1000 [levels].
[light is turned down for the audience
to see the slides]
I'll just bump back a bit, just in case.
[Applause]
Thank you.
So that's the a and b files.
We've tared them up and so I see
the hierarchy of foo and bar file layer.
I've gziped them, so this is a gzip layer.
Here's the tar layer and then there's
the files themselves.
This is from a real .deb from the archive.
Inside this .deb, there's a data.tar.xz
and in that xz file there's a data.tar
and inside that tar file, there's a file
called aff and inside that
there's a version string that is different.
And that looks like a build date so we
probably know that if we went back
to the source package, we could very
quickly work out,
with get a very quick grep, work out
where this file is being generated from,
the de_DE.aff file and then ???
probably quite obvious
that it's using the current build time
and then we can just patch that, fix it etc.
This is gone from two rather obscure
binary .debs all the way to the fix
probably in about 5 minutes, and you can
probably send the patch in that time
because it'd be quite quick.
Without diffoscope here, without this sort
of recursive unpacking,
you'd be just completely lost, you'd be
there with arx all day
and working out which files are different
and trying to use xxd
and this kind of nonsense.
diffoscope's got some other things as well
if you try to do reproducible packages
and things are varying just on
the line ordering, we detect whether
a file differs only in the line ordering.
So, here's file "a", "These lines are in
order".
File "b" has "These order are in lines".
It's very difficult to say, actually,
it's like one of these tongue twisters.
Run diffoscope on these two and it says
it's got ordering differences only.
That's interesting, so you probably need
to sort,
you go all the way back to the source code,
work out very quickly,
if you know it's just ordering differences
you just kind of know
what the output's gonna be, you can
search for order in ???
and you get the right files,
I have sorted in sort in the right
place, BAM! send it patched of,
everything is great.
Oh, and send it to upstream as well
because you're good.
It supports a lot more things.
We've been showing the terminal
text output here.
It's got a HTML output mode, which is
really useful in the hierarchal thing
when it gets a bit more complicated.
Instead of being laid on top of each other
like a unified diff,
you get the diff on the left and the right
and you get sort of a nested
thing inside with colors and lines and
you can link this and various things in it
including bits of metadata here, other
bits here, what command you used.
That's the HTML output.
We also support a lot of file formats,
it's not just on text,
it's about all of these, so let's quickly
run through some of them.
You give it two Androip apk files which
are kind of like zips, but magic.
It'll know how to compare them.
There's like a Manifest file that needs
decoding.
It supports Berkeley DB databases,
Word documents, that's a Word document
with "a" and that's a Word document with "b"
and it'll correctly do that.
If you run that through diff normally,
that ??? be a binaly mess,
so completely useless.
E-books, there's epub, it also supports
mobi.
So if you give it two epub files, it'll say
"They just differ in this date".
Brilliant.
Normally that will be completely useless
diff binary ???
So you can be like "epub date, ok", grep
the source code for that,
make a patch really quickly.
Mono binaries, git repositories, why not?
Gnumeric spreadsheets, ISO images.
Oh yeah, ISO images is really cool.
So, it'll basically unpack the ISO, then
inside that there might be a squashfs image
then it'll completely go down to that and
work out any differences
between the two contents in the ISO file,
including any metadata.
This is on the squashfs metadata headers,
I think.
But say inside that ISO, there was a file
that was a pdf, and inside that pdf was
a ??? which varied,
it will basically go all the way down
and say "yeah, it's actually here,
in this ??? that the data differs."
And that means you can just go again
all the way back to the source
and say "ok, cool, we know how to fix
this quite quickly"
And this is really valuable in getting
the recent Tails distribution reproducible
so their ISOs are reproducible.
If you build one and I build one, we get
the exact same one
and that's kind of useful for something
like Tails where you would probably want to
of all, there's a lot of projects that you
might want to compromise,
you might want to go after that one,
because of the kind of people that are using it.
We support comparing images, so this is
using ???
and then just running that through diff.
That is a linux penguin and that is
something else,
I can't remember now. Oh, FT.
It supports images.
It supports JSON and pretty print,
so if you give it two JSON files
one with key/value… it'll do a nice
diff of them.
It will pretty print it first, before
doing the diff, so it'll actually give you
something clean, otherwise I don't know
if you've ever diffed
two very long JSON lines, if they differ
in the middle, you just get
a huge long unified diff, but here it's
like "oh, just ??? things have changed"
OpenDocument text formats,
Ogg audio files, because why not.
tcpdump capture files, that's actually
quite useful.
PDFs. That PDF says "Hello World" and
this PDF says "Hello sick sad world",
I don't know why, that particulary text
in the demo.
Again, run that through normal diff
program… garbage.
XML documents. Again, it'll pretty print
them so it's nice, actually nice do read.
If you want to get started on diffoscope,
the very easiest and quickest way to do is
fire up a web browser, try.diffoscope.org,
select your files, press Compare
and it'll upload them and run diffoscope
with all the support for all the file formats
in the cloud for you and give you a nice
HTML page that you can then link to people
So that's the very quickest way to get
started.
The next quickest way is to install
trydiffoscope and then you run that
on two files and it'll basically do
the same thing,
run it in the same cloud service as
trydiffoscope
but it'll give you the result on the
command line or
if you pass the webbrowser option, it will
give you an URL or load your webbrowser,
I can't remember exactly which, with
the same results.
This is 1kB of Python, nothing basically.
That's the next easiest way.
But you can then install diffoscope itself
on your own machine.
I recommend not installing recommends
because all of those file formats
might drag in extra things about
the whole of TeX,
I think the whole of OpenOffice, whole
of Mono, whole Java…
Android, yeah, quite big.
I think there's another big one I can't
think of.
They're all optional, and they all say
"By the way, I support TeX documents
or whatever, Mono, whatever.
But you need to install this package and
then you get full pretty printed support",
And it'll tell you that when it's missing.
So, if you just start with
--install-recommends disabled,
right on your file, if it says
"please install this package, you can then
install them as you go along, as you want"
rather than installing everything.
And then you just pass ??? files
and then works as before
How you can you improve all your own
quality assurance and debian packaging
with different scope
The biggest value here is not
necessary for reproducible builds
It's for basically just seeing where you
do want to have a diff or expecting a diff
and you are expecting a particularly type
of diff in a particularly way
you can basically see those changes
And if you build two debs normally and
... i'll try to demo in a second
You build a deb with a patch applied and
then build a deb with the patch applied
you can ??? run a diff on the source package
But that's not very useful because the
binaries are going to end in the
people machines. But if you run a diff on
the binary itself, did my change actually
hit the binary? I think really ...
No..
I just run through a very live demo of
course, so it's gonna fail ...
Checkout some .... We'll get this
libnetx-java
We just build that once
Lets say we are on security team and
want to apply a patch, and we want to be
really sure because we are to push it out
to all our users
First we will make a changelog
Closing a bug
Find some java file to change
Let's pretend we have a real patch
Let's replace that equals equals,
say that was the fix
So that's the patch from upstream
Upstream blast patch
When we build this what we wanna see is
just that change in the file
we wanna see any nonsense changes of
extended dump but we also definitely want
to see that change, cause if our binary as
for security reasons don't have that change
then we aren't fixing people machines,
they will issue a DSA ??? installed ???
And you should do proper testing as well
at multiple levels
I will build that again
So we wanna diff the original one 0 5,
We wanna diff that one with a fake
security one
You see on the progress bar 100%
1- there are diferences (there should be
diferences)
Lets see what that diferences are
in our web browser, its a nice html output
Let have a look.
Are we seeing what we wanna see?
There are some chances in the data tar, we
kind of expect that
What's changed in our control file?
Well the version changed,we wanted that
to change. Perfect
And its changed to ???
That's what we wanna see
No other changes here so there was no
weird control or in magic going on
In our data tar the color of the timestamp
changes, we will ignore those for now
The changelog has changed, well I hope so
because I have changed that entry
Here is where we going to start seeing
We are going to see the changing in the
jar file which is the java class, java
compile archive format
We are seeing some meaningless timestamp
changes but we can ignore those
lets pretend because its just
metadata maybe
Ok part of a class, so if you can see here
it's basically a de-compilation of the
java file itself and it's basically saying
"oh I use to say if now and if not now"
So these are the actual byte java
byte code instructions and whats really
And what is really ??? here
its that nothing else has changed
We were just expecting that change between
the two op codes, of if now elseif not not now
which is good cause its like it hasn't made
any code changes but also crucial we can
see that it has actually made a change
to the code.
For example its wasn't use some cached
version or something like that
This is really useful
And just running a naif diff wouldn't
give that of course, because it would just
come with binary garbage
And just seeing the diff had changed again
??? be told you anything, because all of the
change would have changed as well
So its like well yes it's diferent
The meaningful change there it's
what actually fixes the "floor"
??? but we know it's there
That's kind of ???
Shifting this deb out I'll be quite
confident, that this seemed like the
actual bug
I've been quite confident pushing that out
because it's very minimal amount of changes
you wanna do that for security reasons
So this was the live demo
The other one is seeing no changes
at all, so you can build once
if you build a reproducible
You can build once change your compiler
or change some other part of your toolchain
Build it again and if you got the exact same
results, well great, that's want you intended
You wanna see no changes when you change
some part of it
And that is really useful, if there were
changes diffoscope will highlight them
and show exactly why they had changed,
maybe some compile authorizations,
maybe some other things as well
So you can use it in both ways, when you
expect changes and when you don't expect
changes, and if those match the expectations
diffoscope will tell you exactly why
It's all ??? when other companies
are doing security releases
naming no names whatsoever,
but they like to release patches as you
know just a new firmware for your router
Very large file system images,
you basically have no ideia what changed
between these two files, again you run
through diff completely useless
You can start to unpack them with
squashfs and blah blah blah
But they're probably sort of concatenated
cpio archives, so that's nonsense
But diffoscope would just chew you those
and give you actually what the diferences
is between these two files, and say
they changed this, they've removed or
added some gpl license code or something
kind of interesting
So its very useful for diffing those kind
binary blobs that come from various people
So the current state of diffoscope,
the development is up and down
It started around May 2014 something like that
A bunch of work here, that's is idle I think
These are just for debconfs basically
Anyway it's going up and down its kind
of interesting
??? a lot of reproducible builds projects
of course, so every time we do a build
on the ??? reproducible builds or
testing framework if we run diffoscope
on the result, if it's reproducible it
just says , hey the file is the same
But if not, we publish the diffoscopes of
all your packages that are unreproducible
just you can just go there and be like
whats the diference between these two things
I invested a lot of work optimizing
diffoscope, ??? rather perverse end square
loops inside it. So i manage to cut down
some of the time here, cut down here
That's been quite a few performances and
enhancements over the past ...
these are the git tags , this is version 80
and this is version 50 I just run the same
benchmark across them all
So they shows when I have introduced some
rather stupid code, embarrassing , but whatever
???
There's work been done right now,
on parallel processing, there's been
quite a few attempts before, but adding it
it's kind of interesting and difficult
Luckily we have a ??? student Liliana,
is she in the room? Is she hiding?
She's here and she's been talking tomorrow
about her work on paralel processing in
diffoscope and that will be amazing because
a lot of it is IO bound or waiting for Xtel
processors with multiple cpu machines,
you mind as well just play well
while as I stand waiting for the result
for a pdf to be unpacked I maybe as well
be running on another cpu, I think we are
going to see some real performance wins
as we do that paralell processing merge and
working and ???
You can check out our website diffoscope.org
recently migrated to Salsa .... yeeaahhh
And everything ??? reproducible is now on
Salsa, it's kind of cool
That's quite recent...
Thank you very muck, Danke shcön
You got any questions?
About diffoscope?
Thank you very much !
Q: A buzz word question, can you diff containers
image formats?
A: Depend which ones. So if they are just
directory, then yes, because is just a directory
Do you have particullary in mind? Like docker?
Yes, there's docker and then there's old
CI, I believe is the standard one
And that could make a buzz word complaint
Ah ok we were all about buzz words
Probable diffoscope block change as well
And then run diffoscope on connectors and
see the difference between updates of your
container images
BAM ... solved
Where do I invest?
I wasn't aware that OCI ... that's is how it's
called? No it doesn't support that right now
But it wouldn't be too difficult, presuming
are tools to unpack it and as soon we have
a tool to unpack it, it can then just go
to that, there is a wishing list tool box
for docker containers to the point were
I think it would be really nice if you
could just give it, say, two images names
or whatever the noun is
So you can say "please diff these two
docker images that are available" and
it can look at your local thing and do
a diff on them, currently it's not
supported, but there is an open wishlist
bug.
Q: Shouldn't any company that releases
binaries, be interested in supporting
diffoscope and using it?
A1: Basically when companies release binaries they are not interested in users seeing diferences...
A2: Yes, I'm surprised that actually the
docker bug was only opened two months ago
and hasn't been more interest on diffing
container images, but if you like to open
one for OCI that will be very appreciated,
and we can get on to that, that would be
great.
I was looking the page for OCI, it says
it's based on docker basically, so
once you get OCI for free, you would
sort it out for docker, if you're lucky
The OCI image formaters, they wrote out
on docker images
Ok we will sort that out, and it seems like
we're using a docker more and more
on debian
Any other questions?
Q: Out of curiosity, which ??? are you using
inside? Are you using some bio-informatics
on ??? to diff trees efficiently?
A: No it's really naif, all it does is run
normal diff, the normal diff tools, but
it will try to identify files and unpack
first, so use the file utility identifier
thing that says its a pdf , and try to
unpack it first, he doesn't do any clever
matching. The clever matching that he does
do is fuzzy matching as well, so if just
rename a directory between two inside a
container, he will say , yeah there a
massive match between this two files,
and things like that. So that's kind of
useful. ??? it's not so that clever, which
is kind of what you want , cause if it's
too clever it would start to be a little
opaque ...
I personally like dumb tools.
Q: So one question to you is whether,
if you wanna do a release to stable or
something like that, you can ask for the
debdiff, I'm wandering if anyone
I mean I remember doing that myself
I've been submitting diffoscope output
as well, because is just more readable and
useful. so I'm not sure if anyone have any
objection to people asking for those.
I'll propose that to the release team
see what they say
Thank you very much,
any further questions?
[Applause]