-
Not Synced
I'm here today to talk to you about
diffoscope
-
Not Synced
and how you can use it as a better diff
-
Not Synced
or for Quality Assurance, etc., things
like that.
-
Not Synced
Moin!
-
Not Synced
Apparently that's like a north german
thing to say "welcome".
-
Not Synced
North german, north Denmark, Scandinavia,
that kind of thing, I'm told.
-
Not Synced
People are shaking their head, so I'm
going to assume that's true.
-
Not Synced
This is my first PC, an IBM 5155.
-
Not Synced
Sometimes, when you rebooted it, it would
launch into, it would somehow revert
-
Not Synced
from booting from the hard disk to booting
from a basic ROM,
-
Not Synced
as in the programming language ROM.
-
Not Synced
It was on my motherboard for some reason.
-
Not Synced
So, randomly, you just get a chance to
program in basic and then,
-
Not Synced
sometimes you wouldn't, I don't know why,
but… yeah.
-
Not Synced
It's quite fun with this kind of clicky
keyboard, and that folded in
-
Not Synced
and it was this kind of big desk thing.
-
Not Synced
Anyway…
-
Not Synced
This is my first Debian.
-
Not Synced
At the time it was already old.
-
Not Synced
What's this one? Is this Slink? 2.2?
Yeah.
-
Not Synced
And this is when we had US and non-US,
so that's really dating if you remember that.
-
Not Synced
This is my first contribution to Debian,
19th December 2006,
-
Not Synced
sending a patch to lillypond which is kind
of interesting
-
Not Synced
and the response was "Oh yeah, rock on,
many thanks. I'll upload this and
-
Not Synced
it'll be landing to Etch".
-
Not Synced
And this was super motivating because
Etch was just coming out and it was like
-
Not Synced
"Great, I've got let one line of tiny patch
in a release. This is super cool."
-
Not Synced
Thomas' response was super motivating.
-
Not Synced
So, after that, like that Christmas
basically spent ???
-
Not Synced
Debian webpages and stuff.
-
Not Synced
Very well timed.
-
Not Synced
That's kind of a good…
-
Not Synced
You know, someone sends a patch, be like
"Cool, thanks"
-
Not Synced
Like a little notice in the changelog.
-
Not Synced
It was, you know, so stupid but…
Yeah, do that kind of thing.
-
Not Synced
So, moving on.
-
Not Synced
Why diffoscope?
Why did we write diffoscope?
-
Not Synced
What's the background here?
-
Not Synced
It comes from reproducible builds.
-
Not Synced
The very quick outline is that once you
get the source code for free software,
-
Not Synced
you download the source code for nginx
or whatever,
-
Not Synced
pretty much everyone just runs binaries
on their servers or their systems.
-
Not Synced
You know, "apt install bla", "yum install",
whatever.
-
Not Synced
Android Playstore, whatever.
-
Not Synced
Can you actually trust whether these two
things correspond with each other?
-
Not Synced
You've gotten the source code, it looks
alright, and then you install this binary,
-
Not Synced
yeah…
-
Not Synced
Who generated that? Can you trust that
process?
-
Not Synced
Can you trust who generated it?
-
Not Synced
Even if you could trust them, could you
trust them not to be exploited? Etc.
-
Not Synced
This is a big problem because you can
exploit a build farm and then
-
Not Synced
obviously exploit all of that, you know,
a trojan into the build farm,
-
Not Synced
so every single binary that comes out
is compromised.
-
Not Synced
Kind of problematic.
-
Not Synced
You could also target individual developers
machines,
-
Not Synced
so I could go of to, say, your machine,
add a backdoor to it,
-
Not Synced
so every binary that you give to friends
and things like that,
-
Not Synced
are compromised in some way, stealing
your bitcoins or whatever.
-
Not Synced
I can also ???
and blackmail you into producing
-
Not Synced
software that has compromises or extra
features, shall we say,
-
Not Synced
that don't exist in the source code.
-
Not Synced
So what will happen there is that you'd
release your source
-
Not Synced
and the binaries you produce have
this sort of backdoor that, you know,
-
Not Synced
someone is forcing you into producing.
-
Not Synced
So, you don't want to do that.
-
Not Synced
Anyway
-
Not Synced
enough of that.
-
Not Synced
What you do for reproducible builds is you
ensure that every time you build
-
Not Synced
a piece of software, you get an identical
result.
-
Not Synced
Multiple people then compare their builds
and check whether they all get
-
Not Synced
the same results
-
Not Synced
and this means that an attacker must
either have infected everyone
-
Not Synced
at the same time, or they haven't
infected anyone.
-
Not Synced
The point here is that you have to ensure
that builds have identical results.
-
Not Synced
Ok, great.
-
Not Synced
So, we started the reproducible builds
project, etc.
-
Not Synced
And we build 2 debs.
-
Not Synced
Oh, I'm sorry about the colors there.
-
Not Synced
You probably can't see that.
-
Not Synced
That says "sha1sum a.deb b.deb".
-
Not Synced
Anyway, we're comparing the sha1sums
of 2 binary Debian files.
-
Not Synced
So, these two files differ.
-
Not Synced
Ok, they're not reproducible.
-
Not Synced
Why is that?
-
Not Synced
So we run a diff on them.
-
Not Synced
Yeah…
-
Not Synced
So, what can we learn from this?
-
Not Synced
Well, not very much, visibly they're
compressed so
-
Not Synced
as soon as we see one change, we'll see
they would just cascade changes
-
Not Synced
because that's how compression works.
-
Not Synced
I guess we know it's a deb ???
format file, not very useful.
-
Not Synced
Ok, great so we're gonna have a look in
-
Not Synced
We'll do a binary diff and ok, well…
-
Not Synced
Again, that's not really telling us
very much
-
Not Synced
with the diff there.
-
Not Synced
Ok, great.
-
Not Synced
???
-
Not Synced
"ar x" is on the new maintainer thing,
"how you unpack a deb"
-
Not Synced
Everyone remembers this, right?
-
Not Synced
You unpack a.deb with "ar x" and you
do that to b.deb
-
Not Synced
and then we diff the results of that.
-
Not Synced
Ok, so…yeah, 7zip.
-
Not Synced
Ok, compressed content, not very useful.
-
Not Synced
Ok, so let's unpack the control.tar inside
these debs.
-
Not Synced
And then we run diff on that.
-
Not Synced
Still not really telling anything useful
about how to make this package reproducible
-
Not Synced
So let's unpack the tar.xz into the tar.
-
Not Synced
Inside that tar, there's a file called
md5sums and we start to see some differences
-
Not Synced
between some files in these two debs.
-
Not Synced
??? meaningful, so now
we have some idea that
-
Not Synced
it has something to do with this
usr/bin/pmixer binary.
-
Not Synced
Ok, interesting.
-
Not Synced
We'll unzip that and then we do a diff on
pmixer itself.
-
Not Synced
Now we're back into just binary
??? mode
-
Not Synced
This isn't very helpful and this is taking
quite a while
-
Not Synced
and if I remember correctly, Debian has
a lot of packages.
-
Not Synced
So this might take a little while.
-
Not Synced
So, basically, ??? meme
-
Not Synced
I should build a better diff.
-
Not Synced
That's not quite true, this is actually…
-
Not Synced
It was lunar that started this project
-
Not Synced
and it was called debbindiff, because
we wanted to diff
-
Not Synced
binary Debian packages.
-
Not Synced
So this is the initial commit, 2014.
-
Not Synced
"The version is successfully able to report
differences in two .changes files.
-
Not Synced
Not with much interesting details,
but it's a start."
-
Not Synced
And it was a start.
-
Not Synced
Fast forwarding… Oh, sorry about these
colors,
-
Not Synced
I don't know if we can do anything about
the lights?
-
Not Synced
Yeah?
-
Not Synced
No?
-
Not Synced
Alright, well…
-
Not Synced
Basically, we're diffoscoping on…
-
Not Synced
It works kind of diff does normally,
-
Not Synced
you give it two files, it outputs
a unified diff.
-
Not Synced
So "diffoscope a b", one file contains
the word "foo", one contains the word "bar".
-
Not Synced
Nothing actually out of the ordinary.
-
Not Synced
It's sort of colored by default, so that's
why you can't see it, but whatever.
-
Not Synced
It supports archive formats, so if you
give it two tar files,
-
Not Synced
if we then tar up our "a" file and
our "b" file into a a.tar and b.tar
-
Not Synced
and then run diffoscope on those tar files
-
Not Synced
we get this kind of, like, hierarchy here.
-
Not Synced
So it's saying that there are differencies
between these files,
-
Not Synced
in the file list they have different time
stamps, because I made them
-
Not Synced
at different times,
-
Not Synced
and here are the contents, so we got
"foo" there and "bar" there.
-
Not Synced
So we can see the difference between them.
-
Not Synced
Well, I can, I don't know if you can,
you get the slide there.
-
Not Synced
If we gzip these tar files and then run
diffoscope on those gzip things,
-
Not Synced
it'll say "ok, what we've done is unpack it
first, and here's the metadata
-
Not Synced
about the gzip process",
-
Not Synced
and inside that are a.tar and b.tar
from the previous slides.
-
Not Synced
And then the "a" file and the "b" file.
-
Not Synced
So, it's really going two levels deep
into this tar.gz file.
-
Not Synced
That's pretty cool.
-
Not Synced
And it's completely recursive, I think
it will actually blow out after, I think,
-
Not Synced
1000 [levels].
-
Not Synced
[light is turned down for the audience
to see the slides]
-
Not Synced
I'll just bump back a bit, just in case.
-
Not Synced
[Applause]
-
Not Synced
Thank you.
-
Not Synced
So that's the a and b files.
-
Not Synced
We've tared them up and so I see
the hierarchy of foo and bar file layer.
-
Not Synced
I've gziped them, so this is a gzip layer.
-
Not Synced
Here's the tar layer and then there's
the files themselves.
-
Not Synced
This is from a real .deb from the archive.
-
Not Synced
Inside this .deb, there's a data.tar.xz
and in that xz file there's a data.tar
-
Not Synced
and inside that tar file, there's a file
called aff and inside that
-
Not Synced
there's a version string that is different.
-
Not Synced
And that looks like a build date so we
probably know that if we went back
-
Not Synced
to the source package, we could very
quickly work out,
-
Not Synced
with get a very quick grep, work out
where this file is being generated from,
-
Not Synced
the de_DE.aff file and then ???
probably quite obvious
-
Not Synced
that it's using the current build time
and then we can just patch that, fix it etc.
-
Not Synced
This is gone from two rather obscure
binary .debs all the way to the fix
-
Not Synced
probably in about 5 minutes, and you can
probably send the patch in that time
-
Not Synced
because it'd be quite quick.
-
Not Synced
Without diffoscope here, without this sort
of recursive unpacking,
-
Not Synced
you'd be just completely lost, you'd be
there with arx all day
-
Not Synced
and working out which files are different
and trying to use xxd
-
Not Synced
and this kind of nonsense.
-
Not Synced
diffoscope's got some other things as well
-
Not Synced
if you try to do reproducible packages
and things are varying just on
-
Not Synced
the line ordering, we detect whether
a file differs only in the line ordering.
-
Not Synced
So, here's file "a", "These lines are in
order".
-
Not Synced
File "b" has "These order are in lines".
-
Not Synced
It's very difficult to say, actually,
it's like one of these tongue twisters.
-
Not Synced
Run diffoscope on these two and it says
it's got ordering differences only.
-
Not Synced
That's interesting, so you probably need
to sort,
-
Not Synced
you go all the way back to the source code,
work out very quickly,
-
Not Synced
if you know it's just ordering differences
you just kind of know
-
Not Synced
what the output's gonna be, you can
search for order in ???
-
Not Synced
and you get the right files,
-
Not Synced
??? sort in the right place,
BAM, send it patch of (???),
-
Not Synced
everything is great.
-
Not Synced
Oh, and send it to upstream as well
because you're good.
-
Not Synced
It supports a lot more things.
-
Not Synced
We've been showing the terminal
text output here.
-
Not Synced
It's got a HTML output mode, which is
really useful in the hierarchal thing
-
Not Synced
when it gets a bit more complicated.
-
Not Synced
Instead of being laid on top of each other
like a unified diff,
-
Not Synced
you get the diff on the left and the right
and you get sort of a nested
-
Not Synced
thing inside with colors and lines and
you can link this and various things in it
-
Not Synced
including bits of metadata here, other
bits here, what command you used.
-
Not Synced
That's the HTML output.
-
Not Synced
We also support a lot of file formats,
it's not just on text,
-
Not Synced
it's about all of these, so let's quickly
run through some of them.
-
Not Synced
You give it two Androip apk files which
are kind of like zips, but magic.
-
Not Synced
It'll know how to compare them.
-
Not Synced
There's like a Manifest file that needs
decoding.
-
Not Synced
It supports Berkeley DB databases,
-
Not Synced
Word documents, that's a Word document
with "a" and that's a Word document with "b"
-
Not Synced
and it'll correctly do that.
-
Not Synced
If you run that through diff normally,
that ??? be a binaly mess,
-
Not Synced
so completely useless.
-
Not Synced
E-books, there's epub, it also supports
mobi.
-
Not Synced
So if you give it two epub files, it'll say
"They just differ in this date".
-
Not Synced
Brilliant.
-
Not Synced
Normally that will be completely useless
diff binary ???
-
Not Synced
So you can be like "epub date, ok", grep
the source code for that,
-
Not Synced
make a patch really quickly.
-
Not Synced
Mono binaries, git repositories, why not?
-
Not Synced
Gnumeric spreadsheets, ISO images.
-
Not Synced
Oh yeah, ISO images is really cool.
-
Not Synced
So, it'll basically unpack the ISO, then
inside that there might be a squashfs image
-
Not Synced
then it'll completely go down to that and
work out any differences
-
Not Synced
between the two contents in the ISO file,
including any metadata.
-
Not Synced
This is on the squashfs metadata headers,
I think.
-
Not Synced
But say inside that ISO, there was a file
that was a pdf, and inside that pdf was
-
Not Synced
a ??? which varied,
-
Not Synced
it will basically go all the way down
and say "yeah, it's actually here,
-
Not Synced
in this ??? that the data differs."
-
Not Synced
And that means you can just go again
all the way back to the source
-
Not Synced
and say "ok, cool, we know how to fix
this quite quickly"
-
Not Synced
And this is really valuable in getting
the recent Tails distribution reproducible
-
Not Synced
so their ISOs are reproducible.
-
Not Synced
If you build one and I build one, we get
the exact same one
-
Not Synced
and that's kind of useful for something
like Tails where you would probably want to
-
Not Synced
of all, there's a lot of projects that you
might want to compromise,
-
Not Synced
you might want to go after that one,
because of the kind of people that are using it.
-
Not Synced
We support comparing images, so this is
using ???
-
Not Synced
and then just running that through diff.
-
Not Synced
That is a linux penguin and that is
something else,
-
Not Synced
I can't remember now. Oh, FT.
-
Not Synced
It supports images.
-
Not Synced
It supports JSON and pretty print,
so if you give it two JSON files
-
Not Synced
one with key/value… it'll do a nice
diff of them.
-
Not Synced
It will pretty print it first, before
doing the diff, so it'll actually give you
-
Not Synced
something clean, otherwise I don't know
if you've ever diffed
-
Not Synced
two very long JSON lines, if they differ
in the middle, you just get
-
Not Synced
a huge long unified diff, but here it's
like "oh, just ??? things have changed"
-
Not Synced
OpenDocument text formats,
Ogg audio files, because why not.
-
Not Synced
tcpdump capture files, that's actually
quite useful.
-
Not Synced
PDFs. That PDF says "Hello World" and
this PDF says "Hello sick sad world",
-
Not Synced
I don't know why. ???
in the demo.
-
Not Synced
Again, run that through normal diff
program… garbage.
-
Not Synced
XML documents. Again, it'll pretty print
them so it's nice, actually nice do read.
-
Not Synced
If you want to get started on diffoscope,
the very easiest and quickest way to do is
-
Not Synced
fire up a web browser, try.diffoscope.org,
select your files, press Compare
-
Not Synced
and it'll upload them and run diffoscope
with all the support for all the file formats
-
Not Synced
in the cloud for you and give you a nice
HTML page that you can then link to people
-
Not Synced
So that's the very quickest way to get
started.
-
Not Synced
The next quickest way is to install
trydiffoscope and then you run that
-
Not Synced
on two files and it'll basically do
the same thing,
-
Not Synced
run it in the same cloud service as
trydiffoscope
-
Not Synced
but it'll give you the result on the
command line or
-
Not Synced
if you pass the webbrowser option, it will
give you an URL or load your webbrowser,
-
Not Synced
I can't remember exactly which, with
the same results.
-
Not Synced
This is 1kB of Python, nothing basically.
-
Not Synced
That's the next easiest way.
-
Not Synced
But you can then install diffoscope itself
on your own machine.
-
Not Synced
I recommend not installing recommends
because all of those file formats
-
Not Synced
might drag in extra things about
the whole of TeX,
-
Not Synced
I think the whole of OpenOffice, whole
of Mono, whole Java…
-
Not Synced
Android, yeah, quite big.
-
Not Synced
I think there's another big one I can't
think of.
-
Not Synced
They're all optional, and they all say
"By the way, I support TeX documents
-
Not Synced
or whatever, Mono, whatever.
-
Not Synced
But you need to install this package and
then you get full pretty printed support",
-
Not Synced
And it'll tell you that when it's missing.
-
Not Synced
So, if you just start with
--install-recommends disabled,
-
Not Synced
right on your file, if it says
"please install this package, you can then
-
Not Synced
install them as you go along, as you want"
-
Not Synced
rather than installing everything.
-
Not Synced
And then ??? and then works as before
-
Not Synced
I you can improve all your own quality
assurance and debian packaging
-
Not Synced
with different scope
-
Not Synced
The biggest value here is not
necessary for reproducible builds
-
Not Synced
It's for basically just seeing where you
do want to have a diff or expecting a diff
-
Not Synced
and you are expecting a particularly type
of diff in a particularly way
-
Not Synced
you can basically see those changes
-
Not Synced
And if you build two debs normally and
... i'll try to demo in a second
-
Not Synced
You build a deb with a patch applied you
can ??? see a diff on the source package
-
Not Synced
But that's not very useful because the
binaries are going to end in the
-
Not Synced
people machines. But if you run a diff on
the binary itself, did that change and
-
Not Synced
really hit the binary, I think really ...
No..
-
Not Synced
I just run through a very live demo of
course, so it's gonna fail ...
-
Not Synced
Checkout some .... We'll get this
libnetx-java
-
Not Synced
We just build that once
-
Not Synced
Lets say we are on security team and
-
Not Synced
want to apply a patch, and we want to be
really sure because we are to push it out
-
Not Synced
to all our users
-
Not Synced
First we will make a changelog
-
Not Synced
Closing a bug
-
Not Synced
Find some java file to change
-
Not Synced
Let's pretend we have a real patch
-
Not Synced
Let's replace that equals equals,
say that was the fix
-
Not Synced
So that's the patch from upstream
-
Not Synced
Upstream blast patch
-
Not Synced
When we build this what we wanna see is
just that change in the file
-
Not Synced
we wanna see any nonsense changes of
extended ??? but we also definitely want
-
Not Synced
to see that change, cause if our binary as
for security reasons don't have that change
-
Not Synced
the we aren't fixing people machines,
they will issue a DSA ??? installed, saying
-
Not Synced
And you should do proper testing as well
at multiple levels
-
Not Synced
I will build that again
-
Not Synced
So we wanna diff the original one
-
Not Synced
We wanna diff that one with a fake
security one
-
Not Synced
You see on the progress bar 100%
1- there are diferences (there should be
-
Not Synced
diferences)
Lets see what that diferences are
-
Not Synced
in our web browser, its a nice html output
-
Not Synced
Let have a look.
Are we seeing what we wanna see?
-
Not Synced
There are some chances in the data ta, we
kind of expect that
-
Not Synced
Whats changed in our control file?
Well the version changed,we wanted that
-
Not Synced
to change. Perfect
-
Not Synced
And its changed to ???
That's what we wanna see
-
Not Synced
No other changes here so there was no
weird control or in magic going on
-
Not Synced
In our data tar the color of the timestamp
changes, we will ignore these for now
-
Not Synced
The changelog has changed, well I hope so
because I have changed that entry
-
Not Synced
Here is where we going to start seeing
We are going to see the changing in the
-
Not Synced
jar file which is the java class, java
compile archive format
-
Not Synced
We are seeing some meaningless timestamp
changes but we can ignore those
-
Not Synced
??? cause its just metadata maybe
-
Not Synced
Ok part of a class, so if you can see here
it's basically a de-compilation of the
-
Not Synced
java file itself and it's basically saying
"oh I use to say if now and if not now"
-
Not Synced
So these are the actual byte java
byte code instructions and whats really
-
Not Synced
And what is really ??? here
its that nothing else has changed
-
Not Synced
We were just expecting that change between
the two op codes, of if now elseif not not now
-
Not Synced
which is good cause its like it hasn't made
any code changes but also crucial we can
-
Not Synced
see that it has actually made a change
to the code.
-
Not Synced
For example its wasn't ???
This is really useful
-
Not Synced
And just running a naif diff wouldn't
give that of course, because it would just
-
Not Synced
come with binary garbage
And just seeing the diff change again