I'm here today to talk to you about diffoscope and how you can use it as a better diff or for Quality Assurance, etc., things like that. Moin! Apparently that's like a north german thing to say "welcome". North german, north Denmark, Scandinavia, that kind of thing, I'm told. People are shaking their head, so I'm going to assume that's true. This is my first PC, an IBM 5155. Sometimes, when you rebooted it, it would launch into, it would somehow revert from booting from the hard disk to booting from a basic ROM, as in the programming language ROM. It was on my motherboard for some reason. So, randomly, you just get a chance to program in basic and then, sometimes you wouldn't, I don't know why, but… yeah. It's quite fun with this kind of clicky keyboard, and that folded in and it was this kind of big desk thing. Anyway… This is my first Debian. At the time it was already old. What's this one? Is this Slink? 2.2? Yeah. And this is when we had US and non-US, so that's really dating if you remember that. This is my first contribution to Debian, 19th December 2006, sending a patch to lillypond which is kind of interesting and the response was "Oh yeah, rock on, many thanks. I'll upload this and it'll be landing to Etch". And this was super motivating because Etch was just coming out and it was like "Great, I've got let one line of tiny patch in a release. This is super cool." Thomas' response was super motivating. So, after that, like that Christmas basically spent ??? Debian webpages and stuff. Very well timed. That's kind of a good… You know, someone sends a patch, be like "Cool, thanks" Like a little notice in the changelog. It was, you know, so stupid but… Yeah, do that kind of thing. So, moving on. Why diffoscope? Why did we write diffoscope? What's the background here? It comes from reproducible builds. The very quick outline is that once you get the source code for free software, you download the source code for nginx or whatever, pretty much everyone just runs binaries on their servers or their systems. You know, "apt install bla", "yum install", whatever. Android Playstore, whatever. Can you actually trust whether these two things correspond with each other? You've gotten the source code, it looks alright, and then you install this binary, yeah… Who generated that? Can you trust that process? Can you trust who generated it? Even if you could trust them, could you trust them not to be exploited? Etc. This is a big problem because you can exploit a build farm and then obviously exploit all of that, you know, a trojan into the build farm, so every single binary that comes out is compromised. Kind of problematic. You could also target individual developers machines, so I could go of to, say, your machine, add a backdoor to it, so every binary that you give to friends and things like that, are compromised in some way, stealing your bitcoins or whatever. I can also ??? and blackmail you into producing software that has compromises or extra features, shall we say, that don't exist in the source code. So what will happen there is that you'd release your source and the binaries you produce have this sort of backdoor that, you know, someone is forcing you into producing. So, you don't want to do that. Anyway enough of that. What you do for reproducible builds is you ensure that every time you build a piece of software, you get an identical result. Multiple people then compare their builds and check whether they all get the same results and this means that an attacker must either have infected everyone at the same time, or they haven't infected anyone. The point here is that you have to ensure that builds have identical results. Ok, great. So, we started the reproducible builds project, etc. And we build 2 debs. Oh, I'm sorry about the colors there. You probably can't see that. That says "sha1sum a.deb b.deb". Anyway, we're comparing the sha1sums of 2 binary Debian files. So, these two files differ. Ok, they're not reproducible. Why is that? So we run a diff on them. Yeah… So, what can we learn from this? Well, not very much, visibly they're compressed so as soon as we see one change, we'll see they would just cascade changes because that's how compression works. I guess we know it's a deb ??? format file, not very useful. Ok, great so we're gonna have a look in We'll do a binary diff and ok, well… Again, that's not really telling us very much with the diff there. Ok, great. ??? "ar x" is on the new maintainer thing, "how you unpack a deb" Everyone remembers this, right? You unpack a.deb with "ar x" and you do that to b.deb and then we diff the results of that. Ok, so…yeah, 7zip. Ok, compressed content, not very useful. Ok, so let's unpack the control.tar inside these debs. And then we run diff on that. Still not really telling anything useful about how to make this package reproducible So let's unpack the tar.xz into the tar. Inside that tar, there's a file called md5sums and we start to see some differences between some files in these two debs. ??? meaningful, so now we have some idea that it has something to do with this usr/bin/pmixer binary. Ok, interesting. We'll unzip that and then we do a diff on pmixer itself. Now we're back into just binary ??? mode This isn't very helpful and this is taking quite a while and if I remember correctly, Debian has a lot of packages. So this might take a little while. So, basically, ??? meme I should build a better diff. That's not quite true, this is actually… It was lunar that started this project and it was called debbindiff, because we wanted to diff binary Debian packages. So this is the initial commit, 2014. "The version is successfully able to report differences in two .changes files. Not with much interesting details, but it's a start." And it was a start. Fast forwarding… Oh, sorry about these colors, I don't know if we can do anything about the lights? Yeah? No? Alright, well… Basically, we're diffoscoping on… It works kind of diff does normally, you give it two files, it outputs a unified diff. So "diffoscope a b", one file contains the word "foo", one contains the word "bar". Nothing actually out of the ordinary. It's sort of colored by default, so that's why you can't see it, but whatever. It supports archive formats, so if you give it two tar files, if we then tar up our "a" file and our "b" file into a a.tar and b.tar and then run diffoscope on those tar files we get this kind of, like, hierarchy here. So it's saying that there are differencies between these files, in the file list they have different time stamps, because I made them at different times, and here are the contents, so we got "foo" there and "bar" there. So we can see the difference between them. Well, I can, I don't know if you can, you get the slide there. If we gzip these tar files and then run diffoscope on those gzip things, it'll say "ok, what we've done is unpack it first, and here's the metadata about the gzip process", and inside that are a.tar and b.tar from the previous slides. And then the "a" file and the "b" file. So, it's really going two levels deep into this tar.gz file. That's pretty cool. And it's completely recursive, I think it will actually blow out after, I think, 1000 [levels]. [light is turned down for the audience to see the slides] I'll just bump back a bit, just in case. [Applause] Thank you. So that's the a and b files. We've tared them up and so I see the hierarchy of foo and bar file layer. I've gziped them, so this is a gzip layer. Here's the tar layer and then there's the files themselves. This is from a real .deb from the archive. Inside this .deb, there's a data.tar.xz and in that xz file there's a data.tar and inside that tar file, there's a file called aff and inside that there's a version string that is different. And that looks like a build date so we probably know that if we went back to the source package, we could very quickly work out, with get a very quick grep, work out where this file is being generated from, the de_DE.aff file and then ??? probably quite obvious that it's using the current build time and then we can just patch that, fix it etc. This is gone from two rather obscure binary .debs all the way to the fix probably in about 5 minutes, and you can probably send the patch in that time because it'd be quite quick. Without diffoscope here, without this sort of recursive unpacking, you'd be just completely lost, you'd be there with arx all day and working out which files are different and trying to use xxd and this kind of nonsense. diffoscope's got some other things as well if you try to do reproducible packages and things are varying just on the line ordering, we detect whether a file differs only in the line ordering. So, here's file "a", "These lines are in order". File "b" has "These order are in lines". It's very difficult to say, actually, it's like one of these tongue twisters. Run diffoscope on these two and it says it's got ordering differences only. That's interesting, so you probably need to sort, you go all the way back to the source code, work out very quickly, if you know it's just ordering differences you just kind of know what the output's gonna be, you can search for order in ??? and you get the right files, ??? sort in the right place, BAM, send it patch of (???), everything is great. Oh, and send it to upstream as well because you're good. It supports a lot more things. We've been showing the terminal text output here. It's got a HTML output mode, which is really useful in the hierarchal thing when it gets a bit more complicated. Instead of being laid on top of each other like a unified diff, you get the diff on the left and the right and you get sort of a nested thing inside with colors and lines and you can link this and various things in it including bits of metadata here, other bits here, what command you used. That's the HTML output. We also support a lot of file formats, it's not just on text, it's about all of these, so let's quickly run through some of them. You give it two Androip apk files which are kind of like zips, but magic. It'll know how to compare them. There's like a Manifest file that needs decoding. It supports Berkeley DB databases, Word documents, that's a Word document with "a" and that's a Word document with "b" and it'll correctly do that. If you run that through diff normally, that ??? be a binaly mess, so completely useless. E-books, there's epub, it also supports mobi. So if you give it two epub files, it'll say "They just differ in this date". Brilliant. Normally that will be completely useless diff binary ??? So you can be like "epub date, ok", grep the source code for that, make a patch really quickly. Mono binaries, git repositories, why not? Gnumeric spreadsheets, ISO images. Oh yeah, ISO images is really cool. So, it'll basically unpack the ISO, then inside that there might be a squashfs image then it'll completely go down to that and work out any differences between the two contents in the ISO file, including any metadata. This is on the squashfs metadata headers, I think. But say inside that ISO, there was a file that was a pdf, and inside that pdf was a ??? which varied, it will basically go all the way down and say "yeah, it's actually here, in this ??? that the data differs." And that means you can just go again all the way back to the source and say "ok, cool, we know how to fix this quite quickly" And this is really valuable in getting the recent Tails distribution reproducible so their ISOs are reproducible. If you build one and I build one, we get the exact same one and that's kind of useful for something like Tails where you would probably want to of all, there's a lot of projects that you might want to compromise, you might want to go after that one, because of the kind of people that are using it. We support comparing images, so this is using ??? and then just running that through diff. That is a linux penguin and that is something else, I can't remember now. Oh, FT. It supports images. It supports JSON and pretty print, so if you give it two JSON files one with key/value… it'll do a nice diff of them. It will pretty print it first, before doing the diff, so it'll actually give you something clean, otherwise I don't know if you've ever diffed two very long JSON lines, if they differ in the middle, you just get a huge long unified diff, but here it's like "oh, just ??? things have changed" OpenDocument text formats, Ogg audio files, because why not. tcpdump capture files, that's actually quite useful. PDFs. That PDF says "Hello World" and this PDF says "Hello sick sad world", I don't know why. ??? in the demo. Again, run that through normal diff program… garbage. XML documents. Again, it'll pretty print them so it's nice, actually nice do read. If you want to get started on diffoscope, the very easiest and quickest way to do is fire up a web browser, try.diffoscope.org, select your files, press Compare and it'll upload them and run diffoscope with all the support for all the file formats in the cloud for you and give you a nice HTML page that you can then link to people So that's the very quickest way to get started. The next quickest way is to install trydiffoscope and then you run that on two files and it'll basically do the same thing, run it in the same cloud service as trydiffoscope but it'll give you the result on the command line or if you pass the webbrowser option, it will give you an URL or load your webbrowser, I can't remember exactly which, with the same results. This is 1kB of Python, nothing basically. That's the next easiest way. But you can then install diffoscope itself on your own machine. I recommend not installing recommends because all of those file formats might drag in extra things about the whole of TeX, I think the whole of OpenOffice, whole of Mono, whole Java… Android, yeah, quite big. I think there's another big one I can't think of. They're all optional, and they all say "By the way, I support TeX documents or whatever, Mono, whatever. But you need to install this package and then you get full pretty printed support", And it'll tell you that when it's missing. So, if you just start with --install-recommends disabled, right on your file, if it says "please install this package, you can then install them as you go along, as you want" rather than installing everything. And then ??? and then works as before I you can improve all your own quality assurance and debian packaging with different scope The biggest value here is not necessary for reproducible builds It's for basically just seeing where you do want to have a diff or expecting a diff and you are expecting a particularly type of diff in a particularly way you can basically see those changes And if you build two debs normally and ... i'll try to demo in a second You build a deb with a patch applied you can ??? see a diff on the source package But that's not very useful because the binaries are going to end in the people machines. But if you run a diff on the binary itself, did that change and really hit the binary, I think really ... No.. I just run through a very live demo of course, so it's gonna fail ... Checkout some .... We'll get this libnetx-java We just build that once Lets say we are on security team and want to apply a patch, and we want to be really sure because we are to push it out to all our ??? First we will make a change in log