I'm here today to talk to you about diffoscope and how you can use it as a better diff or for Quality Assurance, etc., things like that. Moin! Apparently that's like a north german thing to say "welcome". North german, north Denmark, Scandinavia, that kind of thing, I'm told. People are shaking their head, so I'm going to assume that's true. This is my first PC, an IBM 5155. Sometimes, when you rebooted it, it would launch into, it would somehow revert from booting from the hard disk to booting from a basic ROM, as in the programming language ROM. It was on my motherboard for some reason. So, randomly, you just get a chance to program in basic and then, sometimes you wouldn't, I don't know why, but… yeah. It's quite fun with this kind of clicky keyboard, and that folded in and it was this kind of big desk thing. Anyway… This is my first Debian. At the time it was already old. What's this one? Is this Slink? 2.2? Yeah. And this is when we had US and non-US, so that's really dating if you remember that. This is my first contribution to Debian, 19th December 2006, sending a patch to lillypond which is kind of interesting and the response was "Oh yeah, rock on, many thanks. I'll upload this and it'll be landing to Etch". And this was super motivating because Etch was just coming out and it was like "Great, I've got let one line of tiny patch in a release. This is super cool." Thomas' response was super motivating. So, after that, like that Christmas basically spent ??? Debian webpages and stuff. Very well timed. That's kind of a good… You know, someone sends a patch, be like "Cool, thanks" Like a little notice in the changelog. It was, you know, so stupid but… Yeah, do that kind of thing. So, moving on. Why diffoscope? Why did we write diffoscope? What's the background here? It comes from reproducible builds. The very quick outline is that once you get the source code for free software, you download the source code for nginx or whatever, pretty much everyone just runs binaries on their servers or their systems. You know, "apt install bla", "yum install", whatever. Android Playstore, whatever. Can you actually trust whether these two things correspond with each other? You've gotten the source code, it looks alright, and then you install this binary, yeah… Who generated that? Can you trust that process? Can you trust who generated it? Even if you could trust them, could you trust them not to be exploited? Etc. This is a big problem because you can exploit a build farm and then obviously exploit all of that, you know, a trojan into the build farm, so every single binary that comes out is compromised. Kind of problematic. You could also target individual developers machines, so I could go of to, say, your machine, add a backdoor to it, so every binary that you give to friends and things like that, are compromised in some way, stealing your bitcoins or whatever. I can also turn up at your door and blackmail you into producing software that has compromises or extra features, shall we say, that don't exist in the source code. So what will happen there is that you'd release your source and the binaries you produce have this sort of backdoor that, you know, someone is forcing you into producing. So, you don't want to do that. Anyway enough of that. What you do for reproducible builds is you ensure that every time you build a piece of software, you get an identical result. Multiple people then compare their builds and check whether they all get the same results and this means that an attacker must either have infected everyone at the same time, or they haven't infected anyone. The point here is that you have to ensure that builds have identical results. Ok, great. So, we started the reproducible builds project, etc. And we build 2 debs. Oh, I'm sorry about the colors there. You probably can't see that. That says "sha1sum a.deb b.deb". Anyway, we're comparing the sha1sums of 2 binary Debian files. So, these two files differ. Ok, they're not reproducible. Why is that? So we run a diff on them. Yeah… So, what can we learn from this? Well, not very much, visibly they're compressed so as soon as we see one change, we'll see they would just cascade changes because that's how compression works. I guess we know it's a deb probably a ar format file, not very useful. Ok, great so we're gonna have a look in We'll do a binary diff and ok, well… Again, that's not really telling us very much with the diff there. Ok, great. ??? one level in "ar x" is on the new maintainer thing, "how you unpack a deb" Everyone remembers this, right? You unpack a.deb with "ar x" and you do that to b.deb and then we diff the results of that. Ok, so…yeah, 7zip. Ok, compressed content, not very useful. Ok, so let's unpack the control.tar inside these debs. And then we run diff on that. Still not really telling anything useful about how to make this package reproducible So let's unpack the tar.xz into the tar. Inside that tar, there's a file called md5sums and we start to see some differences between some files in these two debs. ??? meaningful, so now we have some idea that it has something to do with this usr/bin/pmixer binary. Ok, interesting. We'll unzip that and then we do a diff on pmixer itself. Now we're back into just binary "globgoly" mode This isn't very helpful and this is taking quite a while and if I remember correctly, Debian has a lot of packages. So this might take a little while. So, basically, ??? mean I should build a better diff. That's not quite true, this is actually… It was lunar that started this project and it was called debbindiff, because we wanted to diff binary Debian packages. So this is the initial commit, 2014. "The version is successfully able to report differences in two .changes files. Not with much interesting details, but it's a start." And it was a start. Fast forwarding… Oh, sorry about these colors, I don't know if we can do anything about the lights? Yeah? No? Allright, whatever… Basically, we're diffoscoping on… It works kind of diff does normally, you give it two files, it outputs a unified diff. So "diffoscope a b", one file contains the word "foo", one contains the word "bar". Nothing actually out of the ordinary. It's sort of colored by default, so that's why you can't see it, but whatever. It supports archive formats, so if you give it two tar files, if we then tar up our "a" file and our "b" file into a a.tar and b.tar and then run diffoscope on those tar files we get this kind of, like, hierarchy here. So it's saying that there are differencies between these files, in the file list they have different time stamps, because I made them at different times, and here are the contents, so we got "foo" there and "bar" there. So we can see the difference between them. Well, I can, I don't know if you can, you get the slide there. If we gzip these tar files and then run diffoscope on those gzip things, it'll say "ok, what we've done is unpack it first, and here's the metadata about the gzip process", and inside that are a.tar and b.tar from the previous slides. And then the "a" file and the "b" file. So, it's really going two levels deep into this tar.gz file. That's pretty cool. And it's completely recursive, I think it will actually blow out after, I think, 1000 [levels]. [light is turned down for the audience to see the slides] I'll just bump back a bit, just in case. [Applause] Thank you. So that's the a and b files. We've tared them up and so I see the hierarchy of foo and bar file layer. I've gziped them, so this is a gzip layer. Here's the tar layer and then there's the files themselves. This is from a real .deb from the archive. Inside this .deb, there's a data.tar.xz and in that xz file there's a data.tar and inside that tar file, there's a file called aff and inside that there's a version string that is different. And that looks like a build date so we probably know that if we went back to the source package, we could very quickly work out, with get a very quick grep, work out where this file is being generated from, the de_DE.aff file and then ??? probably quite obvious that it's using the current build time and then we can just patch that, fix it etc. This is gone from two rather obscure binary .debs all the way to the fix probably in about 5 minutes, and you can probably send the patch in that time because it'd be quite quick. Without diffoscope here, without this sort of recursive unpacking, you'd be just completely lost, you'd be there with arx all day and working out which files are different and trying to use xxd and this kind of nonsense. diffoscope's got some other things as well if you try to do reproducible packages and things are varying just on the line ordering, we detect whether a file differs only in the line ordering. So, here's file "a", "These lines are in order". File "b" has "These order are in lines". It's very difficult to say, actually, it's like one of these tongue twisters. Run diffoscope on these two and it says it's got ordering differences only. That's interesting, so you probably need to sort, you go all the way back to the source code, work out very quickly, if you know it's just ordering differences you just kind of know what the output's gonna be, you can search for order in ??? and you get the right files, I have sorted in sort in the right place, BAM! send it patched of, everything is great. Oh, and send it to upstream as well because you're good. It supports a lot more things. We've been showing the terminal text output here. It's got a HTML output mode, which is really useful in the hierarchal thing when it gets a bit more complicated. Instead of being laid on top of each other like a unified diff, you get the diff on the left and the right and you get sort of a nested thing inside with colors and lines and you can link this and various things in it including bits of metadata here, other bits here, what command you used. That's the HTML output. We also support a lot of file formats, it's not just on text, it's about all of these, so let's quickly run through some of them. You give it two Androip apk files which are kind of like zips, but magic. It'll know how to compare them. There's like a Manifest file that needs decoding. It supports Berkeley DB databases, Word documents, that's a Word document with "a" and that's a Word document with "b" and it'll correctly do that. If you run that through diff normally, that ??? be a binaly mess, so completely useless. E-books, there's epub, it also supports mobi. So if you give it two epub files, it'll say "They just differ in this date". Brilliant. Normally that will be completely useless diff binary ??? So you can be like "epub date, ok", grep the source code for that, make a patch really quickly. Mono binaries, git repositories, why not? Gnumeric spreadsheets, ISO images. Oh yeah, ISO images is really cool. So, it'll basically unpack the ISO, then inside that there might be a squashfs image then it'll completely go down to that and work out any differences between the two contents in the ISO file, including any metadata. This is on the squashfs metadata headers, I think. But say inside that ISO, there was a file that was a pdf, and inside that pdf was a ??? which varied, it will basically go all the way down and say "yeah, it's actually here, in this ??? that the data differs." And that means you can just go again all the way back to the source and say "ok, cool, we know how to fix this quite quickly" And this is really valuable in getting the recent Tails distribution reproducible so their ISOs are reproducible. If you build one and I build one, we get the exact same one and that's kind of useful for something like Tails where you would probably want to of all, there's a lot of projects that you might want to compromise, you might want to go after that one, because of the kind of people that are using it. We support comparing images, so this is using ??? and then just running that through diff. That is a linux penguin and that is something else, I can't remember now. Oh, FT. It supports images. It supports JSON and pretty print, so if you give it two JSON files one with key/value… it'll do a nice diff of them. It will pretty print it first, before doing the diff, so it'll actually give you something clean, otherwise I don't know if you've ever diffed two very long JSON lines, if they differ in the middle, you just get a huge long unified diff, but here it's like "oh, just ??? things have changed" OpenDocument text formats, Ogg audio files, because why not. tcpdump capture files, that's actually quite useful. PDFs. That PDF says "Hello World" and this PDF says "Hello sick sad world", I don't know why, that particulary text in the demo. Again, run that through normal diff program… garbage. XML documents. Again, it'll pretty print them so it's nice, actually nice do read. If you want to get started on diffoscope, the very easiest and quickest way to do is fire up a web browser, try.diffoscope.org, select your files, press Compare and it'll upload them and run diffoscope with all the support for all the file formats in the cloud for you and give you a nice HTML page that you can then link to people So that's the very quickest way to get started. The next quickest way is to install trydiffoscope and then you run that on two files and it'll basically do the same thing, run it in the same cloud service as trydiffoscope but it'll give you the result on the command line or if you pass the webbrowser option, it will give you an URL or load your webbrowser, I can't remember exactly which, with the same results. This is 1kB of Python, nothing basically. That's the next easiest way. But you can then install diffoscope itself on your own machine. I recommend not installing recommends because all of those file formats might drag in extra things about the whole of TeX, I think the whole of OpenOffice, whole of Mono, whole Java… Android, yeah, quite big. I think there's another big one I can't think of. They're all optional, and they all say "By the way, I support TeX documents or whatever, Mono, whatever. But you need to install this package and then you get full pretty printed support", And it'll tell you that when it's missing. So, if you just start with --install-recommends disabled, right on your file, if it says "please install this package, you can then install them as you go along, as you want" rather than installing everything. And then you just pass ??? files and then works as before How you can you improve all your own quality assurance and debian packaging with different scope The biggest value here is not necessary for reproducible builds It's for basically just seeing where you do want to have a diff or expecting a diff and you are expecting a particularly type of diff in a particularly way you can basically see those changes And if you build two debs normally and ... i'll try to demo in a second You build a deb with a patch applied and then build a deb with the patch applied you can ??? run a diff on the source package But that's not very useful because the binaries are going to end in the people machines. But if you run a diff on the binary itself, did my change actually hit the binary? I think really ... No.. I just run through a very live demo of course, so it's gonna fail ... Checkout some .... We'll get this libnetx-java We just build that once Lets say we are on security team and want to apply a patch, and we want to be really sure because we are to push it out to all our users First we will make a changelog Closing a bug Find some java file to change Let's pretend we have a real patch Let's replace that equals equals, say that was the fix So that's the patch from upstream Upstream blast patch When we build this what we wanna see is just that change in the file we wanna see any nonsense changes of extended dump but we also definitely want to see that change, cause if our binary as for security reasons don't have that change then we aren't fixing people machines, they will issue a DSA ??? installed ??? And you should do proper testing as well at multiple levels I will build that again So we wanna diff the original one 0 5, We wanna diff that one with a fake security one You see on the progress bar 100% 1- there are diferences (there should be diferences) Lets see what that diferences are in our web browser, its a nice html output Let have a look. Are we seeing what we wanna see? There are some chances in the data tar, we kind of expect that What's changed in our control file? Well the version changed,we wanted that to change. Perfect And its changed to ??? That's what we wanna see No other changes here so there was no weird control or in magic going on In our data tar the color of the timestamp changes, we will ignore those for now The changelog has changed, well I hope so because I have changed that entry Here is where we going to start seeing We are going to see the changing in the jar file which is the java class, java compile archive format We are seeing some meaningless timestamp changes but we can ignore those lets pretend because its just metadata maybe Ok part of a class, so if you can see here it's basically a de-compilation of the java file itself and it's basically saying "oh I use to say if now and if not now" So these are the actual byte java byte code instructions and whats really And what is really ??? here its that nothing else has changed We were just expecting that change between the two op codes, of if now elseif not not now which is good cause its like it hasn't made any code changes but also crucial we can see that it has actually made a change to the code. For example its wasn't use some cached version or something like that This is really useful And just running a naif diff wouldn't give that of course, because it would just come with binary garbage And just seeing the diff had changed again ??? be told you anything, because all of the change would have changed as well So its like well yes it's diferent The meaningful change there it's what actually fixes the "floor" ??? but we know it's there That's kind of ??? Shifting this deb out I'll be quite confident, that this seemed like the actual bug I've been quite confident pushing that out because it's very minimal amount of changes you wanna do that for security reasons So this was the live demo The other one is seeing no changes at all, so you can build once if you build a reproducible You can build once change your compiler or change some other part of your toolchain Build it again and if you got the exact same results, well great, that's want you intended You wanna see no changes when you change some part of it And that is really useful, if there were changes diffoscope will highlight them and show exactly why they had changed, maybe some compile authorizations, maybe some other things as well So you can use it in both ways, when you expect changes and when you don't expect changes, and if those match the expectations diffoscope will tell you exactly why It's all ??? when other companies are doing security releases naming no names whatsoever, but they like to release patches as you know just a new firmware for your router Very large file system images, you basically have no ideia what changed between these two files, again you run through diff completely useless You can start to unpack them with squashfs and blah blah blah But they're probably sort of concatenated cpio archives, so that's nonsense But diffoscope would just chew you those and give you actually what the diferences is between these two files, and say they changed this, they've removed or added some gpl license code or something kind of interesting So its very useful for diffing those kind binary blobs that come from various people So the current state of diffoscope, the development is up and down It started around May 2014 something like that A bunch of work here, that's is idle I think These are just for debconfs basically Anyway it's going up and down its kind of interesting ??? a lot of reproducible builds projects of course, so every time we do a build on the ??? reproducible builds or testing framework if we run diffoscope on the result, if it's reproducible it just says , hey the file is the same But if not, we publish the diffoscopes of all your packages that are unreproducible just you can just go there and be like whats the diference between these two things I invested a lot of work optimizing diffoscope, ??? rather perverse end square loops inside it. So i manage to cut down some of the time here, cut down here That's been quite a few performances and enhancements over the past ... these are the git tags , this is version 80 and this is version 50 I just run the same benchmark across them all So they shows when I have introduced some rather stupid code, embarrassing , but whatever ??? There's work been done right now, on parallel processing, there's been quite a few attempts before, but adding it it's kind of interesting and difficult Luckily we have an outreach student Liliana, is she in the room? Is she hiding? She's here and she's been talking tomorrow about her work on paralel processing in diffoscope and that will be amazing because a lot of it is IO bound or waiting for Xtel processors with multiple cpu machines, you mind as well just play well while as I stand waiting for the result for a pdf to be unpacked I maybe as well be running on another cpu, I think we are going to see some real performance wins as we do that paralell processing merge and working and ??? You can check out our website diffoscope.org recently migrated to Salsa .... yeeaahhh And everything that's reproducible is now on Salsa, it's kind of cool That's quite recent... ??? Thank you very muck, danke shcön You got any questions? About diffoscope? Thank you very much ! [Applause] Q: A buzz word question, can you diff containers image formats? A: Depend which ones. So if they are just directories, then yes, because is just a directory Do you have particullary in mind? Like docker? Yes, there's docker and then there's old CI, I believe is the standard one And that could make a buzz word complaint Ah ok we were all about buzz words Probable diffoscope block change as well And then run diffoscope on connectors and see the difference between updates of your container images BAM ... solved Where do I invest? I wasn't aware that OCI ... that's is how it's called? No it doesn't support that right now But it wouldn't be too difficult, presuming there are tools to unpack it and as soon we have a tool to unpack it, it can then just go to that, there is an open wishlist bug tool box for docker containers to the point were I think it would be really nice if you could just give it, say, two images names or whatever the noun is So you can say "please diff these two docker images that are available" and it can look at your local thing and do a diff on them, currently it's not supported, but there is an open wishlist bug. Q: Shouldn't any company that releases binaries, be interested in supporting diffoscope and using it? A1: Basically when companies release binaries they are not interested in users seeing diferences... A2: Yes, I'm surprised that actually the docker bug was only opened two months ago and hasn't been more interest on diffing container images, but if you like to open one for OCI that will be very appreciated, and we can get on to that, that would be great. I was looking the page for OCI, it says it's based on docker basically, so once you get OCI for free, you would sort it out for docker, if you're lucky The OCI image formaters, they wrote out on docker images Ok we will sort that out, and it seems like we're using a docker more and more on debian Any other questions? Q: Out of curiosity, which ??? are you using inside? Are you using some bio-informatics algorithm to diff trees efficiently? A: No it's really naif, all it does is run normal diff, the normal diff tools, but it will try to identify files and unpack first, so use the file utility identifier thing that says its a pdf , and try to unpack it first, he doesn't do any clever matching. The clever matching that he does do is fuzzy matching as well, so if just rename a directory between two inside a container, he will say , yeah there a massive fuzzy match between this two files, and things like that. So that's kind of useful, but apart from that clever, which is kind of what you want , because if it's too clever it would start to be a little opaque ... I personally like dumb tools. Q: So one question to you is whether, if you wanna do a release to stable or something like that, you can ask for the debdiff, I'm wandering if anyone I mean I remember doing that myself I've been submitting diffoscope output as well, because is just more readable and useful. so I'm not sure if anyone have any objection to people asking for those. I'll propose that to the release team see what they say Thank you very much, is there any other questions? No further questions? Then lets thanks Chris again ! [Applause]