I'm here today to talk to you about diffoscope and how you can use it as a better diff or for Quality Assurance, etc., things like that. Moin! Apparently that's like a north german thing to say "welcome". North german, north Denmark, Scandinavia, that kind of thing, I'm told. People are shaking their head, so I'm going to assume that's true. This is my first PC, an IBM 5155. Sometimes, when you rebooted it, it would launch into, it would somehow revert from booting from the hard disk to booting from a basic ROM, as in the programming language ROM. It was on my motherboard for some reason. So, randomly, you just get a chance to program in basic and then, sometimes you wouldn't, I don't know why, but… yeah. It's quite fun with this kind of clicky keyboard, and that folded in and it was this kind of big desk thing. Anyway… This is my first Debian. At the time it was already old. What's this one? Is this Slink? 2.2? Yeah. And this is when we had US and non-US, so that's really dating if you remember that. This is my first contribution to Debian, 19th December 2006, sending a patch to lillypond which is kind of interesting and the response was "Oh yeah, rock on, many thanks. I'll upload this and it'll be landing to Etch". And this was super motivating because Etch was just coming out and it was like "Great, I've got let one line of tiny patch in a release. This is super cool." Thomas' response was super motivating. So, after that, like that Christmas basically spent ??? Debian webpages and stuff. Very well timed. That's kind of a good… You know, someone sends a patch, be like "Cool, thanks" Like a little notice in the changelog. It was, you know, so stupid but… Yeah, do that kind of thing. So, moving on. Why diffoscope? Why did we write diffoscope? What's the background here? It comes from reproducible builds. The very quick outline is that once you get the source code for free software, you download the source code for nginx or whatever, pretty much everyone just runs binaries on their servers or their systems. You know, "apt install bla", "yum install", whatever. Android Playstore, whatever. Can you actually trust whether these two things correspond with each other? You've gotten the source code, it looks alright, and then you install this binary, yeah… Who generated that? Can you trust that process? Can you trust who generated it? Even if you could trust them, could you trust them not to be exploited? Etc. This is a big problem because you can exploit a build farm and then obviously exploit all of that, you know, a trojan into the build farm, so every single binary that comes out is compromised. Kind of problematic. You could also target individual developers machines, so I could go of to, say, your machine, add a backdoor to it, so every binary that you give to friends and things like that, are compromised in some way, stealing your bitcoins or whatever. I can also ??? and blackmail you into producing software that has compromises or extra features, shall we say, that don't exist in the source code. So what will happen there is that you'd release your source and the binaries you produce have this sort of backdoor that, you know, someone is forcing you into producing. So, you don't want to do that. Anyway enough of that. What you do for reproducible builds is you ensure that every time you build a piece of software, you get an identical result. Multiple people then compare their builds and check whether they all get the same results and this means that an attacker must either have infected everyone at the same time, or they haven't infected anyone. The point here is that you have to ensure that builds have identical results. Ok, great. So, we started the reproducible builds project, etc. And we build 2 debs. Oh, I'm sorry about the colors there. You probably can't see that. That says "sha1sum a.deb b.deb". Anyway, we're comparing the sha1sums of 2 binary Debian files. So, these two files differ. Ok, they're not reproducible. Why is that? So we run a diff on them. Yeah… So, what can we learn from this? Well, not very much, visibly they're compressed so as soon as we see one change, we'll see they would just cascade changes because that's how compression works. I guess we know it's a deb ??? format file, not very useful. Ok, great so we're gonna have a look in We'll do a binary diff and ok, well… Again, that's not really telling us very much with the diff there. Ok, great. ??? "ar x" is on the new maintainer thing, "how you unpack a deb" Everyone remembers this, right? You unpack a.deb with "ar x" and you do that to b.deb and then we diff the results of that. Ok, so…yeah, 7zip. Ok, compressed content, not very useful. Ok, so let's unpack the control.tar inside these debs. And then we run diff on that. Still not really telling anything useful about how to make this package reproducible So let's unpack the tar.xz into the tar. Inside that tar, there's a file called md5sums and we start to see some differences between some files in these two debs. ??? meaningful, so now we have some idea that it has something to do with this usr/bin/pmixer binary. Ok, interesting. We'll unzip that and then we do a diff on pmixer itself. Now we're back into just binary ??? mode This isn't very helpful and this is taking quite a while and if I remember correctly, Debian has a lot of packages. So this might take a little while. So, basically, ??? meme I should build a better diff.