1 00:00:07,466 --> 00:00:10,156 I'm here today to talk to you about diffoscope 2 00:00:10,156 --> 00:00:13,190 and how you can use it as a better diff 3 00:00:14,063 --> 00:00:16,166 or for Quality Assurance, etc., things like that. 4 00:00:19,789 --> 00:00:20,810 Moin! 5 00:00:20,815 --> 00:00:24,409 Apparently that's like a north german thing to say "welcome". 6 00:00:25,938 --> 00:00:29,898 North german, north Denmark, Scandinavia, that kind of thing, I'm told. 7 00:00:31,836 --> 00:00:34,197 People are shaking their head, so I'm going to assume that's true. 8 00:00:37,306 --> 00:00:40,425 This is my first PC, an IBM 5155. 9 00:00:41,623 --> 00:00:46,441 Sometimes, when you rebooted it, it would launch into, it would somehow revert 10 00:00:46,688 --> 00:00:50,971 from booting from the hard disk to booting from a basic ROM, 11 00:00:51,359 --> 00:00:52,959 as in the programming language ROM. 12 00:00:53,017 --> 00:00:54,320 It was on my motherboard for some reason. 13 00:00:54,912 --> 00:00:57,691 So, randomly, you just get a chance to program in basic and then, 14 00:00:57,957 --> 00:01:00,456 sometimes you wouldn't, I don't know why, but… yeah. 15 00:01:00,718 --> 00:01:05,173 It's quite fun with this kind of clicky keyboard, and that folded in 16 00:01:05,519 --> 00:01:07,058 and it was this kind of big desk thing. 17 00:01:07,058 --> 00:01:08,014 Anyway… 18 00:01:09,067 --> 00:01:10,187 This is my first Debian. 19 00:01:10,500 --> 00:01:11,837 At the time it was already old. 20 00:01:12,890 --> 00:01:15,908 What's this one? Is this Slink? 2.2? Yeah. 21 00:01:17,077 --> 00:01:22,043 And this is when we had US and non-US, so that's really dating if you remember that. 22 00:01:23,522 --> 00:01:28,393 This is my first contribution to Debian, 19th December 2006, 23 00:01:28,803 --> 00:01:33,738 sending a patch to lillypond which is kind of interesting 24 00:01:34,155 --> 00:01:37,205 and the response was "Oh yeah, rock on, many thanks. I'll upload this and 25 00:01:37,440 --> 00:01:38,723 it'll be landing to Etch". 26 00:01:39,007 --> 00:01:43,408 And this was super motivating because Etch was just coming out and it was like 27 00:01:43,602 --> 00:01:48,732 "Great, I've got let one line of tiny patch in a release. This is super cool." 28 00:01:49,118 --> 00:01:52,687 Thomas' response was super motivating. 29 00:01:52,993 --> 00:01:56,450 So, after that, like that Christmas basically spent ??? 30 00:01:56,675 --> 00:01:59,754 Debian webpages and stuff. 31 00:02:00,327 --> 00:02:01,568 Very well timed. 32 00:02:02,234 --> 00:02:03,566 That's kind of a good… 33 00:02:04,301 --> 00:02:07,379 You know, someone sends a patch, be like "Cool, thanks" 34 00:02:07,849 --> 00:02:09,434 Like a little notice in the changelog. 35 00:02:09,807 --> 00:02:14,344 It was, you know, so stupid but… Yeah, do that kind of thing. 36 00:02:15,558 --> 00:02:17,249 So, moving on. 37 00:02:17,641 --> 00:02:20,276 Why diffoscope? Why did we write diffoscope? 38 00:02:20,552 --> 00:02:21,880 What's the background here? 39 00:02:22,184 --> 00:02:24,575 It comes from reproducible builds. 40 00:02:24,911 --> 00:02:28,983 The very quick outline is that once you get the source code for free software, 41 00:02:29,208 --> 00:02:31,505 you download the source code for nginx or whatever, 42 00:02:31,998 --> 00:02:35,844 pretty much everyone just runs binaries on their servers or their systems. 43 00:02:36,110 --> 00:02:39,119 You know, "apt install bla", "yum install", whatever. 44 00:02:40,531 --> 00:02:41,535 Android Playstore, whatever. 45 00:02:42,479 --> 00:02:46,176 Can you actually trust whether these two things correspond with each other? 46 00:02:46,470 --> 00:02:49,926 You've gotten the source code, it looks alright, and then you install this binary, 47 00:02:50,847 --> 00:02:51,821 yeah… 48 00:02:52,459 --> 00:02:55,861 Who generated that? Can you trust that process? 49 00:02:56,275 --> 00:02:57,430 Can you trust who generated it? 50 00:02:58,351 --> 00:03:01,493 Even if you could trust them, could you trust them not to be exploited? Etc. 51 00:03:02,295 --> 00:03:04,765 This is a big problem because you can exploit a build farm and then 52 00:03:05,160 --> 00:03:09,895 obviously exploit all of that, you know, a trojan into the build farm, 53 00:03:10,097 --> 00:03:13,290 so every single binary that comes out is compromised. 54 00:03:13,708 --> 00:03:14,792 Kind of problematic. 55 00:03:15,060 --> 00:03:17,686 You could also target individual developers machines, 56 00:03:17,937 --> 00:03:21,288 so I could go of to, say, your machine, add a backdoor to it, 57 00:03:21,578 --> 00:03:25,241 so every binary that you give to friends and things like that, 58 00:03:26,935 --> 00:03:30,485 are compromised in some way, stealing your bitcoins or whatever. 59 00:03:31,802 --> 00:03:36,127 I can also turn up at your door and blackmail you into producing 60 00:03:38,522 --> 00:03:42,997 software that has compromises or extra features, shall we say, 61 00:03:43,472 --> 00:03:44,783 that don't exist in the source code. 62 00:03:45,133 --> 00:03:47,885 So what will happen there is that you'd release your source 63 00:03:48,093 --> 00:03:51,968 and the binaries you produce have this sort of backdoor that, you know, 64 00:03:52,435 --> 00:03:55,127 someone is forcing you into producing. 65 00:03:55,464 --> 00:03:56,679 So, you don't want to do that. 66 00:03:56,856 --> 00:03:57,505 Anyway 67 00:03:58,197 --> 00:03:59,228 enough of that. 68 00:03:59,228 --> 00:04:03,211 What you do for reproducible builds is you ensure that every time you build 69 00:04:03,467 --> 00:04:05,773 a piece of software, you get an identical result. 70 00:04:06,916 --> 00:04:10,885 Multiple people then compare their builds and check whether they all get 71 00:04:07,074 --> 00:04:11,068 the same results 72 00:04:11,068 --> 00:04:15,626 and this means that an attacker must either have infected everyone 73 00:04:15,626 --> 00:04:17,726 at the same time, or they haven't infected anyone. 74 00:04:20,673 --> 00:04:24,058 The point here is that you have to ensure that builds have identical results. 75 00:04:24,173 --> 00:04:25,163 Ok, great. 76 00:04:28,003 --> 00:04:32,539 So, we started the reproducible builds project, etc. 77 00:04:33,470 --> 00:04:34,744 And we build 2 debs. 78 00:04:35,112 --> 00:04:36,537 Oh, I'm sorry about the colors there. 79 00:04:38,067 --> 00:04:38,965 You probably can't see that. 80 00:04:39,349 --> 00:04:42,485 That says "sha1sum a.deb b.deb". 81 00:04:46,128 --> 00:04:50,775 Anyway, we're comparing the sha1sums of 2 binary Debian files. 82 00:04:51,424 --> 00:04:53,922 So, these two files differ. 83 00:04:54,222 --> 00:04:55,612 Ok, they're not reproducible. 84 00:04:56,807 --> 00:04:57,527 Why is that? 85 00:04:57,873 --> 00:04:59,656 So we run a diff on them. 86 00:05:00,140 --> 00:05:00,637 Yeah… 87 00:05:01,340 --> 00:05:04,093 So, what can we learn from this? 88 00:05:04,418 --> 00:05:08,508 Well, not very much, visibly they're compressed so 89 00:05:08,947 --> 00:05:13,012 as soon as we see one change, we'll see they would just cascade changes 90 00:05:13,362 --> 00:05:14,866 because that's how compression works. 91 00:05:16,241 --> 00:05:23,983 I guess we know it's a deb probably a ar format file, not very useful. 92 00:05:24,193 --> 00:05:26,005 Ok, great so we're gonna have a look in 93 00:05:26,492 --> 00:05:29,919 We'll do a binary diff and ok, well… 94 00:05:30,923 --> 00:05:32,790 Again, that's not really telling us very much 95 00:05:34,413 --> 00:05:36,515 with the diff there. 96 00:05:37,206 --> 00:05:38,426 Ok, great. 97 00:05:39,417 --> 00:05:40,427 ??? one level in 98 00:05:40,513 --> 00:05:44,834 "ar x" is on the new maintainer thing, "how you unpack a deb" 99 00:05:44,858 --> 00:05:46,215 Everyone remembers this, right? 100 00:05:48,196 --> 00:05:51,167 You unpack a.deb with "ar x" and you do that to b.deb 101 00:05:51,599 --> 00:05:53,606 and then we diff the results of that. 102 00:05:54,099 --> 00:05:57,824 Ok, so…yeah, 7zip. 103 00:05:58,948 --> 00:06:01,329 Ok, compressed content, not very useful. 104 00:06:01,897 --> 00:06:07,898 Ok, so let's unpack the control.tar inside these debs. 105 00:06:08,725 --> 00:06:10,145 And then we run diff on that. 106 00:06:12,693 --> 00:06:16,850 Still not really telling anything useful about how to make this package reproducible 107 00:06:17,487 --> 00:06:20,345 So let's unpack the tar.xz into the tar. 108 00:06:22,463 --> 00:06:28,348 Inside that tar, there's a file called md5sums and we start to see some differences 109 00:06:28,768 --> 00:06:33,370 between some files in these two debs. 110 00:06:33,640 --> 00:06:36,527 ??? meaningful, so now we have some idea that 111 00:06:36,855 --> 00:06:39,101 it has something to do with this usr/bin/pmixer binary. 112 00:06:39,682 --> 00:06:40,653 Ok, interesting. 113 00:06:41,989 --> 00:06:45,015 We'll unzip that and then we do a diff on pmixer itself. 114 00:06:45,914 --> 00:06:48,600 Now we're back into just binary "globgoly" mode 115 00:06:49,002 --> 00:06:51,736 This isn't very helpful and this is taking quite a while 116 00:06:52,399 --> 00:06:54,663 and if I remember correctly, Debian has a lot of packages. 117 00:06:55,182 --> 00:06:56,784 So this might take a little while. 118 00:06:57,601 --> 00:07:00,415 So, basically, ??? mean 119 00:07:00,782 --> 00:07:02,008 I should build a better diff. 120 00:07:03,703 --> 00:07:05,194 That's not quite true, this is actually… 121 00:07:05,783 --> 00:07:07,472 It was lunar that started this project 122 00:07:07,801 --> 00:07:10,670 and it was called debbindiff, because we wanted to diff 123 00:07:11,093 --> 00:07:12,264 binary Debian packages. 124 00:07:13,474 --> 00:07:15,040 So this is the initial commit, 2014. 125 00:07:16,962 --> 00:07:20,100 "The version is successfully able to report differences in two .changes files. 126 00:07:20,100 --> 00:07:22,343 Not with much interesting details, but it's a start." 127 00:07:22,762 --> 00:07:23,806 And it was a start. 128 00:07:27,581 --> 00:07:29,918 Fast forwarding… Oh, sorry about these colors, 129 00:07:30,307 --> 00:07:31,872 I don't know if we can do anything about the lights? 130 00:07:34,713 --> 00:07:35,363 Yeah? 131 00:07:37,830 --> 00:07:38,080 No? 132 00:07:42,124 --> 00:07:42,974 Allright, whatever… 133 00:07:43,700 --> 00:07:46,410 Basically, we're diffoscoping on… 134 00:07:47,546 --> 00:07:49,595 It works kind of diff does normally, 135 00:07:49,981 --> 00:07:51,995 you give it two files, it outputs a unified diff. 136 00:07:52,699 --> 00:07:59,427 So "diffoscope a b", one file contains the word "foo", one contains the word "bar". 137 00:08:01,241 --> 00:08:03,340 Nothing actually out of the ordinary. 138 00:08:03,974 --> 00:08:07,670 It's sort of colored by default, so that's why you can't see it, but whatever. 139 00:08:10,432 --> 00:08:14,667 It supports archive formats, so if you give it two tar files, 140 00:08:15,413 --> 00:08:22,263 if we then tar up our "a" file and our "b" file into a a.tar and b.tar 141 00:08:23,206 --> 00:08:25,374 and then run diffoscope on those tar files 142 00:08:26,197 --> 00:08:28,395 we get this kind of, like, hierarchy here. 143 00:08:28,742 --> 00:08:32,006 So it's saying that there are differencies between these files, 144 00:08:32,513 --> 00:08:37,735 in the file list they have different time stamps, because I made them 145 00:08:38,161 --> 00:08:39,535 at different times, 146 00:08:39,848 --> 00:08:42,575 and here are the contents, so we got "foo" there and "bar" there. 147 00:08:43,296 --> 00:08:44,781 So we can see the difference between them. 148 00:08:45,566 --> 00:08:48,373 Well, I can, I don't know if you can, you get the slide there. 149 00:08:49,311 --> 00:08:53,551 If we gzip these tar files and then run diffoscope on those gzip things, 150 00:08:53,888 --> 00:08:59,230 it'll say "ok, what we've done is unpack it first, and here's the metadata 151 00:08:59,622 --> 00:09:01,653 about the gzip process", 152 00:09:02,107 --> 00:09:05,941 and inside that are a.tar and b.tar from the previous slides. 153 00:09:07,673 --> 00:09:09,085 And then the "a" file and the "b" file. 154 00:09:09,365 --> 00:09:15,303 So, it's really going two levels deep into this tar.gz file. 155 00:09:16,162 --> 00:09:17,042 That's pretty cool. 156 00:09:17,291 --> 00:09:20,772 And it's completely recursive, I think it will actually blow out after, I think, 157 00:09:20,993 --> 00:09:21,697 1000 [levels]. 158 00:09:23,119 --> 00:09:25,233 [light is turned down for the audience to see the slides] 159 00:09:30,195 --> 00:09:32,065 I'll just bump back a bit, just in case. 160 00:09:35,203 --> 00:09:37,055 [Applause] 161 00:09:37,806 --> 00:09:38,662 Thank you. 162 00:09:39,907 --> 00:09:43,462 So that's the a and b files. 163 00:09:43,884 --> 00:09:48,077 We've tared them up and so I see the hierarchy of foo and bar file layer. 164 00:09:48,472 --> 00:09:52,012 I've gziped them, so this is a gzip layer. 165 00:09:52,399 --> 00:09:54,661 Here's the tar layer and then there's the files themselves. 166 00:09:57,315 --> 00:09:59,252 This is from a real .deb from the archive. 167 00:10:00,637 --> 00:10:06,542 Inside this .deb, there's a data.tar.xz and in that xz file there's a data.tar 168 00:10:07,294 --> 00:10:11,081 and inside that tar file, there's a file called aff and inside that 169 00:10:11,648 --> 00:10:13,892 there's a version string that is different. 170 00:10:14,174 --> 00:10:17,527 And that looks like a build date so we probably know that if we went back 171 00:10:17,753 --> 00:10:22,748 to the source package, we could very quickly work out, 172 00:10:22,922 --> 00:10:26,582 with get a very quick grep, work out where this file is being generated from, 173 00:10:26,582 --> 00:10:31,536 the de_DE.aff file and then ??? probably quite obvious 174 00:10:32,285 --> 00:10:37,311 that it's using the current build time and then we can just patch that, fix it etc. 175 00:10:38,362 --> 00:10:45,681 This is gone from two rather obscure binary .debs all the way to the fix 176 00:10:46,040 --> 00:10:51,683 probably in about 5 minutes, and you can probably send the patch in that time 177 00:10:52,098 --> 00:10:53,086 because it'd be quite quick. 178 00:10:53,860 --> 00:10:57,482 Without diffoscope here, without this sort of recursive unpacking, 179 00:10:58,351 --> 00:11:03,380 you'd be just completely lost, you'd be there with arx all day 180 00:11:03,762 --> 00:11:07,109 and working out which files are different and trying to use xxd 181 00:11:07,859 --> 00:11:09,410 and this kind of nonsense. 182 00:11:10,612 --> 00:11:12,875 diffoscope's got some other things as well 183 00:11:13,277 --> 00:11:17,116 if you try to do reproducible packages and things are varying just on 184 00:11:17,381 --> 00:11:22,408 the line ordering, we detect whether a file differs only in the line ordering. 185 00:11:22,660 --> 00:11:26,178 So, here's file "a", "These lines are in order". 186 00:11:27,155 --> 00:11:30,108 File "b" has "These order are in lines". 187 00:11:30,630 --> 00:11:34,864 It's very difficult to say, actually, it's like one of these tongue twisters. 188 00:11:35,305 --> 00:11:38,862 Run diffoscope on these two and it says it's got ordering differences only. 189 00:11:39,210 --> 00:11:41,295 That's interesting, so you probably need to sort, 190 00:11:41,592 --> 00:11:45,076 you go all the way back to the source code, work out very quickly, 191 00:11:45,389 --> 00:11:48,381 if you know it's just ordering differences you just kind of know 192 00:11:48,672 --> 00:11:52,762 what the output's gonna be, you can search for order in ??? 193 00:11:53,166 --> 00:11:54,648 and you get the right files, 194 00:11:54,928 --> 00:11:57,803 I have sorted in sort in the right place, BAM! send it patched of, 195 00:11:57,889 --> 00:11:59,280 everything is great. 196 00:11:59,280 --> 00:12:02,720 Oh, and send it to upstream as well because you're good. 197 00:12:03,041 --> 00:12:04,707 It supports a lot more things. 198 00:12:05,509 --> 00:12:08,611 We've been showing the terminal text output here. 199 00:12:10,978 --> 00:12:15,950 It's got a HTML output mode, which is really useful in the hierarchal thing 200 00:12:16,139 --> 00:12:17,359 when it gets a bit more complicated. 201 00:12:19,397 --> 00:12:21,766 Instead of being laid on top of each other like a unified diff, 202 00:12:22,312 --> 00:12:26,811 you get the diff on the left and the right and you get sort of a nested 203 00:12:27,075 --> 00:12:32,372 thing inside with colors and lines and you can link this and various things in it 204 00:12:32,728 --> 00:12:37,547 including bits of metadata here, other bits here, what command you used. 205 00:12:38,951 --> 00:12:40,392 That's the HTML output. 206 00:12:40,659 --> 00:12:43,960 We also support a lot of file formats, it's not just on text, 207 00:12:45,635 --> 00:12:48,958 it's about all of these, so let's quickly run through some of them. 208 00:12:49,298 --> 00:12:54,503 You give it two Androip apk files which are kind of like zips, but magic. 209 00:12:55,163 --> 00:12:58,211 It'll know how to compare them. 210 00:12:58,570 --> 00:13:01,026 There's like a Manifest file that needs decoding. 211 00:13:01,617 --> 00:13:03,761 It supports Berkeley DB databases, 212 00:13:04,098 --> 00:13:08,247 Word documents, that's a Word document with "a" and that's a Word document with "b" 213 00:13:08,715 --> 00:13:10,359 and it'll correctly do that. 214 00:13:10,583 --> 00:13:14,311 If you run that through diff normally, that ??? be a binaly mess, 215 00:13:14,932 --> 00:13:16,188 so completely useless. 216 00:13:17,503 --> 00:13:20,118 E-books, there's epub, it also supports mobi. 217 00:13:20,563 --> 00:13:25,958 So if you give it two epub files, it'll say "They just differ in this date". 218 00:13:26,463 --> 00:13:27,350 Brilliant. 219 00:13:28,177 --> 00:13:30,557 Normally that will be completely useless diff binary ??? 220 00:13:30,794 --> 00:13:35,624 So you can be like "epub date, ok", grep the source code for that, 221 00:13:36,427 --> 00:13:38,350 make a patch really quickly. 222 00:13:39,594 --> 00:13:42,786 Mono binaries, git repositories, why not? 223 00:13:43,693 --> 00:13:46,222 Gnumeric spreadsheets, ISO images. 224 00:13:46,454 --> 00:13:47,883 Oh yeah, ISO images is really cool. 225 00:13:48,359 --> 00:13:55,044 So, it'll basically unpack the ISO, then inside that there might be a squashfs image 226 00:13:55,378 --> 00:14:01,549 then it'll completely go down to that and work out any differences 227 00:14:01,746 --> 00:14:06,065 between the two contents in the ISO file, including any metadata. 228 00:14:06,432 --> 00:14:10,607 This is on the squashfs metadata headers, I think. 229 00:14:11,634 --> 00:14:19,251 But say inside that ISO, there was a file that was a pdf, and inside that pdf was 230 00:14:19,572 --> 00:14:23,048 a ??? which varied, 231 00:14:23,285 --> 00:14:26,653 it will basically go all the way down and say "yeah, it's actually here, 232 00:14:26,909 --> 00:14:28,446 in this ??? that the data differs." 233 00:14:28,866 --> 00:14:32,355 And that means you can just go again all the way back to the source 234 00:14:32,646 --> 00:14:35,555 and say "ok, cool, we know how to fix this quite quickly" 235 00:14:36,076 --> 00:14:39,600 And this is really valuable in getting the recent Tails distribution reproducible 236 00:14:39,973 --> 00:14:43,387 so their ISOs are reproducible. 237 00:14:43,829 --> 00:14:46,873 If you build one and I build one, we get the exact same one 238 00:14:47,241 --> 00:14:51,389 and that's kind of useful for something like Tails where you would probably want to 239 00:14:51,828 --> 00:14:54,966 of all, there's a lot of projects that you might want to compromise, 240 00:14:55,450 --> 00:14:58,792 you might want to go after that one, because of the kind of people that are using it. 241 00:15:01,734 --> 00:15:10,009 We support comparing images, so this is using ??? 242 00:15:12,043 --> 00:15:13,714 and then just running that through diff. 243 00:15:16,092 --> 00:15:20,272 That is a linux penguin and that is something else, 244 00:15:20,627 --> 00:15:23,629 I can't remember now. Oh, FT. 245 00:15:24,819 --> 00:15:25,801 It supports images. 246 00:15:27,044 --> 00:15:33,009 It supports JSON and pretty print, so if you give it two JSON files 247 00:15:33,485 --> 00:15:36,657 one with key/value… it'll do a nice diff of them. 248 00:15:38,042 --> 00:15:43,432 It will pretty print it first, before doing the diff, so it'll actually give you 249 00:15:43,634 --> 00:15:46,236 something clean, otherwise I don't know if you've ever diffed 250 00:15:46,978 --> 00:15:50,344 two very long JSON lines, if they differ in the middle, you just get 251 00:15:50,525 --> 00:15:54,737 a huge long unified diff, but here it's like "oh, just ??? things have changed" 252 00:15:58,875 --> 00:16:04,052 OpenDocument text formats, Ogg audio files, because why not. 253 00:16:05,148 --> 00:16:08,251 tcpdump capture files, that's actually quite useful. 254 00:16:09,019 --> 00:16:17,540 PDFs. That PDF says "Hello World" and this PDF says "Hello sick sad world", 255 00:16:17,995 --> 00:16:23,356 I don't know why, that particulary text in the demo. 256 00:16:23,852 --> 00:16:27,058 Again, run that through normal diff program… garbage. 257 00:16:28,212 --> 00:16:34,074 XML documents. Again, it'll pretty print them so it's nice, actually nice do read. 258 00:16:36,117 --> 00:16:41,809 If you want to get started on diffoscope, the very easiest and quickest way to do is 259 00:16:42,212 --> 00:16:47,678 fire up a web browser, try.diffoscope.org, select your files, press Compare 260 00:16:48,470 --> 00:16:54,883 and it'll upload them and run diffoscope with all the support for all the file formats 261 00:16:55,226 --> 00:16:59,096 in the cloud for you and give you a nice HTML page that you can then link to people 262 00:16:59,423 --> 00:17:01,107 So that's the very quickest way to get started. 263 00:17:02,360 --> 00:17:06,884 The next quickest way is to install trydiffoscope and then you run that 264 00:17:07,165 --> 00:17:09,751 on two files and it'll basically do the same thing, 265 00:17:10,018 --> 00:17:12,312 run it in the same cloud service as trydiffoscope 266 00:17:12,877 --> 00:17:16,672 but it'll give you the result on the command line or 267 00:17:16,981 --> 00:17:22,010 if you pass the webbrowser option, it will give you an URL or load your webbrowser, 268 00:17:22,228 --> 00:17:24,951 I can't remember exactly which, with the same results. 269 00:17:25,122 --> 00:17:29,574 This is 1kB of Python, nothing basically. 270 00:17:31,226 --> 00:17:33,120 That's the next easiest way. 271 00:17:34,262 --> 00:17:36,622 But you can then install diffoscope itself on your own machine. 272 00:17:37,631 --> 00:17:42,824 I recommend not installing recommends because all of those file formats 273 00:17:43,208 --> 00:17:46,560 might drag in extra things about the whole of TeX, 274 00:17:46,820 --> 00:17:52,178 I think the whole of OpenOffice, whole of Mono, whole Java… 275 00:17:57,263 --> 00:17:58,403 Android, yeah, quite big. 276 00:18:01,941 --> 00:18:03,489 I think there's another big one I can't think of. 277 00:18:04,554 --> 00:18:11,185 They're all optional, and they all say "By the way, I support TeX documents 278 00:18:12,046 --> 00:18:13,281 or whatever, Mono, whatever. 279 00:18:13,740 --> 00:18:18,954 But you need to install this package and then you get full pretty printed support", 280 00:18:19,846 --> 00:18:21,433 And it'll tell you that when it's missing. 281 00:18:21,791 --> 00:18:25,168 So, if you just start with --install-recommends disabled, 282 00:18:26,427 --> 00:18:29,107 right on your file, if it says "please install this package, you can then 283 00:18:29,335 --> 00:18:31,239 install them as you go along, as you want" 284 00:18:31,722 --> 00:18:34,319 rather than installing everything. 285 00:18:34,630 --> 00:18:38,333 And then you just pass ??? files and then works as before 286 00:18:41,978 --> 00:18:45,869 How you can you improve all your own quality assurance and debian packaging 287 00:18:45,959 --> 00:18:46,713 with different scope 288 00:18:47,582 --> 00:18:50,974 The biggest value here is not necessary for reproducible builds 289 00:18:51,771 --> 00:18:56,406 It's for basically just seeing where you do want to have a diff or expecting a diff 290 00:18:57,078 --> 00:19:00,368 and you are expecting a particularly type of diff in a particularly way 291 00:19:00,903 --> 00:19:02,307 you can basically see those changes 292 00:19:03,539 --> 00:19:12,151 And if you build two debs normally and ... i'll try to demo in a second 293 00:19:12,403 --> 00:19:16,239 You build a deb with a patch applied and then build a deb with the patch applied 294 00:19:16,792 --> 00:19:19,791 you can ??? run a diff on the source package 295 00:19:20,742 --> 00:19:24,455 But that's not very useful because the binaries are going to end in the 296 00:19:24,695 --> 00:19:30,698 people machines. But if you run a diff on the binary itself, did my change actually 297 00:19:31,150 --> 00:19:33,205 hit the binary? I think really ... No.. 298 00:19:36,118 --> 00:19:39,093 I just run through a very live demo of course, so it's gonna fail ... 299 00:20:03,706 --> 00:20:07,376 Checkout some .... We'll get this libnetx-java 300 00:20:11,041 --> 00:20:12,160 We just build that once 301 00:20:16,188 --> 00:20:19,258 Lets say we are on security team and 302 00:20:19,475 --> 00:20:22,701 want to apply a patch, and we want to be really sure because we are to push it out 303 00:20:22,888 --> 00:20:24,044 to all our users 304 00:20:25,046 --> 00:20:28,612 First we will make a changelog 305 00:20:38,445 --> 00:20:39,284 Closing a bug 306 00:20:48,105 --> 00:20:54,949 Find some java file to change 307 00:20:55,688 --> 00:20:56,798 Let's pretend we have a real patch 308 00:21:06,374 --> 00:21:10,650 Let's replace that equals equals, say that was the fix 309 00:21:14,033 --> 00:21:15,512 So that's the patch from upstream 310 00:21:15,884 --> 00:21:16,966 Upstream blast patch 311 00:21:23,505 --> 00:21:26,637 When we build this what we wanna see is just that change in the file 312 00:21:27,141 --> 00:21:32,116 we wanna see any nonsense changes of extended dump but we also definitely want 313 00:21:32,293 --> 00:21:37,129 to see that change, cause if our binary as for security reasons don't have that change 314 00:21:37,129 --> 00:21:42,270 then we aren't fixing people machines, they will issue a DSA ??? installed ??? 315 00:21:44,685 --> 00:21:48,766 And you should do proper testing as well at multiple levels 316 00:21:52,763 --> 00:21:53,799 I will build that again 317 00:22:23,976 --> 00:22:29,717 So we wanna diff the original one 0 5, 318 00:22:30,432 --> 00:22:36,212 We wanna diff that one with a fake security one 319 00:22:37,608 --> 00:22:43,481 You see on the progress bar 100% 1- there are diferences (there should be 320 00:22:43,681 --> 00:22:46,304 diferences) Lets see what that diferences are 321 00:22:48,418 --> 00:22:51,828 in our web browser, its a nice html output 322 00:23:01,180 --> 00:23:03,888 Let have a look. Are we seeing what we wanna see? 323 00:23:07,147 --> 00:23:11,151 There are some chances in the data tar, we kind of expect that 324 00:23:14,447 --> 00:23:18,389 What's changed in our control file? Well the version changed,we wanted that 325 00:23:18,565 --> 00:23:19,656 to change. Perfect 326 00:23:20,535 --> 00:23:24,294 And its changed to ??? That's what we wanna see 327 00:23:24,744 --> 00:23:28,370 No other changes here so there was no weird control or in magic going on 328 00:23:32,297 --> 00:23:38,421 In our data tar the color of the timestamp changes, we will ignore those for now 329 00:23:40,996 --> 00:23:44,944 The changelog has changed, well I hope so because I have changed that entry 330 00:23:48,820 --> 00:23:51,793 Here is where we going to start seeing We are going to see the changing in the 331 00:23:52,016 --> 00:23:59,455 jar file which is the java class, java compile archive format 332 00:24:00,442 --> 00:24:05,931 We are seeing some meaningless timestamp changes but we can ignore those 333 00:24:06,973 --> 00:24:08,923 lets pretend because its just metadata maybe 334 00:24:16,429 --> 00:24:24,131 Ok part of a class, so if you can see here it's basically a de-compilation of the 335 00:24:24,633 --> 00:24:31,500 java file itself and it's basically saying "oh I use to say if now and if not now" 336 00:24:31,796 --> 00:24:35,567 So these are the actual byte java byte code instructions and whats really 337 00:24:35,965 --> 00:24:39,241 And what is really ??? here its that nothing else has changed 338 00:24:39,627 --> 00:24:44,717 We were just expecting that change between the two op codes, of if now elseif not not now 339 00:24:45,554 --> 00:24:49,557 which is good cause its like it hasn't made any code changes but also crucial we can 340 00:24:49,725 --> 00:24:52,076 see that it has actually made a change to the code. 341 00:24:55,060 --> 00:24:58,072 For example its wasn't use some cached version or something like that 342 00:24:58,338 --> 00:24:59,505 This is really useful 343 00:25:00,326 --> 00:25:05,038 And just running a naif diff wouldn't give that of course, because it would just 344 00:25:05,223 --> 00:25:08,341 come with binary garbage And just seeing the diff had changed again 345 00:25:08,627 --> 00:25:12,604 ??? be told you anything, because all of the change would have changed as well 346 00:25:12,802 --> 00:25:15,886 So its like well yes it's diferent 347 00:25:16,028 --> 00:25:19,161 The meaningful change there it's what actually fixes the "floor" 348 00:25:19,597 --> 00:25:21,020 ??? but we know it's there 349 00:25:22,945 --> 00:25:27,448 That's kind of ??? Shifting this deb out I'll be quite 350 00:25:27,687 --> 00:25:30,004 confident, that this seemed like the actual bug 351 00:25:31,151 --> 00:25:34,721 I've been quite confident pushing that out because it's very minimal amount of changes 352 00:25:35,218 --> 00:25:36,750 you wanna do that for security reasons 353 00:25:37,285 --> 00:25:40,111 So this was the live demo 354 00:25:43,038 --> 00:25:48,108 The other one is seeing no changes at all, so you can build once 355 00:25:48,108 --> 00:25:49,894 if you build a reproducible 356 00:25:50,491 --> 00:25:54,753 You can build once change your compiler or change some other part of your toolchain 357 00:25:55,982 --> 00:26:02,267 Build it again and if you got the exact same results, well great, that's want you intended 358 00:26:02,534 --> 00:26:04,595 You wanna see no changes when you change some part of it 359 00:26:08,127 --> 00:26:11,928 And that is really useful, if there were changes diffoscope will highlight them 360 00:26:12,271 --> 00:26:15,993 and show exactly why they had changed, maybe some compile authorizations, 361 00:26:16,393 --> 00:26:17,565 maybe some other things as well 362 00:26:19,056 --> 00:26:22,603 So you can use it in both ways, when you expect changes and when you don't expect 363 00:26:22,789 --> 00:26:26,926 changes, and if those match the expectations diffoscope will tell you exactly why 364 00:26:29,922 --> 00:26:34,355 It's all ??? when other companies are doing security releases 365 00:26:35,111 --> 00:26:41,184 naming no names whatsoever, but they like to release patches as you 366 00:26:41,697 --> 00:26:44,618 know just a new firmware for your router 367 00:26:46,674 --> 00:26:50,629 Very large file system images, you basically have no ideia what changed 368 00:26:51,034 --> 00:26:55,037 between these two files, again you run through diff completely useless 369 00:26:55,419 --> 00:26:59,496 You can start to unpack them with squashfs and blah blah blah 370 00:27:01,143 --> 00:27:05,753 But they're probably sort of concatenated cpio archives, so that's nonsense 371 00:27:07,223 --> 00:27:11,913 But diffoscope would just chew you those and give you actually what the diferences 372 00:27:11,913 --> 00:27:15,197 is between these two files, and say they changed this, they've removed or 373 00:27:15,596 --> 00:27:19,260 added some gpl license code or something kind of interesting 374 00:27:24,293 --> 00:27:31,212 So its very useful for diffing those kind binary blobs that come from various people 375 00:27:33,013 --> 00:27:36,983 So the current state of diffoscope, the development is up and down 376 00:27:41,148 --> 00:27:51,343 It started around May 2014 something like that A bunch of work here, that's is idle I think 377 00:27:55,239 --> 00:27:56,841 These are just for debconfs basically 378 00:28:09,157 --> 00:28:12,343 Anyway it's going up and down its kind of interesting 379 00:28:14,939 --> 00:28:19,296 ??? a lot of reproducible builds projects of course, so every time we do a build 380 00:28:19,621 --> 00:28:25,064 on the ??? reproducible builds or testing framework if we run diffoscope 381 00:28:25,303 --> 00:28:29,834 on the result, if it's reproducible it just says , hey the file is the same 382 00:28:31,208 --> 00:28:36,767 But if not, we publish the diffoscopes of all your packages that are unreproducible 383 00:28:37,092 --> 00:28:40,870 just you can just go there and be like whats the diference between these two things 384 00:28:53,762 --> 00:29:02,115 I invested a lot of work optimizing diffoscope, ??? rather perverse end square 385 00:29:02,465 --> 00:29:07,556 loops inside it. So i manage to cut down some of the time here, cut down here 386 00:29:11,063 --> 00:29:14,012 That's been quite a few performances and enhancements over the past ... 387 00:29:16,395 --> 00:29:21,240 these are the git tags , this is version 80 and this is version 50 I just run the same 388 00:29:22,147 --> 00:29:23,363 benchmark across them all 389 00:29:24,705 --> 00:29:35,180 So they shows when I have introduced some rather stupid code, embarrassing , but whatever 390 00:29:35,703 --> 00:29:36,424 ??? 391 00:29:37,482 --> 00:29:40,522 There's work been done right now, on parallel processing, there's been 392 00:29:40,923 --> 00:29:46,344 quite a few attempts before, but adding it it's kind of interesting and difficult 393 00:29:47,033 --> 00:29:51,898 Luckily we have an outreach student Liliana, is she in the room? Is she hiding? 394 00:29:53,069 --> 00:29:57,225 She's here and she's been talking tomorrow about her work on paralel processing in 395 00:29:57,520 --> 00:30:02,162 diffoscope and that will be amazing because a lot of it is IO bound or waiting for Xtel 396 00:30:02,388 --> 00:30:06,635 processors with multiple cpu machines, you mind as well just play well 397 00:30:07,012 --> 00:30:11,631 while as I stand waiting for the result for a pdf to be unpacked I maybe as well 398 00:30:11,913 --> 00:30:16,859 be running on another cpu, I think we are going to see some real performance wins 399 00:30:17,512 --> 00:30:22,810 as we do that paralell processing merge and working and ??? 400 00:30:24,189 --> 00:30:29,544 You can check out our website diffoscope.org recently migrated to Salsa .... yeeaahhh 401 00:30:33,375 --> 00:30:37,771 And everything that's reproducible is now on Salsa, it's kind of cool 402 00:30:38,732 --> 00:30:42,450 That's quite recent... ??? 403 00:30:44,620 --> 00:30:45,876 Thank you very muck, danke shcön 404 00:30:46,560 --> 00:30:48,733 You got any questions? About diffoscope? 405 00:30:51,659 --> 00:30:53,558 Thank you very much ! 406 00:30:53,558 --> 00:30:57,761 [Applause] 407 00:30:59,888 --> 00:31:02,954 Q: A buzz word question, can you diff containers image formats? 408 00:31:04,943 --> 00:31:14,617 A: Depend which ones. So if they are just directories, then yes, because is just a directory 409 00:31:15,139 --> 00:31:17,224 Do you have particullary in mind? Like docker? 410 00:31:19,068 --> 00:31:25,487 Yes, there's docker and then there's old CI, I believe is the standard one 411 00:31:26,669 --> 00:31:30,506 And that could make a buzz word complaint 412 00:31:31,286 --> 00:31:33,028 Ah ok we were all about buzz words 413 00:31:34,334 --> 00:31:37,411 Probable diffoscope block change as well 414 00:31:38,249 --> 00:31:42,059 And then run diffoscope on connectors and see the difference between updates of your 415 00:31:42,059 --> 00:31:43,395 container images 416 00:31:43,620 --> 00:31:46,219 BAM ... solved Where do I invest? 417 00:31:48,231 --> 00:31:56,645 I wasn't aware that OCI ... that's is how it's called? No it doesn't support that right now 418 00:31:58,347 --> 00:32:02,025 But it wouldn't be too difficult, presuming there are tools to unpack it and as soon 419 00:32:02,297 --> 00:32:07,761 we have a tool to unpack it, it can then just go to that, there is an open wishlist 420 00:32:08,177 --> 00:32:15,402 bug tool box for docker containers to the point were I think it would be really 421 00:32:15,668 --> 00:32:19,338 nice if you could just give it, say, two images names or whatever the noun is 422 00:32:19,835 --> 00:32:24,083 So you can say "please diff these two docker images that are available" and 423 00:32:24,274 --> 00:32:28,753 it can look at your local thing and do a diff on them, currently it's not 424 00:32:29,008 --> 00:32:31,077 supported, but there is an open wishlist bug. 425 00:32:32,345 --> 00:32:36,860 Q: Shouldn't any company that releases binaries, be interested in supporting 426 00:32:37,183 --> 00:32:38,544 diffoscope and using it? 427 00:32:51,541 --> 00:32:58,413 A1: Basically when companies release binaries they are not interested in users seeing diferences... 428 00:33:01,874 --> 00:33:10,299 A2: Yes, I'm surprised that actually the docker bug was only opened two months ago 429 00:33:10,776 --> 00:33:17,144 and hasn't been more interest on diffing container images, but if you like to open 430 00:33:17,561 --> 00:33:24,460 one for OCI that will be very appreciated, and we can get on to that, that would be 431 00:33:24,677 --> 00:33:25,573 great. 432 00:33:30,038 --> 00:33:35,465 I was looking the page for OCI, it says it's based on docker basically, so 433 00:33:35,655 --> 00:33:40,500 once you get OCI for free, you would sort it out for docker, if you're lucky 434 00:33:48,166 --> 00:33:51,646 The OCI image formaters, they wrote out on docker images 435 00:33:55,429 --> 00:34:00,232 Ok we will sort that out, and it seems like we're using a docker more and more 436 00:34:00,279 --> 00:34:01,451 on debian 437 00:34:07,484 --> 00:34:09,216 Any other questions? 438 00:34:20,886 --> 00:34:29,297 Q: Out of curiosity, which ??? are you using inside? Are you using some bio-informatics 439 00:34:30,447 --> 00:34:33,332 algorithm to diff trees efficiently? 440 00:34:34,200 --> 00:34:46,781 A: No it's really naif, all it does is run normal diff, the normal diff tools, but 441 00:34:47,126 --> 00:34:59,242 it will try to identify files and unpack first, so use the file utility identifier 442 00:34:59,716 --> 00:35:06,547 thing that says its a pdf , and try to unpack it first, he doesn't do any clever 443 00:35:07,415 --> 00:35:12,056 matching. The clever matching that he does do is fuzzy matching as well, so if just 444 00:35:12,293 --> 00:35:18,567 rename a directory between two inside a container, he will say , yeah there a 445 00:35:18,812 --> 00:35:23,981 massive fuzzy match between this two files, and things like that. So that's 446 00:35:24,241 --> 00:35:31,110 kind of useful, but apart from that clever, which is kind of what you want , because 447 00:35:31,292 --> 00:35:34,308 if it's too clever it would start to be a little opaque ... 448 00:35:37,749 --> 00:35:40,046 I personally like dumb tools. 449 00:35:43,916 --> 00:35:51,411 Q: So one question to you is whether, if you wanna do a release to stable or 450 00:35:51,565 --> 00:35:58,973 something like that, you can ask for the debdiff, I'm wandering if anyone 451 00:35:59,174 --> 00:36:03,914 I mean I remember doing that myself I've been submitting diffoscope output 452 00:36:04,119 --> 00:36:09,516 as well, because is just more readable and useful. so I'm not sure if anyone have any 453 00:36:09,692 --> 00:36:12,741 objection to people asking for those. 454 00:36:22,179 --> 00:36:24,752 I'll propose that to the release team see what they say 455 00:36:26,024 --> 00:36:28,950 Thank you very much, is there any other questions? 456 00:36:32,634 --> 00:36:36,787 No further questions? Then lets thanks Chris again ! 457 00:36:37,137 --> 00:36:41,940 [Applause]