English titulky

← https:/.../diffoscope.webm

Získať kód na vloženie
1 Language

Ukazujem Revíziu 25 vytvorenú 06/15/2018 od ruipb.

  1. I'm here today to talk to you about
    diffoscope
  2. and how you can use it as a better diff
  3. or for Quality Assurance, etc., things
    like that.
  4. Moin!
  5. Apparently that's like a north german
    thing to say "welcome".
  6. North german, north Denmark, Scandinavia,
    that kind of thing, I'm told.
  7. People are shaking their head, so I'm
    going to assume that's true.
  8. This is my first PC, an IBM 5155.
  9. Sometimes, when you rebooted it, it would
    launch into, it would somehow revert
  10. from booting from the hard disk to booting
    from a basic ROM,
  11. as in the programming language ROM.
  12. It was on my motherboard for some reason.
  13. So, randomly, you just get a chance to
    program in basic and then,
  14. sometimes you wouldn't, I don't know why,
    but… yeah.
  15. It's quite fun with this kind of clicky
    keyboard, and that folded in
  16. and it was this kind of big desk thing.
  17. Anyway…
  18. This is my first Debian.
  19. At the time it was already old.
  20. What's this one? Is this Slink? 2.2?
    Yeah.
  21. And this is when we had US and non-US,
    so that's really dating if you remember that.
  22. This is my first contribution to Debian,
    19th December 2006,
  23. sending a patch to lillypond which is kind
    of interesting
  24. and the response was "Oh yeah, rock on,
    many thanks. I'll upload this and
  25. it'll be landing to Etch".
  26. And this was super motivating because
    Etch was just coming out and it was like
  27. "Great, I've got let one line of tiny patch
    in a release. This is super cool."
  28. Thomas' response was super motivating.
  29. So, after that, like that Christmas
    basically spent ???
  30. Debian webpages and stuff.
  31. Very well timed.
  32. That's kind of a good…
  33. You know, someone sends a patch, be like
    "Cool, thanks"
  34. Like a little notice in the changelog.
  35. It was, you know, so stupid but…
    Yeah, do that kind of thing.
  36. So, moving on.
  37. Why diffoscope?
    Why did we write diffoscope?
  38. What's the background here?
  39. It comes from reproducible builds.
  40. The very quick outline is that once you
    get the source code for free software,
  41. you download the source code for nginx
    or whatever,
  42. pretty much everyone just runs binaries
    on their servers or their systems.
  43. You know, "apt install bla", "yum install",
    whatever.
  44. Android Playstore, whatever.
  45. Can you actually trust whether these two
    things correspond with each other?
  46. You've gotten the source code, it looks
    alright, and then you install this binary,
  47. yeah…
  48. Who generated that? Can you trust that
    process?
  49. Can you trust who generated it?
  50. Even if you could trust them, could you
    trust them not to be exploited? Etc.
  51. This is a big problem because you can
    exploit a build farm and then
  52. obviously exploit all of that, you know,
    a trojan into the build farm,
  53. so every single binary that comes out
    is compromised.
  54. Kind of problematic.
  55. You could also target individual developers
    machines,
  56. so I could go of to, say, your machine,
    add a backdoor to it,
  57. so every binary that you give to friends
    and things like that,
  58. are compromised in some way, stealing
    your bitcoins or whatever.
  59. I can also turn up at your door
    and blackmail you into producing
  60. software that has compromises or extra
    features, shall we say,
  61. that don't exist in the source code.
  62. So what will happen there is that you'd
    release your source
  63. and the binaries you produce have
    this sort of backdoor that, you know,
  64. someone is forcing you into producing.
  65. So, you don't want to do that.
  66. Anyway
  67. enough of that.
  68. What you do for reproducible builds is you
    ensure that every time you build
  69. a piece of software, you get an identical
    result.
  70. Multiple people then compare their builds
    and check whether they all get
  71. the same results
  72. and this means that an attacker must
    either have infected everyone
  73. at the same time, or they haven't
    infected anyone.
  74. The point here is that you have to ensure
    that builds have identical results.
  75. Ok, great.
  76. So, we started the reproducible builds
    project, etc.
  77. And we build 2 debs.
  78. Oh, I'm sorry about the colors there.
  79. You probably can't see that.
  80. That says "sha1sum a.deb b.deb".
  81. Anyway, we're comparing the sha1sums
    of 2 binary Debian files.
  82. So, these two files differ.
  83. Ok, they're not reproducible.
  84. Why is that?
  85. So we run a diff on them.
  86. Yeah…
  87. So, what can we learn from this?
  88. Well, not very much, visibly they're
    compressed so
  89. as soon as we see one change, we'll see
    they would just cascade changes
  90. because that's how compression works.
  91. I guess we know it's a deb probably a ar
    format file, not very useful.
  92. Ok, great so we're gonna have a look in
  93. We'll do a binary diff and ok, well…
  94. Again, that's not really telling us
    very much
  95. with the diff there.
  96. Ok, great.
  97. ??? one level in
  98. "ar x" is on the new maintainer thing,
    "how you unpack a deb"
  99. Everyone remembers this, right?
  100. You unpack a.deb with "ar x" and you
    do that to b.deb
  101. and then we diff the results of that.
  102. Ok, so…yeah, 7zip.
  103. Ok, compressed content, not very useful.
  104. Ok, so let's unpack the control.tar inside
    these debs.
  105. And then we run diff on that.
  106. Still not really telling anything useful
    about how to make this package reproducible
  107. So let's unpack the tar.xz into the tar.
  108. Inside that tar, there's a file called
    md5sums and we start to see some differences
  109. between some files in these two debs.
  110. ??? meaningful, so now
    we have some idea that
  111. it has something to do with this
    usr/bin/pmixer binary.
  112. Ok, interesting.
  113. We'll unzip that and then we do a diff on
    pmixer itself.
  114. Now we're back into just binary
    "globgoly" mode
  115. This isn't very helpful and this is taking
    quite a while
  116. and if I remember correctly, Debian has
    a lot of packages.
  117. So this might take a little while.
  118. So, basically, ??? mean
  119. I should build a better diff.
  120. That's not quite true, this is actually…
  121. It was lunar that started this project
  122. and it was called debbindiff, because
    we wanted to diff
  123. binary Debian packages.
  124. So this is the initial commit, 2014.
  125. "The version is successfully able to report
    differences in two .changes files.
  126. Not with much interesting details,
    but it's a start."
  127. And it was a start.
  128. Fast forwarding… Oh, sorry about these
    colors,
  129. I don't know if we can do anything about
    the lights?
  130. Yeah?
  131. No?
  132. Allright, whatever…
  133. Basically, we're diffoscoping on…
  134. It works kind of diff does normally,
  135. you give it two files, it outputs
    a unified diff.
  136. So "diffoscope a b", one file contains
    the word "foo", one contains the word "bar".
  137. Nothing actually out of the ordinary.
  138. It's sort of colored by default, so that's
    why you can't see it, but whatever.
  139. It supports archive formats, so if you
    give it two tar files,
  140. if we then tar up our "a" file and
    our "b" file into a a.tar and b.tar
  141. and then run diffoscope on those tar files
  142. we get this kind of, like, hierarchy here.
  143. So it's saying that there are differencies
    between these files,
  144. in the file list they have different time
    stamps, because I made them
  145. at different times,
  146. and here are the contents, so we got
    "foo" there and "bar" there.
  147. So we can see the difference between them.
  148. Well, I can, I don't know if you can,
    you get the slide there.
  149. If we gzip these tar files and then run
    diffoscope on those gzip things,
  150. it'll say "ok, what we've done is unpack it
    first, and here's the metadata
  151. about the gzip process",
  152. and inside that are a.tar and b.tar
    from the previous slides.
  153. And then the "a" file and the "b" file.
  154. So, it's really going two levels deep
    into this tar.gz file.
  155. That's pretty cool.
  156. And it's completely recursive, I think
    it will actually blow out after, I think,
  157. 1000 [levels].
  158. [light is turned down for the audience
    to see the slides]
  159. I'll just bump back a bit, just in case.
  160. [Applause]
  161. Thank you.
  162. So that's the a and b files.
  163. We've tared them up and so I see
    the hierarchy of foo and bar file layer.
  164. I've gziped them, so this is a gzip layer.
  165. Here's the tar layer and then there's
    the files themselves.
  166. This is from a real .deb from the archive.
  167. Inside this .deb, there's a data.tar.xz
    and in that xz file there's a data.tar
  168. and inside that tar file, there's a file
    called aff and inside that
  169. there's a version string that is different.
  170. And that looks like a build date so we
    probably know that if we went back
  171. to the source package, we could very
    quickly work out,
  172. with get a very quick grep, work out
    where this file is being generated from,
  173. the de_DE.aff file and then ???
    probably quite obvious
  174. that it's using the current build time
    and then we can just patch that, fix it etc.
  175. This is gone from two rather obscure
    binary .debs all the way to the fix
  176. probably in about 5 minutes, and you can
    probably send the patch in that time
  177. because it'd be quite quick.
  178. Without diffoscope here, without this sort
    of recursive unpacking,
  179. you'd be just completely lost, you'd be
    there with arx all day
  180. and working out which files are different
    and trying to use xxd
  181. and this kind of nonsense.
  182. diffoscope's got some other things as well
  183. if you try to do reproducible packages
    and things are varying just on
  184. the line ordering, we detect whether
    a file differs only in the line ordering.
  185. So, here's file "a", "These lines are in
    order".
  186. File "b" has "These order are in lines".
  187. It's very difficult to say, actually,
    it's like one of these tongue twisters.
  188. Run diffoscope on these two and it says
    it's got ordering differences only.
  189. That's interesting, so you probably need
    to sort,
  190. you go all the way back to the source code,
    work out very quickly,
  191. if you know it's just ordering differences
    you just kind of know
  192. what the output's gonna be, you can
    search for order in ???
  193. and you get the right files,
  194. I have sorted in sort in the right
    place, BAM! send it patched of,
  195. everything is great.
  196. Oh, and send it to upstream as well
    because you're good.
  197. It supports a lot more things.
  198. We've been showing the terminal
    text output here.
  199. It's got a HTML output mode, which is
    really useful in the hierarchal thing
  200. when it gets a bit more complicated.
  201. Instead of being laid on top of each other
    like a unified diff,
  202. you get the diff on the left and the right
    and you get sort of a nested
  203. thing inside with colors and lines and
    you can link this and various things in it
  204. including bits of metadata here, other
    bits here, what command you used.
  205. That's the HTML output.
  206. We also support a lot of file formats,
    it's not just on text,
  207. it's about all of these, so let's quickly
    run through some of them.
  208. You give it two Androip apk files which
    are kind of like zips, but magic.
  209. It'll know how to compare them.
  210. There's like a Manifest file that needs
    decoding.
  211. It supports Berkeley DB databases,
  212. Word documents, that's a Word document
    with "a" and that's a Word document with "b"
  213. and it'll correctly do that.
  214. If you run that through diff normally,
    that ??? be a binaly mess,
  215. so completely useless.
  216. E-books, there's epub, it also supports
    mobi.
  217. So if you give it two epub files, it'll say
    "They just differ in this date".
  218. Brilliant.
  219. Normally that will be completely useless
    diff binary ???
  220. So you can be like "epub date, ok", grep
    the source code for that,
  221. make a patch really quickly.
  222. Mono binaries, git repositories, why not?
  223. Gnumeric spreadsheets, ISO images.
  224. Oh yeah, ISO images is really cool.
  225. So, it'll basically unpack the ISO, then
    inside that there might be a squashfs image
  226. then it'll completely go down to that and
    work out any differences
  227. between the two contents in the ISO file,
    including any metadata.
  228. This is on the squashfs metadata headers,
    I think.
  229. But say inside that ISO, there was a file
    that was a pdf, and inside that pdf was
  230. a ??? which varied,
  231. it will basically go all the way down
    and say "yeah, it's actually here,
  232. in this ??? that the data differs."
  233. And that means you can just go again
    all the way back to the source
  234. and say "ok, cool, we know how to fix
    this quite quickly"
  235. And this is really valuable in getting
    the recent Tails distribution reproducible
  236. so their ISOs are reproducible.
  237. If you build one and I build one, we get
    the exact same one
  238. and that's kind of useful for something
    like Tails where you would probably want to
  239. of all, there's a lot of projects that you
    might want to compromise,
  240. you might want to go after that one,
    because of the kind of people that are using it.
  241. We support comparing images, so this is
    using ???
  242. and then just running that through diff.
  243. That is a linux penguin and that is
    something else,
  244. I can't remember now. Oh, FT.
  245. It supports images.
  246. It supports JSON and pretty print,
    so if you give it two JSON files
  247. one with key/value… it'll do a nice
    diff of them.
  248. It will pretty print it first, before
    doing the diff, so it'll actually give you
  249. something clean, otherwise I don't know
    if you've ever diffed
  250. two very long JSON lines, if they differ
    in the middle, you just get
  251. a huge long unified diff, but here it's
    like "oh, just ??? things have changed"
  252. OpenDocument text formats,
    Ogg audio files, because why not.
  253. tcpdump capture files, that's actually
    quite useful.
  254. PDFs. That PDF says "Hello World" and
    this PDF says "Hello sick sad world",
  255. I don't know why, that particulary text
    in the demo.
  256. Again, run that through normal diff
    program… garbage.
  257. XML documents. Again, it'll pretty print
    them so it's nice, actually nice do read.
  258. If you want to get started on diffoscope,
    the very easiest and quickest way to do is
  259. fire up a web browser, try.diffoscope.org,
    select your files, press Compare
  260. and it'll upload them and run diffoscope
    with all the support for all the file formats
  261. in the cloud for you and give you a nice
    HTML page that you can then link to people
  262. So that's the very quickest way to get
    started.
  263. The next quickest way is to install
    trydiffoscope and then you run that
  264. on two files and it'll basically do
    the same thing,
  265. run it in the same cloud service as
    trydiffoscope
  266. but it'll give you the result on the
    command line or
  267. if you pass the webbrowser option, it will
    give you an URL or load your webbrowser,
  268. I can't remember exactly which, with
    the same results.
  269. This is 1kB of Python, nothing basically.
  270. That's the next easiest way.
  271. But you can then install diffoscope itself
    on your own machine.
  272. I recommend not installing recommends
    because all of those file formats
  273. might drag in extra things about
    the whole of TeX,
  274. I think the whole of OpenOffice, whole
    of Mono, whole Java…
  275. Android, yeah, quite big.
  276. I think there's another big one I can't
    think of.
  277. They're all optional, and they all say
    "By the way, I support TeX documents
  278. or whatever, Mono, whatever.
  279. But you need to install this package and
    then you get full pretty printed support",
  280. And it'll tell you that when it's missing.
  281. So, if you just start with
    --install-recommends disabled,
  282. right on your file, if it says
    "please install this package, you can then
  283. install them as you go along, as you want"
  284. rather than installing everything.
  285. And then you just pass ??? files
    and then works as before
  286. How you can you improve all your own
    quality assurance and debian packaging
  287. with different scope
  288. The biggest value here is not
    necessary for reproducible builds
  289. It's for basically just seeing where you
    do want to have a diff or expecting a diff
  290. and you are expecting a particularly type
    of diff in a particularly way
  291. you can basically see those changes
  292. And if you build two debs normally and
    ... i'll try to demo in a second
  293. You build a deb with a patch applied and
    then build a deb with the patch applied
  294. you can ??? run a diff on the source package
  295. But that's not very useful because the
    binaries are going to end in the
  296. people machines. But if you run a diff on
    the binary itself, did my change actually
  297. hit the binary? I think really ...
    No..
  298. I just run through a very live demo of
    course, so it's gonna fail ...
  299. Checkout some .... We'll get this
    libnetx-java
  300. We just build that once
  301. Lets say we are on security team and
  302. want to apply a patch, and we want to be
    really sure because we are to push it out
  303. to all our users
  304. First we will make a changelog
  305. Closing a bug
  306. Find some java file to change
  307. Let's pretend we have a real patch
  308. Let's replace that equals equals,
    say that was the fix
  309. So that's the patch from upstream
  310. Upstream blast patch
  311. When we build this what we wanna see is
    just that change in the file
  312. we wanna see any nonsense changes of
    extended dump but we also definitely want
  313. to see that change, cause if our binary as
    for security reasons don't have that change
  314. then we aren't fixing people machines,
    they will issue a DSA ??? installed ???
  315. And you should do proper testing as well
    at multiple levels
  316. I will build that again
  317. So we wanna diff the original one 0 5,
  318. We wanna diff that one with a fake
    security one
  319. You see on the progress bar 100%
    1- there are diferences (there should be
  320. diferences)
    Lets see what that diferences are
  321. in our web browser, its a nice html output
  322. Let have a look.
    Are we seeing what we wanna see?
  323. There are some chances in the data tar, we
    kind of expect that
  324. What's changed in our control file?
    Well the version changed,we wanted that
  325. to change. Perfect
  326. And its changed to ???
    That's what we wanna see
  327. No other changes here so there was no
    weird control or in magic going on
  328. In our data tar the color of the timestamp
    changes, we will ignore those for now
  329. The changelog has changed, well I hope so
    because I have changed that entry
  330. Here is where we going to start seeing
    We are going to see the changing in the
  331. jar file which is the java class, java
    compile archive format
  332. We are seeing some meaningless timestamp
    changes but we can ignore those
  333. lets pretend because its just
    metadata maybe
  334. Ok part of a class, so if you can see here
    it's basically a de-compilation of the
  335. java file itself and it's basically saying
    "oh I use to say if now and if not now"
  336. So these are the actual byte java
    byte code instructions and whats really
  337. And what is really ??? here
    its that nothing else has changed
  338. We were just expecting that change between
    the two op codes, of if now elseif not not now
  339. which is good cause its like it hasn't made
    any code changes but also crucial we can
  340. see that it has actually made a change
    to the code.
  341. For example its wasn't use some cached
    version or something like that
  342. This is really useful
  343. And just running a naif diff wouldn't
    give that of course, because it would just
  344. come with binary garbage
    And just seeing the diff had changed again
  345. ??? be told you anything, because all of the
    change would have changed as well
  346. So its like well yes it's diferent
  347. The meaningful change there it's
    what actually fixes the "floor"
  348. ??? but we know it's there
  349. That's kind of ???
    Shifting this deb out I'll be quite
  350. confident, that this seemed like the
    actual bug
  351. I've been quite confident pushing that out
    because it's very minimal amount of changes
  352. you wanna do that for security reasons
  353. So this was the live demo
  354. The other one is seeing no changes
    at all, so you can build once
  355. if you build a reproducible
  356. You can build once change your compiler
    or change some other part of your toolchain
  357. Build it again and if you got the exact same
    results, well great, that's want you intended
  358. You wanna see no changes when you change
    some part of it
  359. And that is really useful, if there were
    changes diffoscope will highlight them
  360. and show exactly why they had changed,
    maybe some compile authorizations,
  361. maybe some other things as well
  362. So you can use it in both ways, when you
    expect changes and when you don't expect
  363. changes, and if those match the expectations
    diffoscope will tell you exactly why
  364. It's all ??? when other companies
    are doing security releases
  365. naming no names whatsoever,
    but they like to release patches as you
  366. know just a new firmware for your router
  367. Very large file system images,
    you basically have no ideia what changed
  368. between these two files, again you run
    through diff completely useless
  369. You can start to unpack them with
    squashfs and blah blah blah
  370. But they're probably sort of concatenated
    cpio archives, so that's nonsense
  371. But diffoscope would just chew you those
    and give you actually what the diferences
  372. is between these two files, and say
    they changed this, they've removed or
  373. added some gpl license code or something
    kind of interesting
  374. So its very useful for diffing those kind
    binary blobs that come from various people
  375. So the current state of diffoscope,
    the development is up and down
  376. It started around May 2014 something like that
    A bunch of work here, that's is idle I think
  377. These are just for debconfs basically
  378. Anyway it's going up and down its kind
    of interesting
  379. ??? a lot of reproducible builds projects
    of course, so every time we do a build
  380. on the ??? reproducible builds or
    testing framework if we run diffoscope
  381. on the result, if it's reproducible it
    just says , hey the file is the same
  382. But if not, we publish the diffoscopes of
    all your packages that are unreproducible
  383. just you can just go there and be like
    whats the diference between these two things
  384. I invested a lot of work optimizing
    diffoscope, ??? rather perverse end square
  385. loops inside it. So i manage to cut down
    some of the time here, cut down here
  386. That's been quite a few performances and
    enhancements over the past ...
  387. these are the git tags , this is version 80
    and this is version 50 I just run the same
  388. benchmark across them all
  389. So they shows when I have introduced some
    rather stupid code, embarrassing , but whatever
  390. ???
  391. There's work been done right now,
    on parallel processing, there's been
  392. quite a few attempts before, but adding it
    it's kind of interesting and difficult
  393. Luckily we have an outreach student
    Liliana, is she in the room? Is she hiding?
  394. She's here and she's been talking tomorrow
    about her work on paralel processing in
  395. diffoscope and that will be amazing because
    a lot of it is IO bound or waiting for Xtel
  396. processors with multiple cpu machines,
    you mind as well just play well
  397. while as I stand waiting for the result
    for a pdf to be unpacked I maybe as well
  398. be running on another cpu, I think we are
    going to see some real performance wins
  399. as we do that paralell processing merge and
    working and ???
  400. You can check out our website diffoscope.org
    recently migrated to Salsa .... yeeaahhh
  401. And everything that's reproducible is now
    on Salsa, it's kind of cool
  402. That's quite recent...
    ???
  403. Thank you very muck, danke shcön
  404. You got any questions?
    About diffoscope?
  405. Thank you very much !
  406. [Applause]
  407. Q: A buzz word question, can you diff containers
    image formats?
  408. A: Depend which ones. So if they are just
    directories, then yes, because is just a directory
  409. Do you have particullary in mind? Like docker?
  410. Yes, there's docker and then there's old
    CI, I believe is the standard one
  411. And that could make a buzz word complaint
  412. Ah ok we were all about buzz words
  413. Probable diffoscope block change as well
  414. And then run diffoscope on connectors and
    see the difference between updates of your
  415. container images
  416. BAM ... solved
    Where do I invest?
  417. I wasn't aware that OCI ... that's is how it's
    called? No it doesn't support that right now
  418. But it wouldn't be too difficult, presuming
    there are tools to unpack it and as soon
  419. we have a tool to unpack it, it can then
    just go to that, there is an open wishlist
  420. bug tool box for docker containers to the
    point were I think it would be really
  421. nice if you could just give it, say, two
    images names or whatever the noun is
  422. So you can say "please diff these two
    docker images that are available" and
  423. it can look at your local thing and do
    a diff on them, currently it's not
  424. supported, but there is an open wishlist
    bug.
  425. Q: Shouldn't any company that releases
    binaries, be interested in supporting
  426. diffoscope and using it?
  427. A1: Basically when companies release binaries they are not interested in users seeing diferences...
  428. A2: Yes, I'm surprised that actually the
    docker bug was only opened two months ago
  429. and hasn't been more interest on diffing
    container images, but if you like to open
  430. one for OCI that will be very appreciated,
    and we can get on to that, that would be
  431. great.
  432. I was looking the page for OCI, it says
    it's based on docker basically, so
  433. once you get OCI for free, you would
    sort it out for docker, if you're lucky
  434. The OCI image formaters, they wrote out
    on docker images
  435. Ok we will sort that out, and it seems like
    we're using a docker more and more
  436. on debian
  437. Any other questions?
  438. Q: Out of curiosity, which ??? are you using
    inside? Are you using some bio-informatics
  439. algorithm to diff trees efficiently?
  440. A: No it's really naif, all it does is run
    normal diff, the normal diff tools, but
  441. it will try to identify files and unpack
    first, so use the file utility identifier
  442. thing that says its a pdf , and try to
    unpack it first, he doesn't do any clever
  443. matching. The clever matching that he does
    do is fuzzy matching as well, so if just
  444. rename a directory between two inside a
    container, he will say , yeah there a
  445. massive fuzzy match between this
    two files, and things like that. So that's
  446. kind of useful, but apart from that clever,
    which is kind of what you want , because
  447. if it's too clever it would start to be a little
    opaque ...
  448. I personally like dumb tools.
  449. Q: So one question to you is whether,
    if you wanna do a release to stable or
  450. something like that, you can ask for the
    debdiff, I'm wandering if anyone
  451. I mean I remember doing that myself
    I've been submitting diffoscope output
  452. as well, because is just more readable and
    useful. so I'm not sure if anyone have any
  453. objection to people asking for those.
  454. I'll propose that to the release team
    see what they say
  455. Thank you very much,
    is there any other questions?
  456. No further questions? Then lets thanks
    Chris again !
  457. [Applause]