Return to Video

Stretching out for trustworthy reproducible builds creating bit by bit identical binaries

  • 0:01 - 0:02
    Welcome and good morning
  • 0:04 - 0:07
    This is the reproducible builds team,
    talking about
  • 0:07 - 0:10
    "Stretching out towards trustworthy
    computing"
  • 0:12 - 0:20
    [Applause]
  • 0:22 - 0:26
    We're 4 on stage, but actually this is a
    team effort.
  • 0:26 - 0:31
    All these people listed here have
    contributed to the project at one point.
  • 0:31 - 0:33
    The 4 of us, that's
  • 0:33 - 0:34
    Lunar − me
  • 0:34 - 0:35
    there's Dhole,
  • 0:35 - 0:36
    Chris Lamb − lamby
  • 0:36 - 0:38
    and Holger.
  • 0:39 - 0:43
    But actually, this is DebConf and so a lot
    more of us have been or are
  • 0:43 - 0:47
    currently here and so, if you want to
    thank anybody that is working on this
  • 0:47 - 0:49
    you need to actually thank all of
    these folks
  • 0:49 - 0:51
    'cause, yay.
  • 0:51 - 0:56
    [Applause]
  • 0:57 - 1:00
    [Holger] The people in blue are here.
  • 1:04 - 1:06
    [Lunar] Let's get started.
  • 1:06 - 1:08
    Quick recap on what we're talking
    about.
  • 1:08 - 1:11
    We have software, it's made from source.
  • 1:11 - 1:15
    Source is readable by humans or at least
    a good amount of humans.
  • 1:15 - 1:17
    In this room it's good.
  • 1:17 - 1:24
    Binary, readable by computer and some
    tiny fraction of humanity.
  • 1:24 - 1:30
    Going from source to binary is called
    build, or like building or compiling
  • 1:30 - 1:33
    and we're doing free software and
    free software is awesome because
  • 1:33 - 1:38
    we can actually run these binaries like
    we want
  • 1:38 - 1:44
    We can actually study the software, how
    it's been made by studying the source
  • 1:44 - 1:49
    and by studying the source we can assess
    that it does what it's supposed to do
  • 1:49 - 1:51
    and not something else that does not
  • 1:51 - 1:56
    have malware, or trojans or security bugs
  • 1:56 - 2:01
    So we have the binary that can be used,
    fine.
  • 2:01 - 2:04
    We have the source that can be verified.
  • 2:04 - 2:10
    Problem is that right now, the only way we
    know that a binary that we get…
  • 2:10 - 2:16
    We have to trust a website or a Debian
    repository that says
  • 2:16 - 2:18
    "Well, this binary has been made with this
    source"
  • 2:18 - 2:23
    But there's no way we can actually prove
    that.
  • 2:23 - 2:27
    This is actually a problem that has been
    well explained by
  • 2:27 - 2:34
    Mike Perry and Seth Schoen at the 31c3
    in Hamburg last december.
  • 2:34 - 2:41
    For example, Seth Schoen made a proof of
    concept exploit for the Linux kernel
  • 2:41 - 2:52
    that when GCC was called, the kernel would
    without modifying anything on the disk
  • 2:52 - 2:59
    when the kernel detects that GCC is going
    to read a C file, it will insert some
  • 2:59 - 3:06
    extra lines of code, and these lines of
    code can be a very bad thing
  • 3:06 - 3:09
    in the case of 31c3 talk I was just
    recalling.
  • 3:09 - 3:18
    Actually, you can even have developers
    who are in very good faith, who have
  • 3:18 - 3:21
    totally secure dev machines, or they
    thought they have,
  • 3:21 - 3:24
    who have reviewed all their source code
    for any bugs
  • 3:24 - 3:31
    and we would still get totally owned as
    soon as their computer gets compromised
  • 3:31 - 3:34
    or one of the build demons from Debian
    gets compromised for example.
  • 3:34 - 3:41
    This is not, like, hypothetical threats
    here we're discussing
  • 3:41 - 3:46
    A couple of months after Seth an Mike's
    talk at 31c3,
  • 3:46 - 3:49
    the Intercept revealed from the Snowden
    leaks
  • 3:49 - 3:56
    that at a CIA conference in 2012, one
    of the talks that happened
  • 3:56 - 3:59
    was about a project called Strawhorse.
  • 3:59 - 4:05
    Strawhorse is about modifying Apple XCode,
    which is the development environment
  • 4:05 - 4:09
    for MacOS 10 and iOS applications
  • 4:09 - 4:11
    and well, they were modifying XCode so
    it would produce,
  • 4:11 - 4:13
    without the developer knowing,
  • 4:13 - 4:23
    binaries with trojans, malware,
    watermarked binaries, lots of bad things.
  • 4:23 - 4:25
    So, solution:
  • 4:25 - 4:29
    enable anyone to reproduce identical
    binary packages from a given source.
  • 4:29 - 4:35
    Because if using a source, using the same
    environment,
  • 4:35 - 4:40
    multiple people on different computers, on
    different networks, at different times,
  • 4:40 - 4:43
    can all get the same thing
    from the same source
  • 4:43 - 4:45
    all the same binary, byte for byte,
  • 4:45 - 4:47
    then there's a good chance that…
  • 4:47 - 4:55
    Well, everybody could be owned,
    but let's be more joyful and say that
  • 4:55 - 4:59
    probably, if everybody gets the same
    result, there was actually no problem
  • 4:59 - 5:01
    and everybody is safe.
  • 5:02 - 5:04
    We call that solution
    "reproducible builds"
  • 5:07 - 5:08
    Yay.
  • 5:08 - 5:11
    [Applause]
  • 5:13 - 5:15
    Actually, it's not only about security.
  • 5:15 - 5:19
    For Debian, we have, if you're doing
    "Multi-arch: same" packages,
  • 5:19 - 5:25
    well they only have the same bytes if
    they are built for different architectures,
  • 5:25 - 5:28
    the files in the package.
  • 5:28 - 5:34
    Debug packages, you can create at a later
    time, if you forgot to have debug packages
  • 5:34 - 5:36
    in the first place,
  • 5:36 - 5:42
    you can pass the no-strip option later and
    because the package is reproducible,
  • 5:42 - 5:47
    you will get the debug symbols that work
    for software that has been shipped already
  • 5:47 - 5:50
    We do early detection of FTBFS that way
  • 5:50 - 5:54
    because if we try pretty quickly
    to reproduce a build,
  • 5:54 - 5:55
    then it has to work.
  • 5:55 - 5:58
    It's useful for build profiles.
  • 5:58 - 6:02
    We can get smaller .deb deltas,
  • 6:02 - 6:05
    because from one version to the next we
    might have the same content.
  • 6:05 - 6:09
    We can do validation of cross-builds,
  • 6:09 - 6:12
    Helmut Grohne can talk to you about that.
  • 6:12 - 6:17
    And also, Niels Thykier told me that
  • 6:17 - 6:21
    he was very interested in reproducible
    builds because it would enable him to
  • 6:21 - 6:24
    test debhelper better, because
  • 6:24 - 6:29
    if the package builds reproducibly,
    then he makes a change to debhelper,
  • 6:29 - 6:32
    then he can rebuild
  • 6:32 - 6:36
    the same version of a package with a newer
    debhelper and see what has changed
  • 6:36 - 6:40
    and this change can be isolated to only
    what he has worked on debhelper
  • 6:40 - 6:42
    for example.
  • 6:43 - 6:45
    And, oh my.
  • 6:45 - 6:48
    The whole world is watching us.
  • 6:48 - 6:56
    Since two years or a year and a half ago,
    everybody I meet in security conference,
  • 6:56 - 6:59
    in hacker conference, in free software
    conference is like
  • 6:59 - 7:01
    "Oh you're working on that,
    that's awesome."
  • 7:01 - 7:09
    And, I mean, I've been the one doing quite
    a lot of talks, and everybody comes to me
  • 7:09 - 7:11
    and I'm like "Wow wow, this is way bigger",
  • 7:11 - 7:16
    but we're actually leading the field here.
  • 7:16 - 7:19
    Yay Debian.
  • 7:19 - 7:26
    [Applause]
  • 7:26 - 7:29
    [Holger] So, we are not the only ones
    leading the field,
  • 7:29 - 7:33
    Bitcoin and Tor made their software
    reproducible before us,
  • 7:33 - 7:37
    Coreboot also succeeded, if you build
    Coreboot without any payload,
  • 7:37 - 7:39
    that's 100% reproducible.
  • 7:39 - 7:44
    FreeBSD has a page on their wiki since
    2013
  • 7:44 - 7:49
    saying there are 5 reproducibility issues
    in their base system.
  • 7:49 - 7:52
    We're at the moment trying to
    confirm this.
  • 7:52 - 7:57
    On jenkins.debian.net, I've also set up
    now tests for FreeBSD, NetBSD,
  • 7:57 - 7:59
    Coreboot and OpenWrt.
  • 7:59 - 8:03
    So if you go to
    reproducible.debian.net/
  • 8:03 - 8:05
    you get that tested.
  • 8:05 - 8:08
    And there's more in the pipeline.
  • 8:08 - 8:11
    There are other projects interested
    as well.
  • 8:11 - 8:15
    NetBSD also has a variable MKREPRO
    which you can set
  • 8:15 - 8:17
    and that builds reproducibly.
  • 8:17 - 8:20
    Though they think "I'm keeping some
    timestamps it's fine" and then
  • 8:20 - 8:22
    filtering them out later".
  • 8:22 - 8:23
    We disagree.
  • 8:23 - 8:28
    So this is how Debian looks like,
    Debian Sid,
  • 8:28 - 8:30
    but this is a lie.
  • 8:30 - 8:32
    This is not the truth.
  • 8:32 - 8:34
    This is just our test setup.
  • 8:34 - 8:36
    Sid is not like this.
  • 8:36 - 8:40
    For Sid, it's all orange, there's zero
    reprodicibility in Sid today.
  • 8:40 - 8:44
    But we'll talk now and in the following
    round table,
  • 8:44 - 8:47
    it's to actually make Sid reproducible.
  • 8:47 - 8:52
    The current status is
  • 8:52 - 8:58
    we're working on this in Debian since
    two years ago.
  • 8:58 - 9:02
    We have weekly reports about our project
    now since May
  • 9:02 - 9:07
    and we've given several talks, especially
    in the last year
  • 9:07 - 9:11
    and all these talks, presentation, also
    other stuff is linked in the wiki.
  • 9:11 - 9:15
    There's a page with information about
    Debian, these BSDs,
  • 9:15 - 9:19
    other Linuxes, upstream softwares
    all on this wiki.
  • 9:23 - 9:27
    Since DebConf14, which is merely
    a year ago,
  • 9:27 - 9:29
    we've made quite some changes.
  • 9:29 - 9:33
    We have introduced
    strip-nondeterminism
  • 9:33 - 9:39
    which is called by dh at the end
    of the build of the package
  • 9:39 - 9:45
    and will normalize some things
    which Chris will explain later
  • 9:45 - 9:50
    We have decided on a fixed build path
  • 9:50 - 9:54
    because the build path is leaked
    in the binaries and several things
  • 9:54 - 9:57
    We didn't find a way yet to make
    the build path arbitrary.
  • 9:57 - 10:03
    We designed a way to record the build
    environment
  • 10:03 - 10:08
    because to rebuild, you need to recreate
    the build environment.
  • 10:08 - 10:12
    We set up this Jenkins setup.
  • 10:12 - 10:17
    We wrote diffoscope which used to be
    called debbindiff
  • 10:17 - 10:21
    which shows differences between two
    packages or two directories or
  • 10:21 - 10:24
    two filesystems by now.
  • 10:24 - 10:31
    There's SOURCEDATEEPOCH, which is a way
    that the tools expose
  • 10:31 - 10:34
    the last modification of the source.
  • 10:34 - 10:37
    Because the build date, people want to
    include the build date
  • 10:37 - 10:39
    because they think this is a
    meaningful indication:
  • 10:39 - 10:42
    when a build was done,
    which software used.
  • 10:42 - 10:46
    But if the build always recreates
    the same results
  • 10:46 - 10:47
    the build date becomes meaningless
  • 10:47 - 10:51
    and the really interesting thing is
    the latest modification of the source.
  • 10:52 - 10:56
    We have written patches for the tools
  • 10:58 - 11:04
    [Lunar] strip-nondeterminism:
    is Andrew Ayer in the audience?
  • 11:04 - 11:06
    Yay! He did it!
  • 11:06 - 11:12
    It's written in Perl because we didn't
    want to have a new build dependency
  • 11:12 - 11:14
    in all Debian packages.
  • 11:14 - 11:18
    Basically it takes anything and tries
    to normalize it as much as it can
  • 11:18 - 11:27
    replacing timestamps or file permissions
    or removing some issues.
  • 11:27 - 11:31
    It's working very well on many formats,
    it's meant to be extensible
  • 11:31 - 11:38
    so we can actually add more things and
    it's run by dh at the end of the process, as Holger said.
  • 11:38 - 11:45
    The .buildinfo is currently a proposal
    we have not yet totally agreed
  • 11:45 - 11:49
    but we are generating them as part
    of the test we have
  • 11:49 - 11:57
    and basically it's a new control file that
    will tie the sources, the generated binary
  • 11:57 - 12:01
    the packages that were used to build this
    binary and their version.
  • 12:01 - 12:09
    The idea is that we can use this file to
    reinstall all the specific versions from snapshot
  • 12:09 - 12:17
    So we recreate the same build environment
    then we can just start the build from that source
  • 12:17 - 12:21
    that was mentioned and see if the binary
    that has been generated matches.
  • 12:23 - 12:28
    What it looks like for now, you see there is
    a source binary, the build path
  • 12:28 - 12:34
    because currently we don't have any good
    post-processing tool for buildpaths
  • 12:34 - 12:41
    in elf and dwarf binaries, we just decided
    to specify the build path so when we do
  • 12:41 - 12:45
    a later rebuild we use that path and be safe.
  • 12:45 - 12:52
    The source is dsc, the binary is .deb and
    a list of packages with the versions.
  • 12:53 - 13:02
    We currently use the base files version
    to know which Debian release is to be used
  • 13:02 - 13:04
    as the basis.
  • 13:11 - 13:18
    [Holger] The general procedure for testing is:
    we build the source, we save the results,
  • 13:18 - 13:23
    we modify the environment and we build
    it again and compare the results.
  • 13:23 - 13:32
    That started as a shell script last year which I
    put on jenkins and then it exploded a bit
  • 13:32 - 13:36
    and now we have 67 jenkins jobs running on
    7 hosts.
  • 13:36 - 13:45
    Since last week we have 4 armhf small boards
    where we will be able to test armhf,
  • 13:45 - 13:46
    but very slowly.
  • 13:46 - 13:49
    We have two new amd64 build nodes.
  • 13:49 - 13:53
    The code is now split into Python and bash
    scripts.
  • 13:53 - 13:59
    For all the other distro testing there's a
    lot of bash code now which is mostly
  • 13:59 - 14:05
    boilerplate and it's 5 lines or something
    to build FreeBSD and 5 lines to build NetBSD
  • 14:05 - 14:09
    but there's 100 lines boilercode around so it's
    really not that much code.
  • 14:09 - 14:13
    We do test Testing, Unstable and Experimental.
  • 14:13 - 14:16
    For arm we only start with Unstable.
  • 14:16 - 14:22
    We do like hardware so if you have hardware
    to donate to us, that would be great,
  • 14:22 - 14:25
    we need ssh and then root basically.
  • 14:27 - 14:34
    We are testing Coreboot, OpenWrt and the
    BSD's, soon I will also set up a Fedora test
  • 14:34 - 14:40
    I don't want to test all the 20,000 Fedora
    packages but just 200 or something:
  • 14:40 - 14:44
    the base system of Fedora to examine how
    rpm works
  • 14:44 - 14:48
    to get really the whole Free Software world
    reproducible.
  • 14:48 - 14:53
    This is all run on ProfitBricks hardware
    since 2002, so thanks to ProfitBricks.
  • 14:57 - 15:00
    This is the variations we do for Debian.
  • 15:02 - 15:07
    It's the hostname, username, timezone,
    locale.
  • 15:07 - 15:14
    Chris will explain what modifications
    this causes, variances...
  • 15:14 - 15:19
    We are not testing at the moment differences
    in date so the date is always the same
  • 15:19 - 15:20
    the time is a bit different.
  • 15:20 - 15:26
    [Lunar] Well almost! Because we cheat with
    the timezone, we use one timezone that is
  • 15:26 - 15:32
    GMT-14 and then GMT+12 so it's more than
    24 hours appart.
  • 15:33 - 15:36
    [Holger] On the first of the month we
    sometimes find new bugs where there's
  • 15:36 - 15:38
    packages which record the month.
  • 15:41 - 15:44
    We don't have variations of the CPU type
    at the moment.
  • 15:46 - 15:51
    Both time and CPU type variations, we'll
    have them about one or two weeks
  • 15:51 - 15:54
    the nodes are being prepared at the moment.
  • 15:54 - 16:01
    Then we will test all the meaningful
    variations we could think of.
  • 16:01 - 16:05
    There will be probably some packages which
    build different according to the number of
  • 16:05 - 16:11
    number of CD drives attached or whatever
    things, but those will be find by you.
  • 16:12 - 16:17
    [Lunar] We are doing all these tests because
    we want when you rebuild a package on
  • 16:17 - 16:22
    your machine that if any this is different from
    the build deamons in Debian you get
  • 16:22 - 16:23
    the same results.
  • 16:23 - 16:30
    We use this to detect this problems early
    before you actually a false positive that we have
  • 16:30 - 16:34
    to investigate when someone rebuilds a
    package on their machine.
  • 16:37 - 16:43
    To understand the difference that we found
    from one build to the other.
  • 16:43 - 16:51
    It started also as a 10 lines shellscript
    and then it felt okeyish
  • 16:51 - 16:52
    and so Python!
  • 16:52 - 16:58
    And now it's a lot of code and it actually
    grew way beyond a Debian package.
  • 16:58 - 17:03
    We changed the name, it was called debbindiff
    but it's absolutely not tied to Debian anymore.
  • 17:03 - 17:07
    It's called diffoscope, thanks to Jocelyn
    for the name.
  • 17:07 - 17:12
    Basically what it does: it tries to get to
    the bottom of what is different between
  • 17:12 - 17:14
    two archives or directories.
  • 17:14 - 17:22
    Because it's not useful to compare bytes that
    are compressed by gzip or xz, that will not
  • 17:22 - 17:27
    lead you to understand what is different
    you need to uncompress and look at
  • 17:27 - 17:33
    uncompressed data, and if the thing actually
    compressed is a tarball, you might actually
  • 17:33 - 17:35
    want to compare the files inside the tarball.
  • 17:35 - 17:42
    If there is a PDF inside this archive, you
    don't want to compare the bytes of the PDF
  • 17:42 - 17:44
    you want to compare the text of the PDF.
  • 17:44 - 17:50
    So this is basically what diffoscope does,
    it tries to transform anything that is
  • 17:50 - 17:57
    a container and compare things in this
    container and if they can be transformed into
  • 17:57 - 18:01
    a human readable form it will try to do
    that, and compare these human readable form.
  • 18:01 - 18:05
    And if it doesn't find any difference but
    there are still differences from the bin
  • 18:05 - 18:07
    it will fall back to binary comparison.
  • 18:08 - 18:13
    Try it, extend it; it's Python, it's modular,
    it's great.
  • 18:13 - 18:23
    It already supports squashfs, ISO, rpm,
    gettext, mo files files and so many different things.
  • 18:23 - 18:30
    You can have HTML output like that,
    so this is what is displayed on many
  • 18:30 - 18:34
    examples we've shown so far, and also
    to make it easier for copy paste
  • 18:34 - 18:38
    and post processing we have the text output.
  • 18:38 - 18:43
    You can also use it to review packages before
    uploading them to Debian.
  • 18:43 - 18:49
    It does fuzzy matching, so even if the
    directory is different in the archive it will
  • 18:49 - 18:52
    find it like git does.
  • 18:52 - 18:59
    It has grown way more beyond just build
    reproducibly. A useful tool.
  • 19:01 - 19:07
    [Dhole] In order to solve timestamp issues, we are
    proposing the SOURCEDATEEPOCH variable.
  • 19:07 - 19:12
    This is because most of the times having
    the build date embedded in a package
  • 19:12 - 19:16
    is not useful for the user, because you could
    take a really old package and build it today
  • 19:16 - 19:19
    and that day would not be useful.
  • 19:19 - 19:26
    We are standardizing a replacement for build
    dates so that tools can use it.
  • 19:26 - 19:32
    When this value is set, the tool instead of
    embedding the current date, it will embed
  • 19:32 - 19:38
    the date taken from SOURCEDATEEPOCH which
    will contain a Unix epoch timestamp.
  • 19:38 - 19:43
    This is a general solution we are trying to
    standardize so that not only Debian uses it,
  • 19:43 - 19:48
    but other Free Software projects and
    distributions and in the case of Debian,
  • 19:48 - 19:52
    we set this variable to the latest Debian
    changelog entry timestamp.
  • 19:55 - 20:01
    We have already been sending patches to
    different packages, mostly it's documentation
  • 20:01 - 20:06
    generation. So here's a list of bugs that
    we have opened which have been closed
  • 20:06 - 20:12
    and merged; so it's help2man, epydoc,
    ghostscript, texi2html and sphinx.
  • 20:12 - 20:19
    We are both sending these patches to Debian
    and upstream so all the distributions can
  • 20:19 - 20:28
    use them, and we have also been sending
    patches to other packages which are still
  • 20:28 - 20:32
    open, so we encourage you to take a look
    at these packages if you are the maintainer
  • 20:32 - 20:35
    and merge the patch.
  • 20:36 - 20:41
    [Lunar] Thanks to Daniel Kahn Gillmor and
    Ximin Luo for pushing this proposal forward.
  • 20:41 - 20:46
    And also lots of these patches have been
    written by Akira and Dhole as part of their
  • 20:46 - 20:49
    Google Summer of Code, and you work really
    great.
  • 20:52 - 20:57
    [Applause]
  • 21:03 - 21:08
    [Dhole] The gcc patch is: gcc uses two
    macros which are _DATE and TIME_
  • 21:08 - 21:14
    which embed the timestamp and I wrote a
    patch so that if SOURCEDATEEPOCH is set
  • 21:14 - 21:19
    instead of adding the current time, it takes
    the time from that variable.
  • 21:19 - 21:26
    I sent this patch to gcc, it's still there
    forgotten with many other patches
  • 21:26 - 21:29
    but hopefully at some point they will
    realize that this is interesting and they
  • 21:29 - 21:30
    will merge it.
  • 21:39 - 21:46
    [Lamby] Hey. Let's very quickly run you
    through some really simple ways
  • 21:46 - 21:50
    to fixing packages. The details don't
    necessarily matter, it's just to give you
  • 21:50 - 21:56
    of what needs to be changed and basically
    to point out that it's not rocket science.
  • 21:56 - 21:58
    So you can just come in and jump in.
  • 21:58 - 22:08
    For example gzip, it's a very old tool
    and they decided to add timestamps when
  • 22:08 - 22:12
    you generate it, but it's an easy fix, you
    just add -n flag.
  • 22:12 - 22:20
    Some other things easy to change: some
    Python stuff had tag_date=True, which
  • 22:20 - 22:25
    I don't know if you can see it but adds a
    timestamp to eggs. You just change it to
  • 22:25 - 22:26
    False to get rid of it.
  • 22:26 - 22:34
    Static libraries, they are just ar archives
    so the same format as .deb, and you
  • 22:34 - 22:38
    can just use binutils or strip-nondeterminism
    tool.
  • 22:38 - 22:44
    PNG has timestamps for some reason, you can
    get rid of them, that's ImageMagick and it's
  • 22:44 - 22:49
    a bit ugly, but also strip-nondeterminism
    gets rid of it.
  • 22:49 - 22:55
    Tarballs are quite interesting, they will
    by default capture user and group
  • 22:55 - 22:58
    you just pass --owner=root bla bla bla...
  • 22:58 - 23:05
    Ordering, this is interesting as well, it
    will usually use file system ordering
  • 23:05 - 23:11
    which is completely non-deterministic. So
    you need to sort with LC_ALL=C.
  • 23:15 - 23:19
    [Lunar] Think about the locale! Because
    sorting order varies from local to the next.
  • 23:23 - 23:28
    [Lamby] They also take timestamps, again
    you can set --mtime or you can mock around
  • 23:28 - 23:31
    with find/xargs/touch bla bla...
  • 23:31 - 23:37
    Lots of other files have timestamps: Erlang
    files for no reason, even upstream don't
  • 23:37 - 23:40
    know why they added a timestamp.
  • 23:42 - 23:49
    We have now a patch for SOURCEDATEEPOCH,
    which I think landed a couple days ago.
  • 23:50 - 23:57
    Here's an interesting one, not necessarily
    the current build timestamp, so this is a
  • 23:57 - 24:05
    timezone dependent date which Ruby loads
    and then saves incorrectly as your local time.
  • 24:05 - 24:07
    This gets mangled, so that's patching.
  • 24:07 - 24:15
    I'm going from changing individual packages
    to more toolchain things as you can see.
  • 24:15 - 24:21
    Upstream configure scripts, you can maybe
    see the top that it just uses hostname
  • 24:21 - 24:26
    for no reason. Sometimes you can override
    it in debian/rules just by exporting something
  • 24:26 - 24:32
    or passing a variable to dh_autobuild or
    whatever. That's just a little bit more
  • 24:32 - 24:34
    involved, you have to look at it more
    carefully.
  • 24:34 - 24:40
    Perl hash order, lot of Perl uses data
    Data::Dumper to just output a bunch of stuff which
  • 24:40 - 24:47
    is just not deterministic. So often just
    setting Sortkeys, but sometimes it's
  • 24:47 - 24:48
    a completely different solution.
  • 24:48 - 24:53
    Header files, so you can maybe see that
    they are using the timestamp essentially
  • 24:53 - 24:59
    as a unique identifier, you probably have
    to start re-writing these something saner
  • 24:59 - 25:04
    because this is a wrong use of timestamp
    anyway.
  • 25:04 - 25:13
    More Makefiles, the deeper they timestamp
    in the upstream package the more you have
  • 25:13 - 25:15
    to start patching, so these kind of start
    sucking a little.
  • 25:15 - 25:21
    We've made a lot of toolchain changes, some
    already mentioned, some of them already
  • 25:21 - 25:25
    merged, see more in this link. Again,
    details don't matter, just check it out
  • 25:25 - 25:30
    it isn't crazy, it's just working out
    what's different.
  • 25:30 - 25:35
    In terms of the work done we've sent these
    many patches: two patches a day,
  • 25:35 - 25:38
    which is not too bad, on average.
  • 25:40 - 25:46
    [Applause]
  • 25:48 - 25:51
    [Holger] I can't clap because I sent three
    or something like that
  • 25:53 - 25:54
    [Lamby] Holger does three per day.
  • 25:55 - 26:00
    And this doesn't count other bugs we found
    in the process of building packages, like
  • 26:00 - 26:01
    fail to build.
  • 26:01 - 26:08
    This is blue the ones that are open and
    orange are done.
  • 26:08 - 26:14
    You can see that someone went a bit crazy
    in February filing bugs and eventually they
  • 26:14 - 26:17
    were being fixed; slowly.
  • 26:18 - 26:24
    [Holger] And actually we filed more bugs
    because the fail to build from source bugs
  • 26:24 - 26:29
    are excluded, I think we filed 300 FTBFS
    in the last two or three months.
  • 26:31 - 26:34
    [Lamby] And those include fail to build
    because of reproducibility things as well
  • 26:34 - 26:36
    but we haven't split them up.
  • 26:40 - 26:47
    [Lunar] What's left to be done because
    Holger said "the graph is a lie".
  • 26:47 - 26:58
    The main thing that is blocking a lot of
    work is dpkg. Right now the output of dpkg
  • 26:58 - 27:09
    will be not deterministic 100% of the time,
    because of timestamps and at least the
  • 27:09 - 27:15
    file ordering. We also have a patch that
    creates these .buildinfo files that we've
  • 27:15 - 27:22
    shown that works. It's not submitted yet
    to dpkg because we need to agree on the
  • 27:22 - 27:27
    format. At least we have ftpmaster or
    maybe dpkg, well we have a lot of people
  • 27:27 - 27:30
    and that's what we are going to do the
    next hour.
  • 27:30 - 27:39
    Debhelper also has a few changes; the make
    mtimes, debhelper might also not be
  • 27:39 - 27:43
    best place, maybe we want that in dpkg.
  • 27:43 - 27:48
    I've been trying to put patches in tar so
    we can make it easier. It's complicated to
  • 27:48 - 27:54
    see where's the best place but so far we've
    been doing our tests with this frame and it works.
  • 27:54 - 28:00
    [Holger] In our repository we have these
    packages with these bugs fixed so when
  • 28:00 - 28:04
    you want to test reproducibility issues on
    your own machine you need to use the
  • 28:04 - 28:07
    repository which has these patches applied
    at the moment.
  • 28:07 - 28:10
    In pure sid you cannot create reproducible
    packages.
  • 28:10 - 28:18
    [Lunar] I heard that the SOURCEDATEEPOCH
    patch is in git already, so it's going to happen.
  • 28:18 - 28:27
    cdbs also needed to export SOURCEDATEEPOCH
    and we are starting to do more infrastructure
  • 28:27 - 28:34
    work: Josch mainly and Akira on sbuild,
    because we wanted to have this
  • 28:34 - 28:40
    srebuild script, where you give it a
    buildinfo and it will do the rebuild and
  • 28:40 - 28:47
    it needs changes in build daemon for the
    build path and also a couple of changes in
  • 28:47 - 28:49
    sbuild itself.
  • 28:49 - 28:53
    [Holger] And the script is not ready yet,
    this "Finish" means it uses our repository
  • 28:53 - 28:57
    at the moment, we need to change it to only
    use Sid and snapshot.
  • 28:57 - 29:02
    [Lunar] So there is the buildd issue that
    we need to discuss
  • 29:02 - 29:09
    and we also need to see how we could include
    or not, or somewhere give this buildinfo
  • 29:09 - 29:13
    control file to the world so they can
    rebuild the packages, so it's not yet
  • 29:13 - 29:14
    clear where's the best place to store
    them.
  • 29:14 - 29:21
    Because adding 22,000 files, some
    people get cranky of this idea.
  • 29:21 - 29:26
    [Holger] It's more than 22,000 files, it's
    22,000 source packages multiplied by
  • 29:26 - 29:30
    10 architectures; but there's a lot of
    arch builds so that's probably 100,000
  • 29:30 - 29:38
    buildinfo files, multiplied by Stretch and
    Sid, so it's 200,000 files or more on
  • 29:38 - 29:40
    the file servers and on the mirrors we
    would like to have it.
  • 29:40 - 29:44
    That's the same amount of files which are
    currently there. The mirror operators are
  • 29:44 - 29:49
    currently not happy, they will not take it,
    so our current idea is just concatenate
  • 29:49 - 29:55
    all these files into one file that's 140 MB
    uncompressed, 40 MB compressed.
  • 29:55 - 29:56
    That's easier to handle.
  • 29:56 - 30:00
    And then probably have a service
    buildinfo.debian.org where you can
  • 30:00 - 30:03
    download individual buildinfo files if you
    need them.
  • 30:04 - 30:10
    [Lunar] And so when we will be done with
    all that we can maybe add a final patch
  • 30:10 - 30:16
    it would be to Debian policy, mandating
    Debian packages be reproducible.
  • 30:20 - 30:23
    [Applause]
  • 30:24 - 30:31
    I can say again that the dream of mine is
    that we would stop uploading .deb when
  • 30:31 - 30:38
    we upload a package, but instead just upload
    the hash of the binary, have the buildd
  • 30:38 - 30:43
    build again this package and only if these
    two match they can enter the archive.
  • 30:43 - 30:48
    So we are sure that at least the two
    machines, the developer machine and the
  • 30:48 - 30:51
    build deamon agree that they've built the
    same thing.
  • 30:51 - 30:55
    [Applause]
  • 30:58 - 31:03
    [Holger] I share this dream but I think
    having this in policy is a mass requirement
  • 31:03 - 31:16
    sadly something only for Stretch + 1, but
    I'm curious if we had fixed dpkg and
  • 31:16 - 31:22
    debhelper now, would you think we should
    upgrade all these wishlist bugs to important now?
  • 31:23 - 31:26
    [Audience] Yes!
  • 31:31 - 31:34
    [Holger] We'll talk about this later soon.
  • 31:34 - 31:37
    [Lunar] But before that we actually have
    work to do.
  • 31:40 - 31:44
    [Dhole] In order to fix your package, the
    first thing you can do is go to
  • 31:44 - 31:51
    reproducible.debian.net/package, and you
    can the web interface where you can see
  • 31:51 - 31:56
    notes on the package, we have tags to
    identify different issues that make packages
  • 31:56 - 31:59
    not reproducible, with links to the wiki
    about how to solve them.
  • 32:05 - 32:09
    [Holger] When you see this, you want to
    click on this debbindiff link.
  • 32:09 - 32:12
    It's still called debbindiff not diffoscope,
    this will show all the differences,
  • 32:12 - 32:17
    if there is a note. If the package is
    unreproducible and there's no note
  • 32:17 - 32:21
    it will automatically display the
    debbindiff, and if your package is fine
  • 32:21 - 32:23
    there's here a sun.
  • 32:30 - 32:34
    [Dhole] You can also see an entry in the
    tracker, stating if your package is
  • 32:34 - 32:35
    reproducible or not.
  • 32:39 - 32:46
    You can also find information in DDPO and
    DMD. You can find tips on the wiki it's
  • 32:46 - 32:54
    ReproducibleBuilds wiki, we are working on
    a Howto to have detailed steps on different
  • 32:54 - 33:01
    issues and how to solve them. Lunar gave
    a talk at CCCamp where there's many issues
  • 33:01 - 33:05
    really well explained and the solutions for
    them.
  • 33:05 - 33:11
    You can also come to our irc channel which
    is #debian-reproducible and ask for help
  • 33:11 - 33:13
    or go to the mailing-list.
  • 33:14 - 33:21
    In order to test locally if your package is
    reproducible right now we are using a
  • 33:21 - 33:29
    script that uses pbuilder in a custom
    configuration, you need to set up our
  • 33:29 - 33:35
    reproducible repository. In the Howto in
    the wiki there's the steps on how to set up
  • 33:35 - 33:39
    the chroot and everything, it's documented
    in the wiki.
  • 33:39 - 33:44
    Diffoscope is in unstable and today it's
    going in Stretch.
  • 33:44 - 33:54
    We plan to add these scripts to rebuild
    packages in different settings in debscripts
  • 33:54 - 34:04
    once dpkg is good, and we welcome you
    tomorrow to the hacking session from
  • 34:04 - 34:07
    2 to 7 in Stockholm room.
  • 34:10 - 34:15
    [Lunar] That's for fixing your packages,
    please do that. If you want to have even
  • 34:15 - 34:19
    more fun, then test your own package, join
    us!
  • 34:19 - 34:25
    This is the past year of my life, it has
    been awesome because the team has been
  • 34:25 - 34:32
    so great, it's been friendly atmosphere, lots of
    new understanding so many things you didn't
  • 34:32 - 34:39
    want to learn about that you had to learn
    about, and basically it feels very good to
  • 34:39 - 34:46
    be part of this actual changing the world
    thing. It's just software but it has some
  • 34:46 - 34:51
    profound effect. I've been told that the
    work we've been doing is being tossed
  • 34:51 - 34:58
    around in Cisco and Google and Facebook;
    all these big dot com companies bla bla,
  • 34:58 - 35:02
    they actually want to do that as well even
    though they are not doing Free Software,
  • 35:02 - 35:03
    which I find wired, but whatever.
  • 35:03 - 35:10
    So what do we do? We review packages, we
    have these notes when we actually try to
  • 35:10 - 35:13
    identify, so when the maintainer comes
    they don't have to think to much about
  • 35:13 - 35:19
    the problem and just fix it. We try to
    identify common trends so when many
  • 35:19 - 35:24
    packages have the same problem we make an
    entry and explain and maybe think about fixes
  • 35:24 - 35:27
    that could apply to the whole archive.
  • 35:27 - 35:34
    We work on this reproducible.debian.net
    jenkins setup, the scripts.
  • 35:34 - 35:41
    We hack on the diffoscope tool, we make
    strip-nondeterminism better, we propose
  • 35:41 - 35:45
    changes for the toolchains when there are
    needs, some need a lot of patches,
  • 35:45 - 35:59
    most of the bugs we have reported on
    individual packages have patches.
  • 36:01 - 36:04
    [Holger] Bugs have patches
    [Lunar] Yes!
  • 36:04 - 36:09
    And also we are actually writing some more
    general documentation from the
  • 36:09 - 36:15
    understanding of these things we have been
    having, we are preparing a reproducible
  • 36:15 - 36:22
    builds Howto to explain to the Free Software
    world how they can do it so it's about some
  • 36:22 - 36:27
    of what Chris explained but also more
    general consideration on what if you're
  • 36:27 - 36:29
    not Debian and you want your thing
    reproducible when you distribute as an
  • 36:29 - 36:36
    independent vendor. So we want to work on
    reference documentation so the whole world
  • 36:36 - 36:37
    can actually do that.
  • 36:39 - 36:43
    We do a lot of talks as you've seen and
    it's been fun, and with all these
  • 36:43 - 36:49
    presentations we've made so far it's all
    in git. And everybody is free to take one
  • 36:49 - 36:53
    of these slide decks and run with it
    somewhere, translate it...
  • 36:57 - 36:59
    Questions?
  • 37:01 - 37:04
    We have to run with the microphone, because
    there's no mic anymore.
  • 37:14 - 37:17
    [Question] I just wanted to make two quick
    comments: so first of all diffoscope is
  • 37:17 - 37:22
    really awesome, not only for reproducibility
    but also for example if you change your
  • 37:22 - 37:27
    debian/rules in some way and want to see if
    the package is the same afterwards because
  • 37:27 - 37:32
    you just cleaned up a bit, that's really
    awesome for that, so thank you.
  • 37:32 - 37:37
    And also I think the work you're doing now
    is something that in 20 years time we're
  • 37:37 - 37:41
    going to look back towards it and think,
    well, of course builds should be
  • 37:41 - 37:44
    reproducible, so thank you very much for
    your work!
  • 37:45 - 37:49
    [Applause]
  • 37:52 - 38:03
    [Question] When reproducibility becomes
    part of the Debian policy, will there be a
  • 38:03 - 38:06
    lintian --reproducible?
  • 38:09 - 38:12
    [Holger] I don't think lintian can detect
    that because lintian works on the source
  • 38:12 - 38:15
    package and you need to build the package
    for this.
  • 38:16 - 38:21
    [Lamby] Things that could be detected by
    lintian from a static analysis point of view,
  • 38:21 - 38:26
    yeah I'm sure, like looking for gzip
    without -n for example, but that wouldn't
  • 38:26 - 38:29
    be conclusive from lintian point of view.
  • 38:29 - 38:33
    [Lunar] One thing that I really wanted to
    diffoscope at some point - the code is made
  • 38:33 - 38:38
    the way that it's possible - it's to have
    hints so when it actually looks up
  • 38:38 - 38:44
    differences between two packages then you
    can have an idea, suggest you: hey you need
  • 38:44 - 38:50
    to remove that timestamps, or you should
    sort these keys. It's not done yet, but if
  • 38:50 - 38:53
    anybody wants to do patches it's totally
    doable.
  • 38:58 - 39:04
    [Question] Thank you for the work, have
    you thought about reproducible images?
  • 39:05 - 39:06
    [Holger] It's on the todo list.
  • 39:08 - 39:15
    Before images we need reproducible package
    installation, and then we need reproducible
  • 39:15 - 39:20
    images like squashfs has some things which
    are not reproducible, but the package
  • 39:20 - 39:23
    installation is not reproducible at the
    moment because apt installs packages in
  • 39:23 - 39:28
    arbitrary order and then the post-inst
    create for example users which get
  • 39:28 - 39:33
    user-ids in the order the packages are
    installed, so for that to fix either apt
  • 39:33 - 39:39
    needs a way to install in a deterministic
    order, but it's on the todo list file.
  • 39:40 - 39:44
    [Lunar] Pabs started a wiki page a couple
    of months ago that is called reproducible
  • 39:44 - 39:50
    install. This is very important if we want
    tools like Tails to actually be reproducible
  • 39:50 - 39:55
    so some people will work on that, we do
    want to work on that.
  • 39:55 - 39:59
    [Lamby] It's quite a deep problem for
    example d-i will install different stuff
  • 39:59 - 40:02
    depending on your hardware, so that's
    immediately not reproducible.
  • 40:02 - 40:05
    It'd be great.
  • 40:07 - 40:10
    [Question] I've been working on a couple
    of my packages to get them reproducible
  • 40:10 - 40:16
    build, but I was often wondering if I
    should fix it in my package or actually
  • 40:16 - 40:23
    that it should be fixed in higher up and I
    guess I've been adding some fixes to my
  • 40:23 - 40:28
    packages which may in the future even not
    be needed anymore and then it's just
  • 40:28 - 40:31
    unnecessary code as well.
  • 40:31 - 40:36
    So how do you see where things should be
    fixed and how should we as package
  • 40:36 - 40:38
    maintainers go about with this?
  • 40:38 - 40:44
    [Holger] There's many things which there's
    the easy fix to whatever: set the timezone in
  • 40:44 - 40:51
    debhelper or better in dpkg to UTC, but
    that will not fix the upstream bugs, so
  • 40:51 - 40:57
    actually it's better not to fix, set the
    timezone or other things deterministically
  • 40:57 - 41:01
    in these tools but rather have them fixed
    upstream, that's what we want.
  • 41:01 - 41:07
    Some things we will need to fix them in
    dpkg to get a meaningful result but
  • 41:07 - 41:12
    basically we want rather these distributions
    with just build from source which don't have
  • 41:12 - 41:15
    debian/rules and they just build with
    upstream Makefiles, we want the fixes
  • 41:15 - 41:17
    to land there.
  • 41:18 - 41:22
    [Lunar] We've been experimenting for two
    and this is a lot of trials and errors,
  • 41:22 - 41:26
    trying something, see how it fails, or
    maybe we can do better than that and
  • 41:26 - 41:30
    changing. And I know this can be frustrating
    at some point because you do changes
  • 41:30 - 41:36
    and they all become unneeded, but in the
    end this is how we make stuff that matters.
  • 41:36 - 41:41
    And we move forward, it's not because we're
    trying to make the big picture at once,
  • 41:41 - 41:46
    and I know in Debian we sometimes try to do
    that, so we experiment and learn from it.
  • 41:47 - 41:52
    [Question] An example that I'm now looking
    into is actually the documentation is built
  • 41:52 - 41:58
    for this package by looking in all the files
    and generating but, for instances the
  • 41:58 - 42:06
    index file is sorted, but I guess upstream
    would say: well, if you set some ordering
  • 42:06 - 42:11
    in your LC parameters you want this page
    to be order as you want, instead of forcing
  • 42:11 - 42:16
    it in the sort, so I'm really wondering:
    should I now upstream this or should
  • 42:16 - 42:19
    I just fix it in my rules because that's
    the logical place?
  • 42:21 - 42:29
    [Lunar] Both. No, there's no good answer,
    I'm quite a strong proponent on the idea
  • 42:29 - 42:35
    that if you use a computer you should be
    able to talk and have the computer talk to
  • 42:35 - 42:41
    you in the language that you choose, so if
    people want to have gcc error messages
  • 42:41 - 42:45
    in German, they should have it.
  • 42:45 - 42:51
    But local sorting, this is the kind of
    LC_ALL that can be very local and that
  • 42:51 - 42:54
    you can do for just one tool, it's fine to
    do that.
  • 42:56 - 42:59
    [Question] Do you have ideas on making
    sources reproducible? Like upstreams
  • 42:59 - 43:04
    calling make dist, or this infamous
    autogen.sh files?
  • 43:07 - 43:12
    [Lunar] I don't think that anybody in the
    team has looked into that yet, source
  • 43:12 - 43:23
    files are easy to analyze way more than
    binary packages so, it would still be great
  • 43:23 - 43:30
    to have easier ways; you have source
    tarballs be byte for byte identical,
  • 43:30 - 43:37
    but it's not as an issue as it is for
    binaries. If people want to look in that
  • 43:37 - 43:39
    they should.
  • 43:44 - 43:49
    [Question] Do you know a way to make git
    archive build something reproducible?
  • 43:50 - 43:52
    [Lunar] Well pristine-tar
  • 43:52 - 43:53
    [Question] Yes, but without it.
  • 43:54 - 43:58
    [Holger] There's one tool. You want to use
    a new one? Then write it.
  • 44:02 - 44:05
    Why not use that tool which does the job?
  • 44:06 - 44:08
    pristine-tar does it.
  • 44:11 - 44:17
    [Lunar] This is for source and so that's
    another issue that what we are actually
  • 44:17 - 44:18
    currently working on.
  • 44:22 - 44:26
    [Holger] You're welcome to join the team and
    extend our scope to sources.
  • 44:29 - 44:31
    [Lunar] How many questions, two?
  • 44:33 - 44:36
    Two more questions, two or three.
  • 44:44 - 44:52
    [Question] So if there is a couple of other
    environment variables that could be set
  • 44:52 - 44:59
    in the environment to increase
    reproducibility, where to put them?
  • 44:59 - 45:08
    In the rules file? Or in the generic build
    environment of all packages, or where
  • 45:08 - 45:10
    should these things be placed?
  • 45:13 - 45:20
    [Lamby] It'd be nice if upstream fixed it,
    so if we just change it in debian/rules
  • 45:20 - 45:29
    that's just only helping us, so often take
    it upstream, would be the ideal solution.
  • 45:29 - 45:31
    Are you referring to something else?
  • 45:32 - 45:40
    [Question] For example many hashmaps have
    randomized data in the hash function, so if
  • 45:40 - 45:47
    you have some code that relies on hash
    order, at least some implementations of
  • 45:47 - 45:57
    hash functions are leaving them be seeded
    rather than using something random for
  • 45:57 - 46:03
    a build thing, but you want the randomness
    in your hash functions for normal users
  • 46:03 - 46:11
    because else your hashmaps get open
    to attacks.
  • 46:12 - 46:14
    [Lamby] Correct, yes.
  • 46:16 - 46:22
    [Lunar] In these cases we send patches
    adding sort everywhere for the keys and
  • 46:22 - 46:28
    it's solved. For very few cases, for Perl for
    example you can set and environment
  • 46:28 - 46:33
    variable and some maintainers prefer to do
    that. But usually we try to push these
  • 46:33 - 46:36
    changes upstream, because they are simple
    enough and they like it.
  • 46:36 - 46:39
    Actually it makes testing easier to them.
  • 46:42 - 46:45
    There was one in the back, there.
  • 46:53 - 46:56
    [Lunar] That's the last question
  • 46:56 - 47:00
    [Question] Follow up question to what we
    had here before.
  • 47:00 - 47:10
    You showed an open bug report against gcc
    to support SOURCEDATEEPOCH to cover
  • 47:10 - 47:20
    the mdate and mtime timestamps, so I have
    patches to patch them out in my packages.
  • 47:20 - 47:24
    Should I remove those patches and if so,
    when?
  • 47:26 - 47:30
    [Lunar] Have you seen any more emails
    from the gcc maintainers?
  • 47:34 - 47:40
    [Dhole] The mail is forgotten, I guess we
    should ping it again, and see if they
  • 47:40 - 47:50
    reply, because what I read from the gcc
    website is that only the replies from
  • 47:50 - 47:55
    maintainers are the ones that matter, and
    I think no maintainer replied to the
  • 47:55 - 47:58
    message, so we should ping again.
  • 47:59 - 48:03
    [Question] That was just an example, my
    question was more general.
  • 48:03 - 48:09
    At which time should I remove my patches
    to fix things which were fixed higher up
  • 48:09 - 48:12
    in the toolchain? Or should I just leave
    them in there?
  • 48:13 - 48:14
    [Holger] Once they are in Sid.
  • 48:15 - 48:17
    [Question] Ok thanks!
  • 48:18 - 48:20
    [Lunar] Ok, I guess we're out of time.
  • 48:20 - 48:22
    Thank you for listening.
  • 48:23 - 48:26
    [Applause]
  • 48:26 - 48:29
    [Lunar] Fix your packages!
Title:
Stretching out for trustworthy reproducible builds creating bit by bit identical binaries
Description:

Talk about reproducible builds given by Lunar, Holger, Lamby and Dhole during DebConf15.

more » « less
Video Language:
English
Team:
Debconf
Project:
2015_debconf15

English subtitles

Revisions Compare revisions