Return to Video

Software Heritage - Preserving the Free Software Commons

  • 0:05 - 0:07
    Hi, thank you.
  • 0:08 - 0:11
    I'm Nicolas Dandrimont and I will indeed
    be talking to you about
  • 0:11 - 0:13
    Software Heritage.
  • 0:13 - 0:15
    I'm a software engineer for this project.
  • 0:16 - 0:18
    I've been working on it for 3 years now.
  • 0:18 - 0:22
    And we'll see what this thing is all about.
  • 0:24 - 0:39
    [Mic not working]
  • 0:39 - 0:41
    I guess the batteries are out.
  • 0:50 - 0:52
    So, let's try that again.
  • 0:52 - 0:55
    So, we all know, we've been doing
    free software for a while,
  • 0:56 - 1:00
    that software source code is something
    special.
  • 1:01 - 1:02
    Why is that?
  • 1:03 - 1:10
    As Harold Abelson has said in SICP, his
    textbook on programming,
  • 1:10 - 1:19
    programs are meant to be read by people
    and then incidentally for machines to execute.
  • 1:20 - 1:26
    Basically, what software source code
    provides us is a way inside
  • 1:26 - 1:29
    the mind of the designer of the program.
  • 1:29 - 1:38
    For instance, you can have,
    you can get inside very crazy algorithms
  • 1:38 - 1:47
    that can do very fast reverse square roots
    for 3D, that kind of stuff
  • 1:47 - 1:50
    Like in the Quake 2 source code.
  • 1:50 - 1:55
    You can also get inside the algorithms
    that are underpinning the internet,
  • 1:55 - 2:00
    for instance seeing the net queue
    algorithm in the Linux kernel.
  • 2:04 - 2:10
    What we are building as the free software
    community is the free software commons.
  • 2:11 - 2:19
    Basically, the commons is all the cultural
    and social and natural resources
  • 2:19 - 2:22
    that we share and that everyone
    has access to.
  • 2:22 - 2:26
    More specifically, the software commons
    is what we are building
  • 2:26 - 2:32
    with software that is open and that is
    available for all to use, to modify,
  • 2:32 - 2:35
    to execute, to distribute.
  • 2:37 - 2:45
    We know that those commons are a really
    critical part of our commons.
  • 2:46 - 2:48
    Who's taking care of it?
  • 2:50 - 2:52
    The software is fragile.
  • 2:52 - 2:54
    Like all digital information, you can lose
    software.
  • 2:56 - 3:02
    People can decide to shut down hosting
    spaces because of business decisions.
  • 3:03 - 3:09
    People can hack into software hosting
    platforms and remove the code maliciously
  • 3:09 - 3:11
    or just inadvertently.
  • 3:13 - 3:18
    And, of course, for the obsolete stuff,
    there's rot.
  • 3:18 - 3:25
    If you don't care about the data, then
    it rots and it decays and you lose it.
  • 3:26 - 3:31
    So, where is the archive we go to
    when something is lost,
  • 3:31 - 3:34
    when GitLab goes away, when Github
    goes away.
  • 3:34 - 3:36
    Where do we go?
  • 3:37 - 3:41
    Finally, there's one last thing that we
    noticed, it's that
  • 3:41 - 3:49
    there's a lot of teams that work on
    research on software
  • 3:49 - 3:54
    and there's no real big infrastructure
    for research on code.
  • 3:57 - 4:02
    There's tons of critical issues around
    code: safety, security, verification, proofs.
  • 4:04 - 4:08
    Nobody's doing this at a very large scale.
  • 4:08 - 4:12
    If you want to see the stars, you go
    the Atacama desert and
  • 4:12 - 4:14
    you point a telescope at the sky.
  • 4:14 - 4:18
    Where is the telescope for source code?
  • 4:18 - 4:21
    That's what Software Heritage wants to be.
  • 4:22 - 4:28
    What we do is we collect, we preserve
    and we share all the software
  • 4:28 - 4:30
    that is publicly available.
  • 4:31 - 4:36
    Why do we do that? We do that to
    preserve the past, to enhance the present
  • 4:36 - 4:38
    and to prepare for the future.
  • 4:40 - 4:45
    What we're building is a base infrastructure
    that can be used
  • 4:45 - 4:50
    for cultural heritage, for industry,
    for research and for education purposes.
  • 4:51 - 4:53
    How do we do it? We do it with an open
    approach.
  • 4:53 - 4:57
    Every single line of code that we write
    is free software.
  • 4:59 - 5:05
    We do it transparently, everything that
    we do, we do it in the open,
  • 5:05 - 5:09
    be that on a mailing list or on
    our issue tracker.
  • 5:10 - 5:16
    And we strive to do it for the very long
    haul, so we do it with replication in mind
  • 5:16 - 5:22
    so that no single entity has full control
    over the data that we collect.
  • 5:23 - 5:27
    And we do it in a non-profit fashion
    so that we avoid
  • 5:27 - 5:33
    business-driven decisions impacting
    the project.
  • 5:35 - 5:39
    So, what do we do concretely?
  • 5:39 - 5:43
    We do archiving of version control systems.
  • 5:43 - 5:45
    What does that mean?
  • 5:46 - 5:49
    It means we archive file contents, so
    source code, files.
  • 5:49 - 5:56
    We archive revisions, which means all the
    metadata of the history of the projects,
  • 5:56 - 6:03
    we try to download it and we put it inside
    a common data model that is
  • 6:03 - 6:07
    shared across all the archive.
  • 6:09 - 6:14
    We archive releases of the software,
    releases that have been tagged
  • 6:14 - 6:18
    in a version control system as well as
    releases that we can find as tarballs
  • 6:18 - 6:24
    because sometimes… boof, views of
    this source code differ.
  • 6:28 - 6:32
    Of course, we archive where and when
    we've seen the data that we've collected.
  • 6:33 - 6:40
    All of this, we put inside a canonical,
    VCS-agnostic, data model.
  • 6:42 - 6:47
    If you have a Debian package, with its
    history, if you have a git repository,
  • 6:47 - 6:50
    if you have a subversion repository, if
    you have a mercurial repository,
  • 6:50 - 6:54
    it all looks the same and you can work
    on it with the same tools.
  • 6:55 - 7:01
    What we don't do is archive what's around
    the software, for instance
  • 7:01 - 7:06
    the bug tracking systems or the homepages
    or the wikis or the mailing lists.
  • 7:07 - 7:11
    There are some projects that work
    in this space, for instance
  • 7:11 - 7:16
    the internet archive does a lot of
    very good work around archiving the web.
  • 7:18 - 7:24
    Our goal is not to replace them, but to
    work with them and be able to do
  • 7:24 - 7:29
    linking across all the archives that exist.
  • 7:30 - 7:35
    We can, for instance for the mailing lists
    there's the gmane project
  • 7:35 - 7:39
    that does a lot of archiving of free
    software mailing lists.
  • 7:40 - 7:48
    So our long term vision is to play a part
    in a semantic wikipedia of software,
  • 7:48 - 7:54
    a wikidata of software where we can
    hyperlink all the archives that exist
  • 7:54 - 7:57
    and do stuff in the area.
  • 8:01 - 8:03
    Quick tour of our infrastructure.
  • 8:03 - 8:10
    Basically, all the way to the right is
    our archive.
  • 8:11 - 8:17
    Our archive consists of a huge graph
    of all the metadata about
  • 8:17 - 8:25
    the files, the directories, the revisions,
    the commits and the releases and
  • 8:25 - 8:28
    all the projects that are on top
    of the graph.
  • 8:29 - 8:34
    We separate the file storage into an other
    object storage because of
  • 8:34 - 8:42
    the size discrepancy: we have lots and lots
    of file contents that we need to store
  • 8:42 - 8:46
    so we do that outside of the database
    that is used to store the graph.
  • 8:49 - 8:54
    Basically, what we archive is a set of
    software origins that are
  • 8:54 - 8:59
    git repositories, mercurial repositories,
    etc. etc.
  • 9:00 - 9:05
    All those origins are loaded on a
    regular schedule.
  • 9:07 - 9:13
    If there is a very active software origin,
    we're gonna archive it more often
  • 9:13 - 9:18
    than stale things that don't get
    a lot of updates.
  • 9:20 - 9:24
    What we do to get the list of software
    origins that we archive.
  • 9:25 - 9:31
    We have a bunch of listers that can,
    scroll through the list of repositories,
  • 9:31 - 9:34
    for instance on Github or other
    hosting platforms.
  • 9:35 - 9:42
    We have code that can read Debian archive
    metadata to make a list of the packages
  • 9:42 - 9:49
    that are inside this archive and can be
    archived, etc.
  • 9:50 - 9:53
    All of this is done on a regular basis.
  • 9:54 - 9:57
    We are currently working on some kind
    of push mechanism so that
  • 9:57 - 10:01
    people or other systems can notify us
    of updates.
  • 10:03 - 10:10
    Our goal is not to do real time archiving,
    we're really in it for the long run
  • 10:10 - 10:16
    but we still want to be able to prioritize
    stuff that people tell us is
  • 10:16 - 10:18
    important to archive.
  • 10:20 - 10:24
    The internet archive has a "save now"
    button and we want to implement
  • 10:24 - 10:26
    something along those lines as well,
  • 10:26 - 10:32
    so if we know that some software project
    is in danger for a reason or another,
  • 10:32 - 10:34
    then we can prioritize archiving it.
  • 10:36 - 10:40
    So this is the basic structure of a revision
    in the software heritage archive.
  • 10:42 - 10:45
    You'll see that it's very similar to
    a git commit.
  • 10:48 - 10:54
    The format of the metadata is pretty much
    what you'll find in a git commit
  • 10:54 - 10:59
    with some extensions that you don't
    see here because this is from a git commit
  • 11:01 - 11:10
    So basically what we do is we take the
    identifier of the directory
  • 11:10 - 11:16
    that the revision points to, we take the
    identifier of the parent of the revision
  • 11:16 - 11:19
    so we can keep track of the history
  • 11:19 - 11:25
    and then we add some metadata,
    authorship and commitership information
  • 11:25 - 11:29
    and the revision message and then we take
    a hash of this,
  • 11:29 - 11:37
    it makes an identifier that's probably
    unique, very very probably unique.
  • 11:40 - 11:47
    Using those identifiers, we can retrace
    all the origins, all the history of
  • 11:47 - 11:52
    development of the project and we can
    deduplicate across all the archive.
  • 11:52 - 11:59
    All the identifiers are intrinsic, which
    means that we compute them
  • 11:59 - 12:04
    from the contents of the things that
    we are archiving, which means that
  • 12:04 - 12:11
    we can deduplicate very efficiently
    across all the data that we archive.
  • 12:12 - 12:14
    How much data do we archive?
  • 12:17 - 12:18
    A bit.
  • 12:19 - 12:24
    So, we have passed the billion revision
    mark a few weeks ago.
  • 12:25 - 12:30
    This graph is a bit old, but anyway,
    you have a live graph on our website.
  • 12:31 - 12:36
    That's more than 4.5 billion unique
    source code files.
  • 12:38 - 12:45
    We don't actually discriminate between
    what we would consider is source code
  • 12:45 - 12:48
    and what upstream developers consider
    as source code,
  • 12:48 - 12:52
    so everything that's in a git repository,
    we consider as source code
  • 12:52 - 12:55
    if it's below a size threshold.
  • 12:56 - 13:00
    A billion revisions across 80 million
    projects.
  • 13:01 - 13:03
    What do we archive?
  • 13:03 - 13:05
    We archive Github, we archive Debian.
  • 13:07 - 13:12
    So, Debian we run the archival process
    every day, every day we get the new packages
  • 13:12 - 13:14
    that have been uploaded in the archive.
  • 13:14 - 13:21
    Github, we try to keep up, we are currently
    working on some performance improvements,
  • 13:21 - 13:25
    some scalability improvements to make sure
    that we can keep up
  • 13:25 - 13:27
    with the development on GitHub.
  • 13:29 - 13:40
    We have archived as a one-off thing the
    former contents of Gitorious and Google Code
  • 13:41 - 13:47
    which are two prominent code hosting
    spaces that closed recently
  • 13:48 - 13:54
    and we've been working on archiving
    the contents of Bitbucket
  • 13:54 - 14:00
    which is kind of a challenge because
    the API is a bit buggy and
  • 14:00 - 14:03
    Atliassian isn't too interested
    in fixing it.
  • 14:06 - 14:17
    In concrete storage terms, we have 175TB
    of blobs, so the files take 175TB
  • 14:17 - 14:20
    and kind of big database, 6TB.
  • 14:21 - 14:28
    The database only contains the graph of
    the metadata for the archive
  • 14:28 - 14:35
    which is basically a 8 billion nodes and
    70 billion edges graph.
  • 14:35 - 14:37
    And of course it's growing daily.
  • 14:38 - 14:43
    We are pretty sure this is the richest public
    source code archive that's available now
  • 14:43 - 14:45
    and it keeps growing.
  • 14:46 - 14:49
    So how do we actually…
  • 14:49 - 14:53
    What kind of stack do we use to store
    all this?
  • 14:55 - 14:57
    We use Debian, of course.
  • 14:58 - 15:03
    All our deployment recipes are in Puppet
    in public repositories.
  • 15:04 - 15:08
    We've started using Ceph
    for the blob storage.
  • 15:09 - 15:14
    We use PostgreSQL for the metadata storage
    with some of the standard tools that
  • 15:15 - 15:18
    live around PostgreSQL for backups
    and replication.
  • 15:20 - 15:28
    We use standard Python stack for
    scheduling of jobs
  • 15:28 - 15:35
    and for web interface stuff, basically
    psycopg2 for the low level stuff,
  • 15:35 - 15:38
    Django for the web stuff
  • 15:38 - 15:44
    and Celery for the scheduling of jobs.
  • 15:45 - 15:50
    In house, we've written an ad hoc
    object storage system which has
  • 15:50 - 15:53
    a bunch of backends that you can use.
  • 15:54 - 16:03
    Basically, we are agnostic between a UNIX
    filesystem, azure, Ceph, or tons of…
  • 16:03 - 16:07
    It's a really simple object storage system
    where you can just put an object,
  • 16:07 - 16:10
    get an object, put a bunch of objects,
    get a bunch of objects.
  • 16:12 - 16:18
    We've implemented removal but we don't
    really use it yet.
  • 16:20 - 16:25
    All the data model implementation,
    all the listers, the loaders, the schedulers
  • 16:25 - 16:29
    everything has been written by us,
    it's a pile of Python code.
  • 16:32 - 16:36
    So, basically 20 Python packages and
    around 30 Puppet modules
  • 16:36 - 16:42
    to deploy all that and we've done everything
    as a copyleft license,
  • 16:42 - 16:46
    GPLv3 for the backend and AGPLv3
    for the frontend.
  • 16:47 - 16:57
    Even if people try and make their own
    Software Heritage using our code,
  • 16:57 - 17:00
    they have to publish their changes.
  • 17:02 - 17:11
    Hardware-wise, we run for now everything
    on a few hypervisors in house and
  • 17:11 - 17:19
    our main storage is currently still
    on a very high density, very slow,
  • 17:19 - 17:28
    very bulky storage array, but we've
    started to migrate all this thing
  • 17:28 - 17:33
    into a Ceph storage cluster which
    we're gonna grow as we need
  • 17:33 - 17:35
    in the next few months.
  • 17:36 - 17:44
    We've also been granted by Microsoft
    sponsorship, ??? sponsorship
  • 17:44 - 17:46
    for their cloud services.
  • 17:46 - 17:52
    We've started putting mirrors of everything
    in their infrastructure as well
  • 17:52 - 18:00
    which means full object storage mirror,
    so 170TB of stuff mirrored on azure
  • 18:00 - 18:02
    as well as a database mirror for graph.
  • 18:04 - 18:09
    And we're also doing all the content
    indexing and all the things that need
  • 18:09 - 18:12
    scalability on azure now.
  • 18:17 - 18:22
    Finally, at the university of Bologna,
    we have a backend storage for the download
  • 18:22 - 18:29
    so currently our main storage is
    quite slow so if you want to download
  • 18:29 - 18:35
    a bundle of things that we've archived,
    then we actually keep a cache of
  • 18:35 - 18:40
    what we've done so that it doesn't take
    a million years to download stuff.
  • 18:42 - 18:46
    We do our development in a classic free
    and open source software way,
  • 18:46 - 18:52
    so we talk on our mailing list, on IRC,
    on a forge.
  • 18:53 - 18:57
    Everything is in English, everything is
    public, there is more information
  • 18:57 - 19:01
    on our website if you want to actually
    have a look and see what we do.
  • 19:04 - 19:10
    So, all that is very interesting but how
    do we actually look into it?
  • 19:12 - 19:16
    One of the ways that you can browse,
    that you can use the archive
  • 19:16 - 19:19
    is using a REST API.
  • 19:19 - 19:25
    Basically, this API allows you to do
    pointwise browsing of the archive
  • 19:25 - 19:29
    so you can go and follow the links
    in a graph,
  • 19:29 - 19:38
    which is very slow but gives you a pretty
    much full access of the data.
  • 19:38 - 19:45
    There's an index for the API that you can
    look at, but that's not really convenient,
  • 19:45 - 19:48
    so we also have a web user interface.
  • 19:49 - 19:56
    It's in preview right now, we're gonna do
    a full launch in the month of June.
  • 19:58 - 20:01
    If you go to
    https://archive.softwareheritage.org/browse/
  • 20:02 - 20:10
    with the given credentials, you can
    have a look and see what's going on.
  • 20:10 - 20:19
    Basically, we have a web interface that
    allows you to look at
  • 20:19 - 20:26
    what origins we have downloaded, when
    we have downloaded the origins
  • 20:26 - 20:35
    with a kind of graph view of how often
    we visited the origins
  • 20:35 - 20:38
    and a calendar view of when we have
    visited the origins.
  • 20:39 - 20:44
    And then, inside the visits, you can
    actually browse the contents
  • 20:44 - 20:45
    that we've archived.
  • 20:45 - 20:50
    So, for instance, this is the Python
    repository as of May 2017
  • 20:50 - 20:55
    and you can have the list of files,
    then drill down,
  • 20:55 - 20:58
    it should be pretty intuitive.
  • 20:59 - 21:03
    If you look at the history of a project,
    you can see the differences
  • 21:03 - 21:05
    between two revisions of a project.
  • 21:07 - 21:12
    Oh no, that's the syntax highlighting,
    but anyway the diffs arrive right after.
  • 21:14 - 21:16
    So, yeah, pretty cool stuff.
  • 21:17 - 21:22
    I should be able to do a demo as well,
    it should work.
  • 21:31 - 21:32
    I'm gonna zoom in.
  • 21:45 - 21:49
    So this is the main archive, you can see
    some statistics about the objects
  • 21:49 - 21:51
    that we've downloaded.
  • 21:51 - 21:57
    When you zoom in, you get some kind of
    overflows, because…
  • 21:57 - 21:59
    Yeah, why would you do that.
  • 21:59 - 22:04
    If you want to browse, we can try to find
    an origin.
  • 22:07 - 22:09
    "glibc".
  • 22:13 - 22:17
    So there's lots and lots of, like, random
    Github forks of things…
  • 22:19 - 22:26
    We don't discriminate and we don't really
    filter what we download.
  • 22:27 - 22:34
    We are looking into doing some relevance
    kind of sorting of the results, here.
  • 22:36 - 22:38
    Next.
  • 22:40 - 22:42
    Xilinx, why not.
  • 22:43 - 22:49
    So, this has been downloaded for the last
    time of August 3rd 2016,
  • 22:49 - 22:50
    so it's probably a dead repository,
  • 22:53 - 22:55
    but yeah, you can see a bunch of source
    code,
  • 22:57 - 23:01
    you can read the README of the glibc.
  • 23:04 - 23:08
    If we go back to a more interesting origin
  • 23:08 - 23:10
    here's the repository for git.
  • 23:11 - 23:17
    I've selected voluntarily an old visit
    of the repo so that we can see
  • 23:17 - 23:19
    what was going on then.
  • 23:23 - 23:31
    If I look at the calendar view, you can see
    that we've had some issues actually
  • 23:31 - 23:33
    updating this, but anyway.
  • 23:38 - 23:46
    If I look at the last visit, then we can
    actually browse the contents,
  • 23:47 - 23:49
    you can get syntax highlighting as well.
  • 23:50 - 23:54
    This is a big big file with lots of comments
  • 24:02 - 24:05
    Let's see the actual source code…
  • 24:07 - 24:10
    Anyway, so, that's the browsing interface.
  • 24:10 - 24:15
    We can also now get back what we've
    archived and download it,
  • 24:15 - 24:19
    which is kind of something that you might
    want to do
  • 24:19 - 24:24
    if a repository is lost, you can actually
    download it
  • 24:24 - 24:26
    and get the source code back again.
  • 24:27 - 24:28
    How we do that.
  • 24:29 - 24:35
    If you go on the top right of this browsing
    interface, you have actions and download
  • 24:35 - 24:40
    and you can download the directory that
    you are currently looking at.
  • 24:41 - 24:46
    It's an asynchronous process, which means
    that if there is a lot of load,
  • 24:46 - 24:51
    then it's gotta take some time to get
    actually, to be able to download the content
  • 24:52 - 24:56
    So you can put in your email address so we
    can notify you when the download is ready.
  • 24:57 - 25:03
    I'm gonna try my luck and say just "ok"
    and it's gonna appear at some point
  • 25:03 - 25:08
    in the list of things that I've requested.
  • 25:11 - 25:20
    I've already requested some things that
    we can actually get and open as a tarball.
  • 25:31 - 25:35
    Yeah, I think that's the thing that I was
    actually looking at,
  • 25:35 - 25:38
    which is this revision of the git
    source code
  • 25:40 - 25:42
    and then I can open it
  • 25:44 - 25:47
    Yay, emacs, that's when you want.
  • 25:47 - 25:48
    Yay, source code.
  • 25:51 - 25:54
    This seems to work.
  • 25:58 - 26:03
    And then, of course, if you want to
    actually script what you're doing,
  • 26:03 - 26:07
    there's an API that allows you to do
    the downloads as well, so you can.
  • 26:11 - 26:18
    The source code is deduplicated a lot,
    which means that for one single repository
  • 26:18 - 26:24
    you get tons of files that we have to
    collect if you want to actually download
  • 26:24 - 26:26
    an archive of a directory.
  • 26:30 - 26:38
    It takes a while but we have an asynchronous
    API so you can POST
  • 26:38 - 26:44
    the identifier of a revision to this URL
    and then get status updates
  • 26:44 - 26:49
    and at some point, it will tell you that
    the… here
  • 26:50 - 26:53
    The status well tell you that the object
    is available.
  • 26:53 - 26:59
    You can download it and you can even
    download the full history of a project
  • 26:59 - 27:04
    and get that as a git-fast-export archive
    that you can reimport into
  • 27:04 - 27:06
    a new git repository.
  • 27:06 - 27:13
    So any kind of VCS that we've imported,
    you can export as a git repository
  • 27:13 - 27:18
    and reimport on your machine.
  • 27:19 - 27:23
    How to get involved in the project?
  • 27:24 - 27:29
    We have a lot of features that we're
    interested in, lots of them are now
  • 27:29 - 27:31
    in early access or have been done.
  • 27:32 - 27:36
    There's some stuff that we would like
    help with.
  • 27:38 - 27:40
    This is some stuff that we're working on:
  • 27:41 - 27:43
    provenance information, you have a content
  • 27:43 - 27:45
    you want to know which repository
    it comes from,
  • 27:46 - 27:48
    that's something we're working on.
  • 27:48 - 27:55
    Full text search, the end goal is to be
    able even to trace
  • 27:55 - 28:01
    source of snippets of code that's have
    been copied from one project to another.
  • 28:01 - 28:06
    That's something that we can look into
    with the wealth of information that
  • 28:06 - 28:08
    we have inside the archive.
  • 28:09 - 28:11
    There's a lot of things that,
  • 28:11 - 28:12
    I mean…
  • 28:12 - 28:15
    There's a lot of things that people want
    to do with the archive.
  • 28:15 - 28:20
    Our goal is to enable people to do things,
    to do interesting things
  • 28:20 - 28:22
    with a lot of source code.
  • 28:24 - 28:27
    If you have an idea of what you want to do
    with such an archive,
  • 28:27 - 28:30
    please you can come talk to us
  • 28:30 - 28:35
    and we'll be happy to help you help us.
  • 28:38 - 28:44
    What we want to do is to diversify
    the sources of things that we archive.
  • 28:44 - 28:51
    Currently, we have good support for git,
    we have OK support for subversion
  • 28:51 - 28:53
    and mercurial.
  • 28:54 - 28:59
    If your project of choice is in another
    version control system,
  • 28:59 - 29:01
    we are gonna miss it.
  • 29:02 - 29:06
    So people can contribute in this area.
  • 29:10 - 29:18
    For the listing part, we have coverage of
    Debian, we have coverage or Github,
  • 29:18 - 29:26
    if your code is somewhere else, we won't
    see it, so we need people to contribute
  • 29:26 - 29:30
    stuff that can list for instance Gitlab
    instances,
  • 29:32 - 29:36
    and then we can integrate that in our
    infrastructure and actually have
  • 29:37 - 29:41
    people be able to archive their gitlab
    instances.
  • 29:42 - 29:49
    And of course, we need to spread
    the word, make the project sustainable.
  • 29:49 - 30:01
    We have a few sponsors now, Microsoft,
    Nokia, Huawei, Github has joined as a sponsor
  • 30:02 - 30:06
    The university of Bologna, of course Inria
    is sponsoring.
  • 30:07 - 30:12
    But we need to keep spreading the word
    and keep the project sustainable.
  • 30:13 - 30:18
    And, of course, we need to save endangered
    source code.
  • 30:18 - 30:23
    For that, we have a suggestion box on
    the wiki that you can add things to.
  • 30:24 - 30:30
    For instance, we have in the back of
    our minds archiving SourceForge,
  • 30:30 - 30:36
    because we know that this isn't very
    sustainable and that's risk of being
  • 30:36 - 30:39
    taken down at some point.
  • 30:42 - 30:48
    If you want to join us, we also have
    some job openings that are available.
  • 30:49 - 30:56
    For now it's in Paris, so if you want to
    consider coming work with us in Paris,
  • 30:56 - 30:58
    you can look into that.
  • 31:01 - 31:03
    That's Software Heritage.
  • 31:03 - 31:05
    We are building a reference archive of
    all the free software
  • 31:05 - 31:07
    that's being ever written
  • 31:07 - 31:11
    in an international, open, non-profit and
    mutualised infrastructure
  • 31:12 - 31:18
    that we have opened up to everyone,
    all users, vendors, developers can use it.
  • 31:20 - 31:26
    The idea is to be at the service of
    the community and for society
  • 31:26 - 31:28
    as a whole.
  • 31:28 - 31:33
    So if you want to join us, you can look at
    our website, you can look at our code.
  • 31:35 - 31:38
    You can also talk to me, so if you have
    any questions,
  • 31:38 - 31:42
    I think we have 10, 12 minutes for questions.
  • 31:46 - 31:52
    [Applause]
  • 31:52 - 31:53
    Do you have questions?
  • 31:57 - 32:01
    [Q] How do you protect the archive
    against stuff that you don't want to
  • 32:01 - 32:02
    have in the archive.
  • 32:02 - 32:07
    I think of a stuff that is copyright-
    protected and that Github will also
  • 32:07 - 32:09
    delete after a while.
  • 32:10 - 32:16
    Worse, if I would misuse the archive
    as my private backup
  • 32:16 - 32:20
    and store encrypted blocks on Github
    and you will eventually backup them
  • 32:20 - 32:21
    for me.
  • 32:25 - 32:27
    [A] There's, I think, two sides of the
    question.
  • 32:27 - 32:29
    The first side is
  • 32:29 - 32:34
    Do we really archive only stuff that is
    free software and
  • 32:34 - 32:41
    that we can redistribute and how do we
    manage, for instance,
  • 32:41 - 32:43
    copyright takedown stuff.
  • 32:46 - 32:52
    Currently, most of the infrastructure
    of the project is under French law.
  • 32:53 - 33:00
    There's a defined process to do
    copyright takedown in the French legal system.
  • 33:02 - 33:09
    We would be really annoyed to have to
    take down content from the archive
  • 33:12 - 33:20
    What we do, however, is to mirror public
    information that is publicly available.
  • 33:21 - 33:27
    Of course I'm not a lawyer for the project,
    so I can't really…
  • 33:30 - 33:33
    I'm not 100% sure of what I'm about to say
    but
  • 33:33 - 33:39
    what I know is that in the current French
    legistlation status,
  • 33:40 - 33:43
    if the source of the data is still available
  • 33:43 - 33:47
    so for instance if the data is still on
    Github, then you need to have
  • 33:47 - 33:50
    Github take it down before we have to
    take it down.
  • 33:57 - 34:02
    We're not currently filtering content for
    misuse of the archive,
  • 34:02 - 34:06
    so the only thing that we do is put
    a limit on the size of the files
  • 34:06 - 34:08
    that are archived in Software Heritage.
  • 34:10 - 34:12
    The limit is pretty high, like 100MB.
  • 34:15 - 34:21
    We can't really decide ourselves
  • 34:21 - 34:24
    what is source code,
    what is not source code
  • 34:24 - 34:31
    because for instance if your project is
    a cryptography library,
  • 34:31 - 34:34
    you might want to have some encrypted
    blocks of data that are stored
  • 34:34 - 34:38
    in you source code repository as
    test fixtures.
  • 34:39 - 34:44
    And then, you need them to build the code
    and to make sure that it works.
  • 34:45 - 34:49
    So, how would that be any different than
    your encrypted backup on Github?
  • 34:49 - 34:56
    How could we, Software Heritage,
    distinguish between proper use and misuse
  • 34:56 - 34:59
    of the resources.
  • 35:00 - 35:05
    I guess our long term goal is to not have
    to care about misuse because
  • 35:05 - 35:07
    it's gonna be a drop in the ocean.
  • 35:09 - 35:11
    We're gonna have so much…
  • 35:12 - 35:15
    We want to have enough space and
    enough resources
  • 35:15 - 35:20
    that we don't really need to ask ourselves
    this question, basically.
  • 35:21 - 35:22
    Thanks.
  • 35:26 - 35:28
    Other questions?
  • 35:34 - 35:39
    [Q] Have you looked at some form of
    authentication to provide additional
  • 35:39 - 35:46
    insurance that the archived source code
    hasn't been modified or tampered with
  • 35:46 - 35:48
    in some form?
  • 35:51 - 35:56
    [A] First of all, all the identifiers for
    the objects that are inside the archive
  • 35:56 - 36:01
    are cryptographic hashes of the contents
    that we've archived.
  • 36:02 - 36:07
    So, for files, for instance, we take
    the SHA1, the SHA256,
  • 36:07 - 36:16
    one of the BLAKE hashes and the git
    modified SHA1 of the file,
  • 36:17 - 36:20
    and we use that in the manifest for
    the directories.
  • 36:20 - 36:26
    So the directories, the directory identifiers
    are a hash of the manifest
  • 36:26 - 36:30
    of the list of files that are inside
    the directory, etc.
  • 36:31 - 36:39
    So, recursively, you can make sure that
    the data that we give back to you
  • 36:39 - 36:48
    has not been, at least altered, by bitflip
    or anything.
  • 36:49 - 36:53
    We regularly run a scrub of the data
    that we have in the archive,
  • 36:53 - 36:57
    so we make sure that there's no rot
    inside our archive.
  • 36:59 - 37:05
    We've not looked into, basically,
    attestation of…
  • 37:09 - 37:14
    for instance, making sure that the code
    that we've downloaded…
  • 37:21 - 37:26
    I mean, we're not doing anything more
    than taking a picture of the data
  • 37:26 - 37:34
    and we say "We've computed this hash.
    Maybe the code that's been presented
  • 37:34 - 37:39
    by Github to Software Heritage is different
    than what you've uploaded to Github,
  • 37:39 - 37:40
    we can't tell."
  • 37:44 - 37:49
    In the case of git, you can always use
    the identifiers of the objects
  • 37:49 - 37:52
    that you've pushed so you have
    the commit hash,
  • 37:52 - 37:57
    which is itself a cryptographic identifier
    of the contents of the commit.
  • 37:59 - 38:02
    In turn, if the commit is signed, then
    the signature is still stored
  • 38:02 - 38:11
    in the Software Heritage metadata and
    you can reproduce the original git object
  • 38:11 - 38:15
    and check the signature, but we've not
    done anything specific for Software Heritage
  • 38:15 - 38:17
    in this area.
  • 38:18 - 38:20
    Does that answer your question?
  • 38:20 - 38:20
    Cool.
  • 38:25 - 38:26
    Other questions?
  • 38:27 - 38:29
    There's one in front.
  • 38:31 - 38:34
    [Q] It's partially question, partially
    comment.
  • 38:34 - 38:40
    Your initial idea was to have a telescope,
    or something like this for source code.
  • 38:40 - 38:43
    For now, for me, it looks a little bit
    more like microscope,
  • 38:43 - 38:47
    so you can focus on one thing, but that's
    not much.
  • 38:47 - 38:51
    So have you sorted things about how to
    analyze entire ecosystem
  • 38:51 - 38:52
    or something like this.
  • 38:52 - 38:57
    For example, now we have Django 2 which is
    Python 3 only so it would be interesting to
  • 38:57 - 39:01
    look at all Django modules to see when
    they start moving to this Django.
  • 39:01 - 39:07
    So we would need to start analyzing
    thousands or millions of files, but then
  • 39:07 - 39:11
    we would need some SQL like, or some
    map reduce jobs
  • 39:11 - 39:12
    or something like this for this.
  • 39:13 - 39:14
    [A] Yes
  • 39:14 - 39:15
    So, we've started…
  • 39:16 - 39:22
    The two initiators of the project, Roberto
    Di Cosmo and Stefano Zacchiroli
  • 39:22 - 39:27
    are both researchers in computer science
    so they have a strong background in
  • 39:27 - 39:35
    actually mining software repositories and
    doing some large scale analysis
  • 39:35 - 39:36
    on source code.
  • 39:38 - 39:45
    We've been talking with research groups
    whose main goal is to do analysis on
  • 39:45 - 39:48
    large scale source code archives.
  • 39:50 - 39:58
    One of the first mirrors outside of our
    control of the archive
  • 39:58 - 39:59
    will be in Grenoble (France).
  • 39:59 - 40:06
    There's a few teams that work on
    actually doing large scale research
  • 40:06 - 40:09
    on source code over there,
  • 40:09 - 40:11
    so that's what the mirror will be
    used for.
  • 40:13 - 40:17
    We've also been looking at what
    the Google open source team does.
  • 40:18 - 40:23
    They have this big repository with all
    the code that Google uses
  • 40:23 - 40:29
    and they've started to push back,
    like do large scale analysis of
  • 40:29 - 40:38
    security vulnerabilities, issues with
    static and dynamic analysis
  • 40:38 - 40:42
    of the code and they've started pushing
    their fixes upstream.
  • 40:43 - 40:47
    That's something that we want to enable
    users to do,
  • 40:47 - 40:51
    that's not something that we want to do
    ourselves, but we want to make sure
  • 40:51 - 40:53
    that people can do it using our archive.
  • 40:55 - 40:59
    So we'd be happy to work with people
    who already do that so that
  • 40:59 - 41:05
    they can use their knowledge and their
    tools inside our archive.
  • 41:07 - 41:09
    Does that answer your question?
  • 41:10 - 41:11
    Cool.
  • 41:15 - 41:17
    Any more questions?
  • 41:19 - 41:22
    No? Then thank you very much Nicolas.
  • 41:22 - 41:23
    Thank you.
  • 41:23 - 41:26
    [Applause]
Title:
Software Heritage - Preserving the Free Software Commons
Description:

Talk given by Nicolas Dandrimont at Minidebconf Hamburg 2018
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-20/software_heritage.webm

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
41:31

English subtitles

Incomplete

Revisions Compare revisions