    Hi, thank you.
    I'm Nicolas Dandrimont and I will indeed
    be talking to you about
    Software Heritage.
    I'm a software engineer for this project.
    I've been working on it for 3 years now.
    And we'll see what this thing is all about.
    [Mic not working]
    I guess the batteries are out.
    So, let's try that again.
    So, we all know, we've been doing
    free software for a while,
    that software source code is something
    Why is that?
    As Harold Abelson has said in SICP, his
    textbook on programming,
    programs are meant to be read by people
    and then incidentally for machines to execute.
    Basically, what software source code
    provides us is a way inside
    the mind of the designer of the program.
    For instance, you can have,
    you can get inside very crazy algorithms
    that can do very fast reverse square roots
    for 3D, that kind of stuff
    Like in the Quake 2 source code.
    You can also get inside the algorithms
    that are underpinning the internet,
    for instance seeing the net queue
    algorithm in the Linux kernel.
    What we are building as the free software
    community is the free software commons.
    Basically, the commons is all the cultural
    and social and natural resources
    that we share and that everyone
    has access to.
    More specifically, the software commons
    is what we are building
    with software that is open and that is
    available for all to use, to modify,
    to execute, to distribute.
    We know that those commons are a really
    critical part of our commons.
    Who's taking care of it?
    The software is fragile.
    Like all digital information, you can lose
    People can decide to shut down hosting
    spaces because of business decisions.
    People can hack into software hosting
    platforms and remove the code maliciously
    or just inadvertently.
  • Not Synced
    there's rot.
    If you don't care about the data, then
    it rots and it decays and you lose it.
    So, where is the archive we go to
    when something is lost,
    when GitLab goes away, when Github
    goes away.
    Where do we go?
    Finally, there's one last thing that we
    noticed, it's that
    there's a lot of teams that work on
    research on software
    and there's no real big infrastructure
    for research on code.
    There's tons of critical issues around
    code: safety, security, verification, proofs.
    Nobody's doing this at a very large scale.
    If you want to see the stars, you go
    the Atacama desert and
    you point a telescope at the sky.
    Where is the telescope for source code?
    That's what Software Heritage wants to be.
    What we do is we collect, we preserve
    and we share all the software
    that is publicly available.
    Why do we do that? We do that to
    preserve the past, to enhance the present
    and to prepare for the future.
    What we're building is a base infrastructure
    that can be used
    for cultural heritage, for industry,
    for research and for education purposes.
    How do we do it? We do it with an open
    Every single line of code that we write
    is free software.
    We do it transparently, everything that
    we do, we do it in the open,
    be that on a mailing list or on
    our issue tracker.
    And we strive to do it for the very long
    haul, so we do it with replication in mind
    so that no single entity has full control
    over the data that we collect.
    And we do it in a non-profit fashion
    so that we avoid
    business-driven decisions impacting
    the project.
    So, what do we do concretely?
    We do archiving of version control systems.
  • Not Synced
  • Not Synced
    source code, files.
  • Not Synced
    metadata of the history of the projects,
  • Not Synced
    a common data model that is
  • Not Synced
  • Not Synced
    releases that have been tagged
  • Not Synced
    releases that we can find as tarballs
  • Not Synced
    this source code differ.
  • Not Synced
    we've seen the data that we've collected.
  • Not Synced
    VCS-agnostic, data model.
  • Not Synced
    history, if you have a git repository,
  • Not Synced
    you have a mercurial repository,
  • Not Synced
    on it with the same tools.
  • Not Synced
    the software, for instance
  • Not Synced
    or the wikis or the mailing lists.
  • Not Synced
    in this space, for instance
  • Not Synced
    really good work around archiving the web.
  • Not Synced
    work with them and be able to do
  • Not Synced
  • Not Synced
    We can, for instance for the mailing lists
  • Not Synced
    that does a lot of archiving of free
  • Not Synced
    in a semantic wikipedia of software,
  • Not Synced
    hyperlink all the archives that exist
  • Not Synced
  • Not Synced
  • Not Synced
    our archive.
  • Not Synced
    of all the metadata about
  • Not Synced
    the commits and the releases and
  • Not Synced
    of the graph.
  • Not Synced
    object storage because of
  • Not Synced
    of file contents that we need to store
  • Not Synced
    that is used to store the graph.
  • Not Synced
    software origins that are
  • Not Synced
    etc. etc.
  • Not Synced
    regular schedule.
  • Not Synced
    we're gonna archive it more often
  • Not Synced
    a lot of updates.
  • Not Synced
    origins that we archive.
  • Not Synced
    scroll through the list of repositories,
  • Not Synced
    hosting platforms.
  • Not Synced
    metadata to make a list of the packages
  • Not Synced
    archived, etc.
  • Not Synced
  • Not Synced
    of push mechanism so that
  • Not Synced
    of updates.
  • Not Synced
    we're really in it for the long run
  • Not Synced
    stuff that people tell us is
  • Not Synced
    important to archive.
    The internet archive has a "save now"
    button and we want to implement
    something along those lines as well,
  • Not Synced
    is in danger for a reason or another,
  • Not Synced
  • Not Synced
    in the software heritage archive.
  • Not Synced
    a git commit.
  • Not Synced
    what you'll find in a git commit
  • Not Synced
    see here because this is from a git commit
  • Not Synced
    identifier of the directory
  • Not Synced
    identifier of the parent of the revision
  • Not Synced
  • Not Synced
    authorship and commitership information
  • Not Synced
    a hash of this,
  • Not Synced
    unique, very very probably unique.
  • Not Synced
    all the origins, all the history of
  • Not Synced
    deduplicate across all the archive.
  • Not Synced
    means that we compute them
  • Not Synced
    we are archiving, which means that
  • Not Synced
    across all the data that we archive.
  • Not Synced
  • Not Synced
  • Not Synced
    mark a few weeks ago.
  • Not Synced
    you have a live graph on our website.
  • Not Synced
    source code files.
  • Not Synced
    what we would consider is source code
  • Not Synced
    as source code,
  • Not Synced
    we consider as source code
  • Not Synced
  • Not Synced
  • Not Synced
  • Not Synced
  • Not Synced
    every day, every day we get the new packages
  • Not Synced
  • Not Synced
    working on some performance improvements,
  • Not Synced
    that we can keep up
  • Not Synced
  • Not Synced
    the former content of Gitorious and Google Code
  • Not Synced
    spaces that closed recently
  • Not Synced
    the contents of Bitbucket
  • Not Synced
    the API is a bit buggy and
  • Not Synced
    in fixing it.
  • Not Synced
    of blobs, so the files take 175TB
  • Not Synced
  • Not Synced
    the metadata for the archive
  • Not Synced
    70 billion edges graph.
  • Not Synced
  • Not Synced
    source code archive that's available now
  • Not Synced
  • Not Synced
  • Not Synced
    all this?
  • Not Synced
  • Not Synced
    in public repositories.
  • Not Synced
    for the blob storage.
  • Not Synced
    we some of the standard tools that
  • Not Synced
    and replication.
  • Not Synced
    scheduling of jobs
  • Not Synced
    psycopg2 for the low level stuff,
  • Not Synced
  • Not Synced
  • Not Synced
    object storage system which has
  • Not Synced
  • Not Synced
    filesystem, azure, Ceph, or tons of…
  • Not Synced
    where you can just put an object,
  • Not Synced
    get a bunch of objects.
  • Not Synced
    really use it yet.
  • Not Synced
    all the listers, the loaders, the schedulers
  • Not Synced
    it's a pile of Python code.
  • Not Synced
    around 30 Puppet modules
  • Not Synced
    as a copyleft license,
  • Not Synced
    for the frontend.
  • Not Synced
    Software Heritage using our code,
  • Not Synced
  • Not Synced
    on a few hypervisors in house and
  • Not Synced
    on a very high density, very slow,
  • Not Synced
    started to migrate all this thing
  • Not Synced
    we're gonna grow as we need
  • Not Synced
  • Not Synced
    sponsorship, ??? sponsorship
  • Not Synced
  • Not Synced
    in their infrastructure as well
  • Not Synced
    so 170TB of stuff mirrored on azure
  • Not Synced
  • Not Synced
    indexing and all the things that need
  • Not Synced
  • Not Synced
    we have a backend storage for the download
  • Not Synced
    quite slow so if you want to download
  • Not Synced
    then we actually keep a cache of
  • Not Synced
    a million years to download stuff.
  • Not Synced
    and open source software way,
  • Not Synced
    so we talk on our mailing list, on IRC
