Return to Video

Software Heritage - Preserving the Free Software Commons

  • Not Synced
    Hi, thank you.
  • Not Synced
    I'm Nicolas Dandrimont and I will indeed
    be talking to you about
  • Not Synced
    Software Heritage.
  • Not Synced
    I'm a software engineer for this project.
  • Not Synced
    I've been working on it for 3 years now.
  • Not Synced
    And we'll see what this thing is all about.
  • Not Synced
    [Mic not working]
  • Not Synced
    I guess the batteries are out.
  • Not Synced
    So, let's try that again.
  • Not Synced
    So, we all know, we've been doing
    free software for a while,
  • Not Synced
    that software source code is something
    special.
  • Not Synced
    Why is that?
  • Not Synced
    As Harold Abelson has said in SICP, his
    textbook on programming,
  • Not Synced
    programs are meant to be read by people
    and then incidentally for machines to execute.
  • Not Synced
    Basically, what software source code
    provides us is a way inside
  • Not Synced
    the mind of the designer of the program.
  • Not Synced
    For instance, you can have,
    you can get inside very crazy algorithms
  • Not Synced
    that can do very fast reverse square roots
    for 3D, that kind of stuff
  • Not Synced
    Like in the Quake 2 source code.
  • Not Synced
    You can also get inside the algorithms
    that are underpinning the internet,
  • Not Synced
    for instance seeing the net queue
    algorithm in the Linux kernel.
  • Not Synced
    What we are building as the free software
    community is the free software commons.
  • Not Synced
    Basically, the commons is all the cultural
    and social and natural resources
  • Not Synced
    that we share and that everyone
    has access to.
  • Not Synced
    More specifically, the software commons
    is what we are building
  • Not Synced
    with software that is open and that is
    available for all to use, to modify,
  • Not Synced
    to execute, to distribute.
  • Not Synced
    We know that those commons are a really
    critical part of our commons.
  • Not Synced
    Who's taking care of it?
  • Not Synced
    The software is fragile.
  • Not Synced
    Like all digital information, you can lose
    software.
  • Not Synced
    People can decide to shut down hosting
    spaces because of business decisions.
  • Not Synced
    People can hack into software hosting
    platforms and remove the code maliciously
  • Not Synced
    or just inadvertently.
  • Not Synced
    And, of course, for the obsolete stuff,
    there's rot.
  • Not Synced
    If you don't care about the data, then
    it rots and it decays and you lose it.
  • Not Synced
    So, where is the archive we go to
    when something is lost,
  • Not Synced
    when GitLab goes away, when Github
    goes away.
  • Not Synced
    Where do we go?
  • Not Synced
    Finally, there's one last thing that we
    noticed, it's that
  • Not Synced
    there's a lot of teams that work on
    research on software
  • Not Synced
    and there's no real big infrastructure
    for research on code.
  • Not Synced
    There's tons of critical issues around
    code: safety, security, verification, proofs.
  • Not Synced
    Nobody's doing this at a very large scale.
  • Not Synced
    If you want to see the stars, you go
    the Atacama desert and
  • Not Synced
    you point a telescope at the sky.
  • Not Synced
    Where is the telescope for source code?
  • Not Synced
    That's what Software Heritage wants to be.
  • Not Synced
    What we do is we collect, we preserve
    and we share all the software
  • Not Synced
    that is publicly available.
  • Not Synced
    Why do we do that? We do that to
    preserve the past, to enhance the present
  • Not Synced
    and to prepare for the future.
  • Not Synced
    What we're building is a base infrastructure
    that can be used
  • Not Synced
    for cultural heritage, for industry,
    for research and for education purposes.
  • Not Synced
    How do we do it? We do it with an open
    approach.
  • Not Synced
    Every single line of code that we write
    is free software.
  • Not Synced
    We do it transparently, everything that
    we do, we do it in the open,
  • Not Synced
    be that on a mailing list or on
    our issue tracker.
  • Not Synced
    And we strive to do it for the very long
    haul, so we do it with replication in mind
  • Not Synced
    so that no single entity has full control
    over the data that we collect.
  • Not Synced
    And we do it in a non-profit fashion
    so that we avoid
  • Not Synced
    business-driven decisions impacting
    the project.
  • Not Synced
    So, what do we do concretely?
  • Not Synced
    We do archiving of version control systems.
  • Not Synced
    What does that mean?
  • Not Synced
    It means we archive file contents, so
    source code, files.
  • Not Synced
    We archive revisions, which means all the
    metadata of the history of the projects,
  • Not Synced
    we try to download it and we put it inside
    a common data model that is
  • Not Synced
    shared across all the archive.
  • Not Synced
    We archive releases of the software,
    releases that have been tagged
  • Not Synced
    in a version control system as well as
    releases that we can find as tarballs
  • Not Synced
    because sometimes… boof, views of
    this source code differ.
  • Not Synced
    Of course, we archive where and when
    we've seen the data that we've collected.
  • Not Synced
    All of this, we put inside a canonical,
    VCS-agnostic, data model.
  • Not Synced
    If you have a Debian package, with its
    history, if you have a git repository,
  • Not Synced
    if you have a subversion repository, if
    you have a mercurial repository,
  • Not Synced
    it all looks the same and you can work
    on it with the same tools.
  • Not Synced
    What we don't do is archive what's around
    the software, for instance
  • Not Synced
    the bug tracking systems or the homepages
    or the wikis or the mailing lists.
  • Not Synced
    There are some projects that work
    in this space, for instance
  • Not Synced
    the internet archive does a lot of
    really good work around archiving the web.
  • Not Synced
    Our goal is not to replace them, but to
    work with them and be able to do
  • Not Synced
    linking across all the archives that exist.
  • Not Synced
    We can, for instance for the mailing lists
    there's the gmane project
  • Not Synced
    that does a lot of archiving of free
    software mailing lists.
  • Not Synced
    So our long term vision is to play a part
    in a semantic wikipedia of software,
  • Not Synced
    a wikidata of software where we can
    hyperlink all the archives that exist
  • Not Synced
    and do stuff in the area.
  • Not Synced
    Quick tour of our infrastructure.
  • Not Synced
    Basically, all the way to the right is
    our archive.
  • Not Synced
    Our archive consists of a huge graph
    of all the metadata about
  • Not Synced
    the files, the directories, the revisions,
    the commits and the releases and
  • Not Synced
    all the projects that are on top
    of the graph.
  • Not Synced
    We separate the file storage into an other
    object storage because of
  • Not Synced
    the size discrepancy: we have lots and lots
    of file contents that we need to store
  • Not Synced
    so we do that outside the database
    that is used to store the graph.
  • Not Synced
    Basically, what we archive is a set of
    software origins that are
  • Not Synced
    git repositories, mercurial repositories,
    etc. etc.
  • Not Synced
    All those origins are loaded on a
    regular schedule.
  • Not Synced
    If there is a very active software origin,
    we're gonna archive it more often
  • Not Synced
    than stale things that don't get
    a lot of updates.
  • Not Synced
    What we do to get the list of software
    origins that we archive.
  • Not Synced
    We have a bunch of listers that can,
    scroll through the list of repositories,
  • Not Synced
    for instance on Github or other
    hosting platforms.
  • Not Synced
    We have code that can read Debian archive
    metadata to make a list of the packages
  • Not Synced
    that are inside this archive and can be
    archived, etc.
  • Not Synced
    All of this is done on a regular basis.
  • Not Synced
    We are currently working on some kind
    of push mechanism so that
  • Not Synced
    people or other systems can notify us
    of updates.
  • Not Synced
    Our goal is not to do real time archiving,
    we're really in it for the long run
  • Not Synced
    but we still want to be able to prioritize
    stuff that people tell us is
  • Not Synced
    important to archive.
  • Not Synced
    The internet archive has a "save now"
    button and we want to implement
  • Not Synced
    something along those lines as well,
  • Not Synced
    so if we know that some software project
    is in danger for a reason or another,
  • Not Synced
    then we can prioritize archiving it.
  • Not Synced
    So this is the basic structure of a revision
    in the software heritage archive.
  • Not Synced
    You'll see that it's very similar to
    a git commit.
  • Not Synced
    The format of the metadata is pretty much
    what you'll find in a git commit
  • Not Synced
    with some extensions that you don't
    see here because this is from a git commit
  • Not Synced
    So basically what we do is we take the
    identifier of the directory
  • Not Synced
    that the revision points to, we take the
    identifier of the parent of the revision
  • Not Synced
    so we can keep track of the history
  • Not Synced
    and then we add some metadata,
    authorship and commitership information
  • Not Synced
    and the revision message and then we take
    a hash of this,
  • Not Synced
    it makes an identifier that's probably
    unique, very very probably unique.
  • Not Synced
    Using those identifiers, we can retrace
    all the origins, all the history of
  • Not Synced
    development of the project and we can
    deduplicate across all the archive.
  • Not Synced
    All the identifiers are intrinsic, which
    means that we compute them
  • Not Synced
    from the contents of the things that
    we are archiving, which means that
  • Not Synced
    we can deduplicate very efficiently
    across all the data that we archive.
  • Not Synced
    How much data do we archive?
  • Not Synced
    A bit.
  • Not Synced
    So, we have passed the billion revision
    mark a few weeks ago.
  • Not Synced
    This graph is a bit old, but anyway,
    you have a live graph on our website.
  • Not Synced
    That's more than 4.5 billion unique
    source code files.
  • Not Synced
    We don't actually discriminate between
    what we would consider is source code
  • Not Synced
    and what upstream developers consider
    as source code,
  • Not Synced
    so everything that's in a git repository,
    we consider as source code
  • Not Synced
    if it's below a size threshold.
  • Not Synced
    A billion revisions across 80 million
    projects.
  • Not Synced
    What do we archive?
  • Not Synced
    We archive Github, we archive Debian.
  • Not Synced
    So, Debian we run the archival process
    every day, every day we get the new packages
  • Not Synced
    that have been uploaded in the archive.
  • Not Synced
    Github, we try to keep up, we are currently
    working on some performance improvements,
  • Not Synced
    some scalability improvements to make sure
    that we can keep up
  • Not Synced
    with the development on GitHub.
  • Not Synced
    We have archived as a one-off thing
    the former content of Gitorious and Google Code
  • Not Synced
    which are two prominent code hosting
    spaces that closed recently
  • Not Synced
    and we've been working on archiving
    the contents of Bitbucket
  • Not Synced
    which is kind of a challenge because
    the API is a bit buggy and
  • Not Synced
    Atliassian isn't too interested
    in fixing it.
  • Not Synced
    In concrete storage terms, we have 175TB
    of blobs, so the files take 175TB
  • Not Synced
    and kind of big database, 6TB.
  • Not Synced
    The database only contains the graph of
    the metadata for the archive
  • Not Synced
    which is basically a 8 billion nodes and
    70 billion edges graph.
  • Not Synced
    And of course it's growing daily.
  • Not Synced
    We are pretty sure this is the richest
    source code archive that's available now
  • Not Synced
    and it keeps growing.
  • Not Synced
    So how do we actually…
  • Not Synced
    What kind of stack do we use to store
    all this?
  • Not Synced
    We use Debian, of course.
  • Not Synced
    All our deployment recipes are in Puppet
    in public repositories.
  • Not Synced
    We've started using Ceph
    for the blob storage.
  • Not Synced
    We use PostgreSQL for the metadata storage
    we some of the standard tools that
  • Not Synced
    live around PostgreSQL for backups
    and replication.
  • Not Synced
    We use standard Python stack for
    scheduling of jobs
  • Not Synced
    and for web interface stuff, basically
    psycopg2 for the low level stuff,
  • Not Synced
    Django for the web stuff
  • Not Synced
    and Celery for the scheduling of jobs.
  • Not Synced
    In house, we've written an ad hoc
    object storage system which has
  • Not Synced
    a bunch of backends that you can use.
  • Not Synced
    Basically, we are agnostic between a UNIX
    filesystem, azure, Ceph, or tons of…
  • Not Synced
    It's a really simple object storage system
    where you can just put an object,
  • Not Synced
    get an object, put a bunch of objects,
    get a bunch of objects.
  • Not Synced
    We've implemented removal but we don't
    really use it yet.
  • Not Synced
    All the data model implementation,
    all the listers, the loaders, the schedulers
  • Not Synced
    everything has been written by us,
    it's a pile of Python code.
  • Not Synced
    So, basically 20 Python packages and
    around 30 Puppet modules
  • Not Synced
    to deploy all that and we've done everything
    as a copyleft license,
  • Not Synced
    GPLv3 for the backend and AGPLv3
    for the frontend.
  • Not Synced
    Even if people try and make their own
    Software Heritage using our code,
  • Not Synced
    they have to publish their changes.
  • Not Synced
    Hardware-wise, we run for now everything
    on a few hypervisors in house and
  • Not Synced
    our main storage is currently still
    on a very high density, very slow,
  • Not Synced
    very bulky storage array, but we've
    started to migrate all this thing
  • Not Synced
    into a Ceph storage cluster which
    we're gonna grow as we need
  • Not Synced
    in the next few months.
  • Not Synced
    We've also been granted by Microsoft
    sponsorship, ??? sponsorship
  • Not Synced
    for their cloud services.
  • Not Synced
    We've started putting mirrors of everything
    in their infrastructure as well
  • Not Synced
    which means full object storage mirror,
    so 170TB of stuff mirrored on azure
  • Not Synced
    as well as a database mirror for graph.
  • Not Synced
    And we're also doing all the content
    indexing and all the things that need
  • Not Synced
    scalability on azure now.
  • Not Synced
    Finally, at the university of Bologna,
    we have a backend storage for the download
  • Not Synced
    so currently our main storage is
    quite slow so if you want to download
  • Not Synced
    a bundle of things that we've archived,
    then we actually keep a cache of
  • Not Synced
    what we've done so that it doesn't take
    a million years to download stuff.
  • Not Synced
    We do our development in a classic free
    and open source software way,
  • Not Synced
    so we talk on our mailing list, on IRC
Title:
Software Heritage - Preserving the Free Software Commons
Description:

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
41:31

English subtitles

Incomplete

Revisions Compare revisions