< Return to Video

Software Heritage - Preserving the Free Software Commons

  • Not Synced
    Hi, thank you.
  • Not Synced
    I'm Nicolas Dandrimont and I will indeed
    be talking to you about
  • Not Synced
    Software Heritage.
  • Not Synced
    I'm a software engineer for this project.
  • Not Synced
    I've been working on it for 3 years now.
  • Not Synced
    And we'll see what this thing is all about.
  • Not Synced
    [Mic not working]
  • Not Synced
    I guess the batteries are out.
  • Not Synced
    So, let's try that again.
  • Not Synced
    So, we all know, we've been doing
    free software for a while,
  • Not Synced
    that software source code is something
    special.
  • Not Synced
    Why is that?
  • Not Synced
    As Harold Abelson has said in SICP, his
    textbook on programming,
  • Not Synced
    programs are meant to be read by people
    and then incidentally for machines to execute.
  • Not Synced
    Basically, what software source code
    provides us is a way inside
  • Not Synced
    the mind of the designer of the program.
  • Not Synced
    For instance, you can have,
    you can get inside very crazy algorithms
  • Not Synced
    that can do very fast reverse square roots
    for 3D, that kind of stuff
  • Not Synced
    Like in the Quake 2 source code.
  • Not Synced
    You can also get inside the algorithms
    that are underpinning the internet,
  • Not Synced
    for instance seeing the net queue
    algorithm in the Linux kernel.
  • Not Synced
    What we are building as the free software
    community is the free software commons.
  • Not Synced
    Basically, the commons is all the cultural
    and social and natural resources
  • Not Synced
    that we share and that everyone
    has access to.
  • Not Synced
    More specifically, the software commons
    is what we are building
  • Not Synced
    with software that is open and that is
    available for all to use, to modify,
  • Not Synced
    to execute, to distribute.
  • Not Synced
    We know that those commons are a really
    critical part of our commons.
  • Not Synced
    Who's taking care of it?
  • Not Synced
    The software is fragile.
  • Not Synced
    Like all digital information, you can lose
    software.
  • Not Synced
    People can decide to shut down hosting
    spaces because of business decisions.
  • Not Synced
    People can hack into software hosting
    platforms and remove the code maliciously
  • Not Synced
    or just inadvertently.
  • Not Synced
    And, of course, for the obsolete stuff,
    there's rot.
  • Not Synced
    If you don't care about the data, then
    it rots and it decays and you lose it.
  • Not Synced
    So, where is the archive we go to
    when something is lost,
  • Not Synced
    when GitLab goes away, when Github
    goes away.
  • Not Synced
    Where do we go?
  • Not Synced
    Finally, there's one last thing that we
    noticed, it's that
  • Not Synced
    there's a lot of teams that work on
    research on software
  • Not Synced
    and there's no real big infrastructure
    for research on code.
  • Not Synced
    There's tons of critical issues around
    code: safety, security, verification, proofs.
  • Not Synced
    Nobody's doing this at a very large scale.
  • Not Synced
    If you want to see the stars, you go
    the Atacama desert and
  • Not Synced
    you point a telescope at the sky.
  • Not Synced
    Where is the telescope for source code?
  • Not Synced
    That's what Software Heritage wants to be.
  • Not Synced
    What we do is we collect, we preserve
    and we share all the software
  • Not Synced
    that is publicly available.
  • Not Synced
    Why do we do that? We do that to
    preserve the past, to enhance the present
  • Not Synced
    and to prepare for the future.
  • Not Synced
    What we're building is a base infrastructure
    that can be used
  • Not Synced
    for cultural heritage, for industry,
    for research and for education purposes.
  • Not Synced
    How do we do it? We do it with an open
    approach.
  • Not Synced
    Every single line of code that we write
    is free software.
  • Not Synced
    We do it transparently, everything that
    we do, we do it in the open,
  • Not Synced
    be that on a mailing list or on
    our issue tracker.
  • Not Synced
    And we strive to do it for the very long
    haul, so we do it with replication in mind
  • Not Synced
    so that no single entity has full control
    over the data that we collect.
  • Not Synced
    And we do it in a non-profit fashion
    so that we avoid
  • Not Synced
    business-driven decisions impacting
    the project.
  • Not Synced
    So, what do we do concretely?
  • Not Synced
    We do archiving of version control systems.
  • Not Synced
    What does that mean?
  • Not Synced
    It means we archive file contents, so
    source code, files.
  • Not Synced
    We archive revisions, which means all the
    metadata of the history of the projects,
  • Not Synced
    we try to download it and we put it inside
    a common data model that is
  • Not Synced
    shared across all the archive.
  • Not Synced
    We archive releases of the software,
    releases that have been tagged
  • Not Synced
    in a version control system as well as
    releases that we can find as tarballs
  • Not Synced
    because sometimes… boof, views of
    this source code differ.
  • Not Synced
    Of course, we archive where and when
    we've seen the data that we've collected.
  • Not Synced
    All of this, we put inside a canonical,
    VCS-agnostic, data model.
  • Not Synced
    If you have a Debian package, with its
    history, if you have a git repository,
  • Not Synced
    if you have a subversion repository, if
    you have a mercurial repository,
  • Not Synced
    it all looks the same and you can work
    on it with the same tools.
  • Not Synced
    What we don't do is archive what's around
    the software, for instance
  • Not Synced
    the bug tracking systems or the homepages
    or the wikis or the mailing lists.
  • Not Synced
    There are some projects that work
    in this space, for instance
  • Not Synced
    the internet archive does a lot of
    really good work around archiving the web.
  • Not Synced
    Our goal is not to replace them, but to
    work with them and be able to do
  • Not Synced
    linking across all the archives that exist.
  • Not Synced
    We can, for instance for the mailing lists
    there's the gmane project
  • Not Synced
    that does a lot of archiving of free
    software mailing lists.
  • Not Synced
    So our long term vision is to play a part
    in a semantic wikipedia of software,
  • Not Synced
    a wikidata of software where we can
    hyperlink all the archives that exist
  • Not Synced
    and do stuff in the area.
  • Not Synced
    Quick tour of our infrastructure.
  • Not Synced
    Basically, all the way to the right is
    our archive.
  • Not Synced
    Our archive consists of a huge graph
    of all the metadata about
  • Not Synced
    the files, the directories, the revisions,
    the commits and the releases and
  • Not Synced
    all the projects that are on top
    of the graph.
  • Not Synced
    We separate the file storage into an other
    object storage because of
  • Not Synced
    the size discrepancy: we have lots and lots
    of file contents that we need to store
  • Not Synced
    so we do that outside the database
    that is used to store the graph.
  • Not Synced
    Basically, what we archive is a set of
    software origins that are
  • Not Synced
    git repositories, mercurial repositories,
    etc. etc.
  • Not Synced
    All those origins are loaded on a
    regular schedule.
  • Not Synced
    If there is a very active software origin,
    we're gonna archive it more often
  • Not Synced
    than stale things that don't get
    a lot of updates.
  • Not Synced
    What we do to get the list of software
    origins that we archive.
  • Not Synced
    We have a bunch of listers that can,
    scroll through the list of repositories,
  • Not Synced
    for instance on Github or other
    hosting platforms.
  • Not Synced
    We have code that can read Debian archive
    metadata to make a list of the packages
  • Not Synced
    that are inside this archive and can be
    archived, etc.
  • Not Synced
    All of this is done on a regular basis.
  • Not Synced
    We are currently working on some kind
    of push mechanism so that
  • Not Synced
    people or other systems can notify us
    of updates.
  • Not Synced
    Our goal is not to do real time archiving,
    we're really in it for the long run
  • Not Synced
    but we still want to be able to prioritize
    stuff that people tell us is
  • Not Synced
    important to archive.
  • Not Synced
    The internet archive has a "save now"
    button and we want to implement
  • Not Synced
    something along those lines as well,
  • Not Synced
    so if we know that some software project
    is in danger for a reason or another,
  • Not Synced
    then we can prioritize archiving it.
  • Not Synced
    So this is the basic structure of a revision
    in the software heritage archive.
  • Not Synced
    You'll see that it's very similar to
    a git commit.
  • Not Synced
    The format of the metadata is pretty much
    what you'll find in a git commit
  • Not Synced
    with some extensions that you don't
    see here because this is from a git commit
  • Not Synced
    So basically what we do is we take the
    identifier of the directory
  • Not Synced
    that the revision points to, we take the
    identifier of the parent of the revision
  • Not Synced
    so we can keep track of the history
  • Not Synced
    and then we add some metadata,
    authorship and commitership information
  • Not Synced
    and the revision message and then we take
    a hash of this,
  • Not Synced
    it makes an identifier that's probably
    unique, very very probably unique.
  • Not Synced
    Using those identifiers, we can retrace
    all the origins, all the history of
  • Not Synced
    development of the project and we can
    deduplicate across all the archive.
  • Not Synced
    All the identifiers are intrinsic, which
    means that we compute them
  • Not Synced
    from the contents of the things that
    we are archiving, which means that
  • Not Synced
    we can deduplicate very efficiently
    across all the data that we archive.
  • Not Synced
    How much data do we archive?
  • Not Synced
    A bit.
  • Not Synced
    So, we have passed the billion revision
    mark a few weeks ago.
  • Not Synced
    This graph is a bit old, but anyway,
    you have a live graph on our website.
  • Not Synced
    That's more than 4.5 billion unique
    source code files.
  • Not Synced
    We don't actually discriminate between
    what we would consider is source code
  • Not Synced
    and what upstream developers consider
    as source code,
  • Not Synced
    so everything that's in a git repository,
    we consider as source code
  • Not Synced
    if it's below a size threshold.
  • Not Synced
    A billion revisions across 80 million
    projects.
  • Not Synced
    What do we archive?
  • Not Synced
    We archive Github, we archive Debian.
  • Not Synced
    So, Debian we run the archival process
    every day, every day we get the new packages
  • Not Synced
    that have been uploaded in the archive.
  • Not Synced
    Github, we try to keep up, we are currently
    working on some performance improvements,
  • Not Synced
    some scalability improvements to make sure
    that we can keep up
  • Not Synced
    with the development on GitHub.
  • Not Synced
    We have archived as a one-off thing
    the former content of Gitorious and Google Code
  • Not Synced
    which are two prominent code hosting
    spaces that closed recently
  • Not Synced
    and we've been working on archiving
    the contents of Bitbucket
  • Not Synced
    which is kind of a challenge because
    the API is a bit buggy and
  • Not Synced
    Atliassian isn't too interested
    in fixing it.
  • Not Synced
    In concrete storage terms, we have 175TB
    of blobs, so the files take 175TB
  • Not Synced
    and kind of big database, 6TB.
  • Not Synced
    The database only contains the graph of
    the metadata for the archive
  • Not Synced
    which is basically a 8 billion nodes and
    70 billion edges graph.
  • Not Synced
    And of course it's growing daily.
  • Not Synced
    We are pretty sure this is the richest
    source code archive that's available now
  • Not Synced
    and it keeps growing.
  • Not Synced
    So how do we actually…
  • Not Synced
    What kind of stack do we use to store
    all this?
  • Not Synced
    We use Debian, of course.
  • Not Synced
    All our deployment recipes are in Puppet
    in public repositories.
  • Not Synced
    We've started using Ceph
    for the blob storage.
  • Not Synced
    We use PostgreSQL for the metadata storage
    we some of the standard tools that
  • Not Synced
    live around PostgreSQL for backups
    and replication.
  • Not Synced
    We use standard Python stack for
    scheduling of jobs
  • Not Synced
    and for web interface stuff, basically
    psycopg2 for the low level stuff,
  • Not Synced
    Django for the web stuff
  • Not Synced
    and Celery for the scheduling of jobs.
  • Not Synced
    In house, we've written an ad hoc
    object storage system which has
  • Not Synced
    a bunch of backends that you can use.
  • Not Synced
    Basically, we are agnostic between a UNIX
    filesystem, azure, Ceph, or tons of…
  • Not Synced
    It's a really simple object storage system
    where you can just put an object,
  • Not Synced
    get an object, put a bunch of objects,
    get a bunch of objects.
  • Not Synced
    We've implemented removal but we don't
    really use it yet.
  • Not Synced
    All the data model implementation,
    all the listers, the loaders, the schedulers
  • Not Synced
    everything has been written by us,
    it's a pile of Python code.
  • Not Synced
    So, basically 20 Python packages and
    around 30 Puppet modules
  • Not Synced
    to deploy all that and we've done everything
    as a copyleft license,
  • Not Synced
    GPLv3 for the backend and AGPLv3
    for the frontend.
  • Not Synced
    Even if people try and make their own
    Software Heritage using our code,
  • Not Synced
    they have to publish their changes.
  • Not Synced
    Hardware-wise, we run for now everything
    on a few hypervisors in house and
  • Not Synced
    our main storage is currently still
    on a very high density, very slow,
  • Not Synced
    very bulky storage array, but we've
    started to migrate all this thing
  • Not Synced
    into a Ceph storage cluster which
    we're gonna grow as we need
  • Not Synced
    in the next few months.
  • Not Synced
    We've also been granted by Microsoft
    sponsorship, ??? sponsorship
  • Not Synced
    for their cloud services.
  • Not Synced
    We've started putting mirrors of everything
    in their infrastructure as well
  • Not Synced
    which means full object storage mirror,
    so 170TB of stuff mirrored on azure
  • Not Synced
    as well as a database mirror for graph.
  • Not Synced
    And we're also doing all the content
    indexing and all the things that need
  • Not Synced
    scalability on azure now.
  • Not Synced
    Finally, at the university of Bologna,
    we have a backend storage for the download
  • Not Synced
    so currently our main storage is
    quite slow so if you want to download
  • Not Synced
    a bundle of things that we've archived,
    then we actually keep a cache of
  • Not Synced
    what we've done so that it doesn't take
    a million years to download stuff.
  • Not Synced
    We do our development in a classic free
    and open source software way,
  • Not Synced
    so we talk on our mailing list, on IRC,
    on a forge.
  • Not Synced
    Everything is in English, everything is
    public, there is more information
  • Not Synced
    on our website if you want to actually
    have a look and see what we do.
  • Not Synced
    So, all that is very interesting but how
    do we actually look into it?
  • Not Synced
    One of the ways that you can browse,
    that you can use the archive
  • Not Synced
    is using a REST API.
  • Not Synced
    Basically, this API allows you to do
    pointwise browsing of the archive
  • Not Synced
    so you can go and follow the links
    in a graph,
  • Not Synced
    which is very slow but gives you a pretty
    much full access of the data.
  • Not Synced
    There's an index for the API that you can
    look at, but that's not really convenient,
  • Not Synced
    so we also have a web user interface.
  • Not Synced
    It's in preview right now, we're gonna do
    a full launch in the month of June.
  • Not Synced
    If you go to
    https://archive.softwareheritage.org/browse/
  • Not Synced
    with the given credentials, you can
    have a look and see what's going on.
  • Not Synced
    Basically, we have a web interface that
    allows you to look at
  • Not Synced
    what origins we have downloaded, when
    we have downloaded the origins
  • Not Synced
    with a kind of graph view of how often
    we visited the origins
  • Not Synced
    and a calendar view of when we have
    visited the origins.
  • Not Synced
    And then, inside the visits, you can
    actually browse the contents
  • Not Synced
    that we've archived.
  • Not Synced
    So, for instance, this is the Python
    repository as of May 2017
  • Not Synced
    and you can have the list of files,
    then drill down,
  • Not Synced
    it should be pretty intuitive.
  • Not Synced
    If you look at the history of a project,
    you can see the differences
  • Not Synced
    between two revisions of a project.
  • Not Synced
    Oh no, that's the syntax highlighting,
    but anyway the diffs arrive right after.
  • Not Synced
    So, yeah, pretty cool stuff.
  • Not Synced
    I should be able to do a demo as well,
    it should work.
  • Not Synced
    I'm gonna zoom in.
  • Not Synced
    So this is the main archive, you can see
    some statistics about the objects
  • Not Synced
    that we've downloaded.
  • Not Synced
    When you zoom in, you get some kind of
    overflows, because…
  • Not Synced
    Yeah, why would you do that.
  • Not Synced
    If you want to browse, we can try to find
    an origin.
  • Not Synced
    "glibc".
  • Not Synced
    So there's lots and lots of, like, random
    Github forks of things…
  • Not Synced
    We don't discriminate and we don't really
    filter what we download.
  • Not Synced
    We are looking into doing some relevance
    kind of sorting of the results, here.
  • Not Synced
    Next.
  • Not Synced
    Xilinx, why not.
  • Not Synced
    So, this has been downloaded for the last
    time of August 3rd 2016,
  • Not Synced
    so it's probably a dead repository,
  • Not Synced
    but yeah, you can see a bunch of source
    code,
  • Not Synced
    you can read the README of the glibc.
  • Not Synced
    If we go back to a more interesting origin
  • Not Synced
    here's the repository for git.
  • Not Synced
    I've selected voluntarily an old visit
    of the repo so that we can see
  • Not Synced
    what was going on then.
  • Not Synced
    If a look at the calendar view, you can see
    that we've had some issues actually
  • Not Synced
    updating this, but anyway.
  • Not Synced
    If I look at the last visit, then we can
    actually browse the contents,
  • Not Synced
    you can get syntax highlighting as well.
  • Not Synced
    This is a big big file with lots of comments
  • Not Synced
    Let's see the actual source code…
  • Not Synced
    Anyway, so, that's the browsing interface.
  • Not Synced
    We can also now get back what we've
    archived and download it,
  • Not Synced
    which is kind of something that you might
    want to do
  • Not Synced
    if a repository is lost, you can actually
    download it
  • Not Synced
    and get the source code back again.
  • Not Synced
    How we do that.
  • Not Synced
    If you go on the top right of this browsing
    interface, you have actions and download
  • Not Synced
    and you can download a directory that
    you are currently looking at.
  • Not Synced
    It's an asynchronous process, which means
    that if there is a lot of load,
  • Not Synced
    then it's gotta take some time to get
    actually, to be able to download the content
  • Not Synced
    So you can put in your email address so we
    can notify you when the download is ready.
  • Not Synced
    I'm gonna try my luck and say just "ok"
    and it's gonna appear at some point
  • Not Synced
    in the list of things that I've requested.
  • Not Synced
    I've already requested some things that
    we can actually get and open as a tarball.
  • Not Synced
    Yeah, I think that's the thing that I was
    actually looking at,
  • Not Synced
    which is this revision of the git
    source code
  • Not Synced
    and then I can open it
  • Not Synced
    Yay, emacs, that's when you want.
  • Not Synced
    Yay, source code.
  • Not Synced
    This seems to work.
  • Not Synced
    And then, of course, if you want to
    actually script what you're doing,
  • Not Synced
    there's an API that allows you to do
    the downloads as well, so you can.
  • Not Synced
    The source code is deduplicated a lot,
    which means that for one single repository
  • Not Synced
    you get tons of files that we have to
    collect if you want to actually download
  • Not Synced
    an archive of a directory.
  • Not Synced
    It takes a while but we have an asynchronous
    API so you can POST
  • Not Synced
    the identifier of a revision to this URL
    and then get status updates
  • Not Synced
    and at some point, it will tell you that
    the… here
  • Not Synced
    The status well tell you that the object
    is available.
  • Not Synced
    You can download it and you can even
    download the full history of a project
  • Not Synced
    and get that as a git-fast-export archive
    that you can reimport into
  • Not Synced
    a new git repository.
  • Not Synced
    So any kind of VCS that we've imported,
    you can export as a git repository
  • Not Synced
    and reimport on your machine.
  • Not Synced
    How to get involved in the project?
  • Not Synced
    We have a lot of features that we're
    interested in, lots of them are now
  • Not Synced
    in early access or have been done.
  • Not Synced
    There's some stuff that we would like
    help with.
  • Not Synced
    This is some stuff that we're working on:
  • Not Synced
    provenance information, you have a content
  • Not Synced
    you want to know which repository
    it comes from,
  • Not Synced
    that's something we're on.
  • Not Synced
    Full text search, the end goal is to be
    able even to trace
  • Not Synced
    source of snippets of code that's have
    been copied from one project to another.
  • Not Synced
    That's something that we can look into
    with the wealth of information that
  • Not Synced
    we have inside the archive.
Title:
Software Heritage - Preserving the Free Software Commons
Description:

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
41:31

English subtitles

Incomplete

Revisions Compare revisions