Return to Video

Software Heritage - Preserving the Free Software Commons

  • Not Synced
    Hi, thank you.
  • Not Synced
    I'm Nicolas Dandrimont and I will indeed
    be talking to you about
  • Not Synced
    Software Heritage.
  • Not Synced
    I'm a software engineer for this project.
  • Not Synced
    I've been working on it for 3 years now.
  • Not Synced
    And we'll see what this thing is all about.
  • Not Synced
    [Mic not working]
  • Not Synced
    I guess the batteries are out.
  • Not Synced
    So, let's try that again.
  • Not Synced
    So, we all know, we've been doing
    free software for a while,
  • Not Synced
    that software source code is something
    special.
  • Not Synced
    Why is that?
  • Not Synced
    As Harold Abelson has said in SICP, his
    textbook on programming,
  • Not Synced
    programs are meant to be read by people
    and then incidentally for machines to execute.
  • Not Synced
    Basically, what software source code
    provides us is a way inside
  • Not Synced
    the mind of the designer of the program.
  • Not Synced
    For instance, you can have,
    you can get inside very crazy algorithms
  • Not Synced
    that can do very fast reverse square roots
    for 3D, that kind of stuff
  • Not Synced
    Like in the Quake 2 source code.
  • Not Synced
    You can also get inside the algorithms
    that are underpinning the internet,
  • Not Synced
    for instance seeing the net queue
    algorithm in the Linux kernel.
  • Not Synced
    What we are building as the free software
    community is the free software commons.
  • Not Synced
    Basically, the commons is all the cultural
    and social and natural resources
  • Not Synced
    that we share and that everyone
    has access to.
  • Not Synced
    More specifically, the software commons
    is what we are building
  • Not Synced
    with software that is open and that is
    available for all to use, to modify,
  • Not Synced
    to execute, to distribute.
  • Not Synced
    We know that those commons are a really
    critical part of our commons.
  • Not Synced
    Who's taking care of it?
  • Not Synced
    The software is fragile.
  • Not Synced
    Like all digital information, you can lose
    software.
  • Not Synced
    People can decide to shut down hosting
    spaces because of business decisions.
  • Not Synced
    People can hack into software hosting
    platforms and remove the code maliciously
  • Not Synced
    or just inadvertently.
  • Not Synced
    And, of course, for the obsolete stuff,
    there's rot.
  • Not Synced
    If you don't care about the data, then
    it rots and it decays and you lose it.
  • Not Synced
    So, where is the archive we go to
    when something is lost,
  • Not Synced
    when GitLab goes away, when Github
    goes away.
  • Not Synced
    Where do we go?
  • Not Synced
    Finally, there's one last thing that we
    noticed, it's that
  • Not Synced
    there's a lot of teams that work on
    research on software
  • Not Synced
    and there's no real big infrastructure
    for research on code.
  • Not Synced
    There's tons of critical issues around
    code: safety, security, verification, proofs.
  • Not Synced
    Nobody's doing this at a very large scale.
  • Not Synced
    If you want to see the stars, you go
    the Atacama desert and
  • Not Synced
    you point a telescope at the sky.
  • Not Synced
    Where is the telescope for source code?
  • Not Synced
    That's what Software Heritage wants to be.
  • Not Synced
    What we do is we collect, we preserve
    and we share all the software
  • Not Synced
    that is publicly available.
  • Not Synced
    Why do we do that? We do that to
    preserve the past, to enhance the present
  • Not Synced
    and to prepare for the future.
  • Not Synced
    What we're building is a base infrastructure
    that can be used
  • Not Synced
    for cultural heritage, for industry,
    for research and for education purposes.
  • Not Synced
    How do we do it? We do it with an open
    approach.
  • Not Synced
    Every single line of code that we write
    is free software.
  • Not Synced
    We do it transparently, everything that
    we do, we do it in the open,
  • Not Synced
    be that on a mailing list or on
    our issue tracker.
  • Not Synced
    And we strive to do it for the very long
    haul, so we do it with replication in mind
  • Not Synced
    so that no single entity has full control
    over the data that we collect.
  • Not Synced
    And we do it in a non-profit fashion
    so that we avoid
  • Not Synced
    business-driven decisions impacting
    the project.
  • Not Synced
    So, what do we do concretely?
  • Not Synced
    We do archiving of version control systems.
  • Not Synced
    What does that mean?
  • Not Synced
    It means we archive file contents, so
    source code, files.
  • Not Synced
    We archive revisions, which means all the
    metadata of the history of the projects,
  • Not Synced
    we try to download it and we put it inside
    a common data model that is
  • Not Synced
    shared across all the archive.
  • Not Synced
    We archive releases of the software,
    releases that have been tagged
  • Not Synced
    in a version control system as well as
    releases that we can find as tarballs
  • Not Synced
    because sometimes… boof, views of
    this source code differ.
  • Not Synced
    Of course, we archive where and when
    we've seen the data that we've collected.
  • Not Synced
    All of this, we put inside a canonical,
    VCS-agnostic, data model.
  • Not Synced
    If you have a Debian package, with its
    history, if you have a git repository,
  • Not Synced
    if you have a subversion repository, if
    you have a mercurial repository,
  • Not Synced
    it all looks the same and you can work
    on it with the same tools.
  • Not Synced
    What we don't do is archive what's around
    the software, for instance
  • Not Synced
    the bug tracking systems or the homepages
    or the wikis or the mailing lists.
  • Not Synced
    There are some projects that work
    in this space, for instance
  • Not Synced
    the internet archive does a lot of
    really good work around archiving the web.
  • Not Synced
    Our goal is not to replace them, but to
    work with them and be able to do
  • Not Synced
    linking across all the archives that exist.
  • Not Synced
    We can, for instance for the mailing lists
    there's the gmane project
  • Not Synced
    that does a lot of archiving of free
    software mailing lists.
  • Not Synced
    So our long term vision is to play a part
    in a semantic wikipedia of software,
  • Not Synced
    a wikidata of software where we can
    hyperlink all the archives that exist
  • Not Synced
    and do stuff in the area.
  • Not Synced
    Quick tour of our infrastructure.
  • Not Synced
    Basically, all the way to the right is
    our archive.
  • Not Synced
    Our archive consists of a huge graph
    of all the metadata about
  • Not Synced
    the files, the directories, the revisions,
    the commits and the releases and
  • Not Synced
    all the projects that are on top
    of the graph.
  • Not Synced
    We separate the file storage into an other
    object storage because of
  • Not Synced
    the size discrepancy: we have lots and lots
    of file contents that we need to store
  • Not Synced
    so we do that outside the database
    that is used to store the graph.
  • Not Synced
    Basically, what we archive is a set of
    software origins that are
  • Not Synced
    git repositories, mercurial repositories,
    etc. etc.
  • Not Synced
    All those origins are loaded on a
    regular schedule.
  • Not Synced
    If there is a very active software origin,
    we're gonna archive it more often
  • Not Synced
    than stale things that don't get
    a lot of updates
Title:
Software Heritage - Preserving the Free Software Commons
Description:

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
41:31

English subtitles

Incomplete

Revisions Compare revisions