Hi, thank you. I'm Nicolas Dandrimont and I will indeed be talking to you about Software Heritage. I'm a software engineer for this project. I've been working on it for 3 years now. And we'll see what this thing is all about. [Mic not working] I guess the batteries are out. So, let's try that again. So, we all know, we've been doing free software for a while, that software source code is something special. Why is that? As Harold Abelson has said in SICP, his textbook on programming, programs are meant to be read by people and then incidentally for machines to execute. Basically, what software source code provides us is a way inside the mind of the designer of the program. For instance, you can have, you can get inside very crazy algorithms that can do very fast reverse square roots for 3D, that kind of stuff Like in the Quake 2 source code. You can also get inside the algorithms that are underpinning the internet, for instance seeing the net queue algorithm in the Linux kernel. What we are building as the free software community is the free software commons. Basically, the commons is all the cultural and social and natural resources that we share and that everyone has access to. More specifically, the software commons is what we are building with software that is open and that is available for all to use, to modify, to execute, to distribute. We know that those commons are a really critical part of our commons. Who's taking care of it? The software is fragile. Like all digital information, you can lose software. People can decide to shut down hosting spaces because of business decisions. People can hack into software hosting platforms and remove the code maliciously or just inadvertently. And, of course, for the obsolete stuff, there's rot. If you don't care about the data, then it rots and it decays and you lose it. So, where is the archive we go to when something is lost, when GitLab goes away, when Github goes away. Where do we go? Finally, there's one last thing that we noticed, it's that there's a lot of teams that work on research on software and there's no real big infrastructure for research on code. There's tons of critical issues around code: safety, security, verification, proofs. Nobody's doing this at a very large scale. If you want to see the stars, you go the Atacama desert and you point a telescope at the sky. Where is the telescope for source code? That's what Software Heritage wants to be. What we do is we collect, we preserve and we share all the software that is publicly available. Why do we do that? We do that to preserve the past, to enhance the present and to prepare for the future. What we're building is a base infrastructure that can be used for cultural heritage, for industry, for research and for education purposes. How do we do it? We do it with an open approach. Every single line of code that we write is free software. We do it transparently, everything that we do, we do it in the open, be that on a mailing list or on our issue tracker. And we strive to do it for the very long haul, so we do it with replication in mind so that no single entity has full control over the data that we collect. And we do it in a non-profit fashion so that we avoid business-driven decisions impacting the project. So, what do we do concretely? We do archiving of version control systems. What does that mean? It means we archive file contents, so source code, files. We archive revisions, which means all the metadata of the history of the projects, we try to download it and we put it inside a common data model that is shared across all the archive. We archive releases of the software, releases that have been tagged in a version control system as well as releases that we can find as tarballs because sometimes… boof, views of this source code differ. Of course, we archive where and when we've seen the data that we've collected. All of this, we put inside a canonical, VCS-agnostic, data model. If you have a Debian package, with its history, if you have a git repository, if you have a subversion repository, if you have a mercurial repository, it all looks the same and you can work on it with the same tools. What we don't do is archive what's around the software, for instance the bug tracking systems or the homepages or the wikis or the mailing lists. There are some projects that work in this space, for instance the internet archive does a lot of really good work around archiving the web. Our goal is not to replace them, but to work with them and be able to do linking across all the archives that exist. We can, for instance for the mailing lists there's the gmane project that does a lot of archiving of free software mailing lists. So our long term vision is to play a part in a semantic wikipedia of software, a wikidata of software where we can hyperlink all the archives that exist and do stuff in the area. Quick tour of our infrastructure. Basically, all the way to the right is our archive. Our archive consists of a huge graph of all the metadata about the files, the directories, the revisions, the commits and the releases and all the projects that are on top of the graph. We separate the file storage into an other object storage because of the size discrepancy: we have lots and lots of file contents that we need to store so we do that outside the database that is used to store the graph. Basically, what we archive is a set of software origins that are git repositories, mercurial repositories, etc. etc. All those origins are loaded on a regular schedule. If there is a very active software origin, we're gonna archive it more often than stale things that don't get a lot of updates. What we do to get the list of software origins that we archive. We have a bunch of listers that can, scroll through the list of repositories, for instance on Github or other hosting platforms. We have code that can read Debian archive metadata to make a list of the packages that are inside this archive and can be archived, etc. All of this is done on a regular basis. We are currently working on some kind of push mechanism so that people or other systems can notify us of updates. Our goal is not to do real time archiving, we're really in it for the long run but we still want to be able to prioritize stuff that people tell us is important to archive. The internet archive has a "save now" button and we want to implement something along those lines as well, so if we know that some software project is in danger for a reason or another, then we can prioritize archiving it. So this is the basic structure of a revision in the software heritage archive. You'll see that it's very similar to a git commit. The format of the metadata is pretty much what you'll find in a git commit with some extensions that you don't see here because this is from a git commit So basically what we do is we take the identifier of the directory that the revision points to, we take the identifier of the parent of the revision so we can keep track of the history and then we add some metadata, authorship and commitership information and the revision message and then we take a hash of this, it makes an identifier that's probably unique, very very probably unique. Using those identifiers, we can retrace all the origins, all the history of development of the project and we can deduplicate across all the archive. All the identifiers are intrinsic, which means that we compute them from the contents of the things that we are archiving, which means that we can deduplicate very efficiently across all the data that we archive. How much data do we archive? A bit. So, we have passed the billion revision mark a few weeks ago. This graph is a bit old, but anyway, you have a live graph on our website. That's more than 4.5 billion unique source code files. We don't actually discriminate between what we would consider is source code and what upstream developers consider as source code, so everything that's in a git repository, we consider as source code if it's below a size threshold. A billion revisions across 80 million projects. What do we archive? We archive Github, we archive Debian. So, Debian we run the archival process every day, every day we get the new packages that have been uploaded in the archive. Github, we try to keep up, we are currently working on some performance improvements, some scalability improvements to make sure that we can keep up with the development on GitHub. We have archived as a one-off thing the former content of Gitorious and Google Code which are two prominent code hosting spaces that closed recently and we've been working on archiving the contents of Bitbucket which is kind of a challenge because the API is a bit buggy and Atliassian isn't too interested in fixing it. In concrete storage terms, we have 175TB of blobs, so the files take 175TB and kind of big database, 6TB. The database only contains the graph of the metadata for the archive which is basically a 8 billion nodes and 70 billion edges graph. And of course it's growing daily. We are pretty sure this is the richest source code archive that's available now and it keeps growing. So how do we actually… What kind of stack do we use to store all this? We use Debian, of course. All our deployment recipes are in Puppet in public repositories. We've started using Ceph for the blob storage. We use PostgreSQL for the metadata storage we some of the standard tools that live around PostgreSQL for backups and replication. We use standard Python stack for scheduling of jobs and for web interface stuff, basically psycopg2 for the low level stuff, Django for the web stuff and Celery for the scheduling of jobs. In house, we've written an ad hoc object storage system which has a bunch of backends that you can use. Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… It's a really simple object storage system where you can just put an object, get an object, put a bunch of objects, get a bunch of objects. We've implemented removal but we don't really use it yet. All the data model implementation, all the listers, the loaders, the schedulers everything has been written by us, it's a pile of Python code. So, basically 20 Python packages and around 30 Puppet modules to deploy all that and we've done everything as a copyleft license, GPLv3 for the backend and AGPLv3 for the frontend. Even if people try and make their own Software Heritage using our code, they have to publish their changes. Hardware-wise, we run for now everything on a few hypervisors in house and our main storage is currently still on a very high density, very slow, very bulky storage array, but we've started to migrate all this thing into a Ceph storage cluster which we're gonna grow as we need in the next few months. We've also been granted by Microsoft sponsorship, ??? sponsorship for their cloud services. We've started putting mirrors of everything in their infrastructure as well which means full object storage mirror, so 170TB of stuff mirrored on azure as well as a database mirror for graph. And we're also doing all the content indexing and all the things that need scalability on azure now. Finally, at the university of Bologna, we have a backend storage for the download so currently our main storage is quite slow so if you want to download a bundle of things that we've archived, then we actually keep a cache of what we've done so that it doesn't take a million years to download stuff. We do our development in a classic free and open source software way, so we talk on our mailing list