Hi, thank you. I'm Nicolas Dandrimont and I will indeed be talking to you about Software Heritage. I'm a software engineer for this project. I've been working on it for 3 years now. And we'll see what this thing is all about. [Mic not working] I guess the batteries are out. So, let's try that again. So, we all know, we've been doing free software for a while, that software source code is something special. Why is that? As Harold Abelson has said in SICP, his textbook on programming, programs are meant to be read by people and then incidentally for machines to execute. Basically, what software source code provides us is a way inside the mind of the designer of the program. For instance, you can have, you can get inside very crazy algorithms that can do very fast reverse square roots for 3D, that kind of stuff Like in the Quake 2 source code. You can also get inside the algorithms that are underpinning the internet, for instance seeing the net queue algorithm in the Linux kernel. What we are building as the free software community is the free software commons. Basically, the commons is all the cultural and social and natural resources that we share and that everyone has access to. More specifically, the software commons is what we are building with software that is open and that is available for all to use, to modify, to execute, to distribute. We know that those commons are a really critical part of our commons. Who's taking care of it? The software is fragile. Like all digital information, you can lose software. People can decide to shut down hosting spaces because of business decisions. People can hack into software hosting platforms and remove the code maliciously or just inadvertently. And, of course, for the obsolete stuff, there's rot. If you don't care about the data, then it rots and it decays and you lose it. So, where is the archive we go to when something is lost, when GitLab goes away, when Github goes away. Where do we go? Finally, there's one last thing that we noticed, it's that there's a lot of teams that work on research on software and there's no real big infrastructure for research on code. There's tons of critical issues around code: safety, security, verification, proofs. Nobody's doing this at a very large scale. If you want to see the stars, you go the Atacama desert and you point a telescope at the sky. Where is the telescope for source code? That's what Software Heritage wants to be. What we do is we collect, we preserve and we share all the software that is publicly available. Why do we do that? We do that to preserve the past, to enhance the present and to prepare for the future. What we're building is a base infrastructure that can be used for cultural heritage, for industry, for research and for education purposes. How do we do it? We do it with an open approach. Every single line of code that we write is free software. We do it transparently, everything that we do, we do it in the open, be that on a mailing list or on our issue tracker. And we strive to do it for the very long haul, so we do it with replication in mind so that no single entity has full control over the data that we collect. And we do it in a non-profit fashion so that we avoid business-driven decisions impacting the project. So, what do we do concretely? We do archiving of version control systems. What does that mean? It means we archive file contents, so source code, files. We archive revisions, which means all the metadata of the history of the projects, we try to download it and we put it inside a common data model that is shared across all the archive. We archive releases of the software, releases that have been tagged in a version control system as well as releases that we can find as tarballs because sometimes… boof, views of this source code differ. Of course, we archive where and when we've seen the data that we've collected. All of this, we put inside a canonical, VCS-agnostic, data model. If you have a Debian package, with its history, if you have a git repository, if you have a subversion repository, if you have a mercurial repository, it all looks the same and you can work on it with the same tools. What we don't do is archive what's around the software, for instance the bug tracking systems or the homepages or the wikis or the mailing lists. There are some projects that work in this space, for instance the internet archive does a lot of really good work around archiving the web. Our goal is not to replace them, but to work with them and be able to do linking across all the archives that exist. We can, for instance for the mailing lists there's the gmane project that does a lot of archiving of free software mailing lists. So our long term vision is to play a part in a semantic wikipedia of software, a wikidata of software where we can hyperlink all the archives that exist and do stuff in the area. Quick tour of our infrastructure. Basically, all the way to the right is our archive. Our archive consists of a huge graph of all the metadata about the files, the directories, the revisions, the commits and the releases and all the projects that are on top of the graph. We separate the file storage into an other object storage because of the size discrepancy: we have lots and lots of file contents that we need to store so we do that outside the database that is used to store the graph. Basically, what we archive is a set of software origins that are git repositories, mercurial repositories, etc. etc. All those origins are loaded on a regular schedule. If there is a very active software origin, we're gonna archive it more often than stale things that don't get a lot of updates. What we do to get the list of software origins that we archive. We have a bunch of listers that can, scroll through the list of repositories, for instance on Github or other hosting platforms. We have code that can read Debian archive metadata to make a list of the packages that are inside this archive and can be archived, etc. All of this is done on a regular basis. We are currently working on some kind of push mechanism so that people or other systems can notify us of updates. Our goal is not to do real time archiving, we're really in it for the long run but we still want to be able to prioritize stuff that people tell us is important to archive. The internet archive has a "save now" button and we want to implement something along those lines as well, so if we know that some software project is in danger for a reason or another, then we can prioritize archiving it. So this is the basic structure of a revision in the software heritage archive. You'll see that it's very similar to a git commit. The format of the metadata is pretty much what you'll find in a git commit with some extensions that you don't see here because this is from a git commit So basically what we do is we take the identifier of the directory that the revision points to, we take the identifier of the parent of the revision so we can keep track of the history and then we add some metadata, authorship and commitership information and the revision message and then we take a hash of this, it makes an identifier that's probably unique, very very probably unique. Using those identifiers, we can retrace all the origins, all the history of development of the project and we can deduplicate across all the archive. All the identifiers are intrinsic, which means that we compute them from the contents of the things that we are archiving, which means that we can deduplicate very efficiently across all the data that we archive. How much data do we archive? A bit. So, we have passed the billion revision mark a few weeks ago. This graph is a bit old, but anyway, you have a live graph on our website. That's more than 4.5 billion unique source code files. We don't actually discriminate between what we would consider is source code and what upstream developers consider as source code, so everything that's in a git repository, we consider as source code if it's below a size threshold. A billion revisions across 80 million projects. What do we archive? We archive Github, we archive Debian. So, Debian we run the archival process every day, every day we get the new packages that have been uploaded in the archive. Github, we try to keep up, we are currently working on some performance improvements, some scalability improvements to make sure that we can keep up with the development on GitHub. We have archived as a one-off thing the former content of Gitorious and Google Code which are two prominent code hosting spaces that closed recently and we've been working on archiving the contents of Bitbucket which is kind of a challenge because the API is a bit buggy and Atliassian isn't too interested in fixing it. In concrete storage terms, we have 175TB of blobs, so the files take 175TB and kind of big database, 6TB. The database only contains the graph of the metadata for the archive which is basically a 8 billion nodes and 70 billion edges graph. And of course it's growing daily. We are pretty sure this is the richest source code archive that's available now and it keeps growing. So how do we actually… What kind of stack do we use to store all this? We use Debian, of course. All our deployment recipes are in Puppet in public repositories. We've started using Ceph for the blob storage. We use PostgreSQL for the metadata storage we some of the standard tools that live around PostgreSQL for backups and replication. We use standard Python stack for scheduling of jobs and for web interface stuff, basically psycopg2 for the low level stuff, Django for the web stuff and Celery for the scheduling of jobs. In house, we've written an ad hoc object storage system which has a bunch of backends that you can use. Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… It's a really simple object storage system where you can just put an object, get an object, put a bunch of objects, get a bunch of objects. We've implemented removal but we don't really use it yet. All the data model implementation, all the listers, the loaders, the schedulers everything has been written by us, it's a pile of Python code. So, basically 20 Python packages and around 30 Puppet modules to deploy all that and we've done everything as a copyleft license, GPLv3 for the backend and AGPLv3 for the frontend. Even if people try and make their own Software Heritage using our code, they have to publish their changes. Hardware-wise, we run for now everything on a few hypervisors in house and our main storage is currently still on a very high density, very slow, very bulky storage array, but we've started to migrate all this thing into a Ceph storage cluster which we're gonna grow as we need in the next few months. We've also been granted by Microsoft sponsorship, ??? sponsorship for their cloud services. We've started putting mirrors of everything in their infrastructure as well which means full object storage mirror, so 170TB of stuff mirrored on azure as well as a database mirror for graph. And we're also doing all the content indexing and all the things that need scalability on azure now. Finally, at the university of Bologna, we have a backend storage for the download so currently our main storage is quite slow so if you want to download a bundle of things that we've archived, then we actually keep a cache of what we've done so that it doesn't take a million years to download stuff. We do our development in a classic free and open source software way, so we talk on our mailing list, on IRC, on a forge. Everything is in English, everything is public, there is more information on our website if you want to actually have a look and see what we do. So, all that is very interesting but how do we actually look into it? One of the ways that you can browse, that you can use the archive is using a REST API. Basically, this API allows you to do pointwise browsing of the archive so you can go and follow the links in a graph, which is very slow but gives you a pretty much full access of the data. There's an index for the API that you can look at, but that's not really convenient, so we also have a web user interface. It's in preview right now, we're gonna do a full launch in the month of June. If you go to https://archive.softwareheritage.org/browse/ with the given credentials, you can have a look and see what's going on. Basically, we have a web interface that allows you to look at what origins we have downloaded, when we have downloaded the origins with a kind of graph view of how often we visited the origins and a calendar view of when we have visited the origins. And then, inside the visits, you can actually browse the contents that we've archived. So, for instance, this is the Python repository as of May 2017 and you can have the list of files, then drill down, it should be pretty intuitive. If you look at the history of a project, you can see the differences between two revisions of a project. Oh no, that's the syntax highlighting, but anyway the diffs arrive right after. So, yeah, pretty cool stuff. I should be able to do a demo as well, it should work. I'm gonna zoom in. So this is the main archive, you can see some statistics about the objects that we've downloaded. When you zoom in, you get some kind of overflows, because… Yeah, why would you do that. If you want to browse, we can try to find an origin. "glibc". So there's lots and lots of, like, random Github forks of things… We don't discriminate and we don't really filter what we download. We are looking into doing some relevance kind of sorting of the results, here. Next. Xilinx, why not. So, this has been downloaded for the last time of August 3rd 2016, so it's probably a dead repository, but yeah, you can see a bunch of source code, you can read the README of the glibc. If we go back to a more interesting origin here's the repository for git. I've selected voluntarily an old visit of the repo so that we can see what was going on then. If a look at the calendar view, you can see that we've had some issues actually updating this, but anyway. If I look at the last visit, then we can actually browse the contents, you can get syntax highlighting as well. This is a big big file with lots of comments Let's see the actual source code… Anyway, so, that's the browsing interface. We can also now get back what we've archived and download it, which is kind of something that you might want to do if a repository is lost, you can actually download it and get the source code back again. How we do that. If you go on the top right of this browsing interface, you have actions and download and you can download a directory that you are currently looking at. It's an asynchronous process, which means that if there is a lot of load, then it's gotta take some time to get actually, to be able to download the content So you can put in your email address so we can notify you when the download is ready. I'm gonna try my luck and say just "ok" and it's gonna appear at some point in the list of things that I've requested. I've already requested some things that we can actually get and open as a tarball. Yeah, I think that's the thing that I was actually looking at, which is this revision of the git source code and then I can open it Yay, emacs, that's when you want. Yay, source code. This seems to work. And then, of course, if you want to actually script what you're doing, there's an API that allows you to do the downloads as well, so you can. The source code is deduplicated a lot, which means that for one single repository you get tons of files that we have to collect if you want to actually download an archive of a directory. It takes a while but we have an asynchronous API so you can POST the identifier of a revision to this URL and then get status updates and at some point, it will tell you that the… here The status well tell you that the object is available. You can download it and you can even download the full history of a project and get that as a git-fast-export archive that you can reimport into a new git repository. So any kind of VCS that we've imported, you can export as a git repository and reimport on your machine. How to get involved in the project? We have a lot of features that we're interested in, lots of them are now in early access or have been done. There's some stuff that we would like help with. This is some stuff that we're working on: provenance information, you have a content you want to know which repository it comes from, that's something we're on. Full text search, the end goal is to be able even to trace source of snippets of code that's have been copied from one project to another. That's something that we can look into with the wealth of information that we have inside the archive.