WEBVTT 99:59:59.999 --> 99:59:59.999 Hi, thank you. 99:59:59.999 --> 99:59:59.999 I'm Nicolas Dandrimont and I will indeed be talking to you about 99:59:59.999 --> 99:59:59.999 Software Heritage. 99:59:59.999 --> 99:59:59.999 I'm a software engineer for this project. 99:59:59.999 --> 99:59:59.999 I've been working on it for 3 years now. 99:59:59.999 --> 99:59:59.999 And we'll see what this thing is all about. 99:59:59.999 --> 99:59:59.999 [Mic not working] 99:59:59.999 --> 99:59:59.999 I guess the batteries are out. 99:59:59.999 --> 99:59:59.999 So, let's try that again. 99:59:59.999 --> 99:59:59.999 So, we all know, we've been doing free software for a while, 99:59:59.999 --> 99:59:59.999 that software source code is something special. 99:59:59.999 --> 99:59:59.999 Why is that? 99:59:59.999 --> 99:59:59.999 As Harold Abelson has said in SICP, his textbook on programming, 99:59:59.999 --> 99:59:59.999 programs are meant to be read by people and then incidentally for machines to execute. 99:59:59.999 --> 99:59:59.999 Basically, what software source code provides us is a way inside 99:59:59.999 --> 99:59:59.999 the mind of the designer of the program. 99:59:59.999 --> 99:59:59.999 For instance, you can have, you can get inside very crazy algorithms 99:59:59.999 --> 99:59:59.999 that can do very fast reverse square roots for 3D, that kind of stuff 99:59:59.999 --> 99:59:59.999 Like in the Quake 2 source code. 99:59:59.999 --> 99:59:59.999 You can also get inside the algorithms that are underpinning the internet, 99:59:59.999 --> 99:59:59.999 for instance seeing the net queue algorithm in the Linux kernel. 99:59:59.999 --> 99:59:59.999 What we are building as the free software community is the free software commons. 99:59:59.999 --> 99:59:59.999 Basically, the commons is all the cultural and social and natural resources 99:59:59.999 --> 99:59:59.999 that we share and that everyone has access to. 99:59:59.999 --> 99:59:59.999 More specifically, the software commons is what we are building 99:59:59.999 --> 99:59:59.999 with software that is open and that is available for all to use, to modify, 99:59:59.999 --> 99:59:59.999 to execute, to distribute. 99:59:59.999 --> 99:59:59.999 We know that those commons are a really critical part of our commons. 99:59:59.999 --> 99:59:59.999 Who's taking care of it? 99:59:59.999 --> 99:59:59.999 The software is fragile. 99:59:59.999 --> 99:59:59.999 Like all digital information, you can lose software. 99:59:59.999 --> 99:59:59.999 People can decide to shut down hosting spaces because of business decisions. 99:59:59.999 --> 99:59:59.999 People can hack into software hosting platforms and remove the code maliciously 99:59:59.999 --> 99:59:59.999 or just inadvertently. 99:59:59.999 --> 99:59:59.999 And, of course, for the obsolete stuff, there's rot. 99:59:59.999 --> 99:59:59.999 If you don't care about the data, then it rots and it decays and you lose it. 99:59:59.999 --> 99:59:59.999 So, where is the archive we go to when something is lost, 99:59:59.999 --> 99:59:59.999 when GitLab goes away, when Github goes away. 99:59:59.999 --> 99:59:59.999 Where do we go? 99:59:59.999 --> 99:59:59.999 Finally, there's one last thing that we noticed, it's that 99:59:59.999 --> 99:59:59.999 there's a lot of teams that work on research on software 99:59:59.999 --> 99:59:59.999 and there's no real big infrastructure for research on code. 99:59:59.999 --> 99:59:59.999 There's tons of critical issues around code: safety, security, verification, proofs. 99:59:59.999 --> 99:59:59.999 Nobody's doing this at a very large scale. 99:59:59.999 --> 99:59:59.999 If you want to see the stars, you go the Atacama desert and 99:59:59.999 --> 99:59:59.999 you point a telescope at the sky. 99:59:59.999 --> 99:59:59.999 Where is the telescope for source code? 99:59:59.999 --> 99:59:59.999 That's what Software Heritage wants to be. 99:59:59.999 --> 99:59:59.999 What we do is we collect, we preserve and we share all the software 99:59:59.999 --> 99:59:59.999 that is publicly available. 99:59:59.999 --> 99:59:59.999 Why do we do that? We do that to preserve the past, to enhance the present 99:59:59.999 --> 99:59:59.999 and to prepare for the future. 99:59:59.999 --> 99:59:59.999 What we're building is a base infrastructure that can be used 99:59:59.999 --> 99:59:59.999 for cultural heritage, for industry, for research and for education purposes. 99:59:59.999 --> 99:59:59.999 How do we do it? We do it with an open approach. 99:59:59.999 --> 99:59:59.999 Every single line of code that we write is free software. 99:59:59.999 --> 99:59:59.999 We do it transparently, everything that we do, we do it in the open, 99:59:59.999 --> 99:59:59.999 be that on a mailing list or on our issue tracker. 99:59:59.999 --> 99:59:59.999 And we strive to do it for the very long haul, so we do it with replication in mind 99:59:59.999 --> 99:59:59.999 so that no single entity has full control over the data that we collect. 99:59:59.999 --> 99:59:59.999 And we do it in a non-profit fashion so that we avoid 99:59:59.999 --> 99:59:59.999 business-driven decisions impacting the project. 99:59:59.999 --> 99:59:59.999 So, what do we do concretely? 99:59:59.999 --> 99:59:59.999 We do archiving of version control systems. 99:59:59.999 --> 99:59:59.999 What does that mean? 99:59:59.999 --> 99:59:59.999 It means we archive file contents, so source code, files. 99:59:59.999 --> 99:59:59.999 We archive revisions, which means all the metadata of the history of the projects, 99:59:59.999 --> 99:59:59.999 we try to download it and we put it inside a common data model that is 99:59:59.999 --> 99:59:59.999 shared across all the archive. 99:59:59.999 --> 99:59:59.999 We archive releases of the software, releases that have been tagged 99:59:59.999 --> 99:59:59.999 in a version control system as well as releases that we can find as tarballs 99:59:59.999 --> 99:59:59.999 because sometimes… boof, views of this source code differ. 99:59:59.999 --> 99:59:59.999 Of course, we archive where and when we've seen the data that we've collected. 99:59:59.999 --> 99:59:59.999 All of this, we put inside a canonical, VCS-agnostic, data model. 99:59:59.999 --> 99:59:59.999 If you have a Debian package, with its history, if you have a git repository, 99:59:59.999 --> 99:59:59.999 if you have a subversion repository, if you have a mercurial repository, 99:59:59.999 --> 99:59:59.999 it all looks the same and you can work on it with the same tools. 99:59:59.999 --> 99:59:59.999 What we don't do is archive what's around the software, for instance 99:59:59.999 --> 99:59:59.999 the bug tracking systems or the homepages or the wikis or the mailing lists. 99:59:59.999 --> 99:59:59.999 There are some projects that work in this space, for instance 99:59:59.999 --> 99:59:59.999 the internet archive does a lot of really good work around archiving the web. 99:59:59.999 --> 99:59:59.999 Our goal is not to replace them, but to work with them and be able to do 99:59:59.999 --> 99:59:59.999 linking across all the archives that exist. 99:59:59.999 --> 99:59:59.999 We can, for instance for the mailing lists there's the gmane project 99:59:59.999 --> 99:59:59.999 that does a lot of archiving of free software mailing lists. 99:59:59.999 --> 99:59:59.999 So our long term vision is to play a part in a semantic wikipedia of software, 99:59:59.999 --> 99:59:59.999 a wikidata of software where we can hyperlink all the archives that exist 99:59:59.999 --> 99:59:59.999 and do stuff in the area. 99:59:59.999 --> 99:59:59.999 Quick tour of our infrastructure. 99:59:59.999 --> 99:59:59.999 Basically, all the way to the right is our archive. 99:59:59.999 --> 99:59:59.999 Our archive consists of a huge graph of all the metadata about 99:59:59.999 --> 99:59:59.999 the files, the directories, the revisions, the commits and the releases and 99:59:59.999 --> 99:59:59.999 all the projects that are on top of the graph. 99:59:59.999 --> 99:59:59.999 We separate the file storage into an other object storage because of 99:59:59.999 --> 99:59:59.999 the size discrepancy: we have lots and lots of file contents that we need to store 99:59:59.999 --> 99:59:59.999 so we do that outside the database that is used to store the graph. 99:59:59.999 --> 99:59:59.999 Basically, what we archive is a set of software origins that are 99:59:59.999 --> 99:59:59.999 git repositories, mercurial repositories, etc. etc. 99:59:59.999 --> 99:59:59.999 All those origins are loaded on a regular schedule. 99:59:59.999 --> 99:59:59.999 If there is a very active software origin, we're gonna archive it more often 99:59:59.999 --> 99:59:59.999 than stale things that don't get a lot of updates. 99:59:59.999 --> 99:59:59.999 What we do to get the list of software origins that we archive. 99:59:59.999 --> 99:59:59.999 We have a bunch of listers that can, scroll through the list of repositories, 99:59:59.999 --> 99:59:59.999 for instance on Github or other hosting platforms. 99:59:59.999 --> 99:59:59.999 We have code that can read Debian archive metadata to make a list of the packages 99:59:59.999 --> 99:59:59.999 that are inside this archive and can be archived, etc. 99:59:59.999 --> 99:59:59.999 All of this is done on a regular basis. 99:59:59.999 --> 99:59:59.999 We are currently working on some kind of push mechanism so that 99:59:59.999 --> 99:59:59.999 people or other systems can notify us of updates. 99:59:59.999 --> 99:59:59.999 Our goal is not to do real time archiving, we're really in it for the long run 99:59:59.999 --> 99:59:59.999 but we still want to be able to prioritize stuff that people tell us is 99:59:59.999 --> 99:59:59.999 important to archive. 99:59:59.999 --> 99:59:59.999 The internet archive has a "save now" button and we want to implement 99:59:59.999 --> 99:59:59.999 something along those lines as well, 99:59:59.999 --> 99:59:59.999 so if we know that some software project is in danger for a reason or another, 99:59:59.999 --> 99:59:59.999 then we can prioritize archiving it. 99:59:59.999 --> 99:59:59.999 So this is the basic structure of a revision in the software heritage archive. 99:59:59.999 --> 99:59:59.999 You'll see that it's very similar to a git commit. 99:59:59.999 --> 99:59:59.999 The format of the metadata is pretty much what you'll find in a git commit 99:59:59.999 --> 99:59:59.999 with some extensions that you don't see here because this is from a git commit 99:59:59.999 --> 99:59:59.999 So basically what we do is we take the identifier of the directory 99:59:59.999 --> 99:59:59.999 that the revision points to, we take the identifier of the parent of the revision 99:59:59.999 --> 99:59:59.999 so we can keep track of the history 99:59:59.999 --> 99:59:59.999 and then we add some metadata, authorship and commitership information 99:59:59.999 --> 99:59:59.999 and the revision message and then we take a hash of this, 99:59:59.999 --> 99:59:59.999 it makes an identifier that's probably unique, very very probably unique. 99:59:59.999 --> 99:59:59.999 Using those identifiers, we can retrace all the origins, all the history of 99:59:59.999 --> 99:59:59.999 development of the project and we can deduplicate across all the archive. 99:59:59.999 --> 99:59:59.999 All the identifiers are intrinsic, which means that we compute them 99:59:59.999 --> 99:59:59.999 from the contents of the things that we are archiving, which means that 99:59:59.999 --> 99:59:59.999 we can deduplicate very efficiently across all the data that we archive. 99:59:59.999 --> 99:59:59.999 How much data do we archive? 99:59:59.999 --> 99:59:59.999 A bit. 99:59:59.999 --> 99:59:59.999 So, we have passed the billion revision mark a few weeks ago. 99:59:59.999 --> 99:59:59.999 This graph is a bit old, but anyway, you have a live graph on our website. 99:59:59.999 --> 99:59:59.999 That's more than 4.5 billion unique source code files. 99:59:59.999 --> 99:59:59.999 We don't actually discriminate between what we would consider is source code 99:59:59.999 --> 99:59:59.999 and what upstream developers consider as source code, 99:59:59.999 --> 99:59:59.999 so everything that's in a git repository, we consider as source code 99:59:59.999 --> 99:59:59.999 if it's below a size threshold. 99:59:59.999 --> 99:59:59.999 A billion revisions across 80 million projects. 99:59:59.999 --> 99:59:59.999 What do we archive? 99:59:59.999 --> 99:59:59.999 We archive Github, we archive Debian. 99:59:59.999 --> 99:59:59.999 So, Debian we run the archival process every day, every day we get the new packages 99:59:59.999 --> 99:59:59.999 that have been uploaded in the archive. 99:59:59.999 --> 99:59:59.999 Github, we try to keep up, we are currently working on some performance improvements, 99:59:59.999 --> 99:59:59.999 some scalability improvements to make sure that we can keep up 99:59:59.999 --> 99:59:59.999 with the development on GitHub. 99:59:59.999 --> 99:59:59.999 We have archived as a one-off thing the former content of Gitorious and Google Code 99:59:59.999 --> 99:59:59.999 which are two prominent code hosting spaces that closed recently 99:59:59.999 --> 99:59:59.999 and we've been working on archiving the contents of Bitbucket 99:59:59.999 --> 99:59:59.999 which is kind of a challenge because the API is a bit buggy and 99:59:59.999 --> 99:59:59.999 Atliassian isn't too interested in fixing it. 99:59:59.999 --> 99:59:59.999 In concrete storage terms, we have 175TB of blobs, so the files take 175TB 99:59:59.999 --> 99:59:59.999 and kind of big database, 6TB. 99:59:59.999 --> 99:59:59.999 The database only contains the graph of the metadata for the archive 99:59:59.999 --> 99:59:59.999 which is basically a 8 billion nodes and 70 billion edges graph. 99:59:59.999 --> 99:59:59.999 And of course it's growing daily. 99:59:59.999 --> 99:59:59.999 We are pretty sure this is the richest source code archive that's available now 99:59:59.999 --> 99:59:59.999 and it keeps growing. 99:59:59.999 --> 99:59:59.999 So how do we actually… 99:59:59.999 --> 99:59:59.999 What kind of stack do we use to store all this? 99:59:59.999 --> 99:59:59.999 We use Debian, of course. 99:59:59.999 --> 99:59:59.999 All our deployment recipes are in Puppet in public repositories. 99:59:59.999 --> 99:59:59.999 We've started using Ceph for the blob storage. 99:59:59.999 --> 99:59:59.999 We use PostgreSQL for the metadata storage we some of the standard tools that 99:59:59.999 --> 99:59:59.999 live around PostgreSQL for backups and replication. 99:59:59.999 --> 99:59:59.999 We use standard Python stack for scheduling of jobs 99:59:59.999 --> 99:59:59.999 and for web interface stuff, basically psycopg2 for the low level stuff, 99:59:59.999 --> 99:59:59.999 Django for the web stuff 99:59:59.999 --> 99:59:59.999 and Celery for the scheduling of jobs. 99:59:59.999 --> 99:59:59.999 In house, we've written an ad hoc object storage system which has 99:59:59.999 --> 99:59:59.999 a bunch of backends that you can use. 99:59:59.999 --> 99:59:59.999 Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… 99:59:59.999 --> 99:59:59.999 It's a really simple object storage system where you can just put an object, 99:59:59.999 --> 99:59:59.999 get an object, put a bunch of objects, get a bunch of objects. 99:59:59.999 --> 99:59:59.999 We've implemented removal but we don't really use it yet. 99:59:59.999 --> 99:59:59.999 All the data model implementation, all the listers, the loaders, the schedulers 99:59:59.999 --> 99:59:59.999 everything has been written by us, it's a pile of Python code. 99:59:59.999 --> 99:59:59.999 So, basically 20 Python packages and around 30 Puppet modules 99:59:59.999 --> 99:59:59.999 to deploy all that and we've done everything as a copyleft license, 99:59:59.999 --> 99:59:59.999 GPLv3 for the backend and AGPLv3 for the frontend. 99:59:59.999 --> 99:59:59.999 Even if people try and make their own Software Heritage using our code, 99:59:59.999 --> 99:59:59.999 they have to publish their changes. 99:59:59.999 --> 99:59:59.999 Hardware-wise, we run for now everything on a few hypervisors in house and 99:59:59.999 --> 99:59:59.999 our main storage is currently still on a very high density, very slow, 99:59:59.999 --> 99:59:59.999 very bulky storage array, but we've started to migrate all this thing 99:59:59.999 --> 99:59:59.999 into a Ceph storage cluster which we're gonna grow as we need 99:59:59.999 --> 99:59:59.999 in the next few months. 99:59:59.999 --> 99:59:59.999 We've also been granted by Microsoft sponsorship, ??? sponsorship 99:59:59.999 --> 99:59:59.999 for their cloud services. 99:59:59.999 --> 99:59:59.999 We've started putting mirrors of everything in their infrastructure as well 99:59:59.999 --> 99:59:59.999 which means full object storage mirror, so 170TB of stuff mirrored on azure 99:59:59.999 --> 99:59:59.999 as well as a database mirror for graph. 99:59:59.999 --> 99:59:59.999 And we're also doing all the content indexing and all the things that need 99:59:59.999 --> 99:59:59.999 scalability on azure now. 99:59:59.999 --> 99:59:59.999 Finally, at the university of Bologna, we have a backend storage for the download 99:59:59.999 --> 99:59:59.999 so currently our main storage is quite slow so if you want to download 99:59:59.999 --> 99:59:59.999 a bundle of things that we've archived, then we actually keep a cache of 99:59:59.999 --> 99:59:59.999 what we've done so that it doesn't take a million years to download stuff. 99:59:59.999 --> 99:59:59.999 We do our development in a classic free and open source software way, 99:59:59.999 --> 99:59:59.999 so we talk on our mailing list, on IRC, on a forge. 99:59:59.999 --> 99:59:59.999 Everything is in English, everything is public, there is more information 99:59:59.999 --> 99:59:59.999 on our website if you want to actually have a look and see what we do. 99:59:59.999 --> 99:59:59.999 So, all that is very interesting but how do we actually look into it? 99:59:59.999 --> 99:59:59.999 One of the ways that you can browse, that you can use the archive 99:59:59.999 --> 99:59:59.999 is using a REST API. 99:59:59.999 --> 99:59:59.999 Basically, this API allows you to do pointwise browsing of the archive 99:59:59.999 --> 99:59:59.999 so you can go and follow the links in a graph, 99:59:59.999 --> 99:59:59.999 which is very slow but gives you a pretty much full access of the data. 99:59:59.999 --> 99:59:59.999 There's an index for the API that you can look at, but that's not really convenient, 99:59:59.999 --> 99:59:59.999 so we also have a web user interface. 99:59:59.999 --> 99:59:59.999 It's in preview right now, we're gonna do a full launch in the month of June. 99:59:59.999 --> 99:59:59.999 If you go to https://archive.softwareheritage.org/browse/ 99:59:59.999 --> 99:59:59.999 with the given credentials, you can have a look and see what's going on. 99:59:59.999 --> 99:59:59.999 Basically, we have a web interface that allows you to look at 99:59:59.999 --> 99:59:59.999 what origins we have downloaded, when we have downloaded the origins 99:59:59.999 --> 99:59:59.999 with a kind of graph view of how often we visited the origins 99:59:59.999 --> 99:59:59.999 and a calendar view of when we have visited the origins. 99:59:59.999 --> 99:59:59.999 And then, inside the visits, you can actually browse the contents 99:59:59.999 --> 99:59:59.999 that we've archived. 99:59:59.999 --> 99:59:59.999 So, for instance, this is the Python repository as of May 2017 99:59:59.999 --> 99:59:59.999 and you can have the list of files, then drill down, 99:59:59.999 --> 99:59:59.999 it should be pretty intuitive. 99:59:59.999 --> 99:59:59.999 If you look at the history of a project, you can see the differences 99:59:59.999 --> 99:59:59.999 between two revisions of a project. 99:59:59.999 --> 99:59:59.999 Oh no, that's the syntax highlighting, but anyway the diffs arrive right after.