9:59:59.000,9:59:59.000 Hi, thank you. 9:59:59.000,9:59:59.000 I'm Nicolas Dandrimont and I will indeed[br]be talking to you about 9:59:59.000,9:59:59.000 Software Heritage. 9:59:59.000,9:59:59.000 I'm a software engineer for this project. 9:59:59.000,9:59:59.000 I've been working on it for 3 years now. 9:59:59.000,9:59:59.000 And we'll see what this thing is all about. 9:59:59.000,9:59:59.000 [Mic not working] 9:59:59.000,9:59:59.000 I guess the batteries are out. 9:59:59.000,9:59:59.000 So, let's try that again. 9:59:59.000,9:59:59.000 So, we all know, we've been doing[br]free software for a while, 9:59:59.000,9:59:59.000 that software source code is something[br]special. 9:59:59.000,9:59:59.000 Why is that? 9:59:59.000,9:59:59.000 As Harold Abelson has said in SICP, his[br]textbook on programming, 9:59:59.000,9:59:59.000 programs are meant to be read by people[br]and then incidentally for machines to execute. 9:59:59.000,9:59:59.000 Basically, what software source code[br]provides us is a way inside 9:59:59.000,9:59:59.000 the mind of the designer of the program. 9:59:59.000,9:59:59.000 For instance, you can have,[br]you can get inside very crazy algorithms 9:59:59.000,9:59:59.000 that can do very fast reverse square roots[br]for 3D, that kind of stuff 9:59:59.000,9:59:59.000 Like in the Quake 2 source code. 9:59:59.000,9:59:59.000 You can also get inside the algorithms[br]that are underpinning the internet, 9:59:59.000,9:59:59.000 for instance seeing the net queue[br]algorithm in the Linux kernel. 9:59:59.000,9:59:59.000 What we are building as the free software[br]community is the free software commons. 9:59:59.000,9:59:59.000 Basically, the commons is all the cultural[br]and social and natural resources 9:59:59.000,9:59:59.000 that we share and that everyone[br]has access to. 9:59:59.000,9:59:59.000 More specifically, the software commons[br]is what we are building 9:59:59.000,9:59:59.000 with software that is open and that is[br]available for all to use, to modify, 9:59:59.000,9:59:59.000 to execute, to distribute. 9:59:59.000,9:59:59.000 We know that those commons are a really[br]critical part of our commons. 9:59:59.000,9:59:59.000 Who's taking care of it? 9:59:59.000,9:59:59.000 The software is fragile. 9:59:59.000,9:59:59.000 Like all digital information, you can lose[br]software. 9:59:59.000,9:59:59.000 People can decide to shut down hosting[br]spaces because of business decisions. 9:59:59.000,9:59:59.000 People can hack into software hosting[br]platforms and remove the code maliciously 9:59:59.000,9:59:59.000 or just inadvertently. 9:59:59.000,9:59:59.000 And, of course, for the obsolete stuff,[br]there's rot. 9:59:59.000,9:59:59.000 If you don't care about the data, then[br]it rots and it decays and you lose it. 9:59:59.000,9:59:59.000 So, where is the archive we go to[br]when something is lost, 9:59:59.000,9:59:59.000 when GitLab goes away, when Github[br]goes away. 9:59:59.000,9:59:59.000 Where do we go? 9:59:59.000,9:59:59.000 Finally, there's one last thing that we[br]noticed, it's that 9:59:59.000,9:59:59.000 there's a lot of teams that work on[br]research on software 9:59:59.000,9:59:59.000 and there's no real big infrastructure[br]for research on code. 9:59:59.000,9:59:59.000 There's tons of critical issues around[br]code: safety, security, verification, proofs. 9:59:59.000,9:59:59.000 Nobody's doing this at a very large scale. 9:59:59.000,9:59:59.000 If you want to see the stars, you go[br]the Atacama desert and 9:59:59.000,9:59:59.000 you point a telescope at the sky. 9:59:59.000,9:59:59.000 Where is the telescope for source code? 9:59:59.000,9:59:59.000 That's what Software Heritage wants to be. 9:59:59.000,9:59:59.000 What we do is we collect, we preserve[br]and we share all the software 9:59:59.000,9:59:59.000 that is publicly available. 9:59:59.000,9:59:59.000 Why do we do that? We do that to[br]preserve the past, to enhance the present 9:59:59.000,9:59:59.000 and to prepare for the future. 9:59:59.000,9:59:59.000 What we're building is a base infrastructure[br]that can be used 9:59:59.000,9:59:59.000 for cultural heritage, for industry,[br]for research and for education purposes. 9:59:59.000,9:59:59.000 How do we do it? We do it with an open[br]approach. 9:59:59.000,9:59:59.000 Every single line of code that we write[br]is free software. 9:59:59.000,9:59:59.000 We do it transparently, everything that[br]we do, we do it in the open, 9:59:59.000,9:59:59.000 be that on a mailing list or on[br]our issue tracker. 9:59:59.000,9:59:59.000 And we strive to do it for the very long[br]haul, so we do it with replication in mind 9:59:59.000,9:59:59.000 so that no single entity has full control[br]over the data that we collect. 9:59:59.000,9:59:59.000 And we do it in a non-profit fashion[br]so that we avoid 9:59:59.000,9:59:59.000 business-driven decisions impacting[br]the project. 9:59:59.000,9:59:59.000 So, what do we do concretely? 9:59:59.000,9:59:59.000 We do archiving of version control systems. 9:59:59.000,9:59:59.000 What does that mean? 9:59:59.000,9:59:59.000 It means we archive file contents, so[br]source code, files. 9:59:59.000,9:59:59.000 We archive revisions, which means all the[br]metadata of the history of the projects, 9:59:59.000,9:59:59.000 we try to download it and we put it inside[br]a common data model that is 9:59:59.000,9:59:59.000 shared across all the archive. 9:59:59.000,9:59:59.000 We archive releases of the software,[br]releases that have been tagged 9:59:59.000,9:59:59.000 in a version control system as well as[br]releases that we can find as tarballs 9:59:59.000,9:59:59.000 because sometimes… boof, views of[br]this source code differ. 9:59:59.000,9:59:59.000 Of course, we archive where and when[br]we've seen the data that we've collected. 9:59:59.000,9:59:59.000 All of this, we put inside a canonical,[br]VCS-agnostic, data model. 9:59:59.000,9:59:59.000 If you have a Debian package, with its[br]history, if you have a git repository, 9:59:59.000,9:59:59.000 if you have a subversion repository, if[br]you have a mercurial repository, 9:59:59.000,9:59:59.000 it all looks the same and you can work[br]on it with the same tools. 9:59:59.000,9:59:59.000 What we don't do is archive what's around[br]the software, for instance 9:59:59.000,9:59:59.000 the bug tracking systems or the homepages[br]or the wikis or the mailing lists. 9:59:59.000,9:59:59.000 There are some projects that work[br]in this space, for instance 9:59:59.000,9:59:59.000 the internet archive does a lot of[br]really good work around archiving the web. 9:59:59.000,9:59:59.000 Our goal is not to replace them, but to[br]work with them and be able to do 9:59:59.000,9:59:59.000 linking across all the archives that exist. 9:59:59.000,9:59:59.000 We can, for instance for the mailing lists[br]there's the gmane project 9:59:59.000,9:59:59.000 that does a lot of archiving of free[br]software mailing lists. 9:59:59.000,9:59:59.000 So our long term vision is to play a part[br]in a semantic wikipedia of software, 9:59:59.000,9:59:59.000 a wikidata of software where we can[br]hyperlink all the archives that exist 9:59:59.000,9:59:59.000 and do stuff in the area. 9:59:59.000,9:59:59.000 Quick tour of our infrastructure. 9:59:59.000,9:59:59.000 Basically, all the way to the right is[br]our archive. 9:59:59.000,9:59:59.000 Our archive consists of a huge graph[br]of all the metadata about 9:59:59.000,9:59:59.000 the files, the directories, the revisions,[br]the commits and the releases and 9:59:59.000,9:59:59.000 all the projects that are on top[br]of the graph. 9:59:59.000,9:59:59.000 We separate the file storage into an other[br]object storage because of 9:59:59.000,9:59:59.000 the size discrepancy: we have lots and lots[br]of file contents that we need to store 9:59:59.000,9:59:59.000 so we do that outside the database[br]that is used to store the graph. 9:59:59.000,9:59:59.000 Basically, what we archive is a set of[br]software origins that are 9:59:59.000,9:59:59.000 git repositories, mercurial repositories,[br]etc. etc. 9:59:59.000,9:59:59.000 All those origins are loaded on a[br]regular schedule. 9:59:59.000,9:59:59.000 If there is a very active software origin,[br]we're gonna archive it more often 9:59:59.000,9:59:59.000 than stale things that don't get[br]a lot of updates. 9:59:59.000,9:59:59.000 What we do to get the list of software[br]origins that we archive. 9:59:59.000,9:59:59.000 We have a bunch of listers that can,[br]scroll through the list of repositories, 9:59:59.000,9:59:59.000 for instance on Github or other[br]hosting platforms. 9:59:59.000,9:59:59.000 We have code that can read Debian archive[br]metadata to make a list of the packages 9:59:59.000,9:59:59.000 that are inside this archive and can be[br]archived, etc. 9:59:59.000,9:59:59.000 All of this is done on a regular basis. 9:59:59.000,9:59:59.000 We are currently working on some kind[br]of push mechanism so that 9:59:59.000,9:59:59.000 people or other systems can notify us[br]of updates. 9:59:59.000,9:59:59.000 Our goal is not to do real time archiving,[br]we're really in it for the long run 9:59:59.000,9:59:59.000 but we still want to be able to prioritize[br]stuff that people tell us is 9:59:59.000,9:59:59.000 important to archive. 9:59:59.000,9:59:59.000 The internet archive has a "save now"[br]button and we want to implement 9:59:59.000,9:59:59.000 something along those lines as well, 9:59:59.000,9:59:59.000 so if we know that some software project[br]is in danger for a reason or another, 9:59:59.000,9:59:59.000 then we can prioritize archiving it. 9:59:59.000,9:59:59.000 So this is the basic structure of a revision[br]in the software heritage archive. 9:59:59.000,9:59:59.000 You'll see that it's very similar to[br]a git commit. 9:59:59.000,9:59:59.000 The format of the metadata is pretty much[br]what you'll find in a git commit 9:59:59.000,9:59:59.000 with some extensions that you don't[br]see here because this is from a git commit 9:59:59.000,9:59:59.000 So basically what we do is we take the[br]identifier of the directory 9:59:59.000,9:59:59.000 that the revision points to, we take the[br]identifier of the parent of the revision 9:59:59.000,9:59:59.000 so we can keep track of the history 9:59:59.000,9:59:59.000 and then we add some metadata,[br]authorship and commitership information 9:59:59.000,9:59:59.000 and the revision message and then we take[br]a hash of this, 9:59:59.000,9:59:59.000 it makes an identifier that's probably[br]unique, very very probably unique. 9:59:59.000,9:59:59.000 Using those identifiers, we can retrace[br]all the origins, all the history of 9:59:59.000,9:59:59.000 development of the project and we can[br]deduplicate across all the archive. 9:59:59.000,9:59:59.000 All the identifiers are intrinsic, which[br]means that we compute them 9:59:59.000,9:59:59.000 from the contents of the things that[br]we are archiving, which means that 9:59:59.000,9:59:59.000 we can deduplicate very efficiently[br]across all the data that we archive. 9:59:59.000,9:59:59.000 How much data do we archive? 9:59:59.000,9:59:59.000 A bit. 9:59:59.000,9:59:59.000 So, we have passed the billion revision[br]mark a few weeks ago. 9:59:59.000,9:59:59.000 This graph is a bit old, but anyway,[br]you have a live graph on our website. 9:59:59.000,9:59:59.000 That's more than 4.5 billion unique[br]source code files. 9:59:59.000,9:59:59.000 We don't actually discriminate between[br]what we would consider is source code 9:59:59.000,9:59:59.000 and what upstream developers consider[br]as source code, 9:59:59.000,9:59:59.000 so everything that's in a git repository,[br]we consider as source code 9:59:59.000,9:59:59.000 if it's below a size threshold. 9:59:59.000,9:59:59.000 A billion revisions across 80 million[br]projects. 9:59:59.000,9:59:59.000 What do we archive? 9:59:59.000,9:59:59.000 We archive Github, we archive Debian. 9:59:59.000,9:59:59.000 So, Debian we run the archival process[br]every day, every day we get the new packages 9:59:59.000,9:59:59.000 that have been uploaded in the archive. 9:59:59.000,9:59:59.000 Github, we try to keep up, we are currently[br]working on some performance improvements, 9:59:59.000,9:59:59.000 some scalability improvements to make sure[br]that we can keep up 9:59:59.000,9:59:59.000 with the development on GitHub. 9:59:59.000,9:59:59.000 We have archived as a one-off thing[br]the former content of Gitorious and Google Code 9:59:59.000,9:59:59.000 which are two prominent code hosting[br]spaces that closed recently 9:59:59.000,9:59:59.000 and we've been working on archiving[br]the contents of Bitbucket 9:59:59.000,9:59:59.000 which is kind of a challenge because[br]the API is a bit buggy and 9:59:59.000,9:59:59.000 Atliassian isn't too interested[br]in fixing it. 9:59:59.000,9:59:59.000 In concrete storage terms, we have 175TB[br]of blobs, so the files take 175TB 9:59:59.000,9:59:59.000 and kind of big database, 6TB. 9:59:59.000,9:59:59.000 The database only contains the graph of[br]the metadata for the archive 9:59:59.000,9:59:59.000 which is basically a 8 billion nodes and[br]70 billion edges graph. 9:59:59.000,9:59:59.000 And of course it's growing daily. 9:59:59.000,9:59:59.000 We are pretty sure this is the richest[br]source code archive that's available now 9:59:59.000,9:59:59.000 and it keeps growing. 9:59:59.000,9:59:59.000 So how do we actually… 9:59:59.000,9:59:59.000 What kind of stack do we use to store[br]all this? 9:59:59.000,9:59:59.000 We use Debian, of course. 9:59:59.000,9:59:59.000 All our deployment recipes are in Puppet[br]in public repositories. 9:59:59.000,9:59:59.000 We've started using Ceph[br]for the blob storage. 9:59:59.000,9:59:59.000 We use PostgreSQL for the metadata storage[br]we some of the standard tools that 9:59:59.000,9:59:59.000 live around PostgreSQL for backups[br]and replication. 9:59:59.000,9:59:59.000 We use standard Python stack for[br]scheduling of jobs 9:59:59.000,9:59:59.000 and for web interface stuff, basically[br]psycopg2 for the low level stuff, 9:59:59.000,9:59:59.000 Django for the web stuff 9:59:59.000,9:59:59.000 and Celery for the scheduling of jobs. 9:59:59.000,9:59:59.000 In house, we've written an ad hoc[br]object storage system which has 9:59:59.000,9:59:59.000 a bunch of backends that you can use. 9:59:59.000,9:59:59.000 Basically, we are agnostic between a UNIX[br]filesystem, azure, Ceph, or tons of… 9:59:59.000,9:59:59.000 It's a really simple object storage system[br]where you can just put an object, 9:59:59.000,9:59:59.000 get an object, put a bunch of objects,[br]get a bunch of objects. 9:59:59.000,9:59:59.000 We've implemented removal but we don't[br]really use it yet. 9:59:59.000,9:59:59.000 All the data model implementation,[br]all the listers, the loaders, the schedulers 9:59:59.000,9:59:59.000 everything has been written by us,[br]it's a pile of Python code. 9:59:59.000,9:59:59.000 So, basically 20 Python packages and[br]around 30 Puppet modules 9:59:59.000,9:59:59.000 to deploy all that and we've done everything[br]as a copyleft license, 9:59:59.000,9:59:59.000 GPLv3 for the backend and AGPLv3[br]for the frontend. 9:59:59.000,9:59:59.000 Even if people try and make their own[br]Software Heritage using our code, 9:59:59.000,9:59:59.000 they have to publish their changes. 9:59:59.000,9:59:59.000 Hardware-wise, we run for now everything[br]on a few hypervisors in house and 9:59:59.000,9:59:59.000 our main storage is currently still[br]on a very high density, very slow, 9:59:59.000,9:59:59.000 very bulky storage array, but we've[br]started to migrate all this thing 9:59:59.000,9:59:59.000 into a Ceph storage cluster which[br]we're gonna grow as we need 9:59:59.000,9:59:59.000 in the next few months. 9:59:59.000,9:59:59.000 We've also been granted by Microsoft[br]sponsorship, ??? sponsorship 9:59:59.000,9:59:59.000 for their cloud services. 9:59:59.000,9:59:59.000 We've started putting mirrors of everything[br]in their infrastructure as well 9:59:59.000,9:59:59.000 which means full object storage mirror,[br]so 170TB of stuff mirrored on azure 9:59:59.000,9:59:59.000 as well as a database mirror for graph. 9:59:59.000,9:59:59.000 And we're also doing all the content[br]indexing and all the things that need 9:59:59.000,9:59:59.000 scalability on azure now. 9:59:59.000,9:59:59.000 Finally, at the university of Bologna,[br]we have a backend storage for the download 9:59:59.000,9:59:59.000 so currently our main storage is[br]quite slow so if you want to download 9:59:59.000,9:59:59.000 a bundle of things that we've archived,[br]then we actually keep a cache of 9:59:59.000,9:59:59.000 what we've done so that it doesn't take[br]a million years to download stuff. 9:59:59.000,9:59:59.000 We do our development in a classic free[br]and open source software way, 9:59:59.000,9:59:59.000 so we talk on our mailing list, on IRC