1 99:59:59,999 --> 99:59:59,999 Hi, thank you. 2 99:59:59,999 --> 99:59:59,999 I'm Nicolas Dandrimont and I will indeed be talking to you about 3 99:59:59,999 --> 99:59:59,999 Software Heritage. 4 99:59:59,999 --> 99:59:59,999 I'm a software engineer for this project. 5 99:59:59,999 --> 99:59:59,999 I've been working on it for 3 years now. 6 99:59:59,999 --> 99:59:59,999 And we'll see what this thing is all about. 7 99:59:59,999 --> 99:59:59,999 [Mic not working] 8 99:59:59,999 --> 99:59:59,999 I guess the batteries are out. 9 99:59:59,999 --> 99:59:59,999 So, let's try that again. 10 99:59:59,999 --> 99:59:59,999 So, we all know, we've been doing free software for a while, 11 99:59:59,999 --> 99:59:59,999 that software source code is something special. 12 99:59:59,999 --> 99:59:59,999 Why is that? 13 99:59:59,999 --> 99:59:59,999 As Harold Abelson has said in SICP, his textbook on programming, 14 99:59:59,999 --> 99:59:59,999 programs are meant to be read by people and then incidentally for machines to execute. 15 99:59:59,999 --> 99:59:59,999 Basically, what software source code provides us is a way inside 16 99:59:59,999 --> 99:59:59,999 the mind of the designer of the program. 17 99:59:59,999 --> 99:59:59,999 For instance, you can have, you can get inside very crazy algorithms 18 99:59:59,999 --> 99:59:59,999 that can do very fast reverse square roots for 3D, that kind of stuff 19 99:59:59,999 --> 99:59:59,999 Like in the Quake 2 source code. 20 99:59:59,999 --> 99:59:59,999 You can also get inside the algorithms that are underpinning the internet, 21 99:59:59,999 --> 99:59:59,999 for instance seeing the net queue algorithm in the Linux kernel. 22 99:59:59,999 --> 99:59:59,999 What we are building as the free software community is the free software commons. 23 99:59:59,999 --> 99:59:59,999 Basically, the commons is all the cultural and social and natural resources 24 99:59:59,999 --> 99:59:59,999 that we share and that everyone has access to. 25 99:59:59,999 --> 99:59:59,999 More specifically, the software commons is what we are building 26 99:59:59,999 --> 99:59:59,999 with software that is open and that is available for all to use, to modify, 27 99:59:59,999 --> 99:59:59,999 to execute, to distribute. 28 99:59:59,999 --> 99:59:59,999 We know that those commons are a really critical part of our commons. 29 99:59:59,999 --> 99:59:59,999 Who's taking care of it? 30 99:59:59,999 --> 99:59:59,999 The software is fragile. 31 99:59:59,999 --> 99:59:59,999 Like all digital information, you can lose software. 32 99:59:59,999 --> 99:59:59,999 People can decide to shut down hosting spaces because of business decisions. 33 99:59:59,999 --> 99:59:59,999 People can hack into software hosting platforms and remove the code maliciously 34 99:59:59,999 --> 99:59:59,999 or just inadvertently. 35 99:59:59,999 --> 99:59:59,999 And, of course, for the obsolete stuff, there's rot. 36 99:59:59,999 --> 99:59:59,999 If you don't care about the data, then it rots and it decays and you lose it. 37 99:59:59,999 --> 99:59:59,999 So, where is the archive we go to when something is lost, 38 99:59:59,999 --> 99:59:59,999 when GitLab goes away, when Github goes away. 39 99:59:59,999 --> 99:59:59,999 Where do we go? 40 99:59:59,999 --> 99:59:59,999 Finally, there's one last thing that we noticed, it's that 41 99:59:59,999 --> 99:59:59,999 there's a lot of teams that work on research on software 42 99:59:59,999 --> 99:59:59,999 and there's no real big infrastructure for research on code. 43 99:59:59,999 --> 99:59:59,999 There's tons of critical issues around code: safety, security, verification, proofs. 44 99:59:59,999 --> 99:59:59,999 Nobody's doing this at a very large scale. 45 99:59:59,999 --> 99:59:59,999 If you want to see the stars, you go the Atacama desert and 46 99:59:59,999 --> 99:59:59,999 you point a telescope at the sky. 47 99:59:59,999 --> 99:59:59,999 Where is the telescope for source code? 48 99:59:59,999 --> 99:59:59,999 That's what Software Heritage wants to be. 49 99:59:59,999 --> 99:59:59,999 What we do is we collect, we preserve and we share all the software 50 99:59:59,999 --> 99:59:59,999 that is publicly available. 51 99:59:59,999 --> 99:59:59,999 Why do we do that? We do that to preserve the past, to enhance the present 52 99:59:59,999 --> 99:59:59,999 and to prepare for the future. 53 99:59:59,999 --> 99:59:59,999 What we're building is a base infrastructure that can be used 54 99:59:59,999 --> 99:59:59,999 for cultural heritage, for industry, for research and for education purposes. 55 99:59:59,999 --> 99:59:59,999 How do we do it? We do it with an open approach. 56 99:59:59,999 --> 99:59:59,999 Every single line of code that we write is free software. 57 99:59:59,999 --> 99:59:59,999 We do it transparently, everything that we do, we do it in the open, 58 99:59:59,999 --> 99:59:59,999 be that on a mailing list or on our issue tracker. 59 99:59:59,999 --> 99:59:59,999 And we strive to do it for the very long haul, so we do it with replication in mind 60 99:59:59,999 --> 99:59:59,999 so that no single entity has full control over the data that we collect. 61 99:59:59,999 --> 99:59:59,999 And we do it in a non-profit fashion so that we avoid 62 99:59:59,999 --> 99:59:59,999 business-driven decisions impacting the project. 63 99:59:59,999 --> 99:59:59,999 So, what do we do concretely? 64 99:59:59,999 --> 99:59:59,999 We do archiving of version control systems. 65 99:59:59,999 --> 99:59:59,999 What does that mean? 66 99:59:59,999 --> 99:59:59,999 It means we archive file contents, so source code, files. 67 99:59:59,999 --> 99:59:59,999 We archive revisions, which means all the metadata of the history of the projects, 68 99:59:59,999 --> 99:59:59,999 we try to download it and we put it inside a common data model that is 69 99:59:59,999 --> 99:59:59,999 shared across all the archive. 70 99:59:59,999 --> 99:59:59,999 We archive releases of the software, releases that have been tagged 71 99:59:59,999 --> 99:59:59,999 in a version control system as well as releases that we can find as tarballs 72 99:59:59,999 --> 99:59:59,999 because sometimes… boof, views of this source code differ. 73 99:59:59,999 --> 99:59:59,999 Of course, we archive where and when we've seen the data that we've collected. 74 99:59:59,999 --> 99:59:59,999 All of this, we put inside a canonical, VCS-agnostic, data model. 75 99:59:59,999 --> 99:59:59,999 If you have a Debian package, with its history, if you have a git repository, 76 99:59:59,999 --> 99:59:59,999 if you have a subversion repository, if you have a mercurial repository, 77 99:59:59,999 --> 99:59:59,999 it all looks the same and you can work on it with the same tools. 78 99:59:59,999 --> 99:59:59,999 What we don't do is archive what's around the software, for instance 79 99:59:59,999 --> 99:59:59,999 the bug tracking systems or the homepages or the wikis or the mailing lists. 80 99:59:59,999 --> 99:59:59,999 There are some projects that work in this space, for instance 81 99:59:59,999 --> 99:59:59,999 the internet archive does a lot of really good work around archiving the web. 82 99:59:59,999 --> 99:59:59,999 Our goal is not to replace them, but to work with them and be able to do 83 99:59:59,999 --> 99:59:59,999 linking across all the archives that exist. 84 99:59:59,999 --> 99:59:59,999 We can, for instance for the mailing lists there's the gmane project 85 99:59:59,999 --> 99:59:59,999 that does a lot of archiving of free software mailing lists. 86 99:59:59,999 --> 99:59:59,999 So our long term vision is to play a part in a semantic wikipedia of software, 87 99:59:59,999 --> 99:59:59,999 a wikidata of software where we can hyperlink all the archives that exist 88 99:59:59,999 --> 99:59:59,999 and do stuff in the area. 89 99:59:59,999 --> 99:59:59,999 Quick tour of our infrastructure. 90 99:59:59,999 --> 99:59:59,999 Basically, all the way to the right is our archive. 91 99:59:59,999 --> 99:59:59,999 Our archive consists of a huge graph of all the metadata about 92 99:59:59,999 --> 99:59:59,999 the files, the directories, the revisions, the commits and the releases and 93 99:59:59,999 --> 99:59:59,999 all the projects that are on top of the graph. 94 99:59:59,999 --> 99:59:59,999 We separate the file storage into an other object storage because of 95 99:59:59,999 --> 99:59:59,999 the size discrepancy: we have lots and lots of file contents that we need to store 96 99:59:59,999 --> 99:59:59,999 so we do that outside the database that is used to store the graph. 97 99:59:59,999 --> 99:59:59,999 Basically, what we archive is a set of software origins that are 98 99:59:59,999 --> 99:59:59,999 git repositories, mercurial repositories, etc. etc. 99 99:59:59,999 --> 99:59:59,999 All those origins are loaded on a regular schedule. 100 99:59:59,999 --> 99:59:59,999 If there is a very active software origin, we're gonna archive it more often 101 99:59:59,999 --> 99:59:59,999 than stale things that don't get a lot of updates. 102 99:59:59,999 --> 99:59:59,999 What we do to get the list of software origins that we archive. 103 99:59:59,999 --> 99:59:59,999 We have a bunch of listers that can, scroll through the list of repositories, 104 99:59:59,999 --> 99:59:59,999 for instance on Github or other hosting platforms. 105 99:59:59,999 --> 99:59:59,999 We have code that can read Debian archive metadata to make a list of the packages 106 99:59:59,999 --> 99:59:59,999 that are inside this archive and can be archived, etc. 107 99:59:59,999 --> 99:59:59,999 All of this is done on a regular basis. 108 99:59:59,999 --> 99:59:59,999 We are currently working on some kind of push mechanism so that 109 99:59:59,999 --> 99:59:59,999 people or other systems can notify us of updates. 110 99:59:59,999 --> 99:59:59,999 Our goal is not to do real time archiving, we're really in it for the long run 111 99:59:59,999 --> 99:59:59,999 but we still want to be able to prioritize stuff that people tell us is 112 99:59:59,999 --> 99:59:59,999 important to archive. 113 99:59:59,999 --> 99:59:59,999 The internet archive has a "save now" button and we want to implement 114 99:59:59,999 --> 99:59:59,999 something along those lines as well, 115 99:59:59,999 --> 99:59:59,999 so if we know that some software project is in danger for a reason or another, 116 99:59:59,999 --> 99:59:59,999 then we can prioritize archiving it. 117 99:59:59,999 --> 99:59:59,999 So this is the basic structure of a revision in the software heritage archive. 118 99:59:59,999 --> 99:59:59,999 You'll see that it's very similar to a git commit. 119 99:59:59,999 --> 99:59:59,999 The format of the metadata is pretty much what you'll find in a git commit 120 99:59:59,999 --> 99:59:59,999 with some extensions that you don't see here because this is from a git commit 121 99:59:59,999 --> 99:59:59,999 So basically what we do is we take the identifier of the directory 122 99:59:59,999 --> 99:59:59,999 that the revision points to, we take the identifier of the parent of the revision 123 99:59:59,999 --> 99:59:59,999 so we can keep track of the history 124 99:59:59,999 --> 99:59:59,999 and then we add some metadata, authorship and commitership information 125 99:59:59,999 --> 99:59:59,999 and the revision message and then we take a hash of this, 126 99:59:59,999 --> 99:59:59,999 it makes an identifier that's probably unique, very very probably unique. 127 99:59:59,999 --> 99:59:59,999 Using those identifiers, we can retrace all the origins, all the history of 128 99:59:59,999 --> 99:59:59,999 development of the project and we can deduplicate across all the archive. 129 99:59:59,999 --> 99:59:59,999 All the identifiers are intrinsic, which means that we compute them 130 99:59:59,999 --> 99:59:59,999 from the contents of the things that we are archiving, which means that 131 99:59:59,999 --> 99:59:59,999 we can deduplicate very efficiently across all the data that we archive. 132 99:59:59,999 --> 99:59:59,999 How much data do we archive? 133 99:59:59,999 --> 99:59:59,999 A bit. 134 99:59:59,999 --> 99:59:59,999 So, we have passed the billion revision mark a few weeks ago. 135 99:59:59,999 --> 99:59:59,999 This graph is a bit old, but anyway, you have a live graph on our website. 136 99:59:59,999 --> 99:59:59,999 That's more than 4.5 billion unique source code files. 137 99:59:59,999 --> 99:59:59,999 We don't actually discriminate between what we would consider is source code 138 99:59:59,999 --> 99:59:59,999 and what upstream developers consider as source code, 139 99:59:59,999 --> 99:59:59,999 so everything that's in a git repository, we consider as source code 140 99:59:59,999 --> 99:59:59,999 if it's below a size threshold. 141 99:59:59,999 --> 99:59:59,999 A billion revisions across 80 million projects. 142 99:59:59,999 --> 99:59:59,999 What do we archive? 143 99:59:59,999 --> 99:59:59,999 We archive Github, we archive Debian. 144 99:59:59,999 --> 99:59:59,999 So, Debian we run the archival process every day, every day we get the new packages 145 99:59:59,999 --> 99:59:59,999 that have been uploaded in the archive. 146 99:59:59,999 --> 99:59:59,999 Github, we try to keep up, we are currently working on some performance improvements, 147 99:59:59,999 --> 99:59:59,999 some scalability improvements to make sure that we can keep up 148 99:59:59,999 --> 99:59:59,999 with the development on GitHub. 149 99:59:59,999 --> 99:59:59,999 We have archived as a one-off thing the former content of Gitorious and Google Code 150 99:59:59,999 --> 99:59:59,999 which are two prominent code hosting spaces that closed recently 151 99:59:59,999 --> 99:59:59,999 and we've been working on archiving the contents of Bitbucket 152 99:59:59,999 --> 99:59:59,999 which is kind of a challenge because the API is a bit buggy and 153 99:59:59,999 --> 99:59:59,999 Atliassian isn't too interested in fixing it. 154 99:59:59,999 --> 99:59:59,999 In concrete storage terms, we have 175TB of blobs, so the files take 175TB 155 99:59:59,999 --> 99:59:59,999 and kind of big database, 6TB. 156 99:59:59,999 --> 99:59:59,999 The database only contains the graph of the metadata for the archive 157 99:59:59,999 --> 99:59:59,999 which is basically a 8 billion nodes and 70 billion edges graph. 158 99:59:59,999 --> 99:59:59,999 And of course it's growing daily. 159 99:59:59,999 --> 99:59:59,999 We are pretty sure this is the richest source code archive that's available now 160 99:59:59,999 --> 99:59:59,999 and it keeps growing. 161 99:59:59,999 --> 99:59:59,999 So how do we actually… 162 99:59:59,999 --> 99:59:59,999 What kind of stack do we use to store all this? 163 99:59:59,999 --> 99:59:59,999 We use Debian, of course. 164 99:59:59,999 --> 99:59:59,999 All our deployment recipes are in Puppet in public repositories. 165 99:59:59,999 --> 99:59:59,999 We've started using Ceph for the blob storage. 166 99:59:59,999 --> 99:59:59,999 We use PostgreSQL for the metadata storage we some of the standard tools that 167 99:59:59,999 --> 99:59:59,999 live around PostgreSQL for backups and replication. 168 99:59:59,999 --> 99:59:59,999 We use standard Python stack for scheduling of jobs 169 99:59:59,999 --> 99:59:59,999 and for web interface stuff, basically psycopg2 for the low level stuff, 170 99:59:59,999 --> 99:59:59,999 Django for the web stuff 171 99:59:59,999 --> 99:59:59,999 and Celery for the scheduling of jobs. 172 99:59:59,999 --> 99:59:59,999 In house, we've written an ad hoc object storage system which has 173 99:59:59,999 --> 99:59:59,999 a bunch of backends that you can use. 174 99:59:59,999 --> 99:59:59,999 Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… 175 99:59:59,999 --> 99:59:59,999 It's a really simple object storage system where you can just put an object, 176 99:59:59,999 --> 99:59:59,999 get an object, put a bunch of objects, get a bunch of objects. 177 99:59:59,999 --> 99:59:59,999 We've implemented removal but we don't really use it yet. 178 99:59:59,999 --> 99:59:59,999 All the data model implementation, all the listers, the loaders, the schedulers 179 99:59:59,999 --> 99:59:59,999 everything has been written by us, it's a pile of Python code. 180 99:59:59,999 --> 99:59:59,999 So, basically 20 Python packages and around 30 Puppet modules 181 99:59:59,999 --> 99:59:59,999 to deploy all that and we've done everything as a copyleft license, 182 99:59:59,999 --> 99:59:59,999 GPLv3 for the backend and AGPLv3 for the frontend. 183 99:59:59,999 --> 99:59:59,999 Even if people try and make their own Software Heritage using our code, 184 99:59:59,999 --> 99:59:59,999 they have to publish their changes. 185 99:59:59,999 --> 99:59:59,999 Hardware-wise, we run for now everything on a few hypervisors in house and 186 99:59:59,999 --> 99:59:59,999 our main storage is currently still on a very high density, very slow, 187 99:59:59,999 --> 99:59:59,999 very bulky storage array, but we've started to migrate all this thing 188 99:59:59,999 --> 99:59:59,999 into a Ceph storage cluster which we're gonna grow as we need 189 99:59:59,999 --> 99:59:59,999 in the next few months. 190 99:59:59,999 --> 99:59:59,999 We've also been granted by Microsoft sponsorship, ??? sponsorship 191 99:59:59,999 --> 99:59:59,999 for their cloud services. 192 99:59:59,999 --> 99:59:59,999 We've started putting mirrors of everything in their infrastructure as well 193 99:59:59,999 --> 99:59:59,999 which means full object storage mirror, so 170TB of stuff mirrored on azure 194 99:59:59,999 --> 99:59:59,999 as well as a database mirror for graph. 195 99:59:59,999 --> 99:59:59,999 And we're also doing all the content indexing and all the things that need 196 99:59:59,999 --> 99:59:59,999 scalability on azure now. 197 99:59:59,999 --> 99:59:59,999 Finally, at the university of Bologna, we have a backend storage for the download 198 99:59:59,999 --> 99:59:59,999 so currently our main storage is quite slow so if you want to download 199 99:59:59,999 --> 99:59:59,999 a bundle of things that we've archived, then we actually keep a cache of 200 99:59:59,999 --> 99:59:59,999 what we've done so that it doesn't take a million years to download stuff. 201 99:59:59,999 --> 99:59:59,999 We do our development in a classic free and open source software way, 202 99:59:59,999 --> 99:59:59,999 so we talk on our mailing list, on IRC, on a forge. 203 99:59:59,999 --> 99:59:59,999 Everything is in English, everything is public, there is more information 204 99:59:59,999 --> 99:59:59,999 on our website if you want to actually have a look and see what we do. 205 99:59:59,999 --> 99:59:59,999 So, all that is very interesting but how do we actually look into it? 206 99:59:59,999 --> 99:59:59,999 One of the ways that you can browse, that you can use the archive 207 99:59:59,999 --> 99:59:59,999 is using a REST API. 208 99:59:59,999 --> 99:59:59,999 Basically, this API allows you to do pointwise browsing of the archive 209 99:59:59,999 --> 99:59:59,999 so you can go and follow the links in a graph, 210 99:59:59,999 --> 99:59:59,999 which is very slow but gives you a pretty much full access of the data. 211 99:59:59,999 --> 99:59:59,999 There's an index for the API that you can look at, but that's not really convenient, 212 99:59:59,999 --> 99:59:59,999 so we also have a web user interface. 213 99:59:59,999 --> 99:59:59,999 It's in preview right now, we're gonna do a full launch in the month of June. 214 99:59:59,999 --> 99:59:59,999 If you go to https://archive.softwareheritage.org/browse/ 215 99:59:59,999 --> 99:59:59,999 with the given credentials, you can have a look and see what's going on. 216 99:59:59,999 --> 99:59:59,999 Basically, we have a web interface that allows you to look at 217 99:59:59,999 --> 99:59:59,999 what origins we have downloaded, when we have downloaded the origins 218 99:59:59,999 --> 99:59:59,999 with a kind of graph view of how often we visited the origins 219 99:59:59,999 --> 99:59:59,999 and a calendar view of when we have visited the origins. 220 99:59:59,999 --> 99:59:59,999 And then, inside the visits, you can actually browse the contents 221 99:59:59,999 --> 99:59:59,999 that we've archived. 222 99:59:59,999 --> 99:59:59,999 So, for instance, this is the Python repository as of May 2017 223 99:59:59,999 --> 99:59:59,999 and you can have the list of files, then drill down, 224 99:59:59,999 --> 99:59:59,999 it should be pretty intuitive. 225 99:59:59,999 --> 99:59:59,999 If you look at the history of a project, you can see the differences 226 99:59:59,999 --> 99:59:59,999 between two revisions of a project. 227 99:59:59,999 --> 99:59:59,999 Oh no, that's the syntax highlighting, but anyway the diffs arrive right after.