1 99:59:59,999 --> 99:59:59,999 Hi, thank you. 2 99:59:59,999 --> 99:59:59,999 I'm Nicolas Dandrimont and I will indeed be talking to you about 3 99:59:59,999 --> 99:59:59,999 Software Heritage. 4 99:59:59,999 --> 99:59:59,999 I'm a software engineer for this project. 5 99:59:59,999 --> 99:59:59,999 I've been working on it for 3 years now. 6 99:59:59,999 --> 99:59:59,999 And we'll see what this thing is all about. 7 99:59:59,999 --> 99:59:59,999 [Mic not working] 8 99:59:59,999 --> 99:59:59,999 I guess the batteries are out. 9 99:59:59,999 --> 99:59:59,999 So, let's try that again. 10 99:59:59,999 --> 99:59:59,999 So, we all know, we've been doing free software for a while, 11 99:59:59,999 --> 99:59:59,999 that software source code is something special. 12 99:59:59,999 --> 99:59:59,999 Why is that? 13 99:59:59,999 --> 99:59:59,999 As Harold Abelson has said in SICP, his textbook on programming, 14 99:59:59,999 --> 99:59:59,999 programs are meant to be read by people and then incidentally for machines to execute. 15 99:59:59,999 --> 99:59:59,999 Basically, what software source code provides us is a way inside 16 99:59:59,999 --> 99:59:59,999 the mind of the designer of the program. 17 99:59:59,999 --> 99:59:59,999 For instance, you can have, you can get inside very crazy algorithms 18 99:59:59,999 --> 99:59:59,999 that can do very fast reverse square roots for 3D, that kind of stuff 19 99:59:59,999 --> 99:59:59,999 Like in the Quake 2 source code. 20 99:59:59,999 --> 99:59:59,999 You can also get inside the algorithms that are underpinning the internet, 21 99:59:59,999 --> 99:59:59,999 for instance seeing the net queue algorithm in the Linux kernel. 22 99:59:59,999 --> 99:59:59,999 What we are building as the free software community is the free software commons. 23 99:59:59,999 --> 99:59:59,999 Basically, the commons is all the cultural and social and natural resources 24 99:59:59,999 --> 99:59:59,999 that we share and that everyone has access to. 25 99:59:59,999 --> 99:59:59,999 More specifically, the software commons is what we are building 26 99:59:59,999 --> 99:59:59,999 with software that is open and that is available for all to use, to modify, 27 99:59:59,999 --> 99:59:59,999 to execute, to distribute. 28 99:59:59,999 --> 99:59:59,999 We know that those commons are a really critical part of our commons. 29 99:59:59,999 --> 99:59:59,999 Who's taking care of it? 30 99:59:59,999 --> 99:59:59,999 The software is fragile. 31 99:59:59,999 --> 99:59:59,999 Like all digital information, you can lose software. 32 99:59:59,999 --> 99:59:59,999 People can decide to shut down hosting spaces because of business decisions. 33 99:59:59,999 --> 99:59:59,999 People can hack into software hosting platforms and remove the code maliciously 34 99:59:59,999 --> 99:59:59,999 or just inadvertently. 35 99:59:59,999 --> 99:59:59,999 And, of course, for the obsolete stuff, there's rot. 36 99:59:59,999 --> 99:59:59,999 If you don't care about the data, then it rots and it decays and you lose it. 37 99:59:59,999 --> 99:59:59,999 So, where is the archive we go to when something is lost, 38 99:59:59,999 --> 99:59:59,999 when GitLab goes away, when Github goes away. 39 99:59:59,999 --> 99:59:59,999 Where do we go? 40 99:59:59,999 --> 99:59:59,999 Finally, there's one last thing that we noticed, it's that 41 99:59:59,999 --> 99:59:59,999 there's a lot of teams that work on research on software 42 99:59:59,999 --> 99:59:59,999 and there's no real big infrastructure for research on code. 43 99:59:59,999 --> 99:59:59,999 There's tons of critical issues around code: safety, security, verification, proofs. 44 99:59:59,999 --> 99:59:59,999 Nobody's doing this at a very large scale. 45 99:59:59,999 --> 99:59:59,999 If you want to see the stars, you go the Atacama desert and 46 99:59:59,999 --> 99:59:59,999 you point a telescope at the sky. 47 99:59:59,999 --> 99:59:59,999 Where is the telescope for source code? 48 99:59:59,999 --> 99:59:59,999 That's what Software Heritage wants to be. 49 99:59:59,999 --> 99:59:59,999 What we do is we collect, we preserve and we share all the software 50 99:59:59,999 --> 99:59:59,999 that is publicly available. 51 99:59:59,999 --> 99:59:59,999 Why do we do that? We do that to preserve the past, to enhance the present 52 99:59:59,999 --> 99:59:59,999 and to prepare for the future. 53 99:59:59,999 --> 99:59:59,999 What we're building is a base infrastructure that can be used 54 99:59:59,999 --> 99:59:59,999 for cultural heritage, for industry, for research and for education purposes. 55 99:59:59,999 --> 99:59:59,999 How do we do it? We do it with an open approach. 56 99:59:59,999 --> 99:59:59,999 Every single line of code that we write is free software. 57 99:59:59,999 --> 99:59:59,999 We do it transparently, everything that we do, we do it in the open, 58 99:59:59,999 --> 99:59:59,999 be that on a mailing list or on our issue tracker. 59 99:59:59,999 --> 99:59:59,999 And we strive to do it for the very long haul, so we do it with replication in mind 60 99:59:59,999 --> 99:59:59,999 so that no single entity has full control over the data that we collect. 61 99:59:59,999 --> 99:59:59,999 And we do it in a non-profit fashion so that we avoid 62 99:59:59,999 --> 99:59:59,999 business-driven decisions impacting the project. 63 99:59:59,999 --> 99:59:59,999 So, what do we do concretely? 64 99:59:59,999 --> 99:59:59,999 We do archiving of version control systems. 65 99:59:59,999 --> 99:59:59,999 What does that mean? 66 99:59:59,999 --> 99:59:59,999 It means we archive file contents, so source code, files. 67 99:59:59,999 --> 99:59:59,999 We archive revisions, which means all the metadata of the history of the projects, 68 99:59:59,999 --> 99:59:59,999 we try to download it and we put it inside a common data model that is 69 99:59:59,999 --> 99:59:59,999 shared across all the archive. 70 99:59:59,999 --> 99:59:59,999 We archive releases of the software, releases that have been tagged 71 99:59:59,999 --> 99:59:59,999 in a version control system as well as releases that we can find as tarballs 72 99:59:59,999 --> 99:59:59,999 because sometimes… boof, views of this source code differ. 73 99:59:59,999 --> 99:59:59,999 Of course, we archive where and when we've seen the data that we've collected. 74 99:59:59,999 --> 99:59:59,999 All of this, we put inside a canonical, VCS-agnostic, data model. 75 99:59:59,999 --> 99:59:59,999 If you have a Debian package, with its history, if you have a git repository, 76 99:59:59,999 --> 99:59:59,999 if you have a subversion repository, if you have a mercurial repository, 77 99:59:59,999 --> 99:59:59,999 it all looks the same and you can work on it with the same tools. 78 99:59:59,999 --> 99:59:59,999 What we don't do is archive what's around the software, for instance 79 99:59:59,999 --> 99:59:59,999 the bug tracking systems or the homepages or the wikis or the mailing lists. 80 99:59:59,999 --> 99:59:59,999 There are some projects that work in this space, for instance 81 99:59:59,999 --> 99:59:59,999 the internet archive does a lot of really good work around archiving the web. 82 99:59:59,999 --> 99:59:59,999 Our goal is not to replace them, but to work with them and be able to do 83 99:59:59,999 --> 99:59:59,999 linking across all the archives that exist. 84 99:59:59,999 --> 99:59:59,999 We can, for instance for the mailing lists there's the gmane project 85 99:59:59,999 --> 99:59:59,999 that does a lot of archiving of free software mailing lists. 86 99:59:59,999 --> 99:59:59,999 So our long term vision is to play a part in a semantic wikipedia of software, 87 99:59:59,999 --> 99:59:59,999 a wikidata of software where we can hyperlink all the archives that exist 88 99:59:59,999 --> 99:59:59,999 and do stuff in the area. 89 99:59:59,999 --> 99:59:59,999 Quick tour of our infrastructure. 90 99:59:59,999 --> 99:59:59,999 Basically, all the way to the right is our archive. 91 99:59:59,999 --> 99:59:59,999 Our archive consists of a huge graph of all the metadata about 92 99:59:59,999 --> 99:59:59,999 the files, the directories, the revisions, the commits and the releases and 93 99:59:59,999 --> 99:59:59,999 all the projects that are on top of the graph. 94 99:59:59,999 --> 99:59:59,999 We separate the file storage into an other object storage because of 95 99:59:59,999 --> 99:59:59,999 the size discrepancy: we have lots and lots of file contents that we need to store 96 99:59:59,999 --> 99:59:59,999 so we do that outside the database that is used to store the graph. 97 99:59:59,999 --> 99:59:59,999 Basically, what we archive is a set of software origins that are 98 99:59:59,999 --> 99:59:59,999 git repositories, mercurial repositories, etc. etc. 99 99:59:59,999 --> 99:59:59,999 All those origins are loaded on a regular schedule. 100 99:59:59,999 --> 99:59:59,999 If there is a very active software origin, we're gonna archive it more often 101 99:59:59,999 --> 99:59:59,999 than stale things that don't get a lot of updates. 102 99:59:59,999 --> 99:59:59,999 What we do to get the list of software origins that we archive. 103 99:59:59,999 --> 99:59:59,999 We have a bunch of listers that can, scroll through the list of repositories, 104 99:59:59,999 --> 99:59:59,999 for instance on Github or other hosting platforms. 105 99:59:59,999 --> 99:59:59,999 We have code that can read Debian archive metadata to make a list of the packages 106 99:59:59,999 --> 99:59:59,999 that are inside this archive and can be archived, etc. 107 99:59:59,999 --> 99:59:59,999 All of this is done on a regular basis. 108 99:59:59,999 --> 99:59:59,999 We are currently working on some kind of push mechanism so that 109 99:59:59,999 --> 99:59:59,999 people or other systems can notify us of updates. 110 99:59:59,999 --> 99:59:59,999 Our goal is not to do real time archiving, we're really in it for the long run 111 99:59:59,999 --> 99:59:59,999 but we still want to be able to prioritize stuff that people tell us is 112 99:59:59,999 --> 99:59:59,999 important to archive. 113 99:59:59,999 --> 99:59:59,999 The internet archive has a "save now" button and we want to implement 114 99:59:59,999 --> 99:59:59,999 something along those lines as well, 115 99:59:59,999 --> 99:59:59,999 so if we know that some software project is in danger for a reason or another, 116 99:59:59,999 --> 99:59:59,999 then we can prioritize archiving it. 117 99:59:59,999 --> 99:59:59,999 So this is the basic structure of a revision in the software heritage archive. 118 99:59:59,999 --> 99:59:59,999 You'll see that it's very similar to a git commit. 119 99:59:59,999 --> 99:59:59,999 The format of the metadata is pretty much what you'll find in a git commit 120 99:59:59,999 --> 99:59:59,999 with some extensions that you don't see here because this is from a git commit 121 99:59:59,999 --> 99:59:59,999 So basically what we do is we take the identifier of the directory 122 99:59:59,999 --> 99:59:59,999 that the revision points to, we take the identifier of the parent of the revision 123 99:59:59,999 --> 99:59:59,999 so we can keep track of the history 124 99:59:59,999 --> 99:59:59,999 and then we add some metadata, authorship and commitership information 125 99:59:59,999 --> 99:59:59,999 and the revision message and then we take a hash of this, 126 99:59:59,999 --> 99:59:59,999 it makes an identifier that's probably unique, very very probably unique. 127 99:59:59,999 --> 99:59:59,999 Using those identifiers, we can retrace all the origins, all the history of 128 99:59:59,999 --> 99:59:59,999 development of the project and we can deduplicate across all the archive. 129 99:59:59,999 --> 99:59:59,999 All the identifiers are intrinsic, which means that we compute them 130 99:59:59,999 --> 99:59:59,999 from the contents of the things that we are archiving, which means that 131 99:59:59,999 --> 99:59:59,999 we can deduplicate very efficiently across all the data that we archive. 132 99:59:59,999 --> 99:59:59,999 How much data do we archive? 133 99:59:59,999 --> 99:59:59,999 A bit. 134 99:59:59,999 --> 99:59:59,999 So, we have passed the billion revision mark a few weeks ago. 135 99:59:59,999 --> 99:59:59,999 This graph is a bit old, but anyway, you have a live graph on our website. 136 99:59:59,999 --> 99:59:59,999 That's more than 4.5 billion unique source code files. 137 99:59:59,999 --> 99:59:59,999 We don't actually discriminate between what we would consider is source code 138 99:59:59,999 --> 99:59:59,999 and what upstream developers consider as source code, 139 99:59:59,999 --> 99:59:59,999 so everything that's in a git repository, we consider as source code 140 99:59:59,999 --> 99:59:59,999 if it's below a size threshold. 141 99:59:59,999 --> 99:59:59,999 A billion revisions across 80 million projects. 142 99:59:59,999 --> 99:59:59,999 What do we archive? 143 99:59:59,999 --> 99:59:59,999 We archive Github, we archive Debian. 144 99:59:59,999 --> 99:59:59,999 So, Debian we run the archival process every day, every day we get the new packages 145 99:59:59,999 --> 99:59:59,999 that have been uploaded in the archive. 146 99:59:59,999 --> 99:59:59,999 Github, we try to keep up, we are currently working on some performance improvements, 147 99:59:59,999 --> 99:59:59,999 some scalability improvements to make sure that we can keep up 148 99:59:59,999 --> 99:59:59,999 with the development on GitHub. 149 99:59:59,999 --> 99:59:59,999 We have archived as a one-off thing the former content of Gitorious and Google Code 150 99:59:59,999 --> 99:59:59,999 which are two prominent code hosting spaces that closed recently 151 99:59:59,999 --> 99:59:59,999 and we've been working on archiving the contents of Bitbucket 152 99:59:59,999 --> 99:59:59,999 which is kind of a challenge because the API is a bit buggy and 153 99:59:59,999 --> 99:59:59,999 Atliassian isn't too interested in fixing it. 154 99:59:59,999 --> 99:59:59,999 In concrete storage terms, we have 175TB of blobs, so the files take 175TB 155 99:59:59,999 --> 99:59:59,999 and kind of big database, 6TB. 156 99:59:59,999 --> 99:59:59,999 The database only contains the graph of the metadata for the archive 157 99:59:59,999 --> 99:59:59,999 which is basically a 8 billion nodes and 70 billion edges graph. 158 99:59:59,999 --> 99:59:59,999 And of course it's growing daily. 159 99:59:59,999 --> 99:59:59,999 We are pretty sure this is the richest source code archive that's available now 160 99:59:59,999 --> 99:59:59,999 and it keeps growing. 161 99:59:59,999 --> 99:59:59,999 So how do we actually… 162 99:59:59,999 --> 99:59:59,999 What kind of stack do we use to store all this? 163 99:59:59,999 --> 99:59:59,999 We use Debian, of course. 164 99:59:59,999 --> 99:59:59,999 All our deployment recipes are in Puppet in public repositories. 165 99:59:59,999 --> 99:59:59,999 We've started using Ceph for the blob storage. 166 99:59:59,999 --> 99:59:59,999 We use PostgreSQL for the metadata storage we some of the standard tools that 167 99:59:59,999 --> 99:59:59,999 live around PostgreSQL for backups and replication. 168 99:59:59,999 --> 99:59:59,999 We use standard Python stack for scheduling of jobs 169 99:59:59,999 --> 99:59:59,999 and for web interface stuff, basically psycopg2 for the low level stuff, 170 99:59:59,999 --> 99:59:59,999 Django for the web stuff 171 99:59:59,999 --> 99:59:59,999 and Celery for the scheduling of jobs. 172 99:59:59,999 --> 99:59:59,999 In house, we've written an ad hoc object storage system which has 173 99:59:59,999 --> 99:59:59,999 a bunch of backends that you can use. 174 99:59:59,999 --> 99:59:59,999 Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… 175 99:59:59,999 --> 99:59:59,999 It's a really simple object storage system where you can just put an object, 176 99:59:59,999 --> 99:59:59,999 get an object, put a bunch of objects, get a bunch of objects. 177 99:59:59,999 --> 99:59:59,999 We've implemented removal but we don't really use it yet. 178 99:59:59,999 --> 99:59:59,999 All the data model implementation, all the listers, the loaders, the schedulers 179 99:59:59,999 --> 99:59:59,999 everything has been written by us, it's a pile of Python code. 180 99:59:59,999 --> 99:59:59,999 So, basically 20 Python packages and around 30 Puppet modules 181 99:59:59,999 --> 99:59:59,999 to deploy all that and we've done everything as a copyleft license, 182 99:59:59,999 --> 99:59:59,999 GPLv3 for the backend and AGPLv3 for the frontend. 183 99:59:59,999 --> 99:59:59,999 Even if people try and make their own Software Heritage using our code, 184 99:59:59,999 --> 99:59:59,999 they have to publish their changes. 185 99:59:59,999 --> 99:59:59,999 Hardware-wise, we run for now everything on a few hypervisors in house and 186 99:59:59,999 --> 99:59:59,999 our main storage is currently still on a very high density, very slow, 187 99:59:59,999 --> 99:59:59,999 very bulky storage array, but we've started to migrate all this thing 188 99:59:59,999 --> 99:59:59,999 into a Ceph storage cluster which we're gonna grow as we need 189 99:59:59,999 --> 99:59:59,999 in the next few months. 190 99:59:59,999 --> 99:59:59,999 We've also been granted by Microsoft sponsorship, ??? sponsorship 191 99:59:59,999 --> 99:59:59,999 for their cloud services. 192 99:59:59,999 --> 99:59:59,999 We've started putting mirrors of everything in their infrastructure as well 193 99:59:59,999 --> 99:59:59,999 which means full object storage mirror, so 170TB of stuff mirrored on azure 194 99:59:59,999 --> 99:59:59,999 as well as a database mirror for graph. 195 99:59:59,999 --> 99:59:59,999 And we're also doing all the content indexing and all the things that need 196 99:59:59,999 --> 99:59:59,999 scalability on azure now. 197 99:59:59,999 --> 99:59:59,999 Finally, at the university of Bologna, we have a backend storage for the download 198 99:59:59,999 --> 99:59:59,999 so currently our main storage is quite slow so if you want to download 199 99:59:59,999 --> 99:59:59,999 a bundle of things that we've archived, then we actually keep a cache of 200 99:59:59,999 --> 99:59:59,999 what we've done so that it doesn't take a million years to download stuff. 201 99:59:59,999 --> 99:59:59,999 We do our development in a classic free and open source software way, 202 99:59:59,999 --> 99:59:59,999 so we talk on our mailing list, on IRC, on a forge. 203 99:59:59,999 --> 99:59:59,999 Everything is in English, everything is public, there is more information 204 99:59:59,999 --> 99:59:59,999 on our website if you want to actually have a look and see what we do. 205 99:59:59,999 --> 99:59:59,999 So, all that is very interesting but how do we actually look into it? 206 99:59:59,999 --> 99:59:59,999 One of the ways that you can browse, that you can use the archive 207 99:59:59,999 --> 99:59:59,999 is using a REST API. 208 99:59:59,999 --> 99:59:59,999 Basically, this API allows you to do pointwise browsing of the archive 209 99:59:59,999 --> 99:59:59,999 so you can go and follow the links in a graph, 210 99:59:59,999 --> 99:59:59,999 which is very slow but gives you a pretty much full access of the data. 211 99:59:59,999 --> 99:59:59,999 There's an index for the API that you can look at, but that's not really convenient, 212 99:59:59,999 --> 99:59:59,999 so we also have a web user interface. 213 99:59:59,999 --> 99:59:59,999 It's in preview right now, we're gonna do a full launch in the month of June. 214 99:59:59,999 --> 99:59:59,999 If you go to https://archive.softwareheritage.org/browse/ 215 99:59:59,999 --> 99:59:59,999 with the given credentials, you can have a look and see what's going on. 216 99:59:59,999 --> 99:59:59,999 Basically, we have a web interface that allows you to look at 217 99:59:59,999 --> 99:59:59,999 what origins we have downloaded, when we have downloaded the origins 218 99:59:59,999 --> 99:59:59,999 with a kind of graph view of how often we visited the origins 219 99:59:59,999 --> 99:59:59,999 and a calendar view of when we have visited the origins. 220 99:59:59,999 --> 99:59:59,999 And then, inside the visits, you can actually browse the contents 221 99:59:59,999 --> 99:59:59,999 that we've archived. 222 99:59:59,999 --> 99:59:59,999 So, for instance, this is the Python repository as of May 2017 223 99:59:59,999 --> 99:59:59,999 and you can have the list of files, then drill down, 224 99:59:59,999 --> 99:59:59,999 it should be pretty intuitive. 225 99:59:59,999 --> 99:59:59,999 If you look at the history of a project, you can see the differences 226 99:59:59,999 --> 99:59:59,999 between two revisions of a project. 227 99:59:59,999 --> 99:59:59,999 Oh no, that's the syntax highlighting, but anyway the diffs arrive right after. 228 99:59:59,999 --> 99:59:59,999 So, yeah, pretty cool stuff. 229 99:59:59,999 --> 99:59:59,999 I should be able to do a demo as well, it should work. 230 99:59:59,999 --> 99:59:59,999 I'm gonna zoom in. 231 99:59:59,999 --> 99:59:59,999 So this is the main archive, you can see some statistics about the objects 232 99:59:59,999 --> 99:59:59,999 that we've downloaded. 233 99:59:59,999 --> 99:59:59,999 When you zoom in, you get some kind of overflows, because… 234 99:59:59,999 --> 99:59:59,999 Yeah, why would you do that. 235 99:59:59,999 --> 99:59:59,999 If you want to browse, we can try to find an origin. 236 99:59:59,999 --> 99:59:59,999 "glibc". 237 99:59:59,999 --> 99:59:59,999 So there's lots and lots of, like, random Github forks of things… 238 99:59:59,999 --> 99:59:59,999 We don't discriminate and we don't really filter what we download. 239 99:59:59,999 --> 99:59:59,999 We are looking into doing some relevance kind of sorting of the results, here. 240 99:59:59,999 --> 99:59:59,999 Next. 241 99:59:59,999 --> 99:59:59,999 Xilinx, why not. 242 99:59:59,999 --> 99:59:59,999 So, this has been downloaded for the last time of August 3rd 2016, 243 99:59:59,999 --> 99:59:59,999 so it's probably a dead repository, 244 99:59:59,999 --> 99:59:59,999 but yeah, you can see a bunch of source code, 245 99:59:59,999 --> 99:59:59,999 you can read the README of the glibc. 246 99:59:59,999 --> 99:59:59,999 If we go back to a more interesting origin 247 99:59:59,999 --> 99:59:59,999 here's the repository for git. 248 99:59:59,999 --> 99:59:59,999 I've selected voluntarily an old visit of the repo so that we can see 249 99:59:59,999 --> 99:59:59,999 what was going on then. 250 99:59:59,999 --> 99:59:59,999 If a look at the calendar view, you can see that we've had some issues actually 251 99:59:59,999 --> 99:59:59,999 updating this, but anyway. 252 99:59:59,999 --> 99:59:59,999 If I look at the last visit, then we can actually browse the contents, 253 99:59:59,999 --> 99:59:59,999 you can get syntax highlighting as well. 254 99:59:59,999 --> 99:59:59,999 This is a big big file with lots of comments 255 99:59:59,999 --> 99:59:59,999 Let's see the actual source code… 256 99:59:59,999 --> 99:59:59,999 Anyway, so, that's the browsing interface. 257 99:59:59,999 --> 99:59:59,999 We can also now get back what we've archived and download it, 258 99:59:59,999 --> 99:59:59,999 which is kind of something that you might want to do 259 99:59:59,999 --> 99:59:59,999 if a repository is lost, you can actually download it 260 99:59:59,999 --> 99:59:59,999 and get the source code back again. 261 99:59:59,999 --> 99:59:59,999 How we do that. 262 99:59:59,999 --> 99:59:59,999 If you go on the top right of this browsing interface, you have actions and download 263 99:59:59,999 --> 99:59:59,999 and you can download a directory that you are currently looking at. 264 99:59:59,999 --> 99:59:59,999 It's an asynchronous process, which means that if there is a lot of load, 265 99:59:59,999 --> 99:59:59,999 then it's gotta take some time to get actually, to be able to download the content 266 99:59:59,999 --> 99:59:59,999 So you can put in your email address so we can notify you when the download is ready. 267 99:59:59,999 --> 99:59:59,999 I'm gonna try my luck and say just "ok" and it's gonna appear at some point 268 99:59:59,999 --> 99:59:59,999 in the list of things that I've requested. 269 99:59:59,999 --> 99:59:59,999 I've already requested some things that we can actually get and open as a tarball. 270 99:59:59,999 --> 99:59:59,999 Yeah, I think that's the thing that I was actually looking at, 271 99:59:59,999 --> 99:59:59,999 which is this revision of the git source code 272 99:59:59,999 --> 99:59:59,999 and then I can open it 273 99:59:59,999 --> 99:59:59,999 Yay, emacs, that's when you want. 274 99:59:59,999 --> 99:59:59,999 Yay, source code. 275 99:59:59,999 --> 99:59:59,999 This seems to work. 276 99:59:59,999 --> 99:59:59,999 And then, of course, if you want to actually script what you're doing, 277 99:59:59,999 --> 99:59:59,999 there's an API that allows you to do the downloads as well, so you can. 278 99:59:59,999 --> 99:59:59,999 The source code is deduplicated a lot, which means that for one single repository 279 99:59:59,999 --> 99:59:59,999 you get tons of files that we have to collect if you want to actually download 280 99:59:59,999 --> 99:59:59,999 an archive of a directory. 281 99:59:59,999 --> 99:59:59,999 It takes a while but we have an asynchronous API so you can POST 282 99:59:59,999 --> 99:59:59,999 the identifier of a revision to this URL and then get status updates 283 99:59:59,999 --> 99:59:59,999 and at some point, it will tell you that the… here 284 99:59:59,999 --> 99:59:59,999 The status well tell you that the object is available. 285 99:59:59,999 --> 99:59:59,999 You can download it and you can even download the full history of a project 286 99:59:59,999 --> 99:59:59,999 and get that as a git-fast-export archive that you can reimport into 287 99:59:59,999 --> 99:59:59,999 a new git repository. 288 99:59:59,999 --> 99:59:59,999 So any kind of VCS that we've imported, you can export as a git repository 289 99:59:59,999 --> 99:59:59,999 and reimport on your machine. 290 99:59:59,999 --> 99:59:59,999 How to get involved in the project? 291 99:59:59,999 --> 99:59:59,999 We have a lot of features that we're interested in, lots of them are now 292 99:59:59,999 --> 99:59:59,999 in early access or have been done. 293 99:59:59,999 --> 99:59:59,999 There's some stuff that we would like help with. 294 99:59:59,999 --> 99:59:59,999 This is some stuff that we're working on: 295 99:59:59,999 --> 99:59:59,999 provenance information, you have a content 296 99:59:59,999 --> 99:59:59,999 you want to know which repository it comes from, 297 99:59:59,999 --> 99:59:59,999 that's something we're on. 298 99:59:59,999 --> 99:59:59,999 Full text search, the end goal is to be able even to trace 299 99:59:59,999 --> 99:59:59,999 source of snippets of code that's have been copied from one project to another. 300 99:59:59,999 --> 99:59:59,999 That's something that we can look into with the wealth of information that 301 99:59:59,999 --> 99:59:59,999 we have inside the archive. 302 99:59:59,999 --> 99:59:59,999 There's a lot of things that, 303 99:59:59,999 --> 99:59:59,999 I mean… 304 99:59:59,999 --> 99:59:59,999 There's a lot of things that people want to do with the archive. 305 99:59:59,999 --> 99:59:59,999 Our goal is to enable people to do things, to do interesting things 306 99:59:59,999 --> 99:59:59,999 with a lot of source code. 307 99:59:59,999 --> 99:59:59,999 If you have an idea of what you want to do with such an archive, 308 99:59:59,999 --> 99:59:59,999 please you can come talk to us 309 99:59:59,999 --> 99:59:59,999 and we'll be happy to help you help us. 310 99:59:59,999 --> 99:59:59,999 What we want to do is to diversify the sources of things that we archive. 311 99:59:59,999 --> 99:59:59,999 Currently, we have good support for git, we have OK support for subversion 312 99:59:59,999 --> 99:59:59,999 and mercurial. 313 99:59:59,999 --> 99:59:59,999 If your project of choice is in another version control system, 314 99:59:59,999 --> 99:59:59,999 we are gonna miss it. 315 99:59:59,999 --> 99:59:59,999 So people can contribute in this area. 316 99:59:59,999 --> 99:59:59,999 For the listing part, we have coverage of Debian, we have coverage or Github, 317 99:59:59,999 --> 99:59:59,999 if your code is somewhere else, we won't see it, so we need people to contribute 318 99:59:59,999 --> 99:59:59,999 stuff that can list for instance Gitlab instances, 319 99:59:59,999 --> 99:59:59,999 and then we can integrate that in our infrastructure and actually have have 320 99:59:59,999 --> 99:59:59,999 people be able to archive their gitlab instances. 321 99:59:59,999 --> 99:59:59,999 And of course, we need to spread the word, make the project sustainable. 322 99:59:59,999 --> 99:59:59,999 We have a few sponsors now, Microsoft, Nokia, Huawei, Github has joined as a sponsor 323 99:59:59,999 --> 99:59:59,999 The university of Bologna, of course Inria is sponsoring. 324 99:59:59,999 --> 99:59:59,999 But we need to keep spreading the word and keep the project sustainable. 325 99:59:59,999 --> 99:59:59,999 And, of course, we need to save endangered source code. 326 99:59:59,999 --> 99:59:59,999 For that, we have a suggestion box on the wiki that you can add things to. 327 99:59:59,999 --> 99:59:59,999 For instance, we have in the back of our minds archiving SourceForge, 328 99:59:59,999 --> 99:59:59,999 because we know that this isn't very sustainable and that's risk of being 329 99:59:59,999 --> 99:59:59,999 taken down at some point. 330 99:59:59,999 --> 99:59:59,999 If you want to join us, we also have some job openings that are available. 331 99:59:59,999 --> 99:59:59,999 For now it's in Paris, so if you want to consider coming work with us in Paris, 332 99:59:59,999 --> 99:59:59,999 you can look into that. 333 99:59:59,999 --> 99:59:59,999 That's Software Heritage. 334 99:59:59,999 --> 99:59:59,999 We are building a reference archive of all the free software 335 99:59:59,999 --> 99:59:59,999 that's being ever written 336 99:59:59,999 --> 99:59:59,999 in an international, open, non-profit and mutualised infrastructure 337 99:59:59,999 --> 99:59:59,999 that we have opened up to everyone, all users, vendors, developers can use it. 338 99:59:59,999 --> 99:59:59,999 The idea is to be at the service of the community and for society 339 99:59:59,999 --> 99:59:59,999 as a whole. 340 99:59:59,999 --> 99:59:59,999 So if you want to join us, you can look at our website, you can look at our code. 341 99:59:59,999 --> 99:59:59,999 You can also talk to me, so if you have any questions, 342 99:59:59,999 --> 99:59:59,999 I think we have 10, 12 minutes for questions. 343 99:59:59,999 --> 99:59:59,999 [Applause] 344 99:59:59,999 --> 99:59:59,999 Do you have questions? 345 99:59:59,999 --> 99:59:59,999 [Q] How do you protect the archive against stuff that you don't want to 346 99:59:59,999 --> 99:59:59,999 have in the archive. 347 99:59:59,999 --> 99:59:59,999 I think of a stuff that is copyright- protected and that Github will also 348 99:59:59,999 --> 99:59:59,999 delete after a while. 349 99:59:59,999 --> 99:59:59,999 Worse, if I would misuse the archive as my private backup 350 99:59:59,999 --> 99:59:59,999 and store encrypted blocks on Github and you will eventually backup them 351 99:59:59,999 --> 99:59:59,999 for me. 352 99:59:59,999 --> 99:59:59,999 [A] There's, I think, two sides of the question. 353 99:59:59,999 --> 99:59:59,999 The first side is 354 99:59:59,999 --> 99:59:59,999 Do we really archive only stuff that is free software and 355 99:59:59,999 --> 99:59:59,999 that we can redistribute and how do we manage, for instance, 356 99:59:59,999 --> 99:59:59,999 copyright takedown stuff. 357 99:59:59,999 --> 99:59:59,999 Currently, most of the infrastructure of the project is under French law. 358 99:59:59,999 --> 99:59:59,999 There's a defined process to do copyright takedown in the French legal system. 359 99:59:59,999 --> 99:59:59,999 We would be really annoyed to have to take down content from the archive 360 99:59:59,999 --> 99:59:59,999 What we do, however, is to mirror public information that is publicly available. 361 99:59:59,999 --> 99:59:59,999 Of course I'm not a lawyer for the project, so I can't really… 362 99:59:59,999 --> 99:59:59,999 I'm not 100% sure of what I'm about to say but 363 99:59:59,999 --> 99:59:59,999 what I know is that in the current French legistlation status, 364 99:59:59,999 --> 99:59:59,999 if the source of the data is still available 365 99:59:59,999 --> 99:59:59,999 so for instance if the data is still on Github, then you need to have 366 99:59:59,999 --> 99:59:59,999 Github take it down before we have to take it down. 367 99:59:59,999 --> 99:59:59,999 We're not currently filtering content for misuse of the archive, 368 99:59:59,999 --> 99:59:59,999 so the only thing that we do is put a limit on the size of the files 369 99:59:59,999 --> 99:59:59,999 that are archived in Software Heritage. 370 99:59:59,999 --> 99:59:59,999 The limit is pretty high, like 100MB. 371 99:59:59,999 --> 99:59:59,999 We can't really decide ourselves 372 99:59:59,999 --> 99:59:59,999 what is source code, what is not source code 373 99:59:59,999 --> 99:59:59,999 because for instance if your project is a cryptography library, 374 99:59:59,999 --> 99:59:59,999 you might want to have some encrypted blocks of data that are stored 375 99:59:59,999 --> 99:59:59,999 in you source code repository as test fixtures. 376 99:59:59,999 --> 99:59:59,999 And then, you need them to build the code and to make sure that it works. 377 99:59:59,999 --> 99:59:59,999 So, how would that be any different than you encrypted backup on Github? 378 99:59:59,999 --> 99:59:59,999 How could we, Software Heritage, distinguish between proper use and misuse 379 99:59:59,999 --> 99:59:59,999 of the resources. 380 99:59:59,999 --> 99:59:59,999 I guess our long term goal is to not have to care about misuse because 381 99:59:59,999 --> 99:59:59,999 it's gonna be a drop in the ocean. 382 99:59:59,999 --> 99:59:59,999 We're gonna have so much… 383 99:59:59,999 --> 99:59:59,999 We want to have enough space and enough resources 384 99:59:59,999 --> 99:59:59,999 that we don't really need to ask ourselves this question, basically. 385 99:59:59,999 --> 99:59:59,999 Thanks. 386 99:59:59,999 --> 99:59:59,999 Other questions? 387 99:59:59,999 --> 99:59:59,999 [Q] Have you looked at some form of authentication to provide additional 388 99:59:59,999 --> 99:59:59,999 insurance that the archived source code hasn't been modified or tampered with 389 99:59:59,999 --> 99:59:59,999 in some form? 390 99:59:59,999 --> 99:59:59,999 [A] First of all, all the identifiers for the objects that are inside the archive 391 99:59:59,999 --> 99:59:59,999 are cryptographic hashes of the contents that we've archived. 392 99:59:59,999 --> 99:59:59,999 So, for files, for instance, we take the SHA1, the SHA256, 393 99:59:59,999 --> 99:59:59,999 one of the BLAKE hashes and the git modified SHA1 of the file, 394 99:59:59,999 --> 99:59:59,999 and we use that in the manifest for the directories. 395 99:59:59,999 --> 99:59:59,999 So the directories, the directory identifiers are a hash of the manifest 396 99:59:59,999 --> 99:59:59,999 of the list of files that are inside the directory, etc. 397 99:59:59,999 --> 99:59:59,999 So, recursively, you can make sure that the data that we give back to you 398 99:59:59,999 --> 99:59:59,999 has not been, at least altered, by bitflip or anything. 399 99:59:59,999 --> 99:59:59,999 We regularly run a scrub of the data that we have in the archive, 400 99:59:59,999 --> 99:59:59,999 so we make sure that there's no rot inside our archive. 401 99:59:59,999 --> 99:59:59,999 We've not looked into, basically, attestation of… 402 99:59:59,999 --> 99:59:59,999 for instance, making sure that the code that we've downloaded… 403 99:59:59,999 --> 99:59:59,999 I mean, we're not doing anything more than taking a picture of the data 404 99:59:59,999 --> 99:59:59,999 and we say "We've computed this hash. Maybe the code that's been presented 405 99:59:59,999 --> 99:59:59,999 by Github to Software Heritage is different than what you've uploaded to Github, 406 99:59:59,999 --> 99:59:59,999 we can't tell." 407 99:59:59,999 --> 99:59:59,999 In the case of git, you can always use the identifiers of the objects 408 99:59:59,999 --> 99:59:59,999 that you've pushed so you have the commit hash, 409 99:59:59,999 --> 99:59:59,999 which is itself a cryptographic identifier of the contents of the commit. 410 99:59:59,999 --> 99:59:59,999 Intern, if the commit is signed, then the signature is still stored 411 99:59:59,999 --> 99:59:59,999 in the Software Heritage metadata and you can reproduce the original git object 412 99:59:59,999 --> 99:59:59,999 and check the signature, but we've not done anything specific for Software Heritage 413 99:59:59,999 --> 99:59:59,999 in this area. 414 99:59:59,999 --> 99:59:59,999 Does that answer your question? 415 99:59:59,999 --> 99:59:59,999 Cool. 416 99:59:59,999 --> 99:59:59,999 Other questions? 417 99:59:59,999 --> 99:59:59,999 There's one in front. 418 99:59:59,999 --> 99:59:59,999 [Q] It's partially question, partially comment. 419 99:59:59,999 --> 99:59:59,999 Your initial idea was to have a telescope, or something like this for source code. 420 99:59:59,999 --> 99:59:59,999 For now, for me, it looks a little bit more like microscope, 421 99:59:59,999 --> 99:59:59,999 so you can focus on one thing, but that's not much. 422 99:59:59,999 --> 99:59:59,999 So have you sorted things about how to analyze entire ecosystem 423 99:59:59,999 --> 99:59:59,999 or something like this. 424 99:59:59,999 --> 99:59:59,999 For example, now we have Django 2 which is Python 3 only so it would be interesting to 425 99:59:59,999 --> 99:59:59,999 look at all Django modules to see when they start moving to this Django. 426 99:59:59,999 --> 99:59:59,999 So we would need to start analyzing thousands or millions of files, but then 427 99:59:59,999 --> 99:59:59,999 we would need some SQL like, or some map reduce jobs 428 99:59:59,999 --> 99:59:59,999 or something like this for this. 429 99:59:59,999 --> 99:59:59,999 [A] Yes 430 99:59:59,999 --> 99:59:59,999 So, we've started… 431 99:59:59,999 --> 99:59:59,999 The two initiators of the project, Roberto Di Cosmo and Stefano Zacchiroli 432 99:59:59,999 --> 99:59:59,999 are both researchers in computer science so they have a strong background in 433 99:59:59,999 --> 99:59:59,999 actually mining software repositories and doing some large scale analysis 434 99:59:59,999 --> 99:59:59,999 on source code. 435 99:59:59,999 --> 99:59:59,999 We've been talking with research groups whose main goal is to do analysis on 436 99:59:59,999 --> 99:59:59,999 large scale source code archives. 437 99:59:59,999 --> 99:59:59,999 One of the first mirrors outside of our control of the archive 438 99:59:59,999 --> 99:59:59,999 will be in Grenoble (France). 439 99:59:59,999 --> 99:59:59,999 There's a few teams that work on actually doing large scale research 440 99:59:59,999 --> 99:59:59,999 on source code over there, 441 99:59:59,999 --> 99:59:59,999 so that's what the mirror will be used for. 442 99:59:59,999 --> 99:59:59,999 We've also been looking at what the Google open source team does. 443 99:59:59,999 --> 99:59:59,999 They have this big repository with all the code that Google uses 444 99:59:59,999 --> 99:59:59,999 and they've started to push back, like do large scale analysis of 445 99:59:59,999 --> 99:59:59,999 security vulnerabilities, issues with static and dynamic analysis 446 99:59:59,999 --> 99:59:59,999 of the code and they've started pushing their fixes upstream. 447 99:59:59,999 --> 99:59:59,999 That's something that we want to enable users to do, 448 99:59:59,999 --> 99:59:59,999 that's not something that we want to do ourselves, but we want to make sure 449 99:59:59,999 --> 99:59:59,999 that people can do it using our archive. 450 99:59:59,999 --> 99:59:59,999 So we'd be happy to work with people who already do that so that 451 99:59:59,999 --> 99:59:59,999 they can use their knowledge and their tools inside our archive. 452 99:59:59,999 --> 99:59:59,999 Does that answer your question? 453 99:59:59,999 --> 99:59:59,999 Cool. 454 99:59:59,999 --> 99:59:59,999 Any more questions? 455 99:59:59,999 --> 99:59:59,999 No? Then thank you very much Nicolas. 456 99:59:59,999 --> 99:59:59,999 Thank you. 457 99:59:59,999 --> 99:59:59,999 [Applause]