1 00:00:05,480 --> 00:00:07,105 Hi, thank you. 2 00:00:07,879 --> 00:00:11,293 I'm Nicolas Dandrimont and I will indeed be talking to you about 3 00:00:11,293 --> 00:00:12,551 Software Heritage. 4 00:00:12,876 --> 00:00:15,232 I'm a software engineer for this project. 5 00:00:15,639 --> 00:00:17,712 I've been working on it for 3 years now. 6 00:00:18,485 --> 00:00:21,572 And we'll see what this thing is all about. 7 00:00:23,767 --> 00:00:38,808 [Mic not working] 8 00:00:39,174 --> 00:00:40,752 I guess the batteries are out. 9 00:00:49,949 --> 00:00:51,720 So, let's try that again. 10 00:00:52,050 --> 00:00:55,380 So, we all know, we've been doing free software for a while, 11 00:00:55,616 --> 00:00:59,806 that software source code is something special. 12 00:01:00,779 --> 00:01:02,031 Why is that? 13 00:01:02,731 --> 00:01:09,963 As Harold Abelson has said in SICP, his textbook on programming, 14 00:01:09,963 --> 00:01:18,782 programs are meant to be read by people and then incidentally for machines to execute. 15 00:01:20,213 --> 00:01:25,661 Basically, what software source code provides us is a way inside 16 00:01:25,661 --> 00:01:28,547 the mind of the designer of the program. 17 00:01:29,309 --> 00:01:37,938 For instance, you can have, you can get inside very crazy algorithms 18 00:01:37,938 --> 00:01:46,564 that can do very fast reverse square roots for 3D, that kind of stuff 19 00:01:47,211 --> 00:01:49,524 Like in the Quake 2 source code. 20 00:01:49,860 --> 00:01:54,606 You can also get inside the algorithms that are underpinning the internet, 21 00:01:54,606 --> 00:01:59,765 for instance seeing the net queue algorithm in the Linux kernel. 22 00:02:03,631 --> 00:02:10,218 What we are building as the free software community is the free software commons. 23 00:02:10,948 --> 00:02:18,629 Basically, the commons is all the cultural and social and natural resources 24 00:02:18,629 --> 00:02:21,802 that we share and that everyone has access to. 25 00:02:22,410 --> 00:02:25,744 More specifically, the software commons is what we are building 26 00:02:25,744 --> 00:02:31,878 with software that is open and that is available for all to use, to modify, 27 00:02:31,878 --> 00:02:34,887 to execute, to distribute. 28 00:02:37,252 --> 00:02:45,251 We know that those commons are a really critical part of our commons. 29 00:02:46,306 --> 00:02:48,137 Who's taking care of it? 30 00:02:49,684 --> 00:02:51,800 The software is fragile. 31 00:02:51,800 --> 00:02:54,405 Like all digital information, you can lose software. 32 00:02:55,625 --> 00:03:01,634 People can decide to shut down hosting spaces because of business decisions. 33 00:03:02,939 --> 00:03:08,913 People can hack into software hosting platforms and remove the code maliciously 34 00:03:08,913 --> 00:03:10,864 or just inadvertently. 35 00:03:12,978 --> 00:03:17,898 And, of course, for the obsolete stuff, there's rot. 36 00:03:18,468 --> 00:03:24,773 If you don't care about the data, then it rots and it decays and you lose it. 37 00:03:26,157 --> 00:03:31,238 So, where is the archive we go to when something is lost, 38 00:03:31,238 --> 00:03:33,965 when GitLab goes away, when Github goes away. 39 00:03:34,411 --> 00:03:35,708 Where do we go? 40 00:03:36,519 --> 00:03:40,989 Finally, there's one last thing that we noticed, it's that 41 00:03:40,989 --> 00:03:48,581 there's a lot of teams that work on research on software 42 00:03:48,581 --> 00:03:54,310 and there's no real big infrastructure for research on code. 43 00:03:56,510 --> 00:04:02,129 There's tons of critical issues around code: safety, security, verification, proofs. 44 00:04:03,583 --> 00:04:07,694 Nobody's doing this at a very large scale. 45 00:04:08,466 --> 00:04:12,244 If you want to see the stars, you go the Atacama desert and 46 00:04:12,244 --> 00:04:13,830 you point a telescope at the sky. 47 00:04:14,477 --> 00:04:17,526 Where is the telescope for source code? 48 00:04:17,973 --> 00:04:20,983 That's what Software Heritage wants to be. 49 00:04:22,081 --> 00:04:27,651 What we do is we collect, we preserve and we share all the software 50 00:04:27,651 --> 00:04:29,887 that is publicly available. 51 00:04:31,139 --> 00:04:35,852 Why do we do that? We do that to preserve the past, to enhance the present 52 00:04:35,852 --> 00:04:37,848 and to prepare for the future. 53 00:04:39,715 --> 00:04:44,588 What we're building is a base infrastructure that can be used 54 00:04:44,588 --> 00:04:50,359 for cultural heritage, for industry, for research and for education purposes. 55 00:04:50,724 --> 00:04:53,120 How do we do it? We do it with an open approach. 56 00:04:53,406 --> 00:04:56,613 Every single line of code that we write is free software. 57 00:04:59,088 --> 00:05:04,653 We do it transparently, everything that we do, we do it in the open, 58 00:05:04,653 --> 00:05:09,124 be that on a mailing list or on our issue tracker. 59 00:05:09,858 --> 00:05:15,873 And we strive to do it for the very long haul, so we do it with replication in mind 60 00:05:15,873 --> 00:05:21,806 so that no single entity has full control over the data that we collect. 61 00:05:22,945 --> 00:05:27,335 And we do it in a non-profit fashion so that we avoid 62 00:05:27,335 --> 00:05:32,786 business-driven decisions impacting the project. 63 00:05:35,470 --> 00:05:38,683 So, what do we do concretely? 64 00:05:39,009 --> 00:05:42,951 We do archiving of version control systems. 65 00:05:43,276 --> 00:05:44,617 What does that mean? 66 00:05:45,755 --> 00:05:49,411 It means we archive file contents, so source code, files. 67 00:05:49,411 --> 00:05:55,673 We archive revisions, which means all the metadata of the history of the projects, 68 00:05:55,673 --> 00:06:03,148 we try to download it and we put it inside a common data model that is 69 00:06:03,148 --> 00:06:06,968 shared across all the archive. 70 00:06:08,555 --> 00:06:13,590 We archive releases of the software, releases that have been tagged 71 00:06:13,590 --> 00:06:18,339 in a version control system as well as releases that we can find as tarballs 72 00:06:18,339 --> 00:06:23,945 because sometimes… boof, views of this source code differ. 73 00:06:27,814 --> 00:06:32,367 Of course, we archive where and when we've seen the data that we've collected. 74 00:06:32,977 --> 00:06:40,266 All of this, we put inside a canonical, VCS-agnostic, data model. 75 00:06:41,983 --> 00:06:46,782 If you have a Debian package, with its history, if you have a git repository, 76 00:06:46,782 --> 00:06:50,197 if you have a subversion repository, if you have a mercurial repository, 77 00:06:50,197 --> 00:06:53,857 it all looks the same and you can work on it with the same tools. 78 00:06:54,995 --> 00:07:01,415 What we don't do is archive what's around the software, for instance 79 00:07:01,415 --> 00:07:05,720 the bug tracking systems or the homepages or the wikis or the mailing lists. 80 00:07:06,696 --> 00:07:10,555 There are some projects that work in this space, for instance 81 00:07:10,555 --> 00:07:15,798 the internet archive does a lot of very good work around archiving the web. 82 00:07:17,658 --> 00:07:24,417 Our goal is not to replace them, but to work with them and be able to do 83 00:07:24,417 --> 00:07:29,291 linking across all the archives that exist. 84 00:07:29,705 --> 00:07:35,020 We can, for instance for the mailing lists there's the gmane project 85 00:07:35,020 --> 00:07:38,998 that does a lot of archiving of free software mailing lists. 86 00:07:39,729 --> 00:07:47,743 So our long term vision is to play a part in a semantic wikipedia of software, 87 00:07:47,743 --> 00:07:53,921 a wikidata of software where we can hyperlink all the archives that exist 88 00:07:53,921 --> 00:07:56,853 and do stuff in the area. 89 00:08:00,594 --> 00:08:02,591 Quick tour of our infrastructure. 90 00:08:02,828 --> 00:08:10,224 Basically, all the way to the right is our archive. 91 00:08:11,447 --> 00:08:16,851 Our archive consists of a huge graph of all the metadata about 92 00:08:16,851 --> 00:08:24,617 the files, the directories, the revisions, the commits and the releases and 93 00:08:24,617 --> 00:08:27,784 all the projects that are on top of the graph. 94 00:08:29,129 --> 00:08:33,600 We separate the file storage into an other object storage because of 95 00:08:33,600 --> 00:08:41,654 the size discrepancy: we have lots and lots of file contents that we need to store 96 00:08:41,654 --> 00:08:46,323 so we do that outside of the database that is used to store the graph. 97 00:08:49,495 --> 00:08:54,162 Basically, what we archive is a set of software origins that are 98 00:08:54,162 --> 00:08:58,830 git repositories, mercurial repositories, etc. etc. 99 00:08:59,689 --> 00:09:05,254 All those origins are loaded on a regular schedule. 100 00:09:06,887 --> 00:09:13,472 If there is a very active software origin, we're gonna archive it more often 101 00:09:13,472 --> 00:09:17,750 than stale things that don't get a lot of updates. 102 00:09:19,663 --> 00:09:24,415 What we do to get the list of software origins that we archive. 103 00:09:24,821 --> 00:09:30,677 We have a bunch of listers that can, scroll through the list of repositories, 104 00:09:30,677 --> 00:09:33,767 for instance on Github or other hosting platforms. 105 00:09:34,945 --> 00:09:42,181 We have code that can read Debian archive metadata to make a list of the packages 106 00:09:42,181 --> 00:09:49,412 that are inside this archive and can be archived, etc. 107 00:09:50,387 --> 00:09:52,611 All of this is done on a regular basis. 108 00:09:53,515 --> 00:09:57,450 We are currently working on some kind of push mechanism so that 109 00:09:57,450 --> 00:10:01,485 people or other systems can notify us of updates. 110 00:10:02,990 --> 00:10:09,673 Our goal is not to do real time archiving, we're really in it for the long run 111 00:10:09,673 --> 00:10:16,010 but we still want to be able to prioritize stuff that people tell us is 112 00:10:16,010 --> 00:10:17,879 important to archive. 113 00:10:19,951 --> 00:10:23,930 The internet archive has a "save now" button and we want to implement 114 00:10:23,930 --> 00:10:26,245 something along those lines as well, 115 00:10:26,245 --> 00:10:31,545 so if we know that some software project is in danger for a reason or another, 116 00:10:31,545 --> 00:10:34,145 then we can prioritize archiving it. 117 00:10:35,811 --> 00:10:39,916 So this is the basic structure of a revision in the software heritage archive. 118 00:10:41,987 --> 00:10:45,073 You'll see that it's very similar to a git commit. 119 00:10:47,833 --> 00:10:53,723 The format of the metadata is pretty much what you'll find in a git commit 120 00:10:53,723 --> 00:10:59,006 with some extensions that you don't see here because this is from a git commit 121 00:11:00,713 --> 00:11:09,620 So basically what we do is we take the identifier of the directory 122 00:11:09,620 --> 00:11:16,200 that the revision points to, we take the identifier of the parent of the revision 123 00:11:16,200 --> 00:11:18,722 so we can keep track of the history 124 00:11:18,722 --> 00:11:24,817 and then we add some metadata, authorship and commitership information 125 00:11:24,817 --> 00:11:28,880 and the revision message and then we take a hash of this, 126 00:11:28,880 --> 00:11:37,050 it makes an identifier that's probably unique, very very probably unique. 127 00:11:40,257 --> 00:11:46,924 Using those identifiers, we can retrace all the origins, all the history of 128 00:11:46,924 --> 00:11:51,747 development of the project and we can deduplicate across all the archive. 129 00:11:52,493 --> 00:11:58,673 All the identifiers are intrinsic, which means that we compute them 130 00:11:58,673 --> 00:12:03,917 from the contents of the things that we are archiving, which means that 131 00:12:03,917 --> 00:12:11,436 we can deduplicate very efficiently across all the data that we archive. 132 00:12:12,248 --> 00:12:14,283 How much data do we archive? 133 00:12:17,128 --> 00:12:18,224 A bit. 134 00:12:18,590 --> 00:12:23,828 So, we have passed the billion revision mark a few weeks ago. 135 00:12:25,298 --> 00:12:29,966 This graph is a bit old, but anyway, you have a live graph on our website. 136 00:12:31,468 --> 00:12:35,860 That's more than 4.5 billion unique source code files. 137 00:12:38,261 --> 00:12:45,170 We don't actually discriminate between what we would consider is source code 138 00:12:45,170 --> 00:12:48,181 and what upstream developers consider as source code, 139 00:12:48,181 --> 00:12:52,327 so everything that's in a git repository, we consider as source code 140 00:12:52,327 --> 00:12:54,878 if it's below a size threshold. 141 00:12:55,980 --> 00:13:00,242 A billion revisions across 80 million projects. 142 00:13:01,389 --> 00:13:02,930 What do we archive? 143 00:13:02,930 --> 00:13:04,718 We archive Github, we archive Debian. 144 00:13:06,677 --> 00:13:11,910 So, Debian we run the archival process every day, every day we get the new packages 145 00:13:11,910 --> 00:13:13,740 that have been uploaded in the archive. 146 00:13:14,308 --> 00:13:21,453 Github, we try to keep up, we are currently working on some performance improvements, 147 00:13:21,453 --> 00:13:25,324 some scalability improvements to make sure that we can keep up 148 00:13:25,324 --> 00:13:27,478 with the development on GitHub. 149 00:13:29,227 --> 00:13:40,117 We have archived as a one-off thing the former contents of Gitorious and Google Code 150 00:13:40,513 --> 00:13:46,727 which are two prominent code hosting spaces that closed recently 151 00:13:47,743 --> 00:13:53,993 and we've been working on archiving the contents of Bitbucket 152 00:13:53,993 --> 00:13:59,944 which is kind of a challenge because the API is a bit buggy and 153 00:13:59,944 --> 00:14:03,401 Atliassian isn't too interested in fixing it. 154 00:14:06,084 --> 00:14:16,651 In concrete storage terms, we have 175TB of blobs, so the files take 175TB 155 00:14:16,651 --> 00:14:19,902 and kind of big database, 6TB. 156 00:14:21,165 --> 00:14:28,315 The database only contains the graph of the metadata for the archive 157 00:14:28,315 --> 00:14:34,697 which is basically a 8 billion nodes and 70 billion edges graph. 158 00:14:35,386 --> 00:14:37,460 And of course it's growing daily. 159 00:14:37,946 --> 00:14:42,823 We are pretty sure this is the richest public source code archive that's available now 160 00:14:43,015 --> 00:14:44,763 and it keeps growing. 161 00:14:46,469 --> 00:14:48,987 So how do we actually… 162 00:14:49,475 --> 00:14:53,294 What kind of stack do we use to store all this? 163 00:14:54,762 --> 00:14:56,555 We use Debian, of course. 164 00:14:57,685 --> 00:15:02,934 All our deployment recipes are in Puppet in public repositories. 165 00:15:04,076 --> 00:15:07,731 We've started using Ceph for the blob storage. 166 00:15:09,404 --> 00:15:14,441 We use PostgreSQL for the metadata storage with some of the standard tools that 167 00:15:14,996 --> 00:15:18,172 live around PostgreSQL for backups and replication. 168 00:15:20,041 --> 00:15:27,766 We use standard Python stack for scheduling of jobs 169 00:15:27,766 --> 00:15:35,362 and for web interface stuff, basically psycopg2 for the low level stuff, 170 00:15:35,362 --> 00:15:38,173 Django for the web stuff 171 00:15:38,173 --> 00:15:44,353 and Celery for the scheduling of jobs. 172 00:15:45,481 --> 00:15:50,453 In house, we've written an ad hoc object storage system which has 173 00:15:50,453 --> 00:15:53,351 a bunch of backends that you can use. 174 00:15:53,821 --> 00:16:03,052 Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… 175 00:16:03,418 --> 00:16:07,118 It's a really simple object storage system where you can just put an object, 176 00:16:07,118 --> 00:16:10,365 get an object, put a bunch of objects, get a bunch of objects. 177 00:16:11,949 --> 00:16:17,517 We've implemented removal but we don't really use it yet. 178 00:16:20,196 --> 00:16:24,955 All the data model implementation, all the listers, the loaders, the schedulers 179 00:16:24,955 --> 00:16:29,180 everything has been written by us, it's a pile of Python code. 180 00:16:31,860 --> 00:16:35,806 So, basically 20 Python packages and around 30 Puppet modules 181 00:16:35,806 --> 00:16:41,696 to deploy all that and we've done everything as a copyleft license, 182 00:16:41,696 --> 00:16:46,078 GPLv3 for the backend and AGPLv3 for the frontend. 183 00:16:47,064 --> 00:16:56,894 Even if people try and make their own Software Heritage using our code, 184 00:16:56,894 --> 00:16:59,660 they have to publish their changes. 185 00:17:01,858 --> 00:17:10,757 Hardware-wise, we run for now everything on a few hypervisors in house and 186 00:17:10,757 --> 00:17:18,568 our main storage is currently still on a very high density, very slow, 187 00:17:18,568 --> 00:17:27,953 very bulky storage array, but we've started to migrate all this thing 188 00:17:27,953 --> 00:17:33,002 into a Ceph storage cluster which we're gonna grow as we need 189 00:17:33,002 --> 00:17:35,073 in the next few months. 190 00:17:36,249 --> 00:17:43,680 We've also been granted by Microsoft sponsorship, ??? sponsorship 191 00:17:44,077 --> 00:17:45,834 for their cloud services. 192 00:17:46,445 --> 00:17:51,763 We've started putting mirrors of everything in their infrastructure as well 193 00:17:51,763 --> 00:17:59,568 which means full object storage mirror, so 170TB of stuff mirrored on azure 194 00:17:59,568 --> 00:18:02,492 as well as a database mirror for graph. 195 00:18:03,796 --> 00:18:08,958 And we're also doing all the content indexing and all the things that need 196 00:18:08,958 --> 00:18:11,962 scalability on azure now. 197 00:18:16,637 --> 00:18:22,413 Finally, at the university of Bologna, we have a backend storage for the download 198 00:18:22,413 --> 00:18:29,412 so currently our main storage is quite slow so if you want to download 199 00:18:29,412 --> 00:18:34,859 a bundle of things that we've archived, then we actually keep a cache of 200 00:18:34,859 --> 00:18:40,347 what we've done so that it doesn't take a million years to download stuff. 201 00:18:41,809 --> 00:18:46,233 We do our development in a classic free and open source software way, 202 00:18:46,233 --> 00:18:52,056 so we talk on our mailing list, on IRC, on a forge. 203 00:18:52,503 --> 00:18:56,635 Everything is in English, everything is public, there is more information 204 00:18:56,635 --> 00:19:00,749 on our website if you want to actually have a look and see what we do. 205 00:19:04,278 --> 00:19:09,598 So, all that is very interesting but how do we actually look into it? 206 00:19:11,670 --> 00:19:16,051 One of the ways that you can browse, that you can use the archive 207 00:19:16,051 --> 00:19:18,619 is using a REST API. 208 00:19:19,189 --> 00:19:25,244 Basically, this API allows you to do pointwise browsing of the archive 209 00:19:25,244 --> 00:19:29,026 so you can go and follow the links in a graph, 210 00:19:29,026 --> 00:19:37,759 which is very slow but gives you a pretty much full access of the data. 211 00:19:38,450 --> 00:19:44,779 There's an index for the API that you can look at, but that's not really convenient, 212 00:19:44,779 --> 00:19:47,788 so we also have a web user interface. 213 00:19:48,828 --> 00:19:55,774 It's in preview right now, we're gonna do a full launch in the month of June. 214 00:19:57,768 --> 00:20:01,103 If you go to https://archive.softwareheritage.org/browse/ 215 00:20:01,591 --> 00:20:09,550 with the given credentials, you can have a look and see what's going on. 216 00:20:10,166 --> 00:20:18,547 Basically, we have a web interface that allows you to look at 217 00:20:18,547 --> 00:20:26,071 what origins we have downloaded, when we have downloaded the origins 218 00:20:26,071 --> 00:20:34,931 with a kind of graph view of how often we visited the origins 219 00:20:34,931 --> 00:20:37,936 and a calendar view of when we have visited the origins. 220 00:20:38,790 --> 00:20:43,749 And then, inside the visits, you can actually browse the contents 221 00:20:43,749 --> 00:20:45,048 that we've archived. 222 00:20:45,293 --> 00:20:49,884 So, for instance, this is the Python repository as of May 2017 223 00:20:49,884 --> 00:20:54,960 and you can have the list of files, then drill down, 224 00:20:54,960 --> 00:20:58,164 it should be pretty intuitive. 225 00:20:59,160 --> 00:21:02,586 If you look at the history of a project, you can see the differences 226 00:21:02,586 --> 00:21:04,696 between two revisions of a project. 227 00:21:06,891 --> 00:21:12,261 Oh no, that's the syntax highlighting, but anyway the diffs arrive right after. 228 00:21:13,641 --> 00:21:16,327 So, yeah, pretty cool stuff. 229 00:21:16,898 --> 00:21:21,535 I should be able to do a demo as well, it should work. 230 00:21:31,112 --> 00:21:32,429 I'm gonna zoom in. 231 00:21:44,795 --> 00:21:49,474 So this is the main archive, you can see some statistics about the objects 232 00:21:49,474 --> 00:21:50,933 that we've downloaded. 233 00:21:51,137 --> 00:21:56,557 When you zoom in, you get some kind of overflows, because… 234 00:21:56,915 --> 00:21:58,867 Yeah, why would you do that. 235 00:21:59,235 --> 00:22:04,076 If you want to browse, we can try to find an origin. 236 00:22:07,407 --> 00:22:08,832 "glibc". 237 00:22:12,729 --> 00:22:17,036 So there's lots and lots of, like, random Github forks of things… 238 00:22:18,584 --> 00:22:25,784 We don't discriminate and we don't really filter what we download. 239 00:22:26,555 --> 00:22:34,393 We are looking into doing some relevance kind of sorting of the results, here. 240 00:22:36,434 --> 00:22:37,694 Next. 241 00:22:40,376 --> 00:22:42,083 Xilinx, why not. 242 00:22:43,220 --> 00:22:48,750 So, this has been downloaded for the last time of August 3rd 2016, 243 00:22:48,750 --> 00:22:50,402 so it's probably a dead repository, 244 00:22:52,717 --> 00:22:54,995 but yeah, you can see a bunch of source code, 245 00:22:56,671 --> 00:23:00,536 you can read the README of the glibc. 246 00:23:04,441 --> 00:23:07,650 If we go back to a more interesting origin 247 00:23:07,650 --> 00:23:09,643 here's the repository for git. 248 00:23:10,577 --> 00:23:17,153 I've selected voluntarily an old visit of the repo so that we can see 249 00:23:17,153 --> 00:23:18,861 what was going on then. 250 00:23:22,759 --> 00:23:31,456 If I look at the calendar view, you can see that we've had some issues actually 251 00:23:31,456 --> 00:23:33,410 updating this, but anyway. 252 00:23:37,835 --> 00:23:46,085 If I look at the last visit, then we can actually browse the contents, 253 00:23:46,735 --> 00:23:49,336 you can get syntax highlighting as well. 254 00:23:49,904 --> 00:23:53,722 This is a big big file with lots of comments 255 00:24:02,094 --> 00:24:04,971 Let's see the actual source code… 256 00:24:07,036 --> 00:24:10,168 Anyway, so, that's the browsing interface. 257 00:24:10,452 --> 00:24:15,126 We can also now get back what we've archived and download it, 258 00:24:15,126 --> 00:24:18,705 which is kind of something that you might want to do 259 00:24:18,705 --> 00:24:23,526 if a repository is lost, you can actually download it 260 00:24:23,526 --> 00:24:25,564 and get the source code back again. 261 00:24:26,944 --> 00:24:28,455 How we do that. 262 00:24:28,733 --> 00:24:35,478 If you go on the top right of this browsing interface, you have actions and download 263 00:24:35,478 --> 00:24:40,275 and you can download the directory that you are currently looking at. 264 00:24:41,294 --> 00:24:46,010 It's an asynchronous process, which means that if there is a lot of load, 265 00:24:46,010 --> 00:24:51,458 then it's gotta take some time to get actually, to be able to download the content 266 00:24:51,947 --> 00:24:56,298 So you can put in your email address so we can notify you when the download is ready. 267 00:24:56,989 --> 00:25:03,338 I'm gonna try my luck and say just "ok" and it's gonna appear at some point 268 00:25:03,338 --> 00:25:07,609 in the list of things that I've requested. 269 00:25:11,016 --> 00:25:20,173 I've already requested some things that we can actually get and open as a tarball. 270 00:25:31,456 --> 00:25:34,758 Yeah, I think that's the thing that I was actually looking at, 271 00:25:35,301 --> 00:25:38,439 which is this revision of the git source code 272 00:25:39,654 --> 00:25:42,252 and then I can open it 273 00:25:43,643 --> 00:25:46,572 Yay, emacs, that's when you want. 274 00:25:46,932 --> 00:25:48,314 Yay, source code. 275 00:25:51,161 --> 00:25:53,562 This seems to work. 276 00:25:57,915 --> 00:26:02,674 And then, of course, if you want to actually script what you're doing, 277 00:26:02,674 --> 00:26:07,141 there's an API that allows you to do the downloads as well, so you can. 278 00:26:10,918 --> 00:26:18,392 The source code is deduplicated a lot, which means that for one single repository 279 00:26:18,392 --> 00:26:24,200 you get tons of files that we have to collect if you want to actually download 280 00:26:24,200 --> 00:26:26,227 an archive of a directory. 281 00:26:29,614 --> 00:26:37,704 It takes a while but we have an asynchronous API so you can POST 282 00:26:37,704 --> 00:26:43,560 the identifier of a revision to this URL and then get status updates 283 00:26:43,560 --> 00:26:49,493 and at some point, it will tell you that the… here 284 00:26:49,846 --> 00:26:52,700 The status well tell you that the object is available. 285 00:26:52,984 --> 00:26:59,134 You can download it and you can even download the full history of a project 286 00:26:59,134 --> 00:27:03,565 and get that as a git-fast-export archive that you can reimport into 287 00:27:03,565 --> 00:27:05,836 a new git repository. 288 00:27:06,236 --> 00:27:13,182 So any kind of VCS that we've imported, you can export as a git repository 289 00:27:13,182 --> 00:27:17,733 and reimport on your machine. 290 00:27:19,241 --> 00:27:22,846 How to get involved in the project? 291 00:27:24,029 --> 00:27:29,030 We have a lot of features that we're interested in, lots of them are now 292 00:27:29,030 --> 00:27:31,387 in early access or have been done. 293 00:27:31,876 --> 00:27:35,620 There's some stuff that we would like help with. 294 00:27:38,226 --> 00:27:40,259 This is some stuff that we're working on: 295 00:27:40,546 --> 00:27:42,946 provenance information, you have a content 296 00:27:43,066 --> 00:27:45,420 you want to know which repository it comes from, 297 00:27:45,868 --> 00:27:47,577 that's something we're working on. 298 00:27:48,314 --> 00:27:55,215 Full text search, the end goal is to be able even to trace 299 00:27:55,215 --> 00:28:00,503 source of snippets of code that's have been copied from one project to another. 300 00:28:01,321 --> 00:28:05,831 That's something that we can look into with the wealth of information that 301 00:28:05,831 --> 00:28:07,623 we have inside the archive. 302 00:28:08,641 --> 00:28:10,672 There's a lot of things that, 303 00:28:10,672 --> 00:28:11,729 I mean… 304 00:28:12,135 --> 00:28:14,731 There's a lot of things that people want to do with the archive. 305 00:28:15,352 --> 00:28:19,586 Our goal is to enable people to do things, to do interesting things 306 00:28:19,586 --> 00:28:21,900 with a lot of source code. 307 00:28:23,530 --> 00:28:27,353 If you have an idea of what you want to do with such an archive, 308 00:28:27,353 --> 00:28:29,835 please you can come talk to us 309 00:28:29,835 --> 00:28:34,941 and we'll be happy to help you help us. 310 00:28:37,552 --> 00:28:43,572 What we want to do is to diversify the sources of things that we archive. 311 00:28:44,466 --> 00:28:51,289 Currently, we have good support for git, we have OK support for subversion 312 00:28:51,289 --> 00:28:52,708 and mercurial. 313 00:28:54,373 --> 00:28:59,219 If your project of choice is in another version control system, 314 00:28:59,219 --> 00:29:01,093 we are gonna miss it. 315 00:29:01,663 --> 00:29:06,295 So people can contribute in this area. 316 00:29:10,117 --> 00:29:18,205 For the listing part, we have coverage of Debian, we have coverage or Github, 317 00:29:18,205 --> 00:29:26,418 if your code is somewhere else, we won't see it, so we need people to contribute 318 00:29:26,418 --> 00:29:29,586 stuff that can list for instance Gitlab instances, 319 00:29:31,899 --> 00:29:36,410 and then we can integrate that in our infrastructure and actually have 320 00:29:36,929 --> 00:29:41,436 people be able to archive their gitlab instances. 321 00:29:42,045 --> 00:29:48,784 And of course, we need to spread the word, make the project sustainable. 322 00:29:49,118 --> 00:30:00,590 We have a few sponsors now, Microsoft, Nokia, Huawei, Github has joined as a sponsor 323 00:30:01,811 --> 00:30:06,365 The university of Bologna, of course Inria is sponsoring. 324 00:30:06,853 --> 00:30:11,971 But we need to keep spreading the word and keep the project sustainable. 325 00:30:13,026 --> 00:30:17,501 And, of course, we need to save endangered source code. 326 00:30:17,828 --> 00:30:22,580 For that, we have a suggestion box on the wiki that you can add things to. 327 00:30:24,208 --> 00:30:29,563 For instance, we have in the back of our minds archiving SourceForge, 328 00:30:29,563 --> 00:30:35,933 because we know that this isn't very sustainable and that's risk of being 329 00:30:35,933 --> 00:30:38,733 taken down at some point. 330 00:30:41,696 --> 00:30:47,830 If you want to join us, we also have some job openings that are available. 331 00:30:48,602 --> 00:30:55,646 For now it's in Paris, so if you want to consider coming work with us in Paris, 332 00:30:55,646 --> 00:30:58,086 you can look into that. 333 00:31:00,646 --> 00:31:02,684 That's Software Heritage. 334 00:31:02,684 --> 00:31:05,123 We are building a reference archive of all the free software 335 00:31:05,123 --> 00:31:06,836 that's being ever written 336 00:31:07,080 --> 00:31:10,982 in an international, open, non-profit and mutualised infrastructure 337 00:31:11,877 --> 00:31:17,933 that we have opened up to everyone, all users, vendors, developers can use it. 338 00:31:20,126 --> 00:31:25,658 The idea is to be at the service of the community and for society 339 00:31:25,658 --> 00:31:27,805 as a whole. 340 00:31:28,136 --> 00:31:32,855 So if you want to join us, you can look at our website, you can look at our code. 341 00:31:34,604 --> 00:31:38,145 You can also talk to me, so if you have any questions, 342 00:31:38,145 --> 00:31:42,125 I think we have 10, 12 minutes for questions. 343 00:31:46,228 --> 00:31:51,512 [Applause] 344 00:31:51,754 --> 00:31:52,933 Do you have questions? 345 00:31:57,207 --> 00:32:00,627 [Q] How do you protect the archive against stuff that you don't want to 346 00:32:00,627 --> 00:32:01,887 have in the archive. 347 00:32:02,170 --> 00:32:06,882 I think of a stuff that is copyright- protected and that Github will also 348 00:32:06,882 --> 00:32:09,325 delete after a while. 349 00:32:09,730 --> 00:32:15,583 Worse, if I would misuse the archive as my private backup 350 00:32:15,583 --> 00:32:19,601 and store encrypted blocks on Github and you will eventually backup them 351 00:32:19,601 --> 00:32:20,779 for me. 352 00:32:24,562 --> 00:32:26,711 [A] There's, I think, two sides of the question. 353 00:32:27,077 --> 00:32:28,502 The first side is 354 00:32:28,502 --> 00:32:33,543 Do we really archive only stuff that is free software and 355 00:32:33,543 --> 00:32:40,901 that we can redistribute and how do we manage, for instance, 356 00:32:40,901 --> 00:32:42,856 copyright takedown stuff. 357 00:32:46,108 --> 00:32:51,874 Currently, most of the infrastructure of the project is under French law. 358 00:32:52,975 --> 00:33:00,047 There's a defined process to do copyright takedown in the French legal system. 359 00:33:02,365 --> 00:33:08,828 We would be really annoyed to have to take down content from the archive 360 00:33:12,486 --> 00:33:19,846 What we do, however, is to mirror public information that is publicly available. 361 00:33:21,192 --> 00:33:26,716 Of course I'm not a lawyer for the project, so I can't really… 362 00:33:29,605 --> 00:33:33,181 I'm not 100% sure of what I'm about to say but 363 00:33:33,181 --> 00:33:38,920 what I know is that in the current French legistlation status, 364 00:33:39,531 --> 00:33:42,903 if the source of the data is still available 365 00:33:42,903 --> 00:33:46,643 so for instance if the data is still on Github, then you need to have 366 00:33:46,643 --> 00:33:49,901 Github take it down before we have to take it down. 367 00:33:56,681 --> 00:34:01,881 We're not currently filtering content for misuse of the archive, 368 00:34:01,881 --> 00:34:06,361 so the only thing that we do is put a limit on the size of the files 369 00:34:06,361 --> 00:34:08,435 that are archived in Software Heritage. 370 00:34:09,536 --> 00:34:12,014 The limit is pretty high, like 100MB. 371 00:34:15,102 --> 00:34:21,440 We can't really decide ourselves 372 00:34:21,440 --> 00:34:24,084 what is source code, what is not source code 373 00:34:24,084 --> 00:34:30,669 because for instance if your project is a cryptography library, 374 00:34:30,669 --> 00:34:34,397 you might want to have some encrypted blocks of data that are stored 375 00:34:34,397 --> 00:34:38,465 in you source code repository as test fixtures. 376 00:34:39,034 --> 00:34:44,033 And then, you need them to build the code and to make sure that it works. 377 00:34:44,682 --> 00:34:48,998 So, how would that be any different than your encrypted backup on Github? 378 00:34:49,139 --> 00:34:55,641 How could we, Software Heritage, distinguish between proper use and misuse 379 00:34:55,641 --> 00:34:58,806 of the resources. 380 00:35:00,349 --> 00:35:05,100 I guess our long term goal is to not have to care about misuse because 381 00:35:05,100 --> 00:35:07,175 it's gonna be a drop in the ocean. 382 00:35:08,638 --> 00:35:10,916 We're gonna have so much… 383 00:35:11,893 --> 00:35:15,303 We want to have enough space and enough resources 384 00:35:15,303 --> 00:35:20,021 that we don't really need to ask ourselves this question, basically. 385 00:35:21,480 --> 00:35:22,413 Thanks. 386 00:35:26,355 --> 00:35:27,653 Other questions? 387 00:35:34,113 --> 00:35:39,359 [Q] Have you looked at some form of authentication to provide additional 388 00:35:39,359 --> 00:35:46,346 insurance that the archived source code hasn't been modified or tampered with 389 00:35:46,346 --> 00:35:47,893 in some form? 390 00:35:50,977 --> 00:35:55,971 [A] First of all, all the identifiers for the objects that are inside the archive 391 00:35:55,971 --> 00:36:00,639 are cryptographic hashes of the contents that we've archived. 392 00:36:01,612 --> 00:36:06,937 So, for files, for instance, we take the SHA1, the SHA256, 393 00:36:06,937 --> 00:36:16,077 one of the BLAKE hashes and the git modified SHA1 of the file, 394 00:36:16,646 --> 00:36:19,658 and we use that in the manifest for the directories. 395 00:36:19,902 --> 00:36:25,787 So the directories, the directory identifiers are a hash of the manifest 396 00:36:25,787 --> 00:36:30,093 of the list of files that are inside the directory, etc. 397 00:36:30,543 --> 00:36:39,286 So, recursively, you can make sure that the data that we give back to you 398 00:36:39,286 --> 00:36:47,779 has not been, at least altered, by bitflip or anything. 399 00:36:48,954 --> 00:36:53,386 We regularly run a scrub of the data that we have in the archive, 400 00:36:53,386 --> 00:36:57,252 so we make sure that there's no rot inside our archive. 401 00:36:58,960 --> 00:37:05,063 We've not looked into, basically, attestation of… 402 00:37:08,761 --> 00:37:13,884 for instance, making sure that the code that we've downloaded… 403 00:37:20,878 --> 00:37:26,448 I mean, we're not doing anything more than taking a picture of the data 404 00:37:26,448 --> 00:37:34,092 and we say "We've computed this hash. Maybe the code that's been presented 405 00:37:34,092 --> 00:37:38,839 by Github to Software Heritage is different than what you've uploaded to Github, 406 00:37:38,839 --> 00:37:40,312 we can't tell." 407 00:37:43,967 --> 00:37:48,925 In the case of git, you can always use the identifiers of the objects 408 00:37:48,925 --> 00:37:51,858 that you've pushed so you have the commit hash, 409 00:37:51,858 --> 00:37:56,777 which is itself a cryptographic identifier of the contents of the commit. 410 00:37:59,419 --> 00:38:02,182 In turn, if the commit is signed, then the signature is still stored 411 00:38:02,182 --> 00:38:10,800 in the Software Heritage metadata and you can reproduce the original git object 412 00:38:10,800 --> 00:38:15,356 and check the signature, but we've not done anything specific for Software Heritage 413 00:38:15,356 --> 00:38:17,185 in this area. 414 00:38:17,536 --> 00:38:19,643 Does that answer your question? 415 00:38:19,975 --> 00:38:20,302 Cool. 416 00:38:24,886 --> 00:38:25,748 Other questions? 417 00:38:27,457 --> 00:38:28,798 There's one in front. 418 00:38:31,400 --> 00:38:33,558 [Q] It's partially question, partially comment. 419 00:38:33,884 --> 00:38:39,776 Your initial idea was to have a telescope, or something like this for source code. 420 00:38:40,223 --> 00:38:43,427 For now, for me, it looks a little bit more like microscope, 421 00:38:43,427 --> 00:38:46,507 so you can focus on one thing, but that's not much. 422 00:38:46,758 --> 00:38:51,023 So have you sorted things about how to analyze entire ecosystem 423 00:38:51,023 --> 00:38:52,203 or something like this. 424 00:38:52,203 --> 00:38:56,513 For example, now we have Django 2 which is Python 3 only so it would be interesting to 425 00:38:56,513 --> 00:39:00,899 look at all Django modules to see when they start moving to this Django. 426 00:39:01,266 --> 00:39:06,619 So we would need to start analyzing thousands or millions of files, but then 427 00:39:06,619 --> 00:39:10,840 we would need some SQL like, or some map reduce jobs 428 00:39:11,050 --> 00:39:12,430 or something like this for this. 429 00:39:12,957 --> 00:39:13,524 [A] Yes 430 00:39:13,889 --> 00:39:15,073 So, we've started… 431 00:39:16,414 --> 00:39:21,620 The two initiators of the project, Roberto Di Cosmo and Stefano Zacchiroli 432 00:39:21,812 --> 00:39:26,566 are both researchers in computer science so they have a strong background in 433 00:39:26,566 --> 00:39:34,654 actually mining software repositories and doing some large scale analysis 434 00:39:34,654 --> 00:39:36,234 on source code. 435 00:39:38,153 --> 00:39:44,818 We've been talking with research groups whose main goal is to do analysis on 436 00:39:44,818 --> 00:39:48,436 large scale source code archives. 437 00:39:50,430 --> 00:39:57,592 One of the first mirrors outside of our control of the archive 438 00:39:57,592 --> 00:39:59,016 will be in Grenoble (France). 439 00:39:59,385 --> 00:40:05,852 There's a few teams that work on actually doing large scale research 440 00:40:05,852 --> 00:40:08,697 on source code over there, 441 00:40:08,697 --> 00:40:11,339 so that's what the mirror will be used for. 442 00:40:13,411 --> 00:40:17,235 We've also been looking at what the Google open source team does. 443 00:40:18,212 --> 00:40:22,997 They have this big repository with all the code that Google uses 444 00:40:22,997 --> 00:40:28,944 and they've started to push back, like do large scale analysis of 445 00:40:28,944 --> 00:40:37,581 security vulnerabilities, issues with static and dynamic analysis 446 00:40:37,581 --> 00:40:41,938 of the code and they've started pushing their fixes upstream. 447 00:40:42,589 --> 00:40:47,135 That's something that we want to enable users to do, 448 00:40:47,135 --> 00:40:50,631 that's not something that we want to do ourselves, but we want to make sure 449 00:40:50,631 --> 00:40:53,482 that people can do it using our archive. 450 00:40:54,620 --> 00:40:58,767 So we'd be happy to work with people who already do that so that 451 00:40:58,767 --> 00:41:04,534 they can use their knowledge and their tools inside our archive. 452 00:41:06,606 --> 00:41:08,684 Does that answer your question? 453 00:41:09,658 --> 00:41:10,673 Cool. 454 00:41:14,982 --> 00:41:16,528 Any more questions? 455 00:41:19,411 --> 00:41:21,727 No? Then thank you very much Nicolas. 456 00:41:21,930 --> 00:41:22,581 Thank you. 457 00:41:22,947 --> 00:41:25,957 [Applause]