9:59:59.000,9:59:59.000 Hi, thank you. 9:59:59.000,9:59:59.000 I'm Nicolas Dandrimont and I will indeed[br]be talking to you about 9:59:59.000,9:59:59.000 Software Heritage. 9:59:59.000,9:59:59.000 I'm a software engineer for this project. 9:59:59.000,9:59:59.000 I've been working on it for 3 years now. 9:59:59.000,9:59:59.000 And we'll see what this thing is all about. 9:59:59.000,9:59:59.000 [Mic not working] 9:59:59.000,9:59:59.000 I guess the batteries are out. 9:59:59.000,9:59:59.000 So, let's try that again. 9:59:59.000,9:59:59.000 So, we all know, we've been doing[br]free software for a while, 9:59:59.000,9:59:59.000 that software source code is something[br]special. 9:59:59.000,9:59:59.000 Why is that? 9:59:59.000,9:59:59.000 As Harold Abelson has said in SICP, his[br]textbook on programming, 9:59:59.000,9:59:59.000 programs are meant to be read by people[br]and then incidentally for machines to execute. 9:59:59.000,9:59:59.000 Basically, what software source code[br]provides us is a way inside 9:59:59.000,9:59:59.000 the mind of the designer of the program. 9:59:59.000,9:59:59.000 For instance, you can have,[br]you can get inside very crazy algorithms 9:59:59.000,9:59:59.000 that can do very fast reverse square roots[br]for 3D, that kind of stuff 9:59:59.000,9:59:59.000 Like in the Quake 2 source code. 9:59:59.000,9:59:59.000 You can also get inside the algorithms[br]that are underpinning the internet, 9:59:59.000,9:59:59.000 for instance seeing the net queue[br]algorithm in the Linux kernel. 9:59:59.000,9:59:59.000 What we are building as the free software[br]community is the free software commons. 9:59:59.000,9:59:59.000 Basically, the commons is all the cultural[br]and social and natural resources 9:59:59.000,9:59:59.000 that we share and that everyone[br]has access to. 9:59:59.000,9:59:59.000 More specifically, the software commons[br]is what we are building 9:59:59.000,9:59:59.000 with software that is open and that is[br]available for all to use, to modify, 9:59:59.000,9:59:59.000 to execute, to distribute. 9:59:59.000,9:59:59.000 We know that those commons are a really[br]critical part of our commons. 9:59:59.000,9:59:59.000 Who's taking care of it? 9:59:59.000,9:59:59.000 The software is fragile. 9:59:59.000,9:59:59.000 Like all digital information, you can lose[br]software. 9:59:59.000,9:59:59.000 People can decide to shut down hosting[br]spaces because of business decisions. 9:59:59.000,9:59:59.000 People can hack into software hosting[br]platforms and remove the code maliciously 9:59:59.000,9:59:59.000 or just inadvertently. 9:59:59.000,9:59:59.000 And, of course, for the obsolete stuff,[br]there's rot. 9:59:59.000,9:59:59.000 If you don't care about the data, then[br]it rots and it decays and you lose it. 9:59:59.000,9:59:59.000 So, where is the archive we go to[br]when something is lost, 9:59:59.000,9:59:59.000 when GitLab goes away, when Github[br]goes away. 9:59:59.000,9:59:59.000 Where do we go? 9:59:59.000,9:59:59.000 Finally, there's one last thing that we[br]noticed, it's that 9:59:59.000,9:59:59.000 there's a lot of teams that work on[br]research on software 9:59:59.000,9:59:59.000 and there's no real big infrastructure[br]for research on code. 9:59:59.000,9:59:59.000 There's tons of critical issues around[br]code: safety, security, verification, proofs. 9:59:59.000,9:59:59.000 Nobody's doing this at a very large scale. 9:59:59.000,9:59:59.000 If you want to see the stars, you go[br]the Atacama desert and 9:59:59.000,9:59:59.000 you point a telescope at the sky. 9:59:59.000,9:59:59.000 Where is the telescope for source code? 9:59:59.000,9:59:59.000 That's what Software Heritage wants to be. 9:59:59.000,9:59:59.000 What we do is we collect, we preserve[br]and we share all the software 9:59:59.000,9:59:59.000 that is publicly available. 9:59:59.000,9:59:59.000 Why do we do that? We do that to[br]preserve the past, to enhance the present 9:59:59.000,9:59:59.000 and to prepare for the future. 9:59:59.000,9:59:59.000 What we're building is a base infrastructure[br]that can be used 9:59:59.000,9:59:59.000 for cultural heritage, for industry,[br]for research and for education purposes. 9:59:59.000,9:59:59.000 How do we do it? We do it with an open[br]approach. 9:59:59.000,9:59:59.000 Every single line of code that we write[br]is free software. 9:59:59.000,9:59:59.000 We do it transparently, everything that[br]we do, we do it in the open, 9:59:59.000,9:59:59.000 be that on a mailing list or on[br]our issue tracker. 9:59:59.000,9:59:59.000 And we strive to do it for the very long[br]haul, so we do it with replication in mind 9:59:59.000,9:59:59.000 so that no single entity has full control[br]over the data that we collect. 9:59:59.000,9:59:59.000 And we do it in a non-profit fashion[br]so that we avoid 9:59:59.000,9:59:59.000 business-driven decisions impacting[br]the project. 9:59:59.000,9:59:59.000 So, what do we do concretely? 9:59:59.000,9:59:59.000 We do archiving of version control systems. 9:59:59.000,9:59:59.000 What does that mean? 9:59:59.000,9:59:59.000 It means we archive file contents, so[br]source code, files. 9:59:59.000,9:59:59.000 We archive revisions, which means all the[br]metadata of the history of the projects, 9:59:59.000,9:59:59.000 we try to download it and we put it inside[br]a common data model that is 9:59:59.000,9:59:59.000 shared across all the archive. 9:59:59.000,9:59:59.000 We archive releases of the software,[br]releases that have been tagged 9:59:59.000,9:59:59.000 in a version control system as well as[br]releases that we can find as tarballs 9:59:59.000,9:59:59.000 because sometimes… boof, views of[br]this source code differ. 9:59:59.000,9:59:59.000 Of course, we archive where and when[br]we've seen the data that we've collected. 9:59:59.000,9:59:59.000 All of this, we put inside a canonical,[br]VCS-agnostic, data model. 9:59:59.000,9:59:59.000 If you have a Debian package, with its[br]history, if you have a git repository, 9:59:59.000,9:59:59.000 if you have a subversion repository, if[br]you have a mercurial repository, 9:59:59.000,9:59:59.000 it all looks the same and you can work[br]on it with the same tools. 9:59:59.000,9:59:59.000 What we don't do is archive what's around[br]the software, for instance 9:59:59.000,9:59:59.000 the bug tracking systems or the homepages[br]or the wikis or the mailing lists. 9:59:59.000,9:59:59.000 There are some projects that work[br]in this space, for instance 9:59:59.000,9:59:59.000 the internet archive does a lot of[br]really good work around archiving the web. 9:59:59.000,9:59:59.000 Our goal is not to replace them, but to[br]work with them and be able to do 9:59:59.000,9:59:59.000 linking across all the archives that exist. 9:59:59.000,9:59:59.000 We can, for instance for the mailing lists[br]there's the gmane project 9:59:59.000,9:59:59.000 that does a lot of archiving of free[br]software mailing lists. 9:59:59.000,9:59:59.000 So our long term vision is to play a part[br]in a semantic wikipedia of software, 9:59:59.000,9:59:59.000 a wikidata of software where we can[br]hyperlink all the archives that exist 9:59:59.000,9:59:59.000 and do stuff in the area. 9:59:59.000,9:59:59.000 Quick tour of our infrastructure. 9:59:59.000,9:59:59.000 Basically, all the way to the right is[br]our archive. 9:59:59.000,9:59:59.000 Our archive consists of a huge graph[br]of all the metadata about 9:59:59.000,9:59:59.000 the files, the directories, the revisions,[br]the commits and the releases and 9:59:59.000,9:59:59.000 all the projects that are on top[br]of the graph. 9:59:59.000,9:59:59.000 We separate the file storage into an other[br]object storage because of 9:59:59.000,9:59:59.000 the size discrepancy: we have lots and lots[br]of file contents that we need to store 9:59:59.000,9:59:59.000 so we do that outside the database[br]that is used to store the graph. 9:59:59.000,9:59:59.000 Basically, what we archive is a set of[br]software origins that are 9:59:59.000,9:59:59.000 git repositories, mercurial repositories,[br]etc. etc. 9:59:59.000,9:59:59.000 All those origins are loaded on a[br]regular schedule. 9:59:59.000,9:59:59.000 If there is a very active software origin,[br]we're gonna archive it more often 9:59:59.000,9:59:59.000 than stale things that don't get[br]a lot of updates. 9:59:59.000,9:59:59.000 What we do to get the list of software[br]origins that we archive. 9:59:59.000,9:59:59.000 We have a bunch of listers that can,[br]scroll through the list of repositories, 9:59:59.000,9:59:59.000 for instance on Github or other[br]hosting platforms. 9:59:59.000,9:59:59.000 We have code that can read Debian archive[br]metadata to make a list of the packages 9:59:59.000,9:59:59.000 that are inside this archive and can be[br]archived, etc. 9:59:59.000,9:59:59.000 All of this is done on a regular basis. 9:59:59.000,9:59:59.000 We are currently working on some kind[br]of push mechanism so that 9:59:59.000,9:59:59.000 people or other systems can notify us[br]of updates. 9:59:59.000,9:59:59.000 Our goal is not to do real time archiving,[br]we're really in it for the long run 9:59:59.000,9:59:59.000 but we still want to be able to prioritize[br]stuff that people tell us is 9:59:59.000,9:59:59.000 important to archive. 9:59:59.000,9:59:59.000 The internet archive has a "save now"[br]button and we want to implement 9:59:59.000,9:59:59.000 something along those lines as well, 9:59:59.000,9:59:59.000 so if we know that some software project[br]is in danger for a reason or another, 9:59:59.000,9:59:59.000 then we can prioritize archiving it. 9:59:59.000,9:59:59.000 So this is the basic structure of a revision[br]in the software heritage archive. 9:59:59.000,9:59:59.000 You'll see that it's very similar to[br]a git commit. 9:59:59.000,9:59:59.000 The format of the metadata is pretty much[br]what you'll find in a git commit 9:59:59.000,9:59:59.000 with some extensions that you don't[br]see here because this is from a git commit 9:59:59.000,9:59:59.000 So basically what we do is we take the[br]identifier of the directory 9:59:59.000,9:59:59.000 that the revision points to, we take the[br]identifier of the parent of the revision 9:59:59.000,9:59:59.000 so we can keep track of the history 9:59:59.000,9:59:59.000 and then we add some metadata,[br]authorship and commitership information 9:59:59.000,9:59:59.000 and the revision message and then we take[br]a hash of this, 9:59:59.000,9:59:59.000 it makes an identifier that's probably[br]unique, very very probably unique. 9:59:59.000,9:59:59.000 Using those identifiers, we can retrace[br]all the origins, all the history of 9:59:59.000,9:59:59.000 development of the project and we can[br]deduplicate across all the archive. 9:59:59.000,9:59:59.000 All the identifiers are intrinsic, which[br]means that we compute them 9:59:59.000,9:59:59.000 from the contents of the things that[br]we are archiving, which means that 9:59:59.000,9:59:59.000 we can deduplicate very efficiently[br]across all the data that we archive. 9:59:59.000,9:59:59.000 How much data do we archive? 9:59:59.000,9:59:59.000 A bit. 9:59:59.000,9:59:59.000 So, we have passed the billion revision[br]mark a few weeks ago. 9:59:59.000,9:59:59.000 This graph is a bit old, but anyway,[br]you have a live graph on our website. 9:59:59.000,9:59:59.000 That's more than 4.5 billion unique[br]source code files. 9:59:59.000,9:59:59.000 We don't actually discriminate between[br]what we would consider is source code 9:59:59.000,9:59:59.000 and what upstream developers consider[br]as source code, 9:59:59.000,9:59:59.000 so everything that's in a git repository,[br]we consider as source code 9:59:59.000,9:59:59.000 if it's below a size threshold. 9:59:59.000,9:59:59.000 A billion revisions across 80 million[br]projects. 9:59:59.000,9:59:59.000 What do we archive? 9:59:59.000,9:59:59.000 We archive Github, we archive Debian. 9:59:59.000,9:59:59.000 So, Debian we run the archival process[br]every day, every day we get the new packages 9:59:59.000,9:59:59.000 that have been uploaded in the archive. 9:59:59.000,9:59:59.000 Github, we try to keep up, we are currently[br]working on some performance improvements, 9:59:59.000,9:59:59.000 some scalability improvements to make sure[br]that we can keep up 9:59:59.000,9:59:59.000 with the development on GitHub. 9:59:59.000,9:59:59.000 We have archived as a one-off thing[br]the former content of Gitorious and Google Code 9:59:59.000,9:59:59.000 which are two prominent code hosting[br]spaces that closed recently 9:59:59.000,9:59:59.000 and we've been working on archiving[br]the contents of Bitbucket 9:59:59.000,9:59:59.000 which is kind of a challenge because[br]the API is a bit buggy and 9:59:59.000,9:59:59.000 Atliassian isn't too interested[br]in fixing it. 9:59:59.000,9:59:59.000 In concrete storage terms, we have 175TB[br]of blobs, so the files take 175TB 9:59:59.000,9:59:59.000 and kind of big database, 6TB. 9:59:59.000,9:59:59.000 The database only contains the graph of[br]the metadata for the archive 9:59:59.000,9:59:59.000 which is basically a 8 billion nodes and[br]70 billion edges graph. 9:59:59.000,9:59:59.000 And of course it's growing daily. 9:59:59.000,9:59:59.000 We are pretty sure this is the richest[br]source code archive that's available now 9:59:59.000,9:59:59.000 and it keeps growing. 9:59:59.000,9:59:59.000 So how do we actually… 9:59:59.000,9:59:59.000 What kind of stack do we use to store[br]all this? 9:59:59.000,9:59:59.000 We use Debian, of course. 9:59:59.000,9:59:59.000 All our deployment recipes are in Puppet[br]in public repositories. 9:59:59.000,9:59:59.000 We've started using Ceph[br]for the blob storage. 9:59:59.000,9:59:59.000 We use PostgreSQL for the metadata storage[br]we some of the standard tools that 9:59:59.000,9:59:59.000 live around PostgreSQL for backups[br]and replication. 9:59:59.000,9:59:59.000 We use standard Python stack for[br]scheduling of jobs 9:59:59.000,9:59:59.000 and for web interface stuff, basically[br]psycopg2 for the low level stuff, 9:59:59.000,9:59:59.000 Django for the web stuff 9:59:59.000,9:59:59.000 and Celery for the scheduling of jobs. 9:59:59.000,9:59:59.000 In house, we've written an ad hoc[br]object storage system which has 9:59:59.000,9:59:59.000 a bunch of backends that you can use. 9:59:59.000,9:59:59.000 Basically, we are agnostic between a UNIX[br]filesystem, azure, Ceph, or tons of… 9:59:59.000,9:59:59.000 It's a really simple object storage system[br]where you can just put an object, 9:59:59.000,9:59:59.000 get an object, put a bunch of objects,[br]get a bunch of objects. 9:59:59.000,9:59:59.000 We've implemented removal but we don't[br]really use it yet. 9:59:59.000,9:59:59.000 All the data model implementation,[br]all the listers, the loaders, the schedulers 9:59:59.000,9:59:59.000 everything has been written by us,[br]it's a pile of Python code. 9:59:59.000,9:59:59.000 So, basically 20 Python packages and[br]around 30 Puppet modules 9:59:59.000,9:59:59.000 to deploy all that and we've done everything[br]as a copyleft license, 9:59:59.000,9:59:59.000 GPLv3 for the backend and AGPLv3[br]for the frontend. 9:59:59.000,9:59:59.000 Even if people try and make their own[br]Software Heritage using our code, 9:59:59.000,9:59:59.000 they have to publish their changes. 9:59:59.000,9:59:59.000 Hardware-wise, we run for now everything[br]on a few hypervisors in house and 9:59:59.000,9:59:59.000 our main storage is currently still[br]on a very high density, very slow, 9:59:59.000,9:59:59.000 very bulky storage array, but we've[br]started to migrate all this thing 9:59:59.000,9:59:59.000 into a Ceph storage cluster which[br]we're gonna grow as we need 9:59:59.000,9:59:59.000 in the next few months. 9:59:59.000,9:59:59.000 We've also been granted by Microsoft[br]sponsorship, ??? sponsorship 9:59:59.000,9:59:59.000 for their cloud services. 9:59:59.000,9:59:59.000 We've started putting mirrors of everything[br]in their infrastructure as well 9:59:59.000,9:59:59.000 which means full object storage mirror,[br]so 170TB of stuff mirrored on azure 9:59:59.000,9:59:59.000 as well as a database mirror for graph. 9:59:59.000,9:59:59.000 And we're also doing all the content[br]indexing and all the things that need 9:59:59.000,9:59:59.000 scalability on azure now. 9:59:59.000,9:59:59.000 Finally, at the university of Bologna,[br]we have a backend storage for the download 9:59:59.000,9:59:59.000 so currently our main storage is[br]quite slow so if you want to download 9:59:59.000,9:59:59.000 a bundle of things that we've archived,[br]then we actually keep a cache of 9:59:59.000,9:59:59.000 what we've done so that it doesn't take[br]a million years to download stuff. 9:59:59.000,9:59:59.000 We do our development in a classic free[br]and open source software way, 9:59:59.000,9:59:59.000 so we talk on our mailing list, on IRC,[br]on a forge. 9:59:59.000,9:59:59.000 Everything is in English, everything is[br]public, there is more information 9:59:59.000,9:59:59.000 on our website if you want to actually[br]have a look and see what we do. 9:59:59.000,9:59:59.000 So, all that is very interesting but how[br]do we actually look into it? 9:59:59.000,9:59:59.000 One of the ways that you can browse,[br]that you can use the archive 9:59:59.000,9:59:59.000 is using a REST API. 9:59:59.000,9:59:59.000 Basically, this API allows you to do[br]pointwise browsing of the archive 9:59:59.000,9:59:59.000 so you can go and follow the links[br]in a graph, 9:59:59.000,9:59:59.000 which is very slow but gives you a pretty[br]much full access of the data. 9:59:59.000,9:59:59.000 There's an index for the API that you can[br]look at, but that's not really convenient, 9:59:59.000,9:59:59.000 so we also have a web user interface. 9:59:59.000,9:59:59.000 It's in preview right now, we're gonna do[br]a full launch in the month of June. 9:59:59.000,9:59:59.000 If you go to [br]https://archive.softwareheritage.org/browse/ 9:59:59.000,9:59:59.000 with the given credentials, you can[br]have a look and see what's going on. 9:59:59.000,9:59:59.000 Basically, we have a web interface that[br]allows you to look at 9:59:59.000,9:59:59.000 what origins we have downloaded, when[br]we have downloaded the origins 9:59:59.000,9:59:59.000 with a kind of graph view of how often[br]we visited the origins 9:59:59.000,9:59:59.000 and a calendar view of when we have[br]visited the origins. 9:59:59.000,9:59:59.000 And then, inside the visits, you can[br]actually browse the contents 9:59:59.000,9:59:59.000 that we've archived. 9:59:59.000,9:59:59.000 So, for instance, this is the Python[br]repository as of May 2017 9:59:59.000,9:59:59.000 and you can have the list of files,[br]then drill down, 9:59:59.000,9:59:59.000 it should be pretty intuitive. 9:59:59.000,9:59:59.000 If you look at the history of a project,[br]you can see the differences 9:59:59.000,9:59:59.000 between two revisions of a project. 9:59:59.000,9:59:59.000 Oh no, that's the syntax highlighting,[br]but anyway the diffs arrive right after. 9:59:59.000,9:59:59.000 So, yeah, pretty cool stuff. 9:59:59.000,9:59:59.000 I should be able to do a demo as well,[br]it should work. 9:59:59.000,9:59:59.000 I'm gonna zoom in. 9:59:59.000,9:59:59.000 So this is the main archive, you can see[br]some statistics about the objects 9:59:59.000,9:59:59.000 that we've downloaded. 9:59:59.000,9:59:59.000 When you zoom in, you get some kind of[br]overflows, because… 9:59:59.000,9:59:59.000 Yeah, why would you do that. 9:59:59.000,9:59:59.000 If you want to browse, we can try to find[br]an origin. 9:59:59.000,9:59:59.000 "glibc". 9:59:59.000,9:59:59.000 So there's lots and lots of, like, random[br]Github forks of things… 9:59:59.000,9:59:59.000 We don't discriminate and we don't really[br]filter what we download. 9:59:59.000,9:59:59.000 We are looking into doing some relevance[br]kind of sorting of the results, here. 9:59:59.000,9:59:59.000 Next. 9:59:59.000,9:59:59.000 Xilinx, why not. 9:59:59.000,9:59:59.000 So, this has been downloaded for the last[br]time of August 3rd 2016, 9:59:59.000,9:59:59.000 so it's probably a dead repository, 9:59:59.000,9:59:59.000 but yeah, you can see a bunch of source[br]code, 9:59:59.000,9:59:59.000 you can read the README of the glibc. 9:59:59.000,9:59:59.000 If we go back to a more interesting origin 9:59:59.000,9:59:59.000 here's the repository for git. 9:59:59.000,9:59:59.000 I've selected voluntarily an old visit[br]of the repo so that we can see 9:59:59.000,9:59:59.000 what was going on then. 9:59:59.000,9:59:59.000 If a look at the calendar view, you can see[br]that we've had some issues actually 9:59:59.000,9:59:59.000 updating this, but anyway. 9:59:59.000,9:59:59.000 If I look at the last visit, then we can[br]actually browse the contents, 9:59:59.000,9:59:59.000 you can get syntax highlighting as well. 9:59:59.000,9:59:59.000 This is a big big file with lots of comments 9:59:59.000,9:59:59.000 Let's see the actual source code… 9:59:59.000,9:59:59.000 Anyway, so, that's the browsing interface. 9:59:59.000,9:59:59.000 We can also now get back what we've[br]archived and download it, 9:59:59.000,9:59:59.000 which is kind of something that you might[br]want to do 9:59:59.000,9:59:59.000 if a repository is lost, you can actually[br]download it 9:59:59.000,9:59:59.000 and get the source code back again. 9:59:59.000,9:59:59.000 How we do that. 9:59:59.000,9:59:59.000 If you go on the top right of this browsing[br]interface, you have actions and download 9:59:59.000,9:59:59.000 and you can download a directory that[br]you are currently looking at. 9:59:59.000,9:59:59.000 It's an asynchronous process, which means[br]that if there is a lot of load, 9:59:59.000,9:59:59.000 then it's gotta take some time to get[br]actually, to be able to download the content 9:59:59.000,9:59:59.000 So you can put in your email address so we[br]can notify you when the download is ready. 9:59:59.000,9:59:59.000 I'm gonna try my luck and say just "ok"[br]and it's gonna appear at some point 9:59:59.000,9:59:59.000 in the list of things that I've requested. 9:59:59.000,9:59:59.000 I've already requested some things that[br]we can actually get and open as a tarball. 9:59:59.000,9:59:59.000 Yeah, I think that's the thing that I was[br]actually looking at, 9:59:59.000,9:59:59.000 which is this revision of the git[br]source code 9:59:59.000,9:59:59.000 and then I can open it 9:59:59.000,9:59:59.000 Yay, emacs, that's when you want. 9:59:59.000,9:59:59.000 Yay, source code. 9:59:59.000,9:59:59.000 This seems to work. 9:59:59.000,9:59:59.000 And then, of course, if you want to[br]actually script what you're doing, 9:59:59.000,9:59:59.000 there's an API that allows you to do[br]the downloads as well, so you can. 9:59:59.000,9:59:59.000 The source code is deduplicated a lot,[br]which means that for one single repository 9:59:59.000,9:59:59.000 you get tons of files that we have to[br]collect if you want to actually download 9:59:59.000,9:59:59.000 an archive of a directory. 9:59:59.000,9:59:59.000 It takes a while but we have an asynchronous[br]API so you can POST 9:59:59.000,9:59:59.000 the identifier of a revision to this URL[br]and then get status updates 9:59:59.000,9:59:59.000 and at some point, it will tell you that[br]the… here 9:59:59.000,9:59:59.000 The status well tell you that the object[br]is available. 9:59:59.000,9:59:59.000 You can download it and you can even[br]download the full history of a project 9:59:59.000,9:59:59.000 and get that as a git-fast-export archive[br]that you can reimport into 9:59:59.000,9:59:59.000 a new git repository. 9:59:59.000,9:59:59.000 So any kind of VCS that we've imported,[br]you can export as a git repository 9:59:59.000,9:59:59.000 and reimport on your machine. 9:59:59.000,9:59:59.000 How to get involved in the project? 9:59:59.000,9:59:59.000 We have a lot of features that we're[br]interested in, lots of them are now 9:59:59.000,9:59:59.000 in early access or have been done. 9:59:59.000,9:59:59.000 There's some stuff that we would like[br]help with. 9:59:59.000,9:59:59.000 This is some stuff that we're working on: 9:59:59.000,9:59:59.000 provenance information, you have a content 9:59:59.000,9:59:59.000 you want to know which repository[br]it comes from, 9:59:59.000,9:59:59.000 that's something we're on. 9:59:59.000,9:59:59.000 Full text search, the end goal is to be[br]able even to trace 9:59:59.000,9:59:59.000 source of snippets of code that's have[br]been copied from one project to another. 9:59:59.000,9:59:59.000 That's something that we can look into[br]with the wealth of information that 9:59:59.000,9:59:59.000 we have inside the archive. 9:59:59.000,9:59:59.000 There's a lot of things that, 9:59:59.000,9:59:59.000 I mean… 9:59:59.000,9:59:59.000 There's a lot of things that people want[br]to do with the archive. 9:59:59.000,9:59:59.000 Our goal is to enable people to do things,[br]to do interesting things 9:59:59.000,9:59:59.000 with a lot of source code. 9:59:59.000,9:59:59.000 If you have an idea of what you want to do[br]with such an archive, 9:59:59.000,9:59:59.000 please you can come talk to us 9:59:59.000,9:59:59.000 and we'll be happy to help you help us. 9:59:59.000,9:59:59.000 What we want to do is to diversify[br]the sources of things that we archive. 9:59:59.000,9:59:59.000 Currently, we have good support for git,[br]we have OK support for subversion[br] 9:59:59.000,9:59:59.000 and mercurial. 9:59:59.000,9:59:59.000 If your project of choice is in another[br]version control system, 9:59:59.000,9:59:59.000 we are gonna miss it. 9:59:59.000,9:59:59.000 So people can contribute in this area. 9:59:59.000,9:59:59.000 For the listing part, we have coverage of[br]Debian, we have coverage or Github, 9:59:59.000,9:59:59.000 if your code is somewhere else, we won't[br]see it, so we need people to contribute 9:59:59.000,9:59:59.000 stuff that can list for instance Gitlab[br]instances, 9:59:59.000,9:59:59.000 and then we can integrate that in our[br]infrastructure and actually have have 9:59:59.000,9:59:59.000 people be able to archive their gitlab[br]instances. 9:59:59.000,9:59:59.000 And of course, we need to spread[br]the word, make the project sustainable. 9:59:59.000,9:59:59.000 We have a few sponsors now, Microsoft,[br]Nokia, Huawei, Github has joined as a sponsor 9:59:59.000,9:59:59.000 The university of Bologna, of course Inria[br]is sponsoring. 9:59:59.000,9:59:59.000 But we need to keep spreading the word[br]and keep the project sustainable. 9:59:59.000,9:59:59.000 And, of course, we need to save endangered[br]source code. 9:59:59.000,9:59:59.000 For that, we have a suggestion box on[br]the wiki that you can add things to. 9:59:59.000,9:59:59.000 For instance, we have in the back of[br]our minds archiving SourceForge, 9:59:59.000,9:59:59.000 because we know that this isn't very[br]sustainable and that's risk of being 9:59:59.000,9:59:59.000 taken down at some point. 9:59:59.000,9:59:59.000 If you want to join us, we also have[br]some job openings that are available. 9:59:59.000,9:59:59.000 For now it's in Paris, so if you want to[br]consider coming work with us in Paris, 9:59:59.000,9:59:59.000 you can look into that. 9:59:59.000,9:59:59.000 That's Software Heritage. 9:59:59.000,9:59:59.000 We are building a reference archive of[br]all the free software 9:59:59.000,9:59:59.000 that's being ever written 9:59:59.000,9:59:59.000 in an international, open, non-profit and[br]mutualised infrastructure 9:59:59.000,9:59:59.000 that we have opened up to everyone,[br]all users, vendors, developers can use it. 9:59:59.000,9:59:59.000 The idea is to be at the service of[br]the community and for society 9:59:59.000,9:59:59.000 as a whole. 9:59:59.000,9:59:59.000 So if you want to join us, you can look at[br]our website, you can look at our code. 9:59:59.000,9:59:59.000 You can also talk to me, so if you have[br]any questions, 9:59:59.000,9:59:59.000 I think we have 10, 12 minutes for questions. 9:59:59.000,9:59:59.000 [Applause] 9:59:59.000,9:59:59.000 Do you have questions? 9:59:59.000,9:59:59.000 [Q] How do you protect the archive[br]against stuff that you don't want to 9:59:59.000,9:59:59.000 have in the archive. 9:59:59.000,9:59:59.000 I think of a stuff that is copyright-[br]protected and that Github will also 9:59:59.000,9:59:59.000 delete after a while. 9:59:59.000,9:59:59.000 Worse, if I would misuse the archive[br]as my private backup 9:59:59.000,9:59:59.000 and store encrypted blocks on Github[br]and you will eventually backup them 9:59:59.000,9:59:59.000 for me. 9:59:59.000,9:59:59.000 [A] There's, I think, two sides of the[br]question. 9:59:59.000,9:59:59.000 The first side is 9:59:59.000,9:59:59.000 Do we really archive only stuff that is[br]free software and 9:59:59.000,9:59:59.000 that we can redistribute and how do we[br]manage, for instance, 9:59:59.000,9:59:59.000 copyright takedown stuff. 9:59:59.000,9:59:59.000 Currently, most of the infrastructure[br]of the project is under French law. 9:59:59.000,9:59:59.000 There's a defined process to do[br]copyright takedown in the French legal system. 9:59:59.000,9:59:59.000 We would be really annoyed to have to[br]take down content from the archive 9:59:59.000,9:59:59.000 What we do, however, is to mirror public[br]information that is publicly available. 9:59:59.000,9:59:59.000 Of course I'm not a lawyer for the project,[br]so I can't really… 9:59:59.000,9:59:59.000 I'm not 100% sure of what I'm about to say[br]but 9:59:59.000,9:59:59.000 what I know is that in the current French[br]legistlation status, 9:59:59.000,9:59:59.000 if the source of the data is still available 9:59:59.000,9:59:59.000 so for instance if the data is still on[br]Github, then you need to have 9:59:59.000,9:59:59.000 Github take it down before we have to[br]take it down. 9:59:59.000,9:59:59.000 We're not currently filtering content for[br]misuse of the archive, 9:59:59.000,9:59:59.000 so the only thing that we do is put[br]a limit on the size of the files 9:59:59.000,9:59:59.000 that are archived in Software Heritage. 9:59:59.000,9:59:59.000 The limit is pretty high, like 100MB. 9:59:59.000,9:59:59.000 We can't really decide ourselves 9:59:59.000,9:59:59.000 what is source code,[br]what is not source code 9:59:59.000,9:59:59.000 because for instance if your project is[br]a cryptography library, 9:59:59.000,9:59:59.000 you might want to have some encrypted[br]blocks of data that are stored 9:59:59.000,9:59:59.000 in you source code repository as[br]test fixtures. 9:59:59.000,9:59:59.000 And then, you need them to build the code[br]and to make sure that it works. 9:59:59.000,9:59:59.000 So, how would that be any different than[br]you encrypted backup on Github? 9:59:59.000,9:59:59.000 How could we, Software Heritage,[br]distinguish between proper use and misuse 9:59:59.000,9:59:59.000 of the resources. 9:59:59.000,9:59:59.000 I guess our long term goal is to not have[br]to care about misuse because 9:59:59.000,9:59:59.000 it's gonna be a drop in the ocean. 9:59:59.000,9:59:59.000 We're gonna have so much… 9:59:59.000,9:59:59.000 We want to have enough space and[br]enough resources 9:59:59.000,9:59:59.000 that we don't really need to ask ourselves[br]this question, basically. 9:59:59.000,9:59:59.000 Thanks. 9:59:59.000,9:59:59.000 Other questions? 9:59:59.000,9:59:59.000 [Q] Have you looked at some form of[br]authentication to provide additional 9:59:59.000,9:59:59.000 insurance that the archived source code[br]hasn't been modified or tampered with 9:59:59.000,9:59:59.000 in some form? 9:59:59.000,9:59:59.000 [A] First of all, all the identifiers for[br]the objects that are inside the archive 9:59:59.000,9:59:59.000 are cryptographic hashes of the contents[br]that we've archived. 9:59:59.000,9:59:59.000 So, for files, for instance, we take[br]the SHA1, the SHA256, 9:59:59.000,9:59:59.000 one of the BLAKE hashes and the git[br]modified SHA1 of the file, 9:59:59.000,9:59:59.000 and we use that in the manifest for[br]the directories. 9:59:59.000,9:59:59.000 So the directories, the directory identifiers[br]are a hash of the manifest 9:59:59.000,9:59:59.000 of the list of files that are inside[br]the directory, etc. 9:59:59.000,9:59:59.000 So, recursively, you can make sure that[br]the data that we give back to you 9:59:59.000,9:59:59.000 has not been, at least altered, by bitflip[br]or anything. 9:59:59.000,9:59:59.000 We regularly run a scrub of the data[br]that we have in the archive, 9:59:59.000,9:59:59.000 so we make sure that there's no rot[br]inside our archive. 9:59:59.000,9:59:59.000 We've not looked into, basically,[br]attestation of… 9:59:59.000,9:59:59.000 for instance, making sure that the code[br]that we've downloaded… 9:59:59.000,9:59:59.000 I mean, we're not doing anything more[br]than taking a picture of the data 9:59:59.000,9:59:59.000 and we say "We've computed this hash.[br]Maybe the code that's been presented 9:59:59.000,9:59:59.000 by Github to Software Heritage is different[br]than what you've uploaded to Github, 9:59:59.000,9:59:59.000 we can't tell." 9:59:59.000,9:59:59.000 In the case of git, you can always use[br]the identifiers of the objects 9:59:59.000,9:59:59.000 that you've pushed so you have[br]the commit hash, 9:59:59.000,9:59:59.000 which is itself a cryptographic identifier[br]of the contents of the commit. 9:59:59.000,9:59:59.000 Intern, if the commit is signed, then[br]the signature is still stored 9:59:59.000,9:59:59.000 in the Software Heritage metadata and[br]you can reproduce the original git object 9:59:59.000,9:59:59.000 and check the signature, but we've not[br]done anything specific for Software Heritage 9:59:59.000,9:59:59.000 in this area. 9:59:59.000,9:59:59.000 Does that answer your question? 9:59:59.000,9:59:59.000 Cool. 9:59:59.000,9:59:59.000 Other questions? 9:59:59.000,9:59:59.000 There's one in front. 9:59:59.000,9:59:59.000 [Q] It's partially question, partially[br]comment. 9:59:59.000,9:59:59.000 Your initial idea was to have a telescope,[br]or something like this for source code. 9:59:59.000,9:59:59.000 For now, for me, it looks a little bit[br]more like microscope, 9:59:59.000,9:59:59.000 so you can focus on one thing, but that's[br]not much. 9:59:59.000,9:59:59.000 So have you sorted things about how to[br]analyze entire ecosystem 9:59:59.000,9:59:59.000 or something like this. 9:59:59.000,9:59:59.000 For example, now we have Django 2 which is[br]Python 3 only so it would be interesting to 9:59:59.000,9:59:59.000 look at all Django modules to see when[br]they start moving to this Django. 9:59:59.000,9:59:59.000 So we would need to start analyzing[br]thousands or millions of files, but then 9:59:59.000,9:59:59.000 we would need some SQL like, or some[br]map reduce jobs 9:59:59.000,9:59:59.000 or something like this for this. 9:59:59.000,9:59:59.000 [A] Yes 9:59:59.000,9:59:59.000 So, we've started… 9:59:59.000,9:59:59.000 The two initiators of the project, Roberto[br]Di Cosmo and Stefano Zacchiroli 9:59:59.000,9:59:59.000 are both researchers in computer science[br]so they have a strong background in 9:59:59.000,9:59:59.000 actually mining software repositories and[br]doing some large scale analysis 9:59:59.000,9:59:59.000 on source code. 9:59:59.000,9:59:59.000 We've been talking with research groups[br]whose main goal is to do analysis on 9:59:59.000,9:59:59.000 large scale source code archives. 9:59:59.000,9:59:59.000 One of the first mirrors outside of our[br]control of the archive 9:59:59.000,9:59:59.000 will be in Grenoble (France). 9:59:59.000,9:59:59.000 There's a few teams that work on[br]actually doing large scale research 9:59:59.000,9:59:59.000 on source code over there, 9:59:59.000,9:59:59.000 so that's what the mirror will be[br]used for. 9:59:59.000,9:59:59.000 We've also been looking at what[br]the Google open source team does. 9:59:59.000,9:59:59.000 They have this big repository with all[br]the code that Google uses 9:59:59.000,9:59:59.000 and they've started to push back,[br]like do large scale analysis of 9:59:59.000,9:59:59.000 security vulnerabilities, issues with[br]static and dynamic analysis 9:59:59.000,9:59:59.000 of the code and they've started pushing[br]their fixes upstream. 9:59:59.000,9:59:59.000 That's something that we want to enable[br]users to do, 9:59:59.000,9:59:59.000 that's not something that we want to do[br]ourselves, but we want to make sure 9:59:59.000,9:59:59.000 that people can do it using our archive. 9:59:59.000,9:59:59.000 So we'd be happy to work with people[br]who already do that so that 9:59:59.000,9:59:59.000 they can use their knowledge and their[br]tools inside our archive. 9:59:59.000,9:59:59.000 Does that answer your question? 9:59:59.000,9:59:59.000 Cool. 9:59:59.000,9:59:59.000 Any more questions? 9:59:59.000,9:59:59.000 No? Then thank you very much Nicolas. 9:59:59.000,9:59:59.000 Thank you. 9:59:59.000,9:59:59.000 [Applause]