WEBVTT 99:59:59.999 --> 99:59:59.999 Hi, thank you. 99:59:59.999 --> 99:59:59.999 I'm Nicolas Dandrimont and I will indeed be talking to you about 99:59:59.999 --> 99:59:59.999 Software Heritage. 99:59:59.999 --> 99:59:59.999 I'm a software engineer for this project. 99:59:59.999 --> 99:59:59.999 I've been working on it for 3 years now. 99:59:59.999 --> 99:59:59.999 And we'll see what this thing is all about. 99:59:59.999 --> 99:59:59.999 [Mic not working] 99:59:59.999 --> 99:59:59.999 I guess the batteries are out. 99:59:59.999 --> 99:59:59.999 So, let's try that again. 99:59:59.999 --> 99:59:59.999 So, we all know, we've been doing free software for a while, 99:59:59.999 --> 99:59:59.999 that software source code is something special. 99:59:59.999 --> 99:59:59.999 Why is that? 99:59:59.999 --> 99:59:59.999 As Harold Abelson has said in SICP, his textbook on programming, 99:59:59.999 --> 99:59:59.999 programs are meant to be read by people and then incidentally for machines to execute. 99:59:59.999 --> 99:59:59.999 Basically, what software source code provides us is a way inside 99:59:59.999 --> 99:59:59.999 the mind of the designer of the program. 99:59:59.999 --> 99:59:59.999 For instance, you can have, you can get inside very crazy algorithms 99:59:59.999 --> 99:59:59.999 that can do very fast reverse square roots for 3D, that kind of stuff 99:59:59.999 --> 99:59:59.999 Like in the Quake 2 source code. 99:59:59.999 --> 99:59:59.999 You can also get inside the algorithms that are underpinning the internet, 99:59:59.999 --> 99:59:59.999 for instance seeing the net queue algorithm in the Linux kernel. 99:59:59.999 --> 99:59:59.999 What we are building as the free software community is the free software commons. 99:59:59.999 --> 99:59:59.999 Basically, the commons is all the cultural and social and natural resources 99:59:59.999 --> 99:59:59.999 that we share and that everyone has access to. 99:59:59.999 --> 99:59:59.999 More specifically, the software commons is what we are building 99:59:59.999 --> 99:59:59.999 with software that is open and that is available for all to use, to modify, 99:59:59.999 --> 99:59:59.999 to execute, to distribute. 99:59:59.999 --> 99:59:59.999 We know that those commons are a really critical part of our commons. 99:59:59.999 --> 99:59:59.999 Who's taking care of it? 99:59:59.999 --> 99:59:59.999 The software is fragile. 99:59:59.999 --> 99:59:59.999 Like all digital information, you can lose software. 99:59:59.999 --> 99:59:59.999 People can decide to shut down hosting spaces because of business decisions. 99:59:59.999 --> 99:59:59.999 People can hack into software hosting platforms and remove the code maliciously 99:59:59.999 --> 99:59:59.999 or just inadvertently. 99:59:59.999 --> 99:59:59.999 And, of course, for the obsolete stuff, there's rot. 99:59:59.999 --> 99:59:59.999 If you don't care about the data, then it rots and it decays and you lose it. 99:59:59.999 --> 99:59:59.999 So, where is the archive we go to when something is lost, 99:59:59.999 --> 99:59:59.999 when GitLab goes away, when Github goes away. 99:59:59.999 --> 99:59:59.999 Where do we go? 99:59:59.999 --> 99:59:59.999 Finally, there's one last thing that we noticed, it's that 99:59:59.999 --> 99:59:59.999 there's a lot of teams that work on research on software 99:59:59.999 --> 99:59:59.999 and there's no real big infrastructure for research on code. 99:59:59.999 --> 99:59:59.999 There's tons of critical issues around code: safety, security, verification, proofs. 99:59:59.999 --> 99:59:59.999 Nobody's doing this at a very large scale. 99:59:59.999 --> 99:59:59.999 If you want to see the stars, you go the Atacama desert and 99:59:59.999 --> 99:59:59.999 you point a telescope at the sky. 99:59:59.999 --> 99:59:59.999 Where is the telescope for source code? 99:59:59.999 --> 99:59:59.999 That's what Software Heritage wants to be. 99:59:59.999 --> 99:59:59.999 What we do is we collect, we preserve and we share all the software 99:59:59.999 --> 99:59:59.999 that is publicly available. 99:59:59.999 --> 99:59:59.999 Why do we do that? We do that to preserve the past, to enhance the present 99:59:59.999 --> 99:59:59.999 and to prepare for the future. 99:59:59.999 --> 99:59:59.999 What we're building is a base infrastructure that can be used 99:59:59.999 --> 99:59:59.999 for cultural heritage, for industry, for research and for education purposes. 99:59:59.999 --> 99:59:59.999 How do we do it? We do it with an open approach. 99:59:59.999 --> 99:59:59.999 Every single line of code that we write is free software. 99:59:59.999 --> 99:59:59.999 We do it transparently, everything that we do, we do it in the open, 99:59:59.999 --> 99:59:59.999 be that on a mailing list or on our issue tracker. 99:59:59.999 --> 99:59:59.999 And we strive to do it for the very long haul, so we do it with replication in mind 99:59:59.999 --> 99:59:59.999 so that no single entity has full control over the data that we collect. 99:59:59.999 --> 99:59:59.999 And we do it in a non-profit fashion so that we avoid 99:59:59.999 --> 99:59:59.999 business-driven decisions impacting the project. 99:59:59.999 --> 99:59:59.999 So, what do we do concretely? 99:59:59.999 --> 99:59:59.999 We do archiving of version control systems. 99:59:59.999 --> 99:59:59.999 What does that mean? 99:59:59.999 --> 99:59:59.999 It means we archive file contents, so source code, files. 99:59:59.999 --> 99:59:59.999 We archive revisions, which means all the metadata of the history of the projects, 99:59:59.999 --> 99:59:59.999 we try to download it and we put it inside a common data model that is 99:59:59.999 --> 99:59:59.999 shared across all the archive. 99:59:59.999 --> 99:59:59.999 We archive releases of the software, releases that have been tagged 99:59:59.999 --> 99:59:59.999 in a version control system as well as releases that we can find as tarballs 99:59:59.999 --> 99:59:59.999 because sometimes… boof, views of this source code differ. 99:59:59.999 --> 99:59:59.999 Of course, we archive where and when we've seen the data that we've collected. 99:59:59.999 --> 99:59:59.999 All of this, we put inside a canonical, VCS-agnostic, data model. 99:59:59.999 --> 99:59:59.999 If you have a Debian package, with its history, if you have a git repository, 99:59:59.999 --> 99:59:59.999 if you have a subversion repository, if you have a mercurial repository, 99:59:59.999 --> 99:59:59.999 it all looks the same and you can work on it with the same tools. 99:59:59.999 --> 99:59:59.999 What we don't do is archive what's around the software, for instance 99:59:59.999 --> 99:59:59.999 the bug tracking systems or the homepages or the wikis or the mailing lists. 99:59:59.999 --> 99:59:59.999 There are some projects that work in this space, for instance 99:59:59.999 --> 99:59:59.999 the internet archive does a lot of really good work around archiving the web. 99:59:59.999 --> 99:59:59.999 Our goal is not to replace them, but to work with them and be able to do 99:59:59.999 --> 99:59:59.999 linking across all the archives that exist. 99:59:59.999 --> 99:59:59.999 We can, for instance for the mailing lists there's the gmane project 99:59:59.999 --> 99:59:59.999 that does a lot of archiving of free software mailing lists. 99:59:59.999 --> 99:59:59.999 So our long term vision is to play a part in a semantic wikipedia of software, 99:59:59.999 --> 99:59:59.999 a wikidata of software where we can hyperlink all the archives that exist 99:59:59.999 --> 99:59:59.999 and do stuff in the area. 99:59:59.999 --> 99:59:59.999 Quick tour of our infrastructure. 99:59:59.999 --> 99:59:59.999 Basically, all the way to the right is our archive. 99:59:59.999 --> 99:59:59.999 Our archive consists of a huge graph of all the metadata about 99:59:59.999 --> 99:59:59.999 the files, the directories, the revisions, the commits and the releases and 99:59:59.999 --> 99:59:59.999 all the projects that are on top of the graph. 99:59:59.999 --> 99:59:59.999 We separate the file storage into an other object storage because of 99:59:59.999 --> 99:59:59.999 the size discrepancy: we have lots and lots of file contents that we need to store 99:59:59.999 --> 99:59:59.999 so we do that outside the database that is used to store the graph. 99:59:59.999 --> 99:59:59.999 Basically, what we archive is a set of software origins that are 99:59:59.999 --> 99:59:59.999 git repositories, mercurial repositories, etc. etc. 99:59:59.999 --> 99:59:59.999 All those origins are loaded on a regular schedule. 99:59:59.999 --> 99:59:59.999 If there is a very active software origin, we're gonna archive it more often 99:59:59.999 --> 99:59:59.999 than stale things that don't get a lot of updates. 99:59:59.999 --> 99:59:59.999 What we do to get the list of software origins that we archive. 99:59:59.999 --> 99:59:59.999 We have a bunch of listers that can, scroll through the list of repositories, 99:59:59.999 --> 99:59:59.999 for instance on Github or other hosting platforms. 99:59:59.999 --> 99:59:59.999 We have code that can read Debian archive metadata to make a list of the packages 99:59:59.999 --> 99:59:59.999 that are inside this archive and can be archived, etc. 99:59:59.999 --> 99:59:59.999 All of this is done on a regular basis. 99:59:59.999 --> 99:59:59.999 We are currently working on some kind of push mechanism so that 99:59:59.999 --> 99:59:59.999 people or other systems can notify us of updates. 99:59:59.999 --> 99:59:59.999 Our goal is not to do real time archiving, we're really in it for the long run 99:59:59.999 --> 99:59:59.999 but we still want to be able to prioritize stuff that people tell us is 99:59:59.999 --> 99:59:59.999 important to archive. 99:59:59.999 --> 99:59:59.999 The internet archive has a "save now" button and we want to implement 99:59:59.999 --> 99:59:59.999 something along those lines as well, 99:59:59.999 --> 99:59:59.999 so if we know that some software project is in danger for a reason or another, 99:59:59.999 --> 99:59:59.999 then we can prioritize archiving it. 99:59:59.999 --> 99:59:59.999 So this is the basic structure of a revision in the software heritage archive. 99:59:59.999 --> 99:59:59.999 You'll see that it's very similar to a git commit. 99:59:59.999 --> 99:59:59.999 The format of the metadata is pretty much what you'll find in a git commit 99:59:59.999 --> 99:59:59.999 with some extensions that you don't see here because this is from a git commit 99:59:59.999 --> 99:59:59.999 So basically what we do is we take the identifier of the directory 99:59:59.999 --> 99:59:59.999 that the revision points to, we take the identifier of the parent of the revision 99:59:59.999 --> 99:59:59.999 so we can keep track of the history 99:59:59.999 --> 99:59:59.999 and then we add some metadata, authorship and commitership information 99:59:59.999 --> 99:59:59.999 and the revision message and then we take a hash of this, 99:59:59.999 --> 99:59:59.999 it makes an identifier that's probably unique, very very probably unique. 99:59:59.999 --> 99:59:59.999 Using those identifiers, we can retrace all the origins, all the history of 99:59:59.999 --> 99:59:59.999 development of the project and we can deduplicate across all the archive. 99:59:59.999 --> 99:59:59.999 All the identifiers are intrinsic, which means that we compute them 99:59:59.999 --> 99:59:59.999 from the contents of the things that we are archiving, which means that 99:59:59.999 --> 99:59:59.999 we can deduplicate very efficiently across all the data that we archive. 99:59:59.999 --> 99:59:59.999 How much data do we archive? 99:59:59.999 --> 99:59:59.999 A bit. 99:59:59.999 --> 99:59:59.999 So, we have passed the billion revision mark a few weeks ago. 99:59:59.999 --> 99:59:59.999 This graph is a bit old, but anyway, you have a live graph on our website. 99:59:59.999 --> 99:59:59.999 That's more than 4.5 billion unique source code files. 99:59:59.999 --> 99:59:59.999 We don't actually discriminate between what we would consider is source code 99:59:59.999 --> 99:59:59.999 and what upstream developers consider as source code, 99:59:59.999 --> 99:59:59.999 so everything that's in a git repository, we consider as source code 99:59:59.999 --> 99:59:59.999 if it's below a size threshold. 99:59:59.999 --> 99:59:59.999 A billion revisions across 80 million projects. 99:59:59.999 --> 99:59:59.999 What do we archive? 99:59:59.999 --> 99:59:59.999 We archive Github, we archive Debian. 99:59:59.999 --> 99:59:59.999 So, Debian we run the archival process every day, every day we get the new packages 99:59:59.999 --> 99:59:59.999 that have been uploaded in the archive. 99:59:59.999 --> 99:59:59.999 Github, we try to keep up, we are currently working on some performance improvements, 99:59:59.999 --> 99:59:59.999 some scalability improvements to make sure that we can keep up 99:59:59.999 --> 99:59:59.999 with the development on GitHub. 99:59:59.999 --> 99:59:59.999 We have archived as a one-off thing the former content of Gitorious and Google Code 99:59:59.999 --> 99:59:59.999 which are two prominent code hosting spaces that closed recently 99:59:59.999 --> 99:59:59.999 and we've been working on archiving the contents of Bitbucket 99:59:59.999 --> 99:59:59.999 which is kind of a challenge because the API is a bit buggy and 99:59:59.999 --> 99:59:59.999 Atliassian isn't too interested in fixing it. 99:59:59.999 --> 99:59:59.999 In concrete storage terms, we have 175TB of blobs, so the files take 175TB 99:59:59.999 --> 99:59:59.999 and kind of big database, 6TB. 99:59:59.999 --> 99:59:59.999 The database only contains the graph of the metadata for the archive 99:59:59.999 --> 99:59:59.999 which is basically a 8 billion nodes and 70 billion edges graph. 99:59:59.999 --> 99:59:59.999 And of course it's growing daily. 99:59:59.999 --> 99:59:59.999 We are pretty sure this is the richest source code archive that's available now 99:59:59.999 --> 99:59:59.999 and it keeps growing. 99:59:59.999 --> 99:59:59.999 So how do we actually… 99:59:59.999 --> 99:59:59.999 What kind of stack do we use to store all this? 99:59:59.999 --> 99:59:59.999 We use Debian, of course. 99:59:59.999 --> 99:59:59.999 All our deployment recipes are in Puppet in public repositories. 99:59:59.999 --> 99:59:59.999 We've started using Ceph for the blob storage. 99:59:59.999 --> 99:59:59.999 We use PostgreSQL for the metadata storage we some of the standard tools that 99:59:59.999 --> 99:59:59.999 live around PostgreSQL for backups and replication. 99:59:59.999 --> 99:59:59.999 We use standard Python stack for scheduling of jobs 99:59:59.999 --> 99:59:59.999 and for web interface stuff, basically psycopg2 for the low level stuff, 99:59:59.999 --> 99:59:59.999 Django for the web stuff 99:59:59.999 --> 99:59:59.999 and Celery for the scheduling of jobs. 99:59:59.999 --> 99:59:59.999 In house, we've written an ad hoc object storage system which has 99:59:59.999 --> 99:59:59.999 a bunch of backends that you can use. 99:59:59.999 --> 99:59:59.999 Basically, we are agnostic between a UNIX filesystem, azure, Ceph, or tons of… 99:59:59.999 --> 99:59:59.999 It's a really simple object storage system where you can just put an object, 99:59:59.999 --> 99:59:59.999 get an object, put a bunch of objects, get a bunch of objects. 99:59:59.999 --> 99:59:59.999 We've implemented removal but we don't really use it yet. 99:59:59.999 --> 99:59:59.999 All the data model implementation, all the listers, the loaders, the schedulers 99:59:59.999 --> 99:59:59.999 everything has been written by us, it's a pile of Python code. 99:59:59.999 --> 99:59:59.999 So, basically 20 Python packages and around 30 Puppet modules 99:59:59.999 --> 99:59:59.999 to deploy all that and we've done everything as a copyleft license, 99:59:59.999 --> 99:59:59.999 GPLv3 for the backend and AGPLv3 for the frontend. 99:59:59.999 --> 99:59:59.999 Even if people try and make their own Software Heritage using our code, 99:59:59.999 --> 99:59:59.999 they have to publish their changes. 99:59:59.999 --> 99:59:59.999 Hardware-wise, we run for now everything on a few hypervisors in house and 99:59:59.999 --> 99:59:59.999 our main storage is currently still on a very high density, very slow, 99:59:59.999 --> 99:59:59.999 very bulky storage array, but we've started to migrate all this thing 99:59:59.999 --> 99:59:59.999 into a Ceph storage cluster which we're gonna grow as we need 99:59:59.999 --> 99:59:59.999 in the next few months. 99:59:59.999 --> 99:59:59.999 We've also been granted by Microsoft sponsorship, ??? sponsorship 99:59:59.999 --> 99:59:59.999 for their cloud services. 99:59:59.999 --> 99:59:59.999 We've started putting mirrors of everything in their infrastructure as well 99:59:59.999 --> 99:59:59.999 which means full object storage mirror, so 170TB of stuff mirrored on azure 99:59:59.999 --> 99:59:59.999 as well as a database mirror for graph. 99:59:59.999 --> 99:59:59.999 And we're also doing all the content indexing and all the things that need 99:59:59.999 --> 99:59:59.999 scalability on azure now. 99:59:59.999 --> 99:59:59.999 Finally, at the university of Bologna, we have a backend storage for the download 99:59:59.999 --> 99:59:59.999 so currently our main storage is quite slow so if you want to download 99:59:59.999 --> 99:59:59.999 a bundle of things that we've archived, then we actually keep a cache of 99:59:59.999 --> 99:59:59.999 what we've done so that it doesn't take a million years to download stuff. 99:59:59.999 --> 99:59:59.999 We do our development in a classic free and open source software way, 99:59:59.999 --> 99:59:59.999 so we talk on our mailing list, on IRC, on a forge. 99:59:59.999 --> 99:59:59.999 Everything is in English, everything is public, there is more information 99:59:59.999 --> 99:59:59.999 on our website if you want to actually have a look and see what we do. 99:59:59.999 --> 99:59:59.999 So, all that is very interesting but how do we actually look into it? 99:59:59.999 --> 99:59:59.999 One of the ways that you can browse, that you can use the archive 99:59:59.999 --> 99:59:59.999 is using a REST API. 99:59:59.999 --> 99:59:59.999 Basically, this API allows you to do pointwise browsing of the archive 99:59:59.999 --> 99:59:59.999 so you can go and follow the links in a graph, 99:59:59.999 --> 99:59:59.999 which is very slow but gives you a pretty much full access of the data. 99:59:59.999 --> 99:59:59.999 There's an index for the API that you can look at, but that's not really convenient, 99:59:59.999 --> 99:59:59.999 so we also have a web user interface. 99:59:59.999 --> 99:59:59.999 It's in preview right now, we're gonna do a full launch in the month of June. 99:59:59.999 --> 99:59:59.999 If you go to https://archive.softwareheritage.org/browse/ 99:59:59.999 --> 99:59:59.999 with the given credentials, you can have a look and see what's going on. 99:59:59.999 --> 99:59:59.999 Basically, we have a web interface that allows you to look at 99:59:59.999 --> 99:59:59.999 what origins we have downloaded, when we have downloaded the origins 99:59:59.999 --> 99:59:59.999 with a kind of graph view of how often we visited the origins 99:59:59.999 --> 99:59:59.999 and a calendar view of when we have visited the origins. 99:59:59.999 --> 99:59:59.999 And then, inside the visits, you can actually browse the contents 99:59:59.999 --> 99:59:59.999 that we've archived. 99:59:59.999 --> 99:59:59.999 So, for instance, this is the Python repository as of May 2017 99:59:59.999 --> 99:59:59.999 and you can have the list of files, then drill down, 99:59:59.999 --> 99:59:59.999 it should be pretty intuitive. 99:59:59.999 --> 99:59:59.999 If you look at the history of a project, you can see the differences 99:59:59.999 --> 99:59:59.999 between two revisions of a project. 99:59:59.999 --> 99:59:59.999 Oh no, that's the syntax highlighting, but anyway the diffs arrive right after. 99:59:59.999 --> 99:59:59.999 So, yeah, pretty cool stuff. 99:59:59.999 --> 99:59:59.999 I should be able to do a demo as well, it should work. 99:59:59.999 --> 99:59:59.999 I'm gonna zoom in. 99:59:59.999 --> 99:59:59.999 So this is the main archive, you can see some statistics about the objects 99:59:59.999 --> 99:59:59.999 that we've downloaded. 99:59:59.999 --> 99:59:59.999 When you zoom in, you get some kind of overflows, because… 99:59:59.999 --> 99:59:59.999 Yeah, why would you do that. 99:59:59.999 --> 99:59:59.999 If you want to browse, we can try to find an origin. 99:59:59.999 --> 99:59:59.999 "glibc". 99:59:59.999 --> 99:59:59.999 So there's lots and lots of, like, random Github forks of things… 99:59:59.999 --> 99:59:59.999 We don't discriminate and we don't really filter what we download. 99:59:59.999 --> 99:59:59.999 We are looking into doing some relevance kind of sorting of the results, here. 99:59:59.999 --> 99:59:59.999 Next. 99:59:59.999 --> 99:59:59.999 Xilinx, why not. 99:59:59.999 --> 99:59:59.999 So, this has been downloaded for the last time of August 3rd 2016, 99:59:59.999 --> 99:59:59.999 so it's probably a dead repository, 99:59:59.999 --> 99:59:59.999 but yeah, you can see a bunch of source code, 99:59:59.999 --> 99:59:59.999 you can read the README of the glibc. 99:59:59.999 --> 99:59:59.999 If we go back to a more interesting origin 99:59:59.999 --> 99:59:59.999 here's the repository for git. 99:59:59.999 --> 99:59:59.999 I've selected voluntarily an old visit of the repo so that we can see 99:59:59.999 --> 99:59:59.999 what was going on then. 99:59:59.999 --> 99:59:59.999 If a look at the calendar view, you can see that we've had some issues actually 99:59:59.999 --> 99:59:59.999 updating this, but anyway. 99:59:59.999 --> 99:59:59.999 If I look at the last visit, then we can actually browse the contents, 99:59:59.999 --> 99:59:59.999 you can get syntax highlighting as well. 99:59:59.999 --> 99:59:59.999 This is a big big file with lots of comments 99:59:59.999 --> 99:59:59.999 Let's see the actual source code… 99:59:59.999 --> 99:59:59.999 Anyway, so, that's the browsing interface. 99:59:59.999 --> 99:59:59.999 We can also now get back what we've archived and download it, 99:59:59.999 --> 99:59:59.999 which is kind of something that you might want to do 99:59:59.999 --> 99:59:59.999 if a repository is lost, you can actually download it 99:59:59.999 --> 99:59:59.999 and get the source code back again. 99:59:59.999 --> 99:59:59.999 How we do that. 99:59:59.999 --> 99:59:59.999 If you go on the top right of this browsing interface, you have actions and download 99:59:59.999 --> 99:59:59.999 and you can download a directory that you are currently looking at. 99:59:59.999 --> 99:59:59.999 It's an asynchronous process, which means that if there is a lot of load, 99:59:59.999 --> 99:59:59.999 then it's gotta take some time to get actually, to be able to download the content 99:59:59.999 --> 99:59:59.999 So you can put in your email address so we can notify you when the download is ready. 99:59:59.999 --> 99:59:59.999 I'm gonna try my luck and say just "ok" and it's gonna appear at some point 99:59:59.999 --> 99:59:59.999 in the list of things that I've requested. 99:59:59.999 --> 99:59:59.999 I've already requested some things that we can actually get and open as a tarball. 99:59:59.999 --> 99:59:59.999 Yeah, I think that's the thing that I was actually looking at, 99:59:59.999 --> 99:59:59.999 which is this revision of the git source code 99:59:59.999 --> 99:59:59.999 and then I can open it 99:59:59.999 --> 99:59:59.999 Yay, emacs, that's when you want. 99:59:59.999 --> 99:59:59.999 Yay, source code. 99:59:59.999 --> 99:59:59.999 This seems to work. 99:59:59.999 --> 99:59:59.999 And then, of course, if you want to actually script what you're doing, 99:59:59.999 --> 99:59:59.999 there's an API that allows you to do the downloads as well, so you can. 99:59:59.999 --> 99:59:59.999 The source code is deduplicated a lot, which means that for one single repository 99:59:59.999 --> 99:59:59.999 you get tons of files that we have to collect if you want to actually download 99:59:59.999 --> 99:59:59.999 an archive of a directory. 99:59:59.999 --> 99:59:59.999 It takes a while but we have an asynchronous API so you can POST 99:59:59.999 --> 99:59:59.999 the identifier of a revision to this URL and then get status updates 99:59:59.999 --> 99:59:59.999 and at some point, it will tell you that the… here 99:59:59.999 --> 99:59:59.999 The status well tell you that the object is available. 99:59:59.999 --> 99:59:59.999 You can download it and you can even download the full history of a project 99:59:59.999 --> 99:59:59.999 and get that as a git-fast-export archive that you can reimport into 99:59:59.999 --> 99:59:59.999 a new git repository. 99:59:59.999 --> 99:59:59.999 So any kind of VCS that we've imported, you can export as a git repository 99:59:59.999 --> 99:59:59.999 and reimport on your machine. 99:59:59.999 --> 99:59:59.999 How to get involved in the project? 99:59:59.999 --> 99:59:59.999 We have a lot of features that we're interested in, lots of them are now 99:59:59.999 --> 99:59:59.999 in early access or have been done. 99:59:59.999 --> 99:59:59.999 There's some stuff that we would like help with. 99:59:59.999 --> 99:59:59.999 This is some stuff that we're working on: 99:59:59.999 --> 99:59:59.999 provenance information, you have a content 99:59:59.999 --> 99:59:59.999 you want to know which repository it comes from, 99:59:59.999 --> 99:59:59.999 that's something we're on. 99:59:59.999 --> 99:59:59.999 Full text search, the end goal is to be able even to trace 99:59:59.999 --> 99:59:59.999 source of snippets of code that's have been copied from one project to another. 99:59:59.999 --> 99:59:59.999 That's something that we can look into with the wealth of information that 99:59:59.999 --> 99:59:59.999 we have inside the archive. 99:59:59.999 --> 99:59:59.999 There's a lot of things that, 99:59:59.999 --> 99:59:59.999 I mean… 99:59:59.999 --> 99:59:59.999 There's a lot of things that people want to do with the archive. 99:59:59.999 --> 99:59:59.999 Our goal is to enable people to do things, to do interesting things 99:59:59.999 --> 99:59:59.999 with a lot of source code. 99:59:59.999 --> 99:59:59.999 If you have an idea of what you want to do with such an archive, 99:59:59.999 --> 99:59:59.999 please you can come talk to us 99:59:59.999 --> 99:59:59.999 and we'll be happy to help you help us. 99:59:59.999 --> 99:59:59.999 What we want to do is to diversify the sources of things that we archive. 99:59:59.999 --> 99:59:59.999 Currently, we have good support for git, we have OK support for subversion 99:59:59.999 --> 99:59:59.999 and mercurial. 99:59:59.999 --> 99:59:59.999 If your project of choice is in another version control system, 99:59:59.999 --> 99:59:59.999 we are gonna miss it. 99:59:59.999 --> 99:59:59.999 So people can contribute in this area. 99:59:59.999 --> 99:59:59.999 For the listing part, we have coverage of Debian, we have coverage or Github, 99:59:59.999 --> 99:59:59.999 if your code is somewhere else, we won't see it, so we need people to contribute 99:59:59.999 --> 99:59:59.999 stuff that can list for instance Gitlab instances, 99:59:59.999 --> 99:59:59.999 and then we can integrate that in our infrastructure and actually have have 99:59:59.999 --> 99:59:59.999 people be able to archive their gitlab instances. 99:59:59.999 --> 99:59:59.999 And of course, we need to spread the word, make the project sustainable. 99:59:59.999 --> 99:59:59.999 We have a few sponsors now, Microsoft, Nokia, Huawei, Github has joined as a sponsor 99:59:59.999 --> 99:59:59.999 The university of Bologna, of course Inria is sponsoring. 99:59:59.999 --> 99:59:59.999 But we need to keep spreading the word and keep the project sustainable. 99:59:59.999 --> 99:59:59.999 And, of course, we need to save endangered source code. 99:59:59.999 --> 99:59:59.999 For that, we have a suggestion box on the wiki that you can add things to. 99:59:59.999 --> 99:59:59.999 For instance, we have in the back of our minds archiving SourceForge, 99:59:59.999 --> 99:59:59.999 because we know that this isn't very sustainable and that's risk of being 99:59:59.999 --> 99:59:59.999 taken down at some point. 99:59:59.999 --> 99:59:59.999 If you want to join us, we also have some job openings that are available. 99:59:59.999 --> 99:59:59.999 For now it's in Paris, so if you want to consider coming work with us in Paris, 99:59:59.999 --> 99:59:59.999 you can look into that. 99:59:59.999 --> 99:59:59.999 That's Software Heritage. 99:59:59.999 --> 99:59:59.999 We are building a reference archive of all the free software 99:59:59.999 --> 99:59:59.999 that's being ever written 99:59:59.999 --> 99:59:59.999 in an international, open, non-profit and mutualised infrastructure 99:59:59.999 --> 99:59:59.999 that we have opened up to everyone, all users, vendors, developers can use it. 99:59:59.999 --> 99:59:59.999 The idea is to be at the service of the community and for society 99:59:59.999 --> 99:59:59.999 as a whole. 99:59:59.999 --> 99:59:59.999 So if you want to join us, you can look at our website, you can look at our code.