9:59:59.000,9:59:59.000 Hi, thank you. 9:59:59.000,9:59:59.000 I'm Nicolas Dandrimont and I will indeed[br]be talking to you about 9:59:59.000,9:59:59.000 Software Heritage. 9:59:59.000,9:59:59.000 I'm a software engineer for this project. 9:59:59.000,9:59:59.000 I've been working on it for 3 years now. 9:59:59.000,9:59:59.000 And we'll see what this thing is all about. 9:59:59.000,9:59:59.000 [Mic not working] 9:59:59.000,9:59:59.000 I guess the batteries are out. 9:59:59.000,9:59:59.000 So, let's try that again. 9:59:59.000,9:59:59.000 So, we all know, we've been doing[br]free software for a while, 9:59:59.000,9:59:59.000 that software source code is something[br]special. 9:59:59.000,9:59:59.000 Why is that? 9:59:59.000,9:59:59.000 As Harold Abelson has said in SICP, his[br]textbook on programming, 9:59:59.000,9:59:59.000 programs are meant to be read by people[br]and then incidentally for machines to execute. 9:59:59.000,9:59:59.000 Basically, what software source code[br]provides us is a way inside 9:59:59.000,9:59:59.000 the mind of the designer of the program. 9:59:59.000,9:59:59.000 For instance, you can have,[br]you can get inside very crazy algorithms 9:59:59.000,9:59:59.000 that can do very fast reverse square roots[br]for 3D, that kind of stuff 9:59:59.000,9:59:59.000 Like in the Quake 2 source code. 9:59:59.000,9:59:59.000 You can also get inside the algorithms[br]that are underpinning the internet, 9:59:59.000,9:59:59.000 for instance seeing the net queue[br]algorithm in the Linux kernel. 9:59:59.000,9:59:59.000 What we are building as the free software[br]community is the free software commons. 9:59:59.000,9:59:59.000 Basically, the commons is all the cultural[br]and social and natural resources 9:59:59.000,9:59:59.000 that we share and that everyone[br]has access to. 9:59:59.000,9:59:59.000 More specifically, the software commons[br]is what we are building 9:59:59.000,9:59:59.000 with software that is open and that is[br]available for all to use, to modify, 9:59:59.000,9:59:59.000 to execute, to distribute. 9:59:59.000,9:59:59.000 We know that those commons are a really[br]critical part of our commons. 9:59:59.000,9:59:59.000 Who's taking care of it? 9:59:59.000,9:59:59.000 The software is fragile. 9:59:59.000,9:59:59.000 Like all digital information, you can lose[br]software. 9:59:59.000,9:59:59.000 People can decide to shut down hosting[br]spaces because of business decisions. 9:59:59.000,9:59:59.000 People can hack into software hosting[br]platforms and remove the code maliciously 9:59:59.000,9:59:59.000 or just inadvertently. 9:59:59.000,9:59:59.000 And, of course, for the obsolete stuff,[br]there's rot. 9:59:59.000,9:59:59.000 If you don't care about the data, then[br]it rots and it decays and you lose it. 9:59:59.000,9:59:59.000 So, where is the archive we go to[br]when something is lost, 9:59:59.000,9:59:59.000 when GitLab goes away, when Github[br]goes away. 9:59:59.000,9:59:59.000 Where do we go? 9:59:59.000,9:59:59.000 Finally, there's one last thing that we[br]noticed, it's that 9:59:59.000,9:59:59.000 there's a lot of teams that work on[br]research on software 9:59:59.000,9:59:59.000 and there's no real big infrastructure[br]for research on code. 9:59:59.000,9:59:59.000 There's tons of critical issues around[br]code: safety, security, verification, proofs. 9:59:59.000,9:59:59.000 Nobody's doing this at a very large scale. 9:59:59.000,9:59:59.000 If you want to see the stars, you go[br]the Atacama desert and 9:59:59.000,9:59:59.000 you point a telescope at the sky. 9:59:59.000,9:59:59.000 Where is the telescope for source code? 9:59:59.000,9:59:59.000 That's what Software Heritage wants to be. 9:59:59.000,9:59:59.000 What we do is we collect, we preserve[br]and we share all the software 9:59:59.000,9:59:59.000 that is publicly available. 9:59:59.000,9:59:59.000 Why do we do that? We do that to[br]preserve the past, to enhance the present 9:59:59.000,9:59:59.000 and to prepare for the future. 9:59:59.000,9:59:59.000 What we're building is a base infrastructure[br]that can be used 9:59:59.000,9:59:59.000 for cultural heritage, for industry,[br]for research and for education purposes. 9:59:59.000,9:59:59.000 How do we do it? We do it with an open[br]approach. 9:59:59.000,9:59:59.000 Every single line of code that we write[br]is free software. 9:59:59.000,9:59:59.000 We do it transparently, everything that[br]we do, we do it in the open, 9:59:59.000,9:59:59.000 be that on a mailing list or on[br]our issue tracker. 9:59:59.000,9:59:59.000 And we strive to do it for the very long[br]haul, so we do it with replication in mind 9:59:59.000,9:59:59.000 so that no single entity has full control[br]over the data that we collect. 9:59:59.000,9:59:59.000 And we do it in a non-profit fashion[br]so that we avoid 9:59:59.000,9:59:59.000 business-driven decisions impacting[br]the project. 9:59:59.000,9:59:59.000 So, what do we do concretely? 9:59:59.000,9:59:59.000 We do archiving of version control systems. 9:59:59.000,9:59:59.000 What does that mean? 9:59:59.000,9:59:59.000 It means we archive file contents, so[br]source code, files. 9:59:59.000,9:59:59.000 We archive revisions, which means all the[br]metadata of the history of the projects, 9:59:59.000,9:59:59.000 we try to download it and we put it inside[br]a common data model that is 9:59:59.000,9:59:59.000 shared across all the archive. 9:59:59.000,9:59:59.000 We archive releases of the software,[br]releases that have been tagged 9:59:59.000,9:59:59.000 in a version control system as well as[br]releases that we can find as tarballs 9:59:59.000,9:59:59.000 because sometimes… boof, views of[br]this source code differ. 9:59:59.000,9:59:59.000 Of course, we archive where and when[br]we've seen the data that we've collected. 9:59:59.000,9:59:59.000 All of this, we put inside a canonical,[br]VCS-agnostic, data model. 9:59:59.000,9:59:59.000 If you have a Debian package, with its[br]history, if you have a git repository, 9:59:59.000,9:59:59.000 if you have a subversion repository, if[br]you have a mercurial repository, 9:59:59.000,9:59:59.000 it all looks the same and you can work[br]on it with the same tools. 9:59:59.000,9:59:59.000 What we don't do is archive what's around[br]the software, for instance 9:59:59.000,9:59:59.000 the bug tracking systems or the homepages[br]or the wikis or the mailing lists. 9:59:59.000,9:59:59.000 There are some projects that work[br]in this space, for instance 9:59:59.000,9:59:59.000 the internet archive does a lot of[br]really good work around archiving the web. 9:59:59.000,9:59:59.000 Our goal is not to replace them, but to[br]work with them and be able to do 9:59:59.000,9:59:59.000 linking across all the archives that exist. 9:59:59.000,9:59:59.000 We can, for instance for the mailing lists[br]there's the gmane project 9:59:59.000,9:59:59.000 that does a lot of archiving of free[br]software mailing lists. 9:59:59.000,9:59:59.000 So our long term vision is to play a part[br]in a semantic wikipedia of software, 9:59:59.000,9:59:59.000 a wikidata of software where we can[br]hyperlink all the archives that exist 9:59:59.000,9:59:59.000 and do stuff in the area. 9:59:59.000,9:59:59.000 Quick tour of our infrastructure. 9:59:59.000,9:59:59.000 Basically, all the way to the right is[br]our archive. 9:59:59.000,9:59:59.000 Our archive consists of a huge graph[br]of all the metadata about 9:59:59.000,9:59:59.000 the files, the directories, the revisions,[br]the commits and the releases and 9:59:59.000,9:59:59.000 all the projects that are on top[br]of the graph. 9:59:59.000,9:59:59.000 We separate the file storage into an other[br]object storage because of 9:59:59.000,9:59:59.000 the size discrepancy: we have lots and lots[br]of file contents that we need to store 9:59:59.000,9:59:59.000 so we do that outside the database[br]that is used to store the graph. 9:59:59.000,9:59:59.000 Basically, what we archive is a set of[br]software origins that are 9:59:59.000,9:59:59.000 git repositories, mercurial repositories,[br]etc. etc. 9:59:59.000,9:59:59.000 All those origins are loaded on a[br]regular schedule. 9:59:59.000,9:59:59.000 If there is a very active software origin,[br]we're gonna archive it more often 9:59:59.000,9:59:59.000 than stale things that don't get[br]a lot of updates