1 99:59:59,999 --> 99:59:59,999 Hi, thank you. 2 99:59:59,999 --> 99:59:59,999 I'm Nicolas Dandrimont and I will indeed be talking to you about 3 99:59:59,999 --> 99:59:59,999 Software Heritage. 4 99:59:59,999 --> 99:59:59,999 I'm a software engineer for this project. 5 99:59:59,999 --> 99:59:59,999 I've been working on it for 3 years now. 6 99:59:59,999 --> 99:59:59,999 And we'll see what this thing is all about. 7 99:59:59,999 --> 99:59:59,999 [Mic not working] 8 99:59:59,999 --> 99:59:59,999 I guess the batteries are out. 9 99:59:59,999 --> 99:59:59,999 So, let's try that again. 10 99:59:59,999 --> 99:59:59,999 So, we all know, we've been doing free software for a while, 11 99:59:59,999 --> 99:59:59,999 that software source code is something special. 12 99:59:59,999 --> 99:59:59,999 Why is that? 13 99:59:59,999 --> 99:59:59,999 As Harold Abelson has said in SICP, his textbook on programming, 14 99:59:59,999 --> 99:59:59,999 programs are meant to be read by people and then incidentally for machines to execute. 15 99:59:59,999 --> 99:59:59,999 Basically, what software source code provides us is a way inside 16 99:59:59,999 --> 99:59:59,999 the mind of the designer of the program. 17 99:59:59,999 --> 99:59:59,999 For instance, you can have, you can get inside very crazy algorithms 18 99:59:59,999 --> 99:59:59,999 that can do very fast reverse square roots for 3D, that kind of stuff 19 99:59:59,999 --> 99:59:59,999 Like in the Quake 2 source code. 20 99:59:59,999 --> 99:59:59,999 You can also get inside the algorithms that are underpinning the internet, 21 99:59:59,999 --> 99:59:59,999 for instance seeing the net queue algorithm in the Linux kernel. 22 99:59:59,999 --> 99:59:59,999 What we are building as the free software community is the free software commons. 23 99:59:59,999 --> 99:59:59,999 Basically, the commons is all the cultural and social and natural resources 24 99:59:59,999 --> 99:59:59,999 that we share and that everyone has access to. 25 99:59:59,999 --> 99:59:59,999 More specifically, the software commons is what we are building 26 99:59:59,999 --> 99:59:59,999 with software that is open and that is available for all to use, to modify, 27 99:59:59,999 --> 99:59:59,999 to execute, to distribute. 28 99:59:59,999 --> 99:59:59,999 We know that those commons are a really critical part of our commons. 29 99:59:59,999 --> 99:59:59,999 Who's taking care of it? 30 99:59:59,999 --> 99:59:59,999 The software is fragile. 31 99:59:59,999 --> 99:59:59,999 Like all digital information, you can lose software. 32 99:59:59,999 --> 99:59:59,999 People can decide to shut down hosting spaces because of business decisions. 33 99:59:59,999 --> 99:59:59,999 People can hack into software hosting platforms and remove the code maliciously 34 99:59:59,999 --> 99:59:59,999 or just inadvertently. 35 99:59:59,999 --> 99:59:59,999 And, of course, for the obsolete stuff, there's rot. 36 99:59:59,999 --> 99:59:59,999 If you don't care about the data, then it rots and it decays and you lose it. 37 99:59:59,999 --> 99:59:59,999 So, where is the archive we go to when something is lost, 38 99:59:59,999 --> 99:59:59,999 when GitLab goes away, when Github goes away. 39 99:59:59,999 --> 99:59:59,999 Where do we go? 40 99:59:59,999 --> 99:59:59,999 Finally, there's one last thing that we noticed, it's that 41 99:59:59,999 --> 99:59:59,999 there's a lot of teams that work on research on software 42 99:59:59,999 --> 99:59:59,999 and there's no real big infrastructure for research on code. 43 99:59:59,999 --> 99:59:59,999 There's tons of critical issues around code: safety, security, verification, proofs. 44 99:59:59,999 --> 99:59:59,999 Nobody's doing this at a very large scale. 45 99:59:59,999 --> 99:59:59,999 If you want to see the stars, you go the Atacama desert and 46 99:59:59,999 --> 99:59:59,999 you point a telescope at the sky. 47 99:59:59,999 --> 99:59:59,999 Where is the telescope for source code? 48 99:59:59,999 --> 99:59:59,999 That's what Software Heritage wants to be. 49 99:59:59,999 --> 99:59:59,999 What we do is we collect, we preserve and we share all the software 50 99:59:59,999 --> 99:59:59,999 that is publicly available. 51 99:59:59,999 --> 99:59:59,999 Why do we do that? We do that to preserve the past, to enhance the present 52 99:59:59,999 --> 99:59:59,999 and to prepare for the future. 53 99:59:59,999 --> 99:59:59,999 What we're building is a base infrastructure that can be used 54 99:59:59,999 --> 99:59:59,999 for cultural heritage, for industry, for research and for education purposes. 55 99:59:59,999 --> 99:59:59,999 How do we do it? We do it with an open approach. 56 99:59:59,999 --> 99:59:59,999 Every single line of code that we write is free software. 57 99:59:59,999 --> 99:59:59,999 We do it transparently, everything that we do, we do it in the open, 58 99:59:59,999 --> 99:59:59,999 be that on a mailing list or on our issue tracker. 59 99:59:59,999 --> 99:59:59,999 And we strive to do it for the very long haul, so we do it with replication in mind 60 99:59:59,999 --> 99:59:59,999 so that no single entity has full control over the data that we collect. 61 99:59:59,999 --> 99:59:59,999 And we do it in a non-profit fashion so that we avoid 62 99:59:59,999 --> 99:59:59,999 business-driven decisions impacting the project. 63 99:59:59,999 --> 99:59:59,999 So, what do we do concretely? 64 99:59:59,999 --> 99:59:59,999 We do archiving of version control systems. 65 99:59:59,999 --> 99:59:59,999 What does that mean? 66 99:59:59,999 --> 99:59:59,999 It means we archive file contents, so source code, files. 67 99:59:59,999 --> 99:59:59,999 We archive revisions, which means all the metadata of the history of the projects, 68 99:59:59,999 --> 99:59:59,999 we try to download it and we put it inside a common data model that is 69 99:59:59,999 --> 99:59:59,999 shared across all the archive. 70 99:59:59,999 --> 99:59:59,999 We archive releases of the software, releases that have been tagged 71 99:59:59,999 --> 99:59:59,999 in a version control system as well as releases that we can find as tarballs 72 99:59:59,999 --> 99:59:59,999 because sometimes… boof, views of this source code differ. 73 99:59:59,999 --> 99:59:59,999 Of course, we archive where and when we've seen the data that we've collected. 74 99:59:59,999 --> 99:59:59,999 All of this, we put inside a canonical, VCS-agnostic, data model. 75 99:59:59,999 --> 99:59:59,999 If you have a Debian package, with its history, if you have a git repository, 76 99:59:59,999 --> 99:59:59,999 if you have a subversion repository, if you have a mercurial repository, 77 99:59:59,999 --> 99:59:59,999 it all looks the same and you can work on it with the same tools. 78 99:59:59,999 --> 99:59:59,999 What we don't do is archive what's around the software, for instance 79 99:59:59,999 --> 99:59:59,999 the bug tracking systems or the homepages or the wikis or the mailing lists. 80 99:59:59,999 --> 99:59:59,999 There are some projects that work in this space, for instance 81 99:59:59,999 --> 99:59:59,999 the internet archive does a lot of really good work around archiving the web. 82 99:59:59,999 --> 99:59:59,999 Our goal is not to replace them, but to work with them and be able to do 83 99:59:59,999 --> 99:59:59,999 linking across all the archives that exist. 84 99:59:59,999 --> 99:59:59,999 We can, for instance for the mailing lists there's the gmane project 85 99:59:59,999 --> 99:59:59,999 that does a lot of archiving of free software mailing lists. 86 99:59:59,999 --> 99:59:59,999 So our long term vision is to play a part in a semantic wikipedia of software, 87 99:59:59,999 --> 99:59:59,999 a wikidata of software where we can hyperlink all the archives that exist 88 99:59:59,999 --> 99:59:59,999 and do stuff in the area. 89 99:59:59,999 --> 99:59:59,999 Quick tour of our infrastructure. 90 99:59:59,999 --> 99:59:59,999 Basically, all the way to the right is our archive. 91 99:59:59,999 --> 99:59:59,999 Our archive consists of a huge graph of all the metadata about 92 99:59:59,999 --> 99:59:59,999 the files, the directories, the revisions, the commits and the releases and 93 99:59:59,999 --> 99:59:59,999 all the projects that are on top of the graph. 94 99:59:59,999 --> 99:59:59,999 We separate the file storage into an other object storage because of 95 99:59:59,999 --> 99:59:59,999 the size discrepancy: we have lots and lots of file contents that we need to store 96 99:59:59,999 --> 99:59:59,999 so we do that outside the database that is used to store the graph. 97 99:59:59,999 --> 99:59:59,999 Basically, what we archive is a set of software origins that are 98 99:59:59,999 --> 99:59:59,999 git repositories, mercurial repositories, etc. etc. 99 99:59:59,999 --> 99:59:59,999 All those origins are loaded on a regular schedule. 100 99:59:59,999 --> 99:59:59,999 If there is a very active software origin, we're gonna archive it more often 101 99:59:59,999 --> 99:59:59,999 than stale things that don't get a lot of updates