[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:05.48,0:00:07.10,Default,,0000,0000,0000,,Hi, thank you. Dialogue: 0,0:00:07.88,0:00:11.29,Default,,0000,0000,0000,,I'm Nicolas Dandrimont and I will indeed\Nbe talking to you about Dialogue: 0,0:00:11.29,0:00:12.55,Default,,0000,0000,0000,,Software Heritage. Dialogue: 0,0:00:12.88,0:00:15.23,Default,,0000,0000,0000,,I'm a software engineer for this project. Dialogue: 0,0:00:15.64,0:00:17.71,Default,,0000,0000,0000,,I've been working on it for 3 years now. Dialogue: 0,0:00:18.48,0:00:21.57,Default,,0000,0000,0000,,And we'll see what this thing is all about. Dialogue: 0,0:00:23.77,0:00:38.81,Default,,0000,0000,0000,,[Mic not working] Dialogue: 0,0:00:39.17,0:00:40.75,Default,,0000,0000,0000,,I guess the batteries are out. Dialogue: 0,0:00:49.95,0:00:51.72,Default,,0000,0000,0000,,So, let's try that again. Dialogue: 0,0:00:52.05,0:00:55.38,Default,,0000,0000,0000,,So, we all know, we've been doing\Nfree software for a while, Dialogue: 0,0:00:55.62,0:00:59.81,Default,,0000,0000,0000,,that software source code is something\Nspecial. Dialogue: 0,0:01:00.78,0:01:02.03,Default,,0000,0000,0000,,Why is that? Dialogue: 0,0:01:02.73,0:01:09.96,Default,,0000,0000,0000,,As Harold Abelson has said in SICP, his\Ntextbook on programming, Dialogue: 0,0:01:09.96,0:01:18.78,Default,,0000,0000,0000,,programs are meant to be read by people\Nand then incidentally for machines to execute. Dialogue: 0,0:01:20.21,0:01:25.66,Default,,0000,0000,0000,,Basically, what software source code\Nprovides us is a way inside Dialogue: 0,0:01:25.66,0:01:28.55,Default,,0000,0000,0000,,the mind of the designer of the program. Dialogue: 0,0:01:29.31,0:01:37.94,Default,,0000,0000,0000,,For instance, you can have,\Nyou can get inside very crazy algorithms Dialogue: 0,0:01:37.94,0:01:46.56,Default,,0000,0000,0000,,that can do very fast reverse square roots\Nfor 3D, that kind of stuff Dialogue: 0,0:01:47.21,0:01:49.52,Default,,0000,0000,0000,,Like in the Quake 2 source code. Dialogue: 0,0:01:49.86,0:01:54.61,Default,,0000,0000,0000,,You can also get inside the algorithms\Nthat are underpinning the internet, Dialogue: 0,0:01:54.61,0:01:59.76,Default,,0000,0000,0000,,for instance seeing the net queue\Nalgorithm in the Linux kernel. Dialogue: 0,0:02:03.63,0:02:10.22,Default,,0000,0000,0000,,What we are building as the free software\Ncommunity is the free software commons. Dialogue: 0,0:02:10.95,0:02:18.63,Default,,0000,0000,0000,,Basically, the commons is all the cultural\Nand social and natural resources Dialogue: 0,0:02:18.63,0:02:21.80,Default,,0000,0000,0000,,that we share and that everyone\Nhas access to. Dialogue: 0,0:02:22.41,0:02:25.74,Default,,0000,0000,0000,,More specifically, the software commons\Nis what we are building Dialogue: 0,0:02:25.74,0:02:31.88,Default,,0000,0000,0000,,with software that is open and that is\Navailable for all to use, to modify, Dialogue: 0,0:02:31.88,0:02:34.89,Default,,0000,0000,0000,,to execute, to distribute. Dialogue: 0,0:02:37.25,0:02:45.25,Default,,0000,0000,0000,,We know that those commons are a really\Ncritical part of our commons. Dialogue: 0,0:02:46.31,0:02:48.14,Default,,0000,0000,0000,,Who's taking care of it? Dialogue: 0,0:02:49.68,0:02:51.80,Default,,0000,0000,0000,,The software is fragile. Dialogue: 0,0:02:51.80,0:02:54.40,Default,,0000,0000,0000,,Like all digital information, you can lose\Nsoftware. Dialogue: 0,0:02:55.62,0:03:01.63,Default,,0000,0000,0000,,People can decide to shut down hosting\Nspaces because of business decisions. Dialogue: 0,0:03:02.94,0:03:08.91,Default,,0000,0000,0000,,People can hack into software hosting\Nplatforms and remove the code maliciously Dialogue: 0,0:03:08.91,0:03:10.86,Default,,0000,0000,0000,,or just inadvertently. Dialogue: 0,0:03:12.98,0:03:17.90,Default,,0000,0000,0000,,And, of course, for the obsolete stuff,\Nthere's rot. Dialogue: 0,0:03:18.47,0:03:24.77,Default,,0000,0000,0000,,If you don't care about the data, then\Nit rots and it decays and you lose it. Dialogue: 0,0:03:26.16,0:03:31.24,Default,,0000,0000,0000,,So, where is the archive we go to\Nwhen something is lost, Dialogue: 0,0:03:31.24,0:03:33.96,Default,,0000,0000,0000,,when GitLab goes away, when Github\Ngoes away. Dialogue: 0,0:03:34.41,0:03:35.71,Default,,0000,0000,0000,,Where do we go? Dialogue: 0,0:03:36.52,0:03:40.99,Default,,0000,0000,0000,,Finally, there's one last thing that we\Nnoticed, it's that Dialogue: 0,0:03:40.99,0:03:48.58,Default,,0000,0000,0000,,there's a lot of teams that work on\Nresearch on software Dialogue: 0,0:03:48.58,0:03:54.31,Default,,0000,0000,0000,,and there's no real big infrastructure\Nfor research on code. Dialogue: 0,0:03:56.51,0:04:02.13,Default,,0000,0000,0000,,There's tons of critical issues around\Ncode: safety, security, verification, proofs. Dialogue: 0,0:04:03.58,0:04:07.69,Default,,0000,0000,0000,,Nobody's doing this at a very large scale. Dialogue: 0,0:04:08.47,0:04:12.24,Default,,0000,0000,0000,,If you want to see the stars, you go\Nthe Atacama desert and Dialogue: 0,0:04:12.24,0:04:13.83,Default,,0000,0000,0000,,you point a telescope at the sky. Dialogue: 0,0:04:14.48,0:04:17.53,Default,,0000,0000,0000,,Where is the telescope for source code? Dialogue: 0,0:04:17.97,0:04:20.98,Default,,0000,0000,0000,,That's what Software Heritage wants to be. Dialogue: 0,0:04:22.08,0:04:27.65,Default,,0000,0000,0000,,What we do is we collect, we preserve\Nand we share all the software Dialogue: 0,0:04:27.65,0:04:29.89,Default,,0000,0000,0000,,that is publicly available. Dialogue: 0,0:04:31.14,0:04:35.85,Default,,0000,0000,0000,,Why do we do that? We do that to\Npreserve the past, to enhance the present Dialogue: 0,0:04:35.85,0:04:37.85,Default,,0000,0000,0000,,and to prepare for the future. Dialogue: 0,0:04:39.72,0:04:44.59,Default,,0000,0000,0000,,What we're building is a base infrastructure\Nthat can be used Dialogue: 0,0:04:44.59,0:04:50.36,Default,,0000,0000,0000,,for cultural heritage, for industry,\Nfor research and for education purposes. Dialogue: 0,0:04:50.72,0:04:53.12,Default,,0000,0000,0000,,How do we do it? We do it with an open\Napproach. Dialogue: 0,0:04:53.41,0:04:56.61,Default,,0000,0000,0000,,Every single line of code that we write\Nis free software. Dialogue: 0,0:04:59.09,0:05:04.65,Default,,0000,0000,0000,,We do it transparently, everything that\Nwe do, we do it in the open, Dialogue: 0,0:05:04.65,0:05:09.12,Default,,0000,0000,0000,,be that on a mailing list or on\Nour issue tracker. Dialogue: 0,0:05:09.86,0:05:15.87,Default,,0000,0000,0000,,And we strive to do it for the very long\Nhaul, so we do it with replication in mind Dialogue: 0,0:05:15.87,0:05:21.81,Default,,0000,0000,0000,,so that no single entity has full control\Nover the data that we collect. Dialogue: 0,0:05:22.94,0:05:27.34,Default,,0000,0000,0000,,And we do it in a non-profit fashion\Nso that we avoid Dialogue: 0,0:05:27.34,0:05:32.79,Default,,0000,0000,0000,,business-driven decisions impacting\Nthe project. Dialogue: 0,0:05:35.47,0:05:38.68,Default,,0000,0000,0000,,So, what do we do concretely? Dialogue: 0,0:05:39.01,0:05:42.95,Default,,0000,0000,0000,,We do archiving of version control systems. Dialogue: 0,0:05:43.28,0:05:44.62,Default,,0000,0000,0000,,What does that mean? Dialogue: 0,0:05:45.76,0:05:49.41,Default,,0000,0000,0000,,It means we archive file contents, so\Nsource code, files. Dialogue: 0,0:05:49.41,0:05:55.67,Default,,0000,0000,0000,,We archive revisions, which means all the\Nmetadata of the history of the projects, Dialogue: 0,0:05:55.67,0:06:03.15,Default,,0000,0000,0000,,we try to download it and we put it inside\Na common data model that is Dialogue: 0,0:06:03.15,0:06:06.97,Default,,0000,0000,0000,,shared across all the archive. Dialogue: 0,0:06:08.56,0:06:13.59,Default,,0000,0000,0000,,We archive releases of the software,\Nreleases that have been tagged Dialogue: 0,0:06:13.59,0:06:18.34,Default,,0000,0000,0000,,in a version control system as well as\Nreleases that we can find as tarballs Dialogue: 0,0:06:18.34,0:06:23.94,Default,,0000,0000,0000,,because sometimes… boof, views of\Nthis source code differ. Dialogue: 0,0:06:27.81,0:06:32.37,Default,,0000,0000,0000,,Of course, we archive where and when\Nwe've seen the data that we've collected. Dialogue: 0,0:06:32.98,0:06:40.27,Default,,0000,0000,0000,,All of this, we put inside a canonical,\NVCS-agnostic, data model. Dialogue: 0,0:06:41.98,0:06:46.78,Default,,0000,0000,0000,,If you have a Debian package, with its\Nhistory, if you have a git repository, Dialogue: 0,0:06:46.78,0:06:50.20,Default,,0000,0000,0000,,if you have a subversion repository, if\Nyou have a mercurial repository, Dialogue: 0,0:06:50.20,0:06:53.86,Default,,0000,0000,0000,,it all looks the same and you can work\Non it with the same tools. Dialogue: 0,0:06:54.100,0:07:01.42,Default,,0000,0000,0000,,What we don't do is archive what's around\Nthe software, for instance Dialogue: 0,0:07:01.42,0:07:05.72,Default,,0000,0000,0000,,the bug tracking systems or the homepages\Nor the wikis or the mailing lists. Dialogue: 0,0:07:06.70,0:07:10.56,Default,,0000,0000,0000,,There are some projects that work\Nin this space, for instance Dialogue: 0,0:07:10.56,0:07:15.80,Default,,0000,0000,0000,,the internet archive does a lot of\Nvery good work around archiving the web. Dialogue: 0,0:07:17.66,0:07:24.42,Default,,0000,0000,0000,,Our goal is not to replace them, but to\Nwork with them and be able to do Dialogue: 0,0:07:24.42,0:07:29.29,Default,,0000,0000,0000,,linking across all the archives that exist. Dialogue: 0,0:07:29.70,0:07:35.02,Default,,0000,0000,0000,,We can, for instance for the mailing lists\Nthere's the gmane project Dialogue: 0,0:07:35.02,0:07:38.100,Default,,0000,0000,0000,,that does a lot of archiving of free\Nsoftware mailing lists. Dialogue: 0,0:07:39.73,0:07:47.74,Default,,0000,0000,0000,,So our long term vision is to play a part\Nin a semantic wikipedia of software, Dialogue: 0,0:07:47.74,0:07:53.92,Default,,0000,0000,0000,,a wikidata of software where we can\Nhyperlink all the archives that exist Dialogue: 0,0:07:53.92,0:07:56.85,Default,,0000,0000,0000,,and do stuff in the area. Dialogue: 0,0:08:00.59,0:08:02.59,Default,,0000,0000,0000,,Quick tour of our infrastructure. Dialogue: 0,0:08:02.83,0:08:10.22,Default,,0000,0000,0000,,Basically, all the way to the right is\Nour archive. Dialogue: 0,0:08:11.45,0:08:16.85,Default,,0000,0000,0000,,Our archive consists of a huge graph\Nof all the metadata about Dialogue: 0,0:08:16.85,0:08:24.62,Default,,0000,0000,0000,,the files, the directories, the revisions,\Nthe commits and the releases and Dialogue: 0,0:08:24.62,0:08:27.78,Default,,0000,0000,0000,,all the projects that are on top\Nof the graph. Dialogue: 0,0:08:29.13,0:08:33.60,Default,,0000,0000,0000,,We separate the file storage into an other\Nobject storage because of Dialogue: 0,0:08:33.60,0:08:41.65,Default,,0000,0000,0000,,the size discrepancy: we have lots and lots\Nof file contents that we need to store Dialogue: 0,0:08:41.65,0:08:46.32,Default,,0000,0000,0000,,so we do that outside of the database\Nthat is used to store the graph. Dialogue: 0,0:08:49.50,0:08:54.16,Default,,0000,0000,0000,,Basically, what we archive is a set of\Nsoftware origins that are Dialogue: 0,0:08:54.16,0:08:58.83,Default,,0000,0000,0000,,git repositories, mercurial repositories,\Netc. etc. Dialogue: 0,0:08:59.69,0:09:05.25,Default,,0000,0000,0000,,All those origins are loaded on a\Nregular schedule. Dialogue: 0,0:09:06.89,0:09:13.47,Default,,0000,0000,0000,,If there is a very active software origin,\Nwe're gonna archive it more often Dialogue: 0,0:09:13.47,0:09:17.75,Default,,0000,0000,0000,,than stale things that don't get\Na lot of updates. Dialogue: 0,0:09:19.66,0:09:24.42,Default,,0000,0000,0000,,What we do to get the list of software\Norigins that we archive. Dialogue: 0,0:09:24.82,0:09:30.68,Default,,0000,0000,0000,,We have a bunch of listers that can,\Nscroll through the list of repositories, Dialogue: 0,0:09:30.68,0:09:33.77,Default,,0000,0000,0000,,for instance on Github or other\Nhosting platforms. Dialogue: 0,0:09:34.94,0:09:42.18,Default,,0000,0000,0000,,We have code that can read Debian archive\Nmetadata to make a list of the packages Dialogue: 0,0:09:42.18,0:09:49.41,Default,,0000,0000,0000,,that are inside this archive and can be\Narchived, etc. Dialogue: 0,0:09:50.39,0:09:52.61,Default,,0000,0000,0000,,All of this is done on a regular basis. Dialogue: 0,0:09:53.52,0:09:57.45,Default,,0000,0000,0000,,We are currently working on some kind\Nof push mechanism so that Dialogue: 0,0:09:57.45,0:10:01.48,Default,,0000,0000,0000,,people or other systems can notify us\Nof updates. Dialogue: 0,0:10:02.99,0:10:09.67,Default,,0000,0000,0000,,Our goal is not to do real time archiving,\Nwe're really in it for the long run Dialogue: 0,0:10:09.67,0:10:16.01,Default,,0000,0000,0000,,but we still want to be able to prioritize\Nstuff that people tell us is Dialogue: 0,0:10:16.01,0:10:17.88,Default,,0000,0000,0000,,important to archive. Dialogue: 0,0:10:19.95,0:10:23.93,Default,,0000,0000,0000,,The internet archive has a "save now"\Nbutton and we want to implement Dialogue: 0,0:10:23.93,0:10:26.24,Default,,0000,0000,0000,,something along those lines as well, Dialogue: 0,0:10:26.24,0:10:31.54,Default,,0000,0000,0000,,so if we know that some software project\Nis in danger for a reason or another, Dialogue: 0,0:10:31.54,0:10:34.14,Default,,0000,0000,0000,,then we can prioritize archiving it. Dialogue: 0,0:10:35.81,0:10:39.92,Default,,0000,0000,0000,,So this is the basic structure of a revision\Nin the software heritage archive. Dialogue: 0,0:10:41.99,0:10:45.07,Default,,0000,0000,0000,,You'll see that it's very similar to\Na git commit. Dialogue: 0,0:10:47.83,0:10:53.72,Default,,0000,0000,0000,,The format of the metadata is pretty much\Nwhat you'll find in a git commit Dialogue: 0,0:10:53.72,0:10:59.01,Default,,0000,0000,0000,,with some extensions that you don't\Nsee here because this is from a git commit Dialogue: 0,0:11:00.71,0:11:09.62,Default,,0000,0000,0000,,So basically what we do is we take the\Nidentifier of the directory Dialogue: 0,0:11:09.62,0:11:16.20,Default,,0000,0000,0000,,that the revision points to, we take the\Nidentifier of the parent of the revision Dialogue: 0,0:11:16.20,0:11:18.72,Default,,0000,0000,0000,,so we can keep track of the history Dialogue: 0,0:11:18.72,0:11:24.82,Default,,0000,0000,0000,,and then we add some metadata,\Nauthorship and commitership information Dialogue: 0,0:11:24.82,0:11:28.88,Default,,0000,0000,0000,,and the revision message and then we take\Na hash of this, Dialogue: 0,0:11:28.88,0:11:37.05,Default,,0000,0000,0000,,it makes an identifier that's probably\Nunique, very very probably unique. Dialogue: 0,0:11:40.26,0:11:46.92,Default,,0000,0000,0000,,Using those identifiers, we can retrace\Nall the origins, all the history of Dialogue: 0,0:11:46.92,0:11:51.75,Default,,0000,0000,0000,,development of the project and we can\Ndeduplicate across all the archive. Dialogue: 0,0:11:52.49,0:11:58.67,Default,,0000,0000,0000,,All the identifiers are intrinsic, which\Nmeans that we compute them Dialogue: 0,0:11:58.67,0:12:03.92,Default,,0000,0000,0000,,from the contents of the things that\Nwe are archiving, which means that Dialogue: 0,0:12:03.92,0:12:11.44,Default,,0000,0000,0000,,we can deduplicate very efficiently\Nacross all the data that we archive. Dialogue: 0,0:12:12.25,0:12:14.28,Default,,0000,0000,0000,,How much data do we archive? Dialogue: 0,0:12:17.13,0:12:18.22,Default,,0000,0000,0000,,A bit. Dialogue: 0,0:12:18.59,0:12:23.83,Default,,0000,0000,0000,,So, we have passed the billion revision\Nmark a few weeks ago. Dialogue: 0,0:12:25.30,0:12:29.97,Default,,0000,0000,0000,,This graph is a bit old, but anyway,\Nyou have a live graph on our website. Dialogue: 0,0:12:31.47,0:12:35.86,Default,,0000,0000,0000,,That's more than 4.5 billion unique\Nsource code files. Dialogue: 0,0:12:38.26,0:12:45.17,Default,,0000,0000,0000,,We don't actually discriminate between\Nwhat we would consider is source code Dialogue: 0,0:12:45.17,0:12:48.18,Default,,0000,0000,0000,,and what upstream developers consider\Nas source code, Dialogue: 0,0:12:48.18,0:12:52.33,Default,,0000,0000,0000,,so everything that's in a git repository,\Nwe consider as source code Dialogue: 0,0:12:52.33,0:12:54.88,Default,,0000,0000,0000,,if it's below a size threshold. Dialogue: 0,0:12:55.98,0:13:00.24,Default,,0000,0000,0000,,A billion revisions across 80 million\Nprojects. Dialogue: 0,0:13:01.39,0:13:02.93,Default,,0000,0000,0000,,What do we archive? Dialogue: 0,0:13:02.93,0:13:04.72,Default,,0000,0000,0000,,We archive Github, we archive Debian. Dialogue: 0,0:13:06.68,0:13:11.91,Default,,0000,0000,0000,,So, Debian we run the archival process\Nevery day, every day we get the new packages Dialogue: 0,0:13:11.91,0:13:13.74,Default,,0000,0000,0000,,that have been uploaded in the archive. Dialogue: 0,0:13:14.31,0:13:21.45,Default,,0000,0000,0000,,Github, we try to keep up, we are currently\Nworking on some performance improvements, Dialogue: 0,0:13:21.45,0:13:25.32,Default,,0000,0000,0000,,some scalability improvements to make sure\Nthat we can keep up Dialogue: 0,0:13:25.32,0:13:27.48,Default,,0000,0000,0000,,with the development on GitHub. Dialogue: 0,0:13:29.23,0:13:40.12,Default,,0000,0000,0000,,We have archived as a one-off thing the\Nformer contents of Gitorious and Google Code Dialogue: 0,0:13:40.51,0:13:46.73,Default,,0000,0000,0000,,which are two prominent code hosting\Nspaces that closed recently Dialogue: 0,0:13:47.74,0:13:53.99,Default,,0000,0000,0000,,and we've been working on archiving\Nthe contents of Bitbucket Dialogue: 0,0:13:53.99,0:13:59.94,Default,,0000,0000,0000,,which is kind of a challenge because\Nthe API is a bit buggy and Dialogue: 0,0:13:59.94,0:14:03.40,Default,,0000,0000,0000,,Atliassian isn't too interested\Nin fixing it. Dialogue: 0,0:14:06.08,0:14:16.65,Default,,0000,0000,0000,,In concrete storage terms, we have 175TB\Nof blobs, so the files take 175TB Dialogue: 0,0:14:16.65,0:14:19.90,Default,,0000,0000,0000,,and kind of big database, 6TB. Dialogue: 0,0:14:21.16,0:14:28.32,Default,,0000,0000,0000,,The database only contains the graph of\Nthe metadata for the archive Dialogue: 0,0:14:28.32,0:14:34.70,Default,,0000,0000,0000,,which is basically a 8 billion nodes and\N70 billion edges graph. Dialogue: 0,0:14:35.39,0:14:37.46,Default,,0000,0000,0000,,And of course it's growing daily. Dialogue: 0,0:14:37.95,0:14:42.82,Default,,0000,0000,0000,,We are pretty sure this is the richest public\Nsource code archive that's available now Dialogue: 0,0:14:43.02,0:14:44.76,Default,,0000,0000,0000,,and it keeps growing. Dialogue: 0,0:14:46.47,0:14:48.99,Default,,0000,0000,0000,,So how do we actually… Dialogue: 0,0:14:49.48,0:14:53.29,Default,,0000,0000,0000,,What kind of stack do we use to store\Nall this? Dialogue: 0,0:14:54.76,0:14:56.56,Default,,0000,0000,0000,,We use Debian, of course. Dialogue: 0,0:14:57.68,0:15:02.93,Default,,0000,0000,0000,,All our deployment recipes are in Puppet\Nin public repositories. Dialogue: 0,0:15:04.08,0:15:07.73,Default,,0000,0000,0000,,We've started using Ceph\Nfor the blob storage. Dialogue: 0,0:15:09.40,0:15:14.44,Default,,0000,0000,0000,,We use PostgreSQL for the metadata storage\Nwith some of the standard tools that Dialogue: 0,0:15:14.100,0:15:18.17,Default,,0000,0000,0000,,live around PostgreSQL for backups\Nand replication. Dialogue: 0,0:15:20.04,0:15:27.77,Default,,0000,0000,0000,,We use standard Python stack for\Nscheduling of jobs Dialogue: 0,0:15:27.77,0:15:35.36,Default,,0000,0000,0000,,and for web interface stuff, basically\Npsycopg2 for the low level stuff, Dialogue: 0,0:15:35.36,0:15:38.17,Default,,0000,0000,0000,,Django for the web stuff Dialogue: 0,0:15:38.17,0:15:44.35,Default,,0000,0000,0000,,and Celery for the scheduling of jobs. Dialogue: 0,0:15:45.48,0:15:50.45,Default,,0000,0000,0000,,In house, we've written an ad hoc\Nobject storage system which has Dialogue: 0,0:15:50.45,0:15:53.35,Default,,0000,0000,0000,,a bunch of backends that you can use. Dialogue: 0,0:15:53.82,0:16:03.05,Default,,0000,0000,0000,,Basically, we are agnostic between a UNIX\Nfilesystem, azure, Ceph, or tons of… Dialogue: 0,0:16:03.42,0:16:07.12,Default,,0000,0000,0000,,It's a really simple object storage system\Nwhere you can just put an object, Dialogue: 0,0:16:07.12,0:16:10.36,Default,,0000,0000,0000,,get an object, put a bunch of objects,\Nget a bunch of objects. Dialogue: 0,0:16:11.95,0:16:17.52,Default,,0000,0000,0000,,We've implemented removal but we don't\Nreally use it yet. Dialogue: 0,0:16:20.20,0:16:24.96,Default,,0000,0000,0000,,All the data model implementation,\Nall the listers, the loaders, the schedulers Dialogue: 0,0:16:24.96,0:16:29.18,Default,,0000,0000,0000,,everything has been written by us,\Nit's a pile of Python code. Dialogue: 0,0:16:31.86,0:16:35.81,Default,,0000,0000,0000,,So, basically 20 Python packages and\Naround 30 Puppet modules Dialogue: 0,0:16:35.81,0:16:41.70,Default,,0000,0000,0000,,to deploy all that and we've done everything\Nas a copyleft license, Dialogue: 0,0:16:41.70,0:16:46.08,Default,,0000,0000,0000,,GPLv3 for the backend and AGPLv3\Nfor the frontend. Dialogue: 0,0:16:47.06,0:16:56.89,Default,,0000,0000,0000,,Even if people try and make their own\NSoftware Heritage using our code, Dialogue: 0,0:16:56.89,0:16:59.66,Default,,0000,0000,0000,,they have to publish their changes. Dialogue: 0,0:17:01.86,0:17:10.76,Default,,0000,0000,0000,,Hardware-wise, we run for now everything\Non a few hypervisors in house and Dialogue: 0,0:17:10.76,0:17:18.57,Default,,0000,0000,0000,,our main storage is currently still\Non a very high density, very slow, Dialogue: 0,0:17:18.57,0:17:27.95,Default,,0000,0000,0000,,very bulky storage array, but we've\Nstarted to migrate all this thing Dialogue: 0,0:17:27.95,0:17:33.00,Default,,0000,0000,0000,,into a Ceph storage cluster which\Nwe're gonna grow as we need Dialogue: 0,0:17:33.00,0:17:35.07,Default,,0000,0000,0000,,in the next few months. Dialogue: 0,0:17:36.25,0:17:43.68,Default,,0000,0000,0000,,We've also been granted by Microsoft\Nsponsorship, ??? sponsorship Dialogue: 0,0:17:44.08,0:17:45.83,Default,,0000,0000,0000,,for their cloud services. Dialogue: 0,0:17:46.44,0:17:51.76,Default,,0000,0000,0000,,We've started putting mirrors of everything\Nin their infrastructure as well Dialogue: 0,0:17:51.76,0:17:59.57,Default,,0000,0000,0000,,which means full object storage mirror,\Nso 170TB of stuff mirrored on azure Dialogue: 0,0:17:59.57,0:18:02.49,Default,,0000,0000,0000,,as well as a database mirror for graph. Dialogue: 0,0:18:03.80,0:18:08.96,Default,,0000,0000,0000,,And we're also doing all the content\Nindexing and all the things that need Dialogue: 0,0:18:08.96,0:18:11.96,Default,,0000,0000,0000,,scalability on azure now. Dialogue: 0,0:18:16.64,0:18:22.41,Default,,0000,0000,0000,,Finally, at the university of Bologna,\Nwe have a backend storage for the download Dialogue: 0,0:18:22.41,0:18:29.41,Default,,0000,0000,0000,,so currently our main storage is\Nquite slow so if you want to download Dialogue: 0,0:18:29.41,0:18:34.86,Default,,0000,0000,0000,,a bundle of things that we've archived,\Nthen we actually keep a cache of Dialogue: 0,0:18:34.86,0:18:40.35,Default,,0000,0000,0000,,what we've done so that it doesn't take\Na million years to download stuff. Dialogue: 0,0:18:41.81,0:18:46.23,Default,,0000,0000,0000,,We do our development in a classic free\Nand open source software way, Dialogue: 0,0:18:46.23,0:18:52.06,Default,,0000,0000,0000,,so we talk on our mailing list, on IRC,\Non a forge. Dialogue: 0,0:18:52.50,0:18:56.64,Default,,0000,0000,0000,,Everything is in English, everything is\Npublic, there is more information Dialogue: 0,0:18:56.64,0:19:00.75,Default,,0000,0000,0000,,on our website if you want to actually\Nhave a look and see what we do. Dialogue: 0,0:19:04.28,0:19:09.60,Default,,0000,0000,0000,,So, all that is very interesting but how\Ndo we actually look into it? Dialogue: 0,0:19:11.67,0:19:16.05,Default,,0000,0000,0000,,One of the ways that you can browse,\Nthat you can use the archive Dialogue: 0,0:19:16.05,0:19:18.62,Default,,0000,0000,0000,,is using a REST API. Dialogue: 0,0:19:19.19,0:19:25.24,Default,,0000,0000,0000,,Basically, this API allows you to do\Npointwise browsing of the archive Dialogue: 0,0:19:25.24,0:19:29.03,Default,,0000,0000,0000,,so you can go and follow the links\Nin a graph, Dialogue: 0,0:19:29.03,0:19:37.76,Default,,0000,0000,0000,,which is very slow but gives you a pretty\Nmuch full access of the data. Dialogue: 0,0:19:38.45,0:19:44.78,Default,,0000,0000,0000,,There's an index for the API that you can\Nlook at, but that's not really convenient, Dialogue: 0,0:19:44.78,0:19:47.79,Default,,0000,0000,0000,,so we also have a web user interface. Dialogue: 0,0:19:48.83,0:19:55.77,Default,,0000,0000,0000,,It's in preview right now, we're gonna do\Na full launch in the month of June. Dialogue: 0,0:19:57.77,0:20:01.10,Default,,0000,0000,0000,,If you go to \Nhttps://archive.softwareheritage.org/browse/ Dialogue: 0,0:20:01.59,0:20:09.55,Default,,0000,0000,0000,,with the given credentials, you can\Nhave a look and see what's going on. Dialogue: 0,0:20:10.17,0:20:18.55,Default,,0000,0000,0000,,Basically, we have a web interface that\Nallows you to look at Dialogue: 0,0:20:18.55,0:20:26.07,Default,,0000,0000,0000,,what origins we have downloaded, when\Nwe have downloaded the origins Dialogue: 0,0:20:26.07,0:20:34.93,Default,,0000,0000,0000,,with a kind of graph view of how often\Nwe visited the origins Dialogue: 0,0:20:34.93,0:20:37.94,Default,,0000,0000,0000,,and a calendar view of when we have\Nvisited the origins. Dialogue: 0,0:20:38.79,0:20:43.75,Default,,0000,0000,0000,,And then, inside the visits, you can\Nactually browse the contents Dialogue: 0,0:20:43.75,0:20:45.05,Default,,0000,0000,0000,,that we've archived. Dialogue: 0,0:20:45.29,0:20:49.88,Default,,0000,0000,0000,,So, for instance, this is the Python\Nrepository as of May 2017 Dialogue: 0,0:20:49.88,0:20:54.96,Default,,0000,0000,0000,,and you can have the list of files,\Nthen drill down, Dialogue: 0,0:20:54.96,0:20:58.16,Default,,0000,0000,0000,,it should be pretty intuitive. Dialogue: 0,0:20:59.16,0:21:02.59,Default,,0000,0000,0000,,If you look at the history of a project,\Nyou can see the differences Dialogue: 0,0:21:02.59,0:21:04.70,Default,,0000,0000,0000,,between two revisions of a project. Dialogue: 0,0:21:06.89,0:21:12.26,Default,,0000,0000,0000,,Oh no, that's the syntax highlighting,\Nbut anyway the diffs arrive right after. Dialogue: 0,0:21:13.64,0:21:16.33,Default,,0000,0000,0000,,So, yeah, pretty cool stuff. Dialogue: 0,0:21:16.90,0:21:21.54,Default,,0000,0000,0000,,I should be able to do a demo as well,\Nit should work. Dialogue: 0,0:21:31.11,0:21:32.43,Default,,0000,0000,0000,,I'm gonna zoom in. Dialogue: 0,0:21:44.80,0:21:49.47,Default,,0000,0000,0000,,So this is the main archive, you can see\Nsome statistics about the objects Dialogue: 0,0:21:49.47,0:21:50.93,Default,,0000,0000,0000,,that we've downloaded. Dialogue: 0,0:21:51.14,0:21:56.56,Default,,0000,0000,0000,,When you zoom in, you get some kind of\Noverflows, because… Dialogue: 0,0:21:56.92,0:21:58.87,Default,,0000,0000,0000,,Yeah, why would you do that. Dialogue: 0,0:21:59.24,0:22:04.08,Default,,0000,0000,0000,,If you want to browse, we can try to find\Nan origin. Dialogue: 0,0:22:07.41,0:22:08.83,Default,,0000,0000,0000,,"glibc". Dialogue: 0,0:22:12.73,0:22:17.04,Default,,0000,0000,0000,,So there's lots and lots of, like, random\NGithub forks of things… Dialogue: 0,0:22:18.58,0:22:25.78,Default,,0000,0000,0000,,We don't discriminate and we don't really\Nfilter what we download. Dialogue: 0,0:22:26.56,0:22:34.39,Default,,0000,0000,0000,,We are looking into doing some relevance\Nkind of sorting of the results, here. Dialogue: 0,0:22:36.43,0:22:37.69,Default,,0000,0000,0000,,Next. Dialogue: 0,0:22:40.38,0:22:42.08,Default,,0000,0000,0000,,Xilinx, why not. Dialogue: 0,0:22:43.22,0:22:48.75,Default,,0000,0000,0000,,So, this has been downloaded for the last\Ntime of August 3rd 2016, Dialogue: 0,0:22:48.75,0:22:50.40,Default,,0000,0000,0000,,so it's probably a dead repository, Dialogue: 0,0:22:52.72,0:22:54.100,Default,,0000,0000,0000,,but yeah, you can see a bunch of source\Ncode, Dialogue: 0,0:22:56.67,0:23:00.54,Default,,0000,0000,0000,,you can read the README of the glibc. Dialogue: 0,0:23:04.44,0:23:07.65,Default,,0000,0000,0000,,If we go back to a more interesting origin Dialogue: 0,0:23:07.65,0:23:09.64,Default,,0000,0000,0000,,here's the repository for git. Dialogue: 0,0:23:10.58,0:23:17.15,Default,,0000,0000,0000,,I've selected voluntarily an old visit\Nof the repo so that we can see Dialogue: 0,0:23:17.15,0:23:18.86,Default,,0000,0000,0000,,what was going on then. Dialogue: 0,0:23:22.76,0:23:31.46,Default,,0000,0000,0000,,If I look at the calendar view, you can see\Nthat we've had some issues actually Dialogue: 0,0:23:31.46,0:23:33.41,Default,,0000,0000,0000,,updating this, but anyway. Dialogue: 0,0:23:37.84,0:23:46.08,Default,,0000,0000,0000,,If I look at the last visit, then we can\Nactually browse the contents, Dialogue: 0,0:23:46.74,0:23:49.34,Default,,0000,0000,0000,,you can get syntax highlighting as well. Dialogue: 0,0:23:49.90,0:23:53.72,Default,,0000,0000,0000,,This is a big big file with lots of comments Dialogue: 0,0:24:02.09,0:24:04.97,Default,,0000,0000,0000,,Let's see the actual source code… Dialogue: 0,0:24:07.04,0:24:10.17,Default,,0000,0000,0000,,Anyway, so, that's the browsing interface. Dialogue: 0,0:24:10.45,0:24:15.13,Default,,0000,0000,0000,,We can also now get back what we've\Narchived and download it, Dialogue: 0,0:24:15.13,0:24:18.70,Default,,0000,0000,0000,,which is kind of something that you might\Nwant to do Dialogue: 0,0:24:18.70,0:24:23.53,Default,,0000,0000,0000,,if a repository is lost, you can actually\Ndownload it Dialogue: 0,0:24:23.53,0:24:25.56,Default,,0000,0000,0000,,and get the source code back again. Dialogue: 0,0:24:26.94,0:24:28.46,Default,,0000,0000,0000,,How we do that. Dialogue: 0,0:24:28.73,0:24:35.48,Default,,0000,0000,0000,,If you go on the top right of this browsing\Ninterface, you have actions and download Dialogue: 0,0:24:35.48,0:24:40.28,Default,,0000,0000,0000,,and you can download the directory that\Nyou are currently looking at. Dialogue: 0,0:24:41.29,0:24:46.01,Default,,0000,0000,0000,,It's an asynchronous process, which means\Nthat if there is a lot of load, Dialogue: 0,0:24:46.01,0:24:51.46,Default,,0000,0000,0000,,then it's gotta take some time to get\Nactually, to be able to download the content Dialogue: 0,0:24:51.95,0:24:56.30,Default,,0000,0000,0000,,So you can put in your email address so we\Ncan notify you when the download is ready. Dialogue: 0,0:24:56.99,0:25:03.34,Default,,0000,0000,0000,,I'm gonna try my luck and say just "ok"\Nand it's gonna appear at some point Dialogue: 0,0:25:03.34,0:25:07.61,Default,,0000,0000,0000,,in the list of things that I've requested. Dialogue: 0,0:25:11.02,0:25:20.17,Default,,0000,0000,0000,,I've already requested some things that\Nwe can actually get and open as a tarball. Dialogue: 0,0:25:31.46,0:25:34.76,Default,,0000,0000,0000,,Yeah, I think that's the thing that I was\Nactually looking at, Dialogue: 0,0:25:35.30,0:25:38.44,Default,,0000,0000,0000,,which is this revision of the git\Nsource code Dialogue: 0,0:25:39.65,0:25:42.25,Default,,0000,0000,0000,,and then I can open it Dialogue: 0,0:25:43.64,0:25:46.57,Default,,0000,0000,0000,,Yay, emacs, that's when you want. Dialogue: 0,0:25:46.93,0:25:48.31,Default,,0000,0000,0000,,Yay, source code. Dialogue: 0,0:25:51.16,0:25:53.56,Default,,0000,0000,0000,,This seems to work. Dialogue: 0,0:25:57.92,0:26:02.67,Default,,0000,0000,0000,,And then, of course, if you want to\Nactually script what you're doing, Dialogue: 0,0:26:02.67,0:26:07.14,Default,,0000,0000,0000,,there's an API that allows you to do\Nthe downloads as well, so you can. Dialogue: 0,0:26:10.92,0:26:18.39,Default,,0000,0000,0000,,The source code is deduplicated a lot,\Nwhich means that for one single repository Dialogue: 0,0:26:18.39,0:26:24.20,Default,,0000,0000,0000,,you get tons of files that we have to\Ncollect if you want to actually download Dialogue: 0,0:26:24.20,0:26:26.23,Default,,0000,0000,0000,,an archive of a directory. Dialogue: 0,0:26:29.61,0:26:37.70,Default,,0000,0000,0000,,It takes a while but we have an asynchronous\NAPI so you can POST Dialogue: 0,0:26:37.70,0:26:43.56,Default,,0000,0000,0000,,the identifier of a revision to this URL\Nand then get status updates Dialogue: 0,0:26:43.56,0:26:49.49,Default,,0000,0000,0000,,and at some point, it will tell you that\Nthe… here Dialogue: 0,0:26:49.85,0:26:52.70,Default,,0000,0000,0000,,The status well tell you that the object\Nis available. Dialogue: 0,0:26:52.98,0:26:59.13,Default,,0000,0000,0000,,You can download it and you can even\Ndownload the full history of a project Dialogue: 0,0:26:59.13,0:27:03.56,Default,,0000,0000,0000,,and get that as a git-fast-export archive\Nthat you can reimport into Dialogue: 0,0:27:03.56,0:27:05.84,Default,,0000,0000,0000,,a new git repository. Dialogue: 0,0:27:06.24,0:27:13.18,Default,,0000,0000,0000,,So any kind of VCS that we've imported,\Nyou can export as a git repository Dialogue: 0,0:27:13.18,0:27:17.73,Default,,0000,0000,0000,,and reimport on your machine. Dialogue: 0,0:27:19.24,0:27:22.85,Default,,0000,0000,0000,,How to get involved in the project? Dialogue: 0,0:27:24.03,0:27:29.03,Default,,0000,0000,0000,,We have a lot of features that we're\Ninterested in, lots of them are now Dialogue: 0,0:27:29.03,0:27:31.39,Default,,0000,0000,0000,,in early access or have been done. Dialogue: 0,0:27:31.88,0:27:35.62,Default,,0000,0000,0000,,There's some stuff that we would like\Nhelp with. Dialogue: 0,0:27:38.23,0:27:40.26,Default,,0000,0000,0000,,This is some stuff that we're working on: Dialogue: 0,0:27:40.55,0:27:42.95,Default,,0000,0000,0000,,provenance information, you have a content Dialogue: 0,0:27:43.07,0:27:45.42,Default,,0000,0000,0000,,you want to know which repository\Nit comes from, Dialogue: 0,0:27:45.87,0:27:47.58,Default,,0000,0000,0000,,that's something we're working on. Dialogue: 0,0:27:48.31,0:27:55.22,Default,,0000,0000,0000,,Full text search, the end goal is to be\Nable even to trace Dialogue: 0,0:27:55.22,0:28:00.50,Default,,0000,0000,0000,,source of snippets of code that's have\Nbeen copied from one project to another. Dialogue: 0,0:28:01.32,0:28:05.83,Default,,0000,0000,0000,,That's something that we can look into\Nwith the wealth of information that Dialogue: 0,0:28:05.83,0:28:07.62,Default,,0000,0000,0000,,we have inside the archive. Dialogue: 0,0:28:08.64,0:28:10.67,Default,,0000,0000,0000,,There's a lot of things that, Dialogue: 0,0:28:10.67,0:28:11.73,Default,,0000,0000,0000,,I mean… Dialogue: 0,0:28:12.14,0:28:14.73,Default,,0000,0000,0000,,There's a lot of things that people want\Nto do with the archive. Dialogue: 0,0:28:15.35,0:28:19.59,Default,,0000,0000,0000,,Our goal is to enable people to do things,\Nto do interesting things Dialogue: 0,0:28:19.59,0:28:21.90,Default,,0000,0000,0000,,with a lot of source code. Dialogue: 0,0:28:23.53,0:28:27.35,Default,,0000,0000,0000,,If you have an idea of what you want to do\Nwith such an archive, Dialogue: 0,0:28:27.35,0:28:29.84,Default,,0000,0000,0000,,please you can come talk to us Dialogue: 0,0:28:29.84,0:28:34.94,Default,,0000,0000,0000,,and we'll be happy to help you help us. Dialogue: 0,0:28:37.55,0:28:43.57,Default,,0000,0000,0000,,What we want to do is to diversify\Nthe sources of things that we archive. Dialogue: 0,0:28:44.47,0:28:51.29,Default,,0000,0000,0000,,Currently, we have good support for git,\Nwe have OK support for subversion\N Dialogue: 0,0:28:51.29,0:28:52.71,Default,,0000,0000,0000,,and mercurial. Dialogue: 0,0:28:54.37,0:28:59.22,Default,,0000,0000,0000,,If your project of choice is in another\Nversion control system, Dialogue: 0,0:28:59.22,0:29:01.09,Default,,0000,0000,0000,,we are gonna miss it. Dialogue: 0,0:29:01.66,0:29:06.30,Default,,0000,0000,0000,,So people can contribute in this area. Dialogue: 0,0:29:10.12,0:29:18.20,Default,,0000,0000,0000,,For the listing part, we have coverage of\NDebian, we have coverage or Github, Dialogue: 0,0:29:18.20,0:29:26.42,Default,,0000,0000,0000,,if your code is somewhere else, we won't\Nsee it, so we need people to contribute Dialogue: 0,0:29:26.42,0:29:29.59,Default,,0000,0000,0000,,stuff that can list for instance Gitlab\Ninstances, Dialogue: 0,0:29:31.90,0:29:36.41,Default,,0000,0000,0000,,and then we can integrate that in our\Ninfrastructure and actually have Dialogue: 0,0:29:36.93,0:29:41.44,Default,,0000,0000,0000,,people be able to archive their gitlab\Ninstances. Dialogue: 0,0:29:42.04,0:29:48.78,Default,,0000,0000,0000,,And of course, we need to spread\Nthe word, make the project sustainable. Dialogue: 0,0:29:49.12,0:30:00.59,Default,,0000,0000,0000,,We have a few sponsors now, Microsoft,\NNokia, Huawei, Github has joined as a sponsor Dialogue: 0,0:30:01.81,0:30:06.36,Default,,0000,0000,0000,,The university of Bologna, of course Inria\Nis sponsoring. Dialogue: 0,0:30:06.85,0:30:11.97,Default,,0000,0000,0000,,But we need to keep spreading the word\Nand keep the project sustainable. Dialogue: 0,0:30:13.03,0:30:17.50,Default,,0000,0000,0000,,And, of course, we need to save endangered\Nsource code. Dialogue: 0,0:30:17.83,0:30:22.58,Default,,0000,0000,0000,,For that, we have a suggestion box on\Nthe wiki that you can add things to. Dialogue: 0,0:30:24.21,0:30:29.56,Default,,0000,0000,0000,,For instance, we have in the back of\Nour minds archiving SourceForge, Dialogue: 0,0:30:29.56,0:30:35.93,Default,,0000,0000,0000,,because we know that this isn't very\Nsustainable and that's risk of being Dialogue: 0,0:30:35.93,0:30:38.73,Default,,0000,0000,0000,,taken down at some point. Dialogue: 0,0:30:41.70,0:30:47.83,Default,,0000,0000,0000,,If you want to join us, we also have\Nsome job openings that are available. Dialogue: 0,0:30:48.60,0:30:55.65,Default,,0000,0000,0000,,For now it's in Paris, so if you want to\Nconsider coming work with us in Paris, Dialogue: 0,0:30:55.65,0:30:58.09,Default,,0000,0000,0000,,you can look into that. Dialogue: 0,0:31:00.65,0:31:02.68,Default,,0000,0000,0000,,That's Software Heritage. Dialogue: 0,0:31:02.68,0:31:05.12,Default,,0000,0000,0000,,We are building a reference archive of\Nall the free software Dialogue: 0,0:31:05.12,0:31:06.84,Default,,0000,0000,0000,,that's being ever written Dialogue: 0,0:31:07.08,0:31:10.98,Default,,0000,0000,0000,,in an international, open, non-profit and\Nmutualised infrastructure Dialogue: 0,0:31:11.88,0:31:17.93,Default,,0000,0000,0000,,that we have opened up to everyone,\Nall users, vendors, developers can use it. Dialogue: 0,0:31:20.13,0:31:25.66,Default,,0000,0000,0000,,The idea is to be at the service of\Nthe community and for society Dialogue: 0,0:31:25.66,0:31:27.80,Default,,0000,0000,0000,,as a whole. Dialogue: 0,0:31:28.14,0:31:32.86,Default,,0000,0000,0000,,So if you want to join us, you can look at\Nour website, you can look at our code. Dialogue: 0,0:31:34.60,0:31:38.14,Default,,0000,0000,0000,,You can also talk to me, so if you have\Nany questions, Dialogue: 0,0:31:38.14,0:31:42.12,Default,,0000,0000,0000,,I think we have 10, 12 minutes for questions. Dialogue: 0,0:31:46.23,0:31:51.51,Default,,0000,0000,0000,,[Applause] Dialogue: 0,0:31:51.75,0:31:52.93,Default,,0000,0000,0000,,Do you have questions? Dialogue: 0,0:31:57.21,0:32:00.63,Default,,0000,0000,0000,,[Q] How do you protect the archive\Nagainst stuff that you don't want to Dialogue: 0,0:32:00.63,0:32:01.89,Default,,0000,0000,0000,,have in the archive. Dialogue: 0,0:32:02.17,0:32:06.88,Default,,0000,0000,0000,,I think of a stuff that is copyright-\Nprotected and that Github will also Dialogue: 0,0:32:06.88,0:32:09.32,Default,,0000,0000,0000,,delete after a while. Dialogue: 0,0:32:09.73,0:32:15.58,Default,,0000,0000,0000,,Worse, if I would misuse the archive\Nas my private backup Dialogue: 0,0:32:15.58,0:32:19.60,Default,,0000,0000,0000,,and store encrypted blocks on Github\Nand you will eventually backup them Dialogue: 0,0:32:19.60,0:32:20.78,Default,,0000,0000,0000,,for me. Dialogue: 0,0:32:24.56,0:32:26.71,Default,,0000,0000,0000,,[A] There's, I think, two sides of the\Nquestion. Dialogue: 0,0:32:27.08,0:32:28.50,Default,,0000,0000,0000,,The first side is Dialogue: 0,0:32:28.50,0:32:33.54,Default,,0000,0000,0000,,Do we really archive only stuff that is\Nfree software and Dialogue: 0,0:32:33.54,0:32:40.90,Default,,0000,0000,0000,,that we can redistribute and how do we\Nmanage, for instance, Dialogue: 0,0:32:40.90,0:32:42.86,Default,,0000,0000,0000,,copyright takedown stuff. Dialogue: 0,0:32:46.11,0:32:51.87,Default,,0000,0000,0000,,Currently, most of the infrastructure\Nof the project is under French law. Dialogue: 0,0:32:52.98,0:33:00.05,Default,,0000,0000,0000,,There's a defined process to do\Ncopyright takedown in the French legal system. Dialogue: 0,0:33:02.36,0:33:08.83,Default,,0000,0000,0000,,We would be really annoyed to have to\Ntake down content from the archive Dialogue: 0,0:33:12.49,0:33:19.85,Default,,0000,0000,0000,,What we do, however, is to mirror public\Ninformation that is publicly available. Dialogue: 0,0:33:21.19,0:33:26.72,Default,,0000,0000,0000,,Of course I'm not a lawyer for the project,\Nso I can't really… Dialogue: 0,0:33:29.60,0:33:33.18,Default,,0000,0000,0000,,I'm not 100% sure of what I'm about to say\Nbut Dialogue: 0,0:33:33.18,0:33:38.92,Default,,0000,0000,0000,,what I know is that in the current French\Nlegistlation status, Dialogue: 0,0:33:39.53,0:33:42.90,Default,,0000,0000,0000,,if the source of the data is still available Dialogue: 0,0:33:42.90,0:33:46.64,Default,,0000,0000,0000,,so for instance if the data is still on\NGithub, then you need to have Dialogue: 0,0:33:46.64,0:33:49.90,Default,,0000,0000,0000,,Github take it down before we have to\Ntake it down. Dialogue: 0,0:33:56.68,0:34:01.88,Default,,0000,0000,0000,,We're not currently filtering content for\Nmisuse of the archive, Dialogue: 0,0:34:01.88,0:34:06.36,Default,,0000,0000,0000,,so the only thing that we do is put\Na limit on the size of the files Dialogue: 0,0:34:06.36,0:34:08.44,Default,,0000,0000,0000,,that are archived in Software Heritage. Dialogue: 0,0:34:09.54,0:34:12.01,Default,,0000,0000,0000,,The limit is pretty high, like 100MB. Dialogue: 0,0:34:15.10,0:34:21.44,Default,,0000,0000,0000,,We can't really decide ourselves Dialogue: 0,0:34:21.44,0:34:24.08,Default,,0000,0000,0000,,what is source code,\Nwhat is not source code Dialogue: 0,0:34:24.08,0:34:30.67,Default,,0000,0000,0000,,because for instance if your project is\Na cryptography library, Dialogue: 0,0:34:30.67,0:34:34.40,Default,,0000,0000,0000,,you might want to have some encrypted\Nblocks of data that are stored Dialogue: 0,0:34:34.40,0:34:38.46,Default,,0000,0000,0000,,in you source code repository as\Ntest fixtures. Dialogue: 0,0:34:39.03,0:34:44.03,Default,,0000,0000,0000,,And then, you need them to build the code\Nand to make sure that it works. Dialogue: 0,0:34:44.68,0:34:48.100,Default,,0000,0000,0000,,So, how would that be any different than\Nyour encrypted backup on Github? Dialogue: 0,0:34:49.14,0:34:55.64,Default,,0000,0000,0000,,How could we, Software Heritage,\Ndistinguish between proper use and misuse Dialogue: 0,0:34:55.64,0:34:58.81,Default,,0000,0000,0000,,of the resources. Dialogue: 0,0:35:00.35,0:35:05.10,Default,,0000,0000,0000,,I guess our long term goal is to not have\Nto care about misuse because Dialogue: 0,0:35:05.10,0:35:07.18,Default,,0000,0000,0000,,it's gonna be a drop in the ocean. Dialogue: 0,0:35:08.64,0:35:10.92,Default,,0000,0000,0000,,We're gonna have so much… Dialogue: 0,0:35:11.89,0:35:15.30,Default,,0000,0000,0000,,We want to have enough space and\Nenough resources Dialogue: 0,0:35:15.30,0:35:20.02,Default,,0000,0000,0000,,that we don't really need to ask ourselves\Nthis question, basically. Dialogue: 0,0:35:21.48,0:35:22.41,Default,,0000,0000,0000,,Thanks. Dialogue: 0,0:35:26.36,0:35:27.65,Default,,0000,0000,0000,,Other questions? Dialogue: 0,0:35:34.11,0:35:39.36,Default,,0000,0000,0000,,[Q] Have you looked at some form of\Nauthentication to provide additional Dialogue: 0,0:35:39.36,0:35:46.35,Default,,0000,0000,0000,,insurance that the archived source code\Nhasn't been modified or tampered with Dialogue: 0,0:35:46.35,0:35:47.89,Default,,0000,0000,0000,,in some form? Dialogue: 0,0:35:50.98,0:35:55.97,Default,,0000,0000,0000,,[A] First of all, all the identifiers for\Nthe objects that are inside the archive Dialogue: 0,0:35:55.97,0:36:00.64,Default,,0000,0000,0000,,are cryptographic hashes of the contents\Nthat we've archived. Dialogue: 0,0:36:01.61,0:36:06.94,Default,,0000,0000,0000,,So, for files, for instance, we take\Nthe SHA1, the SHA256, Dialogue: 0,0:36:06.94,0:36:16.08,Default,,0000,0000,0000,,one of the BLAKE hashes and the git\Nmodified SHA1 of the file, Dialogue: 0,0:36:16.65,0:36:19.66,Default,,0000,0000,0000,,and we use that in the manifest for\Nthe directories. Dialogue: 0,0:36:19.90,0:36:25.79,Default,,0000,0000,0000,,So the directories, the directory identifiers\Nare a hash of the manifest Dialogue: 0,0:36:25.79,0:36:30.09,Default,,0000,0000,0000,,of the list of files that are inside\Nthe directory, etc. Dialogue: 0,0:36:30.54,0:36:39.29,Default,,0000,0000,0000,,So, recursively, you can make sure that\Nthe data that we give back to you Dialogue: 0,0:36:39.29,0:36:47.78,Default,,0000,0000,0000,,has not been, at least altered, by bitflip\Nor anything. Dialogue: 0,0:36:48.95,0:36:53.39,Default,,0000,0000,0000,,We regularly run a scrub of the data\Nthat we have in the archive, Dialogue: 0,0:36:53.39,0:36:57.25,Default,,0000,0000,0000,,so we make sure that there's no rot\Ninside our archive. Dialogue: 0,0:36:58.96,0:37:05.06,Default,,0000,0000,0000,,We've not looked into, basically,\Nattestation of… Dialogue: 0,0:37:08.76,0:37:13.88,Default,,0000,0000,0000,,for instance, making sure that the code\Nthat we've downloaded… Dialogue: 0,0:37:20.88,0:37:26.45,Default,,0000,0000,0000,,I mean, we're not doing anything more\Nthan taking a picture of the data Dialogue: 0,0:37:26.45,0:37:34.09,Default,,0000,0000,0000,,and we say "We've computed this hash.\NMaybe the code that's been presented Dialogue: 0,0:37:34.09,0:37:38.84,Default,,0000,0000,0000,,by Github to Software Heritage is different\Nthan what you've uploaded to Github, Dialogue: 0,0:37:38.84,0:37:40.31,Default,,0000,0000,0000,,we can't tell." Dialogue: 0,0:37:43.97,0:37:48.92,Default,,0000,0000,0000,,In the case of git, you can always use\Nthe identifiers of the objects Dialogue: 0,0:37:48.92,0:37:51.86,Default,,0000,0000,0000,,that you've pushed so you have\Nthe commit hash, Dialogue: 0,0:37:51.86,0:37:56.78,Default,,0000,0000,0000,,which is itself a cryptographic identifier\Nof the contents of the commit. Dialogue: 0,0:37:59.42,0:38:02.18,Default,,0000,0000,0000,,In turn, if the commit is signed, then\Nthe signature is still stored Dialogue: 0,0:38:02.18,0:38:10.80,Default,,0000,0000,0000,,in the Software Heritage metadata and\Nyou can reproduce the original git object Dialogue: 0,0:38:10.80,0:38:15.36,Default,,0000,0000,0000,,and check the signature, but we've not\Ndone anything specific for Software Heritage Dialogue: 0,0:38:15.36,0:38:17.18,Default,,0000,0000,0000,,in this area. Dialogue: 0,0:38:17.54,0:38:19.64,Default,,0000,0000,0000,,Does that answer your question? Dialogue: 0,0:38:19.98,0:38:20.30,Default,,0000,0000,0000,,Cool. Dialogue: 0,0:38:24.89,0:38:25.75,Default,,0000,0000,0000,,Other questions? Dialogue: 0,0:38:27.46,0:38:28.80,Default,,0000,0000,0000,,There's one in front. Dialogue: 0,0:38:31.40,0:38:33.56,Default,,0000,0000,0000,,[Q] It's partially question, partially\Ncomment. Dialogue: 0,0:38:33.88,0:38:39.78,Default,,0000,0000,0000,,Your initial idea was to have a telescope,\Nor something like this for source code. Dialogue: 0,0:38:40.22,0:38:43.43,Default,,0000,0000,0000,,For now, for me, it looks a little bit\Nmore like microscope, Dialogue: 0,0:38:43.43,0:38:46.51,Default,,0000,0000,0000,,so you can focus on one thing, but that's\Nnot much. Dialogue: 0,0:38:46.76,0:38:51.02,Default,,0000,0000,0000,,So have you sorted things about how to\Nanalyze entire ecosystem Dialogue: 0,0:38:51.02,0:38:52.20,Default,,0000,0000,0000,,or something like this. Dialogue: 0,0:38:52.20,0:38:56.51,Default,,0000,0000,0000,,For example, now we have Django 2 which is\NPython 3 only so it would be interesting to Dialogue: 0,0:38:56.51,0:39:00.90,Default,,0000,0000,0000,,look at all Django modules to see when\Nthey start moving to this Django. Dialogue: 0,0:39:01.27,0:39:06.62,Default,,0000,0000,0000,,So we would need to start analyzing\Nthousands or millions of files, but then Dialogue: 0,0:39:06.62,0:39:10.84,Default,,0000,0000,0000,,we would need some SQL like, or some\Nmap reduce jobs Dialogue: 0,0:39:11.05,0:39:12.43,Default,,0000,0000,0000,,or something like this for this. Dialogue: 0,0:39:12.96,0:39:13.52,Default,,0000,0000,0000,,[A] Yes Dialogue: 0,0:39:13.89,0:39:15.07,Default,,0000,0000,0000,,So, we've started… Dialogue: 0,0:39:16.41,0:39:21.62,Default,,0000,0000,0000,,The two initiators of the project, Roberto\NDi Cosmo and Stefano Zacchiroli Dialogue: 0,0:39:21.81,0:39:26.57,Default,,0000,0000,0000,,are both researchers in computer science\Nso they have a strong background in Dialogue: 0,0:39:26.57,0:39:34.65,Default,,0000,0000,0000,,actually mining software repositories and\Ndoing some large scale analysis Dialogue: 0,0:39:34.65,0:39:36.23,Default,,0000,0000,0000,,on source code. Dialogue: 0,0:39:38.15,0:39:44.82,Default,,0000,0000,0000,,We've been talking with research groups\Nwhose main goal is to do analysis on Dialogue: 0,0:39:44.82,0:39:48.44,Default,,0000,0000,0000,,large scale source code archives. Dialogue: 0,0:39:50.43,0:39:57.59,Default,,0000,0000,0000,,One of the first mirrors outside of our\Ncontrol of the archive Dialogue: 0,0:39:57.59,0:39:59.02,Default,,0000,0000,0000,,will be in Grenoble (France). Dialogue: 0,0:39:59.38,0:40:05.85,Default,,0000,0000,0000,,There's a few teams that work on\Nactually doing large scale research Dialogue: 0,0:40:05.85,0:40:08.70,Default,,0000,0000,0000,,on source code over there, Dialogue: 0,0:40:08.70,0:40:11.34,Default,,0000,0000,0000,,so that's what the mirror will be\Nused for. Dialogue: 0,0:40:13.41,0:40:17.24,Default,,0000,0000,0000,,We've also been looking at what\Nthe Google open source team does. Dialogue: 0,0:40:18.21,0:40:22.100,Default,,0000,0000,0000,,They have this big repository with all\Nthe code that Google uses Dialogue: 0,0:40:22.100,0:40:28.94,Default,,0000,0000,0000,,and they've started to push back,\Nlike do large scale analysis of Dialogue: 0,0:40:28.94,0:40:37.58,Default,,0000,0000,0000,,security vulnerabilities, issues with\Nstatic and dynamic analysis Dialogue: 0,0:40:37.58,0:40:41.94,Default,,0000,0000,0000,,of the code and they've started pushing\Ntheir fixes upstream. Dialogue: 0,0:40:42.59,0:40:47.14,Default,,0000,0000,0000,,That's something that we want to enable\Nusers to do, Dialogue: 0,0:40:47.14,0:40:50.63,Default,,0000,0000,0000,,that's not something that we want to do\Nourselves, but we want to make sure Dialogue: 0,0:40:50.63,0:40:53.48,Default,,0000,0000,0000,,that people can do it using our archive. Dialogue: 0,0:40:54.62,0:40:58.77,Default,,0000,0000,0000,,So we'd be happy to work with people\Nwho already do that so that Dialogue: 0,0:40:58.77,0:41:04.53,Default,,0000,0000,0000,,they can use their knowledge and their\Ntools inside our archive. Dialogue: 0,0:41:06.61,0:41:08.68,Default,,0000,0000,0000,,Does that answer your question? Dialogue: 0,0:41:09.66,0:41:10.67,Default,,0000,0000,0000,,Cool. Dialogue: 0,0:41:14.98,0:41:16.53,Default,,0000,0000,0000,,Any more questions? Dialogue: 0,0:41:19.41,0:41:21.73,Default,,0000,0000,0000,,No? Then thank you very much Nicolas. Dialogue: 0,0:41:21.93,0:41:22.58,Default,,0000,0000,0000,,Thank you. Dialogue: 0,0:41:22.95,0:41:25.96,Default,,0000,0000,0000,,[Applause]