Software Heritage - Preserving the Free Software Commons
-
0:05 - 0:07Hi, thank you.
-
0:08 - 0:11I'm Nicolas Dandrimont and I will indeed
be talking to you about -
0:11 - 0:13Software Heritage.
-
0:13 - 0:15I'm a software engineer for this project.
-
0:16 - 0:18I've been working on it for 3 years now.
-
0:18 - 0:22And we'll see what this thing is all about.
-
0:24 - 0:39[Mic not working]
-
0:39 - 0:41I guess the batteries are out.
-
0:50 - 0:52So, let's try that again.
-
0:52 - 0:55So, we all know, we've been doing
free software for a while, -
0:56 - 1:00that software source code is something
special. -
1:01 - 1:02Why is that?
-
1:03 - 1:10As Harold Abelson has said in SICP, his
textbook on programming, -
1:10 - 1:19programs are meant to be read by people
and then incidentally for machines to execute. -
1:20 - 1:26Basically, what software source code
provides us is a way inside -
1:26 - 1:29the mind of the designer of the program.
-
1:29 - 1:38For instance, you can have,
you can get inside very crazy algorithms -
1:38 - 1:47that can do very fast reverse square roots
for 3D, that kind of stuff -
1:47 - 1:50Like in the Quake 2 source code.
-
1:50 - 1:55You can also get inside the algorithms
that are underpinning the internet, -
1:55 - 2:00for instance seeing the net queue
algorithm in the Linux kernel. -
2:04 - 2:10What we are building as the free software
community is the free software commons. -
2:11 - 2:19Basically, the commons is all the cultural
and social and natural resources -
2:19 - 2:22that we share and that everyone
has access to. -
2:22 - 2:26More specifically, the software commons
is what we are building -
2:26 - 2:32with software that is open and that is
available for all to use, to modify, -
2:32 - 2:35to execute, to distribute.
-
2:37 - 2:45We know that those commons are a really
critical part of our commons. -
2:46 - 2:48Who's taking care of it?
-
2:50 - 2:52The software is fragile.
-
2:52 - 2:54Like all digital information, you can lose
software. -
2:56 - 3:02People can decide to shut down hosting
spaces because of business decisions. -
3:03 - 3:09People can hack into software hosting
platforms and remove the code maliciously -
3:09 - 3:11or just inadvertently.
-
3:13 - 3:18And, of course, for the obsolete stuff,
there's rot. -
3:18 - 3:25If you don't care about the data, then
it rots and it decays and you lose it. -
3:26 - 3:31So, where is the archive we go to
when something is lost, -
3:31 - 3:34when GitLab goes away, when Github
goes away. -
3:34 - 3:36Where do we go?
-
3:37 - 3:41Finally, there's one last thing that we
noticed, it's that -
3:41 - 3:49there's a lot of teams that work on
research on software -
3:49 - 3:54and there's no real big infrastructure
for research on code. -
3:57 - 4:02There's tons of critical issues around
code: safety, security, verification, proofs. -
4:04 - 4:08Nobody's doing this at a very large scale.
-
4:08 - 4:12If you want to see the stars, you go
the Atacama desert and -
4:12 - 4:14you point a telescope at the sky.
-
4:14 - 4:18Where is the telescope for source code?
-
4:18 - 4:21That's what Software Heritage wants to be.
-
4:22 - 4:28What we do is we collect, we preserve
and we share all the software -
4:28 - 4:30that is publicly available.
-
4:31 - 4:36Why do we do that? We do that to
preserve the past, to enhance the present -
4:36 - 4:38and to prepare for the future.
-
4:40 - 4:45What we're building is a base infrastructure
that can be used -
4:45 - 4:50for cultural heritage, for industry,
for research and for education purposes. -
4:51 - 4:53How do we do it? We do it with an open
approach. -
4:53 - 4:57Every single line of code that we write
is free software. -
4:59 - 5:05We do it transparently, everything that
we do, we do it in the open, -
5:05 - 5:09be that on a mailing list or on
our issue tracker. -
5:10 - 5:16And we strive to do it for the very long
haul, so we do it with replication in mind -
5:16 - 5:22so that no single entity has full control
over the data that we collect. -
5:23 - 5:27And we do it in a non-profit fashion
so that we avoid -
5:27 - 5:33business-driven decisions impacting
the project. -
5:35 - 5:39So, what do we do concretely?
-
5:39 - 5:43We do archiving of version control systems.
-
5:43 - 5:45What does that mean?
-
5:46 - 5:49It means we archive file contents, so
source code, files. -
5:49 - 5:56We archive revisions, which means all the
metadata of the history of the projects, -
5:56 - 6:03we try to download it and we put it inside
a common data model that is -
6:03 - 6:07shared across all the archive.
-
6:09 - 6:14We archive releases of the software,
releases that have been tagged -
6:14 - 6:18in a version control system as well as
releases that we can find as tarballs -
6:18 - 6:24because sometimes… boof, views of
this source code differ. -
6:28 - 6:32Of course, we archive where and when
we've seen the data that we've collected. -
6:33 - 6:40All of this, we put inside a canonical,
VCS-agnostic, data model. -
6:42 - 6:47If you have a Debian package, with its
history, if you have a git repository, -
6:47 - 6:50if you have a subversion repository, if
you have a mercurial repository, -
6:50 - 6:54it all looks the same and you can work
on it with the same tools. -
6:55 - 7:01What we don't do is archive what's around
the software, for instance -
7:01 - 7:06the bug tracking systems or the homepages
or the wikis or the mailing lists. -
7:07 - 7:11There are some projects that work
in this space, for instance -
7:11 - 7:16the internet archive does a lot of
very good work around archiving the web. -
7:18 - 7:24Our goal is not to replace them, but to
work with them and be able to do -
7:24 - 7:29linking across all the archives that exist.
-
7:30 - 7:35We can, for instance for the mailing lists
there's the gmane project -
7:35 - 7:39that does a lot of archiving of free
software mailing lists. -
7:40 - 7:48So our long term vision is to play a part
in a semantic wikipedia of software, -
7:48 - 7:54a wikidata of software where we can
hyperlink all the archives that exist -
7:54 - 7:57and do stuff in the area.
-
8:01 - 8:03Quick tour of our infrastructure.
-
8:03 - 8:10Basically, all the way to the right is
our archive. -
8:11 - 8:17Our archive consists of a huge graph
of all the metadata about -
8:17 - 8:25the files, the directories, the revisions,
the commits and the releases and -
8:25 - 8:28all the projects that are on top
of the graph. -
8:29 - 8:34We separate the file storage into an other
object storage because of -
8:34 - 8:42the size discrepancy: we have lots and lots
of file contents that we need to store -
8:42 - 8:46so we do that outside of the database
that is used to store the graph. -
8:49 - 8:54Basically, what we archive is a set of
software origins that are -
8:54 - 8:59git repositories, mercurial repositories,
etc. etc. -
9:00 - 9:05All those origins are loaded on a
regular schedule. -
9:07 - 9:13If there is a very active software origin,
we're gonna archive it more often -
9:13 - 9:18than stale things that don't get
a lot of updates. -
9:20 - 9:24What we do to get the list of software
origins that we archive. -
9:25 - 9:31We have a bunch of listers that can,
scroll through the list of repositories, -
9:31 - 9:34for instance on Github or other
hosting platforms. -
9:35 - 9:42We have code that can read Debian archive
metadata to make a list of the packages -
9:42 - 9:49that are inside this archive and can be
archived, etc. -
9:50 - 9:53All of this is done on a regular basis.
-
9:54 - 9:57We are currently working on some kind
of push mechanism so that -
9:57 - 10:01people or other systems can notify us
of updates. -
10:03 - 10:10Our goal is not to do real time archiving,
we're really in it for the long run -
10:10 - 10:16but we still want to be able to prioritize
stuff that people tell us is -
10:16 - 10:18important to archive.
-
10:20 - 10:24The internet archive has a "save now"
button and we want to implement -
10:24 - 10:26something along those lines as well,
-
10:26 - 10:32so if we know that some software project
is in danger for a reason or another, -
10:32 - 10:34then we can prioritize archiving it.
-
10:36 - 10:40So this is the basic structure of a revision
in the software heritage archive. -
10:42 - 10:45You'll see that it's very similar to
a git commit. -
10:48 - 10:54The format of the metadata is pretty much
what you'll find in a git commit -
10:54 - 10:59with some extensions that you don't
see here because this is from a git commit -
11:01 - 11:10So basically what we do is we take the
identifier of the directory -
11:10 - 11:16that the revision points to, we take the
identifier of the parent of the revision -
11:16 - 11:19so we can keep track of the history
-
11:19 - 11:25and then we add some metadata,
authorship and commitership information -
11:25 - 11:29and the revision message and then we take
a hash of this, -
11:29 - 11:37it makes an identifier that's probably
unique, very very probably unique. -
11:40 - 11:47Using those identifiers, we can retrace
all the origins, all the history of -
11:47 - 11:52development of the project and we can
deduplicate across all the archive. -
11:52 - 11:59All the identifiers are intrinsic, which
means that we compute them -
11:59 - 12:04from the contents of the things that
we are archiving, which means that -
12:04 - 12:11we can deduplicate very efficiently
across all the data that we archive. -
12:12 - 12:14How much data do we archive?
-
12:17 - 12:18A bit.
-
12:19 - 12:24So, we have passed the billion revision
mark a few weeks ago. -
12:25 - 12:30This graph is a bit old, but anyway,
you have a live graph on our website. -
12:31 - 12:36That's more than 4.5 billion unique
source code files. -
12:38 - 12:45We don't actually discriminate between
what we would consider is source code -
12:45 - 12:48and what upstream developers consider
as source code, -
12:48 - 12:52so everything that's in a git repository,
we consider as source code -
12:52 - 12:55if it's below a size threshold.
-
12:56 - 13:00A billion revisions across 80 million
projects. -
13:01 - 13:03What do we archive?
-
13:03 - 13:05We archive Github, we archive Debian.
-
13:07 - 13:12So, Debian we run the archival process
every day, every day we get the new packages -
13:12 - 13:14that have been uploaded in the archive.
-
13:14 - 13:21Github, we try to keep up, we are currently
working on some performance improvements, -
13:21 - 13:25some scalability improvements to make sure
that we can keep up -
13:25 - 13:27with the development on GitHub.
-
13:29 - 13:40We have archived as a one-off thing the
former contents of Gitorious and Google Code -
13:41 - 13:47which are two prominent code hosting
spaces that closed recently -
13:48 - 13:54and we've been working on archiving
the contents of Bitbucket -
13:54 - 14:00which is kind of a challenge because
the API is a bit buggy and -
14:00 - 14:03Atliassian isn't too interested
in fixing it. -
14:06 - 14:17In concrete storage terms, we have 175TB
of blobs, so the files take 175TB -
14:17 - 14:20and kind of big database, 6TB.
-
14:21 - 14:28The database only contains the graph of
the metadata for the archive -
14:28 - 14:35which is basically a 8 billion nodes and
70 billion edges graph. -
14:35 - 14:37And of course it's growing daily.
-
14:38 - 14:43We are pretty sure this is the richest public
source code archive that's available now -
14:43 - 14:45and it keeps growing.
-
14:46 - 14:49So how do we actually…
-
14:49 - 14:53What kind of stack do we use to store
all this? -
14:55 - 14:57We use Debian, of course.
-
14:58 - 15:03All our deployment recipes are in Puppet
in public repositories. -
15:04 - 15:08We've started using Ceph
for the blob storage. -
15:09 - 15:14We use PostgreSQL for the metadata storage
with some of the standard tools that -
15:15 - 15:18live around PostgreSQL for backups
and replication. -
15:20 - 15:28We use standard Python stack for
scheduling of jobs -
15:28 - 15:35and for web interface stuff, basically
psycopg2 for the low level stuff, -
15:35 - 15:38Django for the web stuff
-
15:38 - 15:44and Celery for the scheduling of jobs.
-
15:45 - 15:50In house, we've written an ad hoc
object storage system which has -
15:50 - 15:53a bunch of backends that you can use.
-
15:54 - 16:03Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of… -
16:03 - 16:07It's a really simple object storage system
where you can just put an object, -
16:07 - 16:10get an object, put a bunch of objects,
get a bunch of objects. -
16:12 - 16:18We've implemented removal but we don't
really use it yet. -
16:20 - 16:25All the data model implementation,
all the listers, the loaders, the schedulers -
16:25 - 16:29everything has been written by us,
it's a pile of Python code. -
16:32 - 16:36So, basically 20 Python packages and
around 30 Puppet modules -
16:36 - 16:42to deploy all that and we've done everything
as a copyleft license, -
16:42 - 16:46GPLv3 for the backend and AGPLv3
for the frontend. -
16:47 - 16:57Even if people try and make their own
Software Heritage using our code, -
16:57 - 17:00they have to publish their changes.
-
17:02 - 17:11Hardware-wise, we run for now everything
on a few hypervisors in house and -
17:11 - 17:19our main storage is currently still
on a very high density, very slow, -
17:19 - 17:28very bulky storage array, but we've
started to migrate all this thing -
17:28 - 17:33into a Ceph storage cluster which
we're gonna grow as we need -
17:33 - 17:35in the next few months.
-
17:36 - 17:44We've also been granted by Microsoft
sponsorship, ??? sponsorship -
17:44 - 17:46for their cloud services.
-
17:46 - 17:52We've started putting mirrors of everything
in their infrastructure as well -
17:52 - 18:00which means full object storage mirror,
so 170TB of stuff mirrored on azure -
18:00 - 18:02as well as a database mirror for graph.
-
18:04 - 18:09And we're also doing all the content
indexing and all the things that need -
18:09 - 18:12scalability on azure now.
-
18:17 - 18:22Finally, at the university of Bologna,
we have a backend storage for the download -
18:22 - 18:29so currently our main storage is
quite slow so if you want to download -
18:29 - 18:35a bundle of things that we've archived,
then we actually keep a cache of -
18:35 - 18:40what we've done so that it doesn't take
a million years to download stuff. -
18:42 - 18:46We do our development in a classic free
and open source software way, -
18:46 - 18:52so we talk on our mailing list, on IRC,
on a forge. -
18:53 - 18:57Everything is in English, everything is
public, there is more information -
18:57 - 19:01on our website if you want to actually
have a look and see what we do. -
19:04 - 19:10So, all that is very interesting but how
do we actually look into it? -
19:12 - 19:16One of the ways that you can browse,
that you can use the archive -
19:16 - 19:19is using a REST API.
-
19:19 - 19:25Basically, this API allows you to do
pointwise browsing of the archive -
19:25 - 19:29so you can go and follow the links
in a graph, -
19:29 - 19:38which is very slow but gives you a pretty
much full access of the data. -
19:38 - 19:45There's an index for the API that you can
look at, but that's not really convenient, -
19:45 - 19:48so we also have a web user interface.
-
19:49 - 19:56It's in preview right now, we're gonna do
a full launch in the month of June. -
19:58 - 20:01If you go to
https://archive.softwareheritage.org/browse/ -
20:02 - 20:10with the given credentials, you can
have a look and see what's going on. -
20:10 - 20:19Basically, we have a web interface that
allows you to look at -
20:19 - 20:26what origins we have downloaded, when
we have downloaded the origins -
20:26 - 20:35with a kind of graph view of how often
we visited the origins -
20:35 - 20:38and a calendar view of when we have
visited the origins. -
20:39 - 20:44And then, inside the visits, you can
actually browse the contents -
20:44 - 20:45that we've archived.
-
20:45 - 20:50So, for instance, this is the Python
repository as of May 2017 -
20:50 - 20:55and you can have the list of files,
then drill down, -
20:55 - 20:58it should be pretty intuitive.
-
20:59 - 21:03If you look at the history of a project,
you can see the differences -
21:03 - 21:05between two revisions of a project.
-
21:07 - 21:12Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after. -
21:14 - 21:16So, yeah, pretty cool stuff.
-
21:17 - 21:22I should be able to do a demo as well,
it should work. -
21:31 - 21:32I'm gonna zoom in.
-
21:45 - 21:49So this is the main archive, you can see
some statistics about the objects -
21:49 - 21:51that we've downloaded.
-
21:51 - 21:57When you zoom in, you get some kind of
overflows, because… -
21:57 - 21:59Yeah, why would you do that.
-
21:59 - 22:04If you want to browse, we can try to find
an origin. -
22:07 - 22:09"glibc".
-
22:13 - 22:17So there's lots and lots of, like, random
Github forks of things… -
22:19 - 22:26We don't discriminate and we don't really
filter what we download. -
22:27 - 22:34We are looking into doing some relevance
kind of sorting of the results, here. -
22:36 - 22:38Next.
-
22:40 - 22:42Xilinx, why not.
-
22:43 - 22:49So, this has been downloaded for the last
time of August 3rd 2016, -
22:49 - 22:50so it's probably a dead repository,
-
22:53 - 22:55but yeah, you can see a bunch of source
code, -
22:57 - 23:01you can read the README of the glibc.
-
23:04 - 23:08If we go back to a more interesting origin
-
23:08 - 23:10here's the repository for git.
-
23:11 - 23:17I've selected voluntarily an old visit
of the repo so that we can see -
23:17 - 23:19what was going on then.
-
23:23 - 23:31If I look at the calendar view, you can see
that we've had some issues actually -
23:31 - 23:33updating this, but anyway.
-
23:38 - 23:46If I look at the last visit, then we can
actually browse the contents, -
23:47 - 23:49you can get syntax highlighting as well.
-
23:50 - 23:54This is a big big file with lots of comments
-
24:02 - 24:05Let's see the actual source code…
-
24:07 - 24:10Anyway, so, that's the browsing interface.
-
24:10 - 24:15We can also now get back what we've
archived and download it, -
24:15 - 24:19which is kind of something that you might
want to do -
24:19 - 24:24if a repository is lost, you can actually
download it -
24:24 - 24:26and get the source code back again.
-
24:27 - 24:28How we do that.
-
24:29 - 24:35If you go on the top right of this browsing
interface, you have actions and download -
24:35 - 24:40and you can download the directory that
you are currently looking at. -
24:41 - 24:46It's an asynchronous process, which means
that if there is a lot of load, -
24:46 - 24:51then it's gotta take some time to get
actually, to be able to download the content -
24:52 - 24:56So you can put in your email address so we
can notify you when the download is ready. -
24:57 - 25:03I'm gonna try my luck and say just "ok"
and it's gonna appear at some point -
25:03 - 25:08in the list of things that I've requested.
-
25:11 - 25:20I've already requested some things that
we can actually get and open as a tarball. -
25:31 - 25:35Yeah, I think that's the thing that I was
actually looking at, -
25:35 - 25:38which is this revision of the git
source code -
25:40 - 25:42and then I can open it
-
25:44 - 25:47Yay, emacs, that's when you want.
-
25:47 - 25:48Yay, source code.
-
25:51 - 25:54This seems to work.
-
25:58 - 26:03And then, of course, if you want to
actually script what you're doing, -
26:03 - 26:07there's an API that allows you to do
the downloads as well, so you can. -
26:11 - 26:18The source code is deduplicated a lot,
which means that for one single repository -
26:18 - 26:24you get tons of files that we have to
collect if you want to actually download -
26:24 - 26:26an archive of a directory.
-
26:30 - 26:38It takes a while but we have an asynchronous
API so you can POST -
26:38 - 26:44the identifier of a revision to this URL
and then get status updates -
26:44 - 26:49and at some point, it will tell you that
the… here -
26:50 - 26:53The status well tell you that the object
is available. -
26:53 - 26:59You can download it and you can even
download the full history of a project -
26:59 - 27:04and get that as a git-fast-export archive
that you can reimport into -
27:04 - 27:06a new git repository.
-
27:06 - 27:13So any kind of VCS that we've imported,
you can export as a git repository -
27:13 - 27:18and reimport on your machine.
-
27:19 - 27:23How to get involved in the project?
-
27:24 - 27:29We have a lot of features that we're
interested in, lots of them are now -
27:29 - 27:31in early access or have been done.
-
27:32 - 27:36There's some stuff that we would like
help with. -
27:38 - 27:40This is some stuff that we're working on:
-
27:41 - 27:43provenance information, you have a content
-
27:43 - 27:45you want to know which repository
it comes from, -
27:46 - 27:48that's something we're working on.
-
27:48 - 27:55Full text search, the end goal is to be
able even to trace -
27:55 - 28:01source of snippets of code that's have
been copied from one project to another. -
28:01 - 28:06That's something that we can look into
with the wealth of information that -
28:06 - 28:08we have inside the archive.
-
28:09 - 28:11There's a lot of things that,
-
28:11 - 28:12I mean…
-
28:12 - 28:15There's a lot of things that people want
to do with the archive. -
28:15 - 28:20Our goal is to enable people to do things,
to do interesting things -
28:20 - 28:22with a lot of source code.
-
28:24 - 28:27If you have an idea of what you want to do
with such an archive, -
28:27 - 28:30please you can come talk to us
-
28:30 - 28:35and we'll be happy to help you help us.
-
28:38 - 28:44What we want to do is to diversify
the sources of things that we archive. -
28:44 - 28:51Currently, we have good support for git,
we have OK support for subversion -
28:51 - 28:53and mercurial.
-
28:54 - 28:59If your project of choice is in another
version control system, -
28:59 - 29:01we are gonna miss it.
-
29:02 - 29:06So people can contribute in this area.
-
29:10 - 29:18For the listing part, we have coverage of
Debian, we have coverage or Github, -
29:18 - 29:26if your code is somewhere else, we won't
see it, so we need people to contribute -
29:26 - 29:30stuff that can list for instance Gitlab
instances, -
29:32 - 29:36and then we can integrate that in our
infrastructure and actually have -
29:37 - 29:41people be able to archive their gitlab
instances. -
29:42 - 29:49And of course, we need to spread
the word, make the project sustainable. -
29:49 - 30:01We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor -
30:02 - 30:06The university of Bologna, of course Inria
is sponsoring. -
30:07 - 30:12But we need to keep spreading the word
and keep the project sustainable. -
30:13 - 30:18And, of course, we need to save endangered
source code. -
30:18 - 30:23For that, we have a suggestion box on
the wiki that you can add things to. -
30:24 - 30:30For instance, we have in the back of
our minds archiving SourceForge, -
30:30 - 30:36because we know that this isn't very
sustainable and that's risk of being -
30:36 - 30:39taken down at some point.
-
30:42 - 30:48If you want to join us, we also have
some job openings that are available. -
30:49 - 30:56For now it's in Paris, so if you want to
consider coming work with us in Paris, -
30:56 - 30:58you can look into that.
-
31:01 - 31:03That's Software Heritage.
-
31:03 - 31:05We are building a reference archive of
all the free software -
31:05 - 31:07that's being ever written
-
31:07 - 31:11in an international, open, non-profit and
mutualised infrastructure -
31:12 - 31:18that we have opened up to everyone,
all users, vendors, developers can use it. -
31:20 - 31:26The idea is to be at the service of
the community and for society -
31:26 - 31:28as a whole.
-
31:28 - 31:33So if you want to join us, you can look at
our website, you can look at our code. -
31:35 - 31:38You can also talk to me, so if you have
any questions, -
31:38 - 31:42I think we have 10, 12 minutes for questions.
-
31:46 - 31:52[Applause]
-
31:52 - 31:53Do you have questions?
-
31:57 - 32:01[Q] How do you protect the archive
against stuff that you don't want to -
32:01 - 32:02have in the archive.
-
32:02 - 32:07I think of a stuff that is copyright-
protected and that Github will also -
32:07 - 32:09delete after a while.
-
32:10 - 32:16Worse, if I would misuse the archive
as my private backup -
32:16 - 32:20and store encrypted blocks on Github
and you will eventually backup them -
32:20 - 32:21for me.
-
32:25 - 32:27[A] There's, I think, two sides of the
question. -
32:27 - 32:29The first side is
-
32:29 - 32:34Do we really archive only stuff that is
free software and -
32:34 - 32:41that we can redistribute and how do we
manage, for instance, -
32:41 - 32:43copyright takedown stuff.
-
32:46 - 32:52Currently, most of the infrastructure
of the project is under French law. -
32:53 - 33:00There's a defined process to do
copyright takedown in the French legal system. -
33:02 - 33:09We would be really annoyed to have to
take down content from the archive -
33:12 - 33:20What we do, however, is to mirror public
information that is publicly available. -
33:21 - 33:27Of course I'm not a lawyer for the project,
so I can't really… -
33:30 - 33:33I'm not 100% sure of what I'm about to say
but -
33:33 - 33:39what I know is that in the current French
legistlation status, -
33:40 - 33:43if the source of the data is still available
-
33:43 - 33:47so for instance if the data is still on
Github, then you need to have -
33:47 - 33:50Github take it down before we have to
take it down. -
33:57 - 34:02We're not currently filtering content for
misuse of the archive, -
34:02 - 34:06so the only thing that we do is put
a limit on the size of the files -
34:06 - 34:08that are archived in Software Heritage.
-
34:10 - 34:12The limit is pretty high, like 100MB.
-
34:15 - 34:21We can't really decide ourselves
-
34:21 - 34:24what is source code,
what is not source code -
34:24 - 34:31because for instance if your project is
a cryptography library, -
34:31 - 34:34you might want to have some encrypted
blocks of data that are stored -
34:34 - 34:38in you source code repository as
test fixtures. -
34:39 - 34:44And then, you need them to build the code
and to make sure that it works. -
34:45 - 34:49So, how would that be any different than
your encrypted backup on Github? -
34:49 - 34:56How could we, Software Heritage,
distinguish between proper use and misuse -
34:56 - 34:59of the resources.
-
35:00 - 35:05I guess our long term goal is to not have
to care about misuse because -
35:05 - 35:07it's gonna be a drop in the ocean.
-
35:09 - 35:11We're gonna have so much…
-
35:12 - 35:15We want to have enough space and
enough resources -
35:15 - 35:20that we don't really need to ask ourselves
this question, basically. -
35:21 - 35:22Thanks.
-
35:26 - 35:28Other questions?
-
35:34 - 35:39[Q] Have you looked at some form of
authentication to provide additional -
35:39 - 35:46insurance that the archived source code
hasn't been modified or tampered with -
35:46 - 35:48in some form?
-
35:51 - 35:56[A] First of all, all the identifiers for
the objects that are inside the archive -
35:56 - 36:01are cryptographic hashes of the contents
that we've archived. -
36:02 - 36:07So, for files, for instance, we take
the SHA1, the SHA256, -
36:07 - 36:16one of the BLAKE hashes and the git
modified SHA1 of the file, -
36:17 - 36:20and we use that in the manifest for
the directories. -
36:20 - 36:26So the directories, the directory identifiers
are a hash of the manifest -
36:26 - 36:30of the list of files that are inside
the directory, etc. -
36:31 - 36:39So, recursively, you can make sure that
the data that we give back to you -
36:39 - 36:48has not been, at least altered, by bitflip
or anything. -
36:49 - 36:53We regularly run a scrub of the data
that we have in the archive, -
36:53 - 36:57so we make sure that there's no rot
inside our archive. -
36:59 - 37:05We've not looked into, basically,
attestation of… -
37:09 - 37:14for instance, making sure that the code
that we've downloaded… -
37:21 - 37:26I mean, we're not doing anything more
than taking a picture of the data -
37:26 - 37:34and we say "We've computed this hash.
Maybe the code that's been presented -
37:34 - 37:39by Github to Software Heritage is different
than what you've uploaded to Github, -
37:39 - 37:40we can't tell."
-
37:44 - 37:49In the case of git, you can always use
the identifiers of the objects -
37:49 - 37:52that you've pushed so you have
the commit hash, -
37:52 - 37:57which is itself a cryptographic identifier
of the contents of the commit. -
37:59 - 38:02In turn, if the commit is signed, then
the signature is still stored -
38:02 - 38:11in the Software Heritage metadata and
you can reproduce the original git object -
38:11 - 38:15and check the signature, but we've not
done anything specific for Software Heritage -
38:15 - 38:17in this area.
-
38:18 - 38:20Does that answer your question?
-
38:20 - 38:20Cool.
-
38:25 - 38:26Other questions?
-
38:27 - 38:29There's one in front.
-
38:31 - 38:34[Q] It's partially question, partially
comment. -
38:34 - 38:40Your initial idea was to have a telescope,
or something like this for source code. -
38:40 - 38:43For now, for me, it looks a little bit
more like microscope, -
38:43 - 38:47so you can focus on one thing, but that's
not much. -
38:47 - 38:51So have you sorted things about how to
analyze entire ecosystem -
38:51 - 38:52or something like this.
-
38:52 - 38:57For example, now we have Django 2 which is
Python 3 only so it would be interesting to -
38:57 - 39:01look at all Django modules to see when
they start moving to this Django. -
39:01 - 39:07So we would need to start analyzing
thousands or millions of files, but then -
39:07 - 39:11we would need some SQL like, or some
map reduce jobs -
39:11 - 39:12or something like this for this.
-
39:13 - 39:14[A] Yes
-
39:14 - 39:15So, we've started…
-
39:16 - 39:22The two initiators of the project, Roberto
Di Cosmo and Stefano Zacchiroli -
39:22 - 39:27are both researchers in computer science
so they have a strong background in -
39:27 - 39:35actually mining software repositories and
doing some large scale analysis -
39:35 - 39:36on source code.
-
39:38 - 39:45We've been talking with research groups
whose main goal is to do analysis on -
39:45 - 39:48large scale source code archives.
-
39:50 - 39:58One of the first mirrors outside of our
control of the archive -
39:58 - 39:59will be in Grenoble (France).
-
39:59 - 40:06There's a few teams that work on
actually doing large scale research -
40:06 - 40:09on source code over there,
-
40:09 - 40:11so that's what the mirror will be
used for. -
40:13 - 40:17We've also been looking at what
the Google open source team does. -
40:18 - 40:23They have this big repository with all
the code that Google uses -
40:23 - 40:29and they've started to push back,
like do large scale analysis of -
40:29 - 40:38security vulnerabilities, issues with
static and dynamic analysis -
40:38 - 40:42of the code and they've started pushing
their fixes upstream. -
40:43 - 40:47That's something that we want to enable
users to do, -
40:47 - 40:51that's not something that we want to do
ourselves, but we want to make sure -
40:51 - 40:53that people can do it using our archive.
-
40:55 - 40:59So we'd be happy to work with people
who already do that so that -
40:59 - 41:05they can use their knowledge and their
tools inside our archive. -
41:07 - 41:09Does that answer your question?
-
41:10 - 41:11Cool.
-
41:15 - 41:17Any more questions?
-
41:19 - 41:22No? Then thank you very much Nicolas.
-
41:22 - 41:23Thank you.
-
41:23 - 41:26[Applause]
- Title:
- Software Heritage - Preserving the Free Software Commons
- Description:
-
Talk given by Nicolas Dandrimont at Minidebconf Hamburg 2018
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-20/software_heritage.webm - Video Language:
- English
- Team:
- Debconf
- Project:
- 2018_mini-debconf-hamburg
- Duration:
- 41:31
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons | ||
tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons |