Software Heritage - Preserving the Free Software Commons

0:05 - 0:07

Hi, thank you.
0:08 - 0:11

I'm Nicolas Dandrimont and I will indeed
be talking to you about
0:11 - 0:13

Software Heritage.
0:13 - 0:15

I'm a software engineer for this project.
0:16 - 0:18

I've been working on it for 3 years now.
0:18 - 0:22

And we'll see what this thing is all about.
0:24 - 0:39

[Mic not working]
0:39 - 0:41

I guess the batteries are out.
0:50 - 0:52

So, let's try that again.
0:52 - 0:55

So, we all know, we've been doing
free software for a while,
0:56 - 1:00

that software source code is something
special.
1:01 - 1:02

Why is that?
1:03 - 1:10

As Harold Abelson has said in SICP, his
textbook on programming,
1:10 - 1:19

programs are meant to be read by people
and then incidentally for machines to execute.
1:20 - 1:26

Basically, what software source code
provides us is a way inside
1:26 - 1:29

the mind of the designer of the program.
1:29 - 1:38

For instance, you can have,
you can get inside very crazy algorithms
1:38 - 1:47

that can do very fast reverse square roots
for 3D, that kind of stuff
1:47 - 1:50

Like in the Quake 2 source code.
1:50 - 1:55

You can also get inside the algorithms
that are underpinning the internet,
1:55 - 2:00

for instance seeing the net queue
algorithm in the Linux kernel.
2:04 - 2:10

What we are building as the free software
community is the free software commons.
2:11 - 2:19

Basically, the commons is all the cultural
and social and natural resources
2:19 - 2:22

that we share and that everyone
has access to.
2:22 - 2:26

More specifically, the software commons
is what we are building
2:26 - 2:32

with software that is open and that is
available for all to use, to modify,
2:32 - 2:35

to execute, to distribute.
2:37 - 2:45

We know that those commons are a really
critical part of our commons.
2:46 - 2:48

Who's taking care of it?
2:50 - 2:52

The software is fragile.
2:52 - 2:54

Like all digital information, you can lose
software.
2:56 - 3:02

People can decide to shut down hosting
spaces because of business decisions.
3:03 - 3:09

People can hack into software hosting
platforms and remove the code maliciously
3:09 - 3:11

or just inadvertently.
3:13 - 3:18

And, of course, for the obsolete stuff,
there's rot.
3:18 - 3:25

If you don't care about the data, then
it rots and it decays and you lose it.
3:26 - 3:31

So, where is the archive we go to
when something is lost,
3:31 - 3:34

when GitLab goes away, when Github
goes away.
3:34 - 3:36

Where do we go?
3:37 - 3:41

Finally, there's one last thing that we
noticed, it's that
3:41 - 3:49

there's a lot of teams that work on
research on software
3:49 - 3:54

and there's no real big infrastructure
for research on code.
3:57 - 4:02

There's tons of critical issues around
code: safety, security, verification, proofs.
4:04 - 4:08

Nobody's doing this at a very large scale.
4:08 - 4:12

If you want to see the stars, you go
the Atacama desert and
4:12 - 4:14

you point a telescope at the sky.
4:14 - 4:18

Where is the telescope for source code?
4:18 - 4:21

That's what Software Heritage wants to be.
4:22 - 4:28

What we do is we collect, we preserve
and we share all the software
4:28 - 4:30

that is publicly available.
4:31 - 4:36

Why do we do that? We do that to
preserve the past, to enhance the present
4:36 - 4:38

and to prepare for the future.
4:40 - 4:45

What we're building is a base infrastructure
that can be used
4:45 - 4:50

for cultural heritage, for industry,
for research and for education purposes.
4:51 - 4:53

How do we do it? We do it with an open
approach.
4:53 - 4:57

Every single line of code that we write
is free software.
4:59 - 5:05

We do it transparently, everything that
we do, we do it in the open,
5:05 - 5:09

be that on a mailing list or on
our issue tracker.
5:10 - 5:16

And we strive to do it for the very long
haul, so we do it with replication in mind
5:16 - 5:22

so that no single entity has full control
over the data that we collect.
5:23 - 5:27

And we do it in a non-profit fashion
so that we avoid
5:27 - 5:33

business-driven decisions impacting
the project.
5:35 - 5:39

So, what do we do concretely?
5:39 - 5:43

We do archiving of version control systems.
5:43 - 5:45

What does that mean?
5:46 - 5:49

It means we archive file contents, so
source code, files.
5:49 - 5:56

We archive revisions, which means all the
metadata of the history of the projects,
5:56 - 6:03

we try to download it and we put it inside
a common data model that is
6:03 - 6:07

shared across all the archive.
6:09 - 6:14

We archive releases of the software,
releases that have been tagged
6:14 - 6:18

in a version control system as well as
releases that we can find as tarballs
6:18 - 6:24

because sometimes… boof, views of
this source code differ.
6:28 - 6:32

Of course, we archive where and when
we've seen the data that we've collected.
6:33 - 6:40

All of this, we put inside a canonical,
VCS-agnostic, data model.
6:42 - 6:47

If you have a Debian package, with its
history, if you have a git repository,
6:47 - 6:50

if you have a subversion repository, if
you have a mercurial repository,
6:50 - 6:54

it all looks the same and you can work
on it with the same tools.
6:55 - 7:01

What we don't do is archive what's around
the software, for instance
7:01 - 7:06

the bug tracking systems or the homepages
or the wikis or the mailing lists.
7:07 - 7:11

There are some projects that work
in this space, for instance
7:11 - 7:16

the internet archive does a lot of
very good work around archiving the web.
7:18 - 7:24

Our goal is not to replace them, but to
work with them and be able to do
7:24 - 7:29

linking across all the archives that exist.
7:30 - 7:35

We can, for instance for the mailing lists
there's the gmane project
7:35 - 7:39

that does a lot of archiving of free
software mailing lists.
7:40 - 7:48

So our long term vision is to play a part
in a semantic wikipedia of software,
7:48 - 7:54

a wikidata of software where we can
hyperlink all the archives that exist
7:54 - 7:57

and do stuff in the area.
8:01 - 8:03

Quick tour of our infrastructure.
8:03 - 8:10

Basically, all the way to the right is
our archive.
8:11 - 8:17

Our archive consists of a huge graph
of all the metadata about
8:17 - 8:25

the files, the directories, the revisions,
the commits and the releases and
8:25 - 8:28

all the projects that are on top
of the graph.
8:29 - 8:34

We separate the file storage into an other
object storage because of
8:34 - 8:42

the size discrepancy: we have lots and lots
of file contents that we need to store
8:42 - 8:46

so we do that outside of the database
that is used to store the graph.
8:49 - 8:54

Basically, what we archive is a set of
software origins that are
8:54 - 8:59

git repositories, mercurial repositories,
etc. etc.
9:00 - 9:05

All those origins are loaded on a
regular schedule.
9:07 - 9:13

If there is a very active software origin,
we're gonna archive it more often
9:13 - 9:18

than stale things that don't get
a lot of updates.
9:20 - 9:24

What we do to get the list of software
origins that we archive.
9:25 - 9:31

We have a bunch of listers that can,
scroll through the list of repositories,
9:31 - 9:34

for instance on Github or other
hosting platforms.
9:35 - 9:42

We have code that can read Debian archive
metadata to make a list of the packages
9:42 - 9:49

that are inside this archive and can be
archived, etc.
9:50 - 9:53

All of this is done on a regular basis.
9:54 - 9:57

We are currently working on some kind
of push mechanism so that
9:57 - 10:01

people or other systems can notify us
of updates.
10:03 - 10:10

Our goal is not to do real time archiving,
we're really in it for the long run
10:10 - 10:16

but we still want to be able to prioritize
stuff that people tell us is
10:16 - 10:18

important to archive.
10:20 - 10:24

The internet archive has a "save now"
button and we want to implement
10:24 - 10:26

something along those lines as well,
10:26 - 10:32

so if we know that some software project
is in danger for a reason or another,
10:32 - 10:34

then we can prioritize archiving it.
10:36 - 10:40

So this is the basic structure of a revision
in the software heritage archive.
10:42 - 10:45

You'll see that it's very similar to
a git commit.
10:48 - 10:54

The format of the metadata is pretty much
what you'll find in a git commit
10:54 - 10:59

with some extensions that you don't
see here because this is from a git commit
11:01 - 11:10

So basically what we do is we take the
identifier of the directory
11:10 - 11:16

that the revision points to, we take the
identifier of the parent of the revision
11:16 - 11:19

so we can keep track of the history
11:19 - 11:25

and then we add some metadata,
authorship and commitership information
11:25 - 11:29

and the revision message and then we take
a hash of this,
11:29 - 11:37

it makes an identifier that's probably
unique, very very probably unique.
11:40 - 11:47

Using those identifiers, we can retrace
all the origins, all the history of
11:47 - 11:52

development of the project and we can
deduplicate across all the archive.
11:52 - 11:59

All the identifiers are intrinsic, which
means that we compute them
11:59 - 12:04

from the contents of the things that
we are archiving, which means that
12:04 - 12:11

we can deduplicate very efficiently
across all the data that we archive.
12:12 - 12:14

How much data do we archive?
12:17 - 12:18

A bit.
12:19 - 12:24

So, we have passed the billion revision
mark a few weeks ago.
12:25 - 12:30

This graph is a bit old, but anyway,
you have a live graph on our website.
12:31 - 12:36

That's more than 4.5 billion unique
source code files.
12:38 - 12:45

We don't actually discriminate between
what we would consider is source code
12:45 - 12:48

and what upstream developers consider
as source code,
12:48 - 12:52

so everything that's in a git repository,
we consider as source code
12:52 - 12:55

if it's below a size threshold.
12:56 - 13:00

A billion revisions across 80 million
projects.
13:01 - 13:03

What do we archive?
13:03 - 13:05

We archive Github, we archive Debian.
13:07 - 13:12

So, Debian we run the archival process
every day, every day we get the new packages
13:12 - 13:14

that have been uploaded in the archive.
13:14 - 13:21

Github, we try to keep up, we are currently
working on some performance improvements,
13:21 - 13:25

some scalability improvements to make sure
that we can keep up
13:25 - 13:27

with the development on GitHub.
13:29 - 13:40

We have archived as a one-off thing the
former contents of Gitorious and Google Code
13:41 - 13:47

which are two prominent code hosting
spaces that closed recently
13:48 - 13:54

and we've been working on archiving
the contents of Bitbucket
13:54 - 14:00

which is kind of a challenge because
the API is a bit buggy and
14:00 - 14:03

Atliassian isn't too interested
in fixing it.
14:06 - 14:17

In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
14:17 - 14:20

and kind of big database, 6TB.
14:21 - 14:28

The database only contains the graph of
the metadata for the archive
14:28 - 14:35

which is basically a 8 billion nodes and
70 billion edges graph.
14:35 - 14:37

And of course it's growing daily.
14:38 - 14:43

We are pretty sure this is the richest public
source code archive that's available now
14:43 - 14:45

and it keeps growing.
14:46 - 14:49

So how do we actually…
14:49 - 14:53

What kind of stack do we use to store
all this?
14:55 - 14:57

We use Debian, of course.
14:58 - 15:03

All our deployment recipes are in Puppet
in public repositories.
15:04 - 15:08

We've started using Ceph
for the blob storage.
15:09 - 15:14

We use PostgreSQL for the metadata storage
with some of the standard tools that
15:15 - 15:18

live around PostgreSQL for backups
and replication.
15:20 - 15:28

We use standard Python stack for
scheduling of jobs
15:28 - 15:35

and for web interface stuff, basically
psycopg2 for the low level stuff,
15:35 - 15:38

Django for the web stuff
15:38 - 15:44

and Celery for the scheduling of jobs.
15:45 - 15:50

In house, we've written an ad hoc
object storage system which has
15:50 - 15:53

a bunch of backends that you can use.
15:54 - 16:03

Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
16:03 - 16:07

It's a really simple object storage system
where you can just put an object,
16:07 - 16:10

get an object, put a bunch of objects,
get a bunch of objects.
16:12 - 16:18

We've implemented removal but we don't
really use it yet.
16:20 - 16:25

All the data model implementation,
all the listers, the loaders, the schedulers
16:25 - 16:29

everything has been written by us,
it's a pile of Python code.
16:32 - 16:36

So, basically 20 Python packages and
around 30 Puppet modules
16:36 - 16:42

to deploy all that and we've done everything
as a copyleft license,
16:42 - 16:46

GPLv3 for the backend and AGPLv3
for the frontend.
16:47 - 16:57

Even if people try and make their own
Software Heritage using our code,
16:57 - 17:00

they have to publish their changes.
17:02 - 17:11

Hardware-wise, we run for now everything
on a few hypervisors in house and
17:11 - 17:19

our main storage is currently still
on a very high density, very slow,
17:19 - 17:28

very bulky storage array, but we've
started to migrate all this thing
17:28 - 17:33

into a Ceph storage cluster which
we're gonna grow as we need
17:33 - 17:35

in the next few months.
17:36 - 17:44

We've also been granted by Microsoft
sponsorship, ??? sponsorship
17:44 - 17:46

for their cloud services.
17:46 - 17:52

We've started putting mirrors of everything
in their infrastructure as well
17:52 - 18:00

which means full object storage mirror,
so 170TB of stuff mirrored on azure
18:00 - 18:02

as well as a database mirror for graph.
18:04 - 18:09

And we're also doing all the content
indexing and all the things that need
18:09 - 18:12

scalability on azure now.
18:17 - 18:22

Finally, at the university of Bologna,
we have a backend storage for the download
18:22 - 18:29

so currently our main storage is
quite slow so if you want to download
18:29 - 18:35

a bundle of things that we've archived,
then we actually keep a cache of
18:35 - 18:40

what we've done so that it doesn't take
a million years to download stuff.
18:42 - 18:46

We do our development in a classic free
and open source software way,
18:46 - 18:52

so we talk on our mailing list, on IRC,
on a forge.
18:53 - 18:57

Everything is in English, everything is
public, there is more information
18:57 - 19:01

on our website if you want to actually
have a look and see what we do.
19:04 - 19:10

So, all that is very interesting but how
do we actually look into it?
19:12 - 19:16

One of the ways that you can browse,
that you can use the archive
19:16 - 19:19

is using a REST API.
19:19 - 19:25

Basically, this API allows you to do
pointwise browsing of the archive
19:25 - 19:29

so you can go and follow the links
in a graph,
19:29 - 19:38

which is very slow but gives you a pretty
much full access of the data.
19:38 - 19:45

There's an index for the API that you can
look at, but that's not really convenient,
19:45 - 19:48

so we also have a web user interface.
19:49 - 19:56

It's in preview right now, we're gonna do
a full launch in the month of June.
19:58 - 20:01

If you go to
https://archive.softwareheritage.org/browse/
20:02 - 20:10

with the given credentials, you can
have a look and see what's going on.
20:10 - 20:19

Basically, we have a web interface that
allows you to look at
20:19 - 20:26

what origins we have downloaded, when
we have downloaded the origins
20:26 - 20:35

with a kind of graph view of how often
we visited the origins
20:35 - 20:38

and a calendar view of when we have
visited the origins.
20:39 - 20:44

And then, inside the visits, you can
actually browse the contents
20:44 - 20:45

that we've archived.
20:45 - 20:50

So, for instance, this is the Python
repository as of May 2017
20:50 - 20:55

and you can have the list of files,
then drill down,
20:55 - 20:58

it should be pretty intuitive.
20:59 - 21:03

If you look at the history of a project,
you can see the differences
21:03 - 21:05

between two revisions of a project.
21:07 - 21:12

Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
21:14 - 21:16

So, yeah, pretty cool stuff.
21:17 - 21:22

I should be able to do a demo as well,
it should work.
21:31 - 21:32

I'm gonna zoom in.
21:45 - 21:49

So this is the main archive, you can see
some statistics about the objects
21:49 - 21:51

that we've downloaded.
21:51 - 21:57

When you zoom in, you get some kind of
overflows, because…
21:57 - 21:59

Yeah, why would you do that.
21:59 - 22:04

If you want to browse, we can try to find
an origin.
22:07 - 22:09

"glibc".
22:13 - 22:17

So there's lots and lots of, like, random
Github forks of things…
22:19 - 22:26

We don't discriminate and we don't really
filter what we download.
22:27 - 22:34

We are looking into doing some relevance
kind of sorting of the results, here.
22:36 - 22:38

Next.
22:40 - 22:42

Xilinx, why not.
22:43 - 22:49

So, this has been downloaded for the last
time of August 3rd 2016,
22:49 - 22:50

so it's probably a dead repository,
22:53 - 22:55

but yeah, you can see a bunch of source
code,
22:57 - 23:01

you can read the README of the glibc.
23:04 - 23:08

If we go back to a more interesting origin
23:08 - 23:10

here's the repository for git.
23:11 - 23:17

I've selected voluntarily an old visit
of the repo so that we can see
23:17 - 23:19

what was going on then.
23:23 - 23:31

If I look at the calendar view, you can see
that we've had some issues actually
23:31 - 23:33

updating this, but anyway.
23:38 - 23:46

If I look at the last visit, then we can
actually browse the contents,
23:47 - 23:49

you can get syntax highlighting as well.
23:50 - 23:54

This is a big big file with lots of comments
24:02 - 24:05

Let's see the actual source code…
24:07 - 24:10

Anyway, so, that's the browsing interface.
24:10 - 24:15

We can also now get back what we've
archived and download it,
24:15 - 24:19

which is kind of something that you might
want to do
24:19 - 24:24

if a repository is lost, you can actually
download it
24:24 - 24:26

and get the source code back again.
24:27 - 24:28

How we do that.
24:29 - 24:35

If you go on the top right of this browsing
interface, you have actions and download
24:35 - 24:40

and you can download the directory that
you are currently looking at.
24:41 - 24:46

It's an asynchronous process, which means
that if there is a lot of load,
24:46 - 24:51

then it's gotta take some time to get
actually, to be able to download the content
24:52 - 24:56

So you can put in your email address so we
can notify you when the download is ready.
24:57 - 25:03

I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
25:03 - 25:08

in the list of things that I've requested.
25:11 - 25:20

I've already requested some things that
we can actually get and open as a tarball.
25:31 - 25:35

Yeah, I think that's the thing that I was
actually looking at,
25:35 - 25:38

which is this revision of the git
source code
25:40 - 25:42

and then I can open it
25:44 - 25:47

Yay, emacs, that's when you want.
25:47 - 25:48

Yay, source code.
25:51 - 25:54

This seems to work.
25:58 - 26:03

And then, of course, if you want to
actually script what you're doing,
26:03 - 26:07

there's an API that allows you to do
the downloads as well, so you can.
26:11 - 26:18

The source code is deduplicated a lot,
which means that for one single repository
26:18 - 26:24

you get tons of files that we have to
collect if you want to actually download
26:24 - 26:26

an archive of a directory.
26:30 - 26:38

It takes a while but we have an asynchronous
API so you can POST
26:38 - 26:44

the identifier of a revision to this URL
and then get status updates
26:44 - 26:49

and at some point, it will tell you that
the… here
26:50 - 26:53

The status well tell you that the object
is available.
26:53 - 26:59

You can download it and you can even
download the full history of a project
26:59 - 27:04

and get that as a git-fast-export archive
that you can reimport into
27:04 - 27:06

a new git repository.
27:06 - 27:13

So any kind of VCS that we've imported,
you can export as a git repository
27:13 - 27:18

and reimport on your machine.
27:19 - 27:23

How to get involved in the project?
27:24 - 27:29

We have a lot of features that we're
interested in, lots of them are now
27:29 - 27:31

in early access or have been done.
27:32 - 27:36

There's some stuff that we would like
help with.
27:38 - 27:40

This is some stuff that we're working on:
27:41 - 27:43

provenance information, you have a content
27:43 - 27:45

you want to know which repository
it comes from,
27:46 - 27:48

that's something we're working on.
27:48 - 27:55

Full text search, the end goal is to be
able even to trace
27:55 - 28:01

source of snippets of code that's have
been copied from one project to another.
28:01 - 28:06

That's something that we can look into
with the wealth of information that
28:06 - 28:08

we have inside the archive.
28:09 - 28:11

There's a lot of things that,
28:11 - 28:12

I mean…
28:12 - 28:15

There's a lot of things that people want
to do with the archive.
28:15 - 28:20

Our goal is to enable people to do things,
to do interesting things
28:20 - 28:22

with a lot of source code.
28:24 - 28:27

If you have an idea of what you want to do
with such an archive,
28:27 - 28:30

please you can come talk to us
28:30 - 28:35

and we'll be happy to help you help us.
28:38 - 28:44

What we want to do is to diversify
the sources of things that we archive.
28:44 - 28:51

Currently, we have good support for git,
we have OK support for subversion
28:51 - 28:53

and mercurial.
28:54 - 28:59

If your project of choice is in another
version control system,
28:59 - 29:01

we are gonna miss it.
29:02 - 29:06

So people can contribute in this area.
29:10 - 29:18

For the listing part, we have coverage of
Debian, we have coverage or Github,
29:18 - 29:26

if your code is somewhere else, we won't
see it, so we need people to contribute
29:26 - 29:30

stuff that can list for instance Gitlab
instances,
29:32 - 29:36

and then we can integrate that in our
infrastructure and actually have
29:37 - 29:41

people be able to archive their gitlab
instances.
29:42 - 29:49

And of course, we need to spread
the word, make the project sustainable.
29:49 - 30:01

We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
30:02 - 30:06

The university of Bologna, of course Inria
is sponsoring.
30:07 - 30:12

But we need to keep spreading the word
and keep the project sustainable.
30:13 - 30:18

And, of course, we need to save endangered
source code.
30:18 - 30:23

For that, we have a suggestion box on
the wiki that you can add things to.
30:24 - 30:30

For instance, we have in the back of
our minds archiving SourceForge,
30:30 - 30:36

because we know that this isn't very
sustainable and that's risk of being
30:36 - 30:39

taken down at some point.
30:42 - 30:48

If you want to join us, we also have
some job openings that are available.
30:49 - 30:56

For now it's in Paris, so if you want to
consider coming work with us in Paris,
30:56 - 30:58

you can look into that.
31:01 - 31:03

That's Software Heritage.
31:03 - 31:05

We are building a reference archive of
all the free software
31:05 - 31:07

that's being ever written
31:07 - 31:11

in an international, open, non-profit and
mutualised infrastructure
31:12 - 31:18

that we have opened up to everyone,
all users, vendors, developers can use it.
31:20 - 31:26

The idea is to be at the service of
the community and for society
31:26 - 31:28

as a whole.
31:28 - 31:33

So if you want to join us, you can look at
our website, you can look at our code.
31:35 - 31:38

You can also talk to me, so if you have
any questions,
31:38 - 31:42

I think we have 10, 12 minutes for questions.
31:46 - 31:52

[Applause]
31:52 - 31:53

Do you have questions?
31:57 - 32:01

[Q] How do you protect the archive
against stuff that you don't want to
32:01 - 32:02

have in the archive.
32:02 - 32:07

I think of a stuff that is copyright-
protected and that Github will also
32:07 - 32:09

delete after a while.
32:10 - 32:16

Worse, if I would misuse the archive
as my private backup
32:16 - 32:20

and store encrypted blocks on Github
and you will eventually backup them
32:20 - 32:21

for me.
32:25 - 32:27

[A] There's, I think, two sides of the
question.
32:27 - 32:29

The first side is
32:29 - 32:34

Do we really archive only stuff that is
free software and
32:34 - 32:41

that we can redistribute and how do we
manage, for instance,
32:41 - 32:43

copyright takedown stuff.
32:46 - 32:52

Currently, most of the infrastructure
of the project is under French law.
32:53 - 33:00

There's a defined process to do
copyright takedown in the French legal system.
33:02 - 33:09

We would be really annoyed to have to
take down content from the archive
33:12 - 33:20

What we do, however, is to mirror public
information that is publicly available.
33:21 - 33:27

Of course I'm not a lawyer for the project,
so I can't really…
33:30 - 33:33

I'm not 100% sure of what I'm about to say
but
33:33 - 33:39

what I know is that in the current French
legistlation status,
33:40 - 33:43

if the source of the data is still available
33:43 - 33:47

so for instance if the data is still on
Github, then you need to have
33:47 - 33:50

Github take it down before we have to
take it down.
33:57 - 34:02

We're not currently filtering content for
misuse of the archive,
34:02 - 34:06

so the only thing that we do is put
a limit on the size of the files
34:06 - 34:08

that are archived in Software Heritage.
34:10 - 34:12

The limit is pretty high, like 100MB.
34:15 - 34:21

We can't really decide ourselves
34:21 - 34:24

what is source code,
what is not source code
34:24 - 34:31

because for instance if your project is
a cryptography library,
34:31 - 34:34

you might want to have some encrypted
blocks of data that are stored
34:34 - 34:38

in you source code repository as
test fixtures.
34:39 - 34:44

And then, you need them to build the code
and to make sure that it works.
34:45 - 34:49

So, how would that be any different than
your encrypted backup on Github?
34:49 - 34:56

How could we, Software Heritage,
distinguish between proper use and misuse
34:56 - 34:59

of the resources.
35:00 - 35:05

I guess our long term goal is to not have
to care about misuse because
35:05 - 35:07

it's gonna be a drop in the ocean.
35:09 - 35:11

We're gonna have so much…
35:12 - 35:15

We want to have enough space and
enough resources
35:15 - 35:20

that we don't really need to ask ourselves
this question, basically.
35:21 - 35:22

Thanks.
35:26 - 35:28

Other questions?
35:34 - 35:39

[Q] Have you looked at some form of
authentication to provide additional
35:39 - 35:46

insurance that the archived source code
hasn't been modified or tampered with
35:46 - 35:48

in some form?
35:51 - 35:56

[A] First of all, all the identifiers for
the objects that are inside the archive
35:56 - 36:01

are cryptographic hashes of the contents
that we've archived.
36:02 - 36:07

So, for files, for instance, we take
the SHA1, the SHA256,
36:07 - 36:16

one of the BLAKE hashes and the git
modified SHA1 of the file,
36:17 - 36:20

and we use that in the manifest for
the directories.
36:20 - 36:26

So the directories, the directory identifiers
are a hash of the manifest
36:26 - 36:30

of the list of files that are inside
the directory, etc.
36:31 - 36:39

So, recursively, you can make sure that
the data that we give back to you
36:39 - 36:48

has not been, at least altered, by bitflip
or anything.
36:49 - 36:53

We regularly run a scrub of the data
that we have in the archive,
36:53 - 36:57

so we make sure that there's no rot
inside our archive.
36:59 - 37:05

We've not looked into, basically,
attestation of…
37:09 - 37:14

for instance, making sure that the code
that we've downloaded…
37:21 - 37:26

I mean, we're not doing anything more
than taking a picture of the data
37:26 - 37:34

and we say "We've computed this hash.
Maybe the code that's been presented
37:34 - 37:39

by Github to Software Heritage is different
than what you've uploaded to Github,
37:39 - 37:40

we can't tell."
37:44 - 37:49

In the case of git, you can always use
the identifiers of the objects
37:49 - 37:52

that you've pushed so you have
the commit hash,
37:52 - 37:57

which is itself a cryptographic identifier
of the contents of the commit.
37:59 - 38:02

In turn, if the commit is signed, then
the signature is still stored
38:02 - 38:11

in the Software Heritage metadata and
you can reproduce the original git object
38:11 - 38:15

and check the signature, but we've not
done anything specific for Software Heritage
38:15 - 38:17

in this area.
38:18 - 38:20

Does that answer your question?
38:20 - 38:20

Cool.
38:25 - 38:26

Other questions?
38:27 - 38:29

There's one in front.
38:31 - 38:34

[Q] It's partially question, partially
comment.
38:34 - 38:40

Your initial idea was to have a telescope,
or something like this for source code.
38:40 - 38:43

For now, for me, it looks a little bit
more like microscope,
38:43 - 38:47

so you can focus on one thing, but that's
not much.
38:47 - 38:51

So have you sorted things about how to
analyze entire ecosystem
38:51 - 38:52

or something like this.
38:52 - 38:57

For example, now we have Django 2 which is
Python 3 only so it would be interesting to
38:57 - 39:01

look at all Django modules to see when
they start moving to this Django.
39:01 - 39:07

So we would need to start analyzing
thousands or millions of files, but then
39:07 - 39:11

we would need some SQL like, or some
map reduce jobs
39:11 - 39:12

or something like this for this.
39:13 - 39:14

[A] Yes
39:14 - 39:15

So, we've started…
39:16 - 39:22

The two initiators of the project, Roberto
Di Cosmo and Stefano Zacchiroli
39:22 - 39:27

are both researchers in computer science
so they have a strong background in
39:27 - 39:35

actually mining software repositories and
doing some large scale analysis
39:35 - 39:36

on source code.
39:38 - 39:45

We've been talking with research groups
whose main goal is to do analysis on
39:45 - 39:48

large scale source code archives.
39:50 - 39:58

One of the first mirrors outside of our
control of the archive
39:58 - 39:59

will be in Grenoble (France).
39:59 - 40:06

There's a few teams that work on
actually doing large scale research
40:06 - 40:09

on source code over there,
40:09 - 40:11

so that's what the mirror will be
used for.
40:13 - 40:17

We've also been looking at what
the Google open source team does.
40:18 - 40:23

They have this big repository with all
the code that Google uses
40:23 - 40:29

and they've started to push back,
like do large scale analysis of
40:29 - 40:38

security vulnerabilities, issues with
static and dynamic analysis
40:38 - 40:42

of the code and they've started pushing
their fixes upstream.
40:43 - 40:47

That's something that we want to enable
users to do,
40:47 - 40:51

that's not something that we want to do
ourselves, but we want to make sure
40:51 - 40:53

that people can do it using our archive.
40:55 - 40:59

So we'd be happy to work with people
who already do that so that
40:59 - 41:05

they can use their knowledge and their
tools inside our archive.
41:07 - 41:09

Does that answer your question?
41:10 - 41:11

Cool.
41:15 - 41:17

Any more questions?
41:19 - 41:22

No? Then thank you very much Nicolas.
41:22 - 41:23

Thank you.
41:23 - 41:26

[Applause]

Title:: Software Heritage - Preserving the Free Software Commons
Description:: Talk given by Nicolas Dandrimont at Minidebconf Hamburg 2018
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-20/software_heritage.webm

more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 41:31

	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 11 Edited

tvincent
Revision 10 Edited

tvincent
Revision 9 Edited

tvincent
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	11	tvincent
	10	tvincent
	9	tvincent
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Software Heritage - Preserving the Free Software Commons

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)