Software Heritage - Preserving the Free Software Commons

Not Synced

Hi, thank you.
Not Synced

I'm Nicolas Dandrimont and I will indeed
be talking to you about
Not Synced

Software Heritage.
Not Synced

I'm a software engineer for this project.
Not Synced

I've been working on it for 3 years now.
Not Synced

And we'll see what this thing is all about.
Not Synced

[Mic not working]
Not Synced

I guess the batteries are out.
Not Synced

So, let's try that again.
Not Synced

So, we all know, we've been doing
free software for a while,
Not Synced

that software source code is something
special.
Not Synced

Why is that?
Not Synced

As Harold Abelson has said in SICP, his
textbook on programming,
Not Synced

programs are meant to be read by people
and then incidentally for machines to execute.
Not Synced

Basically, what software source code
provides us is a way inside
Not Synced

the mind of the designer of the program.
Not Synced

For instance, you can have,
you can get inside very crazy algorithms
Not Synced

that can do very fast reverse square roots
for 3D, that kind of stuff
Not Synced

Like in the Quake 2 source code.
Not Synced

You can also get inside the algorithms
that are underpinning the internet,
Not Synced

for instance seeing the net queue
algorithm in the Linux kernel.
Not Synced

What we are building as the free software
community is the free software commons.
Not Synced

Basically, the commons is all the cultural
and social and natural resources
Not Synced

that we share and that everyone
has access to.
Not Synced

More specifically, the software commons
is what we are building
Not Synced

with software that is open and that is
available for all to use, to modify,
Not Synced

to execute, to distribute.
Not Synced

We know that those commons are a really
critical part of our commons.
Not Synced

Who's taking care of it?
Not Synced

The software is fragile.
Not Synced

Like all digital information, you can lose
software.
Not Synced

People can decide to shut down hosting
spaces because of business decisions.
Not Synced

People can hack into software hosting
platforms and remove the code maliciously
Not Synced

or just inadvertently.
Not Synced

And, of course, for the obsolete stuff,
there's rot.
Not Synced

If you don't care about the data, then
it rots and it decays and you lose it.
Not Synced

So, where is the archive we go to
when something is lost,
Not Synced

when GitLab goes away, when Github
goes away.
Not Synced

Where do we go?
Not Synced

Finally, there's one last thing that we
noticed, it's that
Not Synced

there's a lot of teams that work on
research on software
Not Synced

and there's no real big infrastructure
for research on code.
Not Synced

There's tons of critical issues around
code: safety, security, verification, proofs.
Not Synced

Nobody's doing this at a very large scale.
Not Synced

If you want to see the stars, you go
the Atacama desert and
Not Synced

you point a telescope at the sky.
Not Synced

Where is the telescope for source code?
Not Synced

That's what Software Heritage wants to be.
Not Synced

What we do is we collect, we preserve
and we share all the software
Not Synced

that is publicly available.
Not Synced

Why do we do that? We do that to
preserve the past, to enhance the present
Not Synced

and to prepare for the future.
Not Synced

What we're building is a base infrastructure
that can be used
Not Synced

for cultural heritage, for industry,
for research and for education purposes.
Not Synced

How do we do it? We do it with an open
approach.
Not Synced

Every single line of code that we write
is free software.
Not Synced

We do it transparently, everything that
we do, we do it in the open,
Not Synced

be that on a mailing list or on
our issue tracker.
Not Synced

And we strive to do it for the very long
haul, so we do it with replication in mind
Not Synced

so that no single entity has full control
over the data that we collect.
Not Synced

And we do it in a non-profit fashion
so that we avoid
Not Synced

business-driven decisions impacting
the project.
Not Synced

So, what do we do concretely?
Not Synced

We do archiving of version control systems.
Not Synced

What does that mean?
Not Synced

It means we archive file contents, so
source code, files.
Not Synced

We archive revisions, which means all the
metadata of the history of the projects,
Not Synced

we try to download it and we put it inside
a common data model that is
Not Synced

shared across all the archive.
Not Synced

We archive releases of the software,
releases that have been tagged
Not Synced

in a version control system as well as
releases that we can find as tarballs
Not Synced

because sometimes… boof, views of
this source code differ.
Not Synced

Of course, we archive where and when
we've seen the data that we've collected.
Not Synced

All of this, we put inside a canonical,
VCS-agnostic, data model.
Not Synced

If you have a Debian package, with its
history, if you have a git repository,
Not Synced

if you have a subversion repository, if
you have a mercurial repository,
Not Synced

it all looks the same and you can work
on it with the same tools.
Not Synced

What we don't do is archive what's around
the software, for instance
Not Synced

the bug tracking systems or the homepages
or the wikis or the mailing lists.
Not Synced

There are some projects that work
in this space, for instance
Not Synced

the internet archive does a lot of
really good work around archiving the web.
Not Synced

Our goal is not to replace them, but to
work with them and be able to do
Not Synced

linking across all the archives that exist.
Not Synced

We can, for instance for the mailing lists
there's the gmane project
Not Synced

that does a lot of archiving of free
software mailing lists.
Not Synced

So our long term vision is to play a part
in a semantic wikipedia of software,
Not Synced

a wikidata of software where we can
hyperlink all the archives that exist
Not Synced

and do stuff in the area.
Not Synced

Quick tour of our infrastructure.
Not Synced

Basically, all the way to the right is
our archive.
Not Synced

Our archive consists of a huge graph
of all the metadata about
Not Synced

the files, the directories, the revisions,
the commits and the releases and
Not Synced

all the projects that are on top
of the graph.
Not Synced

We separate the file storage into an other
object storage because of
Not Synced

the size discrepancy: we have lots and lots
of file contents that we need to store
Not Synced

so we do that outside the database
that is used to store the graph.
Not Synced

Basically, what we archive is a set of
software origins that are
Not Synced

git repositories, mercurial repositories,
etc. etc.
Not Synced

All those origins are loaded on a
regular schedule.
Not Synced

If there is a very active software origin,
we're gonna archive it more often
Not Synced

than stale things that don't get
a lot of updates.
Not Synced

What we do to get the list of software
origins that we archive.
Not Synced

We have a bunch of listers that can,
scroll through the list of repositories,
Not Synced

for instance on Github or other
hosting platforms.
Not Synced

We have code that can read Debian archive
metadata to make a list of the packages
Not Synced

that are inside this archive and can be
archived, etc.
Not Synced

All of this is done on a regular basis.
Not Synced

We are currently working on some kind
of push mechanism so that
Not Synced

people or other systems can notify us
of updates.
Not Synced

Our goal is not to do real time archiving,
we're really in it for the long run
Not Synced

but we still want to be able to prioritize
stuff that people tell us is
Not Synced

important to archive.
Not Synced

The internet archive has a "save now"
button and we want to implement
Not Synced

something along those lines as well,
Not Synced

so if we know that some software project
is in danger for a reason or another,
Not Synced

then we can prioritize archiving it.
Not Synced

So this is the basic structure of a revision
in the software heritage archive.
Not Synced

You'll see that it's very similar to
a git commit.
Not Synced

The format of the metadata is pretty much
what you'll find in a git commit
Not Synced

with some extensions that you don't
see here because this is from a git commit
Not Synced

So basically what we do is we take the
identifier of the directory
Not Synced

that the revision points to, we take the
identifier of the parent of the revision
Not Synced

so we can keep track of the history
Not Synced

and then we add some metadata,
authorship and commitership information
Not Synced

and the revision message and then we take
a hash of this,
Not Synced

it makes an identifier that's probably
unique, very very probably unique.
Not Synced

Using those identifiers, we can retrace
all the origins, all the history of
Not Synced

development of the project and we can
deduplicate across all the archive.
Not Synced

All the identifiers are intrinsic, which
means that we compute them
Not Synced

from the contents of the things that
we are archiving, which means that
Not Synced

we can deduplicate very efficiently
across all the data that we archive.
Not Synced

How much data do we archive?
Not Synced

A bit.
Not Synced

So, we have passed the billion revision
mark a few weeks ago.
Not Synced

This graph is a bit old, but anyway,
you have a live graph on our website.
Not Synced

That's more than 4.5 billion unique
source code files.
Not Synced

We don't actually discriminate between
what we would consider is source code
Not Synced

and what upstream developers consider
as source code,
Not Synced

so everything that's in a git repository,
we consider as source code
Not Synced

if it's below a size threshold.
Not Synced

A billion revisions across 80 million
projects.
Not Synced

What do we archive?
Not Synced

We archive Github, we archive Debian.
Not Synced

So, Debian we run the archival process
every day, every day we get the new packages
Not Synced

that have been uploaded in the archive.
Not Synced

Github, we try to keep up, we are currently
working on some performance improvements,
Not Synced

some scalability improvements to make sure
that we can keep up
Not Synced

with the development on GitHub.
Not Synced

We have archived as a one-off thing
the former content of Gitorious and Google Code
Not Synced

which are two prominent code hosting
spaces that closed recently
Not Synced

and we've been working on archiving
the contents of Bitbucket
Not Synced

which is kind of a challenge because
the API is a bit buggy and
Not Synced

Atliassian isn't too interested
in fixing it.
Not Synced

In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
Not Synced

and kind of big database, 6TB.
Not Synced

The database only contains the graph of
the metadata for the archive
Not Synced

which is basically a 8 billion nodes and
70 billion edges graph.
Not Synced

And of course it's growing daily.
Not Synced

We are pretty sure this is the richest
source code archive that's available now
Not Synced

and it keeps growing.
Not Synced

So how do we actually…
Not Synced

What kind of stack do we use to store
all this?
Not Synced

We use Debian, of course.
Not Synced

All our deployment recipes are in Puppet
in public repositories.
Not Synced

We've started using Ceph
for the blob storage.
Not Synced

We use PostgreSQL for the metadata storage
we some of the standard tools that
Not Synced

live around PostgreSQL for backups
and replication.
Not Synced

We use standard Python stack for
scheduling of jobs
Not Synced

and for web interface stuff, basically
psycopg2 for the low level stuff,
Not Synced

Django for the web stuff
Not Synced

and Celery for the scheduling of jobs.
Not Synced

In house, we've written an ad hoc
object storage system which has
Not Synced

a bunch of backends that you can use.
Not Synced

Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
Not Synced

It's a really simple object storage system
where you can just put an object,
Not Synced

get an object, put a bunch of objects,
get a bunch of objects.
Not Synced

We've implemented removal but we don't
really use it yet.
Not Synced

All the data model implementation,
all the listers, the loaders, the schedulers
Not Synced

everything has been written by us,
it's a pile of Python code.
Not Synced

So, basically 20 Python packages and
around 30 Puppet modules
Not Synced

to deploy all that and we've done everything
as a copyleft license,
Not Synced

GPLv3 for the backend and AGPLv3
for the frontend.
Not Synced

Even if people try and make their own
Software Heritage using our code,
Not Synced

they have to publish their changes.
Not Synced

Hardware-wise, we run for now everything
on a few hypervisors in house and
Not Synced

our main storage is currently still
on a very high density, very slow,
Not Synced

very bulky storage array, but we've
started to migrate all this thing
Not Synced

into a Ceph storage cluster which
we're gonna grow as we need
Not Synced

in the next few months.
Not Synced

We've also been granted by Microsoft
sponsorship, ??? sponsorship
Not Synced

for their cloud services.
Not Synced

We've started putting mirrors of everything
in their infrastructure as well
Not Synced

which means full object storage mirror,
so 170TB of stuff mirrored on azure
Not Synced

as well as a database mirror for graph.
Not Synced

And we're also doing all the content
indexing and all the things that need
Not Synced

scalability on azure now.
Not Synced

Finally, at the university of Bologna,
we have a backend storage for the download
Not Synced

so currently our main storage is
quite slow so if you want to download
Not Synced

a bundle of things that we've archived,
then we actually keep a cache of
Not Synced

what we've done so that it doesn't take
a million years to download stuff.
Not Synced

We do our development in a classic free
and open source software way,
Not Synced

so we talk on our mailing list, on IRC

Title:: Software Heritage - Preserving the Free Software Commons
Description:: more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 41:31

	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 11 Edited

tvincent
Revision 10 Edited

tvincent
Revision 9 Edited

tvincent
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	11	tvincent
	10	tvincent
	9	tvincent
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Software Heritage - Preserving the Free Software Commons

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)