Hi, thank you.

I'm Nicolas Dandrimont and I will indeed
be talking to you about

Software Heritage.

I'm a software engineer for this project.

I've been working on it for 3 years now.

And we'll see what this thing is all about.

[Mic not working]

I guess the batteries are out.

So, let's try that again.

So, we all know, we've been doing
free software for a while,

that software source code is something
special.

Why is that?

As Harold Abelson has said in SICP, his
textbook on programming,

programs are meant to be read by people
and then incidentally for machines to execute.

Basically, what software source code
provides us is a way inside

the mind of the designer of the program.

For instance, you can have,
you can get inside very crazy algorithms

that can do very fast reverse square roots
for 3D, that kind of stuff

Like in the Quake 2 source code.

You can also get inside the algorithms
that are underpinning the internet,

for instance seeing the net queue
algorithm in the Linux kernel.

What we are building as the free software
community is the free software commons.

Basically, the commons is all the cultural
and social and natural resources

that we share and that everyone
has access to.

More specifically, the software commons
is what we are building

with software that is open and that is
available for all to use, to modify,

to execute, to distribute.

We know that those commons are a really
critical part of our commons.

Who's taking care of it?

The software is fragile.

Like all digital information, you can lose
software.

People can decide to shut down hosting
spaces because of business decisions.

People can hack into software hosting
platforms and remove the code maliciously

or just inadvertently.

And, of course, for the obsolete stuff,
there's rot.

If you don't care about the data, then
it rots and it decays and you lose it.

So, where is the archive we go to
when something is lost,

when GitLab goes away, when Github
goes away.

Where do we go?

Finally, there's one last thing that we
noticed, it's that

there's a lot of teams that work on
research on software

and there's no real big infrastructure
for research on code.

There's tons of critical issues around
code: safety, security, verification, proofs.

Nobody's doing this at a very large scale.

If you want to see the stars, you go
the Atacama desert and

you point a telescope at the sky.

Where is the telescope for source code?

That's what Software Heritage wants to be.

What we do is we collect, we preserve
and we share all the software

that is publicly available.

Why do we do that? We do that to
preserve the past, to enhance the present

and to prepare for the future.

What we're building is a base infrastructure
that can be used

for cultural heritage, for industry,
for research and for education purposes.

How do we do it? We do it with an open
approach.

Every single line of code that we write
is free software.

We do it transparently, everything that
we do, we do it in the open,

be that on a mailing list or on
our issue tracker.

And we strive to do it for the very long
haul, so we do it with replication in mind

so that no single entity has full control
over the data that we collect.

And we do it in a non-profit fashion
so that we avoid

business-driven decisions impacting
the project.

So, what do we do concretely?

We do archiving of version control systems.

What does that mean?

It means we archive file contents, so
source code, files.

We archive revisions, which means all the
metadata of the history of the projects,

we try to download it and we put it inside
a common data model that is

shared across all the archive.

We archive releases of the software,
releases that have been tagged

in a version control system as well as
releases that we can find as tarballs

because sometimes… boof, views of
this source code differ.

Of course, we archive where and when
we've seen the data that we've collected.

All of this, we put inside a canonical,
VCS-agnostic, data model.

If you have a Debian package, with its
history, if you have a git repository,

if you have a subversion repository, if
you have a mercurial repository,

it all looks the same and you can work
on it with the same tools.

What we don't do is archive what's around
the software, for instance

the bug tracking systems or the homepages
or the wikis or the mailing lists.

There are some projects that work
in this space, for instance

the internet archive does a lot of
really good work around archiving the web.

Our goal is not to replace them, but to
work with them and be able to do

linking across all the archives that exist.

We can, for instance for the mailing lists
there's the gmane project

that does a lot of archiving of free
software mailing lists.

So our long term vision is to play a part
in a semantic wikipedia of software,

a wikidata of software where we can
hyperlink all the archives that exist

and do stuff in the area.

Quick tour of our infrastructure.

Basically, all the way to the right is
our archive.

Our archive consists of a huge graph
of all the metadata about

the files, the directories, the revisions,
the commits and the releases and

all the projects that are on top
of the graph.

We separate the file storage into an other
object storage because of

the size discrepancy: we have lots and lots
of file contents that we need to store

so we do that outside the database
that is used to store the graph.

Basically, what we archive is a set of
software origins that are

git repositories, mercurial repositories,
etc. etc.

All those origins are loaded on a
regular schedule.

If there is a very active software origin,
we're gonna archive it more often

than stale things that don't get
a lot of updates.

What we do to get the list of software
origins that we archive.

We have a bunch of listers that can,
scroll through the list of repositories,

for instance on Github or other
hosting platforms.

We have code that can read Debian archive
metadata to make a list of the packages

that are inside this archive and can be
archived, etc.

All of this is done on a regular basis.

We are currently working on some kind
of push mechanism so that

people or other systems can notify us
of updates.

Our goal is not to do real time archiving,
we're really in it for the long run

but we still want to be able to prioritize
stuff that people tell us is

important to archive.

The internet archive has a "save now"
button and we want to implement

something along those lines as well,

so if we know that some software project
is in danger for a reason or another,

then we can prioritize archiving it.

So this is the basic structure of a revision
in the software heritage archive.

You'll see that it's very similar to
a git commit.

The format of the metadata is pretty much
what you'll find in a git commit

with some extensions that you don't
see here because this is from a git commit

So basically what we do is we take the
identifier of the directory

that the revision points to, we take the
identifier of the parent of the revision

so we can keep track of the history

and then we add some metadata,
authorship and commitership information

and the revision message and then we take
a hash of this,

it makes an identifier that's probably
unique, very very probably unique.

Using those identifiers, we can retrace
all the origins, all the history of

development of the project and we can
deduplicate across all the archive.

All the identifiers are intrinsic, which
means that we compute them

from the contents of the things that
we are archiving, which means that

we can deduplicate very efficiently
across all the data that we archive.

How much data do we archive?

A bit.

So, we have passed the billion revision
mark a few weeks ago.

This graph is a bit old, but anyway,
you have a live graph on our website.

That's more than 4.5 billion unique
source code files.

We don't actually discriminate between
what we would consider is source code

and what upstream developers consider
as source code,

so everything that's in a git repository,
we consider as source code

if it's below a size threshold.

A billion revisions across 80 million
projects.

What do we archive?

We archive Github, we archive Debian.

So, Debian we run the archival process
every day, every day we get the new packages

that have been uploaded in the archive.

Github, we try to keep up, we are currently
working on some performance improvements,

some scalability improvements to make sure
that we can keep up

with the development on GitHub.

We have archived as a one-off thing
the former content of Gitorious and Google Code

which are two prominent code hosting
spaces that closed recently

and we've been working on archiving
the contents of Bitbucket

which is kind of a challenge because
the API is a bit buggy and

Atliassian isn't too interested
in fixing it.

In concrete storage terms, we have 175TB
of blobs, so the files take 175TB

and kind of big database, 6TB.

The database only contains the graph of
the metadata for the archive

which is basically a 8 billion nodes and
70 billion edges graph.

And of course it's growing daily.

We are pretty sure this is the richest
source code archive that's available now

and it keeps growing.

So how do we actually…

What kind of stack do we use to store
all this?

We use Debian, of course.

All our deployment recipes are in Puppet
in public repositories.

We've started using Ceph
for the blob storage.

We use PostgreSQL for the metadata storage
we some of the standard tools that

live around PostgreSQL for backups
and replication.

We use standard Python stack for
scheduling of jobs

and for web interface stuff, basically
psycopg2 for the low level stuff,

Django for the web stuff

and Celery for the scheduling of jobs.

In house, we've written an ad hoc
object storage system which has

a bunch of backends that you can use.

Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…

It's a really simple object storage system
where you can just put an object,

get an object, put a bunch of objects,
get a bunch of objects.

We've implemented removal but we don't
really use it yet.

All the data model implementation,
all the listers, the loaders, the schedulers

everything has been written by us,
it's a pile of Python code.

So, basically 20 Python packages and
around 30 Puppet modules

to deploy all that and we've done everything
as a copyleft license,

GPLv3 for the backend and AGPLv3
for the frontend.

Even if people try and make their own
Software Heritage using our code,

they have to publish their changes.

Hardware-wise, we run for now everything
on a few hypervisors in house and

our main storage is currently still
on a very high density, very slow,

very bulky storage array, but we've
started to migrate all this thing

into a Ceph storage cluster which
we're gonna grow as we need

in the next few months.