Hi, thank you.
I'm Nicolas Dandrimont and I will indeed
be talking to you about
Software Heritage.
I'm a software engineer for this project.
I've been working on it for 3 years now.
And we'll see what this thing is all about.
[Mic not working]
I guess the batteries are out.
So, let's try that again.
So, we all know, we've been doing
free software for a while,
that software source code is something
special.
Why is that?
As Harold Abelson has said in SICP, his
textbook on programming,
programs are meant to be read by people
and then incidentally for machines to execute.
Basically, what software source code
provides us is a way inside
the mind of the designer of the program.
For instance, you can have,
you can get inside very crazy algorithms
that can do very fast reverse square roots
for 3D, that kind of stuff
Like in the Quake 2 source code.
You can also get inside the algorithms
that are underpinning the internet,
for instance seeing the net queue
algorithm in the Linux kernel.
What we are building as the free software
community is the free software commons.
Basically, the commons is all the cultural
and social and natural resources
that we share and that everyone
has access to.
More specifically, the software commons
is what we are building
with software that is open and that is
available for all to use, to modify,
to execute, to distribute.
We know that those commons are a really
critical part of our commons.
Who's taking care of it?
The software is fragile.
Like all digital information, you can lose
software.
People can decide to shut down hosting
spaces because of business decisions.
People can hack into software hosting
platforms and remove the code maliciously
or just inadvertently.
And, of course, for the obsolete stuff,
there's rot.
If you don't care about the data, then
it rots and it decays and you lose it.
So, where is the archive we go to
when something is lost,
when GitLab goes away, when Github
goes away.
Where do we go?
Finally, there's one last thing that we
noticed, it's that
there's a lot of teams that work on
research on software
and there's no real big infrastructure
for research on code.
There's tons of critical issues around
code: safety, security, verification, proofs.
Nobody's doing this at a very large scale.
If you want to see the stars, you go
the Atacama desert and
you point a telescope at the sky.
Where is the telescope for source code?
That's what Software Heritage wants to be.
What we do is we collect, we preserve
and we share all the software
that is publicly available.
Why do we do that? We do that to
preserve the past, to enhance the present
and to prepare for the future.
What we're building is a base infrastructure
that can be used
for cultural heritage, for industry,
for research and for education purposes.
How do we do it? We do it with an open
approach.
Every single line of code that we write
is free software.
We do it transparently, everything that
we do, we do it in the open,
be that on a mailing list or on
our issue tracker.
And we strive to do it for the very long
haul, so we do it with replication in mind
so that no single entity has full control
over the data that we collect.
And we do it in a non-profit fashion
so that we avoid
business-driven decisions impacting
the project.
So, what do we do concretely?
We do archiving of version control systems.
What does that mean?
It means we archive file contents, so
source code, files.
We archive revisions, which means all the
metadata of the history of the projects,
we try to download it and we put it inside
a common data model that is
shared across all the archive.
We archive releases of the software,
releases that have been tagged
in a version control system as well as
releases that we can find as tarballs
because sometimes… boof, views of
this source code differ.
Of course, we archive where and when
we've seen the data that we've collected.
All of this, we put inside a canonical,
VCS-agnostic, data model.
If you have a Debian package, with its
history, if you have a git repository,
if you have a subversion repository, if
you have a mercurial repository,
it all looks the same and you can work
on it with the same tools.
What we don't do is archive what's around
the software, for instance
the bug tracking systems or the homepages
or the wikis or the mailing lists.
There are some projects that work
in this space, for instance
the internet archive does a lot of
really good work around archiving the web.
Our goal is not to replace them, but to
work with them and be able to do
linking across all the archives that exist.
We can, for instance for the mailing lists
there's the gmane project
that does a lot of archiving of free
software mailing lists.
So our long term vision is to play a part
in a semantic wikipedia of software,
a wikidata of software where we can
hyperlink all the archives that exist
and do stuff in the area.
Quick tour of our infrastructure.
Basically, all the way to the right is
our archive.
Our archive consists of a huge graph
of all the metadata about
the files, the directories, the revisions,
the commits and the releases and
all the projects that are on top
of the graph.
We separate the file storage into an other
object storage because of
the size discrepancy: we have lots and lots
of file contents that we need to store
so we do that outside the database
that is used to store the graph.
Basically, what we archive is a set of
software origins that are
git repositories, mercurial repositories,
etc. etc.
All those origins are loaded on a
regular schedule.
If there is a very active software origin,
we're gonna archive it more often
than stale things that don't get
a lot of updates.
What we do to get the list of software
origins that we archive.
We have a bunch of listers that can,
scroll through the list of repositories,
for instance on Github or other
hosting platforms.
We have code that can read Debian archive
metadata to make a list of the packages
that are inside this archive and can be
archived, etc.
All of this is done on a regular basis.
We are currently working on some kind
of push mechanism so that
people or other systems can notify us
of updates.
Our goal is not to do real time archiving,
we're really in it for the long run
but we still want to be able to prioritize
stuff that people tell us is
important to archive.
The internet archive has a "save now"
button and we want to implement
something along those lines as well,
so if we know that some software project
is in danger for a reason or another,
then we can prioritize archiving it.
So this is the basic structure of a revision
in the software heritage archive.
You'll see that it's very similar to
a git commit.
The format of the metadata is pretty much
what you'll find in a git commit
with some extensions that you don't
see here because this is from a git commit
So basically what we do is we take the
identifier of the directory
that the revision points to, we take the
identifier of the parent of the revision
so we can keep track of the history
and then we add some metadata,
authorship and commitership information
and the revision message and then we take
a hash of this,
it makes an identifier that's probably
unique, very very probably unique.
Using those identifiers, we can retrace
all the origins, all the history of
development of the project and we can
deduplicate across all the archive.
All the identifiers are intrinsic, which
means that we compute them
from the contents of the things that
we are archiving, which means that
we can deduplicate very efficiently
across all the data that we archive.
How much data do we archive?
A bit.
So, we have passed the billion revision
mark a few weeks ago.
This graph is a bit old, but anyway,
you have a live graph on our website.
That's more than 4.5 billion unique
source code files.
We don't actually discriminate between
what we would consider is source code
and what upstream developers consider
as source code,
so everything that's in a git repository,
we consider as source code
if it's below a size threshold.
A billion revisions across 80 million
projects.
What do we archive?
We archive Github, we archive Debian.
So, Debian we run the archival process
every day, every day we get the new packages
that have been uploaded in the archive.
Github, we try to keep up, we are currently
working on some performance improvements,
some scalability improvements to make sure
that we can keep up
with the development on GitHub.
We have archived as a one-off thing
the former content of Gitorious and Google Code
which are two prominent code hosting
spaces that closed recently
and we've been working on archiving
the contents of Bitbucket
which is kind of a challenge because
the API is a bit buggy and
Atliassian isn't too interested
in fixing it.
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
and kind of big database, 6TB.
The database only contains the graph of
the metadata for the archive
which is basically a 8 billion nodes and
70 billion edges graph.
And of course it's growing daily.
We are pretty sure this is the richest
source code archive that's available now
and it keeps growing.
So how do we actually…
What kind of stack do we use to store
all this?
We use Debian, of course.
All our deployment recipes are in Puppet
in public repositories.
We've started using Ceph
for the blob storage.
We use PostgreSQL for the metadata storage
we some of the standard tools that
live around PostgreSQL for backups
and replication.
We use standard Python stack for
scheduling of jobs
and for web interface stuff, basically
psycopg2 for the low level stuff,
Django for the web stuff
and Celery for the scheduling of jobs.
In house, we've written an ad hoc
object storage system which has
a bunch of backends that you can use.
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
It's a really simple object storage system
where you can just put an object,
get an object, put a bunch of objects,
get a bunch of objects.
We've implemented removal but we don't
really use it yet.
All the data model implementation,
all the listers, the loaders, the schedulers
everything has been written by us,
it's a pile of Python code.
So, basically 20 Python packages and
around 30 Puppet modules
to deploy all that and we've done everything
as a copyleft license,
GPLv3 for the backend and AGPLv3
for the frontend.
Even if people try and make their own
Software Heritage using our code,
they have to publish their changes.
Hardware-wise, we run for now everything
on a few hypervisors in house and
our main storage is currently still
on a very high density, very slow,
very bulky storage array, but we've
started to migrate all this thing
into a Ceph storage cluster which
we're gonna grow as we need
in the next few months.
We've also been granted by Microsoft
sponsorship, ??? sponsorship
for their cloud services.
We've started putting mirrors of everything
in their infrastructure as well
which means full object storage mirror,
so 170TB of stuff mirrored on azure
as well as a database mirror for graph.
And we're also doing all the content
indexing and all the things that need
scalability on azure now.
Finally, at the university of Bologna,
we have a backend storage for the download
so currently our main storage is
quite slow so if you want to download
a bundle of things that we've archived,
then we actually keep a cache of
what we've done so that it doesn't take
a million years to download stuff.
We do our development in a classic free
and open source software way,
so we talk on our mailing list