9:59:59.000,9:59:59.000
Hi, thank you.

9:59:59.000,9:59:59.000
I'm Nicolas Dandrimont and I will indeed[br]be talking to you about

9:59:59.000,9:59:59.000
Software Heritage.

9:59:59.000,9:59:59.000
I'm a software engineer for this project.

9:59:59.000,9:59:59.000
I've been working on it for 3 years now.

9:59:59.000,9:59:59.000
And we'll see what this thing is all about.

9:59:59.000,9:59:59.000
[Mic not working]

9:59:59.000,9:59:59.000
I guess the batteries are out.

9:59:59.000,9:59:59.000
So, let's try that again.

9:59:59.000,9:59:59.000
So, we all know, we've been doing[br]free software for a while,

9:59:59.000,9:59:59.000
that software source code is something[br]special.

9:59:59.000,9:59:59.000
Why is that?

9:59:59.000,9:59:59.000
As Harold Abelson has said in SICP, his[br]textbook on programming,

9:59:59.000,9:59:59.000
programs are meant to be read by people[br]and then incidentally for machines to execute.

9:59:59.000,9:59:59.000
Basically, what software source code[br]provides us is a way inside

9:59:59.000,9:59:59.000
the mind of the designer of the program.

9:59:59.000,9:59:59.000
For instance, you can have,[br]you can get inside very crazy algorithms

9:59:59.000,9:59:59.000
that can do very fast reverse square roots[br]for 3D, that kind of stuff

9:59:59.000,9:59:59.000
Like in the Quake 2 source code.

9:59:59.000,9:59:59.000
You can also get inside the algorithms[br]that are underpinning the internet,

9:59:59.000,9:59:59.000
for instance seeing the net queue[br]algorithm in the Linux kernel.

9:59:59.000,9:59:59.000
What we are building as the free software[br]community is the free software commons.

9:59:59.000,9:59:59.000
Basically, the commons is all the cultural[br]and social and natural resources

9:59:59.000,9:59:59.000
that we share and that everyone[br]has access to.

9:59:59.000,9:59:59.000
More specifically, the software commons[br]is what we are building

9:59:59.000,9:59:59.000
with software that is open and that is[br]available for all to use, to modify,

9:59:59.000,9:59:59.000
to execute, to distribute.

9:59:59.000,9:59:59.000
We know that those commons are a really[br]critical part of our commons.

9:59:59.000,9:59:59.000
Who's taking care of it?

9:59:59.000,9:59:59.000
The software is fragile.

9:59:59.000,9:59:59.000
Like all digital information, you can lose[br]software.

9:59:59.000,9:59:59.000
People can decide to shut down hosting[br]spaces because of business decisions.

9:59:59.000,9:59:59.000
People can hack into software hosting[br]platforms and remove the code maliciously

9:59:59.000,9:59:59.000
or just inadvertently.

9:59:59.000,9:59:59.000
And, of course, for the obsolete stuff,[br]there's rot.

9:59:59.000,9:59:59.000
If you don't care about the data, then[br]it rots and it decays and you lose it.

9:59:59.000,9:59:59.000
So, where is the archive we go to[br]when something is lost,

9:59:59.000,9:59:59.000
when GitLab goes away, when Github[br]goes away.

9:59:59.000,9:59:59.000
Where do we go?

9:59:59.000,9:59:59.000
Finally, there's one last thing that we[br]noticed, it's that

9:59:59.000,9:59:59.000
there's a lot of teams that work on[br]research on software

9:59:59.000,9:59:59.000
and there's no real big infrastructure[br]for research on code.

9:59:59.000,9:59:59.000
There's tons of critical issues around[br]code: safety, security, verification, proofs.

9:59:59.000,9:59:59.000
Nobody's doing this at a very large scale.

9:59:59.000,9:59:59.000
If you want to see the stars, you go[br]the Atacama desert and

9:59:59.000,9:59:59.000
you point a telescope at the sky.

9:59:59.000,9:59:59.000
Where is the telescope for source code?

9:59:59.000,9:59:59.000
That's what Software Heritage wants to be.

9:59:59.000,9:59:59.000
What we do is we collect, we preserve[br]and we share all the software

9:59:59.000,9:59:59.000
that is publicly available.

9:59:59.000,9:59:59.000
Why do we do that? We do that to[br]preserve the past, to enhance the present

9:59:59.000,9:59:59.000
and to prepare for the future.

9:59:59.000,9:59:59.000
What we're building is a base infrastructure[br]that can be used

9:59:59.000,9:59:59.000
for cultural heritage, for industry,[br]for research and for education purposes.

9:59:59.000,9:59:59.000
How do we do it? We do it with an open[br]approach.

9:59:59.000,9:59:59.000
Every single line of code that we write[br]is free software.

9:59:59.000,9:59:59.000
We do it transparently, everything that[br]we do, we do it in the open,

9:59:59.000,9:59:59.000
be that on a mailing list or on[br]our issue tracker.

9:59:59.000,9:59:59.000
And we strive to do it for the very long[br]haul, so we do it with replication in mind

9:59:59.000,9:59:59.000
so that no single entity has full control[br]over the data that we collect.

9:59:59.000,9:59:59.000
And we do it in a non-profit fashion[br]so that we avoid

9:59:59.000,9:59:59.000
business-driven decisions impacting[br]the project.

9:59:59.000,9:59:59.000
So, what do we do concretely?

9:59:59.000,9:59:59.000
We do archiving of version control systems.

9:59:59.000,9:59:59.000
What does that mean?

9:59:59.000,9:59:59.000
It means we archive file contents, so[br]source code, files.

9:59:59.000,9:59:59.000
We archive revisions, which means all the[br]metadata of the history of the projects,

9:59:59.000,9:59:59.000
we try to download it and we put it inside[br]a common data model that is

9:59:59.000,9:59:59.000
shared across all the archive.

9:59:59.000,9:59:59.000
We archive releases of the software,[br]releases that have been tagged

9:59:59.000,9:59:59.000
in a version control system as well as[br]releases that we can find as tarballs

9:59:59.000,9:59:59.000
because sometimes… boof, views of[br]this source code differ.

9:59:59.000,9:59:59.000
Of course, we archive where and when[br]we've seen the data that we've collected.

9:59:59.000,9:59:59.000
All of this, we put inside a canonical,[br]VCS-agnostic, data model.

9:59:59.000,9:59:59.000
If you have a Debian package, with its[br]history, if you have a git repository,

9:59:59.000,9:59:59.000
if you have a subversion repository, if[br]you have a mercurial repository,

9:59:59.000,9:59:59.000
it all looks the same and you can work[br]on it with the same tools.

9:59:59.000,9:59:59.000
What we don't do is archive what's around[br]the software, for instance

9:59:59.000,9:59:59.000
the bug tracking systems or the homepages[br]or the wikis or the mailing lists.

9:59:59.000,9:59:59.000
There are some projects that work[br]in this space, for instance

9:59:59.000,9:59:59.000
the internet archive does a lot of[br]really good work around archiving the web.

9:59:59.000,9:59:59.000
Our goal is not to replace them, but to[br]work with them and be able to do

9:59:59.000,9:59:59.000
linking across all the archives that exist.

9:59:59.000,9:59:59.000
We can, for instance for the mailing lists[br]there's the gmane project

9:59:59.000,9:59:59.000
that does a lot of archiving of free[br]software mailing lists.

9:59:59.000,9:59:59.000
So our long term vision is to play a part[br]in a semantic wikipedia of software,

9:59:59.000,9:59:59.000
a wikidata of software where we can[br]hyperlink all the archives that exist

9:59:59.000,9:59:59.000
and do stuff in the area.

9:59:59.000,9:59:59.000
Quick tour of our infrastructure.

9:59:59.000,9:59:59.000
Basically, all the way to the right is[br]our archive.

9:59:59.000,9:59:59.000
Our archive consists of a huge graph[br]of all the metadata about

9:59:59.000,9:59:59.000
the files, the directories, the revisions,[br]the commits and the releases and

9:59:59.000,9:59:59.000
all the projects that are on top[br]of the graph.

9:59:59.000,9:59:59.000
We separate the file storage into an other[br]object storage because of

9:59:59.000,9:59:59.000
the size discrepancy: we have lots and lots[br]of file contents that we need to store

9:59:59.000,9:59:59.000
so we do that outside the database[br]that is used to store the graph.

9:59:59.000,9:59:59.000
Basically, what we archive is a set of[br]software origins that are

9:59:59.000,9:59:59.000
git repositories, mercurial repositories,[br]etc. etc.

9:59:59.000,9:59:59.000
All those origins are loaded on a[br]regular schedule.

9:59:59.000,9:59:59.000
If there is a very active software origin,[br]we're gonna archive it more often

9:59:59.000,9:59:59.000
than stale things that don't get[br]a lot of updates.

9:59:59.000,9:59:59.000
What we do to get the list of software[br]origins that we archive.

9:59:59.000,9:59:59.000
We have a bunch of listers that can,[br]scroll through the list of repositories,

9:59:59.000,9:59:59.000
for instance on Github or other[br]hosting platforms.

9:59:59.000,9:59:59.000
We have code that can read Debian archive[br]metadata to make a list of the packages

9:59:59.000,9:59:59.000
that are inside this archive and can be[br]archived, etc.

9:59:59.000,9:59:59.000
All of this is done on a regular basis.

9:59:59.000,9:59:59.000
We are currently working on some kind[br]of push mechanism so that

9:59:59.000,9:59:59.000
people or other systems can notify us[br]of updates.

9:59:59.000,9:59:59.000
Our goal is not to do real time archiving,[br]we're really in it for the long run

9:59:59.000,9:59:59.000
but we still want to be able to prioritize[br]stuff that people tell us is

9:59:59.000,9:59:59.000
important to archive.

9:59:59.000,9:59:59.000
The internet archive has a "save now"[br]button and we want to implement

9:59:59.000,9:59:59.000
something along those lines as well,

9:59:59.000,9:59:59.000
so if we know that some software project[br]is in danger for a reason or another,

9:59:59.000,9:59:59.000
then we can prioritize archiving it.

9:59:59.000,9:59:59.000
So this is the basic structure of a revision[br]in the software heritage archive.

9:59:59.000,9:59:59.000
You'll see that it's very similar to[br]a git commit.

9:59:59.000,9:59:59.000
The format of the metadata is pretty much[br]what you'll find in a git commit

9:59:59.000,9:59:59.000
with some extensions that you don't[br]see here because this is from a git commit

9:59:59.000,9:59:59.000
So basically what we do is we take the[br]identifier of the directory

9:59:59.000,9:59:59.000
that the revision points to, we take the[br]identifier of the parent of the revision

9:59:59.000,9:59:59.000
so we can keep track of the history

9:59:59.000,9:59:59.000
and then we add some metadata,[br]authorship and commitership information

9:59:59.000,9:59:59.000
and the revision message and then we take[br]a hash of this,

9:59:59.000,9:59:59.000
it makes an identifier that's probably[br]unique, very very probably unique.

9:59:59.000,9:59:59.000
Using those identifiers, we can retrace[br]all the origins, all the history of

9:59:59.000,9:59:59.000
development of the project and we can[br]deduplicate across all the archive.

9:59:59.000,9:59:59.000
All the identifiers are intrinsic, which[br]means that we compute them

9:59:59.000,9:59:59.000
from the contents of the things that[br]we are archiving, which means that

9:59:59.000,9:59:59.000
we can deduplicate very efficiently[br]across all the data that we archive.

9:59:59.000,9:59:59.000
How much data do we archive?

9:59:59.000,9:59:59.000
A bit.

9:59:59.000,9:59:59.000
So, we have passed the billion revision[br]mark a few weeks ago.

9:59:59.000,9:59:59.000
This graph is a bit old, but anyway,[br]you have a live graph on our website.

9:59:59.000,9:59:59.000
That's more than 4.5 billion unique[br]source code files.

9:59:59.000,9:59:59.000
We don't actually discriminate between[br]what we would consider is source code

9:59:59.000,9:59:59.000
and what upstream developers consider[br]as source code,

9:59:59.000,9:59:59.000
so everything that's in a git repository,[br]we consider as source code

9:59:59.000,9:59:59.000
if it's below a size threshold.

9:59:59.000,9:59:59.000
A billion revisions across 80 million[br]projects.

9:59:59.000,9:59:59.000
What do we archive?

9:59:59.000,9:59:59.000
We archive Github, we archive Debian.

9:59:59.000,9:59:59.000
So, Debian we run the archival process[br]every day, every day we get the new packages

9:59:59.000,9:59:59.000
that have been uploaded in the archive.

9:59:59.000,9:59:59.000
Github, we try to keep up, we are currently[br]working on some performance improvements,

9:59:59.000,9:59:59.000
some scalability improvements to make sure[br]that we can keep up

9:59:59.000,9:59:59.000
with the development on GitHub.

9:59:59.000,9:59:59.000
We have archived as a one-off thing[br]the former content of Gitorious and Google Code

9:59:59.000,9:59:59.000
which are two prominent code hosting[br]spaces that closed recently

9:59:59.000,9:59:59.000
and we've been working on archiving[br]the contents of Bitbucket

9:59:59.000,9:59:59.000
which is kind of a challenge because[br]the API is a bit buggy and

9:59:59.000,9:59:59.000
Atliassian isn't too interested[br]in fixing it.

9:59:59.000,9:59:59.000
In concrete storage terms, we have 175TB[br]of blobs, so the files take 175TB

9:59:59.000,9:59:59.000
and kind of big database, 6TB.

9:59:59.000,9:59:59.000
The database only contains the graph of[br]the metadata for the archive

9:59:59.000,9:59:59.000
which is basically a 8 billion nodes and[br]70 billion edges graph.

9:59:59.000,9:59:59.000
And of course it's growing daily.

9:59:59.000,9:59:59.000
We are pretty sure this is the richest[br]source code archive that's available now

9:59:59.000,9:59:59.000
and it keeps growing.

9:59:59.000,9:59:59.000
So how do we actually…

9:59:59.000,9:59:59.000
What kind of stack do we use to store[br]all this?

9:59:59.000,9:59:59.000
We use Debian, of course.

9:59:59.000,9:59:59.000
All our deployment recipes are in Puppet[br]in public repositories.

9:59:59.000,9:59:59.000
We've started using Ceph[br]for the blob storage.

9:59:59.000,9:59:59.000
We use PostgreSQL for the metadata storage[br]we some of the standard tools that

9:59:59.000,9:59:59.000
live around PostgreSQL for backups[br]and replication.

9:59:59.000,9:59:59.000
We use standard Python stack for[br]scheduling of jobs

9:59:59.000,9:59:59.000
and for web interface stuff, basically[br]psycopg2 for the low level stuff,

9:59:59.000,9:59:59.000
Django for the web stuff

9:59:59.000,9:59:59.000
and Celery for the scheduling of jobs.

9:59:59.000,9:59:59.000
In house, we've written an ad hoc[br]object storage system which has

9:59:59.000,9:59:59.000
a bunch of backends that you can use.

9:59:59.000,9:59:59.000
Basically, we are agnostic between a UNIX[br]filesystem, azure, Ceph, or tons of…

9:59:59.000,9:59:59.000
It's a really simple object storage system[br]where you can just put an object,

9:59:59.000,9:59:59.000
get an object, put a bunch of objects,[br]get a bunch of objects.

9:59:59.000,9:59:59.000
We've implemented removal but we don't[br]really use it yet.

9:59:59.000,9:59:59.000
All the data model implementation,[br]all the listers, the loaders, the schedulers

9:59:59.000,9:59:59.000
everything has been written by us,[br]it's a pile of Python code.

9:59:59.000,9:59:59.000
So, basically 20 Python packages and[br]around 30 Puppet modules

9:59:59.000,9:59:59.000
to deploy all that and we've done everything[br]as a copyleft license,

9:59:59.000,9:59:59.000
GPLv3 for the backend and AGPLv3[br]for the frontend.

9:59:59.000,9:59:59.000
Even if people try and make their own[br]Software Heritage using our code,

9:59:59.000,9:59:59.000
they have to publish their changes.

9:59:59.000,9:59:59.000
Hardware-wise, we run for now everything[br]on a few hypervisors in house and

9:59:59.000,9:59:59.000
our main storage is currently still[br]on a very high density, very slow,

9:59:59.000,9:59:59.000
very bulky storage array, but we've[br]started to migrate all this thing

9:59:59.000,9:59:59.000
into a Ceph storage cluster which[br]we're gonna grow as we need

9:59:59.000,9:59:59.000
in the next few months.

9:59:59.000,9:59:59.000
We've also been granted by Microsoft[br]sponsorship, ??? sponsorship

9:59:59.000,9:59:59.000
for their cloud services.

9:59:59.000,9:59:59.000
We've started putting mirrors of everything[br]in their infrastructure as well

9:59:59.000,9:59:59.000
which means full object storage mirror,[br]so 170TB of stuff mirrored on azure

9:59:59.000,9:59:59.000
as well as a database mirror for graph.

9:59:59.000,9:59:59.000
And we're also doing all the content[br]indexing and all the things that need

9:59:59.000,9:59:59.000
scalability on azure now.

9:59:59.000,9:59:59.000
Finally, at the university of Bologna,[br]we have a backend storage for the download

9:59:59.000,9:59:59.000
so currently our main storage is[br]quite slow so if you want to download

9:59:59.000,9:59:59.000
a bundle of things that we've archived,[br]then we actually keep a cache of

9:59:59.000,9:59:59.000
what we've done so that it doesn't take[br]a million years to download stuff.

9:59:59.000,9:59:59.000
We do our development in a classic free[br]and open source software way,

9:59:59.000,9:59:59.000
so we talk on our mailing list, on IRC