WEBVTT

99:59:59.999 --> 99:59:59.999
Hi, thank you.

99:59:59.999 --> 99:59:59.999
I'm Nicolas Dandrimont and I will indeed
be talking to you about

99:59:59.999 --> 99:59:59.999
Software Heritage.

99:59:59.999 --> 99:59:59.999
I'm a software engineer for this project.

99:59:59.999 --> 99:59:59.999
I've been working on it for 3 years now.

99:59:59.999 --> 99:59:59.999
And we'll see what this thing is all about.

99:59:59.999 --> 99:59:59.999
[Mic not working]

99:59:59.999 --> 99:59:59.999
I guess the batteries are out.

99:59:59.999 --> 99:59:59.999
So, let's try that again.

99:59:59.999 --> 99:59:59.999
So, we all know, we've been doing
free software for a while,

99:59:59.999 --> 99:59:59.999
that software source code is something
special.

99:59:59.999 --> 99:59:59.999
Why is that?

99:59:59.999 --> 99:59:59.999
As Harold Abelson has said in SICP, his
textbook on programming,

99:59:59.999 --> 99:59:59.999
programs are meant to be read by people
and then incidentally for machines to execute.

99:59:59.999 --> 99:59:59.999
Basically, what software source code
provides us is a way inside

99:59:59.999 --> 99:59:59.999
the mind of the designer of the program.

99:59:59.999 --> 99:59:59.999
For instance, you can have,
you can get inside very crazy algorithms

99:59:59.999 --> 99:59:59.999
that can do very fast reverse square roots
for 3D, that kind of stuff

99:59:59.999 --> 99:59:59.999
Like in the Quake 2 source code.

99:59:59.999 --> 99:59:59.999
You can also get inside the algorithms
that are underpinning the internet,

99:59:59.999 --> 99:59:59.999
for instance seeing the net queue
algorithm in the Linux kernel.

99:59:59.999 --> 99:59:59.999
What we are building as the free software
community is the free software commons.

99:59:59.999 --> 99:59:59.999
Basically, the commons is all the cultural
and social and natural resources

99:59:59.999 --> 99:59:59.999
that we share and that everyone
has access to.

99:59:59.999 --> 99:59:59.999
More specifically, the software commons
is what we are building

99:59:59.999 --> 99:59:59.999
with software that is open and that is
available for all to use, to modify,

99:59:59.999 --> 99:59:59.999
to execute, to distribute.

99:59:59.999 --> 99:59:59.999
We know that those commons are a really
critical part of our commons.

99:59:59.999 --> 99:59:59.999
Who's taking care of it?

99:59:59.999 --> 99:59:59.999
The software is fragile.

99:59:59.999 --> 99:59:59.999
Like all digital information, you can lose
software.

99:59:59.999 --> 99:59:59.999
People can decide to shut down hosting
spaces because of business decisions.

99:59:59.999 --> 99:59:59.999
People can hack into software hosting
platforms and remove the code maliciously

99:59:59.999 --> 99:59:59.999
or just inadvertently.

99:59:59.999 --> 99:59:59.999
And, of course, for the obsolete stuff,
there's rot.

99:59:59.999 --> 99:59:59.999
If you don't care about the data, then
it rots and it decays and you lose it.

99:59:59.999 --> 99:59:59.999
So, where is the archive we go to
when something is lost,

99:59:59.999 --> 99:59:59.999
when GitLab goes away, when Github
goes away.

99:59:59.999 --> 99:59:59.999
Where do we go?

99:59:59.999 --> 99:59:59.999
Finally, there's one last thing that we
noticed, it's that

99:59:59.999 --> 99:59:59.999
there's a lot of teams that work on
research on software

99:59:59.999 --> 99:59:59.999
and there's no real big infrastructure
for research on code.

99:59:59.999 --> 99:59:59.999
There's tons of critical issues around
code: safety, security, verification, proofs.

99:59:59.999 --> 99:59:59.999
Nobody's doing this at a very large scale.

99:59:59.999 --> 99:59:59.999
If you want to see the stars, you go
the Atacama desert and

99:59:59.999 --> 99:59:59.999
you point a telescope at the sky.

99:59:59.999 --> 99:59:59.999
Where is the telescope for source code?

99:59:59.999 --> 99:59:59.999
That's what Software Heritage wants to be.

99:59:59.999 --> 99:59:59.999
What we do is we collect, we preserve
and we share all the software

99:59:59.999 --> 99:59:59.999
that is publicly available.

99:59:59.999 --> 99:59:59.999
Why do we do that? We do that to
preserve the past, to enhance the present

99:59:59.999 --> 99:59:59.999
and to prepare for the future.

99:59:59.999 --> 99:59:59.999
What we're building is a base infrastructure
that can be used

99:59:59.999 --> 99:59:59.999
for cultural heritage, for industry,
for research and for education purposes.

99:59:59.999 --> 99:59:59.999
How do we do it? We do it with an open
approach.

99:59:59.999 --> 99:59:59.999
Every single line of code that we write
is free software.

99:59:59.999 --> 99:59:59.999
We do it transparently, everything that
we do, we do it in the open,

99:59:59.999 --> 99:59:59.999
be that on a mailing list or on
our issue tracker.

99:59:59.999 --> 99:59:59.999
And we strive to do it for the very long
haul, so we do it with replication in mind

99:59:59.999 --> 99:59:59.999
so that no single entity has full control
over the data that we collect.

99:59:59.999 --> 99:59:59.999
And we do it in a non-profit fashion
so that we avoid

99:59:59.999 --> 99:59:59.999
business-driven decisions impacting
the project.

99:59:59.999 --> 99:59:59.999
So, what do we do concretely?

99:59:59.999 --> 99:59:59.999
We do archiving of version control systems.

99:59:59.999 --> 99:59:59.999
What does that mean?

99:59:59.999 --> 99:59:59.999
It means we archive file contents, so
source code, files.

99:59:59.999 --> 99:59:59.999
We archive revisions, which means all the
metadata of the history of the projects,

99:59:59.999 --> 99:59:59.999
we try to download it and we put it inside
a common data model that is

99:59:59.999 --> 99:59:59.999
shared across all the archive.

99:59:59.999 --> 99:59:59.999
We archive releases of the software,
releases that have been tagged

99:59:59.999 --> 99:59:59.999
in a version control system as well as
releases that we can find as tarballs

99:59:59.999 --> 99:59:59.999
because sometimes… boof, views of
this source code differ.

99:59:59.999 --> 99:59:59.999
Of course, we archive where and when
we've seen the data that we've collected.

99:59:59.999 --> 99:59:59.999
All of this, we put inside a canonical,
VCS-agnostic, data model.

99:59:59.999 --> 99:59:59.999
If you have a Debian package, with its
history, if you have a git repository,

99:59:59.999 --> 99:59:59.999
if you have a subversion repository, if
you have a mercurial repository,

99:59:59.999 --> 99:59:59.999
it all looks the same and you can work
on it with the same tools.

99:59:59.999 --> 99:59:59.999
What we don't do is archive what's around
the software, for instance

99:59:59.999 --> 99:59:59.999
the bug tracking systems or the homepages
or the wikis or the mailing lists.

99:59:59.999 --> 99:59:59.999
There are some projects that work
in this space, for instance

99:59:59.999 --> 99:59:59.999
the internet archive does a lot of
really good work around archiving the web.

99:59:59.999 --> 99:59:59.999
Our goal is not to replace them, but to
work with them and be able to do

99:59:59.999 --> 99:59:59.999
linking across all the archives that exist.

99:59:59.999 --> 99:59:59.999
We can, for instance for the mailing lists
there's the gmane project

99:59:59.999 --> 99:59:59.999
that does a lot of archiving of free
software mailing lists.

99:59:59.999 --> 99:59:59.999
So our long term vision is to play a part
in a semantic wikipedia of software,

99:59:59.999 --> 99:59:59.999
a wikidata of software where we can
hyperlink all the archives that exist

99:59:59.999 --> 99:59:59.999
and do stuff in the area.

99:59:59.999 --> 99:59:59.999
Quick tour of our infrastructure.

99:59:59.999 --> 99:59:59.999
Basically, all the way to the right is
our archive.

99:59:59.999 --> 99:59:59.999
Our archive consists of a huge graph
of all the metadata about

99:59:59.999 --> 99:59:59.999
the files, the directories, the revisions,
the commits and the releases and

99:59:59.999 --> 99:59:59.999
all the projects that are on top
of the graph.

99:59:59.999 --> 99:59:59.999
We separate the file storage into an other
object storage because of

99:59:59.999 --> 99:59:59.999
the size discrepancy: we have lots and lots
of file contents that we need to store

99:59:59.999 --> 99:59:59.999
so we do that outside the database
that is used to store the graph.

99:59:59.999 --> 99:59:59.999
Basically, what we archive is a set of
software origins that are

99:59:59.999 --> 99:59:59.999
git repositories, mercurial repositories,
etc. etc.

99:59:59.999 --> 99:59:59.999
All those origins are loaded on a
regular schedule.

99:59:59.999 --> 99:59:59.999
If there is a very active software origin,
we're gonna archive it more often

99:59:59.999 --> 99:59:59.999
than stale things that don't get
a lot of updates.

99:59:59.999 --> 99:59:59.999
What we do to get the list of software
origins that we archive.

99:59:59.999 --> 99:59:59.999
We have a bunch of listers that can,
scroll through the list of repositories,

99:59:59.999 --> 99:59:59.999
for instance on Github or other
hosting platforms.

99:59:59.999 --> 99:59:59.999
We have code that can read Debian archive
metadata to make a list of the packages

99:59:59.999 --> 99:59:59.999
that are inside this archive and can be
archived, etc.

99:59:59.999 --> 99:59:59.999
All of this is done on a regular basis.

99:59:59.999 --> 99:59:59.999
We are currently working on some kind
of push mechanism so that

99:59:59.999 --> 99:59:59.999
people or other systems can notify us
of updates.

99:59:59.999 --> 99:59:59.999
Our goal is not to do real time archiving,
we're really in it for the long run

99:59:59.999 --> 99:59:59.999
but we still want to be able to prioritize
stuff that people tell us is

99:59:59.999 --> 99:59:59.999
important to archive.

99:59:59.999 --> 99:59:59.999
The internet archive has a "save now"
button and we want to implement

99:59:59.999 --> 99:59:59.999
something along those lines as well,

99:59:59.999 --> 99:59:59.999
so if we know that some software project
is in danger for a reason or another,

99:59:59.999 --> 99:59:59.999
then we can prioritize archiving it.

99:59:59.999 --> 99:59:59.999
So this is the basic structure of a revision
in the software heritage archive.

99:59:59.999 --> 99:59:59.999
You'll see that it's very similar to
a git commit.

99:59:59.999 --> 99:59:59.999
The format of the metadata is pretty much
what you'll find in a git commit

99:59:59.999 --> 99:59:59.999
with some extensions that you don't
see here because this is from a git commit

99:59:59.999 --> 99:59:59.999
So basically what we do is we take the
identifier of the directory

99:59:59.999 --> 99:59:59.999
that the revision points to, we take the
identifier of the parent of the revision

99:59:59.999 --> 99:59:59.999
so we can keep track of the history

99:59:59.999 --> 99:59:59.999
and then we add some metadata,
authorship and commitership information

99:59:59.999 --> 99:59:59.999
and the revision message and then we take
a hash of this,

99:59:59.999 --> 99:59:59.999
it makes an identifier that's probably
unique, very very probably unique.

99:59:59.999 --> 99:59:59.999
Using those identifiers, we can retrace
all the origins, all the history of

99:59:59.999 --> 99:59:59.999
development of the project and we can
deduplicate across all the archive.

99:59:59.999 --> 99:59:59.999
All the identifiers are intrinsic, which
means that we compute them

99:59:59.999 --> 99:59:59.999
from the contents of the things that
we are archiving, which means that

99:59:59.999 --> 99:59:59.999
we can deduplicate very efficiently
across all the data that we archive.

99:59:59.999 --> 99:59:59.999
How much data do we archive?

99:59:59.999 --> 99:59:59.999
A bit.

99:59:59.999 --> 99:59:59.999
So, we have passed the billion revision
mark a few weeks ago.

99:59:59.999 --> 99:59:59.999
This graph is a bit old, but anyway,
you have a live graph on our website.

99:59:59.999 --> 99:59:59.999
That's more than 4.5 billion unique
source code files.

99:59:59.999 --> 99:59:59.999
We don't actually discriminate between
what we would consider is source code

99:59:59.999 --> 99:59:59.999
and what upstream developers consider
as source code,

99:59:59.999 --> 99:59:59.999
so everything that's in a git repository,
we consider as source code

99:59:59.999 --> 99:59:59.999
if it's below a size threshold.

99:59:59.999 --> 99:59:59.999
A billion revisions across 80 million
projects.

99:59:59.999 --> 99:59:59.999
What do we archive?

99:59:59.999 --> 99:59:59.999
We archive Github, we archive Debian.

99:59:59.999 --> 99:59:59.999
So, Debian we run the archival process
every day, every day we get the new packages

99:59:59.999 --> 99:59:59.999
that have been uploaded in the archive.

99:59:59.999 --> 99:59:59.999
Github, we try to keep up, we are currently
working on some performance improvements,

99:59:59.999 --> 99:59:59.999
some scalability improvements to make sure
that we can keep up

99:59:59.999 --> 99:59:59.999
with the development on GitHub.

99:59:59.999 --> 99:59:59.999
We have archived as a one-off thing
the former content of Gitorious and Google Code

99:59:59.999 --> 99:59:59.999
which are two prominent code hosting
spaces that closed recently

99:59:59.999 --> 99:59:59.999
and we've been working on archiving
the contents of Bitbucket

99:59:59.999 --> 99:59:59.999
which is kind of a challenge because
the API is a bit buggy and

99:59:59.999 --> 99:59:59.999
Atliassian isn't too interested
in fixing it.

99:59:59.999 --> 99:59:59.999
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB

99:59:59.999 --> 99:59:59.999
and kind of big database, 6TB.

99:59:59.999 --> 99:59:59.999
The database only contains the graph of
the metadata for the archive

99:59:59.999 --> 99:59:59.999
which is basically a 8 billion nodes and
70 billion edges graph.

99:59:59.999 --> 99:59:59.999
And of course it's growing daily.

99:59:59.999 --> 99:59:59.999
We are pretty sure this is the richest
source code archive that's available now

99:59:59.999 --> 99:59:59.999
and it keeps growing.

99:59:59.999 --> 99:59:59.999
So how do we actually…

99:59:59.999 --> 99:59:59.999
What kind of stack do we use to store
all this?

99:59:59.999 --> 99:59:59.999
We use Debian, of course.

99:59:59.999 --> 99:59:59.999
All our deployment recipes are in Puppet
in public repositories.

99:59:59.999 --> 99:59:59.999
We've started using Ceph
for the blob storage.

99:59:59.999 --> 99:59:59.999
We use PostgreSQL for the metadata storage
we some of the standard tools that

99:59:59.999 --> 99:59:59.999
live around PostgreSQL for backups
and replication.

99:59:59.999 --> 99:59:59.999
We use standard Python stack for
scheduling of jobs

99:59:59.999 --> 99:59:59.999
and for web interface stuff, basically
psycopg2 for the low level stuff,

99:59:59.999 --> 99:59:59.999
Django for the web stuff

99:59:59.999 --> 99:59:59.999
and Celery for the scheduling of jobs.

99:59:59.999 --> 99:59:59.999
In house, we've written an ad hoc
object storage system which has

99:59:59.999 --> 99:59:59.999
a bunch of backends that you can use.

99:59:59.999 --> 99:59:59.999
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…

99:59:59.999 --> 99:59:59.999
It's a really simple object storage system
where you can just put an object,

99:59:59.999 --> 99:59:59.999
get an object, put a bunch of objects,
get a bunch of objects.

99:59:59.999 --> 99:59:59.999
We've implemented removal but we don't
really use it yet.

99:59:59.999 --> 99:59:59.999
All the data model implementation,
all the listers, the loaders, the schedulers

99:59:59.999 --> 99:59:59.999
everything has been written by us,
it's a pile of Python code.

99:59:59.999 --> 99:59:59.999
So, basically 20 Python packages and
around 30 Puppet modules

99:59:59.999 --> 99:59:59.999
to deploy all that and we've done everything
as a copyleft license,

99:59:59.999 --> 99:59:59.999
GPLv3 for the backend and AGPLv3
for the frontend.

99:59:59.999 --> 99:59:59.999
Even if people try and make their own
Software Heritage using our code,

99:59:59.999 --> 99:59:59.999
they have to publish their changes.

99:59:59.999 --> 99:59:59.999
Hardware-wise, we run for now everything
on a few hypervisors in house and

99:59:59.999 --> 99:59:59.999
our main storage is currently still
on a very high density, very slow,

99:59:59.999 --> 99:59:59.999
very bulky storage array, but we've
started to migrate all this thing

99:59:59.999 --> 99:59:59.999
into a Ceph storage cluster which
we're gonna grow as we need

99:59:59.999 --> 99:59:59.999
in the next few months.

99:59:59.999 --> 99:59:59.999
We've also been granted by Microsoft
sponsorship, ??? sponsorship

99:59:59.999 --> 99:59:59.999
for their cloud services.

99:59:59.999 --> 99:59:59.999
We've started putting mirrors of everything
in their infrastructure as well

99:59:59.999 --> 99:59:59.999
which means full object storage mirror,
so 170TB of stuff mirrored on azure

99:59:59.999 --> 99:59:59.999
as well as a database mirror for graph.

99:59:59.999 --> 99:59:59.999
And we're also doing all the content
indexing and all the things that need

99:59:59.999 --> 99:59:59.999
scalability on azure now.

99:59:59.999 --> 99:59:59.999
Finally, at the university of Bologna,
we have a backend storage for the download

99:59:59.999 --> 99:59:59.999
so currently our main storage is
quite slow so if you want to download

99:59:59.999 --> 99:59:59.999
a bundle of things that we've archived,
then we actually keep a cache of

99:59:59.999 --> 99:59:59.999
what we've done so that it doesn't take
a million years to download stuff.

99:59:59.999 --> 99:59:59.999
We do our development in a classic free
and open source software way,

99:59:59.999 --> 99:59:59.999
so we talk on our mailing list, on IRC,
on a forge.

99:59:59.999 --> 99:59:59.999
Everything is in English, everything is
public, there is more information

99:59:59.999 --> 99:59:59.999
on our website if you want to actually
have a look and see what we do.

99:59:59.999 --> 99:59:59.999
So, all that is very interesting but how
do we actually look into it?

99:59:59.999 --> 99:59:59.999
One of the ways that you can browse,
that you can use the archive

99:59:59.999 --> 99:59:59.999
is using a REST API.

99:59:59.999 --> 99:59:59.999
Basically, this API allows you to do
pointwise browsing of the archive

99:59:59.999 --> 99:59:59.999
so you can go and follow the links
in a graph,

99:59:59.999 --> 99:59:59.999
which is very slow but gives you a pretty
much full access of the data.

99:59:59.999 --> 99:59:59.999
There's an index for the API that you can
look at, but that's not really convenient,

99:59:59.999 --> 99:59:59.999
so we also have a web user interface.

99:59:59.999 --> 99:59:59.999
It's in preview right now, we're gonna do
a full launch in the month of June.

99:59:59.999 --> 99:59:59.999
If you go to 
https://archive.softwareheritage.org/browse/

99:59:59.999 --> 99:59:59.999
with the given credentials, you can
have a look and see what's going on.

99:59:59.999 --> 99:59:59.999
Basically, we have a web interface that
allows you to look at

99:59:59.999 --> 99:59:59.999
what origins we have downloaded, when
we have downloaded the origins

99:59:59.999 --> 99:59:59.999
with a kind of graph view of how often
we visited the origins

99:59:59.999 --> 99:59:59.999
and a calendar view of when we have
visited the origins.

99:59:59.999 --> 99:59:59.999
And then, inside the visits, you can
actually browse the contents

99:59:59.999 --> 99:59:59.999
that we've archived.

99:59:59.999 --> 99:59:59.999
So, for instance, this is the Python
repository as of May 2017

99:59:59.999 --> 99:59:59.999
and you can have the list of files,
then drill down,

99:59:59.999 --> 99:59:59.999
it should be pretty intuitive.

99:59:59.999 --> 99:59:59.999
If you look at the history of a project,
you can see the differences

99:59:59.999 --> 99:59:59.999
between two revisions of a project.

99:59:59.999 --> 99:59:59.999
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.