-
Not Synced
Hi, thank you.
-
Not Synced
I'm Nicolas Dandrimont and I will indeed
be talking to you about
-
Not Synced
Software Heritage.
-
Not Synced
I'm a software engineer for this project.
-
Not Synced
I've been working on it for 3 years now.
-
Not Synced
And we'll see what this thing is all about.
-
Not Synced
[Mic not working]
-
Not Synced
I guess the batteries are out.
-
Not Synced
So, let's try that again.
-
Not Synced
So, we all know, we've been doing
free software for a while,
-
Not Synced
that software source code is something
special.
-
Not Synced
Why is that?
-
Not Synced
As Harold Abelson has said in SICP, his
textbook on programming,
-
Not Synced
programs are meant to be read by people
and then incidentally for machines to execute.
-
Not Synced
Basically, what software source code
provides us is a way inside
-
Not Synced
the mind of the designer of the program.
-
Not Synced
For instance, you can have,
you can get inside very crazy algorithms
-
Not Synced
that can do very fast reverse square roots
for 3D, that kind of stuff
-
Not Synced
Like in the Quake 2 source code.
-
Not Synced
You can also get inside the algorithms
that are underpinning the internet,
-
Not Synced
for instance seeing the net queue
algorithm in the Linux kernel.
-
Not Synced
What we are building as the free software
community is the free software commons.
-
Not Synced
Basically, the commons is all the cultural
and social and natural resources
-
Not Synced
that we share and that everyone
has access to.
-
Not Synced
More specifically, the software commons
is what we are building
-
Not Synced
with software that is open and that is
available for all to use, to modify,
-
Not Synced
to execute, to distribute.
-
Not Synced
We know that those commons are a really
critical part of our commons.
-
Not Synced
Who's taking care of it?
-
Not Synced
The software is fragile.
-
Not Synced
Like all digital information, you can lose
software.
-
Not Synced
People can decide to shut down hosting
spaces because of business decisions.
-
Not Synced
People can hack into software hosting
platforms and remove the code maliciously
-
Not Synced
or just inadvertently.
-
Not Synced
And, of course, for the obsolete stuff,
there's rot.
-
Not Synced
If you don't care about the data, then
it rots and it decays and you lose it.
-
Not Synced
So, where is the archive we go to
when something is lost,
-
Not Synced
when GitLab goes away, when Github
goes away.
-
Not Synced
Where do we go?
-
Not Synced
Finally, there's one last thing that we
noticed, it's that
-
Not Synced
there's a lot of teams that work on
research on software
-
Not Synced
and there's no real big infrastructure
for research on code.
-
Not Synced
There's tons of critical issues around
code: safety, security, verification, proofs.
-
Not Synced
Nobody's doing this at a very large scale.
-
Not Synced
If you want to see the stars, you go
the Atacama desert and
-
Not Synced
you point a telescope at the sky.
-
Not Synced
Where is the telescope for source code?
-
Not Synced
That's what Software Heritage wants to be.
-
Not Synced
What we do is we collect, we preserve
and we share all the software
-
Not Synced
that is publicly available.
-
Not Synced
Why do we do that? We do that to
preserve the past, to enhance the present
-
Not Synced
and to prepare for the future.
-
Not Synced
What we're building is a base infrastructure
that can be used
-
Not Synced
for cultural heritage, for industry,
for research and for education purposes.
-
Not Synced
How do we do it? We do it with an open
approach.
-
Not Synced
Every single line of code that we write
is free software.
-
Not Synced
We do it transparently, everything that
we do, we do it in the open,
-
Not Synced
be that on a mailing list or on
our issue tracker.
-
Not Synced
And we strive to do it for the very long
haul, so we do it with replication in mind
-
Not Synced
so that no single entity has full control
over the data that we collect.
-
Not Synced
And we do it in a non-profit fashion
so that we avoid
-
Not Synced
business-driven decisions impacting
the project.
-
Not Synced
So, what do we do concretely?
-
Not Synced
We do archiving of version control systems.
-
Not Synced
What does that mean?
-
Not Synced
It means we archive file contents, so
source code, files.
-
Not Synced
We archive revisions, which means all the
metadata of the history of the projects,
-
Not Synced
we try to download it and we put it inside
a common data model that is
-
Not Synced
shared across all the archive.
-
Not Synced
We archive releases of the software,
releases that have been tagged
-
Not Synced
in a version control system as well as
releases that we can find as tarballs
-
Not Synced
because sometimes… boof, views of
this source code differ.
-
Not Synced
Of course, we archive where and when
we've seen the data that we've collected.
-
Not Synced
All of this, we put inside a canonical,
VCS-agnostic, data model.
-
Not Synced
If you have a Debian package, with its
history, if you have a git repository,
-
Not Synced
if you have a subversion repository, if
you have a mercurial repository,
-
Not Synced
it all looks the same and you can work
on it with the same tools.
-
Not Synced
What we don't do is archive what's around
the software, for instance
-
Not Synced
the bug tracking systems or the homepages
or the wikis or the mailing lists.
-
Not Synced
There are some projects that work
in this space, for instance
-
Not Synced
the internet archive does a lot of
really good work around archiving the web.
-
Not Synced
Our goal is not to replace them, but to
work with them and be able to do
-
Not Synced
linking across all the archives that exist.
-
Not Synced
We can, for instance for the mailing lists
there's the gmane project
-
Not Synced
that does a lot of archiving of free
software mailing lists.
-
Not Synced
So our long term vision is to play a part
in a semantic wikipedia of software,
-
Not Synced
a wikidata of software where we can
hyperlink all the archives that exist
-
Not Synced
and do stuff in the area.
-
Not Synced
Quick tour of our infrastructure.
-
Not Synced
Basically, all the way to the right is
our archive.
-
Not Synced
Our archive consists of a huge graph
of all the metadata about
-
Not Synced
the files, the directories, the revisions,
the commits and the releases and
-
Not Synced
all the projects that are on top
of the graph.
-
Not Synced
We separate the file storage into an other
object storage because of
-
Not Synced
the size discrepancy: we have lots and lots
of file contents that we need to store
-
Not Synced
so we do that outside the database
that is used to store the graph.
-
Not Synced
Basically, what we archive is a set of
software origins that are
-
Not Synced
git repositories, mercurial repositories,
etc. etc.
-
Not Synced
All those origins are loaded on a
regular schedule.
-
Not Synced
If there is a very active software origin,
we're gonna archive it more often
-
Not Synced
than stale things that don't get
a lot of updates.
-
Not Synced
What we do to get the list of software
origins that we archive.
-
Not Synced
We have a bunch of listers that can,
scroll through the list of repositories,
-
Not Synced
for instance on Github or other
hosting platforms.
-
Not Synced
We have code that can read Debian archive
metadata to make a list of the packages
-
Not Synced
that are inside this archive and can be
archived, etc.
-
Not Synced
All of this is done on a regular basis.
-
Not Synced
We are currently working on some kind
of push mechanism so that
-
Not Synced
people or other systems can notify us
of updates.
-
Not Synced
Our goal is not to do real time archiving,
we're really in it for the long run
-
Not Synced
but we still want to be able to prioritize
stuff that people tell us is
-
Not Synced
important to archive.
-
Not Synced
The internet archive has a "save now"
button and we want to implement
-
Not Synced
something along those lines as well,
-
Not Synced
so if we know that some software project
is in danger for a reason or another,
-
Not Synced
then we can prioritize archiving it.
-
Not Synced
So this is the basic structure of a revision
in the software heritage archive.
-
Not Synced
You'll see that it's very similar to
a git commit.
-
Not Synced
The format of the metadata is pretty much
what you'll find in a git commit
-
Not Synced
with some extensions that you don't
see here because this is from a git commit
-
Not Synced
So basically what we do is we take the
identifier of the directory
-
Not Synced
that the revision points to, we take the
identifier of the parent of the revision
-
Not Synced
so we can keep track of the history
-
Not Synced
and then we add some metadata,
authorship and commitership information
-
Not Synced
and the revision message and then we take
a hash of this,
-
Not Synced
it makes an identifier that's probably
unique, very very probably unique.
-
Not Synced
Using those identifiers, we can retrace
all the origins, all the history of
-
Not Synced
development of the project and we can
deduplicate across all the archive.
-
Not Synced
All the identifiers are intrinsic, which
means that we compute them
-
Not Synced
from the contents of the things that
we are archiving, which means that
-
Not Synced
we can deduplicate very efficiently
across all the data that we archive.
-
Not Synced
How much data do we archive?
-
Not Synced
A bit.
-
Not Synced
So, we have passed the billion revision
mark a few weeks ago.
-
Not Synced
This graph is a bit old, but anyway,
you have a live graph on our website.
-
Not Synced
That's more than 4.5 billion unique
source code files.
-
Not Synced
We don't actually discriminate between
what we would consider is source code
-
Not Synced
and what upstream developers consider
as source code,
-
Not Synced
so everything that's in a git repository,
we consider as source code
-
Not Synced
if it's below a size threshold.
-
Not Synced
A billion revisions across 80 million
projects.
-
Not Synced
What do we archive?
-
Not Synced
We archive Github, we archive Debian.
-
Not Synced
So, Debian we run the archival process
every day, every day we get the new packages
-
Not Synced
that have been uploaded in the archive.
-
Not Synced
Github, we try to keep up, we are currently
working on some performance improvements,
-
Not Synced
some scalability improvements to make sure
that we can keep up
-
Not Synced
with the development on GitHub.
-
Not Synced
We have archived as a one-off thing
the former content of Gitorious and Google Code
-
Not Synced
which are two prominent code hosting
spaces that closed recently
-
Not Synced
and we've been working on archiving
the contents of Bitbucket
-
Not Synced
which is kind of a challenge because
the API is a bit buggy and
-
Not Synced
Atliassian isn't too interested
in fixing it.
-
Not Synced
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
-
Not Synced
and kind of big database, 6TB.
-
Not Synced
The database only contains the graph of
the metadata for the archive
-
Not Synced
which is basically a 8 billion nodes and
70 billion edges graph.
-
Not Synced
And of course it's growing daily.
-
Not Synced
We are pretty sure this is the richest
source code archive that's available now
-
Not Synced
and it keeps growing.
-
Not Synced
So how do we actually…
-
Not Synced
What kind of stack do we use to store
all this?
-
Not Synced
We use Debian, of course.
-
Not Synced
All our deployment recipes are in Puppet
in public repositories.
-
Not Synced
We've started using Ceph
for the blob storage.
-
Not Synced
We use PostgreSQL for the metadata storage
we some of the standard tools that
-
Not Synced
live around PostgreSQL for backups
and replication.
-
Not Synced
We use standard Python stack for
scheduling of jobs
-
Not Synced
and for web interface stuff, basically
psycopg2 for the low level stuff,
-
Not Synced
Django for the web stuff
-
Not Synced
and Celery for the scheduling of jobs.
-
Not Synced
In house, we've written an ad hoc
object storage system which has
-
Not Synced
a bunch of backends that you can use.
-
Not Synced
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
-
Not Synced
It's a really simple object storage system
where you can just put an object,
-
Not Synced
get an object, put a bunch of objects,
get a bunch of objects.
-
Not Synced
We've implemented removal but we don't
really use it yet.
-
Not Synced
All the data model implementation,
all the listers, the loaders, the schedulers
-
Not Synced
everything has been written by us,
it's a pile of Python code.
-
Not Synced
So, basically 20 Python packages and
around 30 Puppet modules
-
Not Synced
to deploy all that and we've done everything
as a copyleft license,
-
Not Synced
GPLv3 for the backend and AGPLv3
for the frontend.
-
Not Synced
Even if people try and make their own
Software Heritage using our code,
-
Not Synced
they have to publish their changes.
-
Not Synced
Hardware-wise, we run for now everything
on a few hypervisors in house and
-
Not Synced
our main storage is currently still
on a very high density, very slow,
-
Not Synced
very bulky storage array, but we've
started to migrate all this thing
-
Not Synced
into a Ceph storage cluster which
we're gonna grow as we need
-
Not Synced
in the next few months.
-
Not Synced
We've also been granted by Microsoft
sponsorship, ??? sponsorship
-
Not Synced
for their cloud services.
-
Not Synced
We've started putting mirrors of everything
in their infrastructure as well
-
Not Synced
which means full object storage mirror,
so 170TB of stuff mirrored on azure
-
Not Synced
as well as a database mirror for graph.
-
Not Synced
And we're also doing all the content
indexing and all the things that need
-
Not Synced
scalability on azure now.
-
Not Synced
Finally, at the university of Bologna,
we have a backend storage for the download
-
Not Synced
so currently our main storage is
quite slow so if you want to download
-
Not Synced
a bundle of things that we've archived,
then we actually keep a cache of
-
Not Synced
what we've done so that it doesn't take
a million years to download stuff.
-
Not Synced
We do our development in a classic free
and open source software way,
-
Not Synced
so we talk on our mailing list, on IRC,
on a forge.
-
Not Synced
Everything is in English, everything is
public, there is more information
-
Not Synced
on our website if you want to actually
have a look and see what we do.
-
Not Synced
So, all that is very interesting but how
do we actually look into it?
-
Not Synced
One of the ways that you can browse,
that you can use the archive
-
Not Synced
is using a REST API.
-
Not Synced
Basically, this API allows you to do
pointwise browsing of the archive
-
Not Synced
so you can go and follow the links
in a graph,
-
Not Synced
which is very slow but gives you a pretty
much full access of the data.
-
Not Synced
There's an index for the API that you can
look at, but that's not really convenient,
-
Not Synced
so we also have a web user interface.
-
Not Synced
It's in preview right now, we're gonna do
a full launch in the month of June.
-
Not Synced
If you go to
https://archive.softwareheritage.org/browse/
-
Not Synced
with the given credentials, you can
have a look and see what's going on.
-
Not Synced
Basically, we have a web interface that
allows you to look at
-
Not Synced
what origins we have downloaded, when
we have downloaded the origins
-
Not Synced
with a kind of graph view of how often
we visited the origins
-
Not Synced
and a calendar view of when we have
visited the origins.
-
Not Synced
And then, inside the visits, you can
actually browse the contents
-
Not Synced
that we've archived.
-
Not Synced
So, for instance, this is the Python
repository as of May 2017
-
Not Synced
and you can have the list of files,
then drill down,
-
Not Synced
it should be pretty intuitive.
-
Not Synced
If you look at the history of a project,
you can see the differences
-
Not Synced
between two revisions of a project.
-
Not Synced
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.