Hi, thank you.
I'm Nicolas Dandrimont and I will indeed
be talking to you about
Software Heritage.
I'm a software engineer for this project.
I've been working on it for 3 years now.
And we'll see what this thing is all about.
[Mic not working]
I guess the batteries are out.
So, let's try that again.
So, we all know, we've been doing
free software for a while,
that software source code is something
special.
Why is that?
As Harold Abelson has said in SICP, his
textbook on programming,
programs are meant to be read by people
and then incidentally for machines to execute.
Basically, what software source code
provides us is a way inside
the mind of the designer of the program.
For instance, you can have,
you can get inside very crazy algorithms
that can do very fast reverse square roots
for 3D, that kind of stuff
Like in the Quake 2 source code.
You can also get inside the algorithms
that are underpinning the internet,
for instance seeing the net queue
algorithm in the Linux kernel.
What we are building as the free software
community is the free software commons.
Basically, the commons is all the cultural
and social and natural resources
that we share and that everyone
has access to.
More specifically, the software commons
is what we are building
with software that is open and that is
available for all to use, to modify,
to execute, to distribute.
We know that those commons are a really
critical part of our commons.
Who's taking care of it?
The software is fragile.
Like all digital information, you can lose
software.
People can decide to shut down hosting
spaces because of business decisions.
People can hack into software hosting
platforms and remove the code maliciously
or just inadvertently.
And, of course, for the obsolete stuff,
there's rot.
If you don't care about the data, then
it rots and it decays and you lose it.
So, where is the archive we go to
when something is lost,
when GitLab goes away, when Github
goes away.
Where do we go?
Finally, there's one last thing that we
noticed, it's that
there's a lot of teams that work on
research on software
and there's no real big infrastructure
for research on code.
There's tons of critical issues around
code: safety, security, verification, proofs.
Nobody's doing this at a very large scale.
If you want to see the stars, you go
the Atacama desert and
you point a telescope at the sky.
Where is the telescope for source code?
That's what Software Heritage wants to be.
What we do is we collect, we preserve
and we share all the software
that is publicly available.
Why do we do that? We do that to
preserve the past, to enhance the present
and to prepare for the future.
What we're building is a base infrastructure
that can be used
for cultural heritage, for industry,
for research and for education purposes.
How do we do it? We do it with an open
approach.
Every single line of code that we write
is free software.
We do it transparently, everything that
we do, we do it in the open,
be that on a mailing list or on
our issue tracker.
And we strive to do it for the very long
haul, so we do it with replication in mind
so that no single entity has full control
over the data that we collect.
And we do it in a non-profit fashion
so that we avoid
business-driven decisions impacting
the project.
So, what do we do concretely?
We do archiving of version control systems.
What does that mean?
It means we archive file contents, so
source code, files.
We archive revisions, which means all the
metadata of the history of the projects,
we try to download it and we put it inside
a common data model that is
shared across all the archive.
We archive releases of the software,
releases that have been tagged
in a version control system as well as
releases that we can find as tarballs
because sometimes… boof, views of
this source code differ.
Of course, we archive where and when
we've seen the data that we've collected.
All of this, we put inside a canonical,
VCS-agnostic, data model.
If you have a Debian package, with its
history, if you have a git repository,
if you have a subversion repository, if
you have a mercurial repository,
it all looks the same and you can work
on it with the same tools.
What we don't do is archive what's around
the software, for instance
the bug tracking systems or the homepages
or the wikis or the mailing lists.
There are some projects that work
in this space, for instance
the internet archive does a lot of
really good work around archiving the web.
Our goal is not to replace them, but to
work with them and be able to do
linking across all the archives that exist.
We can, for instance for the mailing lists
there's the gmane project
that does a lot of archiving of free
software mailing lists.
So our long term vision is to play a part
in a semantic wikipedia of software,
a wikidata of software where we can
hyperlink all the archives that exist
and do stuff in the area.
Quick tour of our infrastructure.
Basically, all the way to the right is
our archive.
Our archive consists of a huge graph
of all the metadata about
the files, the directories, the revisions,
the commits and the releases and
all the projects that are on top
of the graph.
We separate the file storage into an other
object storage because of
the size discrepancy: we have lots and lots
of file contents that we need to store
so we do that outside the database
that is used to store the graph.
Basically, what we archive is a set of
software origins that are
git repositories, mercurial repositories,
etc. etc.
All those origins are loaded on a
regular schedule.
If there is a very active software origin,
we're gonna archive it more often
than stale things that don't get
a lot of updates.
What we do to get the list of software
origins that we archive.
We have a bunch of listers that can,
scroll through the list of repositories,
for instance on Github or other
hosting platforms.
We have code that can read Debian archive
metadata to make a list of the packages
that are inside this archive and can be
archived, etc.
All of this is done on a regular basis.
We are currently working on some kind
of push mechanism so that
people or other systems can notify us
of updates.
Our goal is not to do real time archiving,
we're really in it for the long run
but we still want to be able to prioritize
stuff that people tell us is
important to archive.
The internet archive has a "save now"
button and we want to implement
something along those lines as well,
so if we know that some software project
is in danger for a reason or another,
then we can prioritize archiving it.
So this is the basic structure of a revision
in the software heritage archive.
You'll see that it's very similar to
a git commit.
The format of the metadata is pretty much
what you'll find in a git commit
with some extensions that you don't
see here because this is from a git commit
So basically what we do is we take the
identifier of the directory
that the revision points to, we take the
identifier of the parent of the revision
so we can keep track of the history
and then we add some metadata,
authorship and commitership information
and the revision message and then we take
a hash of this,
it makes an identifier that's probably
unique, very very probably unique.
Using those identifiers, we can retrace
all the origins, all the history of
development of the project and we can
deduplicate across all the archive.
All the identifiers are intrinsic, which
means that we compute them
from the contents of the things that
we are archiving, which means that
we can deduplicate very efficiently
across all the data that we archive.
How much data do we archive?
A bit.
So, we have passed the billion revision
mark a few weeks ago.
This graph is a bit old, but anyway,
you have a live graph on our website.
That's more than 4.5 billion unique
source code files.
We don't actually discriminate between
what we would consider is source code
and what upstream developers consider
as source code,
so everything that's in a git repository,
we consider as source code
if it's below a size threshold.
A billion revisions across 80 million
projects.
What do we archive?
We archive Github, we archive Debian.
So, Debian we run the archival process
every day, every day we get the new packages
that have been uploaded in the archive.
Github, we try to keep up, we are currently
working on some performance improvements,
some scalability improvements to make sure
that we can keep up
with the development on GitHub.
We have archived as a one-off thing
the former content of Gitorious and Google Code
which are two prominent code hosting
spaces that closed recently
and we've been working on archiving
the contents of Bitbucket
which is kind of a challenge because
the API is a bit buggy and
Atliassian isn't too interested
in fixing it.
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
and kind of big database, 6TB.
The database only contains the graph of
the metadata for the archive
which is basically a 8 billion nodes and
70 billion edges graph.
And of course it's growing daily.
We are pretty sure this is the richest
source code archive that's available now
and it keeps growing.
So how do we actually…
What kind of stack do we use to store
all this?
We use Debian, of course.
All our deployment recipes are in Puppet
in public repositories.
We've started using Ceph
for the blob storage.
We use PostgreSQL for the metadata storage
we some of the standard tools that
live around PostgreSQL for backups
and replication.
We use standard Python stack for
scheduling of jobs
and for web interface stuff, basically
psycopg2 for the low level stuff,
Django for the web stuff
and Celery for the scheduling of jobs.
In house, we've written an ad hoc
object storage system which has
a bunch of backends that you can use.
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
It's a really simple object storage system
where you can just put an object,
get an object, put a bunch of objects,
get a bunch of objects.
We've implemented removal but we don't
really use it yet.
All the data model implementation,
all the listers, the loaders, the schedulers
everything has been written by us,
it's a pile of Python code.
So, basically 20 Python packages and
around 30 Puppet modules
to deploy all that and we've done everything
as a copyleft license,
GPLv3 for the backend and AGPLv3
for the frontend.
Even if people try and make their own
Software Heritage using our code,
they have to publish their changes.
Hardware-wise, we run for now everything
on a few hypervisors in house and
our main storage is currently still
on a very high density, very slow,
very bulky storage array, but we've
started to migrate all this thing
into a Ceph storage cluster which
we're gonna grow as we need
in the next few months.
We've also been granted by Microsoft
sponsorship, ??? sponsorship
for their cloud services.
We've started putting mirrors of everything
in their infrastructure as well
which means full object storage mirror,
so 170TB of stuff mirrored on azure
as well as a database mirror for graph.
And we're also doing all the content
indexing and all the things that need
scalability on azure now.
Finally, at the university of Bologna,
we have a backend storage for the download
so currently our main storage is
quite slow so if you want to download
a bundle of things that we've archived,
then we actually keep a cache of
what we've done so that it doesn't take
a million years to download stuff.
We do our development in a classic free
and open source software way,
so we talk on our mailing list, on IRC,
on a forge.
Everything is in English, everything is
public, there is more information
on our website if you want to actually
have a look and see what we do.
So, all that is very interesting but how
do we actually look into it?
One of the ways that you can browse,
that you can use the archive
is using a REST API.
Basically, this API allows you to do
pointwise browsing of the archive
so you can go and follow the links
in a graph,
which is very slow but gives you a pretty
much full access of the data.
There's an index for the API that you can
look at, but that's not really convenient,
so we also have a web user interface.
It's in preview right now, we're gonna do
a full launch in the month of June.
If you go to
https://archive.softwareheritage.org/browse/
with the given credentials, you can
have a look and see what's going on.
Basically, we have a web interface that
allows you to look at
what origins we have downloaded, when
we have downloaded the origins
with a kind of graph view of how often
we visited the origins
and a calendar view of when we have
visited the origins.
And then, inside the visits, you can
actually browse the contents
that we've archived.
So, for instance, this is the Python
repository as of May 2017
and you can have the list of files,
then drill down,
it should be pretty intuitive.
If you look at the history of a project,
you can see the differences
between two revisions of a project.
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
So, yeah, pretty cool stuff.
I should be able to do a demo as well,
it should work.
I'm gonna zoom in.
So this is the main archive, you can see
some statistics about the objects
that we've downloaded.
When you zoom in, you get some kind of
overflows, because…
Yeah, why would you do that.
If you want to browse, we can try to find
an origin.
"glibc".
So there's lots and lots of, like, random
Github forks of things…
We don't discriminate and we don't really
filter what we download.
We are looking into doing some relevance
kind of sorting of the results, here.
Next.
Xilinx, why not.
So, this has been downloaded for the last
time of August 3rd 2016,
so it's probably a dead repository,
but yeah, you can see a bunch of source
code,
you can read the README of the glibc.
If we go back to a more interesting origin
here's the repository for git.
I've selected voluntarily an old visit
of the repo so that we can see
what was going on then.
If a look at the calendar view, you can see
that we've had some issues actually
updating this, but anyway.
If I look at the last visit, then we can
actually browse the contents,
you can get syntax highlighting as well.
This is a big big file with lots of comments
Let's see the actual source code…
Anyway, so, that's the browsing interface.
We can also now get back what we've
archived and download it,
which is kind of something that you might
want to do
if a repository is lost, you can actually
download it
and get the source code back again.
How we do that.
If you go on the top right of this browsing
interface, you have actions and download
and you can download a directory that
you are currently looking at.
It's an asynchronous process, which means
that if there is a lot of load,
then it's gotta take some time to get
actually, to be able to download the content
So you can put in your email address so we
can notify you when the download is ready.
I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
in the list of things that I've requested.
I've already requested some things that
we can actually get and open as a tarball.
Yeah, I think that's the thing that I was
actually looking at,
which is this revision of the git
source code
and then I can open it
Yay, emacs, that's when you want.
Yay, source code.
This seems to work.
And then, of course, if you want to
actually script what you're doing,
there's an API that allows you to do
the downloads as well, so you can.
The source code is deduplicated a lot,
which means that for one single repository
you get tons of files that we have to
collect if you want to actually download
an archive of a directory.
It takes a while but we have an asynchronous
API so you can POST
the identifier of a revision to this URL
and then get status updates
and at some point, it will tell you that
the… here
The status well tell you that the object
is available.
You can download it and you can even
download the full history of a project
and get that as a git-fast-export archive
that you can reimport into
a new git repository.
So any kind of VCS that we've imported,
you can export as a git repository
and reimport on your machine.
How to get involved in the project?
We have a lot of features that we're
interested in, lots of them are now
in early access or have been done.
There's some stuff that we would like
help with.
This is some stuff that we're working on:
provenance information, you have a content
you want to know which repository
it comes from,
that's something we're on.
Full text search, the end goal is to be
able even to trace
source of snippets of code that's have
been copied from one project to another.
That's something that we can look into
with the wealth of information that
we have inside the archive.
There's a lot of things that,
I mean…
There's a lot of things that people want
to do with the archive.
Our goal is to enable people to do things,
to do interesting things
with a lot of source code.
If you have an idea of what you want to do
with such an archive,
please you can come talk to us
and we'll be happy to help you help us.
What we want to do is to diversify
the sources of things that we archive.
Currently, we have good support for git,
we have OK support for subversion
and mercurial.
If your project of choice is in another
version control system,
we are gonna miss it.
So people can contribute in this area.
For the listing part, we have coverage of
Debian, we have coverage or Github,
if your code is somewhere else, we won't
see it, so we need people to contribute
stuff that can list for instance Gitlab
instances,
and then we can integrate that in our
infrastructure and actually have have
people be able to archive their gitlab
instances.
And of course, we need to spread
the word, make the project sustainable.
We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
The university of Bologna, of course Inria
is sponsoring.
But we need to keep spreading the word
and keep the project sustainable.
And, of course, we need to save endangered
source code.
For that, we have a suggestion box on
the wiki that you can add things to.
For instance, we have in the back of
our minds archiving SourceForge,
because we know that this isn't very
sustainable and that's risk of being
taken down at some point.
If you want to join us, we also have
some job openings that are available.
For now it's in Paris, so if you want to
consider coming work with us in Paris,
you can look into that.
That's Software Heritage.
We are building a reference archive of
all the free software
that's being ever written
in an international, open, non-profit and
mutualised infrastructure
that we have opened up to everyone,
all users, vendors, developers can use it.
The idea is to be at the service of
the community and for society
as a whole.
So if you want to join us, you can look at
our website, you can look at our code.