Hi, thank you.

I'm Nicolas Dandrimont and I will indeed
be talking to you about

Software Heritage.

I'm a software engineer for this project.

I've been working on it for 3 years now.

And we'll see what this thing is all about.

[Mic not working]

I guess the batteries are out.

So, let's try that again.

So, we all know, we've been doing
free software for a while,

that software source code is something
special.

Why is that?

As Harold Abelson has said in SICP, his
textbook on programming,

programs are meant to be read by people
and then incidentally for machines to execute.

Basically, what software source code
provides us is a way inside

the mind of the designer of the program.

For instance, you can have,
you can get inside very crazy algorithms

that can do very fast reverse square roots
for 3D, that kind of stuff

Like in the Quake 2 source code.

You can also get inside the algorithms
that are underpinning the internet,

for instance seeing the net queue
algorithm in the Linux kernel.

What we are building as the free software
community is the free software commons.

Basically, the commons is all the cultural
and social and natural resources

that we share and that everyone
has access to.

More specifically, the software commons
is what we are building

with software that is open and that is
available for all to use, to modify,

to execute, to distribute.

We know that those commons are a really
critical part of our commons.

Who's taking care of it?

The software is fragile.

Like all digital information, you can lose
software.

People can decide to shut down hosting
spaces because of business decisions.

People can hack into software hosting
platforms and remove the code maliciously

or just inadvertently.

And, of course, for the obsolete stuff,
there's rot.

If you don't care about the data, then
it rots and it decays and you lose it.

So, where is the archive we go to
when something is lost,

when GitLab goes away, when Github
goes away.

Where do we go?

Finally, there's one last thing that we
noticed, it's that

there's a lot of teams that work on
research on software

and there's no real big infrastructure
for research on code.

There's tons of critical issues around
code: safety, security, verification, proofs.

Nobody's doing this at a very large scale.

If you want to see the stars, you go
the Atacama desert and

you point a telescope at the sky.

Where is the telescope for source code?

That's what Software Heritage wants to be.

What we do is we collect, we preserve
and we share all the software

that is publicly available.

Why do we do that? We do that to
preserve the past, to enhance the present

and to prepare for the future.

What we're building is a base infrastructure
that can be used

for cultural heritage, for industry,
for research and for education purposes.

How do we do it? We do it with an open
approach.

Every single line of code that we write
is free software.

We do it transparently, everything that
we do, we do it in the open,

be that on a mailing list or on
our issue tracker.

And we strive to do it for the very long
haul, so we do it with replication in mind

so that no single entity has full control
over the data that we collect.

And we do it in a non-profit fashion
so that we avoid

business-driven decisions impacting
the project.

So, what do we do concretely?

We do archiving of version control systems.

What does that mean?

It means we archive file contents, so
source code, files.

We archive revisions, which means all the
metadata of the history of the projects,

we try to download it and we put it inside
a common data model that is

shared across all the archive.

We archive releases of the software,
releases that have been tagged

in a version control system as well as
releases that we can find as tarballs

because sometimes… boof, views of
this source code differ.

Of course, we archive where and when
we've seen the data that we've collected.

All of this, we put inside a canonical,
VCS-agnostic, data model.

If you have a Debian package, with its
history, if you have a git repository,

if you have a subversion repository, if
you have a mercurial repository,

it all looks the same and you can work
on it with the same tools.

What we don't do is archive what's around
the software, for instance

the bug tracking systems or the homepages
or the wikis or the mailing lists.

There are some projects that work
in this space, for instance

the internet archive does a lot of
really good work around archiving the web.

Our goal is not to replace them, but to
work with them and be able to do

linking across all the archives that exist.

We can, for instance for the mailing lists
there's the gmane project

that does a lot of archiving of free
software mailing lists.

So our long term vision is to play a part
in a semantic wikipedia of software,

a wikidata of software where we can
hyperlink all the archives that exist

and do stuff in the area.

Quick tour of our infrastructure.

Basically, all the way to the right is
our archive.

Our archive consists of a huge graph
of all the metadata about

the files, the directories, the revisions,
the commits and the releases and

all the projects that are on top
of the graph.

We separate the file storage into an other
object storage because of

the size discrepancy: we have lots and lots
of file contents that we need to store

so we do that outside the database
that is used to store the graph.

Basically, what we archive is a set of
software origins that are

git repositories, mercurial repositories,
etc. etc.

All those origins are loaded on a
regular schedule.

If there is a very active software origin,
we're gonna archive it more often

than stale things that don't get
a lot of updates.

What we do to get the list of software
origins that we archive.

We have a bunch of listers that can,
scroll through the list of repositories,

for instance on Github or other
hosting platforms.

We have code that can read Debian archive
metadata to make a list of the packages

that are inside this archive and can be
archived, etc.

All of this is done on a regular basis.

We are currently working on some kind
of push mechanism so that

people or other systems can notify us
of updates.

Our goal is not to do real time archiving,
we're really in it for the long run

but we still want to be able to prioritize
stuff that people tell us is

important to archive.

The internet archive has a "save now"
button and we want to implement

something along those lines as well,

so if we know that some software project
is in danger for a reason or another,

then we can prioritize archiving it.

So this is the basic structure of a revision
in the software heritage archive.

You'll see that it's very similar to
a git commit.

The format of the metadata is pretty much
what you'll find in a git commit

with some extensions that you don't
see here because this is from a git commit

So basically what we do is we take the
identifier of the directory

that the revision points to, we take the
identifier of the parent of the revision

so we can keep track of the history

and then we add some metadata,
authorship and commitership information

and the revision message and then we take
a hash of this,

it makes an identifier that's probably
unique, very very probably unique.

Using those identifiers, we can retrace
all the origins, all the history of

development of the project and we can
deduplicate across all the archive.

All the identifiers are intrinsic, which
means that we compute them

from the contents of the things that
we are archiving, which means that

we can deduplicate very efficiently
across all the data that we archive.

How much data do we archive?

A bit.

So, we have passed the billion revision
mark a few weeks ago.

This graph is a bit old, but anyway,
you have a live graph on our website.

That's more than 4.5 billion unique
source code files.

We don't actually discriminate between
what we would consider is source code

and what upstream developers consider
as source code,

so everything that's in a git repository,
we consider as source code

if it's below a size threshold.

A billion revisions across 80 million
projects.

What do we archive?

We archive Github, we archive Debian.

So, Debian we run the archival process
every day, every day we get the new packages

that have been uploaded in the archive.

Github, we try to keep up, we are currently
working on some performance improvements,

some scalability improvements to make sure
that we can keep up

with the development on GitHub.

We have archived as a one-off thing
the former content of Gitorious and Google Code

which are two prominent code hosting
spaces that closed recently

and we've been working on archiving
the contents of Bitbucket

which is kind of a challenge because
the API is a bit buggy and

Atliassian isn't too interested
in fixing it.

In concrete storage terms, we have 175TB
of blobs, so the files take 175TB

and kind of big database, 6TB.

The database only contains the graph of
the metadata for the archive

which is basically a 8 billion nodes and
70 billion edges graph.

And of course it's growing daily.

We are pretty sure this is the richest
source code archive that's available now

and it keeps growing.

So how do we actually…

What kind of stack do we use to store
all this?

We use Debian, of course.

All our deployment recipes are in Puppet
in public repositories.

We've started using Ceph
for the blob storage.

We use PostgreSQL for the metadata storage
we some of the standard tools that

live around PostgreSQL for backups
and replication.

We use standard Python stack for
scheduling of jobs

and for web interface stuff, basically
psycopg2 for the low level stuff,

Django for the web stuff

and Celery for the scheduling of jobs.

In house, we've written an ad hoc
object storage system which has

a bunch of backends that you can use.

Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…

It's a really simple object storage system
where you can just put an object,

get an object, put a bunch of objects,
get a bunch of objects.

We've implemented removal but we don't
really use it yet.

All the data model implementation,
all the listers, the loaders, the schedulers

everything has been written by us,
it's a pile of Python code.

So, basically 20 Python packages and
around 30 Puppet modules

to deploy all that and we've done everything
as a copyleft license,

GPLv3 for the backend and AGPLv3
for the frontend.

Even if people try and make their own
Software Heritage using our code,

they have to publish their changes.

Hardware-wise, we run for now everything
on a few hypervisors in house and

our main storage is currently still
on a very high density, very slow,

very bulky storage array, but we've
started to migrate all this thing

into a Ceph storage cluster which
we're gonna grow as we need

in the next few months.

We've also been granted by Microsoft
sponsorship, ??? sponsorship

for their cloud services.

We've started putting mirrors of everything
in their infrastructure as well

which means full object storage mirror,
so 170TB of stuff mirrored on azure

as well as a database mirror for graph.

And we're also doing all the content
indexing and all the things that need

scalability on azure now.

Finally, at the university of Bologna,
we have a backend storage for the download

so currently our main storage is
quite slow so if you want to download

a bundle of things that we've archived,
then we actually keep a cache of

what we've done so that it doesn't take
a million years to download stuff.

We do our development in a classic free
and open source software way,

so we talk on our mailing list, on IRC,
on a forge.

Everything is in English, everything is
public, there is more information

on our website if you want to actually
have a look and see what we do.

So, all that is very interesting but how
do we actually look into it?

One of the ways that you can browse,
that you can use the archive

is using a REST API.

Basically, this API allows you to do
pointwise browsing of the archive

so you can go and follow the links
in a graph,

which is very slow but gives you a pretty
much full access of the data.

There's an index for the API that you can
look at, but that's not really convenient,

so we also have a web user interface.

It's in preview right now, we're gonna do
a full launch in the month of June.

If you go to 
https://archive.softwareheritage.org/browse/

with the given credentials, you can
have a look and see what's going on.

Basically, we have a web interface that
allows you to look at

what origins we have downloaded, when
we have downloaded the origins

with a kind of graph view of how often
we visited the origins

and a calendar view of when we have
visited the origins.

And then, inside the visits, you can
actually browse the contents

that we've archived.

So, for instance, this is the Python
repository as of May 2017

and you can have the list of files,
then drill down,

it should be pretty intuitive.

If you look at the history of a project,
you can see the differences

between two revisions of a project.

Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.

So, yeah, pretty cool stuff.

I should be able to do a demo as well,
it should work.

I'm gonna zoom in.

So this is the main archive, you can see
some statistics about the objects

that we've downloaded.

When you zoom in, you get some kind of
overflows, because…

Yeah, why would you do that.

If you want to browse, we can try to find
an origin.

"glibc".

So there's lots and lots of, like, random
Github forks of things…

We don't discriminate and we don't really
filter what we download.

We are looking into doing some relevance
kind of sorting of the results, here.

Next.

Xilinx, why not.

So, this has been downloaded for the last
time of August 3rd 2016,

so it's probably a dead repository,

but yeah, you can see a bunch of source
code,

you can read the README of the glibc.

If we go back to a more interesting origin

here's the repository for git.

I've selected voluntarily an old visit
of the repo so that we can see

what was going on then.

If a look at the calendar view, you can see
that we've had some issues actually

updating this, but anyway.

If I look at the last visit, then we can
actually browse the contents,

you can get syntax highlighting as well.

This is a big big file with lots of comments

Let's see the actual source code…

Anyway, so, that's the browsing interface.

We can also now get back what we've
archived and download it,

which is kind of something that you might
want to do

if a repository is lost, you can actually
download it

and get the source code back again.

How we do that.

If you go on the top right of this browsing
interface, you have actions and download

and you can download a directory that
you are currently looking at.

It's an asynchronous process, which means
that if there is a lot of load,

then it's gotta take some time to get
actually, to be able to download the content

So you can put in your email address so we
can notify you when the download is ready.

I'm gonna try my luck and say just "ok"
and it's gonna appear at some point

in the list of things that I've requested.

I've already requested some things that
we can actually get and open as a tarball.

Yeah, I think that's the thing that I was
actually looking at,

which is this revision of the git
source code

and then I can open it

Yay, emacs, that's when you want.

Yay, source code.

This seems to work.

And then, of course, if you want to
actually script what you're doing,

there's an API that allows you to do
the downloads as well, so you can.

The source code is deduplicated a lot,
which means that for one single repository

you get tons of files that we have to
collect if you want to actually download

an archive of a directory.

It takes a while but we have an asynchronous
API so you can POST

the identifier of a revision to this URL
and then get status updates

and at some point, it will tell you that
the… here

The status well tell you that the object
is available.

You can download it and you can even
download the full history of a project

and get that as a git-fast-export archive
that you can reimport into

a new git repository.

So any kind of VCS that we've imported,
you can export as a git repository

and reimport on your machine.

How to get involved in the project?

We have a lot of features that we're
interested in, lots of them are now

in early access or have been done.

There's some stuff that we would like
help with.

This is some stuff that we're working on:

provenance information, you have a content

you want to know which repository
it comes from,

that's something we're on.

Full text search, the end goal is to be
able even to trace

source of snippets of code that's have
been copied from one project to another.

That's something that we can look into
with the wealth of information that

we have inside the archive.