Software Heritage - Preserving the Free Software Commons

Not Synced

Hi, thank you.
Not Synced

I'm Nicolas Dandrimont and I will indeed
be talking to you about
Not Synced

Software Heritage.
Not Synced

I'm a software engineer for this project.
Not Synced

I've been working on it for 3 years now.
Not Synced

And we'll see what this thing is all about.
Not Synced

[Mic not working]
Not Synced

I guess the batteries are out.
Not Synced

So, let's try that again.
Not Synced

So, we all know, we've been doing
free software for a while,
Not Synced

that software source code is something
special.
Not Synced

Why is that?
Not Synced

As Harold Abelson has said in SICP, his
textbook on programming,
Not Synced

programs are meant to be read by people
and then incidentally for machines to execute.
Not Synced

Basically, what software source code
provides us is a way inside
Not Synced

the mind of the designer of the program.
Not Synced

For instance, you can have,
you can get inside very crazy algorithms
Not Synced

that can do very fast reverse square roots
for 3D, that kind of stuff
Not Synced

Like in the Quake 2 source code.
Not Synced

You can also get inside the algorithms
that are underpinning the internet,
Not Synced

for instance seeing the net queue
algorithm in the Linux kernel.
Not Synced

What we are building as the free software
community is the free software commons.
Not Synced

Basically, the commons is all the cultural
and social and natural resources
Not Synced

that we share and that everyone
has access to.
Not Synced

More specifically, the software commons
is what we are building
Not Synced

with software that is open and that is
available for all to use, to modify,
Not Synced

to execute, to distribute.
Not Synced

We know that those commons are a really
critical part of our commons.
Not Synced

Who's taking care of it?
Not Synced

The software is fragile.
Not Synced

Like all digital information, you can lose
software.
Not Synced

People can decide to shut down hosting
spaces because of business decisions.
Not Synced

People can hack into software hosting
platforms and remove the code maliciously
Not Synced

or just inadvertently.
Not Synced

And, of course, for the obsolete stuff,
there's rot.
Not Synced

If you don't care about the data, then
it rots and it decays and you lose it.
Not Synced

So, where is the archive we go to
when something is lost,
Not Synced

when GitLab goes away, when Github
goes away.
Not Synced

Where do we go?
Not Synced

Finally, there's one last thing that we
noticed, it's that
Not Synced

there's a lot of teams that work on
research on software
Not Synced

and there's no real big infrastructure
for research on code.
Not Synced

There's tons of critical issues around
code: safety, security, verification, proofs.
Not Synced

Nobody's doing this at a very large scale.
Not Synced

If you want to see the stars, you go
the Atacama desert and
Not Synced

you point a telescope at the sky.
Not Synced

Where is the telescope for source code?
Not Synced

That's what Software Heritage wants to be.
Not Synced

What we do is we collect, we preserve
and we share all the software
Not Synced

that is publicly available.
Not Synced

Why do we do that? We do that to
preserve the past, to enhance the present
Not Synced

and to prepare for the future.
Not Synced

What we're building is a base infrastructure
that can be used
Not Synced

for cultural heritage, for industry,
for research and for education purposes.
Not Synced

How do we do it? We do it with an open
approach.
Not Synced

Every single line of code that we write
is free software.
Not Synced

We do it transparently, everything that
we do, we do it in the open,
Not Synced

be that on a mailing list or on
our issue tracker.
Not Synced

And we strive to do it for the very long
haul, so we do it with replication in mind
Not Synced

so that no single entity has full control
over the data that we collect.
Not Synced

And we do it in a non-profit fashion
so that we avoid
Not Synced

business-driven decisions impacting
the project.
Not Synced

So, what do we do concretely?
Not Synced

We do archiving of version control systems.
Not Synced

What does that mean?
Not Synced

It means we archive file contents, so
source code, files.
Not Synced

We archive revisions, which means all the
metadata of the history of the projects,
Not Synced

we try to download it and we put it inside
a common data model that is
Not Synced

shared across all the archive.
Not Synced

We archive releases of the software,
releases that have been tagged
Not Synced

in a version control system as well as
releases that we can find as tarballs
Not Synced

because sometimes… boof, views of
this source code differ.
Not Synced

Of course, we archive where and when
we've seen the data that we've collected.
Not Synced

All of this, we put inside a canonical,
VCS-agnostic, data model.
Not Synced

If you have a Debian package, with its
history, if you have a git repository,
Not Synced

if you have a subversion repository, if
you have a mercurial repository,
Not Synced

it all looks the same and you can work
on it with the same tools.
Not Synced

What we don't do is archive what's around
the software, for instance
Not Synced

the bug tracking systems or the homepages
or the wikis or the mailing lists.
Not Synced

There are some projects that work
in this space, for instance
Not Synced

the internet archive does a lot of
really good work around archiving the web.
Not Synced

Our goal is not to replace them, but to
work with them and be able to do
Not Synced

linking across all the archives that exist.
Not Synced

We can, for instance for the mailing lists
there's the gmane project
Not Synced

that does a lot of archiving of free
software mailing lists.
Not Synced

So our long term vision is to play a part
in a semantic wikipedia of software,
Not Synced

a wikidata of software where we can
hyperlink all the archives that exist
Not Synced

and do stuff in the area.
Not Synced

Quick tour of our infrastructure.
Not Synced

Basically, all the way to the right is
our archive.
Not Synced

Our archive consists of a huge graph
of all the metadata about
Not Synced

the files, the directories, the revisions,
the commits and the releases and
Not Synced

all the projects that are on top
of the graph.
Not Synced

We separate the file storage into an other
object storage because of
Not Synced

the size discrepancy: we have lots and lots
of file contents that we need to store
Not Synced

so we do that outside the database
that is used to store the graph.
Not Synced

Basically, what we archive is a set of
software origins that are
Not Synced

git repositories, mercurial repositories,
etc. etc.
Not Synced

All those origins are loaded on a
regular schedule.
Not Synced

If there is a very active software origin,
we're gonna archive it more often
Not Synced

than stale things that don't get
a lot of updates.
Not Synced

What we do to get the list of software
origins that we archive.
Not Synced

We have a bunch of listers that can,
scroll through the list of repositories,
Not Synced

for instance on Github or other
hosting platforms.
Not Synced

We have code that can read Debian archive
metadata to make a list of the packages
Not Synced

that are inside this archive and can be
archived, etc.
Not Synced

All of this is done on a regular basis.
Not Synced

We are currently working on some kind
of push mechanism so that
Not Synced

people or other systems can notify us
of updates.
Not Synced

Our goal is not to do real time archiving,
we're really in it for the long run
Not Synced

but we still want to be able to prioritize
stuff that people tell us is
Not Synced

important to archive.
Not Synced

The internet archive has a "save now"
button and we want to implement
Not Synced

something along those lines as well,
Not Synced

so if we know that some software project
is in danger for a reason or another,
Not Synced

then we can prioritize archiving it.
Not Synced

So this is the basic structure of a revision
in the software heritage archive.
Not Synced

You'll see that it's very similar to
a git commit.
Not Synced

The format of the metadata is pretty much
what you'll find in a git commit
Not Synced

with some extensions that you don't
see here because this is from a git commit
Not Synced

So basically what we do is we take the
identifier of the directory
Not Synced

that the revision points to, we take the
identifier of the parent of the revision
Not Synced

so we can keep track of the history
Not Synced

and then we add some metadata,
authorship and commitership information
Not Synced

and the revision message and then we take
a hash of this,
Not Synced

it makes an identifier that's probably
unique, very very probably unique.
Not Synced

Using those identifiers, we can retrace
all the origins, all the history of
Not Synced

development of the project and we can
deduplicate across all the archive.
Not Synced

All the identifiers are intrinsic, which
means that we compute them
Not Synced

from the contents of the things that
we are archiving, which means that
Not Synced

we can deduplicate very efficiently
across all the data that we archive.
Not Synced

How much data do we archive?
Not Synced

A bit.
Not Synced

So, we have passed the billion revision
mark a few weeks ago.
Not Synced

This graph is a bit old, but anyway,
you have a live graph on our website.
Not Synced

That's more than 4.5 billion unique
source code files.
Not Synced

We don't actually discriminate between
what we would consider is source code
Not Synced

and what upstream developers consider
as source code,
Not Synced

so everything that's in a git repository,
we consider as source code
Not Synced

if it's below a size threshold.
Not Synced

A billion revisions across 80 million
projects.
Not Synced

What do we archive?
Not Synced

We archive Github, we archive Debian.
Not Synced

So, Debian we run the archival process
every day, every day we get the new packages
Not Synced

that have been uploaded in the archive.
Not Synced

Github, we try to keep up, we are currently
working on some performance improvements,
Not Synced

some scalability improvements to make sure
that we can keep up
Not Synced

with the development on GitHub.
Not Synced

We have archived as a one-off thing
the former content of Gitorious and Google Code
Not Synced

which are two prominent code hosting
spaces that closed recently
Not Synced

and we've been working on archiving
the contents of Bitbucket
Not Synced

which is kind of a challenge because
the API is a bit buggy and
Not Synced

Atliassian isn't too interested
in fixing it.
Not Synced

In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
Not Synced

and kind of big database, 6TB.
Not Synced

The database only contains the graph of
the metadata for the archive
Not Synced

which is basically a 8 billion nodes and
70 billion edges graph.
Not Synced

And of course it's growing daily.
Not Synced

We are pretty sure this is the richest
source code archive that's available now
Not Synced

and it keeps growing.
Not Synced

So how do we actually…
Not Synced

What kind of stack do we use to store
all this?
Not Synced

We use Debian, of course.
Not Synced

All our deployment recipes are in Puppet
in public repositories.
Not Synced

We've started using Ceph
for the blob storage.
Not Synced

We use PostgreSQL for the metadata storage
we some of the standard tools that
Not Synced

live around PostgreSQL for backups
and replication.
Not Synced

We use standard Python stack for
scheduling of jobs
Not Synced

and for web interface stuff, basically
psycopg2 for the low level stuff,
Not Synced

Django for the web stuff
Not Synced

and Celery for the scheduling of jobs.
Not Synced

In house, we've written an ad hoc
object storage system which has
Not Synced

a bunch of backends that you can use.
Not Synced

Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
Not Synced

It's a really simple object storage system
where you can just put an object,
Not Synced

get an object, put a bunch of objects,
get a bunch of objects.
Not Synced

We've implemented removal but we don't
really use it yet.
Not Synced

All the data model implementation,
all the listers, the loaders, the schedulers
Not Synced

everything has been written by us,
it's a pile of Python code.
Not Synced

So, basically 20 Python packages and
around 30 Puppet modules
Not Synced

to deploy all that and we've done everything
as a copyleft license,
Not Synced

GPLv3 for the backend and AGPLv3
for the frontend.
Not Synced

Even if people try and make their own
Software Heritage using our code,
Not Synced

they have to publish their changes.
Not Synced

Hardware-wise, we run for now everything
on a few hypervisors in house and
Not Synced

our main storage is currently still
on a very high density, very slow,
Not Synced

very bulky storage array, but we've
started to migrate all this thing
Not Synced

into a Ceph storage cluster which
we're gonna grow as we need
Not Synced

in the next few months.
Not Synced

We've also been granted by Microsoft
sponsorship, ??? sponsorship
Not Synced

for their cloud services.
Not Synced

We've started putting mirrors of everything
in their infrastructure as well
Not Synced

which means full object storage mirror,
so 170TB of stuff mirrored on azure
Not Synced

as well as a database mirror for graph.
Not Synced

And we're also doing all the content
indexing and all the things that need
Not Synced

scalability on azure now.
Not Synced

Finally, at the university of Bologna,
we have a backend storage for the download
Not Synced

so currently our main storage is
quite slow so if you want to download
Not Synced

a bundle of things that we've archived,
then we actually keep a cache of
Not Synced

what we've done so that it doesn't take
a million years to download stuff.
Not Synced

We do our development in a classic free
and open source software way,
Not Synced

so we talk on our mailing list, on IRC,
on a forge.
Not Synced

Everything is in English, everything is
public, there is more information
Not Synced

on our website if you want to actually
have a look and see what we do.
Not Synced

So, all that is very interesting but how
do we actually look into it?
Not Synced

One of the ways that you can browse,
that you can use the archive
Not Synced

is using a REST API.
Not Synced

Basically, this API allows you to do
pointwise browsing of the archive
Not Synced

so you can go and follow the links
in a graph,
Not Synced

which is very slow but gives you a pretty
much full access of the data.
Not Synced

There's an index for the API that you can
look at, but that's not really convenient,
Not Synced

so we also have a web user interface.
Not Synced

It's in preview right now, we're gonna do
a full launch in the month of June.
Not Synced

If you go to
https://archive.softwareheritage.org/browse/
Not Synced

with the given credentials, you can
have a look and see what's going on.
Not Synced

Basically, we have a web interface that
allows you to look at
Not Synced

what origins we have downloaded, when
we have downloaded the origins
Not Synced

with a kind of graph view of how often
we visited the origins
Not Synced

and a calendar view of when we have
visited the origins.
Not Synced

And then, inside the visits, you can
actually browse the contents
Not Synced

that we've archived.
Not Synced

So, for instance, this is the Python
repository as of May 2017
Not Synced

and you can have the list of files,
then drill down,
Not Synced

it should be pretty intuitive.
Not Synced

If you look at the history of a project,
you can see the differences
Not Synced

between two revisions of a project.
Not Synced

Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
Not Synced

So, yeah, pretty cool stuff.
Not Synced

I should be able to do a demo as well,
it should work.
Not Synced

I'm gonna zoom in.
Not Synced

So this is the main archive, you can see
some statistics about the objects
Not Synced

that we've downloaded.
Not Synced

When you zoom in, you get some kind of
overflows, because…
Not Synced

Yeah, why would you do that.
Not Synced

If you want to browse, we can try to find
an origin.
Not Synced

"glibc".
Not Synced

So there's lots and lots of, like, random
Github forks of things…
Not Synced

We don't discriminate and we don't really
filter what we download.
Not Synced

We are looking into doing some relevance
kind of sorting of the results, here.
Not Synced

Next.
Not Synced

Xilinx, why not.
Not Synced

So, this has been downloaded for the last
time of August 3rd 2016,
Not Synced

so it's probably a dead repository,
Not Synced

but yeah, you can see a bunch of source
code,
Not Synced

you can read the README of the glibc.
Not Synced

If we go back to a more interesting origin
Not Synced

here's the repository for git.
Not Synced

I've selected voluntarily an old visit
of the repo so that we can see
Not Synced

what was going on then.
Not Synced

If a look at the calendar view, you can see
that we've had some issues actually
Not Synced

updating this, but anyway.
Not Synced

If I look at the last visit, then we can
actually browse the contents,
Not Synced

you can get syntax highlighting as well.
Not Synced

This is a big big file with lots of comments
Not Synced

Let's see the actual source code…
Not Synced

Anyway, so, that's the browsing interface.
Not Synced

We can also now get back what we've
archived and download it,
Not Synced

which is kind of something that you might
want to do
Not Synced

if a repository is lost, you can actually
download it
Not Synced

and get the source code back again.
Not Synced

How we do that.
Not Synced

If you go on the top right of this browsing
interface, you have actions and download
Not Synced

and you can download a directory that
you are currently looking at.
Not Synced

It's an asynchronous process, which means
that if there is a lot of load,
Not Synced

then it's gotta take some time to get
actually, to be able to download the content
Not Synced

So you can put in your email address so we
can notify you when the download is ready.
Not Synced

I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
Not Synced

in the list of things that I've requested.
Not Synced

I've already requested some things that
we can actually get and open as a tarball.
Not Synced

Yeah, I think that's the thing that I was
actually looking at,
Not Synced

which is this revision of the git
source code
Not Synced

and then I can open it
Not Synced

Yay, emacs, that's when you want.
Not Synced

Yay, source code.
Not Synced

This seems to work.
Not Synced

And then, of course, if you want to
actually script what you're doing,
Not Synced

there's an API that allows you to do
the downloads as well, so you can.
Not Synced

The source code is deduplicated a lot,
which means that for one single repository
Not Synced

you get tons of files that we have to
collect if you want to actually download
Not Synced

an archive of a directory.
Not Synced

It takes a while but we have an asynchronous
API so you can POST
Not Synced

the identifier of a revision to this URL
and then get status updates
Not Synced

and at some point, it will tell you that
the… here
Not Synced

The status well tell you that the object
is available.
Not Synced

You can download it and you can even
download the full history of a project
Not Synced

and get that as a git-fast-export archive
that you can reimport into
Not Synced

a new git repository.
Not Synced

So any kind of VCS that we've imported,
you can export as a git repository
Not Synced

and reimport on your machine.
Not Synced

How to get involved in the project?
Not Synced

We have a lot of features that we're
interested in, lots of them are now
Not Synced

in early access or have been done.
Not Synced

There's some stuff that we would like
help with.
Not Synced

This is some stuff that we're working on:
Not Synced

provenance information, you have a content
Not Synced

you want to know which repository
it comes from,
Not Synced

that's something we're on.
Not Synced

Full text search, the end goal is to be
able even to trace
Not Synced

source of snippets of code that's have
been copied from one project to another.
Not Synced

That's something that we can look into
with the wealth of information that
Not Synced

we have inside the archive.
Not Synced

There's a lot of things that,
Not Synced

I mean…
Not Synced

There's a lot of things that people want
to do with the archive.
Not Synced

Our goal is to enable people to do things,
to do interesting things
Not Synced

with a lot of source code.
Not Synced

If you have an idea of what you want to do
with such an archive,
Not Synced

please you can come talk to us
Not Synced

and we'll be happy to help you help us.
Not Synced

What we want to do is to diversify
the sources of things that we archive.
Not Synced

Currently, we have good support for git,
we have OK support for subversion
Not Synced

and mercurial.
Not Synced

If your project of choice is in another
version control system,
Not Synced

we are gonna miss it.
Not Synced

So people can contribute in this area.
Not Synced

For the listing part, we have coverage of
Debian, we have coverage or Github,
Not Synced

if your code is somewhere else, we won't
see it, so we need people to contribute
Not Synced

stuff that can list for instance Gitlab
instances,
Not Synced

and then we can integrate that in our
infrastructure and actually have have
Not Synced

people be able to archive their gitlab
instances.
Not Synced

And of course, we need to spread
the word, make the project sustainable.
Not Synced

We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
Not Synced

The university of Bologna, of course Inria
is sponsoring.
Not Synced

But we need to keep spreading the word
and keep the project sustainable.
Not Synced

And, of course, we need to save endangered
source code.
Not Synced

For that, we have a suggestion box on
the wiki that you can add things to.
Not Synced

For instance, we have in the back of
our minds archiving SourceForge,
Not Synced

because we know that this isn't very
sustainable and that's risk of being
Not Synced

taken down at some point.
Not Synced

If you want to join us, we also have
some job openings that are available.
Not Synced

For now it's in Paris, so if you want to
consider coming work with us in Paris,
Not Synced

you can look into that.
Not Synced

That's Software Heritage.
Not Synced

We are building a reference archive of
all the free software
Not Synced

that's being ever written
Not Synced

in an international, open, non-profit and
mutualised infrastructure
Not Synced

that we have opened up to everyone,
all users, vendors, developers can use it.
Not Synced

The idea is to be at the service of
the community and for society
Not Synced

as a whole.
Not Synced

So if you want to join us, you can look at
our website, you can look at our code.

Title:: Software Heritage - Preserving the Free Software Commons
Description:: more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 41:31

	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 11 Edited

tvincent
Revision 10 Edited

tvincent
Revision 9 Edited

tvincent
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	11	tvincent
	10	tvincent
	9	tvincent
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Software Heritage - Preserving the Free Software Commons

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)