-
Not Synced
Hi, thank you.
-
Not Synced
I'm Nicolas Dandrimont and I will indeed
be talking to you about
-
Not Synced
Software Heritage.
-
Not Synced
I'm a software engineer for this project.
-
Not Synced
I've been working on it for 3 years now.
-
Not Synced
And we'll see what this thing is all about.
-
Not Synced
[Mic not working]
-
Not Synced
I guess the batteries are out.
-
Not Synced
So, let's try that again.
-
Not Synced
So, we all know, we've been doing
free software for a while,
-
Not Synced
that software source code is something
special.
-
Not Synced
Why is that?
-
Not Synced
As Harold Abelson has said in SICP, his
textbook on programming,
-
Not Synced
programs are meant to be read by people
and then incidentally for machines to execute.
-
Not Synced
Basically, what software source code
provides us is a way inside
-
Not Synced
the mind of the designer of the program.
-
Not Synced
For instance, you can have,
you can get inside very crazy algorithms
-
Not Synced
that can do very fast reverse square roots
for 3D, that kind of stuff
-
Not Synced
Like in the Quake 2 source code.
-
Not Synced
You can also get inside the algorithms
that are underpinning the internet,
-
Not Synced
for instance seeing the net queue
algorithm in the Linux kernel.
-
Not Synced
What we are building as the free software
community is the free software commons.
-
Not Synced
Basically, the commons is all the cultural
and social and natural resources
-
Not Synced
that we share and that everyone
has access to.
-
Not Synced
More specifically, the software commons
is what we are building
-
Not Synced
with software that is open and that is
available for all to use, to modify,
-
Not Synced
to execute, to distribute.
-
Not Synced
We know that those commons are a really
critical part of our commons.
-
Not Synced
Who's taking care of it?
-
Not Synced
The software is fragile.
-
Not Synced
Like all digital information, you can lose
software.
-
Not Synced
People can decide to shut down hosting
spaces because of business decisions.
-
Not Synced
People can hack into software hosting
platforms and remove the code maliciously
-
Not Synced
or just inadvertently.
-
Not Synced
And, of course, for the obsolete stuff,
there's rot.
-
Not Synced
If you don't care about the data, then
it rots and it decays and you lose it.
-
Not Synced
So, where is the archive we go to
when something is lost,
-
Not Synced
when GitLab goes away, when Github
goes away.
-
Not Synced
Where do we go?
-
Not Synced
Finally, there's one last thing that we
noticed, it's that
-
Not Synced
there's a lot of teams that work on
research on software
-
Not Synced
and there's no real big infrastructure
for research on code.
-
Not Synced
There's tons of critical issues around
code: safety, security, verification, proofs.
-
Not Synced
Nobody's doing this at a very large scale.
-
Not Synced
If you want to see the stars, you go
the Atacama desert and
-
Not Synced
you point a telescope at the sky.
-
Not Synced
Where is the telescope for source code?
-
Not Synced
That's what Software Heritage wants to be.
-
Not Synced
What we do is we collect, we preserve
and we share all the software
-
Not Synced
that is publicly available.
-
Not Synced
Why do we do that? We do that to
preserve the past, to enhance the present
-
Not Synced
and to prepare for the future.
-
Not Synced
What we're building is a base infrastructure
that can be used
-
Not Synced
for cultural heritage, for industry,
for research and for education purposes.
-
Not Synced
How do we do it? We do it with an open
approach.
-
Not Synced
Every single line of code that we write
is free software.
-
Not Synced
We do it transparently, everything that
we do, we do it in the open,
-
Not Synced
be that on a mailing list or on
our issue tracker.
-
Not Synced
And we strive to do it for the very long
haul, so we do it with replication in mind
-
Not Synced
so that no single entity has full control
over the data that we collect.
-
Not Synced
And we do it in a non-profit fashion
so that we avoid
-
Not Synced
business-driven decisions impacting
the project.
-
Not Synced
So, what do we do concretely?
-
Not Synced
We do archiving of version control systems.
-
Not Synced
What does that mean?
-
Not Synced
It means we archive file contents, so
source code, files.
-
Not Synced
We archive revisions, which means all the
metadata of the history of the projects,
-
Not Synced
we try to download it and we put it inside
a common data model that is
-
Not Synced
shared across all the archive.
-
Not Synced
We archive releases of the software,
releases that have been tagged
-
Not Synced
in a version control system as well as
releases that we can find as tarballs
-
Not Synced
because sometimes… boof, views of
this source code differ.
-
Not Synced
Of course, we archive where and when
we've seen the data that we've collected.
-
Not Synced
All of this, we put inside a canonical,
VCS-agnostic, data model.
-
Not Synced
If you have a Debian package, with its
history, if you have a git repository,
-
Not Synced
if you have a subversion repository, if
you have a mercurial repository,
-
Not Synced
it all looks the same and you can work
on it with the same tools.
-
Not Synced
What we don't do is archive what's around
the software, for instance
-
Not Synced
the bug tracking systems or the homepages
or the wikis or the mailing lists.
-
Not Synced
There are some projects that work
in this space, for instance
-
Not Synced
the internet archive does a lot of
really good work around archiving the web.
-
Not Synced
Our goal is not to replace them, but to
work with them and be able to do
-
Not Synced
linking across all the archives that exist.
-
Not Synced
We can, for instance for the mailing lists
there's the gmane project
-
Not Synced
that does a lot of archiving of free
software mailing lists.
-
Not Synced
So our long term vision is to play a part
in a semantic wikipedia of software,
-
Not Synced
a wikidata of software where we can
hyperlink all the archives that exist
-
Not Synced
and do stuff in the area.
-
Not Synced
Quick tour of our infrastructure.
-
Not Synced
Basically, all the way to the right is
our archive.
-
Not Synced
Our archive consists of a huge graph
of all the metadata about
-
Not Synced
the files, the directories, the revisions,
the commits and the releases and
-
Not Synced
all the projects that are on top
of the graph.
-
Not Synced
We separate the file storage into an other
object storage because of
-
Not Synced
the size discrepancy: we have lots and lots
of file contents that we need to store
-
Not Synced
so we do that outside the database
that is used to store the graph.
-
Not Synced
Basically, what we archive is a set of
software origins that are
-
Not Synced
git repositories, mercurial repositories,
etc. etc.
-
Not Synced
All those origins are loaded on a
regular schedule.
-
Not Synced
If there is a very active software origin,
we're gonna archive it more often
-
Not Synced
than stale things that don't get
a lot of updates.
-
Not Synced
What we do to get the list of software
origins that we archive.
-
Not Synced
We have a bunch of listers that can,
scroll through the list of repositories,
-
Not Synced
for instance on Github or other
hosting platforms.
-
Not Synced
We have code that can read Debian archive
metadata to make a list of the packages
-
Not Synced
that are inside this archive and can be
archived, etc.
-
Not Synced
All of this is done on a regular basis.
-
Not Synced
We are currently working on some kind
of push mechanism so that
-
Not Synced
people or other systems can notify us
of updates.
-
Not Synced
Our goal is not to do real time archiving,
we're really in it for the long run
-
Not Synced
but we still want to be able to prioritize
stuff that people tell us is
-
Not Synced
important to archive.
-
Not Synced
The internet archive has a "save now"
button and we want to implement
-
Not Synced
something along those lines as well,
-
Not Synced
so if we know that some software project
is in danger for a reason or another,
-
Not Synced
then we can prioritize archiving it.
-
Not Synced
So this is the basic structure of a revision
in the software heritage archive.
-
Not Synced
You'll see that it's very similar to
a git commit.
-
Not Synced
The format of the metadata is pretty much
what you'll find in a git commit
-
Not Synced
with some extensions that you don't
see here because this is from a git commit
-
Not Synced
So basically what we do is we take the
identifier of the directory
-
Not Synced
that the revision points to, we take the
identifier of the parent of the revision
-
Not Synced
so we can keep track of the history
-
Not Synced
and then we add some metadata,
authorship and commitership information
-
Not Synced
and the revision message and then we take
a hash of this,
-
Not Synced
it makes an identifier that's probably
unique, very very probably unique.
-
Not Synced
Using those identifiers, we can retrace
all the origins, all the history of
-
Not Synced
development of the project and we can
deduplicate across all the archive.
-
Not Synced
All the identifiers are intrinsic, which
means that we compute them
-
Not Synced
from the contents of the things that
we are archiving, which means that
-
Not Synced
we can deduplicate very efficiently
across all the data that we archive.
-
Not Synced
How much data do we archive?
-
Not Synced
A bit.
-
Not Synced
So, we have passed the billion revision
mark a few weeks ago.
-
Not Synced
This graph is a bit old, but anyway,
you have a live graph on our website.
-
Not Synced
That's more than 4.5 billion unique
source code files.
-
Not Synced
We don't actually discriminate between
what we would consider is source code
-
Not Synced
and what upstream developers consider
as source code,
-
Not Synced
so everything that's in a git repository,
we consider as source code
-
Not Synced
if it's below a size threshold.
-
Not Synced
A billion revisions across 80 million
projects.
-
Not Synced
What do we archive?
-
Not Synced
We archive Github, we archive Debian.
-
Not Synced
So, Debian we run the archival process
every day, every day we get the new packages
-
Not Synced
that have been uploaded in the archive.
-
Not Synced
Github, we try to keep up, we are currently
working on some performance improvements,
-
Not Synced
some scalability improvements to make sure
that we can keep up
-
Not Synced
with the development on GitHub.
-
Not Synced
We have archived as a one-off thing
the former content of Gitorious and Google Code
-
Not Synced
which are two prominent code hosting
spaces that closed recently
-
Not Synced
and we've been working on archiving
the contents of Bitbucket
-
Not Synced
which is kind of a challenge because
the API is a bit buggy and
-
Not Synced
Atliassian isn't too interested
in fixing it.
-
Not Synced
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
-
Not Synced
and kind of big database, 6TB.
-
Not Synced
The database only contains the graph of
the metadata for the archive
-
Not Synced
which is basically a 8 billion nodes and
70 billion edges graph.
-
Not Synced
And of course it's growing daily.
-
Not Synced
We are pretty sure this is the richest
source code archive that's available now
-
Not Synced
and it keeps growing.
-
Not Synced
So how do we actually…
-
Not Synced
What kind of stack do we use to store
all this?
-
Not Synced
We use Debian, of course.
-
Not Synced
All our deployment recipes are in Puppet
in public repositories.
-
Not Synced
We've started using Ceph
for the blob storage.
-
Not Synced
We use PostgreSQL for the metadata storage
we some of the standard tools that
-
Not Synced
live around PostgreSQL for backups
and replication.
-
Not Synced
We use standard Python stack for
scheduling of jobs
-
Not Synced
and for web interface stuff, basically
psycopg2 for the low level stuff,
-
Not Synced
Django for the web stuff
-
Not Synced
and Celery for the scheduling of jobs.
-
Not Synced
In house, we've written an ad hoc
object storage system which has
-
Not Synced
a bunch of backends that you can use.
-
Not Synced
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
-
Not Synced
It's a really simple object storage system
where you can just put an object,
-
Not Synced
get an object, put a bunch of objects,
get a bunch of objects.
-
Not Synced
We've implemented removal but we don't
really use it yet.
-
Not Synced
All the data model implementation,
all the listers, the loaders, the schedulers
-
Not Synced
everything has been written by us,
it's a pile of Python code.
-
Not Synced
So, basically 20 Python packages and
around 30 Puppet modules
-
Not Synced
to deploy all that and we've done everything
as a copyleft license,
-
Not Synced
GPLv3 for the backend and AGPLv3
for the frontend.
-
Not Synced
Even if people try and make their own
Software Heritage using our code,
-
Not Synced
they have to publish their changes.
-
Not Synced
Hardware-wise, we run for now everything
on a few hypervisors in house and
-
Not Synced
our main storage is currently still
on a very high density, very slow,
-
Not Synced
very bulky storage array, but we've
started to migrate all this thing
-
Not Synced
into a Ceph storage cluster which
we're gonna grow as we need
-
Not Synced
in the next few months.
-
Not Synced
We've also been granted by Microsoft
sponsorship, ??? sponsorship
-
Not Synced
for their cloud services.
-
Not Synced
We've started putting mirrors of everything
in their infrastructure as well
-
Not Synced
which means full object storage mirror,
so 170TB of stuff mirrored on azure
-
Not Synced
as well as a database mirror for graph.
-
Not Synced
And we're also doing all the content
indexing and all the things that need
-
Not Synced
scalability on azure now.
-
Not Synced
Finally, at the university of Bologna,
we have a backend storage for the download
-
Not Synced
so currently our main storage is
quite slow so if you want to download
-
Not Synced
a bundle of things that we've archived,
then we actually keep a cache of
-
Not Synced
what we've done so that it doesn't take
a million years to download stuff.
-
Not Synced
We do our development in a classic free
and open source software way,
-
Not Synced
so we talk on our mailing list, on IRC,
on a forge.
-
Not Synced
Everything is in English, everything is
public, there is more information
-
Not Synced
on our website if you want to actually
have a look and see what we do.
-
Not Synced
So, all that is very interesting but how
do we actually look into it?
-
Not Synced
One of the ways that you can browse,
that you can use the archive
-
Not Synced
is using a REST API.
-
Not Synced
Basically, this API allows you to do
pointwise browsing of the archive
-
Not Synced
so you can go and follow the links
in a graph,
-
Not Synced
which is very slow but gives you a pretty
much full access of the data.
-
Not Synced
There's an index for the API that you can
look at, but that's not really convenient,
-
Not Synced
so we also have a web user interface.
-
Not Synced
It's in preview right now, we're gonna do
a full launch in the month of June.
-
Not Synced
If you go to
https://archive.softwareheritage.org/browse/
-
Not Synced
with the given credentials, you can
have a look and see what's going on.
-
Not Synced
Basically, we have a web interface that
allows you to look at
-
Not Synced
what origins we have downloaded, when
we have downloaded the origins
-
Not Synced
with a kind of graph view of how often
we visited the origins
-
Not Synced
and a calendar view of when we have
visited the origins.
-
Not Synced
And then, inside the visits, you can
actually browse the contents
-
Not Synced
that we've archived.
-
Not Synced
So, for instance, this is the Python
repository as of May 2017
-
Not Synced
and you can have the list of files,
then drill down,
-
Not Synced
it should be pretty intuitive.
-
Not Synced
If you look at the history of a project,
you can see the differences
-
Not Synced
between two revisions of a project.
-
Not Synced
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
-
Not Synced
So, yeah, pretty cool stuff.
-
Not Synced
I should be able to do a demo as well,
it should work.
-
Not Synced
I'm gonna zoom in.
-
Not Synced
So this is the main archive, you can see
some statistics about the objects
-
Not Synced
that we've downloaded.
-
Not Synced
When you zoom in, you get some kind of
overflows, because…
-
Not Synced
Yeah, why would you do that.
-
Not Synced
If you want to browse, we can try to find
an origin.
-
Not Synced
"glibc".
-
Not Synced
So there's lots and lots of, like, random
Github forks of things…
-
Not Synced
We don't discriminate and we don't really
filter what we download.
-
Not Synced
We are looking into doing some relevance
kind of sorting of the results, here.
-
Not Synced
Next.
-
Not Synced
Xilinx, why not.
-
Not Synced
So, this has been downloaded for the last
time of August 3rd 2016,
-
Not Synced
so it's probably a dead repository,
-
Not Synced
but yeah, you can see a bunch of source
code,
-
Not Synced
you can read the README of the glibc.
-
Not Synced
If we go back to a more interesting origin
-
Not Synced
here's the repository for git.
-
Not Synced
I've selected voluntarily an old visit
of the repo so that we can see
-
Not Synced
what was going on then.
-
Not Synced
If a look at the calendar view, you can see
that we've had some issues actually
-
Not Synced
updating this, but anyway.
-
Not Synced
If I look at the last visit, then we can
actually browse the contents,
-
Not Synced
you can get syntax highlighting as well.
-
Not Synced
This is a big big file with lots of comments
-
Not Synced
Let's see the actual source code…
-
Not Synced
Anyway, so, that's the browsing interface.
-
Not Synced
We can also now get back what we've
archived and download it,
-
Not Synced
which is kind of something that you might
want to do
-
Not Synced
if a repository is lost, you can actually
download it
-
Not Synced
and get the source code back again.
-
Not Synced
How we do that.
-
Not Synced
If you go on the top right of this browsing
interface, you have actions and download
-
Not Synced
and you can download a directory that
you are currently looking at.
-
Not Synced
It's an asynchronous process, which means
that if there is a lot of load,
-
Not Synced
then it's gotta take some time to get
actually, to be able to download the content
-
Not Synced
So you can put in your email address so we
can notify you when the download is ready.
-
Not Synced
I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
-
Not Synced
in the list of things that I've requested.
-
Not Synced
I've already requested some things that
we can actually get and open as a tarball.
-
Not Synced
Yeah, I think that's the thing that I was
actually looking at,
-
Not Synced
which is this revision of the git
source code
-
Not Synced
and then I can open it
-
Not Synced
Yay, emacs, that's when you want.
-
Not Synced
Yay, source code.
-
Not Synced
This seems to work.
-
Not Synced
And then, of course, if you want to
actually script what you're doing,
-
Not Synced
there's an API that allows you to do
the downloads as well, so you can.
-
Not Synced
The source code is deduplicated a lot,
which means that for one single repository
-
Not Synced
you get tons of files that we have to
collect if you want to actually download
-
Not Synced
an archive of a directory.
-
Not Synced
It takes a while but we have an asynchronous
API so you can POST
-
Not Synced
the identifier of a revision to this URL
and then get status updates
-
Not Synced
and at some point, it will tell you that
the… here
-
Not Synced
The status well tell you that the object
is available.
-
Not Synced
You can download it and you can even
download the full history of a project
-
Not Synced
and get that as a git-fast-export archive
that you can reimport into
-
Not Synced
a new git repository.
-
Not Synced
So any kind of VCS that we've imported,
you can export as a git repository
-
Not Synced
and reimport on your machine.
-
Not Synced
How to get involved in the project?
-
Not Synced
We have a lot of features that we're
interested in, lots of them are now
-
Not Synced
in early access or have been done.
-
Not Synced
There's some stuff that we would like
help with.
-
Not Synced
This is some stuff that we're working on:
-
Not Synced
provenance information, you have a content
-
Not Synced
you want to know which repository
it comes from,
-
Not Synced
that's something we're on.
-
Not Synced
Full text search, the end goal is to be
able even to trace
-
Not Synced
source of snippets of code that's have
been copied from one project to another.
-
Not Synced
That's something that we can look into
with the wealth of information that
-
Not Synced
we have inside the archive.
-
Not Synced
There's a lot of things that,
-
Not Synced
I mean…
-
Not Synced
There's a lot of things that people want
to do with the archive.
-
Not Synced
Our goal is to enable people to do things,
to do interesting things
-
Not Synced
with a lot of source code.
-
Not Synced
If you have an idea of what you want to do
with such an archive,
-
Not Synced
please you can come talk to us
-
Not Synced
and we'll be happy to help you help us.
-
Not Synced
What we want to do is to diversify
the sources of things that we archive.
-
Not Synced
Currently, we have good support for git,
we have OK support for subversion
-
Not Synced
and mercurial.
-
Not Synced
If your project of choice is in another
version control system,
-
Not Synced
we are gonna miss it.
-
Not Synced
So people can contribute in this area.
-
Not Synced
For the listing part, we have coverage of
Debian, we have coverage or Github,
-
Not Synced
if your code is somewhere else, we won't
see it, so we need people to contribute
-
Not Synced
stuff that can list for instance Gitlab
instances,
-
Not Synced
and then we can integrate that in our
infrastructure and actually have have
-
Not Synced
people be able to archive their gitlab
instances.
-
Not Synced
And of course, we need to spread
the word, make the project sustainable.
-
Not Synced
We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
-
Not Synced
The university of Bologna, of course Inria
is sponsoring.
-
Not Synced
But we need to keep spreading the word
and keep the project sustainable.
-
Not Synced
And, of course, we need to save endangered
source code.
-
Not Synced
For that, we have a suggestion box on
the wiki that you can add things to.
-
Not Synced
For instance, we have in the back of
our minds archiving SourceForge,
-
Not Synced
because we know that this isn't very
sustainable and that's risk of being
-
Not Synced
taken down at some point.
-
Not Synced
If you want to join us, we also have
some job openings that are available.
-
Not Synced
For now it's in Paris, so if you want to
consider coming work with us in Paris,
-
Not Synced
you can look into that.
-
Not Synced
That's Software Heritage.
-
Not Synced
We are building a reference archive of
all the free software
-
Not Synced
that's being ever written
-
Not Synced
in an international, open, non-profit and
mutualised infrastructure
-
Not Synced
that we have opened up to everyone,
all users, vendors, developers can use it.
-
Not Synced
The idea is to be at the service of
the community and for society
-
Not Synced
as a whole.
-
Not Synced
So if you want to join us, you can look at
our website, you can look at our code.