-
Title:
Software Heritage - Preserving the Free Software Commons
-
Description:
-
Hi, thank you.
-
I'm Nicolas Dandrimont and I will indeed
be talking to you about
-
Software Heritage.
-
I'm a software engineer for this project.
-
I've been working on it for 3 years now.
-
And we'll see what this thing is all about.
-
[Mic not working]
-
I guess the batteries are out.
-
So, let's try that again.
-
So, we all know, we've been doing
free software for a while,
-
that software source code is something
special.
-
Why is that?
-
As Harold Abelson has said in SICP, his
textbook on programming,
-
programs are meant to be read by people
and then incidentally for machines to execute.
-
Basically, what software source code
provides us is a way inside
-
the mind of the designer of the program.
-
For instance, you can have,
you can get inside very crazy algorithms
-
that can do very fast reverse square roots
for 3D, that kind of stuff
-
Like in the Quake 2 source code.
-
You can also get inside the algorithms
that are underpinning the internet,
-
for instance seeing the net queue
algorithm in the Linux kernel.
-
What we are building as the free software
community is the free software commons.
-
Basically, the commons is all the cultural
and social and natural resources
-
that we share and that everyone
has access to.
-
More specifically, the software commons
is what we are building
-
with software that is open and that is
available for all to use, to modify,
-
to execute, to distribute.
-
We know that those commons are a really
critical part of our commons.
-
Who's taking care of it?
-
The software is fragile.
-
Like all digital information, you can lose
software.
-
People can decide to shut down hosting
spaces because of business decisions.
-
People can hack into software hosting
platforms and remove the code maliciously
-
or just inadvertently.
-
And, of course, for the obsolete stuff,
there's rot.
-
If you don't care about the data, then
it rots and it decays and you lose it.
-
So, where is the archive we go to
when something is lost,
-
when GitLab goes away, when Github
goes away.
-
Where do we go?
-
Finally, there's one last thing that we
noticed, it's that
-
there's a lot of teams that work on
research on software
-
and there's no real big infrastructure
for research on code.
-
There's tons of critical issues around
code: safety, security, verification, proofs.
-
Nobody's doing this at a very large scale.
-
If you want to see the stars, you go
the Atacama desert and
-
you point a telescope at the sky.
-
Where is the telescope for source code?
-
That's what Software Heritage wants to be.
-
What we do is we collect, we preserve
and we share all the software
-
that is publicly available.
-
Why do we do that? We do that to
preserve the past, to enhance the present
-
and to prepare for the future.
-
What we're building is a base infrastructure
that can be used
-
for cultural heritage, for industry,
for research and for education purposes.
-
How do we do it? We do it with an open
approach.
-
Every single line of code that we write
is free software.
-
We do it transparently, everything that
we do, we do it in the open,
-
be that on a mailing list or on
our issue tracker.
-
And we strive to do it for the very long
haul, so we do it with replication in mind
-
so that no single entity has full control
over the data that we collect.
-
And we do it in a non-profit fashion
so that we avoid
-
business-driven decisions impacting
the project.
-
So, what do we do concretely?
-
We do archiving of version control systems.
-
What does that mean?
-
It means we archive file contents, so
source code, files.
-
We archive revisions, which means all the
metadata of the history of the projects,
-
we try to download it and we put it inside
a common data model that is
-
shared across all the archive.
-
We archive releases of the software,
releases that have been tagged
-
in a version control system as well as
releases that we can find as tarballs
-
because sometimes… boof, views of
this source code differ.
-
Of course, we archive where and when
we've seen the data that we've collected.
-
All of this, we put inside a canonical,
VCS-agnostic, data model.
-
If you have a Debian package, with its
history, if you have a git repository,
-
if you have a subversion repository, if
you have a mercurial repository,
-
it all looks the same and you can work
on it with the same tools.
-
What we don't do is archive what's around
the software, for instance
-
the bug tracking systems or the homepages
or the wikis or the mailing lists.
-
There are some projects that work
in this space, for instance
-
the internet archive does a lot of
very good work around archiving the web.
-
Our goal is not to replace them, but to
work with them and be able to do
-
linking across all the archives that exist.
-
We can, for instance for the mailing lists
there's the gmane project
-
that does a lot of archiving of free
software mailing lists.
-
So our long term vision is to play a part
in a semantic wikipedia of software,
-
a wikidata of software where we can
hyperlink all the archives that exist
-
and do stuff in the area.
-
Quick tour of our infrastructure.
-
Basically, all the way to the right is
our archive.
-
Our archive consists of a huge graph
of all the metadata about
-
the files, the directories, the revisions,
the commits and the releases and
-
all the projects that are on top
of the graph.
-
We separate the file storage into an other
object storage because of
-
the size discrepancy: we have lots and lots
of file contents that we need to store
-
so we do that outside of the database
that is used to store the graph.
-
Basically, what we archive is a set of
software origins that are
-
git repositories, mercurial repositories,
etc. etc.
-
All those origins are loaded on a
regular schedule.
-
If there is a very active software origin,
we're gonna archive it more often
-
than stale things that don't get
a lot of updates.
-
What we do to get the list of software
origins that we archive.
-
We have a bunch of listers that can,
scroll through the list of repositories,
-
for instance on Github or other
hosting platforms.
-
We have code that can read Debian archive
metadata to make a list of the packages
-
that are inside this archive and can be
archived, etc.
-
All of this is done on a regular basis.
-
We are currently working on some kind
of push mechanism so that
-
people or other systems can notify us
of updates.
-
Our goal is not to do real time archiving,
we're really in it for the long run
-
but we still want to be able to prioritize
stuff that people tell us is
-
important to archive.
-
The internet archive has a "save now"
button and we want to implement
-
something along those lines as well,
-
so if we know that some software project
is in danger for a reason or another,
-
then we can prioritize archiving it.
-
So this is the basic structure of a revision
in the software heritage archive.
-
You'll see that it's very similar to
a git commit.
-
The format of the metadata is pretty much
what you'll find in a git commit
-
with some extensions that you don't
see here because this is from a git commit
-
So basically what we do is we take the
identifier of the directory
-
that the revision points to, we take the
identifier of the parent of the revision
-
so we can keep track of the history
-
and then we add some metadata,
authorship and commitership information
-
and the revision message and then we take
a hash of this,
-
it makes an identifier that's probably
unique, very very probably unique.
-
Using those identifiers, we can retrace
all the origins, all the history of
-
development of the project and we can
deduplicate across all the archive.
-
All the identifiers are intrinsic, which
means that we compute them
-
from the contents of the things that
we are archiving, which means that
-
we can deduplicate very efficiently
across all the data that we archive.
-
How much data do we archive?
-
A bit.
-
So, we have passed the billion revision
mark a few weeks ago.
-
This graph is a bit old, but anyway,
you have a live graph on our website.
-
That's more than 4.5 billion unique
source code files.
-
We don't actually discriminate between
what we would consider is source code
-
and what upstream developers consider
as source code,
-
so everything that's in a git repository,
we consider as source code
-
if it's below a size threshold.
-
A billion revisions across 80 million
projects.
-
What do we archive?
-
We archive Github, we archive Debian.
-
So, Debian we run the archival process
every day, every day we get the new packages
-
that have been uploaded in the archive.
-
Github, we try to keep up, we are currently
working on some performance improvements,
-
some scalability improvements to make sure
that we can keep up
-
with the development on GitHub.
-
We have archived as a one-off thing the
former contents of Gitorious and Google Code
-
which are two prominent code hosting
spaces that closed recently
-
and we've been working on archiving
the contents of Bitbucket
-
which is kind of a challenge because
the API is a bit buggy and
-
Atliassian isn't too interested
in fixing it.
-
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
-
and kind of big database, 6TB.
-
The database only contains the graph of
the metadata for the archive
-
which is basically a 8 billion nodes and
70 billion edges graph.
-
And of course it's growing daily.
-
We are pretty sure this is the richest public
source code archive that's available now
-
and it keeps growing.
-
So how do we actually…
-
What kind of stack do we use to store
all this?
-
We use Debian, of course.
-
All our deployment recipes are in Puppet
in public repositories.
-
We've started using Ceph
for the blob storage.
-
We use PostgreSQL for the metadata storage
with some of the standard tools that
-
live around PostgreSQL for backups
and replication.
-
We use standard Python stack for
scheduling of jobs
-
and for web interface stuff, basically
psycopg2 for the low level stuff,
-
Django for the web stuff
-
and Celery for the scheduling of jobs.
-
In house, we've written an ad hoc
object storage system which has
-
a bunch of backends that you can use.
-
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
-
It's a really simple object storage system
where you can just put an object,
-
get an object, put a bunch of objects,
get a bunch of objects.
-
We've implemented removal but we don't
really use it yet.
-
All the data model implementation,
all the listers, the loaders, the schedulers
-
everything has been written by us,
it's a pile of Python code.
-
So, basically 20 Python packages and
around 30 Puppet modules
-
to deploy all that and we've done everything
as a copyleft license,
-
GPLv3 for the backend and AGPLv3
for the frontend.
-
Even if people try and make their own
Software Heritage using our code,
-
they have to publish their changes.
-
Hardware-wise, we run for now everything
on a few hypervisors in house and
-
our main storage is currently still
on a very high density, very slow,
-
very bulky storage array, but we've
started to migrate all this thing
-
into a Ceph storage cluster which
we're gonna grow as we need
-
in the next few months.
-
We've also been granted by Microsoft
sponsorship, ??? sponsorship
-
for their cloud services.
-
We've started putting mirrors of everything
in their infrastructure as well
-
which means full object storage mirror,
so 170TB of stuff mirrored on azure
-
as well as a database mirror for graph.
-
And we're also doing all the content
indexing and all the things that need
-
scalability on azure now.
-
Finally, at the university of Bologna,
we have a backend storage for the download
-
so currently our main storage is
quite slow so if you want to download
-
a bundle of things that we've archived,
then we actually keep a cache of
-
what we've done so that it doesn't take
a million years to download stuff.
-
We do our development in a classic free
and open source software way,
-
so we talk on our mailing list, on IRC,
on a forge.
-
Everything is in English, everything is
public, there is more information
-
on our website if you want to actually
have a look and see what we do.
-
So, all that is very interesting but how
do we actually look into it?
-
One of the ways that you can browse,
that you can use the archive
-
is using a REST API.
-
Basically, this API allows you to do
pointwise browsing of the archive
-
so you can go and follow the links
in a graph,
-
which is very slow but gives you a pretty
much full access of the data.
-
There's an index for the API that you can
look at, but that's not really convenient,
-
so we also have a web user interface.
-
It's in preview right now, we're gonna do
a full launch in the month of June.
-
If you go to
https://archive.softwareheritage.org/browse/
-
with the given credentials, you can
have a look and see what's going on.
-
Basically, we have a web interface that
allows you to look at
-
what origins we have downloaded, when
we have downloaded the origins
-
with a kind of graph view of how often
we visited the origins
-
and a calendar view of when we have
visited the origins.
-
And then, inside the visits, you can
actually browse the contents
-
that we've archived.
-
So, for instance, this is the Python
repository as of May 2017
-
and you can have the list of files,
then drill down,
-
it should be pretty intuitive.
-
If you look at the history of a project,
you can see the differences
-
between two revisions of a project.
-
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
-
So, yeah, pretty cool stuff.
-
I should be able to do a demo as well,
it should work.
-
I'm gonna zoom in.
-
So this is the main archive, you can see
some statistics about the objects
-
that we've downloaded.
-
When you zoom in, you get some kind of
overflows, because…
-
Yeah, why would you do that.
-
If you want to browse, we can try to find
an origin.
-
"glibc".
-
So there's lots and lots of, like, random
Github forks of things…
-
We don't discriminate and we don't really
filter what we download.
-
We are looking into doing some relevance
kind of sorting of the results, here.
-
Next.
-
Xilinx, why not.
-
So, this has been downloaded for the last
time of August 3rd 2016,
-
so it's probably a dead repository,
-
but yeah, you can see a bunch of source
code,
-
you can read the README of the glibc.
-
If we go back to a more interesting origin
-
here's the repository for git.
-
I've selected voluntarily an old visit
of the repo so that we can see
-
what was going on then.
-
If I look at the calendar view, you can see
that we've had some issues actually
-
updating this, but anyway.
-
If I look at the last visit, then we can
actually browse the contents,
-
you can get syntax highlighting as well.
-
This is a big big file with lots of comments
-
Let's see the actual source code…
-
Anyway, so, that's the browsing interface.
-
We can also now get back what we've
archived and download it,
-
which is kind of something that you might
want to do
-
if a repository is lost, you can actually
download it
-
and get the source code back again.
-
How we do that.
-
If you go on the top right of this browsing
interface, you have actions and download
-
and you can download the directory that
you are currently looking at.
-
It's an asynchronous process, which means
that if there is a lot of load,
-
then it's gotta take some time to get
actually, to be able to download the content
-
So you can put in your email address so we
can notify you when the download is ready.
-
I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
-
in the list of things that I've requested.
-
I've already requested some things that
we can actually get and open as a tarball.
-
Yeah, I think that's the thing that I was
actually looking at,
-
which is this revision of the git
source code
-
and then I can open it
-
Yay, emacs, that's when you want.
-
Yay, source code.
-
This seems to work.
-
And then, of course, if you want to
actually script what you're doing,
-
there's an API that allows you to do
the downloads as well, so you can.
-
The source code is deduplicated a lot,
which means that for one single repository
-
you get tons of files that we have to
collect if you want to actually download
-
an archive of a directory.
-
It takes a while but we have an asynchronous
API so you can POST
-
the identifier of a revision to this URL
and then get status updates
-
and at some point, it will tell you that
the… here
-
The status well tell you that the object
is available.
-
You can download it and you can even
download the full history of a project
-
and get that as a git-fast-export archive
that you can reimport into
-
a new git repository.
-
So any kind of VCS that we've imported,
you can export as a git repository
-
and reimport on your machine.
-
How to get involved in the project?
-
We have a lot of features that we're
interested in, lots of them are now
-
in early access or have been done.
-
There's some stuff that we would like
help with.
-
This is some stuff that we're working on:
-
provenance information, you have a content
-
you want to know which repository
it comes from,
-
that's something we're working on.
-
Full text search, the end goal is to be
able even to trace
-
source of snippets of code that's have
been copied from one project to another.
-
That's something that we can look into
with the wealth of information that
-
we have inside the archive.
-
There's a lot of things that,
-
I mean…
-
There's a lot of things that people want
to do with the archive.
-
Our goal is to enable people to do things,
to do interesting things
-
with a lot of source code.
-
If you have an idea of what you want to do
with such an archive,
-
please you can come talk to us
-
and we'll be happy to help you help us.
-
What we want to do is to diversify
the sources of things that we archive.
-
Currently, we have good support for git,
we have OK support for subversion
-
and mercurial.
-
If your project of choice is in another
version control system,
-
we are gonna miss it.
-
So people can contribute in this area.
-
For the listing part, we have coverage of
Debian, we have coverage or Github,
-
if your code is somewhere else, we won't
see it, so we need people to contribute
-
stuff that can list for instance Gitlab
instances,
-
and then we can integrate that in our
infrastructure and actually have
-
people be able to archive their gitlab
instances.
-
And of course, we need to spread
the word, make the project sustainable.
-
We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
-
The university of Bologna, of course Inria
is sponsoring.
-
But we need to keep spreading the word
and keep the project sustainable.
-
And, of course, we need to save endangered
source code.
-
For that, we have a suggestion box on
the wiki that you can add things to.
-
For instance, we have in the back of
our minds archiving SourceForge,
-
because we know that this isn't very
sustainable and that's risk of being
-
taken down at some point.
-
If you want to join us, we also have
some job openings that are available.
-
For now it's in Paris, so if you want to
consider coming work with us in Paris,
-
you can look into that.
-
That's Software Heritage.
-
We are building a reference archive of
all the free software
-
that's being ever written
-
in an international, open, non-profit and
mutualised infrastructure
-
that we have opened up to everyone,
all users, vendors, developers can use it.
-
The idea is to be at the service of
the community and for society
-
as a whole.
-
So if you want to join us, you can look at
our website, you can look at our code.
-
You can also talk to me, so if you have
any questions,
-
I think we have 10, 12 minutes for questions.
-
[Applause]
-
Do you have questions?
-
[Q] How do you protect the archive
against stuff that you don't want to
-
have in the archive.
-
I think of a stuff that is copyright-
protected and that Github will also
-
delete after a while.
-
Worse, if I would misuse the archive
as my private backup
-
and store encrypted blocks on Github
and you will eventually backup them
-
for me.
-
[A] There's, I think, two sides of the
question.
-
The first side is
-
Do we really archive only stuff that is
free software and
-
that we can redistribute and how do we
manage, for instance,
-
copyright takedown stuff.
-
Currently, most of the infrastructure
of the project is under French law.
-
There's a defined process to do
copyright takedown in the French legal system.
-
We would be really annoyed to have to
take down content from the archive
-
What we do, however, is to mirror public
information that is publicly available.
-
Of course I'm not a lawyer for the project,
so I can't really…
-
I'm not 100% sure of what I'm about to say
but
-
what I know is that in the current French
legistlation status,
-
if the source of the data is still available
-
so for instance if the data is still on
Github, then you need to have
-
Github take it down before we have to
take it down.
-
We're not currently filtering content for
misuse of the archive,
-
so the only thing that we do is put
a limit on the size of the files
-
that are archived in Software Heritage.
-
The limit is pretty high, like 100MB.
-
We can't really decide ourselves
-
what is source code,
what is not source code
-
because for instance if your project is
a cryptography library,
-
you might want to have some encrypted
blocks of data that are stored
-
in you source code repository as
test fixtures.
-
And then, you need them to build the code
and to make sure that it works.
-
So, how would that be any different than
your encrypted backup on Github?
-
How could we, Software Heritage,
distinguish between proper use and misuse
-
of the resources.
-
I guess our long term goal is to not have
to care about misuse because
-
it's gonna be a drop in the ocean.
-
We're gonna have so much…
-
We want to have enough space and
enough resources
-
that we don't really need to ask ourselves
this question, basically.
-
Thanks.
-
Other questions?
-
[Q] Have you looked at some form of
authentication to provide additional
-
insurance that the archived source code
hasn't been modified or tampered with
-
in some form?
-
[A] First of all, all the identifiers for
the objects that are inside the archive
-
are cryptographic hashes of the contents
that we've archived.
-
So, for files, for instance, we take
the SHA1, the SHA256,
-
one of the BLAKE hashes and the git
modified SHA1 of the file,
-
and we use that in the manifest for
the directories.
-
So the directories, the directory identifiers
are a hash of the manifest
-
of the list of files that are inside
the directory, etc.
-
So, recursively, you can make sure that
the data that we give back to you
-
has not been, at least altered, by bitflip
or anything.
-
We regularly run a scrub of the data
that we have in the archive,
-
so we make sure that there's no rot
inside our archive.
-
We've not looked into, basically,
attestation of…
-
for instance, making sure that the code
that we've downloaded…
-
I mean, we're not doing anything more
than taking a picture of the data
-
and we say "We've computed this hash.
Maybe the code that's been presented
-
by Github to Software Heritage is different
than what you've uploaded to Github,
-
we can't tell."
-
In the case of git, you can always use
the identifiers of the objects
-
that you've pushed so you have
the commit hash,
-
which is itself a cryptographic identifier
of the contents of the commit.
-
In turn, if the commit is signed, then
the signature is still stored
-
in the Software Heritage metadata and
you can reproduce the original git object
-
and check the signature, but we've not
done anything specific for Software Heritage
-
in this area.
-
Does that answer your question?
-
Cool.
-
Other questions?
-
There's one in front.
-
[Q] It's partially question, partially
comment.
-
Your initial idea was to have a telescope,
or something like this for source code.
-
For now, for me, it looks a little bit
more like microscope,
-
so you can focus on one thing, but that's
not much.
-
So have you sorted things about how to
analyze entire ecosystem
-
or something like this.
-
For example, now we have Django 2 which is
Python 3 only so it would be interesting to
-
look at all Django modules to see when
they start moving to this Django.
-
So we would need to start analyzing
thousands or millions of files, but then
-
we would need some SQL like, or some
map reduce jobs
-
or something like this for this.
-
[A] Yes
-
So, we've started…
-
The two initiators of the project, Roberto
Di Cosmo and Stefano Zacchiroli
-
are both researchers in computer science
so they have a strong background in
-
actually mining software repositories and
doing some large scale analysis
-
on source code.
-
We've been talking with research groups
whose main goal is to do analysis on
-
large scale source code archives.
-
One of the first mirrors outside of our
control of the archive
-
will be in Grenoble (France).
-
There's a few teams that work on
actually doing large scale research
-
on source code over there,
-
so that's what the mirror will be
used for.
-
We've also been looking at what
the Google open source team does.
-
They have this big repository with all
the code that Google uses
-
and they've started to push back,
like do large scale analysis of
-
security vulnerabilities, issues with
static and dynamic analysis
-
of the code and they've started pushing
their fixes upstream.
-
That's something that we want to enable
users to do,
-
that's not something that we want to do
ourselves, but we want to make sure
-
that people can do it using our archive.
-
So we'd be happy to work with people
who already do that so that
-
they can use their knowledge and their
tools inside our archive.
-
Does that answer your question?
-
Cool.
-
Any more questions?
-
No? Then thank you very much Nicolas.
-
Thank you.
-
[Applause]