Hi, thank you.
I'm Nicolas Dandrimont and I will indeed
be talking to you about
Software Heritage.
I'm a software engineer for this project.
I've been working on it for 3 years now.
And we'll see what this thing is all about.
[Mic not working]
I guess the batteries are out.
So, let's try that again.
So, we all know, we've been doing
free software for a while,
that software source code is something
special.
Why is that?
As Harold Abelson has said in SICP, his
textbook on programming,
programs are meant to be read by people
and then incidentally for machines to execute.
Basically, what software source code
provides us is a way inside
the mind of the designer of the program.
For instance, you can have,
you can get inside very crazy algorithms
that can do very fast reverse square roots
for 3D, that kind of stuff
Like in the Quake 2 source code.
You can also get inside the algorithms
that are underpinning the internet,
for instance seeing the net queue
algorithm in the Linux kernel.
What we are building as the free software
community is the free software commons.
Basically, the commons is all the cultural
and social and natural resources
that we share and that everyone
has access to.
More specifically, the software commons
is what we are building
with software that is open and that is
available for all to use, to modify,
to execute, to distribute.
We know that those commons are a really
critical part of our commons.
Who's taking care of it?
The software is fragile.
Like all digital information, you can lose
software.
People can decide to shut down hosting
spaces because of business decisions.
People can hack into software hosting
platforms and remove the code maliciously
or just inadvertently.
And, of course, for the obsolete stuff,
there's rot.
If you don't care about the data, then
it rots and it decays and you lose it.
So, where is the archive we go to
when something is lost,
when GitLab goes away, when Github
goes away.
Where do we go?
Finally, there's one last thing that we
noticed, it's that
there's a lot of teams that work on
research on software
and there's no real big infrastructure
for research on code.
There's tons of critical issues around
code: safety, security, verification, proofs.
Nobody's doing this at a very large scale.
If you want to see the stars, you go
the Atacama desert and
you point a telescope at the sky.
Where is the telescope for source code?
That's what Software Heritage wants to be.
What we do is we collect, we preserve
and we share all the software
that is publicly available.
Why do we do that? We do that to
preserve the past, to enhance the present
and to prepare for the future.
What we're building is a base infrastructure
that can be used
for cultural heritage, for industry,
for research and for education purposes.
How do we do it? We do it with an open
approach.
Every single line of code that we write
is free software.
We do it transparently, everything that
we do, we do it in the open,
be that on a mailing list or on
our issue tracker.
And we strive to do it for the very long
haul, so we do it with replication in mind
so that no single entity has full control
over the data that we collect.
And we do it in a non-profit fashion
so that we avoid
business-driven decisions impacting
the project.
So, what do we do concretely?
We do archiving of version control systems.
What does that mean?
It means we archive file contents, so
source code, files.
We archive revisions, which means all the
metadata of the history of the projects,
we try to download it and we put it inside
a common data model that is
shared across all the archive.
We archive releases of the software,
releases that have been tagged
in a version control system as well as
releases that we can find as tarballs
because sometimes… boof, views of
this source code differ.
Of course, we archive where and when
we've seen the data that we've collected.
All of this, we put inside a canonical,
VCS-agnostic, data model.
If you have a Debian package, with its
history, if you have a git repository,
if you have a subversion repository, if
you have a mercurial repository,
it all looks the same and you can work
on it with the same tools.
What we don't do is archive what's around
the software, for instance
the bug tracking systems or the homepages
or the wikis or the mailing lists.
There are some projects that work
in this space, for instance
the internet archive does a lot of
really good work around archiving the web.
Our goal is not to replace them, but to
work with them and be able to do
linking across all the archives that exist.
We can, for instance for the mailing lists
there's the gmane project
that does a lot of archiving of free
software mailing lists.
So our long term vision is to play a part
in a semantic wikipedia of software,
a wikidata of software where we can
hyperlink all the archives that exist
and do stuff in the area.
Quick tour of our infrastructure.
Basically, all the way to the right is
our archive.
Our archive consists of a huge graph
of all the metadata about
the files, the directories, the revisions,
the commits and the releases and
all the projects that are on top
of the graph.
We separate the file storage into an other
object storage because of
the size discrepancy: we have lots and lots
of file contents that we need to store
so we do that outside the database
that is used to store the graph.
Basically, what we archive is a set of
software origins that are
git repositories, mercurial repositories,
etc. etc.
All those origins are loaded on a
regular schedule.
If there is a very active software origin,
we're gonna archive it more often
than stale things that don't get
a lot of updates.
What we do to get the list of software
origins that we archive.
We have a bunch of listers that can,
scroll through the list of repositories,
for instance on Github or other
hosting platforms.
We have code that can read Debian archive
metadata to make a list of the packages
that are inside this archive and can be
archived, etc.
All of this is done on a regular basis.
We are currently working on some kind
of push mechanism so that
people or other systems can notify us
of updates.
Our goal is not to do real time archiving,
we're really in it for the long run
but we still want to be able to prioritize
stuff that people tell us is
important to archive.
The internet archive has a "save now"
button and we want to implement
something along those lines as well,
so if we know that some software project
is in danger for a reason or another,
then we can prioritize archiving it.
So this is the basic structure of a revision
in the software heritage archive.
You'll see that it's very similar to
a git commit.
The format of the metadata is pretty much
what you'll find in a git commit
with some extensions that you don't
see here because this is from a git commit
So basically what we do is we take the
identifier of the directory
that the revision points to, we take the
identifier of the parent of the revision
so we can keep track of the history
and then we add some metadata,
authorship and commitership information
and the revision message and then we take
a hash of this,
it makes an identifier that's probably
unique, very very probably unique.
Using those identifiers, we can retrace
all the origins, all the history of
development of the project and we can
deduplicate across all the archive.
All the identifiers are intrinsic, which
means that we compute them
from the contents of the things that
we are archiving, which means that
we can deduplicate very efficiently
across all the data that we archive.
How much data do we archive?
A bit.
So, we have passed the billion revision
mark a few weeks ago.
This graph is a bit old, but anyway,
you have a live graph on our website.
That's more than 4.5 billion unique
source code files.
We don't actually discriminate between
what we would consider is source code
and what upstream developers consider
as source code,
so everything that's in a git repository,
we consider as source code
if it's below a size threshold.
A billion revisions across 80 million
projects.
What do we archive?
We archive Github, we archive Debian.
So, Debian we run the archival process
every day, every day we get the new packages
that have been uploaded in the archive.
Github, we try to keep up, we are currently
working on some performance improvements,
some scalability improvements to make sure
that we can keep up
with the development on GitHub.
We have archived as a one-off thing
the former content of Gitorious and Google Code
which are two prominent code hosting
spaces that closed recently
and we've been working on archiving
the contents of Bitbucket
which is kind of a challenge because
the API is a bit buggy and
Atliassian isn't too interested
in fixing it.
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB
and kind of big database, 6TB.
The database only contains the graph of
the metadata for the archive
which is basically a 8 billion nodes and
70 billion edges graph.
And of course it's growing daily.
We are pretty sure this is the richest
source code archive that's available now
and it keeps growing.
So how do we actually…
What kind of stack do we use to store
all this?
We use Debian, of course.
All our deployment recipes are in Puppet
in public repositories.
We've started using Ceph
for the blob storage.
We use PostgreSQL for the metadata storage
we some of the standard tools that
live around PostgreSQL for backups
and replication.
We use standard Python stack for
scheduling of jobs
and for web interface stuff, basically
psycopg2 for the low level stuff,
Django for the web stuff
and Celery for the scheduling of jobs.
In house, we've written an ad hoc
object storage system which has
a bunch of backends that you can use.
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…
It's a really simple object storage system
where you can just put an object,
get an object, put a bunch of objects,
get a bunch of objects.
We've implemented removal but we don't
really use it yet.
All the data model implementation,
all the listers, the loaders, the schedulers
everything has been written by us,
it's a pile of Python code.
So, basically 20 Python packages and
around 30 Puppet modules
to deploy all that and we've done everything
as a copyleft license,
GPLv3 for the backend and AGPLv3
for the frontend.
Even if people try and make their own
Software Heritage using our code,
they have to publish their changes.
Hardware-wise, we run for now everything
on a few hypervisors in house and
our main storage is currently still
on a very high density, very slow,
very bulky storage array, but we've
started to migrate all this thing
into a Ceph storage cluster which
we're gonna grow as we need
in the next few months.
We've also been granted by Microsoft
sponsorship, ??? sponsorship
for their cloud services.
We've started putting mirrors of everything
in their infrastructure as well
which means full object storage mirror,
so 170TB of stuff mirrored on azure
as well as a database mirror for graph.
And we're also doing all the content
indexing and all the things that need
scalability on azure now.
Finally, at the university of Bologna,
we have a backend storage for the download
so currently our main storage is
quite slow so if you want to download
a bundle of things that we've archived,
then we actually keep a cache of
what we've done so that it doesn't take
a million years to download stuff.
We do our development in a classic free
and open source software way,
so we talk on our mailing list, on IRC,
on a forge.
Everything is in English, everything is
public, there is more information
on our website if you want to actually
have a look and see what we do.
So, all that is very interesting but how
do we actually look into it?
One of the ways that you can browse,
that you can use the archive
is using a REST API.
Basically, this API allows you to do
pointwise browsing of the archive
so you can go and follow the links
in a graph,
which is very slow but gives you a pretty
much full access of the data.
There's an index for the API that you can
look at, but that's not really convenient,
so we also have a web user interface.
It's in preview right now, we're gonna do
a full launch in the month of June.
If you go to
https://archive.softwareheritage.org/browse/
with the given credentials, you can
have a look and see what's going on.
Basically, we have a web interface that
allows you to look at
what origins we have downloaded, when
we have downloaded the origins
with a kind of graph view of how often
we visited the origins
and a calendar view of when we have
visited the origins.
And then, inside the visits, you can
actually browse the contents
that we've archived.
So, for instance, this is the Python
repository as of May 2017
and you can have the list of files,
then drill down,
it should be pretty intuitive.
If you look at the history of a project,
you can see the differences
between two revisions of a project.
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.
So, yeah, pretty cool stuff.
I should be able to do a demo as well,
it should work.
I'm gonna zoom in.
So this is the main archive, you can see
some statistics about the objects
that we've downloaded.
When you zoom in, you get some kind of
overflows, because…
Yeah, why would you do that.
If you want to browse, we can try to find
an origin.
"glibc".
So there's lots and lots of, like, random
Github forks of things…
We don't discriminate and we don't really
filter what we download.
We are looking into doing some relevance
kind of sorting of the results, here.
Next.
Xilinx, why not.
So, this has been downloaded for the last
time of August 3rd 2016,
so it's probably a dead repository,
but yeah, you can see a bunch of source
code,
you can read the README of the glibc.
If we go back to a more interesting origin
here's the repository for git.
I've selected voluntarily an old visit
of the repo so that we can see
what was going on then.
If a look at the calendar view, you can see
that we've had some issues actually
updating this, but anyway.
If I look at the last visit, then we can
actually browse the contents,
you can get syntax highlighting as well.
This is a big big file with lots of comments
Let's see the actual source code…
Anyway, so, that's the browsing interface.
We can also now get back what we've
archived and download it,
which is kind of something that you might
want to do
if a repository is lost, you can actually
download it
and get the source code back again.
How we do that.
If you go on the top right of this browsing
interface, you have actions and download
and you can download a directory that
you are currently looking at.
It's an asynchronous process, which means
that if there is a lot of load,
then it's gotta take some time to get
actually, to be able to download the content
So you can put in your email address so we
can notify you when the download is ready.
I'm gonna try my luck and say just "ok"
and it's gonna appear at some point
in the list of things that I've requested.
I've already requested some things that
we can actually get and open as a tarball.
Yeah, I think that's the thing that I was
actually looking at,
which is this revision of the git
source code
and then I can open it
Yay, emacs, that's when you want.
Yay, source code.
This seems to work.
And then, of course, if you want to
actually script what you're doing,
there's an API that allows you to do
the downloads as well, so you can.
The source code is deduplicated a lot,
which means that for one single repository
you get tons of files that we have to
collect if you want to actually download
an archive of a directory.
It takes a while but we have an asynchronous
API so you can POST
the identifier of a revision to this URL
and then get status updates
and at some point, it will tell you that
the… here
The status well tell you that the object
is available.
You can download it and you can even
download the full history of a project
and get that as a git-fast-export archive
that you can reimport into
a new git repository.
So any kind of VCS that we've imported,
you can export as a git repository
and reimport on your machine.
How to get involved in the project?
We have a lot of features that we're
interested in, lots of them are now
in early access or have been done.
There's some stuff that we would like
help with.
This is some stuff that we're working on:
provenance information, you have a content
you want to know which repository
it comes from,
that's something we're on.
Full text search, the end goal is to be
able even to trace
source of snippets of code that's have
been copied from one project to another.
That's something that we can look into
with the wealth of information that
we have inside the archive.
There's a lot of things that,
I mean…
There's a lot of things that people want
to do with the archive.
Our goal is to enable people to do things,
to do interesting things
with a lot of source code.
If you have an idea of what you want to do
with such an archive,
please you can come talk to us
and we'll be happy to help you help us.
What we want to do is to diversify
the sources of things that we archive.
Currently, we have good support for git,
we have OK support for subversion
and mercurial.
If your project of choice is in another
version control system,
we are gonna miss it.
So people can contribute in this area.
For the listing part, we have coverage of
Debian, we have coverage or Github,
if your code is somewhere else, we won't
see it, so we need people to contribute
stuff that can list for instance Gitlab
instances,
and then we can integrate that in our
infrastructure and actually have have
people be able to archive their gitlab
instances.
And of course, we need to spread
the word, make the project sustainable.
We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor
The university of Bologna, of course Inria
is sponsoring.
But we need to keep spreading the word
and keep the project sustainable.
And, of course, we need to save endangered
source code.
For that, we have a suggestion box on
the wiki that you can add things to.
For instance, we have in the back of
our minds archiving SourceForge,
because we know that this isn't very
sustainable and that's risk of being
taken down at some point.
If you want to join us, we also have
some job openings that are available.
For now it's in Paris, so if you want to
consider coming work with us in Paris,
you can look into that.
That's Software Heritage.
We are building a reference archive of
all the free software
that's being ever written
in an international, open, non-profit and
mutualised infrastructure
that we have opened up to everyone,
all users, vendors, developers can use it.
The idea is to be at the service of
the community and for society
as a whole.
So if you want to join us, you can look at
our website, you can look at our code.
You can also talk to me, so if you have
any questions,
I think we have 10, 12 minutes for questions.
[Applause]
Do you have questions?
[Q] How do you protect the archive
against stuff that you don't want to
have in the archive.
I think of a stuff that is copyright-
protected and that Github will also
delete after a while.
Worse, if I would misuse the archive
as my private backup
and store encrypted blocks on Github
and you will eventually backup them
for me.
[A] There's, I think, two sides of the
question.
The first side is
Do we really archive only stuff that is
free software and
that we can redistribute and how do we
manage, for instance,
copyright takedown stuff.
Currently, most of the infrastructure
of the project is under French law.
There's a defined process to do
copyright takedown in the French legal system.
We would be really annoyed to have to
take down content from the archive
What we do, however, is to mirror public
information that is publicly available.
Of course I'm not a lawyer for the project,
so I can't really…
I'm not 100% sure of what I'm about to say
but
what I know is that in the current French
legistlation status,
if the source of the data is still available
so for instance if the data is still on
Github, then you need to have
Github take it down before we have to
take it down.
We're not currently filtering content for
misuse of the archive,
so the only thing that we do is put
a limit on the size of the files
that are archived in Software Heritage.
The limit is pretty high, like 100MB.
We can't really decide ourselves
what is source code,
what is not source code
because for instance if your project is
a cryptography library,
you might want to have some encrypted
blocks of data that are stored
in you source code repository as
test fixtures.
And then, you need them to build the code
and to make sure that it works.
So, how would that be any different than
you encrypted backup on Github?
How could we, Software Heritage,
distinguish between proper use and misuse
of the resources.
I guess our long term goal is to not have
to care about misuse because
it's gonna be a drop in the ocean.
We're gonna have so much…
We want to have enough space and
enough resources
that we don't really need to ask ourselves
this question, basically.
Thanks.
Other questions?
[Q] Have you looked at some form of
authentication to provide additional
insurance that the archived source code
hasn't been modified or tampered with
in some form?
[A] First of all, all the identifiers for
the objects that are inside the archive
are cryptographic hashes of the contents
that we've archived.
So, for files, for instance, we take
the SHA1, the SHA256,
one of the BLAKE hashes and the git
modified SHA1 of the file,
and we use that in the manifest for
the directories.
So the directories, the directory identifiers
are a hash of the manifest
of the list of files that are inside
the directory, etc.
So, recursively, you can make sure that
the data that we give back to you
has not been, at least altered, by bitflip
or anything.
We regularly run a scrub of the data
that we have in the archive,
so we make sure that there's no rot
inside our archive.
We've not looked into, basically,
attestation of…
for instance, making sure that the code
that we've downloaded…
I mean, we're not doing anything more
than taking a picture of the data
and we say "We've computed this hash.
Maybe the code that's been presented
by Github to Software Heritage is different
than what you've uploaded to Github,
we can't tell."
In the case of git, you can always use
the identifiers of the objects
that you've pushed so you have
the commit hash,
which is itself a cryptographic identifier
of the contents of the commit.
Intern, if the commit is signed, then
the signature is still stored
in the Software Heritage metadata and
you can reproduce the original git object
and check the signature, but we've not
done anything specific for Software Heritage
in this area.
Does that answer your question?
Cool.
Other questions?
There's one in front.
[Q] It's partially question, partially
comment.
Your initial idea was to have a telescope,
or something like this for source code.
For now, for me, it looks a little bit
more like microscope,
so you can focus on one thing, but that's
not much.
So have you sorted things about how to
analyze entire ecosystem
or something like this.
For example, now we have Django 2 which is
Python 3 only so it would be interesting to
look at all Django modules to see when
they start moving to this Django.
So we would need to start analyzing
thousands or millions of files, but then
we would need some SQL like, or some
map reduce jobs
or something like this for this.
[A] Yes
So, we've started…
The two initiators of the project, Roberto
Di Cosmo and Stefano Zacchiroli
are both researchers in computer science
so they have a strong background in
actually mining software repositories and
doing some large scale analysis
on source code.
We've been talking with research groups
whose main goal is to do analysis on
large scale source code archives.
One of the first mirrors outside of our
control of the archive
will be in Grenoble (France).
There's a few teams that work on
actually doing large scale research
on source code over there,
so that's what the mirror will be
used for.
We've also been looking at what
the Google open source team does.
They have this big repository with all
the code that Google uses
and they've started to push back,
like do large scale analysis of
security vulnerabilities, issues with
static and dynamic analysis
of the code and they've started pushing
their fixes upstream.
That's something that we want to enable
users to do,
that's not something that we want to do
ourselves, but we want to make sure
that people can do it using our archive.
So we'd be happy to work with people
who already do that so that
they can use their knowledge and their
tools inside our archive.
Does that answer your question?
Cool.
Any more questions?
No? Then thank you very much Nicolas.
Thank you.
[Applause]