-
Not Synced
Hi, thank you.
-
Not Synced
I'm Nicolas Dandrimont and I will indeed
be talking to you about
-
Not Synced
Software Heritage.
-
Not Synced
I'm a software engineer for this project.
-
Not Synced
I've been working on it for 3 years now.
-
Not Synced
And we'll see what this thing is all about.
-
Not Synced
[Mic not working]
-
Not Synced
I guess the batteries are out.
-
Not Synced
So, let's try that again.
-
Not Synced
So, we all know, we've been doing
free software for a while,
-
Not Synced
that software source code is something
special.
-
Not Synced
Why is that?
-
Not Synced
As Harold Abelson has said in SICP, his
textbook on programming,
-
Not Synced
programs are meant to be read by people
and then incidentally for machines to execute.
-
Not Synced
Basically, what software source code
provides us is a way inside
-
Not Synced
the mind of the designer of the program.
-
Not Synced
For instance, you can have,
you can get inside very crazy algorithms
-
Not Synced
that can do very fast reverse square roots
for 3D, that kind of stuff
-
Not Synced
Like in the Quake 2 source code.
-
Not Synced
You can also get inside the algorithms
that are underpinning the internet,
-
Not Synced
for instance seeing the net queue
algorithm in the Linux kernel.
-
Not Synced
What we are building as the free software
community is the free software commons.
-
Not Synced
Basically, the commons is all the cultural
and social and natural resources
-
Not Synced
that we share and that everyone
has access to.
-
Not Synced
More specifically, the software commons
is what we are building
-
Not Synced
with software that is open and that is
available for all to use, to modify,
-
Not Synced
to execute, to distribute.
-
Not Synced
We know that those commons are a really
critical part of our commons.
-
Not Synced
Who's taking care of it?
-
Not Synced
The software is fragile.
-
Not Synced
Like all digital information, you can lose
software.
-
Not Synced
People can decide to shut down hosting
spaces because of business decisions.
-
Not Synced
People can hack into software hosting
platforms and remove the code maliciously
-
Not Synced
or just inadvertently.
-
Not Synced
And, of course, for the obsolete stuff,
there's rot.
-
Not Synced
If you don't care about the data, then
it rots and it decays and you lose it.
-
Not Synced
So, where is the archive we go to
when something is lost,
-
Not Synced
when GitLab goes away, when Github
goes away.
-
Not Synced
Where do we go?
-
Not Synced
Finally, there's one last thing that we
noticed, it's that
-
Not Synced
there's a lot of teams that work on
research on software
-
Not Synced
and there's no real big infrastructure
for research on code.
-
Not Synced
There's tons of critical issues around
code: safety, security, verification, proofs.
-
Not Synced
Nobody's doing this at a very large scale.
-
Not Synced
If you want to see the stars, you go
the Atacama desert and
-
Not Synced
you point a telescope at the sky.
-
Not Synced
Where is the telescope for source code?
-
Not Synced
That's what Software Heritage wants to be.
-
Not Synced
What we do is we collect, we preserve
and we share all the software
-
Not Synced
that is publicly available.
-
Not Synced
Why do we do that? We do that to
preserve the past, to enhance the present
-
Not Synced
and to prepare for the future.
-
Not Synced
What we're building is a base infrastructure
that can be used
-
Not Synced
for cultural heritage, for industry,
for research and for education purposes.
-
Not Synced
How do we do it? We do it with an open
approach.
-
Not Synced
Every single line of code that we write
is free software.
-
Not Synced
We do it transparently, everything that
we do, we do it in the open,
-
Not Synced
be that on a mailing list or on
our issue tracker.
-
Not Synced
And we strive to do it for the very long
haul, so we do it with replication in mind
-
Not Synced
so that no single entity has full control
over the data that we collect.
-
Not Synced
And we do it in a non-profit fashion
so that we avoid
-
Not Synced
business-driven decisions impacting
the project.
-
Not Synced
So, what do we do concretely?
-
Not Synced
We do archiving of version control systems.
-
Not Synced
What does that mean?
-
Not Synced
It means we archive file contents, so
source code, files.
-
Not Synced
We archive revisions, which means all the
metadata of the history of the projects,
-
Not Synced
we try to download it and we put it inside
a common data model that is
-
Not Synced
shared across all the archive.
-
Not Synced
We archive releases of the software,
releases that have been tagged
-
Not Synced
in a version control system as well as
releases that we can find as tarballs
-
Not Synced
because sometimes… boof, views of
this source code differ.
-
Not Synced
Of course, we archive where and when
we've seen the data that we've collected.
-
Not Synced
All of this, we put inside a canonical,
VCS-agnostic, data model.
-
Not Synced
If you have a Debian package, with its
history, if you have a git repository,
-
Not Synced
if you have a subversion repository, if
you have a mercurial repository,
-
Not Synced
it all looks the same and you can work
on it with the same tools.
-
Not Synced
What we don't do is archive what's around
the software, for instance
-
Not Synced
the bug tracking systems or the homepages
or the wikis or the mailing lists.
-
Not Synced
There are some projects that work
in this space, for instance
-
Not Synced
the internet archive does a lot of
really good work around archiving the web.
-
Not Synced
Our goal is not to replace them, but to
work with them and be able to do
-
Not Synced
linking across all the archives that exist.
-
Not Synced
We can, for instance for the mailing lists
there's the gmane project
-
Not Synced
that does a lot of archiving of free
software mailing lists.
-
Not Synced
So our long term vision is to play a part
in a semantic wikipedia of software,
-
Not Synced
a wikidata of software where we can
hyperlink all the archives that exist
-
Not Synced
and do stuff in the area.
-
Not Synced
Quick tour of our infrastructure.
-
Not Synced
Basically, all the way to the right is
our archive.
-
Not Synced
Our archive consists of a huge graph
of all the metadata about
-
Not Synced
the files, the directories, the revisions,
the commits and the releases and
-
Not Synced
all the projects that are on top
of the graph.
-
Not Synced
We separate the file storage into an other
object storage because of
-
Not Synced
the size discrepancy: we have lots and lots
of file contents that we need to store
-
Not Synced
so we do that outside the database
that is used to store the graph.
-
Not Synced
Basically, what we archive is a set of
software origins that are
-
Not Synced
git repositories, mercurial repositories,
etc. etc.
-
Not Synced
All those origins are loaded on a
regular schedule.
-
Not Synced
If there is a very active software origin,
we're gonna archive it more often
-
Not Synced
than stale things that don't get
a lot of updates