9:59:59.000,9:59:59.000
Hi, thank you.

9:59:59.000,9:59:59.000
I'm Nicolas Dandrimont and I will indeed[br]be talking to you about

9:59:59.000,9:59:59.000
Software Heritage.

9:59:59.000,9:59:59.000
I'm a software engineer for this project.

9:59:59.000,9:59:59.000
I've been working on it for 3 years now.

9:59:59.000,9:59:59.000
And we'll see what this thing is all about.

9:59:59.000,9:59:59.000
[Mic not working]

9:59:59.000,9:59:59.000
I guess the batteries are out.

9:59:59.000,9:59:59.000
So, let's try that again.

9:59:59.000,9:59:59.000
So, we all know, we've been doing[br]free software for a while,

9:59:59.000,9:59:59.000
that software source code is something[br]special.

9:59:59.000,9:59:59.000
Why is that?

9:59:59.000,9:59:59.000
As Harold Abelson has said in SICP, his[br]textbook on programming,

9:59:59.000,9:59:59.000
programs are meant to be read by people[br]and then incidentally for machines to execute.

9:59:59.000,9:59:59.000
Basically, what software source code[br]provides us is a way inside

9:59:59.000,9:59:59.000
the mind of the designer of the program.

9:59:59.000,9:59:59.000
For instance, you can have,[br]you can get inside very crazy algorithms

9:59:59.000,9:59:59.000
that can do very fast reverse square roots[br]for 3D, that kind of stuff

9:59:59.000,9:59:59.000
Like in the Quake 2 source code.

9:59:59.000,9:59:59.000
You can also get inside the algorithms[br]that are underpinning the internet,

9:59:59.000,9:59:59.000
for instance seeing the net queue[br]algorithm in the Linux kernel.

9:59:59.000,9:59:59.000
What we are building as the free software[br]community is the free software commons.

9:59:59.000,9:59:59.000
Basically, the commons is all the cultural[br]and social and natural resources

9:59:59.000,9:59:59.000
that we share and that everyone[br]has access to.

9:59:59.000,9:59:59.000
More specifically, the software commons[br]is what we are building

9:59:59.000,9:59:59.000
with software that is open and that is[br]available for all to use, to modify,

9:59:59.000,9:59:59.000
to execute, to distribute.

9:59:59.000,9:59:59.000
We know that those commons are a really[br]critical part of our commons.

9:59:59.000,9:59:59.000
Who's taking care of it?

9:59:59.000,9:59:59.000
The software is fragile.

9:59:59.000,9:59:59.000
Like all digital information, you can lose[br]software.

9:59:59.000,9:59:59.000
People can decide to shut down hosting[br]spaces because of business decisions.

9:59:59.000,9:59:59.000
People can hack into software hosting[br]platforms and remove the code maliciously

9:59:59.000,9:59:59.000
or just inadvertently.

9:59:59.000,9:59:59.000
And, of course, for the obsolete stuff,[br]there's rot.

9:59:59.000,9:59:59.000
If you don't care about the data, then[br]it rots and it decays and you lose it.

9:59:59.000,9:59:59.000
So, where is the archive we go to[br]when something is lost,

9:59:59.000,9:59:59.000
when GitLab goes away, when Github[br]goes away.

9:59:59.000,9:59:59.000
Where do we go?

9:59:59.000,9:59:59.000
Finally, there's one last thing that we[br]noticed, it's that

9:59:59.000,9:59:59.000
there's a lot of teams that work on[br]research on software

9:59:59.000,9:59:59.000
and there's no real big infrastructure[br]for research on code.

9:59:59.000,9:59:59.000
There's tons of critical issues around[br]code: safety, security, verification, proofs.

9:59:59.000,9:59:59.000
Nobody's doing this at a very large scale.

9:59:59.000,9:59:59.000
If you want to see the stars, you go[br]the Atacama desert and

9:59:59.000,9:59:59.000
you point a telescope at the sky.

9:59:59.000,9:59:59.000
Where is the telescope for source code?

9:59:59.000,9:59:59.000
That's what Software Heritage wants to be.

9:59:59.000,9:59:59.000
What we do is we collect, we preserve[br]and we share all the software

9:59:59.000,9:59:59.000
that is publicly available.

9:59:59.000,9:59:59.000
Why do we do that? We do that to[br]preserve the past, to enhance the present

9:59:59.000,9:59:59.000
and to prepare for the future.

9:59:59.000,9:59:59.000
What we're building is a base infrastructure[br]that can be used

9:59:59.000,9:59:59.000
for cultural heritage, for industry,[br]for research and for education purposes.

9:59:59.000,9:59:59.000
How do we do it? We do it with an open[br]approach.

9:59:59.000,9:59:59.000
Every single line of code that we write[br]is free software.

9:59:59.000,9:59:59.000
We do it transparently, everything that[br]we do, we do it in the open,

9:59:59.000,9:59:59.000
be that on a mailing list or on[br]our issue tracker.

9:59:59.000,9:59:59.000
And we strive to do it for the very long[br]haul, so we do it with replication in mind

9:59:59.000,9:59:59.000
so that no single entity has full control[br]over the data that we collect.

9:59:59.000,9:59:59.000
And we do it in a non-profit fashion[br]so that we avoid

9:59:59.000,9:59:59.000
business-driven decisions impacting[br]the project.

9:59:59.000,9:59:59.000
So, what do we do concretely?

9:59:59.000,9:59:59.000
We do archiving of version control systems.

9:59:59.000,9:59:59.000
What does that mean?

9:59:59.000,9:59:59.000
It means we archive file contents, so[br]source code, files.

9:59:59.000,9:59:59.000
We archive revisions, which means all the[br]metadata of the history of the projects,

9:59:59.000,9:59:59.000
we try to download it and we put it inside[br]a common data model that is

9:59:59.000,9:59:59.000
shared across all the archive.

9:59:59.000,9:59:59.000
We archive releases of the software,[br]releases that have been tagged

9:59:59.000,9:59:59.000
in a version control system as well as[br]releases that we can find as tarballs

9:59:59.000,9:59:59.000
because sometimes… boof, views of[br]this source code differ.

9:59:59.000,9:59:59.000
Of course, we archive where and when[br]we've seen the data that we've collected.

9:59:59.000,9:59:59.000
All of this, we put inside a canonical,[br]VCS-agnostic, data model.

9:59:59.000,9:59:59.000
If you have a Debian package, with its[br]history, if you have a git repository,

9:59:59.000,9:59:59.000
if you have a subversion repository, if[br]you have a mercurial repository,

9:59:59.000,9:59:59.000
it all looks the same and you can work[br]on it with the same tools.

9:59:59.000,9:59:59.000
What we don't do is archive what's around[br]the software, for instance

9:59:59.000,9:59:59.000
the bug tracking systems or the homepages[br]or the wikis or the mailing lists.

9:59:59.000,9:59:59.000
There are some projects that work[br]in this space, for instance

9:59:59.000,9:59:59.000
the internet archive does a lot of[br]really good work around archiving the web.

9:59:59.000,9:59:59.000
Our goal is not to replace them, but to[br]work with them and be able to do

9:59:59.000,9:59:59.000
linking across all the archives that exist.

9:59:59.000,9:59:59.000
We can, for instance for the mailing lists[br]there's the gmane project

9:59:59.000,9:59:59.000
that does a lot of archiving of free[br]software mailing lists.

9:59:59.000,9:59:59.000
So our long term vision is to play a part[br]in a semantic wikipedia of software,

9:59:59.000,9:59:59.000
a wikidata of software where we can[br]hyperlink all the archives that exist

9:59:59.000,9:59:59.000
and do stuff in the area.

9:59:59.000,9:59:59.000
Quick tour of our infrastructure.

9:59:59.000,9:59:59.000
Basically, all the way to the right is[br]our archive.

9:59:59.000,9:59:59.000
Our archive consists of a huge graph[br]of all the metadata about

9:59:59.000,9:59:59.000
the files, the directories, the revisions,[br]the commits and the releases and

9:59:59.000,9:59:59.000
all the projects that are on top[br]of the graph.

9:59:59.000,9:59:59.000
We separate the file storage into an other[br]object storage because of

9:59:59.000,9:59:59.000
the size discrepancy: we have lots and lots[br]of file contents that we need to store

9:59:59.000,9:59:59.000
so we do that outside the database[br]that is used to store the graph.

9:59:59.000,9:59:59.000
Basically, what we archive is a set of[br]software origins that are

9:59:59.000,9:59:59.000
git repositories, mercurial repositories,[br]etc. etc.

9:59:59.000,9:59:59.000
All those origins are loaded on a[br]regular schedule.

9:59:59.000,9:59:59.000
If there is a very active software origin,[br]we're gonna archive it more often

9:59:59.000,9:59:59.000
than stale things that don't get[br]a lot of updates.

9:59:59.000,9:59:59.000
What we do to get the list of software[br]origins that we archive.

9:59:59.000,9:59:59.000
We have a bunch of listers that can,[br]scroll through the list of repositories,

9:59:59.000,9:59:59.000
for instance on Github or other[br]hosting platforms.

9:59:59.000,9:59:59.000
We have code that can read Debian archive[br]metadata to make a list of the packages

9:59:59.000,9:59:59.000
that are inside this archive and can be[br]archived, etc.

9:59:59.000,9:59:59.000
All of this is done on a regular basis.

9:59:59.000,9:59:59.000
We are currently working on some kind[br]of push mechanism so that

9:59:59.000,9:59:59.000
people or other systems can notify us[br]of updates.

9:59:59.000,9:59:59.000
Our goal is not to do real time archiving,[br]we're really in it for the long run

9:59:59.000,9:59:59.000
but we still want to be able to prioritize[br]stuff that people tell us is

9:59:59.000,9:59:59.000
important to archive.

9:59:59.000,9:59:59.000
The internet archive has a "save now"[br]button and we want to implement

9:59:59.000,9:59:59.000
something along those lines as well,

9:59:59.000,9:59:59.000
so if we know that some software project[br]is in danger for a reason or another,

9:59:59.000,9:59:59.000
then we can prioritize archiving it.

9:59:59.000,9:59:59.000
So this is the basic structure of a revision[br]in the software heritage archive.

9:59:59.000,9:59:59.000
You'll see that it's very similar to[br]a git commit.

9:59:59.000,9:59:59.000
The format of the metadata is pretty much[br]what you'll find in a git commit

9:59:59.000,9:59:59.000
with some extensions that you don't[br]see here because this is from a git commit

9:59:59.000,9:59:59.000
So basically what we do is we take the[br]identifier of the directory

9:59:59.000,9:59:59.000
that the revision points to, we take the[br]identifier of the parent of the revision

9:59:59.000,9:59:59.000
so we can keep track of the history

9:59:59.000,9:59:59.000
and then we add some metadata,[br]authorship and commitership information

9:59:59.000,9:59:59.000
and the revision message and then we take[br]a hash of this,

9:59:59.000,9:59:59.000
it makes an identifier that's probably[br]unique, very very probably unique.

9:59:59.000,9:59:59.000
Using those identifiers, we can retrace[br]all the origins, all the history of

9:59:59.000,9:59:59.000
development of the project and we can[br]deduplicate across all the archive.

9:59:59.000,9:59:59.000
All the identifiers are intrinsic, which[br]means that we compute them

9:59:59.000,9:59:59.000
from the contents of the things that[br]we are archiving, which means that

9:59:59.000,9:59:59.000
we can deduplicate very efficiently[br]across all the data that we archive.

9:59:59.000,9:59:59.000
How much data do we archive?

9:59:59.000,9:59:59.000
A bit.

9:59:59.000,9:59:59.000
So, we have passed the billion revision[br]mark a few weeks ago.

9:59:59.000,9:59:59.000
This graph is a bit old, but anyway,[br]you have a live graph on our website.

9:59:59.000,9:59:59.000
That's more than 4.5 billion unique[br]source code files.

9:59:59.000,9:59:59.000
We don't actually discriminate between[br]what we would consider is source code

9:59:59.000,9:59:59.000
and what upstream developers consider[br]as source code,

9:59:59.000,9:59:59.000
so everything that's in a git repository,[br]we consider as source code

9:59:59.000,9:59:59.000
if it's below a size threshold.

9:59:59.000,9:59:59.000
A billion revisions across 80 million[br]projects.

9:59:59.000,9:59:59.000
What do we archive?

9:59:59.000,9:59:59.000
We archive Github, we archive Debian.

9:59:59.000,9:59:59.000
So, Debian we run the archival process[br]every day, every day we get the new packages

9:59:59.000,9:59:59.000
that have been uploaded in the archive.

9:59:59.000,9:59:59.000
Github, we try to keep up, we are currently[br]working on some performance improvements,

9:59:59.000,9:59:59.000
some scalability improvements to make sure[br]that we can keep up

9:59:59.000,9:59:59.000
with the development on GitHub.

9:59:59.000,9:59:59.000
We have archived as a one-off thing[br]the former content of Gitorious and Google Code

9:59:59.000,9:59:59.000
which are two prominent code hosting[br]spaces that closed recently

9:59:59.000,9:59:59.000
and we've been working on archiving[br]the contents of Bitbucket

9:59:59.000,9:59:59.000
which is kind of a challenge because[br]the API is a bit buggy and

9:59:59.000,9:59:59.000
Atliassian isn't too interested[br]in fixing it.

9:59:59.000,9:59:59.000
In concrete storage terms, we have 175TB[br]of blobs, so the files take 175TB

9:59:59.000,9:59:59.000
and kind of big database, 6TB.

9:59:59.000,9:59:59.000
The database only contains the graph of[br]the metadata for the archive

9:59:59.000,9:59:59.000
which is basically a 8 billion nodes and[br]70 billion edges graph.

9:59:59.000,9:59:59.000
And of course it's growing daily.

9:59:59.000,9:59:59.000
We are pretty sure this is the richest[br]source code archive that's available now

9:59:59.000,9:59:59.000
and it keeps growing.

9:59:59.000,9:59:59.000
So how do we actually…

9:59:59.000,9:59:59.000
What kind of stack do we use to store[br]all this?

9:59:59.000,9:59:59.000
We use Debian, of course.

9:59:59.000,9:59:59.000
All our deployment recipes are in Puppet[br]in public repositories.

9:59:59.000,9:59:59.000
We've started using Ceph[br]for the blob storage.

9:59:59.000,9:59:59.000
We use PostgreSQL for the metadata storage[br]we some of the standard tools that

9:59:59.000,9:59:59.000
live around PostgreSQL for backups[br]and replication.

9:59:59.000,9:59:59.000
We use standard Python stack for[br]scheduling of jobs

9:59:59.000,9:59:59.000
and for web interface stuff, basically[br]psycopg2 for the low level stuff,

9:59:59.000,9:59:59.000
Django for the web stuff

9:59:59.000,9:59:59.000
and Celery for the scheduling of jobs.

9:59:59.000,9:59:59.000
In house, we've written an ad hoc[br]object storage system which has

9:59:59.000,9:59:59.000
a bunch of backends that you can use.

9:59:59.000,9:59:59.000
Basically, we are agnostic between a UNIX[br]filesystem, azure, Ceph, or tons of…

9:59:59.000,9:59:59.000
It's a really simple object storage system[br]where you can just put an object,

9:59:59.000,9:59:59.000
get an object, put a bunch of objects,[br]get a bunch of objects.

9:59:59.000,9:59:59.000
We've implemented removal but we don't[br]really use it yet.

9:59:59.000,9:59:59.000
All the data model implementation,[br]all the listers, the loaders, the schedulers

9:59:59.000,9:59:59.000
everything has been written by us,[br]it's a pile of Python code.

9:59:59.000,9:59:59.000
So, basically 20 Python packages and[br]around 30 Puppet modules

9:59:59.000,9:59:59.000
to deploy all that and we've done everything[br]as a copyleft license,

9:59:59.000,9:59:59.000
GPLv3 for the backend and AGPLv3[br]for the frontend.

9:59:59.000,9:59:59.000
Even if people try and make their own[br]Software Heritage using our code,

9:59:59.000,9:59:59.000
they have to publish their changes.

9:59:59.000,9:59:59.000
Hardware-wise, we run for now everything[br]on a few hypervisors in house and

9:59:59.000,9:59:59.000
our main storage is currently still[br]on a very high density, very slow,

9:59:59.000,9:59:59.000
very bulky storage array, but we've[br]started to migrate all this thing

9:59:59.000,9:59:59.000
into a Ceph storage cluster which[br]we're gonna grow as we need

9:59:59.000,9:59:59.000
in the next few months.

9:59:59.000,9:59:59.000
We've also been granted by Microsoft[br]sponsorship, ??? sponsorship

9:59:59.000,9:59:59.000
for their cloud services.

9:59:59.000,9:59:59.000
We've started putting mirrors of everything[br]in their infrastructure as well

9:59:59.000,9:59:59.000
which means full object storage mirror,[br]so 170TB of stuff mirrored on azure

9:59:59.000,9:59:59.000
as well as a database mirror for graph.

9:59:59.000,9:59:59.000
And we're also doing all the content[br]indexing and all the things that need

9:59:59.000,9:59:59.000
scalability on azure now.

9:59:59.000,9:59:59.000
Finally, at the university of Bologna,[br]we have a backend storage for the download

9:59:59.000,9:59:59.000
so currently our main storage is[br]quite slow so if you want to download

9:59:59.000,9:59:59.000
a bundle of things that we've archived,[br]then we actually keep a cache of

9:59:59.000,9:59:59.000
what we've done so that it doesn't take[br]a million years to download stuff.

9:59:59.000,9:59:59.000
We do our development in a classic free[br]and open source software way,

9:59:59.000,9:59:59.000
so we talk on our mailing list, on IRC,[br]on a forge.

9:59:59.000,9:59:59.000
Everything is in English, everything is[br]public, there is more information

9:59:59.000,9:59:59.000
on our website if you want to actually[br]have a look and see what we do.

9:59:59.000,9:59:59.000
So, all that is very interesting but how[br]do we actually look into it?

9:59:59.000,9:59:59.000
One of the ways that you can browse,[br]that you can use the archive

9:59:59.000,9:59:59.000
is using a REST API.

9:59:59.000,9:59:59.000
Basically, this API allows you to do[br]pointwise browsing of the archive

9:59:59.000,9:59:59.000
so you can go and follow the links[br]in a graph,

9:59:59.000,9:59:59.000
which is very slow but gives you a pretty[br]much full access of the data.

9:59:59.000,9:59:59.000
There's an index for the API that you can[br]look at, but that's not really convenient,

9:59:59.000,9:59:59.000
so we also have a web user interface.

9:59:59.000,9:59:59.000
It's in preview right now, we're gonna do[br]a full launch in the month of June.

9:59:59.000,9:59:59.000
If you go to [br]https://archive.softwareheritage.org/browse/

9:59:59.000,9:59:59.000
with the given credentials, you can[br]have a look and see what's going on.

9:59:59.000,9:59:59.000
Basically, we have a web interface that[br]allows you to look at

9:59:59.000,9:59:59.000
what origins we have downloaded, when[br]we have downloaded the origins

9:59:59.000,9:59:59.000
with a kind of graph view of how often[br]we visited the origins

9:59:59.000,9:59:59.000
and a calendar view of when we have[br]visited the origins.

9:59:59.000,9:59:59.000
And then, inside the visits, you can[br]actually browse the contents

9:59:59.000,9:59:59.000
that we've archived.

9:59:59.000,9:59:59.000
So, for instance, this is the Python[br]repository as of May 2017

9:59:59.000,9:59:59.000
and you can have the list of files,[br]then drill down,

9:59:59.000,9:59:59.000
it should be pretty intuitive.

9:59:59.000,9:59:59.000
If you look at the history of a project,[br]you can see the differences

9:59:59.000,9:59:59.000
between two revisions of a project.

9:59:59.000,9:59:59.000
Oh no, that's the syntax highlighting,[br]but anyway the diffs arrive right after.

9:59:59.000,9:59:59.000
So, yeah, pretty cool stuff.

9:59:59.000,9:59:59.000
I should be able to do a demo as well,[br]it should work.

9:59:59.000,9:59:59.000
I'm gonna zoom in.

9:59:59.000,9:59:59.000
So this is the main archive, you can see[br]some statistics about the objects

9:59:59.000,9:59:59.000
that we've downloaded.

9:59:59.000,9:59:59.000
When you zoom in, you get some kind of[br]overflows, because…

9:59:59.000,9:59:59.000
Yeah, why would you do that.

9:59:59.000,9:59:59.000
If you want to browse, we can try to find[br]an origin.

9:59:59.000,9:59:59.000
"glibc".

9:59:59.000,9:59:59.000
So there's lots and lots of, like, random[br]Github forks of things…

9:59:59.000,9:59:59.000
We don't discriminate and we don't really[br]filter what we download.

9:59:59.000,9:59:59.000
We are looking into doing some relevance[br]kind of sorting of the results, here.

9:59:59.000,9:59:59.000
Next.

9:59:59.000,9:59:59.000
Xilinx, why not.

9:59:59.000,9:59:59.000
So, this has been downloaded for the last[br]time of August 3rd 2016,

9:59:59.000,9:59:59.000
so it's probably a dead repository,

9:59:59.000,9:59:59.000
but yeah, you can see a bunch of source[br]code,

9:59:59.000,9:59:59.000
you can read the README of the glibc.

9:59:59.000,9:59:59.000
If we go back to a more interesting origin

9:59:59.000,9:59:59.000
here's the repository for git.

9:59:59.000,9:59:59.000
I've selected voluntarily an old visit[br]of the repo so that we can see

9:59:59.000,9:59:59.000
what was going on then.

9:59:59.000,9:59:59.000
If a look at the calendar view, you can see[br]that we've had some issues actually

9:59:59.000,9:59:59.000
updating this, but anyway.

9:59:59.000,9:59:59.000
If I look at the last visit, then we can[br]actually browse the contents,

9:59:59.000,9:59:59.000
you can get syntax highlighting as well.

9:59:59.000,9:59:59.000
This is a big big file with lots of comments

9:59:59.000,9:59:59.000
Let's see the actual source code…

9:59:59.000,9:59:59.000
Anyway, so, that's the browsing interface.

9:59:59.000,9:59:59.000
We can also now get back what we've[br]archived and download it,

9:59:59.000,9:59:59.000
which is kind of something that you might[br]want to do

9:59:59.000,9:59:59.000
if a repository is lost, you can actually[br]download it

9:59:59.000,9:59:59.000
and get the source code back again.

9:59:59.000,9:59:59.000
How we do that.

9:59:59.000,9:59:59.000
If you go on the top right of this browsing[br]interface, you have actions and download

9:59:59.000,9:59:59.000
and you can download a directory that[br]you are currently looking at.

9:59:59.000,9:59:59.000
It's an asynchronous process, which means[br]that if there is a lot of load,

9:59:59.000,9:59:59.000
then it's gotta take some time to get[br]actually, to be able to download the content

9:59:59.000,9:59:59.000
So you can put in your email address so we[br]can notify you when the download is ready.

9:59:59.000,9:59:59.000
I'm gonna try my luck and say just "ok"[br]and it's gonna appear at some point

9:59:59.000,9:59:59.000
in the list of things that I've requested.

9:59:59.000,9:59:59.000
I've already requested some things that[br]we can actually get and open as a tarball.

9:59:59.000,9:59:59.000
Yeah, I think that's the thing that I was[br]actually looking at,

9:59:59.000,9:59:59.000
which is this revision of the git[br]source code

9:59:59.000,9:59:59.000
and then I can open it

9:59:59.000,9:59:59.000
Yay, emacs, that's when you want.

9:59:59.000,9:59:59.000
Yay, source code.

9:59:59.000,9:59:59.000
This seems to work.

9:59:59.000,9:59:59.000
And then, of course, if you want to[br]actually script what you're doing,

9:59:59.000,9:59:59.000
there's an API that allows you to do[br]the downloads as well, so you can.

9:59:59.000,9:59:59.000
The source code is deduplicated a lot,[br]which means that for one single repository

9:59:59.000,9:59:59.000
you get tons of files that we have to[br]collect if you want to actually download

9:59:59.000,9:59:59.000
an archive of a directory.

9:59:59.000,9:59:59.000
It takes a while but we have an asynchronous[br]API so you can POST

9:59:59.000,9:59:59.000
the identifier of a revision to this URL[br]and then get status updates

9:59:59.000,9:59:59.000
and at some point, it will tell you that[br]the… here

9:59:59.000,9:59:59.000
The status well tell you that the object[br]is available.

9:59:59.000,9:59:59.000
You can download it and you can even[br]download the full history of a project

9:59:59.000,9:59:59.000
and get that as a git-fast-export archive[br]that you can reimport into

9:59:59.000,9:59:59.000
a new git repository.

9:59:59.000,9:59:59.000
So any kind of VCS that we've imported,[br]you can export as a git repository

9:59:59.000,9:59:59.000
and reimport on your machine.

9:59:59.000,9:59:59.000
How to get involved in the project?

9:59:59.000,9:59:59.000
We have a lot of features that we're[br]interested in, lots of them are now

9:59:59.000,9:59:59.000
in early access or have been done.

9:59:59.000,9:59:59.000
There's some stuff that we would like[br]help with.

9:59:59.000,9:59:59.000
This is some stuff that we're working on:

9:59:59.000,9:59:59.000
provenance information, you have a content

9:59:59.000,9:59:59.000
you want to know which repository[br]it comes from,

9:59:59.000,9:59:59.000
that's something we're on.

9:59:59.000,9:59:59.000
Full text search, the end goal is to be[br]able even to trace

9:59:59.000,9:59:59.000
source of snippets of code that's have[br]been copied from one project to another.

9:59:59.000,9:59:59.000
That's something that we can look into[br]with the wealth of information that

9:59:59.000,9:59:59.000
we have inside the archive.

9:59:59.000,9:59:59.000
There's a lot of things that,

9:59:59.000,9:59:59.000
I mean…

9:59:59.000,9:59:59.000
There's a lot of things that people want[br]to do with the archive.

9:59:59.000,9:59:59.000
Our goal is to enable people to do things,[br]to do interesting things

9:59:59.000,9:59:59.000
with a lot of source code.

9:59:59.000,9:59:59.000
If you have an idea of what you want to do[br]with such an archive,

9:59:59.000,9:59:59.000
please you can come talk to us

9:59:59.000,9:59:59.000
and we'll be happy to help you help us.

9:59:59.000,9:59:59.000
What we want to do is to diversify[br]the sources of things that we archive.

9:59:59.000,9:59:59.000
Currently, we have good support for git,[br]we have OK support for subversion[br]

9:59:59.000,9:59:59.000
and mercurial.

9:59:59.000,9:59:59.000
If your project of choice is in another[br]version control system,

9:59:59.000,9:59:59.000
we are gonna miss it.

9:59:59.000,9:59:59.000
So people can contribute in this area.

9:59:59.000,9:59:59.000
For the listing part, we have coverage of[br]Debian, we have coverage or Github,

9:59:59.000,9:59:59.000
if your code is somewhere else, we won't[br]see it, so we need people to contribute

9:59:59.000,9:59:59.000
stuff that can list for instance Gitlab[br]instances,

9:59:59.000,9:59:59.000
and then we can integrate that in our[br]infrastructure and actually have have

9:59:59.000,9:59:59.000
people be able to archive their gitlab[br]instances.

9:59:59.000,9:59:59.000
And of course, we need to spread[br]the word, make the project sustainable.

9:59:59.000,9:59:59.000
We have a few sponsors now, Microsoft,[br]Nokia, Huawei, Github has joined as a sponsor

9:59:59.000,9:59:59.000
The university of Bologna, of course Inria[br]is sponsoring.

9:59:59.000,9:59:59.000
But we need to keep spreading the word[br]and keep the project sustainable.

9:59:59.000,9:59:59.000
And, of course, we need to save endangered[br]source code.

9:59:59.000,9:59:59.000
For that, we have a suggestion box on[br]the wiki that you can add things to.

9:59:59.000,9:59:59.000
For instance, we have in the back of[br]our minds archiving SourceForge,

9:59:59.000,9:59:59.000
because we know that this isn't very[br]sustainable and that's risk of being

9:59:59.000,9:59:59.000
taken down at some point.

9:59:59.000,9:59:59.000
If you want to join us, we also have[br]some job openings that are available.

9:59:59.000,9:59:59.000
For now it's in Paris, so if you want to[br]consider coming work with us in Paris,

9:59:59.000,9:59:59.000
you can look into that.

9:59:59.000,9:59:59.000
That's Software Heritage.

9:59:59.000,9:59:59.000
We are building a reference archive of[br]all the free software

9:59:59.000,9:59:59.000
that's being ever written

9:59:59.000,9:59:59.000
in an international, open, non-profit and[br]mutualised infrastructure

9:59:59.000,9:59:59.000
that we have opened up to everyone,[br]all users, vendors, developers can use it.

9:59:59.000,9:59:59.000
The idea is to be at the service of[br]the community and for society

9:59:59.000,9:59:59.000
as a whole.

9:59:59.000,9:59:59.000
So if you want to join us, you can look at[br]our website, you can look at our code.

9:59:59.000,9:59:59.000
You can also talk to me, so if you have[br]any questions,

9:59:59.000,9:59:59.000
I think we have 10, 12 minutes for questions.

9:59:59.000,9:59:59.000
[Applause]

9:59:59.000,9:59:59.000
Do you have questions?

9:59:59.000,9:59:59.000
[Q] How do you protect the archive[br]against stuff that you don't want to

9:59:59.000,9:59:59.000
have in the archive.

9:59:59.000,9:59:59.000
I think of a stuff that is copyright-[br]protected and that Github will also

9:59:59.000,9:59:59.000
delete after a while.

9:59:59.000,9:59:59.000
Worse, if I would misuse the archive[br]as my private backup

9:59:59.000,9:59:59.000
and store encrypted blocks on Github[br]and you will eventually backup them

9:59:59.000,9:59:59.000
for me.

9:59:59.000,9:59:59.000
[A] There's, I think, two sides of the[br]question.

9:59:59.000,9:59:59.000
The first side is

9:59:59.000,9:59:59.000
Do we really archive only stuff that is[br]free software and

9:59:59.000,9:59:59.000
that we can redistribute and how do we[br]manage, for instance,

9:59:59.000,9:59:59.000
copyright takedown stuff.

9:59:59.000,9:59:59.000
Currently, most of the infrastructure[br]of the project is under French law.

9:59:59.000,9:59:59.000
There's a defined process to do[br]copyright takedown in the French legal system.

9:59:59.000,9:59:59.000
We would be really annoyed to have to[br]take down content from the archive

9:59:59.000,9:59:59.000
What we do, however, is to mirror public[br]information that is publicly available.

9:59:59.000,9:59:59.000
Of course I'm not a lawyer for the project,[br]so I can't really…

9:59:59.000,9:59:59.000
I'm not 100% sure of what I'm about to say[br]but

9:59:59.000,9:59:59.000
what I know is that in the current French[br]legistlation status,

9:59:59.000,9:59:59.000
if the source of the data is still available

9:59:59.000,9:59:59.000
so for instance if the data is still on[br]Github, then you need to have

9:59:59.000,9:59:59.000
Github take it down before we have to[br]take it down.

9:59:59.000,9:59:59.000
We're not currently filtering content for[br]misuse of the archive,

9:59:59.000,9:59:59.000
so the only thing that we do is put[br]a limit on the size of the files

9:59:59.000,9:59:59.000
that are archived in Software Heritage.

9:59:59.000,9:59:59.000
The limit is pretty high, like 100MB.

9:59:59.000,9:59:59.000
We can't really decide ourselves

9:59:59.000,9:59:59.000
what is source code,[br]what is not source code

9:59:59.000,9:59:59.000
because for instance if your project is[br]a cryptography library,

9:59:59.000,9:59:59.000
you might want to have some encrypted[br]blocks of data that are stored

9:59:59.000,9:59:59.000
in you source code repository as[br]test fixtures.

9:59:59.000,9:59:59.000
And then, you need them to build the code[br]and to make sure that it works.

9:59:59.000,9:59:59.000
So, how would that be any different than[br]you encrypted backup on Github?

9:59:59.000,9:59:59.000
How could we, Software Heritage,[br]distinguish between proper use and misuse

9:59:59.000,9:59:59.000
of the resources.

9:59:59.000,9:59:59.000
I guess our long term goal is to not have[br]to care about misuse because

9:59:59.000,9:59:59.000
it's gonna be a drop in the ocean.

9:59:59.000,9:59:59.000
We're gonna have so much…

9:59:59.000,9:59:59.000
We want to have enough space and[br]enough resources

9:59:59.000,9:59:59.000
that we don't really need to ask ourselves[br]this question, basically.

9:59:59.000,9:59:59.000
Thanks.

9:59:59.000,9:59:59.000
Other questions?

9:59:59.000,9:59:59.000
[Q] Have you looked at some form of[br]authentication to provide additional

9:59:59.000,9:59:59.000
insurance that the archived source code[br]hasn't been modified or tampered with

9:59:59.000,9:59:59.000
in some form?

9:59:59.000,9:59:59.000
[A] First of all, all the identifiers for[br]the objects that are inside the archive

9:59:59.000,9:59:59.000
are cryptographic hashes of the contents[br]that we've archived.

9:59:59.000,9:59:59.000
So, for files, for instance, we take[br]the SHA1, the SHA256,

9:59:59.000,9:59:59.000
one of the BLAKE hashes and the git[br]modified SHA1 of the file,

9:59:59.000,9:59:59.000
and we use that in the manifest for[br]the directories.

9:59:59.000,9:59:59.000
So the directories, the directory identifiers[br]are a hash of the manifest

9:59:59.000,9:59:59.000
of the list of files that are inside[br]the directory, etc.

9:59:59.000,9:59:59.000
So, recursively, you can make sure that[br]the data that we give back to you

9:59:59.000,9:59:59.000
has not been, at least altered, by bitflip[br]or anything.

9:59:59.000,9:59:59.000
We regularly run a scrub of the data[br]that we have in the archive,

9:59:59.000,9:59:59.000
so we make sure that there's no rot[br]inside our archive.

9:59:59.000,9:59:59.000
We've not looked into, basically,[br]attestation of…

9:59:59.000,9:59:59.000
for instance, making sure that the code[br]that we've downloaded…

9:59:59.000,9:59:59.000
I mean, we're not doing anything more[br]than taking a picture of the data

9:59:59.000,9:59:59.000
and we say "We've computed this hash.[br]Maybe the code that's been presented

9:59:59.000,9:59:59.000
by Github to Software Heritage is different[br]than what you've uploaded to Github,

9:59:59.000,9:59:59.000
we can't tell."

9:59:59.000,9:59:59.000
In the case of git, you can always use[br]the identifiers of the objects

9:59:59.000,9:59:59.000
that you've pushed so you have[br]the commit hash,

9:59:59.000,9:59:59.000
which is itself a cryptographic identifier[br]of the contents of the commit.

9:59:59.000,9:59:59.000
Intern, if the commit is signed, then[br]the signature is still stored

9:59:59.000,9:59:59.000
in the Software Heritage metadata and[br]you can reproduce the original git object

9:59:59.000,9:59:59.000
and check the signature, but we've not[br]done anything specific for Software Heritage

9:59:59.000,9:59:59.000
in this area.

9:59:59.000,9:59:59.000
Does that answer your question?

9:59:59.000,9:59:59.000
Cool.

9:59:59.000,9:59:59.000
Other questions?

9:59:59.000,9:59:59.000
There's one in front.

9:59:59.000,9:59:59.000
[Q] It's partially question, partially[br]comment.

9:59:59.000,9:59:59.000
Your initial idea was to have a telescope,[br]or something like this for source code.

9:59:59.000,9:59:59.000
For now, for me, it looks a little bit[br]more like microscope,

9:59:59.000,9:59:59.000
so you can focus on one thing, but that's[br]not much.

9:59:59.000,9:59:59.000
So have you sorted things about how to[br]analyze entire ecosystem

9:59:59.000,9:59:59.000
or something like this.

9:59:59.000,9:59:59.000
For example, now we have Django 2 which is[br]Python 3 only so it would be interesting to

9:59:59.000,9:59:59.000
look at all Django modules to see when[br]they start moving to this Django.

9:59:59.000,9:59:59.000
So we would need to start analyzing[br]thousands or millions of files, but then

9:59:59.000,9:59:59.000
we would need some SQL like, or some[br]map reduce jobs

9:59:59.000,9:59:59.000
or something like this for this.

9:59:59.000,9:59:59.000
[A] Yes

9:59:59.000,9:59:59.000
So, we've started…

9:59:59.000,9:59:59.000
The two initiators of the project, Roberto[br]Di Cosmo and Stefano Zacchiroli

9:59:59.000,9:59:59.000
are both researchers in computer science[br]so they have a strong background in

9:59:59.000,9:59:59.000
actually mining software repositories and[br]doing some large scale analysis

9:59:59.000,9:59:59.000
on source code.

9:59:59.000,9:59:59.000
We've been talking with research groups[br]whose main goal is to do analysis on

9:59:59.000,9:59:59.000
large scale source code archives.

9:59:59.000,9:59:59.000
One of the first mirrors outside of our[br]control of the archive

9:59:59.000,9:59:59.000
will be in Grenoble (France).

9:59:59.000,9:59:59.000
There's a few teams that work on[br]actually doing large scale research

9:59:59.000,9:59:59.000
on source code over there,

9:59:59.000,9:59:59.000
so that's what the mirror will be[br]used for.

9:59:59.000,9:59:59.000
We've also been looking at what[br]the Google open source team does.

9:59:59.000,9:59:59.000
They have this big repository with all[br]the code that Google uses

9:59:59.000,9:59:59.000
and they've started to push back,[br]like do large scale analysis of

9:59:59.000,9:59:59.000
security vulnerabilities, issues with[br]static and dynamic analysis

9:59:59.000,9:59:59.000
of the code and they've started pushing[br]their fixes upstream.

9:59:59.000,9:59:59.000
That's something that we want to enable[br]users to do,

9:59:59.000,9:59:59.000
that's not something that we want to do[br]ourselves, but we want to make sure

9:59:59.000,9:59:59.000
that people can do it using our archive.

9:59:59.000,9:59:59.000
So we'd be happy to work with people[br]who already do that so that

9:59:59.000,9:59:59.000
they can use their knowledge and their[br]tools inside our archive.

9:59:59.000,9:59:59.000
Does that answer your question?

9:59:59.000,9:59:59.000
Cool.

9:59:59.000,9:59:59.000
Any more questions?

9:59:59.000,9:59:59.000
No? Then thank you very much Nicolas.

9:59:59.000,9:59:59.000
Thank you.

9:59:59.000,9:59:59.000
[Applause]