1
99:59:59,999 --> 99:59:59,999
Hi, thank you.

2
99:59:59,999 --> 99:59:59,999
I'm Nicolas Dandrimont and I will indeed
be talking to you about

3
99:59:59,999 --> 99:59:59,999
Software Heritage.

4
99:59:59,999 --> 99:59:59,999
I'm a software engineer for this project.

5
99:59:59,999 --> 99:59:59,999
I've been working on it for 3 years now.

6
99:59:59,999 --> 99:59:59,999
And we'll see what this thing is all about.

7
99:59:59,999 --> 99:59:59,999
[Mic not working]

8
99:59:59,999 --> 99:59:59,999
I guess the batteries are out.

9
99:59:59,999 --> 99:59:59,999
So, let's try that again.

10
99:59:59,999 --> 99:59:59,999
So, we all know, we've been doing
free software for a while,

11
99:59:59,999 --> 99:59:59,999
that software source code is something
special.

12
99:59:59,999 --> 99:59:59,999
Why is that?

13
99:59:59,999 --> 99:59:59,999
As Harold Abelson has said in SICP, his
textbook on programming,

14
99:59:59,999 --> 99:59:59,999
programs are meant to be read by people
and then incidentally for machines to execute.

15
99:59:59,999 --> 99:59:59,999
Basically, what software source code
provides us is a way inside

16
99:59:59,999 --> 99:59:59,999
the mind of the designer of the program.

17
99:59:59,999 --> 99:59:59,999
For instance, you can have,
you can get inside very crazy algorithms

18
99:59:59,999 --> 99:59:59,999
that can do very fast reverse square roots
for 3D, that kind of stuff

19
99:59:59,999 --> 99:59:59,999
Like in the Quake 2 source code.

20
99:59:59,999 --> 99:59:59,999
You can also get inside the algorithms
that are underpinning the internet,

21
99:59:59,999 --> 99:59:59,999
for instance seeing the net queue
algorithm in the Linux kernel.

22
99:59:59,999 --> 99:59:59,999
What we are building as the free software
community is the free software commons.

23
99:59:59,999 --> 99:59:59,999
Basically, the commons is all the cultural
and social and natural resources

24
99:59:59,999 --> 99:59:59,999
that we share and that everyone
has access to.

25
99:59:59,999 --> 99:59:59,999
More specifically, the software commons
is what we are building

26
99:59:59,999 --> 99:59:59,999
with software that is open and that is
available for all to use, to modify,

27
99:59:59,999 --> 99:59:59,999
to execute, to distribute.

28
99:59:59,999 --> 99:59:59,999
We know that those commons are a really
critical part of our commons.

29
99:59:59,999 --> 99:59:59,999
Who's taking care of it?

30
99:59:59,999 --> 99:59:59,999
The software is fragile.

31
99:59:59,999 --> 99:59:59,999
Like all digital information, you can lose
software.

32
99:59:59,999 --> 99:59:59,999
People can decide to shut down hosting
spaces because of business decisions.

33
99:59:59,999 --> 99:59:59,999
People can hack into software hosting
platforms and remove the code maliciously

34
99:59:59,999 --> 99:59:59,999
or just inadvertently.

35
99:59:59,999 --> 99:59:59,999
And, of course, for the obsolete stuff,
there's rot.

36
99:59:59,999 --> 99:59:59,999
If you don't care about the data, then
it rots and it decays and you lose it.

37
99:59:59,999 --> 99:59:59,999
So, where is the archive we go to
when something is lost,

38
99:59:59,999 --> 99:59:59,999
when GitLab goes away, when Github
goes away.

39
99:59:59,999 --> 99:59:59,999
Where do we go?

40
99:59:59,999 --> 99:59:59,999
Finally, there's one last thing that we
noticed, it's that

41
99:59:59,999 --> 99:59:59,999
there's a lot of teams that work on
research on software

42
99:59:59,999 --> 99:59:59,999
and there's no real big infrastructure
for research on code.

43
99:59:59,999 --> 99:59:59,999
There's tons of critical issues around
code: safety, security, verification, proofs.

44
99:59:59,999 --> 99:59:59,999
Nobody's doing this at a very large scale.

45
99:59:59,999 --> 99:59:59,999
If you want to see the stars, you go
the Atacama desert and

46
99:59:59,999 --> 99:59:59,999
you point a telescope at the sky.

47
99:59:59,999 --> 99:59:59,999
Where is the telescope for source code?

48
99:59:59,999 --> 99:59:59,999
That's what Software Heritage wants to be.

49
99:59:59,999 --> 99:59:59,999
What we do is we collect, we preserve
and we share all the software

50
99:59:59,999 --> 99:59:59,999
that is publicly available.

51
99:59:59,999 --> 99:59:59,999
Why do we do that? We do that to
preserve the past, to enhance the present

52
99:59:59,999 --> 99:59:59,999
and to prepare for the future.

53
99:59:59,999 --> 99:59:59,999
What we're building is a base infrastructure
that can be used

54
99:59:59,999 --> 99:59:59,999
for cultural heritage, for industry,
for research and for education purposes.

55
99:59:59,999 --> 99:59:59,999
How do we do it? We do it with an open
approach.

56
99:59:59,999 --> 99:59:59,999
Every single line of code that we write
is free software.

57
99:59:59,999 --> 99:59:59,999
We do it transparently, everything that
we do, we do it in the open,

58
99:59:59,999 --> 99:59:59,999
be that on a mailing list or on
our issue tracker.

59
99:59:59,999 --> 99:59:59,999
And we strive to do it for the very long
haul, so we do it with replication in mind

60
99:59:59,999 --> 99:59:59,999
so that no single entity has full control
over the data that we collect.

61
99:59:59,999 --> 99:59:59,999
And we do it in a non-profit fashion
so that we avoid

62
99:59:59,999 --> 99:59:59,999
business-driven decisions impacting
the project.

63
99:59:59,999 --> 99:59:59,999
So, what do we do concretely?

64
99:59:59,999 --> 99:59:59,999
We do archiving of version control systems.

65
99:59:59,999 --> 99:59:59,999
What does that mean?

66
99:59:59,999 --> 99:59:59,999
It means we archive file contents, so
source code, files.

67
99:59:59,999 --> 99:59:59,999
We archive revisions, which means all the
metadata of the history of the projects,

68
99:59:59,999 --> 99:59:59,999
we try to download it and we put it inside
a common data model that is

69
99:59:59,999 --> 99:59:59,999
shared across all the archive.

70
99:59:59,999 --> 99:59:59,999
We archive releases of the software,
releases that have been tagged

71
99:59:59,999 --> 99:59:59,999
in a version control system as well as
releases that we can find as tarballs

72
99:59:59,999 --> 99:59:59,999
because sometimes… boof, views of
this source code differ.

73
99:59:59,999 --> 99:59:59,999
Of course, we archive where and when
we've seen the data that we've collected.

74
99:59:59,999 --> 99:59:59,999
All of this, we put inside a canonical,
VCS-agnostic, data model.

75
99:59:59,999 --> 99:59:59,999
If you have a Debian package, with its
history, if you have a git repository,

76
99:59:59,999 --> 99:59:59,999
if you have a subversion repository, if
you have a mercurial repository,

77
99:59:59,999 --> 99:59:59,999
it all looks the same and you can work
on it with the same tools.

78
99:59:59,999 --> 99:59:59,999
What we don't do is archive what's around
the software, for instance

79
99:59:59,999 --> 99:59:59,999
the bug tracking systems or the homepages
or the wikis or the mailing lists.

80
99:59:59,999 --> 99:59:59,999
There are some projects that work
in this space, for instance

81
99:59:59,999 --> 99:59:59,999
the internet archive does a lot of
really good work around archiving the web.

82
99:59:59,999 --> 99:59:59,999
Our goal is not to replace them, but to
work with them and be able to do

83
99:59:59,999 --> 99:59:59,999
linking across all the archives that exist.

84
99:59:59,999 --> 99:59:59,999
We can, for instance for the mailing lists
there's the gmane project

85
99:59:59,999 --> 99:59:59,999
that does a lot of archiving of free
software mailing lists.

86
99:59:59,999 --> 99:59:59,999
So our long term vision is to play a part
in a semantic wikipedia of software,

87
99:59:59,999 --> 99:59:59,999
a wikidata of software where we can
hyperlink all the archives that exist

88
99:59:59,999 --> 99:59:59,999
and do stuff in the area.

89
99:59:59,999 --> 99:59:59,999
Quick tour of our infrastructure.

90
99:59:59,999 --> 99:59:59,999
Basically, all the way to the right is
our archive.

91
99:59:59,999 --> 99:59:59,999
Our archive consists of a huge graph
of all the metadata about

92
99:59:59,999 --> 99:59:59,999
the files, the directories, the revisions,
the commits and the releases and

93
99:59:59,999 --> 99:59:59,999
all the projects that are on top
of the graph.

94
99:59:59,999 --> 99:59:59,999
We separate the file storage into an other
object storage because of

95
99:59:59,999 --> 99:59:59,999
the size discrepancy: we have lots and lots
of file contents that we need to store

96
99:59:59,999 --> 99:59:59,999
so we do that outside the database
that is used to store the graph.

97
99:59:59,999 --> 99:59:59,999
Basically, what we archive is a set of
software origins that are

98
99:59:59,999 --> 99:59:59,999
git repositories, mercurial repositories,
etc. etc.

99
99:59:59,999 --> 99:59:59,999
All those origins are loaded on a
regular schedule.

100
99:59:59,999 --> 99:59:59,999
If there is a very active software origin,
we're gonna archive it more often

101
99:59:59,999 --> 99:59:59,999
than stale things that don't get
a lot of updates.

102
99:59:59,999 --> 99:59:59,999
What we do to get the list of software
origins that we archive.

103
99:59:59,999 --> 99:59:59,999
We have a bunch of listers that can,
scroll through the list of repositories,

104
99:59:59,999 --> 99:59:59,999
for instance on Github or other
hosting platforms.

105
99:59:59,999 --> 99:59:59,999
We have code that can read Debian archive
metadata to make a list of the packages

106
99:59:59,999 --> 99:59:59,999
that are inside this archive and can be
archived, etc.

107
99:59:59,999 --> 99:59:59,999
All of this is done on a regular basis.

108
99:59:59,999 --> 99:59:59,999
We are currently working on some kind
of push mechanism so that

109
99:59:59,999 --> 99:59:59,999
people or other systems can notify us
of updates.

110
99:59:59,999 --> 99:59:59,999
Our goal is not to do real time archiving,
we're really in it for the long run

111
99:59:59,999 --> 99:59:59,999
but we still want to be able to prioritize
stuff that people tell us is

112
99:59:59,999 --> 99:59:59,999
important to archive.

113
99:59:59,999 --> 99:59:59,999
The internet archive has a "save now"
button and we want to implement

114
99:59:59,999 --> 99:59:59,999
something along those lines as well,

115
99:59:59,999 --> 99:59:59,999
so if we know that some software project
is in danger for a reason or another,

116
99:59:59,999 --> 99:59:59,999
then we can prioritize archiving it.

117
99:59:59,999 --> 99:59:59,999
So this is the basic structure of a revision
in the software heritage archive.

118
99:59:59,999 --> 99:59:59,999
You'll see that it's very similar to
a git commit.

119
99:59:59,999 --> 99:59:59,999
The format of the metadata is pretty much
what you'll find in a git commit

120
99:59:59,999 --> 99:59:59,999
with some extensions that you don't
see here because this is from a git commit

121
99:59:59,999 --> 99:59:59,999
So basically what we do is we take the
identifier of the directory

122
99:59:59,999 --> 99:59:59,999
that the revision points to, we take the
identifier of the parent of the revision

123
99:59:59,999 --> 99:59:59,999
so we can keep track of the history

124
99:59:59,999 --> 99:59:59,999
and then we add some metadata,
authorship and commitership information

125
99:59:59,999 --> 99:59:59,999
and the revision message and then we take
a hash of this,

126
99:59:59,999 --> 99:59:59,999
it makes an identifier that's probably
unique, very very probably unique.

127
99:59:59,999 --> 99:59:59,999
Using those identifiers, we can retrace
all the origins, all the history of

128
99:59:59,999 --> 99:59:59,999
development of the project and we can
deduplicate across all the archive.

129
99:59:59,999 --> 99:59:59,999
All the identifiers are intrinsic, which
means that we compute them

130
99:59:59,999 --> 99:59:59,999
from the contents of the things that
we are archiving, which means that

131
99:59:59,999 --> 99:59:59,999
we can deduplicate very efficiently
across all the data that we archive.

132
99:59:59,999 --> 99:59:59,999
How much data do we archive?

133
99:59:59,999 --> 99:59:59,999
A bit.

134
99:59:59,999 --> 99:59:59,999
So, we have passed the billion revision
mark a few weeks ago.

135
99:59:59,999 --> 99:59:59,999
This graph is a bit old, but anyway,
you have a live graph on our website.

136
99:59:59,999 --> 99:59:59,999
That's more than 4.5 billion unique
source code files.

137
99:59:59,999 --> 99:59:59,999
We don't actually discriminate between
what we would consider is source code

138
99:59:59,999 --> 99:59:59,999
and what upstream developers consider
as source code,

139
99:59:59,999 --> 99:59:59,999
so everything that's in a git repository,
we consider as source code

140
99:59:59,999 --> 99:59:59,999
if it's below a size threshold.

141
99:59:59,999 --> 99:59:59,999
A billion revisions across 80 million
projects.

142
99:59:59,999 --> 99:59:59,999
What do we archive?

143
99:59:59,999 --> 99:59:59,999
We archive Github, we archive Debian.

144
99:59:59,999 --> 99:59:59,999
So, Debian we run the archival process
every day, every day we get the new packages

145
99:59:59,999 --> 99:59:59,999
that have been uploaded in the archive.

146
99:59:59,999 --> 99:59:59,999
Github, we try to keep up, we are currently
working on some performance improvements,

147
99:59:59,999 --> 99:59:59,999
some scalability improvements to make sure
that we can keep up

148
99:59:59,999 --> 99:59:59,999
with the development on GitHub.

149
99:59:59,999 --> 99:59:59,999
We have archived as a one-off thing
the former content of Gitorious and Google Code

150
99:59:59,999 --> 99:59:59,999
which are two prominent code hosting
spaces that closed recently

151
99:59:59,999 --> 99:59:59,999
and we've been working on archiving
the contents of Bitbucket

152
99:59:59,999 --> 99:59:59,999
which is kind of a challenge because
the API is a bit buggy and

153
99:59:59,999 --> 99:59:59,999
Atliassian isn't too interested
in fixing it.

154
99:59:59,999 --> 99:59:59,999
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB

155
99:59:59,999 --> 99:59:59,999
and kind of big database, 6TB.

156
99:59:59,999 --> 99:59:59,999
The database only contains the graph of
the metadata for the archive

157
99:59:59,999 --> 99:59:59,999
which is basically a 8 billion nodes and
70 billion edges graph.

158
99:59:59,999 --> 99:59:59,999
And of course it's growing daily.

159
99:59:59,999 --> 99:59:59,999
We are pretty sure this is the richest
source code archive that's available now

160
99:59:59,999 --> 99:59:59,999
and it keeps growing.

161
99:59:59,999 --> 99:59:59,999
So how do we actually…

162
99:59:59,999 --> 99:59:59,999
What kind of stack do we use to store
all this?

163
99:59:59,999 --> 99:59:59,999
We use Debian, of course.

164
99:59:59,999 --> 99:59:59,999
All our deployment recipes are in Puppet
in public repositories.

165
99:59:59,999 --> 99:59:59,999
We've started using Ceph
for the blob storage.

166
99:59:59,999 --> 99:59:59,999
We use PostgreSQL for the metadata storage
we some of the standard tools that

167
99:59:59,999 --> 99:59:59,999
live around PostgreSQL for backups
and replication.

168
99:59:59,999 --> 99:59:59,999
We use standard Python stack for
scheduling of jobs

169
99:59:59,999 --> 99:59:59,999
and for web interface stuff, basically
psycopg2 for the low level stuff,

170
99:59:59,999 --> 99:59:59,999
Django for the web stuff

171
99:59:59,999 --> 99:59:59,999
and Celery for the scheduling of jobs.

172
99:59:59,999 --> 99:59:59,999
In house, we've written an ad hoc
object storage system which has

173
99:59:59,999 --> 99:59:59,999
a bunch of backends that you can use.

174
99:59:59,999 --> 99:59:59,999
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…

175
99:59:59,999 --> 99:59:59,999
It's a really simple object storage system
where you can just put an object,

176
99:59:59,999 --> 99:59:59,999
get an object, put a bunch of objects,
get a bunch of objects.

177
99:59:59,999 --> 99:59:59,999
We've implemented removal but we don't
really use it yet.

178
99:59:59,999 --> 99:59:59,999
All the data model implementation,
all the listers, the loaders, the schedulers

179
99:59:59,999 --> 99:59:59,999
everything has been written by us,
it's a pile of Python code.

180
99:59:59,999 --> 99:59:59,999
So, basically 20 Python packages and
around 30 Puppet modules

181
99:59:59,999 --> 99:59:59,999
to deploy all that and we've done everything
as a copyleft license,

182
99:59:59,999 --> 99:59:59,999
GPLv3 for the backend and AGPLv3
for the frontend.

183
99:59:59,999 --> 99:59:59,999
Even if people try and make their own
Software Heritage using our code,

184
99:59:59,999 --> 99:59:59,999
they have to publish their changes.

185
99:59:59,999 --> 99:59:59,999
Hardware-wise, we run for now everything
on a few hypervisors in house and

186
99:59:59,999 --> 99:59:59,999
our main storage is currently still
on a very high density, very slow,

187
99:59:59,999 --> 99:59:59,999
very bulky storage array, but we've
started to migrate all this thing

188
99:59:59,999 --> 99:59:59,999
into a Ceph storage cluster which
we're gonna grow as we need

189
99:59:59,999 --> 99:59:59,999
in the next few months.

190
99:59:59,999 --> 99:59:59,999
We've also been granted by Microsoft
sponsorship, ??? sponsorship

191
99:59:59,999 --> 99:59:59,999
for their cloud services.

192
99:59:59,999 --> 99:59:59,999
We've started putting mirrors of everything
in their infrastructure as well

193
99:59:59,999 --> 99:59:59,999
which means full object storage mirror,
so 170TB of stuff mirrored on azure

194
99:59:59,999 --> 99:59:59,999
as well as a database mirror for graph.

195
99:59:59,999 --> 99:59:59,999
And we're also doing all the content
indexing and all the things that need

196
99:59:59,999 --> 99:59:59,999
scalability on azure now.

197
99:59:59,999 --> 99:59:59,999
Finally, at the university of Bologna,
we have a backend storage for the download

198
99:59:59,999 --> 99:59:59,999
so currently our main storage is
quite slow so if you want to download

199
99:59:59,999 --> 99:59:59,999
a bundle of things that we've archived,
then we actually keep a cache of

200
99:59:59,999 --> 99:59:59,999
what we've done so that it doesn't take
a million years to download stuff.

201
99:59:59,999 --> 99:59:59,999
We do our development in a classic free
and open source software way,

202
99:59:59,999 --> 99:59:59,999
so we talk on our mailing list, on IRC,
on a forge.

203
99:59:59,999 --> 99:59:59,999
Everything is in English, everything is
public, there is more information

204
99:59:59,999 --> 99:59:59,999
on our website if you want to actually
have a look and see what we do.

205
99:59:59,999 --> 99:59:59,999
So, all that is very interesting but how
do we actually look into it?

206
99:59:59,999 --> 99:59:59,999
One of the ways that you can browse,
that you can use the archive

207
99:59:59,999 --> 99:59:59,999
is using a REST API.

208
99:59:59,999 --> 99:59:59,999
Basically, this API allows you to do
pointwise browsing of the archive

209
99:59:59,999 --> 99:59:59,999
so you can go and follow the links
in a graph,

210
99:59:59,999 --> 99:59:59,999
which is very slow but gives you a pretty
much full access of the data.

211
99:59:59,999 --> 99:59:59,999
There's an index for the API that you can
look at, but that's not really convenient,

212
99:59:59,999 --> 99:59:59,999
so we also have a web user interface.

213
99:59:59,999 --> 99:59:59,999
It's in preview right now, we're gonna do
a full launch in the month of June.

214
99:59:59,999 --> 99:59:59,999
If you go to 
https://archive.softwareheritage.org/browse/

215
99:59:59,999 --> 99:59:59,999
with the given credentials, you can
have a look and see what's going on.

216
99:59:59,999 --> 99:59:59,999
Basically, we have a web interface that
allows you to look at

217
99:59:59,999 --> 99:59:59,999
what origins we have downloaded, when
we have downloaded the origins

218
99:59:59,999 --> 99:59:59,999
with a kind of graph view of how often
we visited the origins

219
99:59:59,999 --> 99:59:59,999
and a calendar view of when we have
visited the origins.

220
99:59:59,999 --> 99:59:59,999
And then, inside the visits, you can
actually browse the contents

221
99:59:59,999 --> 99:59:59,999
that we've archived.

222
99:59:59,999 --> 99:59:59,999
So, for instance, this is the Python
repository as of May 2017

223
99:59:59,999 --> 99:59:59,999
and you can have the list of files,
then drill down,

224
99:59:59,999 --> 99:59:59,999
it should be pretty intuitive.

225
99:59:59,999 --> 99:59:59,999
If you look at the history of a project,
you can see the differences

226
99:59:59,999 --> 99:59:59,999
between two revisions of a project.

227
99:59:59,999 --> 99:59:59,999
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.