1
00:00:05,480 --> 00:00:07,105
Hi, thank you.

2
00:00:07,879 --> 00:00:11,293
I'm Nicolas Dandrimont and I will indeed
be talking to you about

3
00:00:11,293 --> 00:00:12,551
Software Heritage.

4
00:00:12,876 --> 00:00:15,232
I'm a software engineer for this project.

5
00:00:15,639 --> 00:00:17,712
I've been working on it for 3 years now.

6
00:00:18,485 --> 00:00:21,572
And we'll see what this thing is all about.

7
00:00:23,767 --> 00:00:38,808
[Mic not working]

8
00:00:39,174 --> 00:00:40,752
I guess the batteries are out.

9
00:00:49,949 --> 00:00:51,720
So, let's try that again.

10
00:00:52,050 --> 00:00:55,380
So, we all know, we've been doing
free software for a while,

11
00:00:55,616 --> 00:00:59,806
that software source code is something
special.

12
00:01:00,779 --> 00:01:02,031
Why is that?

13
00:01:02,731 --> 00:01:09,963
As Harold Abelson has said in SICP, his
textbook on programming,

14
00:01:09,963 --> 00:01:18,782
programs are meant to be read by people
and then incidentally for machines to execute.

15
00:01:20,213 --> 00:01:25,661
Basically, what software source code
provides us is a way inside

16
00:01:25,661 --> 00:01:28,547
the mind of the designer of the program.

17
00:01:29,309 --> 00:01:37,938
For instance, you can have,
you can get inside very crazy algorithms

18
00:01:37,938 --> 00:01:46,564
that can do very fast reverse square roots
for 3D, that kind of stuff

19
00:01:47,211 --> 00:01:49,524
Like in the Quake 2 source code.

20
00:01:49,860 --> 00:01:54,606
You can also get inside the algorithms
that are underpinning the internet,

21
00:01:54,606 --> 00:01:59,765
for instance seeing the net queue
algorithm in the Linux kernel.

22
00:02:03,631 --> 00:02:10,218
What we are building as the free software
community is the free software commons.

23
00:02:10,948 --> 00:02:18,629
Basically, the commons is all the cultural
and social and natural resources

24
00:02:18,629 --> 00:02:21,802
that we share and that everyone
has access to.

25
00:02:22,410 --> 00:02:25,744
More specifically, the software commons
is what we are building

26
00:02:25,744 --> 00:02:31,878
with software that is open and that is
available for all to use, to modify,

27
00:02:31,878 --> 00:02:34,887
to execute, to distribute.

28
00:02:37,252 --> 00:02:45,251
We know that those commons are a really
critical part of our commons.

29
00:02:46,306 --> 00:02:48,137
Who's taking care of it?

30
00:02:49,684 --> 00:02:51,800
The software is fragile.

31
00:02:51,800 --> 00:02:54,405
Like all digital information, you can lose
software.

32
00:02:55,625 --> 00:03:01,634
People can decide to shut down hosting
spaces because of business decisions.

33
00:03:02,939 --> 00:03:08,913
People can hack into software hosting
platforms and remove the code maliciously

34
00:03:08,913 --> 00:03:10,864
or just inadvertently.

35
00:03:12,978 --> 00:03:17,898
And, of course, for the obsolete stuff,
there's rot.

36
00:03:18,468 --> 00:03:24,773
If you don't care about the data, then
it rots and it decays and you lose it.

37
00:03:26,157 --> 00:03:31,238
So, where is the archive we go to
when something is lost,

38
00:03:31,238 --> 00:03:33,965
when GitLab goes away, when Github
goes away.

39
00:03:34,411 --> 00:03:35,708
Where do we go?

40
00:03:36,519 --> 00:03:40,989
Finally, there's one last thing that we
noticed, it's that

41
00:03:40,989 --> 00:03:48,581
there's a lot of teams that work on
research on software

42
00:03:48,581 --> 00:03:54,310
and there's no real big infrastructure
for research on code.

43
00:03:56,510 --> 00:04:02,129
There's tons of critical issues around
code: safety, security, verification, proofs.

44
00:04:03,583 --> 00:04:07,694
Nobody's doing this at a very large scale.

45
00:04:08,466 --> 00:04:12,244
If you want to see the stars, you go
the Atacama desert and

46
00:04:12,244 --> 00:04:13,830
you point a telescope at the sky.

47
00:04:14,477 --> 00:04:17,526
Where is the telescope for source code?

48
00:04:17,973 --> 00:04:20,983
That's what Software Heritage wants to be.

49
00:04:22,081 --> 00:04:27,651
What we do is we collect, we preserve
and we share all the software

50
00:04:27,651 --> 00:04:29,887
that is publicly available.

51
00:04:31,139 --> 00:04:35,852
Why do we do that? We do that to
preserve the past, to enhance the present

52
00:04:35,852 --> 00:04:37,848
and to prepare for the future.

53
00:04:39,715 --> 00:04:44,588
What we're building is a base infrastructure
that can be used

54
00:04:44,588 --> 00:04:50,359
for cultural heritage, for industry,
for research and for education purposes.

55
00:04:50,724 --> 00:04:53,120
How do we do it? We do it with an open
approach.

56
00:04:53,406 --> 00:04:56,613
Every single line of code that we write
is free software.

57
00:04:59,088 --> 00:05:04,653
We do it transparently, everything that
we do, we do it in the open,

58
00:05:04,653 --> 00:05:09,124
be that on a mailing list or on
our issue tracker.

59
00:05:09,858 --> 00:05:15,873
And we strive to do it for the very long
haul, so we do it with replication in mind

60
00:05:15,873 --> 00:05:21,806
so that no single entity has full control
over the data that we collect.

61
00:05:22,945 --> 00:05:27,335
And we do it in a non-profit fashion
so that we avoid

62
00:05:27,335 --> 00:05:32,786
business-driven decisions impacting
the project.

63
00:05:35,470 --> 00:05:38,683
So, what do we do concretely?

64
00:05:39,009 --> 00:05:42,951
We do archiving of version control systems.

65
00:05:43,276 --> 00:05:44,617
What does that mean?

66
00:05:45,755 --> 00:05:49,411
It means we archive file contents, so
source code, files.

67
00:05:49,411 --> 00:05:55,673
We archive revisions, which means all the
metadata of the history of the projects,

68
00:05:55,673 --> 00:06:03,148
we try to download it and we put it inside
a common data model that is

69
00:06:03,148 --> 00:06:06,968
shared across all the archive.

70
00:06:08,555 --> 00:06:13,590
We archive releases of the software,
releases that have been tagged

71
00:06:13,590 --> 00:06:18,339
in a version control system as well as
releases that we can find as tarballs

72
00:06:18,339 --> 00:06:23,945
because sometimes… boof, views of
this source code differ.

73
00:06:27,814 --> 00:06:32,367
Of course, we archive where and when
we've seen the data that we've collected.

74
00:06:32,977 --> 00:06:40,266
All of this, we put inside a canonical,
VCS-agnostic, data model.

75
00:06:41,983 --> 00:06:46,782
If you have a Debian package, with its
history, if you have a git repository,

76
00:06:46,782 --> 00:06:50,197
if you have a subversion repository, if
you have a mercurial repository,

77
00:06:50,197 --> 00:06:53,857
it all looks the same and you can work
on it with the same tools.

78
00:06:54,995 --> 00:07:01,415
What we don't do is archive what's around
the software, for instance

79
00:07:01,415 --> 00:07:05,720
the bug tracking systems or the homepages
or the wikis or the mailing lists.

80
00:07:06,696 --> 00:07:10,555
There are some projects that work
in this space, for instance

81
00:07:10,555 --> 00:07:15,798
the internet archive does a lot of
very good work around archiving the web.

82
00:07:17,658 --> 00:07:24,417
Our goal is not to replace them, but to
work with them and be able to do

83
00:07:24,417 --> 00:07:29,291
linking across all the archives that exist.

84
00:07:29,705 --> 00:07:35,020
We can, for instance for the mailing lists
there's the gmane project

85
00:07:35,020 --> 00:07:38,998
that does a lot of archiving of free
software mailing lists.

86
00:07:39,729 --> 00:07:47,743
So our long term vision is to play a part
in a semantic wikipedia of software,

87
00:07:47,743 --> 00:07:53,921
a wikidata of software where we can
hyperlink all the archives that exist

88
00:07:53,921 --> 00:07:56,853
and do stuff in the area.

89
00:08:00,594 --> 00:08:02,591
Quick tour of our infrastructure.

90
00:08:02,828 --> 00:08:10,224
Basically, all the way to the right is
our archive.

91
00:08:11,447 --> 00:08:16,851
Our archive consists of a huge graph
of all the metadata about

92
00:08:16,851 --> 00:08:24,617
the files, the directories, the revisions,
the commits and the releases and

93
00:08:24,617 --> 00:08:27,784
all the projects that are on top
of the graph.

94
00:08:29,129 --> 00:08:33,600
We separate the file storage into an other
object storage because of

95
00:08:33,600 --> 00:08:41,654
the size discrepancy: we have lots and lots
of file contents that we need to store

96
00:08:41,654 --> 00:08:46,323
so we do that outside of the database
that is used to store the graph.

97
00:08:49,495 --> 00:08:54,162
Basically, what we archive is a set of
software origins that are

98
00:08:54,162 --> 00:08:58,830
git repositories, mercurial repositories,
etc. etc.

99
00:08:59,689 --> 00:09:05,254
All those origins are loaded on a
regular schedule.

100
00:09:06,887 --> 00:09:13,472
If there is a very active software origin,
we're gonna archive it more often

101
00:09:13,472 --> 00:09:17,750
than stale things that don't get
a lot of updates.

102
00:09:19,663 --> 00:09:24,415
What we do to get the list of software
origins that we archive.

103
00:09:24,821 --> 00:09:30,677
We have a bunch of listers that can,
scroll through the list of repositories,

104
00:09:30,677 --> 00:09:33,767
for instance on Github or other
hosting platforms.

105
00:09:34,945 --> 00:09:42,181
We have code that can read Debian archive
metadata to make a list of the packages

106
00:09:42,181 --> 00:09:49,412
that are inside this archive and can be
archived, etc.

107
00:09:50,387 --> 00:09:52,611
All of this is done on a regular basis.

108
00:09:53,515 --> 00:09:57,450
We are currently working on some kind
of push mechanism so that

109
00:09:57,450 --> 00:10:01,485
people or other systems can notify us
of updates.

110
00:10:02,990 --> 00:10:09,673
Our goal is not to do real time archiving,
we're really in it for the long run

111
00:10:09,673 --> 00:10:16,010
but we still want to be able to prioritize
stuff that people tell us is

112
00:10:16,010 --> 00:10:17,879
important to archive.

113
00:10:19,951 --> 00:10:23,930
The internet archive has a "save now"
button and we want to implement

114
00:10:23,930 --> 00:10:26,245
something along those lines as well,

115
00:10:26,245 --> 00:10:31,545
so if we know that some software project
is in danger for a reason or another,

116
00:10:31,545 --> 00:10:34,145
then we can prioritize archiving it.

117
00:10:35,811 --> 00:10:39,916
So this is the basic structure of a revision
in the software heritage archive.

118
00:10:41,987 --> 00:10:45,073
You'll see that it's very similar to
a git commit.

119
00:10:47,833 --> 00:10:53,723
The format of the metadata is pretty much
what you'll find in a git commit

120
00:10:53,723 --> 00:10:59,006
with some extensions that you don't
see here because this is from a git commit

121
00:11:00,713 --> 00:11:09,620
So basically what we do is we take the
identifier of the directory

122
00:11:09,620 --> 00:11:16,200
that the revision points to, we take the
identifier of the parent of the revision

123
00:11:16,200 --> 00:11:18,722
so we can keep track of the history

124
00:11:18,722 --> 00:11:24,817
and then we add some metadata,
authorship and commitership information

125
00:11:24,817 --> 00:11:28,880
and the revision message and then we take
a hash of this,

126
00:11:28,880 --> 00:11:37,050
it makes an identifier that's probably
unique, very very probably unique.

127
00:11:40,257 --> 00:11:46,924
Using those identifiers, we can retrace
all the origins, all the history of

128
00:11:46,924 --> 00:11:51,747
development of the project and we can
deduplicate across all the archive.

129
00:11:52,493 --> 00:11:58,673
All the identifiers are intrinsic, which
means that we compute them

130
00:11:58,673 --> 00:12:03,917
from the contents of the things that
we are archiving, which means that

131
00:12:03,917 --> 00:12:11,436
we can deduplicate very efficiently
across all the data that we archive.

132
00:12:12,248 --> 00:12:14,283
How much data do we archive?

133
00:12:17,128 --> 00:12:18,224
A bit.

134
00:12:18,590 --> 00:12:23,828
So, we have passed the billion revision
mark a few weeks ago.

135
00:12:25,298 --> 00:12:29,966
This graph is a bit old, but anyway,
you have a live graph on our website.

136
00:12:31,468 --> 00:12:35,860
That's more than 4.5 billion unique
source code files.

137
00:12:38,261 --> 00:12:45,170
We don't actually discriminate between
what we would consider is source code

138
00:12:45,170 --> 00:12:48,181
and what upstream developers consider
as source code,

139
00:12:48,181 --> 00:12:52,327
so everything that's in a git repository,
we consider as source code

140
00:12:52,327 --> 00:12:54,878
if it's below a size threshold.

141
00:12:55,980 --> 00:13:00,242
A billion revisions across 80 million
projects.

142
00:13:01,389 --> 00:13:02,930
What do we archive?

143
00:13:02,930 --> 00:13:04,718
We archive Github, we archive Debian.

144
00:13:06,677 --> 00:13:11,910
So, Debian we run the archival process
every day, every day we get the new packages

145
00:13:11,910 --> 00:13:13,740
that have been uploaded in the archive.

146
00:13:14,308 --> 00:13:21,453
Github, we try to keep up, we are currently
working on some performance improvements,

147
00:13:21,453 --> 00:13:25,324
some scalability improvements to make sure
that we can keep up

148
00:13:25,324 --> 00:13:27,478
with the development on GitHub.

149
00:13:29,227 --> 00:13:40,117
We have archived as a one-off thing the
former contents of Gitorious and Google Code

150
00:13:40,513 --> 00:13:46,727
which are two prominent code hosting
spaces that closed recently

151
00:13:47,743 --> 00:13:53,993
and we've been working on archiving
the contents of Bitbucket

152
00:13:53,993 --> 00:13:59,944
which is kind of a challenge because
the API is a bit buggy and

153
00:13:59,944 --> 00:14:03,401
Atliassian isn't too interested
in fixing it.

154
00:14:06,084 --> 00:14:16,651
In concrete storage terms, we have 175TB
of blobs, so the files take 175TB

155
00:14:16,651 --> 00:14:19,902
and kind of big database, 6TB.

156
00:14:21,165 --> 00:14:28,315
The database only contains the graph of
the metadata for the archive

157
00:14:28,315 --> 00:14:34,697
which is basically a 8 billion nodes and
70 billion edges graph.

158
00:14:35,386 --> 00:14:37,460
And of course it's growing daily.

159
00:14:37,946 --> 00:14:42,823
We are pretty sure this is the richest public
source code archive that's available now

160
00:14:43,015 --> 00:14:44,763
and it keeps growing.

161
00:14:46,469 --> 00:14:48,987
So how do we actually…

162
00:14:49,475 --> 00:14:53,294
What kind of stack do we use to store
all this?

163
00:14:54,762 --> 00:14:56,555
We use Debian, of course.

164
00:14:57,685 --> 00:15:02,934
All our deployment recipes are in Puppet
in public repositories.

165
00:15:04,076 --> 00:15:07,731
We've started using Ceph
for the blob storage.

166
00:15:09,404 --> 00:15:14,441
We use PostgreSQL for the metadata storage
with some of the standard tools that

167
00:15:14,996 --> 00:15:18,172
live around PostgreSQL for backups
and replication.

168
00:15:20,041 --> 00:15:27,766
We use standard Python stack for
scheduling of jobs

169
00:15:27,766 --> 00:15:35,362
and for web interface stuff, basically
psycopg2 for the low level stuff,

170
00:15:35,362 --> 00:15:38,173
Django for the web stuff

171
00:15:38,173 --> 00:15:44,353
and Celery for the scheduling of jobs.

172
00:15:45,481 --> 00:15:50,453
In house, we've written an ad hoc
object storage system which has

173
00:15:50,453 --> 00:15:53,351
a bunch of backends that you can use.

174
00:15:53,821 --> 00:16:03,052
Basically, we are agnostic between a UNIX
filesystem, azure, Ceph, or tons of…

175
00:16:03,418 --> 00:16:07,118
It's a really simple object storage system
where you can just put an object,

176
00:16:07,118 --> 00:16:10,365
get an object, put a bunch of objects,
get a bunch of objects.

177
00:16:11,949 --> 00:16:17,517
We've implemented removal but we don't
really use it yet.

178
00:16:20,196 --> 00:16:24,955
All the data model implementation,
all the listers, the loaders, the schedulers

179
00:16:24,955 --> 00:16:29,180
everything has been written by us,
it's a pile of Python code.

180
00:16:31,860 --> 00:16:35,806
So, basically 20 Python packages and
around 30 Puppet modules

181
00:16:35,806 --> 00:16:41,696
to deploy all that and we've done everything
as a copyleft license,

182
00:16:41,696 --> 00:16:46,078
GPLv3 for the backend and AGPLv3
for the frontend.

183
00:16:47,064 --> 00:16:56,894
Even if people try and make their own
Software Heritage using our code,

184
00:16:56,894 --> 00:16:59,660
they have to publish their changes.

185
00:17:01,858 --> 00:17:10,757
Hardware-wise, we run for now everything
on a few hypervisors in house and

186
00:17:10,757 --> 00:17:18,568
our main storage is currently still
on a very high density, very slow,

187
00:17:18,568 --> 00:17:27,953
very bulky storage array, but we've
started to migrate all this thing

188
00:17:27,953 --> 00:17:33,002
into a Ceph storage cluster which
we're gonna grow as we need

189
00:17:33,002 --> 00:17:35,073
in the next few months.

190
00:17:36,249 --> 00:17:43,680
We've also been granted by Microsoft
sponsorship, ??? sponsorship

191
00:17:44,077 --> 00:17:45,834
for their cloud services.

192
00:17:46,445 --> 00:17:51,763
We've started putting mirrors of everything
in their infrastructure as well

193
00:17:51,763 --> 00:17:59,568
which means full object storage mirror,
so 170TB of stuff mirrored on azure

194
00:17:59,568 --> 00:18:02,492
as well as a database mirror for graph.

195
00:18:03,796 --> 00:18:08,958
And we're also doing all the content
indexing and all the things that need

196
00:18:08,958 --> 00:18:11,962
scalability on azure now.

197
00:18:16,637 --> 00:18:22,413
Finally, at the university of Bologna,
we have a backend storage for the download

198
00:18:22,413 --> 00:18:29,412
so currently our main storage is
quite slow so if you want to download

199
00:18:29,412 --> 00:18:34,859
a bundle of things that we've archived,
then we actually keep a cache of

200
00:18:34,859 --> 00:18:40,347
what we've done so that it doesn't take
a million years to download stuff.

201
00:18:41,809 --> 00:18:46,233
We do our development in a classic free
and open source software way,

202
00:18:46,233 --> 00:18:52,056
so we talk on our mailing list, on IRC,
on a forge.

203
00:18:52,503 --> 00:18:56,635
Everything is in English, everything is
public, there is more information

204
00:18:56,635 --> 00:19:00,749
on our website if you want to actually
have a look and see what we do.

205
00:19:04,278 --> 00:19:09,598
So, all that is very interesting but how
do we actually look into it?

206
00:19:11,670 --> 00:19:16,051
One of the ways that you can browse,
that you can use the archive

207
00:19:16,051 --> 00:19:18,619
is using a REST API.

208
00:19:19,189 --> 00:19:25,244
Basically, this API allows you to do
pointwise browsing of the archive

209
00:19:25,244 --> 00:19:29,026
so you can go and follow the links
in a graph,

210
00:19:29,026 --> 00:19:37,759
which is very slow but gives you a pretty
much full access of the data.

211
00:19:38,450 --> 00:19:44,779
There's an index for the API that you can
look at, but that's not really convenient,

212
00:19:44,779 --> 00:19:47,788
so we also have a web user interface.

213
00:19:48,828 --> 00:19:55,774
It's in preview right now, we're gonna do
a full launch in the month of June.

214
00:19:57,768 --> 00:20:01,103
If you go to 
https://archive.softwareheritage.org/browse/

215
00:20:01,591 --> 00:20:09,550
with the given credentials, you can
have a look and see what's going on.

216
00:20:10,166 --> 00:20:18,547
Basically, we have a web interface that
allows you to look at

217
00:20:18,547 --> 00:20:26,071
what origins we have downloaded, when
we have downloaded the origins

218
00:20:26,071 --> 00:20:34,931
with a kind of graph view of how often
we visited the origins

219
00:20:34,931 --> 00:20:37,936
and a calendar view of when we have
visited the origins.

220
00:20:38,790 --> 00:20:43,749
And then, inside the visits, you can
actually browse the contents

221
00:20:43,749 --> 00:20:45,048
that we've archived.

222
00:20:45,293 --> 00:20:49,884
So, for instance, this is the Python
repository as of May 2017

223
00:20:49,884 --> 00:20:54,960
and you can have the list of files,
then drill down,

224
00:20:54,960 --> 00:20:58,164
it should be pretty intuitive.

225
00:20:59,160 --> 00:21:02,586
If you look at the history of a project,
you can see the differences

226
00:21:02,586 --> 00:21:04,696
between two revisions of a project.

227
00:21:06,891 --> 00:21:12,261
Oh no, that's the syntax highlighting,
but anyway the diffs arrive right after.

228
00:21:13,641 --> 00:21:16,327
So, yeah, pretty cool stuff.

229
00:21:16,898 --> 00:21:21,535
I should be able to do a demo as well,
it should work.

230
00:21:31,112 --> 00:21:32,429
I'm gonna zoom in.

231
00:21:44,795 --> 00:21:49,474
So this is the main archive, you can see
some statistics about the objects

232
00:21:49,474 --> 00:21:50,933
that we've downloaded.

233
00:21:51,137 --> 00:21:56,557
When you zoom in, you get some kind of
overflows, because…

234
00:21:56,915 --> 00:21:58,867
Yeah, why would you do that.

235
00:21:59,235 --> 00:22:04,076
If you want to browse, we can try to find
an origin.

236
00:22:07,407 --> 00:22:08,832
"glibc".

237
00:22:12,729 --> 00:22:17,036
So there's lots and lots of, like, random
Github forks of things…

238
00:22:18,584 --> 00:22:25,784
We don't discriminate and we don't really
filter what we download.

239
00:22:26,555 --> 00:22:34,393
We are looking into doing some relevance
kind of sorting of the results, here.

240
00:22:36,434 --> 00:22:37,694
Next.

241
00:22:40,376 --> 00:22:42,083
Xilinx, why not.

242
00:22:43,220 --> 00:22:48,750
So, this has been downloaded for the last
time of August 3rd 2016,

243
00:22:48,750 --> 00:22:50,402
so it's probably a dead repository,

244
00:22:52,717 --> 00:22:54,995
but yeah, you can see a bunch of source
code,

245
00:22:56,671 --> 00:23:00,536
you can read the README of the glibc.

246
00:23:04,441 --> 00:23:07,650
If we go back to a more interesting origin

247
00:23:07,650 --> 00:23:09,643
here's the repository for git.

248
00:23:10,577 --> 00:23:17,153
I've selected voluntarily an old visit
of the repo so that we can see

249
00:23:17,153 --> 00:23:18,861
what was going on then.

250
00:23:22,759 --> 00:23:31,456
If I look at the calendar view, you can see
that we've had some issues actually

251
00:23:31,456 --> 00:23:33,410
updating this, but anyway.

252
00:23:37,835 --> 00:23:46,085
If I look at the last visit, then we can
actually browse the contents,

253
00:23:46,735 --> 00:23:49,336
you can get syntax highlighting as well.

254
00:23:49,904 --> 00:23:53,722
This is a big big file with lots of comments

255
00:24:02,094 --> 00:24:04,971
Let's see the actual source code…

256
00:24:07,036 --> 00:24:10,168
Anyway, so, that's the browsing interface.

257
00:24:10,452 --> 00:24:15,126
We can also now get back what we've
archived and download it,

258
00:24:15,126 --> 00:24:18,705
which is kind of something that you might
want to do

259
00:24:18,705 --> 00:24:23,526
if a repository is lost, you can actually
download it

260
00:24:23,526 --> 00:24:25,564
and get the source code back again.

261
00:24:26,944 --> 00:24:28,455
How we do that.

262
00:24:28,733 --> 00:24:35,478
If you go on the top right of this browsing
interface, you have actions and download

263
00:24:35,478 --> 00:24:40,275
and you can download the directory that
you are currently looking at.

264
00:24:41,294 --> 00:24:46,010
It's an asynchronous process, which means
that if there is a lot of load,

265
00:24:46,010 --> 00:24:51,458
then it's gotta take some time to get
actually, to be able to download the content

266
00:24:51,947 --> 00:24:56,298
So you can put in your email address so we
can notify you when the download is ready.

267
00:24:56,989 --> 00:25:03,338
I'm gonna try my luck and say just "ok"
and it's gonna appear at some point

268
00:25:03,338 --> 00:25:07,609
in the list of things that I've requested.

269
00:25:11,016 --> 00:25:20,173
I've already requested some things that
we can actually get and open as a tarball.

270
00:25:31,456 --> 00:25:34,758
Yeah, I think that's the thing that I was
actually looking at,

271
00:25:35,301 --> 00:25:38,439
which is this revision of the git
source code

272
00:25:39,654 --> 00:25:42,252
and then I can open it

273
00:25:43,643 --> 00:25:46,572
Yay, emacs, that's when you want.

274
00:25:46,932 --> 00:25:48,314
Yay, source code.

275
00:25:51,161 --> 00:25:53,562
This seems to work.

276
00:25:57,915 --> 00:26:02,674
And then, of course, if you want to
actually script what you're doing,

277
00:26:02,674 --> 00:26:07,141
there's an API that allows you to do
the downloads as well, so you can.

278
00:26:10,918 --> 00:26:18,392
The source code is deduplicated a lot,
which means that for one single repository

279
00:26:18,392 --> 00:26:24,200
you get tons of files that we have to
collect if you want to actually download

280
00:26:24,200 --> 00:26:26,227
an archive of a directory.

281
00:26:29,614 --> 00:26:37,704
It takes a while but we have an asynchronous
API so you can POST

282
00:26:37,704 --> 00:26:43,560
the identifier of a revision to this URL
and then get status updates

283
00:26:43,560 --> 00:26:49,493
and at some point, it will tell you that
the… here

284
00:26:49,846 --> 00:26:52,700
The status well tell you that the object
is available.

285
00:26:52,984 --> 00:26:59,134
You can download it and you can even
download the full history of a project

286
00:26:59,134 --> 00:27:03,565
and get that as a git-fast-export archive
that you can reimport into

287
00:27:03,565 --> 00:27:05,836
a new git repository.

288
00:27:06,236 --> 00:27:13,182
So any kind of VCS that we've imported,
you can export as a git repository

289
00:27:13,182 --> 00:27:17,733
and reimport on your machine.

290
00:27:19,241 --> 00:27:22,846
How to get involved in the project?

291
00:27:24,029 --> 00:27:29,030
We have a lot of features that we're
interested in, lots of them are now

292
00:27:29,030 --> 00:27:31,387
in early access or have been done.

293
00:27:31,876 --> 00:27:35,620
There's some stuff that we would like
help with.

294
00:27:38,226 --> 00:27:40,259
This is some stuff that we're working on:

295
00:27:40,546 --> 00:27:42,946
provenance information, you have a content

296
00:27:43,066 --> 00:27:45,420
you want to know which repository
it comes from,

297
00:27:45,868 --> 00:27:47,577
that's something we're working on.

298
00:27:48,314 --> 00:27:55,215
Full text search, the end goal is to be
able even to trace

299
00:27:55,215 --> 00:28:00,503
source of snippets of code that's have
been copied from one project to another.

300
00:28:01,321 --> 00:28:05,831
That's something that we can look into
with the wealth of information that

301
00:28:05,831 --> 00:28:07,623
we have inside the archive.

302
00:28:08,641 --> 00:28:10,672
There's a lot of things that,

303
00:28:10,672 --> 00:28:11,729
I mean…

304
00:28:12,135 --> 00:28:14,731
There's a lot of things that people want
to do with the archive.

305
00:28:15,352 --> 00:28:19,586
Our goal is to enable people to do things,
to do interesting things

306
00:28:19,586 --> 00:28:21,900
with a lot of source code.

307
00:28:23,530 --> 00:28:27,353
If you have an idea of what you want to do
with such an archive,

308
00:28:27,353 --> 00:28:29,835
please you can come talk to us

309
00:28:29,835 --> 00:28:34,941
and we'll be happy to help you help us.

310
00:28:37,552 --> 00:28:43,572
What we want to do is to diversify
the sources of things that we archive.

311
00:28:44,466 --> 00:28:51,289
Currently, we have good support for git,
we have OK support for subversion


312
00:28:51,289 --> 00:28:52,708
and mercurial.

313
00:28:54,373 --> 00:28:59,219
If your project of choice is in another
version control system,

314
00:28:59,219 --> 00:29:01,093
we are gonna miss it.

315
00:29:01,663 --> 00:29:06,295
So people can contribute in this area.

316
00:29:10,117 --> 00:29:18,205
For the listing part, we have coverage of
Debian, we have coverage or Github,

317
00:29:18,205 --> 00:29:26,418
if your code is somewhere else, we won't
see it, so we need people to contribute

318
00:29:26,418 --> 00:29:29,586
stuff that can list for instance Gitlab
instances,

319
00:29:31,899 --> 00:29:36,410
and then we can integrate that in our
infrastructure and actually have

320
00:29:36,929 --> 00:29:41,436
people be able to archive their gitlab
instances.

321
00:29:42,045 --> 00:29:48,784
And of course, we need to spread
the word, make the project sustainable.

322
00:29:49,118 --> 00:30:00,590
We have a few sponsors now, Microsoft,
Nokia, Huawei, Github has joined as a sponsor

323
00:30:01,811 --> 00:30:06,365
The university of Bologna, of course Inria
is sponsoring.

324
00:30:06,853 --> 00:30:11,971
But we need to keep spreading the word
and keep the project sustainable.

325
00:30:13,026 --> 00:30:17,501
And, of course, we need to save endangered
source code.

326
00:30:17,828 --> 00:30:22,580
For that, we have a suggestion box on
the wiki that you can add things to.

327
00:30:24,208 --> 00:30:29,563
For instance, we have in the back of
our minds archiving SourceForge,

328
00:30:29,563 --> 00:30:35,933
because we know that this isn't very
sustainable and that's risk of being

329
00:30:35,933 --> 00:30:38,733
taken down at some point.

330
00:30:41,696 --> 00:30:47,830
If you want to join us, we also have
some job openings that are available.

331
00:30:48,602 --> 00:30:55,646
For now it's in Paris, so if you want to
consider coming work with us in Paris,

332
00:30:55,646 --> 00:30:58,086
you can look into that.

333
00:31:00,646 --> 00:31:02,684
That's Software Heritage.

334
00:31:02,684 --> 00:31:05,123
We are building a reference archive of
all the free software

335
00:31:05,123 --> 00:31:06,836
that's being ever written

336
00:31:07,080 --> 00:31:10,982
in an international, open, non-profit and
mutualised infrastructure

337
00:31:11,877 --> 00:31:17,933
that we have opened up to everyone,
all users, vendors, developers can use it.

338
00:31:20,126 --> 00:31:25,658
The idea is to be at the service of
the community and for society

339
00:31:25,658 --> 00:31:27,805
as a whole.

340
00:31:28,136 --> 00:31:32,855
So if you want to join us, you can look at
our website, you can look at our code.

341
00:31:34,604 --> 00:31:38,145
You can also talk to me, so if you have
any questions,

342
00:31:38,145 --> 00:31:42,125
I think we have 10, 12 minutes for questions.

343
00:31:46,228 --> 00:31:51,512
[Applause]

344
00:31:51,754 --> 00:31:52,933
Do you have questions?

345
00:31:57,207 --> 00:32:00,627
[Q] How do you protect the archive
against stuff that you don't want to

346
00:32:00,627 --> 00:32:01,887
have in the archive.

347
00:32:02,170 --> 00:32:06,882
I think of a stuff that is copyright-
protected and that Github will also

348
00:32:06,882 --> 00:32:09,325
delete after a while.

349
00:32:09,730 --> 00:32:15,583
Worse, if I would misuse the archive
as my private backup

350
00:32:15,583 --> 00:32:19,601
and store encrypted blocks on Github
and you will eventually backup them

351
00:32:19,601 --> 00:32:20,779
for me.

352
00:32:24,562 --> 00:32:26,711
[A] There's, I think, two sides of the
question.

353
00:32:27,077 --> 00:32:28,502
The first side is

354
00:32:28,502 --> 00:32:33,543
Do we really archive only stuff that is
free software and

355
00:32:33,543 --> 00:32:40,901
that we can redistribute and how do we
manage, for instance,

356
00:32:40,901 --> 00:32:42,856
copyright takedown stuff.

357
00:32:46,108 --> 00:32:51,874
Currently, most of the infrastructure
of the project is under French law.

358
00:32:52,975 --> 00:33:00,047
There's a defined process to do
copyright takedown in the French legal system.

359
00:33:02,365 --> 00:33:08,828
We would be really annoyed to have to
take down content from the archive

360
00:33:12,486 --> 00:33:19,846
What we do, however, is to mirror public
information that is publicly available.

361
00:33:21,192 --> 00:33:26,716
Of course I'm not a lawyer for the project,
so I can't really…

362
00:33:29,605 --> 00:33:33,181
I'm not 100% sure of what I'm about to say
but

363
00:33:33,181 --> 00:33:38,920
what I know is that in the current French
legistlation status,

364
00:33:39,531 --> 00:33:42,903
if the source of the data is still available

365
00:33:42,903 --> 00:33:46,643
so for instance if the data is still on
Github, then you need to have

366
00:33:46,643 --> 00:33:49,901
Github take it down before we have to
take it down.

367
00:33:56,681 --> 00:34:01,881
We're not currently filtering content for
misuse of the archive,

368
00:34:01,881 --> 00:34:06,361
so the only thing that we do is put
a limit on the size of the files

369
00:34:06,361 --> 00:34:08,435
that are archived in Software Heritage.

370
00:34:09,536 --> 00:34:12,014
The limit is pretty high, like 100MB.

371
00:34:15,102 --> 00:34:21,440
We can't really decide ourselves

372
00:34:21,440 --> 00:34:24,084
what is source code,
what is not source code

373
00:34:24,084 --> 00:34:30,669
because for instance if your project is
a cryptography library,

374
00:34:30,669 --> 00:34:34,397
you might want to have some encrypted
blocks of data that are stored

375
00:34:34,397 --> 00:34:38,465
in you source code repository as
test fixtures.

376
00:34:39,034 --> 00:34:44,033
And then, you need them to build the code
and to make sure that it works.

377
00:34:44,682 --> 00:34:48,998
So, how would that be any different than
your encrypted backup on Github?

378
00:34:49,139 --> 00:34:55,641
How could we, Software Heritage,
distinguish between proper use and misuse

379
00:34:55,641 --> 00:34:58,806
of the resources.

380
00:35:00,349 --> 00:35:05,100
I guess our long term goal is to not have
to care about misuse because

381
00:35:05,100 --> 00:35:07,175
it's gonna be a drop in the ocean.

382
00:35:08,638 --> 00:35:10,916
We're gonna have so much…

383
00:35:11,893 --> 00:35:15,303
We want to have enough space and
enough resources

384
00:35:15,303 --> 00:35:20,021
that we don't really need to ask ourselves
this question, basically.

385
00:35:21,480 --> 00:35:22,413
Thanks.

386
00:35:26,355 --> 00:35:27,653
Other questions?

387
00:35:34,113 --> 00:35:39,359
[Q] Have you looked at some form of
authentication to provide additional

388
00:35:39,359 --> 00:35:46,346
insurance that the archived source code
hasn't been modified or tampered with

389
00:35:46,346 --> 00:35:47,893
in some form?

390
00:35:50,977 --> 00:35:55,971
[A] First of all, all the identifiers for
the objects that are inside the archive

391
00:35:55,971 --> 00:36:00,639
are cryptographic hashes of the contents
that we've archived.

392
00:36:01,612 --> 00:36:06,937
So, for files, for instance, we take
the SHA1, the SHA256,

393
00:36:06,937 --> 00:36:16,077
one of the BLAKE hashes and the git
modified SHA1 of the file,

394
00:36:16,646 --> 00:36:19,658
and we use that in the manifest for
the directories.

395
00:36:19,902 --> 00:36:25,787
So the directories, the directory identifiers
are a hash of the manifest

396
00:36:25,787 --> 00:36:30,093
of the list of files that are inside
the directory, etc.

397
00:36:30,543 --> 00:36:39,286
So, recursively, you can make sure that
the data that we give back to you

398
00:36:39,286 --> 00:36:47,779
has not been, at least altered, by bitflip
or anything.

399
00:36:48,954 --> 00:36:53,386
We regularly run a scrub of the data
that we have in the archive,

400
00:36:53,386 --> 00:36:57,252
so we make sure that there's no rot
inside our archive.

401
00:36:58,960 --> 00:37:05,063
We've not looked into, basically,
attestation of…

402
00:37:08,761 --> 00:37:13,884
for instance, making sure that the code
that we've downloaded…

403
00:37:20,878 --> 00:37:26,448
I mean, we're not doing anything more
than taking a picture of the data

404
00:37:26,448 --> 00:37:34,092
and we say "We've computed this hash.
Maybe the code that's been presented

405
00:37:34,092 --> 00:37:38,839
by Github to Software Heritage is different
than what you've uploaded to Github,

406
00:37:38,839 --> 00:37:40,312
we can't tell."

407
00:37:43,967 --> 00:37:48,925
In the case of git, you can always use
the identifiers of the objects

408
00:37:48,925 --> 00:37:51,858
that you've pushed so you have
the commit hash,

409
00:37:51,858 --> 00:37:56,777
which is itself a cryptographic identifier
of the contents of the commit.

410
00:37:59,419 --> 00:38:02,182
In turn, if the commit is signed, then
the signature is still stored

411
00:38:02,182 --> 00:38:10,800
in the Software Heritage metadata and
you can reproduce the original git object

412
00:38:10,800 --> 00:38:15,356
and check the signature, but we've not
done anything specific for Software Heritage

413
00:38:15,356 --> 00:38:17,185
in this area.

414
00:38:17,536 --> 00:38:19,643
Does that answer your question?

415
00:38:19,975 --> 00:38:20,302
Cool.

416
00:38:24,886 --> 00:38:25,748
Other questions?

417
00:38:27,457 --> 00:38:28,798
There's one in front.

418
00:38:31,400 --> 00:38:33,558
[Q] It's partially question, partially
comment.

419
00:38:33,884 --> 00:38:39,776
Your initial idea was to have a telescope,
or something like this for source code.

420
00:38:40,223 --> 00:38:43,427
For now, for me, it looks a little bit
more like microscope,

421
00:38:43,427 --> 00:38:46,507
so you can focus on one thing, but that's
not much.

422
00:38:46,758 --> 00:38:51,023
So have you sorted things about how to
analyze entire ecosystem

423
00:38:51,023 --> 00:38:52,203
or something like this.

424
00:38:52,203 --> 00:38:56,513
For example, now we have Django 2 which is
Python 3 only so it would be interesting to

425
00:38:56,513 --> 00:39:00,899
look at all Django modules to see when
they start moving to this Django.

426
00:39:01,266 --> 00:39:06,619
So we would need to start analyzing
thousands or millions of files, but then

427
00:39:06,619 --> 00:39:10,840
we would need some SQL like, or some
map reduce jobs

428
00:39:11,050 --> 00:39:12,430
or something like this for this.

429
00:39:12,957 --> 00:39:13,524
[A] Yes

430
00:39:13,889 --> 00:39:15,073
So, we've started…

431
00:39:16,414 --> 00:39:21,620
The two initiators of the project, Roberto
Di Cosmo and Stefano Zacchiroli

432
00:39:21,812 --> 00:39:26,566
are both researchers in computer science
so they have a strong background in

433
00:39:26,566 --> 00:39:34,654
actually mining software repositories and
doing some large scale analysis

434
00:39:34,654 --> 00:39:36,234
on source code.

435
00:39:38,153 --> 00:39:44,818
We've been talking with research groups
whose main goal is to do analysis on

436
00:39:44,818 --> 00:39:48,436
large scale source code archives.

437
00:39:50,430 --> 00:39:57,592
One of the first mirrors outside of our
control of the archive

438
00:39:57,592 --> 00:39:59,016
will be in Grenoble (France).

439
00:39:59,385 --> 00:40:05,852
There's a few teams that work on
actually doing large scale research

440
00:40:05,852 --> 00:40:08,697
on source code over there,

441
00:40:08,697 --> 00:40:11,339
so that's what the mirror will be
used for.

442
00:40:13,411 --> 00:40:17,235
We've also been looking at what
the Google open source team does.

443
00:40:18,212 --> 00:40:22,997
They have this big repository with all
the code that Google uses

444
00:40:22,997 --> 00:40:28,944
and they've started to push back,
like do large scale analysis of

445
00:40:28,944 --> 00:40:37,581
security vulnerabilities, issues with
static and dynamic analysis

446
00:40:37,581 --> 00:40:41,938
of the code and they've started pushing
their fixes upstream.

447
00:40:42,589 --> 00:40:47,135
That's something that we want to enable
users to do,

448
00:40:47,135 --> 00:40:50,631
that's not something that we want to do
ourselves, but we want to make sure

449
00:40:50,631 --> 00:40:53,482
that people can do it using our archive.

450
00:40:54,620 --> 00:40:58,767
So we'd be happy to work with people
who already do that so that

451
00:40:58,767 --> 00:41:04,534
they can use their knowledge and their
tools inside our archive.

452
00:41:06,606 --> 00:41:08,684
Does that answer your question?

453
00:41:09,658 --> 00:41:10,673
Cool.

454
00:41:14,982 --> 00:41:16,528
Any more questions?

455
00:41:19,411 --> 00:41:21,727
No? Then thank you very much Nicolas.

456
00:41:21,930 --> 00:41:22,581
Thank you.

457
00:41:22,947 --> 00:41:25,957
[Applause]