English subtitles

← Software Heritage - Preserving the Free Software Commons

Get Embed Code
2 Languages

Showing Revision 11 created 06/14/2018 by tvincent.

  1. Hi, thank you.
  2. I'm Nicolas Dandrimont and I will indeed
    be talking to you about
  3. Software Heritage.
  4. I'm a software engineer for this project.
  5. I've been working on it for 3 years now.
  6. And we'll see what this thing is all about.
  7. [Mic not working]
  8. I guess the batteries are out.
  9. So, let's try that again.
  10. So, we all know, we've been doing
    free software for a while,
  11. that software source code is something
  12. Why is that?
  13. As Harold Abelson has said in SICP, his
    textbook on programming,
  14. programs are meant to be read by people
    and then incidentally for machines to execute.
  15. Basically, what software source code
    provides us is a way inside
  16. the mind of the designer of the program.
  17. For instance, you can have,
    you can get inside very crazy algorithms
  18. that can do very fast reverse square roots
    for 3D, that kind of stuff
  19. Like in the Quake 2 source code.
  20. You can also get inside the algorithms
    that are underpinning the internet,
  21. for instance seeing the net queue
    algorithm in the Linux kernel.
  22. What we are building as the free software
    community is the free software commons.
  23. Basically, the commons is all the cultural
    and social and natural resources
  24. that we share and that everyone
    has access to.
  25. More specifically, the software commons
    is what we are building
  26. with software that is open and that is
    available for all to use, to modify,
  27. to execute, to distribute.
  28. We know that those commons are a really
    critical part of our commons.
  29. Who's taking care of it?
  30. The software is fragile.
  31. Like all digital information, you can lose
  32. People can decide to shut down hosting
    spaces because of business decisions.
  33. People can hack into software hosting
    platforms and remove the code maliciously
  34. or just inadvertently.
  35. And, of course, for the obsolete stuff,
    there's rot.
  36. If you don't care about the data, then
    it rots and it decays and you lose it.
  37. So, where is the archive we go to
    when something is lost,
  38. when GitLab goes away, when Github
    goes away.
  39. Where do we go?
  40. Finally, there's one last thing that we
    noticed, it's that
  41. there's a lot of teams that work on
    research on software
  42. and there's no real big infrastructure
    for research on code.
  43. There's tons of critical issues around
    code: safety, security, verification, proofs.
  44. Nobody's doing this at a very large scale.
  45. If you want to see the stars, you go
    the Atacama desert and
  46. you point a telescope at the sky.
  47. Where is the telescope for source code?
  48. That's what Software Heritage wants to be.
  49. What we do is we collect, we preserve
    and we share all the software
  50. that is publicly available.
  51. Why do we do that? We do that to
    preserve the past, to enhance the present
  52. and to prepare for the future.
  53. What we're building is a base infrastructure
    that can be used
  54. for cultural heritage, for industry,
    for research and for education purposes.
  55. How do we do it? We do it with an open
  56. Every single line of code that we write
    is free software.
  57. We do it transparently, everything that
    we do, we do it in the open,
  58. be that on a mailing list or on
    our issue tracker.
  59. And we strive to do it for the very long
    haul, so we do it with replication in mind
  60. so that no single entity has full control
    over the data that we collect.
  61. And we do it in a non-profit fashion
    so that we avoid
  62. business-driven decisions impacting
    the project.
  63. So, what do we do concretely?
  64. We do archiving of version control systems.
  65. What does that mean?
  66. It means we archive file contents, so
    source code, files.
  67. We archive revisions, which means all the
    metadata of the history of the projects,
  68. we try to download it and we put it inside
    a common data model that is
  69. shared across all the archive.
  70. We archive releases of the software,
    releases that have been tagged
  71. in a version control system as well as
    releases that we can find as tarballs
  72. because sometimes… boof, views of
    this source code differ.
  73. Of course, we archive where and when
    we've seen the data that we've collected.
  74. All of this, we put inside a canonical,
    VCS-agnostic, data model.
  75. If you have a Debian package, with its
    history, if you have a git repository,
  76. if you have a subversion repository, if
    you have a mercurial repository,
  77. it all looks the same and you can work
    on it with the same tools.
  78. What we don't do is archive what's around
    the software, for instance
  79. the bug tracking systems or the homepages
    or the wikis or the mailing lists.
  80. There are some projects that work
    in this space, for instance
  81. the internet archive does a lot of
    very good work around archiving the web.
  82. Our goal is not to replace them, but to
    work with them and be able to do
  83. linking across all the archives that exist.
  84. We can, for instance for the mailing lists
    there's the gmane project
  85. that does a lot of archiving of free
    software mailing lists.
  86. So our long term vision is to play a part
    in a semantic wikipedia of software,
  87. a wikidata of software where we can
    hyperlink all the archives that exist
  88. and do stuff in the area.
  89. Quick tour of our infrastructure.
  90. Basically, all the way to the right is
    our archive.
  91. Our archive consists of a huge graph
    of all the metadata about
  92. the files, the directories, the revisions,
    the commits and the releases and
  93. all the projects that are on top
    of the graph.
  94. We separate the file storage into an other
    object storage because of
  95. the size discrepancy: we have lots and lots
    of file contents that we need to store
  96. so we do that outside of the database
    that is used to store the graph.
  97. Basically, what we archive is a set of
    software origins that are
  98. git repositories, mercurial repositories,
    etc. etc.
  99. All those origins are loaded on a
    regular schedule.
  100. If there is a very active software origin,
    we're gonna archive it more often
  101. than stale things that don't get
    a lot of updates.
  102. What we do to get the list of software
    origins that we archive.
  103. We have a bunch of listers that can,
    scroll through the list of repositories,
  104. for instance on Github or other
    hosting platforms.
  105. We have code that can read Debian archive
    metadata to make a list of the packages
  106. that are inside this archive and can be
    archived, etc.
  107. All of this is done on a regular basis.
  108. We are currently working on some kind
    of push mechanism so that
  109. people or other systems can notify us
    of updates.
  110. Our goal is not to do real time archiving,
    we're really in it for the long run
  111. but we still want to be able to prioritize
    stuff that people tell us is
  112. important to archive.
  113. The internet archive has a "save now"
    button and we want to implement
  114. something along those lines as well,
  115. so if we know that some software project
    is in danger for a reason or another,
  116. then we can prioritize archiving it.
  117. So this is the basic structure of a revision
    in the software heritage archive.
  118. You'll see that it's very similar to
    a git commit.
  119. The format of the metadata is pretty much
    what you'll find in a git commit
  120. with some extensions that you don't
    see here because this is from a git commit
  121. So basically what we do is we take the
    identifier of the directory
  122. that the revision points to, we take the
    identifier of the parent of the revision
  123. so we can keep track of the history
  124. and then we add some metadata,
    authorship and commitership information
  125. and the revision message and then we take
    a hash of this,
  126. it makes an identifier that's probably
    unique, very very probably unique.
  127. Using those identifiers, we can retrace
    all the origins, all the history of
  128. development of the project and we can
    deduplicate across all the archive.
  129. All the identifiers are intrinsic, which
    means that we compute them
  130. from the contents of the things that
    we are archiving, which means that
  131. we can deduplicate very efficiently
    across all the data that we archive.
  132. How much data do we archive?
  133. A bit.
  134. So, we have passed the billion revision
    mark a few weeks ago.
  135. This graph is a bit old, but anyway,
    you have a live graph on our website.
  136. That's more than 4.5 billion unique
    source code files.
  137. We don't actually discriminate between
    what we would consider is source code
  138. and what upstream developers consider
    as source code,
  139. so everything that's in a git repository,
    we consider as source code
  140. if it's below a size threshold.
  141. A billion revisions across 80 million
  142. What do we archive?
  143. We archive Github, we archive Debian.
  144. So, Debian we run the archival process
    every day, every day we get the new packages
  145. that have been uploaded in the archive.
  146. Github, we try to keep up, we are currently
    working on some performance improvements,
  147. some scalability improvements to make sure
    that we can keep up
  148. with the development on GitHub.
  149. We have archived as a one-off thing the
    former contents of Gitorious and Google Code
  150. which are two prominent code hosting
    spaces that closed recently
  151. and we've been working on archiving
    the contents of Bitbucket
  152. which is kind of a challenge because
    the API is a bit buggy and
  153. Atliassian isn't too interested
    in fixing it.
  154. In concrete storage terms, we have 175TB
    of blobs, so the files take 175TB
  155. and kind of big database, 6TB.
  156. The database only contains the graph of
    the metadata for the archive
  157. which is basically a 8 billion nodes and
    70 billion edges graph.
  158. And of course it's growing daily.
  159. We are pretty sure this is the richest public
    source code archive that's available now
  160. and it keeps growing.
  161. So how do we actually…
  162. What kind of stack do we use to store
    all this?
  163. We use Debian, of course.
  164. All our deployment recipes are in Puppet
    in public repositories.
  165. We've started using Ceph
    for the blob storage.
  166. We use PostgreSQL for the metadata storage
    with some of the standard tools that
  167. live around PostgreSQL for backups
    and replication.
  168. We use standard Python stack for
    scheduling of jobs
  169. and for web interface stuff, basically
    psycopg2 for the low level stuff,
  170. Django for the web stuff
  171. and Celery for the scheduling of jobs.
  172. In house, we've written an ad hoc
    object storage system which has
  173. a bunch of backends that you can use.
  174. Basically, we are agnostic between a UNIX
    filesystem, azure, Ceph, or tons of…
  175. It's a really simple object storage system
    where you can just put an object,
  176. get an object, put a bunch of objects,
    get a bunch of objects.
  177. We've implemented removal but we don't
    really use it yet.
  178. All the data model implementation,
    all the listers, the loaders, the schedulers
  179. everything has been written by us,
    it's a pile of Python code.
  180. So, basically 20 Python packages and
    around 30 Puppet modules
  181. to deploy all that and we've done everything
    as a copyleft license,
  182. GPLv3 for the backend and AGPLv3
    for the frontend.
  183. Even if people try and make their own
    Software Heritage using our code,
  184. they have to publish their changes.
  185. Hardware-wise, we run for now everything
    on a few hypervisors in house and
  186. our main storage is currently still
    on a very high density, very slow,
  187. very bulky storage array, but we've
    started to migrate all this thing
  188. into a Ceph storage cluster which
    we're gonna grow as we need
  189. in the next few months.
  190. We've also been granted by Microsoft
    sponsorship, ??? sponsorship
  191. for their cloud services.
  192. We've started putting mirrors of everything
    in their infrastructure as well
  193. which means full object storage mirror,
    so 170TB of stuff mirrored on azure
  194. as well as a database mirror for graph.
  195. And we're also doing all the content
    indexing and all the things that need
  196. scalability on azure now.
  197. Finally, at the university of Bologna,
    we have a backend storage for the download
  198. so currently our main storage is
    quite slow so if you want to download
  199. a bundle of things that we've archived,
    then we actually keep a cache of
  200. what we've done so that it doesn't take
    a million years to download stuff.
  201. We do our development in a classic free
    and open source software way,
  202. so we talk on our mailing list, on IRC,
    on a forge.
  203. Everything is in English, everything is
    public, there is more information
  204. on our website if you want to actually
    have a look and see what we do.
  205. So, all that is very interesting but how
    do we actually look into it?
  206. One of the ways that you can browse,
    that you can use the archive
  207. is using a REST API.
  208. Basically, this API allows you to do
    pointwise browsing of the archive
  209. so you can go and follow the links
    in a graph,
  210. which is very slow but gives you a pretty
    much full access of the data.
  211. There's an index for the API that you can
    look at, but that's not really convenient,
  212. so we also have a web user interface.
  213. It's in preview right now, we're gonna do
    a full launch in the month of June.
  214. If you go to
  215. with the given credentials, you can
    have a look and see what's going on.
  216. Basically, we have a web interface that
    allows you to look at
  217. what origins we have downloaded, when
    we have downloaded the origins
  218. with a kind of graph view of how often
    we visited the origins
  219. and a calendar view of when we have
    visited the origins.
  220. And then, inside the visits, you can
    actually browse the contents
  221. that we've archived.
  222. So, for instance, this is the Python
    repository as of May 2017
  223. and you can have the list of files,
    then drill down,
  224. it should be pretty intuitive.
  225. If you look at the history of a project,
    you can see the differences
  226. between two revisions of a project.
  227. Oh no, that's the syntax highlighting,
    but anyway the diffs arrive right after.
  228. So, yeah, pretty cool stuff.
  229. I should be able to do a demo as well,
    it should work.
  230. I'm gonna zoom in.
  231. So this is the main archive, you can see
    some statistics about the objects
  232. that we've downloaded.
  233. When you zoom in, you get some kind of
    overflows, because…
  234. Yeah, why would you do that.
  235. If you want to browse, we can try to find
    an origin.
  236. "glibc".
  237. So there's lots and lots of, like, random
    Github forks of things…
  238. We don't discriminate and we don't really
    filter what we download.
  239. We are looking into doing some relevance
    kind of sorting of the results, here.
  240. Next.
  241. Xilinx, why not.
  242. So, this has been downloaded for the last
    time of August 3rd 2016,
  243. so it's probably a dead repository,
  244. but yeah, you can see a bunch of source
  245. you can read the README of the glibc.
  246. If we go back to a more interesting origin
  247. here's the repository for git.
  248. I've selected voluntarily an old visit
    of the repo so that we can see
  249. what was going on then.
  250. If I look at the calendar view, you can see
    that we've had some issues actually
  251. updating this, but anyway.
  252. If I look at the last visit, then we can
    actually browse the contents,
  253. you can get syntax highlighting as well.
  254. This is a big big file with lots of comments
  255. Let's see the actual source code…
  256. Anyway, so, that's the browsing interface.
  257. We can also now get back what we've
    archived and download it,
  258. which is kind of something that you might
    want to do
  259. if a repository is lost, you can actually
    download it
  260. and get the source code back again.
  261. How we do that.
  262. If you go on the top right of this browsing
    interface, you have actions and download
  263. and you can download the directory that
    you are currently looking at.
  264. It's an asynchronous process, which means
    that if there is a lot of load,
  265. then it's gotta take some time to get
    actually, to be able to download the content
  266. So you can put in your email address so we
    can notify you when the download is ready.
  267. I'm gonna try my luck and say just "ok"
    and it's gonna appear at some point
  268. in the list of things that I've requested.
  269. I've already requested some things that
    we can actually get and open as a tarball.
  270. Yeah, I think that's the thing that I was
    actually looking at,
  271. which is this revision of the git
    source code
  272. and then I can open it
  273. Yay, emacs, that's when you want.
  274. Yay, source code.
  275. This seems to work.
  276. And then, of course, if you want to
    actually script what you're doing,
  277. there's an API that allows you to do
    the downloads as well, so you can.
  278. The source code is deduplicated a lot,
    which means that for one single repository
  279. you get tons of files that we have to
    collect if you want to actually download
  280. an archive of a directory.
  281. It takes a while but we have an asynchronous
    API so you can POST
  282. the identifier of a revision to this URL
    and then get status updates
  283. and at some point, it will tell you that
    the… here
  284. The status well tell you that the object
    is available.
  285. You can download it and you can even
    download the full history of a project
  286. and get that as a git-fast-export archive
    that you can reimport into
  287. a new git repository.
  288. So any kind of VCS that we've imported,
    you can export as a git repository
  289. and reimport on your machine.
  290. How to get involved in the project?
  291. We have a lot of features that we're
    interested in, lots of them are now
  292. in early access or have been done.
  293. There's some stuff that we would like
    help with.
  294. This is some stuff that we're working on:
  295. provenance information, you have a content
  296. you want to know which repository
    it comes from,
  297. that's something we're working on.
  298. Full text search, the end goal is to be
    able even to trace
  299. source of snippets of code that's have
    been copied from one project to another.
  300. That's something that we can look into
    with the wealth of information that
  301. we have inside the archive.
  302. There's a lot of things that,
  303. I mean…
  304. There's a lot of things that people want
    to do with the archive.
  305. Our goal is to enable people to do things,
    to do interesting things
  306. with a lot of source code.
  307. If you have an idea of what you want to do
    with such an archive,
  308. please you can come talk to us
  309. and we'll be happy to help you help us.
  310. What we want to do is to diversify
    the sources of things that we archive.
  311. Currently, we have good support for git,
    we have OK support for subversion
  312. and mercurial.
  313. If your project of choice is in another
    version control system,
  314. we are gonna miss it.
  315. So people can contribute in this area.
  316. For the listing part, we have coverage of
    Debian, we have coverage or Github,
  317. if your code is somewhere else, we won't
    see it, so we need people to contribute
  318. stuff that can list for instance Gitlab
  319. and then we can integrate that in our
    infrastructure and actually have
  320. people be able to archive their gitlab
  321. And of course, we need to spread
    the word, make the project sustainable.
  322. We have a few sponsors now, Microsoft,
    Nokia, Huawei, Github has joined as a sponsor
  323. The university of Bologna, of course Inria
    is sponsoring.
  324. But we need to keep spreading the word
    and keep the project sustainable.
  325. And, of course, we need to save endangered
    source code.
  326. For that, we have a suggestion box on
    the wiki that you can add things to.
  327. For instance, we have in the back of
    our minds archiving SourceForge,
  328. because we know that this isn't very
    sustainable and that's risk of being
  329. taken down at some point.
  330. If you want to join us, we also have
    some job openings that are available.
  331. For now it's in Paris, so if you want to
    consider coming work with us in Paris,
  332. you can look into that.
  333. That's Software Heritage.
  334. We are building a reference archive of
    all the free software
  335. that's being ever written
  336. in an international, open, non-profit and
    mutualised infrastructure
  337. that we have opened up to everyone,
    all users, vendors, developers can use it.
  338. The idea is to be at the service of
    the community and for society
  339. as a whole.
  340. So if you want to join us, you can look at
    our website, you can look at our code.
  341. You can also talk to me, so if you have
    any questions,
  342. I think we have 10, 12 minutes for questions.
  343. [Applause]
  344. Do you have questions?
  345. [Q] How do you protect the archive
    against stuff that you don't want to
  346. have in the archive.
  347. I think of a stuff that is copyright-
    protected and that Github will also
  348. delete after a while.
  349. Worse, if I would misuse the archive
    as my private backup
  350. and store encrypted blocks on Github
    and you will eventually backup them
  351. for me.
  352. [A] There's, I think, two sides of the
  353. The first side is
  354. Do we really archive only stuff that is
    free software and
  355. that we can redistribute and how do we
    manage, for instance,
  356. copyright takedown stuff.
  357. Currently, most of the infrastructure
    of the project is under French law.
  358. There's a defined process to do
    copyright takedown in the French legal system.
  359. We would be really annoyed to have to
    take down content from the archive
  360. What we do, however, is to mirror public
    information that is publicly available.
  361. Of course I'm not a lawyer for the project,
    so I can't really…
  362. I'm not 100% sure of what I'm about to say
  363. what I know is that in the current French
    legistlation status,
  364. if the source of the data is still available
  365. so for instance if the data is still on
    Github, then you need to have
  366. Github take it down before we have to
    take it down.
  367. We're not currently filtering content for
    misuse of the archive,
  368. so the only thing that we do is put
    a limit on the size of the files
  369. that are archived in Software Heritage.
  370. The limit is pretty high, like 100MB.
  371. We can't really decide ourselves
  372. what is source code,
    what is not source code
  373. because for instance if your project is
    a cryptography library,
  374. you might want to have some encrypted
    blocks of data that are stored
  375. in you source code repository as
    test fixtures.
  376. And then, you need them to build the code
    and to make sure that it works.
  377. So, how would that be any different than
    your encrypted backup on Github?
  378. How could we, Software Heritage,
    distinguish between proper use and misuse
  379. of the resources.
  380. I guess our long term goal is to not have
    to care about misuse because
  381. it's gonna be a drop in the ocean.
  382. We're gonna have so much…
  383. We want to have enough space and
    enough resources
  384. that we don't really need to ask ourselves
    this question, basically.
  385. Thanks.
  386. Other questions?
  387. [Q] Have you looked at some form of
    authentication to provide additional
  388. insurance that the archived source code
    hasn't been modified or tampered with
  389. in some form?
  390. [A] First of all, all the identifiers for
    the objects that are inside the archive
  391. are cryptographic hashes of the contents
    that we've archived.
  392. So, for files, for instance, we take
    the SHA1, the SHA256,
  393. one of the BLAKE hashes and the git
    modified SHA1 of the file,
  394. and we use that in the manifest for
    the directories.
  395. So the directories, the directory identifiers
    are a hash of the manifest
  396. of the list of files that are inside
    the directory, etc.
  397. So, recursively, you can make sure that
    the data that we give back to you
  398. has not been, at least altered, by bitflip
    or anything.
  399. We regularly run a scrub of the data
    that we have in the archive,
  400. so we make sure that there's no rot
    inside our archive.
  401. We've not looked into, basically,
    attestation of…
  402. for instance, making sure that the code
    that we've downloaded…
  403. I mean, we're not doing anything more
    than taking a picture of the data
  404. and we say "We've computed this hash.
    Maybe the code that's been presented
  405. by Github to Software Heritage is different
    than what you've uploaded to Github,
  406. we can't tell."
  407. In the case of git, you can always use
    the identifiers of the objects
  408. that you've pushed so you have
    the commit hash,
  409. which is itself a cryptographic identifier
    of the contents of the commit.
  410. In turn, if the commit is signed, then
    the signature is still stored
  411. in the Software Heritage metadata and
    you can reproduce the original git object
  412. and check the signature, but we've not
    done anything specific for Software Heritage
  413. in this area.
  414. Does that answer your question?
  415. Cool.
  416. Other questions?
  417. There's one in front.
  418. [Q] It's partially question, partially
  419. Your initial idea was to have a telescope,
    or something like this for source code.
  420. For now, for me, it looks a little bit
    more like microscope,
  421. so you can focus on one thing, but that's
    not much.
  422. So have you sorted things about how to
    analyze entire ecosystem
  423. or something like this.
  424. For example, now we have Django 2 which is
    Python 3 only so it would be interesting to
  425. look at all Django modules to see when
    they start moving to this Django.
  426. So we would need to start analyzing
    thousands or millions of files, but then
  427. we would need some SQL like, or some
    map reduce jobs
  428. or something like this for this.
  429. [A] Yes
  430. So, we've started…
  431. The two initiators of the project, Roberto
    Di Cosmo and Stefano Zacchiroli
  432. are both researchers in computer science
    so they have a strong background in
  433. actually mining software repositories and
    doing some large scale analysis
  434. on source code.
  435. We've been talking with research groups
    whose main goal is to do analysis on
  436. large scale source code archives.
  437. One of the first mirrors outside of our
    control of the archive
  438. will be in Grenoble (France).
  439. There's a few teams that work on
    actually doing large scale research
  440. on source code over there,
  441. so that's what the mirror will be
    used for.
  442. We've also been looking at what
    the Google open source team does.
  443. They have this big repository with all
    the code that Google uses
  444. and they've started to push back,
    like do large scale analysis of
  445. security vulnerabilities, issues with
    static and dynamic analysis
  446. of the code and they've started pushing
    their fixes upstream.
  447. That's something that we want to enable
    users to do,
  448. that's not something that we want to do
    ourselves, but we want to make sure
  449. that people can do it using our archive.
  450. So we'd be happy to work with people
    who already do that so that
  451. they can use their knowledge and their
    tools inside our archive.
  452. Does that answer your question?
  453. Cool.
  454. Any more questions?
  455. No? Then thank you very much Nicolas.
  456. Thank you.
  457. [Applause]