Software Heritage - Preserving the Free Software Commons

Rollback to version 2

Not Synced

Hi, thank you.
Not Synced

I'm Nicolas Dandrimont and I will indeed
be talking to you about
Not Synced

Software Heritage.
Not Synced

I'm a software engineer for this project.
Not Synced

I've been working on it for 3 years now.
Not Synced

And we'll see what this thing is all about.
Not Synced

[Mic not working]
Not Synced

I guess the batteries are out.
Not Synced

So, let's try that again.
Not Synced

So, we all know, we've been doing
free software for a while,
Not Synced

that software source code is something
special.
Not Synced

Why is that?
Not Synced

As Harold Abelson has said in SICP, his
textbook on programming,
Not Synced

programs are meant to be read by people
and then incidentally for machines to execute.
Not Synced

Basically, what software source code
provides us is a way inside
Not Synced

the mind of the designer of the program.
Not Synced

For instance, you can have,
you can get inside very crazy algorithms
Not Synced

that can do very fast reverse square roots
for 3D, that kind of stuff
Not Synced

Like in the Quake 2 source code.
Not Synced

You can also get inside the algorithms
that are underpinning the internet,
Not Synced

for instance seeing the net queue
algorithm in the Linux kernel.
Not Synced

What we are building as the free software
community is the free software commons.
Not Synced

Basically, the commons is all the cultural
and social and natural resources
Not Synced

that we share and that everyone
has access to.
Not Synced

More specifically, the software commons
is what we are building
Not Synced

with software that is open and that is
available for all to use, to modify,
Not Synced

to execute, to distribute.
Not Synced

We know that those commons are a really
critical part of our commons.
Not Synced

Who's taking care of it?
Not Synced

The software is fragile.
Not Synced

Like all digital information, you can lose
software.
Not Synced

People can decide to shut down hosting
spaces because of business decisions.
Not Synced

People can hack into software hosting
platforms and remove the code maliciously
Not Synced

or just inadvertently.
Not Synced

And, of course, for the obsolete stuff,
there's rot.
Not Synced

If you don't care about the data, then
it rots and it decays and you lose it.
Not Synced

So, where is the archive we go to
when something is lost,
Not Synced

when GitLab goes away, when Github
goes away.
Not Synced

Where do we go?
Not Synced

Finally, there's one last thing that we
noticed, it's that
Not Synced

there's a lot of teams that work on
research on software
Not Synced

and there's no real big infrastructure
for research on code.
Not Synced

There's tons of critical issues around
code: safety, security, verification, proofs.
Not Synced

Nobody's doing this at a very large scale.
Not Synced

If you want to see the stars, you go
the Atacama desert and
Not Synced

you point a telescope at the sky.
Not Synced

Where is the telescope for source code?
Not Synced

That's what Software Heritage wants to be.
Not Synced

What we do is we collect, we preserve
and we share all the software
Not Synced

that is publicly available.
Not Synced

Why do we do that? We do that to
preserve the past, to enhance the present
Not Synced

and to prepare for the future.
Not Synced

What we're building is a base infrastructure
that can be used
Not Synced

for cultural heritage, for industry,
for research and for education purposes.
Not Synced

How do we do it? We do it with an open
approach.
Not Synced

Every single line of code that we write
is free software.
Not Synced

We do it transparently, everything that
we do, we do it in the open,
Not Synced

be that on a mailing list or on
our issue tracker.
Not Synced

And we strive to do it for the very long
haul, so we do it with replication in mind
Not Synced

so that no single entity has full control
over the data that we collect.
Not Synced

And we do it in a non-profit fashion
so that we avoid
Not Synced

business-driven decisions impacting
the project.
Not Synced

So, what do we do concretely?
Not Synced

We do archiving of version control systems.
Not Synced

What does that mean?
Not Synced

It means we archive file contents, so
source code, files.
Not Synced

We archive revisions, which means all the
metadata of the history of the projects,
Not Synced

we try to download it and we put it inside
a common data model that is
Not Synced

shared across all the archive.
Not Synced

We archive releases of the software,
releases that have been tagged
Not Synced

in a version control system as well as
releases that we can find as tarballs
Not Synced

because sometimes… boof, views of
this source code differ.
Not Synced

Of course, we archive where and when
we've seen the data that we've collected.
Not Synced

All of this, we put inside a canonical,
VCS-agnostic, data model.
Not Synced

If you have a Debian package, with its
history, if you have a git repository,
Not Synced

if you have a subversion repository, if
you have a mercurial repository,
Not Synced

it all looks the same and you can work
on it with the same tools.
Not Synced

What we don't do is archive what's around
the software, for instance
Not Synced

the bug tracking systems or the homepages
or the wikis or the mailing lists.
Not Synced

There are some projects that work
in this space, for instance
Not Synced

the internet archive does a lot of
really good work around archiving the web.
Not Synced

Our goal is not to replace them, but to
work with them and be able to do
Not Synced

linking across all the archives that exist.
Not Synced

We can, for instance for the mailing lists
there's the gmane project
Not Synced

that does a lot of archiving of free
software mailing lists.
Not Synced

So our long term vision is to play a part
in a semantic wikipedia of software,
Not Synced

a wikidata of software where we can
hyperlink all the archives that exist
Not Synced

and do stuff in the area.
Not Synced

Quick tour of our infrastructure.
Not Synced

Basically, all the way to the right is
our archive.
Not Synced

Our archive consists of a huge graph
of all the metadata about
Not Synced

the files, the directories, the revisions,
the commits and the releases and
Not Synced

all the projects that are on top
of the graph.
Not Synced

We separate the file storage into an other
object storage because of
Not Synced

the size discrepancy: we have lots and lots
of file contents that we need to store
Not Synced

so we do that outside the database
that is used to store the graph.
Not Synced

Basically, what we archive is a set of
software origins that are
Not Synced

git repositories, mercurial repositories,
etc. etc.
Not Synced

All those origins are loaded on a
regular schedule.
Not Synced

If there is a very active software origin,
we're gonna archive it more often
Not Synced

than stale things that don't get
a lot of updates

Title:: Software Heritage - Preserving the Free Software Commons
Description:: more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 41:31

	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons
	tvincent edited English subtitles for Software Heritage - Preserving the Free Software Commons

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 11 Edited

tvincent
Revision 10 Edited

tvincent
Revision 9 Edited

tvincent
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	11	tvincent
	10	tvincent
	9	tvincent
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Software Heritage - Preserving the Free Software Commons

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)