cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4

Edit subtitles

0:06 - 0:08

(host) ...this session
basically on time as well.
0:09 - 0:13

So yeah, this is the Inventaire guys.
0:13 - 0:17

And yeah, enjoy.
0:17 - 0:19

(laughter)
0:21 - 0:23

Thank you for being here.
0:24 - 0:27

We'll be presenting Inventaire
very quickly.
0:27 - 0:33

We would like to go in-depth more
on how we work day to day
0:33 - 0:36

with Wikidata shifting data.
0:38 - 0:42

A quick demo of...
0:42 - 0:47

This is Inventaire,
I hope a lot of people know it already.
0:48 - 0:51

It's in order to share books,
physical books
0:51 - 0:55

and everyone can just scan the ISBN
0:55 - 0:59

and share, lend, sell.
0:59 - 1:05

We're just making a relationship platform
in order to exchange books.
1:07 - 1:11

And of course,
it's reusing Wikidata's data.
1:16 - 1:19

That's the first part of the project.
1:19 - 1:21

The second part is
1:22 - 1:27

the Wikidata-federated
open bibliographic databse
1:27 - 1:31

that we've been building
for a long time now,
1:31 - 1:34

five years, something like that.
1:36 - 1:38

There you go.
1:39 - 1:45

So we basically reuse Wikidata's items
that are in a local [inaudible]
1:45 - 1:47

that we regulate the updates,
1:47 - 1:52

and we also add extra items
to our local database
1:52 - 1:57

in order to comply with books
that are not already on Wikidata.
1:59 - 2:02

And we are using a very
similar data model
2:02 - 2:07

in order to comply to Wikidata's data.
2:09 - 2:14

This is a brand new entities map
2:18 - 2:25

which basically describes
how we model the data in Inventaire
2:26 - 2:28

and we go back after
2:28 - 2:35

on the typing system
that we are strictly enforcing
2:35 - 2:38

compared to what Wikidata is doing.
2:44 - 2:50

Because, yeah, we do like ontologies,
we do like to talk about semantic,
2:50 - 2:54

but we are kind of dealing
with reality here
2:55 - 3:02

and we need to strictly type
what we are dealing with
3:02 - 3:07

especially with ontologies
and inheritance that have some troubles...
3:08 - 3:11

We have some troubles to comply to.
3:13 - 3:16

If you want to add something on that.
3:16 - 3:19

We'll get into the details after.
3:19 - 3:24

We have a preview of what that meant--
3:24 - 3:27

Which is really the part
where people start to scream
3:27 - 3:30

because the way we do typing...
3:30 - 3:33

We have different types of entities
we deal with in Inventaire
3:33 - 3:37

which are mainly works,
editions, series, and humans.
3:37 - 3:40

And the thing is that,
to answer the question
3:40 - 3:46

is this item in Wikidata a series,
a work, an edition?
3:46 - 3:52

You have different ways to do that
but none of them are both perfect
3:52 - 3:57

and fast, efficient and so the way
we do that at the moment
3:57 - 4:03

is that we have those lists of aliases
of what our owners...
4:04 - 4:08

That's the list of properties,
all those properties...
4:08 - 4:10

Well, I'm sorry that's the wrong type
4:21 - 4:27

That's the aliases,
the things we consider being humans
4:27 - 4:33

are P31Q5 obviously
but also things that are duo, sibling duo,
4:33 - 4:36

writer, house name, pseudonym
4:36 - 4:41

we consider that when we encounter
those things in an entity that's a human.
4:41 - 4:43

And humans are actually the simplest ones
4:43 - 4:47

because P41Q5 is one
of the most consistent way
4:47 - 4:50

to type an entity in Wikidata
and that's great
4:50 - 4:53

and we want more of that.
(laughs)
4:53 - 4:58

But look at works, this is all
the things we consider to be works,
4:58 - 5:03

so book, Q5 and bla, bla, bla
and all those comic book,
5:03 - 5:06

comic strip, novella, graphic novel.
5:06 - 5:10

So every time you see those P41s,
we consider them to be works.
5:10 - 5:17

And yeah, we could go more on those
but that's our hack to make typing work.
5:22 - 5:26

So now the problem in a way...
5:27 - 5:31

Wikadata is awesome, we all agree on that.
5:32 - 5:35

The only thing is that
it might change over time
5:35 - 5:42

and so how do we keep track of those
changes in our own local data model?
5:42 - 5:47

So those are the things that could
be redefined on the fly, overnight.
5:48 - 5:52

They are discussions I know,
yet sometimes,
5:52 - 5:56

it's possible that we do have trouble
5:56 - 5:59

to redefine...
6:00 - 6:02

with properties that could be redefined.
6:02 - 6:07

For example that happened
a couple of--maybe six months ago
6:07 - 6:08

or something like that.
6:08 - 6:12

I think it's more than a year ago
but we have been considering it.
6:13 - 6:15

Between the time we detected this problem
6:15 - 6:20

and started to think about taking
measures about that six months ago.
6:22 - 6:25

So all of a sudden,
this property became...
6:26 - 6:30

Not all of a sudden
but we found it all of a sudden.
6:32 - 6:36

...became from languages,
6:37 - 6:41

what's exactly that [inaudible]
6:42 - 6:46

So this property was language
of edition...
6:46 - 6:50

- What was it?
- That was the property we were using
6:50 - 6:54

for edition's language and at some point
6:54 - 6:57

it was decided that there were too many
properties to talk about language,
6:57 - 7:02

so this one was transformed
into original language of film or TV show.
7:02 - 7:08

So we started having our data
be about TV show all of a sudden
7:08 - 7:10

and that was wrong.
7:12 - 7:19

So the conclusion of this example
is don't recycle properties, please.
7:19 - 7:21

(laughter)
7:24 - 7:30

Other examples of shifting data--
maybe we don't have that much time
7:30 - 7:31

to go through all of it.
7:33 - 7:37

There's some examples like when work
7:37 - 7:41

became editions on entities,
7:41 - 7:44

on items--that's also quite tricky
7:44 - 7:48

because then,
how do we categorize it again?
7:56 - 7:59

So we are strictly typing
8:00 - 8:04

and that helps us...
8:05 - 8:09

There are advantages to that,
lots of advantages.
8:10 - 8:12

It is simplified world that we live in
8:12 - 8:18

It's not as complex
as Wikidata reality is showing.
8:18 - 8:23

So every edition has
at least one associated work
8:23 - 8:26

that is something that we can rely on.
8:27 - 8:32

So if a work become an edition,
then that sometimes have a problem.
8:35 - 8:40

Editions data cannot be added
on a work then.
8:41 - 8:44

We are strict about that
even if Wikidata is not.
8:44 - 8:49

So we are enforcing a policy that
we would like to have on Wikidata
8:49 - 8:54

but we are only a small part
of a bigger system.
8:58 - 8:59

We have done autocomplete.
9:00 - 9:04

Well, that's something
we have demonstrated other times
9:04 - 9:07

but the idea for example when
we have genre properties,
9:07 - 9:12

we will just suggest--like the user
will start to type a genre
9:12 - 9:18

and only your genre will be suggested
so it makes... it's like a simplified...
9:18 - 9:20

because we are strictly typing
9:20 - 9:26

we can have less weird input for the user
9:26 - 9:29

which is very interesting for us
9:29 - 9:31

because our users are not aware
of Wikidata
9:31 - 9:33

and all the things that are in Wikidata,
9:33 - 9:38

so we sort of simplify everything
as much as we can
9:38 - 9:40

but that's at the cost of flexibility.
9:40 - 9:45

The flexibility of Wikidata
is lost in this process.
9:48 - 9:54

So for instance...
Oh that's soon.
9:54 - 9:57

(laughs)
So we have to go faster maybe.
9:58 - 10:03

One of the cases that was,
how do we...
10:04 - 10:06

Wait.
10:07 - 10:11

The simplified typing system
is at the cost of how do we get--
10:11 - 10:14

we have the list of aliases of types
we saw earlier,
10:14 - 10:16

but sometimes we don't have all the types
10:16 - 10:22

so for example when we encounter
a P31 science fiction trilogy,
10:22 - 10:27

if we didn't have it in our alias list
that's breaking the system
10:27 - 10:30

and so we are back at this problem.
10:30 - 10:35

There are different ways to work on that
and so that's trying to make...
10:35 - 10:39

In suggestion, we talked about
that problem in our earlier presentation
10:39 - 10:41

and we were told,
"Yeah, you should do that with SPARQL."
10:41 - 10:44

Yes, that could mean asking for
10:44 - 10:50

is this entity an instance of some way
or subclass of written work,
10:50 - 10:51

these kind of things.
10:51 - 10:55

And that's a very expensive query
and we can't do that for everything.
10:55 - 10:58

So that's why we have these aliases.
11:00 - 11:02

And also we have...
11:02 - 11:07

This lost flexibility,
for the sake of simplicity
11:07 - 11:12

we lose flexibility and that's how
we have this for examples maybe.
11:12 - 11:18

Yeah it's quite obvious
we cannot do much about it.
11:19 - 11:22

There are more than humans
that authors books.
11:22 - 11:24

For example, there are collectives
11:24 - 11:28

and this is not yet taken
into account in Inventaire.
11:30 - 11:32

And editions can be a whole series,
11:32 - 11:38

lots of different possibilities
that do not fit into this reality.
11:39 - 11:41

That is the world.
11:42 - 11:45

Maybe going fast.
11:45 - 11:49

On the list of issues,
we have also these querying issues,
11:49 - 11:53

different strategies to try
to be both efficient
11:54 - 11:59

and complete in the way we find all
the works of a given author.
11:59 - 12:04

And we can't go over all those subclasses
because that's too expensive
12:04 - 12:09

but at the same time, we can't just ask
for all the items have a P50
12:10 - 12:13

and an author because editions
also have this P50.
12:13 - 12:17

And so, yeah,
that's the kind of problems we have.
12:18 - 12:22

On the ideas we are playing with
we have what was...
12:23 - 12:28

yeah, the concept of extending entities
locally could be a solution
12:28 - 12:30

to some of the probelms we presented.
12:31 - 12:35

It was mentioned earlier as shadow items
but maybe it's not such a good name,
12:35 - 12:38

so we will call it
"extend entities locally"
12:38 - 12:44

which means adding statements on item
that is on Wikidata locally
12:45 - 12:48

but without overwriting
because that would be crazy.
12:49 - 12:53

That would have some problems though
because for copyrighted work,
12:53 - 12:59

then we could actually work with that
which is not possible through Wikidata.
12:59 - 13:05

We also do not have to agree
with Wikidata's community
13:05 - 13:07

in order to enforce our schemas.
13:10 - 13:17

We can also add links on
non-Wikidata items from a Wikidata item.
13:20 - 13:24

But the [inaudible] is quite huge.
13:25 - 13:29

We have to follow Wikidata's algorithm
in order to make it compatible.
13:30 - 13:36

And it's problematic
for pushing data to Wikidata
13:36 - 13:38

if we lack some information.
13:39 - 13:45

We also reference,
we would like to push it in order to...
13:48 - 13:51

go through it, sorry.
13:52 - 13:54

Actually it's quite the end.
13:54 - 13:57

- This one?
- No, no, please, please.
13:57 - 14:02

(laughs) So to keep updated
we have to sometimes
14:02 - 14:06

make mass updates of Inventaire data,
14:06 - 14:09

and that's also the occasion
of great scripting,
14:09 - 14:14

and that's not always elegant
but at least it's happening.
14:14 - 14:17

So we need to make this great
to transition
14:17 - 14:20

and not to make
this language of TV show for instance.
14:20 - 14:25

Yes, and maybe for a final note
14:26 - 14:28

on an argument for a small Wikidata
14:28 - 14:33

because we have problems
with the query service update time
14:33 - 14:36

which has been mentioned
a few times, and this is...
14:37 - 14:43

It seems to be due to the big ambition
of Wikidata to cover all sort of items
14:43 - 14:49

including scientific articles
and so many scientific articles.
14:50 - 14:51

Maybe they are not the only ones to blame
14:51 - 14:57

but we end up having this large delay
14:57 - 15:01

between an update and the propagation
on the Wikidata query service,
15:01 - 15:03

and that's a problem for us
15:03 - 15:07

because for example you will have
a user modify--adding,
15:07 - 15:09

connecting your work
with an author on Wikidata
15:09 - 15:15

and then going to the author page
and expecting that their contribution
15:15 - 15:19

be visible on the author page
and they won't see it happening
15:19 - 15:26

and so, we need to find ways to tell them
that's it's going to be propagated
15:26 - 15:28

but we don't know when.
15:28 - 15:30

And then we have the problems of
15:30 - 15:34

we cache the request
to get the author data,
15:34 - 15:39

and we don't know when we shall update
this cached version of the query.
15:39 - 15:44

So that's the kind of problems we have
with the query delays, the query service.
15:44 - 15:46

And so having a smaller Wikidata
15:46 - 15:51

could maybe helping us to not have
to deal with this problem
15:51 - 15:55

because then we could just consider
that the update will be quick
15:55 - 15:59

and that we can just maybe, in a few,
15:59 - 16:01

less than ten minutes update our version
16:01 - 16:05

and at least be close
to what people contributed.
16:06 - 16:07

That's it.
16:07 - 16:09

Thanks for listening.
16:09 - 16:14

If you have any questions or comments
we'll be happy to talk now
16:14 - 16:16

or after also during the event,
16:16 - 16:20

and we'll be here also on Sunday
to talk about Wikibase.
16:24 - 16:26

And if you have questions, yeah?
16:26 - 16:28

(host) Why don't you give these guys
a round of applause.
16:28 - 16:31

(applause)
16:34 - 16:37

Meanwhile we can look at the map.
16:37 - 16:38

(host) We have quite
a generous question time
16:38 - 16:41

because these guys have finished
with plenty of time to go
16:41 - 16:43

so lots of questions.
16:43 - 16:48

Yeah, the idea was put on the table
pain we encounter
16:48 - 16:52

in daily life with working out of Wikidata
16:52 - 16:56

and to have your ideas and comments
16:56 - 16:58

and how much you shared
of those pain points,
16:58 - 17:02

and what solution also you might
have found to tackle them.
17:02 - 17:05

Also more general questions is possible.
17:05 - 17:07

(host) I'm going to go
to the chap in the front
17:07 - 17:10

and then we'll go backwards as we go.
17:10 - 17:14

(man) I guess first off, it was very
therapeutic to hear that all this pain
17:14 - 17:18

I've encountered personally,
it's like "Oh yes, it's not just me."
17:18 - 17:20

(laughter)
17:20 - 17:24

But one thing I've encountered
with the schema issues is that,
17:24 - 17:27

yes, my go to approach
is always just like
17:27 - 17:30

Oh, let's just find all subclasses
of a specific instance
17:30 - 17:32

to solve this in it.
17:32 - 17:34

I've encountered
a lot of the issues you have
17:34 - 17:37

though it seems like going
in the reverse
17:37 - 17:40

has helped solve that issue
17:40 - 17:43

at least for my use cases,
for instance I noticed that
17:43 - 17:49

all humans were instances
of a manufactured component
17:49 - 17:55

and I just said, "Okay, let me find
all classes that instance of a human is,
17:55 - 18:02

and this helped me go through
and like find these errors in schema
18:02 - 18:04

of subclass relationships,
18:05 - 18:09

and I was wondering if you
had gone through any of these processes?
18:09 - 18:14

Were there any other approaches,
more to this, to find errors?
18:14 - 18:19

Yeah, we went through
some trial and error there
18:19 - 18:22

and we encountered things like...
18:23 - 18:27

We have this very important distinction
between editions and works,
18:27 - 18:30

but at some point,
editions were a subclass of something
18:30 - 18:34

that was a subclass of works
and so the separation was falling apart
18:34 - 18:38

and so that's the example of one
of the things that were modified
18:38 - 18:42

because someone was thinking the world
was different than what it was.
18:43 - 18:49

And so that's how we were approached
to this more blacklist, whitelist system
18:49 - 18:53

like how good it types list.
18:53 - 18:59

Yes, we are also coming from the editions
and from the ISBNs of people's books
18:59 - 19:05

then, we have to go upward in the classes
in order to find out that somehow
19:05 - 19:10

this edition inherits from the work
and then how do you do that?
19:10 - 19:13

Like that's very problematic.
19:16 - 19:20

(man) I guess I have some SPARQL queries
that might be useful for that.
19:20 - 19:23

It just generates a nice graph
of the instance of subclass.
19:23 - 19:24

I didn't write it.
19:24 - 19:26

Someone wrote it for me
when I described this problem
19:26 - 19:29

so I can't take credit
but it might be useful for that.
19:29 - 19:32

But without cyclic problems,
19:32 - 19:37

like how do you deal with it
when there are inconsistencies
19:37 - 19:40

or things that are like editions,
instance of work and... ?
19:40 - 19:45

(man) I think just visualizing it
and it's very easy...
19:45 - 19:49

No, it's not very easy, it's possible
to then find these inconsistencies.
19:49 - 19:52

But I also think there
are loops in Wikidata,
19:52 - 19:55

for instance a concept
is itself an instance of concept.
19:56 - 20:01

It's not a useful subclass or instance of
but it's a valid one.
20:01 - 20:05

But do you generate this map
and see if there are errors
20:05 - 20:10

- but use the results other than--
- (man) I guess this was a...
20:10 - 20:14

Oh, I noticed that Douglas Adams
in an instance of this
20:14 - 20:17

and there are these errors
and then like just pitching it
20:17 - 20:21

to the communities, like fix this problem.
20:21 - 20:24

But you don't use
this subclasses query live?
20:25 - 20:27

(man) No, not live, no.
20:27 - 20:28

It was more of a debugging
20:28 - 20:33

and then realizing
it was small enough to fix.
20:44 - 20:48

(man 2) I suppose I'll just comment about
the issues around books,
20:48 - 20:53

so I would say that we should
never use a book as an instance,
20:53 - 21:00

and we should try to move books'
instances either to works or editions,
21:00 - 21:03

and perhaps you can agree to that.
21:03 - 21:10

And furthermore,
when I come to this, a book instance,
21:10 - 21:15

I would say that perhaps sometimes
rather than converting to a work,
21:15 - 21:20

a literary work, I would convert it
to an edition instead.
21:20 - 21:23

For example if it has ISBN numbers,
21:23 - 21:26

I would say that it's more
like an edition
21:27 - 21:29

and perhaps meant like an edition.
21:29 - 21:35

Also, for example if it's cited,
I suppose typically in citations
21:35 - 21:39

you are citing an edition
rather than a work.
21:40 - 21:47

But I imagine according to you that
this way could generate problems for you,
21:47 - 21:52

so once you'd rather sort of convert
the book to a work
21:52 - 21:57

and then create an instance of...
21:59 - 22:02

...an instance of an edition
instead of that
22:02 - 22:05

and move, for example, the ISBN numbers
22:05 - 22:09

and perhaps other identifiers
to that item.
22:10 - 22:11

We have seen the different criteria
22:11 - 22:16

depending on who was wanting
to make the separation.
22:16 - 22:18

So people who are interested
in the citation
22:18 - 22:20

or coming from Wikisource
22:20 - 22:23

want to convert pretty much
everything to editions
22:23 - 22:27

and people who are more interested
in the works as abstract categories
22:27 - 22:31

for the editions try
to convert everything to works.
22:31 - 22:34

And because in the case you described,
22:35 - 22:40

rather than considering that something
with an ISBN is rather an edition,
22:40 - 22:42

I will delete the ISBN
and consider it a work,
22:42 - 22:47

and in the case what we see often is that
22:47 - 22:50

there are Wikipedia articles
on those items
22:50 - 22:53

and those Wikipedia articles
don't talk about a specific edition
22:53 - 22:56

but about the concept of the work more.
22:56 - 23:02

And so these are the kind of problems
that are discussed in WikiProject books
23:02 - 23:06

and we are not seeing the end of it
and that's why for the moment...
23:06 - 23:08

(man 2) I want to say that if it's...
23:08 - 23:12

If there's a Wikipedia article
about the book,
23:12 - 23:17

then it should be a work, the item.
23:17 - 23:21

Lots of them have ISBNs
also on the Wikipedia page.
23:23 - 23:26

(man 2) I suppose that should
then be removed
23:26 - 23:30

perhaps over to the edition item.
23:30 - 23:33

It would be nice to have
a consensus on that.
23:33 - 23:38

It's an ongoing discussion
on WikiProject books, I guess.
23:42 - 23:45

(host) We have time for maybe
just one more quick question.
23:50 - 23:52

Excellent!
23:52 - 23:54

Well, if you'd like to show
your appreciation again for these guys,
23:54 - 23:55

that would be great.
23:55 - 23:59

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4
Video Language:: English
Duration:: 24:06

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)