< Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4

  • 0:06 - 0:08
    (host) ...this session
    basically on time as well.
  • 0:09 - 0:13
    So yeah, this is the Inventaire guys.
  • 0:13 - 0:17
    And yeah, enjoy.
  • 0:17 - 0:19
    (laughter)
  • 0:21 - 0:23
    Thank you for being here.
  • 0:24 - 0:27
    We'll be presenting Inventaire
    very quickly.
  • 0:27 - 0:33
    We would like to go in-depth more
    on how we work day to day
  • 0:33 - 0:36
    with Wikidata shifting data.
  • 0:38 - 0:42
    A quick demo of...
  • 0:42 - 0:47
    This is Inventaire,
    I hope a lot of people know it already.
  • 0:48 - 0:51
    It's in order to share books,
    physical books
  • 0:51 - 0:55
    and everyone can just scan the ISBN
  • 0:55 - 0:59
    and share, lend, sell.
  • 0:59 - 1:05
    We're just making a relationship platform
    in order to exchange books.
  • 1:07 - 1:11
    And of course,
    it's reusing Wikidata's data.
  • 1:16 - 1:19
    That's the first part of the project.
  • 1:19 - 1:21
    The second part is
  • 1:22 - 1:27
    the Wikidata-federated
    open bibliographic databse
  • 1:27 - 1:31
    that we've been building
    for a long time now,
  • 1:31 - 1:34
    five years, something like that.
  • 1:36 - 1:38
    There you go.
  • 1:39 - 1:45
    So we basically reuse Wikidata's items
    that are in a local [inaudible]
  • 1:45 - 1:47
    that we regulate the updates,
  • 1:47 - 1:52
    and we also add extra items
    to our local database
  • 1:52 - 1:57
    in order to comply with books
    that are not already on Wikidata.
  • 1:59 - 2:02
    And we are using a very
    similar data model
  • 2:02 - 2:07
    in order to comply to Wikidata's data.
  • 2:09 - 2:14
    This is a brand new entities map
  • 2:18 - 2:25
    which basically describes
    how we model the data in Inventaire
  • 2:26 - 2:28
    and we go back after
  • 2:28 - 2:35
    on the typing system
    that we are strictly enforcing
  • 2:35 - 2:38
    compared to what Wikidata is doing.
  • 2:44 - 2:50
    Because, yeah, we do like ontologies,
    we do like to talk about semantic,
  • 2:50 - 2:54
    but we are kind of dealing
    with reality here
  • 2:55 - 3:02
    and we need to strictly type
    what we are dealing with
  • 3:02 - 3:07
    especially with ontologies
    and inheritance that have some troubles...
  • 3:08 - 3:11
    We have some troubles to comply to.
  • 3:13 - 3:16
    If you want to add something on that.
  • 3:16 - 3:19
    We'll get into the details after.
  • 3:19 - 3:24
    We have a preview of what that meant--
  • 3:24 - 3:27
    Which is really the part
    where people start to scream
  • 3:27 - 3:30
    because the way we do typing...
  • 3:30 - 3:33
    We have different types of entities
    we deal with in Inventaire
  • 3:33 - 3:37
    which are mainly works,
    editions, series, and humans.
  • 3:37 - 3:40
    And the thing is that,
    to answer the question
  • 3:40 - 3:46
    is this item in Wikidata a series,
    a work, an edition?
  • 3:46 - 3:52
    You have different ways to do that
    but none of them are both perfect
  • 3:52 - 3:57
    and fast, efficient and so the way
    we do that at the moment
  • 3:57 - 4:03
    is that we have those lists of aliases
    of what our owners...
  • 4:04 - 4:08
    That's the list of properties,
    all those properties...
  • 4:08 - 4:10
    Well, I'm sorry that's the wrong type
  • 4:21 - 4:27
    That's the aliases,
    the things we consider being humans
  • 4:27 - 4:33
    are P31Q5 obviously
    but also things that are duo, sibling duo,
  • 4:33 - 4:36
    writer, house name, pseudonym
  • 4:36 - 4:41
    we consider that when we encounter
    those things in an entity that's a human.
  • 4:41 - 4:43
    And humans are actually the simplest ones
  • 4:43 - 4:47
    because P41Q5 is one
    of the most consistent way
  • 4:47 - 4:50
    to type an entity in Wikidata
    and that's great
  • 4:50 - 4:53
    and we want more of that.
    (laughs)
  • 4:53 - 4:58
    But look at works, this is all
    the things we consider to be works,
  • 4:58 - 5:03
    so book, Q5 and bla, bla, bla
    and all those comic book,
  • 5:03 - 5:06
    comic strip, novella, graphic novel.
  • 5:06 - 5:10
    So every time you see those P41s,
    we consider them to be works.
  • 5:10 - 5:17
    And yeah, we could go more on those
    but that's our hack to make typing work.
  • 5:22 - 5:26
    So now the problem in a way...
  • 5:27 - 5:31
    Wikadata is awesome, we all agree on that.
  • 5:32 - 5:35
    The only thing is that
    it might change over time
  • 5:35 - 5:42
    and so how do we keep track of those
    changes in our own local data model?
  • 5:42 - 5:47
    So those are the things that could
    be redefined on the fly, overnight.
  • 5:48 - 5:52
    They are discussions I know,
    yet sometimes,
  • 5:52 - 5:56
    it's possible that we do have trouble
  • 5:56 - 5:59
    to redefine...
  • 6:00 - 6:02
    with properties that could be redefined.
  • 6:02 - 6:07
    For example that happened
    a couple of--maybe six months ago
  • 6:07 - 6:08
    or something like that.
  • 6:08 - 6:12
    I think it's more than a year ago
    but we have been considering it.
  • 6:13 - 6:15
    Between the time we detected this problem
  • 6:15 - 6:20
    and started to think about taking
    measures about that six months ago.
  • 6:22 - 6:25
    So all of a sudden,
    this property became...
  • 6:26 - 6:30
    Not all of a sudden
    but we found it all of a sudden.
  • 6:32 - 6:36
    ...became from languages,
  • 6:37 - 6:41
    what's exactly that [inaudible]
  • 6:42 - 6:46
    So this property was language
    of edition...
  • 6:46 - 6:50
    - What was it?
    - That was the property we were using
  • 6:50 - 6:54
    for edition's language and at some point
  • 6:54 - 6:57
    it was decided that there were too many
    properties to talk about language,
  • 6:57 - 7:02
    so this one was transformed
    into original language of film or TV show.
  • 7:02 - 7:08
    So we started having our data
    be about TV show all of a sudden
  • 7:08 - 7:10
    and that was wrong.
  • 7:12 - 7:19
    So the conclusion of this example
    is don't recycle properties, please.
  • 7:19 - 7:21
    (laughter)
  • 7:24 - 7:30
    Other examples of shifting data--
    maybe we don't have that much time
  • 7:30 - 7:31
    to go through all of it.
  • 7:33 - 7:37
    There's some examples like when work
  • 7:37 - 7:41
    became editions on entities,
  • 7:41 - 7:44
    on items--that's also quite tricky
  • 7:44 - 7:48
    because then,
    how do we categorize it again?
  • 7:56 - 7:59
    So we are strictly typing
  • 8:00 - 8:04
    and that helps us...
  • 8:05 - 8:09
    There are advantages to that,
    lots of advantages.
  • 8:10 - 8:12
    It is simplified world that we live in
  • 8:12 - 8:18
    It's not as complex
    as Wikidata reality is showing.
  • 8:18 - 8:23
    So every edition has
    at least one associated work
  • 8:23 - 8:26
    that is something that we can rely on.
  • 8:27 - 8:32
    So if a work become an edition,
    then that sometimes have a problem.
  • 8:35 - 8:40
    Editions data cannot be added
    on a work then.
  • 8:41 - 8:44
    We are strict about that
    even if Wikidata is not.
  • 8:44 - 8:49
    So we are enforcing a policy that
    we would like to have on Wikidata
  • 8:49 - 8:54
    but we are only a small part
    of a bigger system.
  • 8:58 - 8:59
    We have done autocomplete.
  • 9:00 - 9:04
    Well, that's something
    we have demonstrated other times
  • 9:04 - 9:07
    but the idea for example when
    we have genre properties,
  • 9:07 - 9:12
    we will just suggest--like the user
    will start to type a genre
  • 9:12 - 9:18
    and only your genre will be suggested
    so it makes... it's like a simplified...
  • 9:18 - 9:20
    because we are strictly typing
  • 9:20 - 9:26
    we can have less weird input for the user
  • 9:26 - 9:29
    which is very interesting for us
  • 9:29 - 9:31
    because our users are not aware
    of Wikidata
  • 9:31 - 9:33
    and all the things that are in Wikidata,
  • 9:33 - 9:38
    so we sort of simplify everything
    as much as we can
  • 9:38 - 9:40
    but that's at the cost of flexibility.
  • 9:40 - 9:45
    The flexibility of Wikidata
    is lost in this process.
  • 9:48 - 9:54
    So for instance...
    Oh that's soon.
  • 9:54 - 9:57
    (laughs)
    So we have to go faster maybe.
  • 9:58 - 10:03
    One of the cases that was,
    how do we...
  • 10:04 - 10:06
    Wait.
  • 10:07 - 10:11
    The simplified typing system
    is at the cost of how do we get--
  • 10:11 - 10:14
    we have the list of aliases of types
    we saw earlier,
  • 10:14 - 10:16
    but sometimes we don't have all the types
  • 10:16 - 10:22
    so for example when we encounter
    a P31 science fiction trilogy,
  • 10:22 - 10:27
    if we didn't have it in our alias list
    that's breaking the system
  • 10:27 - 10:30
    and so we are back at this problem.
  • 10:30 - 10:35
    There are different ways to work on that
    and so that's trying to make...
  • 10:35 - 10:39
    In suggestion, we talked about
    that problem in our earlier presentation
  • 10:39 - 10:41
    and we were told,
    "Yeah, you should do that with SPARQL."
  • 10:41 - 10:44
    Yes, that could mean asking for
  • 10:44 - 10:50
    is this entity an instance of some way
    or subclass of written work,
  • 10:50 - 10:51
    these kind of things.
  • 10:51 - 10:55
    And that's a very expensive query
    and we can't do that for everything.
  • 10:55 - 10:58
    So that's why we have these aliases.
  • 11:00 - 11:02
    And also we have...
  • 11:02 - 11:07
    This lost flexibility,
    for the sake of simplicity
  • 11:07 - 11:12
    we lose flexibility and that's how
    we have this for examples maybe.
  • 11:12 - 11:18
    Yeah it's quite obvious
    we cannot do much about it.
  • 11:19 - 11:22
    There are more than humans
    that authors books.
  • 11:22 - 11:24
    For example, there are collectives
  • 11:24 - 11:28
    and this is not yet taken
    into account in Inventaire.
  • 11:30 - 11:32
    And editions can be a whole series,
  • 11:32 - 11:38
    lots of different possibilities
    that do not fit into this reality.
  • 11:39 - 11:41
    That is the world.
  • 11:42 - 11:45
    Maybe going fast.
  • 11:45 - 11:49
    On the list of issues,
    we have also these querying issues,
  • 11:49 - 11:53
    different strategies to try
    to be both efficient
  • 11:54 - 11:59
    and complete in the way we find all
    the works of a given author.
  • 11:59 - 12:04
    And we can't go over all those subclasses
    because that's too expensive
  • 12:04 - 12:09
    but at the same time, we can't just ask
    for all the items have a P50
  • 12:10 - 12:13
    and an author because editions
    also have this P50.
  • 12:13 - 12:17
    And so, yeah,
    that's the kind of problems we have.
  • 12:18 - 12:22
    On the ideas we are playing with
    we have what was...
  • 12:23 - 12:28
    yeah, the concept of extending entities
    locally could be a solution
  • 12:28 - 12:30
    to some of the probelms we presented.
  • 12:31 - 12:35
    It was mentioned earlier as shadow items
    but maybe it's not such a good name,
  • 12:35 - 12:38
    so we will call it
    "extend entities locally"
  • 12:38 - 12:44
    which means adding statements on item
    that is on Wikidata locally
  • 12:45 - 12:48
    but without overwriting
    because that would be crazy.
  • 12:49 - 12:53
    That would have some problems though
    because for copyrighted work,
  • 12:53 - 12:59
    then we could actually work with that
    which is not possible through Wikidata.
  • 12:59 - 13:05
    We also do not have to agree
    with Wikidata's community
  • 13:05 - 13:07
    in order to enforce our schemas.
  • 13:10 - 13:17
    We can also add links on
    non-Wikidata items from a Wikidata item.
  • 13:20 - 13:24
    But the [inaudible] is quite huge.
  • 13:25 - 13:29
    We have to follow Wikidata's algorithm
    in order to make it compatible.
  • 13:30 - 13:36
    And it's problematic
    for pushing data to Wikidata
  • 13:36 - 13:38
    if we lack some information.
  • 13:39 - 13:45
    We also reference,
    we would like to push it in order to...
  • 13:48 - 13:51
    go through it, sorry.
  • 13:52 - 13:54
    Actually it's quite the end.
  • 13:54 - 13:57
    - This one?
    - No, no, please, please.
  • 13:57 - 14:02
    (laughs) So to keep updated
    we have to sometimes
  • 14:02 - 14:06
    make mass updates of Inventaire data,
  • 14:06 - 14:09
    and that's also the occasion
    of great scripting,
  • 14:09 - 14:14
    and that's not always elegant
    but at least it's happening.
  • 14:14 - 14:17
    So we need to make this great
    to transition
  • 14:17 - 14:20
    and not to make
    this language of TV show for instance.
  • 14:20 - 14:25
    Yes, and maybe for a final note
  • 14:26 - 14:28
    on an argument for a small Wikidata
  • 14:28 - 14:33
    because we have problems
    with the query service update time
  • 14:33 - 14:36
    which has been mentioned
    a few times, and this is...
  • 14:37 - 14:43
    It seems to be due to the big ambition
    of Wikidata to cover all sort of items
  • 14:43 - 14:49
    including scientific articles
    and so many scientific articles.
  • 14:50 - 14:51
    Maybe they are not the only ones to blame
  • 14:51 - 14:57
    but we end up having this large delay
  • 14:57 - 15:01
    between an update and the propagation
    on the Wikidata query service,
  • 15:01 - 15:03
    and that's a problem for us
  • 15:03 - 15:07
    because for example you will have
    a user modify--adding,
  • 15:07 - 15:09
    connecting your work
    with an author on Wikidata
  • 15:09 - 15:15
    and then going to the author page
    and expecting that their contribution
  • 15:15 - 15:19
    be visible on the author page
    and they won't see it happening
  • 15:19 - 15:26
    and so, we need to find ways to tell them
    that's it's going to be propagated
  • 15:26 - 15:28
    but we don't know when.
  • 15:28 - 15:30
    And then we have the problems of
  • 15:30 - 15:34
    we cache the request
    to get the author data,
  • 15:34 - 15:39
    and we don't know when we shall update
    this cached version of the query.
  • 15:39 - 15:44
    So that's the kind of problems we have
    with the query delays, the query service.
  • 15:44 - 15:46
    And so having a smaller Wikidata
  • 15:46 - 15:51
    could maybe helping us to not have
    to deal with this problem
  • 15:51 - 15:55
    because then we could just consider
    that the update will be quick
  • 15:55 - 15:59
    and that we can just maybe, in a few,
  • 15:59 - 16:01
    less than ten minutes update our version
  • 16:01 - 16:05
    and at least be close
    to what people contributed.
  • 16:06 - 16:07
    That's it.
  • 16:07 - 16:09
    Thanks for listening.
  • 16:09 - 16:14
    If you have any questions or comments
    we'll be happy to talk now
  • 16:14 - 16:16
    or after also during the event,
  • 16:16 - 16:20
    and we'll be here also on Sunday
    to talk about Wikibase.
  • 16:24 - 16:26
    And if you have questions, yeah?
  • 16:26 - 16:28
    (host) Why don't you give these guys
    a round of applause.
  • 16:28 - 16:31
    (applause)
  • 16:34 - 16:37
    Meanwhile we can look at the map.
  • 16:37 - 16:38
    (host) We have quite
    a generous question time
  • 16:38 - 16:41
    because these guys have finished
    with plenty of time to go
  • 16:41 - 16:43
    so lots of questions.
  • 16:43 - 16:48
    Yeah, the idea was put on the table
    pain we encounter
  • 16:48 - 16:52
    in daily life with working out of Wikidata
  • 16:52 - 16:56
    and to have your ideas and comments
  • 16:56 - 16:58
    and how much you shared
    of those pain points,
  • 16:58 - 17:02
    and what solution also you might
    have found to tackle them.
  • 17:02 - 17:05
    Also more general questions is possible.
  • 17:05 - 17:07
    (host) I'm going to go
    to the chap in the front
  • 17:07 - 17:10
    and then we'll go backwards as we go.
  • 17:10 - 17:14
    (man) I guess first off, it was very
    therapeutic to hear that all this pain
  • 17:14 - 17:18
    I've encountered personally,
    it's like "Oh yes, it's not just me."
  • 17:18 - 17:20
    (laughter)
  • 17:20 - 17:24
    But one thing I've encountered
    with the schema issues is that,
  • 17:24 - 17:27
    yes, my go to approach
    is always just like
  • 17:27 - 17:30
    Oh, let's just find all subclasses
    of a specific instance
  • 17:30 - 17:32
    to solve this in it.
  • 17:32 - 17:34
    I've encountered
    a lot of the issues you have
  • 17:34 - 17:37
    though it seems like going
    in the reverse
  • 17:37 - 17:40
    has helped solve that issue
  • 17:40 - 17:43
    at least for my use cases,
    for instance I noticed that
  • 17:43 - 17:49
    all humans were instances
    of a manufactured component
  • 17:49 - 17:55
    and I just said, "Okay, let me find
    all classes that instance of a human is,
  • 17:55 - 18:02
    and this helped me go through
    and like find these errors in schema
  • 18:02 - 18:04
    of subclass relationships,
  • 18:05 - 18:09
    and I was wondering if you
    had gone through any of these processes?
  • 18:09 - 18:14
    Were there any other approaches,
    more to this, to find errors?
  • 18:14 - 18:19
    Yeah, we went through
    some trial and error there
  • 18:19 - 18:22
    and we encountered things like...
  • 18:23 - 18:27
    We have this very important distinction
    between editions and works,
  • 18:27 - 18:30
    but at some point,
    editions were a subclass of something
  • 18:30 - 18:34
    that was a subclass of works
    and so the separation was falling apart
  • 18:34 - 18:38
    and so that's the example of one
    of the things that were modified
  • 18:38 - 18:42
    because someone was thinking the world
    was different than what it was.
  • 18:43 - 18:49
    And so that's how we were approached
    to this more blacklist, whitelist system
  • 18:49 - 18:53
    like how good it types list.
  • 18:53 - 18:59
    Yes, we are also coming from the editions
    and from the ISBNs of people's books
  • 18:59 - 19:05
    then, we have to go upward in the classes
    in order to find out that somehow
  • 19:05 - 19:10
    this edition inherits from the work
    and then how do you do that?
  • 19:10 - 19:13
    Like that's very problematic.
  • 19:16 - 19:20
    (man) I guess I have some SPARQL queries
    that might be useful for that.
  • 19:20 - 19:23
    It just generates a nice graph
    of the instance of subclass.
  • 19:23 - 19:24
    I didn't write it.
  • 19:24 - 19:26
    Someone wrote it for me
    when I described this problem
  • 19:26 - 19:29
    so I can't take credit
    but it might be useful for that.
  • 19:29 - 19:32
    But without cyclic problems,
  • 19:32 - 19:37
    like how do you deal with it
    when there are inconsistencies
  • 19:37 - 19:40
    or things that are like editions,
    instance of work and... ?
  • 19:40 - 19:45
    (man) I think just visualizing it
    and it's very easy...
  • 19:45 - 19:49
    No, it's not very easy, it's possible
    to then find these inconsistencies.
  • 19:49 - 19:52
    But I also think there
    are loops in Wikidata,
  • 19:52 - 19:55
    for instance a concept
    is itself an instance of concept.
  • 19:56 - 20:01
    It's not a useful subclass or instance of
    but it's a valid one.
  • 20:01 - 20:05
    But do you generate this map
    and see if there are errors
  • 20:05 - 20:10
    - but use the results other than--
    - (man) I guess this was a...
  • 20:10 - 20:14
    Oh, I noticed that Douglas Adams
    in an instance of this
  • 20:14 - 20:17
    and there are these errors
    and then like just pitching it
  • 20:17 - 20:21
    to the communities, like fix this problem.
  • 20:21 - 20:24
    But you don't use
    this subclasses query live?
  • 20:25 - 20:27
    (man) No, not live, no.
  • 20:27 - 20:28
    It was more of a debugging
  • 20:28 - 20:33
    and then realizing
    it was small enough to fix.
  • 20:44 - 20:48
    (man 2) I suppose I'll just comment about
    the issues around books,
  • 20:48 - 20:53
    so I would say that we should
    never use a book as an instance,
  • 20:53 - 21:00
    and we should try to move books'
    instances either to works or editions,
  • 21:00 - 21:03
    and perhaps you can agree to that.
  • 21:03 - 21:10
    And furthermore,
    when I come to this, a book instance,
  • 21:10 - 21:15
    I would say that perhaps sometimes
    rather than converting to a work,
  • 21:15 - 21:20
    a literary work, I would convert it
    to an edition instead.
  • 21:20 - 21:23
    For example if it has ISBN numbers,
  • 21:23 - 21:26
    I would say that it's more
    like an edition
  • 21:27 - 21:29
    and perhaps meant like an edition.
  • 21:29 - 21:35
    Also, for example if it's cited,
    I suppose typically in citations
  • 21:35 - 21:39
    you are citing an edition
    rather than a work.
  • 21:40 - 21:47
    But I imagine according to you that
    this way could generate problems for you,
  • 21:47 - 21:52
    so once you'd rather sort of convert
    the book to a work
  • 21:52 - 21:57
    and then create an instance of...
  • 21:59 - 22:02
    ...an instance of an edition
    instead of that
  • 22:02 - 22:05
    and move, for example, the ISBN numbers
  • 22:05 - 22:09
    and perhaps other identifiers
    to that item.
  • 22:10 - 22:11
    We have seen the different criteria
  • 22:11 - 22:16
    depending on who was wanting
    to make the separation.
  • 22:16 - 22:18
    So people who are interested
    in the citation
  • 22:18 - 22:20
    or coming from Wikisource
  • 22:20 - 22:23
    want to convert pretty much
    everything to editions
  • 22:23 - 22:27
    and people who are more interested
    in the works as abstract categories
  • 22:27 - 22:31
    for the editions try
    to convert everything to works.
  • 22:31 - 22:34
    And because in the case you described,
  • 22:35 - 22:40
    rather than considering that something
    with an ISBN is rather an edition,
  • 22:40 - 22:42
    I will delete the ISBN
    and consider it a work,
  • 22:42 - 22:47
    and in the case what we see often is that
  • 22:47 - 22:50
    there are Wikipedia articles
    on those items
  • 22:50 - 22:53
    and those Wikipedia articles
    don't talk about a specific edition
  • 22:53 - 22:56
    but about the concept of the work more.
  • 22:56 - 23:02
    And so these are the kind of problems
    that are discussed in WikiProject books
  • 23:02 - 23:06
    and we are not seeing the end of it
    and that's why for the moment...
  • 23:06 - 23:08
    (man 2) I want to say that if it's...
  • 23:08 - 23:12
    If there's a Wikipedia article
    about the book,
  • 23:12 - 23:17
    then it should be a work, the item.
  • 23:17 - 23:21
    Lots of them have ISBNs
    also on the Wikipedia page.
  • 23:23 - 23:26
    (man 2) I suppose that should
    then be removed
  • 23:26 - 23:30
    perhaps over to the edition item.
  • 23:30 - 23:33
    It would be nice to have
    a consensus on that.
  • 23:33 - 23:38
    It's an ongoing discussion
    on WikiProject books, I guess.
  • 23:42 - 23:45
    (host) We have time for maybe
    just one more quick question.
  • 23:50 - 23:52
    Excellent!
  • 23:52 - 23:54
    Well, if you'd like to show
    your appreciation again for these guys,
  • 23:54 - 23:55
    that would be great.
  • 23:55 - 23:59
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1059-eng-Inventaire_What_we_learnt_from_reusing_and_extending_Wikidata_shifting_data_hd.mp4
Video Language:
English
Duration:
24:06

English subtitles

Revisions