Return to Video

cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4

  • 0:06 - 0:09
    Hello everyone to the Data Quality panel.
  • 0:10 - 0:14
    Data quality matters because
    more and more people out there
  • 0:14 - 0:19
    rely on our data being in good shape,
    so we're going to talk about data quality,
  • 0:20 - 0:26
    and there will be four speakers
    who will give short introductions
  • 0:26 - 0:30
    on topics related to data quality
    and then we will have a Q and A.
  • 0:30 - 0:32
    And the first one is Lucas.
  • 0:34 - 0:35
    Thank you.
  • 0:36 - 0:40
    Hi, I'm Lucas, and I'm going
    to start with an overview
  • 0:40 - 0:44
    of data quality tools
    that we already have on Wikidata
  • 0:44 - 0:46
    and also some things
    that are coming up soon.
  • 0:47 - 0:51
    And I've grouped them
    into some general themes
  • 0:51 - 0:54
    of making errors more visible,
    making problems actionable,
  • 0:54 - 0:56
    getting more eyes on the data
    so that people notice the problems,
  • 0:57 - 1:03
    fix some common sources of errors,
    maintain the quality of the existing data
  • 1:03 - 1:04
    and also human curation.
  • 1:05 - 1:10
    And the ones that are currently available
    start with property constraints.
  • 1:10 - 1:12
    So you've probably seen this
    if you're on Wikidata.
  • 1:12 - 1:14
    You can sometimes get these icons
  • 1:15 - 1:17
    which check
    the internal consistency of the data.
  • 1:17 - 1:21
    For example,
    if one event follows the other,
  • 1:21 - 1:24
    then the other event should
    also be followed by this one,
  • 1:24 - 1:27
    which on the WikidataCon item
    was apparently missing.
  • 1:27 - 1:29
    I'm not sure,
    this feature is a few days old.
  • 1:30 - 1:35
    And there's also,
    if this is too limited or simple for you,
  • 1:35 - 1:38
    you can write any checks you want
    using the Query Service
  • 1:38 - 1:40
    which is useful for
    lots of things of course,
  • 1:40 - 1:45
    but you can also use it
    for finding errors.
  • 1:45 - 1:47
    Like if you've noticed
    one occurrence of a mistake,
  • 1:47 - 1:50
    then you can check
    if there are other places
  • 1:50 - 1:52
    where people have made
    a very similar error
  • 1:52 - 1:53
    and find that with the Query Service.
  • 1:53 - 1:55
    You can also combine the two
  • 1:55 - 1:58
    and search for constraint violations
    in the Query Service,
  • 1:58 - 2:01
    for example,
    only the violations in some area
  • 2:01 - 2:04
    or WikiProject that's relevant to you,
  • 2:04 - 2:07
    although the results are currently
    not complete, sadly.
  • 2:08 - 2:10
    There is revision scoring.
  • 2:11 - 2:13
    That's... I think this is
    from the recent changes
  • 2:13 - 2:16
    you can also get it on your watch list
    an automatic assessment
  • 2:16 - 2:20
    of is this edit likely to be
    in good faith or in bad faith
  • 2:20 - 2:22
    and is it likely to be
    damaging or not damaging,
  • 2:22 - 2:24
    I think those are the two dimensions.
  • 2:24 - 2:26
    So you can, if you want,
  • 2:26 - 2:30
    focus on just looking through
    the damaging but good faith edits.
  • 2:30 - 2:33
    If you're feeling particularly
    friendly and welcoming
  • 2:33 - 2:37
    you can tell these editors,
    "Thank you for your contribution,
  • 2:37 - 2:41
    here's how you should have done it
    but thank you, still."
  • 2:41 - 2:42
    And if you're not feeling that way,
  • 2:42 - 2:44
    you can go through
    the bad faith, damaging edits,
  • 2:44 - 2:46
    and revert the vandals.
  • 2:48 - 2:50
    There's also, similar to that,
    entity scoring.
  • 2:50 - 2:53
    So instead of scoring an edit,
    the change that it made,
  • 2:53 - 2:54
    you score the whole revision,
  • 2:54 - 2:56
    and I think that is
    the same quality measure
  • 2:56 - 3:00
    that Lydia mentions
    at the beginning of the conference.
  • 3:00 - 3:05
    That gives a user script up here
    and gives you a score of like one to five,
  • 3:05 - 3:08
    I think it was, of what the quality
    of the current item is.
  • 3:10 - 3:16
    The primary sources tool is for
    any database that you want to import,
  • 3:16 - 3:18
    but that's not high enough quality
    to directly add to Wikidata,
  • 3:18 - 3:20
    so you add it
    to the primary sources tool instead,
  • 3:20 - 3:23
    and then humans can decide
  • 3:23 - 3:26
    should they add
    these individual statements or not.
  • 3:29 - 3:32
    Showing coordinates as maps
    is mainly a convenience feature
  • 3:32 - 3:34
    but it's also useful for quality control.
  • 3:34 - 3:37
    Like if you see this is supposed to be
    the office of Wikimedia Germany
  • 3:37 - 3:39
    and if the coordinates
    are somewhere in the Indian Ocean,
  • 3:39 - 3:42
    then you know that
    something is not right there
  • 3:42 - 3:45
    and you can see it much more easily
    than if you just had the numbers.
  • 3:46 - 3:50
    This is a gadget called
    the relative completeness indicator
  • 3:50 - 3:52
    which shows you this little icon here
  • 3:53 - 3:56
    telling you how complete
    it thinks this item is
  • 3:56 - 3:58
    and also which properties
    are most likely missing,
  • 3:58 - 4:00
    which is really useful
    if you're editing an item
  • 4:00 - 4:03
    and you're in an area
    that you're not very familiar with
  • 4:03 - 4:06
    and you don't know what
    the right properties to use are,
  • 4:06 - 4:08
    then this is a very useful gadget to have.
  • 4:10 - 4:11
    And we have Shape Expressions.
  • 4:11 - 4:16
    I think Andra or Jose
    are going to talk more about those
  • 4:16 - 4:20
    but basically, a very powerful way
    of comparing the data you have
  • 4:20 - 4:21
    against the schema,
  • 4:21 - 4:23
    like what statement should
    certain entities have,
  • 4:23 - 4:26
    what other entities should they link to
    and what should those look like,
  • 4:26 - 4:29
    and then you can find problems that way.
  • 4:30 - 4:32
    I think... No there is still more.
  • 4:32 - 4:34
    Integraality or property dashboard.
  • 4:34 - 4:37
    It gives you a quick overview
    of the data you already have.
  • 4:37 - 4:39
    For example, this is from
    the WikiProject Red Pandas,
  • 4:40 - 4:42
    and you can see that
    we have a sex or gender
  • 4:42 - 4:44
    for almost all of the red pandas,
  • 4:44 - 4:47
    the date of birth varies a lot
    by which zoo they come from
  • 4:47 - 4:50
    and we have almost
    no dead pandas which is wonderful,
  • 4:51 - 4:53
    because they're so cute.
  • 4:54 - 4:56
    So this is also useful.
  • 4:56 - 4:59
    There we go, OK,
    now for the things that are coming up.
  • 5:00 - 5:04
    Wikidata Bridge, or also known,
    formerly known as client editing,
  • 5:04 - 5:07
    so editing Wikidata
    from Wikipedia infoboxes
  • 5:08 - 5:12
    which will on the one hand
    get more eyes on the data
  • 5:12 - 5:13
    because more people can see the data there
  • 5:13 - 5:19
    and it will hopefully encourage
    more use of Wikidata in the Wikipedias
  • 5:19 - 5:21
    and that means that more
    people can notice
  • 5:21 - 5:23
    if, for example some data is outdated
    and needs to be updated
  • 5:24 - 5:27
    instead of if they would
    only see it on Wikidata itself.
  • 5:29 - 5:31
    There is also tainted references.
  • 5:31 - 5:34
    The idea here is that
    if you edit a statement value,
  • 5:35 - 5:37
    you might want to update
    the references as well,
  • 5:37 - 5:39
    unless it was just a typo or something.
  • 5:40 - 5:44
    And this tainted references
    tells editors that
  • 5:44 - 5:50
    and also that other editors
    see which other edits were made
  • 5:50 - 5:52
    that edited a statement value
    and didn't update a reference
  • 5:52 - 5:57
    then you can clean up after that
    and decide should that be...
  • 5:58 - 6:00
    Do you need to do any thing more of that
  • 6:00 - 6:03
    or is that actually fine and
    you don't need to update the reference.
  • 6:04 - 6:09
    That's related to signed statements
    which is coming from a concern, I think,
  • 6:09 - 6:12
    that some data providers have that like...
  • 6:14 - 6:17
    There's a statement that's referenced
    through the UNESCO or something
  • 6:17 - 6:20
    and then suddenly,
    someone vandalizes the statement
  • 6:20 - 6:22
    and they are worried
    that it will look like
  • 6:23 - 6:27
    this organization, like UNESCO,
    still set this vandalism value
  • 6:27 - 6:29
    and so, with signed statements,
  • 6:29 - 6:31
    they can cryptographically
    sign this reference
  • 6:31 - 6:34
    and that doesn't prevent any edits to it,
  • 6:34 - 6:38
    but at least, if someone
    vandalizes the statement
  • 6:38 - 6:40
    or edits it in any way,
    then the signature is no longer valid,
  • 6:40 - 6:43
    and you can tell this is not exactly
    what the organization said,
  • 6:43 - 6:47
    and perhaps it's a good edit
    and they should re-sign the new statement,
  • 6:47 - 6:50
    but also perhaps it should be reverted.
  • 6:51 - 6:54
    And also, this is going
    to be very exciting, I think,
  • 6:54 - 6:57
    Citoid is this amazing system
    they have on Wikipedia
  • 6:57 - 7:01
    where you can paste a URL,
    or an identifier, or an ISBN
  • 7:01 - 7:05
    or Wikidata ID or basically
    anything into the Visual Editor,
  • 7:05 - 7:08
    and it spits out a reference
    that is nicely formatted
  • 7:08 - 7:11
    and has all the data you want
    and it's wonderful to use.
  • 7:11 - 7:14
    And by comparison, on Wikidata,
    if I want to add a reference
  • 7:14 - 7:19
    I typically have to add a reference URL,
    title, author name string,
  • 7:19 - 7:20
    published in, publication date,
  • 7:20 - 7:25
    retrieve dates,
    at least those, and that's annoying,
  • 7:25 - 7:29
    and integrating Citoid into Wikibase
    will hopefully help with that.
  • 7:30 - 7:34
    And I think
    that's all the ones I had, yeah.
  • 7:34 - 7:36
    So now, I'm going to pass to Cristina.
  • 7:38 - 7:42
    (applause)
  • 7:44 - 7:45
    Hi, I'm Cristina.
  • 7:45 - 7:48
    I'm a research scientist
    from the University of Zürich,
  • 7:48 - 7:51
    and I'm also an active member
    of the Swiss Community.
  • 7:53 - 7:58
    When Claudia Müller-Birn
    and I submitted this to the WikidataCon,
  • 7:58 - 8:00
    what we wanted to do
    is continue our discussion
  • 8:00 - 8:02
    that we started
    in the beginning of the year
  • 8:02 - 8:07
    with a workshop on data quality
    and also some sessions in Wikimania.
  • 8:07 - 8:11
    So the goal of this talk
    is basically to bring some thoughts
  • 8:11 - 8:14
    that we have been collecting
    from the community and ourselves
  • 8:14 - 8:17
    and continue discussion.
  • 8:17 - 8:20
    So what we would like is to continue
    interacting a lot with you.
  • 8:22 - 8:23
    So what we think is very important
  • 8:23 - 8:28
    is that we continuously ask
    all types of users in the community
  • 8:28 - 8:32
    about what they really need,
    what problems they have with data quality,
  • 8:32 - 8:35
    not only editors
    but also the people who are coding,
  • 8:35 - 8:36
    or consuming the data,
  • 8:36 - 8:39
    and also researchers who are
    actually using all the edit history
  • 8:39 - 8:41
    to analyze what is happening.
  • 8:42 - 8:48
    So we did a review of around 80 tools
    that are existing in Wikidata
  • 8:48 - 8:52
    and we aligned them to the different
    data quality dimensions.
  • 8:52 - 8:54
    And what we saw was that actually,
  • 8:54 - 8:58
    many of them were looking at,
    monitoring completeness,
  • 8:58 - 9:03
    but actually... and also some of them
    are also enabling interlinking.
  • 9:03 - 9:08
    But there is a big need for tools
    that are looking into diversity,
  • 9:08 - 9:13
    which is one of the things
    that we actually can have in Wikidata,
  • 9:13 - 9:16
    especially
    this design principle of Wikidata
  • 9:16 - 9:18
    where we can have plurality
  • 9:18 - 9:20
    and different statements
    with different values
  • 9:21 - 9:22
    coming from different sources.
  • 9:22 - 9:25
    Because it's a secondary source,
    we don't have really tools
  • 9:25 - 9:28
    that actually tell us how many
    plural statements there are,
  • 9:28 - 9:31
    and how many we can improve and how,
  • 9:31 - 9:33
    and we also don't know really
  • 9:33 - 9:36
    what are all the reasons
    for plurality that we can have.
  • 9:36 - 9:39
    So from these community meetings,
  • 9:39 - 9:43
    what we discussed was the challenges
    that still need attention.
  • 9:43 - 9:47
    For example, that having
    all these crowdsourcing communities
  • 9:47 - 9:50
    is very good because different people
    attack different parts
  • 9:50 - 9:52
    of the data or the graph,
  • 9:52 - 9:55
    and we also have
    different background knowledge
  • 9:55 - 9:59
    but actually, it's very difficult to align
    everything in something homogeneous
  • 9:59 - 10:05
    because different people are using
    different properties in different ways
  • 10:05 - 10:08
    and they are also expecting
    different things from entity descriptions.
  • 10:09 - 10:13
    People also said that
    they also need more tools
  • 10:13 - 10:16
    that give a better overview
    of the global status of things.
  • 10:16 - 10:21
    So what entities are missing
    in terms of completeness,
  • 10:21 - 10:26
    but also like what are people
    working on right now most of the time,
  • 10:26 - 10:31
    and they also mention many times
    a tighter collaboration
  • 10:31 - 10:33
    across not only languages
    but the WikiProjects
  • 10:33 - 10:36
    and the different Wikimedia platforms.
  • 10:36 - 10:39
    And we published
    all the transcribed comments
  • 10:39 - 10:43
    from all these discussions
    in those links here in the Etherpads
  • 10:43 - 10:46
    and also in the wiki page of Wikimania.
  • 10:46 - 10:48
    Some solutions that appeared actually
  • 10:48 - 10:53
    were going into the direction
    of sharing more the best practices
  • 10:53 - 10:56
    that are being developed
    in different WikiProjects,
  • 10:56 - 11:01
    but also people want tools
    that help organize work in teams
  • 11:01 - 11:04
    or at least understanding
    who is working on that,
  • 11:04 - 11:08
    and they were also mentioning
    that they want more showcases
  • 11:08 - 11:12
    and more templates that help them
    create things in a better way.
  • 11:13 - 11:15
    And from the contact that we have
  • 11:15 - 11:19
    with Open Governmental Data Organizations,
  • 11:19 - 11:20
    and in particularly,
  • 11:20 - 11:23
    I am in contact with the canton
    and the city of Zürich,
  • 11:23 - 11:26
    they are very interested
    in working with Wikidata
  • 11:26 - 11:30
    because they want their data
    to be accessible for everyone
  • 11:30 - 11:34
    in the place where people go
    and consult or access data.
  • 11:34 - 11:37
    So for them, something that
    would be really interesting
  • 11:37 - 11:39
    is to have some kind of quality indicators
  • 11:39 - 11:41
    both in the wiki,
    which is already happening,
  • 11:41 - 11:43
    but also in SPARQL results,
  • 11:43 - 11:46
    to know whether they can trust
    or not that data from the community.
  • 11:46 - 11:48
    And then, they also want to know
  • 11:48 - 11:51
    what parts of their own data sets
    are useful for Wikidata
  • 11:51 - 11:56
    and they would love to have a tool that
    can help them assess that automatically.
  • 11:56 - 11:59
    They also need
    some kind of methodology or tool
  • 11:59 - 12:04
    that helps them decide whether
    they should import or link their data
  • 12:04 - 12:05
    because in some cases,
  • 12:05 - 12:07
    they also have their own
    linked open data sets,
  • 12:07 - 12:10
    so they don't know whether
    to just ingest the data
  • 12:10 - 12:13
    or to keep on creating links
    from the data sets to Wikidata
  • 12:13 - 12:14
    and the other way around.
  • 12:15 - 12:20
    And they also want to know where
    their websites are referred in Wikidata.
  • 12:20 - 12:23
    And when they run such a query
    in the query service,
  • 12:23 - 12:25
    they often get timeouts,
  • 12:25 - 12:28
    so maybe we should
    really create more tools
  • 12:28 - 12:32
    that help them get these answers
    for their questions.
  • 12:33 - 12:36
    And, besides that,
  • 12:36 - 12:39
    we wiki researchers also sometimes
  • 12:39 - 12:42
    lack some information
    in the edit summaries.
  • 12:42 - 12:45
    So I remember that when
    we were doing some work
  • 12:45 - 12:49
    to understand
    the different behavior of editors
  • 12:49 - 12:53
    with tools or bots
    or anonymous users and so on,
  • 12:53 - 12:56
    we were really lacking, for example,
  • 12:56 - 13:01
    a standard way of tracing
    that tools were being used.
  • 13:01 - 13:03
    And there are some tools
    that are already doing that
  • 13:03 - 13:05
    like PetScan and many others,
  • 13:05 - 13:08
    but maybe we should in the community
  • 13:08 - 13:14
    discuss more about how to record these
    for fine-grained provenance.
  • 13:14 - 13:15
    And further on,
  • 13:15 - 13:21
    we think that we need to think
    of more concrete data quality dimensions
  • 13:21 - 13:25
    that are related to link data
    but not all the types of data,
  • 13:25 - 13:31
    so we worked on some measures
    to access actually the information gain
  • 13:31 - 13:34
    enabled by the links,
    and what we mean by that
  • 13:34 - 13:37
    is that when we link
    Wikidata to other data sets,
  • 13:37 - 13:38
    we should also be thinking
  • 13:38 - 13:42
    how much the entities are actually
    gaining in the classification,
  • 13:42 - 13:46
    also in the description
    but also in the vocabularies they use.
  • 13:46 - 13:51
    So just to give a very simple
    example of what I mean with this
  • 13:51 - 13:54
    is we can think of--
    in this case, would be Wikidata
  • 13:54 - 13:58
    or the external data center
    that is linking to Wikidata,
  • 13:58 - 14:00
    we have the entity for a person
    that is called Natasha Noy,
  • 14:00 - 14:03
    we have the affiliation and other things,
  • 14:03 - 14:05
    and then we say OK,
    we link to an external place,
  • 14:05 - 14:09
    and that entity also has that name,
    but we actually have the same value.
  • 14:09 - 14:13
    So what it would be better is that we link
    to something that has a different name,
  • 14:13 - 14:17
    that is still valid because this person
    has two ways of writing the name,
  • 14:17 - 14:20
    and also other information
    that we don't have in Wikidata
  • 14:20 - 14:22
    or that we don't have
    in the other data set.
  • 14:22 - 14:25
    But also, what is even better
  • 14:25 - 14:28
    is that we are actually
    looking in the target data set
  • 14:28 - 14:31
    that they also have new ways
    of classifying the information.
  • 14:31 - 14:35
    So not only is this a person,
    but in the other data set,
  • 14:35 - 14:40
    they also say it's a female
    or anything else that they classify with.
  • 14:40 - 14:43
    And if in the other data set,
    they are using many other vocabularies
  • 14:43 - 14:47
    that is also helping in their whole
    information retrieval thing.
  • 14:47 - 14:51
    So with that, I also would like to say
  • 14:51 - 14:56
    that we think that we can
    showcase federated queries better
  • 14:56 - 15:00
    because when we look at the query log
    provided by Malyshev et al.,
  • 15:01 - 15:04
    we see actually that
    from the organic queries,
  • 15:04 - 15:07
    we have only very few federated queries.
  • 15:07 - 15:13
    And actually, federation is one
    of the key advantages of having link data,
  • 15:13 - 15:17
    so maybe the community
    or the people using Wikidata
  • 15:17 - 15:19
    also need more examples on this.
  • 15:19 - 15:23
    And if we look at the list
    of endpoints that are being used,
  • 15:23 - 15:25
    this is not a complete list
    and we have many more.
  • 15:25 - 15:30
    Of course, this data was analyzed
    from queries until March 2018,
  • 15:30 - 15:35
    but we should look into the list
    of federated endpoints that we have
  • 15:35 - 15:37
    and see whether
    we are really using them or not.
  • 15:38 - 15:40
    So two questions that
    I have for the audience
  • 15:40 - 15:43
    that maybe we can use
    afterwards for the discussion are:
  • 15:43 - 15:46
    what data quality problems
    should be addressed in your opinion,
  • 15:46 - 15:47
    because of the needs that you have,
  • 15:47 - 15:50
    but also, where do you need
    more automation
  • 15:50 - 15:53
    to help you with editing or patrolling.
  • 15:54 - 15:55
    That's all, thank you very much.
  • 15:56 - 15:58
    (applause)
  • 16:06 - 16:09
    (Jose Emilio Labra) OK,
    so what I'm going to talk about
  • 16:09 - 16:15
    is some tools that we were developing
    related with Shape Expressions.
  • 16:16 - 16:19
    So this is what I want to talk...
    I am Jose Emilio Labra,
  • 16:19 - 16:23
    but this has... all these tools
    have been done by different people,
  • 16:24 - 16:28
    mainly related with W3C ShEx,
    Shape Expressions Community Group.
  • 16:28 - 16:29
    ShEx Community Group.
  • 16:30 - 16:36
    So the first tool that I want to mention
    is RDFShape, this is a general tool,
  • 16:36 - 16:41
    because Shape Expressions
    is not only for Wikidata,
  • 16:41 - 16:44
    Shape Expressions is a language
    to validate RDF in general.
  • 16:44 - 16:48
    So this tool was developed mainly by me
  • 16:48 - 16:51
    and it's a tool
    to validate RDF in general.
  • 16:51 - 16:55
    So if you want to learn about RDF
    or you want to validate RDF
  • 16:55 - 16:59
    or SPARQL endpoints not only in Wikidata,
  • 16:59 - 17:01
    my advice is that you can use this tool.
  • 17:01 - 17:03
    Also for teaching.
  • 17:03 - 17:06
    I am a teacher in the university
  • 17:06 - 17:09
    and I use it in my semantic web course
    to teach RDF.
  • 17:09 - 17:12
    So if you want to learn RDF,
    I think it's a good tool.
  • 17:13 - 17:18
    For example, this is just a visualization
    of an RDF graph with the tool.
  • 17:19 - 17:23
    But before coming here, in the last month,
  • 17:23 - 17:28
    I started a fork of rdfshape specifically
    for Wikidata, because I thought...
  • 17:28 - 17:33
    It's called WikiShape, and yesterday,
    I presented it as a present for Wikidata.
  • 17:33 - 17:34
    So what I took is...
  • 17:34 - 17:40
    What I did is to remove all the stuff
    that was not related with Wikidata
  • 17:40 - 17:45
    and to put several things, hard-coded,
    for example, the Wikidata SPARQL endpoint,
  • 17:45 - 17:49
    but now, someone asked me
    if I could do it also for Wikibase.
  • 17:49 - 17:52
    And it is very easy
    to do it for Wikibase also.
  • 17:53 - 17:56
    So this tool, WikiShape, is quite new.
  • 17:57 - 18:00
    I think it works, most of the features,
  • 18:00 - 18:02
    but there are some features
    that maybe don't work,
  • 18:02 - 18:06
    and if you try it and you want
    to improve it, please tell me.
  • 18:06 - 18:13
    So this is [inaudible] captures,
    but I think I can even try so let's try.
  • 18:15 - 18:17
    So let's see if it works.
  • 18:17 - 18:20
    First, I have to go out of the...
  • 18:22 - 18:23
    Here.
  • 18:24 - 18:28
    Alright, yeah. So this is the tool here.
  • 18:28 - 18:30
    Things that you can do with the tool,
  • 18:30 - 18:35
    for example, is that you can
    check schemas, entity schemas.
  • 18:35 - 18:39
    You know that there is
    a new namespace which is "E whatever,"
  • 18:39 - 18:45
    so here, if you start for example,
    write for example "human"...
  • 18:45 - 18:49
    As you are writing,
    its autocomplete allows you to check,
  • 18:49 - 18:52
    for example,
    this is the Shape Expressions of a human,
  • 18:53 - 18:56
    and this is the Shape Expressions here.
  • 18:56 - 19:00
    And as you can see,
    this editor has syntax highlighting,
  • 19:00 - 19:05
    this is... well,
    maybe it's very small, the screen.
  • 19:06 - 19:08
    I can try to do it bigger.
  • 19:09 - 19:11
    Maybe you see it better now.
  • 19:11 - 19:14
    So... and this is the editor
    with syntax highlighting and also has...
  • 19:14 - 19:18
    I mean, this editor
    comes from the same source code
  • 19:18 - 19:20
    as the Wikidata query service.
  • 19:20 - 19:24
    So for example,
    if you hover with the mouse here,
  • 19:24 - 19:28
    it shows you the labels
    of the different properties.
  • 19:28 - 19:31
    So I think it's very helpful because now,
  • 19:33 - 19:39
    the entity schemas that is
    in the Wikidata is just a plain text idea,
  • 19:39 - 19:42
    and I think this editor is much better
    because it has autocomplete
  • 19:42 - 19:44
    and it also has...
  • 19:44 - 19:48
    I mean, if you, for example,
    wanted to add a constraint,
  • 19:48 - 19:52
    you say "wdt:"
  • 19:52 - 19:57
    You start writing "author"
    and then you click Ctrl+Space
  • 19:57 - 19:59
    and it suggests the different things.
  • 19:59 - 20:02
    So this is similar
    to the Wikidata query service
  • 20:02 - 20:06
    but specifically for Shape Expressions
  • 20:06 - 20:12
    because my feeling is that
    creating Shape Expressions
  • 20:12 - 20:16
    is not more difficult
    than writing SPARQL queries.
  • 20:16 - 20:21
    So some people think
    that it's at the same level,
  • 20:22 - 20:26
    It's probably easier, I think,
    because Shape Expressions was,
  • 20:26 - 20:31
    when we designed it,
    we were doing it to be easier to work.
  • 20:31 - 20:35
    OK, so this is one of the first things,
    that you have this editor
  • 20:35 - 20:37
    for Shape Expressions.
  • 20:37 - 20:41
    And then you also have the possibility,
    for example, to visualize.
  • 20:41 - 20:45
    If you have a Shape Expression,
    use for example...
  • 20:45 - 20:49
    I think, "written work" is
    a nice Shape Expression
  • 20:49 - 20:53
    because it has some relationships
    between different things.
  • 20:55 - 20:58
    And this is the UML visualization
    of written work.
  • 20:58 - 21:02
    In a UML, this is easy to see
    the different properties.
  • 21:03 - 21:07
    When you do this, I realized
    when I tried with several people,
  • 21:07 - 21:09
    they find some mistakes
    in their Shape Expressions
  • 21:09 - 21:13
    because it's easy to detect which are
    the missing properties or whatever.
  • 21:14 - 21:16
    Then there is another possibility here
  • 21:16 - 21:20
    is that you can also validate,
    I think I have it here, the validation.
  • 21:20 - 21:25
    I think I had it in some label,
    maybe I closed it.
  • 21:26 - 21:31
    OK, but you can, for example,
    you can click here, Validate entities.
  • 21:32 - 21:34
    You, for example,
  • 21:35 - 21:42
    "q42" with "e42" which is author.
  • 21:43 - 21:46
    With "human,"
    I think we can do it with "human."
  • 21:49 - 21:50
    And then it's...
  • 21:51 - 21:56
    And it's taking a little while to do it
    because this is doing the SPARQL queries
  • 21:56 - 21:59
    and now, for example,
    it's failing by the network but...
  • 22:00 - 22:02
    So you can try it.
  • 22:03 - 22:07
    OK, so let's go continue
    with the presentation, with other tools.
  • 22:07 - 22:12
    So my advice is that if you want to try it
    and you want any feedback let me know.
  • 22:13 - 22:16
    So to continue with the presentation...
  • 22:19 - 22:20
    So this is WikiShape.
  • 22:24 - 22:27
    Then, I already said this,
  • 22:28 - 22:34
    the Shape Expressions Editor
    is an independent project in GitHub.
  • 22:36 - 22:37
    You can use it in your own project.
  • 22:37 - 22:41
    If you want to do
    a Shape Expressions tool,
  • 22:41 - 22:46
    you can just embed it
    in any other project,
  • 22:46 - 22:48
    so this is in GitHub and you can use it.
  • 22:49 - 22:52
    Then the same author,
    it's one of my students,
  • 22:53 - 22:56
    he also created
    an editor for Shape Expressions,
  • 22:56 - 22:58
    also inspired by
    the Wikidata query service
  • 22:58 - 23:01
    where, in a column,
  • 23:01 - 23:05
    you have this more visual editor
    of SPARQL queries
  • 23:05 - 23:07
    where you can put this kind of things.
  • 23:07 - 23:09
    So this is a screen capture.
  • 23:09 - 23:13
    You can see that
    that's the Shape Expressions in text
  • 23:13 - 23:18
    but this is a form-based Shape Expressions
    where it would probably take a bit longer
  • 23:19 - 23:23
    where you can put the different rows
    on the different fields.
  • 23:23 - 23:26
    OK, then there is ShExEr.
  • 23:27 - 23:32
    We have... it's done by one PhD student
    at the University of Oviedo
  • 23:32 - 23:34
    and he's here, so you can present ShExEr.
  • 23:38 - 23:40
    (Danny) Hello, I am Danny Fernández,
  • 23:40 - 23:44
    I am a PhD student in University of Oviedo
    working with Labra.
  • 23:45 - 23:48
    Since we are running out of time,
    let's make these quickly,
  • 23:48 - 23:53
    so let's not go for any actual demo,
    but just print some screenshots.
  • 23:53 - 23:58
    OK, so the usual way to work with
    Shape Expressions or any shape language
  • 23:58 - 24:00
    is that you have a domain expert
  • 24:00 - 24:02
    that defines a priori
    how the graph should look like
  • 24:02 - 24:04
    define some structures,
  • 24:04 - 24:07
    and then you use these structures
    to validate the actual data against it.
  • 24:08 - 24:12
    This tool, which is as well as the ones
    that Labra has been presenting,
  • 24:12 - 24:14
    this is a general purpose tool
    for any RDF source,
  • 24:14 - 24:17
    is designed to do the other way around.
  • 24:17 - 24:19
    You already have some data,
  • 24:19 - 24:23
    you select what nodes
    you want to get the shape about
  • 24:23 - 24:27
    and then you automatically
    extract or infer the shape.
  • 24:27 - 24:30
    So even if this is a general purpose tool,
  • 24:30 - 24:34
    what we did for this WikidataCon
    is these fancy button
  • 24:35 - 24:37
    that if you click it,
    essentially what happens
  • 24:37 - 24:42
    is that there are
    so many configurations params
  • 24:42 - 24:46
    and it configures it to work
    against the Wikidata endpoint
  • 24:46 - 24:48
    and it will end soon, sorry.
  • 24:49 - 24:53
    So, once you press this button
    what you get is essentially this.
  • 24:53 - 24:55
    After having selected what kind of notes,
  • 24:55 - 24:59
    what kind of instances of our class,
    whatever you are looking for,
  • 24:59 - 25:01
    you get an automatic schema.
  • 25:02 - 25:07
    All the constraints are sorted
    by how many modes actually conform to it,
  • 25:07 - 25:10
    you can filter the less common ones, etc.
  • 25:10 - 25:12
    So there is a poster downstairs
    about this stuff
  • 25:12 - 25:15
    and well,
    I will be downstairs and upstairs
  • 25:15 - 25:16
    and all over the place all day,
  • 25:16 - 25:19
    so if you have any further
    interest in this tool,
  • 25:19 - 25:21
    just speak to me during this journey.
  • 25:21 - 25:25
    And now, I'll give back
    the micro to Labra, thank you.
  • 25:25 - 25:29
    (applause)
  • 25:30 - 25:33
    (Jose) So let's continue
    with the other tools.
  • 25:33 - 25:35
    The other tool is the ShapeDesigner.
  • 25:35 - 25:37
    Andra, do you want to do
    the ShapeDesigner now
  • 25:37 - 25:39
    or maybe later or in the workshop?
  • 25:39 - 25:41
    There is a workshop...
  • 25:41 - 25:44
    This afternoon, there is a workshop
    specifically for Shape Expressions, and...
  • 25:45 - 25:48
    The idea is that was going to be
    more hands on,
  • 25:48 - 25:52
    and if you want to practice
    some ShEx, you can do it there.
  • 25:53 - 25:56
    This tool is ShEx...
    and there is Eric here,
  • 25:56 - 25:57
    so you can present it.
  • 25:58 - 26:01
    (Eric) So just super quick,
    the thing that I want to say
  • 26:01 - 26:06
    is that you've probably
    already seen the ShEx interface
  • 26:06 - 26:08
    that's tailored for Wikidata.
  • 26:08 - 26:13
    That's effectively stripped down
    and tailored specifically for Wikidata
  • 26:13 - 26:18
    because the generic one has more features
    but it turns out I thought I'd mention it
  • 26:18 - 26:20
    because one of those features
    is particularly useful
  • 26:20 - 26:23
    for debugging Wikidata schemas,
  • 26:23 - 26:29
    which is if you go
    and you select the slurp mode,
  • 26:29 - 26:31
    what it does is it says
    while I'm validating,
  • 26:31 - 26:35
    I want to pull all the the triples down
    and that means
  • 26:35 - 26:36
    if I get a bunch of failures,
  • 26:36 - 26:40
    I can go through and start looking
    at those failures and saying,
  • 26:40 - 26:42
    OK, what are the triples
    that are in here,
  • 26:42 - 26:44
    sorry, I apologize,
    the triples are down there,
  • 26:44 - 26:46
    this is just a log of what went by.
  • 26:46 - 26:49
    And then you can just sit there
    and fiddle with it in real time
  • 26:49 - 26:51
    like you play with something
    and it changes.
  • 26:51 - 26:54
    So it's a quicker version
    for doing all that stuff.
  • 26:55 - 26:56
    This is a ShExC form,
  • 26:56 - 26:59
    this is something [Joachim] had suggested
  • 27:00 - 27:05
    could be useful for populating
    Wikidata documents
  • 27:05 - 27:07
    based on a Shape Expression
    for that that document.
  • 27:08 - 27:12
    This is not tailored for Wikidata,
  • 27:12 - 27:14
    but this is just to say
    that you can have a schema
  • 27:14 - 27:15
    and you can have some annotations
  • 27:15 - 27:18
    to say specifically how I want
    that schema rendered
  • 27:18 - 27:19
    and then it just builds a form,
  • 27:19 - 27:21
    and if you've got data,
    it can even populate the form.
  • 27:25 - 27:26
    PyShEx [inaudible].
  • 27:28 - 27:31
    (Jose) I think this is the last one.
  • 27:32 - 27:34
    Yes, so the last one is PyShEx.
  • 27:35 - 27:38
    PyShEx is a Python implementation
    of Shape Expressions,
  • 27:39 - 27:43
    you can play also with Jupyter Notebooks
    if you want those kind of things.
  • 27:43 - 27:44
    OK, so that's all for this.
  • 27:44 - 27:47
    (applause)
  • 27:53 - 27:57
    (Andra) So I'm going to talk about
    a specific project that I'm involved in
  • 27:57 - 27:58
    called Gene Wiki,
  • 27:58 - 28:05
    and where we are also
    dealing with quality issues.
  • 28:05 - 28:07
    But before going into the quality,
  • 28:07 - 28:09
    maybe a quick introduction
    about what Gene Wiki is,
  • 28:10 - 28:15
    and we just released a pre-print
    of a paper that we recently have written
  • 28:15 - 28:18
    that explains the details of the project.
  • 28:20 - 28:24
    I see people taking pictures,
    but basically, what Gene Wiki does,
  • 28:24 - 28:28
    it's trying to get biomedical data,
    public data into Wikidata,
  • 28:28 - 28:32
    and we follow a specific pattern
    to get that data into Wikidata.
  • 28:33 - 28:37
    So when we have a new repository
    or a new data set
  • 28:37 - 28:40
    that is eligible
    to be included into Wikidata,
  • 28:40 - 28:41
    the first step is community engagement.
  • 28:41 - 28:44
    It is not necessary
    directly to a Wikidata community
  • 28:44 - 28:46
    but a local research community,
  • 28:46 - 28:50
    and we meet in person
    or online or on any platform
  • 28:50 - 28:53
    and try to come up with a data model
  • 28:53 - 28:56
    that bridges their data
    with the Wikidata model.
  • 28:56 - 29:00
    So here I have a picture of a workshop
    that happened here last year
  • 29:00 - 29:03
    which was trying to look
    at a specific data set
  • 29:03 - 29:05
    and, well, you see a lot of discussions,
  • 29:05 - 29:10
    then aligning it with schema.org
    and other ontologies that are out there.
  • 29:10 - 29:16
    And then, at the end of the first step,
    we have a whiteboard drawing of the schema
  • 29:16 - 29:17
    that we want to implement in Wikidata.
  • 29:17 - 29:20
    What you see over there,
    this is just plain,
  • 29:20 - 29:22
    we have it in the back there
  • 29:22 - 29:25
    so we can make some schemas
    within this panel today even.
  • 29:27 - 29:28
    So once we have the schema in place,
  • 29:28 - 29:31
    the next thing is try to make
    that schema machine readable
  • 29:32 - 29:37
    because you want to have actionable models
    to bridge the data that you're bringing in
  • 29:37 - 29:40
    from any biomedical database
    into Wikidata.
  • 29:40 - 29:45
    And here we are applying
    Shape Expressions.
  • 29:46 - 29:53
    And we use that because
    Shape Expressions allow you to test
  • 29:53 - 29:57
    whether the data set
    is actually-- no, to first see
  • 29:57 - 30:02
    of already existing data in Wikidata
    follows the same data model
  • 30:02 - 30:05
    that was achieved in the previous process.
  • 30:05 - 30:07
    So then with the Shape Expression
    we can check:
  • 30:07 - 30:11
    OK the data that are on this topic
    in Wikidata, does it need some cleaning up
  • 30:11 - 30:15
    or do we need to adapt our model
    to the Wikidata model or vice versa.
  • 30:16 - 30:20
    Once that is in place
    and we start writing bots,
  • 30:21 - 30:24
    and bots are seeding the information
  • 30:24 - 30:27
    that is in the primary sources
    into Wikidata.
  • 30:28 - 30:29
    And when the bots are ready,
  • 30:29 - 30:33
    we write these bots
    with a platform called--
  • 30:33 - 30:36
    with a Python library
    called Wikidata Integrator
  • 30:36 - 30:38
    that came out of our project.
  • 30:39 - 30:43
    And once we have our bots,
    we use a platform called Jenkins
  • 30:43 - 30:45
    for continuous integration.
  • 30:45 - 30:46
    And with Jenkins,
  • 30:46 - 30:51
    we continuously update
    the primary sources with Wikidata.
  • 30:52 - 30:56
    And this is a diagram for the paper
    I previously mentioned.
  • 30:56 - 30:57
    This is our current landscape.
  • 30:57 - 31:02
    So every orange box out there
    is a primary resource on drugs,
  • 31:02 - 31:08
    proteins, genes, diseases,
    chemical compounds with interaction,
  • 31:08 - 31:11
    and this model is too small to read now
  • 31:11 - 31:17
    but this is the database,
    the sources that we manage in Wikidata
  • 31:17 - 31:21
    and bridge with the primary sources.
  • 31:21 - 31:22
    Here is such a workflow.
  • 31:23 - 31:25
    So one of our partners
    is the Disease Ontology
  • 31:25 - 31:28
    the Disease Ontology is a CC0 ontology,
  • 31:28 - 31:32
    and the CC0 Ontology
    has a curation cycle on its own,
  • 31:33 - 31:36
    and they just continuously
    update the Disease Ontology
  • 31:36 - 31:40
    to reflect the disease space
    or the interpretation of diseases.
  • 31:40 - 31:44
    And there is the Wikidata
    curation cycle also on diseases
  • 31:44 - 31:50
    where the Wikidata community constantly
    monitors what's going on on Wikidata.
  • 31:50 - 31:52
    And then we have two roles,
  • 31:52 - 31:55
    we call them colloquially
    the gatekeeper curator,
  • 31:56 - 32:00
    and this was me
    and a colleague five years ago
  • 32:00 - 32:03
    where we just sit on our computers
    and we monitor Wikipedia and Wikidata,
  • 32:03 - 32:09
    and if there is an issue that was
    reported back to the primary community,
  • 32:09 - 32:12
    the primary resources, they looked
    at the implementation and decided:
  • 32:12 - 32:14
    OK, do we do we trust the Wikidata input?
  • 32:15 - 32:19
    Yes--then it's considered,
    it goes into the cycle,
  • 32:19 - 32:23
    and the next iteration
    is part of the Disease Ontology
  • 32:23 - 32:25
    and fed back into Wikidata.
  • 32:27 - 32:31
    We're doing the same for WikiPathways.
  • 32:31 - 32:37
    WikiPathways is a MediaWiki-inspired
    pathway and pathway repository.
  • 32:37 - 32:41
    Same story, there are different
    pathway resources on Wikidata already.
  • 32:41 - 32:45
    There might be conflicts
    between those pathway resources
  • 32:45 - 32:47
    and these conflicts are reported back
  • 32:47 - 32:50
    by the gatekeeper curators
    to that community,
  • 32:50 - 32:54
    and you maintain
    the individual curation cycles.
  • 32:54 - 32:57
    But if you remember the previous cycle,
  • 32:57 - 33:03
    here I mentioned
    only two cycles, two resources,
  • 33:04 - 33:06
    we have to do that
    for every single resource that we have
  • 33:06 - 33:08
    and we have to manage what's going on
  • 33:08 - 33:09
    because when I say curation,
  • 33:09 - 33:11
    I really mean going
    to the Wikipedia top pages,
  • 33:11 - 33:15
    going into the Wikidata top pages
    and trying to do that.
  • 33:15 - 33:19
    That doesn't scale for
    the two gatekeeper curators we had.
  • 33:20 - 33:23
    So when I was in a conference in 2016
  • 33:23 - 33:27
    where Eric gave a presentation
    on Shape Expressions,
  • 33:27 - 33:29
    I jumped on the bandwagon and said OK,
  • 33:29 - 33:34
    Shape Expressions can help us
    detect what differences in Wikidata
  • 33:34 - 33:41
    and so that allows the gatekeepers to have
    some more efficient reporting to report.
  • 33:42 - 33:46
    So this year,
    I was delighted by the schema entity
  • 33:46 - 33:51
    because now, we can store
    those entity schemas on Wikidata,
  • 33:51 - 33:53
    on Wikidata itself,
    whereas before, it was on GitHub,
  • 33:54 - 33:57
    and this aligns
    with the Wikidata interface,
  • 33:57 - 33:59
    so you have things
    like document discussions
  • 33:59 - 34:01
    but you also have revisions.
  • 34:01 - 34:05
    So you can leverage the top pages
    and the revisions in Wikidata
  • 34:05 - 34:12
    to use that to discuss
    about what is in Wikidata
  • 34:12 - 34:14
    and what are in the primary resources.
  • 34:15 - 34:20
    So this what Eric just presented,
    this is already quite a benefit.
  • 34:20 - 34:24
    So here, we made up a Shape Expression
    for the human gene,
  • 34:24 - 34:30
    and then we ran it through simple ShEx,
    and as you can see,
  • 34:30 - 34:32
    we just got already ni--
  • 34:32 - 34:35
    There is one issue
    that needs to be monitored
  • 34:35 - 34:37
    which there is an item
    that doesn't fit that schema,
  • 34:37 - 34:43
    and then you can sort of already
    create schema entities curation reports
  • 34:43 - 34:46
    based on... and send that
    to the different curation reports.
  • 34:48 - 34:53
    But the ShEx.js a built interface,
  • 34:53 - 34:56
    and if I can show back here,
    I only do ten,
  • 34:56 - 35:00
    but we have tens of thousands,
    and so that again doesn't scale.
  • 35:00 - 35:05
    So the Wikidata Integrator now
    supports ShEx support as well,
  • 35:05 - 35:07
    and then we can just loop item loops
  • 35:07 - 35:11
    where we say yes-no,
    yes-no, true-false, true-false.
  • 35:11 - 35:12
    So again,
  • 35:13 - 35:17
    increasing a bit of the efficiency
    of dealing with the reports.
  • 35:17 - 35:23
    But now, recently, that builds
    on the Wikidata Query Service,
  • 35:23 - 35:25
    and well, we recently have been throttling
  • 35:25 - 35:27
    so again, that doesn't scale.
  • 35:27 - 35:31
    So it's still an ongoing process,
    how to deal with models on Wikidata.
  • 35:32 - 35:37
    And so again,
    ShEx is not only intimidating
  • 35:37 - 35:40
    but also the scale is just
    too big to deal with.
  • 35:41 - 35:46
    So I started working, this is my first
    proof of concept or exercise
  • 35:46 - 35:48
    where I used a tool called yED,
  • 35:48 - 35:53
    and I started to draw
    those Shape Expressions and because...
  • 35:53 - 35:58
    and then regenerate this schema
  • 35:58 - 36:01
    into this adjacent format
    of the Shape Expressions,
  • 36:01 - 36:05
    so that would open up already
    to the audience
  • 36:05 - 36:07
    that are intimidated
    by the Shape Expressions languages.
  • 36:08 - 36:12
    But actually, there is a problem
    with those visual descriptions
  • 36:12 - 36:18
    because this is also a schema
    that was actually drawn in yEd by someone.
  • 36:18 - 36:24
    And here is another one
    which is beautiful.
  • 36:24 - 36:29
    I would love to have this on my wall,
    but it is still not interoperable.
  • 36:30 - 36:32
    So I want to end my talk with,
  • 36:32 - 36:36
    and the first time, I've been
    stealing this slide, using this slide.
  • 36:36 - 36:38
    It's an honor to have him in the audience
  • 36:38 - 36:39
    and I really like this:
  • 36:39 - 36:42
    "People think RDF is a pain
    because it's complicated.
  • 36:42 - 36:44
    The truth is even worse, it's so simple,
  • 36:46 - 36:48
    because you have to work
    with real-world data problems
  • 36:48 - 36:50
    that are horribly complicated.
  • 36:50 - 36:51
    While you can avoid RDF,
  • 36:51 - 36:56
    it is harder to avoid complicated data
    and complicated computer problems."
  • 36:56 - 37:00
    This is about RDF, but I think
    this so applies to modeling as well.
  • 37:00 - 37:03
    So my point of discussion
    is should we really...
  • 37:03 - 37:06
    How do we get modeling going?
  • 37:06 - 37:11
    Should we discuss ShEx
    or visual models or...
  • 37:11 - 37:13
    How do we continue?
  • 37:13 - 37:15
    Thank you very much for your time.
  • 37:15 - 37:18
    (applause)
  • 37:20 - 37:21
    (Lydia) Thank you so much.
  • 37:22 - 37:24
    Would you come to the front
  • 37:24 - 37:28
    so that we can open
    the questions from the audience.
  • 37:29 - 37:30
    Are there questions?
  • 37:32 - 37:33
    Yes.
  • 37:34 - 37:37
    And I think, for the camera, we need to...
  • 37:39 - 37:41
    (Lydia laughing) Yeah.
  • 37:43 - 37:46
    (man3) So a question
    for Cristina, I think.
  • 37:47 - 37:52
    So you mentioned exactly
    the term "information gain"
  • 37:52 - 37:54
    from linking with other systems.
  • 37:54 - 37:56
    There is an information theoretic measure
  • 37:56 - 37:58
    using statistic and probability
    called information gain.
  • 37:58 - 38:00
    Do you have the same...
  • 38:00 - 38:02
    I mean did you mean exactly that measure,
  • 38:02 - 38:04
    the information gain
    from the probability theory
  • 38:04 - 38:05
    from information theory
  • 38:05 - 38:09
    or just use this conceptual thing
    to measure information gain some way?
  • 38:09 - 38:13
    No, so we actually defined
    and implemented measures
  • 38:14 - 38:20
    that are using the Shannon entropy,
    so it's meant as that.
  • 38:20 - 38:23
    I didn't want to go into
    details of the concrete formulas...
  • 38:23 - 38:25
    (man3) No, no, of course,
    that's why I asked the question.
  • 38:25 - 38:27
    - (Cristina) But yeah...
    - (man3) Thank you.
  • 38:33 - 38:35
    (man4) Make more
    of a comment than a question.
  • 38:35 - 38:36
    (Lydia) Go for it.
  • 38:36 - 38:40
    (man4) So there's been
    a lot of focus at the item level
  • 38:40 - 38:43
    about quality and completeness,
  • 38:43 - 38:47
    one of the things that concerns me is that
    we're not applying the same to hierarchies
  • 38:47 - 38:51
    and I think we have an issue
    is that our hierarchy often isn't good.
  • 38:51 - 38:53
    We're seeing
    this is going to be a real problem
  • 38:53 - 38:56
    with Commons searching and other things.
  • 38:57 - 39:01
    One of the abilities that we can do
    is to import external--
  • 39:01 - 39:05
    The way that external thesauruses
    structure their hierarchies,
  • 39:05 - 39:10
    using the P4900
    broader concept qualifier.
  • 39:11 - 39:16
    But what I think would be really helpful
    would be much better tools for doing that
  • 39:16 - 39:21
    so that you can import an
    external... thesaurus's hierarchy
  • 39:21 - 39:24
    map that onto our Wikidata items.
  • 39:24 - 39:28
    Once it's in place
    with those P4900 qualifiers,
  • 39:28 - 39:31
    you can actually do some
    quite good querying through SPARQL
  • 39:32 - 39:38
    to see where our hierarchy
    diverges from that external hierarchy.
  • 39:38 - 39:41
    For instance, [Paula Morma],
    user PKM, you may know,
  • 39:41 - 39:44
    does a lot of work on fashion.
  • 39:44 - 39:51
    So we use that to pull in the Europeana
    Fashion Thesaurus's hierarchy
  • 39:51 - 39:54
    and the Getty AAT
    fashion thesaurus hierarchy,
  • 39:54 - 39:58
    and then see where the gaps
    were in our higher level items,
  • 39:58 - 40:01
    which is a real problem for us
    because often,
  • 40:01 - 40:04
    these are things that only exist
    as disambiguation pages on Wikipedia,
  • 40:04 - 40:09
    so we have a lot of higher level items
    in our hierarchies missing
  • 40:09 - 40:14
    and this is something that we must address
    in terms of quality and completeness,
  • 40:14 - 40:16
    but what would really help
  • 40:17 - 40:21
    would be better tools than
    the jungle of pull scripts that I wrote...
  • 40:21 - 40:26
    If somebody could put that
    into a PAWS notebook in Python
  • 40:27 - 40:32
    to be able to take an external thesaurus,
    take its hierarchy,
  • 40:32 - 40:35
    which may well be available
    as linked data or may not,
  • 40:35 - 40:41
    to then put those into
    quick statements to put in P4900 values.
  • 40:41 - 40:42
    And then later,
  • 40:42 - 40:45
    when our representation
    gets more complete,
  • 40:45 - 40:50
    to update those P4900s
    because as our representation gets dated,
  • 40:50 - 40:52
    becomes more dense,
  • 40:52 - 40:55
    the values of those qualifiers
    need to change
  • 40:56 - 41:00
    to represent that we've got more
    of their hierarchy in our system.
  • 41:00 - 41:04
    If somebody could do that,
    I think that would be very helpful,
  • 41:04 - 41:07
    and we do need to also
    look at other approaches
  • 41:07 - 41:11
    to improve quality and completeness
    at the hierarchy level
  • 41:11 - 41:12
    not just at the item level.
  • 41:13 - 41:15
    (Andra) Can I add to that?
  • 41:16 - 41:20
    Yes, and we actually do that,
  • 41:20 - 41:24
    and I can recommend looking at
    the Shape Expression that Finn made
  • 41:24 - 41:27
    with the lexical data
    where he creates Shape Expressions
  • 41:27 - 41:30
    and then build on authorship expressions
  • 41:30 - 41:33
    so you have this concept
    of linked Shape Expressions in Wikidata,
  • 41:33 - 41:35
    and specifically, the use case,
    if I understand correctly,
  • 41:35 - 41:37
    is exactly what we are doing in Gene Wiki.
  • 41:37 - 41:41
    So you have the Disease Ontology
    which is put into Wikidata
  • 41:41 - 41:45
    and then disease data comes in
    and we apply the Shape Expressions
  • 41:45 - 41:47
    to see if that fits with this thesaurus.
  • 41:47 - 41:51
    And there are other thesauruses or other
    ontologies for controlled vocabularies
  • 41:51 - 41:53
    that still need to go into Wikidata,
  • 41:53 - 41:55
    and that's exactly why
    Shape Expression is so interesting
  • 41:55 - 41:58
    because you can have a Shape Expression
    for the Disease Ontology,
  • 41:58 - 42:00
    you can have a Shape Expression for MeSH,
  • 42:00 - 42:02
    you can say: OK,
    now I want to check the quality.
  • 42:02 - 42:04
    Because you also have
    in Wikidata the context
  • 42:04 - 42:10
    of when you have a controlled vocabulary,
    you say the quality is according to this,
  • 42:10 - 42:12
    but you might have
    a disagreeing community.
  • 42:12 - 42:16
    So the tooling is indeed in place
    but now is indeed to create those models
  • 42:16 - 42:18
    and apply them
    on the different use cases.
  • 42:19 - 42:21
    (man4) The ShapeExpression's very useful
  • 42:21 - 42:26
    once you have the external ontology
    mapped into Wikidata,
  • 42:26 - 42:29
    but my problem is that
    it's getting to that stage,
  • 42:29 - 42:35
    it's working out how much of the
    external ontology isn't yet in Wikidata
  • 42:35 - 42:36
    and where the gaps are,
  • 42:36 - 42:41
    and that's where I think that
    having much more robust tools
  • 42:41 - 42:44
    to see what's missing
    from external ontologies
  • 42:44 - 42:46
    would be very helpful.
  • 42:48 - 42:49
    The biggest problem there
  • 42:49 - 42:51
    is not so much tooling
    but more licensing.
  • 42:52 - 42:55
    So getting the ontologies
    into Wikidata is actually a piece of cake
  • 42:55 - 42:59
    but most of the ontologies have,
    how can I say that politely,
  • 43:00 - 43:03
    restrictive licensing,
    so they are not compatible with Wikidata.
  • 43:04 - 43:07
    (man4) There's a huge number
    of public sector thesauruses
  • 43:07 - 43:08
    in cultural fields.
  • 43:08 - 43:11
    - (Andra) Then we need to talk.
    - (man4) Not a problem.
  • 43:11 - 43:12
    (Andra) Then we need to talk.
  • 43:14 - 43:19
    (man5) Just... the comment I want to make
    is actually answer to James,
  • 43:19 - 43:22
    so the thing is that
    hierarchies make graphs,
  • 43:22 - 43:24
    and when you want to...
  • 43:25 - 43:29
    I want to basically talk about...
    a common problem in hierarchies
  • 43:29 - 43:31
    is circle hierarchies,
  • 43:31 - 43:34
    so they come back to each other
    when there's a problem,
  • 43:34 - 43:36
    which you should not
    have that in hierarchies.
  • 43:37 - 43:41
    This, funnily enough,
    happens in categories in Wikipedia a lot
  • 43:41 - 43:43
    we have a lot of circles in categories,
  • 43:44 - 43:47
    but the good news is that this is...
  • 43:48 - 43:52
    Technically, it's a PMP complete problem,
    so you cannot find this,
  • 43:52 - 43:53
    and easily if you built a graph of that,
  • 43:54 - 43:57
    but there are lots of ways
    that have been developed
  • 43:57 - 44:01
    to find problems
    in these hierarchy graphs.
  • 44:01 - 44:05
    Like there is a paper
    called Finding Cycles...
  • 44:05 - 44:08
    Breaking Cycles in Noisy Hierarchies,
  • 44:08 - 44:13
    and it's been used to help
    categorization of English Wikipedia.
  • 44:13 - 44:17
    You can just take this
    and apply these hierarchies in Wikidata,
  • 44:17 - 44:20
    and then you can find
    things that are problematic
  • 44:20 - 44:22
    and just remove the ones
    that are causing issues
  • 44:22 - 44:25
    and find the issues, actually.
  • 44:25 - 44:27
    So this is just an idea, just so you...
  • 44:29 - 44:30
    (man4) That's all very well
  • 44:30 - 44:34
    but I think you're underestimating
    the number of bad subclass relations
  • 44:34 - 44:35
    that we have.
  • 44:35 - 44:40
    It's like having a city
    in completely the wrong country,
  • 44:40 - 44:45
    and there are tools for geography
    to identify that,
  • 44:45 - 44:49
    and we need to have
    much better tools in hierarchies
  • 44:49 - 44:53
    to identify where the equivalent
    of the item for the country
  • 44:53 - 44:58
    is missing entirely,
    or where it's actually been subclassed
  • 44:58 - 45:02
    to something that isn't meaning
    something completely different.
  • 45:03 - 45:07
    (Lydia) Yeah, I think
    you're getting to something
  • 45:07 - 45:12
    that me and my team keeps hearing
    from people who reuse our data
  • 45:12 - 45:14
    quite a bit as well, right,
  • 45:15 - 45:17
    Individual data point might be great
  • 45:17 - 45:20
    but if you have to look
    at the ontology and so on,
  • 45:20 - 45:22
    then it gets very...
  • 45:22 - 45:26
    And I think one of the big problems
    why this is happening
  • 45:26 - 45:31
    is that a lot of editing on Wikidata
  • 45:31 - 45:35
    happens on the basis
    of an individual item, right,
  • 45:35 - 45:36
    you make an edit on that item,
  • 45:38 - 45:42
    without realizing that this
    might have very global consequences
  • 45:42 - 45:44
    on the rest of the graph, for example.
  • 45:44 - 45:50
    And if people have ideas around
    how to make this more visible,
  • 45:50 - 45:53
    the consequences
    of an individual local edit,
  • 45:54 - 45:57
    I think that would be worth exploring,
  • 45:58 - 46:02
    to show people better
    what the consequence of their edit
  • 46:02 - 46:03
    that they might do in very good faith,
  • 46:04 - 46:05
    what that is.
  • 46:07 - 46:12
    Whoa! OK, let's start with, yeah, you,
    then you, then you, then you.
  • 46:12 - 46:14
    (man5) Well, after the discussion,
  • 46:14 - 46:18
    just to express my agreement
    with what James was saying.
  • 46:18 - 46:22
    So essentially, it seems
    the most dangerous thing is the hierarchy,
  • 46:22 - 46:24
    not the hierarchy, but generally
  • 46:24 - 46:28
    the semantics of the subclass relations
    seen in Wikidata, right.
  • 46:28 - 46:33
    So I've been studying languages recently,
    just for the purposes of this conference,
  • 46:33 - 46:35
    and for example, you find plenty of cases
  • 46:35 - 46:39
    where a language is a part of
    and subclass of the same thing, OK.
  • 46:39 - 46:44
    So you know, you can say
    we have a flexible ontology.
  • 46:44 - 46:46
    Wikidata gives you freedom
    to express that, sometimes.
  • 46:46 - 46:47
    Because, for example,
  • 46:47 - 46:51
    that ontology of languages
    is also politically complicated, right?
  • 46:51 - 46:55
    It is even good to be in a position
    to express a level of uncertainty.
  • 46:55 - 46:58
    But imagine anyone who wants
    to do machine reading from that.
  • 46:58 - 46:59
    So that's really problematic.
  • 46:59 - 47:00
    And then again,
  • 47:00 - 47:04
    I don't think that ontology
    was ever imported from somewhere,
  • 47:04 - 47:05
    that's something which is originally ours.
  • 47:05 - 47:08
    It's harvested from Wikipedia
    in the very beginning I will say.
  • 47:08 - 47:11
    So I wonder...
    this Shape Expressions thing is great,
  • 47:11 - 47:16
    and also validating and fixing,
    if you like, the Wikidata ontology
  • 47:16 - 47:18
    by external resources, beautiful idea.
  • 47:19 - 47:20
    In the end,
  • 47:20 - 47:25
    will we end by reflecting
    the external ontologies in Wikidata?
  • 47:25 - 47:29
    And also, what we do with
    the core part of our ontology
  • 47:29 - 47:31
    which is never harvested
    from external resources,
  • 47:31 - 47:32
    how do we go and fix that?
  • 47:32 - 47:35
    And I really think that
    that will be a problem on its own.
  • 47:35 - 47:39
    We will have to focus on that
    independently of the idea
  • 47:39 - 47:41
    of validating ontology
    with something external.
  • 47:49 - 47:53
    (man6) OK, and constrains
    and shapes are very impressive
  • 47:53 - 47:54
    what we can do with it,
  • 47:55 - 47:58
    but the main point is not
    being really made clear--
  • 47:58 - 48:03
    it's because now we can make more explicit
    what we expect from the data.
  • 48:03 - 48:07
    Before, each one has to write
    its own tools and scripts
  • 48:07 - 48:11
    and so it's more visible
    and we can discuss about it.
  • 48:11 - 48:14
    But because it's not about
    what's wrong or right,
  • 48:14 - 48:16
    it's about an expectation,
  • 48:16 - 48:18
    and you will have different
    expectations and discussions
  • 48:18 - 48:21
    about how we want
    to model things in Wikidata,
  • 48:21 - 48:23
    and this...
  • 48:23 - 48:26
    The current state is just
    one step in the direction
  • 48:26 - 48:28
    because now you need
  • 48:28 - 48:31
    very much technical expertise
    to get into this,
  • 48:31 - 48:36
    and we need better ways
    to visualize this constraint,
  • 48:36 - 48:40
    to transform it maybe in natural language
    so people can better understand,
  • 48:41 - 48:44
    but it's less about what's wrong or right.
  • 48:45 - 48:46
    (Lydia) Yeah.
  • 48:51 - 48:54
    (man7) So for quality issues,
    I just want to echo it like...
  • 48:54 - 48:57
    I've definitely found a lot of the issues
    I've encountered have been
  • 48:59 - 49:02
    differences in opinion
    between instance of versus subclass.
  • 49:02 - 49:06
    I would say errors in those situations
  • 49:06 - 49:12
    and trying to find those
    has been a very time-consuming process.
  • 49:12 - 49:15
    What I've found is like:
    "Oh, if I find very high-impression items
  • 49:15 - 49:16
    that are something...
  • 49:16 - 49:22
    and then use all the subclass instances
    to find all derived statements of this,"
  • 49:22 - 49:26
    this is a very useful way
    of looking for these errors.
  • 49:26 - 49:28
    But I was curious if Shape Expressions,
  • 49:30 - 49:32
    if there is...
  • 49:32 - 49:37
    If this can be used as a tool
    to help resolve those issues but, yeah...
  • 49:41 - 49:43
    (man8) If it has a structural footprint...
  • 49:46 - 49:49
    If it has a structural footprint
    that you can...that's sort of falsifiable,
  • 49:49 - 49:51
    you can look at that
    and say well, that's wrong,
  • 49:51 - 49:53
    then yeah, you can do that.
  • 49:53 - 49:57
    But if it's just sort of
    trying to map it to real-world objects,
  • 49:57 - 49:59
    then you're just going to need
    lots and lots of brains.
  • 50:06 - 50:09
    (man9) Hi, Pablo Mendes
    from Apple Siri Knowledge.
  • 50:09 - 50:13
    We're here to find out how to help
    the project and the community
  • 50:13 - 50:16
    but Cristina made the mistake
    of asking what we want.
  • 50:16 - 50:20
    (laughing) So I think
    one thing I'd like to see
  • 50:21 - 50:24
    is a lot around verifiability
  • 50:24 - 50:26
    which is one of the core tenets
    of the project in the community,
  • 50:27 - 50:29
    and trustworthiness.
  • 50:29 - 50:32
    Not every statement is the same,
    some of them are heavily disputed,
  • 50:32 - 50:34
    some of them are easy to guess,
  • 50:34 - 50:36
    like somebody's
    date of birth can be verified,
  • 50:36 - 50:39
    as you saw today in the Keynote,
    gender issues are a lot more complicated.
  • 50:40 - 50:42
    Can you discuss a little bit what you know
  • 50:42 - 50:47
    in this area of data quality around
    trustworthiness and verifiability?
  • 50:55 - 50:58
    If there isn't a lot,
    I'd love to see a lot more. (laughs)
  • 51:01 - 51:02
    (Lydia) Yeah.
  • 51:03 - 51:07
    Apparently, we don't have
    a lot to say on that. (laughs)
  • 51:08 - 51:12
    (Andra) I think we can do a lot,
    but I had a discussion with you yesterday.
  • 51:12 - 51:16
    My favorite example I learned yesterday
    that's already deprecated
  • 51:16 - 51:20
    is if you go to the Q2, which is earth,
  • 51:20 - 51:23
    there is statement
    that claims that the earth is flat.
  • 51:24 - 51:26
    And I love that example
  • 51:26 - 51:28
    because there is a community
    out there that claims that
  • 51:28 - 51:30
    and they have verifiable resources.
  • 51:30 - 51:32
    So I think it's a genuine case,
  • 51:32 - 51:35
    it shouldn't be deprecated,
    it should be in Wikidata.
  • 51:35 - 51:40
    And I think Shape Expressions
    can be really instrumental there,
  • 51:40 - 51:42
    because what you can say,
  • 51:42 - 51:45
    OK, I'm really interested
    in this use case,
  • 51:45 - 51:47
    or this is a use case where you disagree,
  • 51:47 - 51:51
    but there can also be a use case
    where you say OK, I'm interested.
  • 51:51 - 51:53
    So there is this example you say,
    I have glucose.
  • 51:53 - 51:56
    And glucose when you're a biologist,
  • 51:56 - 52:00
    you don't care for the chemical
    constraints of the glucose molecule,
  • 52:00 - 52:03
    you just... everything glucose
    is the same.
  • 52:03 - 52:06
    But if you're a chemist,
    you cringe when you hear that,
  • 52:06 - 52:08
    you have 200 something...
  • 52:08 - 52:10
    So then you can have
    multiple Shape Expressions,
  • 52:10 - 52:13
    OK, I'm coming in with...
    I'm at a chemist view,
  • 52:13 - 52:14
    I'm applying that.
  • 52:14 - 52:17
    And then you say
    I'm from a biological use case,
  • 52:17 - 52:19
    I'm applying that Shape Expression.
  • 52:19 - 52:20
    And then when you want to collaborate,
  • 52:20 - 52:23
    yes, well you should talk
    to Eric about ShEx maps.
  • 52:24 - 52:29
    And so...
    but this journey is just starting.
  • 52:29 - 52:32
    But I personally I believe
    that it's quite instrumental in that area.
  • 52:34 - 52:36
    (Lydia) OK. Over there.
  • 52:38 - 52:39
    (laughs)
  • 52:41 - 52:46
    (woman2) I had several ideas
    from some points in the discussions,
  • 52:46 - 52:51
    so I will try not to lose...
    I had three ideas so...
  • 52:52 - 52:55
    Based on what James said a while ago,
  • 52:55 - 52:59
    we have a very, very big problem
    on Wikidata since the beginning
  • 52:59 - 53:02
    for the upper ontology.
  • 53:02 - 53:05
    We talked about that
    two years ago at WikidataCon,
  • 53:05 - 53:07
    and we talked about that at Wikimania.
  • 53:07 - 53:10
    Well, always we have a Wikidata meeting
  • 53:10 - 53:12
    we are talking about that,
  • 53:12 - 53:16
    because it's a very big problem
    at a very very eye level
  • 53:16 - 53:23
    what entity is, with what work is,
    what genre is, art,
  • 53:23 - 53:25
    are really the biggest concept.
  • 53:26 - 53:33
    And that's actually
    a very weak point on global ontology
  • 53:33 - 53:37
    because people try to clean up regularly
  • 53:38 - 53:41
    and broke everything down the line,
  • 53:43 - 53:49
    because yes, I think some of you
    may remember the guy who in good faith
  • 53:49 - 53:52
    broke absolutely all cities in the world.
  • 53:52 - 53:58
    We were not geographical items anymore,
    so violation constraints everywhere.
  • 53:59 - 54:00
    And it was in good faith
  • 54:00 - 54:04
    because he was really
    correcting a mistake in an item,
  • 54:04 - 54:06
    but everything broke down.
  • 54:06 - 54:09
    And I'm not sure how we can solve that
  • 54:10 - 54:16
    because there is actually
    no external institution we could just copy
  • 54:16 - 54:18
    because everyone is working on...
  • 54:19 - 54:22
    Well, if I am performing art database,
  • 54:22 - 54:25
    I will just go
    at the performing art label,
  • 54:25 - 54:29
    or I won't go to the philosophical concept
    of what an entity is,
  • 54:29 - 54:31
    and that's actually...
  • 54:31 - 54:35
    I don't know any database
    which is working at this level,
  • 54:35 - 54:37
    but that's the weakest point of Wikidata.
  • 54:38 - 54:41
    And probably,
    when we are talking about data quality,
  • 54:41 - 54:44
    that's actually a big part of it, so...
  • 54:44 - 54:49
    And I think it's the same
    we have stated in...
  • 54:49 - 54:50
    Oh, I am sorry, I am changing the subject,
  • 54:51 - 54:56
    but we have stated
    in different sessions about qualities,
  • 54:56 - 54:59
    which is actually some of us
    are doing good modeling job,
  • 54:59 - 55:01
    are doing ShEx,
    are doing things like that.
  • 55:02 - 55:08
    People don't see it on Wikidata,
    they don't see the ShEx,
  • 55:08 - 55:10
    they don't see the WikiProject
    on the discussion page,
  • 55:10 - 55:11
    and sometimes,
  • 55:11 - 55:15
    they don't even see
    the talk pages of properties,
  • 55:15 - 55:20
    which is explicitly stating,
    a), this property is used for that.
  • 55:20 - 55:24
    Like last week,
    I added constraints to a property.
  • 55:24 - 55:26
    The constraint was explicitly written
  • 55:26 - 55:29
    in the discussion
    of the creation of the property.
  • 55:29 - 55:35
    I just created the technical part
    of adding the constraint, and someone:
  • 55:35 - 55:37
    "What! You broke down all my edits!"
  • 55:37 - 55:42
    And he was using the property
    wrongly for the last two years.
  • 55:42 - 55:47
    And the property was actually very clear,
    but there were no warnings and everything,
  • 55:47 - 55:50
    and so, it's the same at the Pink Pony
    we said at Wikimania
  • 55:50 - 55:55
    to make WikiProject more visible
    or to make ShEx more visible, but...
  • 55:55 - 55:57
    And that's what Cristina said.
  • 55:57 - 56:02
    We have a visibility problem
    of what the existing solutions are.
  • 56:02 - 56:04
    And at this session,
  • 56:04 - 56:07
    we are all talking about
    how to create more ShEx,
  • 56:07 - 56:11
    or to facilitate the jobs
    of the people who are doing the cleanup.
  • 56:12 - 56:16
    But we are cleaning up
    since the first day of Wikidata,
  • 56:16 - 56:21
    and globally, we are losing,
    and we are losing because, well,
  • 56:21 - 56:23
    if I know names are complicated
  • 56:23 - 56:26
    but I am the only one
    doing the cleaning up job,
  • 56:27 - 56:30
    the guy who added
    Latin script name
  • 56:30 - 56:32
    to all Chinese researcher,
  • 56:32 - 56:36
    I will take months to clean that
    and I can't do it alone,
  • 56:36 - 56:39
    and he did one massive batch.
  • 56:39 - 56:40
    So we really need...
  • 56:40 - 56:44
    we have a visibility problem
    more than a tool problem, I think,
  • 56:44 - 56:46
    because we have many tools.
  • 56:46 - 56:50
    (Lydia) Right, so unfortunately,
    I've got shown a sign, (laughs),
  • 56:50 - 56:52
    so we need to wrap this up.
  • 56:52 - 56:54
    Thank you so much for your comments,
  • 56:54 - 56:57
    I hope you will continue discussing
    during the rest of the day,
  • 56:57 - 56:58
    and thanks for your input.
  • 56:58 - 57:00
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4
Video Language:
English
Duration:
57:10

English subtitles

Revisions