0:00:05.945,0:00:09.476
Hello everyone to the Data Quality panel.
0:00:10.288,0:00:13.671
Data quality matters because[br]more and more people out there
0:00:13.672,0:00:19.289
rely on our data being in good shape,[br]so we're going to talk about data quality,
0:00:20.029,0:00:26.000
and there will be four speakers[br]who will give short introductions
0:00:26.000,0:00:29.539
on topics related to data quality[br]and then we will have a Q and A.
0:00:30.130,0:00:32.234
And the first one is Lucas.
0:00:34.385,0:00:35.385
Thank you.
0:00:35.901,0:00:39.899
Hi, I'm Lucas, and I'm going[br]to start with an overview
0:00:39.899,0:00:43.806
of data quality tools[br]that we already have on Wikidata
0:00:43.807,0:00:46.109
and also some things[br]that are coming up soon.
0:00:46.932,0:00:50.623
And I've grouped them[br]into some general themes
0:00:50.623,0:00:53.761
of making errors more visible,[br]making problems actionable,
0:00:53.762,0:00:56.322
getting more eyes on the data[br]so that people notice the problems,
0:00:56.945,0:01:02.616
fix some common sources of errors,[br]maintain the quality of the existing data
0:01:02.616,0:01:03.966
and also human curation.
0:01:05.063,0:01:09.874
And the ones that are currently available[br]start with property constraints.
0:01:10.388,0:01:12.421
So you've probably seen this[br]if you're on Wikidata.
0:01:12.422,0:01:14.029
You can sometimes get these icons
0:01:14.530,0:01:17.241
which check[br]the internal consistency of the data.
0:01:17.242,0:01:20.800
For example,[br]if one event follows the other,
0:01:20.801,0:01:23.760
then the other event should[br]also be followed by this one,
0:01:23.761,0:01:27.161
which on the WikidataCon item[br]was apparently missing.
0:01:27.162,0:01:29.360
I'm not sure,[br]this feature is a few days old.
0:01:30.040,0:01:34.681
And there's also,[br]if this is too limited or simple for you,
0:01:34.682,0:01:38.080
you can write any checks you want[br]using the Query Service
0:01:38.081,0:01:39.842
which is useful for[br]lots of things of course,
0:01:39.843,0:01:44.543
but you can also use it[br]for finding errors.
0:01:44.544,0:01:46.974
Like if you've noticed[br]one occurrence of a mistake,
0:01:46.975,0:01:49.709
then you can check[br]if there are other places
0:01:49.710,0:01:51.958
where people have made[br]a very similar error
0:01:51.958,0:01:53.438
and find that with the Query Service.
0:01:53.439,0:01:54.559
You can also combine the two
0:01:54.560,0:01:57.874
and search for constraint violations[br]in the Query Service,
0:01:57.875,0:02:01.240
for example,[br]only the violations in some area
0:02:01.241,0:02:03.762
or WikiProject that's relevant to you,
0:02:03.762,0:02:06.828
although the results are currently[br]not complete, sadly.
0:02:08.422,0:02:09.877
There is revision scoring.
0:02:10.690,0:02:12.666
That's... I think this is[br]from the recent changes
0:02:12.667,0:02:16.217
you can also get it on your watch list[br]an automatic assessment
0:02:16.217,0:02:20.249
of is this edit likely to be[br]in good faith or in bad faith
0:02:20.250,0:02:22.312
and is it likely to be[br]damaging or not damaging,
0:02:22.313,0:02:24.205
I think those are the two dimensions.
0:02:24.206,0:02:25.686
So you can, if you want,
0:02:25.687,0:02:29.898
focus on just looking through[br]the damaging but good faith edits.
0:02:29.899,0:02:32.523
If you're feeling particularly[br]friendly and welcoming
0:02:32.524,0:02:37.121
you can tell these editors,[br]"Thank you for your contribution,
0:02:37.122,0:02:40.560
here's how you should have done it[br]but thank you, still."
0:02:40.561,0:02:42.186
And if you're not feeling that way,
0:02:42.187,0:02:44.452
you can go through[br]the bad faith, damaging edits,
0:02:44.453,0:02:45.573
and revert the vandals.
0:02:47.544,0:02:49.761
There's also, similar to that,[br]entity scoring.
0:02:49.762,0:02:52.590
So instead of scoring an edit,[br]the change that it made,
0:02:52.591,0:02:53.904
you score the whole revision,
0:02:53.904,0:02:56.483
and I think that is[br]the same quality measure
0:02:56.483,0:02:59.863
that Lydia mentions[br]at the beginning of the conference.
0:03:00.372,0:03:04.569
That gives a user script up here[br]and gives you a score of like one to five,
0:03:04.570,0:03:08.176
I think it was, of what the quality[br]of the current item is.
0:03:10.043,0:03:15.528
The primary sources tool is for[br]any database that you want to import,
0:03:15.528,0:03:18.364
but that's not high enough quality[br]to directly add to Wikidata,
0:03:18.374,0:03:20.335
so you add it[br]to the primary sources tool instead,
0:03:20.336,0:03:22.956
and then humans can decide
0:03:22.956,0:03:26.024
should they add[br]these individual statements or not.
0:03:28.595,0:03:31.901
Showing coordinates as maps[br]is mainly a convenience feature
0:03:31.901,0:03:33.588
but it's also useful for quality control.
0:03:33.588,0:03:36.937
Like if you see this is supposed to be[br]the office of Wikimedia Germany
0:03:36.938,0:03:39.400
and if the coordinates[br]are somewhere in the Indian Ocean,
0:03:39.401,0:03:41.529
then you know that[br]something is not right there
0:03:41.530,0:03:44.790
and you can see it much more easily[br]than if you just had the numbers.
0:03:46.382,0:03:49.576
This is a gadget called[br]the relative completeness indicator
0:03:49.577,0:03:52.480
which shows you this little icon here
0:03:53.007,0:03:55.652
telling you how complete[br]it thinks this item is
0:03:55.652,0:03:57.613
and also which properties[br]are most likely missing,
0:03:57.614,0:03:59.769
which is really useful[br]if you're editing an item
0:03:59.769,0:04:03.172
and you're in an area[br]that you're not very familiar with
0:04:03.172,0:04:05.661
and you don't know what[br]the right properties to use are,
0:04:05.662,0:04:08.230
then this is a very useful gadget to have.
0:04:09.604,0:04:11.401
And we have Shape Expressions.
0:04:11.402,0:04:15.624
I think Andra or Jose[br]are going to talk more about those
0:04:15.624,0:04:19.757
but basically, a very powerful way[br]of comparing the data you have
0:04:19.758,0:04:20.758
against the schema,
0:04:20.759,0:04:22.680
like what statement should[br]certain entities have,
0:04:22.681,0:04:25.677
what other entities should they link to[br]and what should those look like,
0:04:26.229,0:04:29.374
and then you can find problems that way.
0:04:30.366,0:04:32.361
I think... No there is still more.
0:04:32.362,0:04:34.321
Integraality or property dashboard.
0:04:34.322,0:04:36.773
It gives you a quick overview[br]of the data you already have.
0:04:36.774,0:04:39.147
For example, this is from[br]the WikiProject Red Pandas,
0:04:39.657,0:04:41.681
and you can see that[br]we have a sex or gender
0:04:41.682,0:04:43.561
for almost all of the red pandas,
0:04:43.561,0:04:46.854
the date of birth varies a lot[br]by which zoo they come from
0:04:46.854,0:04:50.255
and we have almost[br]no dead pandas which is wonderful,
0:04:51.437,0:04:52.600
because they're so cute.
0:04:53.699,0:04:55.654
So this is also useful.
0:04:56.377,0:04:59.185
There we go, OK,[br]now for the things that are coming up.
0:04:59.889,0:05:03.784
Wikidata Bridge, or also known,[br]formerly known as client editing,
0:05:03.785,0:05:07.076
so editing Wikidata[br]from Wikipedia infoboxes
0:05:07.675,0:05:11.725
which will on the one hand[br]get more eyes on the data
0:05:11.725,0:05:13.441
because more people can see the data there
0:05:13.441,0:05:18.841
and it will hopefully encourage[br]more use of Wikidata in the Wikipedias
0:05:18.841,0:05:20.920
and that means that more[br]people can notice
0:05:20.921,0:05:23.389
if, for example some data is outdated[br]and needs to be updated
0:05:23.857,0:05:27.000
instead of if they would[br]only see it on Wikidata itself.
0:05:28.630,0:05:30.656
There is also tainted references.
0:05:30.657,0:05:33.959
The idea here is that[br]if you edit a statement value,
0:05:34.683,0:05:37.279
you might want to update[br]the references as well,
0:05:37.280,0:05:39.373
unless it was just a typo or something.
0:05:39.897,0:05:43.662
And this tainted references[br]tells editors that
0:05:43.663,0:05:49.756
and also that other editors[br]see which other edits were made
0:05:49.756,0:05:52.471
that edited a statement value[br]and didn't update a reference
0:05:52.472,0:05:56.766
then you can clean up after that[br]and decide should that be...
0:05:57.737,0:05:59.566
Do you need to do any thing more of that
0:05:59.566,0:06:02.796
or is that actually fine and[br]you don't need to update the reference.
0:06:03.543,0:06:09.336
That's related to signed statements[br]which is coming from a concern, I think,
0:06:09.336,0:06:12.355
that some data providers have that like...
0:06:14.131,0:06:17.231
There's a statement that's referenced[br]through the UNESCO or something
0:06:17.232,0:06:19.872
and then suddenly,[br]someone vandalizes the statement
0:06:19.873,0:06:21.836
and they are worried[br]that it will look like
0:06:22.827,0:06:26.992
this organization, like UNESCO,[br]still set this vandalism value
0:06:26.993,0:06:28.706
and so, with signed statements,
0:06:28.706,0:06:31.488
they can cryptographically[br]sign this reference
0:06:31.488,0:06:33.562
and that doesn't prevent any edits to it,
0:06:34.169,0:06:37.744
but at least, if someone[br]vandalizes the statement
0:06:37.744,0:06:40.255
or edits it in any way,[br]then the signature is no longer valid,
0:06:40.255,0:06:43.401
and you can tell this is not exactly[br]what the organization said,
0:06:43.402,0:06:47.064
and perhaps it's a good edit[br]and they should re-sign the new statement,
0:06:47.065,0:06:49.851
but also perhaps it should be reverted.
0:06:51.203,0:06:54.166
And also, this is going[br]to be very exciting, I think,
0:06:54.166,0:06:56.846
Citoid is this amazing system[br]they have on Wikipedia
0:06:57.379,0:07:01.340
where you can paste a URL,[br]or an identifier, or an ISBN
0:07:01.340,0:07:04.759
or Wikidata ID or basically[br]anything into the Visual Editor,
0:07:05.260,0:07:08.241
and it spits out a reference[br]that is nicely formatted
0:07:08.242,0:07:11.049
and has all the data you want[br]and it's wonderful to use.
0:07:11.049,0:07:14.337
And by comparison, on Wikidata,[br]if I want to add a reference
0:07:14.338,0:07:18.801
I typically have to add a reference URL,[br]title, author name string,
0:07:18.802,0:07:20.449
published in, publication date,
0:07:20.450,0:07:25.141
retrieve dates,[br]at least those, and that's annoying,
0:07:25.141,0:07:29.261
and integrating Citoid into Wikibase[br]will hopefully help with that.
0:07:30.245,0:07:33.604
And I think[br]that's all the ones I had, yeah.
0:07:33.604,0:07:36.400
So now, I'm going to pass to Cristina.
0:07:37.788,0:07:42.339
(applause)
0:07:43.780,0:07:45.471
Hi, I'm Cristina.
0:07:45.472,0:07:47.672
I'm a research scientist[br]from the University of Zürich,
0:07:47.673,0:07:51.417
and I'm also an active member[br]of the Swiss Community.
0:07:52.698,0:07:57.901
When Claudia Müller-Birn[br]and I submitted this to the WikidataCon,
0:07:57.902,0:08:00.410
what we wanted to do[br]is continue our discussion
0:08:00.411,0:08:02.424
that we started[br]in the beginning of the year
0:08:02.424,0:08:07.442
with a workshop on data quality[br]and also some sessions in Wikimania.
0:08:07.442,0:08:10.535
So the goal of this talk[br]is basically to bring some thoughts
0:08:10.536,0:08:14.432
that we have been collecting[br]from the community and ourselves
0:08:14.432,0:08:16.560
and continue discussion.
0:08:16.561,0:08:20.065
So what we would like is to continue[br]interacting a lot with you.
0:08:21.557,0:08:23.371
So what we think is very important
0:08:23.372,0:08:27.580
is that we continuously ask[br]all types of users in the community
0:08:27.581,0:08:32.240
about what they really need,[br]what problems they have with data quality,
0:08:32.240,0:08:35.000
not only editors[br]but also the people who are coding,
0:08:35.000,0:08:36.241
or consuming the data,
0:08:36.242,0:08:39.494
and also researchers who are[br]actually using all the edit history
0:08:39.494,0:08:40.800
to analyze what is happening.
0:08:42.367,0:08:48.431
So we did a review of around 80 tools[br]that are existing in Wikidata
0:08:48.431,0:08:52.380
and we aligned them to the different[br]data quality dimensions.
0:08:52.380,0:08:54.360
And what we saw was that actually,
0:08:54.361,0:08:57.681
many of them were looking at,[br]monitoring completeness,
0:08:57.682,0:09:02.820
but actually... and also some of them[br]are also enabling interlinking.
0:09:02.820,0:09:08.442
But there is a big need for tools[br]that are looking into diversity,
0:09:08.443,0:09:12.824
which is one of the things[br]that we actually can have in Wikidata,
0:09:12.824,0:09:15.958
especially[br]this design principle of Wikidata
0:09:15.959,0:09:17.901
where we can have plurality
0:09:17.902,0:09:20.308
and different statements[br]with different values
0:09:21.034,0:09:22.236
coming from different sources.
0:09:22.236,0:09:24.921
Because it's a secondary source,[br]we don't have really tools
0:09:24.922,0:09:27.750
that actually tell us how many[br]plural statements there are,
0:09:27.751,0:09:30.889
and how many we can improve and how,
0:09:30.890,0:09:32.833
and we also don't know really
0:09:32.833,0:09:35.538
what are all the reasons[br]for plurality that we can have.
0:09:36.491,0:09:39.201
So from these community meetings,
0:09:39.201,0:09:43.084
what we discussed was the challenges[br]that still need attention.
0:09:43.084,0:09:47.249
For example, that having[br]all these crowdsourcing communities
0:09:47.249,0:09:49.613
is very good because different people[br]attack different parts
0:09:49.613,0:09:51.833
of the data or the graph,
0:09:51.834,0:09:54.615
and we also have[br]different background knowledge
0:09:54.616,0:09:59.161
but actually, it's very difficult to align[br]everything in something homogeneous
0:09:59.162,0:10:04.920
because different people are using[br]different properties in different ways
0:10:04.920,0:10:08.401
and they are also expecting[br]different things from entity descriptions.
0:10:09.003,0:10:12.721
People also said that[br]they also need more tools
0:10:12.722,0:10:16.000
that give a better overview[br]of the global status of things.
0:10:16.000,0:10:20.733
So what entities are missing[br]in terms of completeness,
0:10:20.733,0:10:26.121
but also like what are people[br]working on right now most of the time,
0:10:26.121,0:10:30.516
and they also mention many times[br]a tighter collaboration
0:10:30.517,0:10:33.311
across not only languages[br]but the WikiProjects
0:10:33.311,0:10:35.571
and the different Wikimedia platforms.
0:10:35.571,0:10:38.859
And we published[br]all the transcribed comments
0:10:38.860,0:10:42.959
from all these discussions[br]in those links here in the Etherpads
0:10:42.959,0:10:46.162
and also in the wiki page of Wikimania.
0:10:46.162,0:10:48.481
Some solutions that appeared actually
0:10:48.481,0:10:53.001
were going into the direction[br]of sharing more the best practices
0:10:53.001,0:10:55.762
that are being developed[br]in different WikiProjects,
0:10:55.762,0:11:01.238
but also people want tools[br]that help organize work in teams
0:11:01.239,0:11:03.845
or at least understanding[br]who is working on that,
0:11:03.845,0:11:07.815
and they were also mentioning[br]that they want more showcases
0:11:07.816,0:11:12.019
and more templates that help them[br]create things in a better way.
0:11:12.946,0:11:15.161
And from the contact that we have
0:11:15.162,0:11:18.721
with Open Governmental Data Organizations,
0:11:18.722,0:11:20.068
and in particularly,
0:11:20.068,0:11:23.102
I am in contact with the canton[br]and the city of Zürich,
0:11:23.102,0:11:26.207
they are very interested[br]in working with Wikidata
0:11:26.207,0:11:29.896
because they want their data[br]to be accessible for everyone
0:11:29.897,0:11:33.681
in the place where people go[br]and consult or access data.
0:11:33.682,0:11:36.550
So for them, something that[br]would be really interesting
0:11:36.551,0:11:38.600
is to have some kind of quality indicators
0:11:38.600,0:11:41.082
both in the wiki,[br]which is already happening,
0:11:41.082,0:11:42.801
but also in SPARQL results,
0:11:42.802,0:11:46.066
to know whether they can trust[br]or not that data from the community.
0:11:46.067,0:11:48.230
And then, they also want to know
0:11:48.230,0:11:51.417
what parts of their own data sets[br]are useful for Wikidata
0:11:51.418,0:11:56.040
and they would love to have a tool that[br]can help them assess that automatically.
0:11:56.041,0:11:59.066
They also need[br]some kind of methodology or tool
0:11:59.067,0:12:03.894
that helps them decide whether[br]they should import or link their data
0:12:03.894,0:12:04.894
because in some cases,
0:12:04.895,0:12:07.137
they also have their own[br]linked open data sets,
0:12:07.138,0:12:09.746
so they don't know whether[br]to just ingest the data
0:12:09.747,0:12:13.424
or to keep on creating links[br]from the data sets to Wikidata
0:12:13.425,0:12:14.425
and the other way around.
0:12:14.950,0:12:20.043
And they also want to know where[br]their websites are referred in Wikidata.
0:12:20.044,0:12:23.361
And when they run such a query[br]in the query service,
0:12:23.362,0:12:24.848
they often get timeouts,
0:12:24.849,0:12:28.181
so maybe we should[br]really create more tools
0:12:28.181,0:12:32.240
that help them get these answers[br]for their questions.
0:12:33.148,0:12:36.208
And, besides that,
0:12:36.208,0:12:39.361
we wiki researchers also sometimes
0:12:39.362,0:12:42.023
lack some information[br]in the edit summaries.
0:12:42.024,0:12:44.953
So I remember that when[br]we were doing some work
0:12:44.954,0:12:48.919
to understand[br]the different behavior of editors
0:12:48.919,0:12:53.403
with tools or bots[br]or anonymous users and so on,
0:12:53.403,0:12:56.154
we were really lacking, for example,
0:12:56.154,0:13:01.112
a standard way of tracing[br]that tools were being used.
0:13:01.113,0:13:03.154
And there are some tools[br]that are already doing that
0:13:03.155,0:13:05.230
like PetScan and many others,
0:13:05.230,0:13:07.720
but maybe we should in the community
0:13:07.721,0:13:13.531
discuss more about how to record these[br]for fine-grained provenance.
0:13:14.169,0:13:15.321
And further on,
0:13:15.322,0:13:20.801
we think that we need to think[br]of more concrete data quality dimensions
0:13:20.802,0:13:24.961
that are related to link data[br]but not all the types of data,
0:13:24.962,0:13:30.721
so we worked on some measures[br]to access actually the information gain
0:13:30.722,0:13:33.881
enabled by the links,[br]and what we mean by that
0:13:33.882,0:13:36.681
is that when we link[br]Wikidata to other data sets,
0:13:36.682,0:13:38.201
we should also be thinking
0:13:38.202,0:13:41.921
how much the entities are actually[br]gaining in the classification,
0:13:41.922,0:13:45.601
also in the description[br]but also in the vocabularies they use.
0:13:45.602,0:13:51.041
So just to give a very simple[br]example of what I mean with this
0:13:51.042,0:13:54.269
is we can think of--[br]in this case, would be Wikidata
0:13:54.270,0:13:57.771
or the external data center[br]that is linking to Wikidata,
0:13:57.772,0:14:00.487
we have the entity for a person[br]that is called Natasha Noy,
0:14:00.487,0:14:02.601
we have the affiliation and other things,
0:14:02.602,0:14:05.239
and then we say OK,[br]we link to an external place,
0:14:05.240,0:14:08.919
and that entity also has that name,[br]but we actually have the same value.
0:14:08.920,0:14:12.889
So what it would be better is that we link[br]to something that has a different name,
0:14:12.889,0:14:16.881
that is still valid because this person[br]has two ways of writing the name,
0:14:16.882,0:14:19.714
and also other information[br]that we don't have in Wikidata
0:14:19.715,0:14:21.760
or that we don't have[br]in the other data set.
0:14:22.390,0:14:24.652
But also, what is even better
0:14:24.653,0:14:27.770
is that we are actually[br]looking in the target data set
0:14:27.770,0:14:31.392
that they also have new ways[br]of classifying the information.
0:14:31.393,0:14:35.354
So not only is this a person,[br]but in the other data set,
0:14:35.355,0:14:39.525
they also say it's a female[br]or anything else that they classify with.
0:14:39.526,0:14:43.401
And if in the other data set,[br]they are using many other vocabularies
0:14:43.402,0:14:46.588
that is also helping in their whole[br]information retrieval thing.
0:14:47.371,0:14:51.233
So with that, I also would like to say
0:14:51.234,0:14:55.809
that we think that we can[br]showcase federated queries better
0:14:55.810,0:15:00.448
because when we look at the query log[br]provided by Malyshev et al.,
0:15:01.285,0:15:04.301
we see actually that[br]from the organic queries,
0:15:04.302,0:15:06.921
we have only very few federated queries.
0:15:06.922,0:15:12.801
And actually, federation is one[br]of the key advantages of having link data,
0:15:12.802,0:15:16.903
so maybe the community[br]or the people using Wikidata
0:15:16.903,0:15:18.898
also need more examples on this.
0:15:18.898,0:15:22.666
And if we look at the list[br]of endpoints that are being used,
0:15:22.667,0:15:25.401
this is not a complete list[br]and we have many more.
0:15:25.402,0:15:30.479
Of course, this data was analyzed[br]from queries until March 2018,
0:15:30.480,0:15:34.807
but we should look into the list[br]of federated endpoints that we have
0:15:34.808,0:15:37.048
and see whether[br]we are really using them or not.
0:15:37.813,0:15:40.441
So two questions that[br]I have for the audience
0:15:40.442,0:15:43.001
that maybe we can use[br]afterwards for the discussion are:
0:15:43.001,0:15:46.001
what data quality problems[br]should be addressed in your opinion,
0:15:46.002,0:15:47.412
because of the needs that you have,
0:15:47.412,0:15:50.401
but also, where do you need[br]more automation
0:15:50.402,0:15:52.943
to help you with editing or patrolling.
0:15:53.866,0:15:55.146
That's all, thank you very much.
0:15:55.779,0:15:57.527
(applause)
0:16:06.030,0:16:08.595
(Jose Emilio Labra) OK,[br]so what I'm going to talk about
0:16:08.595,0:16:14.715
is some tools that we were developing[br]related with Shape Expressions.
0:16:15.536,0:16:19.371
So this is what I want to talk...[br]I am Jose Emilio Labra,
0:16:19.371,0:16:23.215
but this has... all these tools[br]have been done by different people,
0:16:23.920,0:16:28.480
mainly related with W3C ShEx,[br]Shape Expressions Community Group.
0:16:28.481,0:16:29.481
ShEx Community Group.
0:16:30.144,0:16:36.081
So the first tool that I want to mention[br]is RDFShape, this is a general tool,
0:16:36.082,0:16:40.681
because Shape Expressions[br]is not only for Wikidata,
0:16:40.682,0:16:44.168
Shape Expressions is a language[br]to validate RDF in general.
0:16:44.168,0:16:47.568
So this tool was developed mainly by me
0:16:47.568,0:16:50.880
and it's a tool[br]to validate RDF in general.
0:16:50.881,0:16:55.139
So if you want to learn about RDF[br]or you want to validate RDF
0:16:55.140,0:16:58.621
or SPARQL endpoints not only in Wikidata,
0:16:58.622,0:17:00.891
my advice is that you can use this tool.
0:17:00.891,0:17:03.255
Also for teaching.
0:17:03.255,0:17:05.640
I am a teacher in the university
0:17:05.641,0:17:09.151
and I use it in my semantic web course[br]to teach RDF.
0:17:09.161,0:17:12.121
So if you want to learn RDF,[br]I think it's a good tool.
0:17:13.033,0:17:17.598
For example, this is just a visualization[br]of an RDF graph with the tool.
0:17:18.587,0:17:22.643
But before coming here, in the last month,
0:17:22.643,0:17:28.441
I started a fork of rdfshape specifically[br]for Wikidata, because I thought...
0:17:28.443,0:17:33.082
It's called WikiShape, and yesterday,[br]I presented it as a present for Wikidata.
0:17:33.082,0:17:34.441
So what I took is...
0:17:34.442,0:17:39.898
What I did is to remove all the stuff[br]that was not related with Wikidata
0:17:39.898,0:17:44.801
and to put several things, hard-coded,[br]for example, the Wikidata SPARQL endpoint,
0:17:44.802,0:17:49.041
but now, someone asked me[br]if I could do it also for Wikibase.
0:17:49.042,0:17:52.000
And it is very easy[br]to do it for Wikibase also.
0:17:52.760,0:17:56.280
So this tool, WikiShape, is quite new.
0:17:57.015,0:17:59.843
I think it works, most of the features,
0:17:59.844,0:18:02.468
but there are some features[br]that maybe don't work,
0:18:02.469,0:18:06.281
and if you try it and you want[br]to improve it, please tell me.
0:18:06.281,0:18:12.680
So this is [inaudible] captures,[br]but I think I can even try so let's try.
0:18:15.385,0:18:16.945
So let's see if it works.
0:18:16.953,0:18:20.070
First, I have to go out of the...
0:18:22.453,0:18:23.453
Here.
0:18:24.226,0:18:28.324
Alright, yeah. So this is the tool here.
0:18:28.324,0:18:29.844
Things that you can do with the tool,
0:18:29.845,0:18:35.275
for example, is that you can[br]check schemas, entity schemas.
0:18:35.276,0:18:38.611
You know that there is[br]a new namespace which is "E whatever,"
0:18:38.612,0:18:44.805
so here, if you start for example,[br]write for example "human"...
0:18:44.806,0:18:48.812
As you are writing,[br]its autocomplete allows you to check,
0:18:48.812,0:18:52.001
for example,[br]this is the Shape Expressions of a human,
0:18:52.790,0:18:55.937
and this is the Shape Expressions here.
0:18:55.938,0:18:59.841
And as you can see,[br]this editor has syntax highlighting,
0:18:59.842,0:19:04.559
this is... well,[br]maybe it's very small, the screen.
0:19:05.676,0:19:07.590
I can try to do it bigger.
0:19:09.194,0:19:10.973
Maybe you see it better now.
0:19:10.973,0:19:14.241
So... and this is the editor[br]with syntax highlighting and also has...
0:19:14.241,0:19:17.851
I mean, this editor[br]comes from the same source code
0:19:17.851,0:19:19.641
as the Wikidata query service.
0:19:19.642,0:19:23.960
So for example,[br]if you hover with the mouse here,
0:19:23.961,0:19:27.961
it shows you the labels[br]of the different properties.
0:19:27.962,0:19:31.298
So I think it's very helpful because now,
0:19:32.588,0:19:38.601
the entity schemas that is[br]in the Wikidata is just a plain text idea,
0:19:38.602,0:19:42.493
and I think this editor is much better[br]because it has autocomplete
0:19:42.494,0:19:43.743
and it also has...
0:19:43.744,0:19:48.241
I mean, if you, for example,[br]wanted to add a constraint,
0:19:48.241,0:19:51.570
you say "wdt:"
0:19:51.570,0:19:56.884
You start writing "author"[br]and then you click Ctrl+Space
0:19:56.884,0:19:58.922
and it suggests the different things.
0:19:58.922,0:20:02.388
So this is similar[br]to the Wikidata query service
0:20:02.389,0:20:06.445
but specifically for Shape Expressions
0:20:06.445,0:20:11.975
because my feeling is that[br]creating Shape Expressions
0:20:11.976,0:20:15.841
is not more difficult[br]than writing SPARQL queries.
0:20:15.842,0:20:21.255
So some people think[br]that it's at the same level,
0:20:22.278,0:20:26.296
It's probably easier, I think,[br]because Shape Expressions was,
0:20:26.296,0:20:31.241
when we designed it,[br]we were doing it to be easier to work.
0:20:31.242,0:20:35.001
OK, so this is one of the first things,[br]that you have this editor
0:20:35.001,0:20:36.620
for Shape Expressions.
0:20:37.371,0:20:41.467
And then you also have the possibility,[br]for example, to visualize.
0:20:41.468,0:20:44.801
If you have a Shape Expression,[br]use for example...
0:20:44.802,0:20:49.386
I think, "written work" is[br]a nice Shape Expression
0:20:49.386,0:20:53.300
because it has some relationships[br]between different things.
0:20:54.823,0:20:58.160
And this is the UML visualization[br]of written work.
0:20:58.161,0:21:02.090
In a UML, this is easy to see[br]the different properties.
0:21:02.790,0:21:06.794
When you do this, I realized[br]when I tried with several people,
0:21:06.795,0:21:09.216
they find some mistakes[br]in their Shape Expressions
0:21:09.217,0:21:12.988
because it's easy to detect which are[br]the missing properties or whatever.
0:21:13.588,0:21:15.771
Then there is another possibility here
0:21:15.772,0:21:19.520
is that you can also validate,[br]I think I have it here, the validation.
0:21:20.496,0:21:25.285
I think I had it in some label,[br]maybe I closed it.
0:21:26.267,0:21:30.988
OK, but you can, for example,[br]you can click here, Validate entities.
0:21:32.308,0:21:34.232
You, for example,
0:21:35.404,0:21:41.921
"q42" with "e42" which is author.
0:21:42.818,0:21:46.180
With "human,"[br]I think we can do it with "human."
0:21:49.050,0:21:50.050
And then it's...
0:21:50.688,0:21:56.365
And it's taking a little while to do it[br]because this is doing the SPARQL queries
0:21:56.365,0:21:59.134
and now, for example,[br]it's failing by the network but...
0:21:59.657,0:22:01.580
So you can try it.
0:22:02.759,0:22:07.026
OK, so let's go continue[br]with the presentation, with other tools.
0:22:07.026,0:22:12.353
So my advice is that if you want to try it[br]and you want any feedback let me know.
0:22:13.133,0:22:15.540
So to continue with the presentation...
0:22:18.923,0:22:20.233
So this is WikiShape.
0:22:23.800,0:22:26.509
Then, I already said this,
0:22:27.681,0:22:34.157
the Shape Expressions Editor[br]is an independent project in GitHub.
0:22:35.605,0:22:37.472
You can use it in your own project.
0:22:37.472,0:22:41.036
If you want to do[br]a Shape Expressions tool,
0:22:41.036,0:22:45.635
you can just embed it[br]in any other project,
0:22:45.636,0:22:48.235
so this is in GitHub and you can use it.
0:22:48.868,0:22:51.970
Then the same author,[br]it's one of my students,
0:22:52.684,0:22:55.704
he also created[br]an editor for Shape Expressions,
0:22:55.704,0:22:57.799
also inspired by[br]the Wikidata query service
0:22:57.800,0:23:00.681
where, in a column,
0:23:00.682,0:23:05.103
you have this more visual editor[br]of SPARQL queries
0:23:05.104,0:23:07.135
where you can put this kind of things.
0:23:07.136,0:23:09.123
So this is a screen capture.
0:23:09.123,0:23:12.662
You can see that[br]that's the Shape Expressions in text
0:23:12.662,0:23:17.822
but this is a form-based Shape Expressions[br]where it would probably take a bit longer
0:23:18.595,0:23:23.400
where you can put the different rows[br]on the different fields.
0:23:23.401,0:23:25.800
OK, then there is ShExEr.
0:23:26.879,0:23:31.882
We have... it's done by one PhD student[br]at the University of Oviedo
0:23:31.883,0:23:34.080
and he's here, so you can present ShExEr.
0:23:38.147,0:23:40.024
(Danny) Hello, I am Danny Fernández,
0:23:40.025,0:23:43.800
I am a PhD student in University of Oviedo[br]working with Labra.
0:23:44.710,0:23:47.725
Since we are running out of time,[br]let's make these quickly,
0:23:47.726,0:23:52.641
so let's not go for any actual demo,[br]but just print some screenshots.
0:23:52.642,0:23:57.897
OK, so the usual way to work with[br]Shape Expressions or any shape language
0:23:57.897,0:23:59.521
is that you have a domain expert
0:23:59.522,0:24:02.313
that defines a priori[br]how the graph should look like
0:24:02.314,0:24:03.555
define some structures,
0:24:03.556,0:24:06.983
and then you use these structures[br]to validate the actual data against it.
0:24:08.124,0:24:11.641
This tool, which is as well as the ones[br]that Labra has been presenting,
0:24:11.642,0:24:14.441
this is a general purpose tool[br]for any RDF source,
0:24:14.442,0:24:17.375
is designed to do the other way around.
0:24:17.376,0:24:18.758
You already have some data,
0:24:18.759,0:24:23.165
you select what nodes[br]you want to get the shape about
0:24:23.165,0:24:26.718
and then you automatically[br]extract or infer the shape.
0:24:26.719,0:24:29.791
So even if this is a general purpose tool,
0:24:29.791,0:24:34.063
what we did for this WikidataCon[br]is these fancy button
0:24:34.884,0:24:37.081
that if you click it,[br]essentially what happens
0:24:37.081,0:24:42.079
is that there are[br]so many configurations params
0:24:42.080,0:24:46.251
and it configures it to work[br]against the Wikidata endpoint
0:24:46.251,0:24:47.971
and it will end soon, sorry.
0:24:48.733,0:24:52.883
So, once you press this button[br]what you get is essentially this.
0:24:52.884,0:24:55.126
After having selected what kind of notes,
0:24:55.127,0:24:59.360
what kind of instances of our class,[br]whatever you are looking for,
0:24:59.361,0:25:01.321
you get an automatic schema.
0:25:02.319,0:25:07.111
All the constraints are sorted[br]by how many modes actually conform to it,
0:25:07.112,0:25:09.772
you can filter the less common ones, etc.
0:25:09.772,0:25:12.126
So there is a poster downstairs[br]about this stuff
0:25:12.127,0:25:14.595
and well,[br]I will be downstairs and upstairs
0:25:14.596,0:25:16.454
and all over the place all day,
0:25:16.455,0:25:19.081
so if you have any further[br]interest in this tool,
0:25:19.082,0:25:21.476
just speak to me during this journey.
0:25:21.477,0:25:24.624
And now, I'll give back[br]the micro to Labra, thank you.
0:25:24.625,0:25:29.265
(applause)
0:25:29.812,0:25:32.578
(Jose) So let's continue[br]with the other tools.
0:25:32.579,0:25:34.984
The other tool is the ShapeDesigner.
0:25:34.984,0:25:37.241
Andra, do you want to do[br]the ShapeDesigner now
0:25:37.242,0:25:39.287
or maybe later or in the workshop?
0:25:39.287,0:25:40.603
There is a workshop...
0:25:40.603,0:25:44.437
This afternoon, there is a workshop[br]specifically for Shape Expressions, and...
0:25:45.265,0:25:47.939
The idea is that was going to be[br]more hands on,
0:25:47.940,0:25:52.324
and if you want to practice[br]some ShEx, you can do it there.
0:25:52.875,0:25:55.720
This tool is ShEx...[br]and there is Eric here,
0:25:55.721,0:25:56.890
so you can present it.
0:25:57.969,0:26:00.687
(Eric) So just super quick,[br]the thing that I want to say
0:26:00.687,0:26:05.711
is that you've probably[br]already seen the ShEx interface
0:26:05.711,0:26:07.601
that's tailored for Wikidata.
0:26:07.602,0:26:12.930
That's effectively stripped down[br]and tailored specifically for Wikidata
0:26:12.930,0:26:17.937
because the generic one has more features[br]but it turns out I thought I'd mention it
0:26:17.937,0:26:19.977
because one of those features[br]is particularly useful
0:26:19.978,0:26:23.201
for debugging Wikidata schemas,
0:26:23.201,0:26:29.224
which is if you go[br]and you select the slurp mode,
0:26:29.225,0:26:31.444
what it does is it says[br]while I'm validating,
0:26:31.445,0:26:34.694
I want to pull all the the triples down[br]and that means
0:26:34.695,0:26:36.274
if I get a bunch of failures,
0:26:36.275,0:26:39.586
I can go through and start looking[br]at those failures and saying,
0:26:39.587,0:26:41.800
OK, what are the triples[br]that are in here,
0:26:41.801,0:26:44.120
sorry, I apologize,[br]the triples are down there,
0:26:44.121,0:26:45.647
this is just a log of what went by.
0:26:46.327,0:26:49.180
And then you can just sit there[br]and fiddle with it in real time
0:26:49.181,0:26:51.033
like you play with something[br]and it changes.
0:26:51.033,0:26:54.160
So it's a quicker version[br]for doing all that stuff.
0:26:55.361,0:26:56.481
This is a ShExC form,
0:26:56.482,0:26:59.455
this is something [Joachim] had suggested
0:27:00.035,0:27:04.631
could be useful for populating[br]Wikidata documents
0:27:04.631,0:27:07.338
based on a Shape Expression[br]for that that document.
0:27:08.095,0:27:11.681
This is not tailored for Wikidata,
0:27:11.682,0:27:14.081
but this is just to say[br]that you can have a schema
0:27:14.082,0:27:15.402
and you can have some annotations
0:27:15.403,0:27:17.518
to say specifically how I want[br]that schema rendered
0:27:17.519,0:27:19.031
and then it just builds a form,
0:27:19.031,0:27:21.191
and if you've got data,[br]it can even populate the form.
0:27:24.517,0:27:26.164
PyShEx [inaudible].
0:27:28.025,0:27:31.080
(Jose) I think this is the last one.
0:27:31.821,0:27:34.080
Yes, so the last one is PyShEx.
0:27:34.675,0:27:38.151
PyShEx is a Python implementation[br]of Shape Expressions,
0:27:39.193,0:27:42.680
you can play also with Jupyter Notebooks[br]if you want those kind of things.
0:27:42.680,0:27:44.432
OK, so that's all for this.
0:27:44.433,0:27:47.170
(applause)
0:27:52.916,0:27:57.073
(Andra) So I'm going to talk about[br]a specific project that I'm involved in
0:27:57.074,0:27:58.074
called Gene Wiki,
0:27:58.075,0:28:04.596
and where we are also[br]dealing with quality issues.
0:28:04.597,0:28:06.684
But before going into the quality,
0:28:06.685,0:28:09.229
maybe a quick introduction[br]about what Gene Wiki is,
0:28:09.855,0:28:15.175
and we just released a pre-print[br]of a paper that we recently have written
0:28:15.175,0:28:18.160
that explains the details of the project.
0:28:19.821,0:28:23.839
I see people taking pictures,[br]but basically, what Gene Wiki does,
0:28:23.846,0:28:28.027
it's trying to get biomedical data,[br]public data into Wikidata,
0:28:28.028,0:28:32.200
and we follow a specific pattern[br]to get that data into Wikidata.
0:28:33.130,0:28:36.809
So when we have a new repository[br]or a new data set
0:28:36.810,0:28:39.600
that is eligible[br]to be included into Wikidata,
0:28:39.601,0:28:41.293
the first step is community engagement.
0:28:41.294,0:28:43.784
It is not necessary[br]directly to a Wikidata community
0:28:43.785,0:28:46.120
but a local research community,
0:28:46.121,0:28:50.286
and we meet in person[br]or online or on any platform
0:28:50.286,0:28:52.881
and try to come up with a data model
0:28:52.882,0:28:56.197
that bridges their data[br]with the Wikidata model.
0:28:56.197,0:28:59.944
So here I have a picture of a workshop[br]that happened here last year
0:28:59.945,0:29:02.663
which was trying to look[br]at a specific data set
0:29:02.663,0:29:05.280
and, well, you see a lot of discussions,
0:29:05.281,0:29:09.780
then aligning it with schema.org[br]and other ontologies that are out there.
0:29:10.320,0:29:15.508
And then, at the end of the first step,[br]we have a whiteboard drawing of the schema
0:29:15.509,0:29:17.336
that we want to implement in Wikidata.
0:29:17.337,0:29:20.440
What you see over there,[br]this is just plain,
0:29:20.441,0:29:21.766
we have it in the back there
0:29:21.767,0:29:25.240
so we can make some schemas[br]within this panel today even.
0:29:26.560,0:29:28.399
So once we have the schema in place,
0:29:28.400,0:29:31.320
the next thing is try to make[br]that schema machine readable
0:29:32.358,0:29:36.841
because you want to have actionable models[br]to bridge the data that you're bringing in
0:29:36.842,0:29:39.690
from any biomedical database[br]into Wikidata.
0:29:40.393,0:29:45.182
And here we are applying[br]Shape Expressions.
0:29:46.471,0:29:52.518
And we use that because[br]Shape Expressions allow you to test
0:29:52.518,0:29:57.040
whether the data set[br]is actually-- no, to first see
0:29:57.041,0:30:01.782
of already existing data in Wikidata[br]follows the same data model
0:30:01.783,0:30:04.718
that was achieved in the previous process.
0:30:04.719,0:30:06.641
So then with the Shape Expression[br]we can check:
0:30:06.642,0:30:10.926
OK the data that are on this topic[br]in Wikidata, does it need some cleaning up
0:30:10.926,0:30:15.013
or do we need to adapt our model[br]to the Wikidata model or vice versa.
0:30:15.937,0:30:19.867
Once that is in place[br]and we start writing bots,
0:30:20.670,0:30:23.801
and bots are seeding the information
0:30:23.802,0:30:27.308
that is in the primary sources[br]into Wikidata.
0:30:27.846,0:30:29.303
And when the bots are ready,
0:30:29.304,0:30:33.001
we write these bots[br]with a platform called--
0:30:33.002,0:30:36.201
with a Python library[br]called Wikidata Integrator
0:30:36.202,0:30:38.167
that came out of our project.
0:30:38.698,0:30:42.921
And once we have our bots,[br]we use a platform called Jenkins
0:30:42.921,0:30:44.540
for continuous integration.
0:30:44.540,0:30:45.762
And with Jenkins,
0:30:45.762,0:30:51.160
we continuously update[br]the primary sources with Wikidata.
0:30:52.178,0:30:55.889
And this is a diagram for the paper[br]I previously mentioned.
0:30:55.890,0:30:57.241
This is our current landscape.
0:30:57.242,0:31:02.059
So every orange box out there[br]is a primary resource on drugs,
0:31:02.060,0:31:07.827
proteins, genes, diseases,[br]chemical compounds with interaction,
0:31:07.827,0:31:10.870
and this model is too small to read now
0:31:10.870,0:31:17.472
but this is the database,[br]the sources that we manage in Wikidata
0:31:17.473,0:31:20.560
and bridge with the primary sources.
0:31:20.561,0:31:22.355
Here is such a workflow.
0:31:22.870,0:31:25.312
So one of our partners[br]is the Disease Ontology
0:31:25.312,0:31:27.672
the Disease Ontology is a CC0 ontology,
0:31:28.179,0:31:31.990
and the CC0 Ontology[br]has a curation cycle on its own,
0:31:32.756,0:31:35.736
and they just continuously[br]update the Disease Ontology
0:31:35.737,0:31:39.687
to reflect the disease space[br]or the interpretation of diseases.
0:31:40.336,0:31:44.361
And there is the Wikidata[br]curation cycle also on diseases
0:31:44.362,0:31:49.844
where the Wikidata community constantly[br]monitors what's going on on Wikidata.
0:31:50.406,0:31:51.601
And then we have two roles,
0:31:51.602,0:31:55.477
we call them colloquially[br]the gatekeeper curator,
0:31:56.009,0:31:59.561
and this was me[br]and a colleague five years ago
0:31:59.562,0:32:03.414
where we just sit on our computers[br]and we monitor Wikipedia and Wikidata,
0:32:03.415,0:32:08.601
and if there is an issue that was[br]reported back to the primary community,
0:32:08.602,0:32:11.765
the primary resources, they looked[br]at the implementation and decided:
0:32:11.765,0:32:14.240
OK, do we do we trust the Wikidata input?
0:32:14.850,0:32:18.555
Yes--then it's considered,[br]it goes into the cycle,
0:32:18.555,0:32:22.686
and the next iteration[br]is part of the Disease Ontology
0:32:22.687,0:32:25.411
and fed back into Wikidata.
0:32:27.419,0:32:31.480
We're doing the same for WikiPathways.
0:32:31.481,0:32:36.601
WikiPathways is a MediaWiki-inspired[br]pathway and pathway repository.
0:32:36.602,0:32:40.901
Same story, there are different[br]pathway resources on Wikidata already.
0:32:41.463,0:32:44.713
There might be conflicts[br]between those pathway resources
0:32:44.722,0:32:46.701
and these conflicts are reported back
0:32:46.702,0:32:49.521
by the gatekeeper curators[br]to that community,
0:32:49.522,0:32:53.715
and you maintain[br]the individual curation cycles.
0:32:53.715,0:32:57.068
But if you remember the previous cycle,
0:32:57.069,0:33:03.041
here I mentioned[br]only two cycles, two resources,
0:33:03.566,0:33:06.300
we have to do that[br]for every single resource that we have
0:33:06.300,0:33:08.061
and we have to manage what's going on
0:33:08.062,0:33:09.185
because when I say curation,
0:33:09.185,0:33:11.377
I really mean going[br]to the Wikipedia top pages,
0:33:11.377,0:33:14.544
going into the Wikidata top pages[br]and trying to do that.
0:33:14.545,0:33:19.316
That doesn't scale for[br]the two gatekeeper curators we had.
0:33:19.860,0:33:22.777
So when I was in a conference in 2016
0:33:22.778,0:33:26.933
where Eric gave a presentation[br]on Shape Expressions,
0:33:26.934,0:33:29.277
I jumped on the bandwagon and said OK,
0:33:29.278,0:33:34.240
Shape Expressions can help us[br]detect what differences in Wikidata
0:33:34.240,0:33:41.159
and so that allows the gatekeepers to have[br]some more efficient reporting to report.
0:33:42.275,0:33:46.019
So this year,[br]I was delighted by the schema entity
0:33:46.020,0:33:50.765
because now, we can store[br]those entity schemas on Wikidata,
0:33:50.765,0:33:53.183
on Wikidata itself,[br]whereas before, it was on GitHub,
0:33:53.860,0:33:56.815
and this aligns[br]with the Wikidata interface,
0:33:56.816,0:33:59.350
so you have things[br]like document discussions
0:33:59.350,0:34:00.762
but you also have revisions.
0:34:00.763,0:34:05.261
So you can leverage the top pages[br]and the revisions in Wikidata
0:34:05.262,0:34:12.255
to use that to discuss[br]about what is in Wikidata
0:34:12.255,0:34:14.060
and what are in the primary resources.
0:34:14.966,0:34:19.686
So this what Eric just presented,[br]this is already quite a benefit.
0:34:19.686,0:34:24.335
So here, we made up a Shape Expression[br]for the human gene,
0:34:24.336,0:34:30.225
and then we ran it through simple ShEx,[br]and as you can see,
0:34:30.225,0:34:32.428
we just got already ni--
0:34:32.429,0:34:34.641
There is one issue[br]that needs to be monitored
0:34:34.642,0:34:37.316
which there is an item[br]that doesn't fit that schema,
0:34:37.316,0:34:43.139
and then you can sort of already[br]create schema entities curation reports
0:34:43.140,0:34:46.240
based on... and send that[br]to the different curation reports.
0:34:48.058,0:34:52.788
But the ShEx.js a built interface,
0:34:52.788,0:34:55.860
and if I can show back here,[br]I only do ten,
0:34:55.860,0:35:00.362
but we have tens of thousands,[br]and so that again doesn't scale.
0:35:00.362,0:35:04.654
So the Wikidata Integrator now[br]supports ShEx support as well,
0:35:05.168,0:35:07.431
and then we can just loop item loops
0:35:07.431,0:35:11.494
where we say yes-no,[br]yes-no, true-false, true-false.
0:35:11.495,0:35:12.495
So again,
0:35:13.065,0:35:16.514
increasing a bit of the efficiency[br]of dealing with the reports.
0:35:17.256,0:35:22.662
But now, recently, that builds[br]on the Wikidata Query Service,
0:35:23.181,0:35:24.998
and well, we recently have been throttling
0:35:24.999,0:35:26.560
so again, that doesn't scale.
0:35:26.561,0:35:31.391
So it's still an ongoing process,[br]how to deal with models on Wikidata.
0:35:32.202,0:35:36.682
And so again,[br]ShEx is not only intimidating
0:35:36.683,0:35:40.356
but also the scale is just[br]too big to deal with.
0:35:41.068,0:35:46.081
So I started working, this is my first[br]proof of concept or exercise
0:35:46.082,0:35:47.680
where I used a tool called yED,
0:35:48.184,0:35:52.590
and I started to draw[br]those Shape Expressions and because...
0:35:52.591,0:35:58.098
and then regenerate this schema
0:35:58.099,0:36:01.279
into this adjacent format[br]of the Shape Expressions,
0:36:01.280,0:36:04.520
so that would open up already[br]to the audience
0:36:04.521,0:36:07.432
that are intimidated[br]by the Shape Expressions languages.
0:36:07.961,0:36:12.308
But actually, there is a problem[br]with those visual descriptions
0:36:12.309,0:36:18.229
because this is also a schema[br]that was actually drawn in yEd by someone.
0:36:18.230,0:36:23.838
And here is another one[br]which is beautiful.
0:36:23.838,0:36:29.414
I would love to have this on my wall,[br]but it is still not interoperable.
0:36:30.281,0:36:32.131
So I want to end my talk with,
0:36:32.131,0:36:35.732
and the first time, I've been[br]stealing this slide, using this slide.
0:36:35.732,0:36:37.594
It's an honor to have him in the audience
0:36:37.595,0:36:39.423
and I really like this:
0:36:39.424,0:36:42.362
"People think RDF is a pain[br]because it's complicated.
0:36:42.362,0:36:43.985
The truth is even worse, it's so simple,
0:36:45.581,0:36:48.133
because you have to work[br]with real-world data problems
0:36:48.134,0:36:50.031
that are horribly complicated.
0:36:50.031,0:36:51.451
While you can avoid RDF,
0:36:51.451,0:36:55.760
it is harder to avoid complicated data[br]and complicated computer problems."
0:36:55.761,0:36:59.535
This is about RDF, but I think[br]this so applies to modeling as well.
0:37:00.112,0:37:02.769
So my point of discussion[br]is should we really...
0:37:03.387,0:37:05.882
How do we get modeling going?
0:37:05.882,0:37:10.826
Should we discuss ShEx[br]or visual models or...
0:37:11.426,0:37:13.271
How do we continue?
0:37:13.474,0:37:14.840
Thank you very much for your time.
0:37:15.102,0:37:17.787
(applause)
0:37:20.001,0:37:21.188
(Lydia) Thank you so much.
0:37:21.692,0:37:24.001
Would you come to the front
0:37:24.002,0:37:27.741
so that we can open[br]the questions from the audience.
0:37:28.610,0:37:30.203
Are there questions?
0:37:31.507,0:37:32.507
Yes.
0:37:34.253,0:37:36.890
And I think, for the camera, we need to...
0:37:38.835,0:37:40.968
(Lydia laughing) Yeah.
0:37:43.094,0:37:46.273
(man3) So a question[br]for Cristina, I think.
0:37:47.366,0:37:51.641
So you mentioned exactly[br]the term "information gain"
0:37:51.642,0:37:53.689
from linking with other systems.
0:37:53.690,0:37:55.619
There is an information theoretic measure
0:37:55.620,0:37:58.001
using statistic and probability[br]called information gain.
0:37:58.002,0:37:59.541
Do you have the same...
0:37:59.542,0:38:01.736
I mean did you mean exactly that measure,
0:38:01.736,0:38:04.173
the information gain[br]from the probability theory
0:38:04.174,0:38:05.240
from information theory
0:38:05.241,0:38:09.024
or just use this conceptual thing[br]to measure information gain some way?
0:38:09.025,0:38:13.016
No, so we actually defined[br]and implemented measures
0:38:13.695,0:38:20.161
that are using the Shannon entropy,[br]so it's meant as that.
0:38:20.162,0:38:22.696
I didn't want to go into[br]details of the concrete formulas...
0:38:22.697,0:38:24.977
(man3) No, no, of course,[br]that's why I asked the question.
0:38:24.978,0:38:26.698
- (Cristina) But yeah...[br]- (man3) Thank you.
0:38:33.091,0:38:35.047
(man4) Make more[br]of a comment than a question.
0:38:35.048,0:38:36.241
(Lydia) Go for it.
0:38:36.242,0:38:39.840
(man4) So there's been[br]a lot of focus at the item level
0:38:39.840,0:38:42.547
about quality and completeness,
0:38:42.547,0:38:47.374
one of the things that concerns me is that[br]we're not applying the same to hierarchies
0:38:47.374,0:38:51.480
and I think we have an issue[br]is that our hierarchy often isn't good.
0:38:51.481,0:38:53.463
We're seeing[br]this is going to be a real problem
0:38:53.464,0:38:55.774
with Commons searching and other things.
0:38:56.771,0:39:00.601
One of the abilities that we can do[br]is to import external--
0:39:00.602,0:39:04.842
The way that external thesauruses[br]structure their hierarchies,
0:39:04.842,0:39:10.291
using the P4900[br]broader concept qualifier.
0:39:11.037,0:39:16.167
But what I think would be really helpful[br]would be much better tools for doing that
0:39:16.168,0:39:21.212
so that you can import an[br]external... thesaurus's hierarchy
0:39:21.212,0:39:24.111
map that onto our Wikidata items.
0:39:24.111,0:39:28.199
Once it's in place[br]with those P4900 qualifiers,
0:39:28.200,0:39:31.494
you can actually do some[br]quite good querying through SPARQL
0:39:32.490,0:39:37.534
to see where our hierarchy[br]diverges from that external hierarchy.
0:39:37.534,0:39:41.346
For instance, [Paula Morma],[br]user PKM, you may know,
0:39:41.346,0:39:43.533
does a lot of work on fashion.
0:39:43.533,0:39:50.524
So we use that to pull in the Europeana[br]Fashion Thesaurus's hierarchy
0:39:50.524,0:39:53.812
and the Getty AAT[br]fashion thesaurus hierarchy,
0:39:53.812,0:39:57.957
and then see where the gaps[br]were in our higher level items,
0:39:57.957,0:40:00.511
which is a real problem for us[br]because often,
0:40:00.511,0:40:04.355
these are things that only exist[br]as disambiguation pages on Wikipedia,
0:40:04.356,0:40:09.270
so we have a lot of higher level items[br]in our hierarchies missing
0:40:09.271,0:40:14.480
and this is something that we must address[br]in terms of quality and completeness,
0:40:14.480,0:40:15.971
but what would really help
0:40:16.643,0:40:20.871
would be better tools than[br]the jungle of pull scripts that I wrote...
0:40:20.872,0:40:26.010
If somebody could put that[br]into a PAWS notebook in Python
0:40:26.561,0:40:31.972
to be able to take an external thesaurus,[br]take its hierarchy,
0:40:31.973,0:40:34.595
which may well be available[br]as linked data or may not,
0:40:35.379,0:40:40.580
to then put those into[br]quick statements to put in P4900 values.
0:40:41.165,0:40:42.165
And then later,
0:40:42.166,0:40:44.527
when our representation[br]gets more complete,
0:40:44.528,0:40:49.691
to update those P4900s[br]because as our representation gets dated,
0:40:49.691,0:40:51.590
becomes more dense,
0:40:51.590,0:40:55.377
the values of those qualifiers[br]need to change
0:40:56.230,0:40:59.526
to represent that we've got more[br]of their hierarchy in our system.
0:40:59.526,0:41:03.728
If somebody could do that,[br]I think that would be very helpful,
0:41:03.728,0:41:07.121
and we do need to also[br]look at other approaches
0:41:07.122,0:41:10.762
to improve quality and completeness[br]at the hierarchy level
0:41:10.763,0:41:12.378
not just at the item level.
0:41:13.308,0:41:14.840
(Andra) Can I add to that?
0:41:16.362,0:41:19.901
Yes, and we actually do that,
0:41:19.911,0:41:23.551
and I can recommend looking at[br]the Shape Expression that Finn made
0:41:23.552,0:41:27.330
with the lexical data[br]where he creates Shape Expressions
0:41:27.330,0:41:29.640
and then build on authorship expressions
0:41:29.641,0:41:32.528
so you have this concept[br]of linked Shape Expressions in Wikidata,
0:41:32.529,0:41:35.005
and specifically, the use case,[br]if I understand correctly,
0:41:35.006,0:41:37.183
is exactly what we are doing in Gene Wiki.
0:41:37.184,0:41:40.841
So you have the Disease Ontology[br]which is put into Wikidata
0:41:40.842,0:41:44.681
and then disease data comes in[br]and we apply the Shape Expressions
0:41:44.682,0:41:47.247
to see if that fits with this thesaurus.
0:41:47.248,0:41:50.919
And there are other thesauruses or other[br]ontologies for controlled vocabularies
0:41:50.920,0:41:52.559
that still need to go into Wikidata,
0:41:52.559,0:41:55.401
and that's exactly why[br]Shape Expression is so interesting
0:41:55.402,0:41:57.963
because you can have a Shape Expression[br]for the Disease Ontology,
0:41:57.964,0:41:59.644
you can have a Shape Expression for MeSH,
0:41:59.645,0:42:01.761
you can say: OK,[br]now I want to check the quality.
0:42:01.762,0:42:04.059
Because you also have[br]in Wikidata the context
0:42:04.060,0:42:09.567
of when you have a controlled vocabulary,[br]you say the quality is according to this,
0:42:09.568,0:42:11.636
but you might have[br]a disagreeing community.
0:42:11.636,0:42:16.081
So the tooling is indeed in place[br]but now is indeed to create those models
0:42:16.082,0:42:18.144
and apply them[br]on the different use cases.
0:42:18.811,0:42:20.921
(man4) The ShapeExpression's very useful
0:42:20.922,0:42:25.928
once you have the external ontology[br]mapped into Wikidata,
0:42:25.929,0:42:29.474
but my problem is that[br]it's getting to that stage,
0:42:29.475,0:42:34.881
it's working out how much of the[br]external ontology isn't yet in Wikidata
0:42:34.882,0:42:36.256
and where the gaps are,
0:42:36.257,0:42:40.660
and that's where I think that[br]having much more robust tools
0:42:40.660,0:42:44.286
to see what's missing[br]from external ontologies
0:42:44.286,0:42:45.537
would be very helpful.
0:42:47.678,0:42:49.062
The biggest problem there
0:42:49.062,0:42:51.201
is not so much tooling[br]but more licensing.
0:42:51.803,0:42:55.249
So getting the ontologies[br]into Wikidata is actually a piece of cake
0:42:55.250,0:42:59.295
but most of the ontologies have,[br]how can I say that politely,
0:42:59.965,0:43:03.256
restrictive licensing,[br]so they are not compatible with Wikidata.
0:43:04.068,0:43:06.678
(man4) There's a huge number[br]of public sector thesauruses
0:43:06.678,0:43:08.209
in cultural fields.
0:43:08.210,0:43:10.851
- (Andra) Then we need to talk.[br]- (man4) Not a problem.
0:43:10.852,0:43:12.384
(Andra) Then we need to talk.
0:43:13.624,0:43:19.192
(man5) Just... the comment I want to make[br]is actually answer to James,
0:43:19.192,0:43:22.401
so the thing is that[br]hierarchies make graphs,
0:43:22.374,0:43:24.041
and when you want to...
0:43:24.579,0:43:28.888
I want to basically talk about...[br]a common problem in hierarchies
0:43:28.889,0:43:30.820
is circle hierarchies,
0:43:30.821,0:43:33.796
so they come back to each other[br]when there's a problem,
0:43:33.796,0:43:35.920
which you should not[br]have that in hierarchies.
0:43:37.022,0:43:41.295
This, funnily enough,[br]happens in categories in Wikipedia a lot
0:43:41.295,0:43:42.990
we have a lot of circles in categories,
0:43:43.898,0:43:46.612
but the good news is that this is...
0:43:47.713,0:43:51.582
Technically, it's a PMP complete problem,[br]so you cannot find this,
0:43:51.583,0:43:53.414
and easily if you built a graph of that,
0:43:54.473,0:43:57.046
but there are lots of ways[br]that have been developed
0:43:57.047,0:44:00.624
to find problems[br]in these hierarchy graphs.
0:44:00.625,0:44:04.860
Like there is a paper[br]called Finding Cycles...
0:44:04.861,0:44:07.955
Breaking Cycles in Noisy Hierarchies,
0:44:07.956,0:44:12.671
and it's been used to help[br]categorization of English Wikipedia.
0:44:12.672,0:44:17.141
You can just take this[br]and apply these hierarchies in Wikidata,
0:44:17.142,0:44:19.540
and then you can find[br]things that are problematic
0:44:19.541,0:44:22.481
and just remove the ones[br]that are causing issues
0:44:22.482,0:44:24.593
and find the issues, actually.
0:44:24.594,0:44:26.960
So this is just an idea, just so you...
0:44:28.780,0:44:29.930
(man4) That's all very well
0:44:29.931,0:44:34.402
but I think you're underestimating[br]the number of bad subclass relations
0:44:34.402,0:44:35.402
that we have.
0:44:35.403,0:44:39.680
It's like having a city[br]in completely the wrong country,
0:44:40.250,0:44:44.874
and there are tools for geography[br]to identify that,
0:44:44.875,0:44:49.201
and we need to have[br]much better tools in hierarchies
0:44:49.202,0:44:53.477
to identify where the equivalent[br]of the item for the country
0:44:53.478,0:44:57.673
is missing entirely,[br]or where it's actually been subclassed
0:44:57.674,0:45:01.804
to something that isn't meaning[br]something completely different.
0:45:02.804,0:45:07.165
(Lydia) Yeah, I think[br]you're getting to something
0:45:07.166,0:45:12.024
that me and my team keeps hearing[br]from people who reuse our data
0:45:12.025,0:45:13.991
quite a bit as well, right,
0:45:15.002,0:45:16.638
Individual data point might be great
0:45:16.639,0:45:20.163
but if you have to look[br]at the ontology and so on,
0:45:20.164,0:45:21.857
then it gets very...
0:45:22.388,0:45:26.437
And I think one of the big problems[br]why this is happening
0:45:26.437,0:45:30.736
is that a lot of editing on Wikidata
0:45:30.736,0:45:34.544
happens on the basis[br]of an individual item, right,
0:45:34.545,0:45:36.201
you make an edit on that item,
0:45:37.653,0:45:42.075
without realizing that this[br]might have very global consequences
0:45:42.075,0:45:44.245
on the rest of the graph, for example.
0:45:44.245,0:45:50.040
And if people have ideas around[br]how to make this more visible,
0:45:50.041,0:45:53.185
the consequences[br]of an individual local edit,
0:45:54.005,0:45:56.537
I think that would be worth exploring,
0:45:57.550,0:46:01.583
to show people better[br]what the consequence of their edit
0:46:01.584,0:46:03.434
that they might do in very good faith,
0:46:04.481,0:46:05.481
what that is.
0:46:06.939,0:46:12.237
Whoa! OK, let's start with, yeah, you,[br]then you, then you, then you.
0:46:12.237,0:46:13.921
(man5) Well, after the discussion,
0:46:13.922,0:46:18.262
just to express my agreement[br]with what James was saying.
0:46:18.263,0:46:22.467
So essentially, it seems[br]the most dangerous thing is the hierarchy,
0:46:22.468,0:46:23.910
not the hierarchy, but generally
0:46:23.911,0:46:28.022
the semantics of the subclass relations[br]seen in Wikidata, right.
0:46:28.022,0:46:32.561
So I've been studying languages recently,[br]just for the purposes of this conference,
0:46:32.562,0:46:35.257
and for example, you find plenty of cases
0:46:35.257,0:46:39.463
where a language is a part of[br]and subclass of the same thing, OK.
0:46:39.463,0:46:43.577
So you know, you can say[br]we have a flexible ontology.
0:46:43.577,0:46:46.256
Wikidata gives you freedom[br]to express that, sometimes.
0:46:46.256,0:46:47.257
Because, for example,
0:46:47.258,0:46:50.721
that ontology of languages[br]is also politically complicated, right?
0:46:50.722,0:46:55.038
It is even good to be in a position[br]to express a level of uncertainty.
0:46:55.038,0:46:57.983
But imagine anyone who wants[br]to do machine reading from that.
0:46:57.984,0:46:59.468
So that's really problematic.
0:46:59.468,0:47:00.468
And then again,
0:47:00.469,0:47:03.686
I don't think that ontology[br]was ever imported from somewhere,
0:47:03.687,0:47:05.490
that's something which is originally ours.
0:47:05.491,0:47:08.321
It's harvested from Wikipedia[br]in the very beginning I will say.
0:47:08.322,0:47:11.324
So I wonder...[br]this Shape Expressions thing is great,
0:47:11.325,0:47:15.575
and also validating and fixing,[br]if you like, the Wikidata ontology
0:47:15.576,0:47:18.191
by external resources, beautiful idea.
0:47:19.026,0:47:20.026
In the end,
0:47:20.027,0:47:25.440
will we end by reflecting[br]the external ontologies in Wikidata?
0:47:25.441,0:47:28.651
And also, what we do with[br]the core part of our ontology
0:47:28.652,0:47:30.642
which is never harvested[br]from external resources,
0:47:30.643,0:47:31.978
how do we go and fix that?
0:47:31.979,0:47:35.276
And I really think that[br]that will be a problem on its own.
0:47:35.277,0:47:39.010
We will have to focus on that[br]independently of the idea
0:47:39.010,0:47:41.046
of validating ontology[br]with something external.
0:47:49.353,0:47:53.379
(man6) OK, and constrains[br]and shapes are very impressive
0:47:53.380,0:47:54.495
what we can do with it,
0:47:55.205,0:47:58.481
but the main point is not[br]being really made clear--
0:47:58.482,0:48:03.229
it's because now we can make more explicit[br]what we expect from the data.
0:48:03.229,0:48:06.893
Before, each one has to write[br]its own tools and scripts
0:48:06.894,0:48:10.601
and so it's more visible[br]and we can discuss about it.
0:48:10.602,0:48:13.641
But because it's not about[br]what's wrong or right,
0:48:13.642,0:48:15.870
it's about an expectation,
0:48:15.870,0:48:18.105
and you will have different[br]expectations and discussions
0:48:18.106,0:48:20.737
about how we want[br]to model things in Wikidata,
0:48:21.246,0:48:23.095
and this...
0:48:23.096,0:48:26.280
The current state is just[br]one step in the direction
0:48:26.281,0:48:28.041
because now you need
0:48:28.042,0:48:31.041
very much technical expertise[br]to get into this,
0:48:31.042,0:48:35.721
and we need better ways[br]to visualize this constraint,
0:48:35.722,0:48:39.995
to transform it maybe in natural language[br]so people can better understand,
0:48:40.939,0:48:43.768
but it's less about what's wrong or right.
0:48:44.925,0:48:45.925
(Lydia) Yeah.
0:48:50.986,0:48:53.893
(man7) So for quality issues,[br]I just want to echo it like...
0:48:53.894,0:48:57.010
I've definitely found a lot of the issues[br]I've encountered have been
0:48:58.838,0:49:02.330
differences in opinion[br]between instance of versus subclass.
0:49:02.331,0:49:05.963
I would say errors in those situations
0:49:05.963,0:49:11.521
and trying to find those[br]has been a very time-consuming process.
0:49:11.522,0:49:14.840
What I've found is like:[br]"Oh, if I find very high-impression items
0:49:14.840,0:49:16.051
that are something...
0:49:16.052,0:49:21.628
and then use all the subclass instances[br]to find all derived statements of this,"
0:49:21.628,0:49:26.215
this is a very useful way[br]of looking for these errors.
0:49:26.215,0:49:28.067
But I was curious if Shape Expressions,
0:49:29.841,0:49:31.582
if there is...
0:49:31.583,0:49:36.934
If this can be used as a tool[br]to help resolve those issues but, yeah...
0:49:40.514,0:49:42.555
(man8) If it has a structural footprint...
0:49:45.910,0:49:49.310
If it has a structural footprint[br]that you can...that's sort of falsifiable,
0:49:49.310,0:49:51.191
you can look at that[br]and say well, that's wrong,
0:49:51.192,0:49:52.670
then yeah, you can do that.
0:49:52.671,0:49:56.921
But if it's just sort of[br]trying to map it to real-world objects,
0:49:56.922,0:49:59.082
then you're just going to need[br]lots and lots of brains.
0:50:05.768,0:50:08.631
(man9) Hi, Pablo Mendes[br]from Apple Siri Knowledge.
0:50:09.154,0:50:12.770
We're here to find out how to help[br]the project and the community
0:50:12.770,0:50:15.645
but Cristina made the mistake[br]of asking what we want.
0:50:16.471,0:50:20.052
(laughing) So I think[br]one thing I'd like to see
0:50:20.958,0:50:23.521
is a lot around verifiability
0:50:23.522,0:50:26.372
which is one of the core tenets[br]of the project in the community,
0:50:27.062,0:50:28.590
and trustworthiness.
0:50:28.590,0:50:32.412
Not every statement is the same,[br]some of them are heavily disputed,
0:50:32.413,0:50:33.653
some of them are easy to guess,
0:50:33.654,0:50:35.541
like somebody's[br]date of birth can be verified,
0:50:36.071,0:50:39.082
as you saw today in the Keynote,[br]gender issues are a lot more complicated.
0:50:40.205,0:50:42.130
Can you discuss a little bit what you know
0:50:42.131,0:50:47.271
in this area of data quality around[br]trustworthiness and verifiability?
0:50:55.442,0:50:58.138
If there isn't a lot,[br]I'd love to see a lot more. (laughs)
0:51:00.646,0:51:01.646
(Lydia) Yeah.
0:51:03.314,0:51:06.548
Apparently, we don't have[br]a lot to say on that. (laughs)
0:51:08.024,0:51:12.299
(Andra) I think we can do a lot,[br]but I had a discussion with you yesterday.
0:51:12.300,0:51:15.774
My favorite example I learned yesterday[br]that's already deprecated
0:51:15.774,0:51:20.281
is if you go to the Q2, which is earth,
0:51:20.282,0:51:23.343
there is statement[br]that claims that the earth is flat.
0:51:24.183,0:51:26.055
And I love that example
0:51:26.056,0:51:28.391
because there is a community[br]out there that claims that
0:51:28.392,0:51:30.417
and they have verifiable resources.
0:51:30.418,0:51:32.254
So I think it's a genuine case,
0:51:32.255,0:51:34.641
it shouldn't be deprecated,[br]it should be in Wikidata.
0:51:34.642,0:51:40.385
And I think Shape Expressions[br]can be really instrumental there,
0:51:40.386,0:51:41.832
because what you can say,
0:51:41.833,0:51:44.856
OK, I'm really interested[br]in this use case,
0:51:44.857,0:51:47.129
or this is a use case where you disagree,
0:51:47.130,0:51:51.059
but there can also be a use case[br]where you say OK, I'm interested.
0:51:51.059,0:51:53.449
So there is this example you say,[br]I have glucose.
0:51:53.449,0:51:55.841
And glucose when you're a biologist,
0:51:55.842,0:52:00.176
you don't care for the chemical[br]constraints of the glucose molecule,
0:52:00.177,0:52:03.201
you just... everything glucose[br]is the same.
0:52:03.202,0:52:05.973
But if you're a chemist,[br]you cringe when you hear that,
0:52:05.973,0:52:08.191
you have 200 something...
0:52:08.191,0:52:10.443
So then you can have[br]multiple Shape Expressions,
0:52:10.443,0:52:12.721
OK, I'm coming in with...[br]I'm at a chemist view,
0:52:12.722,0:52:13.887
I'm applying that.
0:52:13.887,0:52:16.691
And then you say[br]I'm from a biological use case,
0:52:16.691,0:52:18.524
I'm applying that Shape Expression.
0:52:18.524,0:52:20.358
And then when you want to collaborate,
0:52:20.358,0:52:22.784
yes, well you should talk[br]to Eric about ShEx maps.
0:52:23.910,0:52:28.873
And so...[br]but this journey is just starting.
0:52:28.873,0:52:32.238
But I personally I believe[br]that it's quite instrumental in that area.
0:52:34.292,0:52:35.535
(Lydia) OK. Over there.
0:52:37.949,0:52:39.168
(laughs)
0:52:40.597,0:52:46.035
(woman2) I had several ideas[br]from some points in the discussions,
0:52:46.035,0:52:50.902
so I will try not to lose...[br]I had three ideas so...
0:52:52.394,0:52:55.201
Based on what James said a while ago,
0:52:55.202,0:52:59.001
we have a very, very big problem[br]on Wikidata since the beginning
0:52:59.002,0:53:01.574
for the upper ontology.
0:53:02.363,0:53:05.339
We talked about that[br]two years ago at WikidataCon,
0:53:05.340,0:53:07.432
and we talked about that at Wikimania.
0:53:07.432,0:53:09.818
Well, always we have a Wikidata meeting
0:53:09.818,0:53:11.656
we are talking about that,
0:53:11.656,0:53:15.782
because it's a very big problem[br]at a very very eye level
0:53:15.783,0:53:23.118
what entity is, with what work is,[br]what genre is, art,
0:53:23.118,0:53:25.461
are really the biggest concept.
0:53:26.195,0:53:33.117
And that's actually[br]a very weak point on global ontology
0:53:33.118,0:53:37.453
because people try to clean up regularly
0:53:38.017,0:53:41.047
and broke everything down the line,
0:53:42.516,0:53:48.649
because yes, I think some of you[br]may remember the guy who in good faith
0:53:48.649,0:53:51.785
broke absolutely all cities in the world.
0:53:51.785,0:53:57.537
We were not geographical items anymore,[br]so violation constraints everywhere.
0:53:58.720,0:54:00.278
And it was in good faith
0:54:00.278,0:54:03.623
because he was really[br]correcting a mistake in an item,
0:54:04.170,0:54:05.732
but everything broke down.
0:54:06.349,0:54:09.373
And I'm not sure how we can solve that
0:54:10.216,0:54:15.709
because there is actually[br]no external institution we could just copy
0:54:15.710,0:54:18.490
because everyone is working on...
0:54:19.154,0:54:22.041
Well, if I am performing art database,
0:54:22.042,0:54:24.601
I will just go[br]at the performing art label,
0:54:24.601,0:54:29.361
or I won't go to the philosophical concept[br]of what an entity is,
0:54:29.362,0:54:31.201
and that's actually...
0:54:31.202,0:54:34.561
I don't know any database[br]which is working at this level,
0:54:34.562,0:54:36.827
but that's the weakest point of Wikidata.
0:54:37.936,0:54:40.812
And probably,[br]when we are talking about data quality,
0:54:40.812,0:54:44.034
that's actually a big part of it, so...
0:54:44.034,0:54:48.569
And I think it's the same[br]we have stated in...
0:54:48.569,0:54:50.452
Oh, I am sorry, I am changing the subject,
0:54:51.401,0:54:55.774
but we have stated[br]in different sessions about qualities,
0:54:55.774,0:54:59.398
which is actually some of us[br]are doing good modeling job,
0:54:59.399,0:55:01.240
are doing ShEx,[br]are doing things like that.
0:55:01.967,0:55:07.655
People don't see it on Wikidata,[br]they don't see the ShEx,
0:55:07.655,0:55:10.392
they don't see the WikiProject[br]on the discussion page,
0:55:10.393,0:55:11.393
and sometimes,
0:55:11.394,0:55:14.958
they don't even see[br]the talk pages of properties,
0:55:14.958,0:55:19.628
which is explicitly stating,[br]a), this property is used for that.
0:55:19.628,0:55:23.887
Like last week,[br]I added constraints to a property.
0:55:23.888,0:55:26.324
The constraint was explicitly written
0:55:26.325,0:55:28.690
in the discussion[br]of the creation of the property.
0:55:28.690,0:55:34.548
I just created the technical part[br]of adding the constraint, and someone:
0:55:34.548,0:55:37.182
"What! You broke down all my edits!"
0:55:37.183,0:55:41.542
And he was using the property[br]wrongly for the last two years.
0:55:41.542,0:55:46.868
And the property was actually very clear,[br]but there were no warnings and everything,
0:55:46.869,0:55:49.922
and so, it's the same at the Pink Pony[br]we said at Wikimania
0:55:49.922,0:55:54.719
to make WikiProject more visible[br]or to make ShEx more visible, but...
0:55:54.719,0:55:56.917
And that's what Cristina said.
0:55:56.917,0:56:02.368
We have a visibility problem[br]of what the existing solutions are.
0:56:02.368,0:56:04.242
And at this session,
0:56:04.242,0:56:06.862
we are all talking about[br]how to create more ShEx,
0:56:06.863,0:56:10.727
or to facilitate the jobs[br]of the people who are doing the cleanup.
0:56:11.605,0:56:15.835
But we are cleaning up[br]since the first day of Wikidata,
0:56:15.836,0:56:20.921
and globally, we are losing,[br]and we are losing because, well,
0:56:20.922,0:56:22.960
if I know names are complicated
0:56:22.961,0:56:26.162
but I am the only one[br]doing the cleaning up job,
0:56:26.662,0:56:29.671
the guy who added[br]Latin script name
0:56:29.672,0:56:31.584
to all Chinese researcher,
0:56:32.088,0:56:35.616
I will take months to clean that[br]and I can't do it alone,
0:56:35.616,0:56:38.777
and he did one massive batch.
0:56:38.777,0:56:40.241
So we really need...
0:56:40.242,0:56:44.158
we have a visibility problem[br]more than a tool problem, I think,
0:56:44.158,0:56:45.733
because we have many tools.
0:56:45.733,0:56:50.255
(Lydia) Right, so unfortunately,[br]I've got shown a sign, (laughs),
0:56:50.256,0:56:52.121
so we need to wrap this up.
0:56:52.122,0:56:53.563
Thank you so much for your comments,
0:56:53.563,0:56:56.611
I hope you will continue discussing[br]during the rest of the day,
0:56:56.611,0:56:57.840
and thanks for your input.
0:56:58.359,0:56:59.944
(applause)