WEBVTT
00:00:05.945 --> 00:00:09.476
Hello everyone to the Data Quality panel.
00:00:10.288 --> 00:00:13.671
Data quality matters because
more and more people out there
00:00:13.672 --> 00:00:19.289
rely on our data being in good shape,
so we're going to talk about data quality,
00:00:20.029 --> 00:00:26.000
and there will be four speakers
who will give short introductions
00:00:26.000 --> 00:00:29.539
on topics related to data quality
and then we will have a Q and A.
00:00:30.130 --> 00:00:32.234
And the first one is Lucas.
00:00:34.385 --> 00:00:35.385
Thank you.
00:00:35.901 --> 00:00:39.899
Hi, I'm Lucas, and I'm going
to start with an overview
00:00:39.899 --> 00:00:43.806
of data quality tools
that we already have on Wikidata
00:00:43.807 --> 00:00:46.109
and also some things
that are coming up soon.
00:00:46.932 --> 00:00:50.623
And I've grouped them
into some general themes
00:00:50.623 --> 00:00:53.761
of making errors more visible,
making problems actionable,
00:00:53.762 --> 00:00:56.322
getting more eyes on the data
so that people notice the problems,
00:00:56.945 --> 00:01:02.616
fix some common sources of errors,
maintain the quality of the existing data
00:01:02.616 --> 00:01:03.966
and also human curation.
00:01:05.063 --> 00:01:09.874
And the ones that are currently available
start with property constraints.
00:01:10.388 --> 00:01:12.421
So you've probably seen this
if you're on Wikidata.
00:01:12.422 --> 00:01:14.029
You can sometimes get these icons
00:01:14.530 --> 00:01:17.241
which check
the internal consistency of the data.
00:01:17.242 --> 00:01:20.800
For example,
if one event follows the other,
00:01:20.801 --> 00:01:23.760
then the other event should
also be followed by this one,
00:01:23.761 --> 00:01:27.161
which on the WikidataCon item
was apparently missing.
00:01:27.162 --> 00:01:29.360
I'm not sure,
this feature is a few days old.
00:01:30.040 --> 00:01:34.681
And there's also,
if this is too limited or simple for you,
00:01:34.682 --> 00:01:38.080
you can write any checks you want
using the Query Service
00:01:38.081 --> 00:01:39.842
which is useful for
lots of things of course,
00:01:39.843 --> 00:01:44.543
but you can also use it
for finding errors.
00:01:44.544 --> 00:01:46.974
Like if you've noticed
one occurrence of a mistake,
00:01:46.975 --> 00:01:49.709
then you can check
if there are other places
00:01:49.710 --> 00:01:51.958
where people have made
a very similar error
00:01:51.958 --> 00:01:53.438
and find that with the Query Service.
00:01:53.439 --> 00:01:54.559
You can also combine the two
00:01:54.560 --> 00:01:57.874
and search for constraint violations
in the Query Service,
00:01:57.875 --> 00:02:01.240
for example,
only the violations in some area
00:02:01.241 --> 00:02:03.762
or WikiProject that's relevant to you,
00:02:03.762 --> 00:02:06.828
although the results are currently
not complete, sadly.
00:02:08.422 --> 00:02:09.877
There is revision scoring.
00:02:10.690 --> 00:02:12.666
That's... I think this is
from the recent changes
00:02:12.667 --> 00:02:16.217
you can also get it on your watch list
an automatic assessment
00:02:16.217 --> 00:02:20.249
of is this edit likely to be
in good faith or in bad faith
00:02:20.250 --> 00:02:22.312
and is it likely to be
damaging or not damaging,
00:02:22.313 --> 00:02:24.205
I think those are the two dimensions.
00:02:24.206 --> 00:02:25.686
So you can, if you want,
00:02:25.687 --> 00:02:29.898
focus on just looking through
the damaging but good faith edits.
00:02:29.899 --> 00:02:32.523
If you're feeling particularly
friendly and welcoming
00:02:32.524 --> 00:02:37.121
you can tell these editors,
"Thank you for your contribution,
00:02:37.122 --> 00:02:40.560
here's how you should have done it
but thank you, still."
00:02:40.561 --> 00:02:42.186
And if you're not feeling that way,
00:02:42.187 --> 00:02:44.452
you can go through
the bad faith, damaging edits,
00:02:44.453 --> 00:02:45.573
and revert the vandals.
00:02:47.544 --> 00:02:49.761
There's also, similar to that,
entity scoring.
00:02:49.762 --> 00:02:52.590
So instead of scoring an edit,
the change that it made,
00:02:52.591 --> 00:02:53.904
you score the whole revision,
00:02:53.904 --> 00:02:56.483
and I think that is
the same quality measure
00:02:56.483 --> 00:02:59.863
that Lydia mentions
at the beginning of the conference.
00:03:00.372 --> 00:03:04.569
That gives a user script up here
and gives you a score of like one to five,
00:03:04.570 --> 00:03:08.176
I think it was, of what the quality
of the current item is.
00:03:10.043 --> 00:03:15.528
The primary sources tool is for
any database that you want to import,
00:03:15.528 --> 00:03:18.364
but that's not high enough quality
to directly add to Wikidata,
00:03:18.374 --> 00:03:20.335
so you add it
to the primary sources tool instead,
00:03:20.336 --> 00:03:22.956
and then humans can decide
00:03:22.956 --> 00:03:26.024
should they add
these individual statements or not.
00:03:28.595 --> 00:03:31.901
Showing coordinates as maps
is mainly a convenience feature
00:03:31.901 --> 00:03:33.588
but it's also useful for quality control.
00:03:33.588 --> 00:03:36.937
Like if you see this is supposed to be
the office of Wikimedia Germany
00:03:36.938 --> 00:03:39.400
and if the coordinates
are somewhere in the Indian Ocean,
00:03:39.401 --> 00:03:41.529
then you know that
something is not right there
00:03:41.530 --> 00:03:44.790
and you can see it much more easily
than if you just had the numbers.
00:03:46.382 --> 00:03:49.576
This is a gadget called
the relative completeness indicator
00:03:49.577 --> 00:03:52.480
which shows you this little icon here
00:03:53.007 --> 00:03:55.652
telling you how complete
it thinks this item is
00:03:55.652 --> 00:03:57.613
and also which properties
are most likely missing,
00:03:57.614 --> 00:03:59.769
which is really useful
if you're editing an item
00:03:59.769 --> 00:04:03.172
and you're in an area
that you're not very familiar with
00:04:03.172 --> 00:04:05.661
and you don't know what
the right properties to use are,
00:04:05.662 --> 00:04:08.230
then this is a very useful gadget to have.
00:04:09.604 --> 00:04:11.401
And we have Shape Expressions.
00:04:11.402 --> 00:04:15.624
I think Andra or Jose
are going to talk more about those
00:04:15.624 --> 00:04:19.757
but basically, a very powerful way
of comparing the data you have
00:04:19.758 --> 00:04:20.758
against the schema,
00:04:20.759 --> 00:04:22.680
like what statement should
certain entities have,
00:04:22.681 --> 00:04:25.677
what other entities should they link to
and what should those look like,
00:04:26.229 --> 00:04:29.374
and then you can find problems that way.
00:04:30.366 --> 00:04:32.361
I think... No there is still more.
00:04:32.362 --> 00:04:34.321
Integraality or property dashboard.
00:04:34.322 --> 00:04:36.773
It gives you a quick overview
of the data you already have.
00:04:36.774 --> 00:04:39.147
For example, this is from
the WikiProject Red Pandas,
00:04:39.657 --> 00:04:41.681
and you can see that
we have a sex or gender
00:04:41.682 --> 00:04:43.561
for almost all of the red pandas,
00:04:43.561 --> 00:04:46.854
the date of birth varies a lot
by which zoo they come from
00:04:46.854 --> 00:04:50.255
and we have almost
no dead pandas which is wonderful,
00:04:51.437 --> 00:04:52.600
because they're so cute.
00:04:53.699 --> 00:04:55.654
So this is also useful.
00:04:56.377 --> 00:04:59.185
There we go, OK,
now for the things that are coming up.
00:04:59.889 --> 00:05:03.784
Wikidata Bridge, or also known,
formerly known as client editing,
00:05:03.785 --> 00:05:07.076
so editing Wikidata
from Wikipedia infoboxes
00:05:07.675 --> 00:05:11.725
which will on the one hand
get more eyes on the data
00:05:11.725 --> 00:05:13.441
because more people can see the data there
00:05:13.441 --> 00:05:18.841
and it will hopefully encourage
more use of Wikidata in the Wikipedias
00:05:18.841 --> 00:05:20.920
and that means that more
people can notice
00:05:20.921 --> 00:05:23.389
if, for example some data is outdated
and needs to be updated
00:05:23.857 --> 00:05:27.000
instead of if they would
only see it on Wikidata itself.
00:05:28.630 --> 00:05:30.656
There is also tainted references.
00:05:30.657 --> 00:05:33.959
The idea here is that
if you edit a statement value,
00:05:34.683 --> 00:05:37.279
you might want to update
the references as well,
00:05:37.280 --> 00:05:39.373
unless it was just a typo or something.
00:05:39.897 --> 00:05:43.662
And this tainted references
tells editors that
00:05:43.663 --> 00:05:49.756
and also that other editors
see which other edits were made
00:05:49.756 --> 00:05:52.471
that edited a statement value
and didn't update a reference
00:05:52.472 --> 00:05:56.766
then you can clean up after that
and decide should that be...
00:05:57.737 --> 00:05:59.566
Do you need to do any thing more of that
00:05:59.566 --> 00:06:02.796
or is that actually fine and
you don't need to update the reference.
00:06:03.543 --> 00:06:09.336
That's related to signed statements
which is coming from a concern, I think,
00:06:09.336 --> 00:06:12.355
that some data providers have that like...
00:06:14.131 --> 00:06:17.231
There's a statement that's referenced
through the UNESCO or something
00:06:17.232 --> 00:06:19.872
and then suddenly,
someone vandalizes the statement
00:06:19.873 --> 00:06:21.836
and they are worried
that it will look like
00:06:22.827 --> 00:06:26.992
this organization, like UNESCO,
still set this vandalism value
00:06:26.993 --> 00:06:28.706
and so, with signed statements,
00:06:28.706 --> 00:06:31.488
they can cryptographically
sign this reference
00:06:31.488 --> 00:06:33.562
and that doesn't prevent any edits to it,
00:06:34.169 --> 00:06:37.744
but at least, if someone
vandalizes the statement
00:06:37.744 --> 00:06:40.255
or edits it in any way,
then the signature is no longer valid,
00:06:40.255 --> 00:06:43.401
and you can tell this is not exactly
what the organization said,
00:06:43.402 --> 00:06:47.064
and perhaps it's a good edit
and they should re-sign the new statement,
00:06:47.065 --> 00:06:49.851
but also perhaps it should be reverted.
00:06:51.203 --> 00:06:54.166
And also, this is going
to be very exciting, I think,
00:06:54.166 --> 00:06:56.846
Citoid is this amazing system
they have on Wikipedia
00:06:57.379 --> 00:07:01.340
where you can paste a URL,
or an identifier, or an ISBN
00:07:01.340 --> 00:07:04.759
or Wikidata ID or basically
anything into the Visual Editor,
00:07:05.260 --> 00:07:08.241
and it spits out a reference
that is nicely formatted
00:07:08.242 --> 00:07:11.049
and has all the data you want
and it's wonderful to use.
00:07:11.049 --> 00:07:14.337
And by comparison, on Wikidata,
if I want to add a reference
00:07:14.338 --> 00:07:18.801
I typically have to add a reference URL,
title, author name string,
00:07:18.802 --> 00:07:20.449
published in, publication date,
00:07:20.450 --> 00:07:25.141
retrieve dates,
at least those, and that's annoying,
00:07:25.141 --> 00:07:29.261
and integrating Citoid into Wikibase
will hopefully help with that.
00:07:30.245 --> 00:07:33.604
And I think
that's all the ones I had, yeah.
00:07:33.604 --> 00:07:36.400
So now, I'm going to pass to Cristina.
00:07:37.788 --> 00:07:42.339
(applause)
00:07:43.780 --> 00:07:45.471
Hi, I'm Cristina.
00:07:45.472 --> 00:07:47.672
I'm a research scientist
from the University of Zürich,
00:07:47.673 --> 00:07:51.417
and I'm also an active member
of the Swiss Community.
00:07:52.698 --> 00:07:57.901
When Claudia Müller-Birn
and I submitted this to the WikidataCon,
00:07:57.902 --> 00:08:00.410
what we wanted to do
is continue our discussion
00:08:00.411 --> 00:08:02.424
that we started
in the beginning of the year
00:08:02.424 --> 00:08:07.442
with a workshop on data quality
and also some sessions in Wikimania.
00:08:07.442 --> 00:08:10.535
So the goal of this talk
is basically to bring some thoughts
00:08:10.536 --> 00:08:14.432
that we have been collecting
from the community and ourselves
00:08:14.432 --> 00:08:16.560
and continue discussion.
00:08:16.561 --> 00:08:20.065
So what we would like is to continue
interacting a lot with you.
00:08:21.557 --> 00:08:23.371
So what we think is very important
00:08:23.372 --> 00:08:27.580
is that we continuously ask
all types of users in the community
00:08:27.581 --> 00:08:32.240
about what they really need,
what problems they have with data quality,
00:08:32.240 --> 00:08:35.000
not only editors
but also the people who are coding,
00:08:35.000 --> 00:08:36.241
or consuming the data,
00:08:36.242 --> 00:08:39.494
and also researchers who are
actually using all the edit history
00:08:39.494 --> 00:08:40.800
to analyze what is happening.
00:08:42.367 --> 00:08:48.431
So we did a review of around 80 tools
that are existing in Wikidata
00:08:48.431 --> 00:08:52.380
and we aligned them to the different
data quality dimensions.
00:08:52.380 --> 00:08:54.360
And what we saw was that actually,
00:08:54.361 --> 00:08:57.681
many of them were looking at,
monitoring completeness,
00:08:57.682 --> 00:09:02.820
but actually... and also some of them
are also enabling interlinking.
00:09:02.820 --> 00:09:08.442
But there is a big need for tools
that are looking into diversity,
00:09:08.443 --> 00:09:12.824
which is one of the things
that we actually can have in Wikidata,
00:09:12.824 --> 00:09:15.958
especially
this design principle of Wikidata
00:09:15.959 --> 00:09:17.901
where we can have plurality
00:09:17.902 --> 00:09:20.308
and different statements
with different values
00:09:21.034 --> 00:09:22.236
coming from different sources.
00:09:22.236 --> 00:09:24.921
Because it's a secondary source,
we don't have really tools
00:09:24.922 --> 00:09:27.750
that actually tell us how many
plural statements there are,
00:09:27.751 --> 00:09:30.889
and how many we can improve and how,
00:09:30.890 --> 00:09:32.833
and we also don't know really
00:09:32.833 --> 00:09:35.538
what are all the reasons
for plurality that we can have.
00:09:36.491 --> 00:09:39.201
So from these community meetings,
00:09:39.201 --> 00:09:43.084
what we discussed was the challenges
that still need attention.
00:09:43.084 --> 00:09:47.249
For example, that having
all these crowdsourcing communities
00:09:47.249 --> 00:09:49.613
is very good because different people
attack different parts
00:09:49.613 --> 00:09:51.833
of the data or the graph,
00:09:51.834 --> 00:09:54.615
and we also have
different background knowledge
00:09:54.616 --> 00:09:59.161
but actually, it's very difficult to align
everything in something homogeneous
00:09:59.162 --> 00:10:04.920
because different people are using
different properties in different ways
00:10:04.920 --> 00:10:08.401
and they are also expecting
different things from entity descriptions.
00:10:09.003 --> 00:10:12.721
People also said that
they also need more tools
00:10:12.722 --> 00:10:16.000
that give a better overview
of the global status of things.
00:10:16.000 --> 00:10:20.733
So what entities are missing
in terms of completeness,
00:10:20.733 --> 00:10:26.121
but also like what are people
working on right now most of the time,
00:10:26.121 --> 00:10:30.516
and they also mention many times
a tighter collaboration
00:10:30.517 --> 00:10:33.311
across not only languages
but the WikiProjects
00:10:33.311 --> 00:10:35.571
and the different Wikimedia platforms.
00:10:35.571 --> 00:10:38.859
And we published
all the transcribed comments
00:10:38.860 --> 00:10:42.959
from all these discussions
in those links here in the Etherpads
00:10:42.959 --> 00:10:46.162
and also in the wiki page of Wikimania.
00:10:46.162 --> 00:10:48.481
Some solutions that appeared actually
00:10:48.481 --> 00:10:53.001
were going into the direction
of sharing more the best practices
00:10:53.001 --> 00:10:55.762
that are being developed
in different WikiProjects,
00:10:55.762 --> 00:11:01.238
but also people want tools
that help organize work in teams
00:11:01.239 --> 00:11:03.845
or at least understanding
who is working on that,
00:11:03.845 --> 00:11:07.815
and they were also mentioning
that they want more showcases
00:11:07.816 --> 00:11:12.019
and more templates that help them
create things in a better way.
00:11:12.946 --> 00:11:15.161
And from the contact that we have
00:11:15.162 --> 00:11:18.721
with Open Governmental Data Organizations,
00:11:18.722 --> 00:11:20.068
and in particularly,
00:11:20.068 --> 00:11:23.102
I am in contact with the canton
and the city of Zürich,
00:11:23.102 --> 00:11:26.207
they are very interested
in working with Wikidata
00:11:26.207 --> 00:11:29.896
because they want their data
to be accessible for everyone
00:11:29.897 --> 00:11:33.681
in the place where people go
and consult or access data.
00:11:33.682 --> 00:11:36.550
So for them, something that
would be really interesting
00:11:36.551 --> 00:11:38.600
is to have some kind of quality indicators
00:11:38.600 --> 00:11:41.082
both in the wiki,
which is already happening,
00:11:41.082 --> 00:11:42.801
but also in SPARQL results,
00:11:42.802 --> 00:11:46.066
to know whether they can trust
or not that data from the community.
00:11:46.067 --> 00:11:48.230
And then, they also want to know
00:11:48.230 --> 00:11:51.417
what parts of their own data sets
are useful for Wikidata
00:11:51.418 --> 00:11:56.040
and they would love to have a tool that
can help them assess that automatically.
00:11:56.041 --> 00:11:59.066
They also need
some kind of methodology or tool
00:11:59.067 --> 00:12:03.894
that helps them decide whether
they should import or link their data
00:12:03.894 --> 00:12:04.894
because in some cases,
00:12:04.895 --> 00:12:07.137
they also have their own
linked open data sets,
00:12:07.138 --> 00:12:09.746
so they don't know whether
to just ingest the data
00:12:09.747 --> 00:12:13.424
or to keep on creating links
from the data sets to Wikidata
00:12:13.425 --> 00:12:14.425
and the other way around.
00:12:14.950 --> 00:12:20.043
And they also want to know where
their websites are referred in Wikidata.
00:12:20.044 --> 00:12:23.361
And when they run such a query
in the query service,
00:12:23.362 --> 00:12:24.848
they often get timeouts,
00:12:24.849 --> 00:12:28.181
so maybe we should
really create more tools
00:12:28.181 --> 00:12:32.240
that help them get these answers
for their questions.
00:12:33.148 --> 00:12:36.208
And, besides that,
00:12:36.208 --> 00:12:39.361
we wiki researchers also sometimes
00:12:39.362 --> 00:12:42.023
lack some information
in the edit summaries.
00:12:42.024 --> 00:12:44.953
So I remember that when
we were doing some work
00:12:44.954 --> 00:12:48.919
to understand
the different behavior of editors
00:12:48.919 --> 00:12:53.403
with tools or bots
or anonymous users and so on,
00:12:53.403 --> 00:12:56.154
we were really lacking, for example,
00:12:56.154 --> 00:13:01.112
a standard way of tracing
that tools were being used.
00:13:01.113 --> 00:13:03.154
And there are some tools
that are already doing that
00:13:03.155 --> 00:13:05.230
like PetScan and many others,
00:13:05.230 --> 00:13:07.720
but maybe we should in the community
00:13:07.721 --> 00:13:13.531
discuss more about how to record these
for fine-grained provenance.
00:13:14.169 --> 00:13:15.321
And further on,
00:13:15.322 --> 00:13:20.801
we think that we need to think
of more concrete data quality dimensions
00:13:20.802 --> 00:13:24.961
that are related to link data
but not all the types of data,
00:13:24.962 --> 00:13:30.721
so we worked on some measures
to access actually the information gain
00:13:30.722 --> 00:13:33.881
enabled by the links,
and what we mean by that
00:13:33.882 --> 00:13:36.681
is that when we link
Wikidata to other data sets,
00:13:36.682 --> 00:13:38.201
we should also be thinking
00:13:38.202 --> 00:13:41.921
how much the entities are actually
gaining in the classification,
00:13:41.922 --> 00:13:45.601
also in the description
but also in the vocabularies they use.
00:13:45.602 --> 00:13:51.041
So just to give a very simple
example of what I mean with this
00:13:51.042 --> 00:13:54.269
is we can think of--
in this case, would be Wikidata
00:13:54.270 --> 00:13:57.771
or the external data center
that is linking to Wikidata,
00:13:57.772 --> 00:14:00.487
we have the entity for a person
that is called Natasha Noy,
00:14:00.487 --> 00:14:02.601
we have the affiliation and other things,
00:14:02.602 --> 00:14:05.239
and then we say OK,
we link to an external place,
00:14:05.240 --> 00:14:08.919
and that entity also has that name,
but we actually have the same value.
00:14:08.920 --> 00:14:12.889
So what it would be better is that we link
to something that has a different name,
00:14:12.889 --> 00:14:16.881
that is still valid because this person
has two ways of writing the name,
00:14:16.882 --> 00:14:19.714
and also other information
that we don't have in Wikidata
00:14:19.715 --> 00:14:21.760
or that we don't have
in the other data set.
00:14:22.390 --> 00:14:24.652
But also, what is even better
00:14:24.653 --> 00:14:27.770
is that we are actually
looking in the target data set
00:14:27.770 --> 00:14:31.392
that they also have new ways
of classifying the information.
00:14:31.393 --> 00:14:35.354
So not only is this a person,
but in the other data set,
00:14:35.355 --> 00:14:39.525
they also say it's a female
or anything else that they classify with.
00:14:39.526 --> 00:14:43.401
And if in the other data set,
they are using many other vocabularies
00:14:43.402 --> 00:14:46.588
that is also helping in their whole
information retrieval thing.
00:14:47.371 --> 00:14:51.233
So with that, I also would like to say
00:14:51.234 --> 00:14:55.809
that we think that we can
showcase federated queries better
00:14:55.810 --> 00:15:00.448
because when we look at the query log
provided by Malyshev et al.,
00:15:01.285 --> 00:15:04.301
we see actually that
from the organic queries,
00:15:04.302 --> 00:15:06.921
we have only very few federated queries.
00:15:06.922 --> 00:15:12.801
And actually, federation is one
of the key advantages of having link data,
00:15:12.802 --> 00:15:16.903
so maybe the community
or the people using Wikidata
00:15:16.903 --> 00:15:18.898
also need more examples on this.
00:15:18.898 --> 00:15:22.666
And if we look at the list
of endpoints that are being used,
00:15:22.667 --> 00:15:25.401
this is not a complete list
and we have many more.
00:15:25.402 --> 00:15:30.479
Of course, this data was analyzed
from queries until March 2018,
00:15:30.480 --> 00:15:34.807
but we should look into the list
of federated endpoints that we have
00:15:34.808 --> 00:15:37.048
and see whether
we are really using them or not.
00:15:37.813 --> 00:15:40.441
So two questions that
I have for the audience
00:15:40.442 --> 00:15:43.001
that maybe we can use
afterwards for the discussion are:
00:15:43.001 --> 00:15:46.001
what data quality problems
should be addressed in your opinion,
00:15:46.002 --> 00:15:47.412
because of the needs that you have,
00:15:47.412 --> 00:15:50.401
but also, where do you need
more automation
00:15:50.402 --> 00:15:52.943
to help you with editing or patrolling.
00:15:53.866 --> 00:15:55.146
That's all, thank you very much.
00:15:55.779 --> 00:15:57.527
(applause)
00:16:06.030 --> 00:16:08.595
(Jose Emilio Labra) OK,
so what I'm going to talk about
00:16:08.595 --> 00:16:14.715
is some tools that we were developing
related with Shape Expressions.
00:16:15.536 --> 00:16:19.371
So this is what I want to talk...
I am Jose Emilio Labra,
00:16:19.371 --> 00:16:23.215
but this has... all these tools
have been done by different people,
00:16:23.920 --> 00:16:28.480
mainly related with W3C ShEx,
Shape Expressions Community Group.
00:16:28.481 --> 00:16:29.481
ShEx Community Group.
00:16:30.144 --> 00:16:36.081
So the first tool that I want to mention
is RDFShape, this is a general tool,
00:16:36.082 --> 00:16:40.681
because Shape Expressions
is not only for Wikidata,
00:16:40.682 --> 00:16:44.168
Shape Expressions is a language
to validate RDF in general.
00:16:44.168 --> 00:16:47.568
So this tool was developed mainly by me
00:16:47.568 --> 00:16:50.880
and it's a tool
to validate RDF in general.
00:16:50.881 --> 00:16:55.139
So if you want to learn about RDF
or you want to validate RDF
00:16:55.140 --> 00:16:58.621
or SPARQL endpoints not only in Wikidata,
00:16:58.622 --> 00:17:00.891
my advice is that you can use this tool.
00:17:00.891 --> 00:17:03.255
Also for teaching.
00:17:03.255 --> 00:17:05.640
I am a teacher in the university
00:17:05.641 --> 00:17:09.151
and I use it in my semantic web course
to teach RDF.
00:17:09.161 --> 00:17:12.121
So if you want to learn RDF,
I think it's a good tool.
00:17:13.033 --> 00:17:17.598
For example, this is just a visualization
of an RDF graph with the tool.
00:17:18.587 --> 00:17:22.643
But before coming here, in the last month,
00:17:22.643 --> 00:17:28.441
I started a fork of rdfshape specifically
for Wikidata, because I thought...
00:17:28.443 --> 00:17:33.082
It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.
00:17:33.082 --> 00:17:34.441
So what I took is...
00:17:34.442 --> 00:17:39.898
What I did is to remove all the stuff
that was not related with Wikidata
00:17:39.898 --> 00:17:44.801
and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,
00:17:44.802 --> 00:17:49.041
but now, someone asked me
if I could do it also for Wikibase.
00:17:49.042 --> 00:17:52.000
And it is very easy
to do it for Wikibase also.
00:17:52.760 --> 00:17:56.280
So this tool, WikiShape, is quite new.
00:17:57.015 --> 00:17:59.843
I think it works, most of the features,
00:17:59.844 --> 00:18:02.468
but there are some features
that maybe don't work,
00:18:02.469 --> 00:18:06.281
and if you try it and you want
to improve it, please tell me.
00:18:06.281 --> 00:18:12.680
So this is [inaudible] captures,
but I think I can even try so let's try.
00:18:15.385 --> 00:18:16.945
So let's see if it works.
00:18:16.953 --> 00:18:20.070
First, I have to go out of the...
00:18:22.453 --> 00:18:23.453
Here.
00:18:24.226 --> 00:18:28.324
Alright, yeah. So this is the tool here.
00:18:28.324 --> 00:18:29.844
Things that you can do with the tool,
00:18:29.845 --> 00:18:35.275
for example, is that you can
check schemas, entity schemas.
00:18:35.276 --> 00:18:38.611
You know that there is
a new namespace which is "E whatever,"
00:18:38.612 --> 00:18:44.805
so here, if you start for example,
write for example "human"...
00:18:44.806 --> 00:18:48.812
As you are writing,
its autocomplete allows you to check,
00:18:48.812 --> 00:18:52.001
for example,
this is the Shape Expressions of a human,
00:18:52.790 --> 00:18:55.937
and this is the Shape Expressions here.
00:18:55.938 --> 00:18:59.841
And as you can see,
this editor has syntax highlighting,
00:18:59.842 --> 00:19:04.559
this is... well,
maybe it's very small, the screen.
00:19:05.676 --> 00:19:07.590
I can try to do it bigger.
00:19:09.194 --> 00:19:10.973
Maybe you see it better now.
00:19:10.973 --> 00:19:14.241
So... and this is the editor
with syntax highlighting and also has...
00:19:14.241 --> 00:19:17.851
I mean, this editor
comes from the same source code
00:19:17.851 --> 00:19:19.641
as the Wikidata query service.
00:19:19.642 --> 00:19:23.960
So for example,
if you hover with the mouse here,
00:19:23.961 --> 00:19:27.961
it shows you the labels
of the different properties.
00:19:27.962 --> 00:19:31.298
So I think it's very helpful because now,
00:19:32.588 --> 00:19:38.601
the entity schemas that is
in the Wikidata is just a plain text idea,
00:19:38.602 --> 00:19:42.493
and I think this editor is much better
because it has autocomplete
00:19:42.494 --> 00:19:43.743
and it also has...
00:19:43.744 --> 00:19:48.241
I mean, if you, for example,
wanted to add a constraint,
00:19:48.241 --> 00:19:51.570
you say "wdt:"
00:19:51.570 --> 00:19:56.884
You start writing "author"
and then you click Ctrl+Space
00:19:56.884 --> 00:19:58.922
and it suggests the different things.
00:19:58.922 --> 00:20:02.388
So this is similar
to the Wikidata query service
00:20:02.389 --> 00:20:06.445
but specifically for Shape Expressions
00:20:06.445 --> 00:20:11.975
because my feeling is that
creating Shape Expressions
00:20:11.976 --> 00:20:15.841
is not more difficult
than writing SPARQL queries.
00:20:15.842 --> 00:20:21.255
So some people think
that it's at the same level,
00:20:22.278 --> 00:20:26.296
It's probably easier, I think,
because Shape Expressions was,
00:20:26.296 --> 00:20:31.241
when we designed it,
we were doing it to be easier to work.
00:20:31.242 --> 00:20:35.001
OK, so this is one of the first things,
that you have this editor
00:20:35.001 --> 00:20:36.620
for Shape Expressions.
00:20:37.371 --> 00:20:41.467
And then you also have the possibility,
for example, to visualize.
00:20:41.468 --> 00:20:44.801
If you have a Shape Expression,
use for example...
00:20:44.802 --> 00:20:49.386
I think, "written work" is
a nice Shape Expression
00:20:49.386 --> 00:20:53.300
because it has some relationships
between different things.
00:20:54.823 --> 00:20:58.160
And this is the UML visualization
of written work.
00:20:58.161 --> 00:21:02.090
In a UML, this is easy to see
the different properties.
00:21:02.790 --> 00:21:06.794
When you do this, I realized
when I tried with several people,
00:21:06.795 --> 00:21:09.216
they find some mistakes
in their Shape Expressions
00:21:09.217 --> 00:21:12.988
because it's easy to detect which are
the missing properties or whatever.
00:21:13.588 --> 00:21:15.771
Then there is another possibility here
00:21:15.772 --> 00:21:19.520
is that you can also validate,
I think I have it here, the validation.
00:21:20.496 --> 00:21:25.285
I think I had it in some label,
maybe I closed it.
00:21:26.267 --> 00:21:30.988
OK, but you can, for example,
you can click here, Validate entities.
00:21:32.308 --> 00:21:34.232
You, for example,
00:21:35.404 --> 00:21:41.921
"q42" with "e42" which is author.
00:21:42.818 --> 00:21:46.180
With "human,"
I think we can do it with "human."
00:21:49.050 --> 00:21:50.050
And then it's...
00:21:50.688 --> 00:21:56.365
And it's taking a little while to do it
because this is doing the SPARQL queries
00:21:56.365 --> 00:21:59.134
and now, for example,
it's failing by the network but...
00:21:59.657 --> 00:22:01.580
So you can try it.
00:22:02.759 --> 00:22:07.026
OK, so let's go continue
with the presentation, with other tools.
00:22:07.026 --> 00:22:12.353
So my advice is that if you want to try it
and you want any feedback let me know.
00:22:13.133 --> 00:22:15.540
So to continue with the presentation...
00:22:18.923 --> 00:22:20.233
So this is WikiShape.
00:22:23.800 --> 00:22:26.509
Then, I already said this,
00:22:27.681 --> 00:22:34.157
the Shape Expressions Editor
is an independent project in GitHub.
00:22:35.605 --> 00:22:37.472
You can use it in your own project.
00:22:37.472 --> 00:22:41.036
If you want to do
a Shape Expressions tool,
00:22:41.036 --> 00:22:45.635
you can just embed it
in any other project,
00:22:45.636 --> 00:22:48.235
so this is in GitHub and you can use it.
00:22:48.868 --> 00:22:51.970
Then the same author,
it's one of my students,
00:22:52.684 --> 00:22:55.704
he also created
an editor for Shape Expressions,
00:22:55.704 --> 00:22:57.799
also inspired by
the Wikidata query service
00:22:57.800 --> 00:23:00.681
where, in a column,
00:23:00.682 --> 00:23:05.103
you have this more visual editor
of SPARQL queries
00:23:05.104 --> 00:23:07.135
where you can put this kind of things.
00:23:07.136 --> 00:23:09.123
So this is a screen capture.
00:23:09.123 --> 00:23:12.662
You can see that
that's the Shape Expressions in text
00:23:12.662 --> 00:23:17.822
but this is a form-based Shape Expressions
where it would probably take a bit longer
00:23:18.595 --> 00:23:23.400
where you can put the different rows
on the different fields.
00:23:23.401 --> 00:23:25.800
OK, then there is ShExEr.
00:23:26.879 --> 00:23:31.882
We have... it's done by one PhD student
at the University of Oviedo
00:23:31.883 --> 00:23:34.080
and he's here, so you can present ShExEr.
00:23:38.147 --> 00:23:40.024
(Danny) Hello, I am Danny Fernández,
00:23:40.025 --> 00:23:43.800
I am a PhD student in University of Oviedo
working with Labra.
00:23:44.710 --> 00:23:47.725
Since we are running out of time,
let's make these quickly,
00:23:47.726 --> 00:23:52.641
so let's not go for any actual demo,
but just print some screenshots.
00:23:52.642 --> 00:23:57.897
OK, so the usual way to work with
Shape Expressions or any shape language
00:23:57.897 --> 00:23:59.521
is that you have a domain expert
00:23:59.522 --> 00:24:02.313
that defines a priori
how the graph should look like
00:24:02.314 --> 00:24:03.555
define some structures,
00:24:03.556 --> 00:24:06.983
and then you use these structures
to validate the actual data against it.
00:24:08.124 --> 00:24:11.641
This tool, which is as well as the ones
that Labra has been presenting,
00:24:11.642 --> 00:24:14.441
this is a general purpose tool
for any RDF source,
00:24:14.442 --> 00:24:17.375
is designed to do the other way around.
00:24:17.376 --> 00:24:18.758
You already have some data,
00:24:18.759 --> 00:24:23.165
you select what nodes
you want to get the shape about
00:24:23.165 --> 00:24:26.718
and then you automatically
extract or infer the shape.
00:24:26.719 --> 00:24:29.791
So even if this is a general purpose tool,
00:24:29.791 --> 00:24:34.063
what we did for this WikidataCon
is these fancy button
00:24:34.884 --> 00:24:37.081
that if you click it,
essentially what happens
00:24:37.081 --> 00:24:42.079
is that there are
so many configurations params
00:24:42.080 --> 00:24:46.251
and it configures it to work
against the Wikidata endpoint
00:24:46.251 --> 00:24:47.971
and it will end soon, sorry.
00:24:48.733 --> 00:24:52.883
So, once you press this button
what you get is essentially this.
00:24:52.884 --> 00:24:55.126
After having selected what kind of notes,
00:24:55.127 --> 00:24:59.360
what kind of instances of our class,
whatever you are looking for,
00:24:59.361 --> 00:25:01.321
you get an automatic schema.
00:25:02.319 --> 00:25:07.111
All the constraints are sorted
by how many modes actually conform to it,
00:25:07.112 --> 00:25:09.772
you can filter the less common ones, etc.
00:25:09.772 --> 00:25:12.126
So there is a poster downstairs
about this stuff
00:25:12.127 --> 00:25:14.595
and well,
I will be downstairs and upstairs
00:25:14.596 --> 00:25:16.454
and all over the place all day,
00:25:16.455 --> 00:25:19.081
so if you have any further
interest in this tool,
00:25:19.082 --> 00:25:21.476
just speak to me during this journey.
00:25:21.477 --> 00:25:24.624
And now, I'll give back
the micro to Labra, thank you.
00:25:24.625 --> 00:25:29.265
(applause)
00:25:29.812 --> 00:25:32.578
(Jose) So let's continue
with the other tools.
00:25:32.579 --> 00:25:34.984
The other tool is the ShapeDesigner.
00:25:34.984 --> 00:25:37.241
Andra, do you want to do
the ShapeDesigner now
00:25:37.242 --> 00:25:39.287
or maybe later or in the workshop?
00:25:39.287 --> 00:25:40.603
There is a workshop...
00:25:40.603 --> 00:25:44.437
This afternoon, there is a workshop
specifically for Shape Expressions, and...
00:25:45.265 --> 00:25:47.939
The idea is that was going to be
more hands on,
00:25:47.940 --> 00:25:52.324
and if you want to practice
some ShEx, you can do it there.
00:25:52.875 --> 00:25:55.720
This tool is ShEx...
and there is Eric here,
00:25:55.721 --> 00:25:56.890
so you can present it.
00:25:57.969 --> 00:26:00.687
(Eric) So just super quick,
the thing that I want to say
00:26:00.687 --> 00:26:05.711
is that you've probably
already seen the ShEx interface
00:26:05.711 --> 00:26:07.601
that's tailored for Wikidata.
00:26:07.602 --> 00:26:12.930
That's effectively stripped down
and tailored specifically for Wikidata
00:26:12.930 --> 00:26:17.937
because the generic one has more features
but it turns out I thought I'd mention it
00:26:17.937 --> 00:26:19.977
because one of those features
is particularly useful
00:26:19.978 --> 00:26:23.201
for debugging Wikidata schemas,
00:26:23.201 --> 00:26:29.224
which is if you go
and you select the slurp mode,
00:26:29.225 --> 00:26:31.444
what it does is it says
while I'm validating,
00:26:31.445 --> 00:26:34.694
I want to pull all the the triples down
and that means
00:26:34.695 --> 00:26:36.274
if I get a bunch of failures,
00:26:36.275 --> 00:26:39.586
I can go through and start looking
at those failures and saying,
00:26:39.587 --> 00:26:41.800
OK, what are the triples
that are in here,
00:26:41.801 --> 00:26:44.120
sorry, I apologize,
the triples are down there,
00:26:44.121 --> 00:26:45.647
this is just a log of what went by.
00:26:46.327 --> 00:26:49.180
And then you can just sit there
and fiddle with it in real time
00:26:49.181 --> 00:26:51.033
like you play with something
and it changes.
00:26:51.033 --> 00:26:54.160
So it's a quicker version
for doing all that stuff.
00:26:55.361 --> 00:26:56.481
This is a ShExC form,
00:26:56.482 --> 00:26:59.455
this is something [Joachim] had suggested
00:27:00.035 --> 00:27:04.631
could be useful for populating
Wikidata documents
00:27:04.631 --> 00:27:07.338
based on a Shape Expression
for that that document.
00:27:08.095 --> 00:27:11.681
This is not tailored for Wikidata,
00:27:11.682 --> 00:27:14.081
but this is just to say
that you can have a schema
00:27:14.082 --> 00:27:15.402
and you can have some annotations
00:27:15.403 --> 00:27:17.518
to say specifically how I want
that schema rendered
00:27:17.519 --> 00:27:19.031
and then it just builds a form,
00:27:19.031 --> 00:27:21.191
and if you've got data,
it can even populate the form.
00:27:24.517 --> 00:27:26.164
PyShEx [inaudible].
00:27:28.025 --> 00:27:31.080
(Jose) I think this is the last one.
00:27:31.821 --> 00:27:34.080
Yes, so the last one is PyShEx.
00:27:34.675 --> 00:27:38.151
PyShEx is a Python implementation
of Shape Expressions,
00:27:39.193 --> 00:27:42.680
you can play also with Jupyter Notebooks
if you want those kind of things.
00:27:42.680 --> 00:27:44.432
OK, so that's all for this.
00:27:44.433 --> 00:27:47.170
(applause)
00:27:52.916 --> 00:27:57.073
(Andra) So I'm going to talk about
a specific project that I'm involved in
00:27:57.074 --> 00:27:58.074
called Gene Wiki,
00:27:58.075 --> 00:28:04.596
and where we are also
dealing with quality issues.
00:28:04.597 --> 00:28:06.684
But before going into the quality,
00:28:06.685 --> 00:28:09.229
maybe a quick introduction
about what Gene Wiki is,
00:28:09.855 --> 00:28:15.175
and we just released a pre-print
of a paper that we recently have written
00:28:15.175 --> 00:28:18.160
that explains the details of the project.
00:28:19.821 --> 00:28:23.839
I see people taking pictures,
but basically, what Gene Wiki does,
00:28:23.846 --> 00:28:28.027
it's trying to get biomedical data,
public data into Wikidata,
00:28:28.028 --> 00:28:32.200
and we follow a specific pattern
to get that data into Wikidata.
00:28:33.130 --> 00:28:36.809
So when we have a new repository
or a new data set
00:28:36.810 --> 00:28:39.600
that is eligible
to be included into Wikidata,
00:28:39.601 --> 00:28:41.293
the first step is community engagement.
00:28:41.294 --> 00:28:43.784
It is not necessary
directly to a Wikidata community
00:28:43.785 --> 00:28:46.120
but a local research community,
00:28:46.121 --> 00:28:50.286
and we meet in person
or online or on any platform
00:28:50.286 --> 00:28:52.881
and try to come up with a data model
00:28:52.882 --> 00:28:56.197
that bridges their data
with the Wikidata model.
00:28:56.197 --> 00:28:59.944
So here I have a picture of a workshop
that happened here last year
00:28:59.945 --> 00:29:02.663
which was trying to look
at a specific data set
00:29:02.663 --> 00:29:05.280
and, well, you see a lot of discussions,
00:29:05.281 --> 00:29:09.780
then aligning it with schema.org
and other ontologies that are out there.
00:29:10.320 --> 00:29:15.508
And then, at the end of the first step,
we have a whiteboard drawing of the schema
00:29:15.509 --> 00:29:17.336
that we want to implement in Wikidata.
00:29:17.337 --> 00:29:20.440
What you see over there,
this is just plain,
00:29:20.441 --> 00:29:21.766
we have it in the back there
00:29:21.767 --> 00:29:25.240
so we can make some schemas
within this panel today even.
00:29:26.560 --> 00:29:28.399
So once we have the schema in place,
00:29:28.400 --> 00:29:31.320
the next thing is try to make
that schema machine readable
00:29:32.358 --> 00:29:36.841
because you want to have actionable models
to bridge the data that you're bringing in
00:29:36.842 --> 00:29:39.690
from any biomedical database
into Wikidata.
00:29:40.393 --> 00:29:45.182
And here we are applying
Shape Expressions.
00:29:46.471 --> 00:29:52.518
And we use that because
Shape Expressions allow you to test
00:29:52.518 --> 00:29:57.040
whether the data set
is actually-- no, to first see
00:29:57.041 --> 00:30:01.782
of already existing data in Wikidata
follows the same data model
00:30:01.783 --> 00:30:04.718
that was achieved in the previous process.
00:30:04.719 --> 00:30:06.641
So then with the Shape Expression
we can check:
00:30:06.642 --> 00:30:10.926
OK the data that are on this topic
in Wikidata, does it need some cleaning up
00:30:10.926 --> 00:30:15.013
or do we need to adapt our model
to the Wikidata model or vice versa.
00:30:15.937 --> 00:30:19.867
Once that is in place
and we start writing bots,
00:30:20.670 --> 00:30:23.801
and bots are seeding the information
00:30:23.802 --> 00:30:27.308
that is in the primary sources
into Wikidata.
00:30:27.846 --> 00:30:29.303
And when the bots are ready,
00:30:29.304 --> 00:30:33.001
we write these bots
with a platform called--
00:30:33.002 --> 00:30:36.201
with a Python library
called Wikidata Integrator
00:30:36.202 --> 00:30:38.167
that came out of our project.
00:30:38.698 --> 00:30:42.921
And once we have our bots,
we use a platform called Jenkins
00:30:42.921 --> 00:30:44.540
for continuous integration.
00:30:44.540 --> 00:30:45.762
And with Jenkins,
00:30:45.762 --> 00:30:51.160
we continuously update
the primary sources with Wikidata.
00:30:52.178 --> 00:30:55.889
And this is a diagram for the paper
I previously mentioned.
00:30:55.890 --> 00:30:57.241
This is our current landscape.
00:30:57.242 --> 00:31:02.059
So every orange box out there
is a primary resource on drugs,
00:31:02.060 --> 00:31:07.827
proteins, genes, diseases,
chemical compounds with interaction,
00:31:07.827 --> 00:31:10.870
and this model is too small to read now
00:31:10.870 --> 00:31:17.472
but this is the database,
the sources that we manage in Wikidata
00:31:17.473 --> 00:31:20.560
and bridge with the primary sources.
00:31:20.561 --> 00:31:22.355
Here is such a workflow.
00:31:22.870 --> 00:31:25.312
So one of our partners
is the Disease Ontology
00:31:25.312 --> 00:31:27.672
the Disease Ontology is a CC0 ontology,
00:31:28.179 --> 00:31:31.990
and the CC0 Ontology
has a curation cycle on its own,
00:31:32.756 --> 00:31:35.736
and they just continuously
update the Disease Ontology
00:31:35.737 --> 00:31:39.687
to reflect the disease space
or the interpretation of diseases.
00:31:40.336 --> 00:31:44.361
And there is the Wikidata
curation cycle also on diseases
00:31:44.362 --> 00:31:49.844
where the Wikidata community constantly
monitors what's going on on Wikidata.
00:31:50.406 --> 00:31:51.601
And then we have two roles,
00:31:51.602 --> 00:31:55.477
we call them colloquially
the gatekeeper curator,
00:31:56.009 --> 00:31:59.561
and this was me
and a colleague five years ago
00:31:59.562 --> 00:32:03.414
where we just sit on our computers
and we monitor Wikipedia and Wikidata,
00:32:03.415 --> 00:32:08.601
and if there is an issue that was
reported back to the primary community,
00:32:08.602 --> 00:32:11.765
the primary resources, they looked
at the implementation and decided:
00:32:11.765 --> 00:32:14.240
OK, do we do we trust the Wikidata input?
00:32:14.850 --> 00:32:18.555
Yes--then it's considered,
it goes into the cycle,
00:32:18.555 --> 00:32:22.686
and the next iteration
is part of the Disease Ontology
00:32:22.687 --> 00:32:25.411
and fed back into Wikidata.
00:32:27.419 --> 00:32:31.480
We're doing the same for WikiPathways.
00:32:31.481 --> 00:32:36.601
WikiPathways is a MediaWiki-inspired
pathway and pathway repository.
00:32:36.602 --> 00:32:40.901
Same story, there are different
pathway resources on Wikidata already.
00:32:41.463 --> 00:32:44.713
There might be conflicts
between those pathway resources
00:32:44.722 --> 00:32:46.701
and these conflicts are reported back
00:32:46.702 --> 00:32:49.521
by the gatekeeper curators
to that community,
00:32:49.522 --> 00:32:53.715
and you maintain
the individual curation cycles.
00:32:53.715 --> 00:32:57.068
But if you remember the previous cycle,
00:32:57.069 --> 00:33:03.041
here I mentioned
only two cycles, two resources,
00:33:03.566 --> 00:33:06.300
we have to do that
for every single resource that we have
00:33:06.300 --> 00:33:08.061
and we have to manage what's going on
00:33:08.062 --> 00:33:09.185
because when I say curation,
00:33:09.185 --> 00:33:11.377
I really mean going
to the Wikipedia top pages,
00:33:11.377 --> 00:33:14.544
going into the Wikidata top pages
and trying to do that.
00:33:14.545 --> 00:33:19.316
That doesn't scale for
the two gatekeeper curators we had.
00:33:19.860 --> 00:33:22.777
So when I was in a conference in 2016
00:33:22.778 --> 00:33:26.933
where Eric gave a presentation
on Shape Expressions,
00:33:26.934 --> 00:33:29.277
I jumped on the bandwagon and said OK,
00:33:29.278 --> 00:33:34.240
Shape Expressions can help us
detect what differences in Wikidata
00:33:34.240 --> 00:33:41.159
and so that allows the gatekeepers to have
some more efficient reporting to report.
00:33:42.275 --> 00:33:46.019
So this year,
I was delighted by the schema entity
00:33:46.020 --> 00:33:50.765
because now, we can store
those entity schemas on Wikidata,
00:33:50.765 --> 00:33:53.183
on Wikidata itself,
whereas before, it was on GitHub,
00:33:53.860 --> 00:33:56.815
and this aligns
with the Wikidata interface,
00:33:56.816 --> 00:33:59.350
so you have things
like document discussions
00:33:59.350 --> 00:34:00.762
but you also have revisions.
00:34:00.763 --> 00:34:05.261
So you can leverage the top pages
and the revisions in Wikidata
00:34:05.262 --> 00:34:12.255
to use that to discuss
about what is in Wikidata
00:34:12.255 --> 00:34:14.060
and what are in the primary resources.
00:34:14.966 --> 00:34:19.686
So this what Eric just presented,
this is already quite a benefit.
00:34:19.686 --> 00:34:24.335
So here, we made up a Shape Expression
for the human gene,
00:34:24.336 --> 00:34:30.225
and then we ran it through simple ShEx,
and as you can see,
00:34:30.225 --> 00:34:32.428
we just got already ni--
00:34:32.429 --> 00:34:34.641
There is one issue
that needs to be monitored
00:34:34.642 --> 00:34:37.316
which there is an item
that doesn't fit that schema,
00:34:37.316 --> 00:34:43.139
and then you can sort of already
create schema entities curation reports
00:34:43.140 --> 00:34:46.240
based on... and send that
to the different curation reports.
00:34:48.058 --> 00:34:52.788
But the ShEx.js a built interface,
00:34:52.788 --> 00:34:55.860
and if I can show back here,
I only do ten,
00:34:55.860 --> 00:35:00.362
but we have tens of thousands,
and so that again doesn't scale.
00:35:00.362 --> 00:35:04.654
So the Wikidata Integrator now
supports ShEx support as well,
00:35:05.168 --> 00:35:07.431
and then we can just loop item loops
00:35:07.431 --> 00:35:11.494
where we say yes-no,
yes-no, true-false, true-false.
00:35:11.495 --> 00:35:12.495
So again,
00:35:13.065 --> 00:35:16.514
increasing a bit of the efficiency
of dealing with the reports.
00:35:17.256 --> 00:35:22.662
But now, recently, that builds
on the Wikidata Query Service,
00:35:23.181 --> 00:35:24.998
and well, we recently have been throttling
00:35:24.999 --> 00:35:26.560
so again, that doesn't scale.
00:35:26.561 --> 00:35:31.391
So it's still an ongoing process,
how to deal with models on Wikidata.
00:35:32.202 --> 00:35:36.682
And so again,
ShEx is not only intimidating
00:35:36.683 --> 00:35:40.356
but also the scale is just
too big to deal with.
00:35:41.068 --> 00:35:46.081
So I started working, this is my first
proof of concept or exercise
00:35:46.082 --> 00:35:47.680
where I used a tool called yED,
00:35:48.184 --> 00:35:52.590
and I started to draw
those Shape Expressions and because...
00:35:52.591 --> 00:35:58.098
and then regenerate this schema
00:35:58.099 --> 00:36:01.279
into this adjacent format
of the Shape Expressions,
00:36:01.280 --> 00:36:04.520
so that would open up already
to the audience
00:36:04.521 --> 00:36:07.432
that are intimidated
by the Shape Expressions languages.
00:36:07.961 --> 00:36:12.308
But actually, there is a problem
with those visual descriptions
00:36:12.309 --> 00:36:18.229
because this is also a schema
that was actually drawn in yEd by someone.
00:36:18.230 --> 00:36:23.838
And here is another one
which is beautiful.
00:36:23.838 --> 00:36:29.414
I would love to have this on my wall,
but it is still not interoperable.
00:36:30.281 --> 00:36:32.131
So I want to end my talk with,
00:36:32.131 --> 00:36:35.732
and the first time, I've been
stealing this slide, using this slide.
00:36:35.732 --> 00:36:37.594
It's an honor to have him in the audience
00:36:37.595 --> 00:36:39.423
and I really like this:
00:36:39.424 --> 00:36:42.362
"People think RDF is a pain
because it's complicated.
00:36:42.362 --> 00:36:43.985
The truth is even worse, it's so simple,
00:36:45.581 --> 00:36:48.133
because you have to work
with real-world data problems
00:36:48.134 --> 00:36:50.031
that are horribly complicated.
00:36:50.031 --> 00:36:51.451
While you can avoid RDF,
00:36:51.451 --> 00:36:55.760
it is harder to avoid complicated data
and complicated computer problems."
00:36:55.761 --> 00:36:59.535
This is about RDF, but I think
this so applies to modeling as well.
00:37:00.112 --> 00:37:02.769
So my point of discussion
is should we really...
00:37:03.387 --> 00:37:05.882
How do we get modeling going?
00:37:05.882 --> 00:37:10.826
Should we discuss ShEx
or visual models or...
00:37:11.426 --> 00:37:13.271
How do we continue?
00:37:13.474 --> 00:37:14.840
Thank you very much for your time.
00:37:15.102 --> 00:37:17.787
(applause)
00:37:20.001 --> 00:37:21.188
(Lydia) Thank you so much.
00:37:21.692 --> 00:37:24.001
Would you come to the front
00:37:24.002 --> 00:37:27.741
so that we can open
the questions from the audience.
00:37:28.610 --> 00:37:30.203
Are there questions?
00:37:31.507 --> 00:37:32.507
Yes.
00:37:34.253 --> 00:37:36.890
And I think, for the camera, we need to...
00:37:38.835 --> 00:37:40.968
(Lydia laughing) Yeah.
00:37:43.094 --> 00:37:46.273
(man3) So a question
for Cristina, I think.
00:37:47.366 --> 00:37:51.641
So you mentioned exactly
the term "information gain"
00:37:51.642 --> 00:37:53.689
from linking with other systems.
00:37:53.690 --> 00:37:55.619
There is an information theoretic measure
00:37:55.620 --> 00:37:58.001
using statistic and probability
called information gain.
00:37:58.002 --> 00:37:59.541
Do you have the same...
00:37:59.542 --> 00:38:01.736
I mean did you mean exactly that measure,
00:38:01.736 --> 00:38:04.173
the information gain
from the probability theory
00:38:04.174 --> 00:38:05.240
from information theory
00:38:05.241 --> 00:38:09.024
or just use this conceptual thing
to measure information gain some way?
00:38:09.025 --> 00:38:13.016
No, so we actually defined
and implemented measures
00:38:13.695 --> 00:38:20.161
that are using the Shannon entropy,
so it's meant as that.
00:38:20.162 --> 00:38:22.696
I didn't want to go into
details of the concrete formulas...
00:38:22.697 --> 00:38:24.977
(man3) No, no, of course,
that's why I asked the question.
00:38:24.978 --> 00:38:26.698
- (Cristina) But yeah...
- (man3) Thank you.
00:38:33.091 --> 00:38:35.047
(man4) Make more
of a comment than a question.
00:38:35.048 --> 00:38:36.241
(Lydia) Go for it.
00:38:36.242 --> 00:38:39.840
(man4) So there's been
a lot of focus at the item level
00:38:39.840 --> 00:38:42.547
about quality and completeness,
00:38:42.547 --> 00:38:47.374
one of the things that concerns me is that
we're not applying the same to hierarchies
00:38:47.374 --> 00:38:51.480
and I think we have an issue
is that our hierarchy often isn't good.
00:38:51.481 --> 00:38:53.463
We're seeing
this is going to be a real problem
00:38:53.464 --> 00:38:55.774
with Commons searching and other things.
00:38:56.771 --> 00:39:00.601
One of the abilities that we can do
is to import external--
00:39:00.602 --> 00:39:04.842
The way that external thesauruses
structure their hierarchies,
00:39:04.842 --> 00:39:10.291
using the P4900
broader concept qualifier.
00:39:11.037 --> 00:39:16.167
But what I think would be really helpful
would be much better tools for doing that
00:39:16.168 --> 00:39:21.212
so that you can import an
external... thesaurus's hierarchy
00:39:21.212 --> 00:39:24.111
map that onto our Wikidata items.
00:39:24.111 --> 00:39:28.199
Once it's in place
with those P4900 qualifiers,
00:39:28.200 --> 00:39:31.494
you can actually do some
quite good querying through SPARQL
00:39:32.490 --> 00:39:37.534
to see where our hierarchy
diverges from that external hierarchy.
00:39:37.534 --> 00:39:41.346
For instance, [Paula Morma],
user PKM, you may know,
00:39:41.346 --> 00:39:43.533
does a lot of work on fashion.
00:39:43.533 --> 00:39:50.524
So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy
00:39:50.524 --> 00:39:53.812
and the Getty AAT
fashion thesaurus hierarchy,
00:39:53.812 --> 00:39:57.957
and then see where the gaps
were in our higher level items,
00:39:57.957 --> 00:40:00.511
which is a real problem for us
because often,
00:40:00.511 --> 00:40:04.355
these are things that only exist
as disambiguation pages on Wikipedia,
00:40:04.356 --> 00:40:09.270
so we have a lot of higher level items
in our hierarchies missing
00:40:09.271 --> 00:40:14.480
and this is something that we must address
in terms of quality and completeness,
00:40:14.480 --> 00:40:15.971
but what would really help
00:40:16.643 --> 00:40:20.871
would be better tools than
the jungle of pull scripts that I wrote...
00:40:20.872 --> 00:40:26.010
If somebody could put that
into a PAWS notebook in Python
00:40:26.561 --> 00:40:31.972
to be able to take an external thesaurus,
take its hierarchy,
00:40:31.973 --> 00:40:34.595
which may well be available
as linked data or may not,
00:40:35.379 --> 00:40:40.580
to then put those into
quick statements to put in P4900 values.
00:40:41.165 --> 00:40:42.165
And then later,
00:40:42.166 --> 00:40:44.527
when our representation
gets more complete,
00:40:44.528 --> 00:40:49.691
to update those P4900s
because as our representation gets dated,
00:40:49.691 --> 00:40:51.590
becomes more dense,
00:40:51.590 --> 00:40:55.377
the values of those qualifiers
need to change
00:40:56.230 --> 00:40:59.526
to represent that we've got more
of their hierarchy in our system.
00:40:59.526 --> 00:41:03.728
If somebody could do that,
I think that would be very helpful,
00:41:03.728 --> 00:41:07.121
and we do need to also
look at other approaches
00:41:07.122 --> 00:41:10.762
to improve quality and completeness
at the hierarchy level
00:41:10.763 --> 00:41:12.378
not just at the item level.
00:41:13.308 --> 00:41:14.840
(Andra) Can I add to that?
00:41:16.362 --> 00:41:19.901
Yes, and we actually do that,
00:41:19.911 --> 00:41:23.551
and I can recommend looking at
the Shape Expression that Finn made
00:41:23.552 --> 00:41:27.330
with the lexical data
where he creates Shape Expressions
00:41:27.330 --> 00:41:29.640
and then build on authorship expressions
00:41:29.641 --> 00:41:32.528
so you have this concept
of linked Shape Expressions in Wikidata,
00:41:32.529 --> 00:41:35.005
and specifically, the use case,
if I understand correctly,
00:41:35.006 --> 00:41:37.183
is exactly what we are doing in Gene Wiki.
00:41:37.184 --> 00:41:40.841
So you have the Disease Ontology
which is put into Wikidata
00:41:40.842 --> 00:41:44.681
and then disease data comes in
and we apply the Shape Expressions
00:41:44.682 --> 00:41:47.247
to see if that fits with this thesaurus.
00:41:47.248 --> 00:41:50.919
And there are other thesauruses or other
ontologies for controlled vocabularies
00:41:50.920 --> 00:41:52.559
that still need to go into Wikidata,
00:41:52.559 --> 00:41:55.401
and that's exactly why
Shape Expression is so interesting
00:41:55.402 --> 00:41:57.963
because you can have a Shape Expression
for the Disease Ontology,
00:41:57.964 --> 00:41:59.644
you can have a Shape Expression for MeSH,
00:41:59.645 --> 00:42:01.761
you can say: OK,
now I want to check the quality.
00:42:01.762 --> 00:42:04.059
Because you also have
in Wikidata the context
00:42:04.060 --> 00:42:09.567
of when you have a controlled vocabulary,
you say the quality is according to this,
00:42:09.568 --> 00:42:11.636
but you might have
a disagreeing community.
00:42:11.636 --> 00:42:16.081
So the tooling is indeed in place
but now is indeed to create those models
00:42:16.082 --> 00:42:18.144
and apply them
on the different use cases.
00:42:18.811 --> 00:42:20.921
(man4) The ShapeExpression's very useful
00:42:20.922 --> 00:42:25.928
once you have the external ontology
mapped into Wikidata,
00:42:25.929 --> 00:42:29.474
but my problem is that
it's getting to that stage,
00:42:29.475 --> 00:42:34.881
it's working out how much of the
external ontology isn't yet in Wikidata
00:42:34.882 --> 00:42:36.256
and where the gaps are,
00:42:36.257 --> 00:42:40.660
and that's where I think that
having much more robust tools
00:42:40.660 --> 00:42:44.286
to see what's missing
from external ontologies
00:42:44.286 --> 00:42:45.537
would be very helpful.
00:42:47.678 --> 00:42:49.062
The biggest problem there
00:42:49.062 --> 00:42:51.201
is not so much tooling
but more licensing.
00:42:51.803 --> 00:42:55.249
So getting the ontologies
into Wikidata is actually a piece of cake
00:42:55.250 --> 00:42:59.295
but most of the ontologies have,
how can I say that politely,
00:42:59.965 --> 00:43:03.256
restrictive licensing,
so they are not compatible with Wikidata.
00:43:04.068 --> 00:43:06.678
(man4) There's a huge number
of public sector thesauruses
00:43:06.678 --> 00:43:08.209
in cultural fields.
00:43:08.210 --> 00:43:10.851
- (Andra) Then we need to talk.
- (man4) Not a problem.
00:43:10.852 --> 00:43:12.384
(Andra) Then we need to talk.
00:43:13.624 --> 00:43:19.192
(man5) Just... the comment I want to make
is actually answer to James,
00:43:19.192 --> 00:43:22.401
so the thing is that
hierarchies make graphs,
00:43:22.374 --> 00:43:24.041
and when you want to...
00:43:24.579 --> 00:43:28.888
I want to basically talk about...
a common problem in hierarchies
00:43:28.889 --> 00:43:30.820
is circle hierarchies,
00:43:30.821 --> 00:43:33.796
so they come back to each other
when there's a problem,
00:43:33.796 --> 00:43:35.920
which you should not
have that in hierarchies.
00:43:37.022 --> 00:43:41.295
This, funnily enough,
happens in categories in Wikipedia a lot
00:43:41.295 --> 00:43:42.990
we have a lot of circles in categories,
00:43:43.898 --> 00:43:46.612
but the good news is that this is...
00:43:47.713 --> 00:43:51.582
Technically, it's a PMP complete problem,
so you cannot find this,
00:43:51.583 --> 00:43:53.414
and easily if you built a graph of that,
00:43:54.473 --> 00:43:57.046
but there are lots of ways
that have been developed
00:43:57.047 --> 00:44:00.624
to find problems
in these hierarchy graphs.
00:44:00.625 --> 00:44:04.860
Like there is a paper
called Finding Cycles...
00:44:04.861 --> 00:44:07.955
Breaking Cycles in Noisy Hierarchies,
00:44:07.956 --> 00:44:12.671
and it's been used to help
categorization of English Wikipedia.
00:44:12.672 --> 00:44:17.141
You can just take this
and apply these hierarchies in Wikidata,
00:44:17.142 --> 00:44:19.540
and then you can find
things that are problematic
00:44:19.541 --> 00:44:22.481
and just remove the ones
that are causing issues
00:44:22.482 --> 00:44:24.593
and find the issues, actually.
00:44:24.594 --> 00:44:26.960
So this is just an idea, just so you...
00:44:28.780 --> 00:44:29.930
(man4) That's all very well
00:44:29.931 --> 00:44:34.402
but I think you're underestimating
the number of bad subclass relations
00:44:34.402 --> 00:44:35.402
that we have.
00:44:35.403 --> 00:44:39.680
It's like having a city
in completely the wrong country,
00:44:40.250 --> 00:44:44.874
and there are tools for geography
to identify that,
00:44:44.875 --> 00:44:49.201
and we need to have
much better tools in hierarchies
00:44:49.202 --> 00:44:53.477
to identify where the equivalent
of the item for the country
00:44:53.478 --> 00:44:57.673
is missing entirely,
or where it's actually been subclassed
00:44:57.674 --> 00:45:01.804
to something that isn't meaning
something completely different.
00:45:02.804 --> 00:45:07.165
(Lydia) Yeah, I think
you're getting to something
00:45:07.166 --> 00:45:12.024
that me and my team keeps hearing
from people who reuse our data
00:45:12.025 --> 00:45:13.991
quite a bit as well, right,
00:45:15.002 --> 00:45:16.638
Individual data point might be great
00:45:16.639 --> 00:45:20.163
but if you have to look
at the ontology and so on,
00:45:20.164 --> 00:45:21.857
then it gets very...
00:45:22.388 --> 00:45:26.437
And I think one of the big problems
why this is happening
00:45:26.437 --> 00:45:30.736
is that a lot of editing on Wikidata
00:45:30.736 --> 00:45:34.544
happens on the basis
of an individual item, right,
00:45:34.545 --> 00:45:36.201
you make an edit on that item,
00:45:37.653 --> 00:45:42.075
without realizing that this
might have very global consequences
00:45:42.075 --> 00:45:44.245
on the rest of the graph, for example.
00:45:44.245 --> 00:45:50.040
And if people have ideas around
how to make this more visible,
00:45:50.041 --> 00:45:53.185
the consequences
of an individual local edit,
00:45:54.005 --> 00:45:56.537
I think that would be worth exploring,
00:45:57.550 --> 00:46:01.583
to show people better
what the consequence of their edit
00:46:01.584 --> 00:46:03.434
that they might do in very good faith,
00:46:04.481 --> 00:46:05.481
what that is.
00:46:06.939 --> 00:46:12.237
Whoa! OK, let's start with, yeah, you,
then you, then you, then you.
00:46:12.237 --> 00:46:13.921
(man5) Well, after the discussion,
00:46:13.922 --> 00:46:18.262
just to express my agreement
with what James was saying.
00:46:18.263 --> 00:46:22.467
So essentially, it seems
the most dangerous thing is the hierarchy,
00:46:22.468 --> 00:46:23.910
not the hierarchy, but generally
00:46:23.911 --> 00:46:28.022
the semantics of the subclass relations
seen in Wikidata, right.
00:46:28.022 --> 00:46:32.561
So I've been studying languages recently,
just for the purposes of this conference,
00:46:32.562 --> 00:46:35.257
and for example, you find plenty of cases
00:46:35.257 --> 00:46:39.463
where a language is a part of
and subclass of the same thing, OK.
00:46:39.463 --> 00:46:43.577
So you know, you can say
we have a flexible ontology.
00:46:43.577 --> 00:46:46.256
Wikidata gives you freedom
to express that, sometimes.
00:46:46.256 --> 00:46:47.257
Because, for example,
00:46:47.258 --> 00:46:50.721
that ontology of languages
is also politically complicated, right?
00:46:50.722 --> 00:46:55.038
It is even good to be in a position
to express a level of uncertainty.
00:46:55.038 --> 00:46:57.983
But imagine anyone who wants
to do machine reading from that.
00:46:57.984 --> 00:46:59.468
So that's really problematic.
00:46:59.468 --> 00:47:00.468
And then again,
00:47:00.469 --> 00:47:03.686
I don't think that ontology
was ever imported from somewhere,
00:47:03.687 --> 00:47:05.490
that's something which is originally ours.
00:47:05.491 --> 00:47:08.321
It's harvested from Wikipedia
in the very beginning I will say.
00:47:08.322 --> 00:47:11.324
So I wonder...
this Shape Expressions thing is great,
00:47:11.325 --> 00:47:15.575
and also validating and fixing,
if you like, the Wikidata ontology
00:47:15.576 --> 00:47:18.191
by external resources, beautiful idea.
00:47:19.026 --> 00:47:20.026
In the end,
00:47:20.027 --> 00:47:25.440
will we end by reflecting
the external ontologies in Wikidata?
00:47:25.441 --> 00:47:28.651
And also, what we do with
the core part of our ontology
00:47:28.652 --> 00:47:30.642
which is never harvested
from external resources,
00:47:30.643 --> 00:47:31.978
how do we go and fix that?
00:47:31.979 --> 00:47:35.276
And I really think that
that will be a problem on its own.
00:47:35.277 --> 00:47:39.010
We will have to focus on that
independently of the idea
00:47:39.010 --> 00:47:41.046
of validating ontology
with something external.
00:47:49.353 --> 00:47:53.379
(man6) OK, and constrains
and shapes are very impressive
00:47:53.380 --> 00:47:54.495
what we can do with it,
00:47:55.205 --> 00:47:58.481
but the main point is not
being really made clear--
00:47:58.482 --> 00:48:03.229
it's because now we can make more explicit
what we expect from the data.
00:48:03.229 --> 00:48:06.893
Before, each one has to write
its own tools and scripts
00:48:06.894 --> 00:48:10.601
and so it's more visible
and we can discuss about it.
00:48:10.602 --> 00:48:13.641
But because it's not about
what's wrong or right,
00:48:13.642 --> 00:48:15.870
it's about an expectation,
00:48:15.870 --> 00:48:18.105
and you will have different
expectations and discussions
00:48:18.106 --> 00:48:20.737
about how we want
to model things in Wikidata,
00:48:21.246 --> 00:48:23.095
and this...
00:48:23.096 --> 00:48:26.280
The current state is just
one step in the direction
00:48:26.281 --> 00:48:28.041
because now you need
00:48:28.042 --> 00:48:31.041
very much technical expertise
to get into this,
00:48:31.042 --> 00:48:35.721
and we need better ways
to visualize this constraint,
00:48:35.722 --> 00:48:39.995
to transform it maybe in natural language
so people can better understand,
00:48:40.939 --> 00:48:43.768
but it's less about what's wrong or right.
00:48:44.925 --> 00:48:45.925
(Lydia) Yeah.
00:48:50.986 --> 00:48:53.893
(man7) So for quality issues,
I just want to echo it like...
00:48:53.894 --> 00:48:57.010
I've definitely found a lot of the issues
I've encountered have been
00:48:58.838 --> 00:49:02.330
differences in opinion
between instance of versus subclass.
00:49:02.331 --> 00:49:05.963
I would say errors in those situations
00:49:05.963 --> 00:49:11.521
and trying to find those
has been a very time-consuming process.
00:49:11.522 --> 00:49:14.840
What I've found is like:
"Oh, if I find very high-impression items
00:49:14.840 --> 00:49:16.051
that are something...
00:49:16.052 --> 00:49:21.628
and then use all the subclass instances
to find all derived statements of this,"
00:49:21.628 --> 00:49:26.215
this is a very useful way
of looking for these errors.
00:49:26.215 --> 00:49:28.067
But I was curious if Shape Expressions,
00:49:29.841 --> 00:49:31.582
if there is...
00:49:31.583 --> 00:49:36.934
If this can be used as a tool
to help resolve those issues but, yeah...
00:49:40.514 --> 00:49:42.555
(man8) If it has a structural footprint...
00:49:45.910 --> 00:49:49.310
If it has a structural footprint
that you can...that's sort of falsifiable,
00:49:49.310 --> 00:49:51.191
you can look at that
and say well, that's wrong,
00:49:51.192 --> 00:49:52.670
then yeah, you can do that.
00:49:52.671 --> 00:49:56.921
But if it's just sort of
trying to map it to real-world objects,
00:49:56.922 --> 00:49:59.082
then you're just going to need
lots and lots of brains.
00:50:05.768 --> 00:50:08.631
(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.
00:50:09.154 --> 00:50:12.770
We're here to find out how to help
the project and the community
00:50:12.770 --> 00:50:15.645
but Cristina made the mistake
of asking what we want.
00:50:16.471 --> 00:50:20.052
(laughing) So I think
one thing I'd like to see
00:50:20.958 --> 00:50:23.521
is a lot around verifiability
00:50:23.522 --> 00:50:26.372
which is one of the core tenets
of the project in the community,
00:50:27.062 --> 00:50:28.590
and trustworthiness.
00:50:28.590 --> 00:50:32.412
Not every statement is the same,
some of them are heavily disputed,
00:50:32.413 --> 00:50:33.653
some of them are easy to guess,
00:50:33.654 --> 00:50:35.541
like somebody's
date of birth can be verified,
00:50:36.071 --> 00:50:39.082
as you saw today in the Keynote,
gender issues are a lot more complicated.
00:50:40.205 --> 00:50:42.130
Can you discuss a little bit what you know
00:50:42.131 --> 00:50:47.271
in this area of data quality around
trustworthiness and verifiability?
00:50:55.442 --> 00:50:58.138
If there isn't a lot,
I'd love to see a lot more. (laughs)
00:51:00.646 --> 00:51:01.646
(Lydia) Yeah.
00:51:03.314 --> 00:51:06.548
Apparently, we don't have
a lot to say on that. (laughs)
00:51:08.024 --> 00:51:12.299
(Andra) I think we can do a lot,
but I had a discussion with you yesterday.
00:51:12.300 --> 00:51:15.774
My favorite example I learned yesterday
that's already deprecated
00:51:15.774 --> 00:51:20.281
is if you go to the Q2, which is earth,
00:51:20.282 --> 00:51:23.343
there is statement
that claims that the earth is flat.
00:51:24.183 --> 00:51:26.055
And I love that example
00:51:26.056 --> 00:51:28.391
because there is a community
out there that claims that
00:51:28.392 --> 00:51:30.417
and they have verifiable resources.
00:51:30.418 --> 00:51:32.254
So I think it's a genuine case,
00:51:32.255 --> 00:51:34.641
it shouldn't be deprecated,
it should be in Wikidata.
00:51:34.642 --> 00:51:40.385
And I think Shape Expressions
can be really instrumental there,
00:51:40.386 --> 00:51:41.832
because what you can say,
00:51:41.833 --> 00:51:44.856
OK, I'm really interested
in this use case,
00:51:44.857 --> 00:51:47.129
or this is a use case where you disagree,
00:51:47.130 --> 00:51:51.059
but there can also be a use case
where you say OK, I'm interested.
00:51:51.059 --> 00:51:53.449
So there is this example you say,
I have glucose.
00:51:53.449 --> 00:51:55.841
And glucose when you're a biologist,
00:51:55.842 --> 00:52:00.176
you don't care for the chemical
constraints of the glucose molecule,
00:52:00.177 --> 00:52:03.201
you just... everything glucose
is the same.
00:52:03.202 --> 00:52:05.973
But if you're a chemist,
you cringe when you hear that,
00:52:05.973 --> 00:52:08.191
you have 200 something...
00:52:08.191 --> 00:52:10.443
So then you can have
multiple Shape Expressions,
00:52:10.443 --> 00:52:12.721
OK, I'm coming in with...
I'm at a chemist view,
00:52:12.722 --> 00:52:13.887
I'm applying that.
00:52:13.887 --> 00:52:16.691
And then you say
I'm from a biological use case,
00:52:16.691 --> 00:52:18.524
I'm applying that Shape Expression.
00:52:18.524 --> 00:52:20.358
And then when you want to collaborate,
00:52:20.358 --> 00:52:22.784
yes, well you should talk
to Eric about ShEx maps.
00:52:23.910 --> 00:52:28.873
And so...
but this journey is just starting.
00:52:28.873 --> 00:52:32.238
But I personally I believe
that it's quite instrumental in that area.
00:52:34.292 --> 00:52:35.535
(Lydia) OK. Over there.
00:52:37.949 --> 00:52:39.168
(laughs)
00:52:40.597 --> 00:52:46.035
(woman2) I had several ideas
from some points in the discussions,
00:52:46.035 --> 00:52:50.902
so I will try not to lose...
I had three ideas so...
00:52:52.394 --> 00:52:55.201
Based on what James said a while ago,
00:52:55.202 --> 00:52:59.001
we have a very, very big problem
on Wikidata since the beginning
00:52:59.002 --> 00:53:01.574
for the upper ontology.
00:53:02.363 --> 00:53:05.339
We talked about that
two years ago at WikidataCon,
00:53:05.340 --> 00:53:07.432
and we talked about that at Wikimania.
00:53:07.432 --> 00:53:09.818
Well, always we have a Wikidata meeting
00:53:09.818 --> 00:53:11.656
we are talking about that,
00:53:11.656 --> 00:53:15.782
because it's a very big problem
at a very very eye level
00:53:15.783 --> 00:53:23.118
what entity is, with what work is,
what genre is, art,
00:53:23.118 --> 00:53:25.461
are really the biggest concept.
00:53:26.195 --> 00:53:33.117
And that's actually
a very weak point on global ontology
00:53:33.118 --> 00:53:37.453
because people try to clean up regularly
00:53:38.017 --> 00:53:41.047
and broke everything down the line,
00:53:42.516 --> 00:53:48.649
because yes, I think some of you
may remember the guy who in good faith
00:53:48.649 --> 00:53:51.785
broke absolutely all cities in the world.
00:53:51.785 --> 00:53:57.537
We were not geographical items anymore,
so violation constraints everywhere.
00:53:58.720 --> 00:54:00.278
And it was in good faith
00:54:00.278 --> 00:54:03.623
because he was really
correcting a mistake in an item,
00:54:04.170 --> 00:54:05.732
but everything broke down.
00:54:06.349 --> 00:54:09.373
And I'm not sure how we can solve that
00:54:10.216 --> 00:54:15.709
because there is actually
no external institution we could just copy
00:54:15.710 --> 00:54:18.490
because everyone is working on...
00:54:19.154 --> 00:54:22.041
Well, if I am performing art database,
00:54:22.042 --> 00:54:24.601
I will just go
at the performing art label,
00:54:24.601 --> 00:54:29.361
or I won't go to the philosophical concept
of what an entity is,
00:54:29.362 --> 00:54:31.201
and that's actually...
00:54:31.202 --> 00:54:34.561
I don't know any database
which is working at this level,
00:54:34.562 --> 00:54:36.827
but that's the weakest point of Wikidata.
00:54:37.936 --> 00:54:40.812
And probably,
when we are talking about data quality,
00:54:40.812 --> 00:54:44.034
that's actually a big part of it, so...
00:54:44.034 --> 00:54:48.569
And I think it's the same
we have stated in...
00:54:48.569 --> 00:54:50.452
Oh, I am sorry, I am changing the subject,
00:54:51.401 --> 00:54:55.774
but we have stated
in different sessions about qualities,
00:54:55.774 --> 00:54:59.398
which is actually some of us
are doing good modeling job,
00:54:59.399 --> 00:55:01.240
are doing ShEx,
are doing things like that.
00:55:01.967 --> 00:55:07.655
People don't see it on Wikidata,
they don't see the ShEx,
00:55:07.655 --> 00:55:10.392
they don't see the WikiProject
on the discussion page,
00:55:10.393 --> 00:55:11.393
and sometimes,
00:55:11.394 --> 00:55:14.958
they don't even see
the talk pages of properties,
00:55:14.958 --> 00:55:19.628
which is explicitly stating,
a), this property is used for that.
00:55:19.628 --> 00:55:23.887
Like last week,
I added constraints to a property.
00:55:23.888 --> 00:55:26.324
The constraint was explicitly written
00:55:26.325 --> 00:55:28.690
in the discussion
of the creation of the property.
00:55:28.690 --> 00:55:34.548
I just created the technical part
of adding the constraint, and someone:
00:55:34.548 --> 00:55:37.182
"What! You broke down all my edits!"
00:55:37.183 --> 00:55:41.542
And he was using the property
wrongly for the last two years.
00:55:41.542 --> 00:55:46.868
And the property was actually very clear,
but there were no warnings and everything,
00:55:46.869 --> 00:55:49.922
and so, it's the same at the Pink Pony
we said at Wikimania
00:55:49.922 --> 00:55:54.719
to make WikiProject more visible
or to make ShEx more visible, but...
00:55:54.719 --> 00:55:56.917
And that's what Cristina said.
00:55:56.917 --> 00:56:02.368
We have a visibility problem
of what the existing solutions are.
00:56:02.368 --> 00:56:04.242
And at this session,
00:56:04.242 --> 00:56:06.862
we are all talking about
how to create more ShEx,
00:56:06.863 --> 00:56:10.727
or to facilitate the jobs
of the people who are doing the cleanup.
00:56:11.605 --> 00:56:15.835
But we are cleaning up
since the first day of Wikidata,
00:56:15.836 --> 00:56:20.921
and globally, we are losing,
and we are losing because, well,
00:56:20.922 --> 00:56:22.960
if I know names are complicated
00:56:22.961 --> 00:56:26.162
but I am the only one
doing the cleaning up job,
00:56:26.662 --> 00:56:29.671
the guy who added
Latin script name
00:56:29.672 --> 00:56:31.584
to all Chinese researcher,
00:56:32.088 --> 00:56:35.616
I will take months to clean that
and I can't do it alone,
00:56:35.616 --> 00:56:38.777
and he did one massive batch.
00:56:38.777 --> 00:56:40.241
So we really need...
00:56:40.242 --> 00:56:44.158
we have a visibility problem
more than a tool problem, I think,
00:56:44.158 --> 00:56:45.733
because we have many tools.
00:56:45.733 --> 00:56:50.255
(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),
00:56:50.256 --> 00:56:52.121
so we need to wrap this up.
00:56:52.122 --> 00:56:53.563
Thank you so much for your comments,
00:56:53.563 --> 00:56:56.611
I hope you will continue discussing
during the rest of the day,
00:56:56.611 --> 00:56:57.840
and thanks for your input.
00:56:58.359 --> 00:56:59.944
(applause)