1
00:00:05,945 --> 00:00:09,476
Hello everyone to the Data Quality panel.
2
00:00:10,288 --> 00:00:13,671
Data quality matters because
more and more people out there
3
00:00:13,672 --> 00:00:19,289
rely on our data being in good shape,
so we're going to talk about data quality,
4
00:00:20,029 --> 00:00:26,000
and there will be four speakers
who will give short introductions
5
00:00:26,000 --> 00:00:29,539
on topics related to data quality
and then we will have a Q and A.
6
00:00:30,130 --> 00:00:32,234
And the first one is Lucas.
7
00:00:34,385 --> 00:00:35,385
Thank you.
8
00:00:35,901 --> 00:00:39,899
Hi, I'm Lucas, and I'm going
to start with an overview
9
00:00:39,899 --> 00:00:43,806
of data quality tools
that we already have on Wikidata
10
00:00:43,807 --> 00:00:46,109
and also some things
that are coming up soon.
11
00:00:46,932 --> 00:00:50,623
And I've grouped them
into some general themes
12
00:00:50,623 --> 00:00:53,761
of making errors more visible,
making problems actionable,
13
00:00:53,762 --> 00:00:56,322
getting more eyes on the data
so that people notice the problems,
14
00:00:56,945 --> 00:01:02,616
fix some common sources of errors,
maintain the quality of the existing data
15
00:01:02,616 --> 00:01:03,966
and also human curation.
16
00:01:05,063 --> 00:01:09,874
And the ones that are currently available
start with property constraints.
17
00:01:10,388 --> 00:01:12,421
So you've probably seen this
if you're on Wikidata.
18
00:01:12,422 --> 00:01:14,029
You can sometimes get these icons
19
00:01:14,530 --> 00:01:17,241
which check
the internal consistency of the data.
20
00:01:17,242 --> 00:01:20,800
For example,
if one event follows the other,
21
00:01:20,801 --> 00:01:23,760
then the other event should
also be followed by this one,
22
00:01:23,761 --> 00:01:27,161
which on the WikidataCon item
was apparently missing.
23
00:01:27,162 --> 00:01:29,360
I'm not sure,
this feature is a few days old.
24
00:01:30,040 --> 00:01:34,681
And there's also,
if this is too limited or simple for you,
25
00:01:34,682 --> 00:01:38,080
you can write any checks you want
using the Query Service
26
00:01:38,081 --> 00:01:39,842
which is useful for
lots of things of course,
27
00:01:39,843 --> 00:01:44,543
but you can also use it
for finding errors.
28
00:01:44,544 --> 00:01:46,974
Like if you've noticed
one occurrence of a mistake,
29
00:01:46,975 --> 00:01:49,709
then you can check
if there are other places
30
00:01:49,710 --> 00:01:51,958
where people have made
a very similar error
31
00:01:51,958 --> 00:01:53,438
and find that with the Query Service.
32
00:01:53,439 --> 00:01:54,559
You can also combine the two
33
00:01:54,560 --> 00:01:57,874
and search for constraint violations
in the Query Service,
34
00:01:57,875 --> 00:02:01,240
for example,
only the violations in some area
35
00:02:01,241 --> 00:02:03,762
or WikiProject that's relevant to you,
36
00:02:03,762 --> 00:02:06,828
although the results are currently
not complete, sadly.
37
00:02:08,422 --> 00:02:09,877
There is revision scoring.
38
00:02:10,690 --> 00:02:12,666
That's... I think this is
from the recent changes
39
00:02:12,667 --> 00:02:16,217
you can also get it on your watch list
an automatic assessment
40
00:02:16,217 --> 00:02:20,249
of is this edit likely to be
in good faith or in bad faith
41
00:02:20,250 --> 00:02:22,312
and is it likely to be
damaging or not damaging,
42
00:02:22,313 --> 00:02:24,205
I think those are the two dimensions.
43
00:02:24,206 --> 00:02:25,686
So you can, if you want,
44
00:02:25,687 --> 00:02:29,898
focus on just looking through
the damaging but good faith edits.
45
00:02:29,899 --> 00:02:32,523
If you're feeling particularly
friendly and welcoming
46
00:02:32,524 --> 00:02:37,121
you can tell these editors,
"Thank you for your contribution,
47
00:02:37,122 --> 00:02:40,560
here's how you should have done it
but thank you, still."
48
00:02:40,561 --> 00:02:42,186
And if you're not feeling that way,
49
00:02:42,187 --> 00:02:44,452
you can go through
the bad faith, damaging edits,
50
00:02:44,453 --> 00:02:45,573
and revert the vandals.
51
00:02:47,544 --> 00:02:49,761
There's also, similar to that,
entity scoring.
52
00:02:49,762 --> 00:02:52,590
So instead of scoring an edit,
the change that it made,
53
00:02:52,591 --> 00:02:53,904
you score the whole revision,
54
00:02:53,904 --> 00:02:56,483
and I think that is
the same quality measure
55
00:02:56,483 --> 00:02:59,863
that Lydia mentions
at the beginning of the conference.
56
00:03:00,372 --> 00:03:04,569
That gives a user script up here
and gives you a score of like one to five,
57
00:03:04,570 --> 00:03:08,176
I think it was, of what the quality
of the current item is.
58
00:03:10,043 --> 00:03:15,528
The primary sources tool is for
any database that you want to import,
59
00:03:15,528 --> 00:03:18,364
but that's not high enough quality
to directly add to Wikidata,
60
00:03:18,374 --> 00:03:20,335
so you add it
to the primary sources tool instead,
61
00:03:20,336 --> 00:03:22,956
and then humans can decide
62
00:03:22,956 --> 00:03:26,024
should they add
these individual statements or not.
63
00:03:28,595 --> 00:03:31,901
Showing coordinates as maps
is mainly a convenience feature
64
00:03:31,901 --> 00:03:33,588
but it's also useful for quality control.
65
00:03:33,588 --> 00:03:36,937
Like if you see this is supposed to be
the office of Wikimedia Germany
66
00:03:36,938 --> 00:03:39,400
and if the coordinates
are somewhere in the Indian Ocean,
67
00:03:39,401 --> 00:03:41,529
then you know that
something is not right there
68
00:03:41,530 --> 00:03:44,790
and you can see it much more easily
than if you just had the numbers.
69
00:03:46,382 --> 00:03:49,576
This is a gadget called
the relative completeness indicator
70
00:03:49,577 --> 00:03:52,480
which shows you this little icon here
71
00:03:53,007 --> 00:03:55,652
telling you how complete
it thinks this item is
72
00:03:55,652 --> 00:03:57,613
and also which properties
are most likely missing,
73
00:03:57,614 --> 00:03:59,769
which is really useful
if you're editing an item
74
00:03:59,769 --> 00:04:03,172
and you're in an area
that you're not very familiar with
75
00:04:03,172 --> 00:04:05,661
and you don't know what
the right properties to use are,
76
00:04:05,662 --> 00:04:08,230
then this is a very useful gadget to have.
77
00:04:09,604 --> 00:04:11,401
And we have Shape Expressions.
78
00:04:11,402 --> 00:04:15,624
I think Andra or Jose
are going to talk more about those
79
00:04:15,624 --> 00:04:19,757
but basically, a very powerful way
of comparing the data you have
80
00:04:19,758 --> 00:04:20,758
against the schema,
81
00:04:20,759 --> 00:04:22,680
like what statement should
certain entities have,
82
00:04:22,681 --> 00:04:25,677
what other entities should they link to
and what should those look like,
83
00:04:26,229 --> 00:04:29,374
and then you can find problems that way.
84
00:04:30,366 --> 00:04:32,361
I think... No there is still more.
85
00:04:32,362 --> 00:04:34,321
Integraality or property dashboard.
86
00:04:34,322 --> 00:04:36,773
It gives you a quick overview
of the data you already have.
87
00:04:36,774 --> 00:04:39,147
For example, this is from
the WikiProject Red Pandas,
88
00:04:39,657 --> 00:04:41,681
and you can see that
we have a sex or gender
89
00:04:41,682 --> 00:04:43,561
for almost all of the red pandas,
90
00:04:43,561 --> 00:04:46,854
the date of birth varies a lot
by which zoo they come from
91
00:04:46,854 --> 00:04:50,255
and we have almost
no dead pandas which is wonderful,
92
00:04:51,437 --> 00:04:52,600
because they're so cute.
93
00:04:53,699 --> 00:04:55,654
So this is also useful.
94
00:04:56,377 --> 00:04:59,185
There we go, OK,
now for the things that are coming up.
95
00:04:59,889 --> 00:05:03,784
Wikidata Bridge, or also known,
formerly known as client editing,
96
00:05:03,785 --> 00:05:07,076
so editing Wikidata
from Wikipedia infoboxes
97
00:05:07,675 --> 00:05:11,725
which will on the one hand
get more eyes on the data
98
00:05:11,725 --> 00:05:13,441
because more people can see the data there
99
00:05:13,441 --> 00:05:18,841
and it will hopefully encourage
more use of Wikidata in the Wikipedias
100
00:05:18,841 --> 00:05:20,920
and that means that more
people can notice
101
00:05:20,921 --> 00:05:23,389
if, for example some data is outdated
and needs to be updated
102
00:05:23,857 --> 00:05:27,000
instead of if they would
only see it on Wikidata itself.
103
00:05:28,630 --> 00:05:30,656
There is also tainted references.
104
00:05:30,657 --> 00:05:33,959
The idea here is that
if you edit a statement value,
105
00:05:34,683 --> 00:05:37,279
you might want to update
the references as well,
106
00:05:37,280 --> 00:05:39,373
unless it was just a typo or something.
107
00:05:39,897 --> 00:05:43,662
And this tainted references
tells editors that
108
00:05:43,663 --> 00:05:49,756
and also that other editors
see which other edits were made
109
00:05:49,756 --> 00:05:52,471
that edited a statement value
and didn't update a reference
110
00:05:52,472 --> 00:05:56,766
then you can clean up after that
and decide should that be...
111
00:05:57,737 --> 00:05:59,566
Do you need to do any thing more of that
112
00:05:59,566 --> 00:06:02,796
or is that actually fine and
you don't need to update the reference.
113
00:06:03,543 --> 00:06:09,336
That's related to signed statements
which is coming from a concern, I think,
114
00:06:09,336 --> 00:06:12,355
that some data providers have that like...
115
00:06:14,131 --> 00:06:17,231
There's a statement that's referenced
through the UNESCO or something
116
00:06:17,232 --> 00:06:19,872
and then suddenly,
someone vandalizes the statement
117
00:06:19,873 --> 00:06:21,836
and they are worried
that it will look like
118
00:06:22,827 --> 00:06:26,992
this organization, like UNESCO,
still set this vandalism value
119
00:06:26,993 --> 00:06:28,706
and so, with signed statements,
120
00:06:28,706 --> 00:06:31,488
they can cryptographically
sign this reference
121
00:06:31,488 --> 00:06:33,562
and that doesn't prevent any edits to it,
122
00:06:34,169 --> 00:06:37,744
but at least, if someone
vandalizes the statement
123
00:06:37,744 --> 00:06:40,255
or edits it in any way,
then the signature is no longer valid,
124
00:06:40,255 --> 00:06:43,401
and you can tell this is not exactly
what the organization said,
125
00:06:43,402 --> 00:06:47,064
and perhaps it's a good edit
and they should re-sign the new statement,
126
00:06:47,065 --> 00:06:49,851
but also perhaps it should be reverted.
127
00:06:51,203 --> 00:06:54,166
And also, this is going
to be very exciting, I think,
128
00:06:54,166 --> 00:06:56,846
Citoid is this amazing system
they have on Wikipedia
129
00:06:57,379 --> 00:07:01,340
where you can paste a URL,
or an identifier, or an ISBN
130
00:07:01,340 --> 00:07:04,759
or Wikidata ID or basically
anything into the Visual Editor,
131
00:07:05,260 --> 00:07:08,241
and it spits out a reference
that is nicely formatted
132
00:07:08,242 --> 00:07:11,049
and has all the data you want
and it's wonderful to use.
133
00:07:11,049 --> 00:07:14,337
And by comparison, on Wikidata,
if I want to add a reference
134
00:07:14,338 --> 00:07:18,801
I typically have to add a reference URL,
title, author name string,
135
00:07:18,802 --> 00:07:20,449
published in, publication date,
136
00:07:20,450 --> 00:07:25,141
retrieve dates,
at least those, and that's annoying,
137
00:07:25,141 --> 00:07:29,261
and integrating Citoid into Wikibase
will hopefully help with that.
138
00:07:30,245 --> 00:07:33,604
And I think
that's all the ones I had, yeah.
139
00:07:33,604 --> 00:07:36,400
So now, I'm going to pass to Cristina.
140
00:07:37,788 --> 00:07:42,339
(applause)
141
00:07:43,780 --> 00:07:45,471
Hi, I'm Cristina.
142
00:07:45,472 --> 00:07:47,672
I'm a research scientist
from the University of Zürich,
143
00:07:47,673 --> 00:07:51,417
and I'm also an active member
of the Swiss Community.
144
00:07:52,698 --> 00:07:57,901
When Claudia Müller-Birn
and I submitted this to the WikidataCon,
145
00:07:57,902 --> 00:08:00,410
what we wanted to do
is continue our discussion
146
00:08:00,411 --> 00:08:02,424
that we started
in the beginning of the year
147
00:08:02,424 --> 00:08:07,442
with a workshop on data quality
and also some sessions in Wikimania.
148
00:08:07,442 --> 00:08:10,535
So the goal of this talk
is basically to bring some thoughts
149
00:08:10,536 --> 00:08:14,432
that we have been collecting
from the community and ourselves
150
00:08:14,432 --> 00:08:16,560
and continue discussion.
151
00:08:16,561 --> 00:08:20,065
So what we would like is to continue
interacting a lot with you.
152
00:08:21,557 --> 00:08:23,371
So what we think is very important
153
00:08:23,372 --> 00:08:27,580
is that we continuously ask
all types of users in the community
154
00:08:27,581 --> 00:08:32,240
about what they really need,
what problems they have with data quality,
155
00:08:32,240 --> 00:08:35,000
not only editors
but also the people who are coding,
156
00:08:35,000 --> 00:08:36,241
or consuming the data,
157
00:08:36,242 --> 00:08:39,494
and also researchers who are
actually using all the edit history
158
00:08:39,494 --> 00:08:40,800
to analyze what is happening.
159
00:08:42,367 --> 00:08:48,431
So we did a review of around 80 tools
that are existing in Wikidata
160
00:08:48,431 --> 00:08:52,380
and we aligned them to the different
data quality dimensions.
161
00:08:52,380 --> 00:08:54,360
And what we saw was that actually,
162
00:08:54,361 --> 00:08:57,681
many of them were looking at,
monitoring completeness,
163
00:08:57,682 --> 00:09:02,820
but actually... and also some of them
are also enabling interlinking.
164
00:09:02,820 --> 00:09:08,442
But there is a big need for tools
that are looking into diversity,
165
00:09:08,443 --> 00:09:12,824
which is one of the things
that we actually can have in Wikidata,
166
00:09:12,824 --> 00:09:15,958
especially
this design principle of Wikidata
167
00:09:15,959 --> 00:09:17,901
where we can have plurality
168
00:09:17,902 --> 00:09:20,308
and different statements
with different values
169
00:09:21,034 --> 00:09:22,236
coming from different sources.
170
00:09:22,236 --> 00:09:24,921
Because it's a secondary source,
we don't have really tools
171
00:09:24,922 --> 00:09:27,750
that actually tell us how many
plural statements there are,
172
00:09:27,751 --> 00:09:30,889
and how many we can improve and how,
173
00:09:30,890 --> 00:09:32,833
and we also don't know really
174
00:09:32,833 --> 00:09:35,538
what are all the reasons
for plurality that we can have.
175
00:09:36,491 --> 00:09:39,201
So from these community meetings,
176
00:09:39,201 --> 00:09:43,084
what we discussed was the challenges
that still need attention.
177
00:09:43,084 --> 00:09:47,249
For example, that having
all these crowdsourcing communities
178
00:09:47,249 --> 00:09:49,613
is very good because different people
attack different parts
179
00:09:49,613 --> 00:09:51,833
of the data or the graph,
180
00:09:51,834 --> 00:09:54,615
and we also have
different background knowledge
181
00:09:54,616 --> 00:09:59,161
but actually, it's very difficult to align
everything in something homogeneous
182
00:09:59,162 --> 00:10:04,920
because different people are using
different properties in different ways
183
00:10:04,920 --> 00:10:08,401
and they are also expecting
different things from entity descriptions.
184
00:10:09,003 --> 00:10:12,721
People also said that
they also need more tools
185
00:10:12,722 --> 00:10:16,000
that give a better overview
of the global status of things.
186
00:10:16,000 --> 00:10:20,733
So what entities are missing
in terms of completeness,
187
00:10:20,733 --> 00:10:26,121
but also like what are people
working on right now most of the time,
188
00:10:26,121 --> 00:10:30,516
and they also mention many times
a tighter collaboration
189
00:10:30,517 --> 00:10:33,311
across not only languages
but the WikiProjects
190
00:10:33,311 --> 00:10:35,571
and the different Wikimedia platforms.
191
00:10:35,571 --> 00:10:38,859
And we published
all the transcribed comments
192
00:10:38,860 --> 00:10:42,959
from all these discussions
in those links here in the Etherpads
193
00:10:42,959 --> 00:10:46,162
and also in the wiki page of Wikimania.
194
00:10:46,162 --> 00:10:48,481
Some solutions that appeared actually
195
00:10:48,481 --> 00:10:53,001
were going into the direction
of sharing more the best practices
196
00:10:53,001 --> 00:10:55,762
that are being developed
in different WikiProjects,
197
00:10:55,762 --> 00:11:01,238
but also people want tools
that help organize work in teams
198
00:11:01,239 --> 00:11:03,845
or at least understanding
who is working on that,
199
00:11:03,845 --> 00:11:07,815
and they were also mentioning
that they want more showcases
200
00:11:07,816 --> 00:11:12,019
and more templates that help them
create things in a better way.
201
00:11:12,946 --> 00:11:15,161
And from the contact that we have
202
00:11:15,162 --> 00:11:18,721
with Open Governmental Data Organizations,
203
00:11:18,722 --> 00:11:20,068
and in particularly,
204
00:11:20,068 --> 00:11:23,102
I am in contact with the canton
and the city of Zürich,
205
00:11:23,102 --> 00:11:26,207
they are very interested
in working with Wikidata
206
00:11:26,207 --> 00:11:29,896
because they want their data
to be accessible for everyone
207
00:11:29,897 --> 00:11:33,681
in the place where people go
and consult or access data.
208
00:11:33,682 --> 00:11:36,550
So for them, something that
would be really interesting
209
00:11:36,551 --> 00:11:38,600
is to have some kind of quality indicators
210
00:11:38,600 --> 00:11:41,082
both in the wiki,
which is already happening,
211
00:11:41,082 --> 00:11:42,801
but also in SPARQL results,
212
00:11:42,802 --> 00:11:46,066
to know whether they can trust
or not that data from the community.
213
00:11:46,067 --> 00:11:48,230
And then, they also want to know
214
00:11:48,230 --> 00:11:51,417
what parts of their own data sets
are useful for Wikidata
215
00:11:51,418 --> 00:11:56,040
and they would love to have a tool that
can help them assess that automatically.
216
00:11:56,041 --> 00:11:59,066
They also need
some kind of methodology or tool
217
00:11:59,067 --> 00:12:03,894
that helps them decide whether
they should import or link their data
218
00:12:03,894 --> 00:12:04,894
because in some cases,
219
00:12:04,895 --> 00:12:07,137
they also have their own
linked open data sets,
220
00:12:07,138 --> 00:12:09,746
so they don't know whether
to just ingest the data
221
00:12:09,747 --> 00:12:13,424
or to keep on creating links
from the data sets to Wikidata
222
00:12:13,425 --> 00:12:14,425
and the other way around.
223
00:12:14,950 --> 00:12:20,043
And they also want to know where
their websites are referred in Wikidata.
224
00:12:20,044 --> 00:12:23,361
And when they run such a query
in the query service,
225
00:12:23,362 --> 00:12:24,848
they often get timeouts,
226
00:12:24,849 --> 00:12:28,181
so maybe we should
really create more tools
227
00:12:28,181 --> 00:12:32,240
that help them get these answers
for their questions.
228
00:12:33,148 --> 00:12:36,208
And, besides that,
229
00:12:36,208 --> 00:12:39,361
we wiki researchers also sometimes
230
00:12:39,362 --> 00:12:42,023
lack some information
in the edit summaries.
231
00:12:42,024 --> 00:12:44,953
So I remember that when
we were doing some work
232
00:12:44,954 --> 00:12:48,919
to understand
the different behavior of editors
233
00:12:48,919 --> 00:12:53,403
with tools or bots
or anonymous users and so on,
234
00:12:53,403 --> 00:12:56,154
we were really lacking, for example,
235
00:12:56,154 --> 00:13:01,112
a standard way of tracing
that tools were being used.
236
00:13:01,113 --> 00:13:03,154
And there are some tools
that are already doing that
237
00:13:03,155 --> 00:13:05,230
like PetScan and many others,
238
00:13:05,230 --> 00:13:07,720
but maybe we should in the community
239
00:13:07,721 --> 00:13:13,531
discuss more about how to record these
for fine-grained provenance.
240
00:13:14,169 --> 00:13:15,321
And further on,
241
00:13:15,322 --> 00:13:20,801
we think that we need to think
of more concrete data quality dimensions
242
00:13:20,802 --> 00:13:24,961
that are related to link data
but not all the types of data,
243
00:13:24,962 --> 00:13:30,721
so we worked on some measures
to access actually the information gain
244
00:13:30,722 --> 00:13:33,881
enabled by the links,
and what we mean by that
245
00:13:33,882 --> 00:13:36,681
is that when we link
Wikidata to other data sets,
246
00:13:36,682 --> 00:13:38,201
we should also be thinking
247
00:13:38,202 --> 00:13:41,921
how much the entities are actually
gaining in the classification,
248
00:13:41,922 --> 00:13:45,601
also in the description
but also in the vocabularies they use.
249
00:13:45,602 --> 00:13:51,041
So just to give a very simple
example of what I mean with this
250
00:13:51,042 --> 00:13:54,269
is we can think of--
in this case, would be Wikidata
251
00:13:54,270 --> 00:13:57,771
or the external data center
that is linking to Wikidata,
252
00:13:57,772 --> 00:14:00,487
we have the entity for a person
that is called Natasha Noy,
253
00:14:00,487 --> 00:14:02,601
we have the affiliation and other things,
254
00:14:02,602 --> 00:14:05,239
and then we say OK,
we link to an external place,
255
00:14:05,240 --> 00:14:08,919
and that entity also has that name,
but we actually have the same value.
256
00:14:08,920 --> 00:14:12,889
So what it would be better is that we link
to something that has a different name,
257
00:14:12,889 --> 00:14:16,881
that is still valid because this person
has two ways of writing the name,
258
00:14:16,882 --> 00:14:19,714
and also other information
that we don't have in Wikidata
259
00:14:19,715 --> 00:14:21,760
or that we don't have
in the other data set.
260
00:14:22,390 --> 00:14:24,652
But also, what is even better
261
00:14:24,653 --> 00:14:27,770
is that we are actually
looking in the target data set
262
00:14:27,770 --> 00:14:31,392
that they also have new ways
of classifying the information.
263
00:14:31,393 --> 00:14:35,354
So not only is this a person,
but in the other data set,
264
00:14:35,355 --> 00:14:39,525
they also say it's a female
or anything else that they classify with.
265
00:14:39,526 --> 00:14:43,401
And if in the other data set,
they are using many other vocabularies
266
00:14:43,402 --> 00:14:46,588
that is also helping in their whole
information retrieval thing.
267
00:14:47,371 --> 00:14:51,233
So with that, I also would like to say
268
00:14:51,234 --> 00:14:55,809
that we think that we can
showcase federated queries better
269
00:14:55,810 --> 00:15:00,448
because when we look at the query log
provided by Malyshev et al.,
270
00:15:01,285 --> 00:15:04,301
we see actually that
from the organic queries,
271
00:15:04,302 --> 00:15:06,921
we have only very few federated queries.
272
00:15:06,922 --> 00:15:12,801
And actually, federation is one
of the key advantages of having link data,
273
00:15:12,802 --> 00:15:16,903
so maybe the community
or the people using Wikidata
274
00:15:16,903 --> 00:15:18,898
also need more examples on this.
275
00:15:18,898 --> 00:15:22,666
And if we look at the list
of endpoints that are being used,
276
00:15:22,667 --> 00:15:25,401
this is not a complete list
and we have many more.
277
00:15:25,402 --> 00:15:30,479
Of course, this data was analyzed
from queries until March 2018,
278
00:15:30,480 --> 00:15:34,807
but we should look into the list
of federated endpoints that we have
279
00:15:34,808 --> 00:15:37,048
and see whether
we are really using them or not.
280
00:15:37,813 --> 00:15:40,441
So two questions that
I have for the audience
281
00:15:40,442 --> 00:15:43,001
that maybe we can use
afterwards for the discussion are:
282
00:15:43,001 --> 00:15:46,001
what data quality problems
should be addressed in your opinion,
283
00:15:46,002 --> 00:15:47,412
because of the needs that you have,
284
00:15:47,412 --> 00:15:50,401
but also, where do you need
more automation
285
00:15:50,402 --> 00:15:52,943
to help you with editing or patrolling.
286
00:15:53,866 --> 00:15:55,146
That's all, thank you very much.
287
00:15:55,779 --> 00:15:57,527
(applause)
288
00:16:06,030 --> 00:16:08,595
(Jose Emilio Labra) OK,
so what I'm going to talk about
289
00:16:08,595 --> 00:16:14,715
is some tools that we were developing
related with Shape Expressions.
290
00:16:15,536 --> 00:16:19,371
So this is what I want to talk...
I am Jose Emilio Labra,
291
00:16:19,371 --> 00:16:23,215
but this has... all these tools
have been done by different people,
292
00:16:23,920 --> 00:16:28,480
mainly related with W3C ShEx,
Shape Expressions Community Group.
293
00:16:28,481 --> 00:16:29,481
ShEx Community Group.
294
00:16:30,144 --> 00:16:36,081
So the first tool that I want to mention
is RDFShape, this is a general tool,
295
00:16:36,082 --> 00:16:40,681
because Shape Expressions
is not only for Wikidata,
296
00:16:40,682 --> 00:16:44,168
Shape Expressions is a language
to validate RDF in general.
297
00:16:44,168 --> 00:16:47,568
So this tool was developed mainly by me
298
00:16:47,568 --> 00:16:50,880
and it's a tool
to validate RDF in general.
299
00:16:50,881 --> 00:16:55,139
So if you want to learn about RDF
or you want to validate RDF
300
00:16:55,140 --> 00:16:58,621
or SPARQL endpoints not only in Wikidata,
301
00:16:58,622 --> 00:17:00,891
my advice is that you can use this tool.
302
00:17:00,891 --> 00:17:03,255
Also for teaching.
303
00:17:03,255 --> 00:17:05,640
I am a teacher in the university
304
00:17:05,641 --> 00:17:09,151
and I use it in my semantic web course
to teach RDF.
305
00:17:09,161 --> 00:17:12,121
So if you want to learn RDF,
I think it's a good tool.
306
00:17:13,033 --> 00:17:17,598
For example, this is just a visualization
of an RDF graph with the tool.
307
00:17:18,587 --> 00:17:22,643
But before coming here, in the last month,
308
00:17:22,643 --> 00:17:28,441
I started a fork of rdfshape specifically
for Wikidata, because I thought...
309
00:17:28,443 --> 00:17:33,082
It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.
310
00:17:33,082 --> 00:17:34,441
So what I took is...
311
00:17:34,442 --> 00:17:39,898
What I did is to remove all the stuff
that was not related with Wikidata
312
00:17:39,898 --> 00:17:44,801
and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,
313
00:17:44,802 --> 00:17:49,041
but now, someone asked me
if I could do it also for Wikibase.
314
00:17:49,042 --> 00:17:52,000
And it is very easy
to do it for Wikibase also.
315
00:17:52,760 --> 00:17:56,280
So this tool, WikiShape, is quite new.
316
00:17:57,015 --> 00:17:59,843
I think it works, most of the features,
317
00:17:59,844 --> 00:18:02,468
but there are some features
that maybe don't work,
318
00:18:02,469 --> 00:18:06,281
and if you try it and you want
to improve it, please tell me.
319
00:18:06,281 --> 00:18:12,680
So this is [inaudible] captures,
but I think I can even try so let's try.
320
00:18:15,385 --> 00:18:16,945
So let's see if it works.
321
00:18:16,953 --> 00:18:20,070
First, I have to go out of the...
322
00:18:22,453 --> 00:18:23,453
Here.
323
00:18:24,226 --> 00:18:28,324
Alright, yeah. So this is the tool here.
324
00:18:28,324 --> 00:18:29,844
Things that you can do with the tool,
325
00:18:29,845 --> 00:18:35,275
for example, is that you can
check schemas, entity schemas.
326
00:18:35,276 --> 00:18:38,611
You know that there is
a new namespace which is "E whatever,"
327
00:18:38,612 --> 00:18:44,805
so here, if you start for example,
write for example "human"...
328
00:18:44,806 --> 00:18:48,812
As you are writing,
its autocomplete allows you to check,
329
00:18:48,812 --> 00:18:52,001
for example,
this is the Shape Expressions of a human,
330
00:18:52,790 --> 00:18:55,937
and this is the Shape Expressions here.
331
00:18:55,938 --> 00:18:59,841
And as you can see,
this editor has syntax highlighting,
332
00:18:59,842 --> 00:19:04,559
this is... well,
maybe it's very small, the screen.
333
00:19:05,676 --> 00:19:07,590
I can try to do it bigger.
334
00:19:09,194 --> 00:19:10,973
Maybe you see it better now.
335
00:19:10,973 --> 00:19:14,241
So... and this is the editor
with syntax highlighting and also has...
336
00:19:14,241 --> 00:19:17,851
I mean, this editor
comes from the same source code
337
00:19:17,851 --> 00:19:19,641
as the Wikidata query service.
338
00:19:19,642 --> 00:19:23,960
So for example,
if you hover with the mouse here,
339
00:19:23,961 --> 00:19:27,961
it shows you the labels
of the different properties.
340
00:19:27,962 --> 00:19:31,298
So I think it's very helpful because now,
341
00:19:32,588 --> 00:19:38,601
the entity schemas that is
in the Wikidata is just a plain text idea,
342
00:19:38,602 --> 00:19:42,493
and I think this editor is much better
because it has autocomplete
343
00:19:42,494 --> 00:19:43,743
and it also has...
344
00:19:43,744 --> 00:19:48,241
I mean, if you, for example,
wanted to add a constraint,
345
00:19:48,241 --> 00:19:51,570
you say "wdt:"
346
00:19:51,570 --> 00:19:56,884
You start writing "author"
and then you click Ctrl+Space
347
00:19:56,884 --> 00:19:58,922
and it suggests the different things.
348
00:19:58,922 --> 00:20:02,388
So this is similar
to the Wikidata query service
349
00:20:02,389 --> 00:20:06,445
but specifically for Shape Expressions
350
00:20:06,445 --> 00:20:11,975
because my feeling is that
creating Shape Expressions
351
00:20:11,976 --> 00:20:15,841
is not more difficult
than writing SPARQL queries.
352
00:20:15,842 --> 00:20:21,255
So some people think
that it's at the same level,
353
00:20:22,278 --> 00:20:26,296
It's probably easier, I think,
because Shape Expressions was,
354
00:20:26,296 --> 00:20:31,241
when we designed it,
we were doing it to be easier to work.
355
00:20:31,242 --> 00:20:35,001
OK, so this is one of the first things,
that you have this editor
356
00:20:35,001 --> 00:20:36,620
for Shape Expressions.
357
00:20:37,371 --> 00:20:41,467
And then you also have the possibility,
for example, to visualize.
358
00:20:41,468 --> 00:20:44,801
If you have a Shape Expression,
use for example...
359
00:20:44,802 --> 00:20:49,386
I think, "written work" is
a nice Shape Expression
360
00:20:49,386 --> 00:20:53,300
because it has some relationships
between different things.
361
00:20:54,823 --> 00:20:58,160
And this is the UML visualization
of written work.
362
00:20:58,161 --> 00:21:02,090
In a UML, this is easy to see
the different properties.
363
00:21:02,790 --> 00:21:06,794
When you do this, I realized
when I tried with several people,
364
00:21:06,795 --> 00:21:09,216
they find some mistakes
in their Shape Expressions
365
00:21:09,217 --> 00:21:12,988
because it's easy to detect which are
the missing properties or whatever.
366
00:21:13,588 --> 00:21:15,771
Then there is another possibility here
367
00:21:15,772 --> 00:21:19,520
is that you can also validate,
I think I have it here, the validation.
368
00:21:20,496 --> 00:21:25,285
I think I had it in some label,
maybe I closed it.
369
00:21:26,267 --> 00:21:30,988
OK, but you can, for example,
you can click here, Validate entities.
370
00:21:32,308 --> 00:21:34,232
You, for example,
371
00:21:35,404 --> 00:21:41,921
"q42" with "e42" which is author.
372
00:21:42,818 --> 00:21:46,180
With "human,"
I think we can do it with "human."
373
00:21:49,050 --> 00:21:50,050
And then it's...
374
00:21:50,688 --> 00:21:56,365
And it's taking a little while to do it
because this is doing the SPARQL queries
375
00:21:56,365 --> 00:21:59,134
and now, for example,
it's failing by the network but...
376
00:21:59,657 --> 00:22:01,580
So you can try it.
377
00:22:02,759 --> 00:22:07,026
OK, so let's go continue
with the presentation, with other tools.
378
00:22:07,026 --> 00:22:12,353
So my advice is that if you want to try it
and you want any feedback let me know.
379
00:22:13,133 --> 00:22:15,540
So to continue with the presentation...
380
00:22:18,923 --> 00:22:20,233
So this is WikiShape.
381
00:22:23,800 --> 00:22:26,509
Then, I already said this,
382
00:22:27,681 --> 00:22:34,157
the Shape Expressions Editor
is an independent project in GitHub.
383
00:22:35,605 --> 00:22:37,472
You can use it in your own project.
384
00:22:37,472 --> 00:22:41,036
If you want to do
a Shape Expressions tool,
385
00:22:41,036 --> 00:22:45,635
you can just embed it
in any other project,
386
00:22:45,636 --> 00:22:48,235
so this is in GitHub and you can use it.
387
00:22:48,868 --> 00:22:51,970
Then the same author,
it's one of my students,
388
00:22:52,684 --> 00:22:55,704
he also created
an editor for Shape Expressions,
389
00:22:55,704 --> 00:22:57,799
also inspired by
the Wikidata query service
390
00:22:57,800 --> 00:23:00,681
where, in a column,
391
00:23:00,682 --> 00:23:05,103
you have this more visual editor
of SPARQL queries
392
00:23:05,104 --> 00:23:07,135
where you can put this kind of things.
393
00:23:07,136 --> 00:23:09,123
So this is a screen capture.
394
00:23:09,123 --> 00:23:12,662
You can see that
that's the Shape Expressions in text
395
00:23:12,662 --> 00:23:17,822
but this is a form-based Shape Expressions
where it would probably take a bit longer
396
00:23:18,595 --> 00:23:23,400
where you can put the different rows
on the different fields.
397
00:23:23,401 --> 00:23:25,800
OK, then there is ShExEr.
398
00:23:26,879 --> 00:23:31,882
We have... it's done by one PhD student
at the University of Oviedo
399
00:23:31,883 --> 00:23:34,080
and he's here, so you can present ShExEr.
400
00:23:38,147 --> 00:23:40,024
(Danny) Hello, I am Danny Fernández,
401
00:23:40,025 --> 00:23:43,800
I am a PhD student in University of Oviedo
working with Labra.
402
00:23:44,710 --> 00:23:47,725
Since we are running out of time,
let's make these quickly,
403
00:23:47,726 --> 00:23:52,641
so let's not go for any actual demo,
but just print some screenshots.
404
00:23:52,642 --> 00:23:57,897
OK, so the usual way to work with
Shape Expressions or any shape language
405
00:23:57,897 --> 00:23:59,521
is that you have a domain expert
406
00:23:59,522 --> 00:24:02,313
that defines a priori
how the graph should look like
407
00:24:02,314 --> 00:24:03,555
define some structures,
408
00:24:03,556 --> 00:24:06,983
and then you use these structures
to validate the actual data against it.
409
00:24:08,124 --> 00:24:11,641
This tool, which is as well as the ones
that Labra has been presenting,
410
00:24:11,642 --> 00:24:14,441
this is a general purpose tool
for any RDF source,
411
00:24:14,442 --> 00:24:17,375
is designed to do the other way around.
412
00:24:17,376 --> 00:24:18,758
You already have some data,
413
00:24:18,759 --> 00:24:23,165
you select what nodes
you want to get the shape about
414
00:24:23,165 --> 00:24:26,718
and then you automatically
extract or infer the shape.
415
00:24:26,719 --> 00:24:29,791
So even if this is a general purpose tool,
416
00:24:29,791 --> 00:24:34,063
what we did for this WikidataCon
is these fancy button
417
00:24:34,884 --> 00:24:37,081
that if you click it,
essentially what happens
418
00:24:37,081 --> 00:24:42,079
is that there are
so many configurations params
419
00:24:42,080 --> 00:24:46,251
and it configures it to work
against the Wikidata endpoint
420
00:24:46,251 --> 00:24:47,971
and it will end soon, sorry.
421
00:24:48,733 --> 00:24:52,883
So, once you press this button
what you get is essentially this.
422
00:24:52,884 --> 00:24:55,126
After having selected what kind of notes,
423
00:24:55,127 --> 00:24:59,360
what kind of instances of our class,
whatever you are looking for,
424
00:24:59,361 --> 00:25:01,321
you get an automatic schema.
425
00:25:02,319 --> 00:25:07,111
All the constraints are sorted
by how many modes actually conform to it,
426
00:25:07,112 --> 00:25:09,772
you can filter the less common ones, etc.
427
00:25:09,772 --> 00:25:12,126
So there is a poster downstairs
about this stuff
428
00:25:12,127 --> 00:25:14,595
and well,
I will be downstairs and upstairs
429
00:25:14,596 --> 00:25:16,454
and all over the place all day,
430
00:25:16,455 --> 00:25:19,081
so if you have any further
interest in this tool,
431
00:25:19,082 --> 00:25:21,476
just speak to me during this journey.
432
00:25:21,477 --> 00:25:24,624
And now, I'll give back
the micro to Labra, thank you.
433
00:25:24,625 --> 00:25:29,265
(applause)
434
00:25:29,812 --> 00:25:32,578
(Jose) So let's continue
with the other tools.
435
00:25:32,579 --> 00:25:34,984
The other tool is the ShapeDesigner.
436
00:25:34,984 --> 00:25:37,241
Andra, do you want to do
the ShapeDesigner now
437
00:25:37,242 --> 00:25:39,287
or maybe later or in the workshop?
438
00:25:39,287 --> 00:25:40,603
There is a workshop...
439
00:25:40,603 --> 00:25:44,437
This afternoon, there is a workshop
specifically for Shape Expressions, and...
440
00:25:45,265 --> 00:25:47,939
The idea is that was going to be
more hands on,
441
00:25:47,940 --> 00:25:52,324
and if you want to practice
some ShEx, you can do it there.
442
00:25:52,875 --> 00:25:55,720
This tool is ShEx...
and there is Eric here,
443
00:25:55,721 --> 00:25:56,890
so you can present it.
444
00:25:57,969 --> 00:26:00,687
(Eric) So just super quick,
the thing that I want to say
445
00:26:00,687 --> 00:26:05,711
is that you've probably
already seen the ShEx interface
446
00:26:05,711 --> 00:26:07,601
that's tailored for Wikidata.
447
00:26:07,602 --> 00:26:12,930
That's effectively stripped down
and tailored specifically for Wikidata
448
00:26:12,930 --> 00:26:17,937
because the generic one has more features
but it turns out I thought I'd mention it
449
00:26:17,937 --> 00:26:19,977
because one of those features
is particularly useful
450
00:26:19,978 --> 00:26:23,201
for debugging Wikidata schemas,
451
00:26:23,201 --> 00:26:29,224
which is if you go
and you select the slurp mode,
452
00:26:29,225 --> 00:26:31,444
what it does is it says
while I'm validating,
453
00:26:31,445 --> 00:26:34,694
I want to pull all the the triples down
and that means
454
00:26:34,695 --> 00:26:36,274
if I get a bunch of failures,
455
00:26:36,275 --> 00:26:39,586
I can go through and start looking
at those failures and saying,
456
00:26:39,587 --> 00:26:41,800
OK, what are the triples
that are in here,
457
00:26:41,801 --> 00:26:44,120
sorry, I apologize,
the triples are down there,
458
00:26:44,121 --> 00:26:45,647
this is just a log of what went by.
459
00:26:46,327 --> 00:26:49,180
And then you can just sit there
and fiddle with it in real time
460
00:26:49,181 --> 00:26:51,033
like you play with something
and it changes.
461
00:26:51,033 --> 00:26:54,160
So it's a quicker version
for doing all that stuff.
462
00:26:55,361 --> 00:26:56,481
This is a ShExC form,
463
00:26:56,482 --> 00:26:59,455
this is something [Joachim] had suggested
464
00:27:00,035 --> 00:27:04,631
could be useful for populating
Wikidata documents
465
00:27:04,631 --> 00:27:07,338
based on a Shape Expression
for that that document.
466
00:27:08,095 --> 00:27:11,681
This is not tailored for Wikidata,
467
00:27:11,682 --> 00:27:14,081
but this is just to say
that you can have a schema
468
00:27:14,082 --> 00:27:15,402
and you can have some annotations
469
00:27:15,403 --> 00:27:17,518
to say specifically how I want
that schema rendered
470
00:27:17,519 --> 00:27:19,031
and then it just builds a form,
471
00:27:19,031 --> 00:27:21,191
and if you've got data,
it can even populate the form.
472
00:27:24,517 --> 00:27:26,164
PyShEx [inaudible].
473
00:27:28,025 --> 00:27:31,080
(Jose) I think this is the last one.
474
00:27:31,821 --> 00:27:34,080
Yes, so the last one is PyShEx.
475
00:27:34,675 --> 00:27:38,151
PyShEx is a Python implementation
of Shape Expressions,
476
00:27:39,193 --> 00:27:42,680
you can play also with Jupyter Notebooks
if you want those kind of things.
477
00:27:42,680 --> 00:27:44,432
OK, so that's all for this.
478
00:27:44,433 --> 00:27:47,170
(applause)
479
00:27:52,916 --> 00:27:57,073
(Andra) So I'm going to talk about
a specific project that I'm involved in
480
00:27:57,074 --> 00:27:58,074
called Gene Wiki,
481
00:27:58,075 --> 00:28:04,596
and where we are also
dealing with quality issues.
482
00:28:04,597 --> 00:28:06,684
But before going into the quality,
483
00:28:06,685 --> 00:28:09,229
maybe a quick introduction
about what Gene Wiki is,
484
00:28:09,855 --> 00:28:15,175
and we just released a pre-print
of a paper that we recently have written
485
00:28:15,175 --> 00:28:18,160
that explains the details of the project.
486
00:28:19,821 --> 00:28:23,839
I see people taking pictures,
but basically, what Gene Wiki does,
487
00:28:23,846 --> 00:28:28,027
it's trying to get biomedical data,
public data into Wikidata,
488
00:28:28,028 --> 00:28:32,200
and we follow a specific pattern
to get that data into Wikidata.
489
00:28:33,130 --> 00:28:36,809
So when we have a new repository
or a new data set
490
00:28:36,810 --> 00:28:39,600
that is eligible
to be included into Wikidata,
491
00:28:39,601 --> 00:28:41,293
the first step is community engagement.
492
00:28:41,294 --> 00:28:43,784
It is not necessary
directly to a Wikidata community
493
00:28:43,785 --> 00:28:46,120
but a local research community,
494
00:28:46,121 --> 00:28:50,286
and we meet in person
or online or on any platform
495
00:28:50,286 --> 00:28:52,881
and try to come up with a data model
496
00:28:52,882 --> 00:28:56,197
that bridges their data
with the Wikidata model.
497
00:28:56,197 --> 00:28:59,944
So here I have a picture of a workshop
that happened here last year
498
00:28:59,945 --> 00:29:02,663
which was trying to look
at a specific data set
499
00:29:02,663 --> 00:29:05,280
and, well, you see a lot of discussions,
500
00:29:05,281 --> 00:29:09,780
then aligning it with schema.org
and other ontologies that are out there.
501
00:29:10,320 --> 00:29:15,508
And then, at the end of the first step,
we have a whiteboard drawing of the schema
502
00:29:15,509 --> 00:29:17,336
that we want to implement in Wikidata.
503
00:29:17,337 --> 00:29:20,440
What you see over there,
this is just plain,
504
00:29:20,441 --> 00:29:21,766
we have it in the back there
505
00:29:21,767 --> 00:29:25,240
so we can make some schemas
within this panel today even.
506
00:29:26,560 --> 00:29:28,399
So once we have the schema in place,
507
00:29:28,400 --> 00:29:31,320
the next thing is try to make
that schema machine readable
508
00:29:32,358 --> 00:29:36,841
because you want to have actionable models
to bridge the data that you're bringing in
509
00:29:36,842 --> 00:29:39,690
from any biomedical database
into Wikidata.
510
00:29:40,393 --> 00:29:45,182
And here we are applying
Shape Expressions.
511
00:29:46,471 --> 00:29:52,518
And we use that because
Shape Expressions allow you to test
512
00:29:52,518 --> 00:29:57,040
whether the data set
is actually-- no, to first see
513
00:29:57,041 --> 00:30:01,782
of already existing data in Wikidata
follows the same data model
514
00:30:01,783 --> 00:30:04,718
that was achieved in the previous process.
515
00:30:04,719 --> 00:30:06,641
So then with the Shape Expression
we can check:
516
00:30:06,642 --> 00:30:10,926
OK the data that are on this topic
in Wikidata, does it need some cleaning up
517
00:30:10,926 --> 00:30:15,013
or do we need to adapt our model
to the Wikidata model or vice versa.
518
00:30:15,937 --> 00:30:19,867
Once that is in place
and we start writing bots,
519
00:30:20,670 --> 00:30:23,801
and bots are seeding the information
520
00:30:23,802 --> 00:30:27,308
that is in the primary sources
into Wikidata.
521
00:30:27,846 --> 00:30:29,303
And when the bots are ready,
522
00:30:29,304 --> 00:30:33,001
we write these bots
with a platform called--
523
00:30:33,002 --> 00:30:36,201
with a Python library
called Wikidata Integrator
524
00:30:36,202 --> 00:30:38,167
that came out of our project.
525
00:30:38,698 --> 00:30:42,921
And once we have our bots,
we use a platform called Jenkins
526
00:30:42,921 --> 00:30:44,540
for continuous integration.
527
00:30:44,540 --> 00:30:45,762
And with Jenkins,
528
00:30:45,762 --> 00:30:51,160
we continuously update
the primary sources with Wikidata.
529
00:30:52,178 --> 00:30:55,889
And this is a diagram for the paper
I previously mentioned.
530
00:30:55,890 --> 00:30:57,241
This is our current landscape.
531
00:30:57,242 --> 00:31:02,059
So every orange box out there
is a primary resource on drugs,
532
00:31:02,060 --> 00:31:07,827
proteins, genes, diseases,
chemical compounds with interaction,
533
00:31:07,827 --> 00:31:10,870
and this model is too small to read now
534
00:31:10,870 --> 00:31:17,472
but this is the database,
the sources that we manage in Wikidata
535
00:31:17,473 --> 00:31:20,560
and bridge with the primary sources.
536
00:31:20,561 --> 00:31:22,355
Here is such a workflow.
537
00:31:22,870 --> 00:31:25,312
So one of our partners
is the Disease Ontology
538
00:31:25,312 --> 00:31:27,672
the Disease Ontology is a CC0 ontology,
539
00:31:28,179 --> 00:31:31,990
and the CC0 Ontology
has a curation cycle on its own,
540
00:31:32,756 --> 00:31:35,736
and they just continuously
update the Disease Ontology
541
00:31:35,737 --> 00:31:39,687
to reflect the disease space
or the interpretation of diseases.
542
00:31:40,336 --> 00:31:44,361
And there is the Wikidata
curation cycle also on diseases
543
00:31:44,362 --> 00:31:49,844
where the Wikidata community constantly
monitors what's going on on Wikidata.
544
00:31:50,406 --> 00:31:51,601
And then we have two roles,
545
00:31:51,602 --> 00:31:55,477
we call them colloquially
the gatekeeper curator,
546
00:31:56,009 --> 00:31:59,561
and this was me
and a colleague five years ago
547
00:31:59,562 --> 00:32:03,414
where we just sit on our computers
and we monitor Wikipedia and Wikidata,
548
00:32:03,415 --> 00:32:08,601
and if there is an issue that was
reported back to the primary community,
549
00:32:08,602 --> 00:32:11,765
the primary resources, they looked
at the implementation and decided:
550
00:32:11,765 --> 00:32:14,240
OK, do we do we trust the Wikidata input?
551
00:32:14,850 --> 00:32:18,555
Yes--then it's considered,
it goes into the cycle,
552
00:32:18,555 --> 00:32:22,686
and the next iteration
is part of the Disease Ontology
553
00:32:22,687 --> 00:32:25,411
and fed back into Wikidata.
554
00:32:27,419 --> 00:32:31,480
We're doing the same for WikiPathways.
555
00:32:31,481 --> 00:32:36,601
WikiPathways is a MediaWiki-inspired
pathway and pathway repository.
556
00:32:36,602 --> 00:32:40,901
Same story, there are different
pathway resources on Wikidata already.
557
00:32:41,463 --> 00:32:44,713
There might be conflicts
between those pathway resources
558
00:32:44,722 --> 00:32:46,701
and these conflicts are reported back
559
00:32:46,702 --> 00:32:49,521
by the gatekeeper curators
to that community,
560
00:32:49,522 --> 00:32:53,715
and you maintain
the individual curation cycles.
561
00:32:53,715 --> 00:32:57,068
But if you remember the previous cycle,
562
00:32:57,069 --> 00:33:03,041
here I mentioned
only two cycles, two resources,
563
00:33:03,566 --> 00:33:06,300
we have to do that
for every single resource that we have
564
00:33:06,300 --> 00:33:08,061
and we have to manage what's going on
565
00:33:08,062 --> 00:33:09,185
because when I say curation,
566
00:33:09,185 --> 00:33:11,377
I really mean going
to the Wikipedia top pages,
567
00:33:11,377 --> 00:33:14,544
going into the Wikidata top pages
and trying to do that.
568
00:33:14,545 --> 00:33:19,316
That doesn't scale for
the two gatekeeper curators we had.
569
00:33:19,860 --> 00:33:22,777
So when I was in a conference in 2016
570
00:33:22,778 --> 00:33:26,933
where Eric gave a presentation
on Shape Expressions,
571
00:33:26,934 --> 00:33:29,277
I jumped on the bandwagon and said OK,
572
00:33:29,278 --> 00:33:34,240
Shape Expressions can help us
detect what differences in Wikidata
573
00:33:34,240 --> 00:33:41,159
and so that allows the gatekeepers to have
some more efficient reporting to report.
574
00:33:42,275 --> 00:33:46,019
So this year,
I was delighted by the schema entity
575
00:33:46,020 --> 00:33:50,765
because now, we can store
those entity schemas on Wikidata,
576
00:33:50,765 --> 00:33:53,183
on Wikidata itself,
whereas before, it was on GitHub,
577
00:33:53,860 --> 00:33:56,815
and this aligns
with the Wikidata interface,
578
00:33:56,816 --> 00:33:59,350
so you have things
like document discussions
579
00:33:59,350 --> 00:34:00,762
but you also have revisions.
580
00:34:00,763 --> 00:34:05,261
So you can leverage the top pages
and the revisions in Wikidata
581
00:34:05,262 --> 00:34:12,255
to use that to discuss
about what is in Wikidata
582
00:34:12,255 --> 00:34:14,060
and what are in the primary resources.
583
00:34:14,966 --> 00:34:19,686
So this what Eric just presented,
this is already quite a benefit.
584
00:34:19,686 --> 00:34:24,335
So here, we made up a Shape Expression
for the human gene,
585
00:34:24,336 --> 00:34:30,225
and then we ran it through simple ShEx,
and as you can see,
586
00:34:30,225 --> 00:34:32,428
we just got already ni--
587
00:34:32,429 --> 00:34:34,641
There is one issue
that needs to be monitored
588
00:34:34,642 --> 00:34:37,316
which there is an item
that doesn't fit that schema,
589
00:34:37,316 --> 00:34:43,139
and then you can sort of already
create schema entities curation reports
590
00:34:43,140 --> 00:34:46,240
based on... and send that
to the different curation reports.
591
00:34:48,058 --> 00:34:52,788
But the ShEx.js a built interface,
592
00:34:52,788 --> 00:34:55,860
and if I can show back here,
I only do ten,
593
00:34:55,860 --> 00:35:00,362
but we have tens of thousands,
and so that again doesn't scale.
594
00:35:00,362 --> 00:35:04,654
So the Wikidata Integrator now
supports ShEx support as well,
595
00:35:05,168 --> 00:35:07,431
and then we can just loop item loops
596
00:35:07,431 --> 00:35:11,494
where we say yes-no,
yes-no, true-false, true-false.
597
00:35:11,495 --> 00:35:12,495
So again,
598
00:35:13,065 --> 00:35:16,514
increasing a bit of the efficiency
of dealing with the reports.
599
00:35:17,256 --> 00:35:22,662
But now, recently, that builds
on the Wikidata Query Service,
600
00:35:23,181 --> 00:35:24,998
and well, we recently have been throttling
601
00:35:24,999 --> 00:35:26,560
so again, that doesn't scale.
602
00:35:26,561 --> 00:35:31,391
So it's still an ongoing process,
how to deal with models on Wikidata.
603
00:35:32,202 --> 00:35:36,682
And so again,
ShEx is not only intimidating
604
00:35:36,683 --> 00:35:40,356
but also the scale is just
too big to deal with.
605
00:35:41,068 --> 00:35:46,081
So I started working, this is my first
proof of concept or exercise
606
00:35:46,082 --> 00:35:47,680
where I used a tool called yED,
607
00:35:48,184 --> 00:35:52,590
and I started to draw
those Shape Expressions and because...
608
00:35:52,591 --> 00:35:58,098
and then regenerate this schema
609
00:35:58,099 --> 00:36:01,279
into this adjacent format
of the Shape Expressions,
610
00:36:01,280 --> 00:36:04,520
so that would open up already
to the audience
611
00:36:04,521 --> 00:36:07,432
that are intimidated
by the Shape Expressions languages.
612
00:36:07,961 --> 00:36:12,308
But actually, there is a problem
with those visual descriptions
613
00:36:12,309 --> 00:36:18,229
because this is also a schema
that was actually drawn in yEd by someone.
614
00:36:18,230 --> 00:36:23,838
And here is another one
which is beautiful.
615
00:36:23,838 --> 00:36:29,414
I would love to have this on my wall,
but it is still not interoperable.
616
00:36:30,281 --> 00:36:32,131
So I want to end my talk with,
617
00:36:32,131 --> 00:36:35,732
and the first time, I've been
stealing this slide, using this slide.
618
00:36:35,732 --> 00:36:37,594
It's an honor to have him in the audience
619
00:36:37,595 --> 00:36:39,423
and I really like this:
620
00:36:39,424 --> 00:36:42,362
"People think RDF is a pain
because it's complicated.
621
00:36:42,362 --> 00:36:43,985
The truth is even worse, it's so simple,
622
00:36:45,581 --> 00:36:48,133
because you have to work
with real-world data problems
623
00:36:48,134 --> 00:36:50,031
that are horribly complicated.
624
00:36:50,031 --> 00:36:51,451
While you can avoid RDF,
625
00:36:51,451 --> 00:36:55,760
it is harder to avoid complicated data
and complicated computer problems."
626
00:36:55,761 --> 00:36:59,535
This is about RDF, but I think
this so applies to modeling as well.
627
00:37:00,112 --> 00:37:02,769
So my point of discussion
is should we really...
628
00:37:03,387 --> 00:37:05,882
How do we get modeling going?
629
00:37:05,882 --> 00:37:10,826
Should we discuss ShEx
or visual models or...
630
00:37:11,426 --> 00:37:13,271
How do we continue?
631
00:37:13,474 --> 00:37:14,840
Thank you very much for your time.
632
00:37:15,102 --> 00:37:17,787
(applause)
633
00:37:20,001 --> 00:37:21,188
(Lydia) Thank you so much.
634
00:37:21,692 --> 00:37:24,001
Would you come to the front
635
00:37:24,002 --> 00:37:27,741
so that we can open
the questions from the audience.
636
00:37:28,610 --> 00:37:30,203
Are there questions?
637
00:37:31,507 --> 00:37:32,507
Yes.
638
00:37:34,253 --> 00:37:36,890
And I think, for the camera, we need to...
639
00:37:38,835 --> 00:37:40,968
(Lydia laughing) Yeah.
640
00:37:43,094 --> 00:37:46,273
(man3) So a question
for Cristina, I think.
641
00:37:47,366 --> 00:37:51,641
So you mentioned exactly
the term "information gain"
642
00:37:51,642 --> 00:37:53,689
from linking with other systems.
643
00:37:53,690 --> 00:37:55,619
There is an information theoretic measure
644
00:37:55,620 --> 00:37:58,001
using statistic and probability
called information gain.
645
00:37:58,002 --> 00:37:59,541
Do you have the same...
646
00:37:59,542 --> 00:38:01,736
I mean did you mean exactly that measure,
647
00:38:01,736 --> 00:38:04,173
the information gain
from the probability theory
648
00:38:04,174 --> 00:38:05,240
from information theory
649
00:38:05,241 --> 00:38:09,024
or just use this conceptual thing
to measure information gain some way?
650
00:38:09,025 --> 00:38:13,016
No, so we actually defined
and implemented measures
651
00:38:13,695 --> 00:38:20,161
that are using the Shannon entropy,
so it's meant as that.
652
00:38:20,162 --> 00:38:22,696
I didn't want to go into
details of the concrete formulas...
653
00:38:22,697 --> 00:38:24,977
(man3) No, no, of course,
that's why I asked the question.
654
00:38:24,978 --> 00:38:26,698
- (Cristina) But yeah...
- (man3) Thank you.
655
00:38:33,091 --> 00:38:35,047
(man4) Make more
of a comment than a question.
656
00:38:35,048 --> 00:38:36,241
(Lydia) Go for it.
657
00:38:36,242 --> 00:38:39,840
(man4) So there's been
a lot of focus at the item level
658
00:38:39,840 --> 00:38:42,547
about quality and completeness,
659
00:38:42,547 --> 00:38:47,374
one of the things that concerns me is that
we're not applying the same to hierarchies
660
00:38:47,374 --> 00:38:51,480
and I think we have an issue
is that our hierarchy often isn't good.
661
00:38:51,481 --> 00:38:53,463
We're seeing
this is going to be a real problem
662
00:38:53,464 --> 00:38:55,774
with Commons searching and other things.
663
00:38:56,771 --> 00:39:00,601
One of the abilities that we can do
is to import external--
664
00:39:00,602 --> 00:39:04,842
The way that external thesauruses
structure their hierarchies,
665
00:39:04,842 --> 00:39:10,291
using the P4900
broader concept qualifier.
666
00:39:11,037 --> 00:39:16,167
But what I think would be really helpful
would be much better tools for doing that
667
00:39:16,168 --> 00:39:21,212
so that you can import an
external... thesaurus's hierarchy
668
00:39:21,212 --> 00:39:24,111
map that onto our Wikidata items.
669
00:39:24,111 --> 00:39:28,199
Once it's in place
with those P4900 qualifiers,
670
00:39:28,200 --> 00:39:31,494
you can actually do some
quite good querying through SPARQL
671
00:39:32,490 --> 00:39:37,534
to see where our hierarchy
diverges from that external hierarchy.
672
00:39:37,534 --> 00:39:41,346
For instance, [Paula Morma],
user PKM, you may know,
673
00:39:41,346 --> 00:39:43,533
does a lot of work on fashion.
674
00:39:43,533 --> 00:39:50,524
So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy
675
00:39:50,524 --> 00:39:53,812
and the Getty AAT
fashion thesaurus hierarchy,
676
00:39:53,812 --> 00:39:57,957
and then see where the gaps
were in our higher level items,
677
00:39:57,957 --> 00:40:00,511
which is a real problem for us
because often,
678
00:40:00,511 --> 00:40:04,355
these are things that only exist
as disambiguation pages on Wikipedia,
679
00:40:04,356 --> 00:40:09,270
so we have a lot of higher level items
in our hierarchies missing
680
00:40:09,271 --> 00:40:14,480
and this is something that we must address
in terms of quality and completeness,
681
00:40:14,480 --> 00:40:15,971
but what would really help
682
00:40:16,643 --> 00:40:20,871
would be better tools than
the jungle of pull scripts that I wrote...
683
00:40:20,872 --> 00:40:26,010
If somebody could put that
into a PAWS notebook in Python
684
00:40:26,561 --> 00:40:31,972
to be able to take an external thesaurus,
take its hierarchy,
685
00:40:31,973 --> 00:40:34,595
which may well be available
as linked data or may not,
686
00:40:35,379 --> 00:40:40,580
to then put those into
quick statements to put in P4900 values.
687
00:40:41,165 --> 00:40:42,165
And then later,
688
00:40:42,166 --> 00:40:44,527
when our representation
gets more complete,
689
00:40:44,528 --> 00:40:49,691
to update those P4900s
because as our representation gets dated,
690
00:40:49,691 --> 00:40:51,590
becomes more dense,
691
00:40:51,590 --> 00:40:55,377
the values of those qualifiers
need to change
692
00:40:56,230 --> 00:40:59,526
to represent that we've got more
of their hierarchy in our system.
693
00:40:59,526 --> 00:41:03,728
If somebody could do that,
I think that would be very helpful,
694
00:41:03,728 --> 00:41:07,121
and we do need to also
look at other approaches
695
00:41:07,122 --> 00:41:10,762
to improve quality and completeness
at the hierarchy level
696
00:41:10,763 --> 00:41:12,378
not just at the item level.
697
00:41:13,308 --> 00:41:14,840
(Andra) Can I add to that?
698
00:41:16,362 --> 00:41:19,901
Yes, and we actually do that,
699
00:41:19,911 --> 00:41:23,551
and I can recommend looking at
the Shape Expression that Finn made
700
00:41:23,552 --> 00:41:27,330
with the lexical data
where he creates Shape Expressions
701
00:41:27,330 --> 00:41:29,640
and then build on authorship expressions
702
00:41:29,641 --> 00:41:32,528
so you have this concept
of linked Shape Expressions in Wikidata,
703
00:41:32,529 --> 00:41:35,005
and specifically, the use case,
if I understand correctly,
704
00:41:35,006 --> 00:41:37,183
is exactly what we are doing in Gene Wiki.
705
00:41:37,184 --> 00:41:40,841
So you have the Disease Ontology
which is put into Wikidata
706
00:41:40,842 --> 00:41:44,681
and then disease data comes in
and we apply the Shape Expressions
707
00:41:44,682 --> 00:41:47,247
to see if that fits with this thesaurus.
708
00:41:47,248 --> 00:41:50,919
And there are other thesauruses or other
ontologies for controlled vocabularies
709
00:41:50,920 --> 00:41:52,559
that still need to go into Wikidata,
710
00:41:52,559 --> 00:41:55,401
and that's exactly why
Shape Expression is so interesting
711
00:41:55,402 --> 00:41:57,963
because you can have a Shape Expression
for the Disease Ontology,
712
00:41:57,964 --> 00:41:59,644
you can have a Shape Expression for MeSH,
713
00:41:59,645 --> 00:42:01,761
you can say: OK,
now I want to check the quality.
714
00:42:01,762 --> 00:42:04,059
Because you also have
in Wikidata the context
715
00:42:04,060 --> 00:42:09,567
of when you have a controlled vocabulary,
you say the quality is according to this,
716
00:42:09,568 --> 00:42:11,636
but you might have
a disagreeing community.
717
00:42:11,636 --> 00:42:16,081
So the tooling is indeed in place
but now is indeed to create those models
718
00:42:16,082 --> 00:42:18,144
and apply them
on the different use cases.
719
00:42:18,811 --> 00:42:20,921
(man4) The ShapeExpression's very useful
720
00:42:20,922 --> 00:42:25,928
once you have the external ontology
mapped into Wikidata,
721
00:42:25,929 --> 00:42:29,474
but my problem is that
it's getting to that stage,
722
00:42:29,475 --> 00:42:34,881
it's working out how much of the
external ontology isn't yet in Wikidata
723
00:42:34,882 --> 00:42:36,256
and where the gaps are,
724
00:42:36,257 --> 00:42:40,660
and that's where I think that
having much more robust tools
725
00:42:40,660 --> 00:42:44,286
to see what's missing
from external ontologies
726
00:42:44,286 --> 00:42:45,537
would be very helpful.
727
00:42:47,678 --> 00:42:49,062
The biggest problem there
728
00:42:49,062 --> 00:42:51,201
is not so much tooling
but more licensing.
729
00:42:51,803 --> 00:42:55,249
So getting the ontologies
into Wikidata is actually a piece of cake
730
00:42:55,250 --> 00:42:59,295
but most of the ontologies have,
how can I say that politely,
731
00:42:59,965 --> 00:43:03,256
restrictive licensing,
so they are not compatible with Wikidata.
732
00:43:04,068 --> 00:43:06,678
(man4) There's a huge number
of public sector thesauruses
733
00:43:06,678 --> 00:43:08,209
in cultural fields.
734
00:43:08,210 --> 00:43:10,851
- (Andra) Then we need to talk.
- (man4) Not a problem.
735
00:43:10,852 --> 00:43:12,384
(Andra) Then we need to talk.
736
00:43:13,624 --> 00:43:19,192
(man5) Just... the comment I want to make
is actually answer to James,
737
00:43:19,192 --> 00:43:22,401
so the thing is that
hierarchies make graphs,
738
00:43:22,374 --> 00:43:24,041
and when you want to...
739
00:43:24,579 --> 00:43:28,888
I want to basically talk about...
a common problem in hierarchies
740
00:43:28,889 --> 00:43:30,820
is circle hierarchies,
741
00:43:30,821 --> 00:43:33,796
so they come back to each other
when there's a problem,
742
00:43:33,796 --> 00:43:35,920
which you should not
have that in hierarchies.
743
00:43:37,022 --> 00:43:41,295
This, funnily enough,
happens in categories in Wikipedia a lot
744
00:43:41,295 --> 00:43:42,990
we have a lot of circles in categories,
745
00:43:43,898 --> 00:43:46,612
but the good news is that this is...
746
00:43:47,713 --> 00:43:51,582
Technically, it's a PMP complete problem,
so you cannot find this,
747
00:43:51,583 --> 00:43:53,414
and easily if you built a graph of that,
748
00:43:54,473 --> 00:43:57,046
but there are lots of ways
that have been developed
749
00:43:57,047 --> 00:44:00,624
to find problems
in these hierarchy graphs.
750
00:44:00,625 --> 00:44:04,860
Like there is a paper
called Finding Cycles...
751
00:44:04,861 --> 00:44:07,955
Breaking Cycles in Noisy Hierarchies,
752
00:44:07,956 --> 00:44:12,671
and it's been used to help
categorization of English Wikipedia.
753
00:44:12,672 --> 00:44:17,141
You can just take this
and apply these hierarchies in Wikidata,
754
00:44:17,142 --> 00:44:19,540
and then you can find
things that are problematic
755
00:44:19,541 --> 00:44:22,481
and just remove the ones
that are causing issues
756
00:44:22,482 --> 00:44:24,593
and find the issues, actually.
757
00:44:24,594 --> 00:44:26,960
So this is just an idea, just so you...
758
00:44:28,780 --> 00:44:29,930
(man4) That's all very well
759
00:44:29,931 --> 00:44:34,402
but I think you're underestimating
the number of bad subclass relations
760
00:44:34,402 --> 00:44:35,402
that we have.
761
00:44:35,403 --> 00:44:39,680
It's like having a city
in completely the wrong country,
762
00:44:40,250 --> 00:44:44,874
and there are tools for geography
to identify that,
763
00:44:44,875 --> 00:44:49,201
and we need to have
much better tools in hierarchies
764
00:44:49,202 --> 00:44:53,477
to identify where the equivalent
of the item for the country
765
00:44:53,478 --> 00:44:57,673
is missing entirely,
or where it's actually been subclassed
766
00:44:57,674 --> 00:45:01,804
to something that isn't meaning
something completely different.
767
00:45:02,804 --> 00:45:07,165
(Lydia) Yeah, I think
you're getting to something
768
00:45:07,166 --> 00:45:12,024
that me and my team keeps hearing
from people who reuse our data
769
00:45:12,025 --> 00:45:13,991
quite a bit as well, right,
770
00:45:15,002 --> 00:45:16,638
Individual data point might be great
771
00:45:16,639 --> 00:45:20,163
but if you have to look
at the ontology and so on,
772
00:45:20,164 --> 00:45:21,857
then it gets very...
773
00:45:22,388 --> 00:45:26,437
And I think one of the big problems
why this is happening
774
00:45:26,437 --> 00:45:30,736
is that a lot of editing on Wikidata
775
00:45:30,736 --> 00:45:34,544
happens on the basis
of an individual item, right,
776
00:45:34,545 --> 00:45:36,201
you make an edit on that item,
777
00:45:37,653 --> 00:45:42,075
without realizing that this
might have very global consequences
778
00:45:42,075 --> 00:45:44,245
on the rest of the graph, for example.
779
00:45:44,245 --> 00:45:50,040
And if people have ideas around
how to make this more visible,
780
00:45:50,041 --> 00:45:53,185
the consequences
of an individual local edit,
781
00:45:54,005 --> 00:45:56,537
I think that would be worth exploring,
782
00:45:57,550 --> 00:46:01,583
to show people better
what the consequence of their edit
783
00:46:01,584 --> 00:46:03,434
that they might do in very good faith,
784
00:46:04,481 --> 00:46:05,481
what that is.
785
00:46:06,939 --> 00:46:12,237
Whoa! OK, let's start with, yeah, you,
then you, then you, then you.
786
00:46:12,237 --> 00:46:13,921
(man5) Well, after the discussion,
787
00:46:13,922 --> 00:46:18,262
just to express my agreement
with what James was saying.
788
00:46:18,263 --> 00:46:22,467
So essentially, it seems
the most dangerous thing is the hierarchy,
789
00:46:22,468 --> 00:46:23,910
not the hierarchy, but generally
790
00:46:23,911 --> 00:46:28,022
the semantics of the subclass relations
seen in Wikidata, right.
791
00:46:28,022 --> 00:46:32,561
So I've been studying languages recently,
just for the purposes of this conference,
792
00:46:32,562 --> 00:46:35,257
and for example, you find plenty of cases
793
00:46:35,257 --> 00:46:39,463
where a language is a part of
and subclass of the same thing, OK.
794
00:46:39,463 --> 00:46:43,577
So you know, you can say
we have a flexible ontology.
795
00:46:43,577 --> 00:46:46,256
Wikidata gives you freedom
to express that, sometimes.
796
00:46:46,256 --> 00:46:47,257
Because, for example,
797
00:46:47,258 --> 00:46:50,721
that ontology of languages
is also politically complicated, right?
798
00:46:50,722 --> 00:46:55,038
It is even good to be in a position
to express a level of uncertainty.
799
00:46:55,038 --> 00:46:57,983
But imagine anyone who wants
to do machine reading from that.
800
00:46:57,984 --> 00:46:59,468
So that's really problematic.
801
00:46:59,468 --> 00:47:00,468
And then again,
802
00:47:00,469 --> 00:47:03,686
I don't think that ontology
was ever imported from somewhere,
803
00:47:03,687 --> 00:47:05,490
that's something which is originally ours.
804
00:47:05,491 --> 00:47:08,321
It's harvested from Wikipedia
in the very beginning I will say.
805
00:47:08,322 --> 00:47:11,324
So I wonder...
this Shape Expressions thing is great,
806
00:47:11,325 --> 00:47:15,575
and also validating and fixing,
if you like, the Wikidata ontology
807
00:47:15,576 --> 00:47:18,191
by external resources, beautiful idea.
808
00:47:19,026 --> 00:47:20,026
In the end,
809
00:47:20,027 --> 00:47:25,440
will we end by reflecting
the external ontologies in Wikidata?
810
00:47:25,441 --> 00:47:28,651
And also, what we do with
the core part of our ontology
811
00:47:28,652 --> 00:47:30,642
which is never harvested
from external resources,
812
00:47:30,643 --> 00:47:31,978
how do we go and fix that?
813
00:47:31,979 --> 00:47:35,276
And I really think that
that will be a problem on its own.
814
00:47:35,277 --> 00:47:39,010
We will have to focus on that
independently of the idea
815
00:47:39,010 --> 00:47:41,046
of validating ontology
with something external.
816
00:47:49,353 --> 00:47:53,379
(man6) OK, and constrains
and shapes are very impressive
817
00:47:53,380 --> 00:47:54,495
what we can do with it,
818
00:47:55,205 --> 00:47:58,481
but the main point is not
being really made clear--
819
00:47:58,482 --> 00:48:03,229
it's because now we can make more explicit
what we expect from the data.
820
00:48:03,229 --> 00:48:06,893
Before, each one has to write
its own tools and scripts
821
00:48:06,894 --> 00:48:10,601
and so it's more visible
and we can discuss about it.
822
00:48:10,602 --> 00:48:13,641
But because it's not about
what's wrong or right,
823
00:48:13,642 --> 00:48:15,870
it's about an expectation,
824
00:48:15,870 --> 00:48:18,105
and you will have different
expectations and discussions
825
00:48:18,106 --> 00:48:20,737
about how we want
to model things in Wikidata,
826
00:48:21,246 --> 00:48:23,095
and this...
827
00:48:23,096 --> 00:48:26,280
The current state is just
one step in the direction
828
00:48:26,281 --> 00:48:28,041
because now you need
829
00:48:28,042 --> 00:48:31,041
very much technical expertise
to get into this,
830
00:48:31,042 --> 00:48:35,721
and we need better ways
to visualize this constraint,
831
00:48:35,722 --> 00:48:39,995
to transform it maybe in natural language
so people can better understand,
832
00:48:40,939 --> 00:48:43,768
but it's less about what's wrong or right.
833
00:48:44,925 --> 00:48:45,925
(Lydia) Yeah.
834
00:48:50,986 --> 00:48:53,893
(man7) So for quality issues,
I just want to echo it like...
835
00:48:53,894 --> 00:48:57,010
I've definitely found a lot of the issues
I've encountered have been
836
00:48:58,838 --> 00:49:02,330
differences in opinion
between instance of versus subclass.
837
00:49:02,331 --> 00:49:05,963
I would say errors in those situations
838
00:49:05,963 --> 00:49:11,521
and trying to find those
has been a very time-consuming process.
839
00:49:11,522 --> 00:49:14,840
What I've found is like:
"Oh, if I find very high-impression items
840
00:49:14,840 --> 00:49:16,051
that are something...
841
00:49:16,052 --> 00:49:21,628
and then use all the subclass instances
to find all derived statements of this,"
842
00:49:21,628 --> 00:49:26,215
this is a very useful way
of looking for these errors.
843
00:49:26,215 --> 00:49:28,067
But I was curious if Shape Expressions,
844
00:49:29,841 --> 00:49:31,582
if there is...
845
00:49:31,583 --> 00:49:36,934
If this can be used as a tool
to help resolve those issues but, yeah...
846
00:49:40,514 --> 00:49:42,555
(man8) If it has a structural footprint...
847
00:49:45,910 --> 00:49:49,310
If it has a structural footprint
that you can...that's sort of falsifiable,
848
00:49:49,310 --> 00:49:51,191
you can look at that
and say well, that's wrong,
849
00:49:51,192 --> 00:49:52,670
then yeah, you can do that.
850
00:49:52,671 --> 00:49:56,921
But if it's just sort of
trying to map it to real-world objects,
851
00:49:56,922 --> 00:49:59,082
then you're just going to need
lots and lots of brains.
852
00:50:05,768 --> 00:50:08,631
(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.
853
00:50:09,154 --> 00:50:12,770
We're here to find out how to help
the project and the community
854
00:50:12,770 --> 00:50:15,645
but Cristina made the mistake
of asking what we want.
855
00:50:16,471 --> 00:50:20,052
(laughing) So I think
one thing I'd like to see
856
00:50:20,958 --> 00:50:23,521
is a lot around verifiability
857
00:50:23,522 --> 00:50:26,372
which is one of the core tenets
of the project in the community,
858
00:50:27,062 --> 00:50:28,590
and trustworthiness.
859
00:50:28,590 --> 00:50:32,412
Not every statement is the same,
some of them are heavily disputed,
860
00:50:32,413 --> 00:50:33,653
some of them are easy to guess,
861
00:50:33,654 --> 00:50:35,541
like somebody's
date of birth can be verified,
862
00:50:36,071 --> 00:50:39,082
as you saw today in the Keynote,
gender issues are a lot more complicated.
863
00:50:40,205 --> 00:50:42,130
Can you discuss a little bit what you know
864
00:50:42,131 --> 00:50:47,271
in this area of data quality around
trustworthiness and verifiability?
865
00:50:55,442 --> 00:50:58,138
If there isn't a lot,
I'd love to see a lot more. (laughs)
866
00:51:00,646 --> 00:51:01,646
(Lydia) Yeah.
867
00:51:03,314 --> 00:51:06,548
Apparently, we don't have
a lot to say on that. (laughs)
868
00:51:08,024 --> 00:51:12,299
(Andra) I think we can do a lot,
but I had a discussion with you yesterday.
869
00:51:12,300 --> 00:51:15,774
My favorite example I learned yesterday
that's already deprecated
870
00:51:15,774 --> 00:51:20,281
is if you go to the Q2, which is earth,
871
00:51:20,282 --> 00:51:23,343
there is statement
that claims that the earth is flat.
872
00:51:24,183 --> 00:51:26,055
And I love that example
873
00:51:26,056 --> 00:51:28,391
because there is a community
out there that claims that
874
00:51:28,392 --> 00:51:30,417
and they have verifiable resources.
875
00:51:30,418 --> 00:51:32,254
So I think it's a genuine case,
876
00:51:32,255 --> 00:51:34,641
it shouldn't be deprecated,
it should be in Wikidata.
877
00:51:34,642 --> 00:51:40,385
And I think Shape Expressions
can be really instrumental there,
878
00:51:40,386 --> 00:51:41,832
because what you can say,
879
00:51:41,833 --> 00:51:44,856
OK, I'm really interested
in this use case,
880
00:51:44,857 --> 00:51:47,129
or this is a use case where you disagree,
881
00:51:47,130 --> 00:51:51,059
but there can also be a use case
where you say OK, I'm interested.
882
00:51:51,059 --> 00:51:53,449
So there is this example you say,
I have glucose.
883
00:51:53,449 --> 00:51:55,841
And glucose when you're a biologist,
884
00:51:55,842 --> 00:52:00,176
you don't care for the chemical
constraints of the glucose molecule,
885
00:52:00,177 --> 00:52:03,201
you just... everything glucose
is the same.
886
00:52:03,202 --> 00:52:05,973
But if you're a chemist,
you cringe when you hear that,
887
00:52:05,973 --> 00:52:08,191
you have 200 something...
888
00:52:08,191 --> 00:52:10,443
So then you can have
multiple Shape Expressions,
889
00:52:10,443 --> 00:52:12,721
OK, I'm coming in with...
I'm at a chemist view,
890
00:52:12,722 --> 00:52:13,887
I'm applying that.
891
00:52:13,887 --> 00:52:16,691
And then you say
I'm from a biological use case,
892
00:52:16,691 --> 00:52:18,524
I'm applying that Shape Expression.
893
00:52:18,524 --> 00:52:20,358
And then when you want to collaborate,
894
00:52:20,358 --> 00:52:22,784
yes, well you should talk
to Eric about ShEx maps.
895
00:52:23,910 --> 00:52:28,873
And so...
but this journey is just starting.
896
00:52:28,873 --> 00:52:32,238
But I personally I believe
that it's quite instrumental in that area.
897
00:52:34,292 --> 00:52:35,535
(Lydia) OK. Over there.
898
00:52:37,949 --> 00:52:39,168
(laughs)
899
00:52:40,597 --> 00:52:46,035
(woman2) I had several ideas
from some points in the discussions,
900
00:52:46,035 --> 00:52:50,902
so I will try not to lose...
I had three ideas so...
901
00:52:52,394 --> 00:52:55,201
Based on what James said a while ago,
902
00:52:55,202 --> 00:52:59,001
we have a very, very big problem
on Wikidata since the beginning
903
00:52:59,002 --> 00:53:01,574
for the upper ontology.
904
00:53:02,363 --> 00:53:05,339
We talked about that
two years ago at WikidataCon,
905
00:53:05,340 --> 00:53:07,432
and we talked about that at Wikimania.
906
00:53:07,432 --> 00:53:09,818
Well, always we have a Wikidata meeting
907
00:53:09,818 --> 00:53:11,656
we are talking about that,
908
00:53:11,656 --> 00:53:15,782
because it's a very big problem
at a very very eye level
909
00:53:15,783 --> 00:53:23,118
what entity is, with what work is,
what genre is, art,
910
00:53:23,118 --> 00:53:25,461
are really the biggest concept.
911
00:53:26,195 --> 00:53:33,117
And that's actually
a very weak point on global ontology
912
00:53:33,118 --> 00:53:37,453
because people try to clean up regularly
913
00:53:38,017 --> 00:53:41,047
and broke everything down the line,
914
00:53:42,516 --> 00:53:48,649
because yes, I think some of you
may remember the guy who in good faith
915
00:53:48,649 --> 00:53:51,785
broke absolutely all cities in the world.
916
00:53:51,785 --> 00:53:57,537
We were not geographical items anymore,
so violation constraints everywhere.
917
00:53:58,720 --> 00:54:00,278
And it was in good faith
918
00:54:00,278 --> 00:54:03,623
because he was really
correcting a mistake in an item,
919
00:54:04,170 --> 00:54:05,732
but everything broke down.
920
00:54:06,349 --> 00:54:09,373
And I'm not sure how we can solve that
921
00:54:10,216 --> 00:54:15,709
because there is actually
no external institution we could just copy
922
00:54:15,710 --> 00:54:18,490
because everyone is working on...
923
00:54:19,154 --> 00:54:22,041
Well, if I am performing art database,
924
00:54:22,042 --> 00:54:24,601
I will just go
at the performing art label,
925
00:54:24,601 --> 00:54:29,361
or I won't go to the philosophical concept
of what an entity is,
926
00:54:29,362 --> 00:54:31,201
and that's actually...
927
00:54:31,202 --> 00:54:34,561
I don't know any database
which is working at this level,
928
00:54:34,562 --> 00:54:36,827
but that's the weakest point of Wikidata.
929
00:54:37,936 --> 00:54:40,812
And probably,
when we are talking about data quality,
930
00:54:40,812 --> 00:54:44,034
that's actually a big part of it, so...
931
00:54:44,034 --> 00:54:48,569
And I think it's the same
we have stated in...
932
00:54:48,569 --> 00:54:50,452
Oh, I am sorry, I am changing the subject,
933
00:54:51,401 --> 00:54:55,774
but we have stated
in different sessions about qualities,
934
00:54:55,774 --> 00:54:59,398
which is actually some of us
are doing good modeling job,
935
00:54:59,399 --> 00:55:01,240
are doing ShEx,
are doing things like that.
936
00:55:01,967 --> 00:55:07,655
People don't see it on Wikidata,
they don't see the ShEx,
937
00:55:07,655 --> 00:55:10,392
they don't see the WikiProject
on the discussion page,
938
00:55:10,393 --> 00:55:11,393
and sometimes,
939
00:55:11,394 --> 00:55:14,958
they don't even see
the talk pages of properties,
940
00:55:14,958 --> 00:55:19,628
which is explicitly stating,
a), this property is used for that.
941
00:55:19,628 --> 00:55:23,887
Like last week,
I added constraints to a property.
942
00:55:23,888 --> 00:55:26,324
The constraint was explicitly written
943
00:55:26,325 --> 00:55:28,690
in the discussion
of the creation of the property.
944
00:55:28,690 --> 00:55:34,548
I just created the technical part
of adding the constraint, and someone:
945
00:55:34,548 --> 00:55:37,182
"What! You broke down all my edits!"
946
00:55:37,183 --> 00:55:41,542
And he was using the property
wrongly for the last two years.
947
00:55:41,542 --> 00:55:46,868
And the property was actually very clear,
but there were no warnings and everything,
948
00:55:46,869 --> 00:55:49,922
and so, it's the same at the Pink Pony
we said at Wikimania
949
00:55:49,922 --> 00:55:54,719
to make WikiProject more visible
or to make ShEx more visible, but...
950
00:55:54,719 --> 00:55:56,917
And that's what Cristina said.
951
00:55:56,917 --> 00:56:02,368
We have a visibility problem
of what the existing solutions are.
952
00:56:02,368 --> 00:56:04,242
And at this session,
953
00:56:04,242 --> 00:56:06,862
we are all talking about
how to create more ShEx,
954
00:56:06,863 --> 00:56:10,727
or to facilitate the jobs
of the people who are doing the cleanup.
955
00:56:11,605 --> 00:56:15,835
But we are cleaning up
since the first day of Wikidata,
956
00:56:15,836 --> 00:56:20,921
and globally, we are losing,
and we are losing because, well,
957
00:56:20,922 --> 00:56:22,960
if I know names are complicated
958
00:56:22,961 --> 00:56:26,162
but I am the only one
doing the cleaning up job,
959
00:56:26,662 --> 00:56:29,671
the guy who added
Latin script name
960
00:56:29,672 --> 00:56:31,584
to all Chinese researcher,
961
00:56:32,088 --> 00:56:35,616
I will take months to clean that
and I can't do it alone,
962
00:56:35,616 --> 00:56:38,777
and he did one massive batch.
963
00:56:38,777 --> 00:56:40,241
So we really need...
964
00:56:40,242 --> 00:56:44,158
we have a visibility problem
more than a tool problem, I think,
965
00:56:44,158 --> 00:56:45,733
because we have many tools.
966
00:56:45,733 --> 00:56:50,255
(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),
967
00:56:50,256 --> 00:56:52,121
so we need to wrap this up.
968
00:56:52,122 --> 00:56:53,563
Thank you so much for your comments,
969
00:56:53,563 --> 00:56:56,611
I hope you will continue discussing
during the rest of the day,
970
00:56:56,611 --> 00:56:57,840
and thanks for your input.
971
00:56:58,359 --> 00:56:59,944
(applause)