cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4

Edit subtitles

0:06 - 0:09

Hello everyone to the Data Quality panel.
0:10 - 0:14

Data quality matters because
more and more people out there
0:14 - 0:19

rely on our data being in good shape,
so we're going to talk about data quality,
0:20 - 0:26

and there will be four speakers
who will give short introductions
0:26 - 0:30

on topics related to data quality
and then we will have a Q and A.
0:30 - 0:32

And the first one is Lucas.
0:34 - 0:35

Thank you.
0:36 - 0:40

Hi, I'm Lucas, and I'm going
to start with an overview
0:40 - 0:44

of data quality tools
that we already have on Wikidata
0:44 - 0:46

and also some things
that are coming up soon.
0:47 - 0:51

And I've grouped them
into some general themes
0:51 - 0:54

of making errors more visible,
making problems actionable,
0:54 - 0:56

getting more eyes on the data
so that people notice the problems,
0:57 - 1:03

fix some common sources of errors,
maintain the quality of the existing data
1:03 - 1:04

and also human curation.
1:05 - 1:10

And the ones that are currently available
start with property constraints.
1:10 - 1:12

So you've probably seen this
if you're on Wikidata.
1:12 - 1:14

You can sometimes get these icons
1:15 - 1:17

which check
the internal consistency of the data.
1:17 - 1:21

For example,
if one event follows the other,
1:21 - 1:24

then the other event should
also be followed by this one,
1:24 - 1:27

which on the WikidataCon item
was apparently missing.
1:27 - 1:29

I'm not sure,
this feature is a few days old.
1:30 - 1:35

And there's also,
if this is too limited or simple for you,
1:35 - 1:38

you can write any checks you want
using the Query Service
1:38 - 1:40

which is useful for
lots of things of course,
1:40 - 1:45

but you can also use it
for finding errors.
1:45 - 1:47

Like if you've noticed
one occurrence of a mistake,
1:47 - 1:50

then you can check
if there are other places
1:50 - 1:52

where people have made
a very similar error
1:52 - 1:53

and find that with the Query Service.
1:53 - 1:55

You can also combine the two
1:55 - 1:58

and search for constraint violations
in the Query Service,
1:58 - 2:01

for example,
only the violations in some area
2:01 - 2:04

or WikiProject that's relevant to you,
2:04 - 2:07

although the results are currently
not complete, sadly.
2:08 - 2:10

There is revision scoring.
2:11 - 2:13

That's... I think this is
from the recent changes
2:13 - 2:16

you can also get it on your watch list
an automatic assessment
2:16 - 2:20

of is this edit likely to be
in good faith or in bad faith
2:20 - 2:22

and is it likely to be
damaging or not damaging,
2:22 - 2:24

I think those are the two dimensions.
2:24 - 2:26

So you can, if you want,
2:26 - 2:30

focus on just looking through
the damaging but good faith edits.
2:30 - 2:33

If you're feeling particularly
friendly and welcoming
2:33 - 2:37

you can tell these editors,
"Thank you for your contribution,
2:37 - 2:41

here's how you should have done it
but thank you, still."
2:41 - 2:42

And if you're not feeling that way,
2:42 - 2:44

you can go through
the bad faith, damaging edits,
2:44 - 2:46

and revert the vandals.
2:48 - 2:50

There's also, similar to that,
entity scoring.
2:50 - 2:53

So instead of scoring an edit,
the change that it made,
2:53 - 2:54

you score the whole revision,
2:54 - 2:56

and I think that is
the same quality measure
2:56 - 3:00

that Lydia mentions
at the beginning of the conference.
3:00 - 3:05

That gives a user script up here
and gives you a score of like one to five,
3:05 - 3:08

I think it was, of what the quality
of the current item is.
3:10 - 3:16

The primary sources tool is for
any database that you want to import,
3:16 - 3:18

but that's not high enough quality
to directly add to Wikidata,
3:18 - 3:20

so you add it
to the primary sources tool instead,
3:20 - 3:23

and then humans can decide
3:23 - 3:26

should they add
these individual statements or not.
3:29 - 3:32

Showing coordinates as maps
is mainly a convenience feature
3:32 - 3:34

but it's also useful for quality control.
3:34 - 3:37

Like if you see this is supposed to be
the office of Wikimedia Germany
3:37 - 3:39

and if the coordinates
are somewhere in the Indian Ocean,
3:39 - 3:42

then you know that
something is not right there
3:42 - 3:45

and you can see it much more easily
than if you just had the numbers.
3:46 - 3:50

This is a gadget called
the relative completeness indicator
3:50 - 3:52

which shows you this little icon here
3:53 - 3:56

telling you how complete
it thinks this item is
3:56 - 3:58

and also which properties
are most likely missing,
3:58 - 4:00

which is really useful
if you're editing an item
4:00 - 4:03

and you're in an area
that you're not very familiar with
4:03 - 4:06

and you don't know what
the right properties to use are,
4:06 - 4:08

then this is a very useful gadget to have.
4:10 - 4:11

And we have Shape Expressions.
4:11 - 4:16

I think Andra or Jose
are going to talk more about those
4:16 - 4:20

but basically, a very powerful way
of comparing the data you have
4:20 - 4:21

against the schema,
4:21 - 4:23

like what statement should
certain entities have,
4:23 - 4:26

what other entities should they link to
and what should those look like,
4:26 - 4:29

and then you can find problems that way.
4:30 - 4:32

I think... No there is still more.
4:32 - 4:34

Integraality or property dashboard.
4:34 - 4:37

It gives you a quick overview
of the data you already have.
4:37 - 4:39

For example, this is from
the WikiProject Red Pandas,
4:40 - 4:42

and you can see that
we have a sex or gender
4:42 - 4:44

for almost all of the red pandas,
4:44 - 4:47

the date of birth varies a lot
by which zoo they come from
4:47 - 4:50

and we have almost
no dead pandas which is wonderful,
4:51 - 4:53

because they're so cute.
4:54 - 4:56

So this is also useful.
4:56 - 4:59

There we go, OK,
now for the things that are coming up.
5:00 - 5:04

Wikidata Bridge, or also known,
formerly known as client editing,
5:04 - 5:07

so editing Wikidata
from Wikipedia infoboxes
5:08 - 5:12

which will on the one hand
get more eyes on the data
5:12 - 5:13

because more people can see the data there
5:13 - 5:19

and it will hopefully encourage
more use of Wikidata in the Wikipedias
5:19 - 5:21

and that means that more
people can notice
5:21 - 5:23

if, for example some data is outdated
and needs to be updated
5:24 - 5:27

instead of if they would
only see it on Wikidata itself.
5:29 - 5:31

There is also tainted references.
5:31 - 5:34

The idea here is that
if you edit a statement value,
5:35 - 5:37

you might want to update
the references as well,
5:37 - 5:39

unless it was just a typo or something.
5:40 - 5:44

And this tainted references
tells editors that
5:44 - 5:50

and also that other editors
see which other edits were made
5:50 - 5:52

that edited a statement value
and didn't update a reference
5:52 - 5:57

then you can clean up after that
and decide should that be...
5:58 - 6:00

Do you need to do any thing more of that
6:00 - 6:03

or is that actually fine and
you don't need to update the reference.
6:04 - 6:09

That's related to signed statements
which is coming from a concern, I think,
6:09 - 6:12

that some data providers have that like...
6:14 - 6:17

There's a statement that's referenced
through the UNESCO or something
6:17 - 6:20

and then suddenly,
someone vandalizes the statement
6:20 - 6:22

and they are worried
that it will look like
6:23 - 6:27

this organization, like UNESCO,
still set this vandalism value
6:27 - 6:29

and so, with signed statements,
6:29 - 6:31

they can cryptographically
sign this reference
6:31 - 6:34

and that doesn't prevent any edits to it,
6:34 - 6:38

but at least, if someone
vandalizes the statement
6:38 - 6:40

or edits it in any way,
then the signature is no longer valid,
6:40 - 6:43

and you can tell this is not exactly
what the organization said,
6:43 - 6:47

and perhaps it's a good edit
and they should re-sign the new statement,
6:47 - 6:50

but also perhaps it should be reverted.
6:51 - 6:54

And also, this is going
to be very exciting, I think,
6:54 - 6:57

Citoid is this amazing system
they have on Wikipedia
6:57 - 7:01

where you can paste a URL,
or an identifier, or an ISBN
7:01 - 7:05

or Wikidata ID or basically
anything into the Visual Editor,
7:05 - 7:08

and it spits out a reference
that is nicely formatted
7:08 - 7:11

and has all the data you want
and it's wonderful to use.
7:11 - 7:14

And by comparison, on Wikidata,
if I want to add a reference
7:14 - 7:19

I typically have to add a reference URL,
title, author name string,
7:19 - 7:20

published in, publication date,
7:20 - 7:25

retrieve dates,
at least those, and that's annoying,
7:25 - 7:29

and integrating Citoid into Wikibase
will hopefully help with that.
7:30 - 7:34

And I think
that's all the ones I had, yeah.
7:34 - 7:36

So now, I'm going to pass to Cristina.
7:38 - 7:42

(applause)
7:44 - 7:45

Hi, I'm Cristina.
7:45 - 7:48

I'm a research scientist
from the University of Zürich,
7:48 - 7:51

and I'm also an active member
of the Swiss Community.
7:53 - 7:58

When Claudia Müller-Birn
and I submitted this to the WikidataCon,
7:58 - 8:00

what we wanted to do
is continue our discussion
8:00 - 8:02

that we started
in the beginning of the year
8:02 - 8:07

with a workshop on data quality
and also some sessions in Wikimania.
8:07 - 8:11

So the goal of this talk
is basically to bring some thoughts
8:11 - 8:14

that we have been collecting
from the community and ourselves
8:14 - 8:17

and continue discussion.
8:17 - 8:20

So what we would like is to continue
interacting a lot with you.
8:22 - 8:23

So what we think is very important
8:23 - 8:28

is that we continuously ask
all types of users in the community
8:28 - 8:32

about what they really need,
what problems they have with data quality,
8:32 - 8:35

not only editors
but also the people who are coding,
8:35 - 8:36

or consuming the data,
8:36 - 8:39

and also researchers who are
actually using all the edit history
8:39 - 8:41

to analyze what is happening.
8:42 - 8:48

So we did a review of around 80 tools
that are existing in Wikidata
8:48 - 8:52

and we aligned them to the different
data quality dimensions.
8:52 - 8:54

And what we saw was that actually,
8:54 - 8:58

many of them were looking at,
monitoring completeness,
8:58 - 9:03

but actually... and also some of them
are also enabling interlinking.
9:03 - 9:08

But there is a big need for tools
that are looking into diversity,
9:08 - 9:13

which is one of the things
that we actually can have in Wikidata,
9:13 - 9:16

especially
this design principle of Wikidata
9:16 - 9:18

where we can have plurality
9:18 - 9:20

and different statements
with different values
9:21 - 9:22

coming from different sources.
9:22 - 9:25

Because it's a secondary source,
we don't have really tools
9:25 - 9:28

that actually tell us how many
plural statements there are,
9:28 - 9:31

and how many we can improve and how,
9:31 - 9:33

and we also don't know really
9:33 - 9:36

what are all the reasons
for plurality that we can have.
9:36 - 9:39

So from these community meetings,
9:39 - 9:43

what we discussed was the challenges
that still need attention.
9:43 - 9:47

For example, that having
all these crowdsourcing communities
9:47 - 9:50

is very good because different people
attack different parts
9:50 - 9:52

of the data or the graph,
9:52 - 9:55

and we also have
different background knowledge
9:55 - 9:59

but actually, it's very difficult to align
everything in something homogeneous
9:59 - 10:05

because different people are using
different properties in different ways
10:05 - 10:08

and they are also expecting
different things from entity descriptions.
10:09 - 10:13

People also said that
they also need more tools
10:13 - 10:16

that give a better overview
of the global status of things.
10:16 - 10:21

So what entities are missing
in terms of completeness,
10:21 - 10:26

but also like what are people
working on right now most of the time,
10:26 - 10:31

and they also mention many times
a tighter collaboration
10:31 - 10:33

across not only languages
but the WikiProjects
10:33 - 10:36

and the different Wikimedia platforms.
10:36 - 10:39

And we published
all the transcribed comments
10:39 - 10:43

from all these discussions
in those links here in the Etherpads
10:43 - 10:46

and also in the wiki page of Wikimania.
10:46 - 10:48

Some solutions that appeared actually
10:48 - 10:53

were going into the direction
of sharing more the best practices
10:53 - 10:56

that are being developed
in different WikiProjects,
10:56 - 11:01

but also people want tools
that help organize work in teams
11:01 - 11:04

or at least understanding
who is working on that,
11:04 - 11:08

and they were also mentioning
that they want more showcases
11:08 - 11:12

and more templates that help them
create things in a better way.
11:13 - 11:15

And from the contact that we have
11:15 - 11:19

with Open Governmental Data Organizations,
11:19 - 11:20

and in particularly,
11:20 - 11:23

I am in contact with the canton
and the city of Zürich,
11:23 - 11:26

they are very interested
in working with Wikidata
11:26 - 11:30

because they want their data
to be accessible for everyone
11:30 - 11:34

in the place where people go
and consult or access data.
11:34 - 11:37

So for them, something that
would be really interesting
11:37 - 11:39

is to have some kind of quality indicators
11:39 - 11:41

both in the wiki,
which is already happening,
11:41 - 11:43

but also in SPARQL results,
11:43 - 11:46

to know whether they can trust
or not that data from the community.
11:46 - 11:48

And then, they also want to know
11:48 - 11:51

what parts of their own data sets
are useful for Wikidata
11:51 - 11:56

and they would love to have a tool that
can help them assess that automatically.
11:56 - 11:59

They also need
some kind of methodology or tool
11:59 - 12:04

that helps them decide whether
they should import or link their data
12:04 - 12:05

because in some cases,
12:05 - 12:07

they also have their own
linked open data sets,
12:07 - 12:10

so they don't know whether
to just ingest the data
12:10 - 12:13

or to keep on creating links
from the data sets to Wikidata
12:13 - 12:14

and the other way around.
12:15 - 12:20

And they also want to know where
their websites are referred in Wikidata.
12:20 - 12:23

And when they run such a query
in the query service,
12:23 - 12:25

they often get timeouts,
12:25 - 12:28

so maybe we should
really create more tools
12:28 - 12:32

that help them get these answers
for their questions.
12:33 - 12:36

And, besides that,
12:36 - 12:39

we wiki researchers also sometimes
12:39 - 12:42

lack some information
in the edit summaries.
12:42 - 12:45

So I remember that when
we were doing some work
12:45 - 12:49

to understand
the different behavior of editors
12:49 - 12:53

with tools or bots
or anonymous users and so on,
12:53 - 12:56

we were really lacking, for example,
12:56 - 13:01

a standard way of tracing
that tools were being used.
13:01 - 13:03

And there are some tools
that are already doing that
13:03 - 13:05

like PetScan and many others,
13:05 - 13:08

but maybe we should in the community
13:08 - 13:14

discuss more about how to record these
for fine-grained provenance.
13:14 - 13:15

And further on,
13:15 - 13:21

we think that we need to think
of more concrete data quality dimensions
13:21 - 13:25

that are related to link data
but not all the types of data,
13:25 - 13:31

so we worked on some measures
to access actually the information gain
13:31 - 13:34

enabled by the links,
and what we mean by that
13:34 - 13:37

is that when we link
Wikidata to other data sets,
13:37 - 13:38

we should also be thinking
13:38 - 13:42

how much the entities are actually
gaining in the classification,
13:42 - 13:46

also in the description
but also in the vocabularies they use.
13:46 - 13:51

So just to give a very simple
example of what I mean with this
13:51 - 13:54

is we can think of--
in this case, would be Wikidata
13:54 - 13:58

or the external data center
that is linking to Wikidata,
13:58 - 14:00

we have the entity for a person
that is called Natasha Noy,
14:00 - 14:03

we have the affiliation and other things,
14:03 - 14:05

and then we say OK,
we link to an external place,
14:05 - 14:09

and that entity also has that name,
but we actually have the same value.
14:09 - 14:13

So what it would be better is that we link
to something that has a different name,
14:13 - 14:17

that is still valid because this person
has two ways of writing the name,
14:17 - 14:20

and also other information
that we don't have in Wikidata
14:20 - 14:22

or that we don't have
in the other data set.
14:22 - 14:25

But also, what is even better
14:25 - 14:28

is that we are actually
looking in the target data set
14:28 - 14:31

that they also have new ways
of classifying the information.
14:31 - 14:35

So not only is this a person,
but in the other data set,
14:35 - 14:40

they also say it's a female
or anything else that they classify with.
14:40 - 14:43

And if in the other data set,
they are using many other vocabularies
14:43 - 14:47

that is also helping in their whole
information retrieval thing.
14:47 - 14:51

So with that, I also would like to say
14:51 - 14:56

that we think that we can
showcase federated queries better
14:56 - 15:00

because when we look at the query log
provided by Malyshev et al.,
15:01 - 15:04

we see actually that
from the organic queries,
15:04 - 15:07

we have only very few federated queries.
15:07 - 15:13

And actually, federation is one
of the key advantages of having link data,
15:13 - 15:17

so maybe the community
or the people using Wikidata
15:17 - 15:19

also need more examples on this.
15:19 - 15:23

And if we look at the list
of endpoints that are being used,
15:23 - 15:25

this is not a complete list
and we have many more.
15:25 - 15:30

Of course, this data was analyzed
from queries until March 2018,
15:30 - 15:35

but we should look into the list
of federated endpoints that we have
15:35 - 15:37

and see whether
we are really using them or not.
15:38 - 15:40

So two questions that
I have for the audience
15:40 - 15:43

that maybe we can use
afterwards for the discussion are:
15:43 - 15:46

what data quality problems
should be addressed in your opinion,
15:46 - 15:47

because of the needs that you have,
15:47 - 15:50

but also, where do you need
more automation
15:50 - 15:53

to help you with editing or patrolling.
15:54 - 15:55

That's all, thank you very much.
15:56 - 15:58

(applause)
16:06 - 16:09

(Jose Emilio Labra) OK,
so what I'm going to talk about
16:09 - 16:15

is some tools that we were developing
related with Shape Expressions.
16:16 - 16:19

So this is what I want to talk...
I am Jose Emilio Labra,
16:19 - 16:23

but this has... all these tools
have been done by different people,
16:24 - 16:28

mainly related with W3C ShEx,
Shape Expressions Community Group.
16:28 - 16:29

ShEx Community Group.
16:30 - 16:36

So the first tool that I want to mention
is RDFShape, this is a general tool,
16:36 - 16:41

because Shape Expressions
is not only for Wikidata,
16:41 - 16:44

Shape Expressions is a language
to validate RDF in general.
16:44 - 16:48

So this tool was developed mainly by me
16:48 - 16:51

and it's a tool
to validate RDF in general.
16:51 - 16:55

So if you want to learn about RDF
or you want to validate RDF
16:55 - 16:59

or SPARQL endpoints not only in Wikidata,
16:59 - 17:01

my advice is that you can use this tool.
17:01 - 17:03

Also for teaching.
17:03 - 17:06

I am a teacher in the university
17:06 - 17:09

and I use it in my semantic web course
to teach RDF.
17:09 - 17:12

So if you want to learn RDF,
I think it's a good tool.
17:13 - 17:18

For example, this is just a visualization
of an RDF graph with the tool.
17:19 - 17:23

But before coming here, in the last month,
17:23 - 17:28

I started a fork of rdfshape specifically
for Wikidata, because I thought...
17:28 - 17:33

It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.
17:33 - 17:34

So what I took is...
17:34 - 17:40

What I did is to remove all the stuff
that was not related with Wikidata
17:40 - 17:45

and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,
17:45 - 17:49

but now, someone asked me
if I could do it also for Wikibase.
17:49 - 17:52

And it is very easy
to do it for Wikibase also.
17:53 - 17:56

So this tool, WikiShape, is quite new.
17:57 - 18:00

I think it works, most of the features,
18:00 - 18:02

but there are some features
that maybe don't work,
18:02 - 18:06

and if you try it and you want
to improve it, please tell me.
18:06 - 18:13

So this is [inaudible] captures,
but I think I can even try so let's try.
18:15 - 18:17

So let's see if it works.
18:17 - 18:20

First, I have to go out of the...
18:22 - 18:23

Here.
18:24 - 18:28

Alright, yeah. So this is the tool here.
18:28 - 18:30

Things that you can do with the tool,
18:30 - 18:35

for example, is that you can
check schemas, entity schemas.
18:35 - 18:39

You know that there is
a new namespace which is "E whatever,"
18:39 - 18:45

so here, if you start for example,
write for example "human"...
18:45 - 18:49

As you are writing,
its autocomplete allows you to check,
18:49 - 18:52

for example,
this is the Shape Expressions of a human,
18:53 - 18:56

and this is the Shape Expressions here.
18:56 - 19:00

And as you can see,
this editor has syntax highlighting,
19:00 - 19:05

this is... well,
maybe it's very small, the screen.
19:06 - 19:08

I can try to do it bigger.
19:09 - 19:11

Maybe you see it better now.
19:11 - 19:14

So... and this is the editor
with syntax highlighting and also has...
19:14 - 19:18

I mean, this editor
comes from the same source code
19:18 - 19:20

as the Wikidata query service.
19:20 - 19:24

So for example,
if you hover with the mouse here,
19:24 - 19:28

it shows you the labels
of the different properties.
19:28 - 19:31

So I think it's very helpful because now,
19:33 - 19:39

the entity schemas that is
in the Wikidata is just a plain text idea,
19:39 - 19:42

and I think this editor is much better
because it has autocomplete
19:42 - 19:44

and it also has...
19:44 - 19:48

I mean, if you, for example,
wanted to add a constraint,
19:48 - 19:52

you say "wdt:"
19:52 - 19:57

You start writing "author"
and then you click Ctrl+Space
19:57 - 19:59

and it suggests the different things.
19:59 - 20:02

So this is similar
to the Wikidata query service
20:02 - 20:06

but specifically for Shape Expressions
20:06 - 20:12

because my feeling is that
creating Shape Expressions
20:12 - 20:16

is not more difficult
than writing SPARQL queries.
20:16 - 20:21

So some people think
that it's at the same level,
20:22 - 20:26

It's probably easier, I think,
because Shape Expressions was,
20:26 - 20:31

when we designed it,
we were doing it to be easier to work.
20:31 - 20:35

OK, so this is one of the first things,
that you have this editor
20:35 - 20:37

for Shape Expressions.
20:37 - 20:41

And then you also have the possibility,
for example, to visualize.
20:41 - 20:45

If you have a Shape Expression,
use for example...
20:45 - 20:49

I think, "written work" is
a nice Shape Expression
20:49 - 20:53

because it has some relationships
between different things.
20:55 - 20:58

And this is the UML visualization
of written work.
20:58 - 21:02

In a UML, this is easy to see
the different properties.
21:03 - 21:07

When you do this, I realized
when I tried with several people,
21:07 - 21:09

they find some mistakes
in their Shape Expressions
21:09 - 21:13

because it's easy to detect which are
the missing properties or whatever.
21:14 - 21:16

Then there is another possibility here
21:16 - 21:20

is that you can also validate,
I think I have it here, the validation.
21:20 - 21:25

I think I had it in some label,
maybe I closed it.
21:26 - 21:31

OK, but you can, for example,
you can click here, Validate entities.
21:32 - 21:34

You, for example,
21:35 - 21:42

"q42" with "e42" which is author.
21:43 - 21:46

With "human,"
I think we can do it with "human."
21:49 - 21:50

And then it's...
21:51 - 21:56

And it's taking a little while to do it
because this is doing the SPARQL queries
21:56 - 21:59

and now, for example,
it's failing by the network but...
22:00 - 22:02

So you can try it.
22:03 - 22:07

OK, so let's go continue
with the presentation, with other tools.
22:07 - 22:12

So my advice is that if you want to try it
and you want any feedback let me know.
22:13 - 22:16

So to continue with the presentation...
22:19 - 22:20

So this is WikiShape.
22:24 - 22:27

Then, I already said this,
22:28 - 22:34

the Shape Expressions Editor
is an independent project in GitHub.
22:36 - 22:37

You can use it in your own project.
22:37 - 22:41

If you want to do
a Shape Expressions tool,
22:41 - 22:46

you can just embed it
in any other project,
22:46 - 22:48

so this is in GitHub and you can use it.
22:49 - 22:52

Then the same author,
it's one of my students,
22:53 - 22:56

he also created
an editor for Shape Expressions,
22:56 - 22:58

also inspired by
the Wikidata query service
22:58 - 23:01

where, in a column,
23:01 - 23:05

you have this more visual editor
of SPARQL queries
23:05 - 23:07

where you can put this kind of things.
23:07 - 23:09

So this is a screen capture.
23:09 - 23:13

You can see that
that's the Shape Expressions in text
23:13 - 23:18

but this is a form-based Shape Expressions
where it would probably take a bit longer
23:19 - 23:23

where you can put the different rows
on the different fields.
23:23 - 23:26

OK, then there is ShExEr.
23:27 - 23:32

We have... it's done by one PhD student
at the University of Oviedo
23:32 - 23:34

and he's here, so you can present ShExEr.
23:38 - 23:40

(Danny) Hello, I am Danny Fernández,
23:40 - 23:44

I am a PhD student in University of Oviedo
working with Labra.
23:45 - 23:48

Since we are running out of time,
let's make these quickly,
23:48 - 23:53

so let's not go for any actual demo,
but just print some screenshots.
23:53 - 23:58

OK, so the usual way to work with
Shape Expressions or any shape language
23:58 - 24:00

is that you have a domain expert
24:00 - 24:02

that defines a priori
how the graph should look like
24:02 - 24:04

define some structures,
24:04 - 24:07

and then you use these structures
to validate the actual data against it.
24:08 - 24:12

This tool, which is as well as the ones
that Labra has been presenting,
24:12 - 24:14

this is a general purpose tool
for any RDF source,
24:14 - 24:17

is designed to do the other way around.
24:17 - 24:19

You already have some data,
24:19 - 24:23

you select what nodes
you want to get the shape about
24:23 - 24:27

and then you automatically
extract or infer the shape.
24:27 - 24:30

So even if this is a general purpose tool,
24:30 - 24:34

what we did for this WikidataCon
is these fancy button
24:35 - 24:37

that if you click it,
essentially what happens
24:37 - 24:42

is that there are
so many configurations params
24:42 - 24:46

and it configures it to work
against the Wikidata endpoint
24:46 - 24:48

and it will end soon, sorry.
24:49 - 24:53

So, once you press this button
what you get is essentially this.
24:53 - 24:55

After having selected what kind of notes,
24:55 - 24:59

what kind of instances of our class,
whatever you are looking for,
24:59 - 25:01

you get an automatic schema.
25:02 - 25:07

All the constraints are sorted
by how many modes actually conform to it,
25:07 - 25:10

you can filter the less common ones, etc.
25:10 - 25:12

So there is a poster downstairs
about this stuff
25:12 - 25:15

and well,
I will be downstairs and upstairs
25:15 - 25:16

and all over the place all day,
25:16 - 25:19

so if you have any further
interest in this tool,
25:19 - 25:21

just speak to me during this journey.
25:21 - 25:25

And now, I'll give back
the micro to Labra, thank you.
25:25 - 25:29

(applause)
25:30 - 25:33

(Jose) So let's continue
with the other tools.
25:33 - 25:35

The other tool is the ShapeDesigner.
25:35 - 25:37

Andra, do you want to do
the ShapeDesigner now
25:37 - 25:39

or maybe later or in the workshop?
25:39 - 25:41

There is a workshop...
25:41 - 25:44

This afternoon, there is a workshop
specifically for Shape Expressions, and...
25:45 - 25:48

The idea is that was going to be
more hands on,
25:48 - 25:52

and if you want to practice
some ShEx, you can do it there.
25:53 - 25:56

This tool is ShEx...
and there is Eric here,
25:56 - 25:57

so you can present it.
25:58 - 26:01

(Eric) So just super quick,
the thing that I want to say
26:01 - 26:06

is that you've probably
already seen the ShEx interface
26:06 - 26:08

that's tailored for Wikidata.
26:08 - 26:13

That's effectively stripped down
and tailored specifically for Wikidata
26:13 - 26:18

because the generic one has more features
but it turns out I thought I'd mention it
26:18 - 26:20

because one of those features
is particularly useful
26:20 - 26:23

for debugging Wikidata schemas,
26:23 - 26:29

which is if you go
and you select the slurp mode,
26:29 - 26:31

what it does is it says
while I'm validating,
26:31 - 26:35

I want to pull all the the triples down
and that means
26:35 - 26:36

if I get a bunch of failures,
26:36 - 26:40

I can go through and start looking
at those failures and saying,
26:40 - 26:42

OK, what are the triples
that are in here,
26:42 - 26:44

sorry, I apologize,
the triples are down there,
26:44 - 26:46

this is just a log of what went by.
26:46 - 26:49

And then you can just sit there
and fiddle with it in real time
26:49 - 26:51

like you play with something
and it changes.
26:51 - 26:54

So it's a quicker version
for doing all that stuff.
26:55 - 26:56

This is a ShExC form,
26:56 - 26:59

this is something [Joachim] had suggested
27:00 - 27:05

could be useful for populating
Wikidata documents
27:05 - 27:07

based on a Shape Expression
for that that document.
27:08 - 27:12

This is not tailored for Wikidata,
27:12 - 27:14

but this is just to say
that you can have a schema
27:14 - 27:15

and you can have some annotations
27:15 - 27:18

to say specifically how I want
that schema rendered
27:18 - 27:19

and then it just builds a form,
27:19 - 27:21

and if you've got data,
it can even populate the form.
27:25 - 27:26

PyShEx [inaudible].
27:28 - 27:31

(Jose) I think this is the last one.
27:32 - 27:34

Yes, so the last one is PyShEx.
27:35 - 27:38

PyShEx is a Python implementation
of Shape Expressions,
27:39 - 27:43

you can play also with Jupyter Notebooks
if you want those kind of things.
27:43 - 27:44

OK, so that's all for this.
27:44 - 27:47

(applause)
27:53 - 27:57

(Andra) So I'm going to talk about
a specific project that I'm involved in
27:57 - 27:58

called Gene Wiki,
27:58 - 28:05

and where we are also
dealing with quality issues.
28:05 - 28:07

But before going into the quality,
28:07 - 28:09

maybe a quick introduction
about what Gene Wiki is,
28:10 - 28:15

and we just released a pre-print
of a paper that we recently have written
28:15 - 28:18

that explains the details of the project.
28:20 - 28:24

I see people taking pictures,
but basically, what Gene Wiki does,
28:24 - 28:28

it's trying to get biomedical data,
public data into Wikidata,
28:28 - 28:32

and we follow a specific pattern
to get that data into Wikidata.
28:33 - 28:37

So when we have a new repository
or a new data set
28:37 - 28:40

that is eligible
to be included into Wikidata,
28:40 - 28:41

the first step is community engagement.
28:41 - 28:44

It is not necessary
directly to a Wikidata community
28:44 - 28:46

but a local research community,
28:46 - 28:50

and we meet in person
or online or on any platform
28:50 - 28:53

and try to come up with a data model
28:53 - 28:56

that bridges their data
with the Wikidata model.
28:56 - 29:00

So here I have a picture of a workshop
that happened here last year
29:00 - 29:03

which was trying to look
at a specific data set
29:03 - 29:05

and, well, you see a lot of discussions,
29:05 - 29:10

then aligning it with schema.org
and other ontologies that are out there.
29:10 - 29:16

And then, at the end of the first step,
we have a whiteboard drawing of the schema
29:16 - 29:17

that we want to implement in Wikidata.
29:17 - 29:20

What you see over there,
this is just plain,
29:20 - 29:22

we have it in the back there
29:22 - 29:25

so we can make some schemas
within this panel today even.
29:27 - 29:28

So once we have the schema in place,
29:28 - 29:31

the next thing is try to make
that schema machine readable
29:32 - 29:37

because you want to have actionable models
to bridge the data that you're bringing in
29:37 - 29:40

from any biomedical database
into Wikidata.
29:40 - 29:45

And here we are applying
Shape Expressions.
29:46 - 29:53

And we use that because
Shape Expressions allow you to test
29:53 - 29:57

whether the data set
is actually-- no, to first see
29:57 - 30:02

of already existing data in Wikidata
follows the same data model
30:02 - 30:05

that was achieved in the previous process.
30:05 - 30:07

So then with the Shape Expression
we can check:
30:07 - 30:11

OK the data that are on this topic
in Wikidata, does it need some cleaning up
30:11 - 30:15

or do we need to adapt our model
to the Wikidata model or vice versa.
30:16 - 30:20

Once that is in place
and we start writing bots,
30:21 - 30:24

and bots are seeding the information
30:24 - 30:27

that is in the primary sources
into Wikidata.
30:28 - 30:29

And when the bots are ready,
30:29 - 30:33

we write these bots
with a platform called--
30:33 - 30:36

with a Python library
called Wikidata Integrator
30:36 - 30:38

that came out of our project.
30:39 - 30:43

And once we have our bots,
we use a platform called Jenkins
30:43 - 30:45

for continuous integration.
30:45 - 30:46

And with Jenkins,
30:46 - 30:51

we continuously update
the primary sources with Wikidata.
30:52 - 30:56

And this is a diagram for the paper
I previously mentioned.
30:56 - 30:57

This is our current landscape.
30:57 - 31:02

So every orange box out there
is a primary resource on drugs,
31:02 - 31:08

proteins, genes, diseases,
chemical compounds with interaction,
31:08 - 31:11

and this model is too small to read now
31:11 - 31:17

but this is the database,
the sources that we manage in Wikidata
31:17 - 31:21

and bridge with the primary sources.
31:21 - 31:22

Here is such a workflow.
31:23 - 31:25

So one of our partners
is the Disease Ontology
31:25 - 31:28

the Disease Ontology is a CC0 ontology,
31:28 - 31:32

and the CC0 Ontology
has a curation cycle on its own,
31:33 - 31:36

and they just continuously
update the Disease Ontology
31:36 - 31:40

to reflect the disease space
or the interpretation of diseases.
31:40 - 31:44

And there is the Wikidata
curation cycle also on diseases
31:44 - 31:50

where the Wikidata community constantly
monitors what's going on on Wikidata.
31:50 - 31:52

And then we have two roles,
31:52 - 31:55

we call them colloquially
the gatekeeper curator,
31:56 - 32:00

and this was me
and a colleague five years ago
32:00 - 32:03

where we just sit on our computers
and we monitor Wikipedia and Wikidata,
32:03 - 32:09

and if there is an issue that was
reported back to the primary community,
32:09 - 32:12

the primary resources, they looked
at the implementation and decided:
32:12 - 32:14

OK, do we do we trust the Wikidata input?
32:15 - 32:19

Yes--then it's considered,
it goes into the cycle,
32:19 - 32:23

and the next iteration
is part of the Disease Ontology
32:23 - 32:25

and fed back into Wikidata.
32:27 - 32:31

We're doing the same for WikiPathways.
32:31 - 32:37

WikiPathways is a MediaWiki-inspired
pathway and pathway repository.
32:37 - 32:41

Same story, there are different
pathway resources on Wikidata already.
32:41 - 32:45

There might be conflicts
between those pathway resources
32:45 - 32:47

and these conflicts are reported back
32:47 - 32:50

by the gatekeeper curators
to that community,
32:50 - 32:54

and you maintain
the individual curation cycles.
32:54 - 32:57

But if you remember the previous cycle,
32:57 - 33:03

here I mentioned
only two cycles, two resources,
33:04 - 33:06

we have to do that
for every single resource that we have
33:06 - 33:08

and we have to manage what's going on
33:08 - 33:09

because when I say curation,
33:09 - 33:11

I really mean going
to the Wikipedia top pages,
33:11 - 33:15

going into the Wikidata top pages
and trying to do that.
33:15 - 33:19

That doesn't scale for
the two gatekeeper curators we had.
33:20 - 33:23

So when I was in a conference in 2016
33:23 - 33:27

where Eric gave a presentation
on Shape Expressions,
33:27 - 33:29

I jumped on the bandwagon and said OK,
33:29 - 33:34

Shape Expressions can help us
detect what differences in Wikidata
33:34 - 33:41

and so that allows the gatekeepers to have
some more efficient reporting to report.
33:42 - 33:46

So this year,
I was delighted by the schema entity
33:46 - 33:51

because now, we can store
those entity schemas on Wikidata,
33:51 - 33:53

on Wikidata itself,
whereas before, it was on GitHub,
33:54 - 33:57

and this aligns
with the Wikidata interface,
33:57 - 33:59

so you have things
like document discussions
33:59 - 34:01

but you also have revisions.
34:01 - 34:05

So you can leverage the top pages
and the revisions in Wikidata
34:05 - 34:12

to use that to discuss
about what is in Wikidata
34:12 - 34:14

and what are in the primary resources.
34:15 - 34:20

So this what Eric just presented,
this is already quite a benefit.
34:20 - 34:24

So here, we made up a Shape Expression
for the human gene,
34:24 - 34:30

and then we ran it through simple ShEx,
and as you can see,
34:30 - 34:32

we just got already ni--
34:32 - 34:35

There is one issue
that needs to be monitored
34:35 - 34:37

which there is an item
that doesn't fit that schema,
34:37 - 34:43

and then you can sort of already
create schema entities curation reports
34:43 - 34:46

based on... and send that
to the different curation reports.
34:48 - 34:53

But the ShEx.js a built interface,
34:53 - 34:56

and if I can show back here,
I only do ten,
34:56 - 35:00

but we have tens of thousands,
and so that again doesn't scale.
35:00 - 35:05

So the Wikidata Integrator now
supports ShEx support as well,
35:05 - 35:07

and then we can just loop item loops
35:07 - 35:11

where we say yes-no,
yes-no, true-false, true-false.
35:11 - 35:12

So again,
35:13 - 35:17

increasing a bit of the efficiency
of dealing with the reports.
35:17 - 35:23

But now, recently, that builds
on the Wikidata Query Service,
35:23 - 35:25

and well, we recently have been throttling
35:25 - 35:27

so again, that doesn't scale.
35:27 - 35:31

So it's still an ongoing process,
how to deal with models on Wikidata.
35:32 - 35:37

And so again,
ShEx is not only intimidating
35:37 - 35:40

but also the scale is just
too big to deal with.
35:41 - 35:46

So I started working, this is my first
proof of concept or exercise
35:46 - 35:48

where I used a tool called yED,
35:48 - 35:53

and I started to draw
those Shape Expressions and because...
35:53 - 35:58

and then regenerate this schema
35:58 - 36:01

into this adjacent format
of the Shape Expressions,
36:01 - 36:05

so that would open up already
to the audience
36:05 - 36:07

that are intimidated
by the Shape Expressions languages.
36:08 - 36:12

But actually, there is a problem
with those visual descriptions
36:12 - 36:18

because this is also a schema
that was actually drawn in yEd by someone.
36:18 - 36:24

And here is another one
which is beautiful.
36:24 - 36:29

I would love to have this on my wall,
but it is still not interoperable.
36:30 - 36:32

So I want to end my talk with,
36:32 - 36:36

and the first time, I've been
stealing this slide, using this slide.
36:36 - 36:38

It's an honor to have him in the audience
36:38 - 36:39

and I really like this:
36:39 - 36:42

"People think RDF is a pain
because it's complicated.
36:42 - 36:44

The truth is even worse, it's so simple,
36:46 - 36:48

because you have to work
with real-world data problems
36:48 - 36:50

that are horribly complicated.
36:50 - 36:51

While you can avoid RDF,
36:51 - 36:56

it is harder to avoid complicated data
and complicated computer problems."
36:56 - 37:00

This is about RDF, but I think
this so applies to modeling as well.
37:00 - 37:03

So my point of discussion
is should we really...
37:03 - 37:06

How do we get modeling going?
37:06 - 37:11

Should we discuss ShEx
or visual models or...
37:11 - 37:13

How do we continue?
37:13 - 37:15

Thank you very much for your time.
37:15 - 37:18

(applause)
37:20 - 37:21

(Lydia) Thank you so much.
37:22 - 37:24

Would you come to the front
37:24 - 37:28

so that we can open
the questions from the audience.
37:29 - 37:30

Are there questions?
37:32 - 37:33

Yes.
37:34 - 37:37

And I think, for the camera, we need to...
37:39 - 37:41

(Lydia laughing) Yeah.
37:43 - 37:46

(man3) So a question
for Cristina, I think.
37:47 - 37:52

So you mentioned exactly
the term "information gain"
37:52 - 37:54

from linking with other systems.
37:54 - 37:56

There is an information theoretic measure
37:56 - 37:58

using statistic and probability
called information gain.
37:58 - 38:00

Do you have the same...
38:00 - 38:02

I mean did you mean exactly that measure,
38:02 - 38:04

the information gain
from the probability theory
38:04 - 38:05

from information theory
38:05 - 38:09

or just use this conceptual thing
to measure information gain some way?
38:09 - 38:13

No, so we actually defined
and implemented measures
38:14 - 38:20

that are using the Shannon entropy,
so it's meant as that.
38:20 - 38:23

I didn't want to go into
details of the concrete formulas...
38:23 - 38:25

(man3) No, no, of course,
that's why I asked the question.
38:25 - 38:27

- (Cristina) But yeah...
- (man3) Thank you.
38:33 - 38:35

(man4) Make more
of a comment than a question.
38:35 - 38:36

(Lydia) Go for it.
38:36 - 38:40

(man4) So there's been
a lot of focus at the item level
38:40 - 38:43

about quality and completeness,
38:43 - 38:47

one of the things that concerns me is that
we're not applying the same to hierarchies
38:47 - 38:51

and I think we have an issue
is that our hierarchy often isn't good.
38:51 - 38:53

We're seeing
this is going to be a real problem
38:53 - 38:56

with Commons searching and other things.
38:57 - 39:01

One of the abilities that we can do
is to import external--
39:01 - 39:05

The way that external thesauruses
structure their hierarchies,
39:05 - 39:10

using the P4900
broader concept qualifier.
39:11 - 39:16

But what I think would be really helpful
would be much better tools for doing that
39:16 - 39:21

so that you can import an
external... thesaurus's hierarchy
39:21 - 39:24

map that onto our Wikidata items.
39:24 - 39:28

Once it's in place
with those P4900 qualifiers,
39:28 - 39:31

you can actually do some
quite good querying through SPARQL
39:32 - 39:38

to see where our hierarchy
diverges from that external hierarchy.
39:38 - 39:41

For instance, [Paula Morma],
user PKM, you may know,
39:41 - 39:44

does a lot of work on fashion.
39:44 - 39:51

So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy
39:51 - 39:54

and the Getty AAT
fashion thesaurus hierarchy,
39:54 - 39:58

and then see where the gaps
were in our higher level items,
39:58 - 40:01

which is a real problem for us
because often,
40:01 - 40:04

these are things that only exist
as disambiguation pages on Wikipedia,
40:04 - 40:09

so we have a lot of higher level items
in our hierarchies missing
40:09 - 40:14

and this is something that we must address
in terms of quality and completeness,
40:14 - 40:16

but what would really help
40:17 - 40:21

would be better tools than
the jungle of pull scripts that I wrote...
40:21 - 40:26

If somebody could put that
into a PAWS notebook in Python
40:27 - 40:32

to be able to take an external thesaurus,
take its hierarchy,
40:32 - 40:35

which may well be available
as linked data or may not,
40:35 - 40:41

to then put those into
quick statements to put in P4900 values.
40:41 - 40:42

And then later,
40:42 - 40:45

when our representation
gets more complete,
40:45 - 40:50

to update those P4900s
because as our representation gets dated,
40:50 - 40:52

becomes more dense,
40:52 - 40:55

the values of those qualifiers
need to change
40:56 - 41:00

to represent that we've got more
of their hierarchy in our system.
41:00 - 41:04

If somebody could do that,
I think that would be very helpful,
41:04 - 41:07

and we do need to also
look at other approaches
41:07 - 41:11

to improve quality and completeness
at the hierarchy level
41:11 - 41:12

not just at the item level.
41:13 - 41:15

(Andra) Can I add to that?
41:16 - 41:20

Yes, and we actually do that,
41:20 - 41:24

and I can recommend looking at
the Shape Expression that Finn made
41:24 - 41:27

with the lexical data
where he creates Shape Expressions
41:27 - 41:30

and then build on authorship expressions
41:30 - 41:33

so you have this concept
of linked Shape Expressions in Wikidata,
41:33 - 41:35

and specifically, the use case,
if I understand correctly,
41:35 - 41:37

is exactly what we are doing in Gene Wiki.
41:37 - 41:41

So you have the Disease Ontology
which is put into Wikidata
41:41 - 41:45

and then disease data comes in
and we apply the Shape Expressions
41:45 - 41:47

to see if that fits with this thesaurus.
41:47 - 41:51

And there are other thesauruses or other
ontologies for controlled vocabularies
41:51 - 41:53

that still need to go into Wikidata,
41:53 - 41:55

and that's exactly why
Shape Expression is so interesting
41:55 - 41:58

because you can have a Shape Expression
for the Disease Ontology,
41:58 - 42:00

you can have a Shape Expression for MeSH,
42:00 - 42:02

you can say: OK,
now I want to check the quality.
42:02 - 42:04

Because you also have
in Wikidata the context
42:04 - 42:10

of when you have a controlled vocabulary,
you say the quality is according to this,
42:10 - 42:12

but you might have
a disagreeing community.
42:12 - 42:16

So the tooling is indeed in place
but now is indeed to create those models
42:16 - 42:18

and apply them
on the different use cases.
42:19 - 42:21

(man4) The ShapeExpression's very useful
42:21 - 42:26

once you have the external ontology
mapped into Wikidata,
42:26 - 42:29

but my problem is that
it's getting to that stage,
42:29 - 42:35

it's working out how much of the
external ontology isn't yet in Wikidata
42:35 - 42:36

and where the gaps are,
42:36 - 42:41

and that's where I think that
having much more robust tools
42:41 - 42:44

to see what's missing
from external ontologies
42:44 - 42:46

would be very helpful.
42:48 - 42:49

The biggest problem there
42:49 - 42:51

is not so much tooling
but more licensing.
42:52 - 42:55

So getting the ontologies
into Wikidata is actually a piece of cake
42:55 - 42:59

but most of the ontologies have,
how can I say that politely,
43:00 - 43:03

restrictive licensing,
so they are not compatible with Wikidata.
43:04 - 43:07

(man4) There's a huge number
of public sector thesauruses
43:07 - 43:08

in cultural fields.
43:08 - 43:11

- (Andra) Then we need to talk.
- (man4) Not a problem.
43:11 - 43:12

(Andra) Then we need to talk.
43:14 - 43:19

(man5) Just... the comment I want to make
is actually answer to James,
43:19 - 43:22

so the thing is that
hierarchies make graphs,
43:22 - 43:24

and when you want to...
43:25 - 43:29

I want to basically talk about...
a common problem in hierarchies
43:29 - 43:31

is circle hierarchies,
43:31 - 43:34

so they come back to each other
when there's a problem,
43:34 - 43:36

which you should not
have that in hierarchies.
43:37 - 43:41

This, funnily enough,
happens in categories in Wikipedia a lot
43:41 - 43:43

we have a lot of circles in categories,
43:44 - 43:47

but the good news is that this is...
43:48 - 43:52

Technically, it's a PMP complete problem,
so you cannot find this,
43:52 - 43:53

and easily if you built a graph of that,
43:54 - 43:57

but there are lots of ways
that have been developed
43:57 - 44:01

to find problems
in these hierarchy graphs.
44:01 - 44:05

Like there is a paper
called Finding Cycles...
44:05 - 44:08

Breaking Cycles in Noisy Hierarchies,
44:08 - 44:13

and it's been used to help
categorization of English Wikipedia.
44:13 - 44:17

You can just take this
and apply these hierarchies in Wikidata,
44:17 - 44:20

and then you can find
things that are problematic
44:20 - 44:22

and just remove the ones
that are causing issues
44:22 - 44:25

and find the issues, actually.
44:25 - 44:27

So this is just an idea, just so you...
44:29 - 44:30

(man4) That's all very well
44:30 - 44:34

but I think you're underestimating
the number of bad subclass relations
44:34 - 44:35

that we have.
44:35 - 44:40

It's like having a city
in completely the wrong country,
44:40 - 44:45

and there are tools for geography
to identify that,
44:45 - 44:49

and we need to have
much better tools in hierarchies
44:49 - 44:53

to identify where the equivalent
of the item for the country
44:53 - 44:58

is missing entirely,
or where it's actually been subclassed
44:58 - 45:02

to something that isn't meaning
something completely different.
45:03 - 45:07

(Lydia) Yeah, I think
you're getting to something
45:07 - 45:12

that me and my team keeps hearing
from people who reuse our data
45:12 - 45:14

quite a bit as well, right,
45:15 - 45:17

Individual data point might be great
45:17 - 45:20

but if you have to look
at the ontology and so on,
45:20 - 45:22

then it gets very...
45:22 - 45:26

And I think one of the big problems
why this is happening
45:26 - 45:31

is that a lot of editing on Wikidata
45:31 - 45:35

happens on the basis
of an individual item, right,
45:35 - 45:36

you make an edit on that item,
45:38 - 45:42

without realizing that this
might have very global consequences
45:42 - 45:44

on the rest of the graph, for example.
45:44 - 45:50

And if people have ideas around
how to make this more visible,
45:50 - 45:53

the consequences
of an individual local edit,
45:54 - 45:57

I think that would be worth exploring,
45:58 - 46:02

to show people better
what the consequence of their edit
46:02 - 46:03

that they might do in very good faith,
46:04 - 46:05

what that is.
46:07 - 46:12

Whoa! OK, let's start with, yeah, you,
then you, then you, then you.
46:12 - 46:14

(man5) Well, after the discussion,
46:14 - 46:18

just to express my agreement
with what James was saying.
46:18 - 46:22

So essentially, it seems
the most dangerous thing is the hierarchy,
46:22 - 46:24

not the hierarchy, but generally
46:24 - 46:28

the semantics of the subclass relations
seen in Wikidata, right.
46:28 - 46:33

So I've been studying languages recently,
just for the purposes of this conference,
46:33 - 46:35

and for example, you find plenty of cases
46:35 - 46:39

where a language is a part of
and subclass of the same thing, OK.
46:39 - 46:44

So you know, you can say
we have a flexible ontology.
46:44 - 46:46

Wikidata gives you freedom
to express that, sometimes.
46:46 - 46:47

Because, for example,
46:47 - 46:51

that ontology of languages
is also politically complicated, right?
46:51 - 46:55

It is even good to be in a position
to express a level of uncertainty.
46:55 - 46:58

But imagine anyone who wants
to do machine reading from that.
46:58 - 46:59

So that's really problematic.
46:59 - 47:00

And then again,
47:00 - 47:04

I don't think that ontology
was ever imported from somewhere,
47:04 - 47:05

that's something which is originally ours.
47:05 - 47:08

It's harvested from Wikipedia
in the very beginning I will say.
47:08 - 47:11

So I wonder...
this Shape Expressions thing is great,
47:11 - 47:16

and also validating and fixing,
if you like, the Wikidata ontology
47:16 - 47:18

by external resources, beautiful idea.
47:19 - 47:20

In the end,
47:20 - 47:25

will we end by reflecting
the external ontologies in Wikidata?
47:25 - 47:29

And also, what we do with
the core part of our ontology
47:29 - 47:31

which is never harvested
from external resources,
47:31 - 47:32

how do we go and fix that?
47:32 - 47:35

And I really think that
that will be a problem on its own.
47:35 - 47:39

We will have to focus on that
independently of the idea
47:39 - 47:41

of validating ontology
with something external.
47:49 - 47:53

(man6) OK, and constrains
and shapes are very impressive
47:53 - 47:54

what we can do with it,
47:55 - 47:58

but the main point is not
being really made clear--
47:58 - 48:03

it's because now we can make more explicit
what we expect from the data.
48:03 - 48:07

Before, each one has to write
its own tools and scripts
48:07 - 48:11

and so it's more visible
and we can discuss about it.
48:11 - 48:14

But because it's not about
what's wrong or right,
48:14 - 48:16

it's about an expectation,
48:16 - 48:18

and you will have different
expectations and discussions
48:18 - 48:21

about how we want
to model things in Wikidata,
48:21 - 48:23

and this...
48:23 - 48:26

The current state is just
one step in the direction
48:26 - 48:28

because now you need
48:28 - 48:31

very much technical expertise
to get into this,
48:31 - 48:36

and we need better ways
to visualize this constraint,
48:36 - 48:40

to transform it maybe in natural language
so people can better understand,
48:41 - 48:44

but it's less about what's wrong or right.
48:45 - 48:46

(Lydia) Yeah.
48:51 - 48:54

(man7) So for quality issues,
I just want to echo it like...
48:54 - 48:57

I've definitely found a lot of the issues
I've encountered have been
48:59 - 49:02

differences in opinion
between instance of versus subclass.
49:02 - 49:06

I would say errors in those situations
49:06 - 49:12

and trying to find those
has been a very time-consuming process.
49:12 - 49:15

What I've found is like:
"Oh, if I find very high-impression items
49:15 - 49:16

that are something...
49:16 - 49:22

and then use all the subclass instances
to find all derived statements of this,"
49:22 - 49:26

this is a very useful way
of looking for these errors.
49:26 - 49:28

But I was curious if Shape Expressions,
49:30 - 49:32

if there is...
49:32 - 49:37

If this can be used as a tool
to help resolve those issues but, yeah...
49:41 - 49:43

(man8) If it has a structural footprint...
49:46 - 49:49

If it has a structural footprint
that you can...that's sort of falsifiable,
49:49 - 49:51

you can look at that
and say well, that's wrong,
49:51 - 49:53

then yeah, you can do that.
49:53 - 49:57

But if it's just sort of
trying to map it to real-world objects,
49:57 - 49:59

then you're just going to need
lots and lots of brains.
50:06 - 50:09

(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.
50:09 - 50:13

We're here to find out how to help
the project and the community
50:13 - 50:16

but Cristina made the mistake
of asking what we want.
50:16 - 50:20

(laughing) So I think
one thing I'd like to see
50:21 - 50:24

is a lot around verifiability
50:24 - 50:26

which is one of the core tenets
of the project in the community,
50:27 - 50:29

and trustworthiness.
50:29 - 50:32

Not every statement is the same,
some of them are heavily disputed,
50:32 - 50:34

some of them are easy to guess,
50:34 - 50:36

like somebody's
date of birth can be verified,
50:36 - 50:39

as you saw today in the Keynote,
gender issues are a lot more complicated.
50:40 - 50:42

Can you discuss a little bit what you know
50:42 - 50:47

in this area of data quality around
trustworthiness and verifiability?
50:55 - 50:58

If there isn't a lot,
I'd love to see a lot more. (laughs)
51:01 - 51:02

(Lydia) Yeah.
51:03 - 51:07

Apparently, we don't have
a lot to say on that. (laughs)
51:08 - 51:12

(Andra) I think we can do a lot,
but I had a discussion with you yesterday.
51:12 - 51:16

My favorite example I learned yesterday
that's already deprecated
51:16 - 51:20

is if you go to the Q2, which is earth,
51:20 - 51:23

there is statement
that claims that the earth is flat.
51:24 - 51:26

And I love that example
51:26 - 51:28

because there is a community
out there that claims that
51:28 - 51:30

and they have verifiable resources.
51:30 - 51:32

So I think it's a genuine case,
51:32 - 51:35

it shouldn't be deprecated,
it should be in Wikidata.
51:35 - 51:40

And I think Shape Expressions
can be really instrumental there,
51:40 - 51:42

because what you can say,
51:42 - 51:45

OK, I'm really interested
in this use case,
51:45 - 51:47

or this is a use case where you disagree,
51:47 - 51:51

but there can also be a use case
where you say OK, I'm interested.
51:51 - 51:53

So there is this example you say,
I have glucose.
51:53 - 51:56

And glucose when you're a biologist,
51:56 - 52:00

you don't care for the chemical
constraints of the glucose molecule,
52:00 - 52:03

you just... everything glucose
is the same.
52:03 - 52:06

But if you're a chemist,
you cringe when you hear that,
52:06 - 52:08

you have 200 something...
52:08 - 52:10

So then you can have
multiple Shape Expressions,
52:10 - 52:13

OK, I'm coming in with...
I'm at a chemist view,
52:13 - 52:14

I'm applying that.
52:14 - 52:17

And then you say
I'm from a biological use case,
52:17 - 52:19

I'm applying that Shape Expression.
52:19 - 52:20

And then when you want to collaborate,
52:20 - 52:23

yes, well you should talk
to Eric about ShEx maps.
52:24 - 52:29

And so...
but this journey is just starting.
52:29 - 52:32

But I personally I believe
that it's quite instrumental in that area.
52:34 - 52:36

(Lydia) OK. Over there.
52:38 - 52:39

(laughs)
52:41 - 52:46

(woman2) I had several ideas
from some points in the discussions,
52:46 - 52:51

so I will try not to lose...
I had three ideas so...
52:52 - 52:55

Based on what James said a while ago,
52:55 - 52:59

we have a very, very big problem
on Wikidata since the beginning
52:59 - 53:02

for the upper ontology.
53:02 - 53:05

We talked about that
two years ago at WikidataCon,
53:05 - 53:07

and we talked about that at Wikimania.
53:07 - 53:10

Well, always we have a Wikidata meeting
53:10 - 53:12

we are talking about that,
53:12 - 53:16

because it's a very big problem
at a very very eye level
53:16 - 53:23

what entity is, with what work is,
what genre is, art,
53:23 - 53:25

are really the biggest concept.
53:26 - 53:33

And that's actually
a very weak point on global ontology
53:33 - 53:37

because people try to clean up regularly
53:38 - 53:41

and broke everything down the line,
53:43 - 53:49

because yes, I think some of you
may remember the guy who in good faith
53:49 - 53:52

broke absolutely all cities in the world.
53:52 - 53:58

We were not geographical items anymore,
so violation constraints everywhere.
53:59 - 54:00

And it was in good faith
54:00 - 54:04

because he was really
correcting a mistake in an item,
54:04 - 54:06

but everything broke down.
54:06 - 54:09

And I'm not sure how we can solve that
54:10 - 54:16

because there is actually
no external institution we could just copy
54:16 - 54:18

because everyone is working on...
54:19 - 54:22

Well, if I am performing art database,
54:22 - 54:25

I will just go
at the performing art label,
54:25 - 54:29

or I won't go to the philosophical concept
of what an entity is,
54:29 - 54:31

and that's actually...
54:31 - 54:35

I don't know any database
which is working at this level,
54:35 - 54:37

but that's the weakest point of Wikidata.
54:38 - 54:41

And probably,
when we are talking about data quality,
54:41 - 54:44

that's actually a big part of it, so...
54:44 - 54:49

And I think it's the same
we have stated in...
54:49 - 54:50

Oh, I am sorry, I am changing the subject,
54:51 - 54:56

but we have stated
in different sessions about qualities,
54:56 - 54:59

which is actually some of us
are doing good modeling job,
54:59 - 55:01

are doing ShEx,
are doing things like that.
55:02 - 55:08

People don't see it on Wikidata,
they don't see the ShEx,
55:08 - 55:10

they don't see the WikiProject
on the discussion page,
55:10 - 55:11

and sometimes,
55:11 - 55:15

they don't even see
the talk pages of properties,
55:15 - 55:20

which is explicitly stating,
a), this property is used for that.
55:20 - 55:24

Like last week,
I added constraints to a property.
55:24 - 55:26

The constraint was explicitly written
55:26 - 55:29

in the discussion
of the creation of the property.
55:29 - 55:35

I just created the technical part
of adding the constraint, and someone:
55:35 - 55:37

"What! You broke down all my edits!"
55:37 - 55:42

And he was using the property
wrongly for the last two years.
55:42 - 55:47

And the property was actually very clear,
but there were no warnings and everything,
55:47 - 55:50

and so, it's the same at the Pink Pony
we said at Wikimania
55:50 - 55:55

to make WikiProject more visible
or to make ShEx more visible, but...
55:55 - 55:57

And that's what Cristina said.
55:57 - 56:02

We have a visibility problem
of what the existing solutions are.
56:02 - 56:04

And at this session,
56:04 - 56:07

we are all talking about
how to create more ShEx,
56:07 - 56:11

or to facilitate the jobs
of the people who are doing the cleanup.
56:12 - 56:16

But we are cleaning up
since the first day of Wikidata,
56:16 - 56:21

and globally, we are losing,
and we are losing because, well,
56:21 - 56:23

if I know names are complicated
56:23 - 56:26

but I am the only one
doing the cleaning up job,
56:27 - 56:30

the guy who added
Latin script name
56:30 - 56:32

to all Chinese researcher,
56:32 - 56:36

I will take months to clean that
and I can't do it alone,
56:36 - 56:39

and he did one massive batch.
56:39 - 56:40

So we really need...
56:40 - 56:44

we have a visibility problem
more than a tool problem, I think,
56:44 - 56:46

because we have many tools.
56:46 - 56:50

(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),
56:50 - 56:52

so we need to wrap this up.
56:52 - 56:54

Thank you so much for your comments,
56:54 - 56:57

I hope you will continue discussing
during the rest of the day,
56:57 - 56:58

and thanks for your input.
56:58 - 57:00

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4
Video Language:: English
Duration:: 57:10

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-9-eng-Data_quality_panel_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)