-
Hello everyone to the Data Quality panel.
-
Data quality matters because
more and more people out there
-
rely on our data being in good shape,
so we're going to talk about data quality,
-
and there will be four speakers
who will give short introductions
-
on topics related to data quality
and then we will have a Q and A.
-
And the first one is Lucas.
-
Thank you.
-
Hi, I'm Lucas, and I'm going
to start with an overview
-
of data quality tools
that we already have on Wikidata
-
and also some things
that are coming up soon.
-
And I've grouped them
into some general themes
-
of making errors more visible,
making problems actionable,
-
getting more eyes on the data
so that people notice the problems,
-
fix some common sources of errors,
maintain the quality of the existing data
-
and also human curation.
-
And the ones that are currently available
start with property constraints.
-
So you've probably seen this
if you're on Wikidata.
-
You can sometimes get these icons
-
which check
the internal consistency of the data.
-
For example,
if one event follows the other,
-
then the other event should
also be followed by this one,
-
which on the WikidataCon item
was apparently missing.
-
I'm not sure,
this feature is a few days old.
-
And there's also,
if this is too limited or simple for you,
-
you can write any checks you want
using the Query Service
-
which is useful for
lots of things of course,
-
but you can also use it
for finding errors.
-
Like if you've noticed
one occurrence of a mistake,
-
then you can check
if there are other places
-
where people have made
a very similar error
-
and find that with the Query Service.
-
You can also combine the two
-
and search for constraint violations
in the Query Service,
-
for example,
only the violations in some area
-
or WikiProject that's relevant to you,
-
although the results are currently
not complete, sadly.
-
There is revision scoring.
-
That's... I think this is
from the recent changes
-
you can also get it on your watch list
an automatic assessment
-
of is this edit likely to be
in good faith or in bad faith
-
and is it likely to be
damaging or not damaging,
-
I think those are the two dimensions.
-
So you can, if you want,
-
focus on just looking through
the damaging but good faith edits.
-
If you're feeling particularly
friendly and welcoming
-
you can tell these editors,
"Thank you for your contribution,
-
here's how you should have done it
but thank you, still."
-
And if you're not feeling that way,
-
you can go through
the bad faith, damaging edits,
-
and revert the vandals.
-
There's also, similar to that,
entity scoring.
-
So instead of scoring an edit,
the change that it made,
-
you score the whole revision,
-
and I think that is
the same quality measure
-
that Lydia mentions
at the beginning of the conference.
-
That gives a user script up here
and gives you a score of like one to five,
-
I think it was, of what the quality
of the current item is.
-
The primary sources tool is for
any database that you want to import,
-
but that's not high enough quality
to directly add to Wikidata,
-
so you add it
to the primary sources tool instead,
-
and then humans can decide
-
should they add
these individual statements or not.
-
Showing coordinates as maps
is mainly a convenience feature
-
but it's also useful for quality control.
-
Like if you see this is supposed to be
the office of Wikimedia Germany
-
and if the coordinates
are somewhere in the Indian Ocean,
-
then you know that
something is not right there
-
and you can see it much more easily
than if you just had the numbers.
-
This is a gadget called
the relative completeness indicator
-
which shows you this little icon here
-
telling you how complete
it thinks this item is
-
and also which properties
are most likely missing,
-
which is really useful
if you're editing an item
-
and you're in an area
that you're not very familiar with
-
and you don't know what
the right properties to use are,
-
then this is a very useful gadget to have.
-
And we have Shape Expressions.
-
I think Andra or Jose
are going to talk more about those
-
but basically, a very powerful way
of comparing the data you have
-
against the schema,
-
like what statement should
certain entities have,
-
what other entities should they link to
and what should those look like,
-
and then you can find problems that way.
-
I think... No there is still more.
-
Integraality or property dashboard.
-
It gives you a quick overview
of the data you already have.
-
For example, this is from
the WikiProject Red Pandas,
-
and you can see that
we have a sex or gender
-
for almost all of the red pandas,
-
the date of birth varies a lot
by which zoo they come from
-
and we have almost
no dead pandas which is wonderful,
-
because they're so cute.
-
So this is also useful.
-
There we go, OK,
now for the things that are coming up.
-
Wikidata Bridge, or also known,
formerly known as client editing,
-
so editing Wikidata
from Wikipedia infoboxes
-
which will on the one hand
get more eyes on the data
-
because more people can see the data there
-
and it will hopefully encourage
more use of Wikidata in the Wikipedias
-
and that means that more
people can notice
-
if, for example some data is outdated
and needs to be updated
-
instead of if they would
only see it on Wikidata itself.
-
There is also tainted references.
-
The idea here is that
if you edit a statement value,
-
you might want to update
the references as well,
-
unless it was just a typo or something.
-
And this tainted references
tells editors that
-
and also that other editors
see which other edits were made
-
that edited a statement value
and didn't update a reference
-
then you can clean up after that
and decide should that be...
-
Do you need to do any thing more of that
-
or is that actually fine and
you don't need to update the reference.
-
That's related to signed statements
which is coming from a concern, I think,
-
that some data providers have that like...
-
There's a statement that's referenced
through the UNESCO or something
-
and then suddenly,
someone vandalizes the statement
-
and they are worried
that it will look like
-
this organization, like UNESCO,
still set this vandalism value
-
and so, with signed statements,
-
they can cryptographically
sign this reference
-
and that doesn't prevent any edits to it,
-
but at least, if someone
vandalizes the statement
-
or edits it in any way,
then the signature is no longer valid,
-
and you can tell this is not exactly
what the organization said,
-
and perhaps it's a good edit
and they should re-sign the new statement,
-
but also perhaps it should be reverted.
-
And also, this is going
to be very exciting, I think,
-
Citoid is this amazing system
they have on Wikipedia
-
where you can paste a URL,
or an identifier, or an ISBN
-
or Wikidata ID or basically
anything into the Visual Editor,
-
and it spits out a reference
that is nicely formatted
-
and has all the data you want
and it's wonderful to use.
-
And by comparison, on Wikidata,
if I want to add a reference
-
I typically have to add a reference URL,
title, author name string,
-
published in, publication date,
-
retrieve dates,
at least those, and that's annoying,
-
and integrating Citoid into Wikibase
will hopefully help with that.
-
And I think
that's all the ones I had, yeah.
-
So now, I'm going to pass to Cristina.
-
(applause)
-
Hi, I'm Cristina.
-
I'm a research scientist
from the University of Zürich,
-
and I'm also an active member
of the Swiss Community.
-
When Claudia Müller-Birn
and I submitted this to the WikidataCon,
-
what we wanted to do
is continue our discussion
-
that we started
in the beginning of the year
-
with a workshop on data quality
and also some sessions in Wikimania.
-
So the goal of this talk
is basically to bring some thoughts
-
that we have been collecting
from the community and ourselves
-
and continue discussion.
-
So what we would like is to continue
interacting a lot with you.
-
So what we think is very important
-
is that we continuously ask
all types of users in the community
-
about what they really need,
what problems they have with data quality,
-
not only editors
but also the people who are coding,
-
or consuming the data,
-
and also researchers who are
actually using all the edit history
-
to analyze what is happening.
-
So we did a review of around 80 tools
that are existing in Wikidata
-
and we aligned them to the different
data quality dimensions.
-
And what we saw was that actually,
-
many of them were looking at,
monitoring completeness,
-
but actually... and also some of them
are also enabling interlinking.
-
But there is a big need for tools
that are looking into diversity,
-
which is one of the things
that we actually can have in Wikidata,
-
especially
this design principle of Wikidata
-
where we can have plurality
-
and different statements
with different values
-
coming from different sources.
-
Because it's a secondary source,
we don't have really tools
-
that actually tell us how many
plural statements there are,
-
and how many we can improve and how,
-
and we also don't know really
-
what are all the reasons
for plurality that we can have.
-
So from these community meetings,
-
what we discussed was the challenges
that still need attention.
-
For example, that having
all these crowdsourcing communities
-
is very good because different people
attack different parts
-
of the data or the graph,
-
and we also have
different background knowledge
-
but actually, it's very difficult to align
everything in something homogeneous
-
because different people are using
different properties in different ways
-
and they are also expecting
different things from entity descriptions.
-
People also said that
they also need more tools
-
that give a better overview
of the global status of things.
-
So what entities are missing
in terms of completeness,
-
but also like what are people
working on right now most of the time,
-
and they also mention many times
a tighter collaboration
-
across not only languages
but the WikiProjects
-
and the different Wikimedia platforms.
-
And we published
all the transcribed comments
-
from all these discussions
in those links here in the Etherpads
-
and also in the wiki page of Wikimania.
-
Some solutions that appeared actually
-
were going into the direction
of sharing more the best practices
-
that are being developed
in different WikiProjects,
-
but also people want tools
that help organize work in teams
-
or at least understanding
who is working on that,
-
and they were also mentioning
that they want more showcases
-
and more templates that help them
create things in a better way.
-
And from the contact that we have
-
with Open Governmental Data Organizations,
-
and in particularly,
-
I am in contact with the canton
and the city of Zürich,
-
they are very interested
in working with Wikidata
-
because they want their data
to be accessible for everyone
-
in the place where people go
and consult or access data.
-
So for them, something that
would be really interesting
-
is to have some kind of quality indicators
-
both in the wiki,
which is already happening,
-
but also in SPARQL results,
-
to know whether they can trust
or not that data from the community.
-
And then, they also want to know
-
what parts of their own data sets
are useful for Wikidata
-
and they would love to have a tool that
can help them assess that automatically.
-
They also need
some kind of methodology or tool
-
that helps them decide whether
they should import or link their data
-
because in some cases,
-
they also have their own
linked open data sets,
-
so they don't know whether
to just ingest the data
-
or to keep on creating links
from the data sets to Wikidata
-
and the other way around.
-
And they also want to know where
their websites are referred in Wikidata.
-
And when they run such a query
in the query service,
-
they often get timeouts,
-
so maybe we should
really create more tools
-
that help them get these answers
for their questions.
-
And, besides that,
-
we wiki researchers also sometimes
-
lack some information
in the edit summaries.
-
So I remember that when
we were doing some work
-
to understand
the different behavior of editors
-
with tools or bots
or anonymous users and so on,
-
we were really lacking, for example,
-
a standard way of tracing
that tools were being used.
-
And there are some tools
that are already doing that
-
like PetScan and many others,
-
but maybe we should in the community
-
discuss more about how to record these
for fine-grained provenance.
-
And further on,
-
we think that we need to think
of more concrete data quality dimensions
-
that are related to link data
but not all the types of data,
-
so we worked on some measures
to access actually the information gain
-
enabled by the links,
and what we mean by that
-
is that when we link
Wikidata to other data sets,
-
we should also be thinking
-
how much the entities are actually
gaining in the classification,
-
also in the description
but also in the vocabularies they use.
-
So just to give a very simple
example of what I mean with this
-
is we can think of--
in this case, would be Wikidata
-
or the external data center
that is linking to Wikidata,
-
we have the entity for a person
that is called Natasha Noy,
-
we have the affiliation and other things,
-
and then we say OK,
we link to an external place,
-
and that entity also has that name,
but we actually have the same value.
-
So what it would be better is that we link
to something that has a different name,
-
that is still valid because this person
has two ways of writing the name,
-
and also other information
that we don't have in Wikidata
-
or that we don't have
in the other data set.
-
But also, what is even better
-
is that we are actually
looking in the target data set
-
that they also have new ways
of classifying the information.
-
So not only is this a person,
but in the other data set,
-
they also say it's a female
or anything else that they classify with.
-
And if in the other data set,
they are using many other vocabularies
-
that is also helping in their whole
information retrieval thing.
-
So with that, I also would like to say
-
that we think that we can
showcase federated queries better
-
because when we look at the query log
provided by Malyshev et al.,
-
we see actually that
from the organic queries,
-
we have only very few federated queries.
-
And actually, federation is one
of the key advantages of having link data,
-
so maybe the community
or the people using Wikidata
-
also need more examples on this.
-
And if we look at the list
of endpoints that are being used,
-
this is not a complete list
and we have many more.
-
Of course, this data was analyzed
from queries until March 2018,
-
but we should look into the list
of federated endpoints that we have
-
and see whether
we are really using them or not.
-
So two questions that
I have for the audience
-
that maybe we can use
afterwards for the discussion are:
-
what data quality problems
should be addressed in your opinion,
-
because of the needs that you have,
-
but also, where do you need
more automation
-
to help you with editing or patrolling.
-
That's all, thank you very much.
-
(applause)
-
(Jose Emilio Labra) OK,
so what I'm going to talk about
-
is some tools that we were developing
related with Shape Expressions.
-
So this is what I want to talk...
I am Jose Emilio Labra,
-
but this has... all these tools
have been done by different people,
-
mainly related with W3C ShEx,
Shape Expressions Community Group.
-
ShEx Community Group.
-
So the first tool that I want to mention
is RDFShape, this is a general tool,
-
because Shape Expressions
is not only for Wikidata,
-
Shape Expressions is a language
to validate RDF in general.
-
So this tool was developed mainly by me
-
and it's a tool
to validate RDF in general.
-
So if you want to learn about RDF
or you want to validate RDF
-
or SPARQL endpoints not only in Wikidata,
-
my advice is that you can use this tool.
-
Also for teaching.
-
I am a teacher in the university
-
and I use it in my semantic web course
to teach RDF.
-
So if you want to learn RDF,
I think it's a good tool.
-
For example, this is just a visualization
of an RDF graph with the tool.
-
But before coming here, in the last month,
-
I started a fork of rdfshape specifically
for Wikidata, because I thought...
-
It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.
-
So what I took is...
-
What I did is to remove all the stuff
that was not related with Wikidata
-
and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,
-
but now, someone asked me
if I could do it also for Wikibase.
-
And it is very easy
to do it for Wikibase also.
-
So this tool, WikiShape, is quite new.
-
I think it works, most of the features,
-
but there are some features
that maybe don't work,
-
and if you try it and you want
to improve it, please tell me.
-
So this is [inaudible] captures,
but I think I can even try so let's try.
-
So let's see if it works.
-
First, I have to go out of the...
-
Here.
-
Alright, yeah. So this is the tool here.
-
Things that you can do with the tool,
-
for example, is that you can
check schemas, entity schemas.
-
You know that there is
a new namespace which is "E whatever,"
-
so here, if you start for example,
write for example "human"...
-
As you are writing,
its autocomplete allows you to check,
-
for example,
this is the Shape Expressions of a human,
-
and this is the Shape Expressions here.
-
And as you can see,
this editor has syntax highlighting,
-
this is... well,
maybe it's very small, the screen.
-
I can try to do it bigger.
-
Maybe you see it better now.
-
So... and this is the editor
with syntax highlighting and also has...
-
I mean, this editor
comes from the same source code
-
as the Wikidata query service.
-
So for example,
if you hover with the mouse here,
-
it shows you the labels
of the different properties.
-
So I think it's very helpful because now,
-
the entity schemas that is
in the Wikidata is just a plain text idea,
-
and I think this editor is much better
because it has autocomplete
-
and it also has...
-
I mean, if you, for example,
wanted to add a constraint,
-
you say "wdt:"
-
You start writing "author"
and then you click Ctrl+Space
-
and it suggests the different things.
-
So this is similar
to the Wikidata query service
-
but specifically for Shape Expressions
-
because my feeling is that
creating Shape Expressions
-
is not more difficult
than writing SPARQL queries.
-
So some people think
that it's at the same level,
-
It's probably easier, I think,
because Shape Expressions was,
-
when we designed it,
we were doing it to be easier to work.
-
OK, so this is one of the first things,
that you have this editor
-
for Shape Expressions.
-
And then you also have the possibility,
for example, to visualize.
-
If you have a Shape Expression,
use for example...
-
I think, "written work" is
a nice Shape Expression
-
because it has some relationships
between different things.
-
And this is the UML visualization
of written work.
-
In a UML, this is easy to see
the different properties.
-
When you do this, I realized
when I tried with several people,
-
they find some mistakes
in their Shape Expressions
-
because it's easy to detect which are
the missing properties or whatever.
-
Then there is another possibility here
-
is that you can also validate,
I think I have it here, the validation.
-
I think I had it in some label,
maybe I closed it.
-
OK, but you can, for example,
you can click here, Validate entities.
-
You, for example,
-
"q42" with "e42" which is author.
-
With "human,"
I think we can do it with "human."
-
And then it's...
-
And it's taking a little while to do it
because this is doing the SPARQL queries
-
and now, for example,
it's failing by the network but...
-
So you can try it.
-
OK, so let's go continue
with the presentation, with other tools.
-
So my advice is that if you want to try it
and you want any feedback let me know.
-
So to continue with the presentation...
-
So this is WikiShape.
-
Then, I already said this,
-
the Shape Expressions Editor
is an independent project in GitHub.
-
You can use it in your own project.
-
If you want to do
a Shape Expressions tool,
-
you can just embed it
in any other project,
-
so this is in GitHub and you can use it.
-
Then the same author,
it's one of my students,
-
he also created
an editor for Shape Expressions,
-
also inspired by
the Wikidata query service
-
where, in a column,
-
you have this more visual editor
of SPARQL queries
-
where you can put this kind of things.
-
So this is a screen capture.
-
You can see that
that's the Shape Expressions in text
-
but this is a form-based Shape Expressions
where it would probably take a bit longer
-
where you can put the different rows
on the different fields.
-
OK, then there is ShExEr.
-
We have... it's done by one PhD student
at the University of Oviedo
-
and he's here, so you can present ShExEr.
-
(Danny) Hello, I am Danny Fernández,
-
I am a PhD student in University of Oviedo
working with Labra.
-
Since we are running out of time,
let's make these quickly,
-
so let's not go for any actual demo,
but just print some screenshots.
-
OK, so the usual way to work with
Shape Expressions or any shape language
-
is that you have a domain expert
-
that defines a priori
how the graph should look like
-
define some structures,
-
and then you use these structures
to validate the actual data against it.
-
This tool, which is as well as the ones
that Labra has been presenting,
-
this is a general purpose tool
for any RDF source,
-
is designed to do the other way around.
-
You already have some data,
-
you select what nodes
you want to get the shape about
-
and then you automatically
extract or infer the shape.
-
So even if this is a general purpose tool,
-
what we did for this WikidataCon
is these fancy button
-
that if you click it,
essentially what happens
-
is that there are
so many configurations params
-
and it configures it to work
against the Wikidata endpoint
-
and it will end soon, sorry.
-
So, once you press this button
what you get is essentially this.
-
After having selected what kind of notes,
-
what kind of instances of our class,
whatever you are looking for,
-
you get an automatic schema.
-
All the constraints are sorted
by how many modes actually conform to it,
-
you can filter the less common ones, etc.
-
So there is a poster downstairs
about this stuff
-
and well,
I will be downstairs and upstairs
-
and all over the place all day,
-
so if you have any further
interest in this tool,
-
just speak to me during this journey.
-
And now, I'll give back
the micro to Labra, thank you.
-
(applause)
-
(Jose) So let's continue
with the other tools.
-
The other tool is the ShapeDesigner.
-
Andra, do you want to do
the ShapeDesigner now
-
or maybe later or in the workshop?
-
There is a workshop...
-
This afternoon, there is a workshop
specifically for Shape Expressions, and...
-
The idea is that was going to be
more hands on,
-
and if you want to practice
some ShEx, you can do it there.
-
This tool is ShEx...
and there is Eric here,
-
so you can present it.
-
(Eric) So just super quick,
the thing that I want to say
-
is that you've probably
already seen the ShEx interface
-
that's tailored for Wikidata.
-
That's effectively stripped down
and tailored specifically for Wikidata
-
because the generic one has more features
but it turns out I thought I'd mention it
-
because one of those features
is particularly useful
-
for debugging Wikidata schemas,
-
which is if you go
and you select the slurp mode,
-
what it does is it says
while I'm validating,
-
I want to pull all the the triples down
and that means
-
if I get a bunch of failures,
-
I can go through and start looking
at those failures and saying,
-
OK, what are the triples
that are in here,
-
sorry, I apologize,
the triples are down there,
-
this is just a log of what went by.
-
And then you can just sit there
and fiddle with it in real time
-
like you play with something
and it changes.
-
So it's a quicker version
for doing all that stuff.
-
This is a ShExC form,
-
this is something [Joachim] had suggested
-
could be useful for populating
Wikidata documents
-
based on a Shape Expression
for that that document.
-
This is not tailored for Wikidata,
-
but this is just to say
that you can have a schema
-
and you can have some annotations
-
to say specifically how I want
that schema rendered
-
and then it just builds a form,
-
and if you've got data,
it can even populate the form.
-
PyShEx [inaudible].
-
(Jose) I think this is the last one.
-
Yes, so the last one is PyShEx.
-
PyShEx is a Python implementation
of Shape Expressions,
-
you can play also with Jupyter Notebooks
if you want those kind of things.
-
OK, so that's all for this.
-
(applause)
-
(Andra) So I'm going to talk about
a specific project that I'm involved in
-
called Gene Wiki,
-
and where we are also
dealing with quality issues.
-
But before going into the quality,
-
maybe a quick introduction
about what Gene Wiki is,
-
and we just released a pre-print
of a paper that we recently have written
-
that explains the details of the project.
-
I see people taking pictures,
but basically, what Gene Wiki does,
-
it's trying to get biomedical data,
public data into Wikidata,
-
and we follow a specific pattern
to get that data into Wikidata.
-
So when we have a new repository
or a new data set
-
that is eligible
to be included into Wikidata,
-
the first step is community engagement.
-
It is not necessary
directly to a Wikidata community
-
but a local research community,
-
and we meet in person
or online or on any platform
-
and try to come up with a data model
-
that bridges their data
with the Wikidata model.
-
So here I have a picture of a workshop
that happened here last year
-
which was trying to look
at a specific data set
-
and, well, you see a lot of discussions,
-
then aligning it with schema.org
and other ontologies that are out there.
-
And then, at the end of the first step,
we have a whiteboard drawing of the schema
-
that we want to implement in Wikidata.
-
What you see over there,
this is just plain,
-
we have it in the back there
-
so we can make some schemas
within this panel today even.
-
So once we have the schema in place,
-
the next thing is try to make
that schema machine readable
-
because you want to have actionable models
to bridge the data that you're bringing in
-
from any biomedical database
into Wikidata.
-
And here we are applying
Shape Expressions.
-
And we use that because
Shape Expressions allow you to test
-
whether the data set
is actually-- no, to first see
-
of already existing data in Wikidata
follows the same data model
-
that was achieved in the previous process.
-
So then with the Shape Expression
we can check:
-
OK the data that are on this topic
in Wikidata, does it need some cleaning up
-
or do we need to adapt our model
to the Wikidata model or vice versa.
-
Once that is in place
and we start writing bots,
-
and bots are seeding the information
-
that is in the primary sources
into Wikidata.
-
And when the bots are ready,
-
we write these bots
with a platform called--
-
with a Python library
called Wikidata Integrator
-
that came out of our project.
-
And once we have our bots,
we use a platform called Jenkins
-
for continuous integration.
-
And with Jenkins,
-
we continuously update
the primary sources with Wikidata.
-
And this is a diagram for the paper
I previously mentioned.
-
This is our current landscape.
-
So every orange box out there
is a primary resource on drugs,
-
proteins, genes, diseases,
chemical compounds with interaction,
-
and this model is too small to read now
-
but this is the database,
the sources that we manage in Wikidata
-
and bridge with the primary sources.
-
Here is such a workflow.
-
So one of our partners
is the Disease Ontology
-
the Disease Ontology is a CC0 ontology,
-
and the CC0 Ontology
has a curation cycle on its own,
-
and they just continuously
update the Disease Ontology
-
to reflect the disease space
or the interpretation of diseases.
-
And there is the Wikidata
curation cycle also on diseases
-
where the Wikidata community constantly
monitors what's going on on Wikidata.
-
And then we have two roles,
-
we call them colloquially
the gatekeeper curator,
-
and this was me
and a colleague five years ago
-
where we just sit on our computers
and we monitor Wikipedia and Wikidata,
-
and if there is an issue that was
reported back to the primary community,
-
the primary resources, they looked
at the implementation and decided:
-
OK, do we do we trust the Wikidata input?
-
Yes--then it's considered,
it goes into the cycle,
-
and the next iteration
is part of the Disease Ontology
-
and fed back into Wikidata.
-
We're doing the same for WikiPathways.
-
WikiPathways is a MediaWiki-inspired
pathway and pathway repository.
-
Same story, there are different
pathway resources on Wikidata already.
-
There might be conflicts
between those pathway resources
-
and these conflicts are reported back
-
by the gatekeeper curators
to that community,
-
and you maintain
the individual curation cycles.
-
But if you remember the previous cycle,
-
here I mentioned
only two cycles, two resources,
-
we have to do that
for every single resource that we have
-
and we have to manage what's going on
-
because when I say curation,
-
I really mean going
to the Wikipedia top pages,
-
going into the Wikidata top pages
and trying to do that.
-
That doesn't scale for
the two gatekeeper curators we had.
-
So when I was in a conference in 2016
-
where Eric gave a presentation
on Shape Expressions,
-
I jumped on the bandwagon and said OK,
-
Shape Expressions can help us
detect what differences in Wikidata
-
and so that allows the gatekeepers to have
some more efficient reporting to report.
-
So this year,
I was delighted by the schema entity
-
because now, we can store
those entity schemas on Wikidata,
-
on Wikidata itself,
whereas before, it was on GitHub,
-
and this aligns
with the Wikidata interface,
-
so you have things
like document discussions
-
but you also have revisions.
-
So you can leverage the top pages
and the revisions in Wikidata
-
to use that to discuss
about what is in Wikidata
-
and what are in the primary resources.
-
So this what Eric just presented,
this is already quite a benefit.
-
So here, we made up a Shape Expression
for the human gene,
-
and then we ran it through simple ShEx,
and as you can see,
-
we just got already ni--
-
There is one issue
that needs to be monitored
-
which there is an item
that doesn't fit that schema,
-
and then you can sort of already
create schema entities curation reports
-
based on... and send that
to the different curation reports.
-
But the ShEx.js a built interface,
-
and if I can show back here,
I only do ten,
-
but we have tens of thousands,
and so that again doesn't scale.
-
So the Wikidata Integrator now
supports ShEx support as well,
-
and then we can just loop item loops
-
where we say yes-no,
yes-no, true-false, true-false.
-
So again,
-
increasing a bit of the efficiency
of dealing with the reports.
-
But now, recently, that builds
on the Wikidata Query Service,
-
and well, we recently have been throttling
-
so again, that doesn't scale.
-
So it's still an ongoing process,
how to deal with models on Wikidata.
-
And so again,
ShEx is not only intimidating
-
but also the scale is just
too big to deal with.
-
So I started working, this is my first
proof of concept or exercise
-
where I used a tool called yED,
-
and I started to draw
those Shape Expressions and because...
-
and then regenerate this schema
-
into this adjacent format
of the Shape Expressions,
-
so that would open up already
to the audience
-
that are intimidated
by the Shape Expressions languages.
-
But actually, there is a problem
with those visual descriptions
-
because this is also a schema
that was actually drawn in yEd by someone.
-
And here is another one
which is beautiful.
-
I would love to have this on my wall,
but it is still not interoperable.
-
So I want to end my talk with,
-
and the first time, I've been
stealing this slide, using this slide.
-
It's an honor to have him in the audience
-
and I really like this:
-
"People think RDF is a pain
because it's complicated.
-
The truth is even worse, it's so simple,
-
because you have to work
with real-world data problems
-
that are horribly complicated.
-
While you can avoid RDF,
-
it is harder to avoid complicated data
and complicated computer problems."
-
This is about RDF, but I think
this so applies to modeling as well.
-
So my point of discussion
is should we really...
-
How do we get modeling going?
-
Should we discuss ShEx
or visual models or...
-
How do we continue?
-
Thank you very much for your time.
-
(applause)
-
(Lydia) Thank you so much.
-
Would you come to the front
-
so that we can open
the questions from the audience.
-
Are there questions?
-
Yes.
-
And I think, for the camera, we need to...
-
(Lydia laughing) Yeah.
-
(man3) So a question
for Cristina, I think.
-
So you mentioned exactly
the term "information gain"
-
from linking with other systems.
-
There is an information theoretic measure
-
using statistic and probability
called information gain.
-
Do you have the same...
-
I mean did you mean exactly that measure,
-
the information gain
from the probability theory
-
from information theory
-
or just use this conceptual thing
to measure information gain some way?
-
No, so we actually defined
and implemented measures
-
that are using the Shannon entropy,
so it's meant as that.
-
I didn't want to go into
details of the concrete formulas...
-
(man3) No, no, of course,
that's why I asked the question.
-
- (Cristina) But yeah...
- (man3) Thank you.
-
(man4) Make more
of a comment than a question.
-
(Lydia) Go for it.
-
(man4) So there's been
a lot of focus at the item level
-
about quality and completeness,
-
one of the things that concerns me is that
we're not applying the same to hierarchies
-
and I think we have an issue
is that our hierarchy often isn't good.
-
We're seeing
this is going to be a real problem
-
with Commons searching and other things.
-
One of the abilities that we can do
is to import external--
-
The way that external thesauruses
structure their hierarchies,
-
using the P4900
broader concept qualifier.
-
But what I think would be really helpful
would be much better tools for doing that
-
so that you can import an
external... thesaurus's hierarchy
-
map that onto our Wikidata items.
-
Once it's in place
with those P4900 qualifiers,
-
you can actually do some
quite good querying through SPARQL
-
to see where our hierarchy
diverges from that external hierarchy.
-
For instance, [Paula Morma],
user PKM, you may know,
-
does a lot of work on fashion.
-
So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy
-
and the Getty AAT
fashion thesaurus hierarchy,
-
and then see where the gaps
were in our higher level items,
-
which is a real problem for us
because often,
-
these are things that only exist
as disambiguation pages on Wikipedia,
-
so we have a lot of higher level items
in our hierarchies missing
-
and this is something that we must address
in terms of quality and completeness,
-
but what would really help
-
would be better tools than
the jungle of pull scripts that I wrote...
-
If somebody could put that
into a PAWS notebook in Python
-
to be able to take an external thesaurus,
take its hierarchy,
-
which may well be available
as linked data or may not,
-
to then put those into
quick statements to put in P4900 values.
-
And then later,
-
when our representation
gets more complete,
-
to update those P4900s
because as our representation gets dated,
-
becomes more dense,
-
the values of those qualifiers
need to change
-
to represent that we've got more
of their hierarchy in our system.
-
If somebody could do that,
I think that would be very helpful,
-
and we do need to also
look at other approaches
-
to improve quality and completeness
at the hierarchy level
-
not just at the item level.
-
(Andra) Can I add to that?
-
Yes, and we actually do that,
-
and I can recommend looking at
the Shape Expression that Finn made
-
with the lexical data
where he creates Shape Expressions
-
and then build on authorship expressions
-
so you have this concept
of linked Shape Expressions in Wikidata,
-
and specifically, the use case,
if I understand correctly,
-
is exactly what we are doing in Gene Wiki.
-
So you have the Disease Ontology
which is put into Wikidata
-
and then disease data comes in
and we apply the Shape Expressions
-
to see if that fits with this thesaurus.
-
And there are other thesauruses or other
ontologies for controlled vocabularies
-
that still need to go into Wikidata,
-
and that's exactly why
Shape Expression is so interesting
-
because you can have a Shape Expression
for the Disease Ontology,
-
you can have a Shape Expression for MeSH,
-
you can say: OK,
now I want to check the quality.
-
Because you also have
in Wikidata the context
-
of when you have a controlled vocabulary,
you say the quality is according to this,
-
but you might have
a disagreeing community.
-
So the tooling is indeed in place
but now is indeed to create those models
-
and apply them
on the different use cases.
-
(man4) The ShapeExpression's very useful
-
once you have the external ontology
mapped into Wikidata,
-
but my problem is that
it's getting to that stage,
-
it's working out how much of the
external ontology isn't yet in Wikidata
-
and where the gaps are,
-
and that's where I think that
having much more robust tools
-
to see what's missing
from external ontologies
-
would be very helpful.
-
The biggest problem there
-
is not so much tooling
but more licensing.
-
So getting the ontologies
into Wikidata is actually a piece of cake
-
but most of the ontologies have,
how can I say that politely,
-
restrictive licensing,
so they are not compatible with Wikidata.
-
(man4) There's a huge number
of public sector thesauruses
-
in cultural fields.
-
- (Andra) Then we need to talk.
- (man4) Not a problem.
-
(Andra) Then we need to talk.
-
(man5) Just... the comment I want to make
is actually answer to James,
-
so the thing is that
hierarchies make graphs,
-
and when you want to...
-
I want to basically talk about...
a common problem in hierarchies
-
is circle hierarchies,
-
so they come back to each other
when there's a problem,
-
which you should not
have that in hierarchies.
-
This, funnily enough,
happens in categories in Wikipedia a lot
-
we have a lot of circles in categories,
-
but the good news is that this is...
-
Technically, it's a PMP complete problem,
so you cannot find this,
-
and easily if you built a graph of that,
-
but there are lots of ways
that have been developed
-
to find problems
in these hierarchy graphs.
-
Like there is a paper
called Finding Cycles...
-
Breaking Cycles in Noisy Hierarchies,
-
and it's been used to help
categorization of English Wikipedia.
-
You can just take this
and apply these hierarchies in Wikidata,
-
and then you can find
things that are problematic
-
and just remove the ones
that are causing issues
-
and find the issues, actually.
-
So this is just an idea, just so you...
-
(man4) That's all very well
-
but I think you're underestimating
the number of bad subclass relations
-
that we have.
-
It's like having a city
in completely the wrong country,
-
and there are tools for geography
to identify that,
-
and we need to have
much better tools in hierarchies
-
to identify where the equivalent
of the item for the country
-
is missing entirely,
or where it's actually been subclassed
-
to something that isn't meaning
something completely different.
-
(Lydia) Yeah, I think
you're getting to something
-
that me and my team keeps hearing
from people who reuse our data
-
quite a bit as well, right,
-
Individual data point might be great
-
but if you have to look
at the ontology and so on,
-
then it gets very...
-
And I think one of the big problems
why this is happening
-
is that a lot of editing on Wikidata
-
happens on the basis
of an individual item, right,
-
you make an edit on that item,
-
without realizing that this
might have very global consequences
-
on the rest of the graph, for example.
-
And if people have ideas around
how to make this more visible,
-
the consequences
of an individual local edit,
-
I think that would be worth exploring,
-
to show people better
what the consequence of their edit
-
that they might do in very good faith,
-
what that is.
-
Whoa! OK, let's start with, yeah, you,
then you, then you, then you.
-
(man5) Well, after the discussion,
-
just to express my agreement
with what James was saying.
-
So essentially, it seems
the most dangerous thing is the hierarchy,
-
not the hierarchy, but generally
-
the semantics of the subclass relations
seen in Wikidata, right.
-
So I've been studying languages recently,
just for the purposes of this conference,
-
and for example, you find plenty of cases
-
where a language is a part of
and subclass of the same thing, OK.
-
So you know, you can say
we have a flexible ontology.
-
Wikidata gives you freedom
to express that, sometimes.
-
Because, for example,
-
that ontology of languages
is also politically complicated, right?
-
It is even good to be in a position
to express a level of uncertainty.
-
But imagine anyone who wants
to do machine reading from that.
-
So that's really problematic.
-
And then again,
-
I don't think that ontology
was ever imported from somewhere,
-
that's something which is originally ours.
-
It's harvested from Wikipedia
in the very beginning I will say.
-
So I wonder...
this Shape Expressions thing is great,
-
and also validating and fixing,
if you like, the Wikidata ontology
-
by external resources, beautiful idea.
-
In the end,
-
will we end by reflecting
the external ontologies in Wikidata?
-
And also, what we do with
the core part of our ontology
-
which is never harvested
from external resources,
-
how do we go and fix that?
-
And I really think that
that will be a problem on its own.
-
We will have to focus on that
independently of the idea
-
of validating ontology
with something external.
-
(man6) OK, and constrains
and shapes are very impressive
-
what we can do with it,
-
but the main point is not
being really made clear--
-
it's because now we can make more explicit
what we expect from the data.
-
Before, each one has to write
its own tools and scripts
-
and so it's more visible
and we can discuss about it.
-
But because it's not about
what's wrong or right,
-
it's about an expectation,
-
and you will have different
expectations and discussions
-
about how we want
to model things in Wikidata,
-
and this...
-
The current state is just
one step in the direction
-
because now you need
-
very much technical expertise
to get into this,
-
and we need better ways
to visualize this constraint,
-
to transform it maybe in natural language
so people can better understand,
-
but it's less about what's wrong or right.
-
(Lydia) Yeah.
-
(man7) So for quality issues,
I just want to echo it like...
-
I've definitely found a lot of the issues
I've encountered have been
-
differences in opinion
between instance of versus subclass.
-
I would say errors in those situations
-
and trying to find those
has been a very time-consuming process.
-
What I've found is like:
"Oh, if I find very high-impression items
-
that are something...
-
and then use all the subclass instances
to find all derived statements of this,"
-
this is a very useful way
of looking for these errors.
-
But I was curious if Shape Expressions,
-
if there is...
-
If this can be used as a tool
to help resolve those issues but, yeah...
-
(man8) If it has a structural footprint...
-
If it has a structural footprint
that you can...that's sort of falsifiable,
-
you can look at that
and say well, that's wrong,
-
then yeah, you can do that.
-
But if it's just sort of
trying to map it to real-world objects,
-
then you're just going to need
lots and lots of brains.
-
(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.
-
We're here to find out how to help
the project and the community
-
but Cristina made the mistake
of asking what we want.
-
(laughing) So I think
one thing I'd like to see
-
is a lot around verifiability
-
which is one of the core tenets
of the project in the community,
-
and trustworthiness.
-
Not every statement is the same,
some of them are heavily disputed,
-
some of them are easy to guess,
-
like somebody's
date of birth can be verified,
-
as you saw today in the Keynote,
gender issues are a lot more complicated.
-
Can you discuss a little bit what you know
-
in this area of data quality around
trustworthiness and verifiability?
-
If there isn't a lot,
I'd love to see a lot more. (laughs)
-
(Lydia) Yeah.
-
Apparently, we don't have
a lot to say on that. (laughs)
-
(Andra) I think we can do a lot,
but I had a discussion with you yesterday.
-
My favorite example I learned yesterday
that's already deprecated
-
is if you go to the Q2, which is earth,
-
there is statement
that claims that the earth is flat.
-
And I love that example
-
because there is a community
out there that claims that
-
and they have verifiable resources.
-
So I think it's a genuine case,
-
it shouldn't be deprecated,
it should be in Wikidata.
-
And I think Shape Expressions
can be really instrumental there,
-
because what you can say,
-
OK, I'm really interested
in this use case,
-
or this is a use case where you disagree,
-
but there can also be a use case
where you say OK, I'm interested.
-
So there is this example you say,
I have glucose.
-
And glucose when you're a biologist,
-
you don't care for the chemical
constraints of the glucose molecule,
-
you just... everything glucose
is the same.
-
But if you're a chemist,
you cringe when you hear that,
-
you have 200 something...
-
So then you can have
multiple Shape Expressions,
-
OK, I'm coming in with...
I'm at a chemist view,
-
I'm applying that.
-
And then you say
I'm from a biological use case,
-
I'm applying that Shape Expression.
-
And then when you want to collaborate,
-
yes, well you should talk
to Eric about ShEx maps.
-
And so...
but this journey is just starting.
-
But I personally I believe
that it's quite instrumental in that area.
-
(Lydia) OK. Over there.
-
(laughs)
-
(woman2) I had several ideas
from some points in the discussions,
-
so I will try not to lose...
I had three ideas so...
-
Based on what James said a while ago,
-
we have a very, very big problem
on Wikidata since the beginning
-
for the upper ontology.
-
We talked about that
two years ago at WikidataCon,
-
and we talked about that at Wikimania.
-
Well, always we have a Wikidata meeting
-
we are talking about that,
-
because it's a very big problem
at a very very eye level
-
what entity is, with what work is,
what genre is, art,
-
are really the biggest concept.
-
And that's actually
a very weak point on global ontology
-
because people try to clean up regularly
-
and broke everything down the line,
-
because yes, I think some of you
may remember the guy who in good faith
-
broke absolutely all cities in the world.
-
We were not geographical items anymore,
so violation constraints everywhere.
-
And it was in good faith
-
because he was really
correcting a mistake in an item,
-
but everything broke down.
-
And I'm not sure how we can solve that
-
because there is actually
no external institution we could just copy
-
because everyone is working on...
-
Well, if I am performing art database,
-
I will just go
at the performing art label,
-
or I won't go to the philosophical concept
of what an entity is,
-
and that's actually...
-
I don't know any database
which is working at this level,
-
but that's the weakest point of Wikidata.
-
And probably,
when we are talking about data quality,
-
that's actually a big part of it, so...
-
And I think it's the same
we have stated in...
-
Oh, I am sorry, I am changing the subject,
-
but we have stated
in different sessions about qualities,
-
which is actually some of us
are doing good modeling job,
-
are doing ShEx,
are doing things like that.
-
People don't see it on Wikidata,
they don't see the ShEx,
-
they don't see the WikiProject
on the discussion page,
-
and sometimes,
-
they don't even see
the talk pages of properties,
-
which is explicitly stating,
a), this property is used for that.
-
Like last week,
I added constraints to a property.
-
The constraint was explicitly written
-
in the discussion
of the creation of the property.
-
I just created the technical part
of adding the constraint, and someone:
-
"What! You broke down all my edits!"
-
And he was using the property
wrongly for the last two years.
-
And the property was actually very clear,
but there were no warnings and everything,
-
and so, it's the same at the Pink Pony
we said at Wikimania
-
to make WikiProject more visible
or to make ShEx more visible, but...
-
And that's what Cristina said.
-
We have a visibility problem
of what the existing solutions are.
-
And at this session,
-
we are all talking about
how to create more ShEx,
-
or to facilitate the jobs
of the people who are doing the cleanup.
-
But we are cleaning up
since the first day of Wikidata,
-
and globally, we are losing,
and we are losing because, well,
-
if I know names are complicated
-
but I am the only one
doing the cleaning up job,
-
the guy who added
Latin script name
-
to all Chinese researcher,
-
I will take months to clean that
and I can't do it alone,
-
and he did one massive batch.
-
So we really need...
-
we have a visibility problem
more than a tool problem, I think,
-
because we have many tools.
-
(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),
-
so we need to wrap this up.
-
Thank you so much for your comments,
-
I hope you will continue discussing
during the rest of the day,
-
and thanks for your input.
-
(applause)