Hello everyone to the Data Quality panel.
Data quality matters because
more and more people out there
rely on our data being in good shape,
so we're going to talk about data quality,
and there will be four speakers
who will give short introductions
on topics related to data quality
and then we will have a Q and A.
And the first one is Lucas.
Thank you.
Hi, I'm Lucas, and I'm going
to start with an overview
of data quality tools
that we already have on Wikidata
and also some things
that are coming up soon.
And I've grouped them
into some general themes
of making errors more visible,
making problems actionable,
getting more eyes on the data
so that people notice the problems,
fix some common sources of errors,
maintain the quality of the existing data
and also human curation.
And the ones that are currently available
start with property constraints.
So you've probably seen this
if you're on Wikidata.
You can sometimes get these icons
which check
the internal consistency of the data.
For example,
if one event follows the other,
then the other event should
also be followed by this one,
which on the WikidataCon item
was apparently missing.
I'm not sure,
this feature is a few days old.
And there's also,
if this is too limited or simple for you,
you can write any checks you want
using the Query Service
which is useful for
lots of things of course,
but you can also use it
for finding errors.
Like if you've noticed
one occurrence of a mistake,
then you can check
if there are other places
where people have made
a very similar error
and find that with the Query Service.
You can also combine the two
and search for constraint violations
in the Query Service,
for example,
only the violations in some area
or WikiProject that's relevant to you,
although the results are currently
not complete, sadly.
There is revision scoring.
That's... I think this is
from the recent changes
you can also get it on your watch list
an automatic assessment
of is this edit likely to be
in good faith or in bad faith
and is it likely to be
damaging or not damaging,
I think those are the two dimensions.
So you can, if you want,
focus on just looking through
the damaging but good faith edits.
If you're feeling particularly
friendly and welcoming
you can tell these editors,
"Thank you for your contribution,
here's how you should have done it
but thank you, still."
And if you're not feeling that way,
you can go through
the bad faith, damaging edits,
and revert the vandals.
There's also, similar to that,
entity scoring.
So instead of scoring an edit,
the change that it made,
you score the whole revision,
and I think that is
the same quality measure
that Lydia mentions
at the beginning of the conference.
That gives a user script up here
and gives you a score of like one to five,
I think it was, of what the quality
of the current item is.
The primary sources tool is for
any database that you want to import,
but that's not high enough quality
to directly add to Wikidata,
so you add it
to the primary sources tool instead,
and then humans can decide
should they add
these individual statements or not.
Showing coordinates as maps
is mainly a convenience feature
but it's also useful for quality control.
Like if you see this is supposed to be
the office of Wikimedia Germany
and if the coordinates
are somewhere in the Indian Ocean,
then you know that
something is not right there
and you can see it much more easily
than if you just had the numbers.
This is a gadget called
the relative completeness indicator
which shows you this little icon here
telling you how complete
it thinks this item is
and also which properties
are most likely missing,
which is really useful
if you're editing an item
and you're in an area
that you're not very familiar with
and you don't know what
the right properties to use are,
then this is a very useful gadget to have.
And we have Shape Expressions.
I think Andra or Jose
are going to talk more about those
but basically, a very powerful way
of comparing the data you have
against the schema,
like what statement should
certain entities have,
what other entities should they link to
and what should those look like,
and then you can find problems that way.
I think... No there is still more.
Integraality or property dashboard.
It gives you a quick overview
of the data you already have.
For example, this is from
the WikiProject Red Pandas,
and you can see that
we have a sex or gender
for almost all of the red pandas,
the date of birth varies a lot
by which zoo they come from
and we have almost
no dead pandas which is wonderful,
because they're so cute.
So this is also useful.
There we go, OK,
now for the things that are coming up.
Wikidata Bridge, or also known,
formerly known as client editing,
so editing Wikidata
from Wikipedia infoboxes
which will on the one hand
get more eyes on the data
because more people can see the data there
and it will hopefully encourage
more use of Wikidata in the Wikipedias
and that means that more
people can notice
if, for example some data is outdated
and needs to be updated
instead of if they would
only see it on Wikidata itself.
There is also tainted references.
The idea here is that
if you edit a statement value,
you might want to update
the references as well,
unless it was just a typo or something.
And this tainted references
tells editors that
and also that other editors
see which other edits were made
that edited a statement value
and didn't update a reference
then you can clean up after that
and decide should that be...
Do you need to do any thing more of that
or is that actually fine and
you don't need to update the reference.
That's related to signed statements
which is coming from a concern, I think,
that some data providers have that like...
There's a statement that's referenced
through the UNESCO or something
and then suddenly,
someone vandalizes the statement
and they are worried
that it will look like
this organization, like UNESCO,
still set this vandalism value
and so, with signed statements,
they can cryptographically
sign this reference
and that doesn't prevent any edits to it,
but at least, if someone
vandalizes the statement
or edits it in any way,
then the signature is no longer valid,
and you can tell this is not exactly
what the organization said,
and perhaps it's a good edit
and they should re-sign the new statement,
but also perhaps it should be reverted.
And also, this is going
to be very exciting, I think,
Citoid is this amazing system
they have on Wikipedia
where you can paste a URL,
or an identifier, or an ISBN
or Wikidata ID or basically
anything into the Visual Editor,
and it spits out a reference
that is nicely formatted
and has all the data you want
and it's wonderful to use.
And by comparison, on Wikidata,
if I want to add a reference
I typically have to add a reference URL,
title, author name string,
published in, publication date,
retrieve dates,
at least those, and that's annoying,
and integrating Citoid into Wikibase
will hopefully help with that.
And I think
that's all the ones I had, yeah.
So now, I'm going to pass to Cristina.
(applause)
Hi, I'm Cristina.
I'm a research scientist
from the University of Zürich,
and I'm also an active member
of the Swiss Community.
When Claudia Müller-Birn
and I submitted this to the WikidataCon,
what we wanted to do
is continue our discussion
that we started
in the beginning of the year
with a workshop on data quality
and also some sessions in Wikimania.
So the goal of this talk
is basically to bring some thoughts
that we have been collecting
from the community and ourselves
and continue discussion.
So what we would like is to continue
interacting a lot with you.
So what we think is very important
is that we continuously ask
all types of users in the community
about what they really need,
what problems they have with data quality,
not only editors
but also the people who are coding,
or consuming the data,
and also researchers who are
actually using all the edit history
to analyze what is happening.
So we did a review of around 80 tools
that are existing in Wikidata
and we aligned them to the different
data quality dimensions.
And what we saw was that actually,
many of them were looking at,
monitoring completeness,
but actually... and also some of them
are also enabling interlinking.
But there is a big need for tools
that are looking into diversity,
which is one of the things
that we actually can have in Wikidata,
especially
this design principle of Wikidata
where we can have plurality
and different statements
with different values
coming from different sources.
Because it's a secondary source,
we don't have really tools
that actually tell us how many
plural statements there are,
and how many we can improve and how,
and we also don't know really
what are all the reasons
for plurality that we can have.
So from these community meetings,
what we discussed was the challenges
that still need attention.
For example, that having
all these crowdsourcing communities
is very good because different people
attack different parts
of the data or the graph,
and we also have
different background knowledge
but actually, it's very difficult to align
everything in something homogeneous
because different people are using
different properties in different ways
and they are also expecting
different things from entity descriptions.
People also said that
they also need more tools
that give a better overview
of the global status of things.
So what entities are missing
in terms of completeness,
but also like what are people
working on right now most of the time,
and they also mention many times
a tighter collaboration
across not only languages
but the WikiProjects
and the different Wikimedia platforms.
And we published
all the transcribed comments
from all these discussions
in those links here in the Etherpads
and also in the wiki page of Wikimania.
Some solutions that appeared actually
were going into the direction
of sharing more the best practices
that are being developed
in different WikiProjects,
but also people want tools
that help organize work in teams
or at least understanding
who is working on that,
and they were also mentioning
that they want more showcases
and more templates that help them
create things in a better way.
And from the contact that we have
with Open Governmental Data Organizations,
and in particularly,
I am in contact with the canton
and the city of Zürich,
they are very interested
in working with Wikidata
because they want their data
to be accessible for everyone
in the place where people go
and consult or access data.
So for them, something that
would be really interesting
is to have some kind of quality indicators
both in the wiki,
which is already happening,
but also in SPARQL results,
to know whether they can trust
or not that data from the community.
And then, they also want to know
what parts of their own data sets
are useful for Wikidata
and they would love to have a tool that
can help them assess that automatically.
They also need
some kind of methodology or tool
that helps them decide whether
they should import or link their data
because in some cases,
they also have their own
linked open data sets,
so they don't know whether
to just ingest the data
or to keep on creating links
from the data sets to Wikidata
and the other way around.
And they also want to know where
their websites are referred in Wikidata.
And when they run such a query
in the query service,
they often get timeouts,
so maybe we should
really create more tools
that help them get these answers
for their questions.
And, besides that,
we wiki researchers also sometimes
lack some information
in the edit summaries.
So I remember that when
we were doing some work
to understand
the different behavior of editors
with tools or bots
or anonymous users and so on,
we were really lacking, for example,
a standard way of tracing
that tools were being used.
And there are some tools
that are already doing that
like PetScan and many others,
but maybe we should in the community
discuss more about how to record these
for fine-grained provenance.
And further on,
we think that we need to think
of more concrete data quality dimensions
that are related to link data
but not all the types of data,
so we worked on some measures
to access actually the information gain
enabled by the links,
and what we mean by that
is that when we link
Wikidata to other data sets,
we should also be thinking
how much the entities are actually
gaining in the classification,
also in the description
but also in the vocabularies they use.
So just to give a very simple
example of what I mean with this
is we can think of--
in this case, would be Wikidata
or the external data center
that is linking to Wikidata,
we have the entity for a person
that is called Natasha Noy,
we have the affiliation and other things,
and then we say OK,
we link to an external place,
and that entity also has that name,
but we actually have the same value.
So what it would be better is that we link
to something that has a different name,
that is still valid because this person
has two ways of writing the name,
and also other information
that we don't have in Wikidata
or that we don't have
in the other data set.
But also, what is even better
is that we are actually
looking in the target data set
that they also have new ways
of classifying the information.
So not only is this a person,
but in the other data set,
they also say it's a female
or anything else that they classify with.
And if in the other data set,
they are using many other vocabularies
that is also helping in their whole
information retrieval thing.
So with that, I also would like to say
that we think that we can
showcase federated queries better
because when we look at the query log
provided by Malyshev et al.,
we see actually that
from the organic queries,
we have only very few federated queries.
And actually, federation is one
of the key advantages of having link data,
so maybe the community
or the people using Wikidata
also need more examples on this.
And if we look at the list
of endpoints that are being used,
this is not a complete list
and we have many more.
Of course, this data was analyzed
from queries until March 2018,
but we should look into the list
of federated endpoints that we have
and see whether
we are really using them or not.
So two questions that
I have for the audience
that maybe we can use
afterwards for the discussion are:
what data quality problems
should be addressed in your opinion,
because of the needs that you have,
but also, where do you need
more automation
to help you with editing or patrolling.
That's all, thank you very much.
(applause)
(Jose Emilio Labra) OK,
so what I'm going to talk about
is some tools that we were developing
related with Shape Expressions.
So this is what I want to talk...
I am Jose Emilio Labra,
but this has... all these tools
have been done by different people,
mainly related with W3C ShEx,
Shape Expressions Community Group.
ShEx Community Group.
So the first tool that I want to mention
is RDFShape, this is a general tool,
because Shape Expressions
is not only for Wikidata,
Shape Expressions is a language
to validate RDF in general.
So this tool was developed mainly by me
and it's a tool
to validate RDF in general.
So if you want to learn about RDF
or you want to validate RDF
or SPARQL endpoints not only in Wikidata,
my advice is that you can use this tool.
Also for teaching.
I am a teacher in the university
and I use it in my semantic web course
to teach RDF.
So if you want to learn RDF,
I think it's a good tool.
For example, this is just a visualization
of an RDF graph with the tool.
But before coming here, in the last month,
I started a fork of rdfshape specifically
for Wikidata, because I thought...
It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.
So what I took is...
What I did is to remove all the stuff
that was not related with Wikidata
and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,
but now, someone asked me
if I could do it also for Wikibase.
And it is very easy
to do it for Wikibase also.
So this tool, WikiShape, is quite new.
I think it works, most of the features,
but there are some features
that maybe don't work,
and if you try it and you want
to improve it, please tell me.
So this is [inaudible] captures,
but I think I can even try so let's try.
So let's see if it works.
First, I have to go out of the...
Here.
Alright, yeah. So this is the tool here.
Things that you can do with the tool,
for example, is that you can
check schemas, entity schemas.
You know that there is
a new namespace which is "E whatever,"
so here, if you start for example,
write for example "human"...
As you are writing,
its autocomplete allows you to check,
for example,
this is the Shape Expressions of a human,
and this is the Shape Expressions here.
And as you can see,
this editor has syntax highlighting,
this is... well,
maybe it's very small, the screen.
I can try to do it bigger.
Maybe you see it better now.
So... and this is the editor
with syntax highlighting and also has...
I mean, this editor
comes from the same source code
as the Wikidata query service.
So for example,
if you hover with the mouse here,
it shows you the labels
of the different properties.
So I think it's very helpful because now,
the entity schemas that is
in the Wikidata is just a plain text idea,
and I think this editor is much better
because it has autocomplete
and it also has...
I mean, if you, for example,
wanted to add a constraint,
you say "wdt:"
You start writing "author"
and then you click Ctrl+Space
and it suggests the different things.
So this is similar
to the Wikidata query service
but specifically for Shape Expressions
because my feeling is that
creating Shape Expressions
is not more difficult
than writing SPARQL queries.
So some people think
that it's at the same level,
It's probably easier, I think,
because Shape Expressions was,
when we designed it,
we were doing it to be easier to work.
OK, so this is one of the first things,
that you have this editor
for Shape Expressions.
And then you also have the possibility,
for example, to visualize.
If you have a Shape Expression,
use for example...
I think, "written work" is
a nice Shape Expression
because it has some relationships
between different things.
And this is the UML visualization
of written work.
In a UML, this is easy to see
the different properties.
When you do this, I realized
when I tried with several people,
they find some mistakes
in their Shape Expressions
because it's easy to detect which are
the missing properties or whatever.
Then there is another possibility here
is that you can also validate,
I think I have it here, the validation.
I think I had it in some label,
maybe I closed it.
OK, but you can, for example,
you can click here, Validate entities.
You, for example,
"q42" with "e42" which is author.
With "human,"
I think we can do it with "human."
And then it's...
And it's taking a little while to do it
because this is doing the SPARQL queries
and now, for example,
it's failing by the network but...
So you can try it.
OK, so let's go continue
with the presentation, with other tools.
So my advice is that if you want to try it
and you want any feedback let me know.
So to continue with the presentation...
So this is WikiShape.
Then, I already said this,
the Shape Expressions Editor
is an independent project in GitHub.
You can use it in your own project.
If you want to do
a Shape Expressions tool,
you can just embed it
in any other project,
so this is in GitHub and you can use it.
Then the same author,
it's one of my students,
he also created
an editor for Shape Expressions,
also inspired by
the Wikidata query service
where, in a column,
you have this more visual editor
of SPARQL queries
where you can put this kind of things.
So this is a screen capture.
You can see that
that's the Shape Expressions in text
but this is a form-based Shape Expressions
where it would probably take a bit longer
where you can put the different rows
on the different fields.
OK, then there is ShExEr.
We have... it's done by one PhD student
at the University of Oviedo
and he's here, so you can present ShExEr.
(Danny) Hello, I am Danny Fernández,
I am a PhD student in University of Oviedo
working with Labra.
Since we are running out of time,
let's make these quickly,
so let's not go for any actual demo,
but just print some screenshots.
OK, so the usual way to work with
Shape Expressions or any shape language
is that you have a domain expert
that defines a priori
how the graph should look like
define some structures,
and then you use these structures
to validate the actual data against it.
This tool, which is as well as the ones
that Labra has been presenting,
this is a general purpose tool
for any RDF source,
is designed to do the other way around.
You already have some data,
you select what nodes
you want to get the shape about
and then you automatically
extract or infer the shape.
So even if this is a general purpose tool,
what we did for this WikidataCon
is these fancy button
that if you click it,
essentially what happens
is that there are
so many configurations params
and it configures it to work
against the Wikidata endpoint
and it will end soon, sorry.
So, once you press this button
what you get is essentially this.
After having selected what kind of notes,
what kind of instances of our class,
whatever you are looking for,
you get an automatic schema.
All the constraints are sorted
by how many modes actually conform to it,
you can filter the less common ones, etc.
So there is a poster downstairs
about this stuff
and well,
I will be downstairs and upstairs
and all over the place all day,
so if you have any further
interest in this tool,
just speak to me during this journey.
And now, I'll give back
the micro to Labra, thank you.
(applause)
(Jose) So let's continue
with the other tools.
The other tool is the ShapeDesigner.
Andra, do you want to do
the ShapeDesigner now
or maybe later or in the workshop?
There is a workshop...
This afternoon, there is a workshop
specifically for Shape Expressions, and...
The idea is that was going to be
more hands on,
and if you want to practice
some ShEx, you can do it there.
This tool is ShEx...
and there is Eric here,
so you can present it.
(Eric) So just super quick,
the thing that I want to say
is that you've probably
already seen the ShEx interface
that's tailored for Wikidata.
That's effectively stripped down
and tailored specifically for Wikidata
because the generic one has more features
but it turns out I thought I'd mention it
because one of those features
is particularly useful
for debugging Wikidata schemas,
which is if you go
and you select the slurp mode,
what it does is it says
while I'm validating,
I want to pull all the the triples down
and that means
if I get a bunch of failures,
I can go through and start looking
at those failures and saying,
OK, what are the triples
that are in here,
sorry, I apologize,
the triples are down there,
this is just a log of what went by.
And then you can just sit there
and fiddle with it in real time
like you play with something
and it changes.
So it's a quicker version
for doing all that stuff.
This is a ShExC form,
this is something [Joachim] had suggested
could be useful for populating
Wikidata documents
based on a Shape Expression
for that that document.
This is not tailored for Wikidata,
but this is just to say
that you can have a schema
and you can have some annotations
to say specifically how I want
that schema rendered
and then it just builds a form,
and if you've got data,
it can even populate the form.
PyShEx [inaudible].
(Jose) I think this is the last one.
Yes, so the last one is PyShEx.
PyShEx is a Python implementation
of Shape Expressions,
you can play also with Jupyter Notebooks
if you want those kind of things.
OK, so that's all for this.
(applause)
(Andra) So I'm going to talk about
a specific project that I'm involved in
called Gene Wiki,
and where we are also
dealing with quality issues.
But before going into the quality,
maybe a quick introduction
about what Gene Wiki is,
and we just released a pre-print
of a paper that we recently have written
that explains the details of the project.
I see people taking pictures,
but basically, what Gene Wiki does,
it's trying to get biomedical data,
public data into Wikidata,
and we follow a specific pattern
to get that data into Wikidata.
So when we have a new repository
or a new data set
that is eligible
to be included into Wikidata,
the first step is community engagement.
It is not necessary
directly to a Wikidata community
but a local research community,
and we meet in person
or online or on any platform
and try to come up with a data model
that bridges their data
with the Wikidata model.
So here I have a picture of a workshop
that happened here last year
which was trying to look
at a specific data set
and, well, you see a lot of discussions,
then aligning it with schema.org
and other ontologies that are out there.
And then, at the end of the first step,
we have a whiteboard drawing of the schema
that we want to implement in Wikidata.
What you see over there,
this is just plain,
we have it in the back there
so we can make some schemas
within this panel today even.
So once we have the schema in place,
the next thing is try to make
that schema machine readable
because you want to have actionable models
to bridge the data that you're bringing in
from any biomedical database
into Wikidata.
And here we are applying
Shape Expressions.
And we use that because
Shape Expressions allow you to test
whether the data set
is actually-- no, to first see
of already existing data in Wikidata
follows the same data model
that was achieved in the previous process.
So then with the Shape Expression
we can check:
OK the data that are on this topic
in Wikidata, does it need some cleaning up
or do we need to adapt our model
to the Wikidata model or vice versa.
Once that is in place
and we start writing bots,
and bots are seeding the information
that is in the primary sources
into Wikidata.
And when the bots are ready,
we write these bots
with a platform called--
with a Python library
called Wikidata Integrator
that came out of our project.
And once we have our bots,
we use a platform called Jenkins
for continuous integration.
And with Jenkins,
we continuously update
the primary sources with Wikidata.
And this is a diagram for the paper
I previously mentioned.
This is our current landscape.
So every orange box out there
is a primary resource on drugs,
proteins, genes, diseases,
chemical compounds with interaction,
and this model is too small to read now
but this is the database,
the sources that we manage in Wikidata
and bridge with the primary sources.
Here is such a workflow.
So one of our partners
is the Disease Ontology
the Disease Ontology is a CC0 ontology,
and the CC0 Ontology
has a curation cycle on its own,
and they just continuously
update the Disease Ontology
to reflect the disease space
or the interpretation of diseases.
And there is the Wikidata
curation cycle also on diseases
where the Wikidata community constantly
monitors what's going on on Wikidata.
And then we have two roles,
we call them colloquially
the gatekeeper curator,
and this was me
and a colleague five years ago
where we just sit on our computers
and we monitor Wikipedia and Wikidata,
and if there is an issue that was
reported back to the primary community,
the primary resources, they looked
at the implementation and decided:
OK, do we do we trust the Wikidata input?
Yes--then it's considered,
it goes into the cycle,
and the next iteration
is part of the Disease Ontology
and fed back into Wikidata.
We're doing the same for WikiPathways.
WikiPathways is a MediaWiki-inspired
pathway and pathway repository.
Same story, there are different
pathway resources on Wikidata already.
There might be conflicts
between those pathway resources
and these conflicts are reported back
by the gatekeeper curators
to that community,
and you maintain
the individual curation cycles.
But if you remember the previous cycle,
here I mentioned
only two cycles, two resources,
we have to do that
for every single resource that we have
and we have to manage what's going on
because when I say curation,
I really mean going
to the Wikipedia top pages,
going into the Wikidata top pages
and trying to do that.
That doesn't scale for
the two gatekeeper curators we had.
So when I was in a conference in 2016
where Eric gave a presentation
on Shape Expressions,
I jumped on the bandwagon and said OK,
Shape Expressions can help us
detect what differences in Wikidata
and so that allows the gatekeepers to have
some more efficient reporting to report.
So this year,
I was delighted by the schema entity
because now, we can store
those entity schemas on Wikidata,
on Wikidata itself,
whereas before, it was on GitHub,
and this aligns
with the Wikidata interface,
so you have things
like document discussions
but you also have revisions.
So you can leverage the top pages
and the revisions in Wikidata
to use that to discuss
about what is in Wikidata
and what are in the primary resources.
So this what Eric just presented,
this is already quite a benefit.
So here, we made up a Shape Expression
for the human gene,
and then we ran it through simple ShEx,
and as you can see,
we just got already ni--
There is one issue
that needs to be monitored
which there is an item
that doesn't fit that schema,
and then you can sort of already
create schema entities curation reports
based on... and send that
to the different curation reports.
But the ShEx.js a built interface,
and if I can show back here,
I only do ten,
but we have tens of thousands,
and so that again doesn't scale.
So the Wikidata Integrator now
supports ShEx support as well,
and then we can just loop item loops
where we say yes-no,
yes-no, true-false, true-false.
So again,
increasing a bit of the efficiency
of dealing with the reports.
But now, recently, that builds
on the Wikidata Query Service,
and well, we recently have been throttling
so again, that doesn't scale.
So it's still an ongoing process,
how to deal with models on Wikidata.
And so again,
ShEx is not only intimidating
but also the scale is just
too big to deal with.
So I started working, this is my first
proof of concept or exercise
where I used a tool called yED,
and I started to draw
those Shape Expressions and because...
and then regenerate this schema
into this adjacent format
of the Shape Expressions,
so that would open up already
to the audience
that are intimidated
by the Shape Expressions languages.
But actually, there is a problem
with those visual descriptions
because this is also a schema
that was actually drawn in yEd by someone.
And here is another one
which is beautiful.
I would love to have this on my wall,
but it is still not interoperable.
So I want to end my talk with,
and the first time, I've been
stealing this slide, using this slide.
It's an honor to have him in the audience
and I really like this:
"People think RDF is a pain
because it's complicated.
The truth is even worse, it's so simple,
because you have to work
with real-world data problems
that are horribly complicated.
While you can avoid RDF,
it is harder to avoid complicated data
and complicated computer problems."
This is about RDF, but I think
this so applies to modeling as well.
So my point of discussion
is should we really...
How do we get modeling going?
Should we discuss ShEx
or visual models or...
How do we continue?
Thank you very much for your time.
(applause)
(Lydia) Thank you so much.
Would you come to the front
so that we can open
the questions from the audience.
Are there questions?
Yes.
And I think, for the camera, we need to...
(Lydia laughing) Yeah.
(man3) So a question
for Cristina, I think.
So you mentioned exactly
the term "information gain"
from linking with other systems.
There is an information theoretic measure
using statistic and probability
called information gain.
Do you have the same...
I mean did you mean exactly that measure,
the information gain
from the probability theory
from information theory
or just use this conceptual thing
to measure information gain some way?
No, so we actually defined
and implemented measures
that are using the Shannon entropy,
so it's meant as that.
I didn't want to go into
details of the concrete formulas...
(man3) No, no, of course,
that's why I asked the question.
- (Cristina) But yeah...
- (man3) Thank you.
(man4) Make more
of a comment than a question.
(Lydia) Go for it.
(man4) So there's been
a lot of focus at the item level
about quality and completeness,
one of the things that concerns me is that
we're not applying the same to hierarchies
and I think we have an issue
is that our hierarchy often isn't good.
We're seeing
this is going to be a real problem
with Commons searching and other things.
One of the abilities that we can do
is to import external--
The way that external thesauruses
structure their hierarchies,
using the P4900
broader concept qualifier.
But what I think would be really helpful
would be much better tools for doing that
so that you can import an
external... thesaurus's hierarchy
map that onto our Wikidata items.
Once it's in place
with those P4900 qualifiers,
you can actually do some
quite good querying through SPARQL
to see where our hierarchy
diverges from that external hierarchy.
For instance, [Paula Morma],
user PKM, you may know,
does a lot of work on fashion.
So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy
and the Getty AAT
fashion thesaurus hierarchy,
and then see where the gaps
were in our higher level items,
which is a real problem for us
because often,
these are things that only exist
as disambiguation pages on Wikipedia,
so we have a lot of higher level items
in our hierarchies missing
and this is something that we must address
in terms of quality and completeness,
but what would really help
would be better tools than
the jungle of pull scripts that I wrote...
If somebody could put that
into a PAWS notebook in Python
to be able to take an external thesaurus,
take its hierarchy,
which may well be available
as linked data or may not,
to then put those into
quick statements to put in P4900 values.
And then later,
when our representation
gets more complete,
to update those P4900s
because as our representation gets dated,
becomes more dense,
the values of those qualifiers
need to change
to represent that we've got more
of their hierarchy in our system.
If somebody could do that,
I think that would be very helpful,
and we do need to also
look at other approaches
to improve quality and completeness
at the hierarchy level
not just at the item level.
(Andra) Can I add to that?
Yes, and we actually do that,
and I can recommend looking at
the Shape Expression that Finn made
with the lexical data
where he creates Shape Expressions
and then build on authorship expressions
so you have this concept
of linked Shape Expressions in Wikidata,
and specifically, the use case,
if I understand correctly,
is exactly what we are doing in Gene Wiki.
So you have the Disease Ontology
which is put into Wikidata
and then disease data comes in
and we apply the Shape Expressions
to see if that fits with this thesaurus.
And there are other thesauruses or other
ontologies for controlled vocabularies
that still need to go into Wikidata,
and that's exactly why
Shape Expression is so interesting
because you can have a Shape Expression
for the Disease Ontology,
you can have a Shape Expression for MeSH,
you can say: OK,
now I want to check the quality.
Because you also have
in Wikidata the context
of when you have a controlled vocabulary,
you say the quality is according to this,
but you might have
a disagreeing community.
So the tooling is indeed in place
but now is indeed to create those models
and apply them
on the different use cases.
(man4) The ShapeExpression's very useful
once you have the external ontology
mapped into Wikidata,
but my problem is that
it's getting to that stage,
it's working out how much of the
external ontology isn't yet in Wikidata
and where the gaps are,
and that's where I think that
having much more robust tools
to see what's missing
from external ontologies
would be very helpful.
The biggest problem there
is not so much tooling
but more licensing.
So getting the ontologies
into Wikidata is actually a piece of cake
but most of the ontologies have,
how can I say that politely,
restrictive licensing,
so they are not compatible with Wikidata.
(man4) There's a huge number
of public sector thesauruses
in cultural fields.
- (Andra) Then we need to talk.
- (man4) Not a problem.
(Andra) Then we need to talk.
(man5) Just... the comment I want to make
is actually answer to James,
so the thing is that
hierarchies make graphs,
and when you want to...
I want to basically talk about...
a common problem in hierarchies
is circle hierarchies,
so they come back to each other
when there's a problem,
which you should not
have that in hierarchies.
This, funnily enough,
happens in categories in Wikipedia a lot
we have a lot of circles in categories,
but the good news is that this is...
Technically, it's a PMP complete problem,
so you cannot find this,
and easily if you built a graph of that,
but there are lots of ways
that have been developed
to find problems
in these hierarchy graphs.
Like there is a paper
called Finding Cycles...
Breaking Cycles in Noisy Hierarchies,
and it's been used to help
categorization of English Wikipedia.
You can just take this
and apply these hierarchies in Wikidata,
and then you can find
things that are problematic
and just remove the ones
that are causing issues
and find the issues, actually.
So this is just an idea, just so you...
(man4) That's all very well
but I think you're underestimating
the number of bad subclass relations
that we have.
It's like having a city
in completely the wrong country,
and there are tools for geography
to identify that,
and we need to have
much better tools in hierarchies
to identify where the equivalent
of the item for the country
is missing entirely,
or where it's actually been subclassed
to something that isn't meaning
something completely different.
(Lydia) Yeah, I think
you're getting to something
that me and my team keeps hearing
from people who reuse our data
quite a bit as well, right,
Individual data point might be great
but if you have to look
at the ontology and so on,
then it gets very...
And I think one of the big problems
why this is happening
is that a lot of editing on Wikidata
happens on the basis
of an individual item, right,
you make an edit on that item,
without realizing that this
might have very global consequences
on the rest of the graph, for example.
And if people have ideas around
how to make this more visible,
the consequences
of an individual local edit,
I think that would be worth exploring,
to show people better
what the consequence of their edit
that they might do in very good faith,
what that is.
Whoa! OK, let's start with, yeah, you,
then you, then you, then you.
(man5) Well, after the discussion,
just to express my agreement
with what James was saying.
So essentially, it seems
the most dangerous thing is the hierarchy,
not the hierarchy, but generally
the semantics of the subclass relations
seen in Wikidata, right.
So I've been studying languages recently,
just for the purposes of this conference,
and for example, you find plenty of cases
where a language is a part of
and subclass of the same thing, OK.
So you know, you can say
we have a flexible ontology.
Wikidata gives you freedom
to express that, sometimes.
Because, for example,
that ontology of languages
is also politically complicated, right?
It is even good to be in a position
to express a level of uncertainty.
But imagine anyone who wants
to do machine reading from that.
So that's really problematic.
And then again,
I don't think that ontology
was ever imported from somewhere,
that's something which is originally ours.
It's harvested from Wikipedia
in the very beginning I will say.
So I wonder...
this Shape Expressions thing is great,
and also validating and fixing,
if you like, the Wikidata ontology
by external resources, beautiful idea.
In the end,
will we end by reflecting
the external ontologies in Wikidata?
And also, what we do with
the core part of our ontology
which is never harvested
from external resources,
how do we go and fix that?
And I really think that
that will be a problem on its own.
We will have to focus on that
independently of the idea
of validating ontology
with something external.
(man6) OK, and constrains
and shapes are very impressive
what we can do with it,
but the main point is not
being really made clear--
it's because now we can make more explicit
what we expect from the data.
Before, each one has to write
its own tools and scripts
and so it's more visible
and we can discuss about it.
But because it's not about
what's wrong or right,
it's about an expectation,
and you will have different
expectations and discussions
about how we want
to model things in Wikidata,
and this...
The current state is just
one step in the direction
because now you need
very much technical expertise
to get into this,
and we need better ways
to visualize this constraint,
to transform it maybe in natural language
so people can better understand,
but it's less about what's wrong or right.
(Lydia) Yeah.
(man7) So for quality issues,
I just want to echo it like...
I've definitely found a lot of the issues
I've encountered have been
differences in opinion
between instance of versus subclass.
I would say errors in those situations
and trying to find those
has been a very time-consuming process.
What I've found is like:
"Oh, if I find very high-impression items
that are something...
and then use all the subclass instances
to find all derived statements of this,"
this is a very useful way
of looking for these errors.
But I was curious if Shape Expressions,
if there is...
If this can be used as a tool
to help resolve those issues but, yeah...
(man8) If it has a structural footprint...
If it has a structural footprint
that you can...that's sort of falsifiable,
you can look at that
and say well, that's wrong,
then yeah, you can do that.
But if it's just sort of
trying to map it to real-world objects,
then you're just going to need
lots and lots of brains.
(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.
We're here to find out how to help
the project and the community
but Cristina made the mistake
of asking what we want.
(laughing) So I think
one thing I'd like to see
is a lot around verifiability
which is one of the core tenets
of the project in the community,
and trustworthiness.
Not every statement is the same,
some of them are heavily disputed,
some of them are easy to guess,
like somebody's
date of birth can be verified,
as you saw today in the Keynote,
gender issues are a lot more complicated.
Can you discuss a little bit what you know
in this area of data quality around
trustworthiness and verifiability?
If there isn't a lot,
I'd love to see a lot more. (laughs)
(Lydia) Yeah.
Apparently, we don't have
a lot to say on that. (laughs)
(Andra) I think we can do a lot,
but I had a discussion with you yesterday.
My favorite example I learned yesterday
that's already deprecated
is if you go to the Q2, which is earth,
there is statement
that claims that the earth is flat.
And I love that example
because there is a community
out there that claims that
and they have verifiable resources.
So I think it's a genuine case,
it shouldn't be deprecated,
it should be in Wikidata.
And I think Shape Expressions
can be really instrumental there,
because what you can say,
OK, I'm really interested
in this use case,
or this is a use case where you disagree,
but there can also be a use case
where you say OK, I'm interested.
So there is this example you say,
I have glucose.
And glucose when you're a biologist,
you don't care for the chemical
constraints of the glucose molecule,
you just... everything glucose
is the same.
But if you're a chemist,
you cringe when you hear that,
you have 200 something...
So then you can have
multiple Shape Expressions,
OK, I'm coming in with...
I'm at a chemist view,
I'm applying that.
And then you say
I'm from a biological use case,
I'm applying that Shape Expression.
And then when you want to collaborate,
yes, well you should talk
to Eric about ShEx maps.
And so...
but this journey is just starting.
But I personally I believe
that it's quite instrumental in that area.
(Lydia) OK. Over there.
(laughs)
(woman2) I had several ideas
from some points in the discussions,
so I will try not to lose...
I had three ideas so...
Based on what James said a while ago,
we have a very, very big problem
on Wikidata since the beginning
for the upper ontology.
We talked about that
two years ago at WikidataCon,
and we talked about that at Wikimania.
Well, always we have a Wikidata meeting
we are talking about that,
because it's a very big problem
at a very very eye level
what entity is, with what work is,
what genre is, art,
are really the biggest concept.
And that's actually
a very weak point on global ontology
because people try to clean up regularly
and broke everything down the line,
because yes, I think some of you
may remember the guy who in good faith
broke absolutely all cities in the world.
We were not geographical items anymore,
so violation constraints everywhere.
And it was in good faith
because he was really
correcting a mistake in an item,
but everything broke down.
And I'm not sure how we can solve that
because there is actually
no external institution we could just copy
because everyone is working on...
Well, if I am performing art database,
I will just go
at the performing art label,
or I won't go to the philosophical concept
of what an entity is,
and that's actually...
I don't know any database
which is working at this level,
but that's the weakest point of Wikidata.
And probably,
when we are talking about data quality,
that's actually a big part of it, so...
And I think it's the same
we have stated in...
Oh, I am sorry, I am changing the subject,
but we have stated
in different sessions about qualities,
which is actually some of us
are doing good modeling job,
are doing ShEx,
are doing things like that.
People don't see it on Wikidata,
they don't see the ShEx,
they don't see the WikiProject
on the discussion page,
and sometimes,
they don't even see
the talk pages of properties,
which is explicitly stating,
a), this property is used for that.
Like last week,
I added constraints to a property.
The constraint was explicitly written
in the discussion
of the creation of the property.
I just created the technical part
of adding the constraint, and someone:
"What! You broke down all my edits!"
And he was using the property
wrongly for the last two years.
And the property was actually very clear,
but there were no warnings and everything,
and so, it's the same at the Pink Pony
we said at Wikimania
to make WikiProject more visible
or to make ShEx more visible, but...
And that's what Cristina said.
We have a visibility problem
of what the existing solutions are.
And at this session,
we are all talking about
how to create more ShEx,
or to facilitate the jobs
of the people who are doing the cleanup.
But we are cleaning up
since the first day of Wikidata,
and globally, we are losing,
and we are losing because, well,
if I know names are complicated
but I am the only one
doing the cleaning up job,
the guy who added
Latin script name
to all Chinese researcher,
I will take months to clean that
and I can't do it alone,
and he did one massive batch.
So we really need...
we have a visibility problem
more than a tool problem, I think,
because we have many tools.
(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),
so we need to wrap this up.
Thank you so much for your comments,
I hope you will continue discussing
during the rest of the day,
and thanks for your input.
(applause)