(woman) Hello everyone.
Thank you for being here this afternoon.
We are first going to hear--
I'm just going to jump straight in
to give him plenty of time--
so we're first going to hear from
Peter Patel-Schneider
about barriers to using Wikidata
as a knowledge base.
(Peter) Thank you.
I'll skip over the abstract
because you've already seen it all.
And I should say a little bit
about myself.
I'm much more of a user of Wikidata
than an actual editor of Wikidata,
and much more of a user of Wikidata
than somebody who contributes to Wikidata,
but I very much believe in
the aims of Wikidata.
In particular, it aligns
with my research areas
which is knowledge representation,
at least in a certain sense.
I worked in description logics
for a long time, worked with W3C.
I've worked in Silicon Valley for a while,
largely building what might be called
knowledge graphs,
but I don't like the term
knowledge graphs--
I don't like what they mean,
I want to do something better
than knowledge graphs.
And I want to put this together
from various sources.
So Wikidata is a very, very good one,
but DBpedia is not so good.
Freebase is dead.
Open Street Map,
Open Movie Database, things like that.
And then I want to use
this store of knowledge
to do something.
And I want to use it as the source
of knowledge to do something,
and not only just facts
but also organizing my knowledge.
And currently, working where I am,
we're interested in supporting
conversational agents.
Not just things that let you play Avatar,
but lets you play the movie
that's directed by the wife
of the director of Avatar.
So how can we build a conversational agent
that will do something like that?
Well, you need to know
all the facts that go behind it,
but you also need to know
that the fact that there are movies--
not just, we have Avatar,
but that we have movies--
we need to know things about movies,
we need to know things
about directorships.
We need to know things about humans--
that they're married to each other.
We need to know that there are men
and women in the world,
and somehow be able to use
this knowledge of what we're saying
to come up with the actual reference
to these things,
and then actually do
what we were asked to do.
So, though it's one end,
the other thing
that we want to be able to do
is if you think of systems like Siri,
there are hundreds or thousands--
actually, maybe Siri's not
the best example.
The Amazon system has hundreds
or thousands of little programs
that will do something for you.
And the problem that we're interested in
is how do you pick
which one can do something.
So for example, which back-end
can find me train trips
between San Francisco and Palo Alto.
There may be many systems
that will try and sell me train tickets,
but only one or perhaps two of them
will sell me that particular train ticket.
And how do I get the system to do that
without having to be able to tell it
that I want a Caltrain ticket.
So, what happens is I want to use Wikidata
as the source of a lot of this stuff,
and I regularly run into problems.
And from those problems,
I have a bunch of suggestions.
You may agree with my suggestions
or disagree with them.
Some of them are kind of on their way
to being implemented in Wikidata,
some of them aren't.
So, I'm going to do this talk
from the back forward.
I'm going to give you the summary,
and then an expansion of the summary,
and then some rationale
for my suggestions.
And the reason I'm going to do that
is if I started with all of the rationale,
I might never get to the end,
and the end is the important thing,
at least in my viewpoint.
So, my biggest suggestion, I guess,
on the community side is,
gee, guys, speak with a single voice.
(chuckles)
And speak with a voice
where I can find it.
So, it turns out
that one of my suggestions
is actually implemented,
but I only found out about it today,
because it's not used very much at all,
and it's hard to find it.
So, I really want you guys--
and me too, in some sense--
to spend some effort at the beginning
when you're creating these classes
and other things that are important,
so that a poor user like me,
who can't afford to go through five years
of impassioned discussion
to find out what male actually is,
can actually use it in our system--
in my system.
So that's sort of on the community side.
I'm a formalist.
I really want to--
and my programs are dumb.
I don't write smart programs,
I write dumb programs.
Now, they tend to be
very fancy dumb programs,
but these dumb programs
can't really handle all of the shades
of everything that you have
with start time, end time, inception.
I want to have some simple
formal mechanism
that will tell my program what's true now,
or what's true in 1987,
without having to search through
a bunch of things,
and make a bunch of guesses,
and use a lot of heuristics,
or have a machine-learning program
that's done for this particular task.
I just want you to tell me
this stuff somehow, and have a take.
So, I want to be able
to look at something which says
what the things I see in Wikidata
actually mean.
And I don't find that these days.
And then, of course, once we have that,
I want somebody--
I'm willing to do some of this work--
build tools that actually use
that formal description and say,
tell me, for example,
if I'm an instance
of architectural structure,
like the Eiffel Tower,
am I a geographic location?
I don't know.
I mean, Wikidata doesn't tell me
whether this is true or not.
I can find nowhere in Wikidata
that will do that,
because there's no formal thing.
But once you give me a formal thing
then I'm going to write a tool,
which essentially gives the implications
of what the formal things are.
The fourth suggestion is about bots.
Bots are great.
Bots have ultimate power
and as has been said,
with ultimate power,
comes ultimate responsibility.
And I don't believe that bots get
very much responsibility
for the things that they do,
and they need to have.
We need to be able to control the bots
and figure out what they've done wrong,
and essentially, once a bot
makes a thousand mistakes,
we want to undo that once,
as opposed to undoing that
a thousand times.
Of course, as I said,
these are my suggestions.
Other people
may have different suggestions.
I'm coming at it from a user viewpoint.
I suppose I could say something like,
I'm coming at it from a binary viewpoint.
I mean, this is a program
that really wants yes or no answers.
It doesn't understand much
in shades of gray.
So, I would really like you to tell me
what's true and what's not true.
So, that's the end of the talk, right?
(laughs)
And I sort of expanded on some things
but let me-- oops,
where are we, here, yes.
So, here let me expand upon
the things that I said.
So formally,
I really want a logic for Wikidata
because that let's me know
what Wikidata means to me.
I don't want to have data structure
with some sort of English description
somewhere that tells me something.
I want a formal statement of what this is.
And maybe it produces the wrong answers,
in which case we fix it,
but at least we know
what the answers are supposed to be,
as opposed to having to go through
five or ten different pages
of people arguing with each other
what this particular part
of Wikidata means.
So, in particular, I want to have things
that I think are useful,
like disjointness.
I want Wikidata to say that
rocks aren't humans,
to pick an example.
Now, there's lots of that stuff
in Wikidata at the moment.
There's lots of this
opposite from things,
but what does it mean?
Somebody who's an opposite--
there was something this morning
about transgender man
is the opposite of transgender woman.
Yes, in some sense,
but in what sense are they opposites?
It's not a logical sense,
it's something else.
I want to give definitions of classes
and to give an example,
I would very much like Wikidata to say
that "woman" is adult, female, human,
because if I query Wikidata--
this is going to the end--
and I ask how many women are in Wikidata,
I get... any guesses?
(woman) Less than men.
Thirty-seven.
Less than men.
Thirty-seven.
Instances of "woman" in Wikidata-- 37.
That's obviously wrong.
Obviously, obviously wrong.
I know it, you know it,
but my program doesn't know it.
My program says 37--
well, it's not zero.
So it might be right.
I would much prefer
there to be something on "woman"
that says, "Hey, if you're trying
to figure out the women in Wikidata,
don't look at the things
that are stated to be instances of 'woman,'
look at things, well, a SPARQL query
or something like that,
find all the humans, find the female one,
the ones with sex or gender
which is female or female-ish.
That's kind of difficult there,
and then the ones that are adult--
whatever adult means--
at least that's a definition.
We can argue whether
it's the right definition or not.
But we get a number which is not 37,
much better than 37.
So, I want this so that
we can actually come up with answers
to some of these questions.
So, and again, tools--
I would really like to have tools
that show implications of claims.
So, that shows that
the Eiffel Tower is a location.
Whether it is or not in the real world,
is somehow kind of irrelevant.
We can argue whether the Eiffel Tower
is a location or has a location.
Philosophers probably
have argued for decades
over whether this is the case or not.
I don't care.
Just come up with an answer that makes
at least a little bit of sense,
and I'll be happy.
So, I want a tool that'll do that.
I want, essentially,
a tool that will tell me
what's true at a particular time.
So, how big is the Aral Sea?
It's certainly not 22,000 square miles.
It's much, much smaller than that,
but the claims on the Aral Sea
are historical claims.
What's true now?
I think, 3,000 square miles.
Anyway, it's a mere puddle
of its former self, you might say.
I would also like tools
that help in cleaning the data.
So, what are inconsistencies?
Is there something
that's both a rock and a human.
Well, right now,
is that a problem in Wikidata?
Well, there are these
constraint mechanisms,
but they're kind of weak,
and they're not used very well
in many places.
So, I would really like to have some tool
which essentially says, "No!
You can't have a rock and a human!
You can have, perhaps,
a human and a Klingon,
but rocks and humans, just, no."
There's an old science fiction story
called The God Makers
where they take a rock [inaudible],
make it into a God,
so maybe a rock
could be a person in that sense.
But human, no.
Hm?
(man) Are you asking for
exhaustive disjunction?
[inaudible]
(Peter) No, I'm not asking for
exhaustive decompositions.
Just junctions.
I mean, in some sense--
In what?
(woman) That's undecidable.
(Peter) What? No, well,
you mean not logically.
So, the question is
whether we can actually,
can have exhaustive definition,
exhaustive disjunctions?
Well...
(man) That's pricey, right?
To find out that bots are... yeah.
(man 2) To say that rocks
are disjoint from humans is easy,
but to do that in all the cases
you're going to want it, is--
(Peter) It's computation.
Yes, now we have a problem
with computational costs, right?
Yeah.
The computational cost of deciding it
for Wikidata as it exists right now,
is not impossible,
it's just computationally non-trivial.
So given that the query service
is running out of [inaudible],
so to do this right, requires tools
that actually think a little bit.
And that's going to require computation.
How much computation?
Well, it's not the heat death
of the universe,
it's tomorrow, perhaps,
or two seconds from now.
But two seconds times
how many million things are in Wikidata
is getting to be a reasonably big number.
One of the things you can do
is this thing doesn't have to be
completely run in one thing.
You can farm these out into other systems.
We don't have to have everything
all in one computer.
And, of course,
Google just gave us the answer.
We can just put it on
this new Google quantum computer,
and it'll do everything forever.
(woman) But it sounds like
you're asking for OWL, and--
(Peter) No, I'm asking for part of OWL.
(woman) You've been asking for
a lot of things about OWL,
and that just is not possible.
That's why Wikidata works,
is because it's not OWL.
There are actually things
that you can compute with.
(Peter) So, I am asking for
a bigger part of OWL,
not all of it, yeah?
Well, I mean, so the question is,
is Wikidata going to spend the effort
to buy another, perhaps, ten computers
to crunch away on this permanently,
or is it going to spend the effort
of having a whole bunch of people
argue about it, or whatever.
And my view is computers are dirt cheap.
I mean, I'm willing to pony up
some of my very own money
to buy Wikidata another computer
to do this stuff,
because I think it's important.
(man) [inaudible]
Yes. (laughs)
I didn't say I would give it
to Wikimedia Foundation.
But I'm not asking for things
that are trivial.
I'm asking for things
that require compute power,
that require intellectual power,
that require the community to do things.
The community is doing
some of these things.
I found out that there is this property
which essentially says,
"Hey, here's how
you're supposed to use this thing."
I forget the exact name of it.
User instructions,
I thought it was three words.
Whatever, anyway,
it essentially says-- and it's on male.
And there was a big argument about it.
The trouble is it's not supported at all.
There was this plan to have this property
and have it supported,
to have it show up everywhere,
so that people would realize
that human-- in other words,
you don't use person for humans,
right now it's stuck on the description.
And it's stuck on
a very short description.
And it's very hard to figure out
what it really means,
and only a few classes have these things.
So, we go up in the class hierarchy
to these more general things,
it's very hard to figure out
what belongs to them,
is what doesn't belong to them.
So it's no surprise
that people use them the wrong way.
Because the people in this room--
or metaphorically in this room--
may understand that geographic location
is used for a particular purpose,
but even me--
I think I have a fairly good background
in representing things--
don't know the answer to that,
or at least, it requires me to spend
at least an hour of effort
to get a good answer to that.
And that's really not scalable.
So, I'm not asking for nothing,
I'm asking for lots of things,
but the trouble is, I mean, I think--
well, I think I'm important
but anyway, you can ignore me.
I think that I'm a pretty good
use case for Wikidata.
I really want, not just a bit of Wikidata,
I want a lot of it.
And I work for a very big company
but the part of that company
that needs, or wants,
or cares about Wikidata is quite small.
So, if I worked for a company
that really cared about data,
and was willing to put
hundreds of millions of dollars
into curating Wikidata,
and put it into their own knowledge graph,
using Wikidata would be no problem.
My company, perhaps,
has a million dollars to take Wikidata
and put it into a knowledge graph.
A million dollars
doesn't go very far these days.
So, the problem--
and let me say something
that actually isn't in the slides,
but which I really firmly believe in.
The problem with Wikidata not--
Wikidata's great,
but to really use it,
you have to spend a lot of effort.
And most companies,
and most individuals, and most groups
can't expend that amount of effort
to really use it well.
I think that on the Wikidata side,
they should try to be greater
so that more people could really use it.
And that's really, I think,
the guts of this presentation
is that if Wikidata community
improved Wikidata
so it would be more clear
as to what's going on,
then more people
could put information into it
without making mistakes,
and more people could use it
without having to spend a lot of time
to curate it.
Alright, so, we've gone through
lots of this stuff.
Let me just say a few things.
So, I've looked at a fair bit of Wikidata,
and every time I look, I find a problem.
That's bad.
I haven't done a quantitative study,
and somebody should do
a quantitative study
of some of these things,
it would require a lot of work to do it,
but essentially, I look at something
and I find a problem,
and that's not great.
I find missing information.
But I don't have anything to say about
adding in missing information.
Yes, Dan?
(Dan) With respect,
you always find problems.
(Peter) Yes.
(audience laughs)
I am very good at finding problems.
Actually, so one of the problems
that I have, the problem with "woman"--
(laughter)
The problem with--
I didn't find the problem with "woman".
(chuckles)
Turns out that a co-worker,
I showed her a page,
where I had found a different problem
and she looked at it
and said, "Oh, 'woman'."
And so she found that problem
on a display that I already
found the problem.
So, missing information--
there just should be
more information in Wikidata.
There's factual errors in Wikidata,
but everybody's got factual errors.
Bots make it a little bit worse.
There's problems with the ontology,
which I think is a place that--
you can expend effort there
and really improve quite a lot of things.
And then there's also
the problems with qualifiers,
and really temporal qualifiers.
It's very hard to figure out
what's true at a particular time
because there's a whole bunch of
temporal qualifiers
that could be relevant.
Which ones count and which ones get used,
and are they going to stay the same?
Are we going to add a new one tomorrow?
So then I have to change
every one of my programs.
I really think all this kind of stuff,
it would be better to hide that
from the consumer
so that Wikidata would just say,
"Okay, you want to know
what's true at time X?
Here's an interface that tells you
what's true at time X,"
instead of having me
to write all of this stuff.
It's on, I think it's on.
Yeah.
(man) I think you like the idea
of what is possible with Wikidata,
but you say that it's not used
like your idea.
So if, from my perspective,
Wikidata is a collection of statements
from persons and from machines,
and so on, and some might be true,
some might be discussable.
What you could do would be,
from my perspective,
you could use a computational intelligence
to score the statements
if they are...
(speaking German)
...contradictory,
or if they are common sense.
So you could score them,
and then you can filter on the score,
and then you have what you wanted.
(Peter) Possibly, except without a notion
of what things mean in WIkidata,
I can't even figure out
whether two things are contradictory.
I mean, there's constraints
and that helps,
but I don't think that's a full solution.
And common sense--
I don't have much common sense
and my programs have a lot less than I do.
We could write a lot of stuff
which tries to say some things
about common sense, but, again,
I think that requires an understanding
of what's going on.
And yes, so Wikidata has references
which are supposed to be some notion
of what's really supported,
except, here's a problem,
and it's very hard to see this.
Here's a problem with Wikidata
from a while ago.
This is a movie
that's got three directors listed--
the Corpse Bride--
and it's got Mike Johnson, twice.
Different Mike Johnsons.
And they both have a lot of references.
So there's a lot of things
that say that Corpse Bride
has got two different Mike Johnsons
as directors.
And there they are,
one is a director, one is a singer.
What happened, some bot went through
and accidentally did a bad thing
in Italian Wikipedia--
got the wrong thing in there--
and then a bunch of other bots piled on
and essentially created false references.
So, this is a real problem.
So, seven references!
That's really good.
And they're not crap references.
They're some movie databases--
real things.
So, that's one of the things.
Here's another one--
there's the Aral Sea.
These are the biggest--
by volume-- lakes in the world.
There's the Aral Sea.
That comes from Wikidata, by the way.
There's Lake Michigan-Huron.
I didn't realize
there was a Lake Michigan-Huron,
and I live on one of them.
So, here we have two problems.
This is an ontological problem--
what's a lake?
And so is Lake Michigan-Huron a lake?
Well, don't know.
This one here
is a temporal qualifier problem--
how big is the Aral Sea now?
Not 22,000 square miles.
Not 11,000 square miles.
So, what is it?
Sorry, 26,000 square miles.
Although this is something
from Google, of course,
but that's in there.
So anyway, I got a bunch of other things
along these lines,
which you can see if you care,
but I've given you my suggestions already,
you can either like my suggestions or not,
but I've-- woah-- (chuckles)
I think I've sort of
supported some things.
So, anyway, I had questions in the middle,
and we are done,
are we having a question or not?
- (woman) We're done.
- (Peter) Okay.
- (woman) Sorry, that's it.
- (Peter) (laughs)
(audience applause)