1
00:00:06,081 --> 00:00:09,419
(woman) Hello everyone.
Thank you for being here this afternoon.
2
00:00:09,419 --> 00:00:10,538
We are first going to hear--
3
00:00:10,538 --> 00:00:13,018
I'm just going to jump straight in
to give him plenty of time--
4
00:00:13,018 --> 00:00:15,864
so we're first going to hear from
Peter Patel-Schneider
5
00:00:15,864 --> 00:00:19,925
about barriers to using Wikidata
as a knowledge base.
6
00:00:19,925 --> 00:00:21,237
(Peter) Thank you.
7
00:00:22,937 --> 00:00:26,281
I'll skip over the abstract
because you've already seen it all.
8
00:00:26,281 --> 00:00:29,205
And I should say a little bit
about myself.
9
00:00:31,705 --> 00:00:37,050
I'm much more of a user of Wikidata
than an actual editor of Wikidata,
10
00:00:37,050 --> 00:00:41,526
and much more of a user of Wikidata
than somebody who contributes to Wikidata,
11
00:00:41,526 --> 00:00:44,891
but I very much believe in
the aims of Wikidata.
12
00:00:44,891 --> 00:00:47,473
In particular, it aligns
with my research areas
13
00:00:47,473 --> 00:00:50,683
which is knowledge representation,
at least in a certain sense.
14
00:00:50,683 --> 00:00:54,990
I worked in description logics
for a long time, worked with W3C.
15
00:00:54,990 --> 00:00:58,591
I've worked in Silicon Valley for a while,
16
00:00:58,591 --> 00:01:02,111
largely building what might be called
knowledge graphs,
17
00:01:02,111 --> 00:01:03,849
but I don't like the term
knowledge graphs--
18
00:01:03,849 --> 00:01:04,968
I don't like what they mean,
19
00:01:04,968 --> 00:01:07,831
I want to do something better
than knowledge graphs.
20
00:01:07,831 --> 00:01:10,426
And I want to put this together
from various sources.
21
00:01:10,426 --> 00:01:13,268
So Wikidata is a very, very good one,
22
00:01:13,268 --> 00:01:15,837
but DBpedia is not so good.
23
00:01:15,837 --> 00:01:18,610
Freebase is dead.
24
00:01:18,610 --> 00:01:21,548
Open Street Map,
Open Movie Database, things like that.
25
00:01:21,548 --> 00:01:24,532
And then I want to use
this store of knowledge
26
00:01:24,532 --> 00:01:26,499
to do something.
27
00:01:26,499 --> 00:01:31,889
And I want to use it as the source
of knowledge to do something,
28
00:01:31,889 --> 00:01:36,108
and not only just facts
but also organizing my knowledge.
29
00:01:36,108 --> 00:01:39,137
And currently, working where I am,
30
00:01:39,137 --> 00:01:43,245
we're interested in supporting
conversational agents.
31
00:01:43,245 --> 00:01:48,609
Not just things that let you play Avatar,
32
00:01:48,609 --> 00:01:52,850
but lets you play the movie
that's directed by the wife
33
00:01:52,850 --> 00:01:55,571
of the director of Avatar.
34
00:01:55,571 --> 00:02:00,454
So how can we build a conversational agent
that will do something like that?
35
00:02:00,454 --> 00:02:04,342
Well, you need to know
all the facts that go behind it,
36
00:02:04,342 --> 00:02:07,440
but you also need to know
that the fact that there are movies--
37
00:02:07,440 --> 00:02:10,407
not just, we have Avatar,
but that we have movies--
38
00:02:10,407 --> 00:02:12,158
we need to know things about movies,
39
00:02:12,158 --> 00:02:15,040
we need to know things
about directorships.
40
00:02:15,040 --> 00:02:18,476
We need to know things about humans--
that they're married to each other.
41
00:02:18,476 --> 00:02:21,233
We need to know that there are men
and women in the world,
42
00:02:21,233 --> 00:02:25,450
and somehow be able to use
this knowledge of what we're saying
43
00:02:25,450 --> 00:02:28,357
to come up with the actual reference
to these things,
44
00:02:28,357 --> 00:02:31,609
and then actually do
what we were asked to do.
45
00:02:31,609 --> 00:02:34,423
So, though it's one end,
46
00:02:34,423 --> 00:02:36,634
the other thing
that we want to be able to do
47
00:02:36,634 --> 00:02:41,114
is if you think of systems like Siri,
there are hundreds or thousands--
48
00:02:41,114 --> 00:02:43,474
actually, maybe Siri's not
the best example.
49
00:02:43,474 --> 00:02:49,790
The Amazon system has hundreds
or thousands of little programs
50
00:02:49,790 --> 00:02:51,595
that will do something for you.
51
00:02:51,595 --> 00:02:53,495
And the problem that we're interested in
52
00:02:53,495 --> 00:02:56,070
is how do you pick
which one can do something.
53
00:02:56,070 --> 00:03:00,997
So for example, which back-end
can find me train trips
54
00:03:00,997 --> 00:03:04,754
between San Francisco and Palo Alto.
55
00:03:04,754 --> 00:03:09,227
There may be many systems
that will try and sell me train tickets,
56
00:03:09,227 --> 00:03:13,800
but only one or perhaps two of them
will sell me that particular train ticket.
57
00:03:13,800 --> 00:03:18,697
And how do I get the system to do that
without having to be able to tell it
58
00:03:18,697 --> 00:03:21,428
that I want a Caltrain ticket.
59
00:03:23,128 --> 00:03:29,121
So, what happens is I want to use Wikidata
as the source of a lot of this stuff,
60
00:03:29,121 --> 00:03:31,569
and I regularly run into problems.
61
00:03:31,569 --> 00:03:35,861
And from those problems,
I have a bunch of suggestions.
62
00:03:37,061 --> 00:03:40,956
You may agree with my suggestions
or disagree with them.
63
00:03:40,956 --> 00:03:44,539
Some of them are kind of on their way
to being implemented in Wikidata,
64
00:03:44,539 --> 00:03:46,598
some of them aren't.
65
00:03:46,598 --> 00:03:50,220
So, I'm going to do this talk
from the back forward.
66
00:03:50,220 --> 00:03:53,949
I'm going to give you the summary,
and then an expansion of the summary,
67
00:03:53,949 --> 00:03:58,022
and then some rationale
for my suggestions.
68
00:03:58,022 --> 00:04:01,099
And the reason I'm going to do that
is if I started with all of the rationale,
69
00:04:01,099 --> 00:04:03,892
I might never get to the end,
and the end is the important thing,
70
00:04:03,892 --> 00:04:06,176
at least in my viewpoint.
71
00:04:06,176 --> 00:04:12,117
So, my biggest suggestion, I guess,
on the community side is,
72
00:04:12,117 --> 00:04:16,564
gee, guys, speak with a single voice.
73
00:04:16,564 --> 00:04:18,864
(chuckles)
74
00:04:20,064 --> 00:04:23,616
And speak with a voice
where I can find it.
75
00:04:23,616 --> 00:04:25,675
So, it turns out
that one of my suggestions
76
00:04:25,675 --> 00:04:30,570
is actually implemented,
but I only found out about it today,
77
00:04:30,570 --> 00:04:34,748
because it's not used very much at all,
and it's hard to find it.
78
00:04:34,748 --> 00:04:39,810
So, I really want you guys--
and me too, in some sense--
79
00:04:39,810 --> 00:04:44,075
to spend some effort at the beginning
when you're creating these classes
80
00:04:44,075 --> 00:04:46,252
and other things that are important,
81
00:04:46,252 --> 00:04:48,502
so that a poor user like me,
82
00:04:48,502 --> 00:04:52,746
who can't afford to go through five years
of impassioned discussion
83
00:04:52,746 --> 00:04:57,059
to find out what male actually is,
84
00:04:57,059 --> 00:05:00,860
can actually use it in our system--
in my system.
85
00:05:00,860 --> 00:05:03,282
So that's sort of on the community side.
86
00:05:03,282 --> 00:05:04,845
I'm a formalist.
87
00:05:04,845 --> 00:05:09,164
I really want to--
and my programs are dumb.
88
00:05:09,164 --> 00:05:11,756
I don't write smart programs,
I write dumb programs.
89
00:05:11,756 --> 00:05:14,816
Now, they tend to be
very fancy dumb programs,
90
00:05:14,816 --> 00:05:20,495
but these dumb programs
can't really handle all of the shades
91
00:05:20,495 --> 00:05:25,935
of everything that you have
with start time, end time, inception.
92
00:05:25,935 --> 00:05:29,241
I want to have some simple
formal mechanism
93
00:05:29,241 --> 00:05:33,358
that will tell my program what's true now,
94
00:05:33,358 --> 00:05:36,262
or what's true in 1987,
95
00:05:36,262 --> 00:05:38,559
without having to search through
a bunch of things,
96
00:05:38,559 --> 00:05:41,579
and make a bunch of guesses,
and use a lot of heuristics,
97
00:05:41,579 --> 00:05:45,330
or have a machine-learning program
that's done for this particular task.
98
00:05:45,330 --> 00:05:50,224
I just want you to tell me
this stuff somehow, and have a take.
99
00:05:50,224 --> 00:05:54,090
So, I want to be able
to look at something which says
100
00:05:54,090 --> 00:05:57,850
what the things I see in Wikidata
actually mean.
101
00:05:57,850 --> 00:06:01,170
And I don't find that these days.
102
00:06:01,170 --> 00:06:02,736
And then, of course, once we have that,
103
00:06:02,736 --> 00:06:06,636
I want somebody--
I'm willing to do some of this work--
104
00:06:06,636 --> 00:06:11,131
build tools that actually use
that formal description and say,
105
00:06:11,131 --> 00:06:15,781
tell me, for example,
if I'm an instance
106
00:06:15,781 --> 00:06:21,984
of architectural structure,
like the Eiffel Tower,
107
00:06:21,984 --> 00:06:24,135
am I a geographic location?
108
00:06:26,435 --> 00:06:27,474
I don't know.
109
00:06:27,474 --> 00:06:31,352
I mean, Wikidata doesn't tell me
whether this is true or not.
110
00:06:31,352 --> 00:06:34,461
I can find nowhere in Wikidata
that will do that,
111
00:06:34,461 --> 00:06:35,820
because there's no formal thing.
112
00:06:35,820 --> 00:06:38,468
But once you give me a formal thing
then I'm going to write a tool,
113
00:06:38,468 --> 00:06:41,376
which essentially gives the implications
of what the formal things are.
114
00:06:42,776 --> 00:06:47,569
The fourth suggestion is about bots.
115
00:06:47,569 --> 00:06:49,533
Bots are great.
116
00:06:49,533 --> 00:06:55,641
Bots have ultimate power
and as has been said,
117
00:06:55,641 --> 00:06:58,259
with ultimate power,
comes ultimate responsibility.
118
00:06:58,259 --> 00:07:02,380
And I don't believe that bots get
very much responsibility
119
00:07:02,380 --> 00:07:05,409
for the things that they do,
and they need to have.
120
00:07:05,409 --> 00:07:09,121
We need to be able to control the bots
and figure out what they've done wrong,
121
00:07:09,121 --> 00:07:11,884
and essentially, once a bot
makes a thousand mistakes,
122
00:07:11,884 --> 00:07:13,911
we want to undo that once,
123
00:07:13,911 --> 00:07:17,344
as opposed to undoing that
a thousand times.
124
00:07:17,344 --> 00:07:19,188
Of course, as I said,
these are my suggestions.
125
00:07:19,188 --> 00:07:20,912
Other people
may have different suggestions.
126
00:07:20,912 --> 00:07:23,980
I'm coming at it from a user viewpoint.
127
00:07:23,980 --> 00:07:25,923
I suppose I could say something like,
128
00:07:25,923 --> 00:07:28,450
I'm coming at it from a binary viewpoint.
129
00:07:28,450 --> 00:07:32,900
I mean, this is a program
that really wants yes or no answers.
130
00:07:32,900 --> 00:07:36,137
It doesn't understand much
in shades of gray.
131
00:07:36,137 --> 00:07:41,628
So, I would really like you to tell me
what's true and what's not true.
132
00:07:42,428 --> 00:07:49,416
So, that's the end of the talk, right?
(laughs)
133
00:07:51,730 --> 00:07:55,394
And I sort of expanded on some things
134
00:07:55,394 --> 00:07:58,194
but let me-- oops,
where are we, here, yes.
135
00:07:58,194 --> 00:08:00,746
So, here let me expand upon
the things that I said.
136
00:08:00,746 --> 00:08:05,662
So formally,
I really want a logic for Wikidata
137
00:08:05,662 --> 00:08:09,895
because that let's me know
what Wikidata means to me.
138
00:08:09,895 --> 00:08:11,712
I don't want to have data structure
139
00:08:11,712 --> 00:08:16,364
with some sort of English description
somewhere that tells me something.
140
00:08:16,364 --> 00:08:19,146
I want a formal statement of what this is.
141
00:08:19,146 --> 00:08:23,750
And maybe it produces the wrong answers,
in which case we fix it,
142
00:08:23,750 --> 00:08:26,039
but at least we know
what the answers are supposed to be,
143
00:08:26,039 --> 00:08:31,594
as opposed to having to go through
five or ten different pages
144
00:08:31,594 --> 00:08:33,253
of people arguing with each other
145
00:08:33,253 --> 00:08:36,131
what this particular part
of Wikidata means.
146
00:08:36,131 --> 00:08:40,908
So, in particular, I want to have things
that I think are useful,
147
00:08:40,908 --> 00:08:42,445
like disjointness.
148
00:08:42,445 --> 00:08:48,406
I want Wikidata to say that
rocks aren't humans,
149
00:08:48,406 --> 00:08:50,877
to pick an example.
150
00:08:50,877 --> 00:08:54,250
Now, there's lots of that stuff
in Wikidata at the moment.
151
00:08:54,250 --> 00:08:57,300
There's lots of this
opposite from things,
152
00:08:57,300 --> 00:08:59,539
but what does it mean?
153
00:08:59,539 --> 00:09:01,328
Somebody who's an opposite--
154
00:09:01,328 --> 00:09:06,221
there was something this morning
about transgender man
155
00:09:06,221 --> 00:09:09,185
is the opposite of transgender woman.
156
00:09:11,985 --> 00:09:15,888
Yes, in some sense,
but in what sense are they opposites?
157
00:09:15,888 --> 00:09:19,277
It's not a logical sense,
it's something else.
158
00:09:19,277 --> 00:09:23,324
I want to give definitions of classes
and to give an example,
159
00:09:23,324 --> 00:09:27,248
I would very much like Wikidata to say
160
00:09:27,248 --> 00:09:31,948
that "woman" is adult, female, human,
161
00:09:31,948 --> 00:09:36,564
because if I query Wikidata--
this is going to the end--
162
00:09:36,564 --> 00:09:38,864
and I ask how many women are in Wikidata,
163
00:09:38,864 --> 00:09:42,277
I get... any guesses?
164
00:09:43,021 --> 00:09:44,538
(woman) Less than men.
165
00:09:44,538 --> 00:09:46,320
Thirty-seven.
166
00:09:46,320 --> 00:09:47,374
Less than men.
167
00:09:47,374 --> 00:09:48,459
Thirty-seven.
168
00:09:48,459 --> 00:09:53,093
Instances of "woman" in Wikidata-- 37.
169
00:09:53,904 --> 00:09:55,331
That's obviously wrong.
170
00:09:55,331 --> 00:09:56,972
Obviously, obviously wrong.
171
00:09:56,972 --> 00:09:59,310
I know it, you know it,
172
00:09:59,310 --> 00:10:01,876
but my program doesn't know it.
173
00:10:01,876 --> 00:10:05,098
My program says 37--
well, it's not zero.
174
00:10:05,098 --> 00:10:07,354
So it might be right.
175
00:10:09,054 --> 00:10:12,738
I would much prefer
there to be something on "woman"
176
00:10:12,738 --> 00:10:16,203
that says, "Hey, if you're trying
to figure out the women in Wikidata,
177
00:10:16,203 --> 00:10:20,388
don't look at the things
that are stated to be instances of 'woman,'
178
00:10:20,388 --> 00:10:24,323
look at things, well, a SPARQL query
or something like that,
179
00:10:24,323 --> 00:10:27,312
find all the humans, find the female one,
180
00:10:27,312 --> 00:10:32,584
the ones with sex or gender
which is female or female-ish.
181
00:10:32,584 --> 00:10:34,537
That's kind of difficult there,
182
00:10:34,537 --> 00:10:36,823
and then the ones that are adult--
whatever adult means--
183
00:10:36,823 --> 00:10:37,986
at least that's a definition.
184
00:10:37,986 --> 00:10:40,120
We can argue whether
it's the right definition or not.
185
00:10:40,120 --> 00:10:45,381
But we get a number which is not 37,
much better than 37.
186
00:10:45,381 --> 00:10:48,113
So, I want this so that
we can actually come up with answers
187
00:10:48,113 --> 00:10:50,196
to some of these questions.
188
00:10:50,196 --> 00:10:53,585
So, and again, tools--
I would really like to have tools
189
00:10:53,585 --> 00:10:55,277
that show implications of claims.
190
00:10:55,277 --> 00:10:59,429
So, that shows that
the Eiffel Tower is a location.
191
00:10:59,429 --> 00:11:04,137
Whether it is or not in the real world,
is somehow kind of irrelevant.
192
00:11:04,137 --> 00:11:10,015
We can argue whether the Eiffel Tower
is a location or has a location.
193
00:11:10,015 --> 00:11:12,780
Philosophers probably
have argued for decades
194
00:11:12,780 --> 00:11:14,776
over whether this is the case or not.
195
00:11:14,776 --> 00:11:15,914
I don't care.
196
00:11:15,914 --> 00:11:19,829
Just come up with an answer that makes
at least a little bit of sense,
197
00:11:19,829 --> 00:11:23,302
and I'll be happy.
198
00:11:23,302 --> 00:11:24,758
So, I want a tool that'll do that.
199
00:11:24,758 --> 00:11:26,646
I want, essentially,
a tool that will tell me
200
00:11:26,646 --> 00:11:28,786
what's true at a particular time.
201
00:11:28,786 --> 00:11:32,893
So, how big is the Aral Sea?
202
00:11:34,533 --> 00:11:38,558
It's certainly not 22,000 square miles.
203
00:11:38,558 --> 00:11:41,531
It's much, much smaller than that,
204
00:11:41,531 --> 00:11:46,855
but the claims on the Aral Sea
are historical claims.
205
00:11:46,855 --> 00:11:49,238
What's true now?
206
00:11:49,238 --> 00:11:51,607
I think, 3,000 square miles.
207
00:11:51,607 --> 00:11:56,997
Anyway, it's a mere puddle
of its former self, you might say.
208
00:11:56,997 --> 00:12:00,215
I would also like tools
that help in cleaning the data.
209
00:12:00,215 --> 00:12:02,296
So, what are inconsistencies?
210
00:12:02,296 --> 00:12:05,383
Is there something
that's both a rock and a human.
211
00:12:05,383 --> 00:12:09,343
Well, right now,
is that a problem in Wikidata?
212
00:12:09,343 --> 00:12:11,703
Well, there are these
constraint mechanisms,
213
00:12:11,703 --> 00:12:13,042
but they're kind of weak,
214
00:12:13,042 --> 00:12:15,835
and they're not used very well
in many places.
215
00:12:15,835 --> 00:12:21,556
So, I would really like to have some tool
which essentially says, "No!
216
00:12:21,556 --> 00:12:23,793
You can't have a rock and a human!
217
00:12:23,793 --> 00:12:28,541
You can have, perhaps,
a human and a Klingon,
218
00:12:28,541 --> 00:12:31,778
but rocks and humans, just, no."
219
00:12:35,978 --> 00:12:39,408
There's an old science fiction story
called The God Makers
220
00:12:39,408 --> 00:12:42,522
where they take a rock [inaudible],
make it into a God,
221
00:12:42,522 --> 00:12:45,084
so maybe a rock
could be a person in that sense.
222
00:12:45,084 --> 00:12:47,039
But human, no.
223
00:12:47,883 --> 00:12:49,048
Hm?
224
00:12:52,325 --> 00:12:58,055
(man) Are you asking for
exhaustive disjunction?
225
00:12:59,298 --> 00:13:02,052
[inaudible]
226
00:13:02,052 --> 00:13:06,024
(Peter) No, I'm not asking for
exhaustive decompositions.
227
00:13:06,024 --> 00:13:07,389
Just junctions.
228
00:13:07,389 --> 00:13:09,400
I mean, in some sense--
229
00:13:09,400 --> 00:13:10,490
In what?
230
00:13:10,490 --> 00:13:11,623
(woman) That's undecidable.
231
00:13:11,623 --> 00:13:15,357
(Peter) What? No, well,
you mean not logically.
232
00:13:15,357 --> 00:13:18,961
So, the question is
whether we can actually,
233
00:13:18,961 --> 00:13:22,061
can have exhaustive definition,
234
00:13:22,061 --> 00:13:23,744
exhaustive disjunctions?
235
00:13:23,746 --> 00:13:24,774
Well...
236
00:13:24,774 --> 00:13:28,474
(man) That's pricey, right?
To find out that bots are... yeah.
237
00:13:29,874 --> 00:13:32,696
(man 2) To say that rocks
are disjoint from humans is easy,
238
00:13:32,696 --> 00:13:36,038
but to do that in all the cases
you're going to want it, is--
239
00:13:36,038 --> 00:13:37,205
(Peter) It's computation.
240
00:13:37,205 --> 00:13:39,963
Yes, now we have a problem
with computational costs, right?
241
00:13:39,963 --> 00:13:41,457
Yeah.
242
00:13:42,057 --> 00:13:49,056
The computational cost of deciding it
for Wikidata as it exists right now,
243
00:13:49,056 --> 00:13:55,420
is not impossible,
it's just computationally non-trivial.
244
00:13:55,420 --> 00:13:58,735
So given that the query service
is running out of [inaudible],
245
00:13:58,735 --> 00:14:04,022
so to do this right, requires tools
that actually think a little bit.
246
00:14:04,022 --> 00:14:06,769
And that's going to require computation.
247
00:14:06,769 --> 00:14:08,029
How much computation?
248
00:14:08,029 --> 00:14:10,824
Well, it's not the heat death
of the universe,
249
00:14:10,824 --> 00:14:14,211
it's tomorrow, perhaps,
or two seconds from now.
250
00:14:14,211 --> 00:14:18,416
But two seconds times
how many million things are in Wikidata
251
00:14:18,416 --> 00:14:21,672
is getting to be a reasonably big number.
252
00:14:21,672 --> 00:14:22,797
One of the things you can do
253
00:14:22,797 --> 00:14:25,801
is this thing doesn't have to be
completely run in one thing.
254
00:14:25,801 --> 00:14:31,206
You can farm these out into other systems.
255
00:14:31,206 --> 00:14:36,416
We don't have to have everything
all in one computer.
256
00:14:36,416 --> 00:14:38,419
And, of course,
Google just gave us the answer.
257
00:14:38,419 --> 00:14:40,593
We can just put it on
this new Google quantum computer,
258
00:14:40,593 --> 00:14:42,105
and it'll do everything forever.
259
00:14:42,105 --> 00:14:44,911
(woman) But it sounds like
you're asking for OWL, and--
260
00:14:44,911 --> 00:14:46,552
(Peter) No, I'm asking for part of OWL.
261
00:14:46,552 --> 00:14:48,868
(woman) You've been asking for
a lot of things about OWL,
262
00:14:48,868 --> 00:14:50,359
and that just is not possible.
263
00:14:50,359 --> 00:14:53,554
That's why Wikidata works,
is because it's not OWL.
264
00:14:53,554 --> 00:14:55,636
There are actually things
that you can compute with.
265
00:14:55,636 --> 00:15:00,518
(Peter) So, I am asking for
a bigger part of OWL,
266
00:15:00,518 --> 00:15:02,663
not all of it, yeah?
267
00:15:02,663 --> 00:15:07,311
Well, I mean, so the question is,
268
00:15:07,311 --> 00:15:09,211
is Wikidata going to spend the effort
269
00:15:09,211 --> 00:15:13,776
to buy another, perhaps, ten computers
to crunch away on this permanently,
270
00:15:13,776 --> 00:15:18,316
or is it going to spend the effort
of having a whole bunch of people
271
00:15:18,316 --> 00:15:20,724
argue about it, or whatever.
272
00:15:20,724 --> 00:15:25,231
And my view is computers are dirt cheap.
273
00:15:25,231 --> 00:15:30,769
I mean, I'm willing to pony up
some of my very own money
274
00:15:30,769 --> 00:15:34,457
to buy Wikidata another computer
to do this stuff,
275
00:15:34,457 --> 00:15:36,462
because I think it's important.
276
00:15:36,462 --> 00:15:38,062
(man) [inaudible]
277
00:15:38,062 --> 00:15:39,763
Yes. (laughs)
278
00:15:39,763 --> 00:15:42,863
I didn't say I would give it
to Wikimedia Foundation.
279
00:15:45,063 --> 00:15:49,340
But I'm not asking for things
that are trivial.
280
00:15:49,340 --> 00:15:52,308
I'm asking for things
that require compute power,
281
00:15:52,308 --> 00:15:57,041
that require intellectual power,
that require the community to do things.
282
00:15:57,041 --> 00:15:58,800
The community is doing
some of these things.
283
00:15:58,800 --> 00:16:02,957
I found out that there is this property
which essentially says,
284
00:16:02,957 --> 00:16:06,716
"Hey, here's how
you're supposed to use this thing."
285
00:16:06,716 --> 00:16:09,222
I forget the exact name of it.
286
00:16:09,222 --> 00:16:12,976
User instructions,
I thought it was three words.
287
00:16:12,976 --> 00:16:18,235
Whatever, anyway,
it essentially says-- and it's on male.
288
00:16:18,235 --> 00:16:19,851
And there was a big argument about it.
289
00:16:19,851 --> 00:16:21,716
The trouble is it's not supported at all.
290
00:16:21,716 --> 00:16:24,597
There was this plan to have this property
and have it supported,
291
00:16:24,597 --> 00:16:25,895
to have it show up everywhere,
292
00:16:25,895 --> 00:16:30,339
so that people would realize
that human-- in other words,
293
00:16:30,339 --> 00:16:34,069
you don't use person for humans,
right now it's stuck on the description.
294
00:16:34,069 --> 00:16:35,942
And it's stuck on
a very short description.
295
00:16:35,942 --> 00:16:38,697
And it's very hard to figure out
what it really means,
296
00:16:38,697 --> 00:16:41,713
and only a few classes have these things.
297
00:16:41,713 --> 00:16:45,123
So, we go up in the class hierarchy
to these more general things,
298
00:16:45,123 --> 00:16:47,372
it's very hard to figure out
what belongs to them,
299
00:16:47,372 --> 00:16:48,801
is what doesn't belong to them.
300
00:16:48,801 --> 00:16:51,734
So it's no surprise
that people use them the wrong way.
301
00:16:51,734 --> 00:16:56,417
Because the people in this room--
or metaphorically in this room--
302
00:16:56,417 --> 00:17:00,684
may understand that geographic location
is used for a particular purpose,
303
00:17:00,684 --> 00:17:03,215
but even me--
304
00:17:03,215 --> 00:17:06,413
I think I have a fairly good background
in representing things--
305
00:17:06,413 --> 00:17:11,066
don't know the answer to that,
or at least, it requires me to spend
306
00:17:11,066 --> 00:17:13,466
at least an hour of effort
to get a good answer to that.
307
00:17:13,466 --> 00:17:16,703
And that's really not scalable.
308
00:17:16,703 --> 00:17:18,350
So, I'm not asking for nothing,
309
00:17:18,350 --> 00:17:20,472
I'm asking for lots of things,
310
00:17:20,472 --> 00:17:24,155
but the trouble is, I mean, I think--
311
00:17:24,155 --> 00:17:27,369
well, I think I'm important
but anyway, you can ignore me.
312
00:17:27,369 --> 00:17:30,993
I think that I'm a pretty good
use case for Wikidata.
313
00:17:30,993 --> 00:17:34,328
I really want, not just a bit of Wikidata,
314
00:17:34,328 --> 00:17:36,333
I want a lot of it.
315
00:17:36,333 --> 00:17:42,498
And I work for a very big company
but the part of that company
316
00:17:42,498 --> 00:17:47,906
that needs, or wants,
or cares about Wikidata is quite small.
317
00:17:47,906 --> 00:17:52,929
So, if I worked for a company
that really cared about data,
318
00:17:52,929 --> 00:17:55,560
and was willing to put
hundreds of millions of dollars
319
00:17:55,560 --> 00:17:59,985
into curating Wikidata,
and put it into their own knowledge graph,
320
00:17:59,985 --> 00:18:02,649
using Wikidata would be no problem.
321
00:18:02,649 --> 00:18:07,614
My company, perhaps,
has a million dollars to take Wikidata
322
00:18:07,614 --> 00:18:09,431
and put it into a knowledge graph.
323
00:18:09,431 --> 00:18:13,097
A million dollars
doesn't go very far these days.
324
00:18:13,097 --> 00:18:17,475
So, the problem--
and let me say something
325
00:18:17,475 --> 00:18:22,102
that actually isn't in the slides,
but which I really firmly believe in.
326
00:18:22,102 --> 00:18:24,393
The problem with Wikidata not--
327
00:18:24,393 --> 00:18:27,743
Wikidata's great,
328
00:18:27,743 --> 00:18:33,314
but to really use it,
you have to spend a lot of effort.
329
00:18:33,314 --> 00:18:39,777
And most companies,
and most individuals, and most groups
330
00:18:39,777 --> 00:18:46,073
can't expend that amount of effort
to really use it well.
331
00:18:46,073 --> 00:18:51,710
I think that on the Wikidata side,
they should try to be greater
332
00:18:51,710 --> 00:18:54,947
so that more people could really use it.
333
00:18:54,947 --> 00:18:59,458
And that's really, I think,
the guts of this presentation
334
00:18:59,458 --> 00:19:04,002
is that if Wikidata community
improved Wikidata
335
00:19:04,002 --> 00:19:08,364
so it would be more clear
as to what's going on,
336
00:19:08,364 --> 00:19:10,742
then more people
could put information into it
337
00:19:10,742 --> 00:19:12,090
without making mistakes,
338
00:19:12,090 --> 00:19:15,395
and more people could use it
without having to spend a lot of time
339
00:19:15,395 --> 00:19:17,190
to curate it.
340
00:19:18,090 --> 00:19:23,686
Alright, so, we've gone through
lots of this stuff.
341
00:19:25,286 --> 00:19:27,515
Let me just say a few things.
342
00:19:27,515 --> 00:19:33,402
So, I've looked at a fair bit of Wikidata,
343
00:19:33,402 --> 00:19:37,341
and every time I look, I find a problem.
344
00:19:37,341 --> 00:19:39,658
That's bad.
345
00:19:40,858 --> 00:19:42,575
I haven't done a quantitative study,
346
00:19:42,575 --> 00:19:44,258
and somebody should do
a quantitative study
347
00:19:44,258 --> 00:19:46,851
of some of these things,
it would require a lot of work to do it,
348
00:19:46,851 --> 00:19:50,029
but essentially, I look at something
and I find a problem,
349
00:19:50,029 --> 00:19:51,143
and that's not great.
350
00:19:51,143 --> 00:19:52,429
I find missing information.
351
00:19:52,429 --> 00:19:57,568
But I don't have anything to say about
adding in missing information.
352
00:19:57,568 --> 00:19:59,456
Yes, Dan?
353
00:20:00,056 --> 00:20:02,358
(Dan) With respect,
you always find problems.
354
00:20:02,358 --> 00:20:03,367
(Peter) Yes.
355
00:20:03,367 --> 00:20:04,706
(audience laughs)
356
00:20:04,706 --> 00:20:07,547
I am very good at finding problems.
357
00:20:07,547 --> 00:20:13,305
Actually, so one of the problems
that I have, the problem with "woman"--
358
00:20:13,305 --> 00:20:15,105
(laughter)
359
00:20:15,105 --> 00:20:17,733
The problem with--
I didn't find the problem with "woman".
360
00:20:17,733 --> 00:20:19,933
(chuckles)
361
00:20:19,933 --> 00:20:22,630
Turns out that a co-worker,
I showed her a page,
362
00:20:22,630 --> 00:20:25,306
where I had found a different problem
and she looked at it
363
00:20:25,306 --> 00:20:27,162
and said, "Oh, 'woman'."
364
00:20:27,162 --> 00:20:28,801
And so she found that problem
365
00:20:28,801 --> 00:20:32,394
on a display that I already
found the problem.
366
00:20:32,394 --> 00:20:35,874
So, missing information--
367
00:20:35,874 --> 00:20:38,230
there just should be
more information in Wikidata.
368
00:20:38,230 --> 00:20:39,899
There's factual errors in Wikidata,
369
00:20:39,899 --> 00:20:41,429
but everybody's got factual errors.
370
00:20:41,429 --> 00:20:43,081
Bots make it a little bit worse.
371
00:20:43,081 --> 00:20:45,608
There's problems with the ontology,
372
00:20:45,608 --> 00:20:48,537
which I think is a place that--
373
00:20:48,537 --> 00:20:53,049
you can expend effort there
and really improve quite a lot of things.
374
00:20:53,049 --> 00:20:55,548
And then there's also
the problems with qualifiers,
375
00:20:55,548 --> 00:20:57,214
and really temporal qualifiers.
376
00:20:57,214 --> 00:21:00,898
It's very hard to figure out
what's true at a particular time
377
00:21:00,898 --> 00:21:03,567
because there's a whole bunch of
temporal qualifiers
378
00:21:03,567 --> 00:21:05,688
that could be relevant.
379
00:21:05,688 --> 00:21:09,023
Which ones count and which ones get used,
380
00:21:09,023 --> 00:21:10,550
and are they going to stay the same?
381
00:21:10,550 --> 00:21:12,396
Are we going to add a new one tomorrow?
382
00:21:12,396 --> 00:21:15,474
So then I have to change
every one of my programs.
383
00:21:15,474 --> 00:21:18,870
I really think all this kind of stuff,
it would be better to hide that
384
00:21:18,870 --> 00:21:21,845
from the consumer
so that Wikidata would just say,
385
00:21:21,845 --> 00:21:24,497
"Okay, you want to know
what's true at time X?
386
00:21:24,497 --> 00:21:27,648
Here's an interface that tells you
what's true at time X,"
387
00:21:27,648 --> 00:21:31,386
instead of having me
to write all of this stuff.
388
00:21:35,666 --> 00:21:37,350
It's on, I think it's on.
389
00:21:37,350 --> 00:21:39,087
Yeah.
390
00:21:40,287 --> 00:21:47,287
(man) I think you like the idea
of what is possible with Wikidata,
391
00:21:47,287 --> 00:21:53,086
but you say that it's not used
like your idea.
392
00:21:54,186 --> 00:22:00,558
So if, from my perspective,
Wikidata is a collection of statements
393
00:22:00,558 --> 00:22:05,423
from persons and from machines,
and so on, and some might be true,
394
00:22:05,423 --> 00:22:09,409
some might be discussable.
395
00:22:09,409 --> 00:22:12,641
What you could do would be,
from my perspective,
396
00:22:12,641 --> 00:22:16,298
you could use a computational intelligence
397
00:22:16,298 --> 00:22:21,542
to score the statements
if they are...
398
00:22:21,542 --> 00:22:24,042
(speaking German)
399
00:22:24,042 --> 00:22:25,342
...contradictory,
400
00:22:25,342 --> 00:22:27,669
or if they are common sense.
401
00:22:27,669 --> 00:22:31,726
So you could score them,
and then you can filter on the score,
402
00:22:31,726 --> 00:22:34,443
and then you have what you wanted.
403
00:22:34,443 --> 00:22:39,426
(Peter) Possibly, except without a notion
of what things mean in WIkidata,
404
00:22:39,426 --> 00:22:42,452
I can't even figure out
whether two things are contradictory.
405
00:22:42,452 --> 00:22:44,585
I mean, there's constraints
and that helps,
406
00:22:44,585 --> 00:22:48,443
but I don't think that's a full solution.
407
00:22:48,443 --> 00:22:53,418
And common sense--
I don't have much common sense
408
00:22:53,418 --> 00:22:56,674
and my programs have a lot less than I do.
409
00:22:56,674 --> 00:23:01,454
We could write a lot of stuff
which tries to say some things
410
00:23:01,454 --> 00:23:05,210
about common sense, but, again,
I think that requires an understanding
411
00:23:05,210 --> 00:23:06,653
of what's going on.
412
00:23:06,653 --> 00:23:11,121
And yes, so Wikidata has references
which are supposed to be some notion
413
00:23:11,121 --> 00:23:13,923
of what's really supported,
414
00:23:13,923 --> 00:23:20,604
except, here's a problem,
and it's very hard to see this.
415
00:23:20,604 --> 00:23:23,289
Here's a problem with Wikidata
from a while ago.
416
00:23:23,289 --> 00:23:25,818
This is a movie
that's got three directors listed--
417
00:23:25,818 --> 00:23:29,725
the Corpse Bride--
and it's got Mike Johnson, twice.
418
00:23:29,725 --> 00:23:32,723
Different Mike Johnsons.
419
00:23:32,723 --> 00:23:36,367
And they both have a lot of references.
420
00:23:36,367 --> 00:23:40,042
So there's a lot of things
that say that Corpse Bride
421
00:23:40,042 --> 00:23:44,902
has got two different Mike Johnsons
as directors.
422
00:23:44,902 --> 00:23:48,489
And there they are,
one is a director, one is a singer.
423
00:23:48,489 --> 00:23:52,269
What happened, some bot went through
and accidentally did a bad thing
424
00:23:52,269 --> 00:23:55,462
in Italian Wikipedia--
got the wrong thing in there--
425
00:23:55,462 --> 00:23:57,803
and then a bunch of other bots piled on
426
00:23:57,803 --> 00:24:00,648
and essentially created false references.
427
00:24:00,648 --> 00:24:02,254
So, this is a real problem.
428
00:24:02,254 --> 00:24:06,139
So, seven references!
429
00:24:06,139 --> 00:24:07,839
That's really good.
430
00:24:07,839 --> 00:24:10,015
And they're not crap references.
431
00:24:10,015 --> 00:24:16,358
They're some movie databases--
real things.
432
00:24:16,358 --> 00:24:18,756
So, that's one of the things.
433
00:24:18,756 --> 00:24:21,363
Here's another one--
there's the Aral Sea.
434
00:24:23,063 --> 00:24:28,337
These are the biggest--
by volume-- lakes in the world.
435
00:24:28,337 --> 00:24:33,219
There's the Aral Sea.
That comes from Wikidata, by the way.
436
00:24:33,219 --> 00:24:36,433
There's Lake Michigan-Huron.
437
00:24:36,433 --> 00:24:38,547
I didn't realize
there was a Lake Michigan-Huron,
438
00:24:38,547 --> 00:24:40,933
and I live on one of them.
439
00:24:41,866 --> 00:24:43,582
So, here we have two problems.
440
00:24:43,582 --> 00:24:47,167
This is an ontological problem--
what's a lake?
441
00:24:47,167 --> 00:24:50,432
And so is Lake Michigan-Huron a lake?
442
00:24:50,432 --> 00:24:52,813
Well, don't know.
443
00:24:52,813 --> 00:24:56,651
This one here
is a temporal qualifier problem--
444
00:24:56,651 --> 00:24:59,716
how big is the Aral Sea now?
445
00:24:59,716 --> 00:25:02,183
Not 22,000 square miles.
446
00:25:02,183 --> 00:25:04,645
Not 11,000 square miles.
447
00:25:04,645 --> 00:25:10,280
So, what is it?
Sorry, 26,000 square miles.
448
00:25:10,280 --> 00:25:13,613
Although this is something
from Google, of course,
449
00:25:13,613 --> 00:25:15,609
but that's in there.
450
00:25:15,609 --> 00:25:20,988
So anyway, I got a bunch of other things
along these lines,
451
00:25:20,988 --> 00:25:23,264
which you can see if you care,
452
00:25:23,264 --> 00:25:26,636
but I've given you my suggestions already,
453
00:25:26,636 --> 00:25:29,843
you can either like my suggestions or not,
454
00:25:29,843 --> 00:25:32,494
but I've-- woah-- (chuckles)
455
00:25:33,429 --> 00:25:35,859
I think I've sort of
supported some things.
456
00:25:35,859 --> 00:25:37,995
So, anyway, I had questions in the middle,
457
00:25:37,995 --> 00:25:40,452
and we are done,
are we having a question or not?
458
00:25:40,452 --> 00:25:41,897
- (woman) We're done.
- (Peter) Okay.
459
00:25:41,897 --> 00:25:44,402
- (woman) Sorry, that's it.
- (Peter) (laughs)
460
00:25:44,402 --> 00:25:47,397
(audience applause)