WEBVTT
00:00:00.000 --> 00:00:19.480
36C3 preroll music
00:00:19.480 --> 00:00:24.140
Herald Angel: We have Tom and Max here.
They have a talk here with a very
00:00:24.140 --> 00:00:28.140
complicated title that I don't quite
understand yet. It's called "Interactively
00:00:28.140 --> 00:00:35.810
Discovering Implicational Knowledge in
Wikidata. And they told me the point of
00:00:35.810 --> 00:00:39.190
the talk is that I would like to
understand what it means and I hope I
00:00:39.190 --> 00:00:42.190
will. So good luck.
Tom: Thank you very much.
00:00:42.190 --> 00:00:44.310
Herald: And have some applause, please.
00:00:44.310 --> 00:00:47.880
applause
00:00:47.880 --> 00:00:54.980
T: Thank you very much. Do you hear me?
Does it work? Hello? Oh, very good. Thank
00:00:54.980 --> 00:00:58.789
you very much and welcome to our talk
about interactively discovering
00:00:58.789 --> 00:01:05.110
implicational knowledge in Wikidata. It
is more or less a fun project we started
00:01:05.110 --> 00:01:10.890
for finding rules that are implicit in
Wikidata – entailed just by the data it
00:01:10.890 --> 00:01:18.850
has, that people inserted into the
Wikidata database so far. And we will
00:01:18.850 --> 00:01:23.570
start with the explicit knowledge. So the
explicit data in Wikidata, with Max.
00:01:23.570 --> 00:01:28.340
Max: So. Right. What what is Wikidata?
Maybe you have heard about Wikidata, then
00:01:28.340 --> 00:01:33.210
that's all fine. Maybe you haven't, then
surely you've heard of Wikipedia. And
00:01:33.210 --> 00:01:36.790
Wikipedia is run by the Wikimedia
Foundation and the Wikimedia Foundation
00:01:36.790 --> 00:01:41.330
has several other projects. And one of
those is Wikidata. And Wikidata is
00:01:41.330 --> 00:01:45.490
basically a large graph that encodes
machine readable knowledge in the form of
00:01:45.490 --> 00:01:51.730
statements. And a statement basically
consists of some entity that is connected
00:01:51.730 --> 00:01:58.200
– or some some entities that are connected
by some property. And these properties
00:01:58.200 --> 00:02:02.909
can then even have annotations on them.
So, for example, we have Donna Strickland
00:02:02.909 --> 00:02:09.149
here and we encode that she has received a
Nobel prize in physics last year by this
00:02:09.149 --> 00:02:16.290
property "awarded" and this has then a
qualifier "time: 2018" and also "for:
00:02:16.290 --> 00:02:23.100
Chirped Pulse Amplification". And all in
all, we have some 890 million statements
00:02:23.100 --> 00:02:31.960
on Wikidata that connect 71 million items
using 7000 properties. But there's also a
00:02:31.960 --> 00:02:36.830
bit more. So we also know that Donna
Strickland has "field of work: optics" and
00:02:36.830 --> 00:02:41.420
also "field of work: lasers" so we can use
the same property to connect some entity
00:02:41.420 --> 00:02:46.480
with different other entities. And we
don't even have to have knowledge that
00:02:46.480 --> 00:02:56.530
connects the entities. We can have a date
of birth, which is 1959. Nineteen ninety.
00:02:56.530 --> 00:03:05.530
No. Nineteen fifty nine. Yes. And this is
then just a plain date, not an entity. And
00:03:05.530 --> 00:03:11.510
now coming from the explicit knowledge
then, well, we have some more we have
00:03:11.510 --> 00:03:16.209
Donna Strickland has received a Nobel
prize in physics and also Marie Curie has
00:03:16.209 --> 00:03:21.170
received the Nobel prize in physics. And
we also know that Marie Curie has a Nobel
00:03:21.170 --> 00:03:27.780
prize ID that starts with "phys" and then
"1903" and some random numbers that
00:03:27.780 --> 00:03:32.970
basically are this ID. Then Marie Curie
also has received a Nobel prize in
00:03:32.970 --> 00:03:38.580
chemistry in 1911. So she has another
Nobel ID that starts with "chem" and has
00:03:38.580 --> 00:03:43.590
"1911" there. And then there's also
Frances Arnold, who received the Nobel
00:03:43.590 --> 00:03:48.549
prize in chemistry last year. So she has a
Nobel ID that starts with "chem" and has
00:03:48.549 --> 00:03:54.740
"2018" there. And now one one could assume
that, well, everybody who was awarded the
00:03:54.740 --> 00:04:00.156
Nobel prize should also have a Nobel ID.
So everybody who was awarded the Nobel
00:04:00.156 --> 00:04:05.670
prize should also have a Nobel prize ID,
and we could write that as some
00:04:05.670 --> 00:04:11.791
implication here. So "awarded(nobelPrize)"
implies "nobelID". And well, if you
00:04:11.791 --> 00:04:16.349
look sharply at this picture, then there's
this arrow here conspicuously missing that
00:04:16.349 --> 00:04:22.550
Donald Strickland doesn't have a Nobel
prize ID. And indeed, there's 25 people
00:04:22.550 --> 00:04:26.669
currently on Wikidata that are missing
Nobel prize IDs, and Donna Strickland is
00:04:26.669 --> 00:04:34.060
one of them. So we call these people that
don't satisfy this implication – we call
00:04:34.060 --> 00:04:40.419
those counterexamples and well, if you
look at Wikidata on the scale of really
00:04:40.419 --> 00:04:45.350
these 890 million statements, then you
won't find any counterexamples because
00:04:45.350 --> 00:04:52.550
it's just too big. So we need some way to
automatically do that. And the idea is
00:04:52.550 --> 00:04:58.930
that, well, if we had this knowledge that
while some implications are not satisfied,
00:04:58.930 --> 00:05:03.840
then this encodes maybe missing
information or wrong information, and we
00:05:03.840 --> 00:05:10.870
want to represent that in a way that is
easy to understand and also succinct. So
00:05:10.870 --> 00:05:16.090
it doesn't take long to write it down, it
should have a short representation. So
00:05:16.090 --> 00:05:23.060
that rules out anything, including complex
syntax or logical quantifies. So no SPARQL
00:05:23.060 --> 00:05:27.480
queries as a description of that implicit
knowledge. No description logics, if
00:05:27.480 --> 00:05:33.199
you've heard of that. And we also want
something that we can actually compute on
00:05:33.199 --> 00:05:41.539
actual hardware in a reasonable timeframe.
So our approach is we use Formal Concept
00:05:41.539 --> 00:05:46.889
Analysis, which is a technique that has
been developed over the past several years
00:05:46.889 --> 00:05:52.070
to extract what is called propositional
implications. So just logical formulas of
00:05:52.070 --> 00:05:56.240
propositional logic that are an
implication in the form of this
00:05:56.240 --> 00:06:03.020
"awarded(nobelPrize)" implies "nobleID".
So what exactly is Formal Concept
00:06:03.020 --> 00:06:08.500
Analysis? Off to Tom.
T: Thank you. So what is Formal Concept
00:06:08.500 --> 00:06:14.420
Analysis? It was developed in 1980s by a
guy called Rudolf Wille and Bernard Ganter
00:06:14.420 --> 00:06:18.539
and they were restructuring lattice
theory. Lattice theory is an ambiguous
00:06:18.539 --> 00:06:23.370
name in math, it has two meanings: One
meaning is you have a grid and have a
00:06:23.370 --> 00:06:29.050
lattice there. The other thing is to speak
about orders – order relations. So I like
00:06:29.050 --> 00:06:34.150
steaks, I like pudding and I like steaks
more than pudding. And I like rice more
00:06:34.150 --> 00:06:40.960
than steaks. That's an order, right? And
lattices are particular orders which can
00:06:40.960 --> 00:06:46.770
be used to represent propositional logic.
So easy rules like "when it rains, the
00:06:46.770 --> 00:06:52.990
street gets wet", right? So and the data
representation those guys used back then,
00:06:52.990 --> 00:06:57.080
they called it a formal context, which is
basically just a set of objects – they
00:06:57.080 --> 00:07:02.000
call them objects, it's just a name –, a
set of attributes and some incidence,
00:07:02.000 --> 00:07:07.890
which basically means which object does
have which attributes. So, for example, my
00:07:07.890 --> 00:07:13.150
laptop has the colour black. So this
object has some property, right? So that's
00:07:13.150 --> 00:07:17.870
a small example on the right for such a
formal context. So the objects there are
00:07:17.870 --> 00:07:24.379
some animals: a platypus – that's the fun
animal from Australia, the mammal which is
00:07:24.379 --> 00:07:30.279
also laying eggs and which is also
venomous –, a black widow – the spider –,
00:07:30.279 --> 00:07:35.449
the duck and the cat. So we see, the
platypus has all the properties; it has
00:07:35.449 --> 00:07:39.729
being venomous, laying eggs and being a
mammal; we have the duck, which is not a
00:07:39.729 --> 00:07:44.169
mammal, but it lays eggs, and so on and so
on. And it's very easy to grasp some
00:07:44.169 --> 00:07:49.430
implicational knowledge here. An easy rule
you can find is whenever you endeavour a
00:07:49.430 --> 00:07:54.300
mammal that is venomous, it has to lay
eggs. So this is a rule that falls out of
00:07:54.300 --> 00:07:59.639
this binary data table. Our main problem
then or at this point is we do not have
00:07:59.639 --> 00:08:03.470
such a data table for Wikidata, right? We
have the implicit graph, which is way more
00:08:03.470 --> 00:08:09.030
expressive than binary data, and we cannot
even store Wikidata as a binary table.
00:08:09.030 --> 00:08:13.859
Even if you tried to, we have no chance to
compute such rules from that. And for
00:08:13.859 --> 00:08:21.460
this, the people from Formal Context
Analysis proposed an algorithm to extract
00:08:21.460 --> 00:08:27.160
implicit knowledge from an expert. So our
expert here could be Wikidata. It's an
00:08:27.160 --> 00:08:31.240
expert, you can ask Wikidata questions,
right? Using this SPARQL interface, you
00:08:31.240 --> 00:08:34.739
can ask. You can ask "Is there an example
for that? Is there a counterexample for
00:08:34.739 --> 00:08:39.880
something else?" So the algorithm is quite
easy. The algorithm is the algorithm and
00:08:39.880 --> 00:08:45.380
some expert – in our case, Wikidata –, and
the algorithm keeps notes for
00:08:45.380 --> 00:08:49.449
counterexamples and keeps notes for valid
implications. So in the beginning, we do
00:08:49.449 --> 00:08:53.569
not have any valid implications, so this
list on the right is empty, and in the
00:08:53.569 --> 00:08:56.780
beginning we do not have any
counterexamples. So the list on the left,
00:08:56.780 --> 00:09:01.900
the formal context to build up is also
empty. And all the algorithm does now is,
00:09:01.900 --> 00:09:09.170
it asks "is this implication, X follows Y,
Y follows X or X implies Y, is it true?"
00:09:09.170 --> 00:09:14.000
So "is it true," for example, "that an
animal that is a mammal and is venomous
00:09:14.000 --> 00:09:18.880
lays eggs?" So now the expert, which in
our case is Wikidata, can answer it. We
00:09:18.880 --> 00:09:24.860
can query that. We showed in our paper we
can query that. So we query it, and if the
00:09:24.860 --> 00:09:28.491
Wikidata expert does not find any
counterexamples, it will say, ok, that's
00:09:28.491 --> 00:09:36.200
maybe a true, true thing; it's yes. Or if
it's not a true implication in Wikidata,
00:09:36.200 --> 00:09:41.779
it can say, no, no, no, it's not true, and
here's a counterexample. So this is
00:09:41.779 --> 00:09:48.510
something you contradict by example. You
say this rule cannot be true. For example,
00:09:48.510 --> 00:09:52.900
when the street is wet, that does not mean
it has rained, right? It could be the
00:09:52.900 --> 00:10:01.380
cleaning service car or something else. So
our idea now was to use Wikidata as an
00:10:01.380 --> 00:10:05.819
expert, but also include a human into this
loop. So we do not just want to ask
00:10:05.819 --> 00:10:11.709
Wikidata, we also want to ask a human
expert as well. So we first ask in our
00:10:11.709 --> 00:10:18.520
tool the Wikidata expert for some rule.
After that, we also inquire the human
00:10:18.520 --> 00:10:22.080
expert. And he can also say "yeah, that's
true, I know that," or "No, no. Wikidata
00:10:22.080 --> 00:10:27.200
is not aware of this counterexample, I
know one." Or, in the other case "oh,
00:10:27.200 --> 00:10:32.770
Wikidata says this is true. I am aware of
a counterexample." Yeah, and so on and so
00:10:32.770 --> 00:10:37.600
on. And you can represent this more or
less – this is just some mathematical
00:10:37.600 --> 00:10:41.689
picture, it's not very important. But you
can see on the left there's an exploration
00:10:41.689 --> 00:10:46.720
going on, just Wikidata with the
algorithm, on the right an exploration, a
00:10:46.720 --> 00:10:51.419
human expert versus Wikidata which can
answer all the queries. And we combined
00:10:51.419 --> 00:10:57.720
those two into one small tool, still under
development. So, back to Max.
00:10:57.720 --> 00:11:02.980
M: Okay. So far for that to work, we
basically need to have a way of viewing
00:11:02.980 --> 00:11:08.070
Wikidata, or at least parts of Wikidata,
as a formal context. And this formal
00:11:08.070 --> 00:11:13.610
context, well, this was a binary table, so
what do we do? We just take all the items
00:11:13.610 --> 00:11:18.880
in Wikidata as objects and all the
properties as attributes of our context
00:11:18.880 --> 00:11:24.159
and then have an incidence relation that
says "well, this entity has this
00:11:24.159 --> 00:11:30.549
property," so it is incident there, and
then we end up with a context that has 71
00:11:30.549 --> 00:11:36.430
million rows and seven thousand columns.
So, well, that might actually be a slight
00:11:36.430 --> 00:11:40.180
problem there, because we want to have
something that we can run on actual
00:11:40.180 --> 00:11:45.811
hardware and not on a supercomputer. So
let's maybe not do that and focus on
00:11:45.811 --> 00:11:50.900
a smaller set of properties that are
actually related to one another through
00:11:50.900 --> 00:11:55.689
some kind of common domain, yeah? So it
doesn't make any sense to have a property
00:11:55.689 --> 00:11:59.640
that relates to spacecraft and then a
property that relates to books – that's
00:11:59.640 --> 00:12:05.050
probably not a good idea to try to find
implicit knowledge between those two. But
00:12:05.050 --> 00:12:10.259
two different properties about spacecraft,
that sounds good, right? And then the
00:12:10.259 --> 00:12:15.000
interesting question is just how do we
define the incidence for our set of
00:12:15.000 --> 00:12:20.150
properties? And that actually depends very
much on which properties we choose,
00:12:20.150 --> 00:12:25.550
because it does – for some properties, it
makes sense to account for the direction
00:12:25.550 --> 00:12:32.679
of the statement: So there is a property
called parent? Actually, no, it's child,
00:12:32.679 --> 00:12:38.309
and then there's father and mother, and
you don't want to turn those around, as do
00:12:38.309 --> 00:12:43.760
you want to have "A is a child of B," that
should be something different than "B
00:12:43.760 --> 00:12:48.930
is a child of A." Then there's the
qualifiers that might be important for
00:12:48.930 --> 00:12:54.740
some properties. So receiving an award for
something might be something different
00:12:54.740 --> 00:13:00.740
than receiving an award for something
else. But while receiving an award in 2018
00:13:00.740 --> 00:13:06.549
and receiving one in 2017, that's probably
more or less the same thing, so we don't
00:13:06.549 --> 00:13:11.930
necessarily need to differentiate that.
And there's also a thing called subclasses
00:13:11.930 --> 00:13:15.470
and they form a hierarchy on Wikidata. And
you might also want to take that into
00:13:15.470 --> 00:13:20.150
account because while winning something
that is a Nobel prize, that means also
00:13:20.150 --> 00:13:25.190
winning an award itself, and winning the
Nobel Peace prize means winning a peace
00:13:25.190 --> 00:13:32.586
prize. So there's also implications going
on there that you want to respect. So,
00:13:32.586 --> 00:13:38.400
to see how we actually do that, let's look
at an example. So we have here, well, this
00:13:38.400 --> 00:13:47.030
is Donald Strickland. And – I forgot his
first name – Ashkin, this is one of the
00:13:47.030 --> 00:13:51.720
people that won the Nobel prize in physics
with her last year. And also Gérard
00:13:51.720 --> 00:13:57.990
Mourou. That is the third one. They all
got the Nobel prize in physics last year.
00:13:57.990 --> 00:14:04.190
So we have all these statements here, and
these two have a qualifier that says
00:14:04.190 --> 00:14:10.260
"with: Gérard Mourou" here. And I don't
think the qualifier is on this statement
00:14:10.260 --> 00:14:15.160
here, actually, but it doesn't actually
matter. So what we've done here is,
00:14:15.160 --> 00:14:21.190
put all the entities in the small graph as
rows in the table. So we have Strickland
00:14:21.190 --> 00:14:27.850
and Mourou and Ashkin, and also Arnold and
Curie that are not in the picture. But you
00:14:27.850 --> 00:14:33.290
can maybe remember that. And then here we
have awarded, and we scaled that by the
00:14:33.290 --> 00:14:37.250
instance of the different Nobel prizes
that people have won. So that's the
00:14:37.250 --> 00:14:42.209
physics Nobel in the first column, the
chemistry Nobel Prize in the second column
00:14:42.209 --> 00:14:48.380
and just general Nobel prizes in the third
column. There's awarded and that is scaled
00:14:48.380 --> 00:14:55.240
by the "with" qualifier, so awarded with
Gérard Mourou. And then there's field of
00:14:55.240 --> 00:15:00.450
work, and we have lasers here and
radioactivity, so we scale by the actual
00:15:00.450 --> 00:15:06.580
field of work that people have. And well
then, if we look at what kind of incidence
00:15:06.580 --> 00:15:11.370
we get for Donna Strickland, she has a
Nobel prize in physics and that is also a
00:15:11.370 --> 00:15:17.190
Nobel prize, and she has that together
with Mourou. And she has "field of work:
00:15:17.190 --> 00:15:23.220
lasers," but not radioactivity. Then,
Mourou himself: he has a Nobel prize in
00:15:23.220 --> 00:15:29.450
physics, and that is a Nobel prize, but
none of the others. Ashkin gets the Nobel
00:15:29.450 --> 00:15:33.890
prize in physics, and that is still a
Nobel prize, and he gets that with Gérard
00:15:33.890 --> 00:15:40.970
Mourou. And also he works on lasers, but
not in radioactivity. So Frances Arnold
00:15:40.970 --> 00:15:47.230
has a Nobel prize in chemistry, and that
is a Nobel prize. And Marie Curie, she has
00:15:47.230 --> 00:15:50.510
a Nobel prize in physics and one in
chemistry, and they are both a Nobel
00:15:50.510 --> 00:15:55.319
prize. And she also works on
radioactivity. But lasers didn't exist
00:15:55.319 --> 00:16:02.490
back then, so she doesn't get "field of
work: lasers." And then basically this
00:16:02.490 --> 00:16:10.289
table here is a representation of our
formal context. So and then we've actually
00:16:10.289 --> 00:16:14.840
gone ahead and started building a tool
where you can interactively do all these
00:16:14.840 --> 00:16:20.320
things, and it will take care of building
the context for you. You just put in the
00:16:20.320 --> 00:16:24.540
properties, and Tom will show
you how that works.
00:16:24.540 --> 00:16:29.030
T: So here you see some first screenshots
of this tool. So please do not comment on
00:16:29.030 --> 00:16:32.520
the graphic design. We have no idea about
that, we have to ask someone about that.
00:16:32.520 --> 00:16:36.120
We're just into logics, more or less. On
the left, you see the initial state of the
00:16:36.120 --> 00:16:41.120
game. On the left you have five boxes:
they're called countries and borders,
00:16:41.120 --> 00:16:47.370
credit cards, use of energy, memory and
computation – I think –, and space
00:16:47.370 --> 00:16:53.180
launches, which are just presets we
defined. You can explore, for example, in
00:16:53.180 --> 00:16:57.050
the case of the credit card, you can
explore the properties from Wikidata which
00:16:57.050 --> 00:17:02.170
are called "card network," "operator," and
"fee," so you can just choose one of them,
00:17:02.170 --> 00:17:05.530
or on the right, "custom properties," you
can just input the properties you're
00:17:05.530 --> 00:17:10.640
interested in Wikidata, whatever one of
the seven thousand you like, or some
00:17:10.640 --> 00:17:15.140
number of them. On the right, I chose then
the credit card thingy and I now want to
00:17:15.140 --> 00:17:21.860
show you what happens if you now explore
these properties, right? The first step in
00:17:21.860 --> 00:17:25.750
the game is that the game will ask – I
mean, the game, the exploration process –
00:17:25.750 --> 00:17:31.020
will ask, is it true that every entity in
Wikidata will have these three properties?
00:17:31.020 --> 00:17:36.360
So are they common among all entities in
your data, which is most probably not
00:17:36.360 --> 00:17:41.540
true, right? I mean, not everything in
Wikidata has a fee, at least I hope. So,
00:17:41.540 --> 00:17:46.520
what I will do now, I would click the
"reject this implication" button, since
00:17:46.520 --> 00:17:51.480
the implication "Nothing implies
everything" is not true. In the second
00:17:51.480 --> 00:17:56.360
step now, the algorithm tries to find the
minimal number of questions to obtain the
00:17:56.360 --> 00:18:01.820
domain knowledge, so to obtain all valid
rules in this domain. So next question is
00:18:01.820 --> 00:18:06.120
"is it true that everything in Wikidata
that has a 'card network' property also
00:18:06.120 --> 00:18:12.560
has a 'fee' and an 'operator' property?"
And down here you can see Wikidata says
00:18:12.560 --> 00:18:18.110
"ok, there are 26 items which are
counterexamples," so there's 26 items in
00:18:18.110 --> 00:18:22.670
Wikidata which have the "card network"
property but do not have the other two
00:18:22.670 --> 00:18:28.200
ones. So, 26 is not a big number, this
could mean "ok, that's an error, so 26
00:18:28.200 --> 00:18:32.860
statements are missing." Or maybe that
that's, really, that's the true case.
00:18:32.860 --> 00:18:36.890
That's also ok. But you can now choose
what you think is right. You can say, "oh,
00:18:36.890 --> 00:18:40.470
I would say it should be true" or you can
say "no, I think that's ok, one of these
00:18:40.470 --> 00:18:46.380
counterexamples seems valid. Let's reject
it." I in this case, rejected it. The next
00:18:46.380 --> 00:18:51.020
question it asks: "is it true that
everything that has an operator has also a
00:18:51.020 --> 00:18:56.290
fee and a card network?" Yeah, this is
possibly not true. There's also more than
00:18:56.290 --> 00:19:03.110
1000 counterexamples, one being, I think a
telecommunication operator in Hungary or
00:19:03.110 --> 00:19:10.340
something. And so we can reject this as
well. Next question, everything that has
00:19:10.340 --> 00:19:15.360
an operator and a card network – so card
network means Visa, MasterCard, whatever,
00:19:15.360 --> 00:19:21.690
all this stuff – is it true that they have
to have a fee?" Wikidata says "no," it has
00:19:21.690 --> 00:19:27.570
23 items that contradict it. But one of
the items, for example, is the American
00:19:27.570 --> 00:19:32.090
Express Gold Card. I suppose the American
Express Gold Card has some fee. So this
00:19:32.090 --> 00:19:36.140
indicates, "oh, there is some missing data
in Wikidata," there is something that
00:19:36.140 --> 00:19:40.680
Wikidata does not know but should know to
reason correctly in Wikidata with your
00:19:40.680 --> 00:19:46.520
SPARQL queries. So we can now say, "yeah,
that's, uh, that's not a reject, that's an
00:19:46.520 --> 00:19:51.470
accept," because we think it should be
true. But Wikidata thinks otherwise. And
00:19:51.470 --> 00:19:55.800
you go on, we go on. This is then the last
question: "Is it true that everything that
00:19:55.800 --> 00:20:00.950
has a fee and a card work should have an
operator," and you see, "oh, no counter
00:20:00.950 --> 00:20:05.930
examples." This means Wikidata says "this
is true," because it says there is no
00:20:05.930 --> 00:20:09.580
counterexample. If you're asking Wikidata
it says this is a valid implication in the
00:20:09.580 --> 00:20:15.400
data set so far, which could also be
indicating that something is missing, I'm
00:20:15.400 --> 00:20:20.310
not aware if this is possible or not, but
ok, for me it sounds reasonable. Everyone
00:20:20.310 --> 00:20:23.800
has a fee and a card network should also
have an operator, which meens a bank or
00:20:23.800 --> 00:20:29.220
something like that. So I accept this
implication. And then, yeah, you have won
00:20:29.220 --> 00:20:34.410
the exploration game, which essentially
means you've won some knowledge. Thank
00:20:34.410 --> 00:20:40.300
you. And the knowledge is that you know
which implications in Wikidata are true or
00:20:40.300 --> 00:20:44.340
should be true from your point of view.
And yeah, this is more or less the state
00:20:44.340 --> 00:20:50.700
of the game so far as we programmed it in
October. And the next state will be to
00:20:50.700 --> 00:20:54.970
show you some – "How much does your
opinion of the world differ from the
00:20:54.970 --> 00:20:59.950
opinion that is now reflected in the
data?" So is what you think about the data
00:20:59.950 --> 00:21:05.430
true, close to true to what is true in
Wikidata. Or maybe Wikidata has wrong
00:21:05.430 --> 00:21:10.680
information. You can find it with that.
But Max will tell me more about that.
00:21:10.680 --> 00:21:18.220
M: Ok. So let me just quickly come
back to what we have actually done. So we
00:21:18.220 --> 00:21:23.670
offer a procedure that allows you to
explore properties in Wikidata and the
00:21:23.670 --> 00:21:30.720
implicational knowledge that holds between
these properties. And the key idea's here
00:21:30.720 --> 00:21:34.661
that when you look at these implications
that you get, while there might be some
00:21:34.661 --> 00:21:39.280
that you don't actually want because they
shouldn't be true, and there might also be
00:21:39.280 --> 00:21:46.220
ones that you don't get, but you expect to
get because they should hold. And these
00:21:46.220 --> 00:21:51.840
unwanted and/or missing implications, they
point to missing statements and items in
00:21:51.840 --> 00:21:56.130
Wikidata. So they show you where the
opportunities to improve the knowledge in
00:21:56.130 --> 00:22:00.100
Wikidata are, and, well, sometimes you
also get to learn something about the
00:22:00.100 --> 00:22:04.080
world, and in most cases, it's that the
world is more complicated than you thought
00:22:04.080 --> 00:22:10.260
it was – and that's just how life is. But
in general, implications can guide you in
00:22:10.260 --> 00:22:17.220
your way of improving Wikidata and the
state of knowledge therein. So what's
00:22:17.220 --> 00:22:22.380
next? Well, so what we currently don't
offer in the exploration game and what we
00:22:22.380 --> 00:22:27.710
definitely will focus next on is having
configurable counterexamples and also
00:22:27.710 --> 00:22:32.030
filterable counterexamples – right now you
just get a list of a random number of
00:22:32.030 --> 00:22:36.880
counterexamples. And you might want to
search through this list for something you
00:22:36.880 --> 00:22:42.520
recognise and you might also want to
explicitly say, well, this one should be a
00:22:42.520 --> 00:22:48.600
counterexample, and that's definitely
coming next. Then, well, domain specific
00:22:48.600 --> 00:22:53.750
scaling of properties, there's still much
work to be done. Currently, we only have
00:22:53.750 --> 00:23:00.500
some very basic support for that. So you
can have properties, but you can't do the
00:23:00.500 --> 00:23:03.780
fancy things where you say, "well,
everything that is an award should be
00:23:03.780 --> 00:23:10.840
considered as one instance of this
property." That's also coming and then
00:23:10.840 --> 00:23:15.550
what Tom mentioned alread: compare your
knowledge that you have explored through
00:23:15.550 --> 00:23:21.610
this process against the knowledge that is
currently on Wikidata as a form of seeing
00:23:21.610 --> 00:23:26.540
"where do you stand? What is missing in
Wikidata? How can you improve Wikidata?"
00:23:26.540 --> 00:23:32.600
And well, if you have any more suggestions
for features, then just tell us. There's a
00:23:32.600 --> 00:23:39.530
Github link on the implication game page.
And here's the link to the tool again. So,
00:23:39.530 --> 00:23:46.140
yeah, just let us know. Open an issue and
have fun. And if you have any questions,
00:23:46.140 --> 00:23:50.230
then I guess now would be the time to ask.
T: Thank you.
00:23:50.230 --> 00:23:52.730
Herald: Thank you very much, Tom and Max.
00:23:52.730 --> 00:23:55.020
applause
00:23:55.020 --> 00:24:01.510
Herald: So we will switch microphones now
because then I can hand this microphone to
00:24:01.510 --> 00:24:07.250
you if any of you have a question for our
two speakers. Are there any questions or
00:24:07.250 --> 00:24:14.370
suggestions? Yes.
Question: Hi. Thanks for the nice talk. I
00:24:14.370 --> 00:24:18.720
wanted to ask what's the first question,
what's the most interesting implication
00:24:18.720 --> 00:24:25.020
that you've found?
M: Yeah. That would have made for a
00:24:25.020 --> 00:24:31.850
good back up slide. The most interesting
implication so far –
00:24:31.850 --> 00:24:36.010
T: The most basic thing you would expect
everything that is launched in space by
00:24:36.010 --> 00:24:41.920
humans – no, everything that landed from
space, that has a landing date, also has a
00:24:41.920 --> 00:24:46.450
start date. So nothing landed on earth,
which was not started here.
00:24:46.450 --> 00:24:55.200
M: Yes.
Q: Right now, the game only helps you find
00:24:55.200 --> 00:25:00.710
out implications. Are you also planning to
have that I can also add data like for
00:25:00.710 --> 00:25:04.309
example, let's say I have twenty five
Nobel laureates who don't have a Nobel
00:25:04.309 --> 00:25:08.220
laureate ID. Is there plans where you
could give me a simple interface for me to
00:25:08.220 --> 00:25:12.760
Google and add that ID because it would
make the process of adding new entities to
00:25:12.760 --> 00:25:17.400
Wikidata itself more simple.
M: Yes. And that's partly hidden
00:25:17.400 --> 00:25:23.050
behind this "configurable and filterable
counterexamples" thing. We will probably
00:25:23.050 --> 00:25:28.380
not have an explicit interface for adding
stuff, but most likely interface with some
00:25:28.380 --> 00:25:32.270
other tool built around Wikidata, so
probably something that will give you
00:25:32.270 --> 00:25:37.100
QuickStatements or something like that.
But yes, adding data is definitely on the
00:25:37.100 --> 00:25:41.710
roadmap.
Herald: Any more questions? Yes.
00:25:41.710 --> 00:25:48.860
Q: Wouldn't it be nice to do this in other
languages, too?
00:25:48.860 --> 00:25:52.600
T: Actually it's language independent, so
we use Wikidata and then as far as we
00:25:52.600 --> 00:25:58.110
know, Wikidata has no language itself. You
know, it has just items and properties, so
00:25:58.110 --> 00:26:02.640
Qs and Ps, and whatever language you use,
it should be translated in the language of
00:26:02.640 --> 00:26:06.180
the properties, if there is a label for
that property or for that item that you
00:26:06.180 --> 00:26:12.420
have. So if Wikidata is aware of your
language, we are.
00:26:12.420 --> 00:26:15.020
Herald: Oh, yes. More!
M: Of course, the tool still needs to be
00:26:15.020 --> 00:26:18.360
translated, but –
T: The tool itself, it should be.
00:26:18.360 --> 00:26:21.850
Q: Hi, thanks for the talk. I have a
question. Right now you only can find
00:26:21.850 --> 00:26:25.990
missing data with this, right? Or surplus
data. Would you think you'd be able to
00:26:25.990 --> 00:26:31.560
find wrong information with a similar
approach.
00:26:31.560 --> 00:26:37.001
T: Actually, we do. I mean, if we Wikidata
has a counterexample to something we would
00:26:37.001 --> 00:26:42.830
expect to be true, this could point to
wrong data, right? If the counterexample
00:26:42.830 --> 00:26:47.450
is a wrong counterexample. If there is a
missing property or missing property to an
00:26:47.450 --> 00:26:58.160
item.
Q: Ok, I get to ask a second question. So
00:26:58.160 --> 00:27:06.000
the horizontal axis in the incidence
matrix. You said it has 7000, it spans
00:27:06.000 --> 00:27:10.300
7000 columns, right?
M: Yes, because there's 7000 properties in
00:27:10.300 --> 00:27:13.850
Wikidata.
Q: But it's actually way more columns,
00:27:13.850 --> 00:27:17.849
right? Because you multiply the properties
times the arguments, right?
00:27:17.849 --> 00:27:21.360
M: Yes. So if you do any scaling then of
course that might give you multiple
00:27:21.360 --> 00:27:23.380
entries.
Q: So that's what you mean with scaling,
00:27:23.380 --> 00:27:27.770
basically?
M: Yes. But already seven thousand is way
00:27:27.770 --> 00:27:35.580
too big to actually compute that.
Q: How many would it be if you multiply
00:27:35.580 --> 00:27:48.060
all the arguments?
M: I have no idea, probably a few million.
00:27:48.060 --> 00:27:55.309
Q: Have you thought about a recursive
method, as counterexamples may be wrong by
00:27:55.309 --> 00:28:00.350
other counterexamples, like in an
argumentative graph or something like
00:28:00.350 --> 00:28:06.708
this?
T: Actually, I don't get it. How can a
00:28:06.708 --> 00:28:14.040
counterexample be wrong through another
counterxample?
00:28:14.040 --> 00:28:24.450
Q: Maybe some example says that cats can
have golden hair and then another example
00:28:24.450 --> 00:28:31.260
might say that this is not a cat.
T: Ah, so the property to be a cat or
00:28:31.260 --> 00:28:38.000
something cat-ish is missing then. Okay.
No, we have not considered so far deeper
00:28:38.000 --> 00:28:44.570
reasoning. This horn-propositional logic,
you know, it has no contradictions,
00:28:44.570 --> 00:28:47.740
because all you can do is you can
contradict by counterexamples, but there
00:28:47.740 --> 00:28:52.740
can never be a rule that is not true, so
far. Just in your or my opinion, maybe,
00:28:52.740 --> 00:28:56.370
but not in the logic. So what we have to
think about is that we have bigger
00:28:56.370 --> 00:29:01.780
reasoning, right? So.
Q: Sorry, quick question. Because you're
00:29:01.780 --> 00:29:04.929
not considering all the 7000 odd
properties for each of the entities,
00:29:04.929 --> 00:29:07.570
right? What's your current process of
filtering? What are the relevant
00:29:07.570 --> 00:29:14.820
properties? I'm sorry, I didn't get that.
M: Well, we basically handpick those. So
00:29:14.820 --> 00:29:19.940
you have this input field? Yeah, we can go
ahead and select our properties. We also
00:29:19.940 --> 00:29:26.870
have some predefined sets. Okay. And
there's also some classes for groups of
00:29:26.870 --> 00:29:30.780
properties that are related that you could
use if you want bigger sets,
00:29:30.780 --> 00:29:35.960
T: for example, space or family or what
was the other?
00:29:35.960 --> 00:29:43.410
M: Awards is one.
T: It depends on the size of the class.
00:29:43.410 --> 00:29:47.390
For example, for space, it's not that
much, I think it's 10 or 15 properties. It
00:29:47.390 --> 00:29:51.520
will take you some hours, but you can do
because they are 15 or something like
00:29:51.520 --> 00:29:58.150
that. I think for family, it's way too
much, it's like 40 of 50 properties. So a
00:29:58.150 --> 00:30:04.540
lot of questions.
Herald: I don't see any more hands. Maybe
00:30:04.540 --> 00:30:09.760
someone who has not asked the question yet
has another one we could take that,
00:30:09.760 --> 00:30:14.270
otherwise we would be perfectly on time.
And maybe you can tell us where you will
00:30:14.270 --> 00:30:18.860
be for deeper discussions where people can
find you.
00:30:18.860 --> 00:30:22.400
T: Probably at the couches.
Herald: The couches, behind our stage.
00:30:22.400 --> 00:30:26.720
M: Or just running around somewhere. So
there's also our DECT numbers on the
00:30:26.720 --> 00:30:35.960
slides; it's 6284 for Tom and 6279 for me.
So just call and ask where we're hanging
00:30:35.960 --> 00:30:38.470
around.
H: Well then, thank you again. Have a
00:30:38.470 --> 00:30:40.210
round of applause.
applause
00:30:40.210 --> 00:30:42.650
T: Thank you.
M: Well, thanks for having us.
00:30:42.650 --> 00:30:45.310
Applause
00:30:45.310 --> 00:30:49.740
postroll music
00:30:49.740 --> 00:31:12.000
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!