-
36C3 preroll music
-
Herald Angel: We have Tom and Max here.
They have a talk here with a very
-
complicated title that I don't quite
understand yet. It's called "Interactively
-
Discovering Implicational Knowledge in
Wikidata. And they told me the point of
-
the talk is that I would like to
understand what it means and I hope I
-
will. So good luck.
Tom: Thank you very much.
-
Herald: And have some applause, please.
-
applause
-
T: Thank you very much. Do you hear me?
Does it work? Hello? Oh, very good. Thank
-
you very much and welcome to our talk
about interactively discovering
-
implicational knowledge in Wikidata. It
is more or less a fun project we started
-
for finding rules that are implicit in
Wikidata – entailed just by the data it
-
has, that people inserted into the
Wikidata database so far. And we will
-
start with the explicit knowledge. So the
explicit data in Wikidata, with Max.
-
Max: So. Right. What what is Wikidata?
Maybe you have heard about Wikidata, then
-
that's all fine. Maybe you haven't, then
surely you've heard of Wikipedia. And
-
Wikipedia is run by the Wikimedia
Foundation and the Wikimedia Foundation
-
has several other projects. And one of
those is Wikidata. And Wikidata is
-
basically a large graph that encodes
machine readable knowledge in the form of
-
statements. And a statement basically
consists of some entity that is connected
-
– or some some entities that are connected
by some property. And these properties
-
can then even have annotations on them.
So, for example, we have Donna Strickland
-
here and we encode that she has received a
Nobel prize in physics last year by this
-
property "awarded" and this has then a
qualifier "time: 2018" and also "for:
-
Chirped Pulse Amplification". And all in
all, we have some 890 million statements
-
on Wikidata that connect 71 million items
using 7000 properties. But there's also a
-
bit more. So we also know that Donna
Strickland has "field of work: optics" and
-
also "field of work: lasers" so we can use
the same property to connect some entity
-
with different other entities. And we
don't even have to have knowledge that
-
connects the entities. We can have a date
of birth, which is 1959. Nineteen ninety.
-
No. Nineteen fifty nine. Yes. And this is
then just a plain date, not an entity. And
-
now coming from the explicit knowledge
then, well, we have some more we have
-
Donna Strickland has received a Nobel
prize in physics and also Marie Curie has
-
received the Nobel prize in physics. And
we also know that Marie Curie has a Nobel
-
prize ID that starts with "phys" and then
"1903" and some random numbers that
-
basically are this ID. Then Marie Curie
also has received a Nobel prize in
-
chemistry in 1911. So she has another
Nobel ID that starts with "chem" and has
-
"1911" there. And then there's also
Frances Arnold, who received the Nobel
-
prize in chemistry last year. So she has a
Nobel ID that starts with "chem" and has
-
"2018" there. And now one one could assume
that, well, everybody who was awarded the
-
Nobel prize should also have a Nobel ID.
So everybody who was awarded the Nobel
-
prize should also have a Nobel prize ID,
and we could write that as some
-
implication here. So "awarded(nobelPrize)"
implies "nobelID". And well, if you
-
look sharply at this picture, then there's
this arrow here conspicuously missing that
-
Donald Strickland doesn't have a Nobel
prize ID. And indeed, there's 25 people
-
currently on Wikidata that are missing
Nobel prize IDs, and Donna Strickland is
-
one of them. So we call these people that
don't satisfy this implication – we call
-
those counterexamples and well, if you
look at Wikidata on the scale of really
-
these 890 million statements, then you
won't find any counterexamples because
-
it's just too big. So we need some way to
automatically do that. And the idea is
-
that, well, if we had this knowledge that
while some implications are not satisfied,
-
then this encodes maybe missing
information or wrong information, and we
-
want to represent that in a way that is
easy to understand and also succinct. So
-
it doesn't take long to write it down, it
should have a short representation. So
-
that rules out anything, including complex
syntax or logical quantifies. So no SPARQL
-
queries as a description of that implicit
knowledge. No description logics, if
-
you've heard of that. And we also want
something that we can actually compute on
-
actual hardware in a reasonable timeframe.
So our approach is we use Formal Concept
-
Analysis, which is a technique that has
been developed over the past several years
-
to extract what is called propositional
implications. So just logical formulas of
-
propositional logic that are an
implication in the form of this
-
"awarded(nobelPrize)" implies "nobleID".
So what exactly is Formal Concept
-
Analysis? Off to Tom.
T: Thank you. So what is Formal Concept
-
Analysis? It was developed in 1980s by a
guy called Rudolf Wille and Bernard Ganter
-
and they were restructuring lattice
theory. Lattice theory is an ambiguous
-
name in math, it has two meanings: One
meaning is you have a grid and have a
-
lattice there. The other thing is to speak
about orders – order relations. So I like
-
steaks, I like pudding and I like steaks
more than pudding. And I like rice more
-
than steaks. That's an order, right? And
lattices are particular orders which can
-
be used to represent propositional logic.
So easy rules like "when it rains, the
-
street gets wet", right? So and the data
representation those guys used back then,
-
they called it a formal context, which is
basically just a set of objects – they
-
call them objects, it's just a name –, a
set of attributes and some incidence,
-
which basically means which object does
have which attributes. So, for example, my
-
laptop has the colour black. So this
object has some property, right? So that's
-
a small example on the right for such a
formal context. So the objects there are
-
some animals: a platypus – that's the fun
animal from Australia, the mammal which is
-
also laying eggs and which is also
venomous –, a black widow – the spider –,
-
the duck and the cat. So we see, the
platypus has all the properties; it has
-
being venomous, laying eggs and being a
mammal; we have the duck, which is not a
-
mammal, but it lays eggs, and so on and so
on. And it's very easy to grasp some
-
implicational knowledge here. An easy rule
you can find is whenever you endeavour a
-
mammal that is venomous, it has to lay
eggs. So this is a rule that falls out of
-
this binary data table. Our main problem
then or at this point is we do not have
-
such a data table for Wikidata, right? We
have the implicit graph, which is way more
-
expressive than binary data, and we cannot
even store Wikidata as a binary table.
-
Even if you tried to, we have no chance to
compute such rules from that. And for
-
this, the people from Formal Context
Analysis proposed an algorithm to extract
-
implicit knowledge from an expert. So our
expert here could be Wikidata. It's an
-
expert, you can ask Wikidata questions,
right? Using this SPARQL interface, you
-
can ask. You can ask "Is there an example
for that? Is there a counterexample for
-
something else?" So the algorithm is quite
easy. The algorithm is the algorithm and
-
some expert – in our case, Wikidata –, and
the algorithm keeps notes for
-
counterexamples and keeps notes for valid
implications. So in the beginning, we do
-
not have any valid implications, so this
list on the right is empty, and in the
-
beginning we do not have any
counterexamples. So the list on the left,
-
the formal context to build up is also
empty. And all the algorithm does now is,
-
it asks "is this implication, X follows Y,
Y follows X or X implies Y, is it true?"
-
So "is it true," for example, "that an
animal that is a mammal and is venomous
-
lays eggs?" So now the expert, which in
our case is Wikidata, can answer it. We
-
can query that. We showed in our paper we
can query that. So we query it, and if the
-
Wikidata expert does not find any
counterexamples, it will say, ok, that's
-
maybe a true, true thing; it's yes. Or if
it's not a true implication in Wikidata,
-
it can say, no, no, no, it's not true, and
here's a counterexample. So this is
-
something you contradict by example. You
say this rule cannot be true. For example,
-
when the street is wet, that does not mean
it has rained, right? It could be the
-
cleaning service car or something else. So
our idea now was to use Wikidata as an
-
expert, but also include a human into this
loop. So we do not just want to ask
-
Wikidata, we also want to ask a human
expert as well. So we first ask in our
-
tool the Wikidata expert for some rule.
After that, we also inquire the human
-
expert. And he can also say "yeah, that's
true, I know that," or "No, no. Wikidata
-
is not aware of this counterexample, I
know one." Or, in the other case "oh,
-
Wikidata says this is true. I am aware of
a counterexample." Yeah, and so on and so
-
on. And you can represent this more or
less – this is just some mathematical
-
picture, it's not very important. But you
can see on the left there's an exploration
-
going on, just Wikidata with the
algorithm, on the right an exploration, a
-
human expert versus Wikidata which can
answer all the queries. And we combined
-
those two into one small tool, still under
development. So, back to Max.
-
M: Okay. So far for that to work, we
basically need to have a way of viewing
-
Wikidata, or at least parts of Wikidata,
as a formal context. And this formal
-
context, well, this was a binary table, so
what do we do? We just take all the items
-
in Wikidata as objects and all the
properties as attributes of our context
-
and then have an incidence relation that
says "well, this entity has this
-
property," so it is incident there, and
then we end up with a context that has 71
-
million rows and seven thousand columns.
So, well, that might actually be a slight
-
problem there, because we want to have
something that we can run on actual
-
hardware and not on a supercomputer. So
let's maybe not do that and focus on
-
a smaller set of properties that are
actually related to one another through
-
some kind of common domain, yeah? So it
doesn't make any sense to have a property
-
that relates to spacecraft and then a
property that relates to books – that's
-
probably not a good idea to try to find
implicit knowledge between those two. But
-
two different properties about spacecraft,
that sounds good, right? And then the
-
interesting question is just how do we
define the incidence for our set of
-
properties? And that actually depends very
much on which properties we choose,
-
because it does – for some properties, it
makes sense to account for the direction
-
of the statement: So there is a property
called parent? Actually, no, it's child,
-
and then there's father and mother, and
you don't want to turn those around, as do
-
you want to have "A is a child of B," that
should be something different than "B
-
is a child of A." Then there's the
qualifiers that might be important for
-
some properties. So receiving an award for
something might be something different
-
than receiving an award for something
else. But while receiving an award in 2018
-
and receiving one in 2017, that's probably
more or less the same thing, so we don't
-
necessarily need to differentiate that.
And there's also a thing called subclasses
-
and they form a hierarchy on Wikidata. And
you might also want to take that into
-
account because while winning something
that is a Nobel prize, that means also
-
winning an award itself, and winning the
Nobel Peace prize means winning a peace
-
prize. So there's also implications going
on there that you want to respect. So,
-
to see how we actually do that, let's look
at an example. So we have here, well, this
-
is Donald Strickland. And – I forgot his
first name – Ashkin, this is one of the
-
people that won the Nobel prize in physics
with her last year. And also Gérard
-
Mourou. That is the third one. They all
got the Nobel prize in physics last year.
-
So we have all these statements here, and
these two have a qualifier that says
-
"with: Gérard Mourou" here. And I don't
think the qualifier is on this statement
-
here, actually, but it doesn't actually
matter. So what we've done here is,
-
put all the entities in the small graph as
rows in the table. So we have Strickland
-
and Mourou and Ashkin, and also Arnold and
Curie that are not in the picture. But you
-
can maybe remember that. And then here we
have awarded, and we scaled that by the
-
instance of the different Nobel prizes
that people have won. So that's the
-
physics Nobel in the first column, the
chemistry Nobel Prize in the second column
-
and just general Nobel prizes in the third
column. There's awarded and that is scaled
-
by the "with" qualifier, so awarded with
Gérard Mourou. And then there's field of
-
work, and we have lasers here and
radioactivity, so we scale by the actual
-
field of work that people have. And well
then, if we look at what kind of incidence
-
we get for Donna Strickland, she has a
Nobel prize in physics and that is also a
-
Nobel prize, and she has that together
with Mourou. And she has "field of work:
-
lasers," but not radioactivity. Then,
Mourou himself: he has a Nobel prize in
-
physics, and that is a Nobel prize, but
none of the others. Ashkin gets the Nobel
-
prize in physics, and that is still a
Nobel prize, and he gets that with Gérard
-
Mourou. And also he works on lasers, but
not in radioactivity. So Frances Arnold
-
has a Nobel prize in chemistry, and that
is a Nobel prize. And Marie Curie, she has
-
a Nobel prize in physics and one in
chemistry, and they are both a Nobel
-
prize. And she also works on
radioactivity. But lasers didn't exist
-
back then, so she doesn't get "field of
work: lasers." And then basically this
-
table here is a representation of our
formal context. So and then we've actually
-
gone ahead and started building a tool
where you can interactively do all these
-
things, and it will take care of building
the context for you. You just put in the
-
properties, and Tom will show
you how that works.
-
T: So here you see some first screenshots
of this tool. So please do not comment on
-
the graphic design. We have no idea about
that, we have to ask someone about that.
-
We're just into logics, more or less. On
the left, you see the initial state of the
-
game. On the left you have five boxes:
they're called countries and borders,
-
credit cards, use of energy, memory and
computation – I think –, and space
-
launches, which are just presets we
defined. You can explore, for example, in
-
the case of the credit card, you can
explore the properties from Wikidata which
-
are called "card network," "operator," and
"fee," so you can just choose one of them,
-
or on the right, "custom properties," you
can just input the properties you're
-
interested in Wikidata, whatever one of
the seven thousand you like, or some
-
number of them. On the right, I chose then
the credit card thingy and I now want to
-
show you what happens if you now explore
these properties, right? The first step in
-
the game is that the game will ask – I
mean, the game, the exploration process –
-
will ask, is it true that every entity in
Wikidata will have these three properties?
-
So are they common among all entities in
your data, which is most probably not
-
true, right? I mean, not everything in
Wikidata has a fee, at least I hope. So,
-
what I will do now, I would click the
"reject this implication" button, since
-
the implication "Nothing implies
everything" is not true. In the second
-
step now, the algorithm tries to find the
minimal number of questions to obtain the
-
domain knowledge, so to obtain all valid
rules in this domain. So next question is
-
"is it true that everything in Wikidata
that has a 'card network' property also
-
has a 'fee' and an 'operator' property?"
And down here you can see Wikidata says
-
"ok, there are 26 items which are
counterexamples," so there's 26 items in
-
Wikidata which have the "card network"
property but do not have the other two
-
ones. So, 26 is not a big number, this
could mean "ok, that's an error, so 26
-
statements are missing." Or maybe that
that's, really, that's the true case.
-
That's also ok. But you can now choose
what you think is right. You can say, "oh,
-
I would say it should be true" or you can
say "no, I think that's ok, one of these
-
counterexamples seems valid. Let's reject
it." I in this case, rejected it. The next
-
question it asks: "is it true that
everything that has an operator has also a
-
fee and a card network?" Yeah, this is
possibly not true. There's also more than
-
1000 counterexamples, one being, I think a
telecommunication operator in Hungary or
-
something. And so we can reject this as
well. Next question, everything that has
-
an operator and a card network – so card
network means Visa, MasterCard, whatever,
-
all this stuff – is it true that they have
to have a fee?" Wikidata says "no," it has
-
23 items that contradict it. But one of
the items, for example, is the American
-
Express Gold Card. I suppose the American
Express Gold Card has some fee. So this
-
indicates, "oh, there is some missing data
in Wikidata," there is something that
-
Wikidata does not know but should know to
reason correctly in Wikidata with your
-
SPARQL queries. So we can now say, "yeah,
that's, uh, that's not a reject, that's an
-
accept," because we think it should be
true. But Wikidata thinks otherwise. And
-
you go on, we go on. This is then the last
question: "Is it true that everything that
-
has a fee and a card work should have an
operator," and you see, "oh, no counter
-
examples." This means Wikidata says "this
is true," because it says there is no
-
counterexample. If you're asking Wikidata
it says this is a valid implication in the
-
data set so far, which could also be
indicating that something is missing, I'm
-
not aware if this is possible or not, but
ok, for me it sounds reasonable. Everyone
-
has a fee and a card network should also
have an operator, which meens a bank or
-
something like that. So I accept this
implication. And then, yeah, you have won
-
the exploration game, which essentially
means you've won some knowledge. Thank
-
you. And the knowledge is that you know
which implications in Wikidata are true or
-
should be true from your point of view.
And yeah, this is more or less the state
-
of the game so far as we programmed it in
October. And the next state will be to
-
show you some – "How much does your
opinion of the world differ from the
-
opinion that is now reflected in the
data?" So is what you think about the data
-
true, close to true to what is true in
Wikidata. Or maybe Wikidata has wrong
-
information. You can find it with that.
But Max will tell me more about that.
-
M: Ok. So let me just quickly come
back to what we have actually done. So we
-
offer a procedure that allows you to
explore properties in Wikidata and the
-
implicational knowledge that holds between
these properties. And the key idea's here
-
that when you look at these implications
that you get, while there might be some
-
that you don't actually want because they
shouldn't be true, and there might also be
-
ones that you don't get, but you expect to
get because they should hold. And these
-
unwanted and/or missing implications, they
point to missing statements and items in
-
Wikidata. So they show you where the
opportunities to improve the knowledge in
-
Wikidata are, and, well, sometimes you
also get to learn something about the
-
world, and in most cases, it's that the
world is more complicated than you thought
-
it was – and that's just how life is. But
in general, implications can guide you in
-
your way of improving Wikidata and the
state of knowledge therein. So what's
-
next? Well, so what we currently don't
offer in the exploration game and what we
-
definitely will focus next on is having
configurable counterexamples and also
-
filterable counterexamples – right now you
just get a list of a random number of
-
counterexamples. And you might want to
search through this list for something you
-
recognise and you might also want to
explicitly say, well, this one should be a
-
counterexample, and that's definitely
coming next. Then, well, domain specific
-
scaling of properties, there's still much
work to be done. Currently, we only have
-
some very basic support for that. So you
can have properties, but you can't do the
-
fancy things where you say, "well,
everything that is an award should be
-
considered as one instance of this
property." That's also coming and then
-
what Tom mentioned alread: compare your
knowledge that you have explored through
-
this process against the knowledge that is
currently on Wikidata as a form of seeing
-
"where do you stand? What is missing in
Wikidata? How can you improve Wikidata?"
-
And well, if you have any more suggestions
for features, then just tell us. There's a
-
Github link on the implication game page.
And here's the link to the tool again. So,
-
yeah, just let us know. Open an issue and
have fun. And if you have any questions,
-
then I guess now would be the time to ask.
T: Thank you.
-
Herald: Thank you very much, Tom and Max.
-
applause
-
Herald: So we will switch microphones now
because then I can hand this microphone to
-
you if any of you have a question for our
two speakers. Are there any questions or
-
suggestions? Yes.
Question: Hi. Thanks for the nice talk. I
-
wanted to ask what's the first question,
what's the most interesting implication
-
that you've found?
M: Yeah. That would have made for a
-
good back up slide. The most interesting
implication so far –
-
T: The most basic thing you would expect
everything that is launched in space by
-
humans – no, everything that landed from
space, that has a landing date, also has a
-
start date. So nothing landed on earth,
which was not started here.
-
M: Yes.
Q: Right now, the game only helps you find
-
out implications. Are you also planning to
have that I can also add data like for
-
example, let's say I have twenty five
Nobel laureates who don't have a Nobel
-
laureate ID. Is there plans where you
could give me a simple interface for me to
-
Google and add that ID because it would
make the process of adding new entities to
-
Wikidata itself more simple.
M: Yes. And that's partly hidden
-
behind this "configurable and filterable
counterexamples" thing. We will probably
-
not have an explicit interface for adding
stuff, but most likely interface with some
-
other tool built around Wikidata, so
probably something that will give you
-
QuickStatements or something like that.
But yes, adding data is definitely on the
-
roadmap.
Herald: Any more questions? Yes.
-
Q: Wouldn't it be nice to do this in other
languages, too?
-
T: Actually it's language independent, so
we use Wikidata and then as far as we
-
know, Wikidata has no language itself. You
know, it has just items and properties, so
-
Qs and Ps, and whatever language you use,
it should be translated in the language of
-
the properties, if there is a label for
that property or for that item that you
-
have. So if Wikidata is aware of your
language, we are.
-
Herald: Oh, yes. More!
M: Of course, the tool still needs to be
-
translated, but –
T: The tool itself, it should be.
-
Q: Hi, thanks for the talk. I have a
question. Right now you only can find
-
missing data with this, right? Or surplus
data. Would you think you'd be able to
-
find wrong information with a similar
approach.
-
T: Actually, we do. I mean, if we Wikidata
has a counterexample to something we would
-
expect to be true, this could point to
wrong data, right? If the counterexample
-
is a wrong counterexample. If there is a
missing property or missing property to an
-
item.
Q: Ok, I get to ask a second question. So
-
the horizontal axis in the incidence
matrix. You said it has 7000, it spans
-
7000 columns, right?
M: Yes, because there's 7000 properties in
-
Wikidata.
Q: But it's actually way more columns,
-
right? Because you multiply the properties
times the arguments, right?
-
M: Yes. So if you do any scaling then of
course that might give you multiple
-
entries.
Q: So that's what you mean with scaling,
-
basically?
M: Yes. But already seven thousand is way
-
too big to actually compute that.
Q: How many would it be if you multiply
-
all the arguments?
M: I have no idea, probably a few million.
-
Q: Have you thought about a recursive
method, as counterexamples may be wrong by
-
other counterexamples, like in an
argumentative graph or something like
-
this?
T: Actually, I don't get it. How can a
-
counterexample be wrong through another
counterxample?
-
Q: Maybe some example says that cats can
have golden hair and then another example
-
might say that this is not a cat.
T: Ah, so the property to be a cat or
-
something cat-ish is missing then. Okay.
No, we have not considered so far deeper
-
reasoning. This horn-propositional logic,
you know, it has no contradictions,
-
because all you can do is you can
contradict by counterexamples, but there
-
can never be a rule that is not true, so
far. Just in your or my opinion, maybe,
-
but not in the logic. So what we have to
think about is that we have bigger
-
reasoning, right? So.
Q: Sorry, quick question. Because you're
-
not considering all the 7000 odd
properties for each of the entities,
-
right? What's your current process of
filtering? What are the relevant
-
properties? I'm sorry, I didn't get that.
M: Well, we basically handpick those. So
-
you have this input field? Yeah, we can go
ahead and select our properties. We also
-
have some predefined sets. Okay. And
there's also some classes for groups of
-
properties that are related that you could
use if you want bigger sets,
-
T: for example, space or family or what
was the other?
-
M: Awards is one.
T: It depends on the size of the class.
-
For example, for space, it's not that
much, I think it's 10 or 15 properties. It
-
will take you some hours, but you can do
because they are 15 or something like
-
that. I think for family, it's way too
much, it's like 40 of 50 properties. So a
-
lot of questions.
Herald: I don't see any more hands. Maybe
-
someone who has not asked the question yet
has another one we could take that,
-
otherwise we would be perfectly on time.
And maybe you can tell us where you will
-
be for deeper discussions where people can
find you.
-
T: Probably at the couches.
Herald: The couches, behind our stage.
-
M: Or just running around somewhere. So
there's also our DECT numbers on the
-
slides; it's 6284 for Tom and 6279 for me.
So just call and ask where we're hanging
-
around.
H: Well then, thank you again. Have a
-
round of applause.
applause
-
T: Thank you.
M: Well, thanks for having us.
-
Applause
-
postroll music
-
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!