36C3 preroll music
Herald Angel: We have Tom and Max here.
They have a talk here with a very
complicated title that I don't quite
understand yet. It's called "Interactively
Discovering Implicational Knowledge in
Wikidata. And they told me the point of
the talk is that I would like to
understand what it means and I hope I
will. So good luck.
Tom: Thank you very much.
Herald: And have some applause, please.
applause
T: Thank you very much. Do you hear me?
Does it work? Hello? Oh, very good. Thank
you very much and welcome to our talk
about interactively discovering
implicational knowledge in Wikidata. It
is more or less a fun project we started
for finding rules that are implicit in
Wikidata – entailed just by the data it
has, that people inserted into the
Wikidata database so far. And we will
start with the explicit knowledge. So the
explicit data in Wikidata, with Max.
Max: So. Right. What what is Wikidata?
Maybe you have heard about Wikidata, then
that's all fine. Maybe you haven't, then
surely you've heard of Wikipedia. And
Wikipedia is run by the Wikimedia
Foundation and the Wikimedia Foundation
has several other projects. And one of
those is Wikidata. And Wikidata is
basically a large graph that encodes
machine readable knowledge in the form of
statements. And a statement basically
consists of some entity that is connected
– or some some entities that are connected
by some property. And these properties
can then even have annotations on them.
So, for example, we have Donna Strickland
here and we encode that she has received a
Nobel prize in physics last year by this
property "awarded" and this has then a
qualifier "time: 2018" and also "for:
Chirped Pulse Amplification". And all in
all, we have some 890 million statements
on Wikidata that connect 71 million items
using 7000 properties. But there's also a
bit more. So we also know that Donna
Strickland has "field of work: optics" and
also "field of work: lasers" so we can use
the same property to connect some entity
with different other entities. And we
don't even have to have knowledge that
connects the entities. We can have a date
of birth, which is 1959. Nineteen ninety.
No. Nineteen fifty nine. Yes. And this is
then just a plain date, not an entity. And
now coming from the explicit knowledge
then, well, we have some more we have
Donna Strickland has received a Nobel
prize in physics and also Marie Curie has
received the Nobel prize in physics. And
we also know that Marie Curie has a Nobel
prize ID that starts with "phys" and then
"1903" and some random numbers that
basically are this ID. Then Marie Curie
also has received a Nobel prize in
chemistry in 1911. So she has another
Nobel ID that starts with "chem" and has
"1911" there. And then there's also
Frances Arnold, who received the Nobel
prize in chemistry last year. So she has a
Nobel ID that starts with "chem" and has
"2018" there. And now one one could assume
that, well, everybody who was awarded the
Nobel prize should also have a Nobel ID.
So everybody who was awarded the Nobel
prize should also have a Nobel prize ID,
and we could write that as some
implication here. So "awarded(nobelPrize)"
implies "nobelID". And well, if you
look sharply at this picture, then there's
this arrow here conspicuously missing that
Donald Strickland doesn't have a Nobel
prize ID. And indeed, there's 25 people
currently on Wikidata that are missing
Nobel prize IDs, and Donna Strickland is
one of them. So we call these people that
don't satisfy this implication – we call
those counterexamples and well, if you
look at Wikidata on the scale of really
these 890 million statements, then you
won't find any counterexamples because
it's just too big. So we need some way to
automatically do that. And the idea is
that, well, if we had this knowledge that
while some implications are not satisfied,
then this encodes maybe missing
information or wrong information, and we
want to represent that in a way that is
easy to understand and also succinct. So
it doesn't take long to write it down, it
should have a short representation. So
that rules out anything, including complex
syntax or logical quantifies. So no SPARQL
queries as a description of that implicit
knowledge. No description logics, if
you've heard of that. And we also want
something that we can actually compute on
actual hardware in a reasonable timeframe.
So our approach is we use Formal Concept
Analysis, which is a technique that has
been developed over the past several years
to extract what is called propositional
implications. So just logical formulas of
propositional logic that are an
implication in the form of this
"awarded(nobelPrize)" implies "nobleID".
So what exactly is Formal Concept
Analysis? Off to Tom.
T: Thank you. So what is Formal Concept
Analysis? It was developed in 1980s by a
guy called Rudolf Wille and Bernard Ganter
and they were restructuring lattice
theory. Lattice theory is an ambiguous
name in math, it has two meanings: One
meaning is you have a grid and have a
lattice there. The other thing is to speak
about orders – order relations. So I like
steaks, I like pudding and I like steaks
more than pudding. And I like rice more
than steaks. That's an order, right? And
lattices are particular orders which can
be used to represent propositional logic.
So easy rules like "when it rains, the
street gets wet", right? So and the data
representation those guys used back then,
they called it a formal context, which is
basically just a set of objects – they
call them objects, it's just a name –, a
set of attributes and some incidence,
which basically means which object does
have which attributes. So, for example, my
laptop has the colour black. So this
object has some property, right? So that's
a small example on the right for such a
formal context. So the objects there are
some animals: a platypus – that's the fun
animal from Australia, the mammal which is
also laying eggs and which is also
venomous –, a black widow – the spider –,
the duck and the cat. So we see, the
platypus has all the properties; it has
being venomous, laying eggs and being a
mammal; we have the duck, which is not a
mammal, but it lays eggs, and so on and so
on. And it's very easy to grasp some
implicational knowledge here. An easy rule
you can find is whenever you endeavour a
mammal that is venomous, it has to lay
eggs. So this is a rule that falls out of
this binary data table. Our main problem
then or at this point is we do not have
such a data table for Wikidata, right? We
have the implicit graph, which is way more
expressive than binary data, and we cannot
even store Wikidata as a binary table.
Even if you tried to, we have no chance to
compute such rules from that. And for
this, the people from Formal Context
Analysis proposed an algorithm to extract
implicit knowledge from an expert. So our
expert here could be Wikidata. It's an
expert, you can ask Wikidata questions,
right? Using this SPARQL interface, you
can ask. You can ask "Is there an example
for that? Is there a counterexample for
something else?" So the algorithm is quite
easy. The algorithm is the algorithm and
some expert – in our case, Wikidata –, and
the algorithm keeps notes for
counterexamples and keeps notes for valid
implications. So in the beginning, we do
not have any valid implications, so this
list on the right is empty, and in the
beginning we do not have any
counterexamples. So the list on the left,
the formal context to build up is also
empty. And all the algorithm does now is,
it asks "is this implication, X follows Y,
Y follows X or X implies Y, is it true?"
So "is it true," for example, "that an
animal that is a mammal and is venomous
lays eggs?" So now the expert, which in
our case is Wikidata, can answer it. We
can query that. We showed in our paper we
can query that. So we query it, and if the
Wikidata expert does not find any
counterexamples, it will say, ok, that's
maybe a true, true thing; it's yes. Or if
it's not a true implication in Wikidata,
it can say, no, no, no, it's not true, and
here's a counterexample. So this is
something you contradict by example. You
say this rule cannot be true. For example,
when the street is wet, that does not mean
it has rained, right? It could be the
cleaning service car or something else. So
our idea now was to use Wikidata as an
expert, but also include a human into this
loop. So we do not just want to ask
Wikidata, we also want to ask a human
expert as well. So we first ask in our
tool the Wikidata expert for some rule.
After that, we also inquire the human
expert. And he can also say "yeah, that's
true, I know that," or "No, no. Wikidata
is not aware of this counterexample, I
know one." Or, in the other case "oh,
Wikidata says this is true. I am aware of
a counterexample." Yeah, and so on and so
on. And you can represent this more or
less – this is just some mathematical
picture, it's not very important. But you
can see on the left there's an exploration
going on, just Wikidata with the
algorithm, on the right an exploration, a
human expert versus Wikidata which can
answer all the queries. And we combined
those two into one small tool, still under
development. So, back to Max.
M: Okay. So far for that to work, we
basically need to have a way of viewing
Wikidata, or at least parts of Wikidata,
as a formal context. And this formal
context, well, this was a binary table, so
what do we do? We just take all the items
in Wikidata as objects and all the
properties as attributes of our context
and then have an incidence relation that
says "well, this entity has this
property," so it is incident there, and
then we end up with a context that has 71
million rows and seven thousand columns.
So, well, that might actually be a slight
problem there, because we want to have
something that we can run on actual
hardware and not on a supercomputer. So
let's maybe not do that and focus on
a smaller set of properties that are
actually related to one another through
some kind of common domain, yeah? So it
doesn't make any sense to have a property
that relates to spacecraft and then a
property that relates to books – that's
probably not a good idea to try to find
implicit knowledge between those two. But
two different properties about spacecraft,
that sounds good, right? And then the
interesting question is just how do we
define the incidence for our set of
properties? And that actually depends very
much on which properties we choose,
because it does – for some properties, it
makes sense to account for the direction
of the statement: So there is a property
called parent? Actually, no, it's child,
and then there's father and mother, and
you don't want to turn those around, as do
you want to have "A is a child of B," that
should be something different than "B
is a child of A." Then there's the
qualifiers that might be important for
some properties. So receiving an award for
something might be something different
than receiving an award for something
else. But while receiving an award in 2018
and receiving one in 2017, that's probably
more or less the same thing, so we don't
necessarily need to differentiate that.
And there's also a thing called subclasses
and they form a hierarchy on Wikidata. And
you might also want to take that into
account because while winning something
that is a Nobel prize, that means also
winning an award itself, and winning the
Nobel Peace prize means winning a peace
prize. So there's also implications going
on there that you want to respect. So,
to see how we actually do that, let's look
at an example. So we have here, well, this
is Donald Strickland. And – I forgot his
first name – Ashkin, this is one of the
people that won the Nobel prize in physics
with her last year. And also Gérard
Mourou. That is the third one. They all
got the Nobel prize in physics last year.
So we have all these statements here, and
these two have a qualifier that says
"with: Gérard Mourou" here. And I don't
think the qualifier is on this statement
here, actually, but it doesn't actually
matter. So what we've done here is,
put all the entities in the small graph as
rows in the table. So we have Strickland
and Mourou and Ashkin, and also Arnold and
Curie that are not in the picture. But you
can maybe remember that. And then here we
have awarded, and we scaled that by the
instance of the different Nobel prizes
that people have won. So that's the
physics Nobel in the first column, the
chemistry Nobel Prize in the second column
and just general Nobel prizes in the third
column. There's awarded and that is scaled
by the "with" qualifier, so awarded with
Gérard Mourou. And then there's field of
work, and we have lasers here and
radioactivity, so we scale by the actual
field of work that people have. And well
then, if we look at what kind of incidence
we get for Donna Strickland, she has a
Nobel prize in physics and that is also a
Nobel prize, and she has that together
with Mourou. And she has "field of work:
lasers," but not radioactivity. Then,
Mourou himself: he has a Nobel prize in
physics, and that is a Nobel prize, but
none of the others. Ashkin gets the Nobel
prize in physics, and that is still a
Nobel prize, and he gets that with Gérard
Mourou. And also he works on lasers, but
not in radioactivity. So Frances Arnold
has a Nobel prize in chemistry, and that
is a Nobel prize. And Marie Curie, she has
a Nobel prize in physics and one in
chemistry, and they are both a Nobel
prize. And she also works on
radioactivity. But lasers didn't exist
back then, so she doesn't get "field of
work: lasers." And then basically this
table here is a representation of our
formal context. So and then we've actually
gone ahead and started building a tool
where you can interactively do all these
things, and it will take care of building
the context for you. You just put in the
properties, and Tom will show
you how that works.
T: So here you see some first screenshots
of this tool. So please do not comment on
the graphic design. We have no idea about
that, we have to ask someone about that.
We're just into logics, more or less. On
the left, you see the initial state of the
game. On the left you have five boxes:
they're called countries and borders,
credit cards, use of energy, memory and
computation – I think –, and space
launches, which are just presets we
defined. You can explore, for example, in
the case of the credit card, you can
explore the properties from Wikidata which
are called "card network," "operator," and
"fee," so you can just choose one of them,
or on the right, "custom properties," you
can just input the properties you're
interested in Wikidata, whatever one of
the seven thousand you like, or some
number of them. On the right, I chose then
the credit card thingy and I now want to
show you what happens if you now explore
these properties, right? The first step in
the game is that the game will ask – I
mean, the game, the exploration process –
will ask, is it true that every entity in
Wikidata will have these three properties?
So are they common among all entities in
your data, which is most probably not
true, right? I mean, not everything in
Wikidata has a fee, at least I hope. So,
what I will do now, I would click the
"reject this implication" button, since
the implication "Nothing implies
everything" is not true. In the second
step now, the algorithm tries to find the
minimal number of questions to obtain the
domain knowledge, so to obtain all valid
rules in this domain. So next question is
"is it true that everything in Wikidata
that has a 'card network' property also
has a 'fee' and an 'operator' property?"
And down here you can see Wikidata says
"ok, there are 26 items which are
counterexamples," so there's 26 items in
Wikidata which have the "card network"
property but do not have the other two
ones. So, 26 is not a big number, this
could mean "ok, that's an error, so 26
statements are missing." Or maybe that
that's, really, that's the true case.
That's also ok. But you can now choose
what you think is right. You can say, "oh,
I would say it should be true" or you can
say "no, I think that's ok, one of these
counterexamples seems valid. Let's reject
it." I in this case, rejected it. The next
question it asks: "is it true that
everything that has an operator has also a
fee and a card network?" Yeah, this is
possibly not true. There's also more than
1000 counterexamples, one being, I think a
telecommunication operator in Hungary or
something. And so we can reject this as
well. Next question, everything that has
an operator and a card network – so card
network means Visa, MasterCard, whatever,
all this stuff – is it true that they have
to have a fee?" Wikidata says "no," it has
23 items that contradict it. But one of
the items, for example, is the American
Express Gold Card. I suppose the American
Express Gold Card has some fee. So this
indicates, "oh, there is some missing data
in Wikidata," there is something that
Wikidata does not know but should know to
reason correctly in Wikidata with your
SPARQL queries. So we can now say, "yeah,
that's, uh, that's not a reject, that's an
accept," because we think it should be
true. But Wikidata thinks otherwise. And
you go on, we go on. This is then the last
question: "Is it true that everything that
has a fee and a card work should have an
operator," and you see, "oh, no counter
examples." This means Wikidata says "this
is true," because it says there is no
counterexample. If you're asking Wikidata
it says this is a valid implication in the
data set so far, which could also be
indicating that something is missing, I'm
not aware if this is possible or not, but
ok, for me it sounds reasonable. Everyone
has a fee and a card network should also
have an operator, which meens a bank or
something like that. So I accept this
implication. And then, yeah, you have won
the exploration game, which essentially
means you've won some knowledge. Thank
you. And the knowledge is that you know
which implications in Wikidata are true or
should be true from your point of view.
And yeah, this is more or less the state
of the game so far as we programmed it in
October. And the next state will be to
show you some – "How much does your
opinion of the world differ from the
opinion that is now reflected in the
data?" So is what you think about the data
true, close to true to what is true in
Wikidata. Or maybe Wikidata has wrong
information. You can find it with that.
But Max will tell me more about that.
M: Ok. So let me just quickly come
back to what we have actually done. So we
offer a procedure that allows you to
explore properties in Wikidata and the
implicational knowledge that holds between
these properties. And the key idea's here
that when you look at these implications
that you get, while there might be some
that you don't actually want because they
shouldn't be true, and there might also be
ones that you don't get, but you expect to
get because they should hold. And these
unwanted and/or missing implications, they
point to missing statements and items in
Wikidata. So they show you where the
opportunities to improve the knowledge in
Wikidata are, and, well, sometimes you
also get to learn something about the
world, and in most cases, it's that the
world is more complicated than you thought
it was – and that's just how life is. But
in general, implications can guide you in
your way of improving Wikidata and the
state of knowledge therein. So what's
next? Well, so what we currently don't
offer in the exploration game and what we
definitely will focus next on is having
configurable counterexamples and also
filterable counterexamples – right now you
just get a list of a random number of
counterexamples. And you might want to
search through this list for something you
recognise and you might also want to
explicitly say, well, this one should be a
counterexample, and that's definitely
coming next. Then, well, domain specific
scaling of properties, there's still much
work to be done. Currently, we only have
some very basic support for that. So you
can have properties, but you can't do the
fancy things where you say, "well,
everything that is an award should be
considered as one instance of this
property." That's also coming and then
what Tom mentioned alread: compare your
knowledge that you have explored through
this process against the knowledge that is
currently on Wikidata as a form of seeing
"where do you stand? What is missing in
Wikidata? How can you improve Wikidata?"
And well, if you have any more suggestions
for features, then just tell us. There's a
Github link on the implication game page.
And here's the link to the tool again. So,
yeah, just let us know. Open an issue and
have fun. And if you have any questions,
then I guess now would be the time to ask.
T: Thank you.
Herald: Thank you very much, Tom and Max.
applause
Herald: So we will switch microphones now
because then I can hand this microphone to
you if any of you have a question for our
two speakers. Are there any questions or
suggestions? Yes.
Question: Hi. Thanks for the nice talk. I
wanted to ask what's the first question,
what's the most interesting implication
that you've found?
M: Yeah. That would have made for a
good back up slide. The most interesting
implication so far –
T: The most basic thing you would expect
everything that is launched in space by
humans – no, everything that landed from
space, that has a landing date, also has a
start date. So nothing landed on earth,
which was not started here.
M: Yes.
Q: Right now, the game only helps you find
out implications. Are you also planning to
have that I can also add data like for
example, let's say I have twenty five
Nobel laureates who don't have a Nobel
laureate ID. Is there plans where you
could give me a simple interface for me to
Google and add that ID because it would
make the process of adding new entities to
Wikidata itself more simple.
M: Yes. And that's partly hidden
behind this "configurable and filterable
counterexamples" thing. We will probably
not have an explicit interface for adding
stuff, but most likely interface with some
other tool built around Wikidata, so
probably something that will give you
QuickStatements or something like that.
But yes, adding data is definitely on the
roadmap.
Herald: Any more questions? Yes.
Q: Wouldn't it be nice to do this in other
languages, too?
T: Actually it's language independent, so
we use Wikidata and then as far as we
know, Wikidata has no language itself. You
know, it has just items and properties, so
Qs and Ps, and whatever language you use,
it should be translated in the language of
the properties, if there is a label for
that property or for that item that you
have. So if Wikidata is aware of your
language, we are.
Herald: Oh, yes. More!
M: Of course, the tool still needs to be
translated, but –
T: The tool itself, it should be.
Q: Hi, thanks for the talk. I have a
question. Right now you only can find
missing data with this, right? Or surplus
data. Would you think you'd be able to
find wrong information with a similar
approach.
T: Actually, we do. I mean, if we Wikidata
has a counterexample to something we would
expect to be true, this could point to
wrong data, right? If the counterexample
is a wrong counterexample. If there is a
missing property or missing property to an
item.
Q: Ok, I get to ask a second question. So
the horizontal axis in the incidence
matrix. You said it has 7000, it spans
7000 columns, right?
M: Yes, because there's 7000 properties in
Wikidata.
Q: But it's actually way more columns,
right? Because you multiply the properties
times the arguments, right?
M: Yes. So if you do any scaling then of
course that might give you multiple
entries.
Q: So that's what you mean with scaling,
basically?
M: Yes. But already seven thousand is way
too big to actually compute that.
Q: How many would it be if you multiply
all the arguments?
M: I have no idea, probably a few million.
Q: Have you thought about a recursive
method, as counterexamples may be wrong by
other counterexamples, like in an
argumentative graph or something like
this?
T: Actually, I don't get it. How can a
counterexample be wrong through another
counterxample?
Q: Maybe some example says that cats can
have golden hair and then another example
might say that this is not a cat.
T: Ah, so the property to be a cat or
something cat-ish is missing then. Okay.
No, we have not considered so far deeper
reasoning. This horn-propositional logic,
you know, it has no contradictions,
because all you can do is you can
contradict by counterexamples, but there
can never be a rule that is not true, so
far. Just in your or my opinion, maybe,
but not in the logic. So what we have to
think about is that we have bigger
reasoning, right? So.
Q: Sorry, quick question. Because you're
not considering all the 7000 odd
properties for each of the entities,
right? What's your current process of
filtering? What are the relevant
properties? I'm sorry, I didn't get that.
M: Well, we basically handpick those. So
you have this input field? Yeah, we can go
ahead and select our properties. We also
have some predefined sets. Okay. And
there's also some classes for groups of
properties that are related that you could
use if you want bigger sets,
T: for example, space or family or what
was the other?
M: Awards is one.
T: It depends on the size of the class.
For example, for space, it's not that
much, I think it's 10 or 15 properties. It
will take you some hours, but you can do
because they are 15 or something like
that. I think for family, it's way too
much, it's like 40 of 50 properties. So a
lot of questions.
Herald: I don't see any more hands. Maybe
someone who has not asked the question yet
has another one we could take that,
otherwise we would be perfectly on time.
And maybe you can tell us where you will
be for deeper discussions where people can
find you.
T: Probably at the couches.
Herald: The couches, behind our stage.
M: Or just running around somewhere. So
there's also our DECT numbers on the
slides; it's 6284 for Tom and 6279 for me.
So just call and ask where we're hanging
around.
H: Well then, thank you again. Have a
round of applause.
applause
T: Thank you.
M: Well, thanks for having us.
Applause
postroll music
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!