-
(moderator) The next talk is
by Anders Sandholm
-
on Wikidata fact annotation
for Wikipedia across languages.
-
- Thank you.
- Thanks.
-
I wanted to start with a small confession.
-
Wow! I'm blown away
by the momentum of Wikidata
-
and the engagement of the community.
-
I am really excited about being here
-
and getting a chance to talk
about work that we've been doing.
-
This is doing work with Michael,
who's also here in the third row.
-
But before I dive more into this,
-
this wouldn't be
a Google presentation without an ad,
-
so you get that up front.
-
This is what I'll be talking about,
our project, the SLING project.
-
It is an open source project
and it's using Wikidata a lot.
-
You can go check it out on GitHub
when you get a chance
-
if you feel excited about it
after the presentation.
-
And really, what I wanted to talk about--
the title is admittedly a little bit long,
-
it's even shorter than it was
in the original program.
-
But what it comes down to,
what the project comes down to
-
is trying to answer
this one very exciting question.
-
If you want, in the beginning,
there were just two files,
-
some of you may recognize them,
-
they're essentially the dump files
from Wikidata and Wikipedia,
-
and the question we're trying
to figure out or answer is really,
-
can we dramatically improve
how good machines are
-
at understanding human language
just by using these files?
-
And of course, you're entitled to ask
-
whether that's an interesting
question to answer.
-
If you're a company that [inaudible]
is to be able to take search queries
-
and try to answer them
in the best possible way,
-
obviously, understanding natural language
comes in as a very handy thing.
-
But even if you look at Wikidata,
-
in the previous data quality panel
earlier today,
-
there was a question that came up about
verification, or verifiability of facts.
-
So let's say you actually do
understand natural language.
-
If you have a fact and there's a source,
you could go to the source and analyze it,
-
and you can figure out whether
it actually confirms the fact
-
that is claiming
to have this as a source.
-
And if you could do it,
you could even go beyond that
-
and you could read articles
and annotate them, come up with facts,
-
and actually look for existing facts
that may need sources
-
and add these articles as sources.
-
Or, you know, in the wildest,
craziest possible of all worlds,
-
if you get really, really good at it
you could read articles
-
and maybe even annotate with new facts
that you could then suggest as facts
-
that you could potentially
add to Wikidata.
-
But there's a whole world of applications
of natural language understanding.
-
One of the things that's really hard when
you do natural language understanding--
-
these days, that also means
deep learning or machine learning,
-
and one of the things that's really hard
is getting enough training data.
-
And historically,
that's meant having a lot of text
-
that you need human annotators
to then first process
-
and then you can do training.
-
And part of the question here
is also really to say:
-
Can we use Wikidata and the way
in which it's interlinked with Wikipedia
-
for training data,
-
and will that be enough
to train that model?
-
So hopefully, we'll get closer
to answering this question
-
in the next 15 to 20 minutes.
-
We don't quite know the answer yet
but we have some exciting results
-
that are pointing
in the right direction, if you want.
-
Just take a step back in terms of
the development we've seen,
-
machine learning and deep learning
has revolutionized a lot of areas
-
and this is just one example
of a particular image recognition task
-
that if you look at what happened
between 2010 and 2015,
-
in that five-year period,
we went from machines doing pretty poorly
-
to, in the end, actually performing
at the same level of humans
-
or in some cases even better
albeit for a very specific task.
-
So we've seen really a lot of things
improving dramatically.
-
And so you can ask
-
why don't we just throw deep learning
at natural language processing
-
and natural language understanding
and be done with it?
-
And the answer is kind of
we've sort of done to a certain extent,
-
but what it turns out is that
-
natural language understanding
is actually still a bit of a challenge
-
and one of the situations where
a lot of us interact with machines
-
that are trying to behave like
they understand what we're saying
-
is in these chat bots.
-
So this is not to pick
on anyone in particular
-
but just, I think, an experience
that a lot of us have had.
-
In this case, it's a user saying
I want to stay in this place.
-
The chat bot says: "OK, got it,
when will you be checking in and out?
-
For example, November 17th to 23rd."
-
And the user says:
"Well, I don't have any dates yet."
-
And then the response is:
-
"Sorry, there are no hotels available
for the dates you've requested.
-
Would you like to start a new search?"
-
So there's still some way to go
-
to get machines to really
understand human language.
-
But machine learning or deep learning
-
has been applied
already to this discipline.
-
Like, one of the examples is a recent...
-
a more successful example is BERT
-
where they're using transformers
to solve NLP or NLU tasks.
-
And it's dramatically improved
the performance but, as we've seen,
-
there is still some way to go.
-
One thing that's shared among
most of these approaches
-
is that you look at the text itself
-
and you depend on having a lot of it
so you can train your model on the text,
-
but everything is based
on just looking at the text
-
and understanding the text.
-
So the learning is really
just representation learning.
-
What we wanted to do is actually
understand and annotate the text
-
in terms of items
or entities in the real world.
-
And in general, if we take a step back,
-
why is natural language processing
or understanding so hard?
-
There are a number of reasons
why it's really hard, but at the core,
-
one of the important reasons
is that somehow,
-
the machine needs to have
knowledge of the world
-
in order to understand human language.
-
And you think about that
for a little while.
-
What better place to look for knowledge
about the world than Wikidata?
-
So in essence, that's the approach.
-
And the question is can you leverage it,
-
can you use this wonderful knowledge
-
of the world that we already have
-
in a way that you can help
to train and bootstrap your model.
-
So the alternative here is really
understanding the text
-
not just in terms of other texts
or how this text is similar to other texts
-
but in terms of the existing knowledge
that we have about the world.
-
And what makes me really excited
-
or at least makes me
have a good gut feeling about this
-
is that in some ways
-
it seems closer
to how we interact as humans.
-
So if we were having a conversation
-
and you were bringing up
the Bundeskanzler and Angela Merkel,
-
I would have an internal representation
of Q567 and it would light up.
-
And in our continued conversation,
-
mentioning other things
related to Angela Merkel,
-
I would have an easier time
associating with that
-
or figuring out
what you were actually talking about.
-
And so, in essence,
that's at the heart of this approach,
-
that we really believe
Wikidata is a key component
-
in unlocking this better understanding
of natural language.
-
And so how are we planning to do it?
-
Essentially, there are five steps
we're going through,
-
or have been going through.
-
I'll go over each
of the steps briefly in turn
-
but essentially, there are five steps.
-
First, we need to start
with the dump files that I showed you
-
to begin with--
-
understanding what's in them,
parsing them,
-
having an efficient
internal representation in memory
-
that allows us to do
quick processing on this.
-
And then, we're leveraging
some of the annotations
-
that are already in Wikipedia,
linking it to items in Wikidata.
-
I'll briefly show you what I mean by that.
-
We can use that to then
generate more advanced annotations
-
where we have much more text annotated.
-
But still, with annotations
being items or facts in Wikidata,
-
we can then train a model
based on the silver data
-
and get a reasonably good model
-
that will allow us to read
a Wikipedia document
-
and understand what the actual content is
in terms of Wikidata,
-
but only for facts that are
already in Wikidata.
-
And so that's where kind of
the hard part of this begins.
-
In order to go beyond that
we need to have a plausibility model,
-
so a model that can tell us,
-
given a lot of facts about an item
and an additional fact,
-
whether the additional fact is plausible.
-
If we can build that,
-
we can then use a more "hyper modern"
reinforcement learning aspect
-
of deep learning and machine learning
to fine-tune the model
-
and hopefully go beyond
what we've been able to so far.
-
So real quick,
-
the first step is essentially
getting the dump files parsed,
-
understanding the contents, and linking up
Wikidata and Wikipedia information,
-
and then utilizing some of the annotations
that are already there.
-
And so this is essentially
what's happening.
-
Trust me, Michael built all of this,
it's working great.
-
But essentially, we're starting
with the two files you can see on the top,
-
the Wikidata dump and the Wikipedia dump.
-
The Wikidata dump gets processed
and we end up with a knowledge base,
-
a KB at the bottom.
-
That's essentially a store
we can hold in memory
-
that has essentially all of Wikidata in it
-
and we can quickly access
all the properties and facts and so on
-
and do analysis there.
-
Similarly, for the documents,
-
they get processed
and we end up with documents
-
that have been processed.
-
We know all the mentions
-
and some of the things
that are already in the documents.
-
And then in the middle,
-
we have an important part
which is a phrase table
-
that allows us to basically
see for any phrase
-
what is the frequency distribution,
-
what's the most likely item
that we're referring to
-
when we're using this phrase.
-
So we're using that later on
to build the silver annotations.
-
So let's say we've run this
and then we also want to make sure
-
we utilize annotations
that are already there.
-
So an important part
of a Wikipedia article
-
is that it's not just plain text,
-
it's actually already
pre-annotated with a few things.
-
So a template is one example,
links is another example.
-
So if we take here the English article
for Angela Merkel,
-
there is one example of a link here
which is to her party.
-
If you look at the bottom,
-
that's a link to a specific
Wikipedia article,
-
and I guess for people here,
it's no surprise that, in essence,
-
that is then, if you look
at the associated Wikidata item,
-
that's essentially an annotation saying
-
this is the QID I am talking about
when I'm talking about this party,
-
the Christian Democratic Union.
-
So we're using this
to already have a good start
-
in terms of understanding what text means.
-
All of these links,
-
we know exactly what the author
means with the phrase
-
in the cases where
there are links to QIDs.
-
We can use this and the phrase table
to then try and take a Wikipedia document
-
and fully annotate it with everything
we know about already from Wikidata.
-
And we can use this to train
the first iteration of our model.
-
(coughs) Excuse me.
-
So this is exactly the same article,
-
but now, after we've annotated it
with silver annotations,
-
and essentially,
you can see all of the squares
-
are places where we've been able
to annotate with QIDs or with facts.
-
This is just a screenshot
of the viewer on the data,
-
so you can have access
to all of this information
-
and see what's come out
of the silver annotation.
-
And it's important to say that
there's no machine learning
-
or anything involved here.
-
All we've done, is sort of
mechanically, with a few tricks,
-
basically pushed information
we already have from Wikidata
-
onto the Wikipedia article.
-
And so here, if you hover over
"Chancellor of Germany"
-
that is itself a Wikidata,
that's referring to a Wikidata item,
-
has a number of properties
like "subclass of: Chancellor",
-
"country: Germany",
that again referring to subtext.
-
And here, it also has
the property "officeholder"
-
which happens to be
Angela Dorothea Merkel,
-
which is also mentioned in the text.
-
So there's really a full annotation
linking up the contents here.
-
But again, there is an important
and unfortunate point
-
about what we are able to
and not able to do here.
-
So what we are doing is pushing
information we already have in Wikidata,
-
so what we can't annotate here
are things that are not in Wikidata.
-
So for instance, here,
-
she was at some point appointed
Federal Minister for Women and Youth
-
and that alias or that phrase
is not in Wikidata,
-
so we're not able to make that annotation
here in our silver annotations.
-
That said, it's still... at least for me,
-
it's was pretty surprising to see
how much you can actually annotate
-
and how much information is already there
-
when you combine Wikidata
with a Wikipedia article.
-
So what you can do is, once you have this,
you know, millions of documents,
-
you can train your parser
based on the annotations that are there.
-
And that's essentially a parser
that has a number of components.
-
Essentially, the text is coming in
at the bottom and at the top,
-
we have a transition-based
frame semantic parser
-
that then generates the annotations
or these facts or references to the items.
-
We built this and run
on more classical corpora
-
like [inaudible],
which are more classical NLP corpora,
-
but we want to be able to run this
on the full Wikipedia corpora.
-
So Michael has been rewriting this in C++
-
and we're able to really
scale up performance
-
of the parser trainer here.
-
So it will be exciting to see exactly
-
the results that are going
to come out of that.
-
So once that's in place,
-
we have a pretty good model
that's able to at least
-
predict facts that are
already known in Wikidata,
-
but ideally, we want to move beyond that,
-
and for that
we need this plausibility model
-
which in essence,
you can think of it as a black box
-
where you supply it with
all of the known facts you have
-
about a particular item
and then you provide an additional item.
-
And by magic,
-
the black box tells you how plausible is
the additional fact that you're providing
-
and how plausible is it
that this particular item is fact.
-
And...
-
I don't know if it's fair to say
that it was much to our surprise,
-
but at least, you can actually--
-
In order to train a model, you need,
-
like we've seen earlier,
you need a lot of training data
-
and essentially, you can
use Wikidata as training data.
-
You serve it basically
all the facts for a given item
-
and then you mask or hold off one fact
-
and then you provide that as a fact
that it's supposed to predict.
-
And just using this as training data,
-
you can get a really really good
plausibility model, actually,
-
to the extent that I was hoping one day
to maybe be able to even use it
-
for discovering what you could call
accidental vandalism in Wikidata
-
like a fact that's been added by accident
and really doesn't look like it's...
-
It doesn't fit with the normal topology
-
of facts or knowledge
in Wikidata, if you want.
-
But in this particular setup,
we need it for something else,
-
namely for doing reinforcement learning
-
so we can fine-tune the Wiki parser,
-
and basically using the plausibility model
as a reward function.
-
So when you do the training,
you try to pass a Wikipedia document
-
[inaudible] in Wikipedia
comes up with a fact
-
and we check the fact
on the plausibility model
-
and use that as feedback
or as a reward function
-
in training the model.
-
And the big question here is then
can we learn to predict facts
-
that are not already in Wikidata.
-
And we hope and believe we can
but it's still not clear.
-
So this is essentially what we have been
and are planning to do.
-
There's been some
surprisingly good results
-
in terms of how far
you can get with silver annotations
-
and a plausibility model.
-
But in terms of
how far we are, if you want,
-
we sort of have
the infrastructure in place
-
to do the processing
and have everything efficiently in memory.
-
We have first instances
of silver annotations
-
and have a parser trainer in place
for the supervised learning
-
and an initial plausibility model.
-
But we're still pushing on those fronts
and very much looking forward
-
to see what comes out
of the very last bit.
-
And those were my words.
-
I'm very excited to see
what comes out of it
-
and it's been pure joy
to work with Wikidata.
-
It's been fun to see
-
how some of the things you come across
seemed wrong and then the next day,
-
you look, things are fixed
-
and it's really been amazing
to see the momentum there.
-
Like I said, the URL,
all the source code is on GitHub.
-
Our email addresses
were on the first slide,
-
so please do reach out
if you have questions or are interested
-
and I think we have time
for a couple questions now in case...
-
(applause)
-
Thanks.
-
(woman 1) Thank you for your presentation.
I do have a concern however.
-
The Wikipedia corpus
is known to be with bias.
-
There's a very strong bias--
for example, fewer women, more men,
-
all sorts of other aspects in there.
-
So isn't this actually
also tainting the knowledge
-
that you are taking out of the Wikipedia?
-
Well, there are two aspects
of the question.
-
There's both in the model
that we are then training,
-
you could ask how... let's just...
-
If you make it really simple
and say like:
-
Does it mean that the model
will then be worse
-
at predicting facts
about women than men, say,
-
or some other set of groups?
-
To begin with,
if you just look at the raw data,
-
it will reflect whatever is the bias
in the training data, so that's...
-
People work on this to try
and address that in the best possible way.
-
But normally,
when you're training a model,
-
it will reflect
whatever data you're training it on.
-
So that's something to account for
when doing the work, yeah.
-
(man 2) Hi, this is [Marco].
-
I am a natural language
processing practitioner.
-
I was curious about
how you model your facts.
-
So I heard you set frame semantics,
-
Right.
-
(Mike) could you maybe
give some more details on that, please.
-
Yes, so it's frame semantics,
we're using frame semantics,
-
and basically,
-
all of the facts in Wikidata,
they're modeled as frames.
-
And so that's an essential part
of the set up
-
and how we make this work.
-
That's essentially
how we try to address the...
-
How can I make all the knowledge
that I have in Wikidata
-
available in a context where
I can annotate and train my model
-
when I am annotating or passing text.
-
Is that existing data
in Wikidata is modeled as frames.
-
So the store that we have,
-
the knowledge base with
all of the knowledge is a frame store,
-
and this is the same frame store
that we are building on top of
-
when we're then passing the text.
-
(Marco) So you're converting
the Wikidata data model into some frame.
-
Yes, we are converting the Wikidata model
-
into one large frame store
if you want, yeah.
-
(man 3) Thanks. Is Pluto a planet?
-
(audience laughing)
-
Can I get the question...
-
(man 3) I like the bootstrapping thing
that you are doing,
-
I mean the way
that you're training your model
-
by picking out the known facts
about things that are verified,
-
and then training
the plausibility prediction
-
by trying to teach
the architecture of the system
-
to recognize that actually,
that fact fits.
-
So that will work for large classes,
but it will really...
-
It doesn't sound like it will learn
about surprises
-
and especially not
in small classes of items, right.
-
So if you train your model in...
-
When did Pluto disappear, I forgot...
-
As a planet, you mean.
-
(man 3) Yeah, it used to be
a member of the solar system
-
and we had how many,
nine observations there.
-
- Yeah.
- (man 3) It's slightly problematic.
-
So everyone, the kids think
that Pluto is not a planet,
-
I still think it's a planet,
but never mind.
-
So the fact that it suddenly
stopped being a planet,
-
which was supported in the period before,
I don't know, hundreds of years, right?
-
That's crazy, how would you go
for figuring out that thing?
-
For example, the new claim
is not plausible for that thing.
-
Sure. So there are two things.
-
So there's both like how precise
is a plausibility model.
-
So what it distinguishes between
is random facts
-
and facts that are plausible.
-
And there's also the question
of whether Pluto is a planet
-
and that's back to whether...
-
I was in another session
-
where someone brought up the example
of the earth being flat,
-
- whether that is a fact or not.
- (man 3) That makes sense.
-
So it is a fact in a sense
that you can put it in,
-
I guess you could put it in Wikidata
-
with sources that are claiming
that that's the thing.
-
So again, you would not necessarily
want to train the model in a way
-
where if you read someone saying
the planet Pluto, bla, bla, bla,
-
then it should be fine for it
-
to then say that
an annotation for this text
-
is that Pluto is a planet.
-
That doesn't mean, you know...
-
The model won't be able to tell
what "in the end" is the truth,
-
I don't think any of us here
will be able to either, so...
-
(man 3) I just want to say
-
it's not a hard accusation
against the approach
-
because even people
cannot be sure whether that's a fact,
-
a new fact is plausible at that moment.
-
But that's always...
-
I just maybe reiterated a question
that I am posing all the time
-
to myself and my work; I always ask.
-
We do the statistical learning thing,
it's amazing nowadays
-
we can do billions of things,
but we cannot learn about surprises,
-
and they are
very, very important in fact, right?
-
- (man 4) But, just to refute...
- (man 3) Thank you.
-
(man 4) The plausibility model
is combined with kind of two extra roles.
-
First of all,
if it's in Wikidata, it's true.
-
We just give you the benefit of the doubt,
so please make it good.
-
The second thing is if it's not
allowed by the schema it's false;
-
it's all the things in between
we're looking at.
-
So if it's a planet according to Wikidata,
it will be a true fact.
-
But it won't predict surprises
but what is important here
-
is that there's kind of
no manual human work involved,
-
so there's nothing
that prevents you from...
-
Well, now, if we're successful
with the approach,
-
there's nothing that prevents him
from continuously updating the model
-
with changes happening
in Wikidata and Wikipedia and so on.
-
So in theory, you should be able
to quickly learn new surprises.
-
(moderator) One last question.
-
- (man 4) Maybe we're biased by Wikidata.
- Yeah.
-
(man 4) You are our bias.
Whatever you annotate is what we believe.
-
So if you make it good,
if you make it balanced,
-
we can hopefully be balanced.
-
With the gender thing,
there's actually an interesting thing.
-
We are actually getting
more training facts
-
about women than men
-
because "she" is a much less
ambiguous pronoun in the text,
-
so we actually get a lot more
true facts about women.
-
So we are biased, but on the women's side.
-
(woman 2) No, I want to see
the data on that.
-
(audience laughing)
-
We should bring that along next time.
-
(man 4) You get had decision [inaudible].
-
(man 3) Yes, hard decision.
-
(man 5) It says SLING is...
parser across many languages
-
- and you showed us English.
- Yes!
-
(man 5) Can you something about
the number of languages that you are--
-
Yes! Thank you for asking.
-
I had told myself to say that
up front on the first page
-
because otherwise,
I would forget, and I did.
-
So right now,
-
we're not actually looking at two files,
we're looking at 13 files.
-
So Wikipedia dumps
from 12 different languages
-
that we're processing,
-
and none of this is dependent
on the language being English.
-
So we're processing this
for all of the 12 languages.
-
Yeah.
-
For now,
-
they share the property of, I think,
being the Latin alphabet, and so on.
-
Mostly for us to be able to make sure
-
that what we are doing
still make sense and works.
-
But there's nothing
fundamental about the approach
-
that prevents it from being used
in very different languages
-
from those being spoken around this area.
-
(woman 3) Leila from Wikimedia Foundation.
-
I may have missed this
when you presented this.
-
Do you make an attempt to bring
any references from Wikipedia articles
-
back to the property and statements
you're making in Wikidata?
-
So I briefly mentioned this
as a potential application.
-
So for now, what we're trying to do
is just to get this to work,
-
but let's say we did get it to work
with a high level of quality,
-
that would be an obvious thing
to try to do, so when you...
-
Let's let's say you were willing to...
-
I know there's some controversy around
using Wikipedia as a source for Wikidata,
-
that you can't have
circular references and so on,
-
so you need to have
properly sourced facts.
-
So let's say you were
coming up with new facts,
-
and obviously, you could look
at the cover of news media and so on
-
and process these
and try to annotate these.
-
And then, that way,
find sources for facts,
-
new facts that you come up with.
-
Or you could even take existing...
-
There are a lot of facts in Wikidata
that either have no sources
-
or only have Wikipedia as a source,
so you can start processing these
-
and try to find sources
for those automatically.
-
(Leila) Or even within the articles
that you're taking this information from
-
just using the sources from there
because they may contain...
-
- Yeah. Yeah.
- Yeah. Thanks.
-
- (moderator) Thanks Anders.
- Cool. Thanks.
-
(applause)