cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4

0:06 - 0:08

(moderator) The next talk is
by Anders Sandholm
0:08 - 0:12

on Wikidata fact annotation
for Wikipedia across languages.
0:12 - 0:14

- Thank you.
- Thanks.
0:22 - 0:24

I wanted to start with a small confession.
0:26 - 0:32

Wow! I'm blown away
by the momentum of Wikidata
0:34 - 0:36

and the engagement of the community.
0:37 - 0:39

I am really excited about being here
0:39 - 0:42

and getting a chance to talk
about work that we've been doing.
0:43 - 0:47

This is doing work with Michael,
who's also here in the third row.
0:50 - 0:52

But before I dive more into this,
0:52 - 0:56

this wouldn't be
a Google presentation without an ad,
0:56 - 0:58

so you get that up front.
0:58 - 1:01

This is what I'll be talking about,
our project, the SLING project.
1:02 - 1:07

It is an open source project
and it's using Wikidata a lot.
1:08 - 1:12

You can go check it out on GitHub
when you get a chance
1:12 - 1:16

if you feel excited about it
after the presentation.
1:18 - 1:23

And really, what I wanted to talk about--
the title is admittedly a little bit long,
1:23 - 1:26

it's even shorter than it was
in the original program.
1:26 - 1:30

But what it comes down to,
what the project comes down to
1:30 - 1:34

is trying to answer
this one very exciting question.
1:35 - 1:38

If you want, in the beginning,
there were just two files,
1:40 - 1:41

some of you may recognize them,
1:42 - 1:46

they're essentially the dump files
from Wikidata and Wikipedia,
1:47 - 1:50

and the question we're trying
to figure out or answer is really,
1:52 - 1:54

can we dramatically improve
how good machines are
1:54 - 1:58

at understanding human language
just by using these files?
2:01 - 2:04

And of course, you're entitled to ask
2:04 - 2:06

whether that's an interesting
question to answer.
2:07 - 2:14

If you're a company that [inaudible]
is to be able to take search queries
2:14 - 2:18

and try to answer them
in the best possible way,
2:18 - 2:24

obviously, understanding natural language
comes in as a very handy thing.
2:25 - 2:28

But even if you look at Wikidata,
2:29 - 2:34

in the previous data quality panel
earlier today,
2:34 - 2:39

there was a question that came up about
verification, or verifiability of facts.
2:39 - 2:43

So let's say you actually do
understand natural language.
2:43 - 2:47

If you have a fact and there's a source,
you could go to the source and analyze it,
2:47 - 2:50

and you can figure out whether
it actually confirms the fact
2:50 - 2:52

that is claiming
to have this as a source.
2:53 - 2:56

And if you could do it,
you could even go beyond that
2:56 - 3:00

and you could read articles
and annotate them, come up with facts,
3:00 - 3:03

and actually look for existing facts
that may need sources
3:03 - 3:06

and add these articles as sources.
3:07 - 3:11

Or, you know, in the wildest,
craziest possible of all worlds,
3:11 - 3:14

if you get really, really good at it
you could read articles
3:14 - 3:18

and maybe even annotate with new facts
that you could then suggest as facts
3:18 - 3:20

that you could potentially
add to Wikidata.
3:21 - 3:27

But there's a whole world of applications
of natural language understanding.
3:29 - 3:32

One of the things that's really hard when
you do natural language understanding--
3:32 - 3:36

these days, that also means
deep learning or machine learning,
3:36 - 3:40

and one of the things that's really hard
is getting enough training data.
3:40 - 3:43

And historically,
that's meant having a lot of text
3:43 - 3:45

that you need human annotators
to then first process
3:45 - 3:47

and then you can do training.
3:47 - 3:51

And part of the question here
is also really to say:
3:51 - 3:56

Can we use Wikidata and the way
in which it's interlinked with Wikipedia
3:57 - 3:58

for training data,
3:58 - 4:01

and will that be enough
to train that model?
4:03 - 4:07

So hopefully, we'll get closer
to answering this question
4:07 - 4:09

in the next 15 to 20 minutes.
4:10 - 4:14

We don't quite know the answer yet
but we have some exciting results
4:14 - 4:17

that are pointing
in the right direction, if you want.
4:19 - 4:24

Just take a step back in terms of
the development we've seen,
4:24 - 4:28

machine learning and deep learning
has revolutionized a lot of areas
4:28 - 4:32

and this is just one example
of a particular image recognition task
4:32 - 4:37

that if you look at what happened
between 2010 and 2015,
4:37 - 4:41

in that five-year period,
we went from machines doing pretty poorly
4:41 - 4:45

to, in the end, actually performing
at the same level of humans
4:45 - 4:49

or in some cases even better
albeit for a very specific task.
4:50 - 4:56

So we've seen really a lot of things
improving dramatically.
4:56 - 4:58

And so you can ask
4:58 - 5:02

why don't we just throw deep learning
at natural language processing
5:02 - 5:05

and natural language understanding
and be done with it?
5:05 - 5:12

And the answer is kind of
we've sort of done to a certain extent,
5:12 - 5:14

but what it turns out is that
5:15 - 5:18

natural language understanding
is actually still a bit of a challenge
5:18 - 5:23

and one of the situations where
a lot of us interact with machines
5:23 - 5:26

that are trying to behave like
they understand what we're saying
5:26 - 5:27

is in these chat bots.
5:27 - 5:29

So this is not to pick
on anyone in particular
5:29 - 5:32

but just, I think, an experience
that a lot of us have had.
5:32 - 5:37

In this case, it's a user saying
I want to stay in this place.
5:37 - 5:42

The chat bot says: "OK, got it,
when will you be checking in and out?
5:42 - 5:44

For example, November 17th to 23rd."
5:44 - 5:47

And the user says:
"Well, I don't have any dates yet."
5:47 - 5:48

And then the response is:
5:48 - 5:51

"Sorry, there are no hotels available
for the dates you've requested.
5:51 - 5:53

Would you like to start a new search?"
5:53 - 5:55

So there's still some way to go
5:56 - 5:59

to get machines to really
understand human language.
6:00 - 6:04

But machine learning or deep learning
6:04 - 6:07

has been applied
already to this discipline.
6:07 - 6:10

Like, one of the examples is a recent...
6:10 - 6:11

a more successful example is BERT
6:11 - 6:17

where they're using transformers
to solve NLP or NLU tasks.
6:19 - 6:22

And it's dramatically improved
the performance but, as we've seen,
6:22 - 6:24

there is still some way to go.
6:25 - 6:28

One thing that's shared among
most of these approaches
6:28 - 6:32

is that you look at the text itself
6:32 - 6:37

and you depend on having a lot of it
so you can train your model on the text,
6:37 - 6:40

but everything is based
on just looking at the text
6:40 - 6:42

and understanding the text.
6:42 - 6:46

So the learning is really
just representation learning.
6:46 - 6:51

What we wanted to do is actually
understand and annotate the text
6:51 - 6:54

in terms of items
or entities in the real world.
6:56 - 7:00

And in general, if we take a step back,
7:00 - 7:03

why is natural language processing
or understanding so hard?
7:03 - 7:08

There are a number of reasons
why it's really hard, but at the core,
7:08 - 7:11

one of the important reasons
is that somehow,
7:11 - 7:13

the machine needs to have
knowledge of the world
7:13 - 7:17

in order to understand human language.
7:20 - 7:22

And you think about that
for a little while.
7:23 - 7:27

What better place to look for knowledge
about the world than Wikidata?
7:27 - 7:30

So in essence, that's the approach.
7:30 - 7:32

And the question is can you leverage it,
7:32 - 7:39

can you use this wonderful knowledge
7:39 - 7:41

of the world that we already have
7:41 - 7:46

in a way that you can help
to train and bootstrap your model.
7:47 - 7:51

So the alternative here is really
understanding the text
7:51 - 7:55

not just in terms of other texts
or how this text is similar to other texts
7:55 - 7:59

but in terms of the existing knowledge
that we have about the world.
8:01 - 8:03

And what makes me really excited
8:03 - 8:06

or at least makes me
have a good gut feeling about this
8:06 - 8:07

is that in some ways
8:07 - 8:11

it seems closer
to how we interact as humans.
8:11 - 8:14

So if we were having a conversation
8:14 - 8:18

and you were bringing up
the Bundeskanzler and Angela Merkel,
8:19 - 8:23

I would have an internal representation
of Q567 and it would light up.
8:23 - 8:26

And in our continued conversation,
8:26 - 8:30

mentioning other things
related to Angela Merkel,
8:30 - 8:32

I would have an easier time
associating with that
8:32 - 8:34

or figuring out
what you were actually talking about.
8:35 - 8:39

And so, in essence,
that's at the heart of this approach,
8:39 - 8:42

that we really believe
Wikidata is a key component
8:42 - 8:46

in unlocking this better understanding
of natural language.
8:50 - 8:51

And so how are we planning to do it?
8:53 - 8:57

Essentially, there are five steps
we're going through,
8:57 - 8:58

or have been going through.
8:59 - 9:03

I'll go over each
of the steps briefly in turn
9:03 - 9:04

but essentially, there are five steps.
9:04 - 9:07

First, we need to start
with the dump files that I showed you
9:07 - 9:08

to begin with--
9:09 - 9:11

understanding what's in them,
parsing them,
9:11 - 9:13

having an efficient
internal representation in memory
9:13 - 9:16

that allows us to do
quick processing on this.
9:16 - 9:19

And then, we're leveraging
some of the annotations
9:19 - 9:23

that are already in Wikipedia,
linking it to items in Wikidata.
9:23 - 9:25

I'll briefly show you what I mean by that.
9:25 - 9:31

We can use that to then
generate more advanced annotations
9:32 - 9:35

where we have much more text annotated.
9:35 - 9:40

But still, with annotations
being items or facts in Wikidata,
9:40 - 9:44

we can then train a model
based on the silver data
9:44 - 9:46

and get a reasonably good model
9:46 - 9:49

that will allow us to read
a Wikipedia document
9:49 - 9:53

and understand what the actual content is
in terms of Wikidata,
9:55 - 9:58

but only for facts that are
already in Wikidata.
9:59 - 10:02

And so that's where kind of
the hard part of this begins.
10:02 - 10:06

In order to go beyond that
we need to have a plausibility model,
10:06 - 10:08

so a model that can tell us,
10:08 - 10:11

given a lot of facts about an item
and an additional fact,
10:11 - 10:13

whether the additional fact is plausible.
10:13 - 10:14

If we can build that,
10:15 - 10:22

we can then use a more "hyper modern"
reinforcement learning aspect
10:22 - 10:26

of deep learning and machine learning
to fine-tune the model
10:26 - 10:30

and hopefully go beyond
what we've been able to so far.
10:32 - 10:33

So real quick,
10:33 - 10:37

the first step is essentially
getting the dump files parsed,
10:37 - 10:41

understanding the contents, and linking up
Wikidata and Wikipedia information,
10:41 - 10:44

and then utilizing some of the annotations
that are already there.
10:46 - 10:49

And so this is essentially
what's happening.
10:49 - 10:52

Trust me, Michael built all of this,
it's working great.
10:53 - 10:56

But essentially, we're starting
with the two files you can see on the top,
10:56 - 10:58

the Wikidata dump and the Wikipedia dump.
10:58 - 11:02

The Wikidata dump gets processed
and we end up with a knowledge base,
11:02 - 11:04

a KB at the bottom.
11:04 - 11:07

That's essentially a store
we can hold in memory
11:07 - 11:10

that has essentially all of Wikidata in it
11:10 - 11:14

and we can quickly access
all the properties and facts and so on
11:14 - 11:15

and do analysis there.
11:15 - 11:16

Similarly, for the documents,
11:16 - 11:18

they get processed
and we end up with documents
11:19 - 11:22

that have been processed.
11:22 - 11:24

We know all the mentions
11:24 - 11:27

and some of the things
that are already in the documents.
11:27 - 11:28

And then in the middle,
11:28 - 11:30

we have an important part
which is a phrase table
11:30 - 11:33

that allows us to basically
see for any phrase
11:34 - 11:36

what is the frequency distribution,
11:36 - 11:39

what's the most likely item
that we're referring to
11:39 - 11:41

when we're using this phrase.
11:41 - 11:44

So we're using that later on
to build the silver annotations.
11:44 - 11:48

So let's say we've run this
and then we also want to make sure
11:48 - 11:52

we utilize annotations
that are already there.
11:52 - 11:54

So an important part
of a Wikipedia article
11:54 - 11:58

is that it's not just plain text,
11:58 - 12:01

it's actually already
pre-annotated with a few things.
12:01 - 12:04

So a template is one example,
links is another example.
12:04 - 12:08

So if we take here the English article
for Angela Merkel,
12:09 - 12:12

there is one example of a link here
which is to her party.
12:12 - 12:14

If you look at the bottom,
12:14 - 12:16

that's a link to a specific
Wikipedia article,
12:16 - 12:20

and I guess for people here,
it's no surprise that, in essence,
12:20 - 12:23

that is then, if you look
at the associated Wikidata item,
12:23 - 12:26

that's essentially an annotation saying
12:26 - 12:31

this is the QID I am talking about
when I'm talking about this party,
12:31 - 12:33

the Christian Democratic Union.
12:34 - 12:37

So we're using this
to already have a good start
12:37 - 12:39

in terms of understanding what text means.
12:39 - 12:40

All of these links,
12:40 - 12:44

we know exactly what the author
means with the phrase
12:45 - 12:47

in the cases where
there are links to QIDs.
12:48 - 12:53

We can use this and the phrase table
to then try and take a Wikipedia document
12:53 - 12:59

and fully annotate it with everything
we know about already from Wikidata.
13:00 - 13:03

And we can use this to train
the first iteration of our model.
13:04 - 13:05

(coughs) Excuse me.
13:05 - 13:08

So this is exactly the same article,
13:08 - 13:14

but now, after we've annotated it
with silver annotations,
13:15 - 13:18

and essentially,
you can see all of the squares
13:18 - 13:25

are places where we've been able
to annotate with QIDs or with facts.
13:26 - 13:31

This is just a screenshot
of the viewer on the data,
13:31 - 13:34

so you can have access
to all of this information
13:34 - 13:38

and see what's come out
of the silver annotation.
13:38 - 13:41

And it's important to say that
there's no machine learning
13:41 - 13:43

or anything involved here.
13:43 - 13:46

All we've done, is sort of
mechanically, with a few tricks,
13:47 - 13:50

basically pushed information
we already have from Wikidata
13:50 - 13:53

onto the Wikipedia article.
13:53 - 13:56

And so here, if you hover over
"Chancellor of Germany"
13:56 - 14:02

that is itself a Wikidata,
that's referring to a Wikidata item,
14:02 - 14:05

has a number of properties
like "subclass of: Chancellor",
14:05 - 14:09

"country: Germany",
that again referring to subtext.
14:09 - 14:12

And here, it also has
the property "officeholder"
14:12 - 14:15

which happens to be
Angela Dorothea Merkel,
14:15 - 14:17

which is also mentioned in the text.
14:17 - 14:22

So there's really a full annotation
linking up the contents here.
14:25 - 14:27

But again, there is an important
and unfortunate point
14:27 - 14:32

about what we are able to
and not able to do here.
14:32 - 14:35

So what we are doing is pushing
information we already have in Wikidata,
14:35 - 14:40

so what we can't annotate here
are things that are not in Wikidata.
14:40 - 14:42

So for instance, here,
14:42 - 14:45

she was at some point appointed
Federal Minister for Women and Youth
14:45 - 14:49

and that alias or that phrase
is not in Wikidata,
14:49 - 14:54

so we're not able to make that annotation
here in our silver annotations.
14:56 - 15:00

That said, it's still... at least for me,
15:00 - 15:03

it's was pretty surprising to see
how much you can actually annotate
15:03 - 15:04

and how much information is already there
15:04 - 15:09

when you combine Wikidata
with a Wikipedia article.
15:09 - 15:15

So what you can do is, once you have this,
you know, millions of documents,
15:16 - 15:20

you can train your parser
based on the annotations that are there.
15:21 - 15:27

And that's essentially a parser
that has a number of components.
15:27 - 15:30

Essentially, the text is coming in
at the bottom and at the top,
15:30 - 15:34

we have a transition-based
frame semantic parser
15:34 - 15:39

that then generates the annotations
or these facts or references to the items.
15:41 - 15:45

We built this and run
on more classical corpora
15:45 - 15:50

like [inaudible],
which are more classical NLP corpora,
15:50 - 15:54

but we want to be able to run this
on the full Wikipedia corpora.
15:54 - 15:57

So Michael has been rewriting this in C++
15:57 - 16:00

and we're able to really
scale up performance
16:00 - 16:01

of the parser trainer here.
16:01 - 16:04

So it will be exciting to see exactly
16:04 - 16:06

the results that are going
to come out of that.
16:09 - 16:10

So once that's in place,
16:10 - 16:13

we have a pretty good model
that's able to at least
16:13 - 16:16

predict facts that are
already known in Wikidata,
16:16 - 16:19

but ideally, we want to move beyond that,
16:19 - 16:21

and for that
we need this plausibility model
16:21 - 16:24

which in essence,
you can think of it as a black box
16:24 - 16:27

where you supply it with
all of the known facts you have
16:27 - 16:31

about a particular item
and then you provide an additional item.
16:31 - 16:32

And by magic,
16:32 - 16:37

the black box tells you how plausible is
the additional fact that you're providing
16:37 - 16:40

and how plausible is it
that this particular item is fact.
16:43 - 16:44

And...
16:46 - 16:49

I don't know if it's fair to say
that it was much to our surprise,
16:49 - 16:51

but at least, you can actually--
16:51 - 16:53

In order to train a model, you need,
16:53 - 16:55

like we've seen earlier,
you need a lot of training data
16:55 - 16:58

and essentially, you can
use Wikidata as training data.
16:58 - 17:02

You serve it basically
all the facts for a given item
17:02 - 17:05

and then you mask or hold off one fact
17:05 - 17:09

and then you provide that as a fact
that it's supposed to predict.
17:09 - 17:11

And just using this as training data,
17:11 - 17:16

you can get a really really good
plausibility model, actually,
17:19 - 17:22

to the extent that I was hoping one day
to maybe be able to even use it
17:22 - 17:28

for discovering what you could call
accidental vandalism in Wikidata
17:28 - 17:33

like a fact that's been added by accident
and really doesn't look like it's...
17:33 - 17:35

It doesn't fit with the normal topology
17:35 - 17:39

of facts or knowledge
in Wikidata, if you want.
17:41 - 17:44

But in this particular setup,
we need it for something else,
17:44 - 17:47

namely for doing reinforcement learning
17:48 - 17:51

so we can fine-tune the Wiki parser,
17:51 - 17:54

and basically using the plausibility model
as a reward function.
17:54 - 18:00

So when you do the training,
you try to pass a Wikipedia document
18:00 - 18:02

[inaudible] in Wikipedia
comes up with a fact
18:02 - 18:04

and we check the fact
on the plausibility model
18:04 - 18:08

and use that as feedback
or as a reward function
18:08 - 18:10

in training the model.
18:10 - 18:13

And the big question here is then
can we learn to predict facts
18:13 - 18:15

that are not already in Wikidata.
18:16 - 18:22

And we hope and believe we can
but it's still not clear.
18:23 - 18:28

So this is essentially what we have been
and are planning to do.
18:28 - 18:31

There's been some
surprisingly good results
18:31 - 18:34

in terms of how far
you can get with silver annotations
18:34 - 18:36

and a plausibility model.
18:36 - 18:40

But in terms of
how far we are, if you want,
18:40 - 18:42

we sort of have
the infrastructure in place
18:42 - 18:44

to do the processing
and have everything efficiently in memory.
18:45 - 18:49

We have first instances
of silver annotations
18:49 - 18:53

and have a parser trainer in place
for the supervised learning
18:53 - 18:56

and an initial plausibility model.
18:56 - 19:00

But we're still pushing on those fronts
and very much looking forward
19:00 - 19:03

to see what comes out
of the very last bit.
19:08 - 19:10

And those were my words.
19:10 - 19:15

I'm very excited to see
what comes out of it
19:15 - 19:18

and it's been pure joy
to work with Wikidata.
19:18 - 19:20

It's been fun to see
19:20 - 19:24

how some of the things you come across
seemed wrong and then the next day,
19:24 - 19:25

you look, things are fixed
19:25 - 19:31

and it's really been amazing
to see the momentum there.
19:31 - 19:35

Like I said, the URL,
all the source code is on GitHub.
19:36 - 19:39

Our email addresses
were on the first slide,
19:39 - 19:43

so please do reach out
if you have questions or are interested
19:43 - 19:47

and I think we have time
for a couple questions now in case...
19:49 - 19:51

(applause)
19:51 - 19:52

Thanks.
19:56 - 19:59

(woman 1) Thank you for your presentation.
I do have a concern however.
19:59 - 20:05

The Wikipedia corpus
is known to be with bias.
20:05 - 20:10

There's a very strong bias--
for example, fewer women, more men,
20:10 - 20:12

all sorts of other aspects in there.
20:12 - 20:15

So isn't this actually
also tainting the knowledge
20:15 - 20:19

that you are taking out of the Wikipedia?
20:22 - 20:25

Well, there are two aspects
of the question.
20:25 - 20:29

There's both in the model
that we are then training,
20:29 - 20:32

you could ask how... let's just...
20:33 - 20:36

If you make it really simple
and say like:
20:36 - 20:41

Does it mean that the model
will then be worse
20:41 - 20:46

at predicting facts
about women than men, say,
20:46 - 20:50

or some other set of groups?
20:53 - 20:55

To begin with,
if you just look at the raw data,
20:55 - 21:01

it will reflect whatever is the bias
in the training data, so that's...
21:03 - 21:06

People work on this to try
and address that in the best possible way.
21:06 - 21:10

But normally,
when you're training a model,
21:10 - 21:14

it will reflect
whatever data you're training it on.
21:15 - 21:19

So that's something to account for
when doing the work, yeah.
21:21 - 21:23

(man 2) Hi, this is [Marco].
21:23 - 21:26

I am a natural language
processing practitioner.
21:27 - 21:32

I was curious about
how you model your facts.
21:32 - 21:35

So I heard you set frame semantics,
21:35 - 21:36

Right.
21:36 - 21:39

(Mike) could you maybe
give some more details on that, please.
21:40 - 21:47

Yes, so it's frame semantics,
we're using frame semantics,
21:47 - 21:50

and basically,
21:50 - 21:56

all of the facts in Wikidata,
they're modeled as frames.
21:56 - 21:59

And so that's an essential part
of the set up
21:59 - 22:00

and how we make this work.
22:00 - 22:04

That's essentially
how we try to address the...
22:04 - 22:07

How can I make all the knowledge
that I have in Wikidata
22:07 - 22:11

available in a context where
I can annotate and train my model
22:12 - 22:14

when I am annotating or passing text.
22:14 - 22:20

Is that existing data
in Wikidata is modeled as frames.
22:20 - 22:21

So the store that we have,
22:21 - 22:24

the knowledge base with
all of the knowledge is a frame store,
22:24 - 22:27

and this is the same frame store
that we are building on top of
22:27 - 22:30

when we're then passing the text.
22:30 - 22:34

(Marco) So you're converting
the Wikidata data model into some frame.
22:35 - 22:37

Yes, we are converting the Wikidata model
22:37 - 22:40

into one large frame store
if you want, yeah.
22:41 - 22:44

(man 3) Thanks. Is Pluto a planet?
22:44 - 22:47

(audience laughing)
22:47 - 22:48

Can I get the question...
22:48 - 22:52

(man 3) I like the bootstrapping thing
that you are doing,
22:52 - 22:53

I mean the way
that you're training your model
22:53 - 22:58

by picking out the known facts
about things that are verified,
22:58 - 23:01

and then training
the plausibility prediction
23:01 - 23:04

by trying to teach
the architecture of the system
23:04 - 23:06

to recognize that actually,
that fact fits.
23:06 - 23:13

So that will work for large classes,
but it will really...
23:13 - 23:16

It doesn't sound like it will learn
about surprises
23:16 - 23:19

and especially not
in small classes of items, right.
23:19 - 23:21

So if you train your model in...
23:21 - 23:23

When did Pluto disappear, I forgot...
23:23 - 23:24

As a planet, you mean.
23:24 - 23:27

(man 3) Yeah, it used to be
a member of the solar system
23:27 - 23:29

and we had how many,
nine observations there.
23:29 - 23:31

- Yeah.
- (man 3) It's slightly problematic.
23:31 - 23:34

So everyone, the kids think
that Pluto is not a planet,
23:34 - 23:36

I still think it's a planet,
but never mind.
23:36 - 23:42

So the fact that it suddenly
stopped being a planet,
23:42 - 23:46

which was supported in the period before,
I don't know, hundreds of years, right?
23:47 - 23:50

That's crazy, how would you go
for figuring out that thing?
23:50 - 23:54

For example, the new claim
is not plausible for that thing.
23:54 - 23:56

Sure. So there are two things.
23:56 - 23:59

So there's both like how precise
is a plausibility model.
23:59 - 24:02

So what it distinguishes between
is random facts
24:02 - 24:04

and facts that are plausible.
24:04 - 24:07

And there's also the question
of whether Pluto is a planet
24:07 - 24:09

and that's back to whether...
24:09 - 24:10

I was in another session
24:10 - 24:14

where someone brought up the example
of the earth being flat,
24:14 - 24:17

- whether that is a fact or not.
- (man 3) That makes sense.
24:17 - 24:19

So it is a fact in a sense
that you can put it in,
24:19 - 24:20

I guess you could put it in Wikidata
24:20 - 24:22

with sources that are claiming
that that's the thing.
24:22 - 24:27

So again, you would not necessarily
want to train the model in a way
24:27 - 24:31

where if you read someone saying
the planet Pluto, bla, bla, bla,
24:31 - 24:34

then it should be fine for it
24:34 - 24:37

to then say that
an annotation for this text
24:37 - 24:38

is that Pluto is a planet.
24:40 - 24:41

That doesn't mean, you know...
24:42 - 24:47

The model won't be able to tell
what "in the end" is the truth,
24:47 - 24:49

I don't think any of us here
will be able to either, so...
24:49 - 24:50

(man 3) I just want to say
24:50 - 24:53

it's not a hard accusation
against the approach
24:53 - 24:56

because even people
cannot be sure whether that's a fact,
24:56 - 24:58

a new fact is plausible at that moment.
24:59 - 25:00

But that's always...
25:00 - 25:03

I just maybe reiterated a question
that I am posing all the time
25:03 - 25:06

to myself and my work; I always ask.
25:06 - 25:09

We do the statistical learning thing,
it's amazing nowadays
25:09 - 25:14

we can do billions of things,
but we cannot learn about surprises,
25:14 - 25:17

and they are
very, very important in fact, right?
25:18 - 25:21

- (man 4) But, just to refute...
- (man 3) Thank you.
25:23 - 25:27

(man 4) The plausibility model
is combined with kind of two extra roles.
25:27 - 25:30

First of all,
if it's in Wikidata, it's true.
25:30 - 25:35

We just give you the benefit of the doubt,
so please make it good.
25:35 - 25:39

The second thing is if it's not
allowed by the schema it's false;
25:40 - 25:43

it's all the things in between
we're looking at.
25:43 - 25:50

So if it's a planet according to Wikidata,
it will be a true fact.
25:53 - 25:57

But it won't predict surprises
but what is important here
25:57 - 26:02

is that there's kind of
no manual human work involved,
26:02 - 26:04

so there's nothing
that prevents you from...
26:04 - 26:06

Well, now, if we're successful
with the approach,
26:06 - 26:09

there's nothing that prevents him
from continuously updating the model
26:09 - 26:12

with changes happening
in Wikidata and Wikipedia and so on.
26:12 - 26:18

So in theory, you should be able
to quickly learn new surprises.
26:18 - 26:20

(moderator) One last question.
26:20 - 26:23

- (man 4) Maybe we're biased by Wikidata.
- Yeah.
26:24 - 26:28

(man 4) You are our bias.
Whatever you annotate is what we believe.
26:28 - 26:32

So if you make it good,
if you make it balanced,
26:32 - 26:34

we can hopefully be balanced.
26:34 - 26:39

With the gender thing,
there's actually an interesting thing.
26:40 - 26:42

We are actually getting
more training facts
26:42 - 26:44

about women than men
26:44 - 26:49

because "she" is a much less
ambiguous pronoun in the text,
26:49 - 26:52

so we actually get a lot more
true facts about women.
26:52 - 26:55

So we are biased, but on the women's side.
26:56 - 26:59

(woman 2) No, I want to see
the data on that.
26:59 - 27:00

(audience laughing)
27:00 - 27:02

We should bring that along next time.
27:02 - 27:05

(man 4) You get had decision [inaudible].
27:05 - 27:06

(man 3) Yes, hard decision.
27:08 - 27:13

(man 5) It says SLING is...
parser across many languages
27:13 - 27:15

- and you showed us English.
- Yes!
27:15 - 27:18

(man 5) Can you something about
the number of languages that you are--
27:18 - 27:19

Yes! Thank you for asking.
27:19 - 27:22

I had told myself to say that
up front on the first page
27:22 - 27:23

because otherwise,
I would forget, and I did.
27:25 - 27:26

So right now,
27:26 - 27:30

we're not actually looking at two files,
we're looking at 13 files.
27:30 - 27:33

So Wikipedia dumps
from 12 different languages
27:33 - 27:36

that we're processing,
27:36 - 27:41

and none of this is dependent
on the language being English.
27:41 - 27:44

So we're processing this
for all of the 12 languages.
27:48 - 27:49

Yeah.
27:49 - 27:50

For now,
27:50 - 27:57

they share the property of, I think,
being the Latin alphabet, and so on.
27:57 - 27:59

Mostly for us to be able to make sure
27:59 - 28:02

that what we are doing
still make sense and works.
28:02 - 28:05

But there's nothing
fundamental about the approach
28:05 - 28:10

that prevents it from being used
in very different languages
28:10 - 28:15

from those being spoken around this area.
28:17 - 28:19

(woman 3) Leila from Wikimedia Foundation.
28:19 - 28:22

I may have missed this
when you presented this.
28:23 - 28:28

Do you make an attempt to bring
any references from Wikipedia articles
28:28 - 28:32

back to the property and statements
you're making in Wikidata?
28:33 - 28:37

So I briefly mentioned this
as a potential application.
28:37 - 28:40

So for now, what we're trying to do
is just to get this to work,
28:41 - 28:46

but let's say we did get it to work
with a high level of quality,
28:47 - 28:51

that would be an obvious thing
to try to do, so when you...
28:53 - 28:55

Let's let's say you were willing to...
28:55 - 29:00

I know there's some controversy around
using Wikipedia as a source for Wikidata,
29:00 - 29:02

that you can't have
circular references and so on,
29:02 - 29:05

so you need to have
properly sourced facts.
29:05 - 29:07

So let's say you were
coming up with new facts,
29:07 - 29:14

and obviously, you could look
at the cover of news media and so on
29:14 - 29:16

and process these
and try to annotate these.
29:16 - 29:20

And then, that way,
find sources for facts,
29:20 - 29:21

new facts that you come up with.
29:21 - 29:22

Or you could even take existing...
29:22 - 29:26

There are a lot of facts in Wikidata
that either have no sources
29:26 - 29:30

or only have Wikipedia as a source,
so you can start processing these
29:30 - 29:33

and try to find sources
for those automatically.
29:34 - 29:38

(Leila) Or even within the articles
that you're taking this information from
29:38 - 29:42

just using the sources from there
because they may contain...
29:42 - 29:44

- Yeah. Yeah.
- Yeah. Thanks.
29:47 - 29:49

- (moderator) Thanks Anders.
- Cool. Thanks.
29:50 - 29:55

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4
Video Language:: English
Duration:: 30:01

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)