0:00:05.961,0:00:08.133
(moderator) The next talk is[br]by Anders Sandholm

0:00:08.133,0:00:12.319
on Wikidata fact annotation[br]for Wikipedia across languages.

0:00:12.319,0:00:13.920
- Thank you.[br]- Thanks.

0:00:21.905,0:00:24.164
I wanted to start with a small confession.

0:00:26.428,0:00:31.687
Wow! I'm blown away[br]by the momentum of Wikidata

0:00:33.799,0:00:35.909
and the engagement of the community.

0:00:37.230,0:00:38.670
I am really excited about being here

0:00:38.671,0:00:42.296
and getting a chance to talk[br]about work that we've been doing.

0:00:42.914,0:00:47.398
This is doing work with Michael,[br]who's also here in the third row.

0:00:49.551,0:00:51.921
But before I dive more into this,

0:00:51.922,0:00:55.515
this wouldn't be[br]a Google presentation without an ad,

0:00:56.102,0:00:58.196
so you get that up front.

0:00:58.196,0:01:01.242
This is what I'll be talking about,[br]our project, the SLING project.

0:01:02.255,0:01:06.640
It is an open source project[br]and it's using Wikidata a lot.

0:01:08.020,0:01:11.721
You can go check it out on GitHub[br]when you get a chance

0:01:11.722,0:01:15.960
if you feel excited about it[br]after the presentation.

0:01:18.215,0:01:23.493
And really, what I wanted to talk about--[br]the title is admittedly a little bit long,

0:01:23.494,0:01:25.797
it's even shorter than it was[br]in the original program.

0:01:25.798,0:01:29.704
But what it comes down to,[br]what the project comes down to

0:01:29.704,0:01:33.617
is trying to answer[br]this one very exciting question.

0:01:34.810,0:01:38.218
If you want, in the beginning,[br]there were just two files,

0:01:39.914,0:01:41.400
some of you may recognize them,

0:01:42.416,0:01:45.953
they're essentially the dump files[br]from Wikidata and Wikipedia,

0:01:47.234,0:01:50.280
and the question we're trying[br]to figure out or answer is really,

0:01:51.570,0:01:54.423
can we dramatically improve[br]how good machines are

0:01:54.424,0:01:58.062
at understanding human language[br]just by using these files?

0:02:00.900,0:02:04.158
And of course, you're entitled to ask

0:02:04.158,0:02:06.191
whether that's an interesting[br]question to answer.

0:02:07.450,0:02:14.344
If you're a company that [inaudible][br]is to be able to take search queries

0:02:14.344,0:02:17.656
and try to answer them[br]in the best possible way,

0:02:18.460,0:02:23.989
obviously, understanding natural language[br]comes in as a very handy thing.

0:02:25.317,0:02:27.914
But even if you look at Wikidata,

0:02:29.109,0:02:33.843
in the previous data quality panel[br]earlier today,

0:02:33.843,0:02:39.070
there was a question that came up about[br]verification, or verifiability of facts.

0:02:39.070,0:02:42.623
So let's say you actually do[br]understand natural language.

0:02:42.623,0:02:47.304
If you have a fact and there's a source,[br]you could go to the source and analyze it,

0:02:47.304,0:02:49.721
and you can figure out whether[br]it actually confirms the fact

0:02:49.722,0:02:52.282
that is claiming[br]to have this as a source.

0:02:53.459,0:02:55.540
And if you could do it,[br]you could even go beyond that

0:02:55.541,0:02:59.723
and you could read articles[br]and annotate them, come up with facts,

0:02:59.723,0:03:03.478
and actually look for existing facts[br]that may need sources

0:03:03.479,0:03:06.109
and add these articles as sources.

0:03:07.110,0:03:11.371
Or, you know, in the wildest,[br]craziest possible of all worlds,

0:03:11.371,0:03:13.756
if you get really, really good at it[br]you could read articles

0:03:13.756,0:03:18.243
and maybe even annotate with new facts[br]that you could then suggest as facts

0:03:18.244,0:03:19.965
that you could potentially[br]add to Wikidata.

0:03:20.595,0:03:27.025
But there's a whole world of applications[br]of natural language understanding.

0:03:28.895,0:03:32.478
One of the things that's really hard when[br]you do natural language understanding--

0:03:32.479,0:03:35.595
these days, that also means[br]deep learning or machine learning,

0:03:35.596,0:03:39.537
and one of the things that's really hard[br]is getting enough training data.

0:03:39.537,0:03:42.812
And historically,[br]that's meant having a lot of text

0:03:42.812,0:03:45.441
that you need human annotators[br]to then first process

0:03:45.442,0:03:46.801
and then you can do training.

0:03:46.802,0:03:51.184
And part of the question here[br]is also really to say:

0:03:51.184,0:03:55.930
Can we use Wikidata and the way[br]in which it's interlinked with Wikipedia

0:03:57.012,0:03:58.012
for training data,

0:03:58.013,0:04:00.600
and will that be enough[br]to train that model?

0:04:03.429,0:04:06.517
So hopefully, we'll get closer[br]to answering this question

0:04:06.518,0:04:09.289
in the next 15 to 20 minutes.

0:04:10.271,0:04:14.071
We don't quite know the answer yet[br]but we have some exciting results

0:04:14.072,0:04:16.992
that are pointing[br]in the right direction, if you want.

0:04:19.387,0:04:23.798
Just take a step back in terms of[br]the development we've seen,

0:04:24.445,0:04:28.450
machine learning and deep learning[br]has revolutionized a lot of areas

0:04:28.450,0:04:32.431
and this is just one example[br]of a particular image recognition task

0:04:32.432,0:04:37.343
that if you look at what happened[br]between 2010 and 2015,

0:04:37.344,0:04:40.881
in that five-year period,[br]we went from machines doing pretty poorly

0:04:40.882,0:04:44.921
to, in the end, actually performing[br]at the same level of humans

0:04:44.922,0:04:48.804
or in some cases even better[br]albeit for a very specific task.

0:04:50.224,0:04:55.515
So we've seen really a lot of things[br]improving dramatically.

0:04:56.221,0:04:57.881
And so you can ask

0:04:57.882,0:05:02.440
why don't we just throw deep learning[br]at natural language processing

0:05:02.440,0:05:04.600
and natural language understanding[br]and be done with it?

0:05:05.497,0:05:11.532
And the answer is kind of[br]we've sort of done to a certain extent,

0:05:11.532,0:05:14.367
but what it turns out is that

0:05:15.005,0:05:17.725
natural language understanding[br]is actually still a bit of a challenge

0:05:17.726,0:05:23.281
and one of the situations where[br]a lot of us interact with machines

0:05:23.282,0:05:25.803
that are trying to behave like[br]they understand what we're saying

0:05:25.804,0:05:26.804
is in these chat bots.

0:05:26.805,0:05:28.605
So this is not to pick[br]on anyone in particular

0:05:28.606,0:05:31.991
but just, I think, an experience[br]that a lot of us have had.

0:05:31.992,0:05:36.841
In this case, it's a user saying[br]I want to stay in this place.

0:05:36.842,0:05:41.766
The chat bot says: "OK, got it,[br]when will you be checking in and out?

0:05:41.766,0:05:44.488
For example, November 17th to 23rd."

0:05:44.488,0:05:46.620
And the user says:[br]"Well, I don't have any dates yet."

0:05:46.620,0:05:47.681
And then the response is:

0:05:47.682,0:05:51.050
"Sorry, there are no hotels available[br]for the dates you've requested.

0:05:51.050,0:05:52.571
Would you like to start a new search?"

0:05:53.212,0:05:55.041
So there's still some way to go

0:05:55.862,0:05:58.755
to get machines to really[br]understand human language.

0:05:59.817,0:06:03.761
But machine learning or deep learning

0:06:03.762,0:06:06.786
has been applied[br]already to this discipline.

0:06:06.787,0:06:09.721
Like, one of the examples is a recent...

0:06:09.722,0:06:11.232
a more successful example is BERT

0:06:11.233,0:06:17.316
where they're using transformers[br]to solve NLP or NLU tasks.

0:06:18.800,0:06:22.157
And it's dramatically improved[br]the performance but, as we've seen,

0:06:22.157,0:06:23.560
there is still some way to go.

0:06:25.150,0:06:27.857
One thing that's shared among[br]most of these approaches

0:06:27.858,0:06:31.785
is that you look at the text itself

0:06:31.785,0:06:36.629
and you depend on having a lot of it[br]so you can train your model on the text,

0:06:36.629,0:06:39.761
but everything is based[br]on just looking at the text

0:06:39.762,0:06:41.675
and understanding the text.

0:06:41.675,0:06:45.727
So the learning is really[br]just representation learning.

0:06:45.727,0:06:50.653
What we wanted to do is actually[br]understand and annotate the text

0:06:50.653,0:06:54.006
in terms of items[br]or entities in the real world.

0:06:56.384,0:06:59.537
And in general, if we take a step back,

0:07:00.077,0:07:03.441
why is natural language processing[br]or understanding so hard?

0:07:03.442,0:07:07.659
There are a number of reasons[br]why it's really hard, but at the core,

0:07:07.659,0:07:11.041
one of the important reasons[br]is that somehow,

0:07:11.042,0:07:13.225
the machine needs to have[br]knowledge of the world

0:07:13.226,0:07:16.867
in order to understand human language.

0:07:19.569,0:07:22.456
And you think about that[br]for a little while.

0:07:23.074,0:07:26.654
What better place to look for knowledge[br]about the world than Wikidata?

0:07:27.318,0:07:29.625
So in essence, that's the approach.

0:07:29.625,0:07:31.985
And the question is can you leverage it,

0:07:31.985,0:07:38.877
can you use this wonderful knowledge

0:07:38.878,0:07:40.601
of the world that we already have

0:07:40.602,0:07:45.617
in a way that you can help[br]to train and bootstrap your model.

0:07:47.390,0:07:51.121
So the alternative here is really[br]understanding the text

0:07:51.122,0:07:55.439
not just in terms of other texts[br]or how this text is similar to other texts

0:07:55.439,0:07:59.104
but in terms of the existing knowledge[br]that we have about the world.

0:08:01.164,0:08:02.704
And what makes me really excited

0:08:02.705,0:08:05.905
or at least makes me[br]have a good gut feeling about this

0:08:05.906,0:08:07.372
is that in some ways

0:08:07.373,0:08:10.780
it seems closer[br]to how we interact as humans.

0:08:10.780,0:08:13.795
So if we were having a conversation

0:08:13.795,0:08:17.847
and you were bringing up[br]the Bundeskanzler and Angela Merkel,

0:08:18.662,0:08:23.173
I would have an internal representation[br]of Q567 and it would light up.

0:08:23.173,0:08:25.521
And in our continued conversation,

0:08:25.522,0:08:29.615
mentioning other things[br]related to Angela Merkel,

0:08:29.616,0:08:31.762
I would have an easier time[br]associating with that

0:08:31.763,0:08:33.920
or figuring out[br]what you were actually talking about.

0:08:35.027,0:08:38.919
And so, in essence,[br]that's at the heart of this approach,

0:08:38.919,0:08:42.100
that we really believe[br]Wikidata is a key component

0:08:42.101,0:08:45.809
in unlocking this better understanding[br]of natural language.

0:08:49.732,0:08:51.448
And so how are we planning to do it?

0:08:52.557,0:08:56.797
Essentially, there are five steps[br]we're going through,

0:08:56.798,0:08:58.080
or have been going through.

0:08:58.788,0:09:02.841
I'll go over each[br]of the steps briefly in turn

0:09:02.841,0:09:04.410
but essentially, there are five steps.

0:09:04.410,0:09:07.120
First, we need to start[br]with the dump files that I showed you

0:09:07.120,0:09:08.120
to begin with--

0:09:08.706,0:09:11.149
understanding what's in them,[br]parsing them,

0:09:11.149,0:09:13.397
having an efficient[br]internal representation in memory

0:09:13.397,0:09:15.716
that allows us to do[br]quick processing on this.

0:09:16.225,0:09:18.502
And then, we're leveraging[br]some of the annotations

0:09:18.503,0:09:22.605
that are already in Wikipedia,[br]linking it to items in Wikidata.

0:09:22.605,0:09:25.462
I'll briefly show you what I mean by that.

0:09:25.462,0:09:31.001
We can use that to then[br]generate more advanced annotations

0:09:31.973,0:09:34.549
where we have much more text annotated.

0:09:34.549,0:09:40.333
But still, with annotations[br]being items or facts in Wikidata,

0:09:40.334,0:09:43.717
we can then train a model[br]based on the silver data

0:09:43.717,0:09:46.212
and get a reasonably good model

0:09:46.212,0:09:49.047
that will allow us to read[br]a Wikipedia document

0:09:49.047,0:09:53.308
and understand what the actual content is[br]in terms of Wikidata,

0:09:54.613,0:09:57.580
but only for facts that are[br]already in Wikidata.

0:09:58.523,0:10:02.367
And so that's where kind of[br]the hard part of this begins.

0:10:02.367,0:10:06.100
In order to go beyond that[br]we need to have a plausibility model,

0:10:06.100,0:10:07.641
so a model that can tell us,

0:10:07.642,0:10:10.881
given a lot of facts about an item[br]and an additional fact,

0:10:10.882,0:10:12.627
whether the additional fact is plausible.

0:10:13.191,0:10:14.296
If we can build that,

0:10:14.892,0:10:21.831
we can then use a more "hyper modern"[br]reinforcement learning aspect

0:10:21.832,0:10:26.033
of deep learning and machine learning[br]to fine-tune the model

0:10:26.033,0:10:30.303
and hopefully go beyond[br]what we've been able to so far.

0:10:31.933,0:10:32.933
So real quick,

0:10:32.934,0:10:36.632
the first step is essentially[br]getting the dump files parsed,

0:10:36.632,0:10:41.021
understanding the contents, and linking up[br]Wikidata and Wikipedia information,

0:10:41.022,0:10:44.416
and then utilizing some of the annotations[br]that are already there.

0:10:45.547,0:10:49.304
And so this is essentially[br]what's happening.

0:10:49.305,0:10:51.959
Trust me, Michael built all of this,[br]it's working great.

0:10:52.701,0:10:55.621
But essentially, we're starting[br]with the two files you can see on the top,

0:10:55.622,0:10:58.244
the Wikidata dump and the Wikipedia dump.

0:10:58.245,0:11:02.413
The Wikidata dump gets processed[br]and we end up with a knowledge base,

0:11:02.413,0:11:04.376
a KB at the bottom.

0:11:04.377,0:11:07.335
That's essentially a store[br]we can hold in memory

0:11:07.336,0:11:10.439
that has essentially all of Wikidata in it

0:11:10.440,0:11:13.841
and we can quickly access[br]all the properties and facts and so on

0:11:13.841,0:11:15.163
and do analysis there.

0:11:15.164,0:11:16.414
Similarly, for the documents,

0:11:16.415,0:11:18.486
they get processed[br]and we end up with documents

0:11:19.274,0:11:21.911
that have been processed.

0:11:21.912,0:11:23.544
We know all the mentions

0:11:23.545,0:11:26.838
and some of the things[br]that are already in the documents.

0:11:26.839,0:11:27.839
And then in the middle,

0:11:27.840,0:11:30.093
we have an important part[br]which is a phrase table

0:11:30.094,0:11:33.081
that allows us to basically[br]see for any phrase

0:11:34.096,0:11:35.753
what is the frequency distribution,

0:11:35.754,0:11:39.481
what's the most likely item[br]that we're referring to

0:11:39.481,0:11:41.165
when we're using this phrase.

0:11:41.165,0:11:44.445
So we're using that later on[br]to build the silver annotations.

0:11:44.446,0:11:48.001
So let's say we've run this[br]and then we also want to make sure

0:11:48.002,0:11:51.691
we utilize annotations[br]that are already there.

0:11:51.692,0:11:54.112
So an important part[br]of a Wikipedia article

0:11:54.113,0:11:57.841
is that it's not just plain text,

0:11:57.842,0:12:01.007
it's actually already[br]pre-annotated with a few things.

0:12:01.008,0:12:04.046
So a template is one example,[br]links is another example.

0:12:04.047,0:12:08.017
So if we take here the English article[br]for Angela Merkel,

0:12:09.387,0:12:12.301
there is one example of a link here[br]which is to her party.

0:12:12.302,0:12:13.772
If you look at the bottom,

0:12:13.773,0:12:16.426
that's a link to a specific[br]Wikipedia article,

0:12:16.427,0:12:20.155
and I guess for people here,[br]it's no surprise that, in essence,

0:12:20.156,0:12:23.360
that is then, if you look[br]at the associated Wikidata item,

0:12:23.361,0:12:25.801
that's essentially an annotation saying

0:12:25.802,0:12:31.453
this is the QID I am talking about[br]when I'm talking about this party,

0:12:31.453,0:12:32.820
the Christian Democratic Union.

0:12:33.951,0:12:37.281
So we're using this[br]to already have a good start

0:12:37.282,0:12:39.326
in terms of understanding what text means.

0:12:39.327,0:12:40.327
All of these links,

0:12:40.328,0:12:43.983
we know exactly what the author[br]means with the phrase

0:12:44.504,0:12:47.040
in the cases where[br]there are links to QIDs.

0:12:48.234,0:12:53.303
We can use this and the phrase table[br]to then try and take a Wikipedia document

0:12:53.304,0:12:58.760
and fully annotate it with everything[br]we know about already from Wikidata.

0:12:59.659,0:13:02.753
And we can use this to train[br]the first iteration of our model.

0:13:03.933,0:13:04.933
(coughs) Excuse me.

0:13:04.934,0:13:07.876
So this is exactly the same article,

0:13:08.400,0:13:13.566
but now, after we've annotated it[br]with silver annotations,

0:13:14.673,0:13:18.441
and essentially,[br]you can see all of the squares

0:13:18.442,0:13:24.530
are places where we've been able[br]to annotate with QIDs or with facts.

0:13:26.362,0:13:30.681
This is just a screenshot[br]of the viewer on the data,

0:13:30.682,0:13:34.281
so you can have access[br]to all of this information

0:13:34.282,0:13:37.577
and see what's come out[br]of the silver annotation.

0:13:37.577,0:13:41.364
And it's important to say that[br]there's no machine learning

0:13:41.365,0:13:42.678
or anything involved here.

0:13:42.679,0:13:46.007
All we've done, is sort of[br]mechanically, with a few tricks,

0:13:46.515,0:13:49.709
basically pushed information[br]we already have from Wikidata

0:13:49.710,0:13:52.760
onto the Wikipedia article.

0:13:53.328,0:13:56.202
And so here, if you hover over[br]"Chancellor of Germany"

0:13:56.202,0:14:01.973
that is itself a Wikidata,[br]that's referring to a Wikidata item,

0:14:01.974,0:14:04.972
has a number of properties[br]like "subclass of: Chancellor",

0:14:04.972,0:14:08.658
"country: Germany",[br]that again referring to subtext.

0:14:08.659,0:14:11.732
And here, it also has[br]the property "officeholder"

0:14:12.473,0:14:15.496
which happens to be[br]Angela Dorothea Merkel,

0:14:15.497,0:14:17.051
which is also mentioned in the text.

0:14:17.052,0:14:22.137
So there's really a full annotation[br]linking up the contents here.

0:14:24.645,0:14:27.429
But again, there is an important[br]and unfortunate point

0:14:27.430,0:14:31.563
about what we are able to[br]and not able to do here.

0:14:31.564,0:14:35.342
So what we are doing is pushing[br]information we already have in Wikidata,

0:14:35.342,0:14:40.169
so what we can't annotate here[br]are things that are not in Wikidata.

0:14:40.169,0:14:41.681
So for instance, here,

0:14:41.682,0:14:44.910
she was at some point appointed[br]Federal Minister for Women and Youth

0:14:44.910,0:14:48.713
and that alias or that phrase[br]is not in Wikidata,

0:14:48.713,0:14:54.000
so we're not able to make that annotation[br]here in our silver annotations.

0:14:56.227,0:14:59.943
That said, it's still... at least for me,

0:14:59.944,0:15:02.625
it's was pretty surprising to see[br]how much you can actually annotate

0:15:02.626,0:15:04.266
and how much information is already there

0:15:04.267,0:15:08.877
when you combine Wikidata[br]with a Wikipedia article.

0:15:08.878,0:15:15.321
So what you can do is, once you have this,[br]you know, millions of documents,

0:15:16.275,0:15:20.240
you can train your parser[br]based on the annotations that are there.

0:15:21.134,0:15:26.968
And that's essentially a parser[br]that has a number of components.

0:15:26.969,0:15:30.481
Essentially, the text is coming in[br]at the bottom and at the top,

0:15:30.482,0:15:33.722
we have a transition-based[br]frame semantic parser

0:15:33.723,0:15:39.154
that then generates the annotations[br]or these facts or references to the items.

0:15:40.617,0:15:44.987
We built this and run[br]on more classical corpora

0:15:44.987,0:15:49.611
like [inaudible],[br]which are more classical NLP corpora,

0:15:49.611,0:15:53.800
but we want to be able to run this[br]on the full Wikipedia corpora.

0:15:53.800,0:15:57.201
So Michael has been rewriting this in C++

0:15:57.202,0:15:59.932
and we're able to really[br]scale up performance

0:15:59.932,0:16:01.101
of the parser trainer here.

0:16:01.102,0:16:03.594
So it will be exciting to see exactly

0:16:03.595,0:16:05.830
the results that are going[br]to come out of that.

0:16:08.638,0:16:10.263
So once that's in place,

0:16:10.264,0:16:13.459
we have a pretty good model[br]that's able to at least

0:16:13.459,0:16:16.051
predict facts that are[br]already known in Wikidata,

0:16:16.052,0:16:18.790
but ideally, we want to move beyond that,

0:16:18.790,0:16:20.703
and for that[br]we need this plausibility model

0:16:20.704,0:16:23.928
which in essence,[br]you can think of it as a black box

0:16:23.929,0:16:27.121
where you supply it with[br]all of the known facts you have

0:16:27.122,0:16:30.574
about a particular item[br]and then you provide an additional item.

0:16:31.412,0:16:32.412
And by magic,

0:16:32.413,0:16:36.948
the black box tells you how plausible is[br]the additional fact that you're providing

0:16:36.949,0:16:40.396
and how plausible is it[br]that this particular item is fact.

0:16:42.792,0:16:43.792
And...

0:16:45.733,0:16:48.582
I don't know if it's fair to say[br]that it was much to our surprise,

0:16:48.582,0:16:50.776
but at least, you can actually--

0:16:50.776,0:16:52.905
In order to train a model, you need,

0:16:52.905,0:16:55.255
like we've seen earlier,[br]you need a lot of training data

0:16:55.256,0:16:57.880
and essentially, you can[br]use Wikidata as training data.

0:16:57.881,0:17:02.213
You serve it basically[br]all the facts for a given item

0:17:02.213,0:17:04.614
and then you mask or hold off one fact

0:17:04.615,0:17:08.566
and then you provide that as a fact[br]that it's supposed to predict.

0:17:09.238,0:17:10.718
And just using this as training data,

0:17:10.719,0:17:15.881
you can get a really really good[br]plausibility model, actually,

0:17:18.574,0:17:21.675
to the extent that I was hoping one day[br]to maybe be able to even use it

0:17:21.675,0:17:27.527
for discovering what you could call[br]accidental vandalism in Wikidata

0:17:27.528,0:17:33.011
like a fact that's been added by accident[br]and really doesn't look like it's...

0:17:33.012,0:17:35.029
It doesn't fit with the normal topology

0:17:35.029,0:17:38.621
of facts or knowledge[br]in Wikidata, if you want.

0:17:41.058,0:17:43.761
But in this particular setup,[br]we need it for something else,

0:17:43.762,0:17:46.738
namely for doing reinforcement learning

0:17:47.951,0:17:50.805
so we can fine-tune the Wiki parser,

0:17:50.805,0:17:54.034
and basically using the plausibility model[br]as a reward function.

0:17:54.035,0:17:59.576
So when you do the training,[br]you try to pass a Wikipedia document

0:17:59.576,0:18:01.871
[inaudible] in Wikipedia[br]comes up with a fact

0:18:01.871,0:18:04.281
and we check the fact[br]on the plausibility model

0:18:04.282,0:18:07.527
and use that as feedback[br]or as a reward function

0:18:08.198,0:18:09.601
in training the model.

0:18:09.602,0:18:12.708
And the big question here is then[br]can we learn to predict facts

0:18:12.709,0:18:15.000
that are not already in Wikidata.

0:18:15.800,0:18:22.300
And we hope and believe we can[br]but it's still not clear.

0:18:22.879,0:18:27.792
So this is essentially what we have been[br]and are planning to do.

0:18:27.792,0:18:31.223
There's been some[br]surprisingly good results

0:18:31.224,0:18:33.989
in terms of how far[br]you can get with silver annotations

0:18:33.990,0:18:35.720
and a plausibility model.

0:18:36.271,0:18:40.081
But in terms of[br]how far we are, if you want,

0:18:40.082,0:18:41.961
we sort of have[br]the infrastructure in place

0:18:41.962,0:18:44.480
to do the processing[br]and have everything efficiently in memory.

0:18:45.121,0:18:49.138
We have first instances[br]of silver annotations

0:18:49.139,0:18:53.041
and have a parser trainer in place[br]for the supervised learning

0:18:53.042,0:18:55.755
and an initial plausibility model.

0:18:55.756,0:19:00.400
But we're still pushing on those fronts[br]and very much looking forward

0:19:00.400,0:19:03.320
to see what comes out[br]of the very last bit.

0:19:07.786,0:19:10.309
And those were my words.

0:19:10.310,0:19:14.681
I'm very excited to see[br]what comes out of it

0:19:14.682,0:19:17.661
and it's been pure joy[br]to work with Wikidata.

0:19:17.662,0:19:19.513
It's been fun to see

0:19:19.514,0:19:23.917
how some of the things you come across[br]seemed wrong and then the next day,

0:19:23.918,0:19:24.958
you look, things are fixed

0:19:24.959,0:19:30.551
and it's really been amazing[br]to see the momentum there.

0:19:31.161,0:19:35.295
Like I said, the URL,[br]all the source code is on GitHub.

0:19:35.887,0:19:38.912
Our email addresses[br]were on the first slide,

0:19:38.913,0:19:42.582
so please do reach out[br]if you have questions or are interested

0:19:42.582,0:19:47.149
and I think we have time[br]for a couple questions now in case...

0:19:49.450,0:19:51.446
(applause)

0:19:51.447,0:19:52.447
Thanks.

0:19:55.583,0:19:59.400
(woman 1) Thank you for your presentation.[br]I do have a concern however.

0:19:59.401,0:20:05.441
The Wikipedia corpus[br]is known to be with bias.

0:20:05.442,0:20:09.841
There's a very strong bias--[br]for example, fewer women, more men,

0:20:09.842,0:20:11.787
all sorts of other aspects in there.

0:20:11.787,0:20:15.201
So isn't this actually[br]also tainting the knowledge

0:20:15.202,0:20:19.471
that you are taking out of the Wikipedia?

0:20:22.320,0:20:25.424
Well, there are two aspects[br]of the question.

0:20:25.425,0:20:28.591
There's both in the model[br]that we are then training,

0:20:28.591,0:20:32.495
you could ask how... let's just...

0:20:33.172,0:20:35.841
If you make it really simple[br]and say like:

0:20:35.842,0:20:41.204
Does it mean that the model[br]will then be worse

0:20:41.204,0:20:46.027
at predicting facts[br]about women than men, say,

0:20:46.027,0:20:50.416
or some other set of groups?

0:20:53.098,0:20:55.424
To begin with,[br]if you just look at the raw data,

0:20:55.425,0:21:00.529
it will reflect whatever is the bias[br]in the training data, so that's...

0:21:02.810,0:21:06.001
People work on this to try[br]and address that in the best possible way.

0:21:06.002,0:21:10.068
But normally,[br]when you're training a model,

0:21:10.069,0:21:14.244
it will reflect[br]whatever data you're training it on.

0:21:14.870,0:21:18.980
So that's something to account for[br]when doing the work, yeah.

0:21:21.498,0:21:23.194
(man 2) Hi, this is [Marco].

0:21:23.195,0:21:25.960
I am a natural language[br]processing practitioner.

0:21:26.853,0:21:31.578
I was curious about[br]how you model your facts.

0:21:31.578,0:21:34.535
So I heard you set frame semantics,

0:21:34.535,0:21:35.557
Right.

0:21:35.557,0:21:38.875
(Mike) could you maybe[br]give some more details on that, please.

0:21:40.053,0:21:46.510
Yes, so it's frame semantics,[br]we're using frame semantics,

0:21:46.510,0:21:49.642
and basically,

0:21:49.642,0:21:55.778
all of the facts in Wikidata,[br]they're modeled as frames.

0:21:56.291,0:21:58.801
And so that's an essential part[br]of the set up

0:21:58.811,0:22:00.027
and how we make this work.

0:22:00.028,0:22:03.770
That's essentially[br]how we try to address the...

0:22:03.771,0:22:06.680
How can I make all the knowledge[br]that I have in Wikidata

0:22:06.680,0:22:11.012
available in a context where[br]I can annotate and train my model

0:22:12.485,0:22:14.441
when I am annotating or passing text.

0:22:14.442,0:22:19.806
Is that existing data[br]in Wikidata is modeled as frames.

0:22:19.806,0:22:21.007
So the store that we have,

0:22:21.008,0:22:24.041
the knowledge base with[br]all of the knowledge is a frame store,

0:22:24.042,0:22:27.251
and this is the same frame store[br]that we are building on top of

0:22:27.251,0:22:29.521
when we're then passing the text.

0:22:29.522,0:22:34.024
(Marco) So you're converting[br]the Wikidata data model into some frame.

0:22:34.551,0:22:36.703
Yes, we are converting the Wikidata model

0:22:36.704,0:22:39.871
into one large frame store[br]if you want, yeah.

0:22:40.558,0:22:43.605
(man 3) Thanks. Is Pluto a planet?

0:22:44.394,0:22:47.226
(audience laughing)

0:22:47.227,0:22:48.227
Can I get the question...

0:22:48.228,0:22:51.561
(man 3) I like the bootstrapping thing[br]that you are doing,

0:22:51.562,0:22:53.402
I mean the way[br]that you're training your model

0:22:53.403,0:22:57.726
by picking out the known facts[br]about things that are verified,

0:22:57.727,0:23:00.666
and then training[br]the plausibility prediction

0:23:00.667,0:23:03.681
by trying to teach[br]the architecture of the system

0:23:03.682,0:23:06.481
to recognize that actually,[br]that fact fits.

0:23:06.482,0:23:13.464
So that will work for large classes,[br]but it will really...

0:23:13.464,0:23:15.744
It doesn't sound like it will learn[br]about surprises

0:23:15.745,0:23:18.677
and especially not[br]in small classes of items, right.

0:23:18.677,0:23:20.841
So if you train your model in...

0:23:20.842,0:23:23.481
When did Pluto disappear, I forgot...

0:23:23.482,0:23:24.482
As a planet, you mean.

0:23:24.483,0:23:26.900
(man 3) Yeah, it used to be[br]a member of the solar system

0:23:26.900,0:23:29.437
and we had how many,[br]nine observations there.

0:23:29.437,0:23:31.167
- Yeah.[br]- (man 3) It's slightly problematic.

0:23:31.168,0:23:33.514
So everyone, the kids think[br]that Pluto is not a planet,

0:23:33.515,0:23:36.039
I still think it's a planet,[br]but never mind.

0:23:36.040,0:23:42.320
So the fact that it suddenly[br]stopped being a planet,

0:23:42.321,0:23:45.521
which was supported in the period before,[br]I don't know, hundreds of years, right?

0:23:47.150,0:23:50.161
That's crazy, how would you go[br]for figuring out that thing?

0:23:50.162,0:23:53.595
For example, the new claim[br]is not plausible for that thing.

0:23:53.595,0:23:55.886
Sure. So there are two things.

0:23:55.887,0:23:59.430
So there's both like how precise[br]is a plausibility model.

0:23:59.431,0:24:02.086
So what it distinguishes between[br]is random facts

0:24:02.087,0:24:03.600
and facts that are plausible.

0:24:04.105,0:24:06.600
And there's also the question[br]of whether Pluto is a planet

0:24:06.601,0:24:09.241
and that's back to whether...

0:24:09.242,0:24:10.339
I was in another session

0:24:10.340,0:24:14.060
where someone brought up the example[br]of the earth being flat,

0:24:14.060,0:24:16.547
- whether that is a fact or not.[br]- (man 3) That makes sense.

0:24:16.548,0:24:18.508
So it is a fact in a sense[br]that you can put it in,

0:24:18.509,0:24:19.950
I guess you could put it in Wikidata

0:24:19.951,0:24:22.031
with sources that are claiming[br]that that's the thing.

0:24:22.032,0:24:26.561
So again, you would not necessarily[br]want to train the model in a way

0:24:26.562,0:24:30.721
where if you read someone saying[br]the planet Pluto, bla, bla, bla,

0:24:30.722,0:24:33.561
then it should be fine for it

0:24:33.562,0:24:36.561
to then say that[br]an annotation for this text

0:24:36.562,0:24:38.200
is that Pluto is a planet.

0:24:39.509,0:24:41.432
That doesn't mean, you know...

0:24:42.120,0:24:46.918
The model won't be able to tell[br]what "in the end" is the truth,

0:24:46.919,0:24:49.214
I don't think any of us here[br]will be able to either, so...

0:24:49.214,0:24:50.285
(man 3) I just want to say

0:24:50.285,0:24:52.775
it's not a hard accusation[br]against the approach

0:24:52.776,0:24:56.028
because even people[br]cannot be sure whether that's a fact,

0:24:56.029,0:24:58.214
a new fact is plausible at that moment.

0:24:58.730,0:24:59.730
But that's always...

0:24:59.731,0:25:03.386
I just maybe reiterated a question[br]that I am posing all the time

0:25:03.387,0:25:05.750
to myself and my work; I always ask.

0:25:06.311,0:25:09.267
We do the statistical learning thing,[br]it's amazing nowadays

0:25:09.268,0:25:13.585
we can do billions of things,[br]but we cannot learn about surprises,

0:25:13.586,0:25:16.840
and they are[br]very, very important in fact, right?

0:25:17.595,0:25:20.711
- (man 4) But, just to refute...[br]- (man 3) Thank you.

0:25:22.567,0:25:26.551
(man 4) The plausibility model[br]is combined with kind of two extra roles.

0:25:26.551,0:25:30.361
First of all,[br]if it's in Wikidata, it's true.

0:25:30.362,0:25:34.635
We just give you the benefit of the doubt,[br]so please make it good.

0:25:34.636,0:25:39.261
The second thing is if it's not[br]allowed by the schema it's false;

0:25:39.770,0:25:42.504
it's all the things in between[br]we're looking at.

0:25:43.436,0:25:50.366
So if it's a planet according to Wikidata,[br]it will be a true fact.

0:25:53.130,0:25:57.406
But it won't predict surprises[br]but what is important here

0:25:57.407,0:26:01.814
is that there's kind of[br]no manual human work involved,

0:26:01.814,0:26:03.629
so there's nothing[br]that prevents you from...

0:26:03.629,0:26:05.936
Well, now, if we're successful[br]with the approach,

0:26:05.937,0:26:09.019
there's nothing that prevents him[br]from continuously updating the model

0:26:09.019,0:26:12.483
with changes happening[br]in Wikidata and Wikipedia and so on.

0:26:12.484,0:26:18.128
So in theory, you should be able[br]to quickly learn new surprises.

0:26:18.129,0:26:19.657
(moderator) One last question.

0:26:20.223,0:26:23.157
- (man 4) Maybe we're biased by Wikidata.[br]- Yeah.

0:26:23.683,0:26:27.561
(man 4) You are our bias.[br]Whatever you annotate is what we believe.

0:26:27.562,0:26:31.701
So if you make it good,[br]if you make it balanced,

0:26:31.702,0:26:33.953
we can hopefully be balanced.

0:26:33.954,0:26:39.365
With the gender thing,[br]there's actually an interesting thing.

0:26:39.951,0:26:42.299
We are actually getting[br]more training facts

0:26:42.300,0:26:43.649
about women than men

0:26:43.650,0:26:48.954
because "she" is a much less[br]ambiguous pronoun in the text,

0:26:48.954,0:26:51.600
so we actually get a lot more[br]true facts about women.

0:26:51.600,0:26:55.189
So we are biased, but on the women's side.

0:26:56.241,0:26:58.924
(woman 2) No, I want to see[br]the data on that.

0:26:58.925,0:27:00.471
(audience laughing)

0:27:00.471,0:27:02.381
We should bring that along next time.

0:27:02.381,0:27:04.945
(man 4) You get had decision [inaudible].

0:27:04.945,0:27:06.285
(man 3) Yes, hard decision.

0:27:07.885,0:27:13.001
(man 5) It says SLING is...[br]parser across many languages

0:27:13.002,0:27:15.163
- and you showed us English.[br]- Yes!

0:27:15.163,0:27:17.934
(man 5) Can you something about[br]the number of languages that you are--

0:27:17.934,0:27:19.155
Yes! Thank you for asking.

0:27:19.155,0:27:21.602
I had told myself to say that[br]up front on the first page

0:27:21.602,0:27:23.363
because otherwise,[br]I would forget, and I did.

0:27:24.742,0:27:25.742
So right now,

0:27:25.743,0:27:29.876
we're not actually looking at two files,[br]we're looking at 13 files.

0:27:29.877,0:27:32.768
So Wikipedia dumps[br]from 12 different languages

0:27:32.769,0:27:35.801
that we're processing,

0:27:35.802,0:27:41.483
and none of this is dependent[br]on the language being English.

0:27:41.484,0:27:44.280
So we're processing this[br]for all of the 12 languages.

0:27:48.238,0:27:49.238
Yeah.

0:27:49.239,0:27:50.239
For now,

0:27:50.240,0:27:56.617
they share the property of, I think,[br]being the Latin alphabet, and so on.

0:27:56.617,0:27:58.601
Mostly for us to be able to make sure

0:27:58.602,0:28:02.121
that what we are doing[br]still make sense and works.

0:28:02.121,0:28:04.961
But there's nothing[br]fundamental about the approach

0:28:04.962,0:28:09.869
that prevents it from being used[br]in very different languages

0:28:09.870,0:28:14.656
from those being spoken around this area.

0:28:17.275,0:28:19.321
(woman 3) Leila from Wikimedia Foundation.

0:28:19.322,0:28:21.850
I may have missed this[br]when you presented this.

0:28:22.904,0:28:28.385
Do you make an attempt to bring[br]any references from Wikipedia articles

0:28:28.386,0:28:32.433
back to the property and statements[br]you're making in Wikidata?

0:28:33.357,0:28:37.222
So I briefly mentioned this[br]as a potential application.

0:28:37.222,0:28:40.352
So for now, what we're trying to do[br]is just to get this to work,

0:28:41.156,0:28:46.005
but let's say we did get it to work[br]with a high level of quality,

0:28:46.622,0:28:51.240
that would be an obvious thing[br]to try to do, so when you...

0:28:52.811,0:28:55.187
Let's let's say you were willing to...

0:28:55.187,0:28:59.590
I know there's some controversy around[br]using Wikipedia as a source for Wikidata,

0:28:59.590,0:29:01.957
that you can't have[br]circular references and so on,

0:29:01.957,0:29:04.849
so you need to have[br]properly sourced facts.

0:29:04.850,0:29:07.420
So let's say you were[br]coming up with new facts,

0:29:07.421,0:29:14.307
and obviously, you could look[br]at the cover of news media and so on

0:29:14.308,0:29:16.220
and process these[br]and try to annotate these.

0:29:16.221,0:29:19.522
And then, that way,[br]find sources for facts,

0:29:19.523,0:29:20.964
new facts that you come up with.

0:29:20.965,0:29:22.326
Or you could even take existing...

0:29:22.327,0:29:25.901
There are a lot of facts in Wikidata[br]that either have no sources

0:29:25.901,0:29:29.641
or only have Wikipedia as a source,[br]so you can start processing these

0:29:29.642,0:29:32.802
and try to find sources[br]for those automatically.

0:29:33.545,0:29:38.198
(Leila) Or even within the articles[br]that you're taking this information from

0:29:38.199,0:29:41.879
just using the sources from there[br]because they may contain...

0:29:42.383,0:29:44.329
- Yeah. Yeah.[br]- Yeah. Thanks.

0:29:47.428,0:29:49.315
- (moderator) Thanks Anders.[br]- Cool. Thanks.

0:29:49.919,0:29:55.345
(applause)