0:00:05.961,0:00:08.133 (moderator) The next talk is[br]by Anders Sandholm 0:00:08.133,0:00:12.319 on Wikidata fact annotation[br]for Wikipedia across languages. 0:00:12.319,0:00:13.920 - Thank you.[br]- Thanks. 0:00:21.905,0:00:24.164 I wanted to start with a small confession. 0:00:26.428,0:00:31.687 Wow! I'm blown away[br]by the momentum of Wikidata 0:00:33.799,0:00:35.909 and the engagement of the community. 0:00:37.230,0:00:38.670 I am really excited about being here 0:00:38.671,0:00:42.296 and getting a chance to talk[br]about work that we've been doing. 0:00:42.914,0:00:47.398 This is doing work with Michael,[br]who's also here in the third row. 0:00:49.551,0:00:51.921 But before I dive more into this, 0:00:51.922,0:00:55.515 this wouldn't be[br]a Google presentation without an ad, 0:00:56.102,0:00:58.196 so you get that up front. 0:00:58.196,0:01:01.242 This is what I'll be talking about,[br]our project, the SLING project. 0:01:02.255,0:01:06.640 It is an open source project[br]and it's using Wikidata a lot. 0:01:08.020,0:01:11.721 You can go check it out on GitHub[br]when you get a chance 0:01:11.722,0:01:15.960 if you feel excited about it[br]after the presentation. 0:01:18.215,0:01:23.493 And really, what I wanted to talk about--[br]the title is admittedly a little bit long, 0:01:23.494,0:01:25.797 it's even shorter than it was[br]in the original program. 0:01:25.798,0:01:29.704 But what it comes down to,[br]what the project comes down to 0:01:29.704,0:01:33.617 is trying to answer[br]this one very exciting question. 0:01:34.810,0:01:38.218 If you want, in the beginning,[br]there were just two files, 0:01:39.914,0:01:41.400 some of you may recognize them, 0:01:42.416,0:01:45.953 they're essentially the dump files[br]from Wikidata and Wikipedia, 0:01:47.234,0:01:50.280 and the question we're trying[br]to figure out or answer is really, 0:01:51.570,0:01:54.423 can we dramatically improve[br]how good machines are 0:01:54.424,0:01:58.062 at understanding human language[br]just by using these files? 0:02:00.900,0:02:04.158 And of course, you're entitled to ask 0:02:04.158,0:02:06.191 whether that's an interesting[br]question to answer. 0:02:07.450,0:02:14.344 If you're a company that [inaudible][br]is to be able to take search queries 0:02:14.344,0:02:17.656 and try to answer them[br]in the best possible way, 0:02:18.460,0:02:23.989 obviously, understanding natural language[br]comes in as a very handy thing. 0:02:25.317,0:02:27.914 But even if you look at Wikidata, 0:02:29.109,0:02:33.843 in the previous data quality panel[br]earlier today, 0:02:33.843,0:02:39.070 there was a question that came up about[br]verification, or verifiability of facts. 0:02:39.070,0:02:42.623 So let's say you actually do[br]understand natural language. 0:02:42.623,0:02:47.304 If you have a fact and there's a source,[br]you could go to the source and analyze it, 0:02:47.304,0:02:49.721 and you can figure out whether[br]it actually confirms the fact 0:02:49.722,0:02:52.282 that is claiming[br]to have this as a source. 0:02:53.459,0:02:55.540 And if you could do it,[br]you could even go beyond that 0:02:55.541,0:02:59.723 and you could read articles[br]and annotate them, come up with facts, 0:02:59.723,0:03:03.478 and actually look for existing facts[br]that may need sources 0:03:03.479,0:03:06.109 and add these articles as sources. 0:03:07.110,0:03:11.371 Or, you know, in the wildest,[br]craziest possible of all worlds, 0:03:11.371,0:03:13.756 if you get really, really good at it[br]you could read articles 0:03:13.756,0:03:18.243 and maybe even annotate with new facts[br]that you could then suggest as facts 0:03:18.244,0:03:19.965 that you could potentially[br]add to Wikidata. 0:03:20.595,0:03:27.025 But there's a whole world of applications[br]of natural language understanding. 0:03:28.895,0:03:32.478 One of the things that's really hard when[br]you do natural language understanding-- 0:03:32.479,0:03:35.595 these days, that also means[br]deep learning or machine learning, 0:03:35.596,0:03:39.537 and one of the things that's really hard[br]is getting enough training data. 0:03:39.537,0:03:42.812 And historically,[br]that's meant having a lot of text 0:03:42.812,0:03:45.441 that you need human annotators[br]to then first process 0:03:45.442,0:03:46.801 and then you can do training. 0:03:46.802,0:03:51.184 And part of the question here[br]is also really to say: 0:03:51.184,0:03:55.930 Can we use Wikidata and the way[br]in which it's interlinked with Wikipedia 0:03:57.012,0:03:58.012 for training data, 0:03:58.013,0:04:00.600 and will that be enough[br]to train that model? 0:04:03.429,0:04:06.517 So hopefully, we'll get closer[br]to answering this question 0:04:06.518,0:04:09.289 in the next 15 to 20 minutes. 0:04:10.271,0:04:14.071 We don't quite know the answer yet[br]but we have some exciting results 0:04:14.072,0:04:16.992 that are pointing[br]in the right direction, if you want. 0:04:19.387,0:04:23.798 Just take a step back in terms of[br]the development we've seen, 0:04:24.445,0:04:28.450 machine learning and deep learning[br]has revolutionized a lot of areas 0:04:28.450,0:04:32.431 and this is just one example[br]of a particular image recognition task 0:04:32.432,0:04:37.343 that if you look at what happened[br]between 2010 and 2015, 0:04:37.344,0:04:40.881 in that five-year period,[br]we went from machines doing pretty poorly 0:04:40.882,0:04:44.921 to, in the end, actually performing[br]at the same level of humans 0:04:44.922,0:04:48.804 or in some cases even better[br]albeit for a very specific task. 0:04:50.224,0:04:55.515 So we've seen really a lot of things[br]improving dramatically. 0:04:56.221,0:04:57.881 And so you can ask 0:04:57.882,0:05:02.440 why don't we just throw deep learning[br]at natural language processing 0:05:02.440,0:05:04.600 and natural language understanding[br]and be done with it? 0:05:05.497,0:05:11.532 And the answer is kind of[br]we've sort of done to a certain extent, 0:05:11.532,0:05:14.367 but what it turns out is that 0:05:15.005,0:05:17.725 natural language understanding[br]is actually still a bit of a challenge 0:05:17.726,0:05:23.281 and one of the situations where[br]a lot of us interact with machines 0:05:23.282,0:05:25.803 that are trying to behave like[br]they understand what we're saying 0:05:25.804,0:05:26.804 is in these chat bots. 0:05:26.805,0:05:28.605 So this is not to pick[br]on anyone in particular 0:05:28.606,0:05:31.991 but just, I think, an experience[br]that a lot of us have had. 0:05:31.992,0:05:36.841 In this case, it's a user saying[br]I want to stay in this place. 0:05:36.842,0:05:41.766 The chat bot says: "OK, got it,[br]when will you be checking in and out? 0:05:41.766,0:05:44.488 For example, November 17th to 23rd." 0:05:44.488,0:05:46.620 And the user says:[br]"Well, I don't have any dates yet." 0:05:46.620,0:05:47.681 And then the response is: 0:05:47.682,0:05:51.050 "Sorry, there are no hotels available[br]for the dates you've requested. 0:05:51.050,0:05:52.571 Would you like to start a new search?" 0:05:53.212,0:05:55.041 So there's still some way to go 0:05:55.862,0:05:58.755 to get machines to really[br]understand human language. 0:05:59.817,0:06:03.761 But machine learning or deep learning 0:06:03.762,0:06:06.786 has been applied[br]already to this discipline. 0:06:06.787,0:06:09.721 Like, one of the examples is a recent... 0:06:09.722,0:06:11.232 a more successful example is BERT 0:06:11.233,0:06:17.316 where they're using transformers[br]to solve NLP or NLU tasks. 0:06:18.800,0:06:22.157 And it's dramatically improved[br]the performance but, as we've seen, 0:06:22.157,0:06:23.560 there is still some way to go. 0:06:25.150,0:06:27.857 One thing that's shared among[br]most of these approaches 0:06:27.858,0:06:31.785 is that you look at the text itself 0:06:31.785,0:06:36.629 and you depend on having a lot of it[br]so you can train your model on the text, 0:06:36.629,0:06:39.761 but everything is based[br]on just looking at the text 0:06:39.762,0:06:41.675 and understanding the text. 0:06:41.675,0:06:45.727 So the learning is really[br]just representation learning. 0:06:45.727,0:06:50.653 What we wanted to do is actually[br]understand and annotate the text 0:06:50.653,0:06:54.006 in terms of items[br]or entities in the real world. 0:06:56.384,0:06:59.537 And in general, if we take a step back, 0:07:00.077,0:07:03.441 why is natural language processing[br]or understanding so hard? 0:07:03.442,0:07:07.659 There are a number of reasons[br]why it's really hard, but at the core, 0:07:07.659,0:07:11.041 one of the important reasons[br]is that somehow, 0:07:11.042,0:07:13.225 the machine needs to have[br]knowledge of the world 0:07:13.226,0:07:16.867 in order to understand human language. 0:07:19.569,0:07:22.456 And you think about that[br]for a little while. 0:07:23.074,0:07:26.654 What better place to look for knowledge[br]about the world than Wikidata? 0:07:27.318,0:07:29.625 So in essence, that's the approach. 0:07:29.625,0:07:31.985 And the question is can you leverage it, 0:07:31.985,0:07:38.877 can you use this wonderful knowledge 0:07:38.878,0:07:40.601 of the world that we already have 0:07:40.602,0:07:45.617 in a way that you can help[br]to train and bootstrap your model. 0:07:47.390,0:07:51.121 So the alternative here is really[br]understanding the text 0:07:51.122,0:07:55.439 not just in terms of other texts[br]or how this text is similar to other texts 0:07:55.439,0:07:59.104 but in terms of the existing knowledge[br]that we have about the world. 0:08:01.164,0:08:02.704 And what makes me really excited 0:08:02.705,0:08:05.905 or at least makes me[br]have a good gut feeling about this 0:08:05.906,0:08:07.372 is that in some ways 0:08:07.373,0:08:10.780 it seems closer[br]to how we interact as humans. 0:08:10.780,0:08:13.795 So if we were having a conversation 0:08:13.795,0:08:17.847 and you were bringing up[br]the Bundeskanzler and Angela Merkel, 0:08:18.662,0:08:23.173 I would have an internal representation[br]of Q567 and it would light up. 0:08:23.173,0:08:25.521 And in our continued conversation, 0:08:25.522,0:08:29.615 mentioning other things[br]related to Angela Merkel, 0:08:29.616,0:08:31.762 I would have an easier time[br]associating with that 0:08:31.763,0:08:33.920 or figuring out[br]what you were actually talking about. 0:08:35.027,0:08:38.919 And so, in essence,[br]that's at the heart of this approach, 0:08:38.919,0:08:42.100 that we really believe[br]Wikidata is a key component 0:08:42.101,0:08:45.809 in unlocking this better understanding[br]of natural language. 0:08:49.732,0:08:51.448 And so how are we planning to do it? 0:08:52.557,0:08:56.797 Essentially, there are five steps[br]we're going through, 0:08:56.798,0:08:58.080 or have been going through. 0:08:58.788,0:09:02.841 I'll go over each[br]of the steps briefly in turn 0:09:02.841,0:09:04.410 but essentially, there are five steps. 0:09:04.410,0:09:07.120 First, we need to start[br]with the dump files that I showed you 0:09:07.120,0:09:08.120 to begin with-- 0:09:08.706,0:09:11.149 understanding what's in them,[br]parsing them, 0:09:11.149,0:09:13.397 having an efficient[br]internal representation in memory 0:09:13.397,0:09:15.716 that allows us to do[br]quick processing on this. 0:09:16.225,0:09:18.502 And then, we're leveraging[br]some of the annotations 0:09:18.503,0:09:22.605 that are already in Wikipedia,[br]linking it to items in Wikidata. 0:09:22.605,0:09:25.462 I'll briefly show you what I mean by that. 0:09:25.462,0:09:31.001 We can use that to then[br]generate more advanced annotations 0:09:31.973,0:09:34.549 where we have much more text annotated. 0:09:34.549,0:09:40.333 But still, with annotations[br]being items or facts in Wikidata, 0:09:40.334,0:09:43.717 we can then train a model[br]based on the silver data 0:09:43.717,0:09:46.212 and get a reasonably good model 0:09:46.212,0:09:49.047 that will allow us to read[br]a Wikipedia document 0:09:49.047,0:09:53.308 and understand what the actual content is[br]in terms of Wikidata, 0:09:54.613,0:09:57.580 but only for facts that are[br]already in Wikidata. 0:09:58.523,0:10:02.367 And so that's where kind of[br]the hard part of this begins. 0:10:02.367,0:10:06.100 In order to go beyond that[br]we need to have a plausibility model, 0:10:06.100,0:10:07.641 so a model that can tell us, 0:10:07.642,0:10:10.881 given a lot of facts about an item[br]and an additional fact, 0:10:10.882,0:10:12.627 whether the additional fact is plausible. 0:10:13.191,0:10:14.296 If we can build that, 0:10:14.892,0:10:21.831 we can then use a more "hyper modern"[br]reinforcement learning aspect 0:10:21.832,0:10:26.033 of deep learning and machine learning[br]to fine-tune the model 0:10:26.033,0:10:30.303 and hopefully go beyond[br]what we've been able to so far. 0:10:31.933,0:10:32.933 So real quick, 0:10:32.934,0:10:36.632 the first step is essentially[br]getting the dump files parsed, 0:10:36.632,0:10:41.021 understanding the contents, and linking up[br]Wikidata and Wikipedia information, 0:10:41.022,0:10:44.416 and then utilizing some of the annotations[br]that are already there. 0:10:45.547,0:10:49.304 And so this is essentially[br]what's happening. 0:10:49.305,0:10:51.959 Trust me, Michael built all of this,[br]it's working great. 0:10:52.701,0:10:55.621 But essentially, we're starting[br]with the two files you can see on the top, 0:10:55.622,0:10:58.244 the Wikidata dump and the Wikipedia dump. 0:10:58.245,0:11:02.413 The Wikidata dump gets processed[br]and we end up with a knowledge base, 0:11:02.413,0:11:04.376 a KB at the bottom. 0:11:04.377,0:11:07.335 That's essentially a store[br]we can hold in memory 0:11:07.336,0:11:10.439 that has essentially all of Wikidata in it 0:11:10.440,0:11:13.841 and we can quickly access[br]all the properties and facts and so on 0:11:13.841,0:11:15.163 and do analysis there. 0:11:15.164,0:11:16.414 Similarly, for the documents, 0:11:16.415,0:11:18.486 they get processed[br]and we end up with documents 0:11:19.274,0:11:21.911 that have been processed. 0:11:21.912,0:11:23.544 We know all the mentions 0:11:23.545,0:11:26.838 and some of the things[br]that are already in the documents. 0:11:26.839,0:11:27.839 And then in the middle, 0:11:27.840,0:11:30.093 we have an important part[br]which is a phrase table 0:11:30.094,0:11:33.081 that allows us to basically[br]see for any phrase 0:11:34.096,0:11:35.753 what is the frequency distribution, 0:11:35.754,0:11:39.481 what's the most likely item[br]that we're referring to 0:11:39.481,0:11:41.165 when we're using this phrase. 0:11:41.165,0:11:44.445 So we're using that later on[br]to build the silver annotations. 0:11:44.446,0:11:48.001 So let's say we've run this[br]and then we also want to make sure 0:11:48.002,0:11:51.691 we utilize annotations[br]that are already there. 0:11:51.692,0:11:54.112 So an important part[br]of a Wikipedia article 0:11:54.113,0:11:57.841 is that it's not just plain text, 0:11:57.842,0:12:01.007 it's actually already[br]pre-annotated with a few things. 0:12:01.008,0:12:04.046 So a template is one example,[br]links is another example. 0:12:04.047,0:12:08.017 So if we take here the English article[br]for Angela Merkel, 0:12:09.387,0:12:12.301 there is one example of a link here[br]which is to her party. 0:12:12.302,0:12:13.772 If you look at the bottom, 0:12:13.773,0:12:16.426 that's a link to a specific[br]Wikipedia article, 0:12:16.427,0:12:20.155 and I guess for people here,[br]it's no surprise that, in essence, 0:12:20.156,0:12:23.360 that is then, if you look[br]at the associated Wikidata item, 0:12:23.361,0:12:25.801 that's essentially an annotation saying 0:12:25.802,0:12:31.453 this is the QID I am talking about[br]when I'm talking about this party, 0:12:31.453,0:12:32.820 the Christian Democratic Union. 0:12:33.951,0:12:37.281 So we're using this[br]to already have a good start 0:12:37.282,0:12:39.326 in terms of understanding what text means. 0:12:39.327,0:12:40.327 All of these links, 0:12:40.328,0:12:43.983 we know exactly what the author[br]means with the phrase 0:12:44.504,0:12:47.040 in the cases where[br]there are links to QIDs. 0:12:48.234,0:12:53.303 We can use this and the phrase table[br]to then try and take a Wikipedia document 0:12:53.304,0:12:58.760 and fully annotate it with everything[br]we know about already from Wikidata. 0:12:59.659,0:13:02.753 And we can use this to train[br]the first iteration of our model. 0:13:03.933,0:13:04.933 (coughs) Excuse me. 0:13:04.934,0:13:07.876 So this is exactly the same article, 0:13:08.400,0:13:13.566 but now, after we've annotated it[br]with silver annotations, 0:13:14.673,0:13:18.441 and essentially,[br]you can see all of the squares 0:13:18.442,0:13:24.530 are places where we've been able[br]to annotate with QIDs or with facts. 0:13:26.362,0:13:30.681 This is just a screenshot[br]of the viewer on the data, 0:13:30.682,0:13:34.281 so you can have access[br]to all of this information 0:13:34.282,0:13:37.577 and see what's come out[br]of the silver annotation. 0:13:37.577,0:13:41.364 And it's important to say that[br]there's no machine learning 0:13:41.365,0:13:42.678 or anything involved here. 0:13:42.679,0:13:46.007 All we've done, is sort of[br]mechanically, with a few tricks, 0:13:46.515,0:13:49.709 basically pushed information[br]we already have from Wikidata 0:13:49.710,0:13:52.760 onto the Wikipedia article. 0:13:53.328,0:13:56.202 And so here, if you hover over[br]"Chancellor of Germany" 0:13:56.202,0:14:01.973 that is itself a Wikidata,[br]that's referring to a Wikidata item, 0:14:01.974,0:14:04.972 has a number of properties[br]like "subclass of: Chancellor", 0:14:04.972,0:14:08.658 "country: Germany",[br]that again referring to subtext. 0:14:08.659,0:14:11.732 And here, it also has[br]the property "officeholder" 0:14:12.473,0:14:15.496 which happens to be[br]Angela Dorothea Merkel, 0:14:15.497,0:14:17.051 which is also mentioned in the text. 0:14:17.052,0:14:22.137 So there's really a full annotation[br]linking up the contents here. 0:14:24.645,0:14:27.429 But again, there is an important[br]and unfortunate point 0:14:27.430,0:14:31.563 about what we are able to[br]and not able to do here. 0:14:31.564,0:14:35.342 So what we are doing is pushing[br]information we already have in Wikidata, 0:14:35.342,0:14:40.169 so what we can't annotate here[br]are things that are not in Wikidata. 0:14:40.169,0:14:41.681 So for instance, here, 0:14:41.682,0:14:44.910 she was at some point appointed[br]Federal Minister for Women and Youth 0:14:44.910,0:14:48.713 and that alias or that phrase[br]is not in Wikidata, 0:14:48.713,0:14:54.000 so we're not able to make that annotation[br]here in our silver annotations. 0:14:56.227,0:14:59.943 That said, it's still... at least for me, 0:14:59.944,0:15:02.625 it's was pretty surprising to see[br]how much you can actually annotate 0:15:02.626,0:15:04.266 and how much information is already there 0:15:04.267,0:15:08.877 when you combine Wikidata[br]with a Wikipedia article. 0:15:08.878,0:15:15.321 So what you can do is, once you have this,[br]you know, millions of documents, 0:15:16.275,0:15:20.240 you can train your parser[br]based on the annotations that are there. 0:15:21.134,0:15:26.968 And that's essentially a parser[br]that has a number of components. 0:15:26.969,0:15:30.481 Essentially, the text is coming in[br]at the bottom and at the top, 0:15:30.482,0:15:33.722 we have a transition-based[br]frame semantic parser 0:15:33.723,0:15:39.154 that then generates the annotations[br]or these facts or references to the items. 0:15:40.617,0:15:44.987 We built this and run[br]on more classical corpora 0:15:44.987,0:15:49.611 like [inaudible],[br]which are more classical NLP corpora, 0:15:49.611,0:15:53.800 but we want to be able to run this[br]on the full Wikipedia corpora. 0:15:53.800,0:15:57.201 So Michael has been rewriting this in C++ 0:15:57.202,0:15:59.932 and we're able to really[br]scale up performance 0:15:59.932,0:16:01.101 of the parser trainer here. 0:16:01.102,0:16:03.594 So it will be exciting to see exactly 0:16:03.595,0:16:05.830 the results that are going[br]to come out of that. 0:16:08.638,0:16:10.263 So once that's in place, 0:16:10.264,0:16:13.459 we have a pretty good model[br]that's able to at least 0:16:13.459,0:16:16.051 predict facts that are[br]already known in Wikidata, 0:16:16.052,0:16:18.790 but ideally, we want to move beyond that, 0:16:18.790,0:16:20.703 and for that[br]we need this plausibility model 0:16:20.704,0:16:23.928 which in essence,[br]you can think of it as a black box 0:16:23.929,0:16:27.121 where you supply it with[br]all of the known facts you have 0:16:27.122,0:16:30.574 about a particular item[br]and then you provide an additional item. 0:16:31.412,0:16:32.412 And by magic, 0:16:32.413,0:16:36.948 the black box tells you how plausible is[br]the additional fact that you're providing 0:16:36.949,0:16:40.396 and how plausible is it[br]that this particular item is fact. 0:16:42.792,0:16:43.792 And... 0:16:45.733,0:16:48.582 I don't know if it's fair to say[br]that it was much to our surprise, 0:16:48.582,0:16:50.776 but at least, you can actually-- 0:16:50.776,0:16:52.905 In order to train a model, you need, 0:16:52.905,0:16:55.255 like we've seen earlier,[br]you need a lot of training data 0:16:55.256,0:16:57.880 and essentially, you can[br]use Wikidata as training data. 0:16:57.881,0:17:02.213 You serve it basically[br]all the facts for a given item 0:17:02.213,0:17:04.614 and then you mask or hold off one fact 0:17:04.615,0:17:08.566 and then you provide that as a fact[br]that it's supposed to predict. 0:17:09.238,0:17:10.718 And just using this as training data, 0:17:10.719,0:17:15.881 you can get a really really good[br]plausibility model, actually, 0:17:18.574,0:17:21.675 to the extent that I was hoping one day[br]to maybe be able to even use it 0:17:21.675,0:17:27.527 for discovering what you could call[br]accidental vandalism in Wikidata 0:17:27.528,0:17:33.011 like a fact that's been added by accident[br]and really doesn't look like it's... 0:17:33.012,0:17:35.029 It doesn't fit with the normal topology 0:17:35.029,0:17:38.621 of facts or knowledge[br]in Wikidata, if you want. 0:17:41.058,0:17:43.761 But in this particular setup,[br]we need it for something else, 0:17:43.762,0:17:46.738 namely for doing reinforcement learning 0:17:47.951,0:17:50.805 so we can fine-tune the Wiki parser, 0:17:50.805,0:17:54.034 and basically using the plausibility model[br]as a reward function. 0:17:54.035,0:17:59.576 So when you do the training,[br]you try to pass a Wikipedia document 0:17:59.576,0:18:01.871 [inaudible] in Wikipedia[br]comes up with a fact 0:18:01.871,0:18:04.281 and we check the fact[br]on the plausibility model 0:18:04.282,0:18:07.527 and use that as feedback[br]or as a reward function 0:18:08.198,0:18:09.601 in training the model. 0:18:09.602,0:18:12.708 And the big question here is then[br]can we learn to predict facts 0:18:12.709,0:18:15.000 that are not already in Wikidata. 0:18:15.800,0:18:22.300 And we hope and believe we can[br]but it's still not clear. 0:18:22.879,0:18:27.792 So this is essentially what we have been[br]and are planning to do. 0:18:27.792,0:18:31.223 There's been some[br]surprisingly good results 0:18:31.224,0:18:33.989 in terms of how far[br]you can get with silver annotations 0:18:33.990,0:18:35.720 and a plausibility model. 0:18:36.271,0:18:40.081 But in terms of[br]how far we are, if you want, 0:18:40.082,0:18:41.961 we sort of have[br]the infrastructure in place 0:18:41.962,0:18:44.480 to do the processing[br]and have everything efficiently in memory. 0:18:45.121,0:18:49.138 We have first instances[br]of silver annotations 0:18:49.139,0:18:53.041 and have a parser trainer in place[br]for the supervised learning 0:18:53.042,0:18:55.755 and an initial plausibility model. 0:18:55.756,0:19:00.400 But we're still pushing on those fronts[br]and very much looking forward 0:19:00.400,0:19:03.320 to see what comes out[br]of the very last bit. 0:19:07.786,0:19:10.309 And those were my words. 0:19:10.310,0:19:14.681 I'm very excited to see[br]what comes out of it 0:19:14.682,0:19:17.661 and it's been pure joy[br]to work with Wikidata. 0:19:17.662,0:19:19.513 It's been fun to see 0:19:19.514,0:19:23.917 how some of the things you come across[br]seemed wrong and then the next day, 0:19:23.918,0:19:24.958 you look, things are fixed 0:19:24.959,0:19:30.551 and it's really been amazing[br]to see the momentum there. 0:19:31.161,0:19:35.295 Like I said, the URL,[br]all the source code is on GitHub. 0:19:35.887,0:19:38.912 Our email addresses[br]were on the first slide, 0:19:38.913,0:19:42.582 so please do reach out[br]if you have questions or are interested 0:19:42.582,0:19:47.149 and I think we have time[br]for a couple questions now in case... 0:19:49.450,0:19:51.446 (applause) 0:19:51.447,0:19:52.447 Thanks. 0:19:55.583,0:19:59.400 (woman 1) Thank you for your presentation.[br]I do have a concern however. 0:19:59.401,0:20:05.441 The Wikipedia corpus[br]is known to be with bias. 0:20:05.442,0:20:09.841 There's a very strong bias--[br]for example, fewer women, more men, 0:20:09.842,0:20:11.787 all sorts of other aspects in there. 0:20:11.787,0:20:15.201 So isn't this actually[br]also tainting the knowledge 0:20:15.202,0:20:19.471 that you are taking out of the Wikipedia? 0:20:22.320,0:20:25.424 Well, there are two aspects[br]of the question. 0:20:25.425,0:20:28.591 There's both in the model[br]that we are then training, 0:20:28.591,0:20:32.495 you could ask how... let's just... 0:20:33.172,0:20:35.841 If you make it really simple[br]and say like: 0:20:35.842,0:20:41.204 Does it mean that the model[br]will then be worse 0:20:41.204,0:20:46.027 at predicting facts[br]about women than men, say, 0:20:46.027,0:20:50.416 or some other set of groups? 0:20:53.098,0:20:55.424 To begin with,[br]if you just look at the raw data, 0:20:55.425,0:21:00.529 it will reflect whatever is the bias[br]in the training data, so that's... 0:21:02.810,0:21:06.001 People work on this to try[br]and address that in the best possible way. 0:21:06.002,0:21:10.068 But normally,[br]when you're training a model, 0:21:10.069,0:21:14.244 it will reflect[br]whatever data you're training it on. 0:21:14.870,0:21:18.980 So that's something to account for[br]when doing the work, yeah. 0:21:21.498,0:21:23.194 (man 2) Hi, this is [Marco]. 0:21:23.195,0:21:25.960 I am a natural language[br]processing practitioner. 0:21:26.853,0:21:31.578 I was curious about[br]how you model your facts. 0:21:31.578,0:21:34.535 So I heard you set frame semantics, 0:21:34.535,0:21:35.557 Right. 0:21:35.557,0:21:38.875 (Mike) could you maybe[br]give some more details on that, please. 0:21:40.053,0:21:46.510 Yes, so it's frame semantics,[br]we're using frame semantics, 0:21:46.510,0:21:49.642 and basically, 0:21:49.642,0:21:55.778 all of the facts in Wikidata,[br]they're modeled as frames. 0:21:56.291,0:21:58.801 And so that's an essential part[br]of the set up 0:21:58.811,0:22:00.027 and how we make this work. 0:22:00.028,0:22:03.770 That's essentially[br]how we try to address the... 0:22:03.771,0:22:06.680 How can I make all the knowledge[br]that I have in Wikidata 0:22:06.680,0:22:11.012 available in a context where[br]I can annotate and train my model 0:22:12.485,0:22:14.441 when I am annotating or passing text. 0:22:14.442,0:22:19.806 Is that existing data[br]in Wikidata is modeled as frames. 0:22:19.806,0:22:21.007 So the store that we have, 0:22:21.008,0:22:24.041 the knowledge base with[br]all of the knowledge is a frame store, 0:22:24.042,0:22:27.251 and this is the same frame store[br]that we are building on top of 0:22:27.251,0:22:29.521 when we're then passing the text. 0:22:29.522,0:22:34.024 (Marco) So you're converting[br]the Wikidata data model into some frame. 0:22:34.551,0:22:36.703 Yes, we are converting the Wikidata model 0:22:36.704,0:22:39.871 into one large frame store[br]if you want, yeah. 0:22:40.558,0:22:43.605 (man 3) Thanks. Is Pluto a planet? 0:22:44.394,0:22:47.226 (audience laughing) 0:22:47.227,0:22:48.227 Can I get the question... 0:22:48.228,0:22:51.561 (man 3) I like the bootstrapping thing[br]that you are doing, 0:22:51.562,0:22:53.402 I mean the way[br]that you're training your model 0:22:53.403,0:22:57.726 by picking out the known facts[br]about things that are verified, 0:22:57.727,0:23:00.666 and then training[br]the plausibility prediction 0:23:00.667,0:23:03.681 by trying to teach[br]the architecture of the system 0:23:03.682,0:23:06.481 to recognize that actually,[br]that fact fits. 0:23:06.482,0:23:13.464 So that will work for large classes,[br]but it will really... 0:23:13.464,0:23:15.744 It doesn't sound like it will learn[br]about surprises 0:23:15.745,0:23:18.677 and especially not[br]in small classes of items, right. 0:23:18.677,0:23:20.841 So if you train your model in... 0:23:20.842,0:23:23.481 When did Pluto disappear, I forgot... 0:23:23.482,0:23:24.482 As a planet, you mean. 0:23:24.483,0:23:26.900 (man 3) Yeah, it used to be[br]a member of the solar system 0:23:26.900,0:23:29.437 and we had how many,[br]nine observations there. 0:23:29.437,0:23:31.167 - Yeah.[br]- (man 3) It's slightly problematic. 0:23:31.168,0:23:33.514 So everyone, the kids think[br]that Pluto is not a planet, 0:23:33.515,0:23:36.039 I still think it's a planet,[br]but never mind. 0:23:36.040,0:23:42.320 So the fact that it suddenly[br]stopped being a planet, 0:23:42.321,0:23:45.521 which was supported in the period before,[br]I don't know, hundreds of years, right? 0:23:47.150,0:23:50.161 That's crazy, how would you go[br]for figuring out that thing? 0:23:50.162,0:23:53.595 For example, the new claim[br]is not plausible for that thing. 0:23:53.595,0:23:55.886 Sure. So there are two things. 0:23:55.887,0:23:59.430 So there's both like how precise[br]is a plausibility model. 0:23:59.431,0:24:02.086 So what it distinguishes between[br]is random facts 0:24:02.087,0:24:03.600 and facts that are plausible. 0:24:04.105,0:24:06.600 And there's also the question[br]of whether Pluto is a planet 0:24:06.601,0:24:09.241 and that's back to whether... 0:24:09.242,0:24:10.339 I was in another session 0:24:10.340,0:24:14.060 where someone brought up the example[br]of the earth being flat, 0:24:14.060,0:24:16.547 - whether that is a fact or not.[br]- (man 3) That makes sense. 0:24:16.548,0:24:18.508 So it is a fact in a sense[br]that you can put it in, 0:24:18.509,0:24:19.950 I guess you could put it in Wikidata 0:24:19.951,0:24:22.031 with sources that are claiming[br]that that's the thing. 0:24:22.032,0:24:26.561 So again, you would not necessarily[br]want to train the model in a way 0:24:26.562,0:24:30.721 where if you read someone saying[br]the planet Pluto, bla, bla, bla, 0:24:30.722,0:24:33.561 then it should be fine for it 0:24:33.562,0:24:36.561 to then say that[br]an annotation for this text 0:24:36.562,0:24:38.200 is that Pluto is a planet. 0:24:39.509,0:24:41.432 That doesn't mean, you know... 0:24:42.120,0:24:46.918 The model won't be able to tell[br]what "in the end" is the truth, 0:24:46.919,0:24:49.214 I don't think any of us here[br]will be able to either, so... 0:24:49.214,0:24:50.285 (man 3) I just want to say 0:24:50.285,0:24:52.775 it's not a hard accusation[br]against the approach 0:24:52.776,0:24:56.028 because even people[br]cannot be sure whether that's a fact, 0:24:56.029,0:24:58.214 a new fact is plausible at that moment. 0:24:58.730,0:24:59.730 But that's always... 0:24:59.731,0:25:03.386 I just maybe reiterated a question[br]that I am posing all the time 0:25:03.387,0:25:05.750 to myself and my work; I always ask. 0:25:06.311,0:25:09.267 We do the statistical learning thing,[br]it's amazing nowadays 0:25:09.268,0:25:13.585 we can do billions of things,[br]but we cannot learn about surprises, 0:25:13.586,0:25:16.840 and they are[br]very, very important in fact, right? 0:25:17.595,0:25:20.711 - (man 4) But, just to refute...[br]- (man 3) Thank you. 0:25:22.567,0:25:26.551 (man 4) The plausibility model[br]is combined with kind of two extra roles. 0:25:26.551,0:25:30.361 First of all,[br]if it's in Wikidata, it's true. 0:25:30.362,0:25:34.635 We just give you the benefit of the doubt,[br]so please make it good. 0:25:34.636,0:25:39.261 The second thing is if it's not[br]allowed by the schema it's false; 0:25:39.770,0:25:42.504 it's all the things in between[br]we're looking at. 0:25:43.436,0:25:50.366 So if it's a planet according to Wikidata,[br]it will be a true fact. 0:25:53.130,0:25:57.406 But it won't predict surprises[br]but what is important here 0:25:57.407,0:26:01.814 is that there's kind of[br]no manual human work involved, 0:26:01.814,0:26:03.629 so there's nothing[br]that prevents you from... 0:26:03.629,0:26:05.936 Well, now, if we're successful[br]with the approach, 0:26:05.937,0:26:09.019 there's nothing that prevents him[br]from continuously updating the model 0:26:09.019,0:26:12.483 with changes happening[br]in Wikidata and Wikipedia and so on. 0:26:12.484,0:26:18.128 So in theory, you should be able[br]to quickly learn new surprises. 0:26:18.129,0:26:19.657 (moderator) One last question. 0:26:20.223,0:26:23.157 - (man 4) Maybe we're biased by Wikidata.[br]- Yeah. 0:26:23.683,0:26:27.561 (man 4) You are our bias.[br]Whatever you annotate is what we believe. 0:26:27.562,0:26:31.701 So if you make it good,[br]if you make it balanced, 0:26:31.702,0:26:33.953 we can hopefully be balanced. 0:26:33.954,0:26:39.365 With the gender thing,[br]there's actually an interesting thing. 0:26:39.951,0:26:42.299 We are actually getting[br]more training facts 0:26:42.300,0:26:43.649 about women than men 0:26:43.650,0:26:48.954 because "she" is a much less[br]ambiguous pronoun in the text, 0:26:48.954,0:26:51.600 so we actually get a lot more[br]true facts about women. 0:26:51.600,0:26:55.189 So we are biased, but on the women's side. 0:26:56.241,0:26:58.924 (woman 2) No, I want to see[br]the data on that. 0:26:58.925,0:27:00.471 (audience laughing) 0:27:00.471,0:27:02.381 We should bring that along next time. 0:27:02.381,0:27:04.945 (man 4) You get had decision [inaudible]. 0:27:04.945,0:27:06.285 (man 3) Yes, hard decision. 0:27:07.885,0:27:13.001 (man 5) It says SLING is...[br]parser across many languages 0:27:13.002,0:27:15.163 - and you showed us English.[br]- Yes! 0:27:15.163,0:27:17.934 (man 5) Can you something about[br]the number of languages that you are-- 0:27:17.934,0:27:19.155 Yes! Thank you for asking. 0:27:19.155,0:27:21.602 I had told myself to say that[br]up front on the first page 0:27:21.602,0:27:23.363 because otherwise,[br]I would forget, and I did. 0:27:24.742,0:27:25.742 So right now, 0:27:25.743,0:27:29.876 we're not actually looking at two files,[br]we're looking at 13 files. 0:27:29.877,0:27:32.768 So Wikipedia dumps[br]from 12 different languages 0:27:32.769,0:27:35.801 that we're processing, 0:27:35.802,0:27:41.483 and none of this is dependent[br]on the language being English. 0:27:41.484,0:27:44.280 So we're processing this[br]for all of the 12 languages. 0:27:48.238,0:27:49.238 Yeah. 0:27:49.239,0:27:50.239 For now, 0:27:50.240,0:27:56.617 they share the property of, I think,[br]being the Latin alphabet, and so on. 0:27:56.617,0:27:58.601 Mostly for us to be able to make sure 0:27:58.602,0:28:02.121 that what we are doing[br]still make sense and works. 0:28:02.121,0:28:04.961 But there's nothing[br]fundamental about the approach 0:28:04.962,0:28:09.869 that prevents it from being used[br]in very different languages 0:28:09.870,0:28:14.656 from those being spoken around this area. 0:28:17.275,0:28:19.321 (woman 3) Leila from Wikimedia Foundation. 0:28:19.322,0:28:21.850 I may have missed this[br]when you presented this. 0:28:22.904,0:28:28.385 Do you make an attempt to bring[br]any references from Wikipedia articles 0:28:28.386,0:28:32.433 back to the property and statements[br]you're making in Wikidata? 0:28:33.357,0:28:37.222 So I briefly mentioned this[br]as a potential application. 0:28:37.222,0:28:40.352 So for now, what we're trying to do[br]is just to get this to work, 0:28:41.156,0:28:46.005 but let's say we did get it to work[br]with a high level of quality, 0:28:46.622,0:28:51.240 that would be an obvious thing[br]to try to do, so when you... 0:28:52.811,0:28:55.187 Let's let's say you were willing to... 0:28:55.187,0:28:59.590 I know there's some controversy around[br]using Wikipedia as a source for Wikidata, 0:28:59.590,0:29:01.957 that you can't have[br]circular references and so on, 0:29:01.957,0:29:04.849 so you need to have[br]properly sourced facts. 0:29:04.850,0:29:07.420 So let's say you were[br]coming up with new facts, 0:29:07.421,0:29:14.307 and obviously, you could look[br]at the cover of news media and so on 0:29:14.308,0:29:16.220 and process these[br]and try to annotate these. 0:29:16.221,0:29:19.522 And then, that way,[br]find sources for facts, 0:29:19.523,0:29:20.964 new facts that you come up with. 0:29:20.965,0:29:22.326 Or you could even take existing... 0:29:22.327,0:29:25.901 There are a lot of facts in Wikidata[br]that either have no sources 0:29:25.901,0:29:29.641 or only have Wikipedia as a source,[br]so you can start processing these 0:29:29.642,0:29:32.802 and try to find sources[br]for those automatically. 0:29:33.545,0:29:38.198 (Leila) Or even within the articles[br]that you're taking this information from 0:29:38.199,0:29:41.879 just using the sources from there[br]because they may contain... 0:29:42.383,0:29:44.329 - Yeah. Yeah.[br]- Yeah. Thanks. 0:29:47.428,0:29:49.315 - (moderator) Thanks Anders.[br]- Cool. Thanks. 0:29:49.919,0:29:55.345 (applause)