0:00:05.840,0:00:07.080 Hi, I'm Lucie. 0:00:07.080,0:00:11.930 You know me from rambling about [br]not enough language data in Wikidata, 0:00:11.930,0:00:15.800 and I thought instead of rambling today, [br]which I'll leave to Lydia later today, 0:00:16.530,0:00:20.370 I'll just show you a bit, or give you[br]an insight on the projects we did 0:00:20.370,0:00:25.170 using the data that we already have [br]on Wikidata, for different causes. 0:00:25.170,0:00:28.550 So underserved languages [br]compared to the keynote we just heard 0:00:28.550,0:00:32.530 where the person was talking about [br]underserved as like minority languages, 0:00:32.530,0:00:35.600 underserved languages to me,[br]or any languages 0:00:35.600,0:00:38.760 that don't have [br]enough representation on the web. 0:00:39.420,0:00:40.930 Yeah, just to get that clear. 0:00:40.930,0:00:43.060 So, who am I? 0:00:43.060,0:00:45.910 Why am I always talking [br]about languages on Wikidata? 0:00:45.910,0:00:47.593 Not sure but... 0:00:47.593,0:00:50.280 I'm a Computer Science PhD student 0:00:50.280,0:00:52.280 at the University of Southampton. 0:00:52.280,0:00:55.420 I'm a research intern [br]at Bloomberg in London, at the moment. 0:00:55.420,0:00:58.340 I'm a residence[br]at Newspeak House in London. 0:00:58.340,0:01:01.660 I am a researcher and project manager [br]for the Scribe project, 0:01:01.660,0:01:03.230 which I'll go into in a bit, 0:01:03.230,0:01:08.530 and I recently got into the idea [br]of oral knowledge and oral citation. 0:01:08.530,0:01:10.170 Kimberly is sitting right there. 0:01:10.990,0:01:13.330 And then, occasionally,[br]I have time to sleep 0:01:13.330,0:01:16.010 and do other things, but that's very rare. 0:01:16.680,0:01:18.620 So if you're interested[br]in any of those things, 0:01:18.620,0:01:20.020 come talk and speak to me. 0:01:20.020,0:01:23.480 Generally, this is an open presentation[br]and a few questions in between. 0:01:23.480,0:01:26.780 I'll run through a lot of things[br]in a very short time now. 0:01:27.412,0:01:30.110 Come to me afterwards[br]if you're interested in any of them. 0:01:30.640,0:01:32.170 Speak to me. I'm here. 0:01:32.170,0:01:35.460 I'm always very happy to speak to people. 0:01:35.460,0:01:39.110 So that's a bit of what[br]we will talk about today. 0:01:39.110,0:01:41.480 So Wikidata, giving an introduction, 0:01:41.480,0:01:44.060 even though that's obviously[br]not as necessary. 0:01:44.510,0:01:48.130 The article placeholder[br]is aimed for Wikipedia readers, 0:01:48.130,0:01:50.910 for Scribe which is aimed[br]at Wikipedia editors, 0:01:50.910,0:01:54.440 and then we have one topic of my research, 0:01:54.440,0:01:56.880 which is completely outside of Wikipedia 0:01:56.880,0:02:00.110 where we use Wikidata[br]for question answering. 0:02:01.530,0:02:03.950 So just a quick rerun. 0:02:03.950,0:02:07.040 Why is Wikidata so cool[br]for low-resource languages 0:02:07.040,0:02:10.820 where we have those unique identifiers? 0:02:10.820,0:02:13.370 I'm speaking to people that know that 0:02:13.370,0:02:14.930 much better than me even. 0:02:14.930,0:02:17.720 And then we have labels[br]in different languages. 0:02:17.720,0:02:21.820 Those can be in over,[br]I think, 400 languages by now, 0:02:21.820,0:02:24.060 so we have a good option here 0:02:24.060,0:02:27.820 to reuse language[br]in different forms and capture it. 0:02:29.310,0:02:32.730 Yeah, so that's a little bit of me[br]rambling about Wikidata 0:02:32.730,0:02:34.880 because I can't stop it. 0:02:34.880,0:02:37.040 We compared Wikidata,[br]compared to the native speaker, 0:02:37.040,0:02:39.107 so we can see, obviously, 0:02:39.107,0:02:41.570 there are languages[br]that are widely spoken in the world. 0:02:41.570,0:02:43.840 There's Chinese, Hindi, or Arabic, 0:02:43.840,0:02:46.640 but then very low coverage on Wikidata. 0:02:48.000,0:02:50.130 Then the opposite. 0:02:50.130,0:02:52.590 Sorry, I have the Dutch[br]and the Swedish community 0:02:52.590,0:02:54.880 which was super active in Wikidata, 0:02:54.880,0:02:58.060 which is really cool, [br]and that just points out 0:02:58.060,0:03:01.330 that even though we have [br]a low number of speakers, 0:03:01.330,0:03:06.810 we can have a big impact if people [br]are very active in the communities, 0:03:06.810,0:03:09.000 which is really nice and really good. 0:03:09.000,0:03:13.600 But also let's try to equal[br]that graph out in the future. 0:03:14.560,0:03:18.570 So, cool. So now we have [br]all this language data in Wikidata. 0:03:18.570,0:03:22.280 We have low-resource Wikipedias, [br]so we thought, what can we do? 0:03:22.280,0:03:27.460 Well, my undergrad supervisor[br]is sitting here, 0:03:27.460,0:03:31.070 and we worked back then[br]in the golden days, 0:03:31.070,0:03:33.620 on something called[br]the article placeholder 0:03:34.730,0:03:39.370 which takes triples from Wikidata[br]and displays it on Wikipedia. 0:03:39.370,0:03:41.570 And that's pretty much [br]relatively straight forward. 0:03:41.570,0:03:46.300 So you just take the content of Wikidata,[br]display it on Wikipedia 0:03:46.300,0:03:49.330 to attract more readers [br]and then eventually more editors 0:03:49.330,0:03:51.330 in the different low-resource languages. 0:03:51.330,0:03:53.130 They are dynamically generated, 0:03:53.130,0:03:55.550 so they're not like stubs or bot articles 0:03:55.550,0:04:00.170 that then flood the Wikipedia [br]so people can edit them. 0:04:00.170,0:04:02.420 It's basically a starting point. 0:04:02.420,0:04:04.550 And we thought, [br]well, we have that content, 0:04:04.550,0:04:08.570 and we have that knowledge[br]somewhere already, which is Wikidata. 0:04:08.570,0:04:11.600 It's often already in the languages,[br]but they don't have articles, 0:04:11.600,0:04:15.136 so at least give them[br]the insight into the information. 0:04:15.136,0:04:19.220 The article placeholders are live [br]on 14 low-resource Wikipedias. 0:04:20.040,0:04:21.770 If you are a Wikipedia community, 0:04:21.770,0:04:24.805 if you are part of a Wikipedia community[br]and interested in it, 0:04:24.805,0:04:26.110 let us know. 0:04:27.880,0:04:30.080 And then I went into research, 0:04:30.080,0:04:32.770 and I got stuck with[br]the article placeholder, though, 0:04:32.770,0:04:36.040 so we started to look into [br]text generation from Wikidata 0:04:36.040,0:04:38.060 for Wikipedia and low-resource languages. 0:04:38.060,0:04:39.965 And text generation is really interesting 0:04:39.965,0:04:43.310 because in research it was at that point [br]when we started the project 0:04:43.310,0:04:46.020 completely only focused on English, 0:04:46.020,0:04:48.880 which is a bit pointless in my experience 0:04:48.880,0:04:51.440 because, I mean, you have a lot of people[br]who write in English, 0:04:51.440,0:04:55.350 but then what we need is people [br]who write in those low-source languages. 0:04:55.350,0:04:59.420 And our starting point was that,[br]looking at triples on Wikipedia 0:04:59.420,0:05:01.600 is not exactly the nicest thing. 0:05:01.600,0:05:03.680 I mean, as much as I love[br]the article placeholder, 0:05:03.680,0:05:06.330 it's not exactly [br]what you want to see you or expect 0:05:06.330,0:05:07.960 when you open a Wikipedia page. 0:05:07.960,0:05:09.590 So we try to generate text. 0:05:09.590,0:05:11.770 We use this beautiful[br]neural network model, 0:05:11.770,0:05:13.440 where we encode Wikidata triples. 0:05:13.440,0:05:15.755 If you're interested more[br]in the technical parts, 0:05:15.755,0:05:16.970 come and talk to me. 0:05:16.970,0:05:21.820 And so, realistically,[br]with neural text generation, 0:05:21.820,0:05:23.750 you can generate one or two sentences 0:05:23.750,0:05:27.530 before it completely scrambles[br]and becomes useless. 0:05:27.530,0:05:32.660 So we've generated one sentence[br]that describes the topic of the triple. 0:05:32.660,0:05:35.600 And so this, for example, is Arabic. 0:05:35.600,0:05:38.620 We generate the sentence about Marrakesh, 0:05:38.620,0:05:40.660 where it just describes the city. 0:05:42.170,0:05:45.680 So for that, then, we tested this-- 0:05:45.680,0:05:49.330 So we did studies, obviously,[br]to test if our approach works, 0:05:49.330,0:05:52.480 and if it makes sense, to use such things. 0:05:52.480,0:05:55.660 And because we are [br]very application-focused, 0:05:55.660,0:05:58.730 we tested it with actual [br]Wikipedia readers and editors. 0:05:58.730,0:06:01.302 So, first, we tested it[br]with Wikipedia readers 0:06:01.302,0:06:03.020 in Arabic and Esperanto-- 0:06:03.020,0:06:06.170 so use cases with Arabic and Esperanto. 0:06:07.640,0:06:12.710 And we can see that our model[br]can generate sentences 0:06:12.710,0:06:14.493 that are very fluent 0:06:14.493,0:06:18.050 and that feel very much--[br]surprisingly, a lot, actually-- 0:06:18.050,0:06:19.640 like Wikipedia sentences. 0:06:19.640,0:06:22.710 So it picks up, so we train on,[br]for example, for Arabic, 0:06:22.710,0:06:26.470 we train on Arabic with the idea to say 0:06:26.470,0:06:29.880 we want to keep[br]the cultural context of that language 0:06:29.880,0:06:32.980 and not let it influence 0:06:32.980,0:06:35.295 from other languages[br]that have higher coverage. 0:06:36.150,0:06:38.403 And then we did a study[br]with Wikipedia editors 0:06:38.403,0:06:41.080 because in the end the article placeholder[br]is just a starting point 0:06:41.080,0:06:42.515 for people to start editing, 0:06:42.515,0:06:43.570 and we try to measure 0:06:43.570,0:06:45.950 how much of the sentences[br]would they reuse. 0:06:45.950,0:06:48.750 How much is useful for them, basically, 0:06:48.750,0:06:51.200 and you can see [br]that there is a high number of reuse, 0:06:51.200,0:06:54.880 especially in Esperanto[br]when we test with editors. 0:06:55.860,0:07:01.150 And finally, we did also[br]qualitative interviews 0:07:01.150,0:07:05.030 with Wikipedia editors [br]across six languages. 0:07:05.030,0:07:07.680 I think we had [br]about ten people we interviewed. 0:07:08.680,0:07:12.260 And we tried to get[br]more of an understanding 0:07:12.260,0:07:15.310 what's a human perspective[br]on those generated sentences. 0:07:15.310,0:07:18.060 So now we can have [br]a very quantified way of saying, 0:07:18.060,0:07:19.285 yeah, they are good, 0:07:19.285,0:07:21.284 but we wanted to see 0:07:21.284,0:07:22.775 how's the interaction 0:07:22.775,0:07:25.510 and especially with whatever[br]always happens 0:07:25.510,0:07:30.340 in neural machine translation[br]and neural text generations, 0:07:30.340,0:07:33.970 that you have those missing word tokens[br]which we put as "rare" in there. 0:07:33.970,0:07:38.860 So that's the example sentences we used.[br]All of them are in Marrakesh. 0:07:38.860,0:07:42.150 So we wanted to see how much [br]are people bothered by it, 0:07:42.150,0:07:43.198 what's the quality, 0:07:43.198,0:07:45.350 what are the things[br]that point out to them, 0:07:45.350,0:07:50.080 and we can see that the mistakes[br]by the networks like those red tokens 0:07:50.080,0:07:51.420 are often just ignored. 0:07:53.080,0:07:56.080 There is this interesting factor[br]that because we didn't tell them 0:07:56.080,0:08:00.640 where this happens,[br]where we got the sentences from-- 0:08:00.640,0:08:03.680 because it was on a user page of mine 0:08:03.680,0:08:05.880 but it looked like it was on a Wikipedia, 0:08:05.880,0:08:07.420 people just trusted. 0:08:07.420,0:08:09.000 And I think that's very important 0:08:09.000,0:08:13.350 when we look into those kinds[br]of research directions that we look into, 0:08:13.350,0:08:16.130 we cannot override [br]this trust into Wikipedia. 0:08:16.130,0:08:20.460 So if we work with Wikipedians[br]and Wikipedia itself, 0:08:20.460,0:08:23.240 if we take things from,[br]for example, Wikidata, 0:08:23.240,0:08:26.400 that's good[br]because it's also human-curated. 0:08:26.400,0:08:31.050 But when we start[br]with artificial intelligence projects, 0:08:31.050,0:08:34.680 where you have to be really careful[br]what we actually expose people to 0:08:34.680,0:08:37.950 because they just trust[br]the information that we give them. 0:08:38.910,0:08:42.570 So we could see, for example,[br]in the Arabic version, 0:08:42.570,0:08:45.480 it gave the wrong location for Marrakesh, 0:08:45.480,0:08:47.770 and people, even the people I interviewed 0:08:47.770,0:08:50.330 that we're living in Marrakesh [br]didn't pick up on that, 0:08:50.330,0:08:54.090 because it's on Wikipedia, [br]so it should be fine, right? 0:08:54.090,0:08:55.115 (chuckles) 0:08:55.115,0:08:56.340 Yeah. 0:08:57.680,0:09:00.750 We found there was a magical threshold[br]for the lengths of the generated text, 0:09:00.750,0:09:02.000 so that's something we found, 0:09:02.000,0:09:05.250 especially in comparison[br]with the content translation tool, 0:09:05.250,0:09:08.080 where you have a long [br]automatically generated text, 0:09:08.080,0:09:12.130 and people were complaining[br]that content translation was very hard 0:09:12.130,0:09:15.610 because you're just doing post-editing,[br]you don't have the creativity. 0:09:15.610,0:09:19.330 There are other remarks [br]on content translation I usually make-- 0:09:19.330,0:09:20.710 I'll skip them for now. 0:09:22.400,0:09:25.230 So that one sentence was helpful 0:09:25.230,0:09:30.360 because even if we've made mistakes,[br]people were still willing to fix them 0:09:30.360,0:09:34.130 because it's a very short [br]intervenience [in that]. 0:09:34.130,0:09:37.950 And then, finally, [br]a lot of people pointed out, 0:09:37.950,0:09:40.200 that it was particularly good [br]for a new editor, 0:09:40.200,0:09:42.080 so for them to have a starting point, 0:09:42.080,0:09:44.080 to have those triples, to have a sentence, 0:09:44.080,0:09:46.150 so they have something to start from. 0:09:46.150,0:09:48.720 So after all those interviews were done, 0:09:48.720,0:09:51.990 as I go, that's very interesting. 0:09:51.990,0:09:54.260 What else can we do with that knowledge? 0:09:54.260,0:09:58.950 And so we started a new project, [br]exactly because there weren't enough yet. 0:09:58.950,0:10:02.310 And the new project we have[br]is called Scribe, 0:10:02.310,0:10:07.460 and Scribe focuses on new editors[br]that want to write a new article, 0:10:07.460,0:10:09.660 and particularly people[br]who haven't written 0:10:09.660,0:10:11.260 an article on Wikipedia yet, 0:10:11.260,0:10:14.280 and specifically also[br]on low-resource languages. 0:10:15.130,0:10:18.910 So the idea is that-- [br]that's the pixel version of me. 0:10:19.800,0:10:21.170 All my slides are basically 0:10:21.170,0:10:24.240 references to people in this room,[br]which I really love. 0:10:24.240,0:10:25.750 It feels like I'm home again. 0:10:27.000,0:10:30.880 So, yeah, I want to write a new article, 0:10:30.880,0:10:33.570 but I don't know where to start[br]as a new editor, 0:10:33.570,0:10:36.710 and so we have this project Scribe. 0:10:36.710,0:10:41.370 Scribe is a profession[br]or was the name of someone 0:10:41.370,0:10:45.260 with the profession of writing[br]in ancient Egypt. 0:10:47.080,0:10:52.870 So the Scribe project's idea[br]is that we want to give people, basically, 0:10:52.870,0:10:55.710 a hand when they start[br]writing their first articles. 0:10:55.710,0:10:57.750 So give them a skeleton, 0:10:57.750,0:11:01.040 give them a skeleton that's based[br]on their language Wikipedia, 0:11:01.040,0:11:05.330 instead of just translating the content[br]from another language Wikipedia. 0:11:05.330,0:11:10.390 So the first thing we want to do[br]is plan section titles, 0:11:10.390,0:11:13.640 then select references for each section, 0:11:13.640,0:11:15.974 ideally in the local Wikipedia language, 0:11:15.974,0:11:19.950 and then summarize those references[br]to give a starting point to write. 0:11:21.400,0:11:25.310 For the project, we have [br]a Wikimedia Foundation project grant. 0:11:25.310,0:11:27.570 So it just started. 0:11:27.570,0:11:30.670 Some of you are very open[br]to feedback, in general. 0:11:30.670,0:11:35.170 That was the very first[br]not so beautiful layout, 0:11:35.170,0:11:36.830 but just for you to get an overview. 0:11:36.830,0:11:39.640 So there is this idea [br]of collecting references, 0:11:39.640,0:11:42.950 images from comments, section titles. 0:11:42.950,0:11:45.620 And so the main things [br]we want to use Wikidata for 0:11:45.620,0:11:47.850 is the sections. 0:11:47.860,0:11:51.400 So, basically, we want to see[br]what are articles 0:11:51.400,0:11:55.220 on similar topics [br]already existing in your language, 0:11:55.220,0:11:58.350 so we can understand[br]how the language community 0:11:58.350,0:12:02.220 decided on structuring articles. 0:12:02.220,0:12:06.170 And then we look[br]for the images, obviously, 0:12:06.170,0:12:10.480 where Wikidata also[br]is a good point to go through. 0:12:12.550,0:12:16.240 And then we made[br]a prettier interface for it 0:12:16.240,0:12:18.420 because we decided to go mobile first. 0:12:18.420,0:12:21.280 So most of communities[br]that we aim to work with 0:12:21.280,0:12:24.510 are very heavy on mobile editing. 0:12:24.510,0:12:29.800 And so we do this mobile-first focus. 0:12:30.230,0:12:34.060 And then, it also forces us[br]to break down into steps 0:12:34.060,0:12:37.000 which eventually will lead to, [br]yeah, I don't know, 0:12:37.000,0:12:39.440 a step-by-step guide [br]on how to write a new article. 0:12:39.440,0:12:43.060 So an editor comes,[br]they can select section headers 0:12:43.060,0:12:46.630 based on existing articles[br]in their language, 0:12:46.630,0:12:49.150 write one section at a time, 0:12:49.150,0:12:54.130 switch between the sections, [br]and select references for each section. 0:12:55.600,0:12:59.050 Yeah, so the idea is that[br]we will have an easier editing experience, 0:12:59.050,0:13:00.680 especially for new editors, 0:13:00.680,0:13:05.080 to keep them in-- [br]integrate Wikidata information 0:13:05.080,0:13:08.280 and [inaudible] images [br]from Wikimedia Commons as well. 0:13:09.730,0:13:12.110 If you're interested in Scribe, 0:13:12.110,0:13:15.130 I'm working together[br]on this project with Hady. 0:13:15.130,0:13:19.310 There is a lot of things online, 0:13:19.310,0:13:23.240 but then also just come and talk to us. 0:13:23.240,0:13:25.860 Also, if you're editing [br]a low-resource Wikipedia, 0:13:25.860,0:13:28.613 we're still looking[br]for people to interview 0:13:28.613,0:13:31.570 because we're trying to emulate-- 0:13:31.570,0:13:33.880 we're trying to emulate as much as we can 0:13:33.880,0:13:36.750 what people already experience,[br]or they already edit. 0:13:36.750,0:13:38.630 I'm not big on Wikipedia editing. 0:13:38.630,0:13:40.510 Also, my native language is German. 0:13:40.510,0:13:43.580 So I need a lot of input from editors 0:13:43.580,0:13:48.080 that want to tell me[br]what they need, what they want, 0:13:48.080,0:13:51.140 where they think this project can go. 0:13:51.140,0:13:54.590 And if you are into Wikidata,[br]also come and talk to me, please. 0:13:55.730,0:13:57.889 Okay, so that's all the projects 0:13:57.889,0:14:01.880 or most of the projects we did[br]inside the Wikimedia world. 0:14:01.880,0:14:05.780 And I want to give you one [br]short overview of what's happening 0:14:05.780,0:14:10.420 on my end of research,[br]around Wikidata as well. 0:14:14.290,0:14:15.950 So I was part of a project 0:14:15.950,0:14:17.820 that works a lot with question answering, 0:14:17.820,0:14:20.460 and I don't know too much[br]about question answering, 0:14:20.460,0:14:23.880 but what I do know a lot about[br]is knowledge graphs and multilinguality. 0:14:23.880,0:14:25.680 So, basically, what we wanted to do 0:14:25.680,0:14:29.963 is we have a question answering system[br]that gets a question from a user, 0:14:29.963,0:14:35.770 and we wanted to select a knowledge graph[br]that can answer the question best. 0:14:35.770,0:14:40.150 And again, we focused on [br]multilingual question answering system. 0:14:40.150,0:14:45.750 So if I want to ask something about Bach,[br]for example, in Spanish and French-- 0:14:45.750,0:14:48.420 because that's the two languages[br]I know best-- 0:14:48.420,0:14:52.030 then what knowledge graph has the data 0:14:52.030,0:14:53.950 to actually answer those questions. 0:14:55.160,0:14:59.260 So what we did was we found a method[br]to rank knowledge graphs, 0:15:00.600,0:15:04.800 based on the metadata of language, 0:15:04.800,0:15:08.170 that appears on the knowledge graph, 0:15:08.170,0:15:09.510 [which is split] by class. 0:15:09.510,0:15:11.460 And then we look for each class 0:15:11.460,0:15:14.440 into what languages are covered best, 0:15:14.440,0:15:18.170 and then depending on the question,[br]can suggest a knowledge graph. 0:15:19.000,0:15:22.510 From the big knowledge graphs[br]we looked into 0:15:22.510,0:15:25.080 and that are very known and widely used, 0:15:25.080,0:15:28.240 Wikidata covers the most languages[br]over all knowledge graphs, 0:15:28.240,0:15:31.750 and we used a test bed. 0:15:31.750,0:15:35.570 So we'd use a benchmark dataset[br]called [CALD], 0:15:35.570,0:15:39.350 which we then translated-- [br]which was originally for DBpedia. 0:15:39.350,0:15:41.880 We translated it[br]for those five knowledge graphs 0:15:41.880,0:15:43.550 into [SPARQL] questions. 0:15:43.550,0:15:49.820 And then we gave that to a crowd [br]and looked into which knowledge graph 0:15:49.820,0:15:54.680 has the best answers[br]for each of those [SPARQL] queries. 0:15:54.680,0:15:59.370 And overall, the crowd workers[br]preferred Wikidata's answers 0:15:59.370,0:16:01.640 because they are very precise, 0:16:02.890,0:16:05.020 they are in most of the languages 0:16:05.020,0:16:06.530 that the others don't cover, 0:16:07.620,0:16:10.970 and they are not [br]as repetitive or redundant 0:16:10.970,0:16:12.480 as the [inaudible]. 0:16:12.480,0:16:16.680 So just to make a quick recap [br]on the whole topic 0:16:16.680,0:16:19.820 of Wikidata and the future and languages. 0:16:19.820,0:16:23.620 So we can say that Wikidata[br]is already widely used 0:16:23.620,0:16:27.910 for numerous applications in Wikipedia, 0:16:27.910,0:16:30.080 and then outside Wikipedia for research. 0:16:30.080,0:16:33.970 So what I talked about[br]is just the things I do research on, 0:16:33.970,0:16:35.970 but there is still so much more. 0:16:35.970,0:16:38.950 So there is machine translation[br]using knowledge graphs, 0:16:38.950,0:16:40.950 there is rule mining[br]over knowledge graphs, 0:16:40.950,0:16:43.530 its entity linking in text. 0:16:43.530,0:16:47.170 There is so much more research[br]happening at the moment, 0:16:47.170,0:16:50.880 and Wikidata is more and more[br]getting popular for usage of it. 0:16:50.880,0:16:54.640 So I think we are at a very good stage 0:16:54.640,0:16:57.590 to push and connect the communities. 0:16:58.640,0:17:02.970 Yeah, to get the best [br]from both sides, basically. 0:17:03.510,0:17:04.770 Thank you very much. 0:17:04.770,0:17:07.860 If you want to have a look[br]at any of those projects, 0:17:07.860,0:17:09.285 they are there, 0:17:09.285,0:17:10.710 my slides are in Commons already. 0:17:10.710,0:17:14.800 If you want to read any of the papers, [br]I think all of them are open access. 0:17:14.800,0:17:16.260 If you can't find any of them, 0:17:16.260,0:17:18.770 write me an email [br]and I send it to you immediately. 0:17:18.770,0:17:20.705 Thank you very much. 0:17:20.705,0:17:22.400 (applause) 0:17:25.740,0:17:28.130 (moderator) Okay, [br]are there any questions? 0:17:28.130,0:17:31.770 - (moderator) I'll come around.[br]- (person 1) Shall I come to you? 0:17:34.794,0:17:36.370 (person 1) Hi Lucie, thank you so much, 0:17:36.370,0:17:38.460 I'm so glad to see[br]you taking this forward. 0:17:38.460,0:17:40.680 Now I'm really curious about Scribe. 0:17:41.510,0:17:43.510 The example here within our university 0:17:43.510,0:17:46.060 was that the idea that the person says, 0:17:46.060,0:17:47.540 "This is a university." 0:17:47.540,0:17:49.020 And then you go to the key data 0:17:49.020,0:17:51.930 and say, "Oh gosh! [br]Universities have places 0:17:51.930,0:17:54.110 and presidents, and I don't know what," 0:17:54.110,0:17:57.770 that you're using these as the parts, [br]for telling the person what to do. 0:17:57.770,0:18:00.840 So, basically, the idea[br]is that someone says, 0:18:00.840,0:18:02.820 "I want to write about Nile University." 0:18:02.820,0:18:07.040 We look into Nile University's[br]Wikidata item, 0:18:07.040,0:18:09.820 and let's say-- I work a lot with Arabic-- 0:18:09.820,0:18:13.240 so let's say we then go[br]in Arabic Wikipedia, 0:18:13.240,0:18:17.040 so we can make a grid, basically, 0:18:17.040,0:18:19.370 of all items that are around[br]Nile University. 0:18:19.370,0:18:23.060 So there are also universities,[br]there are also universities in Cairo, 0:18:23.060,0:18:25.480 or there are also universities[br]in Egypt, stuff like that, 0:18:25.480,0:18:27.350 or they have similar topics. 0:18:27.350,0:18:32.530 So we can look into[br]all the similar items on Wikidata, 0:18:32.530,0:18:36.330 and if they already have [br]a Wikipedia entry in Arabic Wikipedia, 0:18:36.330,0:18:38.610 we can look at the section titles. 0:18:38.610,0:18:41.310 - (person 1) (gasps)[br]- Exactly, and then we can make basically, 0:18:41.310,0:18:46.370 the most common way [br]about writing about a university 0:18:46.370,0:18:50.000 in Cairo on Arabic Wikipedia. 0:18:50.000,0:18:52.703 - Yeah, so that's the--[br]- (person 1) Thank you, [inaudible]. 0:18:56.880,0:18:59.550 (person 2) Hi, thank you so much[br]for your inspiring talk. 0:18:59.550,0:19:04.800 I was wondering if this would work[br]for languages in Incubator? 0:19:04.800,0:19:10.620 Like, I work with really low,[br]low, low, low-resource languages 0:19:10.620,0:19:16.461 and this thing about doing it mobile[br]would be a huge thing, 0:19:16.461,0:19:20.020 because in many communities[br]they only have phones, not laptops. 0:19:20.020,0:19:22.020 So, would it work? 0:19:22.020,0:19:26.080 So I think, to an extent-- 0:19:26.080,0:19:32.050 so the general structure, the skeleton[br]of the application would work. 0:19:32.050,0:19:35.280 Two things that we're thinking about[br]a lot at the moment 0:19:35.280,0:19:37.080 for exactly those use cases is, 0:19:37.080,0:19:39.970 how much would we want,[br]for example, to say, 0:19:39.970,0:19:44.530 if there are no articles [br]on a similar topic in your Wikipedia, 0:19:44.530,0:19:46.930 how much do we want it[br]to get it from other Wikipedias. 0:19:46.930,0:19:49.750 And that's why I'm basically [br]doing those interviews at the moment, 0:19:49.750,0:19:51.420 because I try to understand 0:19:51.420,0:19:54.570 how much people already look[br]at other language Wikipedias 0:19:54.570,0:19:57.040 to make the structure of an article. 0:19:57.040,0:19:58.800 Are they generally equal 0:19:58.800,0:20:01.630 or do they differ a lot[br]based on cultural context? 0:20:01.630,0:20:04.310 So that would be something to consider, 0:20:04.310,0:20:06.640 but there is a possibility to say, 0:20:06.640,0:20:09.550 we take everything [br]from all the language Wikipedias 0:20:09.550,0:20:12.040 and then make an average, basically. 0:20:12.040,0:20:14.970 And the other problem is referencing. 0:20:14.970,0:20:16.460 So that's something we find. 0:20:16.460,0:20:20.730 We make it very convenient[br]because we use a lot of Arabic, 0:20:20.730,0:20:24.170 and Arabic actually has the problem[br]that there are a lot of references, 0:20:24.170,0:20:28.790 but they are very little used[br]or not widely used in Wikipedia. 0:20:29.260,0:20:31.570 That's not true, obviously,[br]for all languages, 0:20:31.570,0:20:34.104 and that's something[br]I'd be very interested-- 0:20:34.104,0:20:35.180 like, let's talk. 0:20:35.180,0:20:36.680 That's what I'm trying to say, 0:20:36.680,0:20:39.170 I'd be very interested [br]on your perspective on it 0:20:39.170,0:20:41.680 because I'd like to know, yeah 0:20:41.680,0:20:43.750 what do you think about referencing 0:20:43.750,0:20:45.460 done from English or any other language. 0:20:45.460,0:20:46.840 (person 2) Have you ever tried-- 0:20:46.840,0:20:51.880 what we do is we normally[br]reference to interviews we have. 0:20:51.880,0:20:55.600 We put them in our repository,[br]institutional repository, 0:20:55.600,0:20:59.570 because these languages [br]don't have written references, 0:20:59.570,0:21:03.240 and I feel like [br]that is the way to go, but-- 0:21:03.240,0:21:06.910 I'm currently also--[br]Kimberly and I are discussing a lot. 0:21:06.910,0:21:10.930 We made a session on Wikimania[br]on oral knowledge and oral citations. 0:21:10.930,0:21:14.135 Yeah, we should hang out [br]and have a long conversation. 0:21:14.135,0:21:15.620 (laughs) 0:21:18.310,0:21:22.040 (person 3) So [Michael Davignon], [br]we'll talk about medium size, 0:21:22.040,0:21:23.910 which is probably around ten people, 0:21:23.910,0:21:27.750 so it's medium for Briton Wikipedia. 0:21:27.750,0:21:30.600 And I'm wondering if we can use Scribe, 0:21:31.530,0:21:34.880 how to find a common plan[br]the other way around 0:21:34.880,0:21:37.770 for existing article[br]to find [the outer layers], 0:21:37.770,0:21:39.571 that's supposed to be the best plan, 0:21:39.571,0:21:42.130 but I'm not aware of more or less 0:21:42.130,0:21:44.710 [inaudible] [br]improvement existing article. 0:21:46.790,0:21:49.440 I think there's-- 0:21:49.440,0:21:50.800 I forgot the name, I think, 0:21:50.800,0:21:53.790 [Diego] in the Wikimedia Foundation[br]research team, 0:21:53.790,0:21:58.407 who's working a lot at the moment[br]with section headings. 0:21:58.407,0:22:01.420 But, yes, generally, the idea is the same. 0:22:01.420,0:22:04.640 So instead of using them[br]to make an average 0:22:04.640,0:22:07.260 you could say, [br]this is not like the average, 0:22:08.170,0:22:09.680 That's very possible, yeah. 0:22:14.750,0:22:18.330 (person 4) Hi, Lucy. I'm Erica Azzellini[br]from Wiki Movement, Brazil, 0:22:18.330,0:22:20.130 and I'm very-- 0:22:20.130,0:22:21.860 (Érica) Oh, can you hear me? 0:22:21.860,0:22:24.680 So, I'm Érica Azzellini[br]from Wiki Movement Brazil, 0:22:24.680,0:22:26.560 and I'm really impressed with your work 0:22:26.570,0:22:29.154 because it's really in sync 0:22:29.154,0:22:32.540 with what we've been working on in Brazil[br]with the Mbabel tool. 0:22:32.540,0:22:33.950 I don't know if you heard about it? 0:22:33.950,0:22:36.000 - Not yet. [br]- (Érica) It's a tool that we use 0:22:36.020,0:22:38.440 to automatically[br]generate Wikipedia entries 0:22:38.440,0:22:42.240 using Wikidata information [br]in a simple way 0:22:42.240,0:22:46.510 that can be replicated [br]on other Wikipedia languages. 0:22:46.510,0:22:48.950 So we've been working[br]on Portuguese mainly, 0:22:48.950,0:22:51.860 and we're trying to get[br]on English Wikipedia tools, 0:22:51.860,0:22:56.200 but it can be replicated[br]on any language, basically, 0:22:56.200,0:22:58.460 and I think then we could talk about it. 0:22:58.460,0:23:00.460 Absolutely, it will be super interesting 0:23:00.460,0:23:03.260 because the article placeholder[br]is an extension already, 0:23:03.260,0:23:06.130 so it might be worth [br]to integrate your efforts 0:23:06.130,0:23:07.950 into the existing extension. 0:23:07.950,0:23:12.620 Lydia is also fully for it,[br]and... (laughs) 0:23:12.620,0:23:13.930 And then because-- 0:23:13.930,0:23:17.040 so one of the problems--[br][Marius] correct me if I'm wrong-- 0:23:17.040,0:23:20.310 we had was that[br]article placeholder doesn't scale 0:23:20.310,0:23:22.240 as well as it should. 0:23:22.240,0:23:24.860 So article placeholder[br]is not in Portuguese 0:23:24.860,0:23:28.545 because we're always afraid[br]it will break everything, correct? 0:23:29.460,0:23:32.286 And then [Marius] is just taking a pause. 0:23:32.286,0:23:35.420 - (Érica) Yeah, you should be careful.[br]- Don't want to say anything about this. 0:23:35.420,0:23:38.950 But, yeah, we should connect[br]because I'd be super interested to see 0:23:38.950,0:23:42.040 how you solve those issues[br]and how it works for you. 0:23:42.040,0:23:45.310 (Érica) I'm going to present [br]on the second section 0:23:45.310,0:23:48.350 of the learning talk about this project [br]that we've been developing, 0:23:48.350,0:23:50.620 and we've been using it [br]on [Glenwyck] initiatives 0:23:50.620,0:23:52.440 and education projects already. 0:23:52.440,0:23:54.480 - Perfect.[br]- (Érica) So let's do that. 0:23:54.480,0:23:56.440 Yeah, absolutely let's chat. 0:23:57.220,0:23:58.274 (moderator) Cool. 0:23:58.274,0:24:00.370 Some other questions on your projects? 0:24:02.460,0:24:06.820 (person 5) Hi, my name is [Alan], [br]and I think this is extremely cool. 0:24:06.820,0:24:09.170 I had a few questions about 0:24:09.170,0:24:13.110 generating Wiki sentences[br]from neural networks. 0:24:13.110,0:24:16.020 - Yeah.[br]- (person 5) So I've come across 0:24:16.020,0:24:19.240 another project[br]that was attempting to do this, 0:24:19.240,0:24:23.020 and it was essentially using [br][triples input and sentences output], 0:24:23.020,0:24:25.510 and it was able [br]to generate very fluent sentences. 0:24:25.510,0:24:29.360 But sometimes they weren't... 0:24:30.370,0:24:33.820 actually, they weren't correct,[br]with regards to the triple. 0:24:33.820,0:24:39.420 And I was curious if you had any ways[br]of doing validity checks of this site. 0:24:39.420,0:24:43.040 Sometimes the triple [br]is "subject, predicate, object," 0:24:43.040,0:24:46.110 but the language model says, 0:24:46.110,0:24:48.565 "Okay, this object is very rare, 0:24:48.565,0:24:51.740 I'm going to say you are born in San Jose, 0:24:51.740,0:24:55.060 instead of San Francisco or vice versa." 0:24:55.060,0:24:58.880 And I was curious[br]if you had come across this? 0:24:58.880,0:25:01.510 So that's what we call hallucinations. 0:25:01.510,0:25:05.080 The idea that [br]there's something in a sentence 0:25:05.080,0:25:07.690 that wasn't in the original triple[br]and the data. 0:25:08.400,0:25:11.350 What we do-- [br]so we don't do anything about it, 0:25:11.350,0:25:13.910 we just also realized[br]that that's happening. 0:25:13.910,0:25:15.910 It's even more happening[br]for the low-resource, 0:25:15.910,0:25:19.730 because we work across domains,[br]so we are domain independently generating. 0:25:19.730,0:25:24.670 Traditional energy work [br]is always biography domain, usually. 0:25:24.670,0:25:26.620 So that happens a lot 0:25:26.620,0:25:29.510 because we just have little training data [br]on the low-resource languages. 0:25:30.400,0:25:32.800 We have a few ideas. 0:25:32.800,0:25:36.840 It's one of the million topics, [br]I'm supposed to work on at the moment. 0:25:38.850,0:25:42.550 One of them is to use [br]entity linking and relation extraction, 0:25:42.550,0:25:44.440 to align what we generate 0:25:44.440,0:25:46.640 with the triples[br]we inputted in the first place, 0:25:46.640,0:25:50.750 to see if it's off or the network [br]generates information it shouldn't have 0:25:50.750,0:25:54.090 or it cannot know about, basically. 0:25:54.090,0:25:58.680 That's also all I can say about this[br]because now time is over. 0:25:58.680,0:26:01.480 (person 5) I'd love to talk offline[br]about this, if you have time. 0:26:01.480,0:26:03.260 Yeah, absolutely, let's chat about it. 0:26:03.260,0:26:05.140 Thank you so much,[br]everyone, it was lovely. 0:26:05.140,0:26:06.600 (moderator) Thank you, Lucie. 0:26:06.600,0:26:08.610 (applause)