WEBVTT 00:00:05.840 --> 00:00:07.080 Hi, I'm Lucie. 00:00:07.080 --> 00:00:11.930 You know me from rambling about not enough language data in Wikidata, 00:00:11.930 --> 00:00:15.800 and I thought instead of rambling today, which I'll leave to Lydia later today, 00:00:16.530 --> 00:00:20.370 I'll just show you a bit, or give you an insight on the projects we did 00:00:20.370 --> 00:00:25.170 using the data that we already have on Wikidata, for different causes. 00:00:25.170 --> 00:00:28.550 So underserved languages compared to the keynote we just heard 00:00:28.550 --> 00:00:32.530 where the person was talking about underserved as like minority languages, 00:00:32.530 --> 00:00:35.600 underserved languages to me, or any languages 00:00:35.600 --> 00:00:38.760 that don't have enough representation on the web. 00:00:39.420 --> 00:00:40.930 Yeah, just to get that clear. 00:00:40.930 --> 00:00:43.060 So, who am I? 00:00:43.060 --> 00:00:45.910 Why am I always talking about languages on Wikidata? 00:00:45.910 --> 00:00:47.593 Not sure but... 00:00:47.593 --> 00:00:50.280 I'm a Computer Science PhD student 00:00:50.280 --> 00:00:52.280 at the University of Southampton. 00:00:52.280 --> 00:00:55.420 I'm a research intern at Bloomberg in London, at the moment. 00:00:55.420 --> 00:00:58.340 I'm a residence at Newspeak House in London. 00:00:58.340 --> 00:01:01.660 I am a researcher and project manager for the Scribe project, 00:01:01.660 --> 00:01:03.230 which I'll go into in a bit, 00:01:03.230 --> 00:01:08.530 and I recently got into the idea of oral knowledge and oral citation. 00:01:08.530 --> 00:01:10.170 Kimberly is sitting right there. 00:01:10.990 --> 00:01:13.330 And then, occasionally, I have time to sleep 00:01:13.330 --> 00:01:16.010 and do other things, but that's very rare. 00:01:16.680 --> 00:01:18.620 So if you're interested in any of those things, 00:01:18.620 --> 00:01:20.020 come talk and speak to me. 00:01:20.020 --> 00:01:23.480 Generally, this is an open presentation and a few questions in between. 00:01:23.480 --> 00:01:26.780 I'll run through a lot of things in a very short time now. 00:01:27.412 --> 00:01:30.110 Come to me afterwards if you're interested in any of them. 00:01:30.640 --> 00:01:32.170 Speak to me. I'm here. 00:01:32.170 --> 00:01:35.460 I'm always very happy to speak to people. 00:01:35.460 --> 00:01:39.110 So that's a bit of what we will talk about today. 00:01:39.110 --> 00:01:41.480 So Wikidata, giving an introduction, 00:01:41.480 --> 00:01:44.060 even though that's obviously not as necessary. 00:01:44.510 --> 00:01:48.130 The article placeholder is aimed for Wikipedia readers, 00:01:48.130 --> 00:01:50.910 for Scribe which is aimed at Wikipedia editors, 00:01:50.910 --> 00:01:54.440 and then we have one topic of my research, 00:01:54.440 --> 00:01:56.880 which is completely outside of Wikipedia 00:01:56.880 --> 00:02:00.110 where we use Wikidata for question answering. 00:02:01.530 --> 00:02:03.950 So just a quick rerun. 00:02:03.950 --> 00:02:07.040 Why is Wikidata so cool for low-resource languages 00:02:07.040 --> 00:02:10.820 where we have those unique identifiers? 00:02:10.820 --> 00:02:13.370 I'm speaking to people that know that 00:02:13.370 --> 00:02:14.930 much better than me even. 00:02:14.930 --> 00:02:17.720 And then we have labels in different languages. 00:02:17.720 --> 00:02:21.820 Those can be in over, I think, 400 languages by now, 00:02:21.820 --> 00:02:24.060 so we have a good option here 00:02:24.060 --> 00:02:27.820 to reuse language in different forms and capture it. 00:02:29.310 --> 00:02:32.730 Yeah, so that's a little bit of me rambling about Wikidata 00:02:32.730 --> 00:02:34.880 because I can't stop it. 00:02:34.880 --> 00:02:37.040 We compared Wikidata, compared to the native speaker, 00:02:37.040 --> 00:02:39.107 so we can see, obviously, 00:02:39.107 --> 00:02:41.570 there are languages that are widely spoken in the world. 00:02:41.570 --> 00:02:43.840 There's Chinese, Hindi, or Arabic, 00:02:43.840 --> 00:02:46.640 but then very low coverage on Wikidata. 00:02:48.000 --> 00:02:50.130 Then the opposite. 00:02:50.130 --> 00:02:52.590 Sorry, I have the Dutch and the Swedish community 00:02:52.590 --> 00:02:54.880 which was super active in Wikidata, 00:02:54.880 --> 00:02:58.060 which is really cool, and that just points out 00:02:58.060 --> 00:03:01.330 that even though we have a low number of speakers, 00:03:01.330 --> 00:03:06.810 we can have a big impact if people are very active in the communities, 00:03:06.810 --> 00:03:09.000 which is really nice and really good. 00:03:09.000 --> 00:03:13.600 But also let's try to equal that graph out in the future. 00:03:14.560 --> 00:03:18.570 So, cool. So now we have all this language data in Wikidata. 00:03:18.570 --> 00:03:22.280 We have low-resource Wikipedias, so we thought, what can we do? 00:03:22.280 --> 00:03:27.460 Well, my undergrad supervisor is sitting here, 00:03:27.460 --> 00:03:31.070 and we worked back then in the golden days, 00:03:31.070 --> 00:03:33.620 on something called the article placeholder 00:03:34.730 --> 00:03:39.370 which takes triples from Wikidata and displays it on Wikipedia. 00:03:39.370 --> 00:03:41.570 And that's pretty much relatively straight forward. 00:03:41.570 --> 00:03:46.300 So you just take the content of Wikidata, display it on Wikipedia 00:03:46.300 --> 00:03:49.330 to attract more readers and then eventually more editors 00:03:49.330 --> 00:03:51.330 in the different low-resource languages. 00:03:51.330 --> 00:03:53.130 They are dynamically generated, 00:03:53.130 --> 00:03:55.550 so they're not like stubs or bot articles 00:03:55.550 --> 00:04:00.170 that then flood the Wikipedia so people can edit them. 00:04:00.170 --> 00:04:02.420 It's basically a starting point. 00:04:02.420 --> 00:04:04.550 And we thought, well, we have that content, 00:04:04.550 --> 00:04:08.570 and we have that knowledge somewhere already, which is Wikidata. 00:04:08.570 --> 00:04:11.600 It's often already in the languages, but they don't have articles, 00:04:11.600 --> 00:04:15.136 so at least give them the insight into the information. 00:04:15.136 --> 00:04:19.220 The article placeholders are live on 14 low-resource Wikipedias. 00:04:20.040 --> 00:04:21.770 If you are a Wikipedia community, 00:04:21.770 --> 00:04:24.805 if you are part of a Wikipedia community and interested in it, 00:04:24.805 --> 00:04:26.110 let us know. 00:04:27.880 --> 00:04:30.080 And then I went into research, 00:04:30.080 --> 00:04:32.770 and I got stuck with the article placeholder, though, 00:04:32.770 --> 00:04:36.040 so we started to look into text generation from Wikidata 00:04:36.040 --> 00:04:38.060 for Wikipedia and low-resource languages. 00:04:38.060 --> 00:04:39.965 And text generation is really interesting 00:04:39.965 --> 00:04:43.310 because in research it was at that point when we started the project 00:04:43.310 --> 00:04:46.020 completely only focused on English, 00:04:46.020 --> 00:04:48.880 which is a bit pointless in my experience 00:04:48.880 --> 00:04:51.440 because, I mean, you have a lot of people who write in English, 00:04:51.440 --> 00:04:55.350 but then what we need is people who write in those low-source languages. 00:04:55.350 --> 00:04:59.420 And our starting point was that, looking at triples on Wikipedia 00:04:59.420 --> 00:05:01.600 is not exactly the nicest thing. 00:05:01.600 --> 00:05:03.680 I mean, as much as I love the article placeholder, 00:05:03.680 --> 00:05:06.330 it's not exactly what you want to see you or expect 00:05:06.330 --> 00:05:07.960 when you open a Wikipedia page. 00:05:07.960 --> 00:05:09.590 So we try to generate text. 00:05:09.590 --> 00:05:11.770 We use this beautiful neural network model, 00:05:11.770 --> 00:05:13.440 where we encode Wikidata triples. 00:05:13.440 --> 00:05:15.755 If you're interested more in the technical parts, 00:05:15.755 --> 00:05:16.970 come and talk to me. 00:05:16.970 --> 00:05:21.820 And so, realistically, with neural text generation, 00:05:21.820 --> 00:05:23.750 you can generate one or two sentences 00:05:23.750 --> 00:05:27.530 before it completely scrambles and becomes useless. 00:05:27.530 --> 00:05:32.660 So we've generated one sentence that describes the topic of the triple. 00:05:32.660 --> 00:05:35.600 And so this, for example, is Arabic. 00:05:35.600 --> 00:05:38.620 We generate the sentence about Marrakesh, 00:05:38.620 --> 00:05:40.660 where it just describes the city. 00:05:42.170 --> 00:05:45.680 So for that, then, we tested this-- 00:05:45.680 --> 00:05:49.330 So we did studies, obviously, to test if our approach works, 00:05:49.330 --> 00:05:52.480 and if it makes sense, to use such things. 00:05:52.480 --> 00:05:55.660 And because we are very application-focused, 00:05:55.660 --> 00:05:58.730 we tested it with actual Wikipedia readers and editors. 00:05:58.730 --> 00:06:01.302 So, first, we tested it with Wikipedia readers 00:06:01.302 --> 00:06:03.020 in Arabic and Esperanto-- 00:06:03.020 --> 00:06:06.170 so use cases with Arabic and Esperanto. 00:06:07.640 --> 00:06:12.710 And we can see that our model can generate sentences 00:06:12.710 --> 00:06:14.493 that are very fluent 00:06:14.493 --> 00:06:18.050 and that feel very much-- surprisingly, a lot, actually-- 00:06:18.050 --> 00:06:19.640 like Wikipedia sentences. 00:06:19.640 --> 00:06:22.710 So it picks up, so we train on, for example, for Arabic, 00:06:22.710 --> 00:06:26.470 we train on Arabic with the idea to say 00:06:26.470 --> 00:06:29.880 we want to keep the cultural context of that language 00:06:29.880 --> 00:06:32.980 and not let it influence 00:06:32.980 --> 00:06:35.295 from other languages that have higher coverage. 00:06:36.150 --> 00:06:38.403 And then we did a study with Wikipedia editors 00:06:38.403 --> 00:06:41.080 because in the end the article placeholder is just a starting point 00:06:41.080 --> 00:06:42.515 for people to start editing, 00:06:42.515 --> 00:06:43.570 and we try to measure 00:06:43.570 --> 00:06:45.950 how much of the sentences would they reuse. 00:06:45.950 --> 00:06:48.750 How much is useful for them, basically, 00:06:48.750 --> 00:06:51.200 and you can see that there is a high number of reuse, 00:06:51.200 --> 00:06:54.880 especially in Esperanto when we test with editors. 00:06:55.860 --> 00:07:01.150 And finally, we did also qualitative interviews 00:07:01.150 --> 00:07:05.030 with Wikipedia editors across six languages. 00:07:05.030 --> 00:07:07.680 I think we had about ten people we interviewed. 00:07:08.680 --> 00:07:12.260 And we tried to get more of an understanding 00:07:12.260 --> 00:07:15.310 what's a human perspective on those generated sentences. 00:07:15.310 --> 00:07:18.060 So now we can have a very quantified way of saying, 00:07:18.060 --> 00:07:19.285 yeah, they are good, 00:07:19.285 --> 00:07:21.284 but we wanted to see 00:07:21.284 --> 00:07:22.775 how's the interaction 00:07:22.775 --> 00:07:25.510 and especially with whatever always happens 00:07:25.510 --> 00:07:30.340 in neural machine translation and neural text generations, 00:07:30.340 --> 00:07:33.970 that you have those missing word tokens which we put as "rare" in there. 00:07:33.970 --> 00:07:38.860 So that's the example sentences we used. All of them are in Marrakesh. 00:07:38.860 --> 00:07:42.150 So we wanted to see how much are people bothered by it, 00:07:42.150 --> 00:07:43.198 what's the quality, 00:07:43.198 --> 00:07:45.350 what are the things that point out to them, 00:07:45.350 --> 00:07:50.080 and we can see that the mistakes by the networks like those red tokens 00:07:50.080 --> 00:07:51.420 are often just ignored. 00:07:53.080 --> 00:07:56.080 There is this interesting factor that because we didn't tell them 00:07:56.080 --> 00:08:00.640 where this happens, where we got the sentences from-- 00:08:00.640 --> 00:08:03.680 because it was on a user page of mine 00:08:03.680 --> 00:08:05.880 but it looked like it was on a Wikipedia, 00:08:05.880 --> 00:08:07.420 people just trusted. 00:08:07.420 --> 00:08:09.000 And I think that's very important 00:08:09.000 --> 00:08:13.350 when we look into those kinds of research directions that we look into, 00:08:13.350 --> 00:08:16.130 we cannot override this trust into Wikipedia. 00:08:16.130 --> 00:08:20.460 So if we work with Wikipedians and Wikipedia itself, 00:08:20.460 --> 00:08:23.240 if we take things from, for example, Wikidata, 00:08:23.240 --> 00:08:26.400 that's good because it's also human-curated. 00:08:26.400 --> 00:08:31.050 But when we start with artificial intelligence projects, 00:08:31.050 --> 00:08:34.680 where you have to be really careful what we actually expose people to 00:08:34.680 --> 00:08:37.950 because they just trust the information that we give them. 00:08:38.910 --> 00:08:42.570 So we could see, for example, in the Arabic version, 00:08:42.570 --> 00:08:45.480 it gave the wrong location for Marrakesh, 00:08:45.480 --> 00:08:47.770 and people, even the people I interviewed 00:08:47.770 --> 00:08:50.330 that we're living in Marrakesh didn't pick up on that, 00:08:50.330 --> 00:08:54.090 because it's on Wikipedia, so it should be fine, right? 00:08:54.090 --> 00:08:55.115 (chuckles) 00:08:55.115 --> 00:08:56.340 Yeah. 00:08:57.680 --> 00:09:00.750 We found there was a magical threshold for the lengths of the generated text, 00:09:00.750 --> 00:09:02.000 so that's something we found, 00:09:02.000 --> 00:09:05.250 especially in comparison with the content translation tool, 00:09:05.250 --> 00:09:08.080 where you have a long automatically generated text, 00:09:08.080 --> 00:09:12.130 and people were complaining that content translation was very hard 00:09:12.130 --> 00:09:15.610 because you're just doing post-editing, you don't have the creativity. 00:09:15.610 --> 00:09:19.330 There are other remarks on content translation I usually make-- 00:09:19.330 --> 00:09:20.710 I'll skip them for now. 00:09:22.400 --> 00:09:25.230 So that one sentence was helpful 00:09:25.230 --> 00:09:30.360 because even if we've made mistakes, people were still willing to fix them 00:09:30.360 --> 00:09:34.130 because it's a very short intervenience [in that]. 00:09:34.130 --> 00:09:37.950 And then, finally, a lot of people pointed out, 00:09:37.950 --> 00:09:40.200 that it was particularly good for a new editor, 00:09:40.200 --> 00:09:42.080 so for them to have a starting point, 00:09:42.080 --> 00:09:44.080 to have those triples, to have a sentence, 00:09:44.080 --> 00:09:46.150 so they have something to start from. 00:09:46.150 --> 00:09:48.720 So after all those interviews were done, 00:09:48.720 --> 00:09:51.990 as I go, that's very interesting. 00:09:51.990 --> 00:09:54.260 What else can we do with that knowledge? 00:09:54.260 --> 00:09:58.950 And so we started a new project, exactly because there weren't enough yet. 00:09:58.950 --> 00:10:02.310 And the new project we have is called Scribe, 00:10:02.310 --> 00:10:07.460 and Scribe focuses on new editors that want to write a new article, 00:10:07.460 --> 00:10:09.660 and particularly people who haven't written 00:10:09.660 --> 00:10:11.260 an article on Wikipedia yet, 00:10:11.260 --> 00:10:14.280 and specifically also on low-resource languages. 00:10:15.130 --> 00:10:18.910 So the idea is that-- that's the pixel version of me. 00:10:19.800 --> 00:10:21.170 All my slides are basically 00:10:21.170 --> 00:10:24.240 references to people in this room, which I really love. 00:10:24.240 --> 00:10:25.750 It feels like I'm home again. 00:10:27.000 --> 00:10:30.880 So, yeah, I want to write a new article, 00:10:30.880 --> 00:10:33.570 but I don't know where to start as a new editor, 00:10:33.570 --> 00:10:36.710 and so we have this project Scribe. 00:10:36.710 --> 00:10:41.370 Scribe is a profession or was the name of someone 00:10:41.370 --> 00:10:45.260 with the profession of writing in ancient Egypt. 00:10:47.080 --> 00:10:52.870 So the Scribe project's idea is that we want to give people, basically, 00:10:52.870 --> 00:10:55.710 a hand when they start writing their first articles. 00:10:55.710 --> 00:10:57.750 So give them a skeleton, 00:10:57.750 --> 00:11:01.040 give them a skeleton that's based on their language Wikipedia, 00:11:01.040 --> 00:11:05.330 instead of just translating the content from another language Wikipedia. 00:11:05.330 --> 00:11:10.390 So the first thing we want to do is plan section titles, 00:11:10.390 --> 00:11:13.640 then select references for each section, 00:11:13.640 --> 00:11:15.974 ideally in the local Wikipedia language, 00:11:15.974 --> 00:11:19.950 and then summarize those references to give a starting point to write. 00:11:21.400 --> 00:11:25.310 For the project, we have a Wikimedia Foundation project grant. 00:11:25.310 --> 00:11:27.570 So it just started. 00:11:27.570 --> 00:11:30.670 Some of you are very open to feedback, in general. 00:11:30.670 --> 00:11:35.170 That was the very first not so beautiful layout, 00:11:35.170 --> 00:11:36.830 but just for you to get an overview. 00:11:36.830 --> 00:11:39.640 So there is this idea of collecting references, 00:11:39.640 --> 00:11:42.950 images from comments, section titles. 00:11:42.950 --> 00:11:45.620 And so the main things we want to use Wikidata for 00:11:45.620 --> 00:11:47.850 is the sections. 00:11:47.860 --> 00:11:51.400 So, basically, we want to see what are articles 00:11:51.400 --> 00:11:55.220 on similar topics already existing in your language, 00:11:55.220 --> 00:11:58.350 so we can understand how the language community 00:11:58.350 --> 00:12:02.220 decided on structuring articles. 00:12:02.220 --> 00:12:06.170 And then we look for the images, obviously, 00:12:06.170 --> 00:12:10.480 where Wikidata also is a good point to go through. 00:12:12.550 --> 00:12:16.240 And then we made a prettier interface for it 00:12:16.240 --> 00:12:18.420 because we decided to go mobile first. 00:12:18.420 --> 00:12:21.280 So most of communities that we aim to work with 00:12:21.280 --> 00:12:24.510 are very heavy on mobile editing. 00:12:24.510 --> 00:12:29.800 And so we do this mobile-first focus. 00:12:30.230 --> 00:12:34.060 And then, it also forces us to break down into steps 00:12:34.060 --> 00:12:37.000 which eventually will lead to, yeah, I don't know, 00:12:37.000 --> 00:12:39.440 a step-by-step guide on how to write a new article. 00:12:39.440 --> 00:12:43.060 So an editor comes, they can select section headers 00:12:43.060 --> 00:12:46.630 based on existing articles in their language, 00:12:46.630 --> 00:12:49.150 write one section at a time, 00:12:49.150 --> 00:12:54.130 switch between the sections, and select references for each section. 00:12:55.600 --> 00:12:59.050 Yeah, so the idea is that we will have an easier editing experience, 00:12:59.050 --> 00:13:00.680 especially for new editors, 00:13:00.680 --> 00:13:05.080 to keep them in-- integrate Wikidata information 00:13:05.080 --> 00:13:08.280 and [inaudible] images from Wikimedia Commons as well. 00:13:09.730 --> 00:13:12.110 If you're interested in Scribe, 00:13:12.110 --> 00:13:15.130 I'm working together on this project with Hady. 00:13:15.130 --> 00:13:19.310 There is a lot of things online, 00:13:19.310 --> 00:13:23.240 but then also just come and talk to us. 00:13:23.240 --> 00:13:25.860 Also, if you're editing a low-resource Wikipedia, 00:13:25.860 --> 00:13:28.613 we're still looking for people to interview 00:13:28.613 --> 00:13:31.570 because we're trying to emulate-- 00:13:31.570 --> 00:13:33.880 we're trying to emulate as much as we can 00:13:33.880 --> 00:13:36.750 what people already experience, or they already edit. 00:13:36.750 --> 00:13:38.630 I'm not big on Wikipedia editing. 00:13:38.630 --> 00:13:40.510 Also, my native language is German. 00:13:40.510 --> 00:13:43.580 So I need a lot of input from editors 00:13:43.580 --> 00:13:48.080 that want to tell me what they need, what they want, 00:13:48.080 --> 00:13:51.140 where they think this project can go. 00:13:51.140 --> 00:13:54.590 And if you are into Wikidata, also come and talk to me, please. 00:13:55.730 --> 00:13:57.889 Okay, so that's all the projects 00:13:57.889 --> 00:14:01.880 or most of the projects we did inside the Wikimedia world. 00:14:01.880 --> 00:14:05.780 And I want to give you one short overview of what's happening 00:14:05.780 --> 00:14:10.420 on my end of research, around Wikidata as well. 00:14:14.290 --> 00:14:15.950 So I was part of a project 00:14:15.950 --> 00:14:17.820 that works a lot with question answering, 00:14:17.820 --> 00:14:20.460 and I don't know too much about question answering, 00:14:20.460 --> 00:14:23.880 but what I do know a lot about is knowledge graphs and multilinguality. 00:14:23.880 --> 00:14:25.680 So, basically, what we wanted to do 00:14:25.680 --> 00:14:29.963 is we have a question answering system that gets a question from a user, 00:14:29.963 --> 00:14:35.770 and we wanted to select a knowledge graph that can answer the question best. 00:14:35.770 --> 00:14:40.150 And again, we focused on multilingual question answering system. 00:14:40.150 --> 00:14:45.750 So if I want to ask something about Bach, for example, in Spanish and French-- 00:14:45.750 --> 00:14:48.420 because that's the two languages I know best-- 00:14:48.420 --> 00:14:52.030 then what knowledge graph has the data 00:14:52.030 --> 00:14:53.950 to actually answer those questions. 00:14:55.160 --> 00:14:59.260 So what we did was we found a method to rank knowledge graphs, 00:15:00.600 --> 00:15:04.800 based on the metadata of language, 00:15:04.800 --> 00:15:08.170 that appears on the knowledge graph, 00:15:08.170 --> 00:15:09.510 [which is split] by class. 00:15:09.510 --> 00:15:11.460 And then we look for each class 00:15:11.460 --> 00:15:14.440 into what languages are covered best, 00:15:14.440 --> 00:15:18.170 and then depending on the question, can suggest a knowledge graph. 00:15:19.000 --> 00:15:22.510 From the big knowledge graphs we looked into 00:15:22.510 --> 00:15:25.080 and that are very known and widely used, 00:15:25.080 --> 00:15:28.240 Wikidata covers the most languages over all knowledge graphs, 00:15:28.240 --> 00:15:31.750 and we used a test bed. 00:15:31.750 --> 00:15:35.570 So we'd use a benchmark dataset called [CALD], 00:15:35.570 --> 00:15:39.350 which we then translated-- which was originally for DBpedia. 00:15:39.350 --> 00:15:41.880 We translated it for those five knowledge graphs 00:15:41.880 --> 00:15:43.550 into [SPARQL] questions. 00:15:43.550 --> 00:15:49.820 And then we gave that to a crowd and looked into which knowledge graph 00:15:49.820 --> 00:15:54.680 has the best answers for each of those [SPARQL] queries. 00:15:54.680 --> 00:15:59.370 And overall, the crowd workers preferred Wikidata's answers 00:15:59.370 --> 00:16:01.640 because they are very precise, 00:16:02.890 --> 00:16:05.020 they are in most of the languages 00:16:05.020 --> 00:16:06.530 that the others don't cover, 00:16:07.620 --> 00:16:10.970 and they are not as repetitive or redundant 00:16:10.970 --> 00:16:12.480 as the [inaudible]. 00:16:12.480 --> 00:16:16.680 So just to make a quick recap on the whole topic 00:16:16.680 --> 00:16:19.820 of Wikidata and the future and languages. 00:16:19.820 --> 00:16:23.620 So we can say that Wikidata is already widely used 00:16:23.620 --> 00:16:27.910 for numerous applications in Wikipedia, 00:16:27.910 --> 00:16:30.080 and then outside Wikipedia for research. 00:16:30.080 --> 00:16:33.970 So what I talked about is just the things I do research on, 00:16:33.970 --> 00:16:35.970 but there is still so much more. 00:16:35.970 --> 00:16:38.950 So there is machine translation using knowledge graphs, 00:16:38.950 --> 00:16:40.950 there is rule mining over knowledge graphs, 00:16:40.950 --> 00:16:43.530 its entity linking in text. 00:16:43.530 --> 00:16:47.170 There is so much more research happening at the moment, 00:16:47.170 --> 00:16:50.880 and Wikidata is more and more getting popular for usage of it. 00:16:50.880 --> 00:16:54.640 So I think we are at a very good stage 00:16:54.640 --> 00:16:57.590 to push and connect the communities. 00:16:58.640 --> 00:17:02.970 Yeah, to get the best from both sides, basically. 00:17:03.510 --> 00:17:04.770 Thank you very much. 00:17:04.770 --> 00:17:07.860 If you want to have a look at any of those projects, 00:17:07.860 --> 00:17:09.285 they are there, 00:17:09.285 --> 00:17:10.710 my slides are in Commons already. 00:17:10.710 --> 00:17:14.800 If you want to read any of the papers, I think all of them are open access. 00:17:14.800 --> 00:17:16.260 If you can't find any of them, 00:17:16.260 --> 00:17:18.770 write me an email and I send it to you immediately. 00:17:18.770 --> 00:17:20.705 Thank you very much. 00:17:20.705 --> 00:17:22.400 (applause) 00:17:25.740 --> 00:17:28.130 (moderator) Okay, are there any questions? 00:17:28.130 --> 00:17:31.770 - (moderator) I'll come around. - (person 1) Shall I come to you? 00:17:34.794 --> 00:17:36.370 (person 1) Hi Lucie, thank you so much, 00:17:36.370 --> 00:17:38.460 I'm so glad to see you taking this forward. 00:17:38.460 --> 00:17:40.680 Now I'm really curious about Scribe. 00:17:41.510 --> 00:17:43.510 The example here within our university 00:17:43.510 --> 00:17:46.060 was that the idea that the person says, 00:17:46.060 --> 00:17:47.540 "This is a university." 00:17:47.540 --> 00:17:49.020 And then you go to the key data 00:17:49.020 --> 00:17:51.930 and say, "Oh gosh! Universities have places 00:17:51.930 --> 00:17:54.110 and presidents, and I don't know what," 00:17:54.110 --> 00:17:57.770 that you're using these as the parts, for telling the person what to do. 00:17:57.770 --> 00:18:00.840 So, basically, the idea is that someone says, 00:18:00.840 --> 00:18:02.820 "I want to write about Nile University." 00:18:02.820 --> 00:18:07.040 We look into Nile University's Wikidata item, 00:18:07.040 --> 00:18:09.820 and let's say-- I work a lot with Arabic-- 00:18:09.820 --> 00:18:13.240 so let's say we then go in Arabic Wikipedia, 00:18:13.240 --> 00:18:17.040 so we can make a grid, basically, 00:18:17.040 --> 00:18:19.370 of all items that are around Nile University. 00:18:19.370 --> 00:18:23.060 So there are also universities, there are also universities in Cairo, 00:18:23.060 --> 00:18:25.480 or there are also universities in Egypt, stuff like that, 00:18:25.480 --> 00:18:27.350 or they have similar topics. 00:18:27.350 --> 00:18:32.530 So we can look into all the similar items on Wikidata, 00:18:32.530 --> 00:18:36.330 and if they already have a Wikipedia entry in Arabic Wikipedia, 00:18:36.330 --> 00:18:38.610 we can look at the section titles. 00:18:38.610 --> 00:18:41.310 - (person 1) (gasps) - Exactly, and then we can make basically, 00:18:41.310 --> 00:18:46.370 the most common way about writing about a university 00:18:46.370 --> 00:18:50.000 in Cairo on Arabic Wikipedia. 00:18:50.000 --> 00:18:52.703 - Yeah, so that's the-- - (person 1) Thank you, [inaudible]. 00:18:56.880 --> 00:18:59.550 (person 2) Hi, thank you so much for your inspiring talk. 00:18:59.550 --> 00:19:04.800 I was wondering if this would work for languages in Incubator? 00:19:04.800 --> 00:19:10.620 Like, I work with really low, low, low, low-resource languages 00:19:10.620 --> 00:19:16.461 and this thing about doing it mobile would be a huge thing, 00:19:16.461 --> 00:19:20.020 because in many communities they only have phones, not laptops. 00:19:20.020 --> 00:19:22.020 So, would it work? 00:19:22.020 --> 00:19:26.080 So I think, to an extent-- 00:19:26.080 --> 00:19:32.050 so the general structure, the skeleton of the application would work. 00:19:32.050 --> 00:19:35.280 Two things that we're thinking about a lot at the moment 00:19:35.280 --> 00:19:37.080 for exactly those use cases is, 00:19:37.080 --> 00:19:39.970 how much would we want, for example, to say, 00:19:39.970 --> 00:19:44.530 if there are no articles on a similar topic in your Wikipedia, 00:19:44.530 --> 00:19:46.930 how much do we want it to get it from other Wikipedias. 00:19:46.930 --> 00:19:49.750 And that's why I'm basically doing those interviews at the moment, 00:19:49.750 --> 00:19:51.420 because I try to understand 00:19:51.420 --> 00:19:54.570 how much people already look at other language Wikipedias 00:19:54.570 --> 00:19:57.040 to make the structure of an article. 00:19:57.040 --> 00:19:58.800 Are they generally equal 00:19:58.800 --> 00:20:01.630 or do they differ a lot based on cultural context? 00:20:01.630 --> 00:20:04.310 So that would be something to consider, 00:20:04.310 --> 00:20:06.640 but there is a possibility to say, 00:20:06.640 --> 00:20:09.550 we take everything from all the language Wikipedias 00:20:09.550 --> 00:20:12.040 and then make an average, basically. 00:20:12.040 --> 00:20:14.970 And the other problem is referencing. 00:20:14.970 --> 00:20:16.460 So that's something we find. 00:20:16.460 --> 00:20:20.730 We make it very convenient because we use a lot of Arabic, 00:20:20.730 --> 00:20:24.170 and Arabic actually has the problem that there are a lot of references, 00:20:24.170 --> 00:20:28.790 but they are very little used or not widely used in Wikipedia. 00:20:29.260 --> 00:20:31.570 That's not true, obviously, for all languages, 00:20:31.570 --> 00:20:34.104 and that's something I'd be very interested-- 00:20:34.104 --> 00:20:35.180 like, let's talk. 00:20:35.180 --> 00:20:36.680 That's what I'm trying to say, 00:20:36.680 --> 00:20:39.170 I'd be very interested on your perspective on it 00:20:39.170 --> 00:20:41.680 because I'd like to know, yeah 00:20:41.680 --> 00:20:43.750 what do you think about referencing 00:20:43.750 --> 00:20:45.460 done from English or any other language. 00:20:45.460 --> 00:20:46.840 (person 2) Have you ever tried-- 00:20:46.840 --> 00:20:51.880 what we do is we normally reference to interviews we have. 00:20:51.880 --> 00:20:55.600 We put them in our repository, institutional repository, 00:20:55.600 --> 00:20:59.570 because these languages don't have written references, 00:20:59.570 --> 00:21:03.240 and I feel like that is the way to go, but-- 00:21:03.240 --> 00:21:06.910 I'm currently also-- Kimberly and I are discussing a lot. 00:21:06.910 --> 00:21:10.930 We made a session on Wikimania on oral knowledge and oral citations. 00:21:10.930 --> 00:21:14.135 Yeah, we should hang out and have a long conversation. 00:21:14.135 --> 00:21:15.620 (laughs) 00:21:18.310 --> 00:21:22.040 (person 3) So [Michael Davignon], we'll talk about medium size, 00:21:22.040 --> 00:21:23.910 which is probably around ten people, 00:21:23.910 --> 00:21:27.750 so it's medium for Briton Wikipedia. 00:21:27.750 --> 00:21:30.600 And I'm wondering if we can use Scribe, 00:21:31.530 --> 00:21:34.880 how to find a common plan the other way around 00:21:34.880 --> 00:21:37.770 for existing article to find [the outer layers], 00:21:37.770 --> 00:21:39.571 that's supposed to be the best plan, 00:21:39.571 --> 00:21:42.130 but I'm not aware of more or less 00:21:42.130 --> 00:21:44.710 [inaudible] improvement existing article. 00:21:46.790 --> 00:21:49.440 I think there's-- 00:21:49.440 --> 00:21:50.800 I forgot the name, I think, 00:21:50.800 --> 00:21:53.790 [Diego] in the Wikimedia Foundation research team, 00:21:53.790 --> 00:21:58.407 who's working a lot at the moment with section headings. 00:21:58.407 --> 00:22:01.420 But, yes, generally, the idea is the same. 00:22:01.420 --> 00:22:04.640 So instead of using them to make an average 00:22:04.640 --> 00:22:07.260 you could say, this is not like the average, 00:22:08.170 --> 00:22:09.680 That's very possible, yeah. 00:22:14.750 --> 00:22:18.330 (person 4) Hi, Lucy. I'm Erica Azzellini from Wiki Movement, Brazil, 00:22:18.330 --> 00:22:20.130 and I'm very-- 00:22:20.130 --> 00:22:21.860 (Érica) Oh, can you hear me? 00:22:21.860 --> 00:22:24.680 So, I'm Érica Azzellini from Wiki Movement Brazil, 00:22:24.680 --> 00:22:26.560 and I'm really impressed with your work 00:22:26.570 --> 00:22:29.154 because it's really in sync 00:22:29.154 --> 00:22:32.540 with what we've been working on in Brazil with the Mbabel tool. 00:22:32.540 --> 00:22:33.950 I don't know if you heard about it? 00:22:33.950 --> 00:22:36.000 - Not yet. - (Érica) It's a tool that we use 00:22:36.020 --> 00:22:38.440 to automatically generate Wikipedia entries 00:22:38.440 --> 00:22:42.240 using Wikidata information in a simple way 00:22:42.240 --> 00:22:46.510 that can be replicated on other Wikipedia languages. 00:22:46.510 --> 00:22:48.950 So we've been working on Portuguese mainly, 00:22:48.950 --> 00:22:51.860 and we're trying to get on English Wikipedia tools, 00:22:51.860 --> 00:22:56.200 but it can be replicated on any language, basically, 00:22:56.200 --> 00:22:58.460 and I think then we could talk about it. 00:22:58.460 --> 00:23:00.460 Absolutely, it will be super interesting 00:23:00.460 --> 00:23:03.260 because the article placeholder is an extension already, 00:23:03.260 --> 00:23:06.130 so it might be worth to integrate your efforts 00:23:06.130 --> 00:23:07.950 into the existing extension. 00:23:07.950 --> 00:23:12.620 Lydia is also fully for it, and... (laughs) 00:23:12.620 --> 00:23:13.930 And then because-- 00:23:13.930 --> 00:23:17.040 so one of the problems-- [Marius] correct me if I'm wrong-- 00:23:17.040 --> 00:23:20.310 we had was that article placeholder doesn't scale 00:23:20.310 --> 00:23:22.240 as well as it should. 00:23:22.240 --> 00:23:24.860 So article placeholder is not in Portuguese 00:23:24.860 --> 00:23:28.545 because we're always afraid it will break everything, correct? 00:23:29.460 --> 00:23:32.286 And then [Marius] is just taking a pause. 00:23:32.286 --> 00:23:35.420 - (Érica) Yeah, you should be careful. - Don't want to say anything about this. 00:23:35.420 --> 00:23:38.950 But, yeah, we should connect because I'd be super interested to see 00:23:38.950 --> 00:23:42.040 how you solve those issues and how it works for you. 00:23:42.040 --> 00:23:45.310 (Érica) I'm going to present on the second section 00:23:45.310 --> 00:23:48.350 of the learning talk about this project that we've been developing, 00:23:48.350 --> 00:23:50.620 and we've been using it on [Glenwyck] initiatives 00:23:50.620 --> 00:23:52.440 and education projects already. 00:23:52.440 --> 00:23:54.480 - Perfect. - (Érica) So let's do that. 00:23:54.480 --> 00:23:56.440 Yeah, absolutely let's chat. 00:23:57.220 --> 00:23:58.274 (moderator) Cool. 00:23:58.274 --> 00:24:00.370 Some other questions on your projects? 00:24:02.460 --> 00:24:06.820 (person 5) Hi, my name is [Alan], and I think this is extremely cool. 00:24:06.820 --> 00:24:09.170 I had a few questions about 00:24:09.170 --> 00:24:13.110 generating Wiki sentences from neural networks. 00:24:13.110 --> 00:24:16.020 - Yeah. - (person 5) So I've come across 00:24:16.020 --> 00:24:19.240 another project that was attempting to do this, 00:24:19.240 --> 00:24:23.020 and it was essentially using [triples input and sentences output], 00:24:23.020 --> 00:24:25.510 and it was able to generate very fluent sentences. 00:24:25.510 --> 00:24:29.360 But sometimes they weren't... 00:24:30.370 --> 00:24:33.820 actually, they weren't correct, with regards to the triple. 00:24:33.820 --> 00:24:39.420 And I was curious if you had any ways of doing validity checks of this site. 00:24:39.420 --> 00:24:43.040 Sometimes the triple is "subject, predicate, object," 00:24:43.040 --> 00:24:46.110 but the language model says, 00:24:46.110 --> 00:24:48.565 "Okay, this object is very rare, 00:24:48.565 --> 00:24:51.740 I'm going to say you are born in San Jose, 00:24:51.740 --> 00:24:55.060 instead of San Francisco or vice versa." 00:24:55.060 --> 00:24:58.880 And I was curious if you had come across this? 00:24:58.880 --> 00:25:01.510 So that's what we call hallucinations. 00:25:01.510 --> 00:25:05.080 The idea that there's something in a sentence 00:25:05.080 --> 00:25:07.690 that wasn't in the original triple and the data. 00:25:08.400 --> 00:25:11.350 What we do-- so we don't do anything about it, 00:25:11.350 --> 00:25:13.910 we just also realized that that's happening. 00:25:13.910 --> 00:25:15.910 It's even more happening for the low-resource, 00:25:15.910 --> 00:25:19.730 because we work across domains, so we are domain independently generating. 00:25:19.730 --> 00:25:24.670 Traditional energy work is always biography domain, usually. 00:25:24.670 --> 00:25:26.620 So that happens a lot 00:25:26.620 --> 00:25:29.510 because we just have little training data on the low-resource languages. 00:25:30.400 --> 00:25:32.800 We have a few ideas. 00:25:32.800 --> 00:25:36.840 It's one of the million topics, I'm supposed to work on at the moment. 00:25:38.850 --> 00:25:42.550 One of them is to use entity linking and relation extraction, 00:25:42.550 --> 00:25:44.440 to align what we generate 00:25:44.440 --> 00:25:46.640 with the triples we inputted in the first place, 00:25:46.640 --> 00:25:50.750 to see if it's off or the network generates information it shouldn't have 00:25:50.750 --> 00:25:54.090 or it cannot know about, basically. 00:25:54.090 --> 00:25:58.680 That's also all I can say about this because now time is over. 00:25:58.680 --> 00:26:01.480 (person 5) I'd love to talk offline about this, if you have time. 00:26:01.480 --> 00:26:03.260 Yeah, absolutely, let's chat about it. 00:26:03.260 --> 00:26:05.140 Thank you so much, everyone, it was lovely. 00:26:05.140 --> 00:26:06.600 (moderator) Thank you, Lucie. 00:26:06.600 --> 00:26:08.610 (applause)