Hi, I'm Lucie. You know me from rambling about not enough language data in Wikidata, and I thought instead of rambling today, which I'll leave to Lydia later today, I'll just show you a bit, or give you an insight on the projects we did using the data that we already have on Wikidata, for different causes. So underserved languages compared to the keynote we just heard where the person was talking about underserved as like minority languages, underserved languages to me, or any languages that don't have enough representation on the web. Yeah, just to get that clear. So, who am I? Why am I always talking about languages on Wikidata? Not sure but... I'm a Computer Science PhD student at the University of Southampton. I'm a research intern at Bloomberg in London, at the moment. I'm a residence at Newspeak House in London. I am a researcher and project manager for the Scribe project, which I'll go into in a bit, and I recently got into the idea of oral knowledge and oral citation. Kimberly is sitting right there. And then, occasionally, I have time to sleep and do other things, but that's very rare. So if you're interested in any of those things, come talk and speak to me. Generally, this is an open presentation and a few questions in between. I'll run through a lot of things in a very short time now. Come to me afterwards if you're interested in any of them. Speak to me. I'm here. I'm always very happy to speak to people. So that's a bit of what we will talk about today. So Wikidata, giving an introduction, even though that's obviously not as necessary. The article placeholder is aimed for Wikipedia readers, for Scribe which is aimed at Wikipedia editors, and then we have one topic of my research, which is completely outside of Wikipedia where we use Wikidata for question answering. So just a quick rerun. Why is Wikidata so cool for low-resource languages where we have those unique identifiers? I'm speaking to people that know that much better than me even. And then we have labels in different languages. Those can be in over, I think, 400 languages by now, so we have a good option here to reuse language in different forms and capture it. Yeah, so that's a little bit of me rambling about Wikidata because I can't stop it. We compared Wikidata, compared to the native speaker, so we can see, obviously, there are languages that are widely spoken in the world. There's Chinese, Hindi, or Arabic, but then very low coverage on Wikidata. Then the opposite. Sorry, I have the Dutch and the Swedish community which was super active in Wikidata, which is really cool, and that just points out that even though we have a low number of speakers, we can have a big impact if people are very active in the communities, which is really nice and really good. But also let's try to equal that graph out in the future. So, cool. So now we have all this language data in Wikidata. We have low-resource Wikipedias, so we thought, what can we do? Well, my undergrad supervisor is sitting here, and we worked back then in the golden days, on something called the article placeholder which takes triples from Wikidata and displays it on Wikipedia. And that's pretty much relatively straight forward. So you just take the content of Wikidata, display it on Wikipedia to attract more readers and then eventually more editors in the different low-resource languages. They are dynamically generated, so they're not like stubs or bot articles that then flood the Wikipedia so people can edit them. It's basically a starting point. And we thought, well, we have that content, and we have that knowledge somewhere already, which is Wikidata. It's often already in the languages, but they don't have articles, so at least give them the insight into the information. The article placeholders are live on 14 low-resource Wikipedias. If you are a Wikipedia community, if you are part of a Wikipedia community and interested in it, let us know. And then I went into research, and I got stuck with the article placeholder, though, so we started to look into text generation from Wikidata for Wikipedia and low-resource languages. And text generation is really interesting because in research it was at that point when we started the project completely only focused on English, which is a bit pointless in my experience because, I mean, you have a lot of people who write in English, but then what we need is people who write in those low-source languages. And our starting point was that, looking at triples on Wikipedia is not exactly the nicest thing. I mean, as much as I love the article placeholder, it's not exactly what you want to see you or expect when you open a Wikipedia page. So we try to generate text. We use this beautiful neural network model, where we encode Wikidata triples. If you're interested more in the technical parts, come and talk to me. And so, realistically, with neural text generation, you can generate one or two sentences before it completely scrambles and becomes useless. So we've generated one sentence that describes the topic of the triple. And so this, for example, is Arabic. We generate the sentence about Marrakesh, where it just describes the city. So for that, then, we tested this-- So we did studies, obviously, to test if our approach works, and if it makes sense, to use such things. And because we are very application-focused, we tested it with actual Wikipedia readers and editors. So, first, we tested it with Wikipedia readers in Arabic and Esperanto-- so use cases with Arabic and Esperanto. And we can see that our model can generate sentences that are very fluent and that feel very much-- surprisingly, a lot, actually-- like Wikipedia sentences. So it picks up, so we train on, for example, for Arabic, we train on Arabic with the idea to say we want to keep the cultural context of that language and not let it influence from other languages that have higher coverage. And then we did a study with Wikipedia editors because in the end the article placeholder is just a starting point for people to start editing, and we try to measure how much of the sentences would they reuse. How much is useful for them, basically, and you can see that there is a high number of reuse, especially in Esperanto when we test with editors. And finally, we did also qualitative interviews with Wikipedia editors across six languages. I think we had about ten people we interviewed. And we tried to get more of an understanding what's a human perspective on those generated sentences. So now we can have a very quantified way of saying, yeah, they are good, but we wanted to see how's the interaction and especially with whatever always happens in neural machine translation and neural text generations, that you have those missing word tokens which we put as "rare" in there. So that's the example sentences we used. All of them are in Marrakesh. So we wanted to see how much are people bothered by it, what's the quality, what are the things that point out to them, and we can see that the mistakes by the networks like those red tokens are often just ignored. There is this interesting factor that because we didn't tell them where this happens, where we got the sentences from-- because it was on a user page of mine but it looked like it was on a Wikipedia, people just trusted. And I think that's very important when we look into those kinds of research directions that we look into, we cannot override this trust into Wikipedia. So if we work with Wikipedians and Wikipedia itself, if we take things from, for example, Wikidata, that's good because it's also human-curated. But when we start with artificial intelligence projects, where you have to be really careful what we actually expose people to because they just trust the information that we give them. So we could see, for example, in the Arabic version, it gave the wrong location for Marrakesh, and people, even the people I interviewed that we're living in Marrakesh didn't pick up on that, because it's on Wikipedia, so it should be fine, right? (chuckles) Yeah. We found there was a magical threshold for the lengths of the generated text, so that's something we found, especially in comparison with the content translation tool, where you have a long automatically generated text, and people were complaining that content translation was very hard because you're just doing post-editing, you don't have the creativity. There are other remarks on content translation I usually make-- I'll skip them for now. So that one sentence was helpful because even if we've made mistakes, people were still willing to fix them because it's a very short intervenience [in that]. And then, finally, a lot of people pointed out, that it was particularly good for a new editor, so for them to have a starting point, to have those triples, to have a sentence, so they have something to start from. So after all those interviews were done, as I go, that's very interesting. What else can we do with that knowledge? And so we started a new project, exactly because there weren't enough yet. And the new project we have is called Scribe, and Scribe focuses on new editors that want to write a new article, and particularly people who haven't written an article on Wikipedia yet, and specifically also on low-resource languages. So the idea is that-- that's the pixel version of me. All my slides are basically references to people in this room, which I really love. It feels like I'm home again. So, yeah, I want to write a new article, but I don't know where to start as a new editor, and so we have this project Scribe. Scribe is a profession or was the name of someone with the profession of writing in ancient Egypt. So the Scribe project's idea is that we want to give people, basically, a hand when they start writing their first articles. So give them a skeleton, give them a skeleton that's based on their language Wikipedia, instead of just translating the content from another language Wikipedia. So the first thing we want to do is plan section titles, then select references for each section, ideally in the local Wikipedia language, and then summarize those references to give a starting point to write. For the project, we have a Wikimedia Foundation project grant. So it just started. Some of you are very open to feedback, in general. That was the very first not so beautiful layout, but just for you to get an overview. So there is this idea of collecting references, images from comments, section titles. And so the main things we want to use Wikidata for is the sections. So, basically, we want to see what are articles on similar topics already existing in your language, so we can understand how the language community decided on structuring articles. And then we look for the images, obviously, where Wikidata also is a good point to go through. And then we made a prettier interface for it because we decided to go mobile first. So most of communities that we aim to work with are very heavy on mobile editing. And so we do this mobile-first focus. And then, it also forces us to break down into steps which eventually will lead to, yeah, I don't know, a step-by-step guide on how to write a new article. So an editor comes, they can select section headers based on existing articles in their language, write one section at a time, switch between the sections, and select references for each section. Yeah, so the idea is that we will have an easier editing experience, especially for new editors, to keep them in-- integrate Wikidata information and [inaudible] images from Wikimedia Commons as well. If you're interested in Scribe, I'm working together on this project with Hady. There is a lot of things online, but then also just come and talk to us. Also, if you're editing a low-resource Wikipedia, we're still looking for people to interview because we're trying to emulate-- we're trying to emulate as much as we can what people already experience, or they already edit. I'm not big on Wikipedia editing. Also, my native language is German. So I need a lot of input from editors that want to tell me what they need, what they want, where they think this project can go. And if you are into Wikidata, also come and talk to me, please. Okay, so that's all the projects or most of the projects we did inside the Wikimedia world. And I want to give you one short overview of what's happening on my end of research, around Wikidata as well. So I was part of a project that works a lot with question answering, and I don't know too much about question answering, but what I do know a lot about is knowledge graphs and multilinguality. So, basically, what we wanted to do is we have a question answering system that gets a question from a user, and we wanted to select a knowledge graph that can answer the question best. And again, we focused on multilingual question answering system. So if I want to ask something about Bach, for example, in Spanish and French-- because that's the two languages I know best-- then what knowledge graph has the data to actually answer those questions. So what we did was we found a method to rank knowledge graphs, based on the metadata of language, that appears on the knowledge graph, [which is split] by class. And then we look for each class into what languages are covered best, and then depending on the question, can suggest a knowledge graph. From the big knowledge graphs we looked into and that are very known and widely used, Wikidata covers the most languages over all knowledge graphs, and we used a test bed. So we'd use a benchmark dataset called [CALD], which we then translated-- which was originally for DBpedia. We translated it for those five knowledge graphs into [SPARQL] questions. And then we gave that to a crowd and looked into which knowledge graph has the best answers for each of those [SPARQL] queries. And overall, the crowd workers preferred Wikidata's answers because they are very precise, they are in most of the languages that the others don't cover, and they are not as repetitive or redundant as the [inaudible]. So just to make a quick recap on the whole topic of Wikidata and the future and languages. So we can say that Wikidata is already widely used for numerous applications in Wikipedia, and then outside Wikipedia for research. So what I talked about is just the things I do research on, but there is still so much more. So there is machine translation using knowledge graphs, there is rule mining over knowledge graphs, its entity linking in text. There is so much more research happening at the moment, and Wikidata is more and more getting popular for usage of it. So I think we are at a very good stage to push and connect the communities. Yeah, to get the best from both sides, basically. Thank you very much. If you want to have a look at any of those projects, they are there, my slides are in Commons already. If you want to read any of the papers, I think all of them are open access. If you can't find any of them, write me an email and I send it to you immediately. Thank you very much. (applause) (moderator) Okay, are there any questions? - (moderator) I'll come around. - (person 1) Shall I come to you? (person 1) Hi Lucie, thank you so much, I'm so glad to see you taking this forward. Now I'm really curious about Scribe. The example here within our university was that the idea that the person says, "This is a university." And then you go to the key data and say, "Oh gosh! Universities have places and presidents, and I don't know what," that you're using these as the parts, for telling the person what to do. So, basically, the idea is that someone says, "I want to write about Nile University." We look into Nile University's Wikidata item, and let's say-- I work a lot with Arabic-- so let's say we then go in Arabic Wikipedia, so we can make a grid, basically, of all items that are around Nile University. So there are also universities, there are also universities in Cairo, or there are also universities in Egypt, stuff like that, or they have similar topics. So we can look into all the similar items on Wikidata, and if they already have a Wikipedia entry in Arabic Wikipedia, we can look at the section titles. - (person 1) (gasps) - Exactly, and then we can make basically, the most common way about writing about a university in Cairo on Arabic Wikipedia. - Yeah, so that's the-- - (person 1) Thank you, [inaudible]. (person 2) Hi, thank you so much for your inspiring talk. I was wondering if this would work for languages in Incubator? Like, I work with really low, low, low, low-resource languages and this thing about doing it mobile would be a huge thing, because in many communities they only have phones, not laptops. So, would it work? So I think, to an extent-- so the general structure, the skeleton of the application would work. Two things that we're thinking about a lot at the moment for exactly those use cases is, how much would we want, for example, to say, if there are no articles on a similar topic in your Wikipedia, how much do we want it to get it from other Wikipedias. And that's why I'm basically doing those interviews at the moment, because I try to understand how much people already look at other language Wikipedias to make the structure of an article. Are they generally equal or do they differ a lot based on cultural context? So that would be something to consider, but there is a possibility to say, we take everything from all the language Wikipedias and then make an average, basically. And the other problem is referencing. So that's something we find. We make it very convenient because we use a lot of Arabic, and Arabic actually has the problem that there are a lot of references, but they are very little used or not widely used in Wikipedia. That's not true, obviously, for all languages, and that's something I'd be very interested-- like, let's talk. That's what I'm trying to say, I'd be very interested on your perspective on it because I'd like to know, yeah what do you think about referencing done from English or any other language. (person 2) Have you ever tried-- what we do is we normally reference to interviews we have. We put them in our repository, institutional repository, because these languages don't have written references, and I feel like that is the way to go, but-- I'm currently also-- Kimberly and I are discussing a lot. We made a session on Wikimania on oral knowledge and oral citations. Yeah, we should hang out and have a long conversation. (laughs) (person 3) So [Michael Davignon], we'll talk about medium size, which is probably around ten people, so it's medium for Briton Wikipedia. And I'm wondering if we can use Scribe, how to find a common plan the other way around for existing article to find [the outer layers], that's supposed to be the best plan, but I'm not aware of more or less [inaudible] improvement existing article. I think there's-- I forgot the name, I think, [Diego] in the Wikimedia Foundation research team, who's working a lot at the moment with section headings. But, yes, generally, the idea is the same. So instead of using them to make an average you could say, this is not like the average, That's very possible, yeah. (person 4) Hi, Lucy. I'm Erica Azzellini from Wiki Movement, Brazil, and I'm very-- (Érica) Oh, can you hear me? So, I'm Érica Azzellini from Wiki Movement Brazil, and I'm really impressed with your work because it's really in sync with what we've been working on in Brazil with the Mbabel tool. I don't know if you heard about it? - Not yet. - (Érica) It's a tool that we use to automatically generate Wikipedia entries using Wikidata information in a simple way that can be replicated on other Wikipedia languages. So we've been working on Portuguese mainly, and we're trying to get on English Wikipedia tools, but it can be replicated on any language, basically, and I think then we could talk about it. Absolutely, it will be super interesting because the article placeholder is an extension already, so it might be worth to integrate your efforts into the existing extension. Lydia is also fully for it, and... (laughs) And then because-- so one of the problems-- [Marius] correct me if I'm wrong-- we had was that article placeholder doesn't scale as well as it should. So article placeholder is not in Portuguese because we're always afraid it will break everything, correct? And then [Marius] is just taking a pause. - (Érica) Yeah, you should be careful. - Don't want to say anything about this. But, yeah, we should connect because I'd be super interested to see how you solve those issues and how it works for you. (Érica) I'm going to present on the second section of the learning talk about this project that we've been developing, and we've been using it on [Glenwyck] initiatives and education projects already. - Perfect. - (Érica) So let's do that. Yeah, absolutely let's chat. (moderator) Cool. Some other questions on your projects? (person 5) Hi, my name is [Alan], and I think this is extremely cool. I had a few questions about generating Wiki sentences from neural networks. - Yeah. - (person 5) So I've come across another project that was attempting to do this, and it was essentially using [triples input and sentences output], and it was able to generate very fluent sentences. But sometimes they weren't... actually, they weren't correct, with regards to the triple. And I was curious if you had any ways of doing validity checks of this site. Sometimes the triple is "subject, predicate, object," but the language model says, "Okay, this object is very rare, I'm going to say you are born in San Jose, instead of San Francisco or vice versa." And I was curious if you had come across this? So that's what we call hallucinations. The idea that there's something in a sentence that wasn't in the original triple and the data. What we do-- so we don't do anything about it, we just also realized that that's happening. It's even more happening for the low-resource, because we work across domains, so we are domain independently generating. Traditional energy work is always biography domain, usually. So that happens a lot because we just have little training data on the low-resource languages. We have a few ideas. It's one of the million topics, I'm supposed to work on at the moment. One of them is to use entity linking and relation extraction, to align what we generate with the triples we inputted in the first place, to see if it's off or the network generates information it shouldn't have or it cannot know about, basically. That's also all I can say about this because now time is over. (person 5) I'd love to talk offline about this, if you have time. Yeah, absolutely, let's chat about it. Thank you so much, everyone, it was lovely. (moderator) Thank you, Lucie. (applause)