(Lydia) Thank you so much. So, this conference, one of the big themes is languages. I want to give you an overview of where we actually are currently when it comes to languages and where we can go from here. Wikidata is all about giving more people more access to more knowledge, and language is such an important part of making that a reality, especially since more and more of our lives depends on technology. And as our keynote speaker earlier today was talking, some of the technology leaves people behind simply because they can't speak a certain language, and that's not okay. So we want to do something about that. And in order to change that, you need at least two things. One is you need to provide content to the people in their language, and the second thing you need is to provide them with interaction in their language in those applications or whatever it is you have. And Wikidata helps with both of those. And the first thing, content in your language, that is basically what we have in items and properties, how we describe the world. Now, this is certainly not everything you need, but it gets you quite far ahead. The other thing is interaction in your language, and that's where lexemes come into play If you want to talk to your digital personal assistant or if you want to have your device translate a text and things like that. Alright, let's look into content in your language. So what we have in items and properties. For this, the labels in those items and properties are crucial. We need to know what this entity is called that we're talking about. And instead of talking about Q5, someone who speaks English knows that's a "human," someone who speaks German knows that's a "mensch," and similar things. So those labels on items and properties are bridging the gap between humans and machines. And humans and humans making more existing knowledge accessible to them. Now, that's a nice aspiration. What does it actually look like? It looks like this. What you're seeing here is that most of the items on Wikidata have two labels, so labels in two languages. And after that, it's one, and then three, and then it becomes very sad. (quiet laughter) I think we need to do better than this. But, on the other hand, I was actually expecting this to be even worse. I was expecting the average to be one. So I was quite happy to see two. (chuckles) Alright. But it's not just interesting to know how many labels our items and properties have. It's also interesting to see in which languages. Here you see a graph of the languages that we have labels for on Items. So the biggest part there is Other. So I just took the top 100 languages and everything else is Other to make this graph readable. And then there's English and Dutch, French, and not to forget, Asturian. - (person 1) Whoo! - Whoo-hoo, yes! So what you see here is quite an imbalance and still quite a lot of focus on English. Another thing is if you look at the same thing for Properties, it's actually looking better. And I think part of that constituted just being way less properties. So even smaller communities have a chance to keep up with that. But it's also a pretty important part of Wikidata to localize into your language. So that's good. What I want to highlight here with Asturian is that a small community can really make a huge difference with some dedication and work, and that's really cool. A small quiz for you. If you take all the properties on Wikidata that are not external identifiers, which one has the most labels, like the most languages? (audience) [inaudible] I hear some agreement on instance of? You would be wrong. It's image. (chuckles) So, yeah, that tells you, if you speak one of the languages where instance of doesn't yet have a label, you might want to add it. So it has 148 labels currently. But that's just another slide. This graph tells us something about how much content we are making available in a certain language and how much of that content is actually used. So what you're seeing is basically a curve with most content having English labels, being available in English, and being used a lot. And then it kind of goes down. But, again, what you can see are outliers who have a lot more content than you would necessarily expect, and that is really, really good. The problem still is it's not used a lot. Asturian and Dutch should be higher, and I think helping those communities increase the use of the data they collected is a really useful thing to do. What this analysis and others showed us is also a good thing though is that we are seeing that highly used items also tend to have more labels or the other way around-- it's not entirely clear. And then the question is, are we serving just the powerful languages? Or are we serving everyone? And what you see here is a grouping of languages. The languages that are grouped together tend to have labels together. And you see it clustering. Now here's a similar clustering, colored, based on how alive, how used, how endangered the language is. And a good thing you're seeing here is that safe languages and endangered languages do not form two different clusters. But they're all mixed together, which is much better than it would be the other way around where the safe languages, the powerful languages are just helping each other out. No, that's not the case. And it's a really good thing. When I saw this, I thought this was very good. Here's a similar thing where we looked at the languages' status and how many labels it has. What you're seeing is a clear win for safe languages, as is expected. But what you're also seeing is that the languages in category 2 and 3 and maybe even 4 are not that bad, actually, in terms of their representation in Wikidata and others. It's a really good thing to find. Now, if you look at the same thing for how much of that content of those labels is actually used on Wikipedia, for example, then we see a similar picture emerging again. And it tells us that those communities are actually making good use of their time by filling in labels for higher used items, for example. There are outliers where I think we can help, to help those communities find the places where their work would be most valuable. But, overall, I'm happy with this picture. Now, that was the items and properties part of Wikidata. Now, let's look at interaction in your languages. So the lexeme parts of Wikidata where we describe words and their forms and their meanings. We've been doing this now since May last year, and content has been growing. You can see here in blue the lexemes, and then in red, the forms on those lexemes and yellow, the senses on those lexemes. So some communities-- we'll get to that later-- have spent a lot of time creating forms and senses for their lexemes, which is really useful because that builds the core of the data set that you need. Now, we looked at all the languages that have lexemes on Wikidata. So words we have, those are right now 310 languages. Now, what do you think is the top language when it comes to the number of lexemes currently in Wikidata? (audience) [inaudible] Huh? (person 2) German. Sorry, I've heard it before. It's Russian. Russian is quite ahead. And just to give you some perspective, there's different opinions but I've read, for example, that 1,000 to 3,000 words gets you to conversation level, roughly, in another language, and 4,000 to 10,000 words to an advanced level. So, we still have a bit to catch up there. One thing I want you to pay attention to is Basque here with 10,000, roughly, lexemes. Now, if you look at the number of forms for those lexemes, Basque is way up there, which is really cool, and you should go to a talk that explains to you why that is the case. Now, if you look at the number of senses, so what do words mean, Basque even gets to the top of the list. I think that deserves an applause. (applause) Another short quiz. What's the lexeme with the most translations currently? (audience) Cats, cats, [inaudible], Douglas Adams, [inaudible] All good guesses, but no. It's this, the Russian word for "water." Alright, so now we talked a lot about how many lexemes, forms, and senses we have, but that's just one thing you need. The other thing you need is actually describing those lexemes, forms, and senses in a machine-readable way. And for that you have statements, like on items. And one of the properties you use is usage example. So whoever is using that data can understand how to use that word in context, so that could be a quote, for example. And here, Polish rocks. Good job, Polish speakers. Another property that's really useful is IPA, so how do you pronounce this word. Russian apparently needs lots of IPA statements. But, again, Polish, second. And last but not least we have pronunciation audio. So that is links to files on Commons where someone speaks the word, so you can hear a native speaker pronounce the word in case you can't read IPA, for example. And there's a really nice actually Wiki-based powered project called Lingua Libre where you can go and help record words in your language that then can be added to lexemes on Wikidata, so other people can understand how to pronounce your words. (person 2) [inaudible] If you search for "Lingua Libre," and I'm sure someone can post it in the Telegram channel. Those guys rock. They did really cool stuff with Wikibase. Alright. Then the question is, where do we go from here? Based on the numbers I've just shown you, we've come a long way towards giving more people more access to more knowledge when looking at languages on Wikidata. But there is also still a lot of work ahead of us. Some of the things you can do to help, for example, is run label-a-thons like get people together to label items in Wikidata or do an edit-a-thon around lexemes in your language to get the most used words in your language into Wikidata. Or you can use a tool like Terminator that helps you find the most important items in your language that are still missing a label. Most important being measured by how often it is used in other Wikidata items as links in statements. And, of course, for the lexeme part, now that we've got a basic coverage of those lexemes, it's also about building them out, adding more statements to them so that they actually can build the base for meaningful applications to build on top of that. Because we're getting closer to that critical mass, but we're still away from that, that you can build serious applications on top of it. And I hope all of you will join us in doing that. And that already brings me to a little help from our friends, and Bruno, do you want to come over and talk to us about lexical masks. (Bruno) Thank you, Lydia, thank you for giving me this short period of time to present this work that we are doing at Google Denny that most of you probably have heard of or know. Because at Google so I'm a linguist. so I'm very happy to be here amongst other language enthusiasts. We are also building some lexicons, and we have built this technology or this approach that we think can be useful for you. Just to give you a little bit of background, this is my lexicographic background talking here. When we build a lexicon database, there is a lot of hard time to maintain, to keep them consistent and to exchange data, as you probably know. There are several attempts to unify the feature and the properties that are describing those lexemes and those forms, and it's not a solved problem, but there are some unification attempts on that side. But what is really missing-- and this is a problem we had at the beginning of our project at Google is to try to have an internal structure that describes how a lexical entry should look like, what kind of data or what kind of information we have and the specification that are expected. So, this is what we came up with this thing called lexicon mask. A lexicon mask is describing what is expected for an entry, a lexicographic entry, to be complete, both in terms of the number of forms you expect for a lexeme, and the number of features you expect for each of those forms. Here is an example for Italian adjectives. You expect, in Italian, to have four forms for your adjectives, and each of these forms have a specific combination of gender and number features. This is what we expect for the Italian adjectives. Of course, you can have extremely complex masks, like the French verbs conjugation, which is quite extensive, and I don't show you any other Russian mask because it doesn't fit the screen. And we also have some detailed specifications because we distinguish what is at the form level. So here you have Russian nouns that have three numbers and a number of cases with different forms, but they also have an entry level specification that says a noun particularly has an inherent gender and an inherent animacy feature that is also specified in the mask. We also want to distinguish that a mask gives a specification for, in general, what an entry should look like. But you can have smaller masks for defective aspects of the form or defective aspects of the lexeme that happen in language. So here is the simplest version of French verbs that have only the 3rd person singular for all the weather verbs, like "it rains" or "it snows," like in English. So we distinguish these two levels. And how we use this at Google is that when we have a lexicon that we want to use, we use the mask to really literally throw the lexicons, all the entries, through the mask and see which entry has a problem in terms of structure. Are we missing a form? Are we missing a feature? And when there is a problem, we do some human validation or just to see if it passes the mask. So it's an extremely powerful tool to check the quality of the structure. So what we are happy to announce today is that we get the green light to open source our mask. So this is a schema. If you want that, we can release and that we will provide to Wikidata as to ShEx files. This is a ShEx file for German nouns, and Denny is working on the conversion from our internal specification to a more open-source specification. We currently cover more than 25 languages. So we expect to grow on our side, but we also look for this opportunity to collaborate for other languages. And one of the ongoing collaborations also that Denny has with Lukas. Lukas has these great tools to have a UI to help the user or the contributor to add more forms. So if you want to add an adjective in French, the UI is telling you how many forms are expected and what kind of features this form should have. So our mask will help the tool to be defined and expanded. That's it. (Lydia) Thank you so much. (applause) Alright. Are there questions? Do you want to talk more about lexemes? - (person 3) Yes. - Yes. (chuckles) (person 3) My question, because you were talking about giving more access to more people in more languages. But there are a lot of languages that can't be used in Wikidata. So what solution do you have for that? When you say that can't use Wikidata, are you talking about entering labels? - (person 3) Labels, descriptions. - Right. So, for lexemes, it's a bit different because there we don't have that restriction. For labels on items and properties, there is some restriction because we wanted to make sure that it's not completely anyone does anything, and it becomes unmanageable. Even a small community who wants one language and wants to work on that, come talk to us, we will make it happen. (person 3) I mean, we did this at the Prague Hackathon in May, and it took us until almost August in order to be able to use our language. - Yeah. - (person 3) So, it's very slow. Yeah, it is, unfortunately, very slow. We're currently working with the language Committee on solving some fundamental... Like, getting agreement on what kind of languages are actually "allowed," and that has taken too long, which is the reason why your request probably took longer than it should have. (person 3) Thanks. (person 4) Thank you. Lydia, if you remember the statistics that you showed, the number of lexemes per language. So, did you count all the forms as a data point or only lexemes? (Lydia) Do you mean this? Which one do you mean? (person 4) Yes, exactly. If you remember, does this number [inaudible] all the forms for all the lexemes or just how many lexemes there are? No, this is just a number of lexemes. (person 4) Just a number of lexemes, okay. So then it is a just statistic because if it would then compose the forms-- that's why I'm asking-- then all the languages with the inflectional morphology, like Russian, Serbian, Slovenian and et cetera, they have a natural advantage because they have so many. So, this kind of kicks in here on this number of forms. (person 4) Yeah, that was this one. Thank you. (person 5) So, I had a quick question about the... When we're talking about the actual items and properties. Like as far as I understand, there is currently no way to give an actual source to any of the labels and descriptions that are given. So, for example, because when you're talking about an item property, like, for example, you can get conflicting labels. Yes. (person 5) So this person is like... We were talking about indigenous things before, for example. So this person is a Norwegian artist according to this source, and a Sami artist, according to this source. Or, for example, in Estonian, we had an issue where we had to change terminology to the official use terminology in official lexicons, but we have no way to indicate really why, like what was the source of this and why this was better and what was there before. It was just me as a random person just switching the thing to anyone who sees it. So is there a plan to make this possible in any way so that we can actually have proper sources for the language data? So, it is partially possible. So, for example, when you have an item for a person, you have a statement, first name, last name, and so on, of that person, and then you can provide the reference for that there. I'm quite hesitant to add more complexity for references on labels and descriptions, but if people really, really think this is something that isn't covered by any reference on the statement, then let's talk about it. But I fear it will add a lot of complexity for what I hope are few cases, but I'm willing to be convinced otherwise if people really feel very strongly about this. (person 5) I mean, if it's added it probably shouldn't be the default, show to all the users as a beginner, interface, in any case. More like, "Click here if you need to say a specific thing about this." Do we have a sense of how many times that would actually matter? (person 5) In Estonian, for example-- I expect this is true of other languages as well-- for example, there is an official name that is the actual legitimate translation, for example, into English, of, say, a specific kind of municipality. That was my use case, for example, where we were using the word "parish" which the original Estonian word was meant kind of like church parish, and that was the origin, but that's not the official translation Estonia gets right now. In this case, I would just add it as official name statements and add the reference there. (person 5) Okay. More questions, yes? (person 6) I have two quick comments. You specifically called out Asturian as a language that does well, and I think that's a false artifact. Tell me about it. (person 6) I think it's just a bot that pasted person names, like proper names, and said, "Well, this is exactly like in French or Spanish," and just massively copied it. One point of evidence is that you don't see that energy in Asturian in things that actually require translation, like property names, or names of items that are not proper names. Asaf, you break my heart. (person 6) I know, I like raining on parades, but I have good news as well, which is about the pronunciation numbers. As you probably know, Commons is full of pronunciation files, and, for example, Dutch has no less than 300,000 pronunciation files already on Commons that just need to somehow be ingested. So if anyone's looking for a side project, there's tons and tons of classified, categorized pronunciation files on Commons under the category "Pronunciation" by language. So that's just waiting to be matched to lexemes and put on Lexeme. And I was wondering if you could say something about the road map, something about how much investment or what can we expect from Lexeme in the coming year, because I, for one, can't wait. You can't wait? (chuckles) - (person 6) For more. - Yes. (chuckles) Right now, we're concentrating more on Wikibase and data quality to see how much traction this gets and then getting more for feeding off where the pain points are next, and then going back to improving lexicographical data further. And one of the things I'd love to hear from you is where exactly do you see the next steps, where do you want to see improvements so that we can then figure out how to make that happen. But, of course, you're right, there's still so much to do also on the technical side. (person 7) Okay, as we were uploading the Basque words with forms, and you'll see some of these kinds of things, we were both like, last week we said, "Oh, we are the first one in something." It's It appears in press, and it's like, "Oh, Basque are the first time in some-- they are the first in something, okay." (laughs) And then people ask, "Okay, but what is this for?" We don't have a real good answer. I mean it's like, okay, this will help computers to understand more our language, yes, but what kind of tools can we make in the future? And we don't have a good answer for this. So I don't know if you have a good answer for this. (chuckles) I don't know if I have a good answer, but I have an answer. So I think right now as I was telling [inaudible], we haven't reached that critical mass where you can build a lot of the really interesting tools. But there are already some tools. Just the other day, Esther [Pandelia], for example, released a tool where you can see, I think it was the words on a globe where they're spoken, where they're coming from. I'm probably wrong about this, but she had answered on the Project chat on Wikidata-- you can look it up there. So we have seen these first tools, just like we've seen back when Wikidata started. First some--like just a network, and like, "Hey, look, there's this thing that connects to this other thing." And as we have more data, and as we've reached some critical mass, more powerful applications become possible, things like Histropedia, things like question and answering in your digital personal assistant, Platypus, and so on. And we're seeing a similar thing with lexemes. We're at the stage where you can build like these little, hey, look, there's a connection between the two things, and there's a translation of this word into that language stage, and as we build it out and as we describe more words, more becomes possible. Now, what becomes possible? As Ben, our keynote speaker earlier was talking about translations, being able to translate from one language to another. And Jens, my colleague, he's always talking about the European Union looking for a translator who can translate from I think it was Maltese to Swedish-- - (person 8) Estonian. - Estonian. And that is not a usual combination. But once you have all these languages in one machine-readable place, you can do that, you can get a dictionary from Estonian to Maltese and back. So covering language combinations in dictionaries that just haven't been covered before because there wasn't enough demand for it, for example, to make it financially viable and to justify the work. Now we can do that. Then text generation. Lucie was earlier talking about how she's working with Hattie on generating text to get Wikipedia articles in minority languages started, and that needs data about words, and you need to understand the language to do that. Yeah, and those are just some that come to my mind right now. Maybe our audience has more ideas what they want to do when we have all the glorious data. (person 9) Okay, I will deviate from the lexemes topic. I will ask the question, how can I as a member of community influence that priority is put on task, that a new user comes, and he can indicate what languages he wants to see and edit without some secret verbal template knowledge. Maybe there will be this year this technical wish list without Wikipedia topics. Maybe there's a hope we can all vote about this thing we didn't fix for seven years. So do you have any ideas and comments about this? So you're talking about the fact that someone who is not logged into Wikidata can't change their language easily? (person 9) No, for [inaudible] users. So, if they are logged in, they can just change their language at the top of the page, and then it will appear where the labels' description [inaudible] are, and they can edit it. (person 9) Well, actually, usually many times the workflow is that if you want to have multiple languages, they are available, and it's not always the case. Okay, maybe we should sit down after this talk and you show me. Cool. More questions? Yes. (person 10) Thanks for the presentation. Can you comment on the state of the correlation with the Wiktionary community. As far as I've seen, there were some discussions about importing some elements of the work, but there seems to be licensing issues and some disagreements, et cetera. Right. So, Wiktionary communities have spent a lot of time building Wiktionary. They have built amazingly complicated and complex templates to build pretty tables that automatically generate forms for you and all kinds of really impressive, and kind of crazy stuff, if you think about it. And, of course, they have invested a lot of time and effort into that. And understandably, they don't just want that to be grabbed, just like that. So there's some of that coming from there. And that's fine, that's okay. Now, the first Wiktionary communities are talking about turning out and importing some of their data into Wikidata. Russian, you have seen, for example, is one of those cases And I expect more of that to happen. But it will be a slow process, just like adoption of Wikidata's data on Wikipedia has been a rather slow process. On the other side of making it actually easier to use the data that is in lexemes, on Wiktionary, so that they can make use of that and share data between the language Wiktionaries which is super hard to impossible right now, which is crazy, just like it was on Wikipedia. Wait for the birthday present. (chuckles) Yes. (person 11) When I was thinking the other way around it, I actually didn't want to say it because I think this will be super silly, but I think that Wiktionary already has some content, and I know that we can't transfer it to Wikidata because there's a difference in licenses. But I was thinking maybe we can do something about that. Maybe, I don't know, we can obtain the communities' permission after like, I don't know, having like a public voting and for the community, the active members of the community to vote and say if they would like or accept or to transfer the content for which they may do the Wikidata lexemes. Because I just think it is such a waste. So, that's definitely a conversation those people who are in Wiktionary communities are very welcome to bring up there. I think it would be a bit presumptuous for us to go and force that. But, yeah, I think it's definitely worth having a conversation. But I think it's also important to understand that there's a distinction between what is actually legally allowed and what we should be doing and what those people want or do not want. So even if it's legally allowed, if some other Wiktionary communities do not want that, I would be careful, at least. I think you need the mic for the stream. (person 12) So, obviously, it's all very exciting, and I immediately think how can I take that to my students and how can I incorporate it with the courses, the work that we're doing, educational settings. And I don't have, at the moment, first of all, enough knowledge, but I think the documentation that we do have could be maybe improved. So that's a kind of request to make cool videos that explain how it works because if we have it, we can then use it, and we can have students on board, and we can make people understand how awesome it all is. And yeah, just think about documentation and think about education, please. Because I think a lot could be done. These are like many tasks that could be done even with... well, I wouldn't say primary schools, but certainly, even younger students. And so I would really like to see that potential being tapped into, and, as of now, I personally don't understand enough to be able to create tasks or to create like... to do something practical with it. So any help, any thoughts anyone here has about that, I would be very happy to hear your thoughts, and yours as well. Yeah, let's talk about that. More questions? Someone else raised a hand. I forgot where it was. (person 13) So, if we can't import from Wiktionary, is there some concerted effort to find other public domain sources, maybe all the data, and kind of prefilter it, organize it so that it's easy to be checked by people for import? So there are first efforts. My understanding is that Basque is one of those efforts. Maybe you want to say a bit more about it? (person 14) [inaudible] Okay, the actual answer is paying for that... I mean, we have an agreement with a contractor we usually work with. They do dictionaries-- lots of stuff, but they do dictionaries. So we agreed with them to make free the students' dictionary, we would [cast] the most common words and start uploading it with an external identifier and the scheme of things. But there was some discussion about leaving it on CC0 because they have the dictionary with CC by it, and they understood what the difference was. So there was some discussion. But I think that we can provide some tools or some examples in the future, and I think that there will be other dictionaries that we can handle, and also I think Wiktionary should start moving in that direction, but that's another great discussion. And on top of that, Lea is also in contact with people from Occitan who work on Occitan dictionaries, and they're currently working on a Sumerian collaboration. More questions? (person 15) Hi! We are the people who want to import Occitan data. Aha! Perfect! (person 15) And we have a small problem. We don't know how to represent the variety of all lexemes. We have six dialects, and we want to indicate for Lexeme in which dialect it's used, and we don't have a proper C0 statement to do that. So as long as the segment doesn't exist, it prevents us from [inaudible] because we will need to do it again when we will be able to [export] the statement. And it's complicated because it's a statement which won't be asked by many people because it's a statement which concerns mostly minority languages. So you will have one person to ask this. But as our colleagues Basque, it can be one person who will power thousands of others, so it might not be asking a lot, but it will be very important for us. Do you already have a new property proposal up, or do you need help creating it? (person 15) We asked four months ago. Alright, then let's get some people to help out with this property proposal. I'm sure there are enough people in this room to make this happen. (person 15) Property proposal [speaking in French]. (person 16) We didn't have an answer. (person 15) We didn't have any answer, and we don't know how to do this because we aren't in the Wikidata community. Yup, so there are people here who can help you. Maybe someone raises their hand to take-- (person 14) I'm for that. But I think this is quite interesting that only the variant of form also can handle it geographically, with coordinates or some kind of mapping. Also having different pronunciations, and I think this is something that happens in lots of languages. We should start making it happen [inaudible], and I'm going to search for the property. Cool. So you will get backing for your property proposal. Thank you. Alright, more questions? Finn. Finn is one of those people who builds stuff on top of lexicographical data. (Finn) It's just a small question, and that's about spelling variations. It seems to be difficult to put them in... You could, of course, have multiple forms for the same word. I don't know, it seems to be... If you don't do it that way, it seems to be difficult to specify... or I don't know whether this is just a minor technical issue or whether... Let's look at it together. I would love to see an example. Asaf. (Asaf) Thank you. I can give a very concrete example from my mother tongue, Hebrew. Hebrew has two main variants for expressing almost every word because the traditional spelling leaves out many of the vowels. And, therefore, in modern editions of the Bible and of poetry, diacritics are used. However, those diacritics are never used for modern prose or newspaper writing or street signs. So the average daily casual use puts in extra vowels and doesn't use the diacritics because they are, of course, more cumbersome and have all kinds of rules and nobody knows the rules. So there are basically two variants. There's the everyday casual prose variant, and there's the Bible or poetry, which always come in this traditional diacriticized text. To be useful, Lexeme would have to recognize both varieties of every single word and every single form of every single word. So that's a very comprehensive use case for official stable variants. It's not dialect, it's not regions, it's basically two coexisting morphological systems. And I too don't know exactly how to express that in Lexeme today, which is one thing that is keeping me in partial answer to Magnus' question from uploading the parts that are ready from the biggest Hebrew dictionary, which is public domain and which I have been digitizing for several years now. A good portion of it is ready, but I'm not putting it on Lexeme right now because I don't know exactly how to solve this problem. Alright, let's solve this problem here. (chuckles) That has to be possible. Alright, more questions? If not, then thank you so much. (applause)