WEBVTT 00:00:06.303 --> 00:00:07.362 (Lydia) Thank you so much. 00:00:07.362 --> 00:00:11.244 So, this conference, one of the big themes is languages. 00:00:14.220 --> 00:00:18.508 I want to give you an overview of where we actually are currently 00:00:18.508 --> 00:00:19.812 when it comes to languages 00:00:20.264 --> 00:00:22.167 and where we can go from here. 00:00:29.036 --> 00:00:32.580 Wikidata is all about giving more people more access to more knowledge, 00:00:32.580 --> 00:00:37.168 and language is such an important part of making that a reality, 00:00:38.205 --> 00:00:43.291 especially since more and more of our lives depends on technology. 00:00:44.114 --> 00:00:48.873 And as our keynote speaker earlier today was talking, 00:00:49.723 --> 00:00:51.588 some of the technology leaves people behind 00:00:51.588 --> 00:00:55.020 simply because they can't speak a certain language, 00:00:55.320 --> 00:00:57.573 and that's not okay. 00:00:58.633 --> 00:01:02.097 So we want to do something about that. 00:01:02.927 --> 00:01:05.841 And in order to change that, you need at least two things. 00:01:06.411 --> 00:01:11.270 One is you need to provide content to the people in their language, 00:01:11.270 --> 00:01:12.955 and the second thing you need 00:01:12.955 --> 00:01:15.910 is to provide them with interaction in their language 00:01:15.910 --> 00:01:19.189 in those applications or whatever it is you have. 00:01:20.367 --> 00:01:25.277 And Wikidata helps with both of those. 00:01:25.277 --> 00:01:28.408 And the first thing, content in your language, 00:01:28.408 --> 00:01:30.879 that is basically what we have in items and properties, 00:01:31.319 --> 00:01:33.082 how we describe the world. 00:01:33.082 --> 00:01:35.085 Now, this is certainly not everything you need, 00:01:35.085 --> 00:01:39.294 but it gets you quite far ahead. 00:01:39.764 --> 00:01:41.847 The other thing is interaction in your language, 00:01:41.847 --> 00:01:46.389 and that's where lexemes come into play 00:01:46.389 --> 00:01:49.382 If you want to talk to your digital personal assistant 00:01:49.382 --> 00:01:54.918 or if you want to have your device translate a text and things like that. 00:01:56.404 --> 00:01:59.254 Alright, let's look into content in your language. 00:01:59.254 --> 00:02:03.396 So what we have in items and properties. 00:02:05.406 --> 00:02:09.696 For this, the labels in those items and properties are crucial. 00:02:10.236 --> 00:02:14.866 We need to know what this entity is called that we're talking about. 00:02:15.656 --> 00:02:19.987 And instead of talking about Q5, 00:02:19.987 --> 00:02:22.180 someone who speaks English knows that's a "human," 00:02:22.180 --> 00:02:24.706 someone who speaks German knows that's a "mensch," 00:02:24.706 --> 00:02:26.374 and similar things. 00:02:26.374 --> 00:02:29.742 So those labels on items and properties 00:02:29.742 --> 00:02:33.619 are bridging the gap between humans and machines. 00:02:33.619 --> 00:02:35.439 And humans and humans 00:02:35.439 --> 00:02:40.115 making more existing knowledge accessible to them. 00:02:43.270 --> 00:02:46.290 Now, that's a nice aspiration. 00:02:46.290 --> 00:02:48.342 What does it actually look like? 00:02:48.342 --> 00:02:49.607 It looks like this. 00:02:50.947 --> 00:02:52.416 What you're seeing here 00:02:52.416 --> 00:02:58.496 is that most of the items on Wikidata have two labels, 00:02:58.496 --> 00:03:00.767 so labels in two languages. 00:03:01.697 --> 00:03:03.851 And after that, it's one, and then three, 00:03:03.851 --> 00:03:06.115 and then it becomes very sad. 00:03:06.781 --> 00:03:08.581 (quiet laughter) 00:03:10.047 --> 00:03:12.713 I think we need to do better than this. 00:03:14.185 --> 00:03:15.319 But, on the other hand, 00:03:15.319 --> 00:03:17.478 I was actually expecting this to be even worse. 00:03:17.478 --> 00:03:19.560 I was expecting the average to be one. 00:03:19.560 --> 00:03:22.503 So I was quite happy to see two. (chuckles) 00:03:24.921 --> 00:03:26.186 Alright. 00:03:27.156 --> 00:03:29.527 But it's not just interesting to know 00:03:29.527 --> 00:03:33.742 how many labels our items and properties have. 00:03:33.742 --> 00:03:36.565 It's also interesting to see in which languages. 00:03:38.045 --> 00:03:43.764 Here you see a graph of the languages 00:03:43.764 --> 00:03:46.838 that we have labels for on Items. 00:03:46.838 --> 00:03:50.669 So the biggest part there is Other. 00:03:51.229 --> 00:03:53.863 So I just took the top 100 languages 00:03:54.533 --> 00:03:58.902 and everything else is Other to make this graph readable. 00:03:59.542 --> 00:04:02.142 And then there's English and Dutch, 00:04:03.002 --> 00:04:04.254 French, 00:04:05.924 --> 00:04:09.129 and not to forget, Asturian. 00:04:09.659 --> 00:04:11.889 - (person 1) Whoo! - Whoo-hoo, yes! 00:04:13.899 --> 00:04:16.954 So what you see here is quite an imbalance 00:04:16.954 --> 00:04:20.114 and still quite a lot of focus on English. 00:04:21.236 --> 00:04:24.367 Another thing is if you look at the same thing for Properties, 00:04:24.367 --> 00:04:25.999 it's actually looking better. 00:04:27.399 --> 00:04:32.750 And I think part of that constituted just being way less properties. 00:04:32.750 --> 00:04:36.770 So even smaller communities have a chance to keep up with that. 00:04:36.770 --> 00:04:39.173 But it's also a pretty important part of Wikidata 00:04:39.173 --> 00:04:41.159 to localize into your language. 00:04:41.159 --> 00:04:42.384 So that's good. 00:04:45.752 --> 00:04:47.842 What I want to highlight here with Asturian 00:04:47.842 --> 00:04:53.698 is that a small community can really make a huge difference 00:04:54.448 --> 00:04:57.085 with some dedication and work, 00:04:57.085 --> 00:04:58.420 and that's really cool. 00:05:01.846 --> 00:05:03.530 A small quiz for you. 00:05:03.530 --> 00:05:05.493 If you take all the properties on Wikidata 00:05:05.493 --> 00:05:07.687 that are not external identifiers, 00:05:07.687 --> 00:05:10.358 which one has the most labels, like the most languages? 00:05:10.977 --> 00:05:13.847 (audience) [inaudible] 00:05:13.847 --> 00:05:16.786 I hear some agreement on instance of? 00:05:17.506 --> 00:05:19.443 You would be wrong. 00:05:19.983 --> 00:05:22.210 It's image. (chuckles) 00:05:23.230 --> 00:05:26.366 So, yeah, that tells you, if you speak one of the languages 00:05:26.366 --> 00:05:28.621 where instance of doesn't yet have a label, 00:05:28.621 --> 00:05:30.190 you might want to add it. 00:05:32.102 --> 00:05:35.676 So it has 148 labels currently. 00:05:37.688 --> 00:05:41.249 But that's just another slide. 00:05:42.631 --> 00:05:44.162 This graph tells us something 00:05:44.162 --> 00:05:49.321 about how much content we are making available in a certain language 00:05:49.321 --> 00:05:52.042 and how much of that content is actually used. 00:05:52.042 --> 00:05:55.448 So what you're seeing is basically a curve 00:05:55.448 --> 00:06:00.987 with most content having English labels, being available in English, 00:06:01.507 --> 00:06:04.295 and being used a lot. 00:06:04.295 --> 00:06:06.449 And then it kind of goes down. 00:06:06.449 --> 00:06:09.436 But, again, what you can see are outliers 00:06:09.436 --> 00:06:15.333 who have a lot more content than you would necessarily expect, 00:06:16.903 --> 00:06:19.539 and that is really, really good. 00:06:20.839 --> 00:06:24.945 The problem still is it's not used a lot. 00:06:25.565 --> 00:06:28.742 Asturian and Dutch should be higher, 00:06:28.742 --> 00:06:31.994 and I think helping those communities 00:06:33.266 --> 00:06:35.563 increase the use of the data they collected 00:06:35.563 --> 00:06:37.682 is a really useful thing to do. 00:06:42.910 --> 00:06:48.110 What this analysis and others showed us is also a good thing though 00:06:48.300 --> 00:06:51.378 is that we are seeing that highly used items 00:06:51.378 --> 00:06:55.295 also tend to have more labels 00:06:55.295 --> 00:06:58.188 or the other way around-- it's not entirely clear. 00:07:02.513 --> 00:07:04.376 And then the question is, 00:07:04.806 --> 00:07:07.009 are we serving just the powerful languages? 00:07:07.899 --> 00:07:11.147 Or are we serving everyone? 00:07:12.757 --> 00:07:17.743 And what you see here is a grouping of languages. 00:07:17.743 --> 00:07:21.832 The languages that are grouped together tend to have labels together. 00:07:26.042 --> 00:07:28.599 And you see it clustering. 00:07:28.599 --> 00:07:34.065 Now here's a similar clustering, colored, 00:07:34.065 --> 00:07:39.475 based on how alive, how used, 00:07:40.455 --> 00:07:43.156 how endangered the language is. 00:07:43.156 --> 00:07:44.642 And a good thing you're seeing here 00:07:44.642 --> 00:07:49.566 is that safe languages and endangered languages 00:07:49.566 --> 00:07:53.773 do not form two different clusters. 00:07:53.773 --> 00:07:58.872 But they're all mixed together, 00:08:00.262 --> 00:08:04.625 which is much better than it would be the other way around 00:08:04.625 --> 00:08:09.377 where the safe languages, the powerful languages 00:08:10.197 --> 00:08:12.164 are just helping each other out. 00:08:12.744 --> 00:08:14.356 No, that's not the case. 00:08:14.356 --> 00:08:17.417 And it's a really good thing. 00:08:17.417 --> 00:08:20.042 When I saw this, I thought this was very good. 00:08:23.474 --> 00:08:25.169 Here's a similar thing 00:08:26.239 --> 00:08:28.800 where we looked at 00:08:30.230 --> 00:08:34.222 the languages' status 00:08:34.222 --> 00:08:36.225 and how many labels it has. 00:08:39.367 --> 00:08:42.937 What you're seeing is a clear win for safe languages, 00:08:42.937 --> 00:08:44.248 as is expected. 00:08:45.508 --> 00:08:46.693 But what you're also seeing 00:08:46.693 --> 00:08:54.407 is that the languages in category 2 and 3 and maybe even 4 00:08:54.407 --> 00:08:59.280 are not that bad, actually, 00:08:59.280 --> 00:09:02.367 in terms of their representation in Wikidata and others. 00:09:03.287 --> 00:09:06.408 It's a really good thing to find. 00:09:07.646 --> 00:09:09.129 Now, if you look at the same thing 00:09:09.129 --> 00:09:12.418 for how much of that content of those labels 00:09:12.418 --> 00:09:15.495 is actually used on Wikipedia, for example, 00:09:17.455 --> 00:09:22.563 then we see a similar picture emerging again. 00:09:23.603 --> 00:09:29.813 And it tells us that those communities are actually making good use of their time 00:09:29.813 --> 00:09:34.504 by filling in labels for higher used items, for example. 00:09:36.410 --> 00:09:40.493 There are outliers where I think we can help, 00:09:41.683 --> 00:09:48.202 to help those communities find the places where their work would be most valuable. 00:09:49.312 --> 00:09:52.663 But, overall, I'm happy with this picture. 00:09:54.823 --> 00:09:59.844 Now, that was the items and properties part of Wikidata. 00:10:00.714 --> 00:10:03.033 Now, let's look at interaction in your languages. 00:10:03.033 --> 00:10:05.203 So the lexeme parts of Wikidata 00:10:05.203 --> 00:10:09.394 where we describe words and their forms and their meanings. 00:10:10.167 --> 00:10:13.301 We've been doing this now since May last year, 00:10:16.461 --> 00:10:19.127 and content has been growing. 00:10:20.114 --> 00:10:22.149 You can see here in blue the lexemes, 00:10:22.149 --> 00:10:25.938 and then in red, the forms on those lexemes 00:10:25.938 --> 00:10:29.910 and yellow, the senses on those lexemes. 00:10:30.991 --> 00:10:34.451 So some communities-- we'll get to that later-- 00:10:34.451 --> 00:10:39.793 have spent a lot of time creating forms and senses for their lexemes, 00:10:39.793 --> 00:10:42.753 which is really useful 00:10:42.753 --> 00:10:48.243 because that builds the core of the data set that you need. 00:10:50.562 --> 00:10:55.133 Now, we looked at all the languages 00:10:55.133 --> 00:10:57.906 that have lexemes on Wikidata. 00:10:57.906 --> 00:11:01.003 So words we have, 00:11:01.713 --> 00:11:04.404 those are right now 310 languages. 00:11:04.884 --> 00:11:08.290 Now, what do you think is the top language 00:11:08.290 --> 00:11:11.949 when it comes to the number of lexemes currently in Wikidata? 00:11:12.933 --> 00:11:14.700 (audience) [inaudible] 00:11:19.183 --> 00:11:20.216 Huh? 00:11:20.216 --> 00:11:21.741 (person 2) German. 00:11:21.741 --> 00:11:24.252 Sorry, I've heard it before. 00:11:24.252 --> 00:11:25.651 It's Russian. 00:11:28.011 --> 00:11:29.754 Russian is quite ahead. 00:11:31.897 --> 00:11:33.832 And just to give you some perspective, 00:11:35.652 --> 00:11:36.816 there's different opinions 00:11:36.816 --> 00:11:42.231 but I've read, for example, that 1,000 to 3,000 words 00:11:42.231 --> 00:11:45.450 gets you to conversation level, roughly, in another language, 00:11:45.450 --> 00:11:49.461 and 4,000 to 10,000 words to an advanced level. 00:11:51.591 --> 00:11:55.282 So, we still have a bit to catch up there. 00:11:58.483 --> 00:12:03.279 One thing I want you to pay attention to is Basque here 00:12:03.279 --> 00:12:07.744 with 10,000, roughly, lexemes. 00:12:09.244 --> 00:12:13.003 Now, if you look at the number of forms for those lexemes, 00:12:14.163 --> 00:12:16.497 Basque is way up there, 00:12:18.257 --> 00:12:20.006 which is really cool, 00:12:20.006 --> 00:12:24.930 and you should go to a talk that explains to you why that is the case. 00:12:27.341 --> 00:12:31.175 Now, if you look at the number of senses, so what do words mean, 00:12:32.015 --> 00:12:35.081 Basque even gets to the top of the list. 00:12:35.081 --> 00:12:37.102 I think that deserves an applause. 00:12:37.102 --> 00:12:38.921 (applause) 00:12:45.678 --> 00:12:47.118 Another short quiz. 00:12:47.118 --> 00:12:50.181 What's the lexeme with the most translations currently? 00:12:50.651 --> 00:12:55.414 (audience) Cats, cats, [inaudible], Douglas Adams, [inaudible] 00:12:56.766 --> 00:13:00.014 All good guesses, but no. 00:13:01.012 --> 00:13:04.137 It's this, the Russian word for "water." 00:13:09.571 --> 00:13:12.253 Alright, so now we talked a lot 00:13:12.253 --> 00:13:16.412 about how many lexemes, forms, and senses we have, 00:13:16.412 --> 00:13:20.493 but that's just one thing you need. 00:13:20.493 --> 00:13:21.515 The other thing you need 00:13:21.515 --> 00:13:25.161 is actually describing those lexemes, forms, and senses 00:13:25.161 --> 00:13:27.647 in a machine-readable way. 00:13:27.647 --> 00:13:30.039 And for that you have statements, like on items. 00:13:31.479 --> 00:13:36.362 And one of the properties you use is usage example. 00:13:36.362 --> 00:13:38.582 So whoever is using that data 00:13:38.582 --> 00:13:42.089 can understand how to use that word in context, 00:13:42.089 --> 00:13:44.158 so that could be a quote, for example. 00:13:45.396 --> 00:13:47.113 And here, Polish rocks. 00:13:47.900 --> 00:13:49.764 Good job, Polish speakers. 00:13:54.219 --> 00:13:57.680 Another property that's really useful is IPA, 00:13:57.680 --> 00:14:00.186 so how do you pronounce this word. 00:14:00.876 --> 00:14:07.497 Russian apparently needs lots of IPA statements. 00:14:10.419 --> 00:14:13.314 But, again, Polish, second. 00:14:17.148 --> 00:14:20.753 And last but not least we have pronunciation audio. 00:14:20.753 --> 00:14:23.372 So that is links to files on Commons 00:14:23.372 --> 00:14:25.959 where someone speaks the word, 00:14:25.959 --> 00:14:29.913 so you can hear a native speaker pronounce the word 00:14:29.913 --> 00:14:32.871 in case you can't read IPA, for example. 00:14:34.959 --> 00:14:39.205 And there's a really nice actually Wiki-based powered project 00:14:39.205 --> 00:14:40.474 called Lingua Libre 00:14:40.884 --> 00:14:45.173 where you can go and help record words in your language 00:14:45.173 --> 00:14:47.836 that then can be added to lexemes on Wikidata, 00:14:48.446 --> 00:14:52.103 so other people can understand how to pronounce your words. 00:14:53.663 --> 00:14:55.694 (person 2) [inaudible] 00:14:55.694 --> 00:14:57.665 If you search for "Lingua Libre," 00:14:57.665 --> 00:15:00.981 and I'm sure someone can post it in the Telegram channel. 00:15:03.138 --> 00:15:04.621 Those guys rock. 00:15:04.621 --> 00:15:06.726 They did really cool stuff with Wikibase. 00:15:09.416 --> 00:15:10.617 Alright. 00:15:12.706 --> 00:15:17.285 Then the question is, where do we go from here? 00:15:19.165 --> 00:15:22.010 Based on the numbers I've just shown you, 00:15:23.030 --> 00:15:25.172 we've come a long way 00:15:25.172 --> 00:15:28.430 towards giving more people more access to more knowledge 00:15:28.430 --> 00:15:31.240 when looking at languages on Wikidata. 00:15:32.530 --> 00:15:36.392 But there is also still a lot of work ahead of us. 00:15:38.992 --> 00:15:42.341 Some of the things you can do to help, for example, 00:15:42.341 --> 00:15:44.921 is run label-a-thons 00:15:44.921 --> 00:15:50.124 like get people together to label items in Wikidata 00:15:50.914 --> 00:15:55.121 or do an edit-a-thon around lexemes in your language 00:15:55.121 --> 00:15:59.212 to get the most used words in your language into Wikidata. 00:16:00.773 --> 00:16:03.285 Or you can use a tool like Terminator 00:16:03.285 --> 00:16:08.493 that helps you find the most important items in your language 00:16:08.493 --> 00:16:11.549 that are still missing a label. 00:16:13.274 --> 00:16:18.359 Most important being measured by how often it is used 00:16:18.359 --> 00:16:22.553 in other Wikidata items as links in statements. 00:16:25.768 --> 00:16:30.022 And, of course, for the lexeme part, 00:16:31.342 --> 00:16:35.169 now that we've got a basic coverage of those lexemes, 00:16:35.169 --> 00:16:41.163 it's also about building them out, adding more statements to them 00:16:41.163 --> 00:16:44.401 so that they actually can build the base 00:16:44.401 --> 00:16:47.421 for meaningful applications to build on top of that. 00:16:48.141 --> 00:16:50.795 Because we're getting closer to that critical mass, 00:16:50.795 --> 00:16:53.616 but we're still away from that, 00:16:53.616 --> 00:16:56.624 that you can build serious applications on top of it. 00:16:58.277 --> 00:17:01.680 And I hope all of you will join us in doing that. 00:17:02.583 --> 00:17:07.103 And that already brings me 00:17:07.103 --> 00:17:09.843 to a little help from our friends, 00:17:09.843 --> 00:17:12.812 and Bruno, do you want to come over 00:17:13.882 --> 00:17:16.854 and talk to us about lexical masks. 00:17:17.541 --> 00:17:18.567 (Bruno) Thank you, Lydia, 00:17:18.567 --> 00:17:21.519 thank you for giving me this short period of time 00:17:21.519 --> 00:17:24.150 to present this work that we are doing at Google 00:17:24.150 --> 00:17:29.635 Denny that most of you probably have heard of or know. 00:17:30.126 --> 00:17:32.030 Because at Google so I'm a linguist. 00:17:32.030 --> 00:17:36.150 so I'm very happy to be here amongst other language enthusiasts. 00:17:36.620 --> 00:17:39.278 We are also building some lexicons, 00:17:39.278 --> 00:17:41.766 and we have built this technology 00:17:41.766 --> 00:17:45.589 or this approach that we think can be useful for you. 00:17:46.369 --> 00:17:48.455 Just to give you a little bit of background, 00:17:48.455 --> 00:17:52.068 this is my lexicographic background talking here. 00:17:52.788 --> 00:17:54.347 When we build a lexicon database, 00:17:54.347 --> 00:17:58.623 there is a lot of hard time to maintain, to keep them consistent 00:17:58.623 --> 00:18:00.125 and to exchange data, 00:18:00.125 --> 00:18:02.027 as you probably know. 00:18:02.517 --> 00:18:05.927 There are several attempts to unify the feature and the properties 00:18:05.927 --> 00:18:09.184 that are describing those lexemes and those forms, 00:18:09.184 --> 00:18:10.936 and it's not a solved problem, 00:18:10.936 --> 00:18:13.958 but there are some unification attempts on that side. 00:18:13.958 --> 00:18:15.209 But what is really missing-- 00:18:15.209 --> 00:18:18.732 and this is a problem we had at the beginning of our project at Google 00:18:18.732 --> 00:18:21.607 is to try to have an internal structure 00:18:22.197 --> 00:18:25.910 that describes how a lexical entry should look like, 00:18:25.910 --> 00:18:28.581 what kind of data or what kind of information we have 00:18:28.581 --> 00:18:32.237 and the specification that are expected. 00:18:32.237 --> 00:18:38.187 So, this is what we came up with this thing called lexicon mask. 00:18:38.897 --> 00:18:44.841 A lexicon mask is describing what is expected for an entry, 00:18:44.841 --> 00:18:47.329 a lexicographic entry, to be complete, 00:18:47.329 --> 00:18:51.436 both in terms of the number of forms you expect for a lexeme, 00:18:51.436 --> 00:18:55.607 and the number of features you expect for each of those forms. 00:18:56.397 --> 00:18:58.329 Here is an example for Italian adjectives. 00:18:58.329 --> 00:19:02.002 You expect, in Italian, to have four forms for your adjectives, 00:19:02.002 --> 00:19:05.383 and each of these forms have a specific combination 00:19:05.383 --> 00:19:07.946 of gender and number features. 00:19:08.606 --> 00:19:12.672 This is what we expect for the Italian adjectives. 00:19:12.672 --> 00:19:16.176 Of course, you can have extremely complex masks, 00:19:16.176 --> 00:19:20.783 like the French verbs conjugation, which is quite extensive, 00:19:20.783 --> 00:19:23.487 and I don't show you any other Russian mask 00:19:23.487 --> 00:19:25.378 because it doesn't fit the screen. 00:19:26.308 --> 00:19:29.531 And we also have some detailed specifications 00:19:29.531 --> 00:19:33.421 because we distinguish what is at the form level. 00:19:33.421 --> 00:19:37.544 So here you have Russian nouns that have three numbers 00:19:37.544 --> 00:19:40.048 and a number of cases with different forms, 00:19:40.048 --> 00:19:43.086 but they also have an entry level specification 00:19:43.086 --> 00:19:45.590 that says a noun particularly has 00:19:45.590 --> 00:19:50.133 an inherent gender and an inherent animacy feature 00:19:50.133 --> 00:19:52.488 that is also specified in the mask. 00:19:54.518 --> 00:19:58.779 We also want to distinguish that a mask gives a specification 00:19:58.779 --> 00:20:01.874 for, in general, what an entry should look like. 00:20:01.874 --> 00:20:07.158 But you can have smaller masks for defective aspects of the form 00:20:07.158 --> 00:20:11.282 or defective aspects of the lexeme that happen in language. 00:20:11.282 --> 00:20:14.537 So here is the simplest version of French verbs 00:20:14.537 --> 00:20:19.729 that have only the 3rd person singular for all the weather verbs, 00:20:19.729 --> 00:20:23.969 like "it rains" or "it snows," like in English. 00:20:24.537 --> 00:20:26.493 So we distinguish these two levels. 00:20:26.923 --> 00:20:29.962 And how we use this at Google 00:20:29.962 --> 00:20:32.643 is that when we have a lexicon that we want to use, 00:20:33.063 --> 00:20:38.309 we use the mask to really literally throw the lexicons, 00:20:38.309 --> 00:20:40.163 all the entries, through the mask 00:20:40.163 --> 00:20:44.303 and see which entry has a problem in terms of structure. 00:20:44.303 --> 00:20:46.523 Are we missing a form? Are we missing a feature? 00:20:46.523 --> 00:20:51.497 And when there is a problem, we do some human validation 00:20:51.497 --> 00:20:53.751 or just to see if it passes the mask. 00:20:53.751 --> 00:20:57.924 So it's an extremely powerful tool to check the quality of the structure. 00:20:59.427 --> 00:21:01.964 So what we are happy to announce today 00:21:01.964 --> 00:21:05.408 is that we get the green light to open source our mask. 00:21:05.948 --> 00:21:07.573 So this is a schema. 00:21:07.573 --> 00:21:09.477 If you want that, we can release 00:21:09.477 --> 00:21:13.483 and that we will provide to Wikidata as to ShEx files. 00:21:13.483 --> 00:21:16.688 This is a ShEx file for German nouns, 00:21:16.688 --> 00:21:20.428 and Denny is working on the conversion from our internal specification 00:21:20.428 --> 00:21:23.666 to a more open-source specification. 00:21:23.666 --> 00:21:27.522 We currently cover more than 25 languages. 00:21:27.522 --> 00:21:29.225 So we expect to grow on our side, 00:21:29.225 --> 00:21:34.350 but we also look for this opportunity to collaborate for other languages. 00:21:34.350 --> 00:21:40.728 And one of the ongoing collaborations also that Denny has with Lukas. 00:21:40.728 --> 00:21:45.052 Lukas has these great tools to have a UI 00:21:45.052 --> 00:21:51.061 to help the user or the contributor to add more forms. 00:21:51.061 --> 00:21:54.151 So if you want to add an adjective in French, 00:21:54.151 --> 00:21:59.057 the UI is telling you how many forms are expected 00:21:59.057 --> 00:22:01.562 and what kind of features this form should have. 00:22:01.562 --> 00:22:06.268 So our mask will help the tool to be defined and expanded. 00:22:07.238 --> 00:22:08.385 That's it. 00:22:08.791 --> 00:22:10.358 (Lydia) Thank you so much. 00:22:10.358 --> 00:22:11.993 (applause) 00:22:14.249 --> 00:22:16.891 Alright. Are there questions? 00:22:16.891 --> 00:22:19.381 Do you want to talk more about lexemes? 00:22:19.817 --> 00:22:21.475 - (person 3) Yes. - Yes. (chuckles) 00:22:33.485 --> 00:22:35.380 (person 3) My question, because you were talking 00:22:35.380 --> 00:22:39.106 about giving more access to more people in more languages. 00:22:39.106 --> 00:22:42.444 But there are a lot of languages that can't be used in Wikidata. 00:22:42.444 --> 00:22:44.588 So what solution do you have for that? 00:22:45.889 --> 00:22:47.686 When you say that can't use Wikidata, 00:22:47.686 --> 00:22:50.308 are you talking about entering labels? 00:22:50.308 --> 00:22:52.578 - (person 3) Labels, descriptions. - Right. 00:22:52.578 --> 00:22:55.498 So, for lexemes, it's a bit different 00:22:55.498 --> 00:22:57.793 because there we don't have that restriction. 00:22:58.923 --> 00:23:05.003 For labels on items and properties, there is some restriction 00:23:05.433 --> 00:23:12.411 because we wanted to make sure that it's not completely 00:23:12.411 --> 00:23:14.229 anyone does anything, 00:23:14.229 --> 00:23:17.769 and it becomes unmanageable. 00:23:19.349 --> 00:23:23.328 Even a small community who wants one language and wants to work on that, 00:23:23.898 --> 00:23:26.787 come talk to us, we will make it happen. 00:23:26.787 --> 00:23:29.202 (person 3) I mean, we did this at the Prague Hackathon in May, 00:23:29.202 --> 00:23:32.459 and it took us until almost August in order to be able to use our language. 00:23:32.459 --> 00:23:35.135 - Yeah. - (person 3) So, it's very slow. 00:23:35.135 --> 00:23:37.854 Yeah, it is, unfortunately, very slow. 00:23:37.854 --> 00:23:39.883 We're currently working with the language Committee 00:23:39.883 --> 00:23:46.048 on solving some fundamental... 00:23:49.537 --> 00:23:55.447 Like, getting agreement on what kind of languages are actually "allowed," 00:23:56.047 --> 00:23:59.398 and that has taken too long, 00:23:59.988 --> 00:24:04.178 which is the reason why your request probably took longer than it should have. 00:24:04.778 --> 00:24:05.963 (person 3) Thanks. 00:24:06.815 --> 00:24:07.950 (person 4) Thank you. 00:24:07.950 --> 00:24:10.938 Lydia, if you remember the statistics that you showed, 00:24:10.938 --> 00:24:12.886 the number of lexemes per language. 00:24:12.886 --> 00:24:17.599 So, did you count all the forms as a data point 00:24:17.599 --> 00:24:20.034 or only lexemes? 00:24:21.289 --> 00:24:22.941 (Lydia) Do you mean this? 00:24:22.941 --> 00:24:24.053 Which one do you mean? 00:24:24.053 --> 00:24:25.529 (person 4) Yes, exactly. 00:24:25.797 --> 00:24:28.341 If you remember, does this number [inaudible] 00:24:28.341 --> 00:24:31.954 all the forms for all the lexemes or just how many lexemes there are? 00:24:31.954 --> 00:24:33.585 No, this is just a number of lexemes. 00:24:33.585 --> 00:24:35.395 (person 4) Just a number of lexemes, okay. 00:24:35.395 --> 00:24:36.797 So then it is a just statistic 00:24:36.797 --> 00:24:39.390 because if it would then compose the forms-- 00:24:39.390 --> 00:24:40.614 that's why I'm asking-- 00:24:40.614 --> 00:24:42.817 then all the languages with the inflectional morphology, 00:24:42.817 --> 00:24:45.027 like Russian, Serbian, Slovenian and et cetera, 00:24:45.027 --> 00:24:47.616 they have a natural advantage because they have so many. 00:24:47.616 --> 00:24:51.990 So, this kind of kicks in here on this number of forms. 00:24:51.990 --> 00:24:53.851 (person 4) Yeah, that was this one. Thank you. 00:24:56.546 --> 00:25:00.224 (person 5) So, I had a quick question about the... 00:25:00.644 --> 00:25:06.824 When we're talking about the actual items and properties. 00:25:07.124 --> 00:25:08.901 Like as far as I understand, 00:25:08.901 --> 00:25:11.955 there is currently no way to give an actual source 00:25:11.955 --> 00:25:14.726 to any of the labels and descriptions that are given. 00:25:14.726 --> 00:25:18.047 So, for example, because when you're talking 00:25:18.047 --> 00:25:20.920 about an item property, 00:25:20.920 --> 00:25:24.509 like, for example, you can get conflicting labels. 00:25:24.509 --> 00:25:25.739 Yes. 00:25:25.739 --> 00:25:27.662 (person 5) So this person is like... 00:25:28.402 --> 00:25:30.781 We were talking about indigenous things before, for example. 00:25:30.781 --> 00:25:35.965 So this person is a Norwegian artist according to this source, 00:25:35.965 --> 00:25:38.750 and a Sami artist, according to this source. 00:25:39.550 --> 00:25:42.883 Or, for example, in Estonian, we had an issue 00:25:42.883 --> 00:25:47.729 where we had to change terminology to the official use terminology 00:25:47.729 --> 00:25:49.482 in official lexicons, 00:25:49.482 --> 00:25:52.262 but we have no way to indicate really why, 00:25:52.262 --> 00:25:53.596 like what was the source of this 00:25:53.596 --> 00:25:55.561 and why this was better and what was there before. 00:25:55.561 --> 00:25:57.150 It was just me as a random person 00:25:57.150 --> 00:25:59.615 just switching the thing to anyone who sees it. 00:25:59.615 --> 00:26:02.520 So is there a plan to make this possible in any way 00:26:02.520 --> 00:26:06.355 so that we can actually have proper sources for the language data? 00:26:07.045 --> 00:26:11.568 So, it is partially possible. 00:26:11.568 --> 00:26:15.958 So, for example, when you have an item for a person, 00:26:16.968 --> 00:26:22.720 you have a statement, first name, last name, and so on, of that person, 00:26:22.720 --> 00:26:26.226 and then you can provide the reference for that there. 00:26:28.211 --> 00:26:32.544 I'm quite hesitant to add more complexity 00:26:32.544 --> 00:26:35.557 for references on labels and descriptions, 00:26:35.557 --> 00:26:38.624 but if people really, really think 00:26:38.624 --> 00:26:44.939 this is something that isn't covered by any reference on the statement, 00:26:44.939 --> 00:26:46.803 then let's talk about it. 00:26:49.079 --> 00:26:53.303 But I fear it will add a lot of complexity 00:26:53.303 --> 00:26:56.523 for what I hope are few cases, 00:26:57.393 --> 00:27:00.188 but I'm willing to be convinced otherwise 00:27:00.188 --> 00:27:04.087 if people really feel very strongly about this. 00:27:04.087 --> 00:27:08.177 (person 5) I mean, if it's added it probably shouldn't be the default, 00:27:08.177 --> 00:27:12.452 show to all the users as a beginner, interface, in any case. 00:27:12.452 --> 00:27:16.190 More like, "Click here if you need to say a specific thing about this." 00:27:17.632 --> 00:27:23.368 Do we have a sense of how many times that would actually matter? 00:27:24.520 --> 00:27:26.423 (person 5) In Estonian, for example-- 00:27:26.423 --> 00:27:28.844 I expect this is true of other languages as well-- 00:27:29.274 --> 00:27:34.203 for example, there is an official name that is the actual legitimate translation, 00:27:34.203 --> 00:27:36.206 for example, into English, 00:27:36.206 --> 00:27:40.314 of, say, a specific kind of municipality. 00:27:40.614 --> 00:27:42.182 That was my use case, for example, 00:27:42.182 --> 00:27:44.409 where we were using the word "parish" 00:27:45.159 --> 00:27:50.885 which the original Estonian word was meant kind of like church parish, 00:27:50.885 --> 00:27:51.899 and that was the origin, 00:27:51.899 --> 00:27:54.809 but that's not the official translation Estonia gets right now. 00:27:55.189 --> 00:27:58.993 In this case, I would just add it as official name statements 00:27:58.993 --> 00:28:00.817 and add the reference there. 00:28:02.032 --> 00:28:03.158 (person 5) Okay. 00:28:05.186 --> 00:28:06.572 More questions, yes? 00:28:07.682 --> 00:28:10.044 (person 6) I have two quick comments. 00:28:10.044 --> 00:28:13.934 You specifically called out Asturian as a language that does well, 00:28:13.934 --> 00:28:16.455 and I think that's a false artifact. 00:28:16.455 --> 00:28:17.724 Tell me about it. 00:28:17.724 --> 00:28:19.748 (person 6) I think it's just a bot 00:28:19.748 --> 00:28:24.068 that pasted person names, like proper names, 00:28:24.068 --> 00:28:27.172 and said, "Well, this is exactly like in French or Spanish," 00:28:27.172 --> 00:28:28.558 and just massively copied it. 00:28:28.558 --> 00:28:33.316 One point of evidence is that you don't see that energy in Asturian 00:28:33.316 --> 00:28:37.205 in things that actually require translation, like property names, 00:28:37.205 --> 00:28:39.648 or names of items that are not proper names. 00:28:39.648 --> 00:28:41.219 Asaf, you break my heart. 00:28:41.219 --> 00:28:43.198 (person 6) I know, I like raining on parades, 00:28:43.198 --> 00:28:48.458 but I have good news as well, which is about the pronunciation numbers. 00:28:49.408 --> 00:28:53.515 As you probably know, Commons is full of pronunciation files, 00:28:53.515 --> 00:28:54.668 and, for example, 00:28:54.668 --> 00:29:01.102 Dutch has no less than 300,000 pronunciation files already on Commons 00:29:01.912 --> 00:29:05.051 that just need to somehow be ingested. 00:29:05.051 --> 00:29:07.697 So if anyone's looking for a side project, 00:29:07.697 --> 00:29:08.997 there's tons and tons 00:29:08.997 --> 00:29:13.280 of classified, categorized pronunciation files on Commons 00:29:13.280 --> 00:29:16.893 under the category "Pronunciation" by language. 00:29:16.893 --> 00:29:22.840 So that's just waiting to be matched to lexemes and put on Lexeme. 00:29:23.180 --> 00:29:25.484 And I was wondering if you could say something 00:29:25.484 --> 00:29:26.585 about the road map, 00:29:26.585 --> 00:29:28.757 something about how much investment 00:29:28.757 --> 00:29:31.995 or what can we expect from Lexeme in the coming year, 00:29:31.995 --> 00:29:34.020 because I, for one, can't wait. 00:29:34.949 --> 00:29:37.044 You can't wait? (chuckles) 00:29:37.044 --> 00:29:39.118 - (person 6) For more. - Yes. (chuckles) 00:29:44.541 --> 00:29:49.523 Right now, we're concentrating more on Wikibase and data quality 00:29:51.493 --> 00:29:55.087 to see how much traction this gets 00:29:55.087 --> 00:30:01.676 and then getting more for feeding off where the pain points are next, 00:30:01.676 --> 00:30:06.003 and then going back to improving lexicographical data further. 00:30:06.903 --> 00:30:09.790 And one of the things I'd love to hear from you 00:30:09.790 --> 00:30:14.136 is where exactly do you see the next steps, 00:30:14.136 --> 00:30:15.966 where do you want to see improvements 00:30:15.966 --> 00:30:20.340 so that we can then figure out how to make that happen. 00:30:21.125 --> 00:30:22.810 But, of course, you're right, 00:30:22.810 --> 00:30:25.712 there's still so much to do also on the technical side. 00:30:30.573 --> 00:30:35.848 (person 7) Okay, as we were uploading the Basque words with forms, 00:30:35.848 --> 00:30:37.768 and you'll see some of these kinds of things, 00:30:37.768 --> 00:30:41.329 we were both like, last week we said, "Oh, we are the first one in something." 00:30:42.919 --> 00:30:44.928 It's It appears in press, and it's like, 00:30:44.928 --> 00:30:49.488 "Oh, Basque are the first time in some-- they are the first in something, okay." 00:30:49.488 --> 00:30:50.606 (laughs) 00:30:50.606 --> 00:30:53.318 And then people ask, "Okay, but what is this for?" 00:30:54.678 --> 00:30:56.849 We don't have a real good answer. 00:30:56.849 --> 00:30:57.888 I mean it's like, okay, 00:30:57.888 --> 00:31:01.841 this will help computers to understand more our language, yes, 00:31:01.841 --> 00:31:05.279 but what kind of tools can we make in the future? 00:31:05.279 --> 00:31:07.467 And we don't have a good answer for this. 00:31:07.467 --> 00:31:10.625 So I don't know if you have a good answer for this. 00:31:10.625 --> 00:31:12.742 (chuckles) I don't know if I have a good answer, 00:31:12.742 --> 00:31:14.746 but I have an answer. 00:31:15.480 --> 00:31:20.425 So I think right now as I was telling [inaudible], 00:31:20.425 --> 00:31:21.924 we haven't reached that critical mass 00:31:21.924 --> 00:31:25.529 where you can build a lot of the really interesting tools. 00:31:25.529 --> 00:31:27.707 But there are already some tools. 00:31:28.267 --> 00:31:31.912 Just the other day, Esther [Pandelia], for example, 00:31:31.912 --> 00:31:33.817 released a tool where you can see, 00:31:35.837 --> 00:31:38.889 I think it was the words on a globe 00:31:38.889 --> 00:31:41.901 where they're spoken, where they're coming from. 00:31:42.631 --> 00:31:44.090 I'm probably wrong about this, 00:31:44.090 --> 00:31:46.346 but she had answered on the Project chat on Wikidata-- 00:31:46.346 --> 00:31:48.984 you can look it up there. 00:31:49.574 --> 00:31:51.805 So we have seen these first tools, 00:31:51.805 --> 00:31:55.696 just like we've seen back when Wikidata started. 00:31:56.846 --> 00:31:59.602 First some--like just a network, 00:31:59.602 --> 00:32:03.424 and like, "Hey, look, there's this thing that connects to this other thing." 00:32:04.824 --> 00:32:07.059 And as we have more data, 00:32:07.059 --> 00:32:10.352 and as we've reached some critical mass, 00:32:11.852 --> 00:32:14.747 more powerful applications become possible, 00:32:15.677 --> 00:32:17.516 things like Histropedia, 00:32:19.126 --> 00:32:21.988 things like question and answering 00:32:21.988 --> 00:32:26.663 in your digital personal assistant, Platypus, and so on. 00:32:26.663 --> 00:32:29.668 And we're seeing a similar thing with lexemes. 00:32:31.198 --> 00:32:34.650 We're at the stage where you can build like these little, 00:32:34.650 --> 00:32:37.464 hey, look, there's a connection between the two things, 00:32:37.864 --> 00:32:42.738 and there's a translation of this word into that language stage, 00:32:42.738 --> 00:32:47.747 and as we build it out and as we describe more words, 00:32:47.747 --> 00:32:49.533 more becomes possible. 00:32:49.533 --> 00:32:51.795 Now, what becomes possible? 00:32:53.482 --> 00:32:59.483 As Ben, our keynote speaker earlier was talking about translations, 00:33:00.103 --> 00:33:03.455 being able to translate from one language to another. 00:33:03.455 --> 00:33:07.929 And Jens, my colleague, he's always talking about 00:33:07.929 --> 00:33:11.452 the European Union looking for a translator 00:33:11.452 --> 00:33:17.439 who can translate from I think it was Maltese to Swedish-- 00:33:17.439 --> 00:33:19.436 - (person 8) Estonian. - Estonian. 00:33:22.016 --> 00:33:26.211 And that is not a usual combination. 00:33:27.211 --> 00:33:31.735 But once you have all these languages in one machine-readable place, 00:33:31.735 --> 00:33:33.143 you can do that, 00:33:33.143 --> 00:33:36.857 you can get a dictionary 00:33:36.857 --> 00:33:41.735 from Estonian to Maltese and back. 00:33:42.935 --> 00:33:45.607 So covering language combinations in dictionaries 00:33:45.607 --> 00:33:47.911 that just haven't been covered before 00:33:47.911 --> 00:33:51.050 because there wasn't enough demand for it, for example, 00:33:51.050 --> 00:33:55.540 to make it financially viable and to justify the work. 00:33:55.540 --> 00:33:57.147 Now we can do that. 00:33:59.797 --> 00:34:02.318 Then text generation. 00:34:02.318 --> 00:34:03.653 Lucie was earlier talking 00:34:03.653 --> 00:34:10.136 about how she's working with Hattie on generating text 00:34:10.136 --> 00:34:14.673 to get Wikipedia articles in minority languages started, 00:34:15.423 --> 00:34:19.512 and that needs data about words, 00:34:19.512 --> 00:34:22.589 and you need to understand the language to do that. 00:34:23.769 --> 00:34:28.133 Yeah, and those are just some that come to my mind right now. 00:34:28.693 --> 00:34:30.494 Maybe our audience has more ideas 00:34:30.494 --> 00:34:34.353 what they want to do when we have all the glorious data. 00:34:37.693 --> 00:34:40.892 (person 9) Okay, I will deviate from the lexemes topic. 00:34:40.892 --> 00:34:42.666 I will ask the question, 00:34:42.666 --> 00:34:45.634 how can I as a member of community 00:34:45.634 --> 00:34:50.135 influence that priority is put on task, 00:34:50.135 --> 00:34:56.644 that a new user comes, and he can indicate what languages he wants to see and edit 00:34:56.644 --> 00:35:01.135 without some secret verbal template knowledge. 00:35:02.145 --> 00:35:05.053 Maybe there will be this year this technical wish list 00:35:05.053 --> 00:35:07.040 without Wikipedia topics. 00:35:07.040 --> 00:35:10.119 Maybe there's a hope we can all vote about 00:35:10.119 --> 00:35:14.218 this thing we didn't fix for seven years. 00:35:14.218 --> 00:35:17.607 So do you have any ideas and comments about this? 00:35:18.217 --> 00:35:20.328 So you're talking about the fact 00:35:20.328 --> 00:35:23.518 that someone who is not logged into Wikidata 00:35:23.518 --> 00:35:25.971 can't change their language easily? 00:35:25.971 --> 00:35:27.839 (person 9) No, for [inaudible] users. 00:35:28.309 --> 00:35:30.689 So, if they are logged in, 00:35:30.689 --> 00:35:34.871 they can just change their language at the top of the page, 00:35:35.891 --> 00:35:38.099 and then it will appear 00:35:39.769 --> 00:35:42.013 where the labels' description [inaudible] are, 00:35:42.013 --> 00:35:43.483 and they can edit it. 00:35:45.657 --> 00:35:49.009 (person 9) Well, actually, usually many times the workflow 00:35:49.009 --> 00:35:52.447 is that if you want to have multiple languages, they are available, 00:35:52.447 --> 00:35:55.419 and it's not always the case. 00:35:55.419 --> 00:35:58.584 Okay, maybe we should sit down after this talk and you show me. 00:36:01.562 --> 00:36:04.089 Cool. More questions? 00:36:05.534 --> 00:36:06.536 Yes. 00:36:11.595 --> 00:36:13.196 (person 10) Thanks for the presentation. 00:36:14.106 --> 00:36:15.127 Can you comment 00:36:15.127 --> 00:36:19.307 on the state of the correlation with the Wiktionary community. 00:36:19.307 --> 00:36:22.296 As far as I've seen, there were some discussions 00:36:22.296 --> 00:36:26.051 about importing some elements of the work, 00:36:26.051 --> 00:36:30.843 but there seems to be licensing issues and some disagreements, et cetera. 00:36:30.843 --> 00:36:31.848 Right. 00:36:31.848 --> 00:36:36.330 So, Wiktionary communities have spent a lot of time 00:36:37.320 --> 00:36:39.473 building Wiktionary. 00:36:39.473 --> 00:36:42.643 They have built 00:36:43.193 --> 00:36:47.554 amazingly complicated and complex templates 00:36:47.554 --> 00:36:53.614 to build pretty tables that automatically generate forms for you 00:36:53.614 --> 00:36:56.392 and all kinds of really impressive, 00:36:56.392 --> 00:37:00.683 and kind of crazy stuff, if you think about it. 00:37:02.311 --> 00:37:07.994 And, of course, they have invested a lot of time and effort into that. 00:37:09.364 --> 00:37:11.801 And understandably, 00:37:11.801 --> 00:37:17.116 they don't just want that to be grabbed, 00:37:18.046 --> 00:37:19.102 just like that. 00:37:19.102 --> 00:37:21.791 So there's some of that coming from there. 00:37:22.761 --> 00:37:25.137 And that's fine, that's okay. 00:37:25.737 --> 00:37:32.092 Now, the first Wiktionary communities are talking about turning out 00:37:32.092 --> 00:37:34.329 and importing some of their data into Wikidata. 00:37:34.329 --> 00:37:39.095 Russian, you have seen, for example, is one of those cases 00:37:40.375 --> 00:37:42.355 And I expect more of that to happen. 00:37:43.635 --> 00:37:46.800 But it will be a slow process, 00:37:46.800 --> 00:37:49.383 just like adoption of Wikidata's data on Wikipedia 00:37:49.383 --> 00:37:51.909 has been a rather slow process. 00:37:52.849 --> 00:37:56.183 On the other side of making it actually easier 00:37:56.183 --> 00:37:59.132 to use the data that is in lexemes, 00:37:59.132 --> 00:38:02.209 on Wiktionary, so that they can make use of that 00:38:02.209 --> 00:38:05.531 and share data between the language Wiktionaries 00:38:05.531 --> 00:38:08.853 which is super hard to impossible right now, 00:38:08.853 --> 00:38:11.560 which is crazy, just like it was on Wikipedia. 00:38:13.860 --> 00:38:16.325 Wait for the birthday present. (chuckles) 00:38:20.038 --> 00:38:21.182 Yes. 00:38:22.599 --> 00:38:24.827 (person 11) When I was thinking the other way around it, 00:38:24.827 --> 00:38:28.168 I actually didn't want to say it because I think this will be super silly, 00:38:28.168 --> 00:38:32.003 but I think that Wiktionary already has some content, 00:38:32.003 --> 00:38:34.978 and I know that we can't transfer it to Wikidata 00:38:34.978 --> 00:38:37.048 because there's a difference in licenses. 00:38:37.048 --> 00:38:39.631 But I was thinking maybe we can do something about that. 00:38:40.321 --> 00:38:45.913 Maybe, I don't know, we can obtain the communities' permission 00:38:45.913 --> 00:38:51.205 after like, I don't know, having like a public voting 00:38:52.075 --> 00:38:55.642 and for the community, the active members of the community 00:38:55.642 --> 00:39:02.523 to vote and say if they would like or accept or to transfer the content 00:39:02.523 --> 00:39:05.528 for which they may do the Wikidata lexemes. 00:39:06.238 --> 00:39:08.537 Because I just think it is such a waste. 00:39:09.568 --> 00:39:14.443 So, that's definitely a conversation those people 00:39:14.443 --> 00:39:18.249 who are in Wiktionary communities are very welcome to bring up there. 00:39:18.249 --> 00:39:24.647 I think it would be a bit presumptuous for us to go and force that. 00:39:25.917 --> 00:39:31.142 But, yeah, I think it's definitely worth having a conversation. 00:39:31.142 --> 00:39:33.898 But I think it's also important to understand 00:39:33.898 --> 00:39:39.082 that there's a distinction between what is actually legally allowed 00:39:39.082 --> 00:39:43.147 and what we should be doing 00:39:43.147 --> 00:39:45.426 and what those people want or do not want. 00:39:45.736 --> 00:39:47.329 So even if it's legally allowed, 00:39:47.329 --> 00:39:50.640 if some other Wiktionary communities do not want that, 00:39:50.640 --> 00:39:53.537 I would be careful, at least. 00:39:58.886 --> 00:40:02.489 I think you need the mic for the stream. 00:40:04.540 --> 00:40:07.299 (person 12) So, obviously, it's all very exciting, 00:40:07.979 --> 00:40:12.319 and I immediately think how can I take that to my students 00:40:12.319 --> 00:40:15.558 and how can I incorporate it with the courses, 00:40:15.558 --> 00:40:18.531 the work that we're doing, educational settings. 00:40:18.531 --> 00:40:22.271 And I don't have, at the moment, 00:40:22.871 --> 00:40:24.116 first of all, enough knowledge, 00:40:24.116 --> 00:40:27.278 but I think the documentation that we do have 00:40:27.808 --> 00:40:30.082 could be maybe improved. 00:40:30.082 --> 00:40:33.437 So that's a kind of request to make cool videos 00:40:33.437 --> 00:40:35.898 that explain how it works 00:40:35.898 --> 00:40:39.948 because if we have it, we can then use it, 00:40:39.948 --> 00:40:41.985 and we can have students on board, 00:40:41.985 --> 00:40:47.072 and we can make people understand how awesome it all is. 00:40:47.072 --> 00:40:52.001 And yeah, just think about documentation and think about education, please. 00:40:52.001 --> 00:40:54.480 Because I think a lot could be done. 00:40:54.480 --> 00:40:58.585 These are like many tasks that could be done even with... 00:41:00.125 --> 00:41:02.033 well, I wouldn't say primary schools, 00:41:02.033 --> 00:41:05.495 but certainly, even younger students. 00:41:05.915 --> 00:41:10.866 And so I would really like to see that potential being tapped into, 00:41:10.866 --> 00:41:15.272 and, as of now, I personally don't understand enough 00:41:15.272 --> 00:41:19.500 to be able to create tasks or to create like... 00:41:20.430 --> 00:41:22.155 to do something practical with it. 00:41:22.155 --> 00:41:25.772 So any help, any thoughts anyone here has about that, 00:41:25.772 --> 00:41:29.648 I would be very happy to hear your thoughts, and yours as well. 00:41:30.508 --> 00:41:32.129 Yeah, let's talk about that. 00:41:35.473 --> 00:41:37.139 More questions? 00:41:37.809 --> 00:41:39.195 Someone else raised a hand. 00:41:39.195 --> 00:41:40.495 I forgot where it was. 00:41:45.739 --> 00:41:49.996 (person 13) So, if we can't import from Wiktionary, 00:41:49.996 --> 00:41:55.772 is there some concerted effort to find other public domain sources, 00:41:55.772 --> 00:41:57.459 maybe all the data, 00:41:58.769 --> 00:42:03.167 and kind of prefilter it, organize it 00:42:03.167 --> 00:42:08.470 so that it's easy to be checked by people for import? 00:42:09.093 --> 00:42:11.181 So there are first efforts. 00:42:11.181 --> 00:42:14.769 My understanding is that Basque is one of those efforts. 00:42:14.769 --> 00:42:17.474 Maybe you want to say a bit more about it? 00:42:18.426 --> 00:42:20.130 (person 14) [inaudible] 00:42:23.166 --> 00:42:27.148 Okay, the actual answer is paying for that... 00:42:28.374 --> 00:42:33.381 I mean, we have an agreement with a contractor we usually work with. 00:42:34.801 --> 00:42:38.725 They do dictionaries-- 00:42:40.315 --> 00:42:42.458 lots of stuff, but they do dictionaries. 00:42:42.458 --> 00:42:47.473 So we agreed with them to make free the students' dictionary, 00:42:47.473 --> 00:42:52.782 we would [cast] the most common words and start uploading it 00:42:52.782 --> 00:42:55.590 with an external identifier and the scheme of things. 00:42:56.420 --> 00:43:02.902 But there was some discussion about leaving it on CC0 00:43:03.212 --> 00:43:05.322 because they have the dictionary with CC by it, 00:43:06.537 --> 00:43:10.326 and they understood what the difference was. 00:43:10.326 --> 00:43:13.866 So there was some discussion. 00:43:13.866 --> 00:43:19.709 But I think that we can provide some tools or some examples in the future, 00:43:19.709 --> 00:43:21.761 and I think that there will be other dictionaries 00:43:21.761 --> 00:43:24.016 that we can handle, 00:43:24.016 --> 00:43:29.274 and also I think Wiktionary should start moving in that direction, 00:43:29.274 --> 00:43:32.260 but that's another great discussion. 00:43:33.285 --> 00:43:34.487 And on top of that, 00:43:34.487 --> 00:43:38.839 Lea is also in contact with people from Occitan 00:43:38.839 --> 00:43:41.827 who work on Occitan dictionaries, 00:43:41.827 --> 00:43:45.138 and they're currently working on a Sumerian collaboration. 00:43:51.644 --> 00:43:53.363 More questions? 00:44:01.487 --> 00:44:05.349 (person 15) Hi! We are the people who want to import Occitan data. 00:44:05.349 --> 00:44:06.585 Aha! Perfect! 00:44:06.585 --> 00:44:08.368 (person 15) And we have a small problem. 00:44:09.188 --> 00:44:14.215 We don't know how to represent the variety of all lexemes. 00:44:14.215 --> 00:44:17.893 We have six dialects, 00:44:17.893 --> 00:44:24.014 and we want to indicate for Lexeme in which dialect it's used, 00:44:24.014 --> 00:44:27.285 and we don't have a proper C0 statement to do that. 00:44:27.285 --> 00:44:31.105 So as long as the segment doesn't exist, 00:44:31.635 --> 00:44:34.465 it prevents us from [inaudible] 00:44:34.465 --> 00:44:37.603 because we will need to do it again 00:44:37.603 --> 00:44:42.076 when we will be able to [export] the statement. 00:44:42.076 --> 00:44:44.551 And it's complicated because it's a statement 00:44:44.551 --> 00:44:47.802 which won't be asked by many people 00:44:47.802 --> 00:44:53.444 because it's a statement which concerns mostly minority languages. 00:44:53.444 --> 00:44:56.933 So you will have one person to ask this. 00:44:56.933 --> 00:45:00.022 But as our colleagues Basque, 00:45:00.022 --> 00:45:06.082 it can be one person who will power thousands of others, 00:45:06.082 --> 00:45:10.884 so it might not be asking a lot, 00:45:10.884 --> 00:45:14.136 but it will be very important for us. 00:45:14.874 --> 00:45:17.600 Do you already have a new property proposal up, 00:45:17.600 --> 00:45:19.470 or do you need help creating it? 00:45:21.524 --> 00:45:24.300 (person 15) We asked four months ago. 00:45:24.720 --> 00:45:28.755 Alright, then let's get some people to help out with this property proposal. 00:45:30.159 --> 00:45:33.092 I'm sure there are enough people in this room to make this happen. 00:45:33.360 --> 00:45:35.452 (person 15) Property proposal [speaking in French]. 00:45:35.452 --> 00:45:36.965 (person 16) We didn't have an answer. 00:45:36.965 --> 00:45:39.769 (person 15) We didn't have any answer, and we don't know how to do this 00:45:39.769 --> 00:45:42.953 because we aren't in the Wikidata community. 00:45:44.694 --> 00:45:48.817 Yup, so there are people here who can help you. 00:45:48.817 --> 00:45:52.134 Maybe someone raises their hand to take-- 00:45:52.574 --> 00:45:53.644 (person 14) I'm for that. 00:45:53.644 --> 00:45:55.512 But I think this is quite interesting 00:45:55.512 --> 00:45:59.059 that only the variant of form 00:45:59.059 --> 00:46:02.607 also can handle it geographically, 00:46:02.607 --> 00:46:04.995 with coordinates or some kind of mapping. 00:46:05.595 --> 00:46:07.815 Also having different pronunciations, 00:46:07.815 --> 00:46:11.837 and I think this is something that happens in lots of languages. 00:46:12.607 --> 00:46:16.262 We should start making it happen [inaudible], 00:46:16.262 --> 00:46:18.865 and I'm going to search for the property. 00:46:19.782 --> 00:46:20.933 Cool. 00:46:20.933 --> 00:46:24.446 So you will get backing for your property proposal. 00:46:26.136 --> 00:46:27.297 Thank you. 00:46:28.153 --> 00:46:30.261 Alright, more questions? 00:46:32.410 --> 00:46:33.474 Finn. 00:46:33.974 --> 00:46:35.055 Finn is one of those people 00:46:35.055 --> 00:46:38.031 who builds stuff on top of lexicographical data. 00:46:38.031 --> 00:46:40.085 (Finn) It's just a small question, 00:46:40.405 --> 00:46:44.226 and that's about spelling variations. 00:46:44.896 --> 00:46:48.002 It seems to be difficult to put them in... 00:46:48.532 --> 00:46:53.368 You could, of course, have multiple forms for the same word. 00:46:56.327 --> 00:46:58.448 I don't know, it seems to be... 00:46:59.558 --> 00:47:03.535 If you don't do it that way, it seems to be difficult to specify... 00:47:04.771 --> 00:47:05.888 or I don't know whether 00:47:05.888 --> 00:47:09.731 this is just a minor technical issue or whether... 00:47:09.731 --> 00:47:11.252 Let's look at it together. 00:47:11.642 --> 00:47:15.230 I would love to see an example. 00:47:17.478 --> 00:47:18.478 Asaf. 00:47:26.886 --> 00:47:28.396 (Asaf) Thank you. 00:47:29.386 --> 00:47:33.685 I can give a very concrete example from my mother tongue, Hebrew. 00:47:34.205 --> 00:47:38.845 Hebrew has two main variants 00:47:38.845 --> 00:47:42.786 for expressing almost every word 00:47:42.786 --> 00:47:47.640 because the traditional spelling 00:47:47.640 --> 00:47:50.044 leaves out many of the vowels. 00:47:50.934 --> 00:47:55.207 And, therefore, in modern editions of the Bible and of poetry, 00:47:55.207 --> 00:47:57.461 diacritics are used. 00:47:57.461 --> 00:48:02.670 However, those diacritics are never used for modern prose 00:48:02.670 --> 00:48:05.974 or newspaper writing or street signs. 00:48:05.974 --> 00:48:11.209 So the average daily casual use puts in extra vowels 00:48:12.169 --> 00:48:13.519 and doesn't use the diacritics 00:48:13.519 --> 00:48:15.607 because they are, of course, more cumbersome 00:48:15.607 --> 00:48:17.893 and have all kinds of rules and nobody knows the rules. 00:48:18.633 --> 00:48:20.531 So there are basically two variants. 00:48:20.531 --> 00:48:25.322 There's the everyday casual prose variant, 00:48:25.322 --> 00:48:27.827 and there's the Bible or poetry, 00:48:27.827 --> 00:48:32.200 which always come in this traditional diacriticized text. 00:48:32.200 --> 00:48:33.302 To be useful, 00:48:33.302 --> 00:48:37.428 Lexeme would have to recognize both varieties of every single word 00:48:37.428 --> 00:48:39.747 and every single form of every single word. 00:48:40.677 --> 00:48:43.391 So that's a very comprehensive use case 00:48:43.391 --> 00:48:46.340 for official stable variants. 00:48:46.340 --> 00:48:48.942 It's not dialect, it's not regions, 00:48:49.332 --> 00:48:53.627 it's basically two coexisting morphological systems. 00:48:54.537 --> 00:48:58.926 And I too don't know exactly how to express that in Lexeme today, 00:48:58.926 --> 00:49:02.800 which is one thing that is keeping me in partial answer to Magnus' question 00:49:02.800 --> 00:49:05.238 from uploading the parts that are ready 00:49:05.238 --> 00:49:09.394 from the biggest Hebrew dictionary, which is public domain 00:49:09.394 --> 00:49:13.141 and which I have been digitizing for several years now. 00:49:13.141 --> 00:49:14.803 A good portion of it is ready, 00:49:14.803 --> 00:49:16.549 but I'm not putting it on Lexeme right now 00:49:16.549 --> 00:49:20.245 because I don't know exactly how to solve this problem. 00:49:20.245 --> 00:49:23.387 Alright, let's solve this problem here. (chuckles) 00:49:24.503 --> 00:49:26.021 That has to be possible. 00:49:30.045 --> 00:49:32.047 Alright, more questions? 00:49:37.173 --> 00:49:39.735 If not, then thank you so much. 00:49:40.605 --> 00:49:42.675 (applause)