0:00:06.303,0:00:07.362
(Lydia) Thank you so much.
0:00:07.362,0:00:11.244
So, this conference,[br]one of the big themes is languages.
0:00:14.220,0:00:18.508
I want to give you an overview[br]of where we actually are currently
0:00:18.508,0:00:19.812
when it comes to languages
0:00:20.264,0:00:22.167
and where we can go from here.
0:00:29.036,0:00:32.580
Wikidata is all about giving more people[br]more access to more knowledge,
0:00:32.580,0:00:37.168
and language is such an important part[br]of making that a reality,
0:00:38.205,0:00:43.291
especially since more and more[br]of our lives depends on technology.
0:00:44.114,0:00:48.873
And as our keynote speaker[br]earlier today was talking,
0:00:49.723,0:00:51.588
some of the technology[br]leaves people behind
0:00:51.588,0:00:55.020
simply because they can't speak[br]a certain language,
0:00:55.320,0:00:57.573
and that's not okay.
0:00:58.633,0:01:02.097
So we want to do something about that.
0:01:02.927,0:01:05.841
And in order to change that,[br]you need at least two things.
0:01:06.411,0:01:11.270
One is you need to provide content[br]to the people in their language,
0:01:11.270,0:01:12.955
and the second thing you need
0:01:12.955,0:01:15.910
is to provide them[br]with interaction in their language
0:01:15.910,0:01:19.189
in those applications[br]or whatever it is you have.
0:01:20.367,0:01:25.277
And Wikidata helps with both of those.
0:01:25.277,0:01:28.408
And the first thing,[br]content in your language,
0:01:28.408,0:01:30.879
that is basically what we have[br]in items and properties,
0:01:31.319,0:01:33.082
how we describe the world.
0:01:33.082,0:01:35.085
Now, this is certainly[br]not everything you need,
0:01:35.085,0:01:39.294
but it gets you quite far ahead.
0:01:39.764,0:01:41.847
The other thing[br]is interaction in your language,
0:01:41.847,0:01:46.389
and that's where lexemes come into play
0:01:46.389,0:01:49.382
If you want to talk[br]to your digital personal assistant
0:01:49.382,0:01:54.918
or if you want to have your device[br]translate a text and things like that.
0:01:56.404,0:01:59.254
Alright, let's look into[br]content in your language.
0:01:59.254,0:02:03.396
So what we have in items and properties.
0:02:05.406,0:02:09.696
For this, the labels in those items[br]and properties are crucial.
0:02:10.236,0:02:14.866
We need to know what this entity[br]is called that we're talking about.
0:02:15.656,0:02:19.987
And instead of talking about Q5,
0:02:19.987,0:02:22.180
someone who speaks English[br]knows that's a "human,"
0:02:22.180,0:02:24.706
someone who speaks German[br]knows that's a "mensch,"
0:02:24.706,0:02:26.374
and similar things.
0:02:26.374,0:02:29.742
So those labels on items and properties
0:02:29.742,0:02:33.619
are bridging the gap[br]between humans and machines.
0:02:33.619,0:02:35.439
And humans and humans
0:02:35.439,0:02:40.115
making more existing knowledge[br]accessible to them.
0:02:43.270,0:02:46.290
Now, that's a nice aspiration.
0:02:46.290,0:02:48.342
What does it actually look like?
0:02:48.342,0:02:49.607
It looks like this.
0:02:50.947,0:02:52.416
What you're seeing here
0:02:52.416,0:02:58.496
is that most of the items[br]on Wikidata have two labels,
0:02:58.496,0:03:00.767
so labels in two languages.
0:03:01.697,0:03:03.851
And after that, it's one, and then three,
0:03:03.851,0:03:06.115
and then it becomes very sad.
0:03:06.781,0:03:08.581
(quiet laughter)
0:03:10.047,0:03:12.713
I think we need to do better than this.
0:03:14.185,0:03:15.319
But, on the other hand,
0:03:15.319,0:03:17.478
I was actually expecting this[br]to be even worse.
0:03:17.478,0:03:19.560
I was expecting the average to be one.
0:03:19.560,0:03:22.503
So I was quite happy [br]to see two. (chuckles)
0:03:24.921,0:03:26.186
Alright.
0:03:27.156,0:03:29.527
But it's not just interesting to know
0:03:29.527,0:03:33.742
how many labels our items[br]and properties have.
0:03:33.742,0:03:36.565
It's also interesting to see[br]in which languages.
0:03:38.045,0:03:43.764
Here you see a graph of the languages
0:03:43.764,0:03:46.838
that we have labels for on Items.
0:03:46.838,0:03:50.669
So the biggest part there is Other.
0:03:51.229,0:03:53.863
So I just took the top 100 languages
0:03:54.533,0:03:58.902
and everything else is Other[br]to make this graph readable.
0:03:59.542,0:04:02.142
And then there's English and Dutch,
0:04:03.002,0:04:04.254
French,
0:04:05.924,0:04:09.129
and not to forget, Asturian.
0:04:09.659,0:04:11.889
- (person 1) Whoo![br]- Whoo-hoo, yes!
0:04:13.899,0:04:16.954
So what you see here is quite an imbalance
0:04:16.954,0:04:20.114
and still quite a lot of focus on English.
0:04:21.236,0:04:24.367
Another thing is if you look[br]at the same thing for Properties,
0:04:24.367,0:04:25.999
it's actually looking better.
0:04:27.399,0:04:32.750
And I think part of that constituted[br]just being way less properties.
0:04:32.750,0:04:36.770
So even smaller communities[br]have a chance to keep up with that.
0:04:36.770,0:04:39.173
But it's also a pretty[br]important part of Wikidata
0:04:39.173,0:04:41.159
to localize into your language.
0:04:41.159,0:04:42.384
So that's good.
0:04:45.752,0:04:47.842
What I want to highlight[br]here with Asturian
0:04:47.842,0:04:53.698
is that a small community[br]can really make a huge difference
0:04:54.448,0:04:57.085
with some dedication and work,
0:04:57.085,0:04:58.420
and that's really cool.
0:05:01.846,0:05:03.530
A small quiz for you.
0:05:03.530,0:05:05.493
If you take all the properties on Wikidata
0:05:05.493,0:05:07.687
that are not external identifiers,
0:05:07.687,0:05:10.358
which one has the most labels,[br]like the most languages?
0:05:10.977,0:05:13.847
(audience) [inaudible]
0:05:13.847,0:05:16.786
I hear some agreement on instance of?
0:05:17.506,0:05:19.443
You would be wrong.
0:05:19.983,0:05:22.210
It's image. (chuckles)
0:05:23.230,0:05:26.366
So, yeah, that tells you,[br]if you speak one of the languages
0:05:26.366,0:05:28.621
where instance of[br]doesn't yet have a label,
0:05:28.621,0:05:30.190
you might want to add it.
0:05:32.102,0:05:35.676
So it has 148 labels currently.
0:05:37.688,0:05:41.249
But that's just another slide.
0:05:42.631,0:05:44.162
This graph tells us something
0:05:44.162,0:05:49.321
about how much content we are making[br]available in a certain language
0:05:49.321,0:05:52.042
and how much of that content[br]is actually used.
0:05:52.042,0:05:55.448
So what you're seeing is basically a curve
0:05:55.448,0:06:00.987
with most content having English labels,[br]being available in English,
0:06:01.507,0:06:04.295
and being used a lot.
0:06:04.295,0:06:06.449
And then it kind of goes down.
0:06:06.449,0:06:09.436
But, again, what you can see are outliers
0:06:09.436,0:06:15.333
who have a lot more content[br]than you would necessarily expect,
0:06:16.903,0:06:19.539
and that is really, really good.
0:06:20.839,0:06:24.945
The problem still is it's not used a lot.
0:06:25.565,0:06:28.742
Asturian and Dutch should be higher,
0:06:28.742,0:06:31.994
and I think helping those communities
0:06:33.266,0:06:35.563
increase the use[br]of the data they collected
0:06:35.563,0:06:37.682
is a really useful thing to do.
0:06:42.910,0:06:48.110
What this analysis and others[br]showed us is also a good thing though
0:06:48.300,0:06:51.378
is that we are seeing[br]that highly used items
0:06:51.378,0:06:55.295
also tend to have more labels
0:06:55.295,0:06:58.188
or the other way around--[br]it's not entirely clear.
0:07:02.513,0:07:04.376
And then the question is,
0:07:04.806,0:07:07.009
are we serving[br]just the powerful languages?
0:07:07.899,0:07:11.147
Or are we serving everyone?
0:07:12.757,0:07:17.743
And what you see here[br]is a grouping of languages.
0:07:17.743,0:07:21.832
The languages that are grouped together[br]tend to have labels together.
0:07:26.042,0:07:28.599
And you see it clustering.
0:07:28.599,0:07:34.065
Now here's a similar clustering, colored,
0:07:34.065,0:07:39.475
based on how alive, how used,
0:07:40.455,0:07:43.156
how endangered the language is.
0:07:43.156,0:07:44.642
And a good thing you're seeing here
0:07:44.642,0:07:49.566
is that safe languages[br]and endangered languages
0:07:49.566,0:07:53.773
do not form two different clusters.
0:07:53.773,0:07:58.872
But they're all mixed together,
0:08:00.262,0:08:04.625
which is much better than it would be[br]the other way around
0:08:04.625,0:08:09.377
where the safe languages,[br]the powerful languages
0:08:10.197,0:08:12.164
are just helping each other out.
0:08:12.744,0:08:14.356
No, that's not the case.
0:08:14.356,0:08:17.417
And it's a really good thing.
0:08:17.417,0:08:20.042
When I saw this, [br]I thought this was very good.
0:08:23.474,0:08:25.169
Here's a similar thing
0:08:26.239,0:08:28.800
where we looked at
0:08:30.230,0:08:34.222
the languages' status
0:08:34.222,0:08:36.225
and how many labels it has.
0:08:39.367,0:08:42.937
What you're seeing[br]is a clear win for safe languages,
0:08:42.937,0:08:44.248
as is expected.
0:08:45.508,0:08:46.693
But what you're also seeing
0:08:46.693,0:08:54.407
is that the languages in category 2[br]and 3 and maybe even 4
0:08:54.407,0:08:59.280
are not that bad, actually,
0:08:59.280,0:09:02.367
in terms of their representation[br]in Wikidata and others.
0:09:03.287,0:09:06.408
It's a really good thing to find.
0:09:07.646,0:09:09.129
Now, if you look at the same thing
0:09:09.129,0:09:12.418
for how much of that content[br]of those labels
0:09:12.418,0:09:15.495
is actually used[br]on Wikipedia, for example,
0:09:17.455,0:09:22.563
then we see a similar[br]picture emerging again.
0:09:23.603,0:09:29.813
And it tells us that those communities[br]are actually making good use of their time
0:09:29.813,0:09:34.504
by filling in labels[br]for higher used items, for example.
0:09:36.410,0:09:40.493
There are outliers[br]where I think we can help,
0:09:41.683,0:09:48.202
to help those communities find the places[br]where their work would be most valuable.
0:09:49.312,0:09:52.663
But, overall, I'm happy with this picture.
0:09:54.823,0:09:59.844
Now, that was the items[br]and properties part of Wikidata.
0:10:00.714,0:10:03.033
Now, let's look at interaction[br]in your languages.
0:10:03.033,0:10:05.203
So the lexeme parts of Wikidata
0:10:05.203,0:10:09.394
where we describe words[br]and their forms and their meanings.
0:10:10.167,0:10:13.301
We've been doing this now[br]since May last year,
0:10:16.461,0:10:19.127
and content has been growing.
0:10:20.114,0:10:22.149
You can see here in blue the lexemes,
0:10:22.149,0:10:25.938
and then in red, [br]the forms on those lexemes
0:10:25.938,0:10:29.910
and yellow, the senses[br]on those lexemes.
0:10:30.991,0:10:34.451
So some communities--[br]we'll get to that later--
0:10:34.451,0:10:39.793
have spent a lot of time creating forms[br]and senses for their lexemes,
0:10:39.793,0:10:42.753
which is really useful
0:10:42.753,0:10:48.243
because that builds[br]the core of the data set that you need.
0:10:50.562,0:10:55.133
Now, we looked at all the languages
0:10:55.133,0:10:57.906
that have lexemes on Wikidata.
0:10:57.906,0:11:01.003
So words we have,
0:11:01.713,0:11:04.404
those are right now 310 languages.
0:11:04.884,0:11:08.290
Now, what do you think is the top language
0:11:08.290,0:11:11.949
when it comes to the number[br]of lexemes currently in Wikidata?
0:11:12.933,0:11:14.700
(audience) [inaudible]
0:11:19.183,0:11:20.216
Huh?
0:11:20.216,0:11:21.741
(person 2) German.
0:11:21.741,0:11:24.252
Sorry, I've heard it before.
0:11:24.252,0:11:25.651
It's Russian.
0:11:28.011,0:11:29.754
Russian is quite ahead.
0:11:31.897,0:11:33.832
And just to give you some perspective,
0:11:35.652,0:11:36.816
there's different opinions
0:11:36.816,0:11:42.231
but I've read, for example,[br]that 1,000 to 3,000 words
0:11:42.231,0:11:45.450
gets you to conversation level,[br]roughly, in another language,
0:11:45.450,0:11:49.461
and 4,000 to 10,000 words[br]to an advanced level.
0:11:51.591,0:11:55.282
So, we still have a bit to catch up there.
0:11:58.483,0:12:03.279
One thing I want you[br]to pay attention to is Basque here
0:12:03.279,0:12:07.744
with 10,000, roughly, lexemes.
0:12:09.244,0:12:13.003
Now, if you look at the number[br]of forms for those lexemes,
0:12:14.163,0:12:16.497
Basque is way up there,
0:12:18.257,0:12:20.006
which is really cool,
0:12:20.006,0:12:24.930
and you should go to a talk that explains[br]to you why that is the case.
0:12:27.341,0:12:31.175
Now, if you look at the number[br]of senses, so what do words mean,
0:12:32.015,0:12:35.081
Basque even gets to the top of the list.
0:12:35.081,0:12:37.102
I think that deserves an applause.
0:12:37.102,0:12:38.921
(applause)
0:12:45.678,0:12:47.118
Another short quiz.
0:12:47.118,0:12:50.181
What's the lexeme[br]with the most translations currently?
0:12:50.651,0:12:55.414
(audience) Cats, cats, [inaudible], [br]Douglas Adams, [inaudible]
0:12:56.766,0:13:00.014
All good guesses, but no.
0:13:01.012,0:13:04.137
It's this, the Russian word for "water."
0:13:09.571,0:13:12.253
Alright, so now we talked a lot
0:13:12.253,0:13:16.412
about how many lexemes,[br]forms, and senses we have,
0:13:16.412,0:13:20.493
but that's just one thing you need.
0:13:20.493,0:13:21.515
The other thing you need
0:13:21.515,0:13:25.161
is actually describing those lexemes,[br]forms, and senses
0:13:25.161,0:13:27.647
in a machine-readable way.
0:13:27.647,0:13:30.039
And for that you have statements,[br]like on items.
0:13:31.479,0:13:36.362
And one of the properties[br]you use is usage example.
0:13:36.362,0:13:38.582
So whoever is using that data
0:13:38.582,0:13:42.089
can understand how to use[br]that word in context,
0:13:42.089,0:13:44.158
so that could be a quote, for example.
0:13:45.396,0:13:47.113
And here, Polish rocks.
0:13:47.900,0:13:49.764
Good job, Polish speakers.
0:13:54.219,0:13:57.680
Another property[br]that's really useful is IPA,
0:13:57.680,0:14:00.186
so how do you pronounce this word.
0:14:00.876,0:14:07.497
Russian apparently needs[br]lots of IPA statements.
0:14:10.419,0:14:13.314
But, again, Polish, second.
0:14:17.148,0:14:20.753
And last but not least[br]we have pronunciation audio.
0:14:20.753,0:14:23.372
So that is links to files on Commons
0:14:23.372,0:14:25.959
where someone speaks the word,
0:14:25.959,0:14:29.913
so you can hear a native speaker[br]pronounce the word
0:14:29.913,0:14:32.871
in case you can't read IPA, for example.
0:14:34.959,0:14:39.205
And there's a really nice actually[br]Wiki-based powered project
0:14:39.205,0:14:40.474
called Lingua Libre
0:14:40.884,0:14:45.173
where you can go and help record[br]words in your language
0:14:45.173,0:14:47.836
that then can be added[br]to lexemes on Wikidata,
0:14:48.446,0:14:52.103
so other people can understand[br]how to pronounce your words.
0:14:53.663,0:14:55.694
(person 2) [inaudible]
0:14:55.694,0:14:57.665
If you search for "Lingua Libre,"
0:14:57.665,0:15:00.981
and I'm sure someone can post it[br]in the Telegram channel.
0:15:03.138,0:15:04.621
Those guys rock.
0:15:04.621,0:15:06.726
They did really cool stuff with Wikibase.
0:15:09.416,0:15:10.617
Alright.
0:15:12.706,0:15:17.285
Then the question is,[br]where do we go from here?
0:15:19.165,0:15:22.010
Based on the numbers I've just shown you,
0:15:23.030,0:15:25.172
we've come a long way
0:15:25.172,0:15:28.430
towards giving more people[br]more access to more knowledge
0:15:28.430,0:15:31.240
when looking at languages on Wikidata.
0:15:32.530,0:15:36.392
But there is also still[br]a lot of work ahead of us.
0:15:38.992,0:15:42.341
Some of the things[br]you can do to help, for example,
0:15:42.341,0:15:44.921
is run label-a-thons
0:15:44.921,0:15:50.124
like get people together[br]to label items in Wikidata
0:15:50.914,0:15:55.121
or do an edit-a-thon[br]around lexemes in your language
0:15:55.121,0:15:59.212
to get the most used words[br]in your language into Wikidata.
0:16:00.773,0:16:03.285
Or you can use a tool like Terminator
0:16:03.285,0:16:08.493
that helps you find the most[br]important items in your language
0:16:08.493,0:16:11.549
that are still missing a label.
0:16:13.274,0:16:18.359
Most important being measured[br]by how often it is used
0:16:18.359,0:16:22.553
in other Wikidata items[br]as links in statements.
0:16:25.768,0:16:30.022
And, of course, for the lexeme part,
0:16:31.342,0:16:35.169
now that we've got[br]a basic coverage of those lexemes,
0:16:35.169,0:16:41.163
it's also about building them out,[br]adding more statements to them
0:16:41.163,0:16:44.401
so that they actually can build the base
0:16:44.401,0:16:47.421
for meaningful applications[br]to build on top of that.
0:16:48.141,0:16:50.795
Because we're getting closer[br]to that critical mass,
0:16:50.795,0:16:53.616
but we're still away from that,
0:16:53.616,0:16:56.624
that you can build[br]serious applications on top of it.
0:16:58.277,0:17:01.680
And I hope all of you[br]will join us in doing that.
0:17:02.583,0:17:07.103
And that already brings me
0:17:07.103,0:17:09.843
to a little help from our friends,
0:17:09.843,0:17:12.812
and Bruno, do you want to come over
0:17:13.882,0:17:16.854
and talk to us about lexical masks.
0:17:17.541,0:17:18.567
(Bruno) Thank you, Lydia,
0:17:18.567,0:17:21.519
thank you for giving me[br]this short period of time
0:17:21.519,0:17:24.150
to present this work[br]that we are doing at Google
0:17:24.150,0:17:29.635
Denny that most of you[br]probably have heard of or know.
0:17:30.126,0:17:32.030
Because at Google so I'm a linguist.
0:17:32.030,0:17:36.150
so I'm very happy to be here[br]amongst other language enthusiasts.
0:17:36.620,0:17:39.278
We are also building some lexicons,
0:17:39.278,0:17:41.766
and we have built this technology
0:17:41.766,0:17:45.589
or this approach that we think[br]can be useful for you.
0:17:46.369,0:17:48.455
Just to give you[br]a little bit of background,
0:17:48.455,0:17:52.068
this is my lexicographic[br]background talking here.
0:17:52.788,0:17:54.347
When we build a lexicon database,
0:17:54.347,0:17:58.623
there is a lot of hard time to maintain,[br]to keep them consistent
0:17:58.623,0:18:00.125
and to exchange data,
0:18:00.125,0:18:02.027
as you probably know.
0:18:02.517,0:18:05.927
There are several attempts[br]to unify the feature and the properties
0:18:05.927,0:18:09.184
that are describing[br]those lexemes and those forms,
0:18:09.184,0:18:10.936
and it's not a solved problem,
0:18:10.936,0:18:13.958
but there are some[br]unification attempts on that side.
0:18:13.958,0:18:15.209
But what is really missing--
0:18:15.209,0:18:18.732
and this is a problem we had[br]at the beginning of our project at Google
0:18:18.732,0:18:21.607
is to try to have an internal structure
0:18:22.197,0:18:25.910
that describes how[br]a lexical entry should look like,
0:18:25.910,0:18:28.581
what kind of data[br]or what kind of information we have
0:18:28.581,0:18:32.237
and the specification that are expected.
0:18:32.237,0:18:38.187
So, this is what we came up[br]with this thing called lexicon mask.
0:18:38.897,0:18:44.841
A lexicon mask is describing[br]what is expected for an entry,
0:18:44.841,0:18:47.329
a lexicographic entry, to be complete,
0:18:47.329,0:18:51.436
both in terms of the number of forms[br]you expect for a lexeme,
0:18:51.436,0:18:55.607
and the number of features[br]you expect for each of those forms.
0:18:56.397,0:18:58.329
Here is an example for Italian adjectives.
0:18:58.329,0:19:02.002
You expect, in Italian, to have[br]four forms for your adjectives,
0:19:02.002,0:19:05.383
and each of these forms[br]have a specific combination
0:19:05.383,0:19:07.946
of gender and number features.
0:19:08.606,0:19:12.672
This is what we expect[br]for the Italian adjectives.
0:19:12.672,0:19:16.176
Of course, you can have[br]extremely complex masks,
0:19:16.176,0:19:20.783
like the French verbs conjugation,[br]which is quite extensive,
0:19:20.783,0:19:23.487
and I don't show you[br]any other Russian mask
0:19:23.487,0:19:25.378
because it doesn't fit the screen.
0:19:26.308,0:19:29.531
And we also have[br]some detailed specifications
0:19:29.531,0:19:33.421
because we distinguish[br]what is at the form level.
0:19:33.421,0:19:37.544
So here you have Russian nouns[br]that have three numbers
0:19:37.544,0:19:40.048
and a number of cases[br]with different forms,
0:19:40.048,0:19:43.086
but they also have[br]an entry level specification
0:19:43.086,0:19:45.590
that says a noun particularly has
0:19:45.590,0:19:50.133
an inherent gender[br]and an inherent animacy feature
0:19:50.133,0:19:52.488
that is also specified in the mask.
0:19:54.518,0:19:58.779
We also want to distinguish[br]that a mask gives a specification
0:19:58.779,0:20:01.874
for, in general,[br]what an entry should look like.
0:20:01.874,0:20:07.158
But you can have smaller masks[br]for defective aspects of the form
0:20:07.158,0:20:11.282
or defective aspects of the lexeme[br]that happen in language.
0:20:11.282,0:20:14.537
So here is the simplest version[br]of French verbs
0:20:14.537,0:20:19.729
that have only the 3rd person singular[br]for all the weather verbs,
0:20:19.729,0:20:23.969
like "it rains" or "it snows,"[br]like in English.
0:20:24.537,0:20:26.493
So we distinguish these two levels.
0:20:26.923,0:20:29.962
And how we use this at Google
0:20:29.962,0:20:32.643
is that when we have a lexicon[br]that we want to use,
0:20:33.063,0:20:38.309
we use the mask to really[br]literally throw the lexicons,
0:20:38.309,0:20:40.163
all the entries, through the mask
0:20:40.163,0:20:44.303
and see which entry has a problem[br]in terms of structure.
0:20:44.303,0:20:46.523
Are we missing a form?[br]Are we missing a feature?
0:20:46.523,0:20:51.497
And when there is a problem,[br]we do some human validation
0:20:51.497,0:20:53.751
or just to see if it passes the mask.
0:20:53.751,0:20:57.924
So it's an extremely powerful tool[br]to check the quality of the structure.
0:20:59.427,0:21:01.964
So what we are happy to announce today
0:21:01.964,0:21:05.408
is that we get the green light[br]to open source our mask.
0:21:05.948,0:21:07.573
So this is a schema.
0:21:07.573,0:21:09.477
If you want that, we can release
0:21:09.477,0:21:13.483
and that we will provide[br]to Wikidata as to ShEx files.
0:21:13.483,0:21:16.688
This is a ShEx file for German nouns,
0:21:16.688,0:21:20.428
and Denny is working on the conversion[br]from our internal specification
0:21:20.428,0:21:23.666
to a more open-source specification.
0:21:23.666,0:21:27.522
We currently cover more than 25 languages.
0:21:27.522,0:21:29.225
So we expect to grow on our side,
0:21:29.225,0:21:34.350
but we also look for this opportunity[br]to collaborate for other languages.
0:21:34.350,0:21:40.728
And one of the ongoing collaborations[br]also that Denny has with Lukas.
0:21:40.728,0:21:45.052
Lukas has these great tools to have a UI
0:21:45.052,0:21:51.061
to help the user or the contributor[br]to add more forms.
0:21:51.061,0:21:54.151
So if you want to add[br]an adjective in French,
0:21:54.151,0:21:59.057
the UI is telling you[br]how many forms are expected
0:21:59.057,0:22:01.562
and what kind of features[br]this form should have.
0:22:01.562,0:22:06.268
So our mask will help the tool[br]to be defined and expanded.
0:22:07.238,0:22:08.385
That's it.
0:22:08.791,0:22:10.358
(Lydia) Thank you so much.
0:22:10.358,0:22:11.993
(applause)
0:22:14.249,0:22:16.891
Alright. Are there questions?
0:22:16.891,0:22:19.381
Do you want to talk more about lexemes?
0:22:19.817,0:22:21.475
- (person 3) Yes.[br]- Yes. (chuckles)
0:22:33.485,0:22:35.380
(person 3) My question,[br]because you were talking
0:22:35.380,0:22:39.106
about giving more access[br]to more people in more languages.
0:22:39.106,0:22:42.444
But there are a lot of languages[br]that can't be used in Wikidata.
0:22:42.444,0:22:44.588
So what solution do you have for that?
0:22:45.889,0:22:47.686
When you say that can't use Wikidata,
0:22:47.686,0:22:50.308
are you talking about entering labels?
0:22:50.308,0:22:52.578
- (person 3) Labels, descriptions.[br]- Right.
0:22:52.578,0:22:55.498
So, for lexemes, it's a bit different
0:22:55.498,0:22:57.793
because there we don't have[br]that restriction.
0:22:58.923,0:23:05.003
For labels on items and properties,[br]there is some restriction
0:23:05.433,0:23:12.411
because we wanted to make sure[br]that it's not completely
0:23:12.411,0:23:14.229
anyone does anything,
0:23:14.229,0:23:17.769
and it becomes unmanageable.
0:23:19.349,0:23:23.328
Even a small community who wants[br]one language and wants to work on that,
0:23:23.898,0:23:26.787
come talk to us, we will make it happen.
0:23:26.787,0:23:29.202
(person 3) I mean, we did this[br]at the Prague Hackathon in May,
0:23:29.202,0:23:32.459
and it took us until almost August[br]in order to be able to use our language.
0:23:32.459,0:23:35.135
- Yeah.[br]- (person 3) So, it's very slow.
0:23:35.135,0:23:37.854
Yeah, it is, unfortunately, very slow.
0:23:37.854,0:23:39.883
We're currently working[br]with the language Committee
0:23:39.883,0:23:46.048
on solving some fundamental...
0:23:49.537,0:23:55.447
Like, getting agreement on what kind[br]of languages are actually "allowed,"
0:23:56.047,0:23:59.398
and that has taken too long,
0:23:59.988,0:24:04.178
which is the reason why your request[br]probably took longer than it should have.
0:24:04.778,0:24:05.963
(person 3) Thanks.
0:24:06.815,0:24:07.950
(person 4) Thank you.
0:24:07.950,0:24:10.938
Lydia, if you remember[br]the statistics that you showed,
0:24:10.938,0:24:12.886
the number of lexemes per language.
0:24:12.886,0:24:17.599
So, did you count[br]all the forms as a data point
0:24:17.599,0:24:20.034
or only lexemes?
0:24:21.289,0:24:22.941
(Lydia) Do you mean this?
0:24:22.941,0:24:24.053
Which one do you mean?
0:24:24.053,0:24:25.529
(person 4) Yes, exactly.
0:24:25.797,0:24:28.341
If you remember,[br]does this number [inaudible]
0:24:28.341,0:24:31.954
all the forms for all the lexemes[br]or just how many lexemes there are?
0:24:31.954,0:24:33.585
No, this is just a number of lexemes.
0:24:33.585,0:24:35.395
(person 4) Just a number of lexemes, okay.
0:24:35.395,0:24:36.797
So then it is a just statistic
0:24:36.797,0:24:39.390
because if it would then[br]compose the forms--
0:24:39.390,0:24:40.614
that's why I'm asking--
0:24:40.614,0:24:42.817
then all the languages[br]with the inflectional morphology,
0:24:42.817,0:24:45.027
like Russian, Serbian,[br]Slovenian and et cetera,
0:24:45.027,0:24:47.616
they have a natural advantage[br]because they have so many.
0:24:47.616,0:24:51.990
So, this kind of kicks in here[br]on this number of forms.
0:24:51.990,0:24:53.851
(person 4) Yeah, that was this one. [br]Thank you.
0:24:56.546,0:25:00.224
(person 5) So, I had[br]a quick question about the...
0:25:00.644,0:25:06.824
When we're talking about[br]the actual items and properties.
0:25:07.124,0:25:08.901
Like as far as I understand,
0:25:08.901,0:25:11.955
there is currently no way[br]to give an actual source
0:25:11.955,0:25:14.726
to any of the labels[br]and descriptions that are given.
0:25:14.726,0:25:18.047
So, for example,[br]because when you're talking
0:25:18.047,0:25:20.920
about an item property,
0:25:20.920,0:25:24.509
like, for example,[br]you can get conflicting labels.
0:25:24.509,0:25:25.739
Yes.
0:25:25.739,0:25:27.662
(person 5) So this person is like...
0:25:28.402,0:25:30.781
We were talking about[br]indigenous things before, for example.
0:25:30.781,0:25:35.965
So this person is a Norwegian artist[br]according to this source,
0:25:35.965,0:25:38.750
and a Sami artist,[br]according to this source.
0:25:39.550,0:25:42.883
Or, for example, in Estonian,[br]we had an issue
0:25:42.883,0:25:47.729
where we had to change terminology[br]to the official use terminology
0:25:47.729,0:25:49.482
in official lexicons,
0:25:49.482,0:25:52.262
but we have no way to indicate really why,
0:25:52.262,0:25:53.596
like what was the source of this
0:25:53.596,0:25:55.561
and why this was better[br]and what was there before.
0:25:55.561,0:25:57.150
It was just me as a random person
0:25:57.150,0:25:59.615
just switching the thing[br]to anyone who sees it.
0:25:59.615,0:26:02.520
So is there a plan[br]to make this possible in any way
0:26:02.520,0:26:06.355
so that we can actually have[br]proper sources for the language data?
0:26:07.045,0:26:11.568
So, it is partially possible.
0:26:11.568,0:26:15.958
So, for example, when you have[br]an item for a person,
0:26:16.968,0:26:22.720
you have a statement, first name,[br]last name, and so on, of that person,
0:26:22.720,0:26:26.226
and then you can provide[br]the reference for that there.
0:26:28.211,0:26:32.544
I'm quite hesitant to add more complexity
0:26:32.544,0:26:35.557
for references on labels and descriptions,
0:26:35.557,0:26:38.624
but if people really, really think
0:26:38.624,0:26:44.939
this is something that isn't covered[br]by any reference on the statement,
0:26:44.939,0:26:46.803
then let's talk about it.
0:26:49.079,0:26:53.303
But I fear it will add a lot of complexity
0:26:53.303,0:26:56.523
for what I hope are few cases,
0:26:57.393,0:27:00.188
but I'm willing to be convinced otherwise
0:27:00.188,0:27:04.087
if people really feel[br]very strongly about this.
0:27:04.087,0:27:08.177
(person 5) I mean, if it's added[br]it probably shouldn't be the default,
0:27:08.177,0:27:12.452
show to all the users as a beginner,[br]interface, in any case.
0:27:12.452,0:27:16.190
More like, "Click here if you need to say[br]a specific thing about this."
0:27:17.632,0:27:23.368
Do we have a sense of how many times[br]that would actually matter?
0:27:24.520,0:27:26.423
(person 5) In Estonian, for example--
0:27:26.423,0:27:28.844
I expect this is true[br]of other languages as well--
0:27:29.274,0:27:34.203
for example, there is an official name[br]that is the actual legitimate translation,
0:27:34.203,0:27:36.206
for example, into English,
0:27:36.206,0:27:40.314
of, say, a specific kind of municipality.
0:27:40.614,0:27:42.182
That was my use case, for example,
0:27:42.182,0:27:44.409
where we were using the word "parish"
0:27:45.159,0:27:50.885
which the original Estonian word[br]was meant kind of like church parish,
0:27:50.885,0:27:51.899
and that was the origin,
0:27:51.899,0:27:54.809
but that's not the official translation[br]Estonia gets right now.
0:27:55.189,0:27:58.993
In this case, I would just add it[br]as official name statements
0:27:58.993,0:28:00.817
and add the reference there.
0:28:02.032,0:28:03.158
(person 5) Okay.
0:28:05.186,0:28:06.572
More questions, yes?
0:28:07.682,0:28:10.044
(person 6) I have two quick comments.
0:28:10.044,0:28:13.934
You specifically called out Asturian[br]as a language that does well,
0:28:13.934,0:28:16.455
and I think that's a false artifact.
0:28:16.455,0:28:17.724
Tell me about it.
0:28:17.724,0:28:19.748
(person 6) I think it's just a bot
0:28:19.748,0:28:24.068
that pasted person names,[br]like proper names,
0:28:24.068,0:28:27.172
and said, "Well, this is exactly[br]like in French or Spanish,"
0:28:27.172,0:28:28.558
and just massively copied it.
0:28:28.558,0:28:33.316
One point of evidence is that[br]you don't see that energy in Asturian
0:28:33.316,0:28:37.205
in things that actually[br]require translation, like property names,
0:28:37.205,0:28:39.648
or names of items[br]that are not proper names.
0:28:39.648,0:28:41.219
Asaf, you break my heart.
0:28:41.219,0:28:43.198
(person 6) I know,[br]I like raining on parades,
0:28:43.198,0:28:48.458
but I have good news as well,[br]which is about the pronunciation numbers.
0:28:49.408,0:28:53.515
As you probably know,[br]Commons is full of pronunciation files,
0:28:53.515,0:28:54.668
and, for example,
0:28:54.668,0:29:01.102
Dutch has no less than 300,000[br]pronunciation files already on Commons
0:29:01.912,0:29:05.051
that just need to somehow be ingested.
0:29:05.051,0:29:07.697
So if anyone's looking for a side project,
0:29:07.697,0:29:08.997
there's tons and tons
0:29:08.997,0:29:13.280
of classified, categorized[br]pronunciation files on Commons
0:29:13.280,0:29:16.893
under the category[br]"Pronunciation" by language.
0:29:16.893,0:29:22.840
So that's just waiting to be matched[br]to lexemes and put on Lexeme.
0:29:23.180,0:29:25.484
And I was wondering[br]if you could say something
0:29:25.484,0:29:26.585
about the road map,
0:29:26.585,0:29:28.757
something about how much investment
0:29:28.757,0:29:31.995
or what can we expect[br]from Lexeme in the coming year,
0:29:31.995,0:29:34.020
because I, for one, can't wait.
0:29:34.949,0:29:37.044
You can't wait? (chuckles)
0:29:37.044,0:29:39.118
- (person 6) For more.[br]- Yes. (chuckles)
0:29:44.541,0:29:49.523
Right now, we're concentrating[br]more on Wikibase and data quality
0:29:51.493,0:29:55.087
to see how much traction this gets
0:29:55.087,0:30:01.676
and then getting more for feeding off[br]where the pain points are next,
0:30:01.676,0:30:06.003
and then going back to improving[br]lexicographical data further.
0:30:06.903,0:30:09.790
And one of the things[br]I'd love to hear from you
0:30:09.790,0:30:14.136
is where exactly do you see[br]the next steps,
0:30:14.136,0:30:15.966
where do you want to see improvements
0:30:15.966,0:30:20.340
so that we can then figure out[br]how to make that happen.
0:30:21.125,0:30:22.810
But, of course, you're right,
0:30:22.810,0:30:25.712
there's still so much to do[br]also on the technical side.
0:30:30.573,0:30:35.848
(person 7) Okay, as we were uploading[br]the Basque words with forms,
0:30:35.848,0:30:37.768
and you'll see some[br]of these kinds of things,
0:30:37.768,0:30:41.329
we were both like, last week we said,[br]"Oh, we are the first one in something."
0:30:42.919,0:30:44.928
It's It appears in press, and it's like,
0:30:44.928,0:30:49.488
"Oh, Basque are the first time in some--[br]they are the first in something, okay."
0:30:49.488,0:30:50.606
(laughs)
0:30:50.606,0:30:53.318
And then people ask,[br]"Okay, but what is this for?"
0:30:54.678,0:30:56.849
We don't have a real good answer.
0:30:56.849,0:30:57.888
I mean it's like, okay,
0:30:57.888,0:31:01.841
this will help computers[br]to understand more our language, yes,
0:31:01.841,0:31:05.279
but what kind of tools[br]can we make in the future?
0:31:05.279,0:31:07.467
And we don't have a good answer for this.
0:31:07.467,0:31:10.625
So I don't know[br]if you have a good answer for this.
0:31:10.625,0:31:12.742
(chuckles) I don't know[br]if I have a good answer,
0:31:12.742,0:31:14.746
but I have an answer.
0:31:15.480,0:31:20.425
So I think right now [br]as I was telling [inaudible],
0:31:20.425,0:31:21.924
we haven't reached that critical mass
0:31:21.924,0:31:25.529
where you can build a lot[br]of the really interesting tools.
0:31:25.529,0:31:27.707
But there are already some tools.
0:31:28.267,0:31:31.912
Just the other day,[br]Esther [Pandelia], for example,
0:31:31.912,0:31:33.817
released a tool where you can see,
0:31:35.837,0:31:38.889
I think it was the words on a globe
0:31:38.889,0:31:41.901
where they're spoken,[br]where they're coming from.
0:31:42.631,0:31:44.090
I'm probably wrong about this,
0:31:44.090,0:31:46.346
but she had answered[br]on the Project chat on Wikidata--
0:31:46.346,0:31:48.984
you can look it up there.
0:31:49.574,0:31:51.805
So we have seen these first tools,
0:31:51.805,0:31:55.696
just like we've seen[br]back when Wikidata started.
0:31:56.846,0:31:59.602
First some--like just a network,
0:31:59.602,0:32:03.424
and like, "Hey, look, there's this thing[br]that connects to this other thing."
0:32:04.824,0:32:07.059
And as we have more data,
0:32:07.059,0:32:10.352
and as we've reached some critical mass,
0:32:11.852,0:32:14.747
more powerful applications[br]become possible,
0:32:15.677,0:32:17.516
things like Histropedia,
0:32:19.126,0:32:21.988
things like question and answering
0:32:21.988,0:32:26.663
in your digital personal assistant,[br]Platypus, and so on.
0:32:26.663,0:32:29.668
And we're seeing[br]a similar thing with lexemes.
0:32:31.198,0:32:34.650
We're at the stage[br]where you can build like these little,
0:32:34.650,0:32:37.464
hey, look, there's a connection[br]between the two things,
0:32:37.864,0:32:42.738
and there's a translation[br]of this word into that language stage,
0:32:42.738,0:32:47.747
and as we build it out[br]and as we describe more words,
0:32:47.747,0:32:49.533
more becomes possible.
0:32:49.533,0:32:51.795
Now, what becomes possible?
0:32:53.482,0:32:59.483
As Ben, our keynote speaker earlier[br]was talking about translations,
0:33:00.103,0:33:03.455
being able to translate[br]from one language to another.
0:33:03.455,0:33:07.929
And Jens, my colleague,[br]he's always talking about
0:33:07.929,0:33:11.452
the European Union[br]looking for a translator
0:33:11.452,0:33:17.439
who can translate from[br]I think it was Maltese to Swedish--
0:33:17.439,0:33:19.436
- (person 8) Estonian.[br]- Estonian.
0:33:22.016,0:33:26.211
And that is not a usual combination.
0:33:27.211,0:33:31.735
But once you have all these languages[br]in one machine-readable place,
0:33:31.735,0:33:33.143
you can do that,
0:33:33.143,0:33:36.857
you can get a dictionary
0:33:36.857,0:33:41.735
from Estonian to Maltese and back.
0:33:42.935,0:33:45.607
So covering language[br]combinations in dictionaries
0:33:45.607,0:33:47.911
that just haven't been covered before
0:33:47.911,0:33:51.050
because there wasn't[br]enough demand for it, for example,
0:33:51.050,0:33:55.540
to make it financially viable[br]and to justify the work.
0:33:55.540,0:33:57.147
Now we can do that.
0:33:59.797,0:34:02.318
Then text generation.
0:34:02.318,0:34:03.653
Lucie was earlier talking
0:34:03.653,0:34:10.136
about how she's working[br]with Hattie on generating text
0:34:10.136,0:34:14.673
to get Wikipedia articles[br]in minority languages started,
0:34:15.423,0:34:19.512
and that needs data about words,
0:34:19.512,0:34:22.589
and you need to understand[br]the language to do that.
0:34:23.769,0:34:28.133
Yeah, and those are just some[br]that come to my mind right now.
0:34:28.693,0:34:30.494
Maybe our audience has more ideas
0:34:30.494,0:34:34.353
what they want to do[br]when we have all the glorious data.
0:34:37.693,0:34:40.892
(person 9) Okay, I will deviate[br]from the lexemes topic.
0:34:40.892,0:34:42.666
I will ask the question,
0:34:42.666,0:34:45.634
how can I as a member of community
0:34:45.634,0:34:50.135
influence that priority is put on task,
0:34:50.135,0:34:56.644
that a new user comes, and he can indicate[br]what languages he wants to see and edit
0:34:56.644,0:35:01.135
without some secret verbal[br]template knowledge.
0:35:02.145,0:35:05.053
Maybe there will be this year[br]this technical wish list
0:35:05.053,0:35:07.040
without Wikipedia topics.
0:35:07.040,0:35:10.119
Maybe there's a hope[br]we can all vote about
0:35:10.119,0:35:14.218
this thing we didn't fix for seven years.
0:35:14.218,0:35:17.607
So do you have any ideas[br]and comments about this?
0:35:18.217,0:35:20.328
So you're talking about the fact
0:35:20.328,0:35:23.518
that someone who is[br]not logged into Wikidata
0:35:23.518,0:35:25.971
can't change their language easily?
0:35:25.971,0:35:27.839
(person 9) No, for [inaudible] users.
0:35:28.309,0:35:30.689
So, if they are logged in,
0:35:30.689,0:35:34.871
they can just change their language[br]at the top of the page,
0:35:35.891,0:35:38.099
and then it will appear
0:35:39.769,0:35:42.013
where the labels' description[br][inaudible] are,
0:35:42.013,0:35:43.483
and they can edit it.
0:35:45.657,0:35:49.009
(person 9) Well, actually, usually[br]many times the workflow
0:35:49.009,0:35:52.447
is that if you want to have[br]multiple languages, they are available,
0:35:52.447,0:35:55.419
and it's not always the case.
0:35:55.419,0:35:58.584
Okay, maybe we should sit down[br]after this talk and you show me.
0:36:01.562,0:36:04.089
Cool. More questions?
0:36:05.534,0:36:06.536
Yes.
0:36:11.595,0:36:13.196
(person 10) Thanks for the presentation.
0:36:14.106,0:36:15.127
Can you comment
0:36:15.127,0:36:19.307
on the state of the correlation[br]with the Wiktionary community.
0:36:19.307,0:36:22.296
As far as I've seen,[br]there were some discussions
0:36:22.296,0:36:26.051
about importing some elements of the work,
0:36:26.051,0:36:30.843
but there seems to be licensing issues[br]and some disagreements, et cetera.
0:36:30.843,0:36:31.848
Right.
0:36:31.848,0:36:36.330
So, Wiktionary communities[br]have spent a lot of time
0:36:37.320,0:36:39.473
building Wiktionary.
0:36:39.473,0:36:42.643
They have built
0:36:43.193,0:36:47.554
amazingly complicated[br]and complex templates
0:36:47.554,0:36:53.614
to build pretty tables[br]that automatically generate forms for you
0:36:53.614,0:36:56.392
and all kinds of really impressive,
0:36:56.392,0:37:00.683
and kind of crazy stuff,[br]if you think about it.
0:37:02.311,0:37:07.994
And, of course, they have invested[br]a lot of time and effort into that.
0:37:09.364,0:37:11.801
And understandably,
0:37:11.801,0:37:17.116
they don't just want that to be grabbed,
0:37:18.046,0:37:19.102
just like that.
0:37:19.102,0:37:21.791
So there's some of that coming from there.
0:37:22.761,0:37:25.137
And that's fine, that's okay.
0:37:25.737,0:37:32.092
Now, the first Wiktionary communities[br]are talking about turning out
0:37:32.092,0:37:34.329
and importing some[br]of their data into Wikidata.
0:37:34.329,0:37:39.095
Russian, you have seen,[br]for example, is one of those cases
0:37:40.375,0:37:42.355
And I expect more of that to happen.
0:37:43.635,0:37:46.800
But it will be a slow process,
0:37:46.800,0:37:49.383
just like adoption[br]of Wikidata's data on Wikipedia
0:37:49.383,0:37:51.909
has been a rather slow process.
0:37:52.849,0:37:56.183
On the other side[br]of making it actually easier
0:37:56.183,0:37:59.132
to use the data that is in lexemes,
0:37:59.132,0:38:02.209
on Wiktionary, so that[br]they can make use of that
0:38:02.209,0:38:05.531
and share data between[br]the language Wiktionaries
0:38:05.531,0:38:08.853
which is super hard[br]to impossible right now,
0:38:08.853,0:38:11.560
which is crazy,[br]just like it was on Wikipedia.
0:38:13.860,0:38:16.325
Wait for the birthday present. (chuckles)
0:38:20.038,0:38:21.182
Yes.
0:38:22.599,0:38:24.827
(person 11) When I was thinking[br]the other way around it,
0:38:24.827,0:38:28.168
I actually didn't want to say it[br]because I think this will be super silly,
0:38:28.168,0:38:32.003
but I think that Wiktionary[br]already has some content,
0:38:32.003,0:38:34.978
and I know that[br]we can't transfer it to Wikidata
0:38:34.978,0:38:37.048
because there's a difference in licenses.
0:38:37.048,0:38:39.631
But I was thinking maybe[br]we can do something about that.
0:38:40.321,0:38:45.913
Maybe, I don't know, we can obtain[br]the communities' permission
0:38:45.913,0:38:51.205
after like, I don't know,[br]having like a public voting
0:38:52.075,0:38:55.642
and for the community,[br]the active members of the community
0:38:55.642,0:39:02.523
to vote and say if they would like [br]or accept or to transfer the content
0:39:02.523,0:39:05.528
for which they may do[br]the Wikidata lexemes.
0:39:06.238,0:39:08.537
Because I just think it is such a waste.
0:39:09.568,0:39:14.443
So, that's definitely[br]a conversation those people
0:39:14.443,0:39:18.249
who are in Wiktionary communities[br]are very welcome to bring up there.
0:39:18.249,0:39:24.647
I think it would be a bit presumptuous[br]for us to go and force that.
0:39:25.917,0:39:31.142
But, yeah, I think it's definitely worth[br]having a conversation.
0:39:31.142,0:39:33.898
But I think it's also important[br]to understand
0:39:33.898,0:39:39.082
that there's a distinction between[br]what is actually legally allowed
0:39:39.082,0:39:43.147
and what we should be doing
0:39:43.147,0:39:45.426
and what those people want or do not want.
0:39:45.736,0:39:47.329
So even if it's legally allowed,
0:39:47.329,0:39:50.640
if some other Wiktionary communities[br]do not want that,
0:39:50.640,0:39:53.537
I would be careful, at least.
0:39:58.886,0:40:02.489
I think you need the mic[br]for the stream.
0:40:04.540,0:40:07.299
(person 12) So, obviously,[br]it's all very exciting,
0:40:07.979,0:40:12.319
and I immediately think[br]how can I take that to my students
0:40:12.319,0:40:15.558
and how can I incorporate it[br]with the courses,
0:40:15.558,0:40:18.531
the work that we're doing,[br]educational settings.
0:40:18.531,0:40:22.271
And I don't have, at the moment,
0:40:22.871,0:40:24.116
first of all, enough knowledge,
0:40:24.116,0:40:27.278
but I think the documentation[br]that we do have
0:40:27.808,0:40:30.082
could be maybe improved.
0:40:30.082,0:40:33.437
So that's a kind of request[br]to make cool videos
0:40:33.437,0:40:35.898
that explain how it works
0:40:35.898,0:40:39.948
because if we have it, we can then use it,
0:40:39.948,0:40:41.985
and we can have students on board,
0:40:41.985,0:40:47.072
and we can make people understand[br]how awesome it all is.
0:40:47.072,0:40:52.001
And yeah, just think about documentation[br]and think about education, please.
0:40:52.001,0:40:54.480
Because I think a lot could be done.
0:40:54.480,0:40:58.585
These are like many tasks[br]that could be done even with...
0:41:00.125,0:41:02.033
well, I wouldn't say primary schools,
0:41:02.033,0:41:05.495
but certainly, even younger students.
0:41:05.915,0:41:10.866
And so I would really like to see[br]that potential being tapped into,
0:41:10.866,0:41:15.272
and, as of now, I personally[br]don't understand enough
0:41:15.272,0:41:19.500
to be able to create tasks[br]or to create like...
0:41:20.430,0:41:22.155
to do something practical with it.
0:41:22.155,0:41:25.772
So any help, any thoughts[br]anyone here has about that,
0:41:25.772,0:41:29.648
I would be very happy to hear[br]your thoughts, and yours as well.
0:41:30.508,0:41:32.129
Yeah, let's talk about that.
0:41:35.473,0:41:37.139
More questions?
0:41:37.809,0:41:39.195
Someone else raised a hand.
0:41:39.195,0:41:40.495
I forgot where it was.
0:41:45.739,0:41:49.996
(person 13) So, if we can't import[br]from Wiktionary,
0:41:49.996,0:41:55.772
is there some concerted effort[br]to find other public domain sources,
0:41:55.772,0:41:57.459
maybe all the data,
0:41:58.769,0:42:03.167
and kind of prefilter it, organize it
0:42:03.167,0:42:08.470
so that it's easy to be checked[br]by people for import?
0:42:09.093,0:42:11.181
So there are first efforts.
0:42:11.181,0:42:14.769
My understanding is that Basque[br]is one of those efforts.
0:42:14.769,0:42:17.474
Maybe you want to say[br]a bit more about it?
0:42:18.426,0:42:20.130
(person 14) [inaudible]
0:42:23.166,0:42:27.148
Okay, the actual answer [br]is paying for that...
0:42:28.374,0:42:33.381
I mean, we have an agreement[br]with a contractor we usually work with.
0:42:34.801,0:42:38.725
They do dictionaries--
0:42:40.315,0:42:42.458
lots of stuff, but they do dictionaries.
0:42:42.458,0:42:47.473
So we agreed with them[br]to make free the students' dictionary,
0:42:47.473,0:42:52.782
we would [cast] the most common words[br]and start uploading it
0:42:52.782,0:42:55.590
with an external identifier[br]and the scheme of things.
0:42:56.420,0:43:02.902
But there was some discussion[br]about leaving it on CC0
0:43:03.212,0:43:05.322
because they have[br]the dictionary with CC by it,
0:43:06.537,0:43:10.326
and they understood[br]what the difference was.
0:43:10.326,0:43:13.866
So there was some discussion.
0:43:13.866,0:43:19.709
But I think that we can provide some tools[br]or some examples in the future,
0:43:19.709,0:43:21.761
and I think that there will be[br]other dictionaries
0:43:21.761,0:43:24.016
that we can handle,
0:43:24.016,0:43:29.274
and also I think Wiktionary[br]should start moving in that direction,
0:43:29.274,0:43:32.260
but that's another great discussion.
0:43:33.285,0:43:34.487
And on top of that,
0:43:34.487,0:43:38.839
Lea is also in contact[br]with people from Occitan
0:43:38.839,0:43:41.827
who work on Occitan dictionaries,
0:43:41.827,0:43:45.138
and they're currently working[br]on a Sumerian collaboration.
0:43:51.644,0:43:53.363
More questions?
0:44:01.487,0:44:05.349
(person 15) Hi! We are the people[br]who want to import Occitan data.
0:44:05.349,0:44:06.585
Aha! Perfect!
0:44:06.585,0:44:08.368
(person 15) And we have a small problem.
0:44:09.188,0:44:14.215
We don't know how to represent[br]the variety of all lexemes.
0:44:14.215,0:44:17.893
We have six dialects,
0:44:17.893,0:44:24.014
and we want to indicate for Lexeme[br]in which dialect it's used,
0:44:24.014,0:44:27.285
and we don't have a proper[br]C0 statement to do that.
0:44:27.285,0:44:31.105
So as long as the segment doesn't exist,
0:44:31.635,0:44:34.465
it prevents us from [inaudible]
0:44:34.465,0:44:37.603
because we will need to do it again
0:44:37.603,0:44:42.076
when we will be able[br]to [export] the statement.
0:44:42.076,0:44:44.551
And it's complicated[br]because it's a statement
0:44:44.551,0:44:47.802
which won't be asked by many people
0:44:47.802,0:44:53.444
because it's a statement[br]which concerns mostly minority languages.
0:44:53.444,0:44:56.933
So you will have one person to ask this.
0:44:56.933,0:45:00.022
But as our colleagues Basque,
0:45:00.022,0:45:06.082
it can be one person[br]who will power thousands of others,
0:45:06.082,0:45:10.884
so it might not be asking a lot,
0:45:10.884,0:45:14.136
but it will be very important for us.
0:45:14.874,0:45:17.600
Do you already have[br]a new property proposal up,
0:45:17.600,0:45:19.470
or do you need help creating it?
0:45:21.524,0:45:24.300
(person 15) We asked four months ago.
0:45:24.720,0:45:28.755
Alright, then let's get some people[br]to help out with this property proposal.
0:45:30.159,0:45:33.092
I'm sure there are enough people[br]in this room to make this happen.
0:45:33.360,0:45:35.452
(person 15) Property proposal[br][speaking in French].
0:45:35.452,0:45:36.965
(person 16) We didn't have an answer.
0:45:36.965,0:45:39.769
(person 15) We didn't have any answer,[br]and we don't know how to do this
0:45:39.769,0:45:42.953
because we aren't [br]in the Wikidata community.
0:45:44.694,0:45:48.817
Yup, so there are people here[br]who can help you.
0:45:48.817,0:45:52.134
Maybe someone raises their hand to take--
0:45:52.574,0:45:53.644
(person 14) I'm for that.
0:45:53.644,0:45:55.512
But I think this is quite interesting
0:45:55.512,0:45:59.059
that only the variant of form
0:45:59.059,0:46:02.607
also can handle it geographically,
0:46:02.607,0:46:04.995
with coordinates or some kind of mapping.
0:46:05.595,0:46:07.815
Also having different pronunciations,
0:46:07.815,0:46:11.837
and I think this is something[br]that happens in lots of languages.
0:46:12.607,0:46:16.262
We should start making[br]it happen [inaudible],
0:46:16.262,0:46:18.865
and I'm going to search for the property.
0:46:19.782,0:46:20.933
Cool.
0:46:20.933,0:46:24.446
So you will get backing[br]for your property proposal.
0:46:26.136,0:46:27.297
Thank you.
0:46:28.153,0:46:30.261
Alright, more questions?
0:46:32.410,0:46:33.474
Finn.
0:46:33.974,0:46:35.055
Finn is one of those people
0:46:35.055,0:46:38.031
who builds stuff[br]on top of lexicographical data.
0:46:38.031,0:46:40.085
(Finn) It's just a small question,
0:46:40.405,0:46:44.226
and that's about spelling variations.
0:46:44.896,0:46:48.002
It seems to be difficult to put them in...
0:46:48.532,0:46:53.368
You could, of course,[br]have multiple forms for the same word.
0:46:56.327,0:46:58.448
I don't know, it seems to be...
0:46:59.558,0:47:03.535
If you don't do it that way,[br]it seems to be difficult to specify...
0:47:04.771,0:47:05.888
or I don't know whether
0:47:05.888,0:47:09.731
this is just a minor technical issue[br]or whether...
0:47:09.731,0:47:11.252
Let's look at it together.
0:47:11.642,0:47:15.230
I would love to see an example.
0:47:17.478,0:47:18.478
Asaf.
0:47:26.886,0:47:28.396
(Asaf) Thank you.
0:47:29.386,0:47:33.685
I can give a very concrete example[br]from my mother tongue, Hebrew.
0:47:34.205,0:47:38.845
Hebrew has two main variants
0:47:38.845,0:47:42.786
for expressing almost every word
0:47:42.786,0:47:47.640
because the traditional spelling
0:47:47.640,0:47:50.044
leaves out many of the vowels.
0:47:50.934,0:47:55.207
And, therefore, in modern editions[br]of the Bible and of poetry,
0:47:55.207,0:47:57.461
diacritics are used.
0:47:57.461,0:48:02.670
However, those diacritics[br]are never used for modern prose
0:48:02.670,0:48:05.974
or newspaper writing or street signs.
0:48:05.974,0:48:11.209
So the average daily casual use[br]puts in extra vowels
0:48:12.169,0:48:13.519
and doesn't use the diacritics
0:48:13.519,0:48:15.607
because they are,[br]of course, more cumbersome
0:48:15.607,0:48:17.893
and have all kinds of rules[br]and nobody knows the rules.
0:48:18.633,0:48:20.531
So there are basically two variants.
0:48:20.531,0:48:25.322
There's the everyday casual prose variant,
0:48:25.322,0:48:27.827
and there's the Bible or poetry,
0:48:27.827,0:48:32.200
which always come[br]in this traditional diacriticized text.
0:48:32.200,0:48:33.302
To be useful,
0:48:33.302,0:48:37.428
Lexeme would have to recognize[br]both varieties of every single word
0:48:37.428,0:48:39.747
and every single form[br]of every single word.
0:48:40.677,0:48:43.391
So that's a very comprehensive use case
0:48:43.391,0:48:46.340
for official stable variants.
0:48:46.340,0:48:48.942
It's not dialect, it's not regions,
0:48:49.332,0:48:53.627
it's basically two coexisting[br]morphological systems.
0:48:54.537,0:48:58.926
And I too don't know exactly[br]how to express that in Lexeme today,
0:48:58.926,0:49:02.800
which is one thing that is keeping me[br]in partial answer to Magnus' question
0:49:02.800,0:49:05.238
from uploading the parts that are ready
0:49:05.238,0:49:09.394
from the biggest Hebrew dictionary,[br]which is public domain
0:49:09.394,0:49:13.141
and which I have been digitizing[br]for several years now.
0:49:13.141,0:49:14.803
A good portion of it is ready,
0:49:14.803,0:49:16.549
but I'm not putting it on Lexeme right now
0:49:16.549,0:49:20.245
because I don't know exactly[br]how to solve this problem.
0:49:20.245,0:49:23.387
Alright, let's solve[br]this problem here. (chuckles)
0:49:24.503,0:49:26.021
That has to be possible.
0:49:30.045,0:49:32.047
Alright, more questions?
0:49:37.173,0:49:39.735
If not, then thank you so much.
0:49:40.605,0:49:42.675
(applause)