1
00:00:06,303 --> 00:00:07,362
(Lydia) Thank you so much.
2
00:00:07,362 --> 00:00:11,244
So, this conference,
one of the big themes is languages.
3
00:00:14,220 --> 00:00:18,508
I want to give you an overview
of where we actually are currently
4
00:00:18,508 --> 00:00:19,812
when it comes to languages
5
00:00:20,264 --> 00:00:22,167
and where we can go from here.
6
00:00:29,036 --> 00:00:32,580
Wikidata is all about giving more people
more access to more knowledge,
7
00:00:32,580 --> 00:00:37,168
and language is such an important part
of making that a reality,
8
00:00:38,205 --> 00:00:43,291
especially since more and more
of our lives depends on technology.
9
00:00:44,114 --> 00:00:48,873
And as our keynote speaker
earlier today was talking,
10
00:00:49,723 --> 00:00:51,588
some of the technology
leaves people behind
11
00:00:51,588 --> 00:00:55,020
simply because they can't speak
a certain language,
12
00:00:55,320 --> 00:00:57,573
and that's not okay.
13
00:00:58,633 --> 00:01:02,097
So we want to do something about that.
14
00:01:02,927 --> 00:01:05,841
And in order to change that,
you need at least two things.
15
00:01:06,411 --> 00:01:11,270
One is you need to provide content
to the people in their language,
16
00:01:11,270 --> 00:01:12,955
and the second thing you need
17
00:01:12,955 --> 00:01:15,910
is to provide them
with interaction in their language
18
00:01:15,910 --> 00:01:19,189
in those applications
or whatever it is you have.
19
00:01:20,367 --> 00:01:25,277
And Wikidata helps with both of those.
20
00:01:25,277 --> 00:01:28,408
And the first thing,
content in your language,
21
00:01:28,408 --> 00:01:30,879
that is basically what we have
in items and properties,
22
00:01:31,319 --> 00:01:33,082
how we describe the world.
23
00:01:33,082 --> 00:01:35,085
Now, this is certainly
not everything you need,
24
00:01:35,085 --> 00:01:39,294
but it gets you quite far ahead.
25
00:01:39,764 --> 00:01:41,847
The other thing
is interaction in your language,
26
00:01:41,847 --> 00:01:46,389
and that's where lexemes come into play
27
00:01:46,389 --> 00:01:49,382
If you want to talk
to your digital personal assistant
28
00:01:49,382 --> 00:01:54,918
or if you want to have your device
translate a text and things like that.
29
00:01:56,404 --> 00:01:59,254
Alright, let's look into
content in your language.
30
00:01:59,254 --> 00:02:03,396
So what we have in items and properties.
31
00:02:05,406 --> 00:02:09,696
For this, the labels in those items
and properties are crucial.
32
00:02:10,236 --> 00:02:14,866
We need to know what this entity
is called that we're talking about.
33
00:02:15,656 --> 00:02:19,987
And instead of talking about Q5,
34
00:02:19,987 --> 00:02:22,180
someone who speaks English
knows that's a "human,"
35
00:02:22,180 --> 00:02:24,706
someone who speaks German
knows that's a "mensch,"
36
00:02:24,706 --> 00:02:26,374
and similar things.
37
00:02:26,374 --> 00:02:29,742
So those labels on items and properties
38
00:02:29,742 --> 00:02:33,619
are bridging the gap
between humans and machines.
39
00:02:33,619 --> 00:02:35,439
And humans and humans
40
00:02:35,439 --> 00:02:40,115
making more existing knowledge
accessible to them.
41
00:02:43,270 --> 00:02:46,290
Now, that's a nice aspiration.
42
00:02:46,290 --> 00:02:48,342
What does it actually look like?
43
00:02:48,342 --> 00:02:49,607
It looks like this.
44
00:02:50,947 --> 00:02:52,416
What you're seeing here
45
00:02:52,416 --> 00:02:58,496
is that most of the items
on Wikidata have two labels,
46
00:02:58,496 --> 00:03:00,767
so labels in two languages.
47
00:03:01,697 --> 00:03:03,851
And after that, it's one, and then three,
48
00:03:03,851 --> 00:03:06,115
and then it becomes very sad.
49
00:03:06,781 --> 00:03:08,581
(quiet laughter)
50
00:03:10,047 --> 00:03:12,713
I think we need to do better than this.
51
00:03:14,185 --> 00:03:15,319
But, on the other hand,
52
00:03:15,319 --> 00:03:17,478
I was actually expecting this
to be even worse.
53
00:03:17,478 --> 00:03:19,560
I was expecting the average to be one.
54
00:03:19,560 --> 00:03:22,503
So I was quite happy
to see two. (chuckles)
55
00:03:24,921 --> 00:03:26,186
Alright.
56
00:03:27,156 --> 00:03:29,527
But it's not just interesting to know
57
00:03:29,527 --> 00:03:33,742
how many labels our items
and properties have.
58
00:03:33,742 --> 00:03:36,565
It's also interesting to see
in which languages.
59
00:03:38,045 --> 00:03:43,764
Here you see a graph of the languages
60
00:03:43,764 --> 00:03:46,838
that we have labels for on Items.
61
00:03:46,838 --> 00:03:50,669
So the biggest part there is Other.
62
00:03:51,229 --> 00:03:53,863
So I just took the top 100 languages
63
00:03:54,533 --> 00:03:58,902
and everything else is Other
to make this graph readable.
64
00:03:59,542 --> 00:04:02,142
And then there's English and Dutch,
65
00:04:03,002 --> 00:04:04,254
French,
66
00:04:05,924 --> 00:04:09,129
and not to forget, Asturian.
67
00:04:09,659 --> 00:04:11,889
- (person 1) Whoo!
- Whoo-hoo, yes!
68
00:04:13,899 --> 00:04:16,954
So what you see here is quite an imbalance
69
00:04:16,954 --> 00:04:20,114
and still quite a lot of focus on English.
70
00:04:21,236 --> 00:04:24,367
Another thing is if you look
at the same thing for Properties,
71
00:04:24,367 --> 00:04:25,999
it's actually looking better.
72
00:04:27,399 --> 00:04:32,750
And I think part of that constituted
just being way less properties.
73
00:04:32,750 --> 00:04:36,770
So even smaller communities
have a chance to keep up with that.
74
00:04:36,770 --> 00:04:39,173
But it's also a pretty
important part of Wikidata
75
00:04:39,173 --> 00:04:41,159
to localize into your language.
76
00:04:41,159 --> 00:04:42,384
So that's good.
77
00:04:45,752 --> 00:04:47,842
What I want to highlight
here with Asturian
78
00:04:47,842 --> 00:04:53,698
is that a small community
can really make a huge difference
79
00:04:54,448 --> 00:04:57,085
with some dedication and work,
80
00:04:57,085 --> 00:04:58,420
and that's really cool.
81
00:05:01,846 --> 00:05:03,530
A small quiz for you.
82
00:05:03,530 --> 00:05:05,493
If you take all the properties on Wikidata
83
00:05:05,493 --> 00:05:07,687
that are not external identifiers,
84
00:05:07,687 --> 00:05:10,358
which one has the most labels,
like the most languages?
85
00:05:10,977 --> 00:05:13,847
(audience) [inaudible]
86
00:05:13,847 --> 00:05:16,786
I hear some agreement on instance of?
87
00:05:17,506 --> 00:05:19,443
You would be wrong.
88
00:05:19,983 --> 00:05:22,210
It's image. (chuckles)
89
00:05:23,230 --> 00:05:26,366
So, yeah, that tells you,
if you speak one of the languages
90
00:05:26,366 --> 00:05:28,621
where instance of
doesn't yet have a label,
91
00:05:28,621 --> 00:05:30,190
you might want to add it.
92
00:05:32,102 --> 00:05:35,676
So it has 148 labels currently.
93
00:05:37,688 --> 00:05:41,249
But that's just another slide.
94
00:05:42,631 --> 00:05:44,162
This graph tells us something
95
00:05:44,162 --> 00:05:49,321
about how much content we are making
available in a certain language
96
00:05:49,321 --> 00:05:52,042
and how much of that content
is actually used.
97
00:05:52,042 --> 00:05:55,448
So what you're seeing is basically a curve
98
00:05:55,448 --> 00:06:00,987
with most content having English labels,
being available in English,
99
00:06:01,507 --> 00:06:04,295
and being used a lot.
100
00:06:04,295 --> 00:06:06,449
And then it kind of goes down.
101
00:06:06,449 --> 00:06:09,436
But, again, what you can see are outliers
102
00:06:09,436 --> 00:06:15,333
who have a lot more content
than you would necessarily expect,
103
00:06:16,903 --> 00:06:19,539
and that is really, really good.
104
00:06:20,839 --> 00:06:24,945
The problem still is it's not used a lot.
105
00:06:25,565 --> 00:06:28,742
Asturian and Dutch should be higher,
106
00:06:28,742 --> 00:06:31,994
and I think helping those communities
107
00:06:33,266 --> 00:06:35,563
increase the use
of the data they collected
108
00:06:35,563 --> 00:06:37,682
is a really useful thing to do.
109
00:06:42,910 --> 00:06:48,110
What this analysis and others
showed us is also a good thing though
110
00:06:48,300 --> 00:06:51,378
is that we are seeing
that highly used items
111
00:06:51,378 --> 00:06:55,295
also tend to have more labels
112
00:06:55,295 --> 00:06:58,188
or the other way around--
it's not entirely clear.
113
00:07:02,513 --> 00:07:04,376
And then the question is,
114
00:07:04,806 --> 00:07:07,009
are we serving
just the powerful languages?
115
00:07:07,899 --> 00:07:11,147
Or are we serving everyone?
116
00:07:12,757 --> 00:07:17,743
And what you see here
is a grouping of languages.
117
00:07:17,743 --> 00:07:21,832
The languages that are grouped together
tend to have labels together.
118
00:07:26,042 --> 00:07:28,599
And you see it clustering.
119
00:07:28,599 --> 00:07:34,065
Now here's a similar clustering, colored,
120
00:07:34,065 --> 00:07:39,475
based on how alive, how used,
121
00:07:40,455 --> 00:07:43,156
how endangered the language is.
122
00:07:43,156 --> 00:07:44,642
And a good thing you're seeing here
123
00:07:44,642 --> 00:07:49,566
is that safe languages
and endangered languages
124
00:07:49,566 --> 00:07:53,773
do not form two different clusters.
125
00:07:53,773 --> 00:07:58,872
But they're all mixed together,
126
00:08:00,262 --> 00:08:04,625
which is much better than it would be
the other way around
127
00:08:04,625 --> 00:08:09,377
where the safe languages,
the powerful languages
128
00:08:10,197 --> 00:08:12,164
are just helping each other out.
129
00:08:12,744 --> 00:08:14,356
No, that's not the case.
130
00:08:14,356 --> 00:08:17,417
And it's a really good thing.
131
00:08:17,417 --> 00:08:20,042
When I saw this,
I thought this was very good.
132
00:08:23,474 --> 00:08:25,169
Here's a similar thing
133
00:08:26,239 --> 00:08:28,800
where we looked at
134
00:08:30,230 --> 00:08:34,222
the languages' status
135
00:08:34,222 --> 00:08:36,225
and how many labels it has.
136
00:08:39,367 --> 00:08:42,937
What you're seeing
is a clear win for safe languages,
137
00:08:42,937 --> 00:08:44,248
as is expected.
138
00:08:45,508 --> 00:08:46,693
But what you're also seeing
139
00:08:46,693 --> 00:08:54,407
is that the languages in category 2
and 3 and maybe even 4
140
00:08:54,407 --> 00:08:59,280
are not that bad, actually,
141
00:08:59,280 --> 00:09:02,367
in terms of their representation
in Wikidata and others.
142
00:09:03,287 --> 00:09:06,408
It's a really good thing to find.
143
00:09:07,646 --> 00:09:09,129
Now, if you look at the same thing
144
00:09:09,129 --> 00:09:12,418
for how much of that content
of those labels
145
00:09:12,418 --> 00:09:15,495
is actually used
on Wikipedia, for example,
146
00:09:17,455 --> 00:09:22,563
then we see a similar
picture emerging again.
147
00:09:23,603 --> 00:09:29,813
And it tells us that those communities
are actually making good use of their time
148
00:09:29,813 --> 00:09:34,504
by filling in labels
for higher used items, for example.
149
00:09:36,410 --> 00:09:40,493
There are outliers
where I think we can help,
150
00:09:41,683 --> 00:09:48,202
to help those communities find the places
where their work would be most valuable.
151
00:09:49,312 --> 00:09:52,663
But, overall, I'm happy with this picture.
152
00:09:54,823 --> 00:09:59,844
Now, that was the items
and properties part of Wikidata.
153
00:10:00,714 --> 00:10:03,033
Now, let's look at interaction
in your languages.
154
00:10:03,033 --> 00:10:05,203
So the lexeme parts of Wikidata
155
00:10:05,203 --> 00:10:09,394
where we describe words
and their forms and their meanings.
156
00:10:10,167 --> 00:10:13,301
We've been doing this now
since May last year,
157
00:10:16,461 --> 00:10:19,127
and content has been growing.
158
00:10:20,114 --> 00:10:22,149
You can see here in blue the lexemes,
159
00:10:22,149 --> 00:10:25,938
and then in red,
the forms on those lexemes
160
00:10:25,938 --> 00:10:29,910
and yellow, the senses
on those lexemes.
161
00:10:30,991 --> 00:10:34,451
So some communities--
we'll get to that later--
162
00:10:34,451 --> 00:10:39,793
have spent a lot of time creating forms
and senses for their lexemes,
163
00:10:39,793 --> 00:10:42,753
which is really useful
164
00:10:42,753 --> 00:10:48,243
because that builds
the core of the data set that you need.
165
00:10:50,562 --> 00:10:55,133
Now, we looked at all the languages
166
00:10:55,133 --> 00:10:57,906
that have lexemes on Wikidata.
167
00:10:57,906 --> 00:11:01,003
So words we have,
168
00:11:01,713 --> 00:11:04,404
those are right now 310 languages.
169
00:11:04,884 --> 00:11:08,290
Now, what do you think is the top language
170
00:11:08,290 --> 00:11:11,949
when it comes to the number
of lexemes currently in Wikidata?
171
00:11:12,933 --> 00:11:14,700
(audience) [inaudible]
172
00:11:19,183 --> 00:11:20,216
Huh?
173
00:11:20,216 --> 00:11:21,741
(person 2) German.
174
00:11:21,741 --> 00:11:24,252
Sorry, I've heard it before.
175
00:11:24,252 --> 00:11:25,651
It's Russian.
176
00:11:28,011 --> 00:11:29,754
Russian is quite ahead.
177
00:11:31,897 --> 00:11:33,832
And just to give you some perspective,
178
00:11:35,652 --> 00:11:36,816
there's different opinions
179
00:11:36,816 --> 00:11:42,231
but I've read, for example,
that 1,000 to 3,000 words
180
00:11:42,231 --> 00:11:45,450
gets you to conversation level,
roughly, in another language,
181
00:11:45,450 --> 00:11:49,461
and 4,000 to 10,000 words
to an advanced level.
182
00:11:51,591 --> 00:11:55,282
So, we still have a bit to catch up there.
183
00:11:58,483 --> 00:12:03,279
One thing I want you
to pay attention to is Basque here
184
00:12:03,279 --> 00:12:07,744
with 10,000, roughly, lexemes.
185
00:12:09,244 --> 00:12:13,003
Now, if you look at the number
of forms for those lexemes,
186
00:12:14,163 --> 00:12:16,497
Basque is way up there,
187
00:12:18,257 --> 00:12:20,006
which is really cool,
188
00:12:20,006 --> 00:12:24,930
and you should go to a talk that explains
to you why that is the case.
189
00:12:27,341 --> 00:12:31,175
Now, if you look at the number
of senses, so what do words mean,
190
00:12:32,015 --> 00:12:35,081
Basque even gets to the top of the list.
191
00:12:35,081 --> 00:12:37,102
I think that deserves an applause.
192
00:12:37,102 --> 00:12:38,921
(applause)
193
00:12:45,678 --> 00:12:47,118
Another short quiz.
194
00:12:47,118 --> 00:12:50,181
What's the lexeme
with the most translations currently?
195
00:12:50,651 --> 00:12:55,414
(audience) Cats, cats, [inaudible],
Douglas Adams, [inaudible]
196
00:12:56,766 --> 00:13:00,014
All good guesses, but no.
197
00:13:01,012 --> 00:13:04,137
It's this, the Russian word for "water."
198
00:13:09,571 --> 00:13:12,253
Alright, so now we talked a lot
199
00:13:12,253 --> 00:13:16,412
about how many lexemes,
forms, and senses we have,
200
00:13:16,412 --> 00:13:20,493
but that's just one thing you need.
201
00:13:20,493 --> 00:13:21,515
The other thing you need
202
00:13:21,515 --> 00:13:25,161
is actually describing those lexemes,
forms, and senses
203
00:13:25,161 --> 00:13:27,647
in a machine-readable way.
204
00:13:27,647 --> 00:13:30,039
And for that you have statements,
like on items.
205
00:13:31,479 --> 00:13:36,362
And one of the properties
you use is usage example.
206
00:13:36,362 --> 00:13:38,582
So whoever is using that data
207
00:13:38,582 --> 00:13:42,089
can understand how to use
that word in context,
208
00:13:42,089 --> 00:13:44,158
so that could be a quote, for example.
209
00:13:45,396 --> 00:13:47,113
And here, Polish rocks.
210
00:13:47,900 --> 00:13:49,764
Good job, Polish speakers.
211
00:13:54,219 --> 00:13:57,680
Another property
that's really useful is IPA,
212
00:13:57,680 --> 00:14:00,186
so how do you pronounce this word.
213
00:14:00,876 --> 00:14:07,497
Russian apparently needs
lots of IPA statements.
214
00:14:10,419 --> 00:14:13,314
But, again, Polish, second.
215
00:14:17,148 --> 00:14:20,753
And last but not least
we have pronunciation audio.
216
00:14:20,753 --> 00:14:23,372
So that is links to files on Commons
217
00:14:23,372 --> 00:14:25,959
where someone speaks the word,
218
00:14:25,959 --> 00:14:29,913
so you can hear a native speaker
pronounce the word
219
00:14:29,913 --> 00:14:32,871
in case you can't read IPA, for example.
220
00:14:34,959 --> 00:14:39,205
And there's a really nice actually
Wiki-based powered project
221
00:14:39,205 --> 00:14:40,474
called Lingua Libre
222
00:14:40,884 --> 00:14:45,173
where you can go and help record
words in your language
223
00:14:45,173 --> 00:14:47,836
that then can be added
to lexemes on Wikidata,
224
00:14:48,446 --> 00:14:52,103
so other people can understand
how to pronounce your words.
225
00:14:53,663 --> 00:14:55,694
(person 2) [inaudible]
226
00:14:55,694 --> 00:14:57,665
If you search for "Lingua Libre,"
227
00:14:57,665 --> 00:15:00,981
and I'm sure someone can post it
in the Telegram channel.
228
00:15:03,138 --> 00:15:04,621
Those guys rock.
229
00:15:04,621 --> 00:15:06,726
They did really cool stuff with Wikibase.
230
00:15:09,416 --> 00:15:10,617
Alright.
231
00:15:12,706 --> 00:15:17,285
Then the question is,
where do we go from here?
232
00:15:19,165 --> 00:15:22,010
Based on the numbers I've just shown you,
233
00:15:23,030 --> 00:15:25,172
we've come a long way
234
00:15:25,172 --> 00:15:28,430
towards giving more people
more access to more knowledge
235
00:15:28,430 --> 00:15:31,240
when looking at languages on Wikidata.
236
00:15:32,530 --> 00:15:36,392
But there is also still
a lot of work ahead of us.
237
00:15:38,992 --> 00:15:42,341
Some of the things
you can do to help, for example,
238
00:15:42,341 --> 00:15:44,921
is run label-a-thons
239
00:15:44,921 --> 00:15:50,124
like get people together
to label items in Wikidata
240
00:15:50,914 --> 00:15:55,121
or do an edit-a-thon
around lexemes in your language
241
00:15:55,121 --> 00:15:59,212
to get the most used words
in your language into Wikidata.
242
00:16:00,773 --> 00:16:03,285
Or you can use a tool like Terminator
243
00:16:03,285 --> 00:16:08,493
that helps you find the most
important items in your language
244
00:16:08,493 --> 00:16:11,549
that are still missing a label.
245
00:16:13,274 --> 00:16:18,359
Most important being measured
by how often it is used
246
00:16:18,359 --> 00:16:22,553
in other Wikidata items
as links in statements.
247
00:16:25,768 --> 00:16:30,022
And, of course, for the lexeme part,
248
00:16:31,342 --> 00:16:35,169
now that we've got
a basic coverage of those lexemes,
249
00:16:35,169 --> 00:16:41,163
it's also about building them out,
adding more statements to them
250
00:16:41,163 --> 00:16:44,401
so that they actually can build the base
251
00:16:44,401 --> 00:16:47,421
for meaningful applications
to build on top of that.
252
00:16:48,141 --> 00:16:50,795
Because we're getting closer
to that critical mass,
253
00:16:50,795 --> 00:16:53,616
but we're still away from that,
254
00:16:53,616 --> 00:16:56,624
that you can build
serious applications on top of it.
255
00:16:58,277 --> 00:17:01,680
And I hope all of you
will join us in doing that.
256
00:17:02,583 --> 00:17:07,103
And that already brings me
257
00:17:07,103 --> 00:17:09,843
to a little help from our friends,
258
00:17:09,843 --> 00:17:12,812
and Bruno, do you want to come over
259
00:17:13,882 --> 00:17:16,854
and talk to us about lexical masks.
260
00:17:17,541 --> 00:17:18,567
(Bruno) Thank you, Lydia,
261
00:17:18,567 --> 00:17:21,519
thank you for giving me
this short period of time
262
00:17:21,519 --> 00:17:24,150
to present this work
that we are doing at Google
263
00:17:24,150 --> 00:17:29,635
Denny that most of you
probably have heard of or know.
264
00:17:30,126 --> 00:17:32,030
Because at Google so I'm a linguist.
265
00:17:32,030 --> 00:17:36,150
so I'm very happy to be here
amongst other language enthusiasts.
266
00:17:36,620 --> 00:17:39,278
We are also building some lexicons,
267
00:17:39,278 --> 00:17:41,766
and we have built this technology
268
00:17:41,766 --> 00:17:45,589
or this approach that we think
can be useful for you.
269
00:17:46,369 --> 00:17:48,455
Just to give you
a little bit of background,
270
00:17:48,455 --> 00:17:52,068
this is my lexicographic
background talking here.
271
00:17:52,788 --> 00:17:54,347
When we build a lexicon database,
272
00:17:54,347 --> 00:17:58,623
there is a lot of hard time to maintain,
to keep them consistent
273
00:17:58,623 --> 00:18:00,125
and to exchange data,
274
00:18:00,125 --> 00:18:02,027
as you probably know.
275
00:18:02,517 --> 00:18:05,927
There are several attempts
to unify the feature and the properties
276
00:18:05,927 --> 00:18:09,184
that are describing
those lexemes and those forms,
277
00:18:09,184 --> 00:18:10,936
and it's not a solved problem,
278
00:18:10,936 --> 00:18:13,958
but there are some
unification attempts on that side.
279
00:18:13,958 --> 00:18:15,209
But what is really missing--
280
00:18:15,209 --> 00:18:18,732
and this is a problem we had
at the beginning of our project at Google
281
00:18:18,732 --> 00:18:21,607
is to try to have an internal structure
282
00:18:22,197 --> 00:18:25,910
that describes how
a lexical entry should look like,
283
00:18:25,910 --> 00:18:28,581
what kind of data
or what kind of information we have
284
00:18:28,581 --> 00:18:32,237
and the specification that are expected.
285
00:18:32,237 --> 00:18:38,187
So, this is what we came up
with this thing called lexicon mask.
286
00:18:38,897 --> 00:18:44,841
A lexicon mask is describing
what is expected for an entry,
287
00:18:44,841 --> 00:18:47,329
a lexicographic entry, to be complete,
288
00:18:47,329 --> 00:18:51,436
both in terms of the number of forms
you expect for a lexeme,
289
00:18:51,436 --> 00:18:55,607
and the number of features
you expect for each of those forms.
290
00:18:56,397 --> 00:18:58,329
Here is an example for Italian adjectives.
291
00:18:58,329 --> 00:19:02,002
You expect, in Italian, to have
four forms for your adjectives,
292
00:19:02,002 --> 00:19:05,383
and each of these forms
have a specific combination
293
00:19:05,383 --> 00:19:07,946
of gender and number features.
294
00:19:08,606 --> 00:19:12,672
This is what we expect
for the Italian adjectives.
295
00:19:12,672 --> 00:19:16,176
Of course, you can have
extremely complex masks,
296
00:19:16,176 --> 00:19:20,783
like the French verbs conjugation,
which is quite extensive,
297
00:19:20,783 --> 00:19:23,487
and I don't show you
any other Russian mask
298
00:19:23,487 --> 00:19:25,378
because it doesn't fit the screen.
299
00:19:26,308 --> 00:19:29,531
And we also have
some detailed specifications
300
00:19:29,531 --> 00:19:33,421
because we distinguish
what is at the form level.
301
00:19:33,421 --> 00:19:37,544
So here you have Russian nouns
that have three numbers
302
00:19:37,544 --> 00:19:40,048
and a number of cases
with different forms,
303
00:19:40,048 --> 00:19:43,086
but they also have
an entry level specification
304
00:19:43,086 --> 00:19:45,590
that says a noun particularly has
305
00:19:45,590 --> 00:19:50,133
an inherent gender
and an inherent animacy feature
306
00:19:50,133 --> 00:19:52,488
that is also specified in the mask.
307
00:19:54,518 --> 00:19:58,779
We also want to distinguish
that a mask gives a specification
308
00:19:58,779 --> 00:20:01,874
for, in general,
what an entry should look like.
309
00:20:01,874 --> 00:20:07,158
But you can have smaller masks
for defective aspects of the form
310
00:20:07,158 --> 00:20:11,282
or defective aspects of the lexeme
that happen in language.
311
00:20:11,282 --> 00:20:14,537
So here is the simplest version
of French verbs
312
00:20:14,537 --> 00:20:19,729
that have only the 3rd person singular
for all the weather verbs,
313
00:20:19,729 --> 00:20:23,969
like "it rains" or "it snows,"
like in English.
314
00:20:24,537 --> 00:20:26,493
So we distinguish these two levels.
315
00:20:26,923 --> 00:20:29,962
And how we use this at Google
316
00:20:29,962 --> 00:20:32,643
is that when we have a lexicon
that we want to use,
317
00:20:33,063 --> 00:20:38,309
we use the mask to really
literally throw the lexicons,
318
00:20:38,309 --> 00:20:40,163
all the entries, through the mask
319
00:20:40,163 --> 00:20:44,303
and see which entry has a problem
in terms of structure.
320
00:20:44,303 --> 00:20:46,523
Are we missing a form?
Are we missing a feature?
321
00:20:46,523 --> 00:20:51,497
And when there is a problem,
we do some human validation
322
00:20:51,497 --> 00:20:53,751
or just to see if it passes the mask.
323
00:20:53,751 --> 00:20:57,924
So it's an extremely powerful tool
to check the quality of the structure.
324
00:20:59,427 --> 00:21:01,964
So what we are happy to announce today
325
00:21:01,964 --> 00:21:05,408
is that we get the green light
to open source our mask.
326
00:21:05,948 --> 00:21:07,573
So this is a schema.
327
00:21:07,573 --> 00:21:09,477
If you want that, we can release
328
00:21:09,477 --> 00:21:13,483
and that we will provide
to Wikidata as to ShEx files.
329
00:21:13,483 --> 00:21:16,688
This is a ShEx file for German nouns,
330
00:21:16,688 --> 00:21:20,428
and Denny is working on the conversion
from our internal specification
331
00:21:20,428 --> 00:21:23,666
to a more open-source specification.
332
00:21:23,666 --> 00:21:27,522
We currently cover more than 25 languages.
333
00:21:27,522 --> 00:21:29,225
So we expect to grow on our side,
334
00:21:29,225 --> 00:21:34,350
but we also look for this opportunity
to collaborate for other languages.
335
00:21:34,350 --> 00:21:40,728
And one of the ongoing collaborations
also that Denny has with Lukas.
336
00:21:40,728 --> 00:21:45,052
Lukas has these great tools to have a UI
337
00:21:45,052 --> 00:21:51,061
to help the user or the contributor
to add more forms.
338
00:21:51,061 --> 00:21:54,151
So if you want to add
an adjective in French,
339
00:21:54,151 --> 00:21:59,057
the UI is telling you
how many forms are expected
340
00:21:59,057 --> 00:22:01,562
and what kind of features
this form should have.
341
00:22:01,562 --> 00:22:06,268
So our mask will help the tool
to be defined and expanded.
342
00:22:07,238 --> 00:22:08,385
That's it.
343
00:22:08,791 --> 00:22:10,358
(Lydia) Thank you so much.
344
00:22:10,358 --> 00:22:11,993
(applause)
345
00:22:14,249 --> 00:22:16,891
Alright. Are there questions?
346
00:22:16,891 --> 00:22:19,381
Do you want to talk more about lexemes?
347
00:22:19,817 --> 00:22:21,475
- (person 3) Yes.
- Yes. (chuckles)
348
00:22:33,485 --> 00:22:35,380
(person 3) My question,
because you were talking
349
00:22:35,380 --> 00:22:39,106
about giving more access
to more people in more languages.
350
00:22:39,106 --> 00:22:42,444
But there are a lot of languages
that can't be used in Wikidata.
351
00:22:42,444 --> 00:22:44,588
So what solution do you have for that?
352
00:22:45,889 --> 00:22:47,686
When you say that can't use Wikidata,
353
00:22:47,686 --> 00:22:50,308
are you talking about entering labels?
354
00:22:50,308 --> 00:22:52,578
- (person 3) Labels, descriptions.
- Right.
355
00:22:52,578 --> 00:22:55,498
So, for lexemes, it's a bit different
356
00:22:55,498 --> 00:22:57,793
because there we don't have
that restriction.
357
00:22:58,923 --> 00:23:05,003
For labels on items and properties,
there is some restriction
358
00:23:05,433 --> 00:23:12,411
because we wanted to make sure
that it's not completely
359
00:23:12,411 --> 00:23:14,229
anyone does anything,
360
00:23:14,229 --> 00:23:17,769
and it becomes unmanageable.
361
00:23:19,349 --> 00:23:23,328
Even a small community who wants
one language and wants to work on that,
362
00:23:23,898 --> 00:23:26,787
come talk to us, we will make it happen.
363
00:23:26,787 --> 00:23:29,202
(person 3) I mean, we did this
at the Prague Hackathon in May,
364
00:23:29,202 --> 00:23:32,459
and it took us until almost August
in order to be able to use our language.
365
00:23:32,459 --> 00:23:35,135
- Yeah.
- (person 3) So, it's very slow.
366
00:23:35,135 --> 00:23:37,854
Yeah, it is, unfortunately, very slow.
367
00:23:37,854 --> 00:23:39,883
We're currently working
with the language Committee
368
00:23:39,883 --> 00:23:46,048
on solving some fundamental...
369
00:23:49,537 --> 00:23:55,447
Like, getting agreement on what kind
of languages are actually "allowed,"
370
00:23:56,047 --> 00:23:59,398
and that has taken too long,
371
00:23:59,988 --> 00:24:04,178
which is the reason why your request
probably took longer than it should have.
372
00:24:04,778 --> 00:24:05,963
(person 3) Thanks.
373
00:24:06,815 --> 00:24:07,950
(person 4) Thank you.
374
00:24:07,950 --> 00:24:10,938
Lydia, if you remember
the statistics that you showed,
375
00:24:10,938 --> 00:24:12,886
the number of lexemes per language.
376
00:24:12,886 --> 00:24:17,599
So, did you count
all the forms as a data point
377
00:24:17,599 --> 00:24:20,034
or only lexemes?
378
00:24:21,289 --> 00:24:22,941
(Lydia) Do you mean this?
379
00:24:22,941 --> 00:24:24,053
Which one do you mean?
380
00:24:24,053 --> 00:24:25,529
(person 4) Yes, exactly.
381
00:24:25,797 --> 00:24:28,341
If you remember,
does this number [inaudible]
382
00:24:28,341 --> 00:24:31,954
all the forms for all the lexemes
or just how many lexemes there are?
383
00:24:31,954 --> 00:24:33,585
No, this is just a number of lexemes.
384
00:24:33,585 --> 00:24:35,395
(person 4) Just a number of lexemes, okay.
385
00:24:35,395 --> 00:24:36,797
So then it is a just statistic
386
00:24:36,797 --> 00:24:39,390
because if it would then
compose the forms--
387
00:24:39,390 --> 00:24:40,614
that's why I'm asking--
388
00:24:40,614 --> 00:24:42,817
then all the languages
with the inflectional morphology,
389
00:24:42,817 --> 00:24:45,027
like Russian, Serbian,
Slovenian and et cetera,
390
00:24:45,027 --> 00:24:47,616
they have a natural advantage
because they have so many.
391
00:24:47,616 --> 00:24:51,990
So, this kind of kicks in here
on this number of forms.
392
00:24:51,990 --> 00:24:53,851
(person 4) Yeah, that was this one.
Thank you.
393
00:24:56,546 --> 00:25:00,224
(person 5) So, I had
a quick question about the...
394
00:25:00,644 --> 00:25:06,824
When we're talking about
the actual items and properties.
395
00:25:07,124 --> 00:25:08,901
Like as far as I understand,
396
00:25:08,901 --> 00:25:11,955
there is currently no way
to give an actual source
397
00:25:11,955 --> 00:25:14,726
to any of the labels
and descriptions that are given.
398
00:25:14,726 --> 00:25:18,047
So, for example,
because when you're talking
399
00:25:18,047 --> 00:25:20,920
about an item property,
400
00:25:20,920 --> 00:25:24,509
like, for example,
you can get conflicting labels.
401
00:25:24,509 --> 00:25:25,739
Yes.
402
00:25:25,739 --> 00:25:27,662
(person 5) So this person is like...
403
00:25:28,402 --> 00:25:30,781
We were talking about
indigenous things before, for example.
404
00:25:30,781 --> 00:25:35,965
So this person is a Norwegian artist
according to this source,
405
00:25:35,965 --> 00:25:38,750
and a Sami artist,
according to this source.
406
00:25:39,550 --> 00:25:42,883
Or, for example, in Estonian,
we had an issue
407
00:25:42,883 --> 00:25:47,729
where we had to change terminology
to the official use terminology
408
00:25:47,729 --> 00:25:49,482
in official lexicons,
409
00:25:49,482 --> 00:25:52,262
but we have no way to indicate really why,
410
00:25:52,262 --> 00:25:53,596
like what was the source of this
411
00:25:53,596 --> 00:25:55,561
and why this was better
and what was there before.
412
00:25:55,561 --> 00:25:57,150
It was just me as a random person
413
00:25:57,150 --> 00:25:59,615
just switching the thing
to anyone who sees it.
414
00:25:59,615 --> 00:26:02,520
So is there a plan
to make this possible in any way
415
00:26:02,520 --> 00:26:06,355
so that we can actually have
proper sources for the language data?
416
00:26:07,045 --> 00:26:11,568
So, it is partially possible.
417
00:26:11,568 --> 00:26:15,958
So, for example, when you have
an item for a person,
418
00:26:16,968 --> 00:26:22,720
you have a statement, first name,
last name, and so on, of that person,
419
00:26:22,720 --> 00:26:26,226
and then you can provide
the reference for that there.
420
00:26:28,211 --> 00:26:32,544
I'm quite hesitant to add more complexity
421
00:26:32,544 --> 00:26:35,557
for references on labels and descriptions,
422
00:26:35,557 --> 00:26:38,624
but if people really, really think
423
00:26:38,624 --> 00:26:44,939
this is something that isn't covered
by any reference on the statement,
424
00:26:44,939 --> 00:26:46,803
then let's talk about it.
425
00:26:49,079 --> 00:26:53,303
But I fear it will add a lot of complexity
426
00:26:53,303 --> 00:26:56,523
for what I hope are few cases,
427
00:26:57,393 --> 00:27:00,188
but I'm willing to be convinced otherwise
428
00:27:00,188 --> 00:27:04,087
if people really feel
very strongly about this.
429
00:27:04,087 --> 00:27:08,177
(person 5) I mean, if it's added
it probably shouldn't be the default,
430
00:27:08,177 --> 00:27:12,452
show to all the users as a beginner,
interface, in any case.
431
00:27:12,452 --> 00:27:16,190
More like, "Click here if you need to say
a specific thing about this."
432
00:27:17,632 --> 00:27:23,368
Do we have a sense of how many times
that would actually matter?
433
00:27:24,520 --> 00:27:26,423
(person 5) In Estonian, for example--
434
00:27:26,423 --> 00:27:28,844
I expect this is true
of other languages as well--
435
00:27:29,274 --> 00:27:34,203
for example, there is an official name
that is the actual legitimate translation,
436
00:27:34,203 --> 00:27:36,206
for example, into English,
437
00:27:36,206 --> 00:27:40,314
of, say, a specific kind of municipality.
438
00:27:40,614 --> 00:27:42,182
That was my use case, for example,
439
00:27:42,182 --> 00:27:44,409
where we were using the word "parish"
440
00:27:45,159 --> 00:27:50,885
which the original Estonian word
was meant kind of like church parish,
441
00:27:50,885 --> 00:27:51,899
and that was the origin,
442
00:27:51,899 --> 00:27:54,809
but that's not the official translation
Estonia gets right now.
443
00:27:55,189 --> 00:27:58,993
In this case, I would just add it
as official name statements
444
00:27:58,993 --> 00:28:00,817
and add the reference there.
445
00:28:02,032 --> 00:28:03,158
(person 5) Okay.
446
00:28:05,186 --> 00:28:06,572
More questions, yes?
447
00:28:07,682 --> 00:28:10,044
(person 6) I have two quick comments.
448
00:28:10,044 --> 00:28:13,934
You specifically called out Asturian
as a language that does well,
449
00:28:13,934 --> 00:28:16,455
and I think that's a false artifact.
450
00:28:16,455 --> 00:28:17,724
Tell me about it.
451
00:28:17,724 --> 00:28:19,748
(person 6) I think it's just a bot
452
00:28:19,748 --> 00:28:24,068
that pasted person names,
like proper names,
453
00:28:24,068 --> 00:28:27,172
and said, "Well, this is exactly
like in French or Spanish,"
454
00:28:27,172 --> 00:28:28,558
and just massively copied it.
455
00:28:28,558 --> 00:28:33,316
One point of evidence is that
you don't see that energy in Asturian
456
00:28:33,316 --> 00:28:37,205
in things that actually
require translation, like property names,
457
00:28:37,205 --> 00:28:39,648
or names of items
that are not proper names.
458
00:28:39,648 --> 00:28:41,219
Asaf, you break my heart.
459
00:28:41,219 --> 00:28:43,198
(person 6) I know,
I like raining on parades,
460
00:28:43,198 --> 00:28:48,458
but I have good news as well,
which is about the pronunciation numbers.
461
00:28:49,408 --> 00:28:53,515
As you probably know,
Commons is full of pronunciation files,
462
00:28:53,515 --> 00:28:54,668
and, for example,
463
00:28:54,668 --> 00:29:01,102
Dutch has no less than 300,000
pronunciation files already on Commons
464
00:29:01,912 --> 00:29:05,051
that just need to somehow be ingested.
465
00:29:05,051 --> 00:29:07,697
So if anyone's looking for a side project,
466
00:29:07,697 --> 00:29:08,997
there's tons and tons
467
00:29:08,997 --> 00:29:13,280
of classified, categorized
pronunciation files on Commons
468
00:29:13,280 --> 00:29:16,893
under the category
"Pronunciation" by language.
469
00:29:16,893 --> 00:29:22,840
So that's just waiting to be matched
to lexemes and put on Lexeme.
470
00:29:23,180 --> 00:29:25,484
And I was wondering
if you could say something
471
00:29:25,484 --> 00:29:26,585
about the road map,
472
00:29:26,585 --> 00:29:28,757
something about how much investment
473
00:29:28,757 --> 00:29:31,995
or what can we expect
from Lexeme in the coming year,
474
00:29:31,995 --> 00:29:34,020
because I, for one, can't wait.
475
00:29:34,949 --> 00:29:37,044
You can't wait? (chuckles)
476
00:29:37,044 --> 00:29:39,118
- (person 6) For more.
- Yes. (chuckles)
477
00:29:44,541 --> 00:29:49,523
Right now, we're concentrating
more on Wikibase and data quality
478
00:29:51,493 --> 00:29:55,087
to see how much traction this gets
479
00:29:55,087 --> 00:30:01,676
and then getting more for feeding off
where the pain points are next,
480
00:30:01,676 --> 00:30:06,003
and then going back to improving
lexicographical data further.
481
00:30:06,903 --> 00:30:09,790
And one of the things
I'd love to hear from you
482
00:30:09,790 --> 00:30:14,136
is where exactly do you see
the next steps,
483
00:30:14,136 --> 00:30:15,966
where do you want to see improvements
484
00:30:15,966 --> 00:30:20,340
so that we can then figure out
how to make that happen.
485
00:30:21,125 --> 00:30:22,810
But, of course, you're right,
486
00:30:22,810 --> 00:30:25,712
there's still so much to do
also on the technical side.
487
00:30:30,573 --> 00:30:35,848
(person 7) Okay, as we were uploading
the Basque words with forms,
488
00:30:35,848 --> 00:30:37,768
and you'll see some
of these kinds of things,
489
00:30:37,768 --> 00:30:41,329
we were both like, last week we said,
"Oh, we are the first one in something."
490
00:30:42,919 --> 00:30:44,928
It's It appears in press, and it's like,
491
00:30:44,928 --> 00:30:49,488
"Oh, Basque are the first time in some--
they are the first in something, okay."
492
00:30:49,488 --> 00:30:50,606
(laughs)
493
00:30:50,606 --> 00:30:53,318
And then people ask,
"Okay, but what is this for?"
494
00:30:54,678 --> 00:30:56,849
We don't have a real good answer.
495
00:30:56,849 --> 00:30:57,888
I mean it's like, okay,
496
00:30:57,888 --> 00:31:01,841
this will help computers
to understand more our language, yes,
497
00:31:01,841 --> 00:31:05,279
but what kind of tools
can we make in the future?
498
00:31:05,279 --> 00:31:07,467
And we don't have a good answer for this.
499
00:31:07,467 --> 00:31:10,625
So I don't know
if you have a good answer for this.
500
00:31:10,625 --> 00:31:12,742
(chuckles) I don't know
if I have a good answer,
501
00:31:12,742 --> 00:31:14,746
but I have an answer.
502
00:31:15,480 --> 00:31:20,425
So I think right now
as I was telling [inaudible],
503
00:31:20,425 --> 00:31:21,924
we haven't reached that critical mass
504
00:31:21,924 --> 00:31:25,529
where you can build a lot
of the really interesting tools.
505
00:31:25,529 --> 00:31:27,707
But there are already some tools.
506
00:31:28,267 --> 00:31:31,912
Just the other day,
Esther [Pandelia], for example,
507
00:31:31,912 --> 00:31:33,817
released a tool where you can see,
508
00:31:35,837 --> 00:31:38,889
I think it was the words on a globe
509
00:31:38,889 --> 00:31:41,901
where they're spoken,
where they're coming from.
510
00:31:42,631 --> 00:31:44,090
I'm probably wrong about this,
511
00:31:44,090 --> 00:31:46,346
but she had answered
on the Project chat on Wikidata--
512
00:31:46,346 --> 00:31:48,984
you can look it up there.
513
00:31:49,574 --> 00:31:51,805
So we have seen these first tools,
514
00:31:51,805 --> 00:31:55,696
just like we've seen
back when Wikidata started.
515
00:31:56,846 --> 00:31:59,602
First some--like just a network,
516
00:31:59,602 --> 00:32:03,424
and like, "Hey, look, there's this thing
that connects to this other thing."
517
00:32:04,824 --> 00:32:07,059
And as we have more data,
518
00:32:07,059 --> 00:32:10,352
and as we've reached some critical mass,
519
00:32:11,852 --> 00:32:14,747
more powerful applications
become possible,
520
00:32:15,677 --> 00:32:17,516
things like Histropedia,
521
00:32:19,126 --> 00:32:21,988
things like question and answering
522
00:32:21,988 --> 00:32:26,663
in your digital personal assistant,
Platypus, and so on.
523
00:32:26,663 --> 00:32:29,668
And we're seeing
a similar thing with lexemes.
524
00:32:31,198 --> 00:32:34,650
We're at the stage
where you can build like these little,
525
00:32:34,650 --> 00:32:37,464
hey, look, there's a connection
between the two things,
526
00:32:37,864 --> 00:32:42,738
and there's a translation
of this word into that language stage,
527
00:32:42,738 --> 00:32:47,747
and as we build it out
and as we describe more words,
528
00:32:47,747 --> 00:32:49,533
more becomes possible.
529
00:32:49,533 --> 00:32:51,795
Now, what becomes possible?
530
00:32:53,482 --> 00:32:59,483
As Ben, our keynote speaker earlier
was talking about translations,
531
00:33:00,103 --> 00:33:03,455
being able to translate
from one language to another.
532
00:33:03,455 --> 00:33:07,929
And Jens, my colleague,
he's always talking about
533
00:33:07,929 --> 00:33:11,452
the European Union
looking for a translator
534
00:33:11,452 --> 00:33:17,439
who can translate from
I think it was Maltese to Swedish--
535
00:33:17,439 --> 00:33:19,436
- (person 8) Estonian.
- Estonian.
536
00:33:22,016 --> 00:33:26,211
And that is not a usual combination.
537
00:33:27,211 --> 00:33:31,735
But once you have all these languages
in one machine-readable place,
538
00:33:31,735 --> 00:33:33,143
you can do that,
539
00:33:33,143 --> 00:33:36,857
you can get a dictionary
540
00:33:36,857 --> 00:33:41,735
from Estonian to Maltese and back.
541
00:33:42,935 --> 00:33:45,607
So covering language
combinations in dictionaries
542
00:33:45,607 --> 00:33:47,911
that just haven't been covered before
543
00:33:47,911 --> 00:33:51,050
because there wasn't
enough demand for it, for example,
544
00:33:51,050 --> 00:33:55,540
to make it financially viable
and to justify the work.
545
00:33:55,540 --> 00:33:57,147
Now we can do that.
546
00:33:59,797 --> 00:34:02,318
Then text generation.
547
00:34:02,318 --> 00:34:03,653
Lucie was earlier talking
548
00:34:03,653 --> 00:34:10,136
about how she's working
with Hattie on generating text
549
00:34:10,136 --> 00:34:14,673
to get Wikipedia articles
in minority languages started,
550
00:34:15,423 --> 00:34:19,512
and that needs data about words,
551
00:34:19,512 --> 00:34:22,589
and you need to understand
the language to do that.
552
00:34:23,769 --> 00:34:28,133
Yeah, and those are just some
that come to my mind right now.
553
00:34:28,693 --> 00:34:30,494
Maybe our audience has more ideas
554
00:34:30,494 --> 00:34:34,353
what they want to do
when we have all the glorious data.
555
00:34:37,693 --> 00:34:40,892
(person 9) Okay, I will deviate
from the lexemes topic.
556
00:34:40,892 --> 00:34:42,666
I will ask the question,
557
00:34:42,666 --> 00:34:45,634
how can I as a member of community
558
00:34:45,634 --> 00:34:50,135
influence that priority is put on task,
559
00:34:50,135 --> 00:34:56,644
that a new user comes, and he can indicate
what languages he wants to see and edit
560
00:34:56,644 --> 00:35:01,135
without some secret verbal
template knowledge.
561
00:35:02,145 --> 00:35:05,053
Maybe there will be this year
this technical wish list
562
00:35:05,053 --> 00:35:07,040
without Wikipedia topics.
563
00:35:07,040 --> 00:35:10,119
Maybe there's a hope
we can all vote about
564
00:35:10,119 --> 00:35:14,218
this thing we didn't fix for seven years.
565
00:35:14,218 --> 00:35:17,607
So do you have any ideas
and comments about this?
566
00:35:18,217 --> 00:35:20,328
So you're talking about the fact
567
00:35:20,328 --> 00:35:23,518
that someone who is
not logged into Wikidata
568
00:35:23,518 --> 00:35:25,971
can't change their language easily?
569
00:35:25,971 --> 00:35:27,839
(person 9) No, for [inaudible] users.
570
00:35:28,309 --> 00:35:30,689
So, if they are logged in,
571
00:35:30,689 --> 00:35:34,871
they can just change their language
at the top of the page,
572
00:35:35,891 --> 00:35:38,099
and then it will appear
573
00:35:39,769 --> 00:35:42,013
where the labels' description
[inaudible] are,
574
00:35:42,013 --> 00:35:43,483
and they can edit it.
575
00:35:45,657 --> 00:35:49,009
(person 9) Well, actually, usually
many times the workflow
576
00:35:49,009 --> 00:35:52,447
is that if you want to have
multiple languages, they are available,
577
00:35:52,447 --> 00:35:55,419
and it's not always the case.
578
00:35:55,419 --> 00:35:58,584
Okay, maybe we should sit down
after this talk and you show me.
579
00:36:01,562 --> 00:36:04,089
Cool. More questions?
580
00:36:05,534 --> 00:36:06,536
Yes.
581
00:36:11,595 --> 00:36:13,196
(person 10) Thanks for the presentation.
582
00:36:14,106 --> 00:36:15,127
Can you comment
583
00:36:15,127 --> 00:36:19,307
on the state of the correlation
with the Wiktionary community.
584
00:36:19,307 --> 00:36:22,296
As far as I've seen,
there were some discussions
585
00:36:22,296 --> 00:36:26,051
about importing some elements of the work,
586
00:36:26,051 --> 00:36:30,843
but there seems to be licensing issues
and some disagreements, et cetera.
587
00:36:30,843 --> 00:36:31,848
Right.
588
00:36:31,848 --> 00:36:36,330
So, Wiktionary communities
have spent a lot of time
589
00:36:37,320 --> 00:36:39,473
building Wiktionary.
590
00:36:39,473 --> 00:36:42,643
They have built
591
00:36:43,193 --> 00:36:47,554
amazingly complicated
and complex templates
592
00:36:47,554 --> 00:36:53,614
to build pretty tables
that automatically generate forms for you
593
00:36:53,614 --> 00:36:56,392
and all kinds of really impressive,
594
00:36:56,392 --> 00:37:00,683
and kind of crazy stuff,
if you think about it.
595
00:37:02,311 --> 00:37:07,994
And, of course, they have invested
a lot of time and effort into that.
596
00:37:09,364 --> 00:37:11,801
And understandably,
597
00:37:11,801 --> 00:37:17,116
they don't just want that to be grabbed,
598
00:37:18,046 --> 00:37:19,102
just like that.
599
00:37:19,102 --> 00:37:21,791
So there's some of that coming from there.
600
00:37:22,761 --> 00:37:25,137
And that's fine, that's okay.
601
00:37:25,737 --> 00:37:32,092
Now, the first Wiktionary communities
are talking about turning out
602
00:37:32,092 --> 00:37:34,329
and importing some
of their data into Wikidata.
603
00:37:34,329 --> 00:37:39,095
Russian, you have seen,
for example, is one of those cases
604
00:37:40,375 --> 00:37:42,355
And I expect more of that to happen.
605
00:37:43,635 --> 00:37:46,800
But it will be a slow process,
606
00:37:46,800 --> 00:37:49,383
just like adoption
of Wikidata's data on Wikipedia
607
00:37:49,383 --> 00:37:51,909
has been a rather slow process.
608
00:37:52,849 --> 00:37:56,183
On the other side
of making it actually easier
609
00:37:56,183 --> 00:37:59,132
to use the data that is in lexemes,
610
00:37:59,132 --> 00:38:02,209
on Wiktionary, so that
they can make use of that
611
00:38:02,209 --> 00:38:05,531
and share data between
the language Wiktionaries
612
00:38:05,531 --> 00:38:08,853
which is super hard
to impossible right now,
613
00:38:08,853 --> 00:38:11,560
which is crazy,
just like it was on Wikipedia.
614
00:38:13,860 --> 00:38:16,325
Wait for the birthday present. (chuckles)
615
00:38:20,038 --> 00:38:21,182
Yes.
616
00:38:22,599 --> 00:38:24,827
(person 11) When I was thinking
the other way around it,
617
00:38:24,827 --> 00:38:28,168
I actually didn't want to say it
because I think this will be super silly,
618
00:38:28,168 --> 00:38:32,003
but I think that Wiktionary
already has some content,
619
00:38:32,003 --> 00:38:34,978
and I know that
we can't transfer it to Wikidata
620
00:38:34,978 --> 00:38:37,048
because there's a difference in licenses.
621
00:38:37,048 --> 00:38:39,631
But I was thinking maybe
we can do something about that.
622
00:38:40,321 --> 00:38:45,913
Maybe, I don't know, we can obtain
the communities' permission
623
00:38:45,913 --> 00:38:51,205
after like, I don't know,
having like a public voting
624
00:38:52,075 --> 00:38:55,642
and for the community,
the active members of the community
625
00:38:55,642 --> 00:39:02,523
to vote and say if they would like
or accept or to transfer the content
626
00:39:02,523 --> 00:39:05,528
for which they may do
the Wikidata lexemes.
627
00:39:06,238 --> 00:39:08,537
Because I just think it is such a waste.
628
00:39:09,568 --> 00:39:14,443
So, that's definitely
a conversation those people
629
00:39:14,443 --> 00:39:18,249
who are in Wiktionary communities
are very welcome to bring up there.
630
00:39:18,249 --> 00:39:24,647
I think it would be a bit presumptuous
for us to go and force that.
631
00:39:25,917 --> 00:39:31,142
But, yeah, I think it's definitely worth
having a conversation.
632
00:39:31,142 --> 00:39:33,898
But I think it's also important
to understand
633
00:39:33,898 --> 00:39:39,082
that there's a distinction between
what is actually legally allowed
634
00:39:39,082 --> 00:39:43,147
and what we should be doing
635
00:39:43,147 --> 00:39:45,426
and what those people want or do not want.
636
00:39:45,736 --> 00:39:47,329
So even if it's legally allowed,
637
00:39:47,329 --> 00:39:50,640
if some other Wiktionary communities
do not want that,
638
00:39:50,640 --> 00:39:53,537
I would be careful, at least.
639
00:39:58,886 --> 00:40:02,489
I think you need the mic
for the stream.
640
00:40:04,540 --> 00:40:07,299
(person 12) So, obviously,
it's all very exciting,
641
00:40:07,979 --> 00:40:12,319
and I immediately think
how can I take that to my students
642
00:40:12,319 --> 00:40:15,558
and how can I incorporate it
with the courses,
643
00:40:15,558 --> 00:40:18,531
the work that we're doing,
educational settings.
644
00:40:18,531 --> 00:40:22,271
And I don't have, at the moment,
645
00:40:22,871 --> 00:40:24,116
first of all, enough knowledge,
646
00:40:24,116 --> 00:40:27,278
but I think the documentation
that we do have
647
00:40:27,808 --> 00:40:30,082
could be maybe improved.
648
00:40:30,082 --> 00:40:33,437
So that's a kind of request
to make cool videos
649
00:40:33,437 --> 00:40:35,898
that explain how it works
650
00:40:35,898 --> 00:40:39,948
because if we have it, we can then use it,
651
00:40:39,948 --> 00:40:41,985
and we can have students on board,
652
00:40:41,985 --> 00:40:47,072
and we can make people understand
how awesome it all is.
653
00:40:47,072 --> 00:40:52,001
And yeah, just think about documentation
and think about education, please.
654
00:40:52,001 --> 00:40:54,480
Because I think a lot could be done.
655
00:40:54,480 --> 00:40:58,585
These are like many tasks
that could be done even with...
656
00:41:00,125 --> 00:41:02,033
well, I wouldn't say primary schools,
657
00:41:02,033 --> 00:41:05,495
but certainly, even younger students.
658
00:41:05,915 --> 00:41:10,866
And so I would really like to see
that potential being tapped into,
659
00:41:10,866 --> 00:41:15,272
and, as of now, I personally
don't understand enough
660
00:41:15,272 --> 00:41:19,500
to be able to create tasks
or to create like...
661
00:41:20,430 --> 00:41:22,155
to do something practical with it.
662
00:41:22,155 --> 00:41:25,772
So any help, any thoughts
anyone here has about that,
663
00:41:25,772 --> 00:41:29,648
I would be very happy to hear
your thoughts, and yours as well.
664
00:41:30,508 --> 00:41:32,129
Yeah, let's talk about that.
665
00:41:35,473 --> 00:41:37,139
More questions?
666
00:41:37,809 --> 00:41:39,195
Someone else raised a hand.
667
00:41:39,195 --> 00:41:40,495
I forgot where it was.
668
00:41:45,739 --> 00:41:49,996
(person 13) So, if we can't import
from Wiktionary,
669
00:41:49,996 --> 00:41:55,772
is there some concerted effort
to find other public domain sources,
670
00:41:55,772 --> 00:41:57,459
maybe all the data,
671
00:41:58,769 --> 00:42:03,167
and kind of prefilter it, organize it
672
00:42:03,167 --> 00:42:08,470
so that it's easy to be checked
by people for import?
673
00:42:09,093 --> 00:42:11,181
So there are first efforts.
674
00:42:11,181 --> 00:42:14,769
My understanding is that Basque
is one of those efforts.
675
00:42:14,769 --> 00:42:17,474
Maybe you want to say
a bit more about it?
676
00:42:18,426 --> 00:42:20,130
(person 14) [inaudible]
677
00:42:23,166 --> 00:42:27,148
Okay, the actual answer
is paying for that...
678
00:42:28,374 --> 00:42:33,381
I mean, we have an agreement
with a contractor we usually work with.
679
00:42:34,801 --> 00:42:38,725
They do dictionaries--
680
00:42:40,315 --> 00:42:42,458
lots of stuff, but they do dictionaries.
681
00:42:42,458 --> 00:42:47,473
So we agreed with them
to make free the students' dictionary,
682
00:42:47,473 --> 00:42:52,782
we would [cast] the most common words
and start uploading it
683
00:42:52,782 --> 00:42:55,590
with an external identifier
and the scheme of things.
684
00:42:56,420 --> 00:43:02,902
But there was some discussion
about leaving it on CC0
685
00:43:03,212 --> 00:43:05,322
because they have
the dictionary with CC by it,
686
00:43:06,537 --> 00:43:10,326
and they understood
what the difference was.
687
00:43:10,326 --> 00:43:13,866
So there was some discussion.
688
00:43:13,866 --> 00:43:19,709
But I think that we can provide some tools
or some examples in the future,
689
00:43:19,709 --> 00:43:21,761
and I think that there will be
other dictionaries
690
00:43:21,761 --> 00:43:24,016
that we can handle,
691
00:43:24,016 --> 00:43:29,274
and also I think Wiktionary
should start moving in that direction,
692
00:43:29,274 --> 00:43:32,260
but that's another great discussion.
693
00:43:33,285 --> 00:43:34,487
And on top of that,
694
00:43:34,487 --> 00:43:38,839
Lea is also in contact
with people from Occitan
695
00:43:38,839 --> 00:43:41,827
who work on Occitan dictionaries,
696
00:43:41,827 --> 00:43:45,138
and they're currently working
on a Sumerian collaboration.
697
00:43:51,644 --> 00:43:53,363
More questions?
698
00:44:01,487 --> 00:44:05,349
(person 15) Hi! We are the people
who want to import Occitan data.
699
00:44:05,349 --> 00:44:06,585
Aha! Perfect!
700
00:44:06,585 --> 00:44:08,368
(person 15) And we have a small problem.
701
00:44:09,188 --> 00:44:14,215
We don't know how to represent
the variety of all lexemes.
702
00:44:14,215 --> 00:44:17,893
We have six dialects,
703
00:44:17,893 --> 00:44:24,014
and we want to indicate for Lexeme
in which dialect it's used,
704
00:44:24,014 --> 00:44:27,285
and we don't have a proper
C0 statement to do that.
705
00:44:27,285 --> 00:44:31,105
So as long as the segment doesn't exist,
706
00:44:31,635 --> 00:44:34,465
it prevents us from [inaudible]
707
00:44:34,465 --> 00:44:37,603
because we will need to do it again
708
00:44:37,603 --> 00:44:42,076
when we will be able
to [export] the statement.
709
00:44:42,076 --> 00:44:44,551
And it's complicated
because it's a statement
710
00:44:44,551 --> 00:44:47,802
which won't be asked by many people
711
00:44:47,802 --> 00:44:53,444
because it's a statement
which concerns mostly minority languages.
712
00:44:53,444 --> 00:44:56,933
So you will have one person to ask this.
713
00:44:56,933 --> 00:45:00,022
But as our colleagues Basque,
714
00:45:00,022 --> 00:45:06,082
it can be one person
who will power thousands of others,
715
00:45:06,082 --> 00:45:10,884
so it might not be asking a lot,
716
00:45:10,884 --> 00:45:14,136
but it will be very important for us.
717
00:45:14,874 --> 00:45:17,600
Do you already have
a new property proposal up,
718
00:45:17,600 --> 00:45:19,470
or do you need help creating it?
719
00:45:21,524 --> 00:45:24,300
(person 15) We asked four months ago.
720
00:45:24,720 --> 00:45:28,755
Alright, then let's get some people
to help out with this property proposal.
721
00:45:30,159 --> 00:45:33,092
I'm sure there are enough people
in this room to make this happen.
722
00:45:33,360 --> 00:45:35,452
(person 15) Property proposal
[speaking in French].
723
00:45:35,452 --> 00:45:36,965
(person 16) We didn't have an answer.
724
00:45:36,965 --> 00:45:39,769
(person 15) We didn't have any answer,
and we don't know how to do this
725
00:45:39,769 --> 00:45:42,953
because we aren't
in the Wikidata community.
726
00:45:44,694 --> 00:45:48,817
Yup, so there are people here
who can help you.
727
00:45:48,817 --> 00:45:52,134
Maybe someone raises their hand to take--
728
00:45:52,574 --> 00:45:53,644
(person 14) I'm for that.
729
00:45:53,644 --> 00:45:55,512
But I think this is quite interesting
730
00:45:55,512 --> 00:45:59,059
that only the variant of form
731
00:45:59,059 --> 00:46:02,607
also can handle it geographically,
732
00:46:02,607 --> 00:46:04,995
with coordinates or some kind of mapping.
733
00:46:05,595 --> 00:46:07,815
Also having different pronunciations,
734
00:46:07,815 --> 00:46:11,837
and I think this is something
that happens in lots of languages.
735
00:46:12,607 --> 00:46:16,262
We should start making
it happen [inaudible],
736
00:46:16,262 --> 00:46:18,865
and I'm going to search for the property.
737
00:46:19,782 --> 00:46:20,933
Cool.
738
00:46:20,933 --> 00:46:24,446
So you will get backing
for your property proposal.
739
00:46:26,136 --> 00:46:27,297
Thank you.
740
00:46:28,153 --> 00:46:30,261
Alright, more questions?
741
00:46:32,410 --> 00:46:33,474
Finn.
742
00:46:33,974 --> 00:46:35,055
Finn is one of those people
743
00:46:35,055 --> 00:46:38,031
who builds stuff
on top of lexicographical data.
744
00:46:38,031 --> 00:46:40,085
(Finn) It's just a small question,
745
00:46:40,405 --> 00:46:44,226
and that's about spelling variations.
746
00:46:44,896 --> 00:46:48,002
It seems to be difficult to put them in...
747
00:46:48,532 --> 00:46:53,368
You could, of course,
have multiple forms for the same word.
748
00:46:56,327 --> 00:46:58,448
I don't know, it seems to be...
749
00:46:59,558 --> 00:47:03,535
If you don't do it that way,
it seems to be difficult to specify...
750
00:47:04,771 --> 00:47:05,888
or I don't know whether
751
00:47:05,888 --> 00:47:09,731
this is just a minor technical issue
or whether...
752
00:47:09,731 --> 00:47:11,252
Let's look at it together.
753
00:47:11,642 --> 00:47:15,230
I would love to see an example.
754
00:47:17,478 --> 00:47:18,478
Asaf.
755
00:47:26,886 --> 00:47:28,396
(Asaf) Thank you.
756
00:47:29,386 --> 00:47:33,685
I can give a very concrete example
from my mother tongue, Hebrew.
757
00:47:34,205 --> 00:47:38,845
Hebrew has two main variants
758
00:47:38,845 --> 00:47:42,786
for expressing almost every word
759
00:47:42,786 --> 00:47:47,640
because the traditional spelling
760
00:47:47,640 --> 00:47:50,044
leaves out many of the vowels.
761
00:47:50,934 --> 00:47:55,207
And, therefore, in modern editions
of the Bible and of poetry,
762
00:47:55,207 --> 00:47:57,461
diacritics are used.
763
00:47:57,461 --> 00:48:02,670
However, those diacritics
are never used for modern prose
764
00:48:02,670 --> 00:48:05,974
or newspaper writing or street signs.
765
00:48:05,974 --> 00:48:11,209
So the average daily casual use
puts in extra vowels
766
00:48:12,169 --> 00:48:13,519
and doesn't use the diacritics
767
00:48:13,519 --> 00:48:15,607
because they are,
of course, more cumbersome
768
00:48:15,607 --> 00:48:17,893
and have all kinds of rules
and nobody knows the rules.
769
00:48:18,633 --> 00:48:20,531
So there are basically two variants.
770
00:48:20,531 --> 00:48:25,322
There's the everyday casual prose variant,
771
00:48:25,322 --> 00:48:27,827
and there's the Bible or poetry,
772
00:48:27,827 --> 00:48:32,200
which always come
in this traditional diacriticized text.
773
00:48:32,200 --> 00:48:33,302
To be useful,
774
00:48:33,302 --> 00:48:37,428
Lexeme would have to recognize
both varieties of every single word
775
00:48:37,428 --> 00:48:39,747
and every single form
of every single word.
776
00:48:40,677 --> 00:48:43,391
So that's a very comprehensive use case
777
00:48:43,391 --> 00:48:46,340
for official stable variants.
778
00:48:46,340 --> 00:48:48,942
It's not dialect, it's not regions,
779
00:48:49,332 --> 00:48:53,627
it's basically two coexisting
morphological systems.
780
00:48:54,537 --> 00:48:58,926
And I too don't know exactly
how to express that in Lexeme today,
781
00:48:58,926 --> 00:49:02,800
which is one thing that is keeping me
in partial answer to Magnus' question
782
00:49:02,800 --> 00:49:05,238
from uploading the parts that are ready
783
00:49:05,238 --> 00:49:09,394
from the biggest Hebrew dictionary,
which is public domain
784
00:49:09,394 --> 00:49:13,141
and which I have been digitizing
for several years now.
785
00:49:13,141 --> 00:49:14,803
A good portion of it is ready,
786
00:49:14,803 --> 00:49:16,549
but I'm not putting it on Lexeme right now
787
00:49:16,549 --> 00:49:20,245
because I don't know exactly
how to solve this problem.
788
00:49:20,245 --> 00:49:23,387
Alright, let's solve
this problem here. (chuckles)
789
00:49:24,503 --> 00:49:26,021
That has to be possible.
790
00:49:30,045 --> 00:49:32,047
Alright, more questions?
791
00:49:37,173 --> 00:49:39,735
If not, then thank you so much.
792
00:49:40,605 --> 00:49:42,675
(applause)