1
00:00:06,303 --> 00:00:07,362
(Lydia) Thank you so much.

2
00:00:07,362 --> 00:00:11,244
So, this conference,
one of the big themes is languages.

3
00:00:14,220 --> 00:00:18,508
I want to give you an overview
of where we actually are currently

4
00:00:18,508 --> 00:00:19,812
when it comes to languages

5
00:00:20,264 --> 00:00:22,167
and where we can go from here.

6
00:00:29,036 --> 00:00:32,580
Wikidata is all about giving more people
more access to more knowledge,

7
00:00:32,580 --> 00:00:37,168
and language is such an important part
of making that a reality,

8
00:00:38,205 --> 00:00:43,291
especially since more and more
of our lives depends on technology.

9
00:00:44,114 --> 00:00:48,873
And as our keynote speaker
earlier today was talking,

10
00:00:49,723 --> 00:00:51,588
some of the technology
leaves people behind

11
00:00:51,588 --> 00:00:55,020
simply because they can't speak
a certain language,

12
00:00:55,320 --> 00:00:57,573
and that's not okay.

13
00:00:58,633 --> 00:01:02,097
So we want to do something about that.

14
00:01:02,927 --> 00:01:05,841
And in order to change that,
you need at least two things.

15
00:01:06,411 --> 00:01:11,270
One is you need to provide content
to the people in their language,

16
00:01:11,270 --> 00:01:12,955
and the second thing you need

17
00:01:12,955 --> 00:01:15,910
is to provide them
with interaction in their language

18
00:01:15,910 --> 00:01:19,189
in those applications
or whatever it is you have.

19
00:01:20,367 --> 00:01:25,277
And Wikidata helps with both of those.

20
00:01:25,277 --> 00:01:28,408
And the first thing,
<i>content in your language</i>,

21
00:01:28,408 --> 00:01:30,879
that is basically what we have
in items and properties,

22
00:01:31,319 --> 00:01:33,082
how we describe the world.

23
00:01:33,082 --> 00:01:35,085
Now, this is certainly
not everything you need,

24
00:01:35,085 --> 00:01:39,294
but it gets you quite far ahead.

25
00:01:39,764 --> 00:01:41,847
The other thing
is <i>interaction in your language</i>,

26
00:01:41,847 --> 00:01:46,389
and that's where lexemes come into play

27
00:01:46,389 --> 00:01:49,382
If you want to talk
to your digital personal assistant

28
00:01:49,382 --> 00:01:54,918
or if you want to have your device
translate a text and things like that.

29
00:01:56,404 --> 00:01:59,254
Alright, let's look into
<i>content in your language.</i>

30
00:01:59,254 --> 00:02:03,396
So what we have in <i>items</i> and <i>properties</i>.

31
00:02:05,406 --> 00:02:09,696
For this, the labels in those items
and properties are crucial.

32
00:02:10,236 --> 00:02:14,866
We need to know what this entity
is called that we're talking about.

33
00:02:15,656 --> 00:02:19,987
And instead of talking about Q5,

34
00:02:19,987 --> 00:02:22,180
someone who speaks English
knows that's a "human,"

35
00:02:22,180 --> 00:02:24,706
someone who speaks German
knows that's a "mensch,"

36
00:02:24,706 --> 00:02:26,374
and similar things.

37
00:02:26,374 --> 00:02:29,742
So those labels on items and properties

38
00:02:29,742 --> 00:02:33,619
are bridging the gap
between humans and machines.

39
00:02:33,619 --> 00:02:35,439
And humans and humans

40
00:02:35,439 --> 00:02:40,115
making more existing knowledge
accessible to them.

41
00:02:43,270 --> 00:02:46,290
Now, that's a nice aspiration.

42
00:02:46,290 --> 00:02:48,342
What does it actually look like?

43
00:02:48,342 --> 00:02:49,607
It looks like this.

44
00:02:50,947 --> 00:02:52,416
What you're seeing here

45
00:02:52,416 --> 00:02:58,496
is that most of the items
on Wikidata have two labels,

46
00:02:58,496 --> 00:03:00,767
so labels in two languages.

47
00:03:01,697 --> 00:03:03,851
And after that, it's one, and then three,

48
00:03:03,851 --> 00:03:06,115
and then it becomes very sad.

49
00:03:06,781 --> 00:03:08,581
(quiet laughter)

50
00:03:10,047 --> 00:03:12,713
I think we need to do better than this.

51
00:03:14,185 --> 00:03:15,319
But, on the other hand,

52
00:03:15,319 --> 00:03:17,478
I was actually expecting this
to be even worse.

53
00:03:17,478 --> 00:03:19,560
I was expecting the average to be one.

54
00:03:19,560 --> 00:03:22,503
So I was quite happy 
to see two. (chuckles)

55
00:03:24,921 --> 00:03:26,186
Alright.

56
00:03:27,156 --> 00:03:29,527
But it's not just interesting to know

57
00:03:29,527 --> 00:03:33,742
how many labels our items
and properties have.

58
00:03:33,742 --> 00:03:36,565
It's also interesting to see
in which languages.

59
00:03:38,045 --> 00:03:43,764
Here you see a graph of the languages

60
00:03:43,764 --> 00:03:46,838
that we have labels for on <i>Items</i>.

61
00:03:46,838 --> 00:03:50,669
So the biggest part there is <i>Other</i>.

62
00:03:51,229 --> 00:03:53,863
So I just took the top 100 languages

63
00:03:54,533 --> 00:03:58,902
and everything else is <i>Other</i>
to make this graph readable.

64
00:03:59,542 --> 00:04:02,142
And then there's English and Dutch,

65
00:04:03,002 --> 00:04:04,254
French,

66
00:04:05,924 --> 00:04:09,129
and not to forget, Asturian.

67
00:04:09,659 --> 00:04:11,889
- (person 1) Whoo!
- Whoo-hoo, yes!

68
00:04:13,899 --> 00:04:16,954
So what you see here is quite an imbalance

69
00:04:16,954 --> 00:04:20,114
and still quite a lot of focus on English.

70
00:04:21,236 --> 00:04:24,367
Another thing is if you look
at the same thing for <i>Properties</i>,

71
00:04:24,367 --> 00:04:25,999
it's actually looking better.

72
00:04:27,399 --> 00:04:32,750
And I think part of that constituted
just being way less properties.

73
00:04:32,750 --> 00:04:36,770
So even smaller communities
have a chance to keep up with that.

74
00:04:36,770 --> 00:04:39,173
But it's also a pretty
important part of Wikidata

75
00:04:39,173 --> 00:04:41,159
to localize into your language.

76
00:04:41,159 --> 00:04:42,384
So that's good.

77
00:04:45,752 --> 00:04:47,842
What I want to highlight
here with Asturian

78
00:04:47,842 --> 00:04:53,698
is that a small community
can really make a huge difference

79
00:04:54,448 --> 00:04:57,085
with some dedication and work,

80
00:04:57,085 --> 00:04:58,420
and that's really cool.

81
00:05:01,846 --> 00:05:03,530
A small quiz for you.

82
00:05:03,530 --> 00:05:05,493
If you take all the properties on Wikidata

83
00:05:05,493 --> 00:05:07,687
that are not external identifiers,

84
00:05:07,687 --> 00:05:10,358
which one has the most labels,
like the most languages?

85
00:05:10,977 --> 00:05:13,847
(audience) [inaudible]

86
00:05:13,847 --> 00:05:16,786
I hear some agreement on <i>instance of</i>?

87
00:05:17,506 --> 00:05:19,443
You would be wrong.

88
00:05:19,983 --> 00:05:22,210
It's <i>image</i>. (chuckles)

89
00:05:23,230 --> 00:05:26,366
So, yeah, that tells you,
if you speak one of the languages

90
00:05:26,366 --> 00:05:28,621
where <i>instance of</i>
doesn't yet have a label,

91
00:05:28,621 --> 00:05:30,190
you might want to add it.

92
00:05:32,102 --> 00:05:35,676
So it has 148 labels currently.

93
00:05:37,688 --> 00:05:41,249
But that's just another slide.

94
00:05:42,631 --> 00:05:44,162
This graph tells us something

95
00:05:44,162 --> 00:05:49,321
about how much content we are making
available in a certain language

96
00:05:49,321 --> 00:05:52,042
and how much of that content
is actually used.

97
00:05:52,042 --> 00:05:55,448
So what you're seeing is basically a curve

98
00:05:55,448 --> 00:06:00,987
with most content having English labels,
being available in English,

99
00:06:01,507 --> 00:06:04,295
and being used a lot.

100
00:06:04,295 --> 00:06:06,449
And then it kind of goes down.

101
00:06:06,449 --> 00:06:09,436
But, again, what you can see are outliers

102
00:06:09,436 --> 00:06:15,333
who have a lot more content
than you would necessarily expect,

103
00:06:16,903 --> 00:06:19,539
and that is really, really good.

104
00:06:20,839 --> 00:06:24,945
The problem still is it's not used a lot.

105
00:06:25,565 --> 00:06:28,742
Asturian and Dutch should be higher,

106
00:06:28,742 --> 00:06:31,994
and I think helping those communities

107
00:06:33,266 --> 00:06:35,563
increase the use
of the data they collected

108
00:06:35,563 --> 00:06:37,682
is a really useful thing to do.

109
00:06:42,910 --> 00:06:48,110
What this analysis and others
showed us is also a good thing though

110
00:06:48,300 --> 00:06:51,378
is that we are seeing
that highly used items

111
00:06:51,378 --> 00:06:55,295
also tend to have more labels

112
00:06:55,295 --> 00:06:58,188
or the other way around--
it's not entirely clear.

113
00:07:02,513 --> 00:07:04,376
And then the question is,

114
00:07:04,806 --> 00:07:07,009
are we serving
just the powerful languages?

115
00:07:07,899 --> 00:07:11,147
Or are we serving everyone?

116
00:07:12,757 --> 00:07:17,743
And what you see here
is a grouping of languages.

117
00:07:17,743 --> 00:07:21,832
The languages that are grouped together
tend to have labels together.

118
00:07:26,042 --> 00:07:28,599
And you see it clustering.

119
00:07:28,599 --> 00:07:34,065
Now here's a similar clustering, colored,

120
00:07:34,065 --> 00:07:39,475
based on how alive, how used,

121
00:07:40,455 --> 00:07:43,156
how endangered the language is.

122
00:07:43,156 --> 00:07:44,642
And a good thing you're seeing here

123
00:07:44,642 --> 00:07:49,566
is that safe languages
and endangered languages

124
00:07:49,566 --> 00:07:53,773
do not form two different clusters.

125
00:07:53,773 --> 00:07:58,872
But they're all mixed together,

126
00:08:00,262 --> 00:08:04,625
which is much better than it would be
the other way around

127
00:08:04,625 --> 00:08:09,377
where the safe languages,
the powerful languages

128
00:08:10,197 --> 00:08:12,164
are just helping each other out.

129
00:08:12,744 --> 00:08:14,356
No, that's not the case.

130
00:08:14,356 --> 00:08:17,417
And it's a really good thing.

131
00:08:17,417 --> 00:08:20,042
When I saw this, 
I thought this was very good.

132
00:08:23,474 --> 00:08:25,169
Here's a similar thing

133
00:08:26,239 --> 00:08:28,800
where we looked at

134
00:08:30,230 --> 00:08:34,222
the languages' status

135
00:08:34,222 --> 00:08:36,225
and how many labels it has.

136
00:08:39,367 --> 00:08:42,937
What you're seeing
is a clear win for safe languages,

137
00:08:42,937 --> 00:08:44,248
as is expected.

138
00:08:45,508 --> 00:08:46,693
But what you're also seeing

139
00:08:46,693 --> 00:08:54,407
is that the languages in category 2
and 3 and maybe even 4

140
00:08:54,407 --> 00:08:59,280
are not that bad, actually,

141
00:08:59,280 --> 00:09:02,367
in terms of their representation
in Wikidata and others.

142
00:09:03,287 --> 00:09:06,408
It's a really good thing to find.

143
00:09:07,646 --> 00:09:09,129
Now, if you look at the same thing

144
00:09:09,129 --> 00:09:12,418
for how much of that content
of those labels

145
00:09:12,418 --> 00:09:15,495
is actually used
on Wikipedia, for example,

146
00:09:17,455 --> 00:09:22,563
then we see a similar
picture emerging again.

147
00:09:23,603 --> 00:09:29,813
And it tells us that those communities
are actually making good use of their time

148
00:09:29,813 --> 00:09:34,504
by filling in labels
for higher used items, for example.

149
00:09:36,410 --> 00:09:40,493
There are outliers
where I think we can help,

150
00:09:41,683 --> 00:09:48,202
to help those communities find the places
where their work would be most valuable.

151
00:09:49,312 --> 00:09:52,663
But, overall, I'm happy with this picture.

152
00:09:54,823 --> 00:09:59,844
Now, that was the items
and properties part of Wikidata.

153
00:10:00,714 --> 00:10:03,033
Now, let's look at interaction
in your languages.

154
00:10:03,033 --> 00:10:05,203
So the lexeme parts of Wikidata

155
00:10:05,203 --> 00:10:09,394
where we describe words
and their forms and their meanings.

156
00:10:10,167 --> 00:10:13,301
We've been doing this now
since May last year,

157
00:10:16,461 --> 00:10:19,127
and content has been growing.

158
00:10:20,114 --> 00:10:22,149
You can see here in blue the lexemes,

159
00:10:22,149 --> 00:10:25,938
and then in red, 
the forms on those lexemes

160
00:10:25,938 --> 00:10:29,910
and yellow, the senses
on those lexemes.

161
00:10:30,991 --> 00:10:34,451
So some communities--
we'll get to that later--

162
00:10:34,451 --> 00:10:39,793
have spent a lot of time creating forms
and senses for their lexemes,

163
00:10:39,793 --> 00:10:42,753
which is really useful

164
00:10:42,753 --> 00:10:48,243
because that builds
the core of the data set that you need.

165
00:10:50,562 --> 00:10:55,133
Now, we looked at all the languages

166
00:10:55,133 --> 00:10:57,906
that have lexemes on Wikidata.

167
00:10:57,906 --> 00:11:01,003
So words we have,

168
00:11:01,713 --> 00:11:04,404
those are right now 310 languages.

169
00:11:04,884 --> 00:11:08,290
Now, what do you think is the top language

170
00:11:08,290 --> 00:11:11,949
when it comes to the number
of lexemes currently in Wikidata?

171
00:11:12,933 --> 00:11:14,700
(audience) [inaudible]

172
00:11:19,183 --> 00:11:20,216
Huh?

173
00:11:20,216 --> 00:11:21,741
(person 2) German.

174
00:11:21,741 --> 00:11:24,252
Sorry, I've heard it before.

175
00:11:24,252 --> 00:11:25,651
It's Russian.

176
00:11:28,011 --> 00:11:29,754
Russian is quite ahead.

177
00:11:31,897 --> 00:11:33,832
And just to give you some perspective,

178
00:11:35,652 --> 00:11:36,816
there's different opinions

179
00:11:36,816 --> 00:11:42,231
but I've read, for example,
that 1,000 to 3,000 words

180
00:11:42,231 --> 00:11:45,450
gets you to conversation level,
roughly, in another language,

181
00:11:45,450 --> 00:11:49,461
and 4,000 to 10,000 words
to an advanced level.

182
00:11:51,591 --> 00:11:55,282
So, we still have a bit to catch up there.

183
00:11:58,483 --> 00:12:03,279
One thing I want you
to pay attention to is Basque here

184
00:12:03,279 --> 00:12:07,744
with 10,000, roughly, lexemes.

185
00:12:09,244 --> 00:12:13,003
Now, if you look at the number
of forms for those lexemes,

186
00:12:14,163 --> 00:12:16,497
Basque is way up there,

187
00:12:18,257 --> 00:12:20,006
which is really cool,

188
00:12:20,006 --> 00:12:24,930
and you should go to a talk that explains
to you why that is the case.

189
00:12:27,341 --> 00:12:31,175
Now, if you look at the number
of senses, so what do words mean,

190
00:12:32,015 --> 00:12:35,081
Basque even gets to the top of the list.

191
00:12:35,081 --> 00:12:37,102
I think that deserves an applause.

192
00:12:37,102 --> 00:12:38,921
(applause)

193
00:12:45,678 --> 00:12:47,118
Another short quiz.

194
00:12:47,118 --> 00:12:50,181
What's the lexeme
with the most translations currently?

195
00:12:50,651 --> 00:12:55,414
(audience) Cats, cats, [inaudible], 
Douglas Adams, [inaudible]

196
00:12:56,766 --> 00:13:00,014
All good guesses, but no.

197
00:13:01,012 --> 00:13:04,137
It's this, the Russian word for "water."

198
00:13:09,571 --> 00:13:12,253
Alright, so now we talked a lot

199
00:13:12,253 --> 00:13:16,412
about how many lexemes,
forms, and senses we have,

200
00:13:16,412 --> 00:13:20,493
but that's just one thing you need.

201
00:13:20,493 --> 00:13:21,515
The other thing you need

202
00:13:21,515 --> 00:13:25,161
is actually describing those lexemes,
forms, and senses

203
00:13:25,161 --> 00:13:27,647
in a machine-readable way.

204
00:13:27,647 --> 00:13:30,039
And for that you have statements,
like on items.

205
00:13:31,479 --> 00:13:36,362
And one of the properties
you use is usage example.

206
00:13:36,362 --> 00:13:38,582
So whoever is using that data

207
00:13:38,582 --> 00:13:42,089
can understand how to use
that word in context,

208
00:13:42,089 --> 00:13:44,158
so that could be a quote, for example.

209
00:13:45,396 --> 00:13:47,113
And here, Polish rocks.

210
00:13:47,900 --> 00:13:49,764
Good job, Polish speakers.

211
00:13:54,219 --> 00:13:57,680
Another property
that's really useful is IPA,

212
00:13:57,680 --> 00:14:00,186
so how do you pronounce this word.

213
00:14:00,876 --> 00:14:07,497
Russian apparently needs
lots of IPA statements.

214
00:14:10,419 --> 00:14:13,314
But, again, Polish, second.

215
00:14:17,148 --> 00:14:20,753
And last but not least
we have pronunciation audio.

216
00:14:20,753 --> 00:14:23,372
So that is links to files on Commons

217
00:14:23,372 --> 00:14:25,959
where someone speaks the word,

218
00:14:25,959 --> 00:14:29,913
so you can hear a native speaker
pronounce the word

219
00:14:29,913 --> 00:14:32,871
in case you can't read IPA, for example.

220
00:14:34,959 --> 00:14:39,205
And there's a really nice actually
Wiki-based powered project

221
00:14:39,205 --> 00:14:40,474
called Lingua Libre

222
00:14:40,884 --> 00:14:45,173
where you can go and help record
words in your language

223
00:14:45,173 --> 00:14:47,836
that then can be added
to lexemes on Wikidata,

224
00:14:48,446 --> 00:14:52,103
so other people can understand
how to pronounce your words.

225
00:14:53,663 --> 00:14:55,694
(person 2) [inaudible]

226
00:14:55,694 --> 00:14:57,665
If you search for "Lingua Libre,"

227
00:14:57,665 --> 00:15:00,981
and I'm sure someone can post it
in the Telegram channel.

228
00:15:03,138 --> 00:15:04,621
Those guys rock.

229
00:15:04,621 --> 00:15:06,726
They did really cool stuff with Wikibase.

230
00:15:09,416 --> 00:15:10,617
Alright.

231
00:15:12,706 --> 00:15:17,285
Then the question is,
where do we go from here?

232
00:15:19,165 --> 00:15:22,010
Based on the numbers I've just shown you,

233
00:15:23,030 --> 00:15:25,172
we've come a long way

234
00:15:25,172 --> 00:15:28,430
towards giving more people
more access to more knowledge

235
00:15:28,430 --> 00:15:31,240
when looking at languages on Wikidata.

236
00:15:32,530 --> 00:15:36,392
But there is also still
a lot of work ahead of us.

237
00:15:38,992 --> 00:15:42,341
Some of the things
you can do to help, for example,

238
00:15:42,341 --> 00:15:44,921
is run label-a-thons

239
00:15:44,921 --> 00:15:50,124
like get people together
to label items in Wikidata

240
00:15:50,914 --> 00:15:55,121
or do an edit-a-thon
around lexemes in your language

241
00:15:55,121 --> 00:15:59,212
to get the most used words
in your language into Wikidata.

242
00:16:00,773 --> 00:16:03,285
Or you can use a tool like Terminator

243
00:16:03,285 --> 00:16:08,493
that helps you find the most
important items in your language

244
00:16:08,493 --> 00:16:11,549
that are still missing a label.

245
00:16:13,274 --> 00:16:18,359
Most important being measured
by how often it is used

246
00:16:18,359 --> 00:16:22,553
in other Wikidata items
as links in statements.

247
00:16:25,768 --> 00:16:30,022
And, of course, for the lexeme part,

248
00:16:31,342 --> 00:16:35,169
now that we've got
a basic coverage of those lexemes,

249
00:16:35,169 --> 00:16:41,163
it's also about building them out,
adding more statements to them

250
00:16:41,163 --> 00:16:44,401
so that they actually can build the base

251
00:16:44,401 --> 00:16:47,421
for meaningful applications
to build on top of that.

252
00:16:48,141 --> 00:16:50,795
Because we're getting closer
to that critical mass,

253
00:16:50,795 --> 00:16:53,616
but we're still away from that,

254
00:16:53,616 --> 00:16:56,624
that you can build
serious applications on top of it.

255
00:16:58,277 --> 00:17:01,680
And I hope all of you
will join us in doing that.

256
00:17:02,583 --> 00:17:07,103
And that already brings me

257
00:17:07,103 --> 00:17:09,843
to a little help from our friends,

258
00:17:09,843 --> 00:17:12,812
and Bruno, do you want to come over

259
00:17:13,882 --> 00:17:16,854
and talk to us about lexical masks.

260
00:17:17,541 --> 00:17:18,567
(Bruno) Thank you, Lydia,

261
00:17:18,567 --> 00:17:21,519
thank you for giving me
this short period of time

262
00:17:21,519 --> 00:17:24,150
to present this work
that we are doing at Google

263
00:17:24,150 --> 00:17:29,635
Denny that most of you
probably have heard of or know.

264
00:17:30,126 --> 00:17:32,030
Because at Google so I'm a linguist.

265
00:17:32,030 --> 00:17:36,150
so I'm very happy to be here
amongst other language enthusiasts.

266
00:17:36,620 --> 00:17:39,278
We are also building some lexicons,

267
00:17:39,278 --> 00:17:41,766
and we have built this technology

268
00:17:41,766 --> 00:17:45,589
or this approach that we think
can be useful for you.

269
00:17:46,369 --> 00:17:48,455
Just to give you
a little bit of background,

270
00:17:48,455 --> 00:17:52,068
this is my lexicographic
background talking here.

271
00:17:52,788 --> 00:17:54,347
When we build a lexicon database,

272
00:17:54,347 --> 00:17:58,623
there is a lot of hard time to maintain,
to keep them consistent

273
00:17:58,623 --> 00:18:00,125
and to exchange data,

274
00:18:00,125 --> 00:18:02,027
as you probably know.

275
00:18:02,517 --> 00:18:05,927
There are several attempts
to unify the feature and the properties

276
00:18:05,927 --> 00:18:09,184
that are describing
those lexemes and those forms,

277
00:18:09,184 --> 00:18:10,936
and it's not a solved problem,

278
00:18:10,936 --> 00:18:13,958
but there are some
unification attempts on that side.

279
00:18:13,958 --> 00:18:15,209
But what is really missing--

280
00:18:15,209 --> 00:18:18,732
and this is a problem we had
at the beginning of our project at Google

281
00:18:18,732 --> 00:18:21,607
is to try to have an internal structure

282
00:18:22,197 --> 00:18:25,910
that describes how
a lexical entry should look like,

283
00:18:25,910 --> 00:18:28,581
what kind of data
or what kind of information we have

284
00:18:28,581 --> 00:18:32,237
and the specification that are expected.

285
00:18:32,237 --> 00:18:38,187
So, this is what we came up
with this thing called lexicon mask.

286
00:18:38,897 --> 00:18:44,841
A lexicon mask is describing
what is expected for an entry,

287
00:18:44,841 --> 00:18:47,329
a lexicographic entry, to be complete,

288
00:18:47,329 --> 00:18:51,436
both in terms of the number of forms
you expect for a lexeme,

289
00:18:51,436 --> 00:18:55,607
and the number of features
you expect for each of those forms.

290
00:18:56,397 --> 00:18:58,329
Here is an example for Italian adjectives.

291
00:18:58,329 --> 00:19:02,002
You expect, in Italian, to have
four forms for your adjectives,

292
00:19:02,002 --> 00:19:05,383
and each of these forms
have a specific combination

293
00:19:05,383 --> 00:19:07,946
of gender and number features.

294
00:19:08,606 --> 00:19:12,672
This is what we expect
for the Italian adjectives.

295
00:19:12,672 --> 00:19:16,176
Of course, you can have
extremely complex masks,

296
00:19:16,176 --> 00:19:20,783
like the French verbs conjugation,
which is quite extensive,

297
00:19:20,783 --> 00:19:23,487
and I don't show you
any other Russian mask

298
00:19:23,487 --> 00:19:25,378
because it doesn't fit the screen.

299
00:19:26,308 --> 00:19:29,531
And we also have
some detailed specifications

300
00:19:29,531 --> 00:19:33,421
because we distinguish
what is at the form level.

301
00:19:33,421 --> 00:19:37,544
So here you have Russian nouns
that have three numbers

302
00:19:37,544 --> 00:19:40,048
and a number of cases
with different forms,

303
00:19:40,048 --> 00:19:43,086
but they also have
an entry level specification

304
00:19:43,086 --> 00:19:45,590
that says a noun particularly has

305
00:19:45,590 --> 00:19:50,133
an inherent gender
and an inherent animacy feature

306
00:19:50,133 --> 00:19:52,488
that is also specified in the mask.

307
00:19:54,518 --> 00:19:58,779
We also want to distinguish
that a mask gives a specification

308
00:19:58,779 --> 00:20:01,874
for, in general,
what an entry should look like.

309
00:20:01,874 --> 00:20:07,158
But you can have smaller masks
for defective aspects of the form

310
00:20:07,158 --> 00:20:11,282
or defective aspects of the lexeme
that happen in language.

311
00:20:11,282 --> 00:20:14,537
So here is the simplest version
of French verbs

312
00:20:14,537 --> 00:20:19,729
that have only the 3rd person singular
for all the weather verbs,

313
00:20:19,729 --> 00:20:23,969
like "it rains" or "it snows,"
like in English.

314
00:20:24,537 --> 00:20:26,493
So we distinguish these two levels.

315
00:20:26,923 --> 00:20:29,962
And how we use this at Google

316
00:20:29,962 --> 00:20:32,643
is that when we have a lexicon
that we want to use,

317
00:20:33,063 --> 00:20:38,309
we use the mask to really
literally throw the lexicons,

318
00:20:38,309 --> 00:20:40,163
all the entries, through the mask

319
00:20:40,163 --> 00:20:44,303
and see which entry has a problem
in terms of structure.

320
00:20:44,303 --> 00:20:46,523
Are we missing a form?
Are we missing a feature?

321
00:20:46,523 --> 00:20:51,497
And when there is a problem,
we do some human validation

322
00:20:51,497 --> 00:20:53,751
or just to see if it passes the mask.

323
00:20:53,751 --> 00:20:57,924
So it's an extremely powerful tool
to check the quality of the structure.

324
00:20:59,427 --> 00:21:01,964
So what we are happy to announce today

325
00:21:01,964 --> 00:21:05,408
is that we get the green light
to open source our mask.

326
00:21:05,948 --> 00:21:07,573
So this is a schema.

327
00:21:07,573 --> 00:21:09,477
If you want that, we can release

328
00:21:09,477 --> 00:21:13,483
and that we will provide
to Wikidata as to ShEx files.

329
00:21:13,483 --> 00:21:16,688
This is a ShEx file for German nouns,

330
00:21:16,688 --> 00:21:20,428
and Denny is working on the conversion
from our internal specification

331
00:21:20,428 --> 00:21:23,666
to a more open-source specification.

332
00:21:23,666 --> 00:21:27,522
We currently cover more than 25 languages.

333
00:21:27,522 --> 00:21:29,225
So we expect to grow on our side,

334
00:21:29,225 --> 00:21:34,350
but we also look for this opportunity
to collaborate for other languages.

335
00:21:34,350 --> 00:21:40,728
And one of the ongoing collaborations
also that Denny has with Lukas.

336
00:21:40,728 --> 00:21:45,052
Lukas has these great tools to have a UI

337
00:21:45,052 --> 00:21:51,061
to help the user or the contributor
to add more forms.

338
00:21:51,061 --> 00:21:54,151
So if you want to add
an adjective in French,

339
00:21:54,151 --> 00:21:59,057
the UI is telling you
how many forms are expected

340
00:21:59,057 --> 00:22:01,562
and what kind of features
this form should have.

341
00:22:01,562 --> 00:22:06,268
So our mask will help the tool
to be defined and expanded.

342
00:22:07,238 --> 00:22:08,385
That's it.

343
00:22:08,791 --> 00:22:10,358
(Lydia) Thank you so much.

344
00:22:10,358 --> 00:22:11,993
(applause)

345
00:22:14,249 --> 00:22:16,891
Alright. Are there questions?

346
00:22:16,891 --> 00:22:19,381
Do you want to talk more about lexemes?

347
00:22:19,817 --> 00:22:21,475
- (person 3) Yes.
- Yes. (chuckles)

348
00:22:33,485 --> 00:22:35,380
(person 3) My question,
because you were talking

349
00:22:35,380 --> 00:22:39,106
about giving more access
to more people in more languages.

350
00:22:39,106 --> 00:22:42,444
But there are a lot of languages
that can't be used in Wikidata.

351
00:22:42,444 --> 00:22:44,588
So what solution do you have for that?

352
00:22:45,889 --> 00:22:47,686
When you say that can't use Wikidata,

353
00:22:47,686 --> 00:22:50,308
are you talking about entering labels?

354
00:22:50,308 --> 00:22:52,578
- (person 3) Labels, descriptions.
- Right.

355
00:22:52,578 --> 00:22:55,498
So, for lexemes, it's a bit different

356
00:22:55,498 --> 00:22:57,793
because there we don't have
that restriction.

357
00:22:58,923 --> 00:23:05,003
For labels on items and properties,
there is some restriction

358
00:23:05,433 --> 00:23:12,411
because we wanted to make sure
that it's not completely

359
00:23:12,411 --> 00:23:14,229
anyone does anything,

360
00:23:14,229 --> 00:23:17,769
and it becomes unmanageable.

361
00:23:19,349 --> 00:23:23,328
Even a small community who wants
one language and wants to work on that,

362
00:23:23,898 --> 00:23:26,787
come talk to us, we will make it happen.

363
00:23:26,787 --> 00:23:29,202
(person 3) I mean, we did this
at the Prague Hackathon in May,

364
00:23:29,202 --> 00:23:32,459
and it took us until almost August
in order to be able to use our language.

365
00:23:32,459 --> 00:23:35,135
- Yeah.
- (person 3) So, it's very slow.

366
00:23:35,135 --> 00:23:37,854
Yeah, it is, unfortunately, very slow.

367
00:23:37,854 --> 00:23:39,883
We're currently working
with the language Committee

368
00:23:39,883 --> 00:23:46,048
on solving some fundamental...

369
00:23:49,537 --> 00:23:55,447
Like, getting agreement on what kind
of languages are actually "allowed,"

370
00:23:56,047 --> 00:23:59,398
and that has taken too long,

371
00:23:59,988 --> 00:24:04,178
which is the reason why your request
probably took longer than it should have.

372
00:24:04,778 --> 00:24:05,963
(person 3) Thanks.

373
00:24:06,815 --> 00:24:07,950
(person 4) Thank you.

374
00:24:07,950 --> 00:24:10,938
Lydia, if you remember
the statistics that you showed,

375
00:24:10,938 --> 00:24:12,886
the number of lexemes per language.

376
00:24:12,886 --> 00:24:17,599
So, did you count
all the forms as a data point

377
00:24:17,599 --> 00:24:20,034
or only lexemes?

378
00:24:21,289 --> 00:24:22,941
(Lydia) Do you mean this?

379
00:24:22,941 --> 00:24:24,053
Which one do you mean?

380
00:24:24,053 --> 00:24:25,529
(person 4) Yes, exactly.

381
00:24:25,797 --> 00:24:28,341
If you remember,
does this number [inaudible]

382
00:24:28,341 --> 00:24:31,954
all the forms for all the lexemes
or just how many lexemes there are?

383
00:24:31,954 --> 00:24:33,585
No, this is just a number of lexemes.

384
00:24:33,585 --> 00:24:35,395
(person 4) Just a number of lexemes, okay.

385
00:24:35,395 --> 00:24:36,797
So then it is a just statistic

386
00:24:36,797 --> 00:24:39,390
because if it would then
compose the forms--

387
00:24:39,390 --> 00:24:40,614
that's why I'm asking--

388
00:24:40,614 --> 00:24:42,817
then all the languages
with the inflectional morphology,

389
00:24:42,817 --> 00:24:45,027
like Russian, Serbian,
Slovenian and et cetera,

390
00:24:45,027 --> 00:24:47,616
they have a natural advantage
because they have so many.

391
00:24:47,616 --> 00:24:51,990
So, this kind of kicks in here
on this number of forms.

392
00:24:51,990 --> 00:24:53,851
(person 4) Yeah, that was this one. 
Thank you.

393
00:24:56,546 --> 00:25:00,224
(person 5) So, I had
a quick question about the...

394
00:25:00,644 --> 00:25:06,824
When we're talking about
the actual items and properties.

395
00:25:07,124 --> 00:25:08,901
Like as far as I understand,

396
00:25:08,901 --> 00:25:11,955
there is currently no way
to give an actual source

397
00:25:11,955 --> 00:25:14,726
to any of the labels
and descriptions that are given.

398
00:25:14,726 --> 00:25:18,047
So, for example,
because when you're talking

399
00:25:18,047 --> 00:25:20,920
about an item property,

400
00:25:20,920 --> 00:25:24,509
like, for example,
you can get conflicting labels.

401
00:25:24,509 --> 00:25:25,739
Yes.

402
00:25:25,739 --> 00:25:27,662
(person 5) So this person is like...

403
00:25:28,402 --> 00:25:30,781
We were talking about
indigenous things before, for example.

404
00:25:30,781 --> 00:25:35,965
So this person is a Norwegian artist
according to this source,

405
00:25:35,965 --> 00:25:38,750
and a Sami artist,
according to this source.

406
00:25:39,550 --> 00:25:42,883
Or, for example, in Estonian,
we had an issue

407
00:25:42,883 --> 00:25:47,729
where we had to change terminology
to the official use terminology

408
00:25:47,729 --> 00:25:49,482
in official lexicons,

409
00:25:49,482 --> 00:25:52,262
but we have no way to indicate really why,

410
00:25:52,262 --> 00:25:53,596
like what was the source of this

411
00:25:53,596 --> 00:25:55,561
and why this was better
and what was there before.

412
00:25:55,561 --> 00:25:57,150
It was just me as a random person

413
00:25:57,150 --> 00:25:59,615
just switching the thing
to anyone who sees it.

414
00:25:59,615 --> 00:26:02,520
So is there a plan
to make this possible in any way

415
00:26:02,520 --> 00:26:06,355
so that we can actually have
proper sources for the language data?

416
00:26:07,045 --> 00:26:11,568
So, it is partially possible.

417
00:26:11,568 --> 00:26:15,958
So, for example, when you have
an item for a person,

418
00:26:16,968 --> 00:26:22,720
you have a statement, first name,
last name, and so on, of that person,

419
00:26:22,720 --> 00:26:26,226
and then you can provide
the reference for that there.

420
00:26:28,211 --> 00:26:32,544
I'm quite hesitant to add more complexity

421
00:26:32,544 --> 00:26:35,557
for references on labels and descriptions,

422
00:26:35,557 --> 00:26:38,624
but if people really, really think

423
00:26:38,624 --> 00:26:44,939
this is something that isn't covered
by any reference on the statement,

424
00:26:44,939 --> 00:26:46,803
then let's talk about it.

425
00:26:49,079 --> 00:26:53,303
But I fear it will add a lot of complexity

426
00:26:53,303 --> 00:26:56,523
for what I hope are few cases,

427
00:26:57,393 --> 00:27:00,188
but I'm willing to be convinced otherwise

428
00:27:00,188 --> 00:27:04,087
if people really feel
very strongly about this.

429
00:27:04,087 --> 00:27:08,177
(person 5) I mean, if it's added
it probably shouldn't be the default,

430
00:27:08,177 --> 00:27:12,452
show to all the users as a beginner,
interface, in any case.

431
00:27:12,452 --> 00:27:16,190
More like, "Click here if you need to say
a specific thing about this."

432
00:27:17,632 --> 00:27:23,368
Do we have a sense of how many times
that would actually matter?

433
00:27:24,520 --> 00:27:26,423
(person 5) In Estonian, for example--

434
00:27:26,423 --> 00:27:28,844
I expect this is true
of other languages as well--

435
00:27:29,274 --> 00:27:34,203
for example, there is an official name
that is the actual legitimate translation,

436
00:27:34,203 --> 00:27:36,206
for example, into English,

437
00:27:36,206 --> 00:27:40,314
of, say, a specific kind of municipality.

438
00:27:40,614 --> 00:27:42,182
That was my use case, for example,

439
00:27:42,182 --> 00:27:44,409
where we were using the word "parish"

440
00:27:45,159 --> 00:27:50,885
which the original Estonian word
was meant kind of like church parish,

441
00:27:50,885 --> 00:27:51,899
and that was the origin,

442
00:27:51,899 --> 00:27:54,809
but that's not the official translation
Estonia gets right now.

443
00:27:55,189 --> 00:27:58,993
In this case, I would just add it
as official name statements

444
00:27:58,993 --> 00:28:00,817
and add the reference there.

445
00:28:02,032 --> 00:28:03,158
(person 5) Okay.

446
00:28:05,186 --> 00:28:06,572
More questions, yes?

447
00:28:07,682 --> 00:28:10,044
(person 6) I have two quick comments.

448
00:28:10,044 --> 00:28:13,934
You specifically called out Asturian
as a language that does well,

449
00:28:13,934 --> 00:28:16,455
and I think that's a false artifact.

450
00:28:16,455 --> 00:28:17,724
Tell me about it.

451
00:28:17,724 --> 00:28:19,748
(person 6) I think it's just a bot

452
00:28:19,748 --> 00:28:24,068
that pasted person names,
like proper names,

453
00:28:24,068 --> 00:28:27,172
and said, "Well, this is exactly
like in French or Spanish,"

454
00:28:27,172 --> 00:28:28,558
and just massively copied it.

455
00:28:28,558 --> 00:28:33,316
One point of evidence is that
you don't see that energy in Asturian

456
00:28:33,316 --> 00:28:37,205
in things that actually
require translation, like property names,

457
00:28:37,205 --> 00:28:39,648
or names of items
that are not proper names.

458
00:28:39,648 --> 00:28:41,219
Asaf, you break my heart.

459
00:28:41,219 --> 00:28:43,198
(person 6) I know,
I like raining on parades,

460
00:28:43,198 --> 00:28:48,458
but I have good news as well,
which is about the pronunciation numbers.

461
00:28:49,408 --> 00:28:53,515
As you probably know,
Commons is full of pronunciation files,

462
00:28:53,515 --> 00:28:54,668
and, for example,

463
00:28:54,668 --> 00:29:01,102
Dutch has no less than 300,000
pronunciation files already on Commons

464
00:29:01,912 --> 00:29:05,051
that just need to somehow be ingested.

465
00:29:05,051 --> 00:29:07,697
So if anyone's looking for a side project,

466
00:29:07,697 --> 00:29:08,997
there's tons and tons

467
00:29:08,997 --> 00:29:13,280
of classified, categorized
pronunciation files on Commons

468
00:29:13,280 --> 00:29:16,893
under the category
"Pronunciation" by language.

469
00:29:16,893 --> 00:29:22,840
So that's just waiting to be matched
to lexemes and put on Lexeme.

470
00:29:23,180 --> 00:29:25,484
And I was wondering
if you could say something

471
00:29:25,484 --> 00:29:26,585
about the road map,

472
00:29:26,585 --> 00:29:28,757
something about how much investment

473
00:29:28,757 --> 00:29:31,995
or what can we expect
from Lexeme in the coming year,

474
00:29:31,995 --> 00:29:34,020
because I, for one, can't wait.

475
00:29:34,949 --> 00:29:37,044
You can't wait? (chuckles)

476
00:29:37,044 --> 00:29:39,118
- (person 6) For more.
- Yes. (chuckles)

477
00:29:44,541 --> 00:29:49,523
Right now, we're concentrating
more on Wikibase and data quality

478
00:29:51,493 --> 00:29:55,087
to see how much traction this gets

479
00:29:55,087 --> 00:30:01,676
and then getting more for feeding off
where the pain points are next,

480
00:30:01,676 --> 00:30:06,003
and then going back to improving
lexicographical data further.

481
00:30:06,903 --> 00:30:09,790
And one of the things
I'd love to hear from you

482
00:30:09,790 --> 00:30:14,136
is where exactly do you see
the next steps,

483
00:30:14,136 --> 00:30:15,966
where do you want to see improvements

484
00:30:15,966 --> 00:30:20,340
so that we can then figure out
how to make that happen.

485
00:30:21,125 --> 00:30:22,810
But, of course, you're right,

486
00:30:22,810 --> 00:30:25,712
there's still so much to do
also on the technical side.

487
00:30:30,573 --> 00:30:35,848
(person 7) Okay, as we were uploading
the Basque words with forms,

488
00:30:35,848 --> 00:30:37,768
and you'll see some
of these kinds of things,

489
00:30:37,768 --> 00:30:41,329
we were both like, last week we said,
"Oh, we are the first one in something."

490
00:30:42,919 --> 00:30:44,928
It's It appears in press, and it's like,

491
00:30:44,928 --> 00:30:49,488
"Oh, Basque are the first time in some--
they are the first in something, okay."

492
00:30:49,488 --> 00:30:50,606
(laughs)

493
00:30:50,606 --> 00:30:53,318
And then people ask,
"Okay, but what is this for?"

494
00:30:54,678 --> 00:30:56,849
We don't have a real good answer.

495
00:30:56,849 --> 00:30:57,888
I mean it's like, okay,

496
00:30:57,888 --> 00:31:01,841
this will help computers
to understand more our language, yes,

497
00:31:01,841 --> 00:31:05,279
but what kind of tools
can we make in the future?

498
00:31:05,279 --> 00:31:07,467
And we don't have a good answer for this.

499
00:31:07,467 --> 00:31:10,625
So I don't know
if you have a good answer for this.

500
00:31:10,625 --> 00:31:12,742
(chuckles) I don't know
if I have a good answer,

501
00:31:12,742 --> 00:31:14,746
but I have an answer.

502
00:31:15,480 --> 00:31:20,425
So I think right now 
as I was telling [inaudible],

503
00:31:20,425 --> 00:31:21,924
we haven't reached that critical mass

504
00:31:21,924 --> 00:31:25,529
where you can build a lot
of the really interesting tools.

505
00:31:25,529 --> 00:31:27,707
But there are already some tools.

506
00:31:28,267 --> 00:31:31,912
Just the other day,
Esther [Pandelia], for example,

507
00:31:31,912 --> 00:31:33,817
released a tool where you can see,

508
00:31:35,837 --> 00:31:38,889
I think it was the words on a globe

509
00:31:38,889 --> 00:31:41,901
where they're spoken,
where they're coming from.

510
00:31:42,631 --> 00:31:44,090
I'm probably wrong about this,

511
00:31:44,090 --> 00:31:46,346
but she had answered
on the Project chat on Wikidata--

512
00:31:46,346 --> 00:31:48,984
you can look it up there.

513
00:31:49,574 --> 00:31:51,805
So we have seen these first tools,

514
00:31:51,805 --> 00:31:55,696
just like we've seen
back when Wikidata started.

515
00:31:56,846 --> 00:31:59,602
First some--like just a network,

516
00:31:59,602 --> 00:32:03,424
and like, "Hey, look, there's this thing
that connects to this other thing."

517
00:32:04,824 --> 00:32:07,059
And as we have more data,

518
00:32:07,059 --> 00:32:10,352
and as we've reached some critical mass,

519
00:32:11,852 --> 00:32:14,747
more powerful applications
become possible,

520
00:32:15,677 --> 00:32:17,516
things like Histropedia,

521
00:32:19,126 --> 00:32:21,988
things like question and answering

522
00:32:21,988 --> 00:32:26,663
in your digital personal assistant,
Platypus, and so on.

523
00:32:26,663 --> 00:32:29,668
And we're seeing
a similar thing with lexemes.

524
00:32:31,198 --> 00:32:34,650
We're at the stage
where you can build like these little,

525
00:32:34,650 --> 00:32:37,464
hey, look, there's a connection
between the two things,

526
00:32:37,864 --> 00:32:42,738
and there's a translation
of this word into that language stage,

527
00:32:42,738 --> 00:32:47,747
and as we build it out
and as we describe more words,

528
00:32:47,747 --> 00:32:49,533
more becomes possible.

529
00:32:49,533 --> 00:32:51,795
Now, what becomes possible?

530
00:32:53,482 --> 00:32:59,483
As Ben, our keynote speaker earlier
was talking about translations,

531
00:33:00,103 --> 00:33:03,455
being able to translate
from one language to another.

532
00:33:03,455 --> 00:33:07,929
And Jens, my colleague,
he's always talking about

533
00:33:07,929 --> 00:33:11,452
the European Union
looking for a translator

534
00:33:11,452 --> 00:33:17,439
who can translate from
I think it was Maltese to Swedish--

535
00:33:17,439 --> 00:33:19,436
- (person 8) Estonian.
- Estonian.

536
00:33:22,016 --> 00:33:26,211
And that is not a usual combination.

537
00:33:27,211 --> 00:33:31,735
But once you have all these languages
in one machine-readable place,

538
00:33:31,735 --> 00:33:33,143
you can do that,

539
00:33:33,143 --> 00:33:36,857
you can get a dictionary

540
00:33:36,857 --> 00:33:41,735
from Estonian to Maltese and back.

541
00:33:42,935 --> 00:33:45,607
So covering language
combinations in dictionaries

542
00:33:45,607 --> 00:33:47,911
that just haven't been covered before

543
00:33:47,911 --> 00:33:51,050
because there wasn't
enough demand for it, for example,

544
00:33:51,050 --> 00:33:55,540
to make it financially viable
and to justify the work.

545
00:33:55,540 --> 00:33:57,147
Now we can do that.

546
00:33:59,797 --> 00:34:02,318
Then text generation.

547
00:34:02,318 --> 00:34:03,653
Lucie was earlier talking

548
00:34:03,653 --> 00:34:10,136
about how she's working
with Hattie on generating text

549
00:34:10,136 --> 00:34:14,673
to get Wikipedia articles
in minority languages started,

550
00:34:15,423 --> 00:34:19,512
and that needs data about words,

551
00:34:19,512 --> 00:34:22,589
and you need to understand
the language to do that.

552
00:34:23,769 --> 00:34:28,133
Yeah, and those are just some
that come to my mind right now.

553
00:34:28,693 --> 00:34:30,494
Maybe our audience has more ideas

554
00:34:30,494 --> 00:34:34,353
what they want to do
when we have all the glorious data.

555
00:34:37,693 --> 00:34:40,892
(person 9) Okay, I will deviate
from the lexemes topic.

556
00:34:40,892 --> 00:34:42,666
I will ask the question,

557
00:34:42,666 --> 00:34:45,634
how can I as a member of community

558
00:34:45,634 --> 00:34:50,135
influence that priority is put on task,

559
00:34:50,135 --> 00:34:56,644
that a new user comes, and he can indicate
what languages he wants to see and edit

560
00:34:56,644 --> 00:35:01,135
without some secret verbal
template knowledge.

561
00:35:02,145 --> 00:35:05,053
Maybe there will be this year
this technical wish list

562
00:35:05,053 --> 00:35:07,040
without Wikipedia topics.

563
00:35:07,040 --> 00:35:10,119
Maybe there's a hope
we can all vote about

564
00:35:10,119 --> 00:35:14,218
this thing we didn't fix for seven years.

565
00:35:14,218 --> 00:35:17,607
So do you have any ideas
and comments about this?

566
00:35:18,217 --> 00:35:20,328
So you're talking about the fact

567
00:35:20,328 --> 00:35:23,518
that someone who is
not logged into Wikidata

568
00:35:23,518 --> 00:35:25,971
can't change their language easily?

569
00:35:25,971 --> 00:35:27,839
(person 9) No, for [inaudible] users.

570
00:35:28,309 --> 00:35:30,689
So, if they are logged in,

571
00:35:30,689 --> 00:35:34,871
they can just change their language
at the top of the page,

572
00:35:35,891 --> 00:35:38,099
and then it will appear

573
00:35:39,769 --> 00:35:42,013
where the labels' description
[inaudible] are,

574
00:35:42,013 --> 00:35:43,483
and they can edit it.

575
00:35:45,657 --> 00:35:49,009
(person 9) Well, actually, usually
many times the workflow

576
00:35:49,009 --> 00:35:52,447
is that if you want to have
multiple languages, they are available,

577
00:35:52,447 --> 00:35:55,419
and it's not always the case.

578
00:35:55,419 --> 00:35:58,584
Okay, maybe we should sit down
after this talk and you show me.

579
00:36:01,562 --> 00:36:04,089
Cool. More questions?

580
00:36:05,534 --> 00:36:06,536
Yes.

581
00:36:11,595 --> 00:36:13,196
(person 10) Thanks for the presentation.

582
00:36:14,106 --> 00:36:15,127
Can you comment

583
00:36:15,127 --> 00:36:19,307
on the state of the correlation
with the Wiktionary community.

584
00:36:19,307 --> 00:36:22,296
As far as I've seen,
there were some discussions

585
00:36:22,296 --> 00:36:26,051
about importing some elements of the work,

586
00:36:26,051 --> 00:36:30,843
but there seems to be licensing issues
and some disagreements, et cetera.

587
00:36:30,843 --> 00:36:31,848
Right.

588
00:36:31,848 --> 00:36:36,330
So, Wiktionary communities
have spent a lot of time

589
00:36:37,320 --> 00:36:39,473
building Wiktionary.

590
00:36:39,473 --> 00:36:42,643
They have built

591
00:36:43,193 --> 00:36:47,554
amazingly complicated
and complex templates

592
00:36:47,554 --> 00:36:53,614
to build pretty tables
that automatically generate forms for you

593
00:36:53,614 --> 00:36:56,392
and all kinds of really impressive,

594
00:36:56,392 --> 00:37:00,683
and kind of crazy stuff,
if you think about it.

595
00:37:02,311 --> 00:37:07,994
And, of course, they have invested
a lot of time and effort into that.

596
00:37:09,364 --> 00:37:11,801
And understandably,

597
00:37:11,801 --> 00:37:17,116
they don't just want that to be grabbed,

598
00:37:18,046 --> 00:37:19,102
just like that.

599
00:37:19,102 --> 00:37:21,791
So there's some of that coming from there.

600
00:37:22,761 --> 00:37:25,137
And that's fine, that's okay.

601
00:37:25,737 --> 00:37:32,092
Now, the first Wiktionary communities
are talking about turning out

602
00:37:32,092 --> 00:37:34,329
and importing some
of their data into Wikidata.

603
00:37:34,329 --> 00:37:39,095
Russian, you have seen,
for example, is one of those cases

604
00:37:40,375 --> 00:37:42,355
And I expect more of that to happen.

605
00:37:43,635 --> 00:37:46,800
But it will be a slow process,

606
00:37:46,800 --> 00:37:49,383
just like adoption
of Wikidata's data on Wikipedia

607
00:37:49,383 --> 00:37:51,909
has been a rather slow process.

608
00:37:52,849 --> 00:37:56,183
On the other side
of making it actually easier

609
00:37:56,183 --> 00:37:59,132
to use the data that is in lexemes,

610
00:37:59,132 --> 00:38:02,209
on Wiktionary, so that
they can make use of that

611
00:38:02,209 --> 00:38:05,531
and share data between
the language Wiktionaries

612
00:38:05,531 --> 00:38:08,853
which is super hard
to impossible right now,

613
00:38:08,853 --> 00:38:11,560
which is crazy,
just like it was on Wikipedia.

614
00:38:13,860 --> 00:38:16,325
Wait for the birthday present. (chuckles)

615
00:38:20,038 --> 00:38:21,182
Yes.

616
00:38:22,599 --> 00:38:24,827
(person 11) When I was thinking
the other way around it,

617
00:38:24,827 --> 00:38:28,168
I actually didn't want to say it
because I think this will be super silly,

618
00:38:28,168 --> 00:38:32,003
but I think that Wiktionary
already has some content,

619
00:38:32,003 --> 00:38:34,978
and I know that
we can't transfer it to Wikidata

620
00:38:34,978 --> 00:38:37,048
because there's a difference in licenses.

621
00:38:37,048 --> 00:38:39,631
But I was thinking maybe
we can do something about that.

622
00:38:40,321 --> 00:38:45,913
Maybe, I don't know, we can obtain
the communities' permission

623
00:38:45,913 --> 00:38:51,205
after like, I don't know,
having like a public voting

624
00:38:52,075 --> 00:38:55,642
and for the community,
the active members of the community

625
00:38:55,642 --> 00:39:02,523
to vote and say if they would like 
or accept or to transfer the content

626
00:39:02,523 --> 00:39:05,528
for which they may do
the Wikidata lexemes.

627
00:39:06,238 --> 00:39:08,537
Because I just think it is such a waste.

628
00:39:09,568 --> 00:39:14,443
So, that's definitely
a conversation those people

629
00:39:14,443 --> 00:39:18,249
who are in Wiktionary communities
are very welcome to bring up there.

630
00:39:18,249 --> 00:39:24,647
I think it would be a bit presumptuous
for us to go and force that.

631
00:39:25,917 --> 00:39:31,142
But, yeah, I think it's definitely worth
having a conversation.

632
00:39:31,142 --> 00:39:33,898
But I think it's also important
to understand

633
00:39:33,898 --> 00:39:39,082
that there's a distinction between
what is actually legally allowed

634
00:39:39,082 --> 00:39:43,147
and what we should be doing

635
00:39:43,147 --> 00:39:45,426
and what those people want or do not want.

636
00:39:45,736 --> 00:39:47,329
So even if it's legally allowed,

637
00:39:47,329 --> 00:39:50,640
if some other Wiktionary communities
do not want that,

638
00:39:50,640 --> 00:39:53,537
I would be careful, at least.

639
00:39:58,886 --> 00:40:02,489
I think you need the mic
for the stream.

640
00:40:04,540 --> 00:40:07,299
(person 12) So, obviously,
it's all very exciting,

641
00:40:07,979 --> 00:40:12,319
and I immediately think
how can I take that to my students

642
00:40:12,319 --> 00:40:15,558
and how can I incorporate it
with the courses,

643
00:40:15,558 --> 00:40:18,531
the work that we're doing,
educational settings.

644
00:40:18,531 --> 00:40:22,271
And I don't have, at the moment,

645
00:40:22,871 --> 00:40:24,116
first of all, enough knowledge,

646
00:40:24,116 --> 00:40:27,278
but I think the documentation
that we do have

647
00:40:27,808 --> 00:40:30,082
could be maybe improved.

648
00:40:30,082 --> 00:40:33,437
So that's a kind of request
to make cool videos

649
00:40:33,437 --> 00:40:35,898
that explain how it works

650
00:40:35,898 --> 00:40:39,948
because if we have it, we can then use it,

651
00:40:39,948 --> 00:40:41,985
and we can have students on board,

652
00:40:41,985 --> 00:40:47,072
and we can make people understand
how awesome it all is.

653
00:40:47,072 --> 00:40:52,001
And yeah, just think about documentation
and think about education, please.

654
00:40:52,001 --> 00:40:54,480
Because I think a lot could be done.

655
00:40:54,480 --> 00:40:58,585
These are like many tasks
that could be done even with...

656
00:41:00,125 --> 00:41:02,033
well, I wouldn't say primary schools,

657
00:41:02,033 --> 00:41:05,495
but certainly, even younger students.

658
00:41:05,915 --> 00:41:10,866
And so I would really like to see
that potential being tapped into,

659
00:41:10,866 --> 00:41:15,272
and, as of now, I personally
don't understand enough

660
00:41:15,272 --> 00:41:19,500
to be able to create tasks
or to create like...

661
00:41:20,430 --> 00:41:22,155
to do something practical with it.

662
00:41:22,155 --> 00:41:25,772
So any help, any thoughts
anyone here has about that,

663
00:41:25,772 --> 00:41:29,648
I would be very happy to hear
your thoughts, and yours as well.

664
00:41:30,508 --> 00:41:32,129
Yeah, let's talk about that.

665
00:41:35,473 --> 00:41:37,139
More questions?

666
00:41:37,809 --> 00:41:39,195
Someone else raised a hand.

667
00:41:39,195 --> 00:41:40,495
I forgot where it was.

668
00:41:45,739 --> 00:41:49,996
(person 13) So, if we can't import
from Wiktionary,

669
00:41:49,996 --> 00:41:55,772
is there some concerted effort
to find other public domain sources,

670
00:41:55,772 --> 00:41:57,459
maybe all the data,

671
00:41:58,769 --> 00:42:03,167
and kind of prefilter it, organize it

672
00:42:03,167 --> 00:42:08,470
so that it's easy to be checked
by people for import?

673
00:42:09,093 --> 00:42:11,181
So there are first efforts.

674
00:42:11,181 --> 00:42:14,769
My understanding is that Basque
is one of those efforts.

675
00:42:14,769 --> 00:42:17,474
Maybe you want to say
a bit more about it?

676
00:42:18,426 --> 00:42:20,130
(person 14) [inaudible]

677
00:42:23,166 --> 00:42:27,148
Okay, the actual answer 
is paying for that...

678
00:42:28,374 --> 00:42:33,381
I mean, we have an agreement
with a contractor we usually work with.

679
00:42:34,801 --> 00:42:38,725
They do dictionaries--

680
00:42:40,315 --> 00:42:42,458
lots of stuff, but they do dictionaries.

681
00:42:42,458 --> 00:42:47,473
So we agreed with them
to make free the students' dictionary,

682
00:42:47,473 --> 00:42:52,782
we would [cast] the most common words
and start uploading it

683
00:42:52,782 --> 00:42:55,590
with an external identifier
and the scheme of things.

684
00:42:56,420 --> 00:43:02,902
But there was some discussion
about leaving it on CC0

685
00:43:03,212 --> 00:43:05,322
because they have
the dictionary with CC by it,

686
00:43:06,537 --> 00:43:10,326
and they understood
what the difference was.

687
00:43:10,326 --> 00:43:13,866
So there was some discussion.

688
00:43:13,866 --> 00:43:19,709
But I think that we can provide some tools
or some examples in the future,

689
00:43:19,709 --> 00:43:21,761
and I think that there will be
other dictionaries

690
00:43:21,761 --> 00:43:24,016
that we can handle,

691
00:43:24,016 --> 00:43:29,274
and also I think Wiktionary
should start moving in that direction,

692
00:43:29,274 --> 00:43:32,260
but that's another great discussion.

693
00:43:33,285 --> 00:43:34,487
And on top of that,

694
00:43:34,487 --> 00:43:38,839
Lea is also in contact
with people from Occitan

695
00:43:38,839 --> 00:43:41,827
who work on Occitan dictionaries,

696
00:43:41,827 --> 00:43:45,138
and they're currently working
on a Sumerian collaboration.

697
00:43:51,644 --> 00:43:53,363
More questions?

698
00:44:01,487 --> 00:44:05,349
(person 15) Hi! We are the people
who want to import Occitan data.

699
00:44:05,349 --> 00:44:06,585
Aha! Perfect!

700
00:44:06,585 --> 00:44:08,368
(person 15) And we have a small problem.

701
00:44:09,188 --> 00:44:14,215
We don't know how to represent
the variety of all lexemes.

702
00:44:14,215 --> 00:44:17,893
We have six dialects,

703
00:44:17,893 --> 00:44:24,014
and we want to indicate for Lexeme
in which dialect it's used,

704
00:44:24,014 --> 00:44:27,285
and we don't have a proper
C0 statement to do that.

705
00:44:27,285 --> 00:44:31,105
So as long as the segment doesn't exist,

706
00:44:31,635 --> 00:44:34,465
it prevents us from [inaudible]

707
00:44:34,465 --> 00:44:37,603
because we will need to do it again

708
00:44:37,603 --> 00:44:42,076
when we will be able
to [export] the statement.

709
00:44:42,076 --> 00:44:44,551
And it's complicated
because it's a statement

710
00:44:44,551 --> 00:44:47,802
which won't be asked by many people

711
00:44:47,802 --> 00:44:53,444
because it's a statement
which concerns mostly minority languages.

712
00:44:53,444 --> 00:44:56,933
So you will have one person to ask this.

713
00:44:56,933 --> 00:45:00,022
But as our colleagues Basque,

714
00:45:00,022 --> 00:45:06,082
it can be one person
who will power thousands of others,

715
00:45:06,082 --> 00:45:10,884
so it might not be asking a lot,

716
00:45:10,884 --> 00:45:14,136
but it will be very important for us.

717
00:45:14,874 --> 00:45:17,600
Do you already have
a new property proposal up,

718
00:45:17,600 --> 00:45:19,470
or do you need help creating it?

719
00:45:21,524 --> 00:45:24,300
(person 15) We asked four months ago.

720
00:45:24,720 --> 00:45:28,755
Alright, then let's get some people
to help out with this property proposal.

721
00:45:30,159 --> 00:45:33,092
I'm sure there are enough people
in this room to make this happen.

722
00:45:33,360 --> 00:45:35,452
(person 15) Property proposal
[speaking in French].

723
00:45:35,452 --> 00:45:36,965
(person 16) We didn't have an answer.

724
00:45:36,965 --> 00:45:39,769
(person 15) We didn't have any answer,
and we don't know how to do this

725
00:45:39,769 --> 00:45:42,953
because we aren't 
in the Wikidata community.

726
00:45:44,694 --> 00:45:48,817
Yup, so there are people here
who can help you.

727
00:45:48,817 --> 00:45:52,134
Maybe someone raises their hand to take--

728
00:45:52,574 --> 00:45:53,644
(person 14) I'm for that.

729
00:45:53,644 --> 00:45:55,512
But I think this is quite interesting

730
00:45:55,512 --> 00:45:59,059
that only the variant of form

731
00:45:59,059 --> 00:46:02,607
also can handle it geographically,

732
00:46:02,607 --> 00:46:04,995
with coordinates or some kind of mapping.

733
00:46:05,595 --> 00:46:07,815
Also having different pronunciations,

734
00:46:07,815 --> 00:46:11,837
and I think this is something
that happens in lots of languages.

735
00:46:12,607 --> 00:46:16,262
We should start making
it happen [inaudible],

736
00:46:16,262 --> 00:46:18,865
and I'm going to search for the property.

737
00:46:19,782 --> 00:46:20,933
Cool.

738
00:46:20,933 --> 00:46:24,446
So you will get backing
for your property proposal.

739
00:46:26,136 --> 00:46:27,297
Thank you.

740
00:46:28,153 --> 00:46:30,261
Alright, more questions?

741
00:46:32,410 --> 00:46:33,474
Finn.

742
00:46:33,974 --> 00:46:35,055
Finn is one of those people

743
00:46:35,055 --> 00:46:38,031
who builds stuff
on top of lexicographical data.

744
00:46:38,031 --> 00:46:40,085
(Finn) It's just a small question,

745
00:46:40,405 --> 00:46:44,226
and that's about spelling variations.

746
00:46:44,896 --> 00:46:48,002
It seems to be difficult to put them in...

747
00:46:48,532 --> 00:46:53,368
You could, of course,
have multiple forms for the same word.

748
00:46:56,327 --> 00:46:58,448
I don't know, it seems to be...

749
00:46:59,558 --> 00:47:03,535
If you don't do it that way,
it seems to be difficult to specify...

750
00:47:04,771 --> 00:47:05,888
or I don't know whether

751
00:47:05,888 --> 00:47:09,731
this is just a minor technical issue
or whether...

752
00:47:09,731 --> 00:47:11,252
Let's look at it together.

753
00:47:11,642 --> 00:47:15,230
I would love to see an example.

754
00:47:17,478 --> 00:47:18,478
Asaf.

755
00:47:26,886 --> 00:47:28,396
(Asaf) Thank you.

756
00:47:29,386 --> 00:47:33,685
I can give a very concrete example
from my mother tongue, Hebrew.

757
00:47:34,205 --> 00:47:38,845
Hebrew has two main variants

758
00:47:38,845 --> 00:47:42,786
for expressing almost every word

759
00:47:42,786 --> 00:47:47,640
because the traditional spelling

760
00:47:47,640 --> 00:47:50,044
leaves out many of the vowels.

761
00:47:50,934 --> 00:47:55,207
And, therefore, in modern editions
of the Bible and of poetry,

762
00:47:55,207 --> 00:47:57,461
diacritics are used.

763
00:47:57,461 --> 00:48:02,670
However, those diacritics
are never used for modern prose

764
00:48:02,670 --> 00:48:05,974
or newspaper writing or street signs.

765
00:48:05,974 --> 00:48:11,209
So the average daily casual use
puts in extra vowels

766
00:48:12,169 --> 00:48:13,519
and doesn't use the diacritics

767
00:48:13,519 --> 00:48:15,607
because they are,
of course, more cumbersome

768
00:48:15,607 --> 00:48:17,893
and have all kinds of rules
and nobody knows the rules.

769
00:48:18,633 --> 00:48:20,531
So there are basically two variants.

770
00:48:20,531 --> 00:48:25,322
There's the everyday casual prose variant,

771
00:48:25,322 --> 00:48:27,827
and there's the Bible or poetry,

772
00:48:27,827 --> 00:48:32,200
which always come
in this traditional diacriticized text.

773
00:48:32,200 --> 00:48:33,302
To be useful,

774
00:48:33,302 --> 00:48:37,428
Lexeme would have to recognize
both varieties of every single word

775
00:48:37,428 --> 00:48:39,747
and every single form
of every single word.

776
00:48:40,677 --> 00:48:43,391
So that's a very comprehensive use case

777
00:48:43,391 --> 00:48:46,340
for official stable variants.

778
00:48:46,340 --> 00:48:48,942
It's not dialect, it's not regions,

779
00:48:49,332 --> 00:48:53,627
it's basically two coexisting
morphological systems.

780
00:48:54,537 --> 00:48:58,926
And I too don't know exactly
how to express that in Lexeme today,

781
00:48:58,926 --> 00:49:02,800
which is one thing that is keeping me
in partial answer to Magnus' question

782
00:49:02,800 --> 00:49:05,238
from uploading the parts that are ready

783
00:49:05,238 --> 00:49:09,394
from the biggest Hebrew dictionary,
which is public domain

784
00:49:09,394 --> 00:49:13,141
and which I have been digitizing
for several years now.

785
00:49:13,141 --> 00:49:14,803
A good portion of it is ready,

786
00:49:14,803 --> 00:49:16,549
but I'm not putting it on Lexeme right now

787
00:49:16,549 --> 00:49:20,245
because I don't know exactly
how to solve this problem.

788
00:49:20,245 --> 00:49:23,387
Alright, let's solve
this problem here. (chuckles)

789
00:49:24,503 --> 00:49:26,021
That has to be possible.

790
00:49:30,045 --> 00:49:32,047
Alright, more questions?

791
00:49:37,173 --> 00:49:39,735
If not, then thank you so much.

792
00:49:40,605 --> 00:49:42,675
(applause)