1
00:00:05,888 --> 00:00:09,312
Now, there are approximately
7,500 languages

2
00:00:09,312 --> 00:00:10,806
spoken on the planet today.

3
00:00:11,770 --> 00:00:13,808
Of those, it's estimated

4
00:00:13,808 --> 00:00:18,466
that about 70%
are at risk of not surviving

5
00:00:18,466 --> 00:00:20,355
the end of the 21st century.

6
00:00:22,270 --> 00:00:24,266
Every time a language dies,

7
00:00:24,711 --> 00:00:26,622
it's severing a connection

8
00:00:26,622 --> 00:00:30,590
that has lasted for hundreds
to thousands of years,

9
00:00:30,590 --> 00:00:34,816
to culture, to history,

10
00:00:35,320 --> 00:00:38,150
and to traditions, and to knowledge.

11
00:00:38,933 --> 00:00:42,250
The linguist Kenneth Hale once said

12
00:00:42,250 --> 00:00:44,183
that every time a language dies,

13
00:00:44,183 --> 00:00:46,794
it's like dropping
an atom bomb on the Louvre,

14
00:00:49,377 --> 00:00:51,844
So the question is,

15
00:00:52,730 --> 00:00:54,800
why do languages die?

16
00:00:56,244 --> 00:01:00,155
Well, perhaps the simple answer might be

17
00:01:00,162 --> 00:01:03,051
that one could imagine 
authoritarian governments

18
00:01:03,051 --> 00:01:05,311
preventing people from speaking
their native language,

19
00:01:05,844 --> 00:01:09,630
children being punished
for speaking their language at school,

20
00:01:09,866 --> 00:01:12,911
or the government
shutting down radio stations

21
00:01:12,923 --> 00:01:14,644
in the minority language.

22
00:01:15,044 --> 00:01:16,977
And this definitely happened in the past,

23
00:01:16,977 --> 00:01:19,088
and it still, to some extent,
happens today.

24
00:01:19,616 --> 00:01:23,026
But the honest answer

25
00:01:23,026 --> 00:01:26,666
is that for the vast majority
of the cases of language extinction,

26
00:01:27,296 --> 00:01:29,336
it's a much simpler

27
00:01:29,336 --> 00:01:32,555
and a much more easy-to-explain answer.

28
00:01:33,696 --> 00:01:36,222
The languages go extinct

29
00:01:36,220 --> 00:01:37,888
because they are not passed down

30
00:01:37,888 --> 00:01:39,733
from one generation to the next.

31
00:01:42,280 --> 00:01:43,866
Every single time a person who speaks

32
00:01:43,866 --> 00:01:46,088
a minority language has a child,

33
00:01:46,752 --> 00:01:50,355
they go through a calculus.

34
00:01:51,360 --> 00:01:52,800
They ask themselves,

35
00:01:53,660 --> 00:01:56,288
"Do I pass my language down to my child,

36
00:01:56,770 --> 00:02:01,311
or do I instead teach them
only the majority language?"

37
00:02:01,311 --> 00:02:03,222
Essentially, there is a scale that goes on

38
00:02:03,900 --> 00:02:05,844
that they access in their heads,

39
00:02:06,720 --> 00:02:08,355
in which on one side

40
00:02:09,530 --> 00:02:11,733
every single time in their lives

41
00:02:11,737 --> 00:02:14,222
that they've had an opportunity
to use their native language

42
00:02:14,866 --> 00:02:18,490
for communication,
for access to traditional culture,

43
00:02:19,776 --> 00:02:21,748
a stone is placed on the left side.

44
00:02:22,228 --> 00:02:23,840
And every time that they find themselves

45
00:02:23,840 --> 00:02:25,755
unable to use their native language,

46
00:02:25,770 --> 00:02:27,955
and instead have to rely on
the majority language,

47
00:02:27,958 --> 00:02:30,066
a stone is placed on the right side.

48
00:02:31,822 --> 00:02:34,800
Now, due to the strength and the dignity

49
00:02:34,800 --> 00:02:36,600
of being able to speak 
one's mother tongue,

50
00:02:36,600 --> 00:02:38,720
the stones on the left
tend to be a bit heavier.

51
00:02:38,720 --> 00:02:42,048
But with enough stones on the right side,

52
00:02:42,560 --> 00:02:44,600
then eventually the scale tips,

53
00:02:44,600 --> 00:02:47,111
and then when a person makes the decision

54
00:02:47,111 --> 00:02:49,150
to pass their language down,

55
00:02:49,160 --> 00:02:50,622
they see their own language

56
00:02:50,622 --> 00:02:52,620
as more of a burden than a blessing.

57
00:02:55,200 --> 00:02:58,676
So the question is,
how do we reverse this?

58
00:02:59,450 --> 00:03:01,777
First, we need to think 
about the fact that,

59
00:03:03,511 --> 00:03:04,968
for any given language,

60
00:03:04,970 --> 00:03:07,900
there are certain social spheres
that they can be used in.

61
00:03:07,900 --> 00:03:08,976
So any language

62
00:03:08,976 --> 00:03:10,800
that's a mother tongue spoken today,

63
00:03:10,800 --> 00:03:12,990
can be used with one's family.

64
00:03:13,790 --> 00:03:16,671
A smaller set of languages 
can be used within one's community,

65
00:03:16,671 --> 00:03:18,660
a smaller set, maybe within one's region,

66
00:03:19,288 --> 00:03:22,155
and for a small handful of languages,

67
00:03:22,511 --> 00:03:24,488
they can be used
for international communication.

68
00:03:25,824 --> 00:03:28,640
And then even across these spheres,

69
00:03:28,640 --> 00:03:31,712
there's the question of can someone
use their language,

70
00:03:31,712 --> 00:03:35,533
for the purpose of education or business,

71
00:03:35,911 --> 00:03:37,600
or in technology?

72
00:03:39,136 --> 00:03:41,952
So, to better explain

73
00:03:43,200 --> 00:03:44,530
what I'm talking about here,

74
00:03:44,530 --> 00:03:46,393
I would like to use an anecdote.

75
00:03:48,400 --> 00:03:50,400
Let's say that you are about to go

76
00:03:50,400 --> 00:03:52,280
on your dream vacation to India,

77
00:03:53,155 --> 00:03:56,032
and you have an eight-hour
layover in Istanbul.

78
00:03:57,312 --> 00:04:00,640
Now, you weren't necessarily
planning on visiting Turkey,

79
00:04:00,896 --> 00:04:04,266
but with your layover
and with a Turkish friend

80
00:04:04,266 --> 00:04:05,933
telling you about an amazing restaurant

81
00:04:05,933 --> 00:04:07,400
that's not too far from the airport,

82
00:04:07,800 --> 00:04:10,600
you say, "Hey, you know, 
maybe I'll stop by during my layover."

83
00:04:11,022 --> 00:04:12,920
So, you exit the airport,

84
00:04:13,950 --> 00:04:15,480
you get to your restaurant,

85
00:04:15,480 --> 00:04:17,020
and they hand you a menu,

86
00:04:17,020 --> 00:04:19,086
and the menu is entirely in Turkish.

87
00:04:20,170 --> 00:04:22,911
Now, let's say,
for the point of this exercise,

88
00:04:22,911 --> 00:04:24,377
that you don't speak Turkish.

89
00:04:25,210 --> 00:04:26,535
What do you do?

90
00:04:28,155 --> 00:04:29,744
Well, best-case scenario,

91
00:04:29,744 --> 00:04:32,177
you find someone perhaps
who can speak your native language,

92
00:04:32,383 --> 00:04:34,264
German, English, etc.

93
00:04:36,220 --> 00:04:37,997
But let's say it's not your lucky day

94
00:04:38,000 --> 00:04:41,066
and nobody in the restaurant can speak
any German or any English.

95
00:04:42,000 --> 00:04:43,377
So what do you do?

96
00:04:43,377 --> 00:04:45,995
Well, if you are like me,
and I imagine most of you,

97
00:04:45,995 --> 00:04:48,130
you've probably turned 
to a technological solution,

98
00:04:49,535 --> 00:04:52,351
machine translation 
or a digital dictionary,

99
00:04:52,607 --> 00:04:54,196
look up each word individually,

100
00:04:54,399 --> 00:04:57,733
and eventually order yourself 
a delicious Turkish meal.

101
00:04:59,970 --> 00:05:02,844
Now, let's imagine this scenario instead,

102
00:05:03,610 --> 00:05:06,400
in which you are the native speaker
of a minority language.

103
00:05:07,455 --> 00:05:09,333
Let's say, Lower Sorbian.

104
00:05:09,333 --> 00:05:11,000
Lower Sorbian is an endangered language

105
00:05:11,000 --> 00:05:12,488
spoken here in Germany,

106
00:05:12,488 --> 00:05:16,888
about 130 kilometers
to the southeast from here,

107
00:05:17,711 --> 00:05:20,857
that's spoken only by
a few thousand people, mostly elderly.

108
00:05:22,810 --> 00:05:25,111
Now, let's say your mother tongue
is Lower Sorbian.

109
00:05:25,370 --> 00:05:26,773
You end up in the restaurant.

110
00:05:26,773 --> 00:05:28,462
Now, of course, the odds
of finding someone

111
00:05:28,462 --> 00:05:31,387
who speaks your native language 
in the restaurant is extraordinarily low.

112
00:05:32,280 --> 00:05:36,412
But, again, you can just go
to a technological solution.

113
00:05:36,890 --> 00:05:39,333
However, for your native language,

114
00:05:39,333 --> 00:05:41,718
these technological solutions don't exist.

115
00:05:42,010 --> 00:05:44,991
You would have to rely on
German or English

116
00:05:44,991 --> 00:05:47,488
as your pivot language into Turkish.

117
00:05:48,920 --> 00:05:52,382
Now, of course, you still end up
getting your delicious Turkish meal,

118
00:05:52,382 --> 00:05:54,860
but you begin to think about
how difficult this would have been

119
00:05:54,860 --> 00:05:57,170
if you were your grandfather,
who spoke no German at all.

120
00:05:58,244 --> 00:05:59,840
Now, this is just a small incident,

121
00:05:59,844 --> 00:06:04,787
but it's going to place a stone
on the right side of that scale,

122
00:06:05,310 --> 00:06:07,053
and make you think perhaps

123
00:06:07,053 --> 00:06:09,898
maybe when I have children
or maybe when I have another child,

124
00:06:10,943 --> 00:06:14,726
the burden that you went through with this

125
00:06:14,726 --> 00:06:17,133
may not be worth it 
to keep your language.

126
00:06:19,391 --> 00:06:21,284
And imagine if this was a scenario

127
00:06:21,284 --> 00:06:26,177
that was of significantly more importance,

128
00:06:26,177 --> 00:06:28,380
such as, for example, being in a hospital.

129
00:06:31,133 --> 00:06:36,161
Now, this is the point 
in which we can help--

130
00:06:36,790 --> 00:06:40,242
by we, I mean me and you
in this room can help.

131
00:06:41,400 --> 00:06:43,355
We have the tools 
to be able to help this.

132
00:06:45,155 --> 00:06:47,355
If technological tools 
are available for people

133
00:06:47,355 --> 00:06:49,350
who speak minority 
and underserved languages,

134
00:06:50,555 --> 00:06:54,022
it puts a little finger on the scale,
on the left side of the scale.

135
00:06:54,022 --> 00:06:55,776
Someone doesn't necessarily have to think

136
00:06:55,776 --> 00:06:57,680
that they have to rely on
a minority language

137
00:06:57,680 --> 00:06:59,488
in order to interact 
with the outside world,

138
00:07:00,351 --> 00:07:05,111
because it opens the social spheres

139
00:07:05,111 --> 00:07:06,328
a little bit more.

140
00:07:07,910 --> 00:07:10,333
So, of course, the ideal solution

141
00:07:10,333 --> 00:07:13,022
is that we have machine translation
in every language in the world.

142
00:07:13,022 --> 00:07:16,831
But, unfortunately,
that's just not feasible.

143
00:07:16,831 --> 00:07:19,800
Machine translation 
requires large corpuses of text,

144
00:07:19,800 --> 00:07:21,088
and for many of these languages

145
00:07:21,088 --> 00:07:23,080
that are endangered or underserved,

146
00:07:23,391 --> 00:07:25,439
such data is simply not available.

147
00:07:26,309 --> 00:07:28,279
Some of them aren't even commonly written

148
00:07:29,000 --> 00:07:32,825
and thus getting enough data
to make a machine translation engine

149
00:07:32,825 --> 00:07:34,390
is unlikely.

150
00:07:34,390 --> 00:07:38,060
But what is available is lexical data.

151
00:07:40,244 --> 00:07:43,444
Through the work of many linguists

152
00:07:43,444 --> 00:07:45,440
over the past few hundred years,

153
00:07:47,777 --> 00:07:49,728
dictionaries and grammars
have been produced

154
00:07:49,728 --> 00:07:51,680
for most of the world's languages.

155
00:07:53,920 --> 00:07:56,511
But, unfortunately, most of these works

156
00:07:56,511 --> 00:08:00,644
are not accessible
or available to the world,

157
00:08:00,647 --> 00:08:03,533
let alone to speakers 
of these minority languages.

158
00:08:04,522 --> 00:08:06,377
And it's not an intentional process,

159
00:08:06,377 --> 00:08:07,910
a lot of times it's simply because

160
00:08:07,910 --> 00:08:10,785
the initial print run
of these dictionaries was small,

161
00:08:11,155 --> 00:08:12,543
and the only copies

162
00:08:12,543 --> 00:08:16,244
are moldering away
in a university library somewhere.

163
00:08:17,511 --> 00:08:21,333
But we have the ability to take that data

164
00:08:21,333 --> 00:08:23,330
and make it accessible to the world.

165
00:08:24,133 --> 00:08:28,377
The Wikimedia Foundation
is one of the best organizations,

166
00:08:28,377 --> 00:08:30,555
I would say <i>the</i> best 
organization in the world,

167
00:08:30,975 --> 00:08:33,396
for getting data available

168
00:08:33,396 --> 00:08:36,688
to the vast majority
of the population of this planet.

169
00:08:38,533 --> 00:08:40,134
So let's work on that.

170
00:08:41,000 --> 00:08:43,222
So to explain a little bit

171
00:08:43,224 --> 00:08:45,050
about what we've been doing
in this regard,

172
00:08:45,311 --> 00:08:48,127
I'd like to introduce 
my organization, PanLex,

173
00:08:48,711 --> 00:08:51,888
which is an organization
that is attempting

174
00:08:51,888 --> 00:08:54,146
to collect lexical data for this purpose.

175
00:08:54,780 --> 00:08:56,830
We got started about 12 years ago

176
00:08:56,830 --> 00:08:59,600
at the University of Washington,
as a research project.

177
00:08:59,600 --> 00:09:01,088
The idea behind it

178
00:09:01,088 --> 00:09:03,990
was to show that inferred translations

179
00:09:04,377 --> 00:09:07,125
could create an effective
translation device,

180
00:09:07,125 --> 00:09:09,088
essentially a lexical translation device.

181
00:09:09,088 --> 00:09:12,223
This is an example 
from PanLex data itself.

182
00:09:12,680 --> 00:09:14,057
This is showing how to translate

183
00:09:14,066 --> 00:09:17,805
the word "ev" in Turkish,
which means house,

184
00:09:17,805 --> 00:09:19,555
to Lower Sorbian,

185
00:09:19,555 --> 00:09:21,201
the language I was referring to earlier.

186
00:09:21,212 --> 00:09:23,190
So it's unlikely to find

187
00:09:24,333 --> 00:09:26,200
Turkish to Lower Sorbian dictionaries,

188
00:09:26,200 --> 00:09:28,244
but by passing it through

189
00:09:28,244 --> 00:09:30,240
many, many different
intermediate languages,

190
00:09:30,488 --> 00:09:32,600
you can create effective translations.

191
00:09:34,333 --> 00:09:36,911
So, once this was shown 
in the research projects,

192
00:09:36,911 --> 00:09:39,631
the founder of PanLex, 
Dr. Jonathan Pool,

193
00:09:40,711 --> 00:09:43,666
decided, "Well, you know,
why not actually just do this?"

194
00:09:43,666 --> 00:09:45,470
So he started a non-profit

195
00:09:45,470 --> 00:09:48,522
to collect as much lexical data 
as possible and make it accessible.

196
00:09:48,911 --> 00:09:51,066
That's what we've been doing
for the past 12 years.

197
00:09:51,066 --> 00:09:54,516
In that time, we've collected
thousands and thousands of dictionaries,

198
00:09:54,516 --> 00:09:56,479
and extracted lexical data out of them

199
00:09:56,479 --> 00:10:01,340
and compiled a database that allows
inferred lexical translation

200
00:10:01,340 --> 00:10:03,755
across any of--

201
00:10:03,755 --> 00:10:05,866
Our current count is around 5,500

202
00:10:05,860 --> 00:10:07,955
of the 7,500 languages in the world.

203
00:10:08,511 --> 00:10:10,685
And, of course,

204
00:10:10,685 --> 00:10:12,221
we're constantly trying to expand that

205
00:10:12,221 --> 00:10:14,784
and expand the data 
on each individual language.

206
00:10:17,220 --> 00:10:21,111
So, the next question is,

207
00:10:22,079 --> 00:10:25,663
what can we do to work together on this?

208
00:10:26,680 --> 00:10:28,931
We, at PanLex, have been 
extremely excited to watch

209
00:10:28,931 --> 00:10:31,260
the development on lexical data,

210
00:10:31,260 --> 00:10:34,175
that Wikidata has been working on lately.

211
00:10:35,155 --> 00:10:37,548
It's very fascinating to see organizations

212
00:10:37,550 --> 00:10:39,476
that are working in a very similar sphere,

213
00:10:39,476 --> 00:10:41,183
but in different aspects.

214
00:10:41,535 --> 00:10:44,351
And we are extremely excited to see

215
00:10:44,733 --> 00:10:46,466
the results of this from Wikidata.

216
00:10:46,466 --> 00:10:51,144
And also we are looking forward
to collaborating with Wikidata.

217
00:10:53,844 --> 00:10:56,271
I think that the special skills

218
00:10:56,271 --> 00:10:58,022
that we've developed
over the past 12 years,

219
00:10:58,022 --> 00:11:01,555
with not just collecting lexical data, 
but also in database design,

220
00:11:01,557 --> 00:11:03,908
could be extremely useful for Wikidata.

221
00:11:03,910 --> 00:11:07,111
And on the other side, I think that--

222
00:11:08,415 --> 00:11:10,975
I especially am excited about Wikidata's

223
00:11:11,743 --> 00:11:14,549
ability to do crowdsourcing of data.

224
00:11:15,129 --> 00:11:18,047
PanLex, currently,
our sources are entirely

225
00:11:18,399 --> 00:11:20,959
printed lexical sources
or other types of lexical sources,

226
00:11:21,170 --> 00:11:22,662
but we don't do any crowdsourcing.

227
00:11:22,670 --> 00:11:24,920
We simply don't have 
the infrastructure for it available

228
00:11:24,920 --> 00:11:26,931
and of course, the Wikimedia Foundation

229
00:11:26,933 --> 00:11:28,930
is the world expert in crowdsourcing.

230
00:11:31,848 --> 00:11:33,728
I'm really looking
forward to seeing exactly

231
00:11:33,733 --> 00:11:35,680
how we can apply these skills together.

232
00:11:38,533 --> 00:11:41,600
But, overall, I think the main thing
to think about this

233
00:11:41,600 --> 00:11:43,457
is that when we were
working on these things,

234
00:11:43,461 --> 00:11:45,133
it's minute detail.

235
00:11:45,133 --> 00:11:47,533
We're sitting around 
looking at grammatical forms,

236
00:11:47,533 --> 00:11:51,911
or paging our way through 
dictionaries, ancient dictionaries,

237
00:11:51,915 --> 00:11:53,977
or sometimes
recently published dictionaries

238
00:11:53,977 --> 00:11:57,466
and getting into written forms of words,

239
00:11:57,466 --> 00:11:59,994
and it feels very close up.

240
00:11:59,994 --> 00:12:01,535
But, occasionally, we need to remember

241
00:12:01,535 --> 00:12:02,556
to take a step back

242
00:12:02,556 --> 00:12:04,951
in that, even though what we're doing

243
00:12:06,231 --> 00:12:08,831
can feel even mundane at times,

244
00:12:10,091 --> 00:12:11,957
the work we're doing 
is extremely important.

245
00:12:13,010 --> 00:12:15,666
This is, in my opinion,
the absolute best way

246
00:12:15,666 --> 00:12:18,862
that we can support endangered languages

247
00:12:18,862 --> 00:12:21,488
and make sure that the linguistic
diversity of the planet

248
00:12:21,488 --> 00:12:25,730
is preserved up to the end 
of this century or longer.

249
00:12:26,444 --> 00:12:29,644
It's entirely possible that the work 
that we're doing today

250
00:12:29,644 --> 00:12:32,577
may result in languages

251
00:12:32,577 --> 00:12:35,355
being preserved and passed down,

252
00:12:35,355 --> 00:12:36,955
and not going extinct.

253
00:12:38,527 --> 00:12:40,605
So just to remember

254
00:12:40,605 --> 00:12:43,207
that even if you're sitting
around on your computer

255
00:12:43,207 --> 00:12:44,480
editing an individual entry

256
00:12:44,480 --> 00:12:49,707
and adding the data form
of a small minority language

257
00:12:49,707 --> 00:12:51,796
for every single noun,

258
00:12:51,800 --> 00:12:54,577
the little thing 
that you're doing right now,

259
00:12:54,577 --> 00:12:57,528
might actually be partially responsible

260
00:12:57,533 --> 00:12:59,155
for making sure that language survives,

261
00:12:59,155 --> 00:13:01,060
until the end of the century or longer.

262
00:13:02,591 --> 00:13:03,703
Thank you very much,

263
00:13:03,703 --> 00:13:05,717
and I'd like to open
the floor to questions.

264
00:13:06,222 --> 00:13:08,373
(applause)

265
00:13:23,688 --> 00:13:24,977
(woman 1) Thank you.

266
00:13:24,977 --> 00:13:26,701
- Thank you for your talk.
- Thank you.

267
00:13:26,701 --> 00:13:28,777
(woman 1) I just have a question
about dictionaries.

268
00:13:28,777 --> 00:13:31,107
You said that you work
with printed dictionaries?

269
00:13:31,107 --> 00:13:32,312
- Yes.
- (woman 1) So my question

270
00:13:32,312 --> 00:13:34,508
is what do you take
from those dictionaries

271
00:13:34,511 --> 00:13:38,222
and if there's any copyright thing
you have to deal with?

272
00:13:38,222 --> 00:13:41,060
I anticipated this to be
the first question that I would get.

273
00:13:41,060 --> 00:13:42,827
(laughter)

274
00:13:42,827 --> 00:13:46,358
So, first off, for PanLex,

275
00:13:46,358 --> 00:13:50,244
we have, according to our legal 
resources that we have consulted,

276
00:13:52,734 --> 00:13:57,466
whereas the arrangement and organization
of a dictionary is copyrightable,

277
00:13:57,466 --> 00:14:03,260
the translation itself 
is not considered copyrightable.

278
00:14:04,170 --> 00:14:05,808
A good example is like, for example,

279
00:14:05,808 --> 00:14:10,525
a phone book is considered,
at least according to US law,

280
00:14:10,956 --> 00:14:11,965
copyrightable.

281
00:14:11,965 --> 00:14:16,800
But saying that person X's
phone number is digits D

282
00:14:16,800 --> 00:14:18,360
is not copyrightable.

283
00:14:21,666 --> 00:14:23,444
So like I said,

284
00:14:23,444 --> 00:14:25,311
according to our legal scholars,

285
00:14:25,311 --> 00:14:27,333
this is how we can deal with this.

286
00:14:27,333 --> 00:14:30,666
But even if that's not 
a solid enough legal argument,

287
00:14:30,666 --> 00:14:32,063
one important thing to remember

288
00:14:32,063 --> 00:14:38,269
is that the vast majority
of these lexical data,

289
00:14:39,355 --> 00:14:40,530
is actually out of copyright.

290
00:14:40,530 --> 00:14:42,822
A significant number 
of these are out of copyright

291
00:14:42,822 --> 00:14:44,333
and thus can be used without [end].

292
00:14:44,333 --> 00:14:46,783
And the other thing 
is that oftentimes, for example,

293
00:14:47,311 --> 00:14:49,644
if we're working with
a recently made print dictionary,

294
00:14:49,640 --> 00:14:51,577
rather than trying to scan it and OCR it,

295
00:14:51,577 --> 00:14:53,439
we just email the person who made it.

296
00:14:53,439 --> 00:14:57,600
And it turns out that 
most linguists are really excited

297
00:14:57,600 --> 00:14:59,600
that their data can be made accessible.

298
00:14:59,600 --> 00:15:01,267
And so they're like, "Sure, please,

299
00:15:01,267 --> 00:15:03,273
just put it all in there 
and make it accessible."

300
00:15:05,533 --> 00:15:08,424
So like I said, we have, at least,
according to our legal opinions,

301
00:15:08,424 --> 00:15:09,466
we have the ability,

302
00:15:09,466 --> 00:15:11,177
but even if you don't want
to go with that,

303
00:15:11,177 --> 00:15:15,644
it's very easy to get
the data publicly accessible.

304
00:15:26,288 --> 00:15:28,470
- (man 1) Thank you. Hi.
- Hi.

305
00:15:28,470 --> 00:15:29,830
(man 1) Can you say a little more

306
00:15:29,830 --> 00:15:35,031
about how the person who speaks
Lower Sorbian is accessing the data.

307
00:15:35,031 --> 00:15:38,355
Like specifically how
that information is getting to them

308
00:15:38,357 --> 00:15:40,977
and how that might help to convince them

309
00:15:40,977 --> 00:15:42,800
to either try out the--

310
00:15:42,800 --> 00:15:44,680
Great question and this is actually

311
00:15:44,680 --> 00:15:46,266
one that I think about a lot as well,

312
00:15:46,266 --> 00:15:49,759
because I think that
when we talk about data access,

313
00:15:50,270 --> 00:15:53,244
there's actually a multiple step
of this, multiple steps.

314
00:15:53,244 --> 00:15:56,288
One is, of course, data preservation, 
make sure the data doesn't go away.

315
00:15:56,288 --> 00:15:58,911
Secondly, is make sure it's interoperable

316
00:15:59,177 --> 00:16:01,844
and can be used.

317
00:16:01,844 --> 00:16:05,370
And thirdly is make sure
that it's available.

318
00:16:05,631 --> 00:16:07,333
So in PanLex's case,

319
00:16:07,333 --> 00:16:09,755
we have an API that can be used,

320
00:16:09,755 --> 00:16:11,888
but, obviously,
that can't be used by an end user

321
00:16:11,888 --> 00:16:14,847
But we've also developed interfaces.

322
00:16:15,155 --> 00:16:19,727
And so, for example,
if you go to <i>translate.panlex.org</i>,

323
00:16:19,728 --> 00:16:22,711
you can do translations on our database.

324
00:16:22,711 --> 00:16:25,864
If you want to mess around 
with the API, just go to <i>dev.panlex.org,</i>

325
00:16:25,866 --> 00:16:29,222
and you can find a bunch of stuff
on the API, or just <i>api.panlex.org</i>.

326
00:16:30,950 --> 00:16:32,542
But there's another step too,

327
00:16:32,542 --> 00:16:36,577
which is that even if you make
all of your data completely accessible

328
00:16:36,570 --> 00:16:40,533
with tools that are super useful
to be able to access it,

329
00:16:41,210 --> 00:16:43,244
if you don't actually promote the tools,

330
00:16:43,244 --> 00:16:45,058
then people won't actually
be able to use it.

331
00:16:45,058 --> 00:16:47,177
And this is honestly kind of a...

332
00:16:48,827 --> 00:16:51,044
the thing that isn't talked about enough,

333
00:16:51,044 --> 00:16:52,955
and I don't have a good answer for it.

334
00:16:52,955 --> 00:16:54,800
How do we make sure that--

335
00:16:55,022 --> 00:16:56,933
For example, l only fairly recently,

336
00:16:56,933 --> 00:16:59,647
only a few years ago
got acquainted with Wikidata,

337
00:16:59,647 --> 00:17:02,463
and it's exactly the kind
of thing that I'm interested in.

338
00:17:02,970 --> 00:17:07,177
So, how do we promote 
ourselves to others?

339
00:17:07,177 --> 00:17:08,780
I'm leaving that as an open question.

340
00:17:08,780 --> 00:17:10,800
Like I said, I don't have
a good answer for this.

341
00:17:10,800 --> 00:17:12,888
But, of course, in order to do that,

342
00:17:12,888 --> 00:17:14,880
we still need to accomplish
the first few steps.

343
00:17:22,133 --> 00:17:24,777
(man 2) If we want to have 
machine translation,

344
00:17:24,777 --> 00:17:27,822
don't we need a translation memory?

345
00:17:27,827 --> 00:17:30,666
I'm not sure that the individual words

346
00:17:30,666 --> 00:17:32,918
that we put into Wikidata,

347
00:17:32,918 --> 00:17:36,558
these short phrases
that we put into Wikidata,

348
00:17:36,558 --> 00:17:41,130
either as ordinary Wikidata items
or as Wikidata lexemes,

349
00:17:41,130 --> 00:17:43,953
are sufficient to do a proper translation.

350
00:17:43,955 --> 00:17:46,600
We need to have full sentences,
for example, for--

351
00:17:46,772 --> 00:17:48,320
(Benjamin) Yeah, absolutely.

352
00:17:48,577 --> 00:17:51,422
(man 2) And where do we get
this data structure?

353
00:17:51,422 --> 00:17:55,177
I'm not sure that, currently,

354
00:17:55,177 --> 00:17:59,533
Wikidata is able to very well handle

355
00:17:59,533 --> 00:18:03,066
the issue of a translation memory,

356
00:18:04,324 --> 00:18:05,965
<i>translatewiki.net</i>,

357
00:18:05,965 --> 00:18:09,490
for getting into that gap of...

358
00:18:12,111 --> 00:18:14,993
Should we do anything
in that respect, or should we--

359
00:18:15,000 --> 00:18:17,133
Yeah, and I really 
appreciate your question.

360
00:18:17,135 --> 00:18:18,715
I touched on this a little bit earlier,

361
00:18:18,715 --> 00:18:20,361
but I'd love to reiterate it.

362
00:18:21,356 --> 00:18:24,955
This is precisely the reason
that PanLex works in lexical data

363
00:18:24,955 --> 00:18:27,030
and why I'm excited about lexical data,

364
00:18:27,030 --> 00:18:29,935
as opposed to-- 
not as opposed to, but in addition

365
00:18:29,935 --> 00:18:35,207
to machine translation engines
and machine translation in general.

366
00:18:35,900 --> 00:18:39,200
As you said, machine translation
requires a specific kind of data,

367
00:18:39,740 --> 00:18:43,123
and that data is not available 
for most of the world's languages.

368
00:18:43,123 --> 00:18:44,966
For the vast majority
of the world's languages,

369
00:18:44,966 --> 00:18:46,379
that simply is not available.

370
00:18:46,650 --> 00:18:48,447
But that doesn't mean
we should just give up.

371
00:18:48,447 --> 00:18:49,627
Like why?

372
00:18:51,260 --> 00:18:54,444
If I needed to translate 
my Turkish restaurant menu,

373
00:18:54,755 --> 00:18:59,360
then lexical translation will likely 
be an exceptionally good tool for that.

374
00:18:59,360 --> 00:19:01,715
Now, I'm not saying 
that you can use lexical translation

375
00:19:01,715 --> 00:19:04,600
to do perfect paragraph 
to paragraph translation.

376
00:19:04,600 --> 00:19:06,866
When I say lexical translation,
I mean word to word

377
00:19:06,866 --> 00:19:09,670
and word to word translation
can be extremely useful,

378
00:19:12,231 --> 00:19:14,708
It's funny to think about it, 
but we didn't really have access

379
00:19:14,708 --> 00:19:16,620
to really good machine translation.

380
00:19:16,620 --> 00:19:20,191
Everyone didn't have 
access to that until fairly recently.

381
00:19:20,191 --> 00:19:23,649
And we still got by with dictionaries,

382
00:19:23,649 --> 00:19:27,687
and they're an incredibly good resource.

383
00:19:28,311 --> 00:19:31,288
And the data is available, 
so why not make it available

384
00:19:31,288 --> 00:19:34,377
to the world at large 
and to the speakers of these languages?

385
00:19:36,422 --> 00:19:38,666
(woman 2) Hi, what mechanisms 
do you have in place

386
00:19:38,666 --> 00:19:40,666
when the community itself--I'm over here.

387
00:19:40,666 --> 00:19:43,253
- Where are you? Okay, right.
- (woman 2) Yeah, sorry. (laughs)

388
00:19:43,253 --> 00:19:44,577
...when the community itself

389
00:19:44,577 --> 00:19:47,320
doesn't want part of their data in PanLex?

390
00:19:47,320 --> 00:19:48,933
Great question.

391
00:19:48,933 --> 00:19:51,955
So the way that we work with that

392
00:19:51,955 --> 00:19:56,287
is that if a dictionary is published 
and made publicly available,

393
00:19:56,666 --> 00:19:58,133
that's a good indication.

394
00:19:58,133 --> 00:20:02,400
Like you could buy it in a store
or at a university library,

395
00:20:02,400 --> 00:20:04,690
or a public library anyone can access.

396
00:20:04,690 --> 00:20:08,080
That's a good indication 
that that decision has been made.

397
00:20:08,080 --> 00:20:11,577
(woman 2) [inaudible]

398
00:20:15,740 --> 00:20:18,266
(man 3) Please, [inaudible],
could you speak in the microphone?

399
00:20:19,295 --> 00:20:20,447
Can you say it again?

400
00:20:20,447 --> 00:20:23,307
(woman 2) Linguists don't always have 
the permission of the community.

401
00:20:23,307 --> 00:20:24,387
In order to publish things,

402
00:20:24,387 --> 00:20:27,533
they oftentimes publish things
without the consent of the community.

403
00:20:27,533 --> 00:20:29,577
And that's absolutely true.

404
00:20:29,577 --> 00:20:32,533
I would say that is a--

405
00:20:32,533 --> 00:20:34,422
That does happen.

406
00:20:34,422 --> 00:20:36,770
I would say it's generally 
a small minority of cases,

407
00:20:36,770 --> 00:20:40,955
mostly confined 
to generally North America,

408
00:20:40,955 --> 00:20:43,355
although sometimes 
South American languages as well.

409
00:20:44,765 --> 00:20:46,488
It's something we have
to take into account.

410
00:20:46,488 --> 00:20:49,288
If we were to receive word, for example,

411
00:20:49,288 --> 00:20:52,377
that the data that is in PanLex

412
00:20:52,377 --> 00:20:56,330
should not be accessed 
by the greater world,

413
00:20:56,330 --> 00:20:58,040
then, of course, we would remove it.

414
00:20:58,040 --> 00:20:59,310
(woman 2) Good, good.

415
00:21:01,281 --> 00:21:02,451
That doesn't mean, of course,

416
00:21:02,451 --> 00:21:04,391
that we'll listen
to copyright rules necessarily

417
00:21:04,391 --> 00:21:06,542
but we will listen
to traditional communities,

418
00:21:06,542 --> 00:21:08,157
and that's the major difference.

419
00:21:08,157 --> 00:21:10,252
(woman 2) Yeah,
that's what I'm referring to.

420
00:21:15,022 --> 00:21:16,755
It brings up a really interesting point,

421
00:21:16,755 --> 00:21:18,350
which is that

422
00:21:18,844 --> 00:21:22,244
sometimes it's a really big question
of who speaks for a language.

423
00:21:23,000 --> 00:21:27,911
I had some experience actually
visiting the American Southwest

424
00:21:27,911 --> 00:21:29,755
and working with some groups,

425
00:21:29,777 --> 00:21:32,288
who work on indigenous,
the Pueblo languages out there.

426
00:21:36,053 --> 00:21:38,044
So there is approximately

427
00:21:38,044 --> 00:21:40,220
six Pueblo languages, 
depending on how you slice it,

428
00:21:40,220 --> 00:21:41,955
spoken in that area.

429
00:21:41,955 --> 00:21:44,022
But they are divided
amongst 18 different Pueblos

430
00:21:44,320 --> 00:21:47,066
and each one has their own
tribal government,

431
00:21:47,066 --> 00:21:50,022
and each government 
may have a different opinion

432
00:21:50,022 --> 00:21:54,007
on whether their language 
should be accessible to outsiders or not.

433
00:21:56,626 --> 00:21:58,170
Like, for example, Zuni Pueblo,

434
00:21:58,170 --> 00:22:01,472
it's a single Pueblo
that speaks Zuni language.

435
00:22:02,923 --> 00:22:05,274
And they're really big 
on their language going everywhere,

436
00:22:05,274 --> 00:22:07,694
they put it on the street signs 
and everything, it's great.

437
00:22:07,694 --> 00:22:10,637
But for some of the other languages,

438
00:22:10,644 --> 00:22:13,051
you might have one group that says,

439
00:22:13,051 --> 00:22:15,866
"Yeah, we don't want our language
being accessed by outsiders."

440
00:22:15,871 --> 00:22:18,838
But then you have the neighboring Pueblo
who speaks the same language say,

441
00:22:18,838 --> 00:22:21,666
"We really want our language
accessible to outsiders

442
00:22:21,666 --> 00:22:24,088
in using these technological tools,

443
00:22:24,088 --> 00:22:26,560
because we want our language
to be able to continue on."

444
00:22:26,560 --> 00:22:29,488
And it raises a really 
interesting ethical question.

445
00:22:29,488 --> 00:22:31,651
Because if you default by saying,

446
00:22:31,651 --> 00:22:34,622
"Fine, I'm cutting it off because 
this group said we should cut it off"--

447
00:22:34,622 --> 00:22:36,711
aren't you also disservicing
the second group

448
00:22:36,711 --> 00:22:39,360
because they actively 
want you to rule out these things.

449
00:22:39,360 --> 00:22:42,755
So I don't think this is a question
that has an easy answer.

450
00:22:42,755 --> 00:22:44,955
But I would say 
at least in terms of PanLex.

451
00:22:44,955 --> 00:22:48,938
And for the record, we actually
haven't encountered this yet,

452
00:22:48,938 --> 00:22:50,407
that I'm aware of.

453
00:22:50,933 --> 00:22:52,920
Now, that could be partially because...

454
00:22:53,666 --> 00:22:55,444
Getting back to his question,

455
00:22:55,666 --> 00:22:57,790
we may need to promote more. (chuckles)

456
00:22:58,660 --> 00:23:02,155
But, in general, as far as I know,

457
00:23:02,155 --> 00:23:04,488
we have not had this come up.

458
00:23:04,488 --> 00:23:06,871
But our game plan for this

459
00:23:06,871 --> 00:23:10,975
is if a community says they don't want
their data in a database,

460
00:23:10,975 --> 00:23:12,095
then we remove it.

461
00:23:12,095 --> 00:23:14,916
(woman 2) Because we have come up
with it in Wikidata and Wikipedia...

462
00:23:14,916 --> 00:23:16,140
- You have?
- (woman 2) ...in comments.

463
00:23:16,140 --> 00:23:17,407
- Really?
- (woman 2) It's been a problem.

464
00:23:17,407 --> 00:23:20,488
Yeah, I can imagine especially in comments
for photos or certain things.

465
00:23:20,488 --> 00:23:21,900
(woman 2) Correct.

466
00:23:27,177 --> 00:23:33,170
(man 4) Hi, I had a question about
the crowdsourcing aspect of this.

467
00:23:34,087 --> 00:23:36,644
As far as going in and asking a community

468
00:23:36,654 --> 00:23:40,480
to annotate or add data for a dataset,

469
00:23:40,480 --> 00:23:44,200
one of the things
that's a little intimidating is like,

470
00:23:44,711 --> 00:23:49,244
as an editor, I can only see
what things are missing.

471
00:23:49,244 --> 00:23:53,242
But if I'm going to spend time
on things, having an idea,

472
00:23:53,582 --> 00:23:56,672
there's a list of high priority items,

473
00:23:57,755 --> 00:24:01,198
that's, I guess,
very motivating in this aspect.

474
00:24:01,200 --> 00:24:04,222
And I was curious if you had a system

475
00:24:04,222 --> 00:24:07,866
which is, essentially, like, 
we know the gaps in our own data,

476
00:24:07,866 --> 00:24:12,088
we have linguistic evidence 
to know that these are the ones

477
00:24:12,088 --> 00:24:15,530
that if we had annotated,
these would be the high impact drivers.

478
00:24:15,530 --> 00:24:17,152
So I can imagine

479
00:24:18,202 --> 00:24:21,405
having the lexeme
for "house" very impactful,

480
00:24:21,405 --> 00:24:24,977
maybe not a lexeme
for a data or some other like.

481
00:24:24,977 --> 00:24:28,947
But I was curious if you had that,
it if it is something

482
00:24:30,217 --> 00:24:35,480
that could be used
to drive these community efforts.

483
00:24:35,840 --> 00:24:37,066
Great question.

484
00:24:37,200 --> 00:24:41,216
So one thing that Wikidata
has a whole lot of--

485
00:24:41,216 --> 00:24:44,666
sorry, excuse me, PanLex 
has a whole lot of are Swadesh lists.

486
00:24:44,666 --> 00:24:47,511
We have apparently the largest collection
of Swadesh lists in the world

487
00:24:47,511 --> 00:24:48,555
which is interesting.

488
00:24:48,555 --> 00:24:50,212
If you don't know what a Swadesh list is,

489
00:24:50,212 --> 00:24:56,244
it's essentially a regularized 
list of lexical items

490
00:24:56,244 --> 00:25:00,040
that can be used 
for analysis of languages.

491
00:25:00,040 --> 00:25:02,730
They contain really basic sets.

492
00:25:02,730 --> 00:25:05,003
So there's a couple 
of different kinds of Swadesh lists.

493
00:25:05,003 --> 00:25:07,328
But there are 100 or 213 items

494
00:25:07,328 --> 00:25:08,911
and they might contain

495
00:25:08,911 --> 00:25:12,777
words like "house" and "eye" and "skin"

496
00:25:12,777 --> 00:25:14,444
and basically general words

497
00:25:14,444 --> 00:25:16,331
that you should be able
to find in any language.

498
00:25:16,331 --> 00:25:19,888
So that's like a really 
good starting point

499
00:25:19,888 --> 00:25:22,988
for having that kind of data available.

500
00:25:29,090 --> 00:25:31,126
Now, as I mentioned before,

501
00:25:31,133 --> 00:25:33,600
crowdsourcing is something
that we don't do yet

502
00:25:33,600 --> 00:25:36,066
and we're actually 
really excited to be able to do.

503
00:25:36,066 --> 00:25:37,554
It's one of the things I'm really excited

504
00:25:37,554 --> 00:25:38,993
to talk to people 
at this conference about,

505
00:25:38,993 --> 00:25:42,982
is how crowdsourcing can be used

506
00:25:42,982 --> 00:25:45,931
and the logistics behind it,

507
00:25:46,200 --> 00:25:48,867
and these are the kind 
of questions that can come up.

508
00:25:51,288 --> 00:25:53,400
So I guess the answer I can say to you

509
00:25:53,400 --> 00:25:55,376
is that we do have a priority list--

510
00:25:55,376 --> 00:25:57,684
Actually, one thing I can say
is we definitely do have a priority list

511
00:25:57,684 --> 00:25:59,730
when it comes to which languages
we are seeking out.

512
00:25:59,730 --> 00:26:02,222
So the way we do this 
is that we look for languages

513
00:26:02,222 --> 00:26:04,666
that are not currently served
by technological solutions,

514
00:26:04,666 --> 00:26:06,977
which are oftentimes minority languages,

515
00:26:06,977 --> 00:26:09,280
or usually minority languages,

516
00:26:09,280 --> 00:26:12,096
and then prioritize those.

517
00:26:13,916 --> 00:26:16,844
But in terms of individual lexical items

518
00:26:16,851 --> 00:26:20,244
being the general way we get new data

519
00:26:20,244 --> 00:26:22,977
is essentially by ingesting
an entire dictionary's worth.

520
00:26:22,977 --> 00:26:25,911
We are relying on the dictionary's choice

521
00:26:25,911 --> 00:26:29,333
of lexical items, 
rather than necessarily saying,

522
00:26:29,333 --> 00:26:31,500
we're really looking for the word 
for "house" in every language.

523
00:26:31,500 --> 00:26:35,000
But when it comes to data crowdsourcing,
we will need something like that.

524
00:26:35,000 --> 00:26:37,912
So this is an opportunity
for research and growth.

525
00:26:40,044 --> 00:26:43,088
(man 5) Hi, I'm Victor,
and this is awesome.

526
00:26:45,108 --> 00:26:46,888
As you have slides here,

527
00:26:46,888 --> 00:26:49,355
can you talk a little bit
about the technical status

528
00:26:49,355 --> 00:26:51,260
that currently you have data

529
00:26:51,260 --> 00:26:57,022
or information flow 
from and to Wikidata and PanLex.

530
00:26:57,022 --> 00:26:59,955
Is that currently implemented already

531
00:26:59,955 --> 00:27:03,888
and how do you deal with

532
00:27:03,888 --> 00:27:07,133
back and forth or even
feedback loop information

533
00:27:07,140 --> 00:27:09,950
between PanLex and Wikidata?

534
00:27:09,950 --> 00:27:13,733
So we actually don't have any formal 
connections to Wikidata at this point,

535
00:27:13,733 --> 00:27:15,343
and this is something that I'm, again,

536
00:27:15,343 --> 00:27:17,824
I'm really excited to talk 
to people in this conference about.

537
00:27:17,824 --> 00:27:20,644
We've had some interaction 
with Wiktionary,

538
00:27:21,774 --> 00:27:24,720
but Wikidata is actually
a better fit, honestly,

539
00:27:24,720 --> 00:27:26,755
for what we are looking for.

540
00:27:27,355 --> 00:27:29,201
Having directly lexical stuff

541
00:27:29,201 --> 00:27:32,311
means that we have to do a lot less 
data analysis and extraction.

542
00:27:32,933 --> 00:27:37,148
And so the answer is,
we don't yet, but we want to.

543
00:27:37,148 --> 00:27:39,800
(man 5) And if not,
what are the obstacles?

544
00:27:39,800 --> 00:27:43,511
And as we can see, Wikidata
already supports several languages,

545
00:27:43,511 --> 00:27:46,533
but when I look up <i>translate.panlex.org</i>,

546
00:27:46,533 --> 00:27:49,311
you apparently support
many, many variants,

547
00:27:49,311 --> 00:27:50,888
much more than Wikidata.

548
00:27:50,888 --> 00:27:53,316
How do you see there is a gap

549
00:27:53,316 --> 00:27:57,177
between translation
or lexical translation first,

550
00:27:57,177 --> 00:28:00,155
application versus an effort

551
00:28:00,155 --> 00:28:03,777
as trying to map a knowledge structure.

552
00:28:03,777 --> 00:28:05,866
Mapping knowledge 
will actually be very interesting.

553
00:28:05,866 --> 00:28:07,336
We've had some 
very interesting discussions

554
00:28:07,336 --> 00:28:12,311
about the way that Wikidata 
organizes their lexical data,

555
00:28:12,311 --> 00:28:13,777
, your lexical data,

556
00:28:13,777 --> 00:28:16,044
and how we organize our lexical data.

557
00:28:16,044 --> 00:28:20,933
And there are subtle differences
that would require a mapping strategy,

558
00:28:21,460 --> 00:28:24,577
some of which will not
necessarily be automatic,

559
00:28:24,577 --> 00:28:27,422
but we might be able to develop 
techniques to be able to do this.

560
00:28:27,422 --> 00:28:30,796
You gave the example of language variants.

561
00:28:30,796 --> 00:28:34,111
We tend to be very "splittery"
when it comes to language variants.

562
00:28:34,111 --> 00:28:36,311
In other words,
if we get a source that says

563
00:28:36,311 --> 00:28:38,755
that this is the dialect spoken

564
00:28:38,755 --> 00:28:41,695
on the left side of the river 
in Papua New Guinea, for this language,

565
00:28:41,695 --> 00:28:42,913
and we get another source that says

566
00:28:42,913 --> 00:28:44,955
this is the dialect spoken
on the right side of the river,

567
00:28:44,955 --> 00:28:46,720
then we consider them
essentially separate languages.

568
00:28:46,720 --> 00:28:51,072
And so we do this in order to basically
preserve the most data that we can.

569
00:28:52,222 --> 00:28:54,355
Being able to map that
to how Wikidata does it--

570
00:28:54,355 --> 00:28:56,938
Actually, what I would love
is to have conversations

571
00:28:56,938 --> 00:29:00,696
about how languages

572
00:29:00,696 --> 00:29:06,323
are designated on Wikidata.

573
00:29:08,145 --> 00:29:12,320
Again, we go with the strategy
of very much a "splittery" strategy.

574
00:29:13,856 --> 00:29:17,440
We broadly rely on ISO 6393 codes,

575
00:29:17,866 --> 00:29:19,643
which is provided by the Ethnologue,

576
00:29:19,643 --> 00:29:23,840
and then each individual code, 
we then allow multiple variants within it,

577
00:29:23,840 --> 00:29:29,098
either for script variants
or regional dialects or sociolects, etc.

578
00:29:30,240 --> 00:29:32,762
Again, opportunity 
for discussion and work.

579
00:29:35,622 --> 00:29:39,466
(woman 3) Hi, I would like to know
if you have a OCR pipeline

580
00:29:39,466 --> 00:29:44,533
and especially because 
we've been trying to do OCR on Maya,

581
00:29:44,533 --> 00:29:47,928
and we don't get any results.

582
00:29:47,933 --> 00:29:49,933
It doesn't understand anything--

583
00:29:49,933 --> 00:29:52,512
- Oh, yeah! (laughs)
- (woman 3) And... yeah.

584
00:29:52,512 --> 00:29:56,078
So if your pipelines are available.

585
00:29:56,078 --> 00:30:00,288
And the other one is just
on the overlap of ISO codes,

586
00:30:00,288 --> 00:30:01,641
like sometimes they say,

587
00:30:01,641 --> 00:30:04,199
"Oh, this is a language,
and this is another language,"

588
00:30:04,199 --> 00:30:06,555
but there are sources 
that say other stuff,

589
00:30:06,555 --> 00:30:10,133
as you were mentioning,
but they tend to overlap.

590
00:30:10,133 --> 00:30:12,955
So how do you go on...? Yeah.

591
00:30:12,956 --> 00:30:15,155
Yeah, that's absolutely
an amazing question.

592
00:30:15,155 --> 00:30:17,120
I really like it.

593
00:30:17,120 --> 00:30:20,400
So we don't have a formalized 
OCR pipeline per se;

594
00:30:20,400 --> 00:30:23,533
we do it on a sort of 
source by source basis.

595
00:30:23,533 --> 00:30:26,266
One of the reasons why 
is because we oftentimes have sources

596
00:30:26,266 --> 00:30:27,955
that not necessarily need to be OCR'd,

597
00:30:27,955 --> 00:30:29,841
that are available 
for some of these languages,

598
00:30:29,841 --> 00:30:32,766
and we concentrate on those because
they require the least amount of work.

599
00:30:32,766 --> 00:30:35,000
But, obviously, 
if we really want to dive deep

600
00:30:35,000 --> 00:30:37,056
into some of our sources
that are in our backlog,

601
00:30:37,056 --> 00:30:40,896
we're going to need to essentially 
develop strong OCR pipelines.

602
00:30:40,896 --> 00:30:43,968
But there's another aspect too,
which is that, as you mentioned...

603
00:30:44,400 --> 00:30:48,576
like the people who designed OCR engines

604
00:30:49,088 --> 00:30:52,672
I think are not realizing
how much you can stress test them.

605
00:30:52,672 --> 00:30:55,181
Like, you know what's fun?--

606
00:30:55,181 --> 00:30:57,690
trying to OCR
a Russian-Tibetan dictionary.

607
00:30:58,600 --> 00:31:00,216
It's really hard, it turns out...

608
00:31:01,503 --> 00:31:03,747
We gave up, and we hired 
someone to just type it up,

609
00:31:04,022 --> 00:31:05,641
which was totally doable.

610
00:31:05,641 --> 00:31:07,260
And actually, it turns out

611
00:31:07,260 --> 00:31:10,266
that this amazing Russian woman 
learned to read Tibetan

612
00:31:10,266 --> 00:31:12,755
so she could type this up,
which was super cool.

613
00:31:15,333 --> 00:31:18,270
I think that if you're dealing 
with stuff in the Latin scripts,

614
00:31:18,270 --> 00:31:22,871
then I think that OCR solutions 
can be developed, that are more robust,

615
00:31:22,871 --> 00:31:24,673
that deal with 
multilingual sources like this

616
00:31:24,673 --> 00:31:26,991
and expect that you're going
to get a random four in there,

617
00:31:26,991 --> 00:31:28,284
if you're dealing with something like

618
00:31:28,284 --> 00:31:30,560
16th-century Mayan sources,
you know, with digit four.

619
00:31:32,088 --> 00:31:37,600
But there are some sources

620
00:31:37,600 --> 00:31:40,111
that OCR is probably just 
never really going to catch up to,

621
00:31:40,111 --> 00:31:42,244
or require such an immense amount of work,

622
00:31:43,200 --> 00:31:46,933
that actually we put a little
bit of this to use right now.

623
00:31:46,933 --> 00:31:48,800
We have another project
we're running at PanLex

624
00:31:48,800 --> 00:31:53,533
to transcribe all of the traditional
literature of Bali,

625
00:31:53,533 --> 00:31:57,952
and we found that in handwritten 
Balinese manuscripts,

626
00:31:58,444 --> 00:31:59,644
there's just no chance of OCR.

627
00:31:59,644 --> 00:32:02,200
So we got a bunch 
of Balinese people to type them up,

628
00:32:02,200 --> 00:32:05,000
and it's become a really cool
cultural project within Bali,

629
00:32:05,000 --> 00:32:07,288
and it's become news and stuff like that.

630
00:32:07,288 --> 00:32:09,084
So I would say

631
00:32:09,084 --> 00:32:11,377
that you don't necessarily 
need to rely on OCR,

632
00:32:11,377 --> 00:32:12,577
but there is a lot out there.

633
00:32:12,577 --> 00:32:15,160
So having good OCR solutions
would be good.

634
00:32:16,663 --> 00:32:20,992
Also, if anyone out here 
is into super multilingual OCR,

635
00:32:20,992 --> 00:32:22,635
please come talk to me.

636
00:32:29,517 --> 00:32:31,377
(man 6) Thank you for your presentation.

637
00:32:32,007 --> 00:32:34,866
You talked about integration

638
00:32:34,866 --> 00:32:37,060
between PanLex and Wikidata,

639
00:32:37,060 --> 00:32:38,792
but you haven't gone into the specifics.

640
00:32:38,792 --> 00:32:42,701
So I was checking your data license,
and it is under CC0.

641
00:32:42,701 --> 00:32:44,210
- Yes.
- (man 6) That's really great.

642
00:32:44,210 --> 00:32:46,377
So there are two possible ways

643
00:32:46,377 --> 00:32:49,400
that either we can import the data

644
00:32:49,400 --> 00:32:52,777
or we can continue something similar
to the Freebase way,

645
00:32:52,777 --> 00:32:55,688
where we had the complete
database from the Freebase,

646
00:32:55,688 --> 00:32:59,080
and we imported them, and we made a link,

647
00:32:59,080 --> 00:33:03,955
an external identifier
to the Freebase database.

648
00:33:03,955 --> 00:33:08,397
So if you have something in mind,
are you thinking similar?

649
00:33:08,397 --> 00:33:10,401
Or you just want to make...

650
00:33:15,291 --> 00:33:18,755
an independent database
which can be linked to Wikidata?

651
00:33:18,755 --> 00:33:20,533
Yeah, so this is a great question

652
00:33:20,533 --> 00:33:23,282
and actually I feel
like it's about one step ahead

653
00:33:23,282 --> 00:33:25,648
of some of the stuff 
that I've already been thinking about,

654
00:33:25,648 --> 00:33:29,555
partially because, like I said,

655
00:33:29,955 --> 00:33:32,111
getting the two databases to work together

656
00:33:32,111 --> 00:33:33,533
is a step in of itself.

657
00:33:33,533 --> 00:33:35,332
I think the first step that we can take

658
00:33:35,333 --> 00:33:37,622
is literally just pooling 
our skills together.

659
00:33:37,911 --> 00:33:40,246
We have a lot of experience 
dealing with stuff

660
00:33:40,246 --> 00:33:42,656
like classifications of properties
of individual lexemes

661
00:33:42,656 --> 00:33:44,734
that I'd love to share.

662
00:33:45,864 --> 00:33:49,050
But being able to link the databases 
themselves would be wonderful.

663
00:33:49,050 --> 00:33:50,808
I'm 100% for that.

664
00:33:50,808 --> 00:33:54,066
I think it would be a little bit easier

665
00:33:54,066 --> 00:33:56,022
on the Wikidata towards PanLex way,

666
00:33:56,022 --> 00:33:58,866
but maybe I'm just biased
because I can see how that could work.

667
00:34:02,040 --> 00:34:06,088
Yeah, essentially, as long 
as Wikidata is comfortable

668
00:34:06,088 --> 00:34:09,620
with all the licensing stuff like that,
or we work something out,

669
00:34:09,620 --> 00:34:12,057
then I think that would be a great idea.

670
00:34:13,216 --> 00:34:16,235
We'd just have to figure out ways
of linking the data itself.

671
00:34:16,235 --> 00:34:22,234
One thing I can imagine is, essentially,
that I would love for edits to Wikidata

672
00:34:22,577 --> 00:34:26,088
to immediately become populated 
to the PanLex database,

673
00:34:26,088 --> 00:34:28,551
without having to essentially

674
00:34:28,551 --> 00:34:30,786
just reingest it every...

675
00:34:30,786 --> 00:34:35,779
essentially making Wikidata
a crowdsourceable interface to PanLex

676
00:34:35,779 --> 00:34:36,888
would be really awesome.

677
00:34:36,888 --> 00:34:39,777
And then being able to use
PanLex in immediate translations,

678
00:34:39,780 --> 00:34:42,224
to be able to do translations 
across Wikidata lexical items--

679
00:34:42,224 --> 00:34:43,770
that would be glorious.

680
00:34:55,288 --> 00:35:00,266
(man 7) This is like the auditing process
of this semantic web

681
00:35:00,266 --> 00:35:03,808
to close holes by inference.

682
00:35:05,682 --> 00:35:09,733
If we think this further,
this kind of translation,

683
00:35:09,733 --> 00:35:13,353
how do you deal with semantic mismatch

684
00:35:13,355 --> 00:35:16,088
and grammatical mismatch?

685
00:35:16,088 --> 00:35:18,888
For instance, if you try 
to translate something in German,

686
00:35:18,888 --> 00:35:21,933
you can simply put several words together

687
00:35:21,933 --> 00:35:25,986
and reach something that's sensible,

688
00:35:25,986 --> 00:35:29,184
and on the other hand,
I think I read sometimes

689
00:35:31,450 --> 00:35:38,450
not every language
has the same granular system

690
00:35:38,450 --> 00:35:40,453
for colors, for instance.

691
00:35:41,577 --> 00:35:42,800
You said everything

692
00:35:42,800 --> 00:35:45,010
uses a different system
for colors or are the same?

693
00:35:45,530 --> 00:35:48,377
(man 7) I remember maybe 
that it's just about evolution of language

694
00:35:48,377 --> 00:35:51,533
that they started out 
with black and white and then--

695
00:35:51,533 --> 00:35:53,333
Yeah, the color hierarchy.

696
00:35:53,333 --> 00:35:54,492
Actually, the color hierarchy

697
00:35:54,492 --> 00:35:57,271
is a great way to illustrate
how this works, right?

698
00:35:57,977 --> 00:36:01,400
So, essentially, when you have 
a single pivot language--

699
00:36:02,043 --> 00:36:04,822
it's really interesting when 
you read papers on machine translations

700
00:36:04,822 --> 00:36:08,000
because oftentimes they'll talk about 
some hypothetical pivot language,

701
00:36:08,000 --> 00:36:09,826
that they say, "Oh yeah,
there is a pivot language,"

702
00:36:09,826 --> 00:36:12,133
and then you read in the paper 
and say, "It's English."

703
00:36:12,133 --> 00:36:16,688
And so what this form
of lexical translation does,

704
00:36:16,680 --> 00:36:20,352
by passing it through
many different intermediate languages,

705
00:36:20,755 --> 00:36:26,142
it has the effect of being able 
to deal with a lot of semantic ambiguity.

706
00:36:26,142 --> 00:36:28,426
Because as long as you're passing it
through languages

707
00:36:28,426 --> 00:36:33,408
that contain the same reasonably similar 
semantic boundaries to a word,

708
00:36:33,408 --> 00:36:37,038
then you can avoid
the problem of essentially

709
00:36:37,038 --> 00:36:39,808
introducing semantic ambiguity 
through the pivot language.

710
00:36:39,808 --> 00:36:43,266
So using the color hierarchy thing
as an example,

711
00:36:43,266 --> 00:36:46,460
if you take a language that has 
a single color word for green and blue

712
00:36:46,460 --> 00:36:50,688
and it translates it into blue

713
00:36:50,688 --> 00:36:53,244
in your single pivot language

714
00:36:53,244 --> 00:36:54,477
and then into another language

715
00:36:54,477 --> 00:36:57,422
that has different ambiguities
on these things,

716
00:36:57,422 --> 00:37:00,283
then you end up introducing
semantic ambiguity.

717
00:37:00,283 --> 00:37:02,370
But if you pass it through
a bunch of other languages

718
00:37:02,370 --> 00:37:05,660
that also contain a single 
lexical item for green and blue,

719
00:37:05,660 --> 00:37:10,666
then, essentially,
that semantic specificity

720
00:37:11,040 --> 00:37:16,990
gets passed along
to the resultant language.

721
00:37:17,755 --> 00:37:20,666
As far as the grammatical feature aspects,

722
00:37:20,666 --> 00:37:23,488
PanLex has been primarily, in its history,

723
00:37:23,488 --> 00:37:28,960
collecting essentially lexemes, 
essentially lexical forms.

724
00:37:29,711 --> 00:37:31,800
And, by that, I mean, essentially,

725
00:37:31,804 --> 00:37:33,840
whatever you get 
as the headword for a dictionary.

726
00:37:34,800 --> 00:37:38,170
So we don't necessarily
concentrate at this time

727
00:37:38,555 --> 00:37:40,955
on collecting grammatical variant forms,

728
00:37:40,955 --> 00:37:43,360
things like [inaudible] data, etc.

729
00:37:43,360 --> 00:37:44,830
or past tense and present tense.

730
00:37:44,830 --> 00:37:46,487
But it's something we're looking into.

731
00:37:46,488 --> 00:37:48,420
One thing that it's always
important to remember

732
00:37:48,420 --> 00:37:50,600
is that because our focus is--

733
00:37:51,422 --> 00:37:54,490
is on underserved and endangered
minority languages,

734
00:37:55,000 --> 00:37:57,777
we want to make sure 
that something is available

735
00:37:57,777 --> 00:37:59,711
before we make it perfect.

736
00:38:01,621 --> 00:38:02,844
A phrase I absolutely love

737
00:38:02,844 --> 00:38:04,927
is "Don't let the perfect
be the enemy of the good,"

738
00:38:04,927 --> 00:38:06,570
and that's what we intend to do.

739
00:38:06,570 --> 00:38:09,014
But we are super interested in the idea

740
00:38:09,014 --> 00:38:12,266
of being able to handle grammatical forms,

741
00:38:12,266 --> 00:38:14,031
and being able to translate
across grammatical forms,

742
00:38:14,031 --> 00:38:15,665
and it's some stuff 
we've done some research on

743
00:38:15,665 --> 00:38:17,468
but we haven't fully implemented yet.

744
00:38:25,350 --> 00:38:28,777
(man 8) So, of the 7,500 or so languages,

745
00:38:30,448 --> 00:38:33,111
I assume you're relying on dictionaries
which are written for us,

746
00:38:33,111 --> 00:38:36,222
but do all those languages 
have standard written forms

747
00:38:36,222 --> 00:38:38,101
and how do you deal with...?

748
00:38:38,101 --> 00:38:39,887
That's a great question.

749
00:38:42,111 --> 00:38:45,062
Essentially, yes, a lot of these languages

750
00:38:45,066 --> 00:38:47,977
as everyone's aware, are unwritten.

751
00:38:47,977 --> 00:38:50,666
However, any language 
for which a dictionary has been produced

752
00:38:50,666 --> 00:38:52,466
has some kind of orthography,

753
00:38:52,466 --> 00:38:56,710
and we rely on the orthography
produced for the dictionary.

754
00:38:56,710 --> 00:38:59,686
We occasionally do some 
slight massaging of orthography

755
00:39:00,956 --> 00:39:03,177
if we can guarantee 
it to be lossless, basically.

756
00:39:03,177 --> 00:39:05,377
But we tend to avoid it
as much as possible.

757
00:39:07,533 --> 00:39:11,485
So, essentially,
we don't get into the business

758
00:39:11,485 --> 00:39:13,229
of developing orthographies 
for languages,

759
00:39:13,229 --> 00:39:14,967
because oftentimes they haven't developed,

760
00:39:14,967 --> 00:39:17,240
even if they're not really
widely published.

761
00:39:17,240 --> 00:39:22,155
So, for example,

762
00:39:22,155 --> 00:39:26,022
for a lot of languages
that are spoken in New Guinea,

763
00:39:26,488 --> 00:39:29,125
there may not be a commonly
used orthographic form,

764
00:39:29,125 --> 00:39:30,980
but some linguists 
just come up with something

765
00:39:30,980 --> 00:39:32,333
and that's a good first step.

766
00:39:33,473 --> 00:39:36,730
We also collect phonetic forms
when they're available in dictionaries,

767
00:39:36,730 --> 00:39:38,400
and so that's another way in,

768
00:39:38,400 --> 00:39:40,533
essentially an IPA 
representation of the word,

769
00:39:40,533 --> 00:39:41,800
if that's available.

770
00:39:41,800 --> 00:39:43,333
So that can also be used as well.

771
00:39:43,333 --> 00:39:45,755
But we don't just typically
use that as a pivot

772
00:39:45,755 --> 00:39:48,226
because it introduces certain ambiguities.

773
00:39:52,666 --> 00:39:55,466
(woman 4) Thank you, 
this might be a super silly question,

774
00:39:56,044 --> 00:40:00,572
but are those only the intermediate
languages you work with?

775
00:40:00,572 --> 00:40:02,215
Oh, no. Oh, no.

776
00:40:02,222 --> 00:40:03,790
(woman 4) Oh, yes, alright. Thank you.

777
00:40:03,790 --> 00:40:05,683
No, I'm glad you asked.
It answers the question.

778
00:40:05,683 --> 00:40:11,311
So this is actually a screenshot snap
from <i>translate.panlex.org</i>.

779
00:40:11,311 --> 00:40:12,826
If you do a translation,

780
00:40:12,826 --> 00:40:15,022
you'll get a list of translations
on the right side.

781
00:40:15,022 --> 00:40:17,874
You click a little <i>dot dot dot</i> button, 
you'll get a graph like this.

782
00:40:17,874 --> 00:40:21,760
And what this shows 
is the intermediate languages,

783
00:40:22,010 --> 00:40:24,133
the top 20 by score--

784
00:40:24,133 --> 00:40:26,093
I could go into the details
of how we do the score

785
00:40:26,093 --> 00:40:27,452
but it's not super important now--

786
00:40:27,452 --> 00:40:30,244
by score that are being used.

787
00:40:30,244 --> 00:40:33,393
But to make the translation, 
we're actually using way more than 20.

788
00:40:33,393 --> 00:40:35,797
The reason I cap it at 20
is because if you have more than 20--

789
00:40:35,797 --> 00:40:37,661
like this is actually 
a kind of a physics simulation

790
00:40:37,661 --> 00:40:39,638
you can move the things around 
and they squiggle.

791
00:40:39,638 --> 00:40:42,200
If you have more than 20,
your computer gets really mad.

792
00:40:45,400 --> 00:40:47,419
So it's more of a demonstration, yeah.

793
00:40:55,955 --> 00:40:57,888
(woman 5) Leila, 
from Wikimedia Foundation.

794
00:40:57,888 --> 00:41:00,155
Just one note on--

795
00:41:00,155 --> 00:41:03,260
You mentioned Wikimedia Foundation
a couple of times in your presentation,

796
00:41:03,260 --> 00:41:06,533
I wanted to say if you want to do
any kind of data ingestion

797
00:41:06,533 --> 00:41:08,460
or a collaboration with Wikidata,

798
00:41:08,820 --> 00:41:11,200
perhaps Wikimedia Deutschland
would be a better place

799
00:41:11,200 --> 00:41:13,182
to have these conversations with?

800
00:41:13,182 --> 00:41:16,256
Because Wikidata lives
within Wikimedia Deutschland

801
00:41:16,256 --> 00:41:17,511
and the team is there,

802
00:41:17,511 --> 00:41:19,971
and also the community 
of volunteers around Wikidata

803
00:41:19,977 --> 00:41:23,710
would be the perfect place to talk

804
00:41:23,710 --> 00:41:25,590
about any kind of ingestions

805
00:41:25,590 --> 00:41:31,136
or working with bringing
PanLex closer to Wikidata.

806
00:41:31,577 --> 00:41:32,688
Great, thank you very much,

807
00:41:32,688 --> 00:41:34,901
because honestly I'm not
exactly super familiar

808
00:41:34,901 --> 00:41:37,823
with all of the intricacies
of the architecture

809
00:41:37,823 --> 00:41:39,740
of how all the projects
relate to each other.

810
00:41:39,740 --> 00:41:41,977
I'm guessing by the laughs
that it's complicated.

811
00:41:41,977 --> 00:41:44,333
But, yeah, so basically
we would want to talk

812
00:41:44,333 --> 00:41:48,333
with whoever is responsible for Wikidata.

813
00:41:48,333 --> 00:41:52,120
So just do a little
[inaudible] place thing,

814
00:41:52,860 --> 00:41:56,470
whoever is responsible for Wikidata,
that's who we're interested in talking to,

815
00:41:56,470 --> 00:41:58,264
which is all of you volunteers.

816
00:42:03,266 --> 00:42:05,044
Any further questions?

817
00:42:10,066 --> 00:42:14,400
Okay, well, if anyone does end up having
any further questions beyond this

818
00:42:14,400 --> 00:42:17,711
or ones that I talked about-- the details 
and specifics about these things,

819
00:42:17,711 --> 00:42:19,800
please come and talk to me, 
I'm super interested.

820
00:42:19,800 --> 00:42:23,977
And especially if you're dealing 
with anything involving lexical stuff,

821
00:42:23,977 --> 00:42:28,666
anything involving 
endangered minority languages

822
00:42:28,666 --> 00:42:30,444
and underserved languages,

823
00:42:30,444 --> 00:42:34,410
and also Unicode, 
which is something I do as well.

824
00:42:36,220 --> 00:42:37,800
So thank you very much

825
00:42:37,800 --> 00:42:39,563
and thank you 
for inviting me to come speak,

826
00:42:39,563 --> 00:42:41,550
I'm hoping that you enjoyed all this.

827
00:42:41,550 --> 00:42:43,753
(applause)