1
00:00:05,888 --> 00:00:09,312
Now, there are approximately
7,500 languages
2
00:00:09,312 --> 00:00:10,806
spoken on the planet today.
3
00:00:11,770 --> 00:00:13,808
Of those, it's estimated
4
00:00:13,808 --> 00:00:18,466
that about 70%
are at risk of not surviving
5
00:00:18,466 --> 00:00:20,355
the end of the 21st century.
6
00:00:22,270 --> 00:00:24,266
Every time a language dies,
7
00:00:24,711 --> 00:00:26,622
it's severing a connection
8
00:00:26,622 --> 00:00:30,590
that has lasted for hundreds
to thousands of years,
9
00:00:30,590 --> 00:00:34,816
to culture, to history,
10
00:00:35,320 --> 00:00:38,150
and to traditions, and to knowledge.
11
00:00:38,933 --> 00:00:42,250
The linguist Kenneth Hale once said
12
00:00:42,250 --> 00:00:44,183
that every time a language dies,
13
00:00:44,183 --> 00:00:46,794
it's like dropping
an atom bomb on the Louvre,
14
00:00:49,377 --> 00:00:51,844
So the question is,
15
00:00:52,730 --> 00:00:54,800
why do languages die?
16
00:00:56,244 --> 00:01:00,155
Well, perhaps the simple answer might be
17
00:01:00,162 --> 00:01:03,051
that one could imagine
authoritarian governments
18
00:01:03,051 --> 00:01:05,311
preventing people from speaking
their native language,
19
00:01:05,844 --> 00:01:09,630
children being punished
for speaking their language at school,
20
00:01:09,866 --> 00:01:12,911
or the government
shutting down radio stations
21
00:01:12,923 --> 00:01:14,644
in the minority language.
22
00:01:15,044 --> 00:01:16,977
And this definitely happened in the past,
23
00:01:16,977 --> 00:01:19,088
and it still, to some extent,
happens today.
24
00:01:19,616 --> 00:01:23,026
But the honest answer
25
00:01:23,026 --> 00:01:26,666
is that for the vast majority
of the cases of language extinction,
26
00:01:27,296 --> 00:01:29,336
it's a much simpler
27
00:01:29,336 --> 00:01:32,555
and a much more easy-to-explain answer.
28
00:01:33,696 --> 00:01:36,222
The languages go extinct
29
00:01:36,220 --> 00:01:37,888
because they are not passed down
30
00:01:37,888 --> 00:01:39,733
from one generation to the next.
31
00:01:42,280 --> 00:01:43,866
Every single time a person who speaks
32
00:01:43,866 --> 00:01:46,088
a minority language has a child,
33
00:01:46,752 --> 00:01:50,355
they go through a calculus.
34
00:01:51,360 --> 00:01:52,800
They ask themselves,
35
00:01:53,660 --> 00:01:56,288
"Do I pass my language down to my child,
36
00:01:56,770 --> 00:02:01,311
or do I instead teach them
only the majority language?"
37
00:02:01,311 --> 00:02:03,222
Essentially, there is a scale that goes on
38
00:02:03,900 --> 00:02:05,844
that they access in their heads,
39
00:02:06,720 --> 00:02:08,355
in which on one side
40
00:02:09,530 --> 00:02:11,733
every single time in their lives
41
00:02:11,737 --> 00:02:14,222
that they've had an opportunity
to use their native language
42
00:02:14,866 --> 00:02:18,490
for communication,
for access to traditional culture,
43
00:02:19,776 --> 00:02:21,748
a stone is placed on the left side.
44
00:02:22,228 --> 00:02:23,840
And every time that they find themselves
45
00:02:23,840 --> 00:02:25,755
unable to use their native language,
46
00:02:25,770 --> 00:02:27,955
and instead have to rely on
the majority language,
47
00:02:27,958 --> 00:02:30,066
a stone is placed on the right side.
48
00:02:31,822 --> 00:02:34,800
Now, due to the strength and the dignity
49
00:02:34,800 --> 00:02:36,600
of being able to speak
one's mother tongue,
50
00:02:36,600 --> 00:02:38,720
the stones on the left
tend to be a bit heavier.
51
00:02:38,720 --> 00:02:42,048
But with enough stones on the right side,
52
00:02:42,560 --> 00:02:44,600
then eventually the scale tips,
53
00:02:44,600 --> 00:02:47,111
and then when a person makes the decision
54
00:02:47,111 --> 00:02:49,150
to pass their language down,
55
00:02:49,160 --> 00:02:50,622
they see their own language
56
00:02:50,622 --> 00:02:52,620
as more of a burden than a blessing.
57
00:02:55,200 --> 00:02:58,676
So the question is,
how do we reverse this?
58
00:02:59,450 --> 00:03:01,777
First, we need to think
about the fact that,
59
00:03:03,511 --> 00:03:04,968
for any given language,
60
00:03:04,970 --> 00:03:07,900
there are certain social spheres
that they can be used in.
61
00:03:07,900 --> 00:03:08,976
So any language
62
00:03:08,976 --> 00:03:10,800
that's a mother tongue spoken today,
63
00:03:10,800 --> 00:03:12,990
can be used with one's family.
64
00:03:13,790 --> 00:03:16,671
A smaller set of languages
can be used within one's community,
65
00:03:16,671 --> 00:03:18,660
a smaller set, maybe within one's region,
66
00:03:19,288 --> 00:03:22,155
and for a small handful of languages,
67
00:03:22,511 --> 00:03:24,488
they can be used
for international communication.
68
00:03:25,824 --> 00:03:28,640
And then even across these spheres,
69
00:03:28,640 --> 00:03:31,712
there's the question of can someone
use their language,
70
00:03:31,712 --> 00:03:35,533
for the purpose of education or business,
71
00:03:35,911 --> 00:03:37,600
or in technology?
72
00:03:39,136 --> 00:03:41,952
So, to better explain
73
00:03:43,200 --> 00:03:44,530
what I'm talking about here,
74
00:03:44,530 --> 00:03:46,393
I would like to use an anecdote.
75
00:03:48,400 --> 00:03:50,400
Let's say that you are about to go
76
00:03:50,400 --> 00:03:52,280
on your dream vacation to India,
77
00:03:53,155 --> 00:03:56,032
and you have an eight-hour
layover in Istanbul.
78
00:03:57,312 --> 00:04:00,640
Now, you weren't necessarily
planning on visiting Turkey,
79
00:04:00,896 --> 00:04:04,266
but with your layover
and with a Turkish friend
80
00:04:04,266 --> 00:04:05,933
telling you about an amazing restaurant
81
00:04:05,933 --> 00:04:07,400
that's not too far from the airport,
82
00:04:07,800 --> 00:04:10,600
you say, "Hey, you know,
maybe I'll stop by during my layover."
83
00:04:11,022 --> 00:04:12,920
So, you exit the airport,
84
00:04:13,950 --> 00:04:15,480
you get to your restaurant,
85
00:04:15,480 --> 00:04:17,020
and they hand you a menu,
86
00:04:17,020 --> 00:04:19,086
and the menu is entirely in Turkish.
87
00:04:20,170 --> 00:04:22,911
Now, let's say,
for the point of this exercise,
88
00:04:22,911 --> 00:04:24,377
that you don't speak Turkish.
89
00:04:25,210 --> 00:04:26,535
What do you do?
90
00:04:28,155 --> 00:04:29,744
Well, best-case scenario,
91
00:04:29,744 --> 00:04:32,177
you find someone perhaps
who can speak your native language,
92
00:04:32,383 --> 00:04:34,264
German, English, etc.
93
00:04:36,220 --> 00:04:37,997
But let's say it's not your lucky day
94
00:04:38,000 --> 00:04:41,066
and nobody in the restaurant can speak
any German or any English.
95
00:04:42,000 --> 00:04:43,377
So what do you do?
96
00:04:43,377 --> 00:04:45,995
Well, if you are like me,
and I imagine most of you,
97
00:04:45,995 --> 00:04:48,130
you've probably turned
to a technological solution,
98
00:04:49,535 --> 00:04:52,351
machine translation
or a digital dictionary,
99
00:04:52,607 --> 00:04:54,196
look up each word individually,
100
00:04:54,399 --> 00:04:57,733
and eventually order yourself
a delicious Turkish meal.
101
00:04:59,970 --> 00:05:02,844
Now, let's imagine this scenario instead,
102
00:05:03,610 --> 00:05:06,400
in which you are the native speaker
of a minority language.
103
00:05:07,455 --> 00:05:09,333
Let's say, Lower Sorbian.
104
00:05:09,333 --> 00:05:11,000
Lower Sorbian is an endangered language
105
00:05:11,000 --> 00:05:12,488
spoken here in Germany,
106
00:05:12,488 --> 00:05:16,888
about 130 kilometers
to the southeast from here,
107
00:05:17,711 --> 00:05:20,857
that's spoken only by
a few thousand people, mostly elderly.
108
00:05:22,810 --> 00:05:25,111
Now, let's say your mother tongue
is Lower Sorbian.
109
00:05:25,370 --> 00:05:26,773
You end up in the restaurant.
110
00:05:26,773 --> 00:05:28,462
Now, of course, the odds
of finding someone
111
00:05:28,462 --> 00:05:31,387
who speaks your native language
in the restaurant is extraordinarily low.
112
00:05:32,280 --> 00:05:36,412
But, again, you can just go
to a technological solution.
113
00:05:36,890 --> 00:05:39,333
However, for your native language,
114
00:05:39,333 --> 00:05:41,718
these technological solutions don't exist.
115
00:05:42,010 --> 00:05:44,991
You would have to rely on
German or English
116
00:05:44,991 --> 00:05:47,488
as your pivot language into Turkish.
117
00:05:48,920 --> 00:05:52,382
Now, of course, you still end up
getting your delicious Turkish meal,
118
00:05:52,382 --> 00:05:54,860
but you begin to think about
how difficult this would have been
119
00:05:54,860 --> 00:05:57,170
if you were your grandfather,
who spoke no German at all.
120
00:05:58,244 --> 00:05:59,840
Now, this is just a small incident,
121
00:05:59,844 --> 00:06:04,787
but it's going to place a stone
on the right side of that scale,
122
00:06:05,310 --> 00:06:07,053
and make you think perhaps
123
00:06:07,053 --> 00:06:09,898
maybe when I have children
or maybe when I have another child,
124
00:06:10,943 --> 00:06:14,726
the burden that you went through with this
125
00:06:14,726 --> 00:06:17,133
may not be worth it
to keep your language.
126
00:06:19,391 --> 00:06:21,284
And imagine if this was a scenario
127
00:06:21,284 --> 00:06:26,177
that was of significantly more importance,
128
00:06:26,177 --> 00:06:28,380
such as, for example, being in a hospital.
129
00:06:31,133 --> 00:06:36,161
Now, this is the point
in which we can help--
130
00:06:36,790 --> 00:06:40,242
by we, I mean me and you
in this room can help.
131
00:06:41,400 --> 00:06:43,355
We have the tools
to be able to help this.
132
00:06:45,155 --> 00:06:47,355
If technological tools
are available for people
133
00:06:47,355 --> 00:06:49,350
who speak minority
and underserved languages,
134
00:06:50,555 --> 00:06:54,022
it puts a little finger on the scale,
on the left side of the scale.
135
00:06:54,022 --> 00:06:55,776
Someone doesn't necessarily have to think
136
00:06:55,776 --> 00:06:57,680
that they have to rely on
a minority language
137
00:06:57,680 --> 00:06:59,488
in order to interact
with the outside world,
138
00:07:00,351 --> 00:07:05,111
because it opens the social spheres
139
00:07:05,111 --> 00:07:06,328
a little bit more.
140
00:07:07,910 --> 00:07:10,333
So, of course, the ideal solution
141
00:07:10,333 --> 00:07:13,022
is that we have machine translation
in every language in the world.
142
00:07:13,022 --> 00:07:16,831
But, unfortunately,
that's just not feasible.
143
00:07:16,831 --> 00:07:19,800
Machine translation
requires large corpuses of text,
144
00:07:19,800 --> 00:07:21,088
and for many of these languages
145
00:07:21,088 --> 00:07:23,080
that are endangered or underserved,
146
00:07:23,391 --> 00:07:25,439
such data is simply not available.
147
00:07:26,309 --> 00:07:28,279
Some of them aren't even commonly written
148
00:07:29,000 --> 00:07:32,825
and thus getting enough data
to make a machine translation engine
149
00:07:32,825 --> 00:07:34,390
is unlikely.
150
00:07:34,390 --> 00:07:38,060
But what is available is lexical data.
151
00:07:40,244 --> 00:07:43,444
Through the work of many linguists
152
00:07:43,444 --> 00:07:45,440
over the past few hundred years,
153
00:07:47,777 --> 00:07:49,728
dictionaries and grammars
have been produced
154
00:07:49,728 --> 00:07:51,680
for most of the world's languages.
155
00:07:53,920 --> 00:07:56,511
But, unfortunately, most of these works
156
00:07:56,511 --> 00:08:00,644
are not accessible
or available to the world,
157
00:08:00,647 --> 00:08:03,533
let alone to speakers
of these minority languages.
158
00:08:04,522 --> 00:08:06,377
And it's not an intentional process,
159
00:08:06,377 --> 00:08:07,910
a lot of times it's simply because
160
00:08:07,910 --> 00:08:10,785
the initial print run
of these dictionaries was small,
161
00:08:11,155 --> 00:08:12,543
and the only copies
162
00:08:12,543 --> 00:08:16,244
are moldering away
in a university library somewhere.
163
00:08:17,511 --> 00:08:21,333
But we have the ability to take that data
164
00:08:21,333 --> 00:08:23,330
and make it accessible to the world.
165
00:08:24,133 --> 00:08:28,377
The Wikimedia Foundation
is one of the best organizations,
166
00:08:28,377 --> 00:08:30,555
I would say the best
organization in the world,
167
00:08:30,975 --> 00:08:33,396
for getting data available
168
00:08:33,396 --> 00:08:36,688
to the vast majority
of the population of this planet.
169
00:08:38,533 --> 00:08:40,134
So let's work on that.
170
00:08:41,000 --> 00:08:43,222
So to explain a little bit
171
00:08:43,224 --> 00:08:45,050
about what we've been doing
in this regard,
172
00:08:45,311 --> 00:08:48,127
I'd like to introduce
my organization, PanLex,
173
00:08:48,711 --> 00:08:51,888
which is an organization
that is attempting
174
00:08:51,888 --> 00:08:54,146
to collect lexical data for this purpose.
175
00:08:54,780 --> 00:08:56,830
We got started about 12 years ago
176
00:08:56,830 --> 00:08:59,600
at the University of Washington,
as a research project.
177
00:08:59,600 --> 00:09:01,088
The idea behind it
178
00:09:01,088 --> 00:09:03,990
was to show that inferred translations
179
00:09:04,377 --> 00:09:07,125
could create an effective
translation device,
180
00:09:07,125 --> 00:09:09,088
essentially a lexical translation device.
181
00:09:09,088 --> 00:09:12,223
This is an example
from PanLex data itself.
182
00:09:12,680 --> 00:09:14,057
This is showing how to translate
183
00:09:14,066 --> 00:09:17,805
the word "ev" in Turkish,
which means house,
184
00:09:17,805 --> 00:09:19,555
to Lower Sorbian,
185
00:09:19,555 --> 00:09:21,201
the language I was referring to earlier.
186
00:09:21,212 --> 00:09:23,190
So it's unlikely to find
187
00:09:24,333 --> 00:09:26,200
Turkish to Lower Sorbian dictionaries,
188
00:09:26,200 --> 00:09:28,244
but by passing it through
189
00:09:28,244 --> 00:09:30,240
many, many different
intermediate languages,
190
00:09:30,488 --> 00:09:32,600
you can create effective translations.
191
00:09:34,333 --> 00:09:36,911
So, once this was shown
in the research projects,
192
00:09:36,911 --> 00:09:39,631
the founder of PanLex,
Dr. Jonathan Pool,
193
00:09:40,711 --> 00:09:43,666
decided, "Well, you know,
why not actually just do this?"
194
00:09:43,666 --> 00:09:45,470
So he started a non-profit
195
00:09:45,470 --> 00:09:48,522
to collect as much lexical data
as possible and make it accessible.
196
00:09:48,911 --> 00:09:51,066
That's what we've been doing
for the past 12 years.
197
00:09:51,066 --> 00:09:54,516
In that time, we've collected
thousands and thousands of dictionaries,
198
00:09:54,516 --> 00:09:56,479
and extracted lexical data out of them
199
00:09:56,479 --> 00:10:01,340
and compiled a database that allows
inferred lexical translation
200
00:10:01,340 --> 00:10:03,755
across any of--
201
00:10:03,755 --> 00:10:05,866
Our current count is around 5,500
202
00:10:05,860 --> 00:10:07,955
of the 7,500 languages in the world.
203
00:10:08,511 --> 00:10:10,685
And, of course,
204
00:10:10,685 --> 00:10:12,221
we're constantly trying to expand that
205
00:10:12,221 --> 00:10:14,784
and expand the data
on each individual language.
206
00:10:17,220 --> 00:10:21,111
So, the next question is,
207
00:10:22,079 --> 00:10:25,663
what can we do to work together on this?
208
00:10:26,680 --> 00:10:28,931
We, at PanLex, have been
extremely excited to watch
209
00:10:28,931 --> 00:10:31,260
the development on lexical data,
210
00:10:31,260 --> 00:10:34,175
that Wikidata has been working on lately.
211
00:10:35,155 --> 00:10:37,548
It's very fascinating to see organizations
212
00:10:37,550 --> 00:10:39,476
that are working in a very similar sphere,
213
00:10:39,476 --> 00:10:41,183
but in different aspects.
214
00:10:41,535 --> 00:10:44,351
And we are extremely excited to see
215
00:10:44,733 --> 00:10:46,466
the results of this from Wikidata.
216
00:10:46,466 --> 00:10:51,144
And also we are looking forward
to collaborating with Wikidata.
217
00:10:53,844 --> 00:10:56,271
I think that the special skills
218
00:10:56,271 --> 00:10:58,022
that we've developed
over the past 12 years,
219
00:10:58,022 --> 00:11:01,555
with not just collecting lexical data,
but also in database design,
220
00:11:01,557 --> 00:11:03,908
could be extremely useful for Wikidata.
221
00:11:03,910 --> 00:11:07,111
And on the other side, I think that--
222
00:11:08,415 --> 00:11:10,975
I especially am excited about Wikidata's
223
00:11:11,743 --> 00:11:14,549
ability to do crowdsourcing of data.
224
00:11:15,129 --> 00:11:18,047
PanLex, currently,
our sources are entirely
225
00:11:18,399 --> 00:11:20,959
printed lexical sources
or other types of lexical sources,
226
00:11:21,170 --> 00:11:22,662
but we don't do any crowdsourcing.
227
00:11:22,670 --> 00:11:24,920
We simply don't have
the infrastructure for it available
228
00:11:24,920 --> 00:11:26,931
and of course, the Wikimedia Foundation
229
00:11:26,933 --> 00:11:28,930
is the world expert in crowdsourcing.
230
00:11:31,848 --> 00:11:33,728
I'm really looking
forward to seeing exactly
231
00:11:33,733 --> 00:11:35,680
how we can apply these skills together.
232
00:11:38,533 --> 00:11:41,600
But, overall, I think the main thing
to think about this
233
00:11:41,600 --> 00:11:43,457
is that when we were
working on these things,
234
00:11:43,461 --> 00:11:45,133
it's minute detail.
235
00:11:45,133 --> 00:11:47,533
We're sitting around
looking at grammatical forms,
236
00:11:47,533 --> 00:11:51,911
or paging our way through
dictionaries, ancient dictionaries,
237
00:11:51,915 --> 00:11:53,977
or sometimes
recently published dictionaries
238
00:11:53,977 --> 00:11:57,466
and getting into written forms of words,
239
00:11:57,466 --> 00:11:59,994
and it feels very close up.
240
00:11:59,994 --> 00:12:01,535
But, occasionally, we need to remember
241
00:12:01,535 --> 00:12:02,556
to take a step back
242
00:12:02,556 --> 00:12:04,951
in that, even though what we're doing
243
00:12:06,231 --> 00:12:08,831
can feel even mundane at times,
244
00:12:10,091 --> 00:12:11,957
the work we're doing
is extremely important.
245
00:12:13,010 --> 00:12:15,666
This is, in my opinion,
the absolute best way
246
00:12:15,666 --> 00:12:18,862
that we can support endangered languages
247
00:12:18,862 --> 00:12:21,488
and make sure that the linguistic
diversity of the planet
248
00:12:21,488 --> 00:12:25,730
is preserved up to the end
of this century or longer.
249
00:12:26,444 --> 00:12:29,644
It's entirely possible that the work
that we're doing today
250
00:12:29,644 --> 00:12:32,577
may result in languages
251
00:12:32,577 --> 00:12:35,355
being preserved and passed down,
252
00:12:35,355 --> 00:12:36,955
and not going extinct.
253
00:12:38,527 --> 00:12:40,605
So just to remember
254
00:12:40,605 --> 00:12:43,207
that even if you're sitting
around on your computer
255
00:12:43,207 --> 00:12:44,480
editing an individual entry
256
00:12:44,480 --> 00:12:49,707
and adding the data form
of a small minority language
257
00:12:49,707 --> 00:12:51,796
for every single noun,
258
00:12:51,800 --> 00:12:54,577
the little thing
that you're doing right now,
259
00:12:54,577 --> 00:12:57,528
might actually be partially responsible
260
00:12:57,533 --> 00:12:59,155
for making sure that language survives,
261
00:12:59,155 --> 00:13:01,060
until the end of the century or longer.
262
00:13:02,591 --> 00:13:03,703
Thank you very much,
263
00:13:03,703 --> 00:13:05,717
and I'd like to open
the floor to questions.
264
00:13:06,222 --> 00:13:08,373
(applause)
265
00:13:23,688 --> 00:13:24,977
(woman 1) Thank you.
266
00:13:24,977 --> 00:13:26,701
- Thank you for your talk.
- Thank you.
267
00:13:26,701 --> 00:13:28,777
(woman 1) I just have a question
about dictionaries.
268
00:13:28,777 --> 00:13:31,107
You said that you work
with printed dictionaries?
269
00:13:31,107 --> 00:13:32,312
- Yes.
- (woman 1) So my question
270
00:13:32,312 --> 00:13:34,508
is what do you take
from those dictionaries
271
00:13:34,511 --> 00:13:38,222
and if there's any copyright thing
you have to deal with?
272
00:13:38,222 --> 00:13:41,060
I anticipated this to be
the first question that I would get.
273
00:13:41,060 --> 00:13:42,827
(laughter)
274
00:13:42,827 --> 00:13:46,358
So, first off, for PanLex,
275
00:13:46,358 --> 00:13:50,244
we have, according to our legal
resources that we have consulted,
276
00:13:52,734 --> 00:13:57,466
whereas the arrangement and organization
of a dictionary is copyrightable,
277
00:13:57,466 --> 00:14:03,260
the translation itself
is not considered copyrightable.
278
00:14:04,170 --> 00:14:05,808
A good example is like, for example,
279
00:14:05,808 --> 00:14:10,525
a phone book is considered,
at least according to US law,
280
00:14:10,956 --> 00:14:11,965
copyrightable.
281
00:14:11,965 --> 00:14:16,800
But saying that person X's
phone number is digits D
282
00:14:16,800 --> 00:14:18,360
is not copyrightable.
283
00:14:21,666 --> 00:14:23,444
So like I said,
284
00:14:23,444 --> 00:14:25,311
according to our legal scholars,
285
00:14:25,311 --> 00:14:27,333
this is how we can deal with this.
286
00:14:27,333 --> 00:14:30,666
But even if that's not
a solid enough legal argument,
287
00:14:30,666 --> 00:14:32,063
one important thing to remember
288
00:14:32,063 --> 00:14:38,269
is that the vast majority
of these lexical data,
289
00:14:39,355 --> 00:14:40,530
is actually out of copyright.
290
00:14:40,530 --> 00:14:42,822
A significant number
of these are out of copyright
291
00:14:42,822 --> 00:14:44,333
and thus can be used without [end].
292
00:14:44,333 --> 00:14:46,783
And the other thing
is that oftentimes, for example,
293
00:14:47,311 --> 00:14:49,644
if we're working with
a recently made print dictionary,
294
00:14:49,640 --> 00:14:51,577
rather than trying to scan it and OCR it,
295
00:14:51,577 --> 00:14:53,439
we just email the person who made it.
296
00:14:53,439 --> 00:14:57,600
And it turns out that
most linguists are really excited
297
00:14:57,600 --> 00:14:59,600
that their data can be made accessible.
298
00:14:59,600 --> 00:15:01,267
And so they're like, "Sure, please,
299
00:15:01,267 --> 00:15:03,273
just put it all in there
and make it accessible."
300
00:15:05,533 --> 00:15:08,424
So like I said, we have, at least,
according to our legal opinions,
301
00:15:08,424 --> 00:15:09,466
we have the ability,
302
00:15:09,466 --> 00:15:11,177
but even if you don't want
to go with that,
303
00:15:11,177 --> 00:15:15,644
it's very easy to get
the data publicly accessible.
304
00:15:26,288 --> 00:15:28,470
- (man 1) Thank you. Hi.
- Hi.
305
00:15:28,470 --> 00:15:29,830
(man 1) Can you say a little more
306
00:15:29,830 --> 00:15:35,031
about how the person who speaks
Lower Sorbian is accessing the data.
307
00:15:35,031 --> 00:15:38,355
Like specifically how
that information is getting to them
308
00:15:38,357 --> 00:15:40,977
and how that might help to convince them
309
00:15:40,977 --> 00:15:42,800
to either try out the--
310
00:15:42,800 --> 00:15:44,680
Great question and this is actually
311
00:15:44,680 --> 00:15:46,266
one that I think about a lot as well,
312
00:15:46,266 --> 00:15:49,759
because I think that
when we talk about data access,
313
00:15:50,270 --> 00:15:53,244
there's actually a multiple step
of this, multiple steps.
314
00:15:53,244 --> 00:15:56,288
One is, of course, data preservation,
make sure the data doesn't go away.
315
00:15:56,288 --> 00:15:58,911
Secondly, is make sure it's interoperable
316
00:15:59,177 --> 00:16:01,844
and can be used.
317
00:16:01,844 --> 00:16:05,370
And thirdly is make sure
that it's available.
318
00:16:05,631 --> 00:16:07,333
So in PanLex's case,
319
00:16:07,333 --> 00:16:09,755
we have an API that can be used,
320
00:16:09,755 --> 00:16:11,888
but, obviously,
that can't be used by an end user
321
00:16:11,888 --> 00:16:14,847
But we've also developed interfaces.
322
00:16:15,155 --> 00:16:19,727
And so, for example,
if you go to translate.panlex.org,
323
00:16:19,728 --> 00:16:22,711
you can do translations on our database.
324
00:16:22,711 --> 00:16:25,864
If you want to mess around
with the API, just go to dev.panlex.org,
325
00:16:25,866 --> 00:16:29,222
and you can find a bunch of stuff
on the API, or just api.panlex.org.
326
00:16:30,950 --> 00:16:32,542
But there's another step too,
327
00:16:32,542 --> 00:16:36,577
which is that even if you make
all of your data completely accessible
328
00:16:36,570 --> 00:16:40,533
with tools that are super useful
to be able to access it,
329
00:16:41,210 --> 00:16:43,244
if you don't actually promote the tools,
330
00:16:43,244 --> 00:16:45,058
then people won't actually
be able to use it.
331
00:16:45,058 --> 00:16:47,177
And this is honestly kind of a...
332
00:16:48,827 --> 00:16:51,044
the thing that isn't talked about enough,
333
00:16:51,044 --> 00:16:52,955
and I don't have a good answer for it.
334
00:16:52,955 --> 00:16:54,800
How do we make sure that--
335
00:16:55,022 --> 00:16:56,933
For example, l only fairly recently,
336
00:16:56,933 --> 00:16:59,647
only a few years ago
got acquainted with Wikidata,
337
00:16:59,647 --> 00:17:02,463
and it's exactly the kind
of thing that I'm interested in.
338
00:17:02,970 --> 00:17:07,177
So, how do we promote
ourselves to others?
339
00:17:07,177 --> 00:17:08,780
I'm leaving that as an open question.
340
00:17:08,780 --> 00:17:10,800
Like I said, I don't have
a good answer for this.
341
00:17:10,800 --> 00:17:12,888
But, of course, in order to do that,
342
00:17:12,888 --> 00:17:14,880
we still need to accomplish
the first few steps.
343
00:17:22,133 --> 00:17:24,777
(man 2) If we want to have
machine translation,
344
00:17:24,777 --> 00:17:27,822
don't we need a translation memory?
345
00:17:27,827 --> 00:17:30,666
I'm not sure that the individual words
346
00:17:30,666 --> 00:17:32,918
that we put into Wikidata,
347
00:17:32,918 --> 00:17:36,558
these short phrases
that we put into Wikidata,
348
00:17:36,558 --> 00:17:41,130
either as ordinary Wikidata items
or as Wikidata lexemes,
349
00:17:41,130 --> 00:17:43,953
are sufficient to do a proper translation.
350
00:17:43,955 --> 00:17:46,600
We need to have full sentences,
for example, for--
351
00:17:46,772 --> 00:17:48,320
(Benjamin) Yeah, absolutely.
352
00:17:48,577 --> 00:17:51,422
(man 2) And where do we get
this data structure?
353
00:17:51,422 --> 00:17:55,177
I'm not sure that, currently,
354
00:17:55,177 --> 00:17:59,533
Wikidata is able to very well handle
355
00:17:59,533 --> 00:18:03,066
the issue of a translation memory,
356
00:18:04,324 --> 00:18:05,965
translatewiki.net,
357
00:18:05,965 --> 00:18:09,490
for getting into that gap of...
358
00:18:12,111 --> 00:18:14,993
Should we do anything
in that respect, or should we--
359
00:18:15,000 --> 00:18:17,133
Yeah, and I really
appreciate your question.
360
00:18:17,135 --> 00:18:18,715
I touched on this a little bit earlier,
361
00:18:18,715 --> 00:18:20,361
but I'd love to reiterate it.
362
00:18:21,356 --> 00:18:24,955
This is precisely the reason
that PanLex works in lexical data
363
00:18:24,955 --> 00:18:27,030
and why I'm excited about lexical data,
364
00:18:27,030 --> 00:18:29,935
as opposed to--
not as opposed to, but in addition
365
00:18:29,935 --> 00:18:35,207
to machine translation engines
and machine translation in general.
366
00:18:35,900 --> 00:18:39,200
As you said, machine translation
requires a specific kind of data,
367
00:18:39,740 --> 00:18:43,123
and that data is not available
for most of the world's languages.
368
00:18:43,123 --> 00:18:44,966
For the vast majority
of the world's languages,
369
00:18:44,966 --> 00:18:46,379
that simply is not available.
370
00:18:46,650 --> 00:18:48,447
But that doesn't mean
we should just give up.
371
00:18:48,447 --> 00:18:49,627
Like why?
372
00:18:51,260 --> 00:18:54,444
If I needed to translate
my Turkish restaurant menu,
373
00:18:54,755 --> 00:18:59,360
then lexical translation will likely
be an exceptionally good tool for that.
374
00:18:59,360 --> 00:19:01,715
Now, I'm not saying
that you can use lexical translation
375
00:19:01,715 --> 00:19:04,600
to do perfect paragraph
to paragraph translation.
376
00:19:04,600 --> 00:19:06,866
When I say lexical translation,
I mean word to word
377
00:19:06,866 --> 00:19:09,670
and word to word translation
can be extremely useful,
378
00:19:12,231 --> 00:19:14,708
It's funny to think about it,
but we didn't really have access
379
00:19:14,708 --> 00:19:16,620
to really good machine translation.
380
00:19:16,620 --> 00:19:20,191
Everyone didn't have
access to that until fairly recently.
381
00:19:20,191 --> 00:19:23,649
And we still got by with dictionaries,
382
00:19:23,649 --> 00:19:27,687
and they're an incredibly good resource.
383
00:19:28,311 --> 00:19:31,288
And the data is available,
so why not make it available
384
00:19:31,288 --> 00:19:34,377
to the world at large
and to the speakers of these languages?
385
00:19:36,422 --> 00:19:38,666
(woman 2) Hi, what mechanisms
do you have in place
386
00:19:38,666 --> 00:19:40,666
when the community itself--I'm over here.
387
00:19:40,666 --> 00:19:43,253
- Where are you? Okay, right.
- (woman 2) Yeah, sorry. (laughs)
388
00:19:43,253 --> 00:19:44,577
...when the community itself
389
00:19:44,577 --> 00:19:47,320
doesn't want part of their data in PanLex?
390
00:19:47,320 --> 00:19:48,933
Great question.
391
00:19:48,933 --> 00:19:51,955
So the way that we work with that
392
00:19:51,955 --> 00:19:56,287
is that if a dictionary is published
and made publicly available,
393
00:19:56,666 --> 00:19:58,133
that's a good indication.
394
00:19:58,133 --> 00:20:02,400
Like you could buy it in a store
or at a university library,
395
00:20:02,400 --> 00:20:04,690
or a public library anyone can access.
396
00:20:04,690 --> 00:20:08,080
That's a good indication
that that decision has been made.
397
00:20:08,080 --> 00:20:11,577
(woman 2) [inaudible]
398
00:20:15,740 --> 00:20:18,266
(man 3) Please, [inaudible],
could you speak in the microphone?
399
00:20:19,295 --> 00:20:20,447
Can you say it again?
400
00:20:20,447 --> 00:20:23,307
(woman 2) Linguists don't always have
the permission of the community.
401
00:20:23,307 --> 00:20:24,387
In order to publish things,
402
00:20:24,387 --> 00:20:27,533
they oftentimes publish things
without the consent of the community.
403
00:20:27,533 --> 00:20:29,577
And that's absolutely true.
404
00:20:29,577 --> 00:20:32,533
I would say that is a--
405
00:20:32,533 --> 00:20:34,422
That does happen.
406
00:20:34,422 --> 00:20:36,770
I would say it's generally
a small minority of cases,
407
00:20:36,770 --> 00:20:40,955
mostly confined
to generally North America,
408
00:20:40,955 --> 00:20:43,355
although sometimes
South American languages as well.
409
00:20:44,765 --> 00:20:46,488
It's something we have
to take into account.
410
00:20:46,488 --> 00:20:49,288
If we were to receive word, for example,
411
00:20:49,288 --> 00:20:52,377
that the data that is in PanLex
412
00:20:52,377 --> 00:20:56,330
should not be accessed
by the greater world,
413
00:20:56,330 --> 00:20:58,040
then, of course, we would remove it.
414
00:20:58,040 --> 00:20:59,310
(woman 2) Good, good.
415
00:21:01,281 --> 00:21:02,451
That doesn't mean, of course,
416
00:21:02,451 --> 00:21:04,391
that we'll listen
to copyright rules necessarily
417
00:21:04,391 --> 00:21:06,542
but we will listen
to traditional communities,
418
00:21:06,542 --> 00:21:08,157
and that's the major difference.
419
00:21:08,157 --> 00:21:10,252
(woman 2) Yeah,
that's what I'm referring to.
420
00:21:15,022 --> 00:21:16,755
It brings up a really interesting point,
421
00:21:16,755 --> 00:21:18,350
which is that
422
00:21:18,844 --> 00:21:22,244
sometimes it's a really big question
of who speaks for a language.
423
00:21:23,000 --> 00:21:27,911
I had some experience actually
visiting the American Southwest
424
00:21:27,911 --> 00:21:29,755
and working with some groups,
425
00:21:29,777 --> 00:21:32,288
who work on indigenous,
the Pueblo languages out there.
426
00:21:36,053 --> 00:21:38,044
So there is approximately
427
00:21:38,044 --> 00:21:40,220
six Pueblo languages,
depending on how you slice it,
428
00:21:40,220 --> 00:21:41,955
spoken in that area.
429
00:21:41,955 --> 00:21:44,022
But they are divided
amongst 18 different Pueblos
430
00:21:44,320 --> 00:21:47,066
and each one has their own
tribal government,
431
00:21:47,066 --> 00:21:50,022
and each government
may have a different opinion
432
00:21:50,022 --> 00:21:54,007
on whether their language
should be accessible to outsiders or not.
433
00:21:56,626 --> 00:21:58,170
Like, for example, Zuni Pueblo,
434
00:21:58,170 --> 00:22:01,472
it's a single Pueblo
that speaks Zuni language.
435
00:22:02,923 --> 00:22:05,274
And they're really big
on their language going everywhere,
436
00:22:05,274 --> 00:22:07,694
they put it on the street signs
and everything, it's great.
437
00:22:07,694 --> 00:22:10,637
But for some of the other languages,
438
00:22:10,644 --> 00:22:13,051
you might have one group that says,
439
00:22:13,051 --> 00:22:15,866
"Yeah, we don't want our language
being accessed by outsiders."
440
00:22:15,871 --> 00:22:18,838
But then you have the neighboring Pueblo
who speaks the same language say,
441
00:22:18,838 --> 00:22:21,666
"We really want our language
accessible to outsiders
442
00:22:21,666 --> 00:22:24,088
in using these technological tools,
443
00:22:24,088 --> 00:22:26,560
because we want our language
to be able to continue on."
444
00:22:26,560 --> 00:22:29,488
And it raises a really
interesting ethical question.
445
00:22:29,488 --> 00:22:31,651
Because if you default by saying,
446
00:22:31,651 --> 00:22:34,622
"Fine, I'm cutting it off because
this group said we should cut it off"--
447
00:22:34,622 --> 00:22:36,711
aren't you also disservicing
the second group
448
00:22:36,711 --> 00:22:39,360
because they actively
want you to rule out these things.
449
00:22:39,360 --> 00:22:42,755
So I don't think this is a question
that has an easy answer.
450
00:22:42,755 --> 00:22:44,955
But I would say
at least in terms of PanLex.
451
00:22:44,955 --> 00:22:48,938
And for the record, we actually
haven't encountered this yet,
452
00:22:48,938 --> 00:22:50,407
that I'm aware of.
453
00:22:50,933 --> 00:22:52,920
Now, that could be partially because...
454
00:22:53,666 --> 00:22:55,444
Getting back to his question,
455
00:22:55,666 --> 00:22:57,790
we may need to promote more. (chuckles)
456
00:22:58,660 --> 00:23:02,155
But, in general, as far as I know,
457
00:23:02,155 --> 00:23:04,488
we have not had this come up.
458
00:23:04,488 --> 00:23:06,871
But our game plan for this
459
00:23:06,871 --> 00:23:10,975
is if a community says they don't want
their data in a database,
460
00:23:10,975 --> 00:23:12,095
then we remove it.
461
00:23:12,095 --> 00:23:14,916
(woman 2) Because we have come up
with it in Wikidata and Wikipedia...
462
00:23:14,916 --> 00:23:16,140
- You have?
- (woman 2) ...in comments.
463
00:23:16,140 --> 00:23:17,407
- Really?
- (woman 2) It's been a problem.
464
00:23:17,407 --> 00:23:20,488
Yeah, I can imagine especially in comments
for photos or certain things.
465
00:23:20,488 --> 00:23:21,900
(woman 2) Correct.
466
00:23:27,177 --> 00:23:33,170
(man 4) Hi, I had a question about
the crowdsourcing aspect of this.
467
00:23:34,087 --> 00:23:36,644
As far as going in and asking a community
468
00:23:36,654 --> 00:23:40,480
to annotate or add data for a dataset,
469
00:23:40,480 --> 00:23:44,200
one of the things
that's a little intimidating is like,
470
00:23:44,711 --> 00:23:49,244
as an editor, I can only see
what things are missing.
471
00:23:49,244 --> 00:23:53,242
But if I'm going to spend time
on things, having an idea,
472
00:23:53,582 --> 00:23:56,672
there's a list of high priority items,
473
00:23:57,755 --> 00:24:01,198
that's, I guess,
very motivating in this aspect.
474
00:24:01,200 --> 00:24:04,222
And I was curious if you had a system
475
00:24:04,222 --> 00:24:07,866
which is, essentially, like,
we know the gaps in our own data,
476
00:24:07,866 --> 00:24:12,088
we have linguistic evidence
to know that these are the ones
477
00:24:12,088 --> 00:24:15,530
that if we had annotated,
these would be the high impact drivers.
478
00:24:15,530 --> 00:24:17,152
So I can imagine
479
00:24:18,202 --> 00:24:21,405
having the lexeme
for "house" very impactful,
480
00:24:21,405 --> 00:24:24,977
maybe not a lexeme
for a data or some other like.
481
00:24:24,977 --> 00:24:28,947
But I was curious if you had that,
it if it is something
482
00:24:30,217 --> 00:24:35,480
that could be used
to drive these community efforts.
483
00:24:35,840 --> 00:24:37,066
Great question.
484
00:24:37,200 --> 00:24:41,216
So one thing that Wikidata
has a whole lot of--
485
00:24:41,216 --> 00:24:44,666
sorry, excuse me, PanLex
has a whole lot of are Swadesh lists.
486
00:24:44,666 --> 00:24:47,511
We have apparently the largest collection
of Swadesh lists in the world
487
00:24:47,511 --> 00:24:48,555
which is interesting.
488
00:24:48,555 --> 00:24:50,212
If you don't know what a Swadesh list is,
489
00:24:50,212 --> 00:24:56,244
it's essentially a regularized
list of lexical items
490
00:24:56,244 --> 00:25:00,040
that can be used
for analysis of languages.
491
00:25:00,040 --> 00:25:02,730
They contain really basic sets.
492
00:25:02,730 --> 00:25:05,003
So there's a couple
of different kinds of Swadesh lists.
493
00:25:05,003 --> 00:25:07,328
But there are 100 or 213 items
494
00:25:07,328 --> 00:25:08,911
and they might contain
495
00:25:08,911 --> 00:25:12,777
words like "house" and "eye" and "skin"
496
00:25:12,777 --> 00:25:14,444
and basically general words
497
00:25:14,444 --> 00:25:16,331
that you should be able
to find in any language.
498
00:25:16,331 --> 00:25:19,888
So that's like a really
good starting point
499
00:25:19,888 --> 00:25:22,988
for having that kind of data available.
500
00:25:29,090 --> 00:25:31,126
Now, as I mentioned before,
501
00:25:31,133 --> 00:25:33,600
crowdsourcing is something
that we don't do yet
502
00:25:33,600 --> 00:25:36,066
and we're actually
really excited to be able to do.
503
00:25:36,066 --> 00:25:37,554
It's one of the things I'm really excited
504
00:25:37,554 --> 00:25:38,993
to talk to people
at this conference about,
505
00:25:38,993 --> 00:25:42,982
is how crowdsourcing can be used
506
00:25:42,982 --> 00:25:45,931
and the logistics behind it,
507
00:25:46,200 --> 00:25:48,867
and these are the kind
of questions that can come up.
508
00:25:51,288 --> 00:25:53,400
So I guess the answer I can say to you
509
00:25:53,400 --> 00:25:55,376
is that we do have a priority list--
510
00:25:55,376 --> 00:25:57,684
Actually, one thing I can say
is we definitely do have a priority list
511
00:25:57,684 --> 00:25:59,730
when it comes to which languages
we are seeking out.
512
00:25:59,730 --> 00:26:02,222
So the way we do this
is that we look for languages
513
00:26:02,222 --> 00:26:04,666
that are not currently served
by technological solutions,
514
00:26:04,666 --> 00:26:06,977
which are oftentimes minority languages,
515
00:26:06,977 --> 00:26:09,280
or usually minority languages,
516
00:26:09,280 --> 00:26:12,096
and then prioritize those.
517
00:26:13,916 --> 00:26:16,844
But in terms of individual lexical items
518
00:26:16,851 --> 00:26:20,244
being the general way we get new data
519
00:26:20,244 --> 00:26:22,977
is essentially by ingesting
an entire dictionary's worth.
520
00:26:22,977 --> 00:26:25,911
We are relying on the dictionary's choice
521
00:26:25,911 --> 00:26:29,333
of lexical items,
rather than necessarily saying,
522
00:26:29,333 --> 00:26:31,500
we're really looking for the word
for "house" in every language.
523
00:26:31,500 --> 00:26:35,000
But when it comes to data crowdsourcing,
we will need something like that.
524
00:26:35,000 --> 00:26:37,912
So this is an opportunity
for research and growth.
525
00:26:40,044 --> 00:26:43,088
(man 5) Hi, I'm Victor,
and this is awesome.
526
00:26:45,108 --> 00:26:46,888
As you have slides here,
527
00:26:46,888 --> 00:26:49,355
can you talk a little bit
about the technical status
528
00:26:49,355 --> 00:26:51,260
that currently you have data
529
00:26:51,260 --> 00:26:57,022
or information flow
from and to Wikidata and PanLex.
530
00:26:57,022 --> 00:26:59,955
Is that currently implemented already
531
00:26:59,955 --> 00:27:03,888
and how do you deal with
532
00:27:03,888 --> 00:27:07,133
back and forth or even
feedback loop information
533
00:27:07,140 --> 00:27:09,950
between PanLex and Wikidata?
534
00:27:09,950 --> 00:27:13,733
So we actually don't have any formal
connections to Wikidata at this point,
535
00:27:13,733 --> 00:27:15,343
and this is something that I'm, again,
536
00:27:15,343 --> 00:27:17,824
I'm really excited to talk
to people in this conference about.
537
00:27:17,824 --> 00:27:20,644
We've had some interaction
with Wiktionary,
538
00:27:21,774 --> 00:27:24,720
but Wikidata is actually
a better fit, honestly,
539
00:27:24,720 --> 00:27:26,755
for what we are looking for.
540
00:27:27,355 --> 00:27:29,201
Having directly lexical stuff
541
00:27:29,201 --> 00:27:32,311
means that we have to do a lot less
data analysis and extraction.
542
00:27:32,933 --> 00:27:37,148
And so the answer is,
we don't yet, but we want to.
543
00:27:37,148 --> 00:27:39,800
(man 5) And if not,
what are the obstacles?
544
00:27:39,800 --> 00:27:43,511
And as we can see, Wikidata
already supports several languages,
545
00:27:43,511 --> 00:27:46,533
but when I look up translate.panlex.org,
546
00:27:46,533 --> 00:27:49,311
you apparently support
many, many variants,
547
00:27:49,311 --> 00:27:50,888
much more than Wikidata.
548
00:27:50,888 --> 00:27:53,316
How do you see there is a gap
549
00:27:53,316 --> 00:27:57,177
between translation
or lexical translation first,
550
00:27:57,177 --> 00:28:00,155
application versus an effort
551
00:28:00,155 --> 00:28:03,777
as trying to map a knowledge structure.
552
00:28:03,777 --> 00:28:05,866
Mapping knowledge
will actually be very interesting.
553
00:28:05,866 --> 00:28:07,336
We've had some
very interesting discussions
554
00:28:07,336 --> 00:28:12,311
about the way that Wikidata
organizes their lexical data,
555
00:28:12,311 --> 00:28:13,777
, your lexical data,
556
00:28:13,777 --> 00:28:16,044
and how we organize our lexical data.
557
00:28:16,044 --> 00:28:20,933
And there are subtle differences
that would require a mapping strategy,
558
00:28:21,460 --> 00:28:24,577
some of which will not
necessarily be automatic,
559
00:28:24,577 --> 00:28:27,422
but we might be able to develop
techniques to be able to do this.
560
00:28:27,422 --> 00:28:30,796
You gave the example of language variants.
561
00:28:30,796 --> 00:28:34,111
We tend to be very "splittery"
when it comes to language variants.
562
00:28:34,111 --> 00:28:36,311
In other words,
if we get a source that says
563
00:28:36,311 --> 00:28:38,755
that this is the dialect spoken
564
00:28:38,755 --> 00:28:41,695
on the left side of the river
in Papua New Guinea, for this language,
565
00:28:41,695 --> 00:28:42,913
and we get another source that says
566
00:28:42,913 --> 00:28:44,955
this is the dialect spoken
on the right side of the river,
567
00:28:44,955 --> 00:28:46,720
then we consider them
essentially separate languages.
568
00:28:46,720 --> 00:28:51,072
And so we do this in order to basically
preserve the most data that we can.
569
00:28:52,222 --> 00:28:54,355
Being able to map that
to how Wikidata does it--
570
00:28:54,355 --> 00:28:56,938
Actually, what I would love
is to have conversations
571
00:28:56,938 --> 00:29:00,696
about how languages
572
00:29:00,696 --> 00:29:06,323
are designated on Wikidata.
573
00:29:08,145 --> 00:29:12,320
Again, we go with the strategy
of very much a "splittery" strategy.
574
00:29:13,856 --> 00:29:17,440
We broadly rely on ISO 6393 codes,
575
00:29:17,866 --> 00:29:19,643
which is provided by the Ethnologue,
576
00:29:19,643 --> 00:29:23,840
and then each individual code,
we then allow multiple variants within it,
577
00:29:23,840 --> 00:29:29,098
either for script variants
or regional dialects or sociolects, etc.
578
00:29:30,240 --> 00:29:32,762
Again, opportunity
for discussion and work.
579
00:29:35,622 --> 00:29:39,466
(woman 3) Hi, I would like to know
if you have a OCR pipeline
580
00:29:39,466 --> 00:29:44,533
and especially because
we've been trying to do OCR on Maya,
581
00:29:44,533 --> 00:29:47,928
and we don't get any results.
582
00:29:47,933 --> 00:29:49,933
It doesn't understand anything--
583
00:29:49,933 --> 00:29:52,512
- Oh, yeah! (laughs)
- (woman 3) And... yeah.
584
00:29:52,512 --> 00:29:56,078
So if your pipelines are available.
585
00:29:56,078 --> 00:30:00,288
And the other one is just
on the overlap of ISO codes,
586
00:30:00,288 --> 00:30:01,641
like sometimes they say,
587
00:30:01,641 --> 00:30:04,199
"Oh, this is a language,
and this is another language,"
588
00:30:04,199 --> 00:30:06,555
but there are sources
that say other stuff,
589
00:30:06,555 --> 00:30:10,133
as you were mentioning,
but they tend to overlap.
590
00:30:10,133 --> 00:30:12,955
So how do you go on...? Yeah.
591
00:30:12,956 --> 00:30:15,155
Yeah, that's absolutely
an amazing question.
592
00:30:15,155 --> 00:30:17,120
I really like it.
593
00:30:17,120 --> 00:30:20,400
So we don't have a formalized
OCR pipeline per se;
594
00:30:20,400 --> 00:30:23,533
we do it on a sort of
source by source basis.
595
00:30:23,533 --> 00:30:26,266
One of the reasons why
is because we oftentimes have sources
596
00:30:26,266 --> 00:30:27,955
that not necessarily need to be OCR'd,
597
00:30:27,955 --> 00:30:29,841
that are available
for some of these languages,
598
00:30:29,841 --> 00:30:32,766
and we concentrate on those because
they require the least amount of work.
599
00:30:32,766 --> 00:30:35,000
But, obviously,
if we really want to dive deep
600
00:30:35,000 --> 00:30:37,056
into some of our sources
that are in our backlog,
601
00:30:37,056 --> 00:30:40,896
we're going to need to essentially
develop strong OCR pipelines.
602
00:30:40,896 --> 00:30:43,968
But there's another aspect too,
which is that, as you mentioned...
603
00:30:44,400 --> 00:30:48,576
like the people who designed OCR engines
604
00:30:49,088 --> 00:30:52,672
I think are not realizing
how much you can stress test them.
605
00:30:52,672 --> 00:30:55,181
Like, you know what's fun?--
606
00:30:55,181 --> 00:30:57,690
trying to OCR
a Russian-Tibetan dictionary.
607
00:30:58,600 --> 00:31:00,216
It's really hard, it turns out...
608
00:31:01,503 --> 00:31:03,747
We gave up, and we hired
someone to just type it up,
609
00:31:04,022 --> 00:31:05,641
which was totally doable.
610
00:31:05,641 --> 00:31:07,260
And actually, it turns out
611
00:31:07,260 --> 00:31:10,266
that this amazing Russian woman
learned to read Tibetan
612
00:31:10,266 --> 00:31:12,755
so she could type this up,
which was super cool.
613
00:31:15,333 --> 00:31:18,270
I think that if you're dealing
with stuff in the Latin scripts,
614
00:31:18,270 --> 00:31:22,871
then I think that OCR solutions
can be developed, that are more robust,
615
00:31:22,871 --> 00:31:24,673
that deal with
multilingual sources like this
616
00:31:24,673 --> 00:31:26,991
and expect that you're going
to get a random four in there,
617
00:31:26,991 --> 00:31:28,284
if you're dealing with something like
618
00:31:28,284 --> 00:31:30,560
16th-century Mayan sources,
you know, with digit four.
619
00:31:32,088 --> 00:31:37,600
But there are some sources
620
00:31:37,600 --> 00:31:40,111
that OCR is probably just
never really going to catch up to,
621
00:31:40,111 --> 00:31:42,244
or require such an immense amount of work,
622
00:31:43,200 --> 00:31:46,933
that actually we put a little
bit of this to use right now.
623
00:31:46,933 --> 00:31:48,800
We have another project
we're running at PanLex
624
00:31:48,800 --> 00:31:53,533
to transcribe all of the traditional
literature of Bali,
625
00:31:53,533 --> 00:31:57,952
and we found that in handwritten
Balinese manuscripts,
626
00:31:58,444 --> 00:31:59,644
there's just no chance of OCR.
627
00:31:59,644 --> 00:32:02,200
So we got a bunch
of Balinese people to type them up,
628
00:32:02,200 --> 00:32:05,000
and it's become a really cool
cultural project within Bali,
629
00:32:05,000 --> 00:32:07,288
and it's become news and stuff like that.
630
00:32:07,288 --> 00:32:09,084
So I would say
631
00:32:09,084 --> 00:32:11,377
that you don't necessarily
need to rely on OCR,
632
00:32:11,377 --> 00:32:12,577
but there is a lot out there.
633
00:32:12,577 --> 00:32:15,160
So having good OCR solutions
would be good.
634
00:32:16,663 --> 00:32:20,992
Also, if anyone out here
is into super multilingual OCR,
635
00:32:20,992 --> 00:32:22,635
please come talk to me.
636
00:32:29,517 --> 00:32:31,377
(man 6) Thank you for your presentation.
637
00:32:32,007 --> 00:32:34,866
You talked about integration
638
00:32:34,866 --> 00:32:37,060
between PanLex and Wikidata,
639
00:32:37,060 --> 00:32:38,792
but you haven't gone into the specifics.
640
00:32:38,792 --> 00:32:42,701
So I was checking your data license,
and it is under CC0.
641
00:32:42,701 --> 00:32:44,210
- Yes.
- (man 6) That's really great.
642
00:32:44,210 --> 00:32:46,377
So there are two possible ways
643
00:32:46,377 --> 00:32:49,400
that either we can import the data
644
00:32:49,400 --> 00:32:52,777
or we can continue something similar
to the Freebase way,
645
00:32:52,777 --> 00:32:55,688
where we had the complete
database from the Freebase,
646
00:32:55,688 --> 00:32:59,080
and we imported them, and we made a link,
647
00:32:59,080 --> 00:33:03,955
an external identifier
to the Freebase database.
648
00:33:03,955 --> 00:33:08,397
So if you have something in mind,
are you thinking similar?
649
00:33:08,397 --> 00:33:10,401
Or you just want to make...
650
00:33:15,291 --> 00:33:18,755
an independent database
which can be linked to Wikidata?
651
00:33:18,755 --> 00:33:20,533
Yeah, so this is a great question
652
00:33:20,533 --> 00:33:23,282
and actually I feel
like it's about one step ahead
653
00:33:23,282 --> 00:33:25,648
of some of the stuff
that I've already been thinking about,
654
00:33:25,648 --> 00:33:29,555
partially because, like I said,
655
00:33:29,955 --> 00:33:32,111
getting the two databases to work together
656
00:33:32,111 --> 00:33:33,533
is a step in of itself.
657
00:33:33,533 --> 00:33:35,332
I think the first step that we can take
658
00:33:35,333 --> 00:33:37,622
is literally just pooling
our skills together.
659
00:33:37,911 --> 00:33:40,246
We have a lot of experience
dealing with stuff
660
00:33:40,246 --> 00:33:42,656
like classifications of properties
of individual lexemes
661
00:33:42,656 --> 00:33:44,734
that I'd love to share.
662
00:33:45,864 --> 00:33:49,050
But being able to link the databases
themselves would be wonderful.
663
00:33:49,050 --> 00:33:50,808
I'm 100% for that.
664
00:33:50,808 --> 00:33:54,066
I think it would be a little bit easier
665
00:33:54,066 --> 00:33:56,022
on the Wikidata towards PanLex way,
666
00:33:56,022 --> 00:33:58,866
but maybe I'm just biased
because I can see how that could work.
667
00:34:02,040 --> 00:34:06,088
Yeah, essentially, as long
as Wikidata is comfortable
668
00:34:06,088 --> 00:34:09,620
with all the licensing stuff like that,
or we work something out,
669
00:34:09,620 --> 00:34:12,057
then I think that would be a great idea.
670
00:34:13,216 --> 00:34:16,235
We'd just have to figure out ways
of linking the data itself.
671
00:34:16,235 --> 00:34:22,234
One thing I can imagine is, essentially,
that I would love for edits to Wikidata
672
00:34:22,577 --> 00:34:26,088
to immediately become populated
to the PanLex database,
673
00:34:26,088 --> 00:34:28,551
without having to essentially
674
00:34:28,551 --> 00:34:30,786
just reingest it every...
675
00:34:30,786 --> 00:34:35,779
essentially making Wikidata
a crowdsourceable interface to PanLex
676
00:34:35,779 --> 00:34:36,888
would be really awesome.
677
00:34:36,888 --> 00:34:39,777
And then being able to use
PanLex in immediate translations,
678
00:34:39,780 --> 00:34:42,224
to be able to do translations
across Wikidata lexical items--
679
00:34:42,224 --> 00:34:43,770
that would be glorious.
680
00:34:55,288 --> 00:35:00,266
(man 7) This is like the auditing process
of this semantic web
681
00:35:00,266 --> 00:35:03,808
to close holes by inference.
682
00:35:05,682 --> 00:35:09,733
If we think this further,
this kind of translation,
683
00:35:09,733 --> 00:35:13,353
how do you deal with semantic mismatch
684
00:35:13,355 --> 00:35:16,088
and grammatical mismatch?
685
00:35:16,088 --> 00:35:18,888
For instance, if you try
to translate something in German,
686
00:35:18,888 --> 00:35:21,933
you can simply put several words together
687
00:35:21,933 --> 00:35:25,986
and reach something that's sensible,
688
00:35:25,986 --> 00:35:29,184
and on the other hand,
I think I read sometimes
689
00:35:31,450 --> 00:35:38,450
not every language
has the same granular system
690
00:35:38,450 --> 00:35:40,453
for colors, for instance.
691
00:35:41,577 --> 00:35:42,800
You said everything
692
00:35:42,800 --> 00:35:45,010
uses a different system
for colors or are the same?
693
00:35:45,530 --> 00:35:48,377
(man 7) I remember maybe
that it's just about evolution of language
694
00:35:48,377 --> 00:35:51,533
that they started out
with black and white and then--
695
00:35:51,533 --> 00:35:53,333
Yeah, the color hierarchy.
696
00:35:53,333 --> 00:35:54,492
Actually, the color hierarchy
697
00:35:54,492 --> 00:35:57,271
is a great way to illustrate
how this works, right?
698
00:35:57,977 --> 00:36:01,400
So, essentially, when you have
a single pivot language--
699
00:36:02,043 --> 00:36:04,822
it's really interesting when
you read papers on machine translations
700
00:36:04,822 --> 00:36:08,000
because oftentimes they'll talk about
some hypothetical pivot language,
701
00:36:08,000 --> 00:36:09,826
that they say, "Oh yeah,
there is a pivot language,"
702
00:36:09,826 --> 00:36:12,133
and then you read in the paper
and say, "It's English."
703
00:36:12,133 --> 00:36:16,688
And so what this form
of lexical translation does,
704
00:36:16,680 --> 00:36:20,352
by passing it through
many different intermediate languages,
705
00:36:20,755 --> 00:36:26,142
it has the effect of being able
to deal with a lot of semantic ambiguity.
706
00:36:26,142 --> 00:36:28,426
Because as long as you're passing it
through languages
707
00:36:28,426 --> 00:36:33,408
that contain the same reasonably similar
semantic boundaries to a word,
708
00:36:33,408 --> 00:36:37,038
then you can avoid
the problem of essentially
709
00:36:37,038 --> 00:36:39,808
introducing semantic ambiguity
through the pivot language.
710
00:36:39,808 --> 00:36:43,266
So using the color hierarchy thing
as an example,
711
00:36:43,266 --> 00:36:46,460
if you take a language that has
a single color word for green and blue
712
00:36:46,460 --> 00:36:50,688
and it translates it into blue
713
00:36:50,688 --> 00:36:53,244
in your single pivot language
714
00:36:53,244 --> 00:36:54,477
and then into another language
715
00:36:54,477 --> 00:36:57,422
that has different ambiguities
on these things,
716
00:36:57,422 --> 00:37:00,283
then you end up introducing
semantic ambiguity.
717
00:37:00,283 --> 00:37:02,370
But if you pass it through
a bunch of other languages
718
00:37:02,370 --> 00:37:05,660
that also contain a single
lexical item for green and blue,
719
00:37:05,660 --> 00:37:10,666
then, essentially,
that semantic specificity
720
00:37:11,040 --> 00:37:16,990
gets passed along
to the resultant language.
721
00:37:17,755 --> 00:37:20,666
As far as the grammatical feature aspects,
722
00:37:20,666 --> 00:37:23,488
PanLex has been primarily, in its history,
723
00:37:23,488 --> 00:37:28,960
collecting essentially lexemes,
essentially lexical forms.
724
00:37:29,711 --> 00:37:31,800
And, by that, I mean, essentially,
725
00:37:31,804 --> 00:37:33,840
whatever you get
as the headword for a dictionary.
726
00:37:34,800 --> 00:37:38,170
So we don't necessarily
concentrate at this time
727
00:37:38,555 --> 00:37:40,955
on collecting grammatical variant forms,
728
00:37:40,955 --> 00:37:43,360
things like [inaudible] data, etc.
729
00:37:43,360 --> 00:37:44,830
or past tense and present tense.
730
00:37:44,830 --> 00:37:46,487
But it's something we're looking into.
731
00:37:46,488 --> 00:37:48,420
One thing that it's always
important to remember
732
00:37:48,420 --> 00:37:50,600
is that because our focus is--
733
00:37:51,422 --> 00:37:54,490
is on underserved and endangered
minority languages,
734
00:37:55,000 --> 00:37:57,777
we want to make sure
that something is available
735
00:37:57,777 --> 00:37:59,711
before we make it perfect.
736
00:38:01,621 --> 00:38:02,844
A phrase I absolutely love
737
00:38:02,844 --> 00:38:04,927
is "Don't let the perfect
be the enemy of the good,"
738
00:38:04,927 --> 00:38:06,570
and that's what we intend to do.
739
00:38:06,570 --> 00:38:09,014
But we are super interested in the idea
740
00:38:09,014 --> 00:38:12,266
of being able to handle grammatical forms,
741
00:38:12,266 --> 00:38:14,031
and being able to translate
across grammatical forms,
742
00:38:14,031 --> 00:38:15,665
and it's some stuff
we've done some research on
743
00:38:15,665 --> 00:38:17,468
but we haven't fully implemented yet.
744
00:38:25,350 --> 00:38:28,777
(man 8) So, of the 7,500 or so languages,
745
00:38:30,448 --> 00:38:33,111
I assume you're relying on dictionaries
which are written for us,
746
00:38:33,111 --> 00:38:36,222
but do all those languages
have standard written forms
747
00:38:36,222 --> 00:38:38,101
and how do you deal with...?
748
00:38:38,101 --> 00:38:39,887
That's a great question.
749
00:38:42,111 --> 00:38:45,062
Essentially, yes, a lot of these languages
750
00:38:45,066 --> 00:38:47,977
as everyone's aware, are unwritten.
751
00:38:47,977 --> 00:38:50,666
However, any language
for which a dictionary has been produced
752
00:38:50,666 --> 00:38:52,466
has some kind of orthography,
753
00:38:52,466 --> 00:38:56,710
and we rely on the orthography
produced for the dictionary.
754
00:38:56,710 --> 00:38:59,686
We occasionally do some
slight massaging of orthography
755
00:39:00,956 --> 00:39:03,177
if we can guarantee
it to be lossless, basically.
756
00:39:03,177 --> 00:39:05,377
But we tend to avoid it
as much as possible.
757
00:39:07,533 --> 00:39:11,485
So, essentially,
we don't get into the business
758
00:39:11,485 --> 00:39:13,229
of developing orthographies
for languages,
759
00:39:13,229 --> 00:39:14,967
because oftentimes they haven't developed,
760
00:39:14,967 --> 00:39:17,240
even if they're not really
widely published.
761
00:39:17,240 --> 00:39:22,155
So, for example,
762
00:39:22,155 --> 00:39:26,022
for a lot of languages
that are spoken in New Guinea,
763
00:39:26,488 --> 00:39:29,125
there may not be a commonly
used orthographic form,
764
00:39:29,125 --> 00:39:30,980
but some linguists
just come up with something
765
00:39:30,980 --> 00:39:32,333
and that's a good first step.
766
00:39:33,473 --> 00:39:36,730
We also collect phonetic forms
when they're available in dictionaries,
767
00:39:36,730 --> 00:39:38,400
and so that's another way in,
768
00:39:38,400 --> 00:39:40,533
essentially an IPA
representation of the word,
769
00:39:40,533 --> 00:39:41,800
if that's available.
770
00:39:41,800 --> 00:39:43,333
So that can also be used as well.
771
00:39:43,333 --> 00:39:45,755
But we don't just typically
use that as a pivot
772
00:39:45,755 --> 00:39:48,226
because it introduces certain ambiguities.
773
00:39:52,666 --> 00:39:55,466
(woman 4) Thank you,
this might be a super silly question,
774
00:39:56,044 --> 00:40:00,572
but are those only the intermediate
languages you work with?
775
00:40:00,572 --> 00:40:02,215
Oh, no. Oh, no.
776
00:40:02,222 --> 00:40:03,790
(woman 4) Oh, yes, alright. Thank you.
777
00:40:03,790 --> 00:40:05,683
No, I'm glad you asked.
It answers the question.
778
00:40:05,683 --> 00:40:11,311
So this is actually a screenshot snap
from translate.panlex.org.
779
00:40:11,311 --> 00:40:12,826
If you do a translation,
780
00:40:12,826 --> 00:40:15,022
you'll get a list of translations
on the right side.
781
00:40:15,022 --> 00:40:17,874
You click a little dot dot dot button,
you'll get a graph like this.
782
00:40:17,874 --> 00:40:21,760
And what this shows
is the intermediate languages,
783
00:40:22,010 --> 00:40:24,133
the top 20 by score--
784
00:40:24,133 --> 00:40:26,093
I could go into the details
of how we do the score
785
00:40:26,093 --> 00:40:27,452
but it's not super important now--
786
00:40:27,452 --> 00:40:30,244
by score that are being used.
787
00:40:30,244 --> 00:40:33,393
But to make the translation,
we're actually using way more than 20.
788
00:40:33,393 --> 00:40:35,797
The reason I cap it at 20
is because if you have more than 20--
789
00:40:35,797 --> 00:40:37,661
like this is actually
a kind of a physics simulation
790
00:40:37,661 --> 00:40:39,638
you can move the things around
and they squiggle.
791
00:40:39,638 --> 00:40:42,200
If you have more than 20,
your computer gets really mad.
792
00:40:45,400 --> 00:40:47,419
So it's more of a demonstration, yeah.
793
00:40:55,955 --> 00:40:57,888
(woman 5) Leila,
from Wikimedia Foundation.
794
00:40:57,888 --> 00:41:00,155
Just one note on--
795
00:41:00,155 --> 00:41:03,260
You mentioned Wikimedia Foundation
a couple of times in your presentation,
796
00:41:03,260 --> 00:41:06,533
I wanted to say if you want to do
any kind of data ingestion
797
00:41:06,533 --> 00:41:08,460
or a collaboration with Wikidata,
798
00:41:08,820 --> 00:41:11,200
perhaps Wikimedia Deutschland
would be a better place
799
00:41:11,200 --> 00:41:13,182
to have these conversations with?
800
00:41:13,182 --> 00:41:16,256
Because Wikidata lives
within Wikimedia Deutschland
801
00:41:16,256 --> 00:41:17,511
and the team is there,
802
00:41:17,511 --> 00:41:19,971
and also the community
of volunteers around Wikidata
803
00:41:19,977 --> 00:41:23,710
would be the perfect place to talk
804
00:41:23,710 --> 00:41:25,590
about any kind of ingestions
805
00:41:25,590 --> 00:41:31,136
or working with bringing
PanLex closer to Wikidata.
806
00:41:31,577 --> 00:41:32,688
Great, thank you very much,
807
00:41:32,688 --> 00:41:34,901
because honestly I'm not
exactly super familiar
808
00:41:34,901 --> 00:41:37,823
with all of the intricacies
of the architecture
809
00:41:37,823 --> 00:41:39,740
of how all the projects
relate to each other.
810
00:41:39,740 --> 00:41:41,977
I'm guessing by the laughs
that it's complicated.
811
00:41:41,977 --> 00:41:44,333
But, yeah, so basically
we would want to talk
812
00:41:44,333 --> 00:41:48,333
with whoever is responsible for Wikidata.
813
00:41:48,333 --> 00:41:52,120
So just do a little
[inaudible] place thing,
814
00:41:52,860 --> 00:41:56,470
whoever is responsible for Wikidata,
that's who we're interested in talking to,
815
00:41:56,470 --> 00:41:58,264
which is all of you volunteers.
816
00:42:03,266 --> 00:42:05,044
Any further questions?
817
00:42:10,066 --> 00:42:14,400
Okay, well, if anyone does end up having
any further questions beyond this
818
00:42:14,400 --> 00:42:17,711
or ones that I talked about-- the details
and specifics about these things,
819
00:42:17,711 --> 00:42:19,800
please come and talk to me,
I'm super interested.
820
00:42:19,800 --> 00:42:23,977
And especially if you're dealing
with anything involving lexical stuff,
821
00:42:23,977 --> 00:42:28,666
anything involving
endangered minority languages
822
00:42:28,666 --> 00:42:30,444
and underserved languages,
823
00:42:30,444 --> 00:42:34,410
and also Unicode,
which is something I do as well.
824
00:42:36,220 --> 00:42:37,800
So thank you very much
825
00:42:37,800 --> 00:42:39,563
and thank you
for inviting me to come speak,
826
00:42:39,563 --> 00:42:41,550
I'm hoping that you enjoyed all this.
827
00:42:41,550 --> 00:42:43,753
(applause)