-
Now, there are approximately
7,500 languages
-
spoken on the planet today.
-
Of those, it's estimated
-
that about 70%
are at risk of not surviving
-
the end of the 21st century.
-
Every time a language dies,
-
it's severing a connection
-
that has lasted for hundreds
to thousands of years,
-
to culture, to history,
-
and to traditions, and to knowledge.
-
The linguist Kenneth Hale once said
-
that every time a language dies,
-
it's like dropping
an atom bomb on the Louvre,
-
So the question is,
-
why do languages die?
-
Well, perhaps the simple answer might be
-
that one could imagine
authoritarian governments
-
preventing people from speaking
their native language,
-
children being punished
for speaking their language at school,
-
or the government
shutting down radio stations
-
in the minority language.
-
And this definitely happened in the past,
-
and it still, to some extent,
happens today.
-
But the honest answer
-
is that for the vast majority
of the cases of language extinction,
-
it's a much simpler
-
and a much more easy-to-explain answer.
-
The languages go extinct
-
because they are not passed down
-
from one generation to the next.
-
Every single time a person who speaks
-
a minority language has a child,
-
they go through a calculus.
-
They ask themselves,
-
"Do I pass my language down to my child,
-
or do I instead teach them
only the majority language?"
-
Essentially, there is a scale that goes on
-
that they access in their heads,
-
in which on one side
-
every single time in their lives
-
that they've had an opportunity
to use their native language
-
for communication,
for access to traditional culture,
-
a stone is placed on the left side.
-
And every time that they find themselves
-
unable to use their native language,
-
and instead have to rely on
the majority language,
-
a stone is placed on the right side.
-
Now, due to the strength and the dignity
-
of being able to speak
one's mother tongue,
-
the stones on the left
tend to be a bit heavier.
-
But with enough stones on the right side,
-
then eventually the scale tips,
-
and then when a person makes the decision
-
to pass their language down,
-
they see their own language
-
as more of a burden than a blessing.
-
So the question is,
how do we reverse this?
-
First, we need to think
about the fact that,
-
for any given language,
-
there are certain social spheres
that they can be used in.
-
So any language
-
that's a mother tongue spoken today,
-
can be used with one's family.
-
A smaller set of languages
can be used within one's community,
-
a smaller set, maybe within one's region,
-
and for a small handful of languages,
-
they can be used
for international communication.
-
And then even across these spheres,
-
there's the question of can someone
use their language,
-
for the purpose of education or business,
-
or in technology?
-
So, to better explain
-
what I'm talking about here,
-
I would like to use an anecdote.
-
Let's say that you are about to go
-
on your dream vacation to India,
-
and you have an eight-hour
layover in Istanbul.
-
Now, you weren't necessarily
planning on visiting Turkey,
-
but with your layover
and with a Turkish friend
-
telling you about an amazing restaurant
-
that's not too far from the airport,
-
you say, "Hey, you know,
maybe I'll stop by during my layover."
-
So, you exit the airport,
-
you get to your restaurant,
-
and they hand you a menu,
-
and the menu is entirely in Turkish.
-
Now, let's say,
for the point of this exercise,
-
that you don't speak Turkish.
-
What do you do?
-
Well, best-case scenario,
-
you find someone perhaps
who can speak your native language,
-
German, English, etc.
-
But let's say it's not your lucky day
-
and nobody in the restaurant can speak
any German or any English.
-
So what do you do?
-
Well, if you are like me,
and I imagine most of you,
-
you've probably turned
to a technological solution,
-
machine translation
or a digital dictionary,
-
look up each word individually,
-
and eventually order yourself
a delicious Turkish meal.
-
Now, let's imagine this scenario instead,
-
in which you are the native speaker
of a minority language.
-
Let's say, Lower Sorbian.
-
Lower Sorbian is an endangered language
-
spoken here in Germany,
-
about 130 kilometers
to the southeast from here,
-
that's spoken only by
a few thousand people, mostly elderly.
-
Now, let's say your mother tongue
is Lower Sorbian.
-
You end up in the restaurant.
-
Now, of course, the odds
of finding someone
-
who speaks your native language
in the restaurant is extraordinarily low.
-
But, again, you can just go
to a technological solution.
-
However, for your native language,
-
these technological solutions don't exist.
-
You would have to rely on
German or English
-
as your pivot language into Turkish.
-
Now, of course, you still end up
getting your delicious Turkish meal,
-
but you begin to think about
how difficult this would have been
-
if you were your grandfather,
who spoke no German at all.
-
Now, this is just a small incident,
-
but it's going to place a stone
on the right side of that scale,
-
and make you think perhaps
-
maybe when I have children
or maybe when I have another child,
-
the burden that you went through with this
-
may not be worth it
to keep your language.
-
And imagine if this was a scenario
-
that was of significantly more importance,
-
such as, for example, being in a hospital.
-
Now, this is the point
in which we can help--
-
by we, I mean me and you
in this room can help.
-
We have the tools
to be able to help this.
-
If technological tools
are available for people
-
who speak minority
and underserved languages,
-
it puts a little finger on the scale,
on the left side of the scale.
-
Someone doesn't necessarily have to think
-
that they have to rely on
a minority language
-
in order to interact
with the outside world,
-
because it opens the social spheres
-
a little bit more.
-
So, of course, the ideal solution
-
is that we have machine translation
in every language in the world.
-
But, unfortunately,
that's just not feasible.
-
Machine translation
requires large corpuses of text,
-
and for many of these languages
-
that are endangered or underserved,
-
such data is simply not available.
-
Some of them aren't even commonly written
-
and thus getting enough data
to make a machine translation engine
-
is unlikely.
-
But what is available is lexical data.
-
Through the work of many linguists
-
over the past few hundred years,
-
dictionaries and grammars
have been produced
-
for most of the world's languages.
-
But, unfortunately, most of these works
-
are not accessible
or available to the world,
-
let alone to speakers
of these minority languages.
-
And it's not an intentional process,
-
a lot of times it's simply because
-
the initial print run
of these dictionaries was small,
-
and the only copies
-
are moldering away
in a university library somewhere.
-
But we have the ability to take that data
-
and make it accessible to the world.
-
The Wikimedia Foundation
is one of the best organizations,
-
I would say the best
organization in the world,
-
for getting data available
-
to the vast majority
of the population of this planet.
-
So let's work on that.
-
So to explain a little bit
-
about what we've been doing
in this regard,
-
I'd like to introduce
my organization, PanLex,
-
which is an organization
that is attempting
-
to collect lexical data for this purpose.
-
We got started about 12 years ago
-
at the University of Washington,
as a research project.
-
The idea behind it
-
was to show that inferred translations
-
could create an effective
translation device,
-
essentially a lexical translation device.
-
This is an example
from PanLex data itself.
-
This is showing how to translate
-
the word "ev" in Turkish,
which means house,
-
to Lower Sorbian,
-
the language I was referring to earlier.
-
So it's unlikely to find
-
Turkish to Lower Sorbian dictionaries,
-
but by passing it through
-
many, many different
intermediate languages,
-
you can create effective translations.
-
So, once this was shown
in the research projects,
-
the founder of PanLex,
Dr. Jonathan Pool,
-
decided, "Well, you know,
why not actually just do this?"
-
So he started a non-profit
-
to collect as much lexical data
as possible and make it accessible.
-
That's what we've been doing
for the past 12 years.
-
In that time, we've collected
thousands and thousands of dictionaries,
-
and extracted lexical data out of them
-
and compiled a database that allows
inferred lexical translation
-
across any of--
-
Our current count is around 5,500
-
of the 7,500 languages in the world.
-
And, of course,
-
we're constantly trying to expand that
-
and expand the data
on each individual language.
-
So, the next question is,
-
what can we do to work together on this?
-
We, at PanLex, have been
extremely excited to watch
-
the development on lexical data,
-
that Wikidata has been working on lately.
-
It's very fascinating to see organizations
-
that are working in a very similar sphere,
-
but in different aspects.
-
And we are extremely excited to see
-
the results of this from Wikidata.
-
And also we are looking forward
to collaborating with Wikidata.
-
I think that the special skills
-
that we've developed
over the past 12 years,
-
with not just collecting lexical data,
but also in database design,
-
could be extremely useful for Wikidata.
-
And on the other side, I think that--
-
I especially am excited about Wikidata's
-
ability to do crowdsourcing of data.
-
PanLex, currently,
our sources are entirely
-
printed lexical sources
or other types of lexical sources,
-
but we don't do any crowdsourcing.
-
We simply don't have
the infrastructure for it available
-
and of course, the Wikimedia Foundation
-
is the world expert in crowdsourcing.
-
I'm really looking
forward to seeing exactly
-
how we can apply these skills together.
-
But, overall, I think the main thing
to think about this
-
is that when we were
working on these things,
-
it's minute detail.
-
We're sitting around
looking at grammatical forms,
-
or paging our way through
dictionaries, ancient dictionaries,
-
or sometimes
recently published dictionaries
-
and getting into written forms of words,
-
and it feels very close up.
-
But, occasionally, we need to remember
-
to take a step back
-
in that, even though what we're doing
-
can feel even mundane at times,
-
the work we're doing
is extremely important.
-
This is, in my opinion,
the absolute best way
-
that we can support endangered languages
-
and make sure that the linguistic
diversity of the planet
-
is preserved up to the end
of this century or longer.
-
It's entirely possible that the work
that we're doing today
-
may result in languages
-
being preserved and passed down,
-
and not going extinct.
-
So just to remember
-
that even if you're sitting
around on your computer
-
editing an individual entry
-
and adding the data form
of a small minority language
-
for every single noun,
-
the little thing
that you're doing right now,
-
might actually be partially responsible
-
for making sure that language survives,
-
until the end of the century or longer.
-
Thank you very much,
-
and I'd like to open
the floor to questions.
-
(applause)
-
(woman 1) Thank you.
-
- Thank you for your talk.
- Thank you.
-
(woman 1) I just have a question
about dictionaries.
-
You said that you work
with printed dictionaries?
-
- Yes.
- (woman 1) So my question
-
is what do you take
from those dictionaries
-
and if there's any copyright thing
you have to deal with?
-
I anticipated this to be
the first question that I would get.
-
(laughter)
-
So, first off, for PanLex,
-
we have, according to our legal
resources that we have consulted,
-
whereas the arrangement and organization
of a dictionary is copyrightable,
-
the translation itself
is not considered copyrightable.
-
A good example is like, for example,
-
a phone book is considered,
at least according to US law,
-
copyrightable.
-
But saying that person X's
phone number is digits D
-
is not copyrightable.
-
So like I said,
-
according to our legal scholars,
-
this is how we can deal with this.
-
But even if that's not
a solid enough legal argument,
-
one important thing to remember
-
is that the vast majority
of these lexical data,
-
is actually out of copyright.
-
A significant number
of these are out of copyright
-
and thus can be used without [end].
-
And the other thing
is that oftentimes, for example,
-
if we're working with
a recently made print dictionary,
-
rather than trying to scan it and OCR it,
-
we just email the person who made it.
-
And it turns out that
most linguists are really excited
-
that their data can be made accessible.
-
And so they're like, "Sure, please,
-
just put it all in there
and make it accessible."
-
So like I said, we have, at least,
according to our legal opinions,
-
we have the ability,
-
but even if you don't want
to go with that,
-
it's very easy to get
the data publicly accessible.
-
- (man 1) Thank you. Hi.
- Hi.
-
(man 1) Can you say a little more
-
about how the person who speaks
Lower Sorbian is accessing the data.
-
Like specifically how
that information is getting to them
-
and how that might help to convince them
-
to either try out the--
-
Great question and this is actually
-
one that I think about a lot as well,
-
because I think that
when we talk about data access,
-
there's actually a multiple step
of this, multiple steps.
-
One is, of course, data preservation,
make sure the data doesn't go away.
-
Secondly, is make sure it's interoperable
-
and can be used.
-
And thirdly is make sure
that it's available.
-
So in PanLex's case,
-
we have an API that can be used,
-
but, obviously,
that can't be used by an end user
-
But we've also developed interfaces.
-
And so, for example,
if you go to translate.panlex.org,
-
you can do translations on our database.
-
If you want to mess around
with the API, just go to dev.panlex.org,
-
and you can find a bunch of stuff
on the API, or just api.panlex.org.
-
But there's another step too,
-
which is that even if you make
all of your data completely accessible
-
with tools that are super useful
to be able to access it,
-
if you don't actually promote the tools,
-
then people won't actually
be able to use it.
-
And this is honestly kind of a...
-
the thing that isn't talked about enough,
-
and I don't have a good answer for it.
-
How do we make sure that--
-
For example, l only fairly recently,
-
only a few years ago
got acquainted with Wikidata,
-
and it's exactly the kind
of thing that I'm interested in.
-
So, how do we promote
ourselves to others?
-
I'm leaving that as an open question.
-
Like I said, I don't have
a good answer for this.
-
But, of course, in order to do that,
-
we still need to accomplish
the first few steps.
-
(man 2) If we want to have
machine translation,
-
don't we need a translation memory?
-
I'm not sure that the individual words
-
that we put into Wikidata,
-
these short phrases
that we put into Wikidata,
-
either as ordinary Wikidata items
or as Wikidata lexemes,
-
are sufficient to do a proper translation.
-
We need to have full sentences,
for example, for--
-
(Benjamin) Yeah, absolutely.
-
(man 2) And where do we get
this data structure?
-
I'm not sure that, currently,
-
Wikidata is able to very well handle
-
the issue of a translation memory,
-
translatewiki.net,
-
for getting into that gap of...
-
Should we do anything
in that respect, or should we--
-
Yeah, and I really
appreciate your question.
-
I touched on this a little bit earlier,
-
but I'd love to reiterate it.
-
This is precisely the reason
that PanLex works in lexical data
-
and why I'm excited about lexical data,
-
as opposed to--
not as opposed to, but in addition
-
to machine translation engines
and machine translation in general.
-
As you said, machine translation
requires a specific kind of data,
-
and that data is not available
for most of the world's languages.
-
For the vast majority
of the world's languages,
-
that simply is not available.
-
But that doesn't mean
we should just give up.
-
Like why?
-
If I needed to translate
my Turkish restaurant menu,
-
then lexical translation will likely
be an exceptionally good tool for that.
-
Now, I'm not saying
that you can use lexical translation
-
to do perfect paragraph
to paragraph translation.
-
When I say lexical translation,
I mean word to word
-
and word to word translation
can be extremely useful,
-
It's funny to think about it,
but we didn't really have access
-
to really good machine translation.
-
Everyone didn't have
access to that until fairly recently.
-
And we still got by with dictionaries,
-
and they're an incredibly good resource.
-
And the data is available,
so why not make it available
-
to the world at large
and to the speakers of these languages?
-
(woman 2) Hi, what mechanisms
do you have in place
-
when the community itself--I'm over here.
-
- Where are you? Okay, right.
- (woman 2) Yeah, sorry. (laughs)
-
...when the community itself
-
doesn't want part of their data in PanLex?
-
Great question.
-
So the way that we work with that
-
is that if a dictionary is published
and made publicly available,
-
that's a good indication.
-
Like you could buy it in a store
or at a university library,
-
or a public library anyone can access.
-
That's a good indication
that that decision has been made.
-
(woman 2) [inaudible]
-
(man 3) Please, [inaudible],
could you speak in the microphone?
-
Can you say it again?
-
(woman 2) Linguists don't always have
the permission of the community.
-
In order to publish things,
-
they oftentimes publish things
without the consent of the community.
-
And that's absolutely true.
-
I would say that is a--
-
That does happen.
-
I would say it's generally
a small minority of cases,
-
mostly confined
to generally North America,
-
although sometimes
South American languages as well.
-
It's something we have
to take into account.
-
If we were to receive word, for example,
-
that the data that is in PanLex
-
should not be accessed
by the greater world,
-
then, of course, we would remove it.
-
(woman 2) Good, good.
-
That doesn't mean, of course,
-
that we'll listen
to copyright rules necessarily
-
but we will listen
to traditional communities,
-
and that's the major difference.
-
(woman 2) Yeah,
that's what I'm referring to.
-
It brings up a really interesting point,
-
which is that
-
sometimes it's a really big question
of who speaks for a language.
-
I had some experience actually
visiting the American Southwest
-
and working with some groups,
-
who work on indigenous,
the Pueblo languages out there.
-
So there is approximately
-
six Pueblo languages,
depending on how you slice it,
-
spoken in that area.
-
But they are divided
amongst 18 different Pueblos
-
and each one has their own
tribal government,
-
and each government
may have a different opinion
-
on whether their language
should be accessible to outsiders or not.
-
Like, for example, Zuni Pueblo,
-
it's a single Pueblo
that speaks Zuni language.
-
And they're really big
on their language going everywhere,
-
they put it on the street signs
and everything, it's great.
-
But for some of the other languages,
-
you might have one group that says,
-
"Yeah, we don't want our language
being accessed by outsiders."
-
But then you have the neighboring Pueblo
who speaks the same language say,
-
"We really want our language
accessible to outsiders
-
in using these technological tools,
-
because we want our language
to be able to continue on."
-
And it raises a really
interesting ethical question.
-
Because if you default by saying,
-
"Fine, I'm cutting it off because
this group said we should cut it off"--
-
aren't you also disservicing
the second group
-
because they actively
want you to rule out these things.
-
So I don't think this is a question
that has an easy answer.
-
But I would say
at least in terms of PanLex.
-
And for the record, we actually
haven't encountered this yet,
-
that I'm aware of.
-
Now, that could be partially because...
-
Getting back to his question,
-
we may need to promote more. (chuckles)
-
But, in general, as far as I know,
-
we have not had this come up.
-
But our game plan for this
-
is if a community says they don't want
their data in a database,
-
then we remove it.
-
(woman 2) Because we have come up
with it in Wikidata and Wikipedia...
-
- You have?
- (woman 2) ...in comments.
-
- Really?
- (woman 2) It's been a problem.
-
Yeah, I can imagine especially in comments
for photos or certain things.
-
(woman 2) Correct.
-
(man 4) Hi, I had a question about
the crowdsourcing aspect of this.
-
As far as going in and asking a community
-
to annotate or add data for a dataset,
-
one of the things
that's a little intimidating is like,
-
as an editor, I can only see
what things are missing.
-
But if I'm going to spend time
on things, having an idea,
-
there's a list of high priority items,
-
that's, I guess,
very motivating in this aspect.
-
And I was curious if you had a system
-
which is, essentially, like,
we know the gaps in our own data,
-
we have linguistic evidence
to know that these are the ones
-
that if we had annotated,
these would be the high impact drivers.
-
So I can imagine
-
having the lexeme
for "house" very impactful,
-
maybe not a lexeme
for a data or some other like.
-
But I was curious if you had that,
it if it is something
-
that could be used
to drive these community efforts.
-
Great question.
-
So one thing that Wikidata
has a whole lot of--
-
sorry, excuse me, PanLex
has a whole lot of are Swadesh lists.
-
We have apparently the largest collection
of Swadesh lists in the world
-
which is interesting.
-
If you don't know what a Swadesh list is,
-
it's essentially a regularized
list of lexical items
-
that can be used
for analysis of languages.
-
They contain really basic sets.
-
So there's a couple
of different kinds of Swadesh lists.
-
But there are 100 or 213 items
-
and they might contain
-
words like "house" and "eye" and "skin"
-
and basically general words
-
that you should be able
to find in any language.
-
So that's like a really
good starting point
-
for having that kind of data available.
-
Now, as I mentioned before,
-
crowdsourcing is something
that we don't do yet
-
and we're actually
really excited to be able to do.
-
It's one of the things I'm really excited
-
to talk to people
at this conference about,
-
is how crowdsourcing can be used
-
and the logistics behind it,
-
and these are the kind
of questions that can come up.
-
So I guess the answer I can say to you
-
is that we do have a priority list--
-
Actually, one thing I can say
is we definitely do have a priority list
-
when it comes to which languages
we are seeking out.
-
So the way we do this
is that we look for languages
-
that are not currently served
by technological solutions,
-
which are oftentimes minority languages,
-
or usually minority languages,
-
and then prioritize those.
-
But in terms of individual lexical items
-
being the general way we get new data
-
is essentially by ingesting
an entire dictionary's worth.
-
We are relying on the dictionary's choice
-
of lexical items,
rather than necessarily saying,
-
we're really looking for the word
for "house" in every language.
-
But when it comes to data crowdsourcing,
we will need something like that.
-
So this is an opportunity
for research and growth.
-
(man 5) Hi, I'm Victor,
and this is awesome.
-
As you have slides here,
-
can you talk a little bit
about the technical status
-
that currently you have data
-
or information flow
from and to Wikidata and PanLex.
-
Is that currently implemented already
-
and how do you deal with
-
back and forth or even
feedback loop information
-
between PanLex and Wikidata?
-
So we actually don't have any formal
connections to Wikidata at this point,
-
and this is something that I'm, again,
-
I'm really excited to talk
to people in this conference about.
-
We've had some interaction
with Wiktionary,
-
but Wikidata is actually
a better fit, honestly,
-
for what we are looking for.
-
Having directly lexical stuff
-
means that we have to do a lot less
data analysis and extraction.
-
And so the answer is,
we don't yet, but we want to.
-
(man 5) And if not,
what are the obstacles?
-
And as we can see, Wikidata
already supports several languages,
-
but when I look up translate.panlex.org,
-
you apparently support
many, many variants,
-
much more than Wikidata.
-
How do you see there is a gap
-
between translation
or lexical translation first,
-
application versus an effort
-
as trying to map a knowledge structure.
-
Mapping knowledge
will actually be very interesting.
-
We've had some
very interesting discussions
-
about the way that Wikidata
organizes their lexical data,
-
, your lexical data,
-
and how we organize our lexical data.
-
And there are subtle differences
that would require a mapping strategy,
-
some of which will not
necessarily be automatic,
-
but we might be able to develop
techniques to be able to do this.
-
You gave the example of language variants.
-
We tend to be very "splittery"
when it comes to language variants.
-
In other words,
if we get a source that says
-
that this is the dialect spoken
-
on the left side of the river
in Papua New Guinea, for this language,
-
and we get another source that says
-
this is the dialect spoken
on the right side of the river,
-
then we consider them
essentially separate languages.
-
And so we do this in order to basically
preserve the most data that we can.
-
Being able to map that
to how Wikidata does it--
-
Actually, what I would love
is to have conversations
-
about how languages
-
are designated on Wikidata.
-
Again, we go with the strategy
of very much a "splittery" strategy.
-
We broadly rely on ISO 6393 codes,
-
which is provided by the Ethnologue,
-
and then each individual code,
we then allow multiple variants within it,
-
either for script variants
or regional dialects or sociolects, etc.
-
Again, opportunity
for discussion and work.
-
(woman 3) Hi, I would like to know
if you have a OCR pipeline
-
and especially because
we've been trying to do OCR on Maya,
-
and we don't get any results.
-
It doesn't understand anything--
-
- Oh, yeah! (laughs)
- (woman 3) And... yeah.
-
So if your pipelines are available.
-
And the other one is just
on the overlap of ISO codes,
-
like sometimes they say,
-
"Oh, this is a language,
and this is another language,"
-
but there are sources
that say other stuff,
-
as you were mentioning,
but they tend to overlap.
-
So how do you go on...? Yeah.
-
Yeah, that's absolutely
an amazing question.
-
I really like it.
-
So we don't have a formalized
OCR pipeline per se;
-
we do it on a sort of
source by source basis.
-
One of the reasons why
is because we oftentimes have sources
-
that not necessarily need to be OCR'd,
-
that are available
for some of these languages,
-
and we concentrate on those because
they require the least amount of work.
-
But, obviously,
if we really want to dive deep
-
into some of our sources
that are in our backlog,
-
we're going to need to essentially
develop strong OCR pipelines.
-
But there's another aspect too,
which is that, as you mentioned...
-
like the people who designed OCR engines
-
I think are not realizing
how much you can stress test them.
-
Like, you know what's fun?--
-
trying to OCR
a Russian-Tibetan dictionary.
-
It's really hard, it turns out...
-
We gave up, and we hired
someone to just type it up,
-
which was totally doable.
-
And actually, it turns out
-
that this amazing Russian woman
learned to read Tibetan
-
so she could type this up,
which was super cool.
-
I think that if you're dealing
with stuff in the Latin scripts,
-
then I think that OCR solutions
can be developed, that are more robust,
-
that deal with
multilingual sources like this
-
and expect that you're going
to get a random four in there,
-
if you're dealing with something like
-
16th-century Mayan sources,
you know, with digit four.
-
But there are some sources
-
that OCR is probably just
never really going to catch up to,
-
or require such an immense amount of work,
-
that actually we put a little
bit of this to use right now.
-
We have another project
we're running at PanLex
-
to transcribe all of the traditional
literature of Bali,
-
and we found that in handwritten
Balinese manuscripts,
-
there's just no chance of OCR.
-
So we got a bunch
of Balinese people to type them up,
-
and it's become a really cool
cultural project within Bali,
-
and it's become news and stuff like that.
-
So I would say
-
that you don't necessarily
need to rely on OCR,
-
but there is a lot out there.
-
So having good OCR solutions
would be good.
-
Also, if anyone out here
is into super multilingual OCR,
-
please come talk to me.
-
(man 6) Thank you for your presentation.
-
You talked about integration
-
between PanLex and Wikidata,
-
but you haven't gone into the specifics.
-
So I was checking your data license,
and it is under CC0.
-
- Yes.
- (man 6) That's really great.
-
So there are two possible ways
-
that either we can import the data
-
or we can continue something similar
to the Freebase way,
-
where we had the complete
database from the Freebase,
-
and we imported them, and we made a link,
-
an external identifier
to the Freebase database.
-
So if you have something in mind,
are you thinking similar?
-
Or you just want to make...
-
an independent database
which can be linked to Wikidata?
-
Yeah, so this is a great question
-
and actually I feel
like it's about one step ahead
-
of some of the stuff
that I've already been thinking about,
-
partially because, like I said,
-
getting the two databases to work together
-
is a step in of itself.
-
I think the first step that we can take
-
is literally just pooling
our skills together.
-
We have a lot of experience
dealing with stuff
-
like classifications of properties
of individual lexemes
-
that I'd love to share.
-
But being able to link the databases
themselves would be wonderful.
-
I'm 100% for that.
-
I think it would be a little bit easier
-
on the Wikidata towards PanLex way,
-
but maybe I'm just biased
because I can see how that could work.
-
Yeah, essentially, as long
as Wikidata is comfortable
-
with all the licensing stuff like that,
or we work something out,
-
then I think that would be a great idea.
-
We'd just have to figure out ways
of linking the data itself.
-
One thing I can imagine is, essentially,
that I would love for edits to Wikidata
-
to immediately become populated
to the PanLex database,
-
without having to essentially
-
just reingest it every...
-
essentially making Wikidata
a crowdsourceable interface to PanLex
-
would be really awesome.
-
And then being able to use
PanLex in immediate translations,
-
to be able to do translations
across Wikidata lexical items--
-
that would be glorious.
-
(man 7) This is like the auditing process
of this semantic web
-
to close holes by inference.
-
If we think this further,
this kind of translation,
-
how do you deal with semantic mismatch
-
and grammatical mismatch?
-
For instance, if you try
to translate something in German,
-
you can simply put several words together
-
and reach something that's sensible,
-
and on the other hand,
I think I read sometimes
-
not every language
has the same granular system
-
for colors, for instance.
-
You said everything
-
uses a different system
for colors or are the same?
-
(man 7) I remember maybe
that it's just about evolution of language
-
that they started out
with black and white and then--
-
Yeah, the color hierarchy.
-
Actually, the color hierarchy
-
is a great way to illustrate
how this works, right?
-
So, essentially, when you have
a single pivot language--
-
it's really interesting when
you read papers on machine translations
-
because oftentimes they'll talk about
some hypothetical pivot language,
-
that they say, "Oh yeah,
there is a pivot language,"
-
and then you read in the paper
and say, "It's English."
-
And so what this form
of lexical translation does,
-
by passing it through
many different intermediate languages,
-
it has the effect of being able
to deal with a lot of semantic ambiguity.
-
Because as long as you're passing it
through languages
-
that contain the same reasonably similar
semantic boundaries to a word,
-
then you can avoid
the problem of essentially
-
introducing semantic ambiguity
through the pivot language.
-
So using the color hierarchy thing
as an example,
-
if you take a language that has
a single color word for green and blue
-
and it translates it into blue
-
in your single pivot language
-
and then into another language
-
that has different ambiguities
on these things,
-
then you end up introducing
semantic ambiguity.
-
But if you pass it through
a bunch of other languages
-
that also contain a single
lexical item for green and blue,
-
then, essentially,
that semantic specificity
-
gets passed along
to the resultant language.
-
As far as the grammatical feature aspects,
-
PanLex has been primarily, in its history,
-
collecting essentially lexemes,
essentially lexical forms.
-
And, by that, I mean, essentially,
-
whatever you get
as the headword for a dictionary.
-
So we don't necessarily
concentrate at this time
-
on collecting grammatical variant forms,
-
things like [inaudible] data, etc.
-
or past tense and present tense.
-
But it's something we're looking into.
-
One thing that it's always
important to remember
-
is that because our focus is--
-
is on underserved and endangered
minority languages,
-
we want to make sure
that something is available
-
before we make it perfect.
-
A phrase I absolutely love
-
is "Don't let the perfect
be the enemy of the good,"
-
and that's what we intend to do.
-
But we are super interested in the idea
-
of being able to handle grammatical forms,
-
and being able to translate
across grammatical forms,
-
and it's some stuff
we've done some research on
-
but we haven't fully implemented yet.
-
(man 8) So, of the 7,500 or so languages,
-
I assume you're relying on dictionaries
which are written for us,
-
but do all those languages
have standard written forms
-
and how do you deal with...?
-
That's a great question.
-
Essentially, yes, a lot of these languages
-
as everyone's aware, are unwritten.
-
However, any language
for which a dictionary has been produced
-
has some kind of orthography,
-
and we rely on the orthography
produced for the dictionary.
-
We occasionally do some
slight massaging of orthography
-
if we can guarantee
it to be lossless, basically.
-
But we tend to avoid it
as much as possible.
-
So, essentially,
we don't get into the business
-
of developing orthographies
for languages,
-
because oftentimes they haven't developed,
-
even if they're not really
widely published.
-
So, for example,
-
for a lot of languages
that are spoken in New Guinea,
-
there may not be a commonly
used orthographic form,
-
but some linguists
just come up with something
-
and that's a good first step.
-
We also collect phonetic forms
when they're available in dictionaries,
-
and so that's another way in,
-
essentially an IPA
representation of the word,
-
if that's available.
-
So that can also be used as well.
-
But we don't just typically
use that as a pivot
-
because it introduces certain ambiguities.
-
(woman 4) Thank you,
this might be a super silly question,
-
but are those only the intermediate
languages you work with?
-
Oh, no. Oh, no.
-
(woman 4) Oh, yes, alright. Thank you.
-
No, I'm glad you asked.
It answers the question.
-
So this is actually a screenshot snap
from translate.panlex.org.
-
If you do a translation,
-
you'll get a list of translations
on the right side.
-
You click a little dot dot dot button,
you'll get a graph like this.
-
And what this shows
is the intermediate languages,
-
the top 20 by score--
-
I could go into the details
of how we do the score
-
but it's not super important now--
-
by score that are being used.
-
But to make the translation,
we're actually using way more than 20.
-
The reason I cap it at 20
is because if you have more than 20--
-
like this is actually
a kind of a physics simulation
-
you can move the things around
and they squiggle.
-
If you have more than 20,
your computer gets really mad.
-
So it's more of a demonstration, yeah.
-
(woman 5) Leila,
from Wikimedia Foundation.
-
Just one note on--
-
You mentioned Wikimedia Foundation
a couple of times in your presentation,
-
I wanted to say if you want to do
any kind of data ingestion
-
or a collaboration with Wikidata,
-
perhaps Wikimedia Deutschland
would be a better place
-
to have these conversations with?
-
Because Wikidata lives
within Wikimedia Deutschland
-
and the team is there,
-
and also the community
of volunteers around Wikidata
-
would be the perfect place to talk
-
about any kind of ingestions
-
or working with bringing
PanLex closer to Wikidata.
-
Great, thank you very much,
-
because honestly I'm not
exactly super familiar
-
with all of the intricacies
of the architecture
-
of how all the projects
relate to each other.
-
I'm guessing by the laughs
that it's complicated.
-
But, yeah, so basically
we would want to talk
-
with whoever is responsible for Wikidata.
-
So just do a little
[inaudible] place thing,
-
whoever is responsible for Wikidata,
that's who we're interested in talking to,
-
which is all of you volunteers.
-
Any further questions?
-
Okay, well, if anyone does end up having
any further questions beyond this
-
or ones that I talked about-- the details
and specifics about these things,
-
please come and talk to me,
I'm super interested.
-
And especially if you're dealing
with anything involving lexical stuff,
-
anything involving
endangered minority languages
-
and underserved languages,
-
and also Unicode,
which is something I do as well.
-
So thank you very much
-
and thank you
for inviting me to come speak,
-
I'm hoping that you enjoyed all this.
-
(applause)