Now, there are approximately
7,500 languages
spoken on the planet today.
Of those, it's estimated
that about 70%
are at risk of not surviving
the end of the 21st century.
Every time a language dies,
it's severing a connection
that has lasted for hundreds
to thousands of years,
to culture, to history,
and to traditions, and to knowledge.
The linguist Kenneth Hale once said
that every time a language dies,
it's like dropping
an atom bomb on the Louvre,
So the question is,
why do languages die?
Well, perhaps the simple answer might be
that one could imagine
authoritarian governments
preventing people from speaking
their native language,
children being punished
for speaking their language at school,
or the government
shutting down radio stations
in the minority language.
And this definitely happened in the past,
and it still, to some extent,
happens today.
But the honest answer
is that for the vast majority
of the cases of language extinction,
it's a much simpler
and a much more easy-to-explain answer.
The languages go extinct
because they are not passed down
from one generation to the next.
Every single time a person who speaks
a minority language has a child,
they go through a calculus.
They ask themselves,
"Do I pass my language down to my child,
or do I instead teach them
only the majority language?"
Essentially, there is a scale that goes on
that they access in their heads,
in which on one side
every single time in their lives
that they've had an opportunity
to use their native language
for communication,
for access to traditional culture,
a stone is placed on the left side.
And every time that they find themselves
unable to use their native language,
and instead have to rely on
the majority language,
a stone is placed on the right side.
Now, due to the strength and the dignity
of being able to speak
one's mother tongue,
the stones on the left
tend to be a bit heavier.
But with enough stones on the right side,
then eventually the scale tips,
and then when a person makes the decision
to pass their language down,
they see their own language
as more of a burden than a blessing.
So the question is,
how do we reverse this?
First, we need to think
about the fact that,
for any given language,
there are certain social spheres
that they can be used in.
So any language
that's a mother tongue spoken today,
can be used with one's family.
A smaller set of languages
can be used within one's community,
a smaller set, maybe within one's region,
and for a small handful of languages,
they can be used
for international communication.
And then even across these spheres,
there's the question of can someone
use their language,
for the purpose of education or business,
or in technology?
So, to better explain
what I'm talking about here,
I would like to use an anecdote.
Let's say that you are about to go
on your dream vacation to India,
and you have an eight-hour
layover in Istanbul.
Now, you weren't necessarily
planning on visiting Turkey,
but with your layover
and with a Turkish friend
telling you about an amazing restaurant
that's not too far from the airport,
you say, "Hey, you know,
maybe I'll stop by during my layover."
So, you exit the airport,
you get to your restaurant,
and they hand you a menu,
and the menu is entirely in Turkish.
Now, let's say,
for the point of this exercise,
that you don't speak Turkish.
What do you do?
Well, best-case scenario,
you find someone perhaps
who can speak your native language,
German, English, etc.
But let's say it's not your lucky day
and nobody in the restaurant can speak
any German or any English.
So what do you do?
Well, if you are like me,
and I imagine most of you,
you've probably turned
to a technological solution,
machine translation
or a digital dictionary,
look up each word individually,
and eventually order yourself
a delicious Turkish meal.
Now, let's imagine this scenario instead,
in which you are the native speaker
of a minority language.
Let's say, Lower Sorbian.
Lower Sorbian is an endangered language
spoken here in Germany,
about 130 kilometers
to the southeast from here,
that's spoken only by
a few thousand people, mostly elderly.
Now, let's say your mother tongue
is Lower Sorbian.
You end up in the restaurant.
Now, of course, the odds
of finding someone
who speaks your native language
in the restaurant is extraordinarily low.
But, again, you can just go
to a technological solution.
However, for your native language,
these technological solutions don't exist.
You would have to rely on
German or English
as your pivot language into Turkish.
Now, of course, you still end up
getting your delicious Turkish meal,
but you begin to think about
how difficult this would have been
if you were your grandfather,
who spoke no German at all.
Now, this is just a small incident,
but it's going to place a stone
on the right side of that scale,
and make you think perhaps
maybe when I have children
or maybe when I have another child,
the burden that you went through with this
may not be worth it
to keep your language.
And imagine if this was a scenario
that was of significantly more importance,
such as, for example, being in a hospital.
Now, this is the point
in which we can help--
by we, I mean me and you
in this room can help.
We have the tools
to be able to help this.
If technological tools
are available for people
who speak minority
and underserved languages,
it puts a little finger on the scale,
on the left side of the scale.
Someone doesn't necessarily have to think
that they have to rely on
a minority language
in order to interact
with the outside world,
because it opens the social spheres
a little bit more.
So, of course, the ideal solution
is that we have machine translation
in every language in the world.
But, unfortunately,
that's just not feasible.
Machine translation
requires large corpuses of text,
and for many of these languages
that are endangered or underserved,
such data is simply not available.
Some of them aren't even commonly written
and thus getting enough data
to make a machine translation engine
is unlikely.
But what is available is lexical data.
Through the work of many linguists
over the past few hundred years,
dictionaries and grammars
have been produced
for most of the world's languages.
But, unfortunately, most of these works
are not accessible
or available to the world,
let alone to speakers
of these minority languages.
And it's not an intentional process,
a lot of times it's simply because
the initial print run
of these dictionaries was small,
and the only copies
are moldering away
in a university library somewhere.
But we have the ability to take that data
and make it accessible to the world.
The Wikimedia Foundation
is one of the best organizations,
I would say the best
organization in the world,
for getting data available
to the vast majority
of the population of this planet.
So let's work on that.
So to explain a little bit
about what we've been doing
in this regard,
I'd like to introduce
my organization, PanLex,
which is an organization
that is attempting
to collect lexical data for this purpose.
We got started about 12 years ago
at the University of Washington,
as a research project.
The idea behind it
was to show that inferred translations
could create an effective
translation device,
essentially a lexical translation device.
This is an example
from PanLex data itself.
This is showing how to translate
the word "ev" in Turkish,
which means house,
to Lower Sorbian,
the language I was referring to earlier.
So it's unlikely to find
Turkish to Lower Sorbian dictionaries,
but by passing it through
many, many different
intermediate languages,
you can create effective translations.
So, once this was shown
in the research projects,
the founder of PanLex,
Dr. Jonathan Pool,
decided, "Well, you know,
why not actually just do this?"
So he started a non-profit
to collect as much lexical data
as possible and make it accessible.
That's what we've been doing
for the past 12 years.
In that time, we've collected
thousands and thousands of dictionaries,
and extracted lexical data out of them
and compiled a database that allows
inferred lexical translation
across any of--
Our current count is around 5,500
of the 7,500 languages in the world.
And, of course,
we're constantly trying to expand that
and expand the data
on each individual language.
So, the next question is,
what can we do to work together on this?
We, at PanLex, have been
extremely excited to watch
the development on lexical data,
that Wikidata has been working on lately.
It's very fascinating to see organizations
that are working in a very similar sphere,
but in different aspects.
And we are extremely excited to see
the results of this from Wikidata.
And also we are looking forward
to collaborating with Wikidata.
I think that the special skills
that we've developed
over the past 12 years,
with not just collecting lexical data,
but also in database design,
could be extremely useful for Wikidata.
And on the other side, I think that--
I especially am excited about Wikidata's
ability to do crowdsourcing of data.
PanLex, currently,
our sources are entirely
printed lexical sources
or other types of lexical sources,
but we don't do any crowdsourcing.
We simply don't have
the infrastructure for it available
and of course, the Wikimedia Foundation
is the world expert in crowdsourcing.
I'm really looking
forward to seeing exactly
how we can apply these skills together.
But, overall, I think the main thing
to think about this
is that when we were
working on these things,
it's minute detail.
We're sitting around
looking at grammatical forms,
or paging our way through
dictionaries, ancient dictionaries,
or sometimes
recently published dictionaries
and getting into written forms of words,
and it feels very close up.
But, occasionally, we need to remember
to take a step back
in that, even though what we're doing
can feel even mundane at times,
the work we're doing
is extremely important.
This is, in my opinion,
the absolute best way
that we can support endangered languages
and make sure that the linguistic
diversity of the planet
is preserved up to the end
of this century or longer.
It's entirely possible that the work
that we're doing today
may result in languages
being preserved and passed down,
and not going extinct.
So just to remember
that even if you're sitting
around on your computer
editing an individual entry
and adding the data form
of a small minority language
for every single noun,
the little thing
that you're doing right now,
might actually be partially responsible
for making sure that language survives,
until the end of the century or longer.
Thank you very much,
and I'd like to open
the floor to questions.
(applause)
(woman 1) Thank you.
- Thank you for your talk.
- Thank you.
(woman 1) I just have a question
about dictionaries.
You said that you work
with printed dictionaries?
- Yes.
- (woman 1) So my question
is what do you take
from those dictionaries
and if there's any copyright thing
you have to deal with?
I anticipated this to be
the first question that I would get.
(laughter)
So, first off, for PanLex,
we have, according to our legal
resources that we have consulted,
whereas the arrangement and organization
of a dictionary is copyrightable,
the translation itself
is not considered copyrightable.
A good example is like, for example,
a phone book is considered,
at least according to US law,
copyrightable.
But saying that person X's
phone number is digits D
is not copyrightable.
So like I said,
according to our legal scholars,
this is how we can deal with this.
But even if that's not
a solid enough legal argument,
one important thing to remember
is that the vast majority
of these lexical data,
is actually out of copyright.
A significant number
of these are out of copyright
and thus can be used without [end].
And the other thing
is that oftentimes, for example,
if we're working with
a recently made print dictionary,
rather than trying to scan it and OCR it,
we just email the person who made it.
And it turns out that
most linguists are really excited
that their data can be made accessible.
And so they're like, "Sure, please,
just put it all in there
and make it accessible."
So like I said, we have, at least,
according to our legal opinions,
we have the ability,
but even if you don't want
to go with that,
it's very easy to get
the data publicly accessible.
- (man 1) Thank you. Hi.
- Hi.
(man 1) Can you say a little more
about how the person who speaks
Lower Sorbian is accessing the data.
Like specifically how
that information is getting to them
and how that might help to convince them
to either try out the--
Great question and this is actually
one that I think about a lot as well,
because I think that
when we talk about data access,
there's actually a multiple step
of this, multiple steps.
One is, of course, data preservation,
make sure the data doesn't go away.
Secondly, is make sure it's interoperable
and can be used.
And thirdly is make sure
that it's available.
So in PanLex's case,
we have an API that can be used,
but, obviously,
that can't be used by an end user
But we've also developed interfaces.
And so, for example,
if you go to translate.panlex.org,
you can do translations on our database.
If you want to mess around
with the API, just go to dev.panlex.org,
and you can find a bunch of stuff
on the API, or just api.panlex.org.
But there's another step too,
which is that even if you make
all of your data completely accessible
with tools that are super useful
to be able to access it,
if you don't actually promote the tools,
then people won't actually
be able to use it.
And this is honestly kind of a...
the thing that isn't talked about enough,
and I don't have a good answer for it.
How do we make sure that--
For example, l only fairly recently,
only a few years ago
got acquainted with Wikidata,
and it's exactly the kind
of thing that I'm interested in.
So, how do we promote
ourselves to others?
I'm leaving that as an open question.
Like I said, I don't have
a good answer for this.
But, of course, in order to do that,
we still need to accomplish
the first few steps.
(man 2) If we want to have
machine translation,
don't we need a translation memory?
I'm not sure that the individual words
that we put into Wikidata,
these short phrases
that we put into Wikidata,
either as ordinary Wikidata items
or as Wikidata lexemes,
are sufficient to do a proper translation.
We need to have full sentences,
for example, for--
(Benjamin) Yeah, absolutely.
(man 2) And where do we get
this data structure?
I'm not sure that, currently,
Wikidata is able to very well handle
the issue of a translation memory,
translatewiki.net,
for getting into that gap of...
Should we do anything
in that respect, or should we--
Yeah, and I really
appreciate your question.
I touched on this a little bit earlier,
but I'd love to reiterate it.
This is precisely the reason
that PanLex works in lexical data
and why I'm excited about lexical data,
as opposed to--
not as opposed to, but in addition
to machine translation engines
and machine translation in general.
As you said, machine translation
requires a specific kind of data,
and that data is not available
for most of the world's languages.
For the vast majority
of the world's languages,
that simply is not available.
But that doesn't mean
we should just give up.
Like why?
If I needed to translate
my Turkish restaurant menu,
then lexical translation will likely
be an exceptionally good tool for that.
Now, I'm not saying
that you can use lexical translation
to do perfect paragraph
to paragraph translation.
When I say lexical translation,
I mean word to word
and word to word translation
can be extremely useful,
It's funny to think about it,
but we didn't really have access
to really good machine translation.
Everyone didn't have
access to that until fairly recently.
And we still got by with dictionaries,
and they're an incredibly good resource.
And the data is available,
so why not make it available
to the world at large
and to the speakers of these languages?
(woman 2) Hi, what mechanisms
do you have in place
when the community itself--I'm over here.
- Where are you? Okay, right.
- (woman 2) Yeah, sorry. (laughs)
...when the community itself
doesn't want part of their data in PanLex?
Great question.
So the way that we work with that
is that if a dictionary is published
and made publicly available,
that's a good indication.
Like you could buy it in a store
or at a university library,
or a public library anyone can access.
That's a good indication
that that decision has been made.
(woman 2) [inaudible]
(man 3) Please, [inaudible],
could you speak in the microphone?
Can you say it again?
(woman 2) Linguists don't always have
the permission of the community.
In order to publish things,
they oftentimes publish things
without the consent of the community.
And that's absolutely true.
I would say that is a--
That does happen.
I would say it's generally
a small minority of cases,
mostly confined
to generally North America,
although sometimes
South American languages as well.
It's something we have
to take into account.
If we were to receive word, for example,
that the data that is in PanLex
should not be accessed
by the greater world,
then, of course, we would remove it.
(woman 2) Good, good.
That doesn't mean, of course,
that we'll listen
to copyright rules necessarily
but we will listen
to traditional communities,
and that's the major difference.
(woman 2) Yeah,
that's what I'm referring to.
It brings up a really interesting point,
which is that
sometimes it's a really big question
of who speaks for a language.
I had some experience actually
visiting the American Southwest
and working with some groups,
who work on indigenous,
the Pueblo languages out there.
So there is approximately
six Pueblo languages,
depending on how you slice it,
spoken in that area.
But they are divided
amongst 18 different Pueblos
and each one has their own
tribal government,
and each government
may have a different opinion
on whether their language
should be accessible to outsiders or not.
Like, for example, Zuni Pueblo,
it's a single Pueblo
that speaks Zuni language.
And they're really big
on their language going everywhere,
they put it on the street signs
and everything, it's great.
But for some of the other languages,
you might have one group that says,
"Yeah, we don't want our language
being accessed by outsiders."
But then you have the neighboring Pueblo
who speaks the same language say,
"We really want our language
accessible to outsiders
in using these technological tools,
because we want our language
to be able to continue on."
And it raises a really
interesting ethical question.
Because if you default by saying,
"Fine, I'm cutting it off because
this group said we should cut it off"--
aren't you also disservicing
the second group
because they actively
want you to rule out these things.
So I don't think this is a question
that has an easy answer.
But I would say
at least in terms of PanLex.
And for the record, we actually
haven't encountered this yet,
that I'm aware of.
Now, that could be partially because...
Getting back to his question,
we may need to promote more. (chuckles)
But, in general, as far as I know,
we have not had this come up.
But our game plan for this
is if a community says they don't want
their data in a database,
then we remove it.
(woman 2) Because we have come up
with it in Wikidata and Wikipedia...
- You have?
- (woman 2) ...in comments.
- Really?
- (woman 2) It's been a problem.
Yeah, I can imagine especially in comments
for photos or certain things.
(woman 2) Correct.
(man 4) Hi, I had a question about
the crowdsourcing aspect of this.
As far as going in and asking a community
to annotate or add data for a dataset,
one of the things
that's a little intimidating is like,
as an editor, I can only see
what things are missing.
But if I'm going to spend time
on things, having an idea,
there's a list of high priority items,
that's, I guess,
very motivating in this aspect.
And I was curious if you had a system
which is, essentially, like,
we know the gaps in our own data,
we have linguistic evidence
to know that these are the ones
that if we had annotated,
these would be the high impact drivers.
So I can imagine
having the lexeme
for "house" very impactful,
maybe not a lexeme
for a data or some other like.
But I was curious if you had that,
it if it is something
that could be used
to drive these community efforts.
Great question.
So one thing that Wikidata
has a whole lot of--
sorry, excuse me, PanLex
has a whole lot of are Swadesh lists.
We have apparently the largest collection
of Swadesh lists in the world
which is interesting.
If you don't know what a Swadesh list is,
it's essentially a regularized
list of lexical items
that can be used
for analysis of languages.
They contain really basic sets.
So there's a couple
of different kinds of Swadesh lists.
But there are 100 or 213 items
and they might contain
words like "house" and "eye" and "skin"
and basically general words
that you should be able
to find in any language.
So that's like a really
good starting point
for having that kind of data available.
Now, as I mentioned before,
crowdsourcing is something
that we don't do yet
and we're actually
really excited to be able to do.
It's one of the things I'm really excited
to talk to people
at this conference about,
is how crowdsourcing can be used
and the logistics behind it,
and these are the kind
of questions that can come up.
So I guess the answer I can say to you
is that we do have a priority list--
Actually, one thing I can say
is we definitely do have a priority list
when it comes to which languages
we are seeking out.
So the way we do this
is that we look for languages
that are not currently served
by technological solutions,
which are oftentimes minority languages,
or usually minority languages,
and then prioritize those.
But in terms of individual lexical items
being the general way we get new data
is essentially by ingesting
an entire dictionary's worth.
We are relying on the dictionary's choice
of lexical items,
rather than necessarily saying,
we're really looking for the word
for "house" in every language.
But when it comes to data crowdsourcing,
we will need something like that.
So this is an opportunity
for research and growth.
(man 5) Hi, I'm Victor,
and this is awesome.
As you have slides here,
can you talk a little bit
about the technical status
that currently you have data
or information flow
from and to Wikidata and PanLex.
Is that currently implemented already
and how do you deal with
back and forth or even
feedback loop information
between PanLex and Wikidata?
So we actually don't have any formal
connections to Wikidata at this point,
and this is something that I'm, again,
I'm really excited to talk
to people in this conference about.
We've had some interaction
with Wiktionary,
but Wikidata is actually
a better fit, honestly,
for what we are looking for.
Having directly lexical stuff
means that we have to do a lot less
data analysis and extraction.
And so the answer is,
we don't yet, but we want to.
(man 5) And if not,
what are the obstacles?
And as we can see, Wikidata
already supports several languages,
but when I look up translate.panlex.org,
you apparently support
many, many variants,
much more than Wikidata.
How do you see there is a gap
between translation
or lexical translation first,
application versus an effort
as trying to map a knowledge structure.
Mapping knowledge
will actually be very interesting.
We've had some
very interesting discussions
about the way that Wikidata
organizes their lexical data,
, your lexical data,
and how we organize our lexical data.
And there are subtle differences
that would require a mapping strategy,
some of which will not
necessarily be automatic,
but we might be able to develop
techniques to be able to do this.
You gave the example of language variants.
We tend to be very "splittery"
when it comes to language variants.
In other words,
if we get a source that says
that this is the dialect spoken
on the left side of the river
in Papua New Guinea, for this language,
and we get another source that says
this is the dialect spoken
on the right side of the river,
then we consider them
essentially separate languages.
And so we do this in order to basically
preserve the most data that we can.
Being able to map that
to how Wikidata does it--
Actually, what I would love
is to have conversations
about how languages
are designated on Wikidata.
Again, we go with the strategy
of very much a "splittery" strategy.
We broadly rely on ISO 6393 codes,
which is provided by the Ethnologue,
and then each individual code,
we then allow multiple variants within it,
either for script variants
or regional dialects or sociolects, etc.
Again, opportunity
for discussion and work.
(woman 3) Hi, I would like to know
if you have a OCR pipeline
and especially because
we've been trying to do OCR on Maya,
and we don't get any results.
It doesn't understand anything--
- Oh, yeah! (laughs)
- (woman 3) And... yeah.
So if your pipelines are available.
And the other one is just
on the overlap of ISO codes,
like sometimes they say,
"Oh, this is a language,
and this is another language,"
but there are sources
that say other stuff,
as you were mentioning,
but they tend to overlap.
So how do you go on...? Yeah.
Yeah, that's absolutely
an amazing question.
I really like it.
So we don't have a formalized
OCR pipeline per se;
we do it on a sort of
source by source basis.
One of the reasons why
is because we oftentimes have sources
that not necessarily need to be OCR'd,
that are available
for some of these languages,
and we concentrate on those because
they require the least amount of work.
But, obviously,
if we really want to dive deep
into some of our sources
that are in our backlog,
we're going to need to essentially
develop strong OCR pipelines.
But there's another aspect too,
which is that, as you mentioned...
like the people who designed OCR engines
I think are not realizing
how much you can stress test them.
Like, you know what's fun?--
trying to OCR
a Russian-Tibetan dictionary.
It's really hard, it turns out...
We gave up, and we hired
someone to just type it up,
which was totally doable.
And actually, it turns out
that this amazing Russian woman
learned to read Tibetan
so she could type this up,
which was super cool.
I think that if you're dealing
with stuff in the Latin scripts,
then I think that OCR solutions
can be developed, that are more robust,
that deal with
multilingual sources like this
and expect that you're going
to get a random four in there,
if you're dealing with something like
16th-century Mayan sources,
you know, with digit four.
But there are some sources
that OCR is probably just
never really going to catch up to,
or require such an immense amount of work,
that actually we put a little
bit of this to use right now.
We have another project
we're running at PanLex
to transcribe all of the traditional
literature of Bali,
and we found that in handwritten
Balinese manuscripts,
there's just no chance of OCR.
So we got a bunch
of Balinese people to type them up,
and it's become a really cool
cultural project within Bali,
and it's become news and stuff like that.
So I would say
that you don't necessarily
need to rely on OCR,
but there is a lot out there.
So having good OCR solutions
would be good.
Also, if anyone out here
is into super multilingual OCR,
please come talk to me.
(man 6) Thank you for your presentation.
You talked about integration
between PanLex and Wikidata,
but you haven't gone into the specifics.
So I was checking your data license,
and it is under CC0.
- Yes.
- (man 6) That's really great.
So there are two possible ways
that either we can import the data
or we can continue something similar
to the Freebase way,
where we had the complete
database from the Freebase,
and we imported them, and we made a link,
an external identifier
to the Freebase database.
So if you have something in mind,
are you thinking similar?
Or you just want to make...
an independent database
which can be linked to Wikidata?
Yeah, so this is a great question
and actually I feel
like it's about one step ahead
of some of the stuff
that I've already been thinking about,
partially because, like I said,
getting the two databases to work together
is a step in of itself.
I think the first step that we can take
is literally just pooling
our skills together.
We have a lot of experience
dealing with stuff
like classifications of properties
of individual lexemes
that I'd love to share.
But being able to link the databases
themselves would be wonderful.
I'm 100% for that.
I think it would be a little bit easier
on the Wikidata towards PanLex way,
but maybe I'm just biased
because I can see how that could work.
Yeah, essentially, as long
as Wikidata is comfortable
with all the licensing stuff like that,
or we work something out,
then I think that would be a great idea.
We'd just have to figure out ways
of linking the data itself.
One thing I can imagine is, essentially,
that I would love for edits to Wikidata
to immediately become populated
to the PanLex database,
without having to essentially
just reingest it every...
essentially making Wikidata
a crowdsourceable interface to PanLex
would be really awesome.
And then being able to use
PanLex in immediate translations,
to be able to do translations
across Wikidata lexical items--
that would be glorious.
(man 7) This is like the auditing process
of this semantic web
to close holes by inference.
If we think this further,
this kind of translation,
how do you deal with semantic mismatch
and grammatical mismatch?
For instance, if you try
to translate something in German,
you can simply put several words together
and reach something that's sensible,
and on the other hand,
I think I read sometimes
not every language
has the same granular system
for colors, for instance.
You said everything
uses a different system
for colors or are the same?
(man 7) I remember maybe
that it's just about evolution of language
that they started out
with black and white and then--
Yeah, the color hierarchy.
Actually, the color hierarchy
is a great way to illustrate
how this works, right?
So, essentially, when you have
a single pivot language--
it's really interesting when
you read papers on machine translations
because oftentimes they'll talk about
some hypothetical pivot language,
that they say, "Oh yeah,
there is a pivot language,"
and then you read in the paper
and say, "It's English."
And so what this form
of lexical translation does,
by passing it through
many different intermediate languages,
it has the effect of being able
to deal with a lot of semantic ambiguity.
Because as long as you're passing it
through languages
that contain the same reasonably similar
semantic boundaries to a word,
then you can avoid
the problem of essentially
introducing semantic ambiguity
through the pivot language.
So using the color hierarchy thing
as an example,
if you take a language that has
a single color word for green and blue
and it translates it into blue
in your single pivot language
and then into another language
that has different ambiguities
on these things,
then you end up introducing
semantic ambiguity.
But if you pass it through
a bunch of other languages
that also contain a single
lexical item for green and blue,
then, essentially,
that semantic specificity
gets passed along
to the resultant language.
As far as the grammatical feature aspects,
PanLex has been primarily, in its history,
collecting essentially lexemes,
essentially lexical forms.
And, by that, I mean, essentially,
whatever you get
as the headword for a dictionary.
So we don't necessarily
concentrate at this time
on collecting grammatical variant forms,
things like [inaudible] data, etc.
or past tense and present tense.
But it's something we're looking into.
One thing that it's always
important to remember
is that because our focus is--
is on underserved and endangered
minority languages,
we want to make sure
that something is available
before we make it perfect.
A phrase I absolutely love
is "Don't let the perfect
be the enemy of the good,"
and that's what we intend to do.
But we are super interested in the idea
of being able to handle grammatical forms,
and being able to translate
across grammatical forms,
and it's some stuff
we've done some research on
but we haven't fully implemented yet.
(man 8) So, of the 7,500 or so languages,
I assume you're relying on dictionaries
which are written for us,
but do all those languages
have standard written forms
and how do you deal with...?
That's a great question.
Essentially, yes, a lot of these languages
as everyone's aware, are unwritten.
However, any language
for which a dictionary has been produced
has some kind of orthography,
and we rely on the orthography
produced for the dictionary.
We occasionally do some
slight massaging of orthography
if we can guarantee
it to be lossless, basically.
But we tend to avoid it
as much as possible.
So, essentially,
we don't get into the business
of developing orthographies
for languages,
because oftentimes they haven't developed,
even if they're not really
widely published.
So, for example,
for a lot of languages
that are spoken in New Guinea,
there may not be a commonly
used orthographic form,
but some linguists
just come up with something
and that's a good first step.
We also collect phonetic forms
when they're available in dictionaries,
and so that's another way in,
essentially an IPA
representation of the word,
if that's available.
So that can also be used as well.
But we don't just typically
use that as a pivot
because it introduces certain ambiguities.
(woman 4) Thank you,
this might be a super silly question,
but are those only the intermediate
languages you work with?
Oh, no. Oh, no.
(woman 4) Oh, yes, alright. Thank you.
No, I'm glad you asked.
It answers the question.
So this is actually a screenshot snap
from translate.panlex.org.
If you do a translation,
you'll get a list of translations
on the right side.
You click a little dot dot dot button,
you'll get a graph like this.
And what this shows
is the intermediate languages,
the top 20 by score--
I could go into the details
of how we do the score
but it's not super important now--
by score that are being used.
But to make the translation,
we're actually using way more than 20.
The reason I cap it at 20
is because if you have more than 20--
like this is actually
a kind of a physics simulation
you can move the things around
and they squiggle.
If you have more than 20,
your computer gets really mad.
So it's more of a demonstration, yeah.
(woman 5) Leila,
from Wikimedia Foundation.
Just one note on--
You mentioned Wikimedia Foundation
a couple of times in your presentation,
I wanted to say if you want to do
any kind of data ingestion
or a collaboration with Wikidata,
perhaps Wikimedia Deutschland
would be a better place
to have these conversations with?
Because Wikidata lives
within Wikimedia Deutschland
and the team is there,
and also the community
of volunteers around Wikidata
would be the perfect place to talk
about any kind of ingestions
or working with bringing
PanLex closer to Wikidata.
Great, thank you very much,
because honestly I'm not
exactly super familiar
with all of the intricacies
of the architecture
of how all the projects
relate to each other.
I'm guessing by the laughs
that it's complicated.
But, yeah, so basically
we would want to talk
with whoever is responsible for Wikidata.
So just do a little
[inaudible] place thing,
whoever is responsible for Wikidata,
that's who we're interested in talking to,
which is all of you volunteers.
Any further questions?
Okay, well, if anyone does end up having
any further questions beyond this
or ones that I talked about-- the details
and specifics about these things,
please come and talk to me,
I'm super interested.
And especially if you're dealing
with anything involving lexical stuff,
anything involving
endangered minority languages
and underserved languages,
and also Unicode,
which is something I do as well.
So thank you very much
and thank you
for inviting me to come speak,
I'm hoping that you enjoyed all this.
(applause)