Return to Video

cdn.media.ccc.de/.../wikidatacon2019-14-eng-Keynote_Why_is_collecting_lexical_data_one_of_the_best_ways_we_can_help_support_underserved_and_endangered_languages_hd.mp4

  • 0:06 - 0:09
    Now, there are approximately
    7,500 languages
  • 0:09 - 0:11
    spoken on the planet today.
  • 0:12 - 0:14
    Of those, it's estimated
  • 0:14 - 0:18
    that about 70%
    are at risk of not surviving
  • 0:18 - 0:20
    the end of the 21st century.
  • 0:22 - 0:24
    Every time a language dies,
  • 0:25 - 0:27
    it's severing a connection
  • 0:27 - 0:31
    that has lasted for hundreds
    to thousands of years,
  • 0:31 - 0:35
    to culture, to history,
  • 0:35 - 0:38
    and to traditions, and to knowledge.
  • 0:39 - 0:42
    The linguist Kenneth Hale once said
  • 0:42 - 0:44
    that every time a language dies,
  • 0:44 - 0:47
    it's like dropping
    an atom bomb on the Louvre,
  • 0:49 - 0:52
    So the question is,
  • 0:53 - 0:55
    why do languages die?
  • 0:56 - 1:00
    Well, perhaps the simple answer might be
  • 1:00 - 1:03
    that one could imagine
    authoritarian governments
  • 1:03 - 1:05
    preventing people from speaking
    their native language,
  • 1:06 - 1:10
    children being punished
    for speaking their language at school,
  • 1:10 - 1:13
    or the government
    shutting down radio stations
  • 1:13 - 1:15
    in the minority language.
  • 1:15 - 1:17
    And this definitely happened in the past,
  • 1:17 - 1:19
    and it still, to some extent,
    happens today.
  • 1:20 - 1:23
    But the honest answer
  • 1:23 - 1:27
    is that for the vast majority
    of the cases of language extinction,
  • 1:27 - 1:29
    it's a much simpler
  • 1:29 - 1:33
    and a much more easy-to-explain answer.
  • 1:34 - 1:36
    The languages go extinct
  • 1:36 - 1:38
    because they are not passed down
  • 1:38 - 1:40
    from one generation to the next.
  • 1:42 - 1:44
    Every single time a person who speaks
  • 1:44 - 1:46
    a minority language has a child,
  • 1:47 - 1:50
    they go through a calculus.
  • 1:51 - 1:53
    They ask themselves,
  • 1:54 - 1:56
    "Do I pass my language down to my child,
  • 1:57 - 2:01
    or do I instead teach them
    only the majority language?"
  • 2:01 - 2:03
    Essentially, there is a scale that goes on
  • 2:04 - 2:06
    that they access in their heads,
  • 2:07 - 2:08
    in which on one side
  • 2:10 - 2:12
    every single time in their lives
  • 2:12 - 2:14
    that they've had an opportunity
    to use their native language
  • 2:15 - 2:18
    for communication,
    for access to traditional culture,
  • 2:20 - 2:22
    a stone is placed on the left side.
  • 2:22 - 2:24
    And every time that they find themselves
  • 2:24 - 2:26
    unable to use their native language,
  • 2:26 - 2:28
    and instead have to rely on
    the majority language,
  • 2:28 - 2:30
    a stone is placed on the right side.
  • 2:32 - 2:35
    Now, due to the strength and the dignity
  • 2:35 - 2:37
    of being able to speak
    one's mother tongue,
  • 2:37 - 2:39
    the stones on the left
    tend to be a bit heavier.
  • 2:39 - 2:42
    But with enough stones on the right side,
  • 2:43 - 2:45
    then eventually the scale tips,
  • 2:45 - 2:47
    and then when a person makes the decision
  • 2:47 - 2:49
    to pass their language down,
  • 2:49 - 2:51
    they see their own language
  • 2:51 - 2:53
    as more of a burden than a blessing.
  • 2:55 - 2:59
    So the question is,
    how do we reverse this?
  • 2:59 - 3:02
    First, we need to think
    about the fact that,
  • 3:04 - 3:05
    for any given language,
  • 3:05 - 3:08
    there are certain social spheres
    that they can be used in.
  • 3:08 - 3:09
    So any language
  • 3:09 - 3:11
    that's a mother tongue spoken today,
  • 3:11 - 3:13
    can be used with one's family.
  • 3:14 - 3:17
    A smaller set of languages
    can be used within one's community,
  • 3:17 - 3:19
    a smaller set, maybe within one's region,
  • 3:19 - 3:22
    and for a small handful of languages,
  • 3:23 - 3:24
    they can be used
    for international communication.
  • 3:26 - 3:29
    And then even across these spheres,
  • 3:29 - 3:32
    there's the question of can someone
    use their language,
  • 3:32 - 3:36
    for the purpose of education or business,
  • 3:36 - 3:38
    or in technology?
  • 3:39 - 3:42
    So, to better explain
  • 3:43 - 3:45
    what I'm talking about here,
  • 3:45 - 3:46
    I would like to use an anecdote.
  • 3:48 - 3:50
    Let's say that you are about to go
  • 3:50 - 3:52
    on your dream vacation to India,
  • 3:53 - 3:56
    and you have an eight-hour
    layover in Istanbul.
  • 3:57 - 4:01
    Now, you weren't necessarily
    planning on visiting Turkey,
  • 4:01 - 4:04
    but with your layover
    and with a Turkish friend
  • 4:04 - 4:06
    telling you about an amazing restaurant
  • 4:06 - 4:07
    that's not too far from the airport,
  • 4:08 - 4:11
    you say, "Hey, you know,
    maybe I'll stop by during my layover."
  • 4:11 - 4:13
    So, you exit the airport,
  • 4:14 - 4:15
    you get to your restaurant,
  • 4:15 - 4:17
    and they hand you a menu,
  • 4:17 - 4:19
    and the menu is entirely in Turkish.
  • 4:20 - 4:23
    Now, let's say,
    for the point of this exercise,
  • 4:23 - 4:24
    that you don't speak Turkish.
  • 4:25 - 4:27
    What do you do?
  • 4:28 - 4:30
    Well, best-case scenario,
  • 4:30 - 4:32
    you find someone perhaps
    who can speak your native language,
  • 4:32 - 4:34
    German, English, etc.
  • 4:36 - 4:38
    But let's say it's not your lucky day
  • 4:38 - 4:41
    and nobody in the restaurant can speak
    any German or any English.
  • 4:42 - 4:43
    So what do you do?
  • 4:43 - 4:46
    Well, if you are like me,
    and I imagine most of you,
  • 4:46 - 4:48
    you've probably turned
    to a technological solution,
  • 4:50 - 4:52
    machine translation
    or a digital dictionary,
  • 4:53 - 4:54
    look up each word individually,
  • 4:54 - 4:58
    and eventually order yourself
    a delicious Turkish meal.
  • 5:00 - 5:03
    Now, let's imagine this scenario instead,
  • 5:04 - 5:06
    in which you are the native speaker
    of a minority language.
  • 5:07 - 5:09
    Let's say, Lower Sorbian.
  • 5:09 - 5:11
    Lower Sorbian is an endangered language
  • 5:11 - 5:12
    spoken here in Germany,
  • 5:12 - 5:17
    about 130 kilometers
    to the southeast from here,
  • 5:18 - 5:21
    that's spoken only by
    a few thousand people, mostly elderly.
  • 5:23 - 5:25
    Now, let's say your mother tongue
    is Lower Sorbian.
  • 5:25 - 5:27
    You end up in the restaurant.
  • 5:27 - 5:28
    Now, of course, the odds
    of finding someone
  • 5:28 - 5:31
    who speaks your native language
    in the restaurant is extraordinarily low.
  • 5:32 - 5:36
    But, again, you can just go
    to a technological solution.
  • 5:37 - 5:39
    However, for your native language,
  • 5:39 - 5:42
    these technological solutions don't exist.
  • 5:42 - 5:45
    You would have to rely on
    German or English
  • 5:45 - 5:47
    as your pivot language into Turkish.
  • 5:49 - 5:52
    Now, of course, you still end up
    getting your delicious Turkish meal,
  • 5:52 - 5:55
    but you begin to think about
    how difficult this would have been
  • 5:55 - 5:57
    if you were your grandfather,
    who spoke no German at all.
  • 5:58 - 6:00
    Now, this is just a small incident,
  • 6:00 - 6:05
    but it's going to place a stone
    on the right side of that scale,
  • 6:05 - 6:07
    and make you think perhaps
  • 6:07 - 6:10
    maybe when I have children
    or maybe when I have another child,
  • 6:11 - 6:15
    the burden that you went through with this
  • 6:15 - 6:17
    may not be worth it
    to keep your language.
  • 6:19 - 6:21
    And imagine if this was a scenario
  • 6:21 - 6:26
    that was of significantly more importance,
  • 6:26 - 6:28
    such as, for example, being in a hospital.
  • 6:31 - 6:36
    Now, this is the point
    in which we can help--
  • 6:37 - 6:40
    by we, I mean me and you
    in this room can help.
  • 6:41 - 6:43
    We have the tools
    to be able to help this.
  • 6:45 - 6:47
    If technological tools
    are available for people
  • 6:47 - 6:49
    who speak minority
    and underserved languages,
  • 6:51 - 6:54
    it puts a little finger on the scale,
    on the left side of the scale.
  • 6:54 - 6:56
    Someone doesn't necessarily have to think
  • 6:56 - 6:58
    that they have to rely on
    a minority language
  • 6:58 - 6:59
    in order to interact
    with the outside world,
  • 7:00 - 7:05
    because it opens the social spheres
  • 7:05 - 7:06
    a little bit more.
  • 7:08 - 7:10
    So, of course, the ideal solution
  • 7:10 - 7:13
    is that we have machine translation
    in every language in the world.
  • 7:13 - 7:17
    But, unfortunately,
    that's just not feasible.
  • 7:17 - 7:20
    Machine translation
    requires large corpuses of text,
  • 7:20 - 7:21
    and for many of these languages
  • 7:21 - 7:23
    that are endangered or underserved,
  • 7:23 - 7:25
    such data is simply not available.
  • 7:26 - 7:28
    Some of them aren't even commonly written
  • 7:29 - 7:33
    and thus getting enough data
    to make a machine translation engine
  • 7:33 - 7:34
    is unlikely.
  • 7:34 - 7:38
    But what is available is lexical data.
  • 7:40 - 7:43
    Through the work of many linguists
  • 7:43 - 7:45
    over the past few hundred years,
  • 7:48 - 7:50
    dictionaries and grammars
    have been produced
  • 7:50 - 7:52
    for most of the world's languages.
  • 7:54 - 7:57
    But, unfortunately, most of these works
  • 7:57 - 8:01
    are not accessible
    or available to the world,
  • 8:01 - 8:04
    let alone to speakers
    of these minority languages.
  • 8:05 - 8:06
    And it's not an intentional process,
  • 8:06 - 8:08
    a lot of times it's simply because
  • 8:08 - 8:11
    the initial print run
    of these dictionaries was small,
  • 8:11 - 8:13
    and the only copies
  • 8:13 - 8:16
    are moldering away
    in a university library somewhere.
  • 8:18 - 8:21
    But we have the ability to take that data
  • 8:21 - 8:23
    and make it accessible to the world.
  • 8:24 - 8:28
    The Wikimedia Foundation
    is one of the best organizations,
  • 8:28 - 8:31
    I would say the best
    organization in the world,
  • 8:31 - 8:33
    for getting data available
  • 8:33 - 8:37
    to the vast majority
    of the population of this planet.
  • 8:39 - 8:40
    So let's work on that.
  • 8:41 - 8:43
    So to explain a little bit
  • 8:43 - 8:45
    about what we've been doing
    in this regard,
  • 8:45 - 8:48
    I'd like to introduce
    my organization, PanLex,
  • 8:49 - 8:52
    which is an organization
    that is attempting
  • 8:52 - 8:54
    to collect lexical data for this purpose.
  • 8:55 - 8:57
    We got started about 12 years ago
  • 8:57 - 9:00
    at the University of Washington,
    as a research project.
  • 9:00 - 9:01
    The idea behind it
  • 9:01 - 9:04
    was to show that inferred translations
  • 9:04 - 9:07
    could create an effective
    translation device,
  • 9:07 - 9:09
    essentially a lexical translation device.
  • 9:09 - 9:12
    This is an example
    from PanLex data itself.
  • 9:13 - 9:14
    This is showing how to translate
  • 9:14 - 9:18
    the word "ev" in Turkish,
    which means house,
  • 9:18 - 9:20
    to Lower Sorbian,
  • 9:20 - 9:21
    the language I was referring to earlier.
  • 9:21 - 9:23
    So it's unlikely to find
  • 9:24 - 9:26
    Turkish to Lower Sorbian dictionaries,
  • 9:26 - 9:28
    but by passing it through
  • 9:28 - 9:30
    many, many different
    intermediate languages,
  • 9:30 - 9:33
    you can create effective translations.
  • 9:34 - 9:37
    So, once this was shown
    in the research projects,
  • 9:37 - 9:40
    the founder of PanLex,
    Dr. Jonathan Pool,
  • 9:41 - 9:44
    decided, "Well, you know,
    why not actually just do this?"
  • 9:44 - 9:45
    So he started a non-profit
  • 9:45 - 9:49
    to collect as much lexical data
    as possible and make it accessible.
  • 9:49 - 9:51
    That's what we've been doing
    for the past 12 years.
  • 9:51 - 9:55
    In that time, we've collected
    thousands and thousands of dictionaries,
  • 9:55 - 9:56
    and extracted lexical data out of them
  • 9:56 - 10:01
    and compiled a database that allows
    inferred lexical translation
  • 10:01 - 10:04
    across any of--
  • 10:04 - 10:06
    Our current count is around 5,500
  • 10:06 - 10:08
    of the 7,500 languages in the world.
  • 10:09 - 10:11
    And, of course,
  • 10:11 - 10:12
    we're constantly trying to expand that
  • 10:12 - 10:15
    and expand the data
    on each individual language.
  • 10:17 - 10:21
    So, the next question is,
  • 10:22 - 10:26
    what can we do to work together on this?
  • 10:27 - 10:29
    We, at PanLex, have been
    extremely excited to watch
  • 10:29 - 10:31
    the development on lexical data,
  • 10:31 - 10:34
    that Wikidata has been working on lately.
  • 10:35 - 10:38
    It's very fascinating to see organizations
  • 10:38 - 10:39
    that are working in a very similar sphere,
  • 10:39 - 10:41
    but in different aspects.
  • 10:42 - 10:44
    And we are extremely excited to see
  • 10:45 - 10:46
    the results of this from Wikidata.
  • 10:46 - 10:51
    And also we are looking forward
    to collaborating with Wikidata.
  • 10:54 - 10:56
    I think that the special skills
  • 10:56 - 10:58
    that we've developed
    over the past 12 years,
  • 10:58 - 11:02
    with not just collecting lexical data,
    but also in database design,
  • 11:02 - 11:04
    could be extremely useful for Wikidata.
  • 11:04 - 11:07
    And on the other side, I think that--
  • 11:08 - 11:11
    I especially am excited about Wikidata's
  • 11:12 - 11:15
    ability to do crowdsourcing of data.
  • 11:15 - 11:18
    PanLex, currently,
    our sources are entirely
  • 11:18 - 11:21
    printed lexical sources
    or other types of lexical sources,
  • 11:21 - 11:23
    but we don't do any crowdsourcing.
  • 11:23 - 11:25
    We simply don't have
    the infrastructure for it available
  • 11:25 - 11:27
    and of course, the Wikimedia Foundation
  • 11:27 - 11:29
    is the world expert in crowdsourcing.
  • 11:32 - 11:34
    I'm really looking
    forward to seeing exactly
  • 11:34 - 11:36
    how we can apply these skills together.
  • 11:39 - 11:42
    But, overall, I think the main thing
    to think about this
  • 11:42 - 11:43
    is that when we were
    working on these things,
  • 11:43 - 11:45
    it's minute detail.
  • 11:45 - 11:48
    We're sitting around
    looking at grammatical forms,
  • 11:48 - 11:52
    or paging our way through
    dictionaries, ancient dictionaries,
  • 11:52 - 11:54
    or sometimes
    recently published dictionaries
  • 11:54 - 11:57
    and getting into written forms of words,
  • 11:57 - 12:00
    and it feels very close up.
  • 12:00 - 12:02
    But, occasionally, we need to remember
  • 12:02 - 12:03
    to take a step back
  • 12:03 - 12:05
    in that, even though what we're doing
  • 12:06 - 12:09
    can feel even mundane at times,
  • 12:10 - 12:12
    the work we're doing
    is extremely important.
  • 12:13 - 12:16
    This is, in my opinion,
    the absolute best way
  • 12:16 - 12:19
    that we can support endangered languages
  • 12:19 - 12:21
    and make sure that the linguistic
    diversity of the planet
  • 12:21 - 12:26
    is preserved up to the end
    of this century or longer.
  • 12:26 - 12:30
    It's entirely possible that the work
    that we're doing today
  • 12:30 - 12:33
    may result in languages
  • 12:33 - 12:35
    being preserved and passed down,
  • 12:35 - 12:37
    and not going extinct.
  • 12:39 - 12:41
    So just to remember
  • 12:41 - 12:43
    that even if you're sitting
    around on your computer
  • 12:43 - 12:44
    editing an individual entry
  • 12:44 - 12:50
    and adding the data form
    of a small minority language
  • 12:50 - 12:52
    for every single noun,
  • 12:52 - 12:55
    the little thing
    that you're doing right now,
  • 12:55 - 12:58
    might actually be partially responsible
  • 12:58 - 12:59
    for making sure that language survives,
  • 12:59 - 13:01
    until the end of the century or longer.
  • 13:03 - 13:04
    Thank you very much,
  • 13:04 - 13:06
    and I'd like to open
    the floor to questions.
  • 13:06 - 13:08
    (applause)
  • 13:24 - 13:25
    (woman 1) Thank you.
  • 13:25 - 13:27
    - Thank you for your talk.
    - Thank you.
  • 13:27 - 13:29
    (woman 1) I just have a question
    about dictionaries.
  • 13:29 - 13:31
    You said that you work
    with printed dictionaries?
  • 13:31 - 13:32
    - Yes.
    - (woman 1) So my question
  • 13:32 - 13:35
    is what do you take
    from those dictionaries
  • 13:35 - 13:38
    and if there's any copyright thing
    you have to deal with?
  • 13:38 - 13:41
    I anticipated this to be
    the first question that I would get.
  • 13:41 - 13:43
    (laughter)
  • 13:43 - 13:46
    So, first off, for PanLex,
  • 13:46 - 13:50
    we have, according to our legal
    resources that we have consulted,
  • 13:53 - 13:57
    whereas the arrangement and organization
    of a dictionary is copyrightable,
  • 13:57 - 14:03
    the translation itself
    is not considered copyrightable.
  • 14:04 - 14:06
    A good example is like, for example,
  • 14:06 - 14:11
    a phone book is considered,
    at least according to US law,
  • 14:11 - 14:12
    copyrightable.
  • 14:12 - 14:17
    But saying that person X's
    phone number is digits D
  • 14:17 - 14:18
    is not copyrightable.
  • 14:22 - 14:23
    So like I said,
  • 14:23 - 14:25
    according to our legal scholars,
  • 14:25 - 14:27
    this is how we can deal with this.
  • 14:27 - 14:31
    But even if that's not
    a solid enough legal argument,
  • 14:31 - 14:32
    one important thing to remember
  • 14:32 - 14:38
    is that the vast majority
    of these lexical data,
  • 14:39 - 14:41
    is actually out of copyright.
  • 14:41 - 14:43
    A significant number
    of these are out of copyright
  • 14:43 - 14:44
    and thus can be used without [end].
  • 14:44 - 14:47
    And the other thing
    is that oftentimes, for example,
  • 14:47 - 14:50
    if we're working with
    a recently made print dictionary,
  • 14:50 - 14:52
    rather than trying to scan it and OCR it,
  • 14:52 - 14:53
    we just email the person who made it.
  • 14:53 - 14:58
    And it turns out that
    most linguists are really excited
  • 14:58 - 15:00
    that their data can be made accessible.
  • 15:00 - 15:01
    And so they're like, "Sure, please,
  • 15:01 - 15:03
    just put it all in there
    and make it accessible."
  • 15:06 - 15:08
    So like I said, we have, at least,
    according to our legal opinions,
  • 15:08 - 15:09
    we have the ability,
  • 15:09 - 15:11
    but even if you don't want
    to go with that,
  • 15:11 - 15:16
    it's very easy to get
    the data publicly accessible.
  • 15:26 - 15:28
    - (man 1) Thank you. Hi.
    - Hi.
  • 15:28 - 15:30
    (man 1) Can you say a little more
  • 15:30 - 15:35
    about how the person who speaks
    Lower Sorbian is accessing the data.
  • 15:35 - 15:38
    Like specifically how
    that information is getting to them
  • 15:38 - 15:41
    and how that might help to convince them
  • 15:41 - 15:43
    to either try out the--
  • 15:43 - 15:45
    Great question and this is actually
  • 15:45 - 15:46
    one that I think about a lot as well,
  • 15:46 - 15:50
    because I think that
    when we talk about data access,
  • 15:50 - 15:53
    there's actually a multiple step
    of this, multiple steps.
  • 15:53 - 15:56
    One is, of course, data preservation,
    make sure the data doesn't go away.
  • 15:56 - 15:59
    Secondly, is make sure it's interoperable
  • 15:59 - 16:02
    and can be used.
  • 16:02 - 16:05
    And thirdly is make sure
    that it's available.
  • 16:06 - 16:07
    So in PanLex's case,
  • 16:07 - 16:10
    we have an API that can be used,
  • 16:10 - 16:12
    but, obviously,
    that can't be used by an end user
  • 16:12 - 16:15
    But we've also developed interfaces.
  • 16:15 - 16:20
    And so, for example,
    if you go to translate.panlex.org,
  • 16:20 - 16:23
    you can do translations on our database.
  • 16:23 - 16:26
    If you want to mess around
    with the API, just go to dev.panlex.org,
  • 16:26 - 16:29
    and you can find a bunch of stuff
    on the API, or just api.panlex.org.
  • 16:31 - 16:33
    But there's another step too,
  • 16:33 - 16:37
    which is that even if you make
    all of your data completely accessible
  • 16:37 - 16:41
    with tools that are super useful
    to be able to access it,
  • 16:41 - 16:43
    if you don't actually promote the tools,
  • 16:43 - 16:45
    then people won't actually
    be able to use it.
  • 16:45 - 16:47
    And this is honestly kind of a...
  • 16:49 - 16:51
    the thing that isn't talked about enough,
  • 16:51 - 16:53
    and I don't have a good answer for it.
  • 16:53 - 16:55
    How do we make sure that--
  • 16:55 - 16:57
    For example, l only fairly recently,
  • 16:57 - 17:00
    only a few years ago
    got acquainted with Wikidata,
  • 17:00 - 17:02
    and it's exactly the kind
    of thing that I'm interested in.
  • 17:03 - 17:07
    So, how do we promote
    ourselves to others?
  • 17:07 - 17:09
    I'm leaving that as an open question.
  • 17:09 - 17:11
    Like I said, I don't have
    a good answer for this.
  • 17:11 - 17:13
    But, of course, in order to do that,
  • 17:13 - 17:15
    we still need to accomplish
    the first few steps.
  • 17:22 - 17:25
    (man 2) If we want to have
    machine translation,
  • 17:25 - 17:28
    don't we need a translation memory?
  • 17:28 - 17:31
    I'm not sure that the individual words
  • 17:31 - 17:33
    that we put into Wikidata,
  • 17:33 - 17:37
    these short phrases
    that we put into Wikidata,
  • 17:37 - 17:41
    either as ordinary Wikidata items
    or as Wikidata lexemes,
  • 17:41 - 17:44
    are sufficient to do a proper translation.
  • 17:44 - 17:47
    We need to have full sentences,
    for example, for--
  • 17:47 - 17:48
    (Benjamin) Yeah, absolutely.
  • 17:49 - 17:51
    (man 2) And where do we get
    this data structure?
  • 17:51 - 17:55
    I'm not sure that, currently,
  • 17:55 - 18:00
    Wikidata is able to very well handle
  • 18:00 - 18:03
    the issue of a translation memory,
  • 18:04 - 18:06
    translatewiki.net,
  • 18:06 - 18:09
    for getting into that gap of...
  • 18:12 - 18:15
    Should we do anything
    in that respect, or should we--
  • 18:15 - 18:17
    Yeah, and I really
    appreciate your question.
  • 18:17 - 18:19
    I touched on this a little bit earlier,
  • 18:19 - 18:20
    but I'd love to reiterate it.
  • 18:21 - 18:25
    This is precisely the reason
    that PanLex works in lexical data
  • 18:25 - 18:27
    and why I'm excited about lexical data,
  • 18:27 - 18:30
    as opposed to--
    not as opposed to, but in addition
  • 18:30 - 18:35
    to machine translation engines
    and machine translation in general.
  • 18:36 - 18:39
    As you said, machine translation
    requires a specific kind of data,
  • 18:40 - 18:43
    and that data is not available
    for most of the world's languages.
  • 18:43 - 18:45
    For the vast majority
    of the world's languages,
  • 18:45 - 18:46
    that simply is not available.
  • 18:47 - 18:48
    But that doesn't mean
    we should just give up.
  • 18:48 - 18:50
    Like why?
  • 18:51 - 18:54
    If I needed to translate
    my Turkish restaurant menu,
  • 18:55 - 18:59
    then lexical translation will likely
    be an exceptionally good tool for that.
  • 18:59 - 19:02
    Now, I'm not saying
    that you can use lexical translation
  • 19:02 - 19:05
    to do perfect paragraph
    to paragraph translation.
  • 19:05 - 19:07
    When I say lexical translation,
    I mean word to word
  • 19:07 - 19:10
    and word to word translation
    can be extremely useful,
  • 19:12 - 19:15
    It's funny to think about it,
    but we didn't really have access
  • 19:15 - 19:17
    to really good machine translation.
  • 19:17 - 19:20
    Everyone didn't have
    access to that until fairly recently.
  • 19:20 - 19:24
    And we still got by with dictionaries,
  • 19:24 - 19:28
    and they're an incredibly good resource.
  • 19:28 - 19:31
    And the data is available,
    so why not make it available
  • 19:31 - 19:34
    to the world at large
    and to the speakers of these languages?
  • 19:36 - 19:39
    (woman 2) Hi, what mechanisms
    do you have in place
  • 19:39 - 19:41
    when the community itself--I'm over here.
  • 19:41 - 19:43
    - Where are you? Okay, right.
    - (woman 2) Yeah, sorry. (laughs)
  • 19:43 - 19:45
    ...when the community itself
  • 19:45 - 19:47
    doesn't want part of their data in PanLex?
  • 19:47 - 19:49
    Great question.
  • 19:49 - 19:52
    So the way that we work with that
  • 19:52 - 19:56
    is that if a dictionary is published
    and made publicly available,
  • 19:57 - 19:58
    that's a good indication.
  • 19:58 - 20:02
    Like you could buy it in a store
    or at a university library,
  • 20:02 - 20:05
    or a public library anyone can access.
  • 20:05 - 20:08
    That's a good indication
    that that decision has been made.
  • 20:08 - 20:12
    (woman 2) [inaudible]
  • 20:16 - 20:18
    (man 3) Please, [inaudible],
    could you speak in the microphone?
  • 20:19 - 20:20
    Can you say it again?
  • 20:20 - 20:23
    (woman 2) Linguists don't always have
    the permission of the community.
  • 20:23 - 20:24
    In order to publish things,
  • 20:24 - 20:28
    they oftentimes publish things
    without the consent of the community.
  • 20:28 - 20:30
    And that's absolutely true.
  • 20:30 - 20:33
    I would say that is a--
  • 20:33 - 20:34
    That does happen.
  • 20:34 - 20:37
    I would say it's generally
    a small minority of cases,
  • 20:37 - 20:41
    mostly confined
    to generally North America,
  • 20:41 - 20:43
    although sometimes
    South American languages as well.
  • 20:45 - 20:46
    It's something we have
    to take into account.
  • 20:46 - 20:49
    If we were to receive word, for example,
  • 20:49 - 20:52
    that the data that is in PanLex
  • 20:52 - 20:56
    should not be accessed
    by the greater world,
  • 20:56 - 20:58
    then, of course, we would remove it.
  • 20:58 - 20:59
    (woman 2) Good, good.
  • 21:01 - 21:02
    That doesn't mean, of course,
  • 21:02 - 21:04
    that we'll listen
    to copyright rules necessarily
  • 21:04 - 21:07
    but we will listen
    to traditional communities,
  • 21:07 - 21:08
    and that's the major difference.
  • 21:08 - 21:10
    (woman 2) Yeah,
    that's what I'm referring to.
  • 21:15 - 21:17
    It brings up a really interesting point,
  • 21:17 - 21:18
    which is that
  • 21:19 - 21:22
    sometimes it's a really big question
    of who speaks for a language.
  • 21:23 - 21:28
    I had some experience actually
    visiting the American Southwest
  • 21:28 - 21:30
    and working with some groups,
  • 21:30 - 21:32
    who work on indigenous,
    the Pueblo languages out there.
  • 21:36 - 21:38
    So there is approximately
  • 21:38 - 21:40
    six Pueblo languages,
    depending on how you slice it,
  • 21:40 - 21:42
    spoken in that area.
  • 21:42 - 21:44
    But they are divided
    amongst 18 different Pueblos
  • 21:44 - 21:47
    and each one has their own
    tribal government,
  • 21:47 - 21:50
    and each government
    may have a different opinion
  • 21:50 - 21:54
    on whether their language
    should be accessible to outsiders or not.
  • 21:57 - 21:58
    Like, for example, Zuni Pueblo,
  • 21:58 - 22:01
    it's a single Pueblo
    that speaks Zuni language.
  • 22:03 - 22:05
    And they're really big
    on their language going everywhere,
  • 22:05 - 22:08
    they put it on the street signs
    and everything, it's great.
  • 22:08 - 22:11
    But for some of the other languages,
  • 22:11 - 22:13
    you might have one group that says,
  • 22:13 - 22:16
    "Yeah, we don't want our language
    being accessed by outsiders."
  • 22:16 - 22:19
    But then you have the neighboring Pueblo
    who speaks the same language say,
  • 22:19 - 22:22
    "We really want our language
    accessible to outsiders
  • 22:22 - 22:24
    in using these technological tools,
  • 22:24 - 22:27
    because we want our language
    to be able to continue on."
  • 22:27 - 22:29
    And it raises a really
    interesting ethical question.
  • 22:29 - 22:32
    Because if you default by saying,
  • 22:32 - 22:35
    "Fine, I'm cutting it off because
    this group said we should cut it off"--
  • 22:35 - 22:37
    aren't you also disservicing
    the second group
  • 22:37 - 22:39
    because they actively
    want you to rule out these things.
  • 22:39 - 22:43
    So I don't think this is a question
    that has an easy answer.
  • 22:43 - 22:45
    But I would say
    at least in terms of PanLex.
  • 22:45 - 22:49
    And for the record, we actually
    haven't encountered this yet,
  • 22:49 - 22:50
    that I'm aware of.
  • 22:51 - 22:53
    Now, that could be partially because...
  • 22:54 - 22:55
    Getting back to his question,
  • 22:56 - 22:58
    we may need to promote more. (chuckles)
  • 22:59 - 23:02
    But, in general, as far as I know,
  • 23:02 - 23:04
    we have not had this come up.
  • 23:04 - 23:07
    But our game plan for this
  • 23:07 - 23:11
    is if a community says they don't want
    their data in a database,
  • 23:11 - 23:12
    then we remove it.
  • 23:12 - 23:15
    (woman 2) Because we have come up
    with it in Wikidata and Wikipedia...
  • 23:15 - 23:16
    - You have?
    - (woman 2) ...in comments.
  • 23:16 - 23:17
    - Really?
    - (woman 2) It's been a problem.
  • 23:17 - 23:20
    Yeah, I can imagine especially in comments
    for photos or certain things.
  • 23:20 - 23:22
    (woman 2) Correct.
  • 23:27 - 23:33
    (man 4) Hi, I had a question about
    the crowdsourcing aspect of this.
  • 23:34 - 23:37
    As far as going in and asking a community
  • 23:37 - 23:40
    to annotate or add data for a dataset,
  • 23:40 - 23:44
    one of the things
    that's a little intimidating is like,
  • 23:45 - 23:49
    as an editor, I can only see
    what things are missing.
  • 23:49 - 23:53
    But if I'm going to spend time
    on things, having an idea,
  • 23:54 - 23:57
    there's a list of high priority items,
  • 23:58 - 24:01
    that's, I guess,
    very motivating in this aspect.
  • 24:01 - 24:04
    And I was curious if you had a system
  • 24:04 - 24:08
    which is, essentially, like,
    we know the gaps in our own data,
  • 24:08 - 24:12
    we have linguistic evidence
    to know that these are the ones
  • 24:12 - 24:16
    that if we had annotated,
    these would be the high impact drivers.
  • 24:16 - 24:17
    So I can imagine
  • 24:18 - 24:21
    having the lexeme
    for "house" very impactful,
  • 24:21 - 24:25
    maybe not a lexeme
    for a data or some other like.
  • 24:25 - 24:29
    But I was curious if you had that,
    it if it is something
  • 24:30 - 24:35
    that could be used
    to drive these community efforts.
  • 24:36 - 24:37
    Great question.
  • 24:37 - 24:41
    So one thing that Wikidata
    has a whole lot of--
  • 24:41 - 24:45
    sorry, excuse me, PanLex
    has a whole lot of are Swadesh lists.
  • 24:45 - 24:48
    We have apparently the largest collection
    of Swadesh lists in the world
  • 24:48 - 24:49
    which is interesting.
  • 24:49 - 24:50
    If you don't know what a Swadesh list is,
  • 24:50 - 24:56
    it's essentially a regularized
    list of lexical items
  • 24:56 - 25:00
    that can be used
    for analysis of languages.
  • 25:00 - 25:03
    They contain really basic sets.
  • 25:03 - 25:05
    So there's a couple
    of different kinds of Swadesh lists.
  • 25:05 - 25:07
    But there are 100 or 213 items
  • 25:07 - 25:09
    and they might contain
  • 25:09 - 25:13
    words like "house" and "eye" and "skin"
  • 25:13 - 25:14
    and basically general words
  • 25:14 - 25:16
    that you should be able
    to find in any language.
  • 25:16 - 25:20
    So that's like a really
    good starting point
  • 25:20 - 25:23
    for having that kind of data available.
  • 25:29 - 25:31
    Now, as I mentioned before,
  • 25:31 - 25:34
    crowdsourcing is something
    that we don't do yet
  • 25:34 - 25:36
    and we're actually
    really excited to be able to do.
  • 25:36 - 25:38
    It's one of the things I'm really excited
  • 25:38 - 25:39
    to talk to people
    at this conference about,
  • 25:39 - 25:43
    is how crowdsourcing can be used
  • 25:43 - 25:46
    and the logistics behind it,
  • 25:46 - 25:49
    and these are the kind
    of questions that can come up.
  • 25:51 - 25:53
    So I guess the answer I can say to you
  • 25:53 - 25:55
    is that we do have a priority list--
  • 25:55 - 25:58
    Actually, one thing I can say
    is we definitely do have a priority list
  • 25:58 - 26:00
    when it comes to which languages
    we are seeking out.
  • 26:00 - 26:02
    So the way we do this
    is that we look for languages
  • 26:02 - 26:05
    that are not currently served
    by technological solutions,
  • 26:05 - 26:07
    which are oftentimes minority languages,
  • 26:07 - 26:09
    or usually minority languages,
  • 26:09 - 26:12
    and then prioritize those.
  • 26:14 - 26:17
    But in terms of individual lexical items
  • 26:17 - 26:20
    being the general way we get new data
  • 26:20 - 26:23
    is essentially by ingesting
    an entire dictionary's worth.
  • 26:23 - 26:26
    We are relying on the dictionary's choice
  • 26:26 - 26:29
    of lexical items,
    rather than necessarily saying,
  • 26:29 - 26:32
    we're really looking for the word
    for "house" in every language.
  • 26:32 - 26:35
    But when it comes to data crowdsourcing,
    we will need something like that.
  • 26:35 - 26:38
    So this is an opportunity
    for research and growth.
  • 26:40 - 26:43
    (man 5) Hi, I'm Victor,
    and this is awesome.
  • 26:45 - 26:47
    As you have slides here,
  • 26:47 - 26:49
    can you talk a little bit
    about the technical status
  • 26:49 - 26:51
    that currently you have data
  • 26:51 - 26:57
    or information flow
    from and to Wikidata and PanLex.
  • 26:57 - 27:00
    Is that currently implemented already
  • 27:00 - 27:04
    and how do you deal with
  • 27:04 - 27:07
    back and forth or even
    feedback loop information
  • 27:07 - 27:10
    between PanLex and Wikidata?
  • 27:10 - 27:14
    So we actually don't have any formal
    connections to Wikidata at this point,
  • 27:14 - 27:15
    and this is something that I'm, again,
  • 27:15 - 27:18
    I'm really excited to talk
    to people in this conference about.
  • 27:18 - 27:21
    We've had some interaction
    with Wiktionary,
  • 27:22 - 27:25
    but Wikidata is actually
    a better fit, honestly,
  • 27:25 - 27:27
    for what we are looking for.
  • 27:27 - 27:29
    Having directly lexical stuff
  • 27:29 - 27:32
    means that we have to do a lot less
    data analysis and extraction.
  • 27:33 - 27:37
    And so the answer is,
    we don't yet, but we want to.
  • 27:37 - 27:40
    (man 5) And if not,
    what are the obstacles?
  • 27:40 - 27:44
    And as we can see, Wikidata
    already supports several languages,
  • 27:44 - 27:47
    but when I look up translate.panlex.org,
  • 27:47 - 27:49
    you apparently support
    many, many variants,
  • 27:49 - 27:51
    much more than Wikidata.
  • 27:51 - 27:53
    How do you see there is a gap
  • 27:53 - 27:57
    between translation
    or lexical translation first,
  • 27:57 - 28:00
    application versus an effort
  • 28:00 - 28:04
    as trying to map a knowledge structure.
  • 28:04 - 28:06
    Mapping knowledge
    will actually be very interesting.
  • 28:06 - 28:07
    We've had some
    very interesting discussions
  • 28:07 - 28:12
    about the way that Wikidata
    organizes their lexical data,
  • 28:12 - 28:14
    , your lexical data,
  • 28:14 - 28:16
    and how we organize our lexical data.
  • 28:16 - 28:21
    And there are subtle differences
    that would require a mapping strategy,
  • 28:21 - 28:25
    some of which will not
    necessarily be automatic,
  • 28:25 - 28:27
    but we might be able to develop
    techniques to be able to do this.
  • 28:27 - 28:31
    You gave the example of language variants.
  • 28:31 - 28:34
    We tend to be very "splittery"
    when it comes to language variants.
  • 28:34 - 28:36
    In other words,
    if we get a source that says
  • 28:36 - 28:39
    that this is the dialect spoken
  • 28:39 - 28:42
    on the left side of the river
    in Papua New Guinea, for this language,
  • 28:42 - 28:43
    and we get another source that says
  • 28:43 - 28:45
    this is the dialect spoken
    on the right side of the river,
  • 28:45 - 28:47
    then we consider them
    essentially separate languages.
  • 28:47 - 28:51
    And so we do this in order to basically
    preserve the most data that we can.
  • 28:52 - 28:54
    Being able to map that
    to how Wikidata does it--
  • 28:54 - 28:57
    Actually, what I would love
    is to have conversations
  • 28:57 - 29:01
    about how languages
  • 29:01 - 29:06
    are designated on Wikidata.
  • 29:08 - 29:12
    Again, we go with the strategy
    of very much a "splittery" strategy.
  • 29:14 - 29:17
    We broadly rely on ISO 6393 codes,
  • 29:18 - 29:20
    which is provided by the Ethnologue,
  • 29:20 - 29:24
    and then each individual code,
    we then allow multiple variants within it,
  • 29:24 - 29:29
    either for script variants
    or regional dialects or sociolects, etc.
  • 29:30 - 29:33
    Again, opportunity
    for discussion and work.
  • 29:36 - 29:39
    (woman 3) Hi, I would like to know
    if you have a OCR pipeline
  • 29:39 - 29:45
    and especially because
    we've been trying to do OCR on Maya,
  • 29:45 - 29:48
    and we don't get any results.
  • 29:48 - 29:50
    It doesn't understand anything--
  • 29:50 - 29:53
    - Oh, yeah! (laughs)
    - (woman 3) And... yeah.
  • 29:53 - 29:56
    So if your pipelines are available.
  • 29:56 - 30:00
    And the other one is just
    on the overlap of ISO codes,
  • 30:00 - 30:02
    like sometimes they say,
  • 30:02 - 30:04
    "Oh, this is a language,
    and this is another language,"
  • 30:04 - 30:07
    but there are sources
    that say other stuff,
  • 30:07 - 30:10
    as you were mentioning,
    but they tend to overlap.
  • 30:10 - 30:13
    So how do you go on...? Yeah.
  • 30:13 - 30:15
    Yeah, that's absolutely
    an amazing question.
  • 30:15 - 30:17
    I really like it.
  • 30:17 - 30:20
    So we don't have a formalized
    OCR pipeline per se;
  • 30:20 - 30:24
    we do it on a sort of
    source by source basis.
  • 30:24 - 30:26
    One of the reasons why
    is because we oftentimes have sources
  • 30:26 - 30:28
    that not necessarily need to be OCR'd,
  • 30:28 - 30:30
    that are available
    for some of these languages,
  • 30:30 - 30:33
    and we concentrate on those because
    they require the least amount of work.
  • 30:33 - 30:35
    But, obviously,
    if we really want to dive deep
  • 30:35 - 30:37
    into some of our sources
    that are in our backlog,
  • 30:37 - 30:41
    we're going to need to essentially
    develop strong OCR pipelines.
  • 30:41 - 30:44
    But there's another aspect too,
    which is that, as you mentioned...
  • 30:44 - 30:49
    like the people who designed OCR engines
  • 30:49 - 30:53
    I think are not realizing
    how much you can stress test them.
  • 30:53 - 30:55
    Like, you know what's fun?--
  • 30:55 - 30:58
    trying to OCR
    a Russian-Tibetan dictionary.
  • 30:59 - 31:00
    It's really hard, it turns out...
  • 31:02 - 31:04
    We gave up, and we hired
    someone to just type it up,
  • 31:04 - 31:06
    which was totally doable.
  • 31:06 - 31:07
    And actually, it turns out
  • 31:07 - 31:10
    that this amazing Russian woman
    learned to read Tibetan
  • 31:10 - 31:13
    so she could type this up,
    which was super cool.
  • 31:15 - 31:18
    I think that if you're dealing
    with stuff in the Latin scripts,
  • 31:18 - 31:23
    then I think that OCR solutions
    can be developed, that are more robust,
  • 31:23 - 31:25
    that deal with
    multilingual sources like this
  • 31:25 - 31:27
    and expect that you're going
    to get a random four in there,
  • 31:27 - 31:28
    if you're dealing with something like
  • 31:28 - 31:31
    16th-century Mayan sources,
    you know, with digit four.
  • 31:32 - 31:38
    But there are some sources
  • 31:38 - 31:40
    that OCR is probably just
    never really going to catch up to,
  • 31:40 - 31:42
    or require such an immense amount of work,
  • 31:43 - 31:47
    that actually we put a little
    bit of this to use right now.
  • 31:47 - 31:49
    We have another project
    we're running at PanLex
  • 31:49 - 31:54
    to transcribe all of the traditional
    literature of Bali,
  • 31:54 - 31:58
    and we found that in handwritten
    Balinese manuscripts,
  • 31:58 - 32:00
    there's just no chance of OCR.
  • 32:00 - 32:02
    So we got a bunch
    of Balinese people to type them up,
  • 32:02 - 32:05
    and it's become a really cool
    cultural project within Bali,
  • 32:05 - 32:07
    and it's become news and stuff like that.
  • 32:07 - 32:09
    So I would say
  • 32:09 - 32:11
    that you don't necessarily
    need to rely on OCR,
  • 32:11 - 32:13
    but there is a lot out there.
  • 32:13 - 32:15
    So having good OCR solutions
    would be good.
  • 32:17 - 32:21
    Also, if anyone out here
    is into super multilingual OCR,
  • 32:21 - 32:23
    please come talk to me.
  • 32:30 - 32:31
    (man 6) Thank you for your presentation.
  • 32:32 - 32:35
    You talked about integration
  • 32:35 - 32:37
    between PanLex and Wikidata,
  • 32:37 - 32:39
    but you haven't gone into the specifics.
  • 32:39 - 32:43
    So I was checking your data license,
    and it is under CC0.
  • 32:43 - 32:44
    - Yes.
    - (man 6) That's really great.
  • 32:44 - 32:46
    So there are two possible ways
  • 32:46 - 32:49
    that either we can import the data
  • 32:49 - 32:53
    or we can continue something similar
    to the Freebase way,
  • 32:53 - 32:56
    where we had the complete
    database from the Freebase,
  • 32:56 - 32:59
    and we imported them, and we made a link,
  • 32:59 - 33:04
    an external identifier
    to the Freebase database.
  • 33:04 - 33:08
    So if you have something in mind,
    are you thinking similar?
  • 33:08 - 33:10
    Or you just want to make...
  • 33:15 - 33:19
    an independent database
    which can be linked to Wikidata?
  • 33:19 - 33:21
    Yeah, so this is a great question
  • 33:21 - 33:23
    and actually I feel
    like it's about one step ahead
  • 33:23 - 33:26
    of some of the stuff
    that I've already been thinking about,
  • 33:26 - 33:30
    partially because, like I said,
  • 33:30 - 33:32
    getting the two databases to work together
  • 33:32 - 33:34
    is a step in of itself.
  • 33:34 - 33:35
    I think the first step that we can take
  • 33:35 - 33:38
    is literally just pooling
    our skills together.
  • 33:38 - 33:40
    We have a lot of experience
    dealing with stuff
  • 33:40 - 33:43
    like classifications of properties
    of individual lexemes
  • 33:43 - 33:45
    that I'd love to share.
  • 33:46 - 33:49
    But being able to link the databases
    themselves would be wonderful.
  • 33:49 - 33:51
    I'm 100% for that.
  • 33:51 - 33:54
    I think it would be a little bit easier
  • 33:54 - 33:56
    on the Wikidata towards PanLex way,
  • 33:56 - 33:59
    but maybe I'm just biased
    because I can see how that could work.
  • 34:02 - 34:06
    Yeah, essentially, as long
    as Wikidata is comfortable
  • 34:06 - 34:10
    with all the licensing stuff like that,
    or we work something out,
  • 34:10 - 34:12
    then I think that would be a great idea.
  • 34:13 - 34:16
    We'd just have to figure out ways
    of linking the data itself.
  • 34:16 - 34:22
    One thing I can imagine is, essentially,
    that I would love for edits to Wikidata
  • 34:23 - 34:26
    to immediately become populated
    to the PanLex database,
  • 34:26 - 34:29
    without having to essentially
  • 34:29 - 34:31
    just reingest it every...
  • 34:31 - 34:36
    essentially making Wikidata
    a crowdsourceable interface to PanLex
  • 34:36 - 34:37
    would be really awesome.
  • 34:37 - 34:40
    And then being able to use
    PanLex in immediate translations,
  • 34:40 - 34:42
    to be able to do translations
    across Wikidata lexical items--
  • 34:42 - 34:44
    that would be glorious.
  • 34:55 - 35:00
    (man 7) This is like the auditing process
    of this semantic web
  • 35:00 - 35:04
    to close holes by inference.
  • 35:06 - 35:10
    If we think this further,
    this kind of translation,
  • 35:10 - 35:13
    how do you deal with semantic mismatch
  • 35:13 - 35:16
    and grammatical mismatch?
  • 35:16 - 35:19
    For instance, if you try
    to translate something in German,
  • 35:19 - 35:22
    you can simply put several words together
  • 35:22 - 35:26
    and reach something that's sensible,
  • 35:26 - 35:29
    and on the other hand,
    I think I read sometimes
  • 35:31 - 35:38
    not every language
    has the same granular system
  • 35:38 - 35:40
    for colors, for instance.
  • 35:42 - 35:43
    You said everything
  • 35:43 - 35:45
    uses a different system
    for colors or are the same?
  • 35:46 - 35:48
    (man 7) I remember maybe
    that it's just about evolution of language
  • 35:48 - 35:52
    that they started out
    with black and white and then--
  • 35:52 - 35:53
    Yeah, the color hierarchy.
  • 35:53 - 35:54
    Actually, the color hierarchy
  • 35:54 - 35:57
    is a great way to illustrate
    how this works, right?
  • 35:58 - 36:01
    So, essentially, when you have
    a single pivot language--
  • 36:02 - 36:05
    it's really interesting when
    you read papers on machine translations
  • 36:05 - 36:08
    because oftentimes they'll talk about
    some hypothetical pivot language,
  • 36:08 - 36:10
    that they say, "Oh yeah,
    there is a pivot language,"
  • 36:10 - 36:12
    and then you read in the paper
    and say, "It's English."
  • 36:12 - 36:17
    And so what this form
    of lexical translation does,
  • 36:17 - 36:20
    by passing it through
    many different intermediate languages,
  • 36:21 - 36:26
    it has the effect of being able
    to deal with a lot of semantic ambiguity.
  • 36:26 - 36:28
    Because as long as you're passing it
    through languages
  • 36:28 - 36:33
    that contain the same reasonably similar
    semantic boundaries to a word,
  • 36:33 - 36:37
    then you can avoid
    the problem of essentially
  • 36:37 - 36:40
    introducing semantic ambiguity
    through the pivot language.
  • 36:40 - 36:43
    So using the color hierarchy thing
    as an example,
  • 36:43 - 36:46
    if you take a language that has
    a single color word for green and blue
  • 36:46 - 36:51
    and it translates it into blue
  • 36:51 - 36:53
    in your single pivot language
  • 36:53 - 36:54
    and then into another language
  • 36:54 - 36:57
    that has different ambiguities
    on these things,
  • 36:57 - 37:00
    then you end up introducing
    semantic ambiguity.
  • 37:00 - 37:02
    But if you pass it through
    a bunch of other languages
  • 37:02 - 37:06
    that also contain a single
    lexical item for green and blue,
  • 37:06 - 37:11
    then, essentially,
    that semantic specificity
  • 37:11 - 37:17
    gets passed along
    to the resultant language.
  • 37:18 - 37:21
    As far as the grammatical feature aspects,
  • 37:21 - 37:23
    PanLex has been primarily, in its history,
  • 37:23 - 37:29
    collecting essentially lexemes,
    essentially lexical forms.
  • 37:30 - 37:32
    And, by that, I mean, essentially,
  • 37:32 - 37:34
    whatever you get
    as the headword for a dictionary.
  • 37:35 - 37:38
    So we don't necessarily
    concentrate at this time
  • 37:39 - 37:41
    on collecting grammatical variant forms,
  • 37:41 - 37:43
    things like [inaudible] data, etc.
  • 37:43 - 37:45
    or past tense and present tense.
  • 37:45 - 37:46
    But it's something we're looking into.
  • 37:46 - 37:48
    One thing that it's always
    important to remember
  • 37:48 - 37:51
    is that because our focus is--
  • 37:51 - 37:54
    is on underserved and endangered
    minority languages,
  • 37:55 - 37:58
    we want to make sure
    that something is available
  • 37:58 - 38:00
    before we make it perfect.
  • 38:02 - 38:03
    A phrase I absolutely love
  • 38:03 - 38:05
    is "Don't let the perfect
    be the enemy of the good,"
  • 38:05 - 38:07
    and that's what we intend to do.
  • 38:07 - 38:09
    But we are super interested in the idea
  • 38:09 - 38:12
    of being able to handle grammatical forms,
  • 38:12 - 38:14
    and being able to translate
    across grammatical forms,
  • 38:14 - 38:16
    and it's some stuff
    we've done some research on
  • 38:16 - 38:17
    but we haven't fully implemented yet.
  • 38:25 - 38:29
    (man 8) So, of the 7,500 or so languages,
  • 38:30 - 38:33
    I assume you're relying on dictionaries
    which are written for us,
  • 38:33 - 38:36
    but do all those languages
    have standard written forms
  • 38:36 - 38:38
    and how do you deal with...?
  • 38:38 - 38:40
    That's a great question.
  • 38:42 - 38:45
    Essentially, yes, a lot of these languages
  • 38:45 - 38:48
    as everyone's aware, are unwritten.
  • 38:48 - 38:51
    However, any language
    for which a dictionary has been produced
  • 38:51 - 38:52
    has some kind of orthography,
  • 38:52 - 38:57
    and we rely on the orthography
    produced for the dictionary.
  • 38:57 - 39:00
    We occasionally do some
    slight massaging of orthography
  • 39:01 - 39:03
    if we can guarantee
    it to be lossless, basically.
  • 39:03 - 39:05
    But we tend to avoid it
    as much as possible.
  • 39:08 - 39:11
    So, essentially,
    we don't get into the business
  • 39:11 - 39:13
    of developing orthographies
    for languages,
  • 39:13 - 39:15
    because oftentimes they haven't developed,
  • 39:15 - 39:17
    even if they're not really
    widely published.
  • 39:17 - 39:22
    So, for example,
  • 39:22 - 39:26
    for a lot of languages
    that are spoken in New Guinea,
  • 39:26 - 39:29
    there may not be a commonly
    used orthographic form,
  • 39:29 - 39:31
    but some linguists
    just come up with something
  • 39:31 - 39:32
    and that's a good first step.
  • 39:33 - 39:37
    We also collect phonetic forms
    when they're available in dictionaries,
  • 39:37 - 39:38
    and so that's another way in,
  • 39:38 - 39:41
    essentially an IPA
    representation of the word,
  • 39:41 - 39:42
    if that's available.
  • 39:42 - 39:43
    So that can also be used as well.
  • 39:43 - 39:46
    But we don't just typically
    use that as a pivot
  • 39:46 - 39:48
    because it introduces certain ambiguities.
  • 39:53 - 39:55
    (woman 4) Thank you,
    this might be a super silly question,
  • 39:56 - 40:01
    but are those only the intermediate
    languages you work with?
  • 40:01 - 40:02
    Oh, no. Oh, no.
  • 40:02 - 40:04
    (woman 4) Oh, yes, alright. Thank you.
  • 40:04 - 40:06
    No, I'm glad you asked.
    It answers the question.
  • 40:06 - 40:11
    So this is actually a screenshot snap
    from translate.panlex.org.
  • 40:11 - 40:13
    If you do a translation,
  • 40:13 - 40:15
    you'll get a list of translations
    on the right side.
  • 40:15 - 40:18
    You click a little dot dot dot button,
    you'll get a graph like this.
  • 40:18 - 40:22
    And what this shows
    is the intermediate languages,
  • 40:22 - 40:24
    the top 20 by score--
  • 40:24 - 40:26
    I could go into the details
    of how we do the score
  • 40:26 - 40:27
    but it's not super important now--
  • 40:27 - 40:30
    by score that are being used.
  • 40:30 - 40:33
    But to make the translation,
    we're actually using way more than 20.
  • 40:33 - 40:36
    The reason I cap it at 20
    is because if you have more than 20--
  • 40:36 - 40:38
    like this is actually
    a kind of a physics simulation
  • 40:38 - 40:40
    you can move the things around
    and they squiggle.
  • 40:40 - 40:42
    If you have more than 20,
    your computer gets really mad.
  • 40:45 - 40:47
    So it's more of a demonstration, yeah.
  • 40:56 - 40:58
    (woman 5) Leila,
    from Wikimedia Foundation.
  • 40:58 - 41:00
    Just one note on--
  • 41:00 - 41:03
    You mentioned Wikimedia Foundation
    a couple of times in your presentation,
  • 41:03 - 41:07
    I wanted to say if you want to do
    any kind of data ingestion
  • 41:07 - 41:08
    or a collaboration with Wikidata,
  • 41:09 - 41:11
    perhaps Wikimedia Deutschland
    would be a better place
  • 41:11 - 41:13
    to have these conversations with?
  • 41:13 - 41:16
    Because Wikidata lives
    within Wikimedia Deutschland
  • 41:16 - 41:18
    and the team is there,
  • 41:18 - 41:20
    and also the community
    of volunteers around Wikidata
  • 41:20 - 41:24
    would be the perfect place to talk
  • 41:24 - 41:26
    about any kind of ingestions
  • 41:26 - 41:31
    or working with bringing
    PanLex closer to Wikidata.
  • 41:32 - 41:33
    Great, thank you very much,
  • 41:33 - 41:35
    because honestly I'm not
    exactly super familiar
  • 41:35 - 41:38
    with all of the intricacies
    of the architecture
  • 41:38 - 41:40
    of how all the projects
    relate to each other.
  • 41:40 - 41:42
    I'm guessing by the laughs
    that it's complicated.
  • 41:42 - 41:44
    But, yeah, so basically
    we would want to talk
  • 41:44 - 41:48
    with whoever is responsible for Wikidata.
  • 41:48 - 41:52
    So just do a little
    [inaudible] place thing,
  • 41:53 - 41:56
    whoever is responsible for Wikidata,
    that's who we're interested in talking to,
  • 41:56 - 41:58
    which is all of you volunteers.
  • 42:03 - 42:05
    Any further questions?
  • 42:10 - 42:14
    Okay, well, if anyone does end up having
    any further questions beyond this
  • 42:14 - 42:18
    or ones that I talked about-- the details
    and specifics about these things,
  • 42:18 - 42:20
    please come and talk to me,
    I'm super interested.
  • 42:20 - 42:24
    And especially if you're dealing
    with anything involving lexical stuff,
  • 42:24 - 42:29
    anything involving
    endangered minority languages
  • 42:29 - 42:30
    and underserved languages,
  • 42:30 - 42:34
    and also Unicode,
    which is something I do as well.
  • 42:36 - 42:38
    So thank you very much
  • 42:38 - 42:40
    and thank you
    for inviting me to come speak,
  • 42:40 - 42:42
    I'm hoping that you enjoyed all this.
  • 42:42 - 42:44
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-14-eng-Keynote_Why_is_collecting_lexical_data_one_of_the_best_ways_we_can_help_support_underserved_and_endangered_languages_hd.mp4
Video Language:
English
Duration:
42:53

English subtitles

Revisions