< Return to Video

cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4

  • 0:06 - 0:07
    (Lydia) Thank you so much.
  • 0:07 - 0:11
    So, this conference,
    one of the big themes is languages.
  • 0:14 - 0:19
    I want to give you an overview
    of where we actually are currently
  • 0:19 - 0:20
    when it comes to languages
  • 0:20 - 0:22
    and where we can go from here.
  • 0:29 - 0:33
    Wikidata is all about giving more people
    more access to more knowledge,
  • 0:33 - 0:37
    and language is such an important part
    of making that a reality,
  • 0:38 - 0:43
    especially since more and more
    of our lives depends on technology.
  • 0:44 - 0:49
    And as our keynote speaker
    earlier today was talking,
  • 0:50 - 0:52
    some of the technology
    leaves people behind
  • 0:52 - 0:55
    simply because they can't speak
    a certain language,
  • 0:55 - 0:58
    and that's not okay.
  • 0:59 - 1:02
    So we want to do something about that.
  • 1:03 - 1:06
    And in order to change that,
    you need at least two things.
  • 1:06 - 1:11
    One is you need to provide content
    to the people in their language,
  • 1:11 - 1:13
    and the second thing you need
  • 1:13 - 1:16
    is to provide them
    with interaction in their language
  • 1:16 - 1:19
    in those applications
    or whatever it is you have.
  • 1:20 - 1:25
    And Wikidata helps with both of those.
  • 1:25 - 1:28
    And the first thing,
    content in your language,
  • 1:28 - 1:31
    that is basically what we have
    in items and properties,
  • 1:31 - 1:33
    how we describe the world.
  • 1:33 - 1:35
    Now, this is certainly
    not everything you need,
  • 1:35 - 1:39
    but it gets you quite far ahead.
  • 1:40 - 1:42
    The other thing
    is interaction in your language,
  • 1:42 - 1:46
    and that's where lexemes come into play
  • 1:46 - 1:49
    If you want to talk
    to your digital personal assistant
  • 1:49 - 1:55
    or if you want to have your device
    translate a text and things like that.
  • 1:56 - 1:59
    Alright, let's look into
    content in your language.
  • 1:59 - 2:03
    So what we have in items and properties.
  • 2:05 - 2:10
    For this, the labels in those items
    and properties are crucial.
  • 2:10 - 2:15
    We need to know what this entity
    is called that we're talking about.
  • 2:16 - 2:20
    And instead of talking about Q5,
  • 2:20 - 2:22
    someone who speaks English
    knows that's a "human,"
  • 2:22 - 2:25
    someone who speaks German
    knows that's a "mensch,"
  • 2:25 - 2:26
    and similar things.
  • 2:26 - 2:30
    So those labels on items and properties
  • 2:30 - 2:34
    are bridging the gap
    between humans and machines.
  • 2:34 - 2:35
    And humans and humans
  • 2:35 - 2:40
    making more existing knowledge
    accessible to them.
  • 2:43 - 2:46
    Now, that's a nice aspiration.
  • 2:46 - 2:48
    What does it actually look like?
  • 2:48 - 2:50
    It looks like this.
  • 2:51 - 2:52
    What you're seeing here
  • 2:52 - 2:58
    is that most of the items
    on Wikidata have two labels,
  • 2:58 - 3:01
    so labels in two languages.
  • 3:02 - 3:04
    And after that, it's one, and then three,
  • 3:04 - 3:06
    and then it becomes very sad.
  • 3:07 - 3:09
    (quiet laughter)
  • 3:10 - 3:13
    I think we need to do better than this.
  • 3:14 - 3:15
    But, on the other hand,
  • 3:15 - 3:17
    I was actually expecting this
    to be even worse.
  • 3:17 - 3:20
    I was expecting the average to be one.
  • 3:20 - 3:23
    So I was quite happy
    to see two. (chuckles)
  • 3:25 - 3:26
    Alright.
  • 3:27 - 3:30
    But it's not just interesting to know
  • 3:30 - 3:34
    how many labels our items
    and properties have.
  • 3:34 - 3:37
    It's also interesting to see
    in which languages.
  • 3:38 - 3:44
    Here you see a graph of the languages
  • 3:44 - 3:47
    that we have labels for on Items.
  • 3:47 - 3:51
    So the biggest part there is Other.
  • 3:51 - 3:54
    So I just took the top 100 languages
  • 3:55 - 3:59
    and everything else is Other
    to make this graph readable.
  • 4:00 - 4:02
    And then there's English and Dutch,
  • 4:03 - 4:04
    French,
  • 4:06 - 4:09
    and not to forget, Asturian.
  • 4:10 - 4:12
    - (person 1) Whoo!
    - Whoo-hoo, yes!
  • 4:14 - 4:17
    So what you see here is quite an imbalance
  • 4:17 - 4:20
    and still quite a lot of focus on English.
  • 4:21 - 4:24
    Another thing is if you look
    at the same thing for Properties,
  • 4:24 - 4:26
    it's actually looking better.
  • 4:27 - 4:33
    And I think part of that constituted
    just being way less properties.
  • 4:33 - 4:37
    So even smaller communities
    have a chance to keep up with that.
  • 4:37 - 4:39
    But it's also a pretty
    important part of Wikidata
  • 4:39 - 4:41
    to localize into your language.
  • 4:41 - 4:42
    So that's good.
  • 4:46 - 4:48
    What I want to highlight
    here with Asturian
  • 4:48 - 4:54
    is that a small community
    can really make a huge difference
  • 4:54 - 4:57
    with some dedication and work,
  • 4:57 - 4:58
    and that's really cool.
  • 5:02 - 5:04
    A small quiz for you.
  • 5:04 - 5:05
    If you take all the properties on Wikidata
  • 5:05 - 5:08
    that are not external identifiers,
  • 5:08 - 5:10
    which one has the most labels,
    like the most languages?
  • 5:11 - 5:14
    (audience) [inaudible]
  • 5:14 - 5:17
    I hear some agreement on instance of?
  • 5:18 - 5:19
    You would be wrong.
  • 5:20 - 5:22
    It's image. (chuckles)
  • 5:23 - 5:26
    So, yeah, that tells you,
    if you speak one of the languages
  • 5:26 - 5:29
    where instance of
    doesn't yet have a label,
  • 5:29 - 5:30
    you might want to add it.
  • 5:32 - 5:36
    So it has 148 labels currently.
  • 5:38 - 5:41
    But that's just another slide.
  • 5:43 - 5:44
    This graph tells us something
  • 5:44 - 5:49
    about how much content we are making
    available in a certain language
  • 5:49 - 5:52
    and how much of that content
    is actually used.
  • 5:52 - 5:55
    So what you're seeing is basically a curve
  • 5:55 - 6:01
    with most content having English labels,
    being available in English,
  • 6:02 - 6:04
    and being used a lot.
  • 6:04 - 6:06
    And then it kind of goes down.
  • 6:06 - 6:09
    But, again, what you can see are outliers
  • 6:09 - 6:15
    who have a lot more content
    than you would necessarily expect,
  • 6:17 - 6:20
    and that is really, really good.
  • 6:21 - 6:25
    The problem still is it's not used a lot.
  • 6:26 - 6:29
    Asturian and Dutch should be higher,
  • 6:29 - 6:32
    and I think helping those communities
  • 6:33 - 6:36
    increase the use
    of the data they collected
  • 6:36 - 6:38
    is a really useful thing to do.
  • 6:43 - 6:48
    What this analysis and others
    showed us is also a good thing though
  • 6:48 - 6:51
    is that we are seeing
    that highly used items
  • 6:51 - 6:55
    also tend to have more labels
  • 6:55 - 6:58
    or the other way around--
    it's not entirely clear.
  • 7:03 - 7:04
    And then the question is,
  • 7:05 - 7:07
    are we serving
    just the powerful languages?
  • 7:08 - 7:11
    Or are we serving everyone?
  • 7:13 - 7:18
    And what you see here
    is a grouping of languages.
  • 7:18 - 7:22
    The languages that are grouped together
    tend to have labels together.
  • 7:26 - 7:29
    And you see it clustering.
  • 7:29 - 7:34
    Now here's a similar clustering, colored,
  • 7:34 - 7:39
    based on how alive, how used,
  • 7:40 - 7:43
    how endangered the language is.
  • 7:43 - 7:45
    And a good thing you're seeing here
  • 7:45 - 7:50
    is that safe languages
    and endangered languages
  • 7:50 - 7:54
    do not form two different clusters.
  • 7:54 - 7:59
    But they're all mixed together,
  • 8:00 - 8:05
    which is much better than it would be
    the other way around
  • 8:05 - 8:09
    where the safe languages,
    the powerful languages
  • 8:10 - 8:12
    are just helping each other out.
  • 8:13 - 8:14
    No, that's not the case.
  • 8:14 - 8:17
    And it's a really good thing.
  • 8:17 - 8:20
    When I saw this,
    I thought this was very good.
  • 8:23 - 8:25
    Here's a similar thing
  • 8:26 - 8:29
    where we looked at
  • 8:30 - 8:34
    the languages' status
  • 8:34 - 8:36
    and how many labels it has.
  • 8:39 - 8:43
    What you're seeing
    is a clear win for safe languages,
  • 8:43 - 8:44
    as is expected.
  • 8:46 - 8:47
    But what you're also seeing
  • 8:47 - 8:54
    is that the languages in category 2
    and 3 and maybe even 4
  • 8:54 - 8:59
    are not that bad, actually,
  • 8:59 - 9:02
    in terms of their representation
    in Wikidata and others.
  • 9:03 - 9:06
    It's a really good thing to find.
  • 9:08 - 9:09
    Now, if you look at the same thing
  • 9:09 - 9:12
    for how much of that content
    of those labels
  • 9:12 - 9:15
    is actually used
    on Wikipedia, for example,
  • 9:17 - 9:23
    then we see a similar
    picture emerging again.
  • 9:24 - 9:30
    And it tells us that those communities
    are actually making good use of their time
  • 9:30 - 9:35
    by filling in labels
    for higher used items, for example.
  • 9:36 - 9:40
    There are outliers
    where I think we can help,
  • 9:42 - 9:48
    to help those communities find the places
    where their work would be most valuable.
  • 9:49 - 9:53
    But, overall, I'm happy with this picture.
  • 9:55 - 10:00
    Now, that was the items
    and properties part of Wikidata.
  • 10:01 - 10:03
    Now, let's look at interaction
    in your languages.
  • 10:03 - 10:05
    So the lexeme parts of Wikidata
  • 10:05 - 10:09
    where we describe words
    and their forms and their meanings.
  • 10:10 - 10:13
    We've been doing this now
    since May last year,
  • 10:16 - 10:19
    and content has been growing.
  • 10:20 - 10:22
    You can see here in blue the lexemes,
  • 10:22 - 10:26
    and then in red,
    the forms on those lexemes
  • 10:26 - 10:30
    and yellow, the senses
    on those lexemes.
  • 10:31 - 10:34
    So some communities--
    we'll get to that later--
  • 10:34 - 10:40
    have spent a lot of time creating forms
    and senses for their lexemes,
  • 10:40 - 10:43
    which is really useful
  • 10:43 - 10:48
    because that builds
    the core of the data set that you need.
  • 10:51 - 10:55
    Now, we looked at all the languages
  • 10:55 - 10:58
    that have lexemes on Wikidata.
  • 10:58 - 11:01
    So words we have,
  • 11:02 - 11:04
    those are right now 310 languages.
  • 11:05 - 11:08
    Now, what do you think is the top language
  • 11:08 - 11:12
    when it comes to the number
    of lexemes currently in Wikidata?
  • 11:13 - 11:15
    (audience) [inaudible]
  • 11:19 - 11:20
    Huh?
  • 11:20 - 11:22
    (person 2) German.
  • 11:22 - 11:24
    Sorry, I've heard it before.
  • 11:24 - 11:26
    It's Russian.
  • 11:28 - 11:30
    Russian is quite ahead.
  • 11:32 - 11:34
    And just to give you some perspective,
  • 11:36 - 11:37
    there's different opinions
  • 11:37 - 11:42
    but I've read, for example,
    that 1,000 to 3,000 words
  • 11:42 - 11:45
    gets you to conversation level,
    roughly, in another language,
  • 11:45 - 11:49
    and 4,000 to 10,000 words
    to an advanced level.
  • 11:52 - 11:55
    So, we still have a bit to catch up there.
  • 11:58 - 12:03
    One thing I want you
    to pay attention to is Basque here
  • 12:03 - 12:08
    with 10,000, roughly, lexemes.
  • 12:09 - 12:13
    Now, if you look at the number
    of forms for those lexemes,
  • 12:14 - 12:16
    Basque is way up there,
  • 12:18 - 12:20
    which is really cool,
  • 12:20 - 12:25
    and you should go to a talk that explains
    to you why that is the case.
  • 12:27 - 12:31
    Now, if you look at the number
    of senses, so what do words mean,
  • 12:32 - 12:35
    Basque even gets to the top of the list.
  • 12:35 - 12:37
    I think that deserves an applause.
  • 12:37 - 12:39
    (applause)
  • 12:46 - 12:47
    Another short quiz.
  • 12:47 - 12:50
    What's the lexeme
    with the most translations currently?
  • 12:51 - 12:55
    (audience) Cats, cats, [inaudible],
    Douglas Adams, [inaudible]
  • 12:57 - 13:00
    All good guesses, but no.
  • 13:01 - 13:04
    It's this, the Russian word for "water."
  • 13:10 - 13:12
    Alright, so now we talked a lot
  • 13:12 - 13:16
    about how many lexemes,
    forms, and senses we have,
  • 13:16 - 13:20
    but that's just one thing you need.
  • 13:20 - 13:22
    The other thing you need
  • 13:22 - 13:25
    is actually describing those lexemes,
    forms, and senses
  • 13:25 - 13:28
    in a machine-readable way.
  • 13:28 - 13:30
    And for that you have statements,
    like on items.
  • 13:31 - 13:36
    And one of the properties
    you use is usage example.
  • 13:36 - 13:39
    So whoever is using that data
  • 13:39 - 13:42
    can understand how to use
    that word in context,
  • 13:42 - 13:44
    so that could be a quote, for example.
  • 13:45 - 13:47
    And here, Polish rocks.
  • 13:48 - 13:50
    Good job, Polish speakers.
  • 13:54 - 13:58
    Another property
    that's really useful is IPA,
  • 13:58 - 14:00
    so how do you pronounce this word.
  • 14:01 - 14:07
    Russian apparently needs
    lots of IPA statements.
  • 14:10 - 14:13
    But, again, Polish, second.
  • 14:17 - 14:21
    And last but not least
    we have pronunciation audio.
  • 14:21 - 14:23
    So that is links to files on Commons
  • 14:23 - 14:26
    where someone speaks the word,
  • 14:26 - 14:30
    so you can hear a native speaker
    pronounce the word
  • 14:30 - 14:33
    in case you can't read IPA, for example.
  • 14:35 - 14:39
    And there's a really nice actually
    Wiki-based powered project
  • 14:39 - 14:40
    called Lingua Libre
  • 14:41 - 14:45
    where you can go and help record
    words in your language
  • 14:45 - 14:48
    that then can be added
    to lexemes on Wikidata,
  • 14:48 - 14:52
    so other people can understand
    how to pronounce your words.
  • 14:54 - 14:56
    (person 2) [inaudible]
  • 14:56 - 14:58
    If you search for "Lingua Libre,"
  • 14:58 - 15:01
    and I'm sure someone can post it
    in the Telegram channel.
  • 15:03 - 15:05
    Those guys rock.
  • 15:05 - 15:07
    They did really cool stuff with Wikibase.
  • 15:09 - 15:11
    Alright.
  • 15:13 - 15:17
    Then the question is,
    where do we go from here?
  • 15:19 - 15:22
    Based on the numbers I've just shown you,
  • 15:23 - 15:25
    we've come a long way
  • 15:25 - 15:28
    towards giving more people
    more access to more knowledge
  • 15:28 - 15:31
    when looking at languages on Wikidata.
  • 15:33 - 15:36
    But there is also still
    a lot of work ahead of us.
  • 15:39 - 15:42
    Some of the things
    you can do to help, for example,
  • 15:42 - 15:45
    is run label-a-thons
  • 15:45 - 15:50
    like get people together
    to label items in Wikidata
  • 15:51 - 15:55
    or do an edit-a-thon
    around lexemes in your language
  • 15:55 - 15:59
    to get the most used words
    in your language into Wikidata.
  • 16:01 - 16:03
    Or you can use a tool like Terminator
  • 16:03 - 16:08
    that helps you find the most
    important items in your language
  • 16:08 - 16:12
    that are still missing a label.
  • 16:13 - 16:18
    Most important being measured
    by how often it is used
  • 16:18 - 16:23
    in other Wikidata items
    as links in statements.
  • 16:26 - 16:30
    And, of course, for the lexeme part,
  • 16:31 - 16:35
    now that we've got
    a basic coverage of those lexemes,
  • 16:35 - 16:41
    it's also about building them out,
    adding more statements to them
  • 16:41 - 16:44
    so that they actually can build the base
  • 16:44 - 16:47
    for meaningful applications
    to build on top of that.
  • 16:48 - 16:51
    Because we're getting closer
    to that critical mass,
  • 16:51 - 16:54
    but we're still away from that,
  • 16:54 - 16:57
    that you can build
    serious applications on top of it.
  • 16:58 - 17:02
    And I hope all of you
    will join us in doing that.
  • 17:03 - 17:07
    And that already brings me
  • 17:07 - 17:10
    to a little help from our friends,
  • 17:10 - 17:13
    and Bruno, do you want to come over
  • 17:14 - 17:17
    and talk to us about lexical masks.
  • 17:18 - 17:19
    (Bruno) Thank you, Lydia,
  • 17:19 - 17:22
    thank you for giving me
    this short period of time
  • 17:22 - 17:24
    to present this work
    that we are doing at Google
  • 17:24 - 17:30
    Denny that most of you
    probably have heard of or know.
  • 17:30 - 17:32
    Because at Google so I'm a linguist.
  • 17:32 - 17:36
    so I'm very happy to be here
    amongst other language enthusiasts.
  • 17:37 - 17:39
    We are also building some lexicons,
  • 17:39 - 17:42
    and we have built this technology
  • 17:42 - 17:46
    or this approach that we think
    can be useful for you.
  • 17:46 - 17:48
    Just to give you
    a little bit of background,
  • 17:48 - 17:52
    this is my lexicographic
    background talking here.
  • 17:53 - 17:54
    When we build a lexicon database,
  • 17:54 - 17:59
    there is a lot of hard time to maintain,
    to keep them consistent
  • 17:59 - 18:00
    and to exchange data,
  • 18:00 - 18:02
    as you probably know.
  • 18:03 - 18:06
    There are several attempts
    to unify the feature and the properties
  • 18:06 - 18:09
    that are describing
    those lexemes and those forms,
  • 18:09 - 18:11
    and it's not a solved problem,
  • 18:11 - 18:14
    but there are some
    unification attempts on that side.
  • 18:14 - 18:15
    But what is really missing--
  • 18:15 - 18:19
    and this is a problem we had
    at the beginning of our project at Google
  • 18:19 - 18:22
    is to try to have an internal structure
  • 18:22 - 18:26
    that describes how
    a lexical entry should look like,
  • 18:26 - 18:29
    what kind of data
    or what kind of information we have
  • 18:29 - 18:32
    and the specification that are expected.
  • 18:32 - 18:38
    So, this is what we came up
    with this thing called lexicon mask.
  • 18:39 - 18:45
    A lexicon mask is describing
    what is expected for an entry,
  • 18:45 - 18:47
    a lexicographic entry, to be complete,
  • 18:47 - 18:51
    both in terms of the number of forms
    you expect for a lexeme,
  • 18:51 - 18:56
    and the number of features
    you expect for each of those forms.
  • 18:56 - 18:58
    Here is an example for Italian adjectives.
  • 18:58 - 19:02
    You expect, in Italian, to have
    four forms for your adjectives,
  • 19:02 - 19:05
    and each of these forms
    have a specific combination
  • 19:05 - 19:08
    of gender and number features.
  • 19:09 - 19:13
    This is what we expect
    for the Italian adjectives.
  • 19:13 - 19:16
    Of course, you can have
    extremely complex masks,
  • 19:16 - 19:21
    like the French verbs conjugation,
    which is quite extensive,
  • 19:21 - 19:23
    and I don't show you
    any other Russian mask
  • 19:23 - 19:25
    because it doesn't fit the screen.
  • 19:26 - 19:30
    And we also have
    some detailed specifications
  • 19:30 - 19:33
    because we distinguish
    what is at the form level.
  • 19:33 - 19:38
    So here you have Russian nouns
    that have three numbers
  • 19:38 - 19:40
    and a number of cases
    with different forms,
  • 19:40 - 19:43
    but they also have
    an entry level specification
  • 19:43 - 19:46
    that says a noun particularly has
  • 19:46 - 19:50
    an inherent gender
    and an inherent animacy feature
  • 19:50 - 19:52
    that is also specified in the mask.
  • 19:55 - 19:59
    We also want to distinguish
    that a mask gives a specification
  • 19:59 - 20:02
    for, in general,
    what an entry should look like.
  • 20:02 - 20:07
    But you can have smaller masks
    for defective aspects of the form
  • 20:07 - 20:11
    or defective aspects of the lexeme
    that happen in language.
  • 20:11 - 20:15
    So here is the simplest version
    of French verbs
  • 20:15 - 20:20
    that have only the 3rd person singular
    for all the weather verbs,
  • 20:20 - 20:24
    like "it rains" or "it snows,"
    like in English.
  • 20:25 - 20:26
    So we distinguish these two levels.
  • 20:27 - 20:30
    And how we use this at Google
  • 20:30 - 20:33
    is that when we have a lexicon
    that we want to use,
  • 20:33 - 20:38
    we use the mask to really
    literally throw the lexicons,
  • 20:38 - 20:40
    all the entries, through the mask
  • 20:40 - 20:44
    and see which entry has a problem
    in terms of structure.
  • 20:44 - 20:47
    Are we missing a form?
    Are we missing a feature?
  • 20:47 - 20:51
    And when there is a problem,
    we do some human validation
  • 20:51 - 20:54
    or just to see if it passes the mask.
  • 20:54 - 20:58
    So it's an extremely powerful tool
    to check the quality of the structure.
  • 20:59 - 21:02
    So what we are happy to announce today
  • 21:02 - 21:05
    is that we get the green light
    to open source our mask.
  • 21:06 - 21:08
    So this is a schema.
  • 21:08 - 21:09
    If you want that, we can release
  • 21:09 - 21:13
    and that we will provide
    to Wikidata as to ShEx files.
  • 21:13 - 21:17
    This is a ShEx file for German nouns,
  • 21:17 - 21:20
    and Denny is working on the conversion
    from our internal specification
  • 21:20 - 21:24
    to a more open-source specification.
  • 21:24 - 21:28
    We currently cover more than 25 languages.
  • 21:28 - 21:29
    So we expect to grow on our side,
  • 21:29 - 21:34
    but we also look for this opportunity
    to collaborate for other languages.
  • 21:34 - 21:41
    And one of the ongoing collaborations
    also that Denny has with Lukas.
  • 21:41 - 21:45
    Lukas has these great tools to have a UI
  • 21:45 - 21:51
    to help the user or the contributor
    to add more forms.
  • 21:51 - 21:54
    So if you want to add
    an adjective in French,
  • 21:54 - 21:59
    the UI is telling you
    how many forms are expected
  • 21:59 - 22:02
    and what kind of features
    this form should have.
  • 22:02 - 22:06
    So our mask will help the tool
    to be defined and expanded.
  • 22:07 - 22:08
    That's it.
  • 22:09 - 22:10
    (Lydia) Thank you so much.
  • 22:10 - 22:12
    (applause)
  • 22:14 - 22:17
    Alright. Are there questions?
  • 22:17 - 22:19
    Do you want to talk more about lexemes?
  • 22:20 - 22:21
    - (person 3) Yes.
    - Yes. (chuckles)
  • 22:33 - 22:35
    (person 3) My question,
    because you were talking
  • 22:35 - 22:39
    about giving more access
    to more people in more languages.
  • 22:39 - 22:42
    But there are a lot of languages
    that can't be used in Wikidata.
  • 22:42 - 22:45
    So what solution do you have for that?
  • 22:46 - 22:48
    When you say that can't use Wikidata,
  • 22:48 - 22:50
    are you talking about entering labels?
  • 22:50 - 22:53
    - (person 3) Labels, descriptions.
    - Right.
  • 22:53 - 22:55
    So, for lexemes, it's a bit different
  • 22:55 - 22:58
    because there we don't have
    that restriction.
  • 22:59 - 23:05
    For labels on items and properties,
    there is some restriction
  • 23:05 - 23:12
    because we wanted to make sure
    that it's not completely
  • 23:12 - 23:14
    anyone does anything,
  • 23:14 - 23:18
    and it becomes unmanageable.
  • 23:19 - 23:23
    Even a small community who wants
    one language and wants to work on that,
  • 23:24 - 23:27
    come talk to us, we will make it happen.
  • 23:27 - 23:29
    (person 3) I mean, we did this
    at the Prague Hackathon in May,
  • 23:29 - 23:32
    and it took us until almost August
    in order to be able to use our language.
  • 23:32 - 23:35
    - Yeah.
    - (person 3) So, it's very slow.
  • 23:35 - 23:38
    Yeah, it is, unfortunately, very slow.
  • 23:38 - 23:40
    We're currently working
    with the language Committee
  • 23:40 - 23:46
    on solving some fundamental...
  • 23:50 - 23:55
    Like, getting agreement on what kind
    of languages are actually "allowed,"
  • 23:56 - 23:59
    and that has taken too long,
  • 24:00 - 24:04
    which is the reason why your request
    probably took longer than it should have.
  • 24:05 - 24:06
    (person 3) Thanks.
  • 24:07 - 24:08
    (person 4) Thank you.
  • 24:08 - 24:11
    Lydia, if you remember
    the statistics that you showed,
  • 24:11 - 24:13
    the number of lexemes per language.
  • 24:13 - 24:18
    So, did you count
    all the forms as a data point
  • 24:18 - 24:20
    or only lexemes?
  • 24:21 - 24:23
    (Lydia) Do you mean this?
  • 24:23 - 24:24
    Which one do you mean?
  • 24:24 - 24:26
    (person 4) Yes, exactly.
  • 24:26 - 24:28
    If you remember,
    does this number [inaudible]
  • 24:28 - 24:32
    all the forms for all the lexemes
    or just how many lexemes there are?
  • 24:32 - 24:34
    No, this is just a number of lexemes.
  • 24:34 - 24:35
    (person 4) Just a number of lexemes, okay.
  • 24:35 - 24:37
    So then it is a just statistic
  • 24:37 - 24:39
    because if it would then
    compose the forms--
  • 24:39 - 24:41
    that's why I'm asking--
  • 24:41 - 24:43
    then all the languages
    with the inflectional morphology,
  • 24:43 - 24:45
    like Russian, Serbian,
    Slovenian and et cetera,
  • 24:45 - 24:48
    they have a natural advantage
    because they have so many.
  • 24:48 - 24:52
    So, this kind of kicks in here
    on this number of forms.
  • 24:52 - 24:54
    (person 4) Yeah, that was this one.
    Thank you.
  • 24:57 - 25:00
    (person 5) So, I had
    a quick question about the...
  • 25:01 - 25:07
    When we're talking about
    the actual items and properties.
  • 25:07 - 25:09
    Like as far as I understand,
  • 25:09 - 25:12
    there is currently no way
    to give an actual source
  • 25:12 - 25:15
    to any of the labels
    and descriptions that are given.
  • 25:15 - 25:18
    So, for example,
    because when you're talking
  • 25:18 - 25:21
    about an item property,
  • 25:21 - 25:25
    like, for example,
    you can get conflicting labels.
  • 25:25 - 25:26
    Yes.
  • 25:26 - 25:28
    (person 5) So this person is like...
  • 25:28 - 25:31
    We were talking about
    indigenous things before, for example.
  • 25:31 - 25:36
    So this person is a Norwegian artist
    according to this source,
  • 25:36 - 25:39
    and a Sami artist,
    according to this source.
  • 25:40 - 25:43
    Or, for example, in Estonian,
    we had an issue
  • 25:43 - 25:48
    where we had to change terminology
    to the official use terminology
  • 25:48 - 25:49
    in official lexicons,
  • 25:49 - 25:52
    but we have no way to indicate really why,
  • 25:52 - 25:54
    like what was the source of this
  • 25:54 - 25:56
    and why this was better
    and what was there before.
  • 25:56 - 25:57
    It was just me as a random person
  • 25:57 - 26:00
    just switching the thing
    to anyone who sees it.
  • 26:00 - 26:03
    So is there a plan
    to make this possible in any way
  • 26:03 - 26:06
    so that we can actually have
    proper sources for the language data?
  • 26:07 - 26:12
    So, it is partially possible.
  • 26:12 - 26:16
    So, for example, when you have
    an item for a person,
  • 26:17 - 26:23
    you have a statement, first name,
    last name, and so on, of that person,
  • 26:23 - 26:26
    and then you can provide
    the reference for that there.
  • 26:28 - 26:33
    I'm quite hesitant to add more complexity
  • 26:33 - 26:36
    for references on labels and descriptions,
  • 26:36 - 26:39
    but if people really, really think
  • 26:39 - 26:45
    this is something that isn't covered
    by any reference on the statement,
  • 26:45 - 26:47
    then let's talk about it.
  • 26:49 - 26:53
    But I fear it will add a lot of complexity
  • 26:53 - 26:57
    for what I hope are few cases,
  • 26:57 - 27:00
    but I'm willing to be convinced otherwise
  • 27:00 - 27:04
    if people really feel
    very strongly about this.
  • 27:04 - 27:08
    (person 5) I mean, if it's added
    it probably shouldn't be the default,
  • 27:08 - 27:12
    show to all the users as a beginner,
    interface, in any case.
  • 27:12 - 27:16
    More like, "Click here if you need to say
    a specific thing about this."
  • 27:18 - 27:23
    Do we have a sense of how many times
    that would actually matter?
  • 27:25 - 27:26
    (person 5) In Estonian, for example--
  • 27:26 - 27:29
    I expect this is true
    of other languages as well--
  • 27:29 - 27:34
    for example, there is an official name
    that is the actual legitimate translation,
  • 27:34 - 27:36
    for example, into English,
  • 27:36 - 27:40
    of, say, a specific kind of municipality.
  • 27:41 - 27:42
    That was my use case, for example,
  • 27:42 - 27:44
    where we were using the word "parish"
  • 27:45 - 27:51
    which the original Estonian word
    was meant kind of like church parish,
  • 27:51 - 27:52
    and that was the origin,
  • 27:52 - 27:55
    but that's not the official translation
    Estonia gets right now.
  • 27:55 - 27:59
    In this case, I would just add it
    as official name statements
  • 27:59 - 28:01
    and add the reference there.
  • 28:02 - 28:03
    (person 5) Okay.
  • 28:05 - 28:07
    More questions, yes?
  • 28:08 - 28:10
    (person 6) I have two quick comments.
  • 28:10 - 28:14
    You specifically called out Asturian
    as a language that does well,
  • 28:14 - 28:16
    and I think that's a false artifact.
  • 28:16 - 28:18
    Tell me about it.
  • 28:18 - 28:20
    (person 6) I think it's just a bot
  • 28:20 - 28:24
    that pasted person names,
    like proper names,
  • 28:24 - 28:27
    and said, "Well, this is exactly
    like in French or Spanish,"
  • 28:27 - 28:29
    and just massively copied it.
  • 28:29 - 28:33
    One point of evidence is that
    you don't see that energy in Asturian
  • 28:33 - 28:37
    in things that actually
    require translation, like property names,
  • 28:37 - 28:40
    or names of items
    that are not proper names.
  • 28:40 - 28:41
    Asaf, you break my heart.
  • 28:41 - 28:43
    (person 6) I know,
    I like raining on parades,
  • 28:43 - 28:48
    but I have good news as well,
    which is about the pronunciation numbers.
  • 28:49 - 28:54
    As you probably know,
    Commons is full of pronunciation files,
  • 28:54 - 28:55
    and, for example,
  • 28:55 - 29:01
    Dutch has no less than 300,000
    pronunciation files already on Commons
  • 29:02 - 29:05
    that just need to somehow be ingested.
  • 29:05 - 29:08
    So if anyone's looking for a side project,
  • 29:08 - 29:09
    there's tons and tons
  • 29:09 - 29:13
    of classified, categorized
    pronunciation files on Commons
  • 29:13 - 29:17
    under the category
    "Pronunciation" by language.
  • 29:17 - 29:23
    So that's just waiting to be matched
    to lexemes and put on Lexeme.
  • 29:23 - 29:25
    And I was wondering
    if you could say something
  • 29:25 - 29:27
    about the road map,
  • 29:27 - 29:29
    something about how much investment
  • 29:29 - 29:32
    or what can we expect
    from Lexeme in the coming year,
  • 29:32 - 29:34
    because I, for one, can't wait.
  • 29:35 - 29:37
    You can't wait? (chuckles)
  • 29:37 - 29:39
    - (person 6) For more.
    - Yes. (chuckles)
  • 29:45 - 29:50
    Right now, we're concentrating
    more on Wikibase and data quality
  • 29:51 - 29:55
    to see how much traction this gets
  • 29:55 - 30:02
    and then getting more for feeding off
    where the pain points are next,
  • 30:02 - 30:06
    and then going back to improving
    lexicographical data further.
  • 30:07 - 30:10
    And one of the things
    I'd love to hear from you
  • 30:10 - 30:14
    is where exactly do you see
    the next steps,
  • 30:14 - 30:16
    where do you want to see improvements
  • 30:16 - 30:20
    so that we can then figure out
    how to make that happen.
  • 30:21 - 30:23
    But, of course, you're right,
  • 30:23 - 30:26
    there's still so much to do
    also on the technical side.
  • 30:31 - 30:36
    (person 7) Okay, as we were uploading
    the Basque words with forms,
  • 30:36 - 30:38
    and you'll see some
    of these kinds of things,
  • 30:38 - 30:41
    we were both like, last week we said,
    "Oh, we are the first one in something."
  • 30:43 - 30:45
    It's It appears in press, and it's like,
  • 30:45 - 30:49
    "Oh, Basque are the first time in some--
    they are the first in something, okay."
  • 30:49 - 30:51
    (laughs)
  • 30:51 - 30:53
    And then people ask,
    "Okay, but what is this for?"
  • 30:55 - 30:57
    We don't have a real good answer.
  • 30:57 - 30:58
    I mean it's like, okay,
  • 30:58 - 31:02
    this will help computers
    to understand more our language, yes,
  • 31:02 - 31:05
    but what kind of tools
    can we make in the future?
  • 31:05 - 31:07
    And we don't have a good answer for this.
  • 31:07 - 31:11
    So I don't know
    if you have a good answer for this.
  • 31:11 - 31:13
    (chuckles) I don't know
    if I have a good answer,
  • 31:13 - 31:15
    but I have an answer.
  • 31:15 - 31:20
    So I think right now
    as I was telling [inaudible],
  • 31:20 - 31:22
    we haven't reached that critical mass
  • 31:22 - 31:26
    where you can build a lot
    of the really interesting tools.
  • 31:26 - 31:28
    But there are already some tools.
  • 31:28 - 31:32
    Just the other day,
    Esther [Pandelia], for example,
  • 31:32 - 31:34
    released a tool where you can see,
  • 31:36 - 31:39
    I think it was the words on a globe
  • 31:39 - 31:42
    where they're spoken,
    where they're coming from.
  • 31:43 - 31:44
    I'm probably wrong about this,
  • 31:44 - 31:46
    but she had answered
    on the Project chat on Wikidata--
  • 31:46 - 31:49
    you can look it up there.
  • 31:50 - 31:52
    So we have seen these first tools,
  • 31:52 - 31:56
    just like we've seen
    back when Wikidata started.
  • 31:57 - 32:00
    First some--like just a network,
  • 32:00 - 32:03
    and like, "Hey, look, there's this thing
    that connects to this other thing."
  • 32:05 - 32:07
    And as we have more data,
  • 32:07 - 32:10
    and as we've reached some critical mass,
  • 32:12 - 32:15
    more powerful applications
    become possible,
  • 32:16 - 32:18
    things like Histropedia,
  • 32:19 - 32:22
    things like question and answering
  • 32:22 - 32:27
    in your digital personal assistant,
    Platypus, and so on.
  • 32:27 - 32:30
    And we're seeing
    a similar thing with lexemes.
  • 32:31 - 32:35
    We're at the stage
    where you can build like these little,
  • 32:35 - 32:37
    hey, look, there's a connection
    between the two things,
  • 32:38 - 32:43
    and there's a translation
    of this word into that language stage,
  • 32:43 - 32:48
    and as we build it out
    and as we describe more words,
  • 32:48 - 32:50
    more becomes possible.
  • 32:50 - 32:52
    Now, what becomes possible?
  • 32:53 - 32:59
    As Ben, our keynote speaker earlier
    was talking about translations,
  • 33:00 - 33:03
    being able to translate
    from one language to another.
  • 33:03 - 33:08
    And Jens, my colleague,
    he's always talking about
  • 33:08 - 33:11
    the European Union
    looking for a translator
  • 33:11 - 33:17
    who can translate from
    I think it was Maltese to Swedish--
  • 33:17 - 33:19
    - (person 8) Estonian.
    - Estonian.
  • 33:22 - 33:26
    And that is not a usual combination.
  • 33:27 - 33:32
    But once you have all these languages
    in one machine-readable place,
  • 33:32 - 33:33
    you can do that,
  • 33:33 - 33:37
    you can get a dictionary
  • 33:37 - 33:42
    from Estonian to Maltese and back.
  • 33:43 - 33:46
    So covering language
    combinations in dictionaries
  • 33:46 - 33:48
    that just haven't been covered before
  • 33:48 - 33:51
    because there wasn't
    enough demand for it, for example,
  • 33:51 - 33:56
    to make it financially viable
    and to justify the work.
  • 33:56 - 33:57
    Now we can do that.
  • 34:00 - 34:02
    Then text generation.
  • 34:02 - 34:04
    Lucie was earlier talking
  • 34:04 - 34:10
    about how she's working
    with Hattie on generating text
  • 34:10 - 34:15
    to get Wikipedia articles
    in minority languages started,
  • 34:15 - 34:20
    and that needs data about words,
  • 34:20 - 34:23
    and you need to understand
    the language to do that.
  • 34:24 - 34:28
    Yeah, and those are just some
    that come to my mind right now.
  • 34:29 - 34:30
    Maybe our audience has more ideas
  • 34:30 - 34:34
    what they want to do
    when we have all the glorious data.
  • 34:38 - 34:41
    (person 9) Okay, I will deviate
    from the lexemes topic.
  • 34:41 - 34:43
    I will ask the question,
  • 34:43 - 34:46
    how can I as a member of community
  • 34:46 - 34:50
    influence that priority is put on task,
  • 34:50 - 34:57
    that a new user comes, and he can indicate
    what languages he wants to see and edit
  • 34:57 - 35:01
    without some secret verbal
    template knowledge.
  • 35:02 - 35:05
    Maybe there will be this year
    this technical wish list
  • 35:05 - 35:07
    without Wikipedia topics.
  • 35:07 - 35:10
    Maybe there's a hope
    we can all vote about
  • 35:10 - 35:14
    this thing we didn't fix for seven years.
  • 35:14 - 35:18
    So do you have any ideas
    and comments about this?
  • 35:18 - 35:20
    So you're talking about the fact
  • 35:20 - 35:24
    that someone who is
    not logged into Wikidata
  • 35:24 - 35:26
    can't change their language easily?
  • 35:26 - 35:28
    (person 9) No, for [inaudible] users.
  • 35:28 - 35:31
    So, if they are logged in,
  • 35:31 - 35:35
    they can just change their language
    at the top of the page,
  • 35:36 - 35:38
    and then it will appear
  • 35:40 - 35:42
    where the labels' description
    [inaudible] are,
  • 35:42 - 35:43
    and they can edit it.
  • 35:46 - 35:49
    (person 9) Well, actually, usually
    many times the workflow
  • 35:49 - 35:52
    is that if you want to have
    multiple languages, they are available,
  • 35:52 - 35:55
    and it's not always the case.
  • 35:55 - 35:59
    Okay, maybe we should sit down
    after this talk and you show me.
  • 36:02 - 36:04
    Cool. More questions?
  • 36:06 - 36:07
    Yes.
  • 36:12 - 36:13
    (person 10) Thanks for the presentation.
  • 36:14 - 36:15
    Can you comment
  • 36:15 - 36:19
    on the state of the correlation
    with the Wiktionary community.
  • 36:19 - 36:22
    As far as I've seen,
    there were some discussions
  • 36:22 - 36:26
    about importing some elements of the work,
  • 36:26 - 36:31
    but there seems to be licensing issues
    and some disagreements, et cetera.
  • 36:31 - 36:32
    Right.
  • 36:32 - 36:36
    So, Wiktionary communities
    have spent a lot of time
  • 36:37 - 36:39
    building Wiktionary.
  • 36:39 - 36:43
    They have built
  • 36:43 - 36:48
    amazingly complicated
    and complex templates
  • 36:48 - 36:54
    to build pretty tables
    that automatically generate forms for you
  • 36:54 - 36:56
    and all kinds of really impressive,
  • 36:56 - 37:01
    and kind of crazy stuff,
    if you think about it.
  • 37:02 - 37:08
    And, of course, they have invested
    a lot of time and effort into that.
  • 37:09 - 37:12
    And understandably,
  • 37:12 - 37:17
    they don't just want that to be grabbed,
  • 37:18 - 37:19
    just like that.
  • 37:19 - 37:22
    So there's some of that coming from there.
  • 37:23 - 37:25
    And that's fine, that's okay.
  • 37:26 - 37:32
    Now, the first Wiktionary communities
    are talking about turning out
  • 37:32 - 37:34
    and importing some
    of their data into Wikidata.
  • 37:34 - 37:39
    Russian, you have seen,
    for example, is one of those cases
  • 37:40 - 37:42
    And I expect more of that to happen.
  • 37:44 - 37:47
    But it will be a slow process,
  • 37:47 - 37:49
    just like adoption
    of Wikidata's data on Wikipedia
  • 37:49 - 37:52
    has been a rather slow process.
  • 37:53 - 37:56
    On the other side
    of making it actually easier
  • 37:56 - 37:59
    to use the data that is in lexemes,
  • 37:59 - 38:02
    on Wiktionary, so that
    they can make use of that
  • 38:02 - 38:06
    and share data between
    the language Wiktionaries
  • 38:06 - 38:09
    which is super hard
    to impossible right now,
  • 38:09 - 38:12
    which is crazy,
    just like it was on Wikipedia.
  • 38:14 - 38:16
    Wait for the birthday present. (chuckles)
  • 38:20 - 38:21
    Yes.
  • 38:23 - 38:25
    (person 11) When I was thinking
    the other way around it,
  • 38:25 - 38:28
    I actually didn't want to say it
    because I think this will be super silly,
  • 38:28 - 38:32
    but I think that Wiktionary
    already has some content,
  • 38:32 - 38:35
    and I know that
    we can't transfer it to Wikidata
  • 38:35 - 38:37
    because there's a difference in licenses.
  • 38:37 - 38:40
    But I was thinking maybe
    we can do something about that.
  • 38:40 - 38:46
    Maybe, I don't know, we can obtain
    the communities' permission
  • 38:46 - 38:51
    after like, I don't know,
    having like a public voting
  • 38:52 - 38:56
    and for the community,
    the active members of the community
  • 38:56 - 39:03
    to vote and say if they would like
    or accept or to transfer the content
  • 39:03 - 39:06
    for which they may do
    the Wikidata lexemes.
  • 39:06 - 39:09
    Because I just think it is such a waste.
  • 39:10 - 39:14
    So, that's definitely
    a conversation those people
  • 39:14 - 39:18
    who are in Wiktionary communities
    are very welcome to bring up there.
  • 39:18 - 39:25
    I think it would be a bit presumptuous
    for us to go and force that.
  • 39:26 - 39:31
    But, yeah, I think it's definitely worth
    having a conversation.
  • 39:31 - 39:34
    But I think it's also important
    to understand
  • 39:34 - 39:39
    that there's a distinction between
    what is actually legally allowed
  • 39:39 - 39:43
    and what we should be doing
  • 39:43 - 39:45
    and what those people want or do not want.
  • 39:46 - 39:47
    So even if it's legally allowed,
  • 39:47 - 39:51
    if some other Wiktionary communities
    do not want that,
  • 39:51 - 39:54
    I would be careful, at least.
  • 39:59 - 40:02
    I think you need the mic
    for the stream.
  • 40:05 - 40:07
    (person 12) So, obviously,
    it's all very exciting,
  • 40:08 - 40:12
    and I immediately think
    how can I take that to my students
  • 40:12 - 40:16
    and how can I incorporate it
    with the courses,
  • 40:16 - 40:19
    the work that we're doing,
    educational settings.
  • 40:19 - 40:22
    And I don't have, at the moment,
  • 40:23 - 40:24
    first of all, enough knowledge,
  • 40:24 - 40:27
    but I think the documentation
    that we do have
  • 40:28 - 40:30
    could be maybe improved.
  • 40:30 - 40:33
    So that's a kind of request
    to make cool videos
  • 40:33 - 40:36
    that explain how it works
  • 40:36 - 40:40
    because if we have it, we can then use it,
  • 40:40 - 40:42
    and we can have students on board,
  • 40:42 - 40:47
    and we can make people understand
    how awesome it all is.
  • 40:47 - 40:52
    And yeah, just think about documentation
    and think about education, please.
  • 40:52 - 40:54
    Because I think a lot could be done.
  • 40:54 - 40:59
    These are like many tasks
    that could be done even with...
  • 41:00 - 41:02
    well, I wouldn't say primary schools,
  • 41:02 - 41:05
    but certainly, even younger students.
  • 41:06 - 41:11
    And so I would really like to see
    that potential being tapped into,
  • 41:11 - 41:15
    and, as of now, I personally
    don't understand enough
  • 41:15 - 41:20
    to be able to create tasks
    or to create like...
  • 41:20 - 41:22
    to do something practical with it.
  • 41:22 - 41:26
    So any help, any thoughts
    anyone here has about that,
  • 41:26 - 41:30
    I would be very happy to hear
    your thoughts, and yours as well.
  • 41:31 - 41:32
    Yeah, let's talk about that.
  • 41:35 - 41:37
    More questions?
  • 41:38 - 41:39
    Someone else raised a hand.
  • 41:39 - 41:40
    I forgot where it was.
  • 41:46 - 41:50
    (person 13) So, if we can't import
    from Wiktionary,
  • 41:50 - 41:56
    is there some concerted effort
    to find other public domain sources,
  • 41:56 - 41:57
    maybe all the data,
  • 41:59 - 42:03
    and kind of prefilter it, organize it
  • 42:03 - 42:08
    so that it's easy to be checked
    by people for import?
  • 42:09 - 42:11
    So there are first efforts.
  • 42:11 - 42:15
    My understanding is that Basque
    is one of those efforts.
  • 42:15 - 42:17
    Maybe you want to say
    a bit more about it?
  • 42:18 - 42:20
    (person 14) [inaudible]
  • 42:23 - 42:27
    Okay, the actual answer
    is paying for that...
  • 42:28 - 42:33
    I mean, we have an agreement
    with a contractor we usually work with.
  • 42:35 - 42:39
    They do dictionaries--
  • 42:40 - 42:42
    lots of stuff, but they do dictionaries.
  • 42:42 - 42:47
    So we agreed with them
    to make free the students' dictionary,
  • 42:47 - 42:53
    we would [cast] the most common words
    and start uploading it
  • 42:53 - 42:56
    with an external identifier
    and the scheme of things.
  • 42:56 - 43:03
    But there was some discussion
    about leaving it on CC0
  • 43:03 - 43:05
    because they have
    the dictionary with CC by it,
  • 43:07 - 43:10
    and they understood
    what the difference was.
  • 43:10 - 43:14
    So there was some discussion.
  • 43:14 - 43:20
    But I think that we can provide some tools
    or some examples in the future,
  • 43:20 - 43:22
    and I think that there will be
    other dictionaries
  • 43:22 - 43:24
    that we can handle,
  • 43:24 - 43:29
    and also I think Wiktionary
    should start moving in that direction,
  • 43:29 - 43:32
    but that's another great discussion.
  • 43:33 - 43:34
    And on top of that,
  • 43:34 - 43:39
    Lea is also in contact
    with people from Occitan
  • 43:39 - 43:42
    who work on Occitan dictionaries,
  • 43:42 - 43:45
    and they're currently working
    on a Sumerian collaboration.
  • 43:52 - 43:53
    More questions?
  • 44:01 - 44:05
    (person 15) Hi! We are the people
    who want to import Occitan data.
  • 44:05 - 44:07
    Aha! Perfect!
  • 44:07 - 44:08
    (person 15) And we have a small problem.
  • 44:09 - 44:14
    We don't know how to represent
    the variety of all lexemes.
  • 44:14 - 44:18
    We have six dialects,
  • 44:18 - 44:24
    and we want to indicate for Lexeme
    in which dialect it's used,
  • 44:24 - 44:27
    and we don't have a proper
    C0 statement to do that.
  • 44:27 - 44:31
    So as long as the segment doesn't exist,
  • 44:32 - 44:34
    it prevents us from [inaudible]
  • 44:34 - 44:38
    because we will need to do it again
  • 44:38 - 44:42
    when we will be able
    to [export] the statement.
  • 44:42 - 44:45
    And it's complicated
    because it's a statement
  • 44:45 - 44:48
    which won't be asked by many people
  • 44:48 - 44:53
    because it's a statement
    which concerns mostly minority languages.
  • 44:53 - 44:57
    So you will have one person to ask this.
  • 44:57 - 45:00
    But as our colleagues Basque,
  • 45:00 - 45:06
    it can be one person
    who will power thousands of others,
  • 45:06 - 45:11
    so it might not be asking a lot,
  • 45:11 - 45:14
    but it will be very important for us.
  • 45:15 - 45:18
    Do you already have
    a new property proposal up,
  • 45:18 - 45:19
    or do you need help creating it?
  • 45:22 - 45:24
    (person 15) We asked four months ago.
  • 45:25 - 45:29
    Alright, then let's get some people
    to help out with this property proposal.
  • 45:30 - 45:33
    I'm sure there are enough people
    in this room to make this happen.
  • 45:33 - 45:35
    (person 15) Property proposal
    [speaking in French].
  • 45:35 - 45:37
    (person 16) We didn't have an answer.
  • 45:37 - 45:40
    (person 15) We didn't have any answer,
    and we don't know how to do this
  • 45:40 - 45:43
    because we aren't
    in the Wikidata community.
  • 45:45 - 45:49
    Yup, so there are people here
    who can help you.
  • 45:49 - 45:52
    Maybe someone raises their hand to take--
  • 45:53 - 45:54
    (person 14) I'm for that.
  • 45:54 - 45:56
    But I think this is quite interesting
  • 45:56 - 45:59
    that only the variant of form
  • 45:59 - 46:03
    also can handle it geographically,
  • 46:03 - 46:05
    with coordinates or some kind of mapping.
  • 46:06 - 46:08
    Also having different pronunciations,
  • 46:08 - 46:12
    and I think this is something
    that happens in lots of languages.
  • 46:13 - 46:16
    We should start making
    it happen [inaudible],
  • 46:16 - 46:19
    and I'm going to search for the property.
  • 46:20 - 46:21
    Cool.
  • 46:21 - 46:24
    So you will get backing
    for your property proposal.
  • 46:26 - 46:27
    Thank you.
  • 46:28 - 46:30
    Alright, more questions?
  • 46:32 - 46:33
    Finn.
  • 46:34 - 46:35
    Finn is one of those people
  • 46:35 - 46:38
    who builds stuff
    on top of lexicographical data.
  • 46:38 - 46:40
    (Finn) It's just a small question,
  • 46:40 - 46:44
    and that's about spelling variations.
  • 46:45 - 46:48
    It seems to be difficult to put them in...
  • 46:49 - 46:53
    You could, of course,
    have multiple forms for the same word.
  • 46:56 - 46:58
    I don't know, it seems to be...
  • 47:00 - 47:04
    If you don't do it that way,
    it seems to be difficult to specify...
  • 47:05 - 47:06
    or I don't know whether
  • 47:06 - 47:10
    this is just a minor technical issue
    or whether...
  • 47:10 - 47:11
    Let's look at it together.
  • 47:12 - 47:15
    I would love to see an example.
  • 47:17 - 47:18
    Asaf.
  • 47:27 - 47:28
    (Asaf) Thank you.
  • 47:29 - 47:34
    I can give a very concrete example
    from my mother tongue, Hebrew.
  • 47:34 - 47:39
    Hebrew has two main variants
  • 47:39 - 47:43
    for expressing almost every word
  • 47:43 - 47:48
    because the traditional spelling
  • 47:48 - 47:50
    leaves out many of the vowels.
  • 47:51 - 47:55
    And, therefore, in modern editions
    of the Bible and of poetry,
  • 47:55 - 47:57
    diacritics are used.
  • 47:57 - 48:03
    However, those diacritics
    are never used for modern prose
  • 48:03 - 48:06
    or newspaper writing or street signs.
  • 48:06 - 48:11
    So the average daily casual use
    puts in extra vowels
  • 48:12 - 48:14
    and doesn't use the diacritics
  • 48:14 - 48:16
    because they are,
    of course, more cumbersome
  • 48:16 - 48:18
    and have all kinds of rules
    and nobody knows the rules.
  • 48:19 - 48:21
    So there are basically two variants.
  • 48:21 - 48:25
    There's the everyday casual prose variant,
  • 48:25 - 48:28
    and there's the Bible or poetry,
  • 48:28 - 48:32
    which always come
    in this traditional diacriticized text.
  • 48:32 - 48:33
    To be useful,
  • 48:33 - 48:37
    Lexeme would have to recognize
    both varieties of every single word
  • 48:37 - 48:40
    and every single form
    of every single word.
  • 48:41 - 48:43
    So that's a very comprehensive use case
  • 48:43 - 48:46
    for official stable variants.
  • 48:46 - 48:49
    It's not dialect, it's not regions,
  • 48:49 - 48:54
    it's basically two coexisting
    morphological systems.
  • 48:55 - 48:59
    And I too don't know exactly
    how to express that in Lexeme today,
  • 48:59 - 49:03
    which is one thing that is keeping me
    in partial answer to Magnus' question
  • 49:03 - 49:05
    from uploading the parts that are ready
  • 49:05 - 49:09
    from the biggest Hebrew dictionary,
    which is public domain
  • 49:09 - 49:13
    and which I have been digitizing
    for several years now.
  • 49:13 - 49:15
    A good portion of it is ready,
  • 49:15 - 49:17
    but I'm not putting it on Lexeme right now
  • 49:17 - 49:20
    because I don't know exactly
    how to solve this problem.
  • 49:20 - 49:23
    Alright, let's solve
    this problem here. (chuckles)
  • 49:25 - 49:26
    That has to be possible.
  • 49:30 - 49:32
    Alright, more questions?
  • 49:37 - 49:40
    If not, then thank you so much.
  • 49:41 - 49:43
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4
Video Language:
English
Duration:
49:51

English subtitles

Revisions