< Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4

  • 0:06 - 0:07
    Hi, I'm Lucie.
  • 0:07 - 0:12
    You know me from rambling about
    not enough language data in Wikidata,
  • 0:12 - 0:16
    and I thought instead of rambling today,
    which I'll leave to Lydia later today,
  • 0:17 - 0:20
    I'll just show you a bit, or give you
    an insight on the projects we did
  • 0:20 - 0:25
    using the data that we already have
    on Wikidata, for different causes.
  • 0:25 - 0:29
    So underserved languages
    compared to the keynote we just heard
  • 0:29 - 0:33
    where the person was talking about
    underserved as like minority languages,
  • 0:33 - 0:36
    underserved languages to me,
    or any languages
  • 0:36 - 0:39
    that don't have
    enough representation on the web.
  • 0:39 - 0:41
    Yeah, just to get that clear.
  • 0:41 - 0:43
    So, who am I?
  • 0:43 - 0:46
    Why am I always talking
    about languages on Wikidata?
  • 0:46 - 0:48
    Not sure but...
  • 0:48 - 0:50
    I'm a Computer Science PhD student
  • 0:50 - 0:52
    at the University of Southampton.
  • 0:52 - 0:55
    I'm a research intern
    at Bloomberg in London, at the moment.
  • 0:55 - 0:58
    I'm a residence
    at Newspeak House in London.
  • 0:58 - 1:02
    I am a researcher and project manager
    for the Scribe project,
  • 1:02 - 1:03
    which I'll go into in a bit,
  • 1:03 - 1:09
    and I recently got into the idea
    of oral knowledge and oral citation.
  • 1:09 - 1:10
    Kimberly is sitting right there.
  • 1:11 - 1:13
    And then, occasionally,
    I have time to sleep
  • 1:13 - 1:16
    and do other things, but that's very rare.
  • 1:17 - 1:19
    So if you're interested
    in any of those things,
  • 1:19 - 1:20
    come talk and speak to me.
  • 1:20 - 1:23
    Generally, this is an open presentation
    and a few questions in between.
  • 1:23 - 1:27
    I'll run through a lot of things
    in a very short time now.
  • 1:27 - 1:30
    Come to me afterwards
    if you're interested in any of them.
  • 1:31 - 1:32
    Speak to me. I'm here.
  • 1:32 - 1:35
    I'm always very happy to speak to people.
  • 1:35 - 1:39
    So that's a bit of what
    we will talk about today.
  • 1:39 - 1:41
    So Wikidata, giving an introduction,
  • 1:41 - 1:44
    even though that's obviously
    not as necessary.
  • 1:45 - 1:48
    The article placeholder
    is aimed for Wikipedia readers,
  • 1:48 - 1:51
    for Scribe which is aimed
    at Wikipedia editors,
  • 1:51 - 1:54
    and then we have one topic of my research,
  • 1:54 - 1:57
    which is completely outside of Wikipedia
  • 1:57 - 2:00
    where we use Wikidata
    for question answering.
  • 2:02 - 2:04
    So just a quick rerun.
  • 2:04 - 2:07
    Why is Wikidata so cool
    for low-resource languages
  • 2:07 - 2:11
    where we have those unique identifiers?
  • 2:11 - 2:13
    I'm speaking to people that know that
  • 2:13 - 2:15
    much better than me even.
  • 2:15 - 2:18
    And then we have labels
    in different languages.
  • 2:18 - 2:22
    Those can be in over,
    I think, 400 languages by now,
  • 2:22 - 2:24
    so we have a good option here
  • 2:24 - 2:28
    to reuse language
    in different forms and capture it.
  • 2:29 - 2:33
    Yeah, so that's a little bit of me
    rambling about Wikidata
  • 2:33 - 2:35
    because I can't stop it.
  • 2:35 - 2:37
    We compared Wikidata,
    compared to the native speaker,
  • 2:37 - 2:39
    so we can see, obviously,
  • 2:39 - 2:42
    there are languages
    that are widely spoken in the world.
  • 2:42 - 2:44
    There's Chinese, Hindi, or Arabic,
  • 2:44 - 2:47
    but then very low coverage on Wikidata.
  • 2:48 - 2:50
    Then the opposite.
  • 2:50 - 2:53
    Sorry, I have the Dutch
    and the Swedish community
  • 2:53 - 2:55
    which was super active in Wikidata,
  • 2:55 - 2:58
    which is really cool,
    and that just points out
  • 2:58 - 3:01
    that even though we have
    a low number of speakers,
  • 3:01 - 3:07
    we can have a big impact if people
    are very active in the communities,
  • 3:07 - 3:09
    which is really nice and really good.
  • 3:09 - 3:14
    But also let's try to equal
    that graph out in the future.
  • 3:15 - 3:19
    So, cool. So now we have
    all this language data in Wikidata.
  • 3:19 - 3:22
    We have low-resource Wikipedias,
    so we thought, what can we do?
  • 3:22 - 3:27
    Well, my undergrad supervisor
    is sitting here,
  • 3:27 - 3:31
    and we worked back then
    in the golden days,
  • 3:31 - 3:34
    on something called
    the article placeholder
  • 3:35 - 3:39
    which takes triples from Wikidata
    and displays it on Wikipedia.
  • 3:39 - 3:42
    And that's pretty much
    relatively straight forward.
  • 3:42 - 3:46
    So you just take the content of Wikidata,
    display it on Wikipedia
  • 3:46 - 3:49
    to attract more readers
    and then eventually more editors
  • 3:49 - 3:51
    in the different low-resource languages.
  • 3:51 - 3:53
    They are dynamically generated,
  • 3:53 - 3:56
    so they're not like stubs or bot articles
  • 3:56 - 4:00
    that then flood the Wikipedia
    so people can edit them.
  • 4:00 - 4:02
    It's basically a starting point.
  • 4:02 - 4:05
    And we thought,
    well, we have that content,
  • 4:05 - 4:09
    and we have that knowledge
    somewhere already, which is Wikidata.
  • 4:09 - 4:12
    It's often already in the languages,
    but they don't have articles,
  • 4:12 - 4:15
    so at least give them
    the insight into the information.
  • 4:15 - 4:19
    The article placeholders are live
    on 14 low-resource Wikipedias.
  • 4:20 - 4:22
    If you are a Wikipedia community,
  • 4:22 - 4:25
    if you are part of a Wikipedia community
    and interested in it,
  • 4:25 - 4:26
    let us know.
  • 4:28 - 4:30
    And then I went into research,
  • 4:30 - 4:33
    and I got stuck with
    the article placeholder, though,
  • 4:33 - 4:36
    so we started to look into
    text generation from Wikidata
  • 4:36 - 4:38
    for Wikipedia and low-resource languages.
  • 4:38 - 4:40
    And text generation is really interesting
  • 4:40 - 4:43
    because in research it was at that point
    when we started the project
  • 4:43 - 4:46
    completely only focused on English,
  • 4:46 - 4:49
    which is a bit pointless in my experience
  • 4:49 - 4:51
    because, I mean, you have a lot of people
    who write in English,
  • 4:51 - 4:55
    but then what we need is people
    who write in those low-source languages.
  • 4:55 - 4:59
    And our starting point was that,
    looking at triples on Wikipedia
  • 4:59 - 5:02
    is not exactly the nicest thing.
  • 5:02 - 5:04
    I mean, as much as I love
    the article placeholder,
  • 5:04 - 5:06
    it's not exactly
    what you want to see you or expect
  • 5:06 - 5:08
    when you open a Wikipedia page.
  • 5:08 - 5:10
    So we try to generate text.
  • 5:10 - 5:12
    We use this beautiful
    neural network model,
  • 5:12 - 5:13
    where we encode Wikidata triples.
  • 5:13 - 5:16
    If you're interested more
    in the technical parts,
  • 5:16 - 5:17
    come and talk to me.
  • 5:17 - 5:22
    And so, realistically,
    with neural text generation,
  • 5:22 - 5:24
    you can generate one or two sentences
  • 5:24 - 5:28
    before it completely scrambles
    and becomes useless.
  • 5:28 - 5:33
    So we've generated one sentence
    that describes the topic of the triple.
  • 5:33 - 5:36
    And so this, for example, is Arabic.
  • 5:36 - 5:39
    We generate the sentence about Marrakesh,
  • 5:39 - 5:41
    where it just describes the city.
  • 5:42 - 5:46
    So for that, then, we tested this--
  • 5:46 - 5:49
    So we did studies, obviously,
    to test if our approach works,
  • 5:49 - 5:52
    and if it makes sense, to use such things.
  • 5:52 - 5:56
    And because we are
    very application-focused,
  • 5:56 - 5:59
    we tested it with actual
    Wikipedia readers and editors.
  • 5:59 - 6:01
    So, first, we tested it
    with Wikipedia readers
  • 6:01 - 6:03
    in Arabic and Esperanto--
  • 6:03 - 6:06
    so use cases with Arabic and Esperanto.
  • 6:08 - 6:13
    And we can see that our model
    can generate sentences
  • 6:13 - 6:14
    that are very fluent
  • 6:14 - 6:18
    and that feel very much--
    surprisingly, a lot, actually--
  • 6:18 - 6:20
    like Wikipedia sentences.
  • 6:20 - 6:23
    So it picks up, so we train on,
    for example, for Arabic,
  • 6:23 - 6:26
    we train on Arabic with the idea to say
  • 6:26 - 6:30
    we want to keep
    the cultural context of that language
  • 6:30 - 6:33
    and not let it influence
  • 6:33 - 6:35
    from other languages
    that have higher coverage.
  • 6:36 - 6:38
    And then we did a study
    with Wikipedia editors
  • 6:38 - 6:41
    because in the end the article placeholder
    is just a starting point
  • 6:41 - 6:43
    for people to start editing,
  • 6:43 - 6:44
    and we try to measure
  • 6:44 - 6:46
    how much of the sentences
    would they reuse.
  • 6:46 - 6:49
    How much is useful for them, basically,
  • 6:49 - 6:51
    and you can see
    that there is a high number of reuse,
  • 6:51 - 6:55
    especially in Esperanto
    when we test with editors.
  • 6:56 - 7:01
    And finally, we did also
    qualitative interviews
  • 7:01 - 7:05
    with Wikipedia editors
    across six languages.
  • 7:05 - 7:08
    I think we had
    about ten people we interviewed.
  • 7:09 - 7:12
    And we tried to get
    more of an understanding
  • 7:12 - 7:15
    what's a human perspective
    on those generated sentences.
  • 7:15 - 7:18
    So now we can have
    a very quantified way of saying,
  • 7:18 - 7:19
    yeah, they are good,
  • 7:19 - 7:21
    but we wanted to see
  • 7:21 - 7:23
    how's the interaction
  • 7:23 - 7:26
    and especially with whatever
    always happens
  • 7:26 - 7:30
    in neural machine translation
    and neural text generations,
  • 7:30 - 7:34
    that you have those missing word tokens
    which we put as "rare" in there.
  • 7:34 - 7:39
    So that's the example sentences we used.
    All of them are in Marrakesh.
  • 7:39 - 7:42
    So we wanted to see how much
    are people bothered by it,
  • 7:42 - 7:43
    what's the quality,
  • 7:43 - 7:45
    what are the things
    that point out to them,
  • 7:45 - 7:50
    and we can see that the mistakes
    by the networks like those red tokens
  • 7:50 - 7:51
    are often just ignored.
  • 7:53 - 7:56
    There is this interesting factor
    that because we didn't tell them
  • 7:56 - 8:01
    where this happens,
    where we got the sentences from--
  • 8:01 - 8:04
    because it was on a user page of mine
  • 8:04 - 8:06
    but it looked like it was on a Wikipedia,
  • 8:06 - 8:07
    people just trusted.
  • 8:07 - 8:09
    And I think that's very important
  • 8:09 - 8:13
    when we look into those kinds
    of research directions that we look into,
  • 8:13 - 8:16
    we cannot override
    this trust into Wikipedia.
  • 8:16 - 8:20
    So if we work with Wikipedians
    and Wikipedia itself,
  • 8:20 - 8:23
    if we take things from,
    for example, Wikidata,
  • 8:23 - 8:26
    that's good
    because it's also human-curated.
  • 8:26 - 8:31
    But when we start
    with artificial intelligence projects,
  • 8:31 - 8:35
    where you have to be really careful
    what we actually expose people to
  • 8:35 - 8:38
    because they just trust
    the information that we give them.
  • 8:39 - 8:43
    So we could see, for example,
    in the Arabic version,
  • 8:43 - 8:45
    it gave the wrong location for Marrakesh,
  • 8:45 - 8:48
    and people, even the people I interviewed
  • 8:48 - 8:50
    that we're living in Marrakesh
    didn't pick up on that,
  • 8:50 - 8:54
    because it's on Wikipedia,
    so it should be fine, right?
  • 8:54 - 8:55
    (chuckles)
  • 8:55 - 8:56
    Yeah.
  • 8:58 - 9:01
    We found there was a magical threshold
    for the lengths of the generated text,
  • 9:01 - 9:02
    so that's something we found,
  • 9:02 - 9:05
    especially in comparison
    with the content translation tool,
  • 9:05 - 9:08
    where you have a long
    automatically generated text,
  • 9:08 - 9:12
    and people were complaining
    that content translation was very hard
  • 9:12 - 9:16
    because you're just doing post-editing,
    you don't have the creativity.
  • 9:16 - 9:19
    There are other remarks
    on content translation I usually make--
  • 9:19 - 9:21
    I'll skip them for now.
  • 9:22 - 9:25
    So that one sentence was helpful
  • 9:25 - 9:30
    because even if we've made mistakes,
    people were still willing to fix them
  • 9:30 - 9:34
    because it's a very short
    intervenience [in that].
  • 9:34 - 9:38
    And then, finally,
    a lot of people pointed out,
  • 9:38 - 9:40
    that it was particularly good
    for a new editor,
  • 9:40 - 9:42
    so for them to have a starting point,
  • 9:42 - 9:44
    to have those triples, to have a sentence,
  • 9:44 - 9:46
    so they have something to start from.
  • 9:46 - 9:49
    So after all those interviews were done,
  • 9:49 - 9:52
    as I go, that's very interesting.
  • 9:52 - 9:54
    What else can we do with that knowledge?
  • 9:54 - 9:59
    And so we started a new project,
    exactly because there weren't enough yet.
  • 9:59 - 10:02
    And the new project we have
    is called Scribe,
  • 10:02 - 10:07
    and Scribe focuses on new editors
    that want to write a new article,
  • 10:07 - 10:10
    and particularly people
    who haven't written
  • 10:10 - 10:11
    an article on Wikipedia yet,
  • 10:11 - 10:14
    and specifically also
    on low-resource languages.
  • 10:15 - 10:19
    So the idea is that--
    that's the pixel version of me.
  • 10:20 - 10:21
    All my slides are basically
  • 10:21 - 10:24
    references to people in this room,
    which I really love.
  • 10:24 - 10:26
    It feels like I'm home again.
  • 10:27 - 10:31
    So, yeah, I want to write a new article,
  • 10:31 - 10:34
    but I don't know where to start
    as a new editor,
  • 10:34 - 10:37
    and so we have this project Scribe.
  • 10:37 - 10:41
    Scribe is a profession
    or was the name of someone
  • 10:41 - 10:45
    with the profession of writing
    in ancient Egypt.
  • 10:47 - 10:53
    So the Scribe project's idea
    is that we want to give people, basically,
  • 10:53 - 10:56
    a hand when they start
    writing their first articles.
  • 10:56 - 10:58
    So give them a skeleton,
  • 10:58 - 11:01
    give them a skeleton that's based
    on their language Wikipedia,
  • 11:01 - 11:05
    instead of just translating the content
    from another language Wikipedia.
  • 11:05 - 11:10
    So the first thing we want to do
    is plan section titles,
  • 11:10 - 11:14
    then select references for each section,
  • 11:14 - 11:16
    ideally in the local Wikipedia language,
  • 11:16 - 11:20
    and then summarize those references
    to give a starting point to write.
  • 11:21 - 11:25
    For the project, we have
    a Wikimedia Foundation project grant.
  • 11:25 - 11:28
    So it just started.
  • 11:28 - 11:31
    Some of you are very open
    to feedback, in general.
  • 11:31 - 11:35
    That was the very first
    not so beautiful layout,
  • 11:35 - 11:37
    but just for you to get an overview.
  • 11:37 - 11:40
    So there is this idea
    of collecting references,
  • 11:40 - 11:43
    images from comments, section titles.
  • 11:43 - 11:46
    And so the main things
    we want to use Wikidata for
  • 11:46 - 11:48
    is the sections.
  • 11:48 - 11:51
    So, basically, we want to see
    what are articles
  • 11:51 - 11:55
    on similar topics
    already existing in your language,
  • 11:55 - 11:58
    so we can understand
    how the language community
  • 11:58 - 12:02
    decided on structuring articles.
  • 12:02 - 12:06
    And then we look
    for the images, obviously,
  • 12:06 - 12:10
    where Wikidata also
    is a good point to go through.
  • 12:13 - 12:16
    And then we made
    a prettier interface for it
  • 12:16 - 12:18
    because we decided to go mobile first.
  • 12:18 - 12:21
    So most of communities
    that we aim to work with
  • 12:21 - 12:25
    are very heavy on mobile editing.
  • 12:25 - 12:30
    And so we do this mobile-first focus.
  • 12:30 - 12:34
    And then, it also forces us
    to break down into steps
  • 12:34 - 12:37
    which eventually will lead to,
    yeah, I don't know,
  • 12:37 - 12:39
    a step-by-step guide
    on how to write a new article.
  • 12:39 - 12:43
    So an editor comes,
    they can select section headers
  • 12:43 - 12:47
    based on existing articles
    in their language,
  • 12:47 - 12:49
    write one section at a time,
  • 12:49 - 12:54
    switch between the sections,
    and select references for each section.
  • 12:56 - 12:59
    Yeah, so the idea is that
    we will have an easier editing experience,
  • 12:59 - 13:01
    especially for new editors,
  • 13:01 - 13:05
    to keep them in--
    integrate Wikidata information
  • 13:05 - 13:08
    and [inaudible] images
    from Wikimedia Commons as well.
  • 13:10 - 13:12
    If you're interested in Scribe,
  • 13:12 - 13:15
    I'm working together
    on this project with Hady.
  • 13:15 - 13:19
    There is a lot of things online,
  • 13:19 - 13:23
    but then also just come and talk to us.
  • 13:23 - 13:26
    Also, if you're editing
    a low-resource Wikipedia,
  • 13:26 - 13:29
    we're still looking
    for people to interview
  • 13:29 - 13:32
    because we're trying to emulate--
  • 13:32 - 13:34
    we're trying to emulate as much as we can
  • 13:34 - 13:37
    what people already experience,
    or they already edit.
  • 13:37 - 13:39
    I'm not big on Wikipedia editing.
  • 13:39 - 13:41
    Also, my native language is German.
  • 13:41 - 13:44
    So I need a lot of input from editors
  • 13:44 - 13:48
    that want to tell me
    what they need, what they want,
  • 13:48 - 13:51
    where they think this project can go.
  • 13:51 - 13:55
    And if you are into Wikidata,
    also come and talk to me, please.
  • 13:56 - 13:58
    Okay, so that's all the projects
  • 13:58 - 14:02
    or most of the projects we did
    inside the Wikimedia world.
  • 14:02 - 14:06
    And I want to give you one
    short overview of what's happening
  • 14:06 - 14:10
    on my end of research,
    around Wikidata as well.
  • 14:14 - 14:16
    So I was part of a project
  • 14:16 - 14:18
    that works a lot with question answering,
  • 14:18 - 14:20
    and I don't know too much
    about question answering,
  • 14:20 - 14:24
    but what I do know a lot about
    is knowledge graphs and multilinguality.
  • 14:24 - 14:26
    So, basically, what we wanted to do
  • 14:26 - 14:30
    is we have a question answering system
    that gets a question from a user,
  • 14:30 - 14:36
    and we wanted to select a knowledge graph
    that can answer the question best.
  • 14:36 - 14:40
    And again, we focused on
    multilingual question answering system.
  • 14:40 - 14:46
    So if I want to ask something about Bach,
    for example, in Spanish and French--
  • 14:46 - 14:48
    because that's the two languages
    I know best--
  • 14:48 - 14:52
    then what knowledge graph has the data
  • 14:52 - 14:54
    to actually answer those questions.
  • 14:55 - 14:59
    So what we did was we found a method
    to rank knowledge graphs,
  • 15:01 - 15:05
    based on the metadata of language,
  • 15:05 - 15:08
    that appears on the knowledge graph,
  • 15:08 - 15:10
    [which is split] by class.
  • 15:10 - 15:11
    And then we look for each class
  • 15:11 - 15:14
    into what languages are covered best,
  • 15:14 - 15:18
    and then depending on the question,
    can suggest a knowledge graph.
  • 15:19 - 15:23
    From the big knowledge graphs
    we looked into
  • 15:23 - 15:25
    and that are very known and widely used,
  • 15:25 - 15:28
    Wikidata covers the most languages
    over all knowledge graphs,
  • 15:28 - 15:32
    and we used a test bed.
  • 15:32 - 15:36
    So we'd use a benchmark dataset
    called [CALD],
  • 15:36 - 15:39
    which we then translated--
    which was originally for DBpedia.
  • 15:39 - 15:42
    We translated it
    for those five knowledge graphs
  • 15:42 - 15:44
    into [SPARQL] questions.
  • 15:44 - 15:50
    And then we gave that to a crowd
    and looked into which knowledge graph
  • 15:50 - 15:55
    has the best answers
    for each of those [SPARQL] queries.
  • 15:55 - 15:59
    And overall, the crowd workers
    preferred Wikidata's answers
  • 15:59 - 16:02
    because they are very precise,
  • 16:03 - 16:05
    they are in most of the languages
  • 16:05 - 16:07
    that the others don't cover,
  • 16:08 - 16:11
    and they are not
    as repetitive or redundant
  • 16:11 - 16:12
    as the [inaudible].
  • 16:12 - 16:17
    So just to make a quick recap
    on the whole topic
  • 16:17 - 16:20
    of Wikidata and the future and languages.
  • 16:20 - 16:24
    So we can say that Wikidata
    is already widely used
  • 16:24 - 16:28
    for numerous applications in Wikipedia,
  • 16:28 - 16:30
    and then outside Wikipedia for research.
  • 16:30 - 16:34
    So what I talked about
    is just the things I do research on,
  • 16:34 - 16:36
    but there is still so much more.
  • 16:36 - 16:39
    So there is machine translation
    using knowledge graphs,
  • 16:39 - 16:41
    there is rule mining
    over knowledge graphs,
  • 16:41 - 16:44
    its entity linking in text.
  • 16:44 - 16:47
    There is so much more research
    happening at the moment,
  • 16:47 - 16:51
    and Wikidata is more and more
    getting popular for usage of it.
  • 16:51 - 16:55
    So I think we are at a very good stage
  • 16:55 - 16:58
    to push and connect the communities.
  • 16:59 - 17:03
    Yeah, to get the best
    from both sides, basically.
  • 17:04 - 17:05
    Thank you very much.
  • 17:05 - 17:08
    If you want to have a look
    at any of those projects,
  • 17:08 - 17:09
    they are there,
  • 17:09 - 17:11
    my slides are in Commons already.
  • 17:11 - 17:15
    If you want to read any of the papers,
    I think all of them are open access.
  • 17:15 - 17:16
    If you can't find any of them,
  • 17:16 - 17:19
    write me an email
    and I send it to you immediately.
  • 17:19 - 17:21
    Thank you very much.
  • 17:21 - 17:22
    (applause)
  • 17:26 - 17:28
    (moderator) Okay,
    are there any questions?
  • 17:28 - 17:32
    - (moderator) I'll come around.
    - (person 1) Shall I come to you?
  • 17:35 - 17:36
    (person 1) Hi Lucie, thank you so much,
  • 17:36 - 17:38
    I'm so glad to see
    you taking this forward.
  • 17:38 - 17:41
    Now I'm really curious about Scribe.
  • 17:42 - 17:44
    The example here within our university
  • 17:44 - 17:46
    was that the idea that the person says,
  • 17:46 - 17:48
    "This is a university."
  • 17:48 - 17:49
    And then you go to the key data
  • 17:49 - 17:52
    and say, "Oh gosh!
    Universities have places
  • 17:52 - 17:54
    and presidents, and I don't know what,"
  • 17:54 - 17:58
    that you're using these as the parts,
    for telling the person what to do.
  • 17:58 - 18:01
    So, basically, the idea
    is that someone says,
  • 18:01 - 18:03
    "I want to write about Nile University."
  • 18:03 - 18:07
    We look into Nile University's
    Wikidata item,
  • 18:07 - 18:10
    and let's say-- I work a lot with Arabic--
  • 18:10 - 18:13
    so let's say we then go
    in Arabic Wikipedia,
  • 18:13 - 18:17
    so we can make a grid, basically,
  • 18:17 - 18:19
    of all items that are around
    Nile University.
  • 18:19 - 18:23
    So there are also universities,
    there are also universities in Cairo,
  • 18:23 - 18:25
    or there are also universities
    in Egypt, stuff like that,
  • 18:25 - 18:27
    or they have similar topics.
  • 18:27 - 18:33
    So we can look into
    all the similar items on Wikidata,
  • 18:33 - 18:36
    and if they already have
    a Wikipedia entry in Arabic Wikipedia,
  • 18:36 - 18:39
    we can look at the section titles.
  • 18:39 - 18:41
    - (person 1) (gasps)
    - Exactly, and then we can make basically,
  • 18:41 - 18:46
    the most common way
    about writing about a university
  • 18:46 - 18:50
    in Cairo on Arabic Wikipedia.
  • 18:50 - 18:53
    - Yeah, so that's the--
    - (person 1) Thank you, [inaudible].
  • 18:57 - 19:00
    (person 2) Hi, thank you so much
    for your inspiring talk.
  • 19:00 - 19:05
    I was wondering if this would work
    for languages in Incubator?
  • 19:05 - 19:11
    Like, I work with really low,
    low, low, low-resource languages
  • 19:11 - 19:16
    and this thing about doing it mobile
    would be a huge thing,
  • 19:16 - 19:20
    because in many communities
    they only have phones, not laptops.
  • 19:20 - 19:22
    So, would it work?
  • 19:22 - 19:26
    So I think, to an extent--
  • 19:26 - 19:32
    so the general structure, the skeleton
    of the application would work.
  • 19:32 - 19:35
    Two things that we're thinking about
    a lot at the moment
  • 19:35 - 19:37
    for exactly those use cases is,
  • 19:37 - 19:40
    how much would we want,
    for example, to say,
  • 19:40 - 19:45
    if there are no articles
    on a similar topic in your Wikipedia,
  • 19:45 - 19:47
    how much do we want it
    to get it from other Wikipedias.
  • 19:47 - 19:50
    And that's why I'm basically
    doing those interviews at the moment,
  • 19:50 - 19:51
    because I try to understand
  • 19:51 - 19:55
    how much people already look
    at other language Wikipedias
  • 19:55 - 19:57
    to make the structure of an article.
  • 19:57 - 19:59
    Are they generally equal
  • 19:59 - 20:02
    or do they differ a lot
    based on cultural context?
  • 20:02 - 20:04
    So that would be something to consider,
  • 20:04 - 20:07
    but there is a possibility to say,
  • 20:07 - 20:10
    we take everything
    from all the language Wikipedias
  • 20:10 - 20:12
    and then make an average, basically.
  • 20:12 - 20:15
    And the other problem is referencing.
  • 20:15 - 20:16
    So that's something we find.
  • 20:16 - 20:21
    We make it very convenient
    because we use a lot of Arabic,
  • 20:21 - 20:24
    and Arabic actually has the problem
    that there are a lot of references,
  • 20:24 - 20:29
    but they are very little used
    or not widely used in Wikipedia.
  • 20:29 - 20:32
    That's not true, obviously,
    for all languages,
  • 20:32 - 20:34
    and that's something
    I'd be very interested--
  • 20:34 - 20:35
    like, let's talk.
  • 20:35 - 20:37
    That's what I'm trying to say,
  • 20:37 - 20:39
    I'd be very interested
    on your perspective on it
  • 20:39 - 20:42
    because I'd like to know, yeah
  • 20:42 - 20:44
    what do you think about referencing
  • 20:44 - 20:45
    done from English or any other language.
  • 20:45 - 20:47
    (person 2) Have you ever tried--
  • 20:47 - 20:52
    what we do is we normally
    reference to interviews we have.
  • 20:52 - 20:56
    We put them in our repository,
    institutional repository,
  • 20:56 - 21:00
    because these languages
    don't have written references,
  • 21:00 - 21:03
    and I feel like
    that is the way to go, but--
  • 21:03 - 21:07
    I'm currently also--
    Kimberly and I are discussing a lot.
  • 21:07 - 21:11
    We made a session on Wikimania
    on oral knowledge and oral citations.
  • 21:11 - 21:14
    Yeah, we should hang out
    and have a long conversation.
  • 21:14 - 21:16
    (laughs)
  • 21:18 - 21:22
    (person 3) So [Michael Davignon],
    we'll talk about medium size,
  • 21:22 - 21:24
    which is probably around ten people,
  • 21:24 - 21:28
    so it's medium for Briton Wikipedia.
  • 21:28 - 21:31
    And I'm wondering if we can use Scribe,
  • 21:32 - 21:35
    how to find a common plan
    the other way around
  • 21:35 - 21:38
    for existing article
    to find [the outer layers],
  • 21:38 - 21:40
    that's supposed to be the best plan,
  • 21:40 - 21:42
    but I'm not aware of more or less
  • 21:42 - 21:45
    [inaudible]
    improvement existing article.
  • 21:47 - 21:49
    I think there's--
  • 21:49 - 21:51
    I forgot the name, I think,
  • 21:51 - 21:54
    [Diego] in the Wikimedia Foundation
    research team,
  • 21:54 - 21:58
    who's working a lot at the moment
    with section headings.
  • 21:58 - 22:01
    But, yes, generally, the idea is the same.
  • 22:01 - 22:05
    So instead of using them
    to make an average
  • 22:05 - 22:07
    you could say,
    this is not like the average,
  • 22:08 - 22:10
    That's very possible, yeah.
  • 22:15 - 22:18
    (person 4) Hi, Lucy. I'm Erica Azzellini
    from Wiki Movement, Brazil,
  • 22:18 - 22:20
    and I'm very--
  • 22:20 - 22:22
    (Érica) Oh, can you hear me?
  • 22:22 - 22:25
    So, I'm Érica Azzellini
    from Wiki Movement Brazil,
  • 22:25 - 22:27
    and I'm really impressed with your work
  • 22:27 - 22:29
    because it's really in sync
  • 22:29 - 22:33
    with what we've been working on in Brazil
    with the Mbabel tool.
  • 22:33 - 22:34
    I don't know if you heard about it?
  • 22:34 - 22:36
    - Not yet.
    - (Érica) It's a tool that we use
  • 22:36 - 22:38
    to automatically
    generate Wikipedia entries
  • 22:38 - 22:42
    using Wikidata information
    in a simple way
  • 22:42 - 22:47
    that can be replicated
    on other Wikipedia languages.
  • 22:47 - 22:49
    So we've been working
    on Portuguese mainly,
  • 22:49 - 22:52
    and we're trying to get
    on English Wikipedia tools,
  • 22:52 - 22:56
    but it can be replicated
    on any language, basically,
  • 22:56 - 22:58
    and I think then we could talk about it.
  • 22:58 - 23:00
    Absolutely, it will be super interesting
  • 23:00 - 23:03
    because the article placeholder
    is an extension already,
  • 23:03 - 23:06
    so it might be worth
    to integrate your efforts
  • 23:06 - 23:08
    into the existing extension.
  • 23:08 - 23:13
    Lydia is also fully for it,
    and... (laughs)
  • 23:13 - 23:14
    And then because--
  • 23:14 - 23:17
    so one of the problems--
    [Marius] correct me if I'm wrong--
  • 23:17 - 23:20
    we had was that
    article placeholder doesn't scale
  • 23:20 - 23:22
    as well as it should.
  • 23:22 - 23:25
    So article placeholder
    is not in Portuguese
  • 23:25 - 23:29
    because we're always afraid
    it will break everything, correct?
  • 23:29 - 23:32
    And then [Marius] is just taking a pause.
  • 23:32 - 23:35
    - (Érica) Yeah, you should be careful.
    - Don't want to say anything about this.
  • 23:35 - 23:39
    But, yeah, we should connect
    because I'd be super interested to see
  • 23:39 - 23:42
    how you solve those issues
    and how it works for you.
  • 23:42 - 23:45
    (Érica) I'm going to present
    on the second section
  • 23:45 - 23:48
    of the learning talk about this project
    that we've been developing,
  • 23:48 - 23:51
    and we've been using it
    on [Glenwyck] initiatives
  • 23:51 - 23:52
    and education projects already.
  • 23:52 - 23:54
    - Perfect.
    - (Érica) So let's do that.
  • 23:54 - 23:56
    Yeah, absolutely let's chat.
  • 23:57 - 23:58
    (moderator) Cool.
  • 23:58 - 24:00
    Some other questions on your projects?
  • 24:02 - 24:07
    (person 5) Hi, my name is [Alan],
    and I think this is extremely cool.
  • 24:07 - 24:09
    I had a few questions about
  • 24:09 - 24:13
    generating Wiki sentences
    from neural networks.
  • 24:13 - 24:16
    - Yeah.
    - (person 5) So I've come across
  • 24:16 - 24:19
    another project
    that was attempting to do this,
  • 24:19 - 24:23
    and it was essentially using
    [triples input and sentences output],
  • 24:23 - 24:26
    and it was able
    to generate very fluent sentences.
  • 24:26 - 24:29
    But sometimes they weren't...
  • 24:30 - 24:34
    actually, they weren't correct,
    with regards to the triple.
  • 24:34 - 24:39
    And I was curious if you had any ways
    of doing validity checks of this site.
  • 24:39 - 24:43
    Sometimes the triple
    is "subject, predicate, object,"
  • 24:43 - 24:46
    but the language model says,
  • 24:46 - 24:49
    "Okay, this object is very rare,
  • 24:49 - 24:52
    I'm going to say you are born in San Jose,
  • 24:52 - 24:55
    instead of San Francisco or vice versa."
  • 24:55 - 24:59
    And I was curious
    if you had come across this?
  • 24:59 - 25:02
    So that's what we call hallucinations.
  • 25:02 - 25:05
    The idea that
    there's something in a sentence
  • 25:05 - 25:08
    that wasn't in the original triple
    and the data.
  • 25:08 - 25:11
    What we do--
    so we don't do anything about it,
  • 25:11 - 25:14
    we just also realized
    that that's happening.
  • 25:14 - 25:16
    It's even more happening
    for the low-resource,
  • 25:16 - 25:20
    because we work across domains,
    so we are domain independently generating.
  • 25:20 - 25:25
    Traditional energy work
    is always biography domain, usually.
  • 25:25 - 25:27
    So that happens a lot
  • 25:27 - 25:30
    because we just have little training data
    on the low-resource languages.
  • 25:30 - 25:33
    We have a few ideas.
  • 25:33 - 25:37
    It's one of the million topics,
    I'm supposed to work on at the moment.
  • 25:39 - 25:43
    One of them is to use
    entity linking and relation extraction,
  • 25:43 - 25:44
    to align what we generate
  • 25:44 - 25:47
    with the triples
    we inputted in the first place,
  • 25:47 - 25:51
    to see if it's off or the network
    generates information it shouldn't have
  • 25:51 - 25:54
    or it cannot know about, basically.
  • 25:54 - 25:59
    That's also all I can say about this
    because now time is over.
  • 25:59 - 26:01
    (person 5) I'd love to talk offline
    about this, if you have time.
  • 26:01 - 26:03
    Yeah, absolutely, let's chat about it.
  • 26:03 - 26:05
    Thank you so much,
    everyone, it was lovely.
  • 26:05 - 26:07
    (moderator) Thank you, Lucie.
  • 26:07 - 26:09
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4
Video Language:
English
Duration:
26:15

English subtitles

Revisions