< Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4

  • 0:06 - 0:08
    (moderator) The next talk is
    by Anders Sandholm
  • 0:08 - 0:12
    on Wikidata fact annotation
    for Wikipedia across languages.
  • 0:12 - 0:14
    - Thank you.
    - Thanks.
  • 0:22 - 0:24
    I wanted to start with a small confession.
  • 0:26 - 0:32
    Wow! I'm blown away
    by the momentum of Wikidata
  • 0:34 - 0:36
    and the engagement of the community.
  • 0:37 - 0:39
    I am really excited about being here
  • 0:39 - 0:42
    and getting a chance to talk
    about work that we've been doing.
  • 0:43 - 0:47
    This is doing work with Michael,
    who's also here in the third row.
  • 0:50 - 0:52
    But before I dive more into this,
  • 0:52 - 0:56
    this wouldn't be
    a Google presentation without an ad,
  • 0:56 - 0:58
    so you get that up front.
  • 0:58 - 1:01
    This is what I'll be talking about,
    our project, the SLING project.
  • 1:02 - 1:07
    It is an open source project
    and it's using Wikidata a lot.
  • 1:08 - 1:12
    You can go check it out on GitHub
    when you get a chance
  • 1:12 - 1:16
    if you feel excited about it
    after the presentation.
  • 1:18 - 1:23
    And really, what I wanted to talk about--
    the title is admittedly a little bit long,
  • 1:23 - 1:26
    it's even shorter than it was
    in the original program.
  • 1:26 - 1:30
    But what it comes down to,
    what the project comes down to
  • 1:30 - 1:34
    is trying to answer
    this one very exciting question.
  • 1:35 - 1:38
    If you want, in the beginning,
    there were just two files,
  • 1:40 - 1:41
    some of you may recognize them,
  • 1:42 - 1:46
    they're essentially the dump files
    from Wikidata and Wikipedia,
  • 1:47 - 1:50
    and the question we're trying
    to figure out or answer is really,
  • 1:52 - 1:54
    can we dramatically improve
    how good machines are
  • 1:54 - 1:58
    at understanding human language
    just by using these files?
  • 2:01 - 2:04
    And of course, you're entitled to ask
  • 2:04 - 2:06
    whether that's an interesting
    question to answer.
  • 2:07 - 2:14
    If you're a company that [inaudible]
    is to be able to take search queries
  • 2:14 - 2:18
    and try to answer them
    in the best possible way,
  • 2:18 - 2:24
    obviously, understanding natural language
    comes in as a very handy thing.
  • 2:25 - 2:28
    But even if you look at Wikidata,
  • 2:29 - 2:34
    in the previous data quality panel
    earlier today,
  • 2:34 - 2:39
    there was a question that came up about
    verification, or verifiability of facts.
  • 2:39 - 2:43
    So let's say you actually do
    understand natural language.
  • 2:43 - 2:47
    If you have a fact and there's a source,
    you could go to the source and analyze it,
  • 2:47 - 2:50
    and you can figure out whether
    it actually confirms the fact
  • 2:50 - 2:52
    that is claiming
    to have this as a source.
  • 2:53 - 2:56
    And if you could do it,
    you could even go beyond that
  • 2:56 - 3:00
    and you could read articles
    and annotate them, come up with facts,
  • 3:00 - 3:03
    and actually look for existing facts
    that may need sources
  • 3:03 - 3:06
    and add these articles as sources.
  • 3:07 - 3:11
    Or, you know, in the wildest,
    craziest possible of all worlds,
  • 3:11 - 3:14
    if you get really, really good at it
    you could read articles
  • 3:14 - 3:18
    and maybe even annotate with new facts
    that you could then suggest as facts
  • 3:18 - 3:20
    that you could potentially
    add to Wikidata.
  • 3:21 - 3:27
    But there's a whole world of applications
    of natural language understanding.
  • 3:29 - 3:32
    One of the things that's really hard when
    you do natural language understanding--
  • 3:32 - 3:36
    these days, that also means
    deep learning or machine learning,
  • 3:36 - 3:40
    and one of the things that's really hard
    is getting enough training data.
  • 3:40 - 3:43
    And historically,
    that's meant having a lot of text
  • 3:43 - 3:45
    that you need human annotators
    to then first process
  • 3:45 - 3:47
    and then you can do training.
  • 3:47 - 3:51
    And part of the question here
    is also really to say:
  • 3:51 - 3:56
    Can we use Wikidata and the way
    in which it's interlinked with Wikipedia
  • 3:57 - 3:58
    for training data,
  • 3:58 - 4:01
    and will that be enough
    to train that model?
  • 4:03 - 4:07
    So hopefully, we'll get closer
    to answering this question
  • 4:07 - 4:09
    in the next 15 to 20 minutes.
  • 4:10 - 4:14
    We don't quite know the answer yet
    but we have some exciting results
  • 4:14 - 4:17
    that are pointing
    in the right direction, if you want.
  • 4:19 - 4:24
    Just take a step back in terms of
    the development we've seen,
  • 4:24 - 4:28
    machine learning and deep learning
    has revolutionized a lot of areas
  • 4:28 - 4:32
    and this is just one example
    of a particular image recognition task
  • 4:32 - 4:37
    that if you look at what happened
    between 2010 and 2015,
  • 4:37 - 4:41
    in that five-year period,
    we went from machines doing pretty poorly
  • 4:41 - 4:45
    to, in the end, actually performing
    at the same level of humans
  • 4:45 - 4:49
    or in some cases even better
    albeit for a very specific task.
  • 4:50 - 4:56
    So we've seen really a lot of things
    improving dramatically.
  • 4:56 - 4:58
    And so you can ask
  • 4:58 - 5:02
    why don't we just throw deep learning
    at natural language processing
  • 5:02 - 5:05
    and natural language understanding
    and be done with it?
  • 5:05 - 5:12
    And the answer is kind of
    we've sort of done to a certain extent,
  • 5:12 - 5:14
    but what it turns out is that
  • 5:15 - 5:18
    natural language understanding
    is actually still a bit of a challenge
  • 5:18 - 5:23
    and one of the situations where
    a lot of us interact with machines
  • 5:23 - 5:26
    that are trying to behave like
    they understand what we're saying
  • 5:26 - 5:27
    is in these chat bots.
  • 5:27 - 5:29
    So this is not to pick
    on anyone in particular
  • 5:29 - 5:32
    but just, I think, an experience
    that a lot of us have had.
  • 5:32 - 5:37
    In this case, it's a user saying
    I want to stay in this place.
  • 5:37 - 5:42
    The chat bot says: "OK, got it,
    when will you be checking in and out?
  • 5:42 - 5:44
    For example, November 17th to 23rd."
  • 5:44 - 5:47
    And the user says:
    "Well, I don't have any dates yet."
  • 5:47 - 5:48
    And then the response is:
  • 5:48 - 5:51
    "Sorry, there are no hotels available
    for the dates you've requested.
  • 5:51 - 5:53
    Would you like to start a new search?"
  • 5:53 - 5:55
    So there's still some way to go
  • 5:56 - 5:59
    to get machines to really
    understand human language.
  • 6:00 - 6:04
    But machine learning or deep learning
  • 6:04 - 6:07
    has been applied
    already to this discipline.
  • 6:07 - 6:10
    Like, one of the examples is a recent...
  • 6:10 - 6:11
    a more successful example is BERT
  • 6:11 - 6:17
    where they're using transformers
    to solve NLP or NLU tasks.
  • 6:19 - 6:22
    And it's dramatically improved
    the performance but, as we've seen,
  • 6:22 - 6:24
    there is still some way to go.
  • 6:25 - 6:28
    One thing that's shared among
    most of these approaches
  • 6:28 - 6:32
    is that you look at the text itself
  • 6:32 - 6:37
    and you depend on having a lot of it
    so you can train your model on the text,
  • 6:37 - 6:40
    but everything is based
    on just looking at the text
  • 6:40 - 6:42
    and understanding the text.
  • 6:42 - 6:46
    So the learning is really
    just representation learning.
  • 6:46 - 6:51
    What we wanted to do is actually
    understand and annotate the text
  • 6:51 - 6:54
    in terms of items
    or entities in the real world.
  • 6:56 - 7:00
    And in general, if we take a step back,
  • 7:00 - 7:03
    why is natural language processing
    or understanding so hard?
  • 7:03 - 7:08
    There are a number of reasons
    why it's really hard, but at the core,
  • 7:08 - 7:11
    one of the important reasons
    is that somehow,
  • 7:11 - 7:13
    the machine needs to have
    knowledge of the world
  • 7:13 - 7:17
    in order to understand human language.
  • 7:20 - 7:22
    And you think about that
    for a little while.
  • 7:23 - 7:27
    What better place to look for knowledge
    about the world than Wikidata?
  • 7:27 - 7:30
    So in essence, that's the approach.
  • 7:30 - 7:32
    And the question is can you leverage it,
  • 7:32 - 7:39
    can you use this wonderful knowledge
  • 7:39 - 7:41
    of the world that we already have
  • 7:41 - 7:46
    in a way that you can help
    to train and bootstrap your model.
  • 7:47 - 7:51
    So the alternative here is really
    understanding the text
  • 7:51 - 7:55
    not just in terms of other texts
    or how this text is similar to other texts
  • 7:55 - 7:59
    but in terms of the existing knowledge
    that we have about the world.
  • 8:01 - 8:03
    And what makes me really excited
  • 8:03 - 8:06
    or at least makes me
    have a good gut feeling about this
  • 8:06 - 8:07
    is that in some ways
  • 8:07 - 8:11
    it seems closer
    to how we interact as humans.
  • 8:11 - 8:14
    So if we were having a conversation
  • 8:14 - 8:18
    and you were bringing up
    the Bundeskanzler and Angela Merkel,
  • 8:19 - 8:23
    I would have an internal representation
    of Q567 and it would light up.
  • 8:23 - 8:26
    And in our continued conversation,
  • 8:26 - 8:30
    mentioning other things
    related to Angela Merkel,
  • 8:30 - 8:32
    I would have an easier time
    associating with that
  • 8:32 - 8:34
    or figuring out
    what you were actually talking about.
  • 8:35 - 8:39
    And so, in essence,
    that's at the heart of this approach,
  • 8:39 - 8:42
    that we really believe
    Wikidata is a key component
  • 8:42 - 8:46
    in unlocking this better understanding
    of natural language.
  • 8:50 - 8:51
    And so how are we planning to do it?
  • 8:53 - 8:57
    Essentially, there are five steps
    we're going through,
  • 8:57 - 8:58
    or have been going through.
  • 8:59 - 9:03
    I'll go over each
    of the steps briefly in turn
  • 9:03 - 9:04
    but essentially, there are five steps.
  • 9:04 - 9:07
    First, we need to start
    with the dump files that I showed you
  • 9:07 - 9:08
    to begin with--
  • 9:09 - 9:11
    understanding what's in them,
    parsing them,
  • 9:11 - 9:13
    having an efficient
    internal representation in memory
  • 9:13 - 9:16
    that allows us to do
    quick processing on this.
  • 9:16 - 9:19
    And then, we're leveraging
    some of the annotations
  • 9:19 - 9:23
    that are already in Wikipedia,
    linking it to items in Wikidata.
  • 9:23 - 9:25
    I'll briefly show you what I mean by that.
  • 9:25 - 9:31
    We can use that to then
    generate more advanced annotations
  • 9:32 - 9:35
    where we have much more text annotated.
  • 9:35 - 9:40
    But still, with annotations
    being items or facts in Wikidata,
  • 9:40 - 9:44
    we can then train a model
    based on the silver data
  • 9:44 - 9:46
    and get a reasonably good model
  • 9:46 - 9:49
    that will allow us to read
    a Wikipedia document
  • 9:49 - 9:53
    and understand what the actual content is
    in terms of Wikidata,
  • 9:55 - 9:58
    but only for facts that are
    already in Wikidata.
  • 9:59 - 10:02
    And so that's where kind of
    the hard part of this begins.
  • 10:02 - 10:06
    In order to go beyond that
    we need to have a plausibility model,
  • 10:06 - 10:08
    so a model that can tell us,
  • 10:08 - 10:11
    given a lot of facts about an item
    and an additional fact,
  • 10:11 - 10:13
    whether the additional fact is plausible.
  • 10:13 - 10:14
    If we can build that,
  • 10:15 - 10:22
    we can then use a more "hyper modern"
    reinforcement learning aspect
  • 10:22 - 10:26
    of deep learning and machine learning
    to fine-tune the model
  • 10:26 - 10:30
    and hopefully go beyond
    what we've been able to so far.
  • 10:32 - 10:33
    So real quick,
  • 10:33 - 10:37
    the first step is essentially
    getting the dump files parsed,
  • 10:37 - 10:41
    understanding the contents, and linking up
    Wikidata and Wikipedia information,
  • 10:41 - 10:44
    and then utilizing some of the annotations
    that are already there.
  • 10:46 - 10:49
    And so this is essentially
    what's happening.
  • 10:49 - 10:52
    Trust me, Michael built all of this,
    it's working great.
  • 10:53 - 10:56
    But essentially, we're starting
    with the two files you can see on the top,
  • 10:56 - 10:58
    the Wikidata dump and the Wikipedia dump.
  • 10:58 - 11:02
    The Wikidata dump gets processed
    and we end up with a knowledge base,
  • 11:02 - 11:04
    a KB at the bottom.
  • 11:04 - 11:07
    That's essentially a store
    we can hold in memory
  • 11:07 - 11:10
    that has essentially all of Wikidata in it
  • 11:10 - 11:14
    and we can quickly access
    all the properties and facts and so on
  • 11:14 - 11:15
    and do analysis there.
  • 11:15 - 11:16
    Similarly, for the documents,
  • 11:16 - 11:18
    they get processed
    and we end up with documents
  • 11:19 - 11:22
    that have been processed.
  • 11:22 - 11:24
    We know all the mentions
  • 11:24 - 11:27
    and some of the things
    that are already in the documents.
  • 11:27 - 11:28
    And then in the middle,
  • 11:28 - 11:30
    we have an important part
    which is a phrase table
  • 11:30 - 11:33
    that allows us to basically
    see for any phrase
  • 11:34 - 11:36
    what is the frequency distribution,
  • 11:36 - 11:39
    what's the most likely item
    that we're referring to
  • 11:39 - 11:41
    when we're using this phrase.
  • 11:41 - 11:44
    So we're using that later on
    to build the silver annotations.
  • 11:44 - 11:48
    So let's say we've run this
    and then we also want to make sure
  • 11:48 - 11:52
    we utilize annotations
    that are already there.
  • 11:52 - 11:54
    So an important part
    of a Wikipedia article
  • 11:54 - 11:58
    is that it's not just plain text,
  • 11:58 - 12:01
    it's actually already
    pre-annotated with a few things.
  • 12:01 - 12:04
    So a template is one example,
    links is another example.
  • 12:04 - 12:08
    So if we take here the English article
    for Angela Merkel,
  • 12:09 - 12:12
    there is one example of a link here
    which is to her party.
  • 12:12 - 12:14
    If you look at the bottom,
  • 12:14 - 12:16
    that's a link to a specific
    Wikipedia article,
  • 12:16 - 12:20
    and I guess for people here,
    it's no surprise that, in essence,
  • 12:20 - 12:23
    that is then, if you look
    at the associated Wikidata item,
  • 12:23 - 12:26
    that's essentially an annotation saying
  • 12:26 - 12:31
    this is the QID I am talking about
    when I'm talking about this party,
  • 12:31 - 12:33
    the Christian Democratic Union.
  • 12:34 - 12:37
    So we're using this
    to already have a good start
  • 12:37 - 12:39
    in terms of understanding what text means.
  • 12:39 - 12:40
    All of these links,
  • 12:40 - 12:44
    we know exactly what the author
    means with the phrase
  • 12:45 - 12:47
    in the cases where
    there are links to QIDs.
  • 12:48 - 12:53
    We can use this and the phrase table
    to then try and take a Wikipedia document
  • 12:53 - 12:59
    and fully annotate it with everything
    we know about already from Wikidata.
  • 13:00 - 13:03
    And we can use this to train
    the first iteration of our model.
  • 13:04 - 13:05
    (coughs) Excuse me.
  • 13:05 - 13:08
    So this is exactly the same article,
  • 13:08 - 13:14
    but now, after we've annotated it
    with silver annotations,
  • 13:15 - 13:18
    and essentially,
    you can see all of the squares
  • 13:18 - 13:25
    are places where we've been able
    to annotate with QIDs or with facts.
  • 13:26 - 13:31
    This is just a screenshot
    of the viewer on the data,
  • 13:31 - 13:34
    so you can have access
    to all of this information
  • 13:34 - 13:38
    and see what's come out
    of the silver annotation.
  • 13:38 - 13:41
    And it's important to say that
    there's no machine learning
  • 13:41 - 13:43
    or anything involved here.
  • 13:43 - 13:46
    All we've done, is sort of
    mechanically, with a few tricks,
  • 13:47 - 13:50
    basically pushed information
    we already have from Wikidata
  • 13:50 - 13:53
    onto the Wikipedia article.
  • 13:53 - 13:56
    And so here, if you hover over
    "Chancellor of Germany"
  • 13:56 - 14:02
    that is itself a Wikidata,
    that's referring to a Wikidata item,
  • 14:02 - 14:05
    has a number of properties
    like "subclass of: Chancellor",
  • 14:05 - 14:09
    "country: Germany",
    that again referring to subtext.
  • 14:09 - 14:12
    And here, it also has
    the property "officeholder"
  • 14:12 - 14:15
    which happens to be
    Angela Dorothea Merkel,
  • 14:15 - 14:17
    which is also mentioned in the text.
  • 14:17 - 14:22
    So there's really a full annotation
    linking up the contents here.
  • 14:25 - 14:27
    But again, there is an important
    and unfortunate point
  • 14:27 - 14:32
    about what we are able to
    and not able to do here.
  • 14:32 - 14:35
    So what we are doing is pushing
    information we already have in Wikidata,
  • 14:35 - 14:40
    so what we can't annotate here
    are things that are not in Wikidata.
  • 14:40 - 14:42
    So for instance, here,
  • 14:42 - 14:45
    she was at some point appointed
    Federal Minister for Women and Youth
  • 14:45 - 14:49
    and that alias or that phrase
    is not in Wikidata,
  • 14:49 - 14:54
    so we're not able to make that annotation
    here in our silver annotations.
  • 14:56 - 15:00
    That said, it's still... at least for me,
  • 15:00 - 15:03
    it's was pretty surprising to see
    how much you can actually annotate
  • 15:03 - 15:04
    and how much information is already there
  • 15:04 - 15:09
    when you combine Wikidata
    with a Wikipedia article.
  • 15:09 - 15:15
    So what you can do is, once you have this,
    you know, millions of documents,
  • 15:16 - 15:20
    you can train your parser
    based on the annotations that are there.
  • 15:21 - 15:27
    And that's essentially a parser
    that has a number of components.
  • 15:27 - 15:30
    Essentially, the text is coming in
    at the bottom and at the top,
  • 15:30 - 15:34
    we have a transition-based
    frame semantic parser
  • 15:34 - 15:39
    that then generates the annotations
    or these facts or references to the items.
  • 15:41 - 15:45
    We built this and run
    on more classical corpora
  • 15:45 - 15:50
    like [inaudible],
    which are more classical NLP corpora,
  • 15:50 - 15:54
    but we want to be able to run this
    on the full Wikipedia corpora.
  • 15:54 - 15:57
    So Michael has been rewriting this in C++
  • 15:57 - 16:00
    and we're able to really
    scale up performance
  • 16:00 - 16:01
    of the parser trainer here.
  • 16:01 - 16:04
    So it will be exciting to see exactly
  • 16:04 - 16:06
    the results that are going
    to come out of that.
  • 16:09 - 16:10
    So once that's in place,
  • 16:10 - 16:13
    we have a pretty good model
    that's able to at least
  • 16:13 - 16:16
    predict facts that are
    already known in Wikidata,
  • 16:16 - 16:19
    but ideally, we want to move beyond that,
  • 16:19 - 16:21
    and for that
    we need this plausibility model
  • 16:21 - 16:24
    which in essence,
    you can think of it as a black box
  • 16:24 - 16:27
    where you supply it with
    all of the known facts you have
  • 16:27 - 16:31
    about a particular item
    and then you provide an additional item.
  • 16:31 - 16:32
    And by magic,
  • 16:32 - 16:37
    the black box tells you how plausible is
    the additional fact that you're providing
  • 16:37 - 16:40
    and how plausible is it
    that this particular item is fact.
  • 16:43 - 16:44
    And...
  • 16:46 - 16:49
    I don't know if it's fair to say
    that it was much to our surprise,
  • 16:49 - 16:51
    but at least, you can actually--
  • 16:51 - 16:53
    In order to train a model, you need,
  • 16:53 - 16:55
    like we've seen earlier,
    you need a lot of training data
  • 16:55 - 16:58
    and essentially, you can
    use Wikidata as training data.
  • 16:58 - 17:02
    You serve it basically
    all the facts for a given item
  • 17:02 - 17:05
    and then you mask or hold off one fact
  • 17:05 - 17:09
    and then you provide that as a fact
    that it's supposed to predict.
  • 17:09 - 17:11
    And just using this as training data,
  • 17:11 - 17:16
    you can get a really really good
    plausibility model, actually,
  • 17:19 - 17:22
    to the extent that I was hoping one day
    to maybe be able to even use it
  • 17:22 - 17:28
    for discovering what you could call
    accidental vandalism in Wikidata
  • 17:28 - 17:33
    like a fact that's been added by accident
    and really doesn't look like it's...
  • 17:33 - 17:35
    It doesn't fit with the normal topology
  • 17:35 - 17:39
    of facts or knowledge
    in Wikidata, if you want.
  • 17:41 - 17:44
    But in this particular setup,
    we need it for something else,
  • 17:44 - 17:47
    namely for doing reinforcement learning
  • 17:48 - 17:51
    so we can fine-tune the Wiki parser,
  • 17:51 - 17:54
    and basically using the plausibility model
    as a reward function.
  • 17:54 - 18:00
    So when you do the training,
    you try to pass a Wikipedia document
  • 18:00 - 18:02
    [inaudible] in Wikipedia
    comes up with a fact
  • 18:02 - 18:04
    and we check the fact
    on the plausibility model
  • 18:04 - 18:08
    and use that as feedback
    or as a reward function
  • 18:08 - 18:10
    in training the model.
  • 18:10 - 18:13
    And the big question here is then
    can we learn to predict facts
  • 18:13 - 18:15
    that are not already in Wikidata.
  • 18:16 - 18:22
    And we hope and believe we can
    but it's still not clear.
  • 18:23 - 18:28
    So this is essentially what we have been
    and are planning to do.
  • 18:28 - 18:31
    There's been some
    surprisingly good results
  • 18:31 - 18:34
    in terms of how far
    you can get with silver annotations
  • 18:34 - 18:36
    and a plausibility model.
  • 18:36 - 18:40
    But in terms of
    how far we are, if you want,
  • 18:40 - 18:42
    we sort of have
    the infrastructure in place
  • 18:42 - 18:44
    to do the processing
    and have everything efficiently in memory.
  • 18:45 - 18:49
    We have first instances
    of silver annotations
  • 18:49 - 18:53
    and have a parser trainer in place
    for the supervised learning
  • 18:53 - 18:56
    and an initial plausibility model.
  • 18:56 - 19:00
    But we're still pushing on those fronts
    and very much looking forward
  • 19:00 - 19:03
    to see what comes out
    of the very last bit.
  • 19:08 - 19:10
    And those were my words.
  • 19:10 - 19:15
    I'm very excited to see
    what comes out of it
  • 19:15 - 19:18
    and it's been pure joy
    to work with Wikidata.
  • 19:18 - 19:20
    It's been fun to see
  • 19:20 - 19:24
    how some of the things you come across
    seemed wrong and then the next day,
  • 19:24 - 19:25
    you look, things are fixed
  • 19:25 - 19:31
    and it's really been amazing
    to see the momentum there.
  • 19:31 - 19:35
    Like I said, the URL,
    all the source code is on GitHub.
  • 19:36 - 19:39
    Our email addresses
    were on the first slide,
  • 19:39 - 19:43
    so please do reach out
    if you have questions or are interested
  • 19:43 - 19:47
    and I think we have time
    for a couple questions now in case...
  • 19:49 - 19:51
    (applause)
  • 19:51 - 19:52
    Thanks.
  • 19:56 - 19:59
    (woman 1) Thank you for your presentation.
    I do have a concern however.
  • 19:59 - 20:05
    The Wikipedia corpus
    is known to be with bias.
  • 20:05 - 20:10
    There's a very strong bias--
    for example, fewer women, more men,
  • 20:10 - 20:12
    all sorts of other aspects in there.
  • 20:12 - 20:15
    So isn't this actually
    also tainting the knowledge
  • 20:15 - 20:19
    that you are taking out of the Wikipedia?
  • 20:22 - 20:25
    Well, there are two aspects
    of the question.
  • 20:25 - 20:29
    There's both in the model
    that we are then training,
  • 20:29 - 20:32
    you could ask how... let's just...
  • 20:33 - 20:36
    If you make it really simple
    and say like:
  • 20:36 - 20:41
    Does it mean that the model
    will then be worse
  • 20:41 - 20:46
    at predicting facts
    about women than men, say,
  • 20:46 - 20:50
    or some other set of groups?
  • 20:53 - 20:55
    To begin with,
    if you just look at the raw data,
  • 20:55 - 21:01
    it will reflect whatever is the bias
    in the training data, so that's...
  • 21:03 - 21:06
    People work on this to try
    and address that in the best possible way.
  • 21:06 - 21:10
    But normally,
    when you're training a model,
  • 21:10 - 21:14
    it will reflect
    whatever data you're training it on.
  • 21:15 - 21:19
    So that's something to account for
    when doing the work, yeah.
  • 21:21 - 21:23
    (man 2) Hi, this is [Marco].
  • 21:23 - 21:26
    I am a natural language
    processing practitioner.
  • 21:27 - 21:32
    I was curious about
    how you model your facts.
  • 21:32 - 21:35
    So I heard you set frame semantics,
  • 21:35 - 21:36
    Right.
  • 21:36 - 21:39
    (Mike) could you maybe
    give some more details on that, please.
  • 21:40 - 21:47
    Yes, so it's frame semantics,
    we're using frame semantics,
  • 21:47 - 21:50
    and basically,
  • 21:50 - 21:56
    all of the facts in Wikidata,
    they're modeled as frames.
  • 21:56 - 21:59
    And so that's an essential part
    of the set up
  • 21:59 - 22:00
    and how we make this work.
  • 22:00 - 22:04
    That's essentially
    how we try to address the...
  • 22:04 - 22:07
    How can I make all the knowledge
    that I have in Wikidata
  • 22:07 - 22:11
    available in a context where
    I can annotate and train my model
  • 22:12 - 22:14
    when I am annotating or passing text.
  • 22:14 - 22:20
    Is that existing data
    in Wikidata is modeled as frames.
  • 22:20 - 22:21
    So the store that we have,
  • 22:21 - 22:24
    the knowledge base with
    all of the knowledge is a frame store,
  • 22:24 - 22:27
    and this is the same frame store
    that we are building on top of
  • 22:27 - 22:30
    when we're then passing the text.
  • 22:30 - 22:34
    (Marco) So you're converting
    the Wikidata data model into some frame.
  • 22:35 - 22:37
    Yes, we are converting the Wikidata model
  • 22:37 - 22:40
    into one large frame store
    if you want, yeah.
  • 22:41 - 22:44
    (man 3) Thanks. Is Pluto a planet?
  • 22:44 - 22:47
    (audience laughing)
  • 22:47 - 22:48
    Can I get the question...
  • 22:48 - 22:52
    (man 3) I like the bootstrapping thing
    that you are doing,
  • 22:52 - 22:53
    I mean the way
    that you're training your model
  • 22:53 - 22:58
    by picking out the known facts
    about things that are verified,
  • 22:58 - 23:01
    and then training
    the plausibility prediction
  • 23:01 - 23:04
    by trying to teach
    the architecture of the system
  • 23:04 - 23:06
    to recognize that actually,
    that fact fits.
  • 23:06 - 23:13
    So that will work for large classes,
    but it will really...
  • 23:13 - 23:16
    It doesn't sound like it will learn
    about surprises
  • 23:16 - 23:19
    and especially not
    in small classes of items, right.
  • 23:19 - 23:21
    So if you train your model in...
  • 23:21 - 23:23
    When did Pluto disappear, I forgot...
  • 23:23 - 23:24
    As a planet, you mean.
  • 23:24 - 23:27
    (man 3) Yeah, it used to be
    a member of the solar system
  • 23:27 - 23:29
    and we had how many,
    nine observations there.
  • 23:29 - 23:31
    - Yeah.
    - (man 3) It's slightly problematic.
  • 23:31 - 23:34
    So everyone, the kids think
    that Pluto is not a planet,
  • 23:34 - 23:36
    I still think it's a planet,
    but never mind.
  • 23:36 - 23:42
    So the fact that it suddenly
    stopped being a planet,
  • 23:42 - 23:46
    which was supported in the period before,
    I don't know, hundreds of years, right?
  • 23:47 - 23:50
    That's crazy, how would you go
    for figuring out that thing?
  • 23:50 - 23:54
    For example, the new claim
    is not plausible for that thing.
  • 23:54 - 23:56
    Sure. So there are two things.
  • 23:56 - 23:59
    So there's both like how precise
    is a plausibility model.
  • 23:59 - 24:02
    So what it distinguishes between
    is random facts
  • 24:02 - 24:04
    and facts that are plausible.
  • 24:04 - 24:07
    And there's also the question
    of whether Pluto is a planet
  • 24:07 - 24:09
    and that's back to whether...
  • 24:09 - 24:10
    I was in another session
  • 24:10 - 24:14
    where someone brought up the example
    of the earth being flat,
  • 24:14 - 24:17
    - whether that is a fact or not.
    - (man 3) That makes sense.
  • 24:17 - 24:19
    So it is a fact in a sense
    that you can put it in,
  • 24:19 - 24:20
    I guess you could put it in Wikidata
  • 24:20 - 24:22
    with sources that are claiming
    that that's the thing.
  • 24:22 - 24:27
    So again, you would not necessarily
    want to train the model in a way
  • 24:27 - 24:31
    where if you read someone saying
    the planet Pluto, bla, bla, bla,
  • 24:31 - 24:34
    then it should be fine for it
  • 24:34 - 24:37
    to then say that
    an annotation for this text
  • 24:37 - 24:38
    is that Pluto is a planet.
  • 24:40 - 24:41
    That doesn't mean, you know...
  • 24:42 - 24:47
    The model won't be able to tell
    what "in the end" is the truth,
  • 24:47 - 24:49
    I don't think any of us here
    will be able to either, so...
  • 24:49 - 24:50
    (man 3) I just want to say
  • 24:50 - 24:53
    it's not a hard accusation
    against the approach
  • 24:53 - 24:56
    because even people
    cannot be sure whether that's a fact,
  • 24:56 - 24:58
    a new fact is plausible at that moment.
  • 24:59 - 25:00
    But that's always...
  • 25:00 - 25:03
    I just maybe reiterated a question
    that I am posing all the time
  • 25:03 - 25:06
    to myself and my work; I always ask.
  • 25:06 - 25:09
    We do the statistical learning thing,
    it's amazing nowadays
  • 25:09 - 25:14
    we can do billions of things,
    but we cannot learn about surprises,
  • 25:14 - 25:17
    and they are
    very, very important in fact, right?
  • 25:18 - 25:21
    - (man 4) But, just to refute...
    - (man 3) Thank you.
  • 25:23 - 25:27
    (man 4) The plausibility model
    is combined with kind of two extra roles.
  • 25:27 - 25:30
    First of all,
    if it's in Wikidata, it's true.
  • 25:30 - 25:35
    We just give you the benefit of the doubt,
    so please make it good.
  • 25:35 - 25:39
    The second thing is if it's not
    allowed by the schema it's false;
  • 25:40 - 25:43
    it's all the things in between
    we're looking at.
  • 25:43 - 25:50
    So if it's a planet according to Wikidata,
    it will be a true fact.
  • 25:53 - 25:57
    But it won't predict surprises
    but what is important here
  • 25:57 - 26:02
    is that there's kind of
    no manual human work involved,
  • 26:02 - 26:04
    so there's nothing
    that prevents you from...
  • 26:04 - 26:06
    Well, now, if we're successful
    with the approach,
  • 26:06 - 26:09
    there's nothing that prevents him
    from continuously updating the model
  • 26:09 - 26:12
    with changes happening
    in Wikidata and Wikipedia and so on.
  • 26:12 - 26:18
    So in theory, you should be able
    to quickly learn new surprises.
  • 26:18 - 26:20
    (moderator) One last question.
  • 26:20 - 26:23
    - (man 4) Maybe we're biased by Wikidata.
    - Yeah.
  • 26:24 - 26:28
    (man 4) You are our bias.
    Whatever you annotate is what we believe.
  • 26:28 - 26:32
    So if you make it good,
    if you make it balanced,
  • 26:32 - 26:34
    we can hopefully be balanced.
  • 26:34 - 26:39
    With the gender thing,
    there's actually an interesting thing.
  • 26:40 - 26:42
    We are actually getting
    more training facts
  • 26:42 - 26:44
    about women than men
  • 26:44 - 26:49
    because "she" is a much less
    ambiguous pronoun in the text,
  • 26:49 - 26:52
    so we actually get a lot more
    true facts about women.
  • 26:52 - 26:55
    So we are biased, but on the women's side.
  • 26:56 - 26:59
    (woman 2) No, I want to see
    the data on that.
  • 26:59 - 27:00
    (audience laughing)
  • 27:00 - 27:02
    We should bring that along next time.
  • 27:02 - 27:05
    (man 4) You get had decision [inaudible].
  • 27:05 - 27:06
    (man 3) Yes, hard decision.
  • 27:08 - 27:13
    (man 5) It says SLING is...
    parser across many languages
  • 27:13 - 27:15
    - and you showed us English.
    - Yes!
  • 27:15 - 27:18
    (man 5) Can you something about
    the number of languages that you are--
  • 27:18 - 27:19
    Yes! Thank you for asking.
  • 27:19 - 27:22
    I had told myself to say that
    up front on the first page
  • 27:22 - 27:23
    because otherwise,
    I would forget, and I did.
  • 27:25 - 27:26
    So right now,
  • 27:26 - 27:30
    we're not actually looking at two files,
    we're looking at 13 files.
  • 27:30 - 27:33
    So Wikipedia dumps
    from 12 different languages
  • 27:33 - 27:36
    that we're processing,
  • 27:36 - 27:41
    and none of this is dependent
    on the language being English.
  • 27:41 - 27:44
    So we're processing this
    for all of the 12 languages.
  • 27:48 - 27:49
    Yeah.
  • 27:49 - 27:50
    For now,
  • 27:50 - 27:57
    they share the property of, I think,
    being the Latin alphabet, and so on.
  • 27:57 - 27:59
    Mostly for us to be able to make sure
  • 27:59 - 28:02
    that what we are doing
    still make sense and works.
  • 28:02 - 28:05
    But there's nothing
    fundamental about the approach
  • 28:05 - 28:10
    that prevents it from being used
    in very different languages
  • 28:10 - 28:15
    from those being spoken around this area.
  • 28:17 - 28:19
    (woman 3) Leila from Wikimedia Foundation.
  • 28:19 - 28:22
    I may have missed this
    when you presented this.
  • 28:23 - 28:28
    Do you make an attempt to bring
    any references from Wikipedia articles
  • 28:28 - 28:32
    back to the property and statements
    you're making in Wikidata?
  • 28:33 - 28:37
    So I briefly mentioned this
    as a potential application.
  • 28:37 - 28:40
    So for now, what we're trying to do
    is just to get this to work,
  • 28:41 - 28:46
    but let's say we did get it to work
    with a high level of quality,
  • 28:47 - 28:51
    that would be an obvious thing
    to try to do, so when you...
  • 28:53 - 28:55
    Let's let's say you were willing to...
  • 28:55 - 29:00
    I know there's some controversy around
    using Wikipedia as a source for Wikidata,
  • 29:00 - 29:02
    that you can't have
    circular references and so on,
  • 29:02 - 29:05
    so you need to have
    properly sourced facts.
  • 29:05 - 29:07
    So let's say you were
    coming up with new facts,
  • 29:07 - 29:14
    and obviously, you could look
    at the cover of news media and so on
  • 29:14 - 29:16
    and process these
    and try to annotate these.
  • 29:16 - 29:20
    And then, that way,
    find sources for facts,
  • 29:20 - 29:21
    new facts that you come up with.
  • 29:21 - 29:22
    Or you could even take existing...
  • 29:22 - 29:26
    There are a lot of facts in Wikidata
    that either have no sources
  • 29:26 - 29:30
    or only have Wikipedia as a source,
    so you can start processing these
  • 29:30 - 29:33
    and try to find sources
    for those automatically.
  • 29:34 - 29:38
    (Leila) Or even within the articles
    that you're taking this information from
  • 29:38 - 29:42
    just using the sources from there
    because they may contain...
  • 29:42 - 29:44
    - Yeah. Yeah.
    - Yeah. Thanks.
  • 29:47 - 29:49
    - (moderator) Thanks Anders.
    - Cool. Thanks.
  • 29:50 - 29:55
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1120-eng-Wikidata_knowledge_base_completion_using_multilingual_Wikipedia_fact_extraction_hd.mp4
Video Language:
English
Duration:
30:01

English subtitles

Revisions