Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1119-eng-Analysing_Translation_of_Wikidata_Properties_hd.mp4

  • 0:06 - 0:09
    (moderator) Good afternoon, everybody.
    We're about to start.
  • 0:09 - 0:11
    I'm presenting you John Samuel
  • 0:11 - 0:17
    who works at the French
    engineering school CPE,
  • 0:17 - 0:20
    based in Lyon in France.
  • 0:20 - 0:21
    And he will tell us something more
  • 0:21 - 0:27
    about the translation
    of properties in Wikidata.
  • 0:27 - 0:30
    As you know,
    as is the case in all sessions,
  • 0:30 - 0:32
    there is an etherpad
    for collaborative note-taking.
  • 0:32 - 0:35
    Please don't forget that.
  • 0:35 - 0:36
    We'll have the presentation
  • 0:36 - 0:40
    and then we'll have
    some time for a short Q&A.
  • 0:40 - 0:42
    - The floor is yours.
    - (John) Thanks, [inaudible].
  • 0:43 - 0:45
    Thank you all for coming here.
  • 0:45 - 0:50
    So my talk is about analyzing
    translation of Wikidata properties.
  • 0:50 - 0:53
    So just give you a quick outline.
  • 0:53 - 0:55
    I would like to introduce this topic.
  • 0:55 - 0:59
    I will present a tool
    that I developed some years before,
  • 0:59 - 1:01
    called WDProp,
    which I'm continuously working,
  • 1:01 - 1:04
    and based on the feedback
    from the community,
  • 1:04 - 1:05
    I add new features.
  • 1:05 - 1:09
    And then I will talk about
    something called coarser analysis,
  • 1:09 - 1:12
    where I would like to look
    at the property translation,
  • 1:12 - 1:15
    from a much larger picture.
  • 1:15 - 1:19
    So I will talk about
    how we collected this data,
  • 1:19 - 1:23
    because this work is also done
    with one of my students, Thibaut Chamard.
  • 1:23 - 1:27
    And then I will present some results,
    and finally, I will conclude the talk.
  • 1:27 - 1:31
    So Wikidata, as you all know,
    it started in 2012,
  • 1:31 - 1:34
    and it's a free, open, linked,
    structured, collaborative,
  • 1:34 - 1:36
    and multilingual knowledge base.
  • 1:37 - 1:40
    My focus today
    is on the multilingual part,
  • 1:40 - 1:43
    because there is a big change
    from the traditional way
  • 1:43 - 1:45
    of how we used to edit on Wikipedia site.
  • 1:45 - 1:48
    There were multiple subdomains,
  • 1:48 - 1:51
    and now you'll have a single domain
    on a Wikidata
  • 1:51 - 1:56
    where multilingual contributors come
    and write or create articles.
  • 1:56 - 1:57
    So this is a collaborative.
  • 1:57 - 2:01
    There has been work to say
    what exactly is collaborative,
  • 2:01 - 2:02
    why it is collaborative.
  • 2:02 - 2:05
    I have given references for these works.
  • 2:05 - 2:07
    So this is, if you see Wikidata,
  • 2:07 - 2:11
    everything that starts
    is starting from the property.
  • 2:11 - 2:14
    The property is proposed
    and then discussed and voted.
  • 2:14 - 2:17
    And then it is created
    and finally translated,
  • 2:17 - 2:20
    and then you are finally
    able to use these properties.
  • 2:20 - 2:22
    But these properties may also be deleted--
  • 2:22 - 2:24
    there's also something called deletion.
  • 2:24 - 2:27
    But, as I highlighted on this slide,
  • 2:27 - 2:29
    my focus is on the multilingual aspect,
  • 2:29 - 2:33
    and the property creation
    and translation point of view.
  • 2:33 - 2:36
    So you have been here
    for the past two days,
  • 2:36 - 2:40
    and by this time
    you have seen many articles,
  • 2:40 - 2:46
    and I just want to point
    what am I looking for on a Wikidata item.
  • 2:46 - 2:48
    This is a Wikidata item,
  • 2:48 - 2:52
    so you have this Q2841, which is Bogotá,
  • 2:52 - 2:56
    which is the capital city of Colombia,
  • 2:56 - 2:57
    and you have four parts here:
  • 2:57 - 3:01
    the languages, the labels,
    the description, and aliases.
  • 3:01 - 3:02
    So you can see,
    for different languages
  • 3:02 - 3:05
    you'll have the label,
    you have the description
  • 3:05 - 3:11
    as well as if there any aliases
    also known as, you could see them.
  • 3:11 - 3:14
    And this, under the city,
    where you see the labels
  • 3:14 - 3:16
    and the properties together.
  • 3:16 - 3:21
    This is Avignon, a city in France.
  • 3:21 - 3:25
    So what I'm interested in
    is only the properties part.
  • 3:25 - 3:31
    For example, official name, native label,
    country, capital of, et cetera.
  • 3:31 - 3:34
    So when I say property,
    for example, if a country,
  • 3:34 - 3:38
    in this country,
    I'm looking at different aspects:
  • 3:38 - 3:40
    the language, the label,
    and the description,
  • 3:40 - 3:43
    and see how things change.
  • 3:43 - 3:44
    For example, if you take instance of--
  • 3:44 - 3:49
    okay, everybody knows instance of,
    you have been using it quite a lot--
  • 3:49 - 3:54
    this is P31, you see
    the number of aliases in English
  • 3:54 - 3:59
    for the property P31 in instance of,
  • 3:59 - 4:04
    and then you would find
    that these types of properties
  • 4:04 - 4:08
    are created after discussion
    with the community.
  • 4:08 - 4:11
    So if I take the complete prop--
    the procedure,
  • 4:11 - 4:13
    what happens to creation of properties--
  • 4:13 - 4:17
    you start proposing properties
    with some possible translation.
  • 4:17 - 4:19
    It is important it's not just in English.
  • 4:19 - 4:24
    You have the templates
    to suggest your properties
  • 4:24 - 4:25
    in your local language.
  • 4:25 - 4:29
    So that's why it's a proposition
    with possible translation.
  • 4:29 - 4:32
    And then you put it to discussion,
    then you are put to voting,
  • 4:32 - 4:37
    and it's created, and then finally,
    the community members start translating it
  • 4:37 - 4:39
    and people put it into use.
  • 4:39 - 4:42
    But then you cannot be guaranteed
    the properties that are created
  • 4:42 - 4:44
    are always there forever.
  • 4:44 - 4:47
    Properties can be deleted,
    just like items can be deleted.
  • 4:47 - 4:51
    But then, again,
    it goes through a similar procedure.
  • 4:51 - 4:55
    You put the property
  • 4:55 - 4:58
    as you propose that it should be deleted,
  • 4:58 - 5:02
    and if the community decides it,
    it votes it, and then if it is decided--
  • 5:02 - 5:05
    the majority votes
    has decided to delete it--
  • 5:05 - 5:09
    we deprecate the property,
    and finally we delete this property.
  • 5:09 - 5:15
    So for today's talk, I'm mostly interested
    for the translation part.
  • 5:15 - 5:17
    So where are the translations happening?
  • 5:17 - 5:20
    First, the translation would happen
    at the proposition part,
  • 5:20 - 5:23
    and then you could find that,
    at the time of creation,
  • 5:23 - 5:28
    the person who creates the property
    can use the exact names
  • 5:28 - 5:31
    that were suggested
    by the property proposer
  • 5:31 - 5:35
    and he or she will create the properties,
  • 5:35 - 5:39
    and later, you start translating
    these properties.
  • 5:39 - 5:43
    So let us look at why this matters,
    why it is important.
  • 5:43 - 5:45
    So I put some examples.
  • 5:45 - 5:47
    This is, again, on P31,
  • 5:47 - 5:52
    instance of the very, very famous
    property P31,
  • 5:52 - 5:56
    and you see there is
    no description for this item.
  • 5:56 - 6:01
    There are almost
    six descriptions on this image,
  • 6:01 - 6:03
    where we do not have any description.
  • 6:03 - 6:07
    Again, some more description
    for Odia and Punjabi,
  • 6:07 - 6:08
    there is no description.
  • 6:08 - 6:11
    This is a property
    which is used quite a lot,
  • 6:11 - 6:14
    and you see that there is
    no description for it.
  • 6:14 - 6:18
    And there is a surprising part
    that you could also have cases
  • 6:18 - 6:22
    where there are descriptions,
    but there are no labels.
  • 6:22 - 6:25
    For example, Ruffian,
    that has been shown here,
  • 6:25 - 6:30
    again on property P31,
    there is a label that is missing.
  • 6:30 - 6:34
    So this was the initial
    inspiration for this work
  • 6:34 - 6:37
    when I started working
    on property analysis.
  • 6:37 - 6:44
    I wanted to look at
    what aspects of properties,
  • 6:44 - 6:46
    or what aspects of property
  • 6:46 - 6:50
    that the whole flow chart
    that we have seen,
  • 6:50 - 6:51
    is multilingual.
  • 6:51 - 6:53
    So I wanted to look at,
  • 6:53 - 6:56
    okay, we know that Wikidata
    is multilingual,
  • 6:56 - 6:59
    and it's collaborative,
    that has been done.
  • 6:59 - 7:05
    But are we really able to achieve
    a truly multilingual experience?
  • 7:05 - 7:09
    That was the question
    behind the creation of WDProp.
  • 7:09 - 7:11
    So you may ask
    why there are so many people
  • 7:11 - 7:15
    who have worked on items,
    there are people who have worked on--
  • 7:15 - 7:17
    users, multilingual users
    and bots, et cetera,
  • 7:17 - 7:19
    why you want to focus on properties?
  • 7:19 - 7:23
    The answer is,
    I want to focus on properties
  • 7:23 - 7:26
    because it's very, very
    less influenced by bots.
  • 7:26 - 7:29
    You may have heard today or yesterday,
  • 7:29 - 7:32
    many people said,
    "Okay, if you have translation
  • 7:32 - 7:37
    in your local languages,
    and it has reached a very good number,
  • 7:37 - 7:39
    you should ensure
    what type of translation it is.
  • 7:39 - 7:44
    Is it just bots, which copies
    the name of a person to another language.
  • 7:44 - 7:47
    Then is it really translation?"
  • 7:47 - 7:48
    Okay, that's debatable.
  • 7:48 - 7:51
    But, of course,
    there is an influence by bot,
  • 7:51 - 7:55
    but in case of properties,
    there is not so much influence by bots,
  • 7:55 - 7:56
    and that is a good part.
  • 7:56 - 8:01
    That's why I focus on the bots part.
  • 8:01 - 8:06
    So, as I said, when WDProp was created,
  • 8:06 - 8:09
    it was to understand every aspect--
    the proposal, the creation, translation.
  • 8:09 - 8:12
    What are the templates that are available.
  • 8:12 - 8:16
    Are these templates,
    for example, you said support,
  • 8:16 - 8:22
    if a French person opens Wikidata,
    a Wikidata France translation page,
  • 8:22 - 8:28
    can he see the word, [soutien],
    for that particular property proposal?
  • 8:28 - 8:29
    Is it possible?
  • 8:29 - 8:33
    So this type of things was needed.
  • 8:33 - 8:36
    In the end, it was also
    about giving real-time statistics
  • 8:36 - 8:38
    to the multilingual contributors.
  • 8:38 - 8:39
    It's not about one time,
  • 8:39 - 8:42
    it's like you just made it
    and published for one time-- no.
  • 8:42 - 8:45
    You want people
    to get this data in real time.
  • 8:45 - 8:47
    So what are we doing?
  • 8:47 - 8:52
    So the goal of WDProp
    was to understand everything
  • 8:52 - 8:54
    about Wikidata properties.
  • 8:54 - 8:57
    So, label, aliases, description.
  • 8:57 - 9:01
    So you have got all these three translated
    so the middle part where you say,
  • 9:01 - 9:06
    this property is completely usable
    because all the three aspects
  • 9:06 - 9:09
    have been translated.
  • 9:09 - 9:12
    So let me just show you quickly,
    what is this WDProp,
  • 9:12 - 9:14
    what I'm talking about.
  • 9:14 - 9:15
    So this is the WDProp,
  • 9:15 - 9:20
    it's available on
    tools.wmflabs.org/wdprop/.
  • 9:20 - 9:24
    So you have a lot statistics
    and if I ask you some questions today,
  • 9:24 - 9:28
    like, for example,
    "How many data types are there
  • 9:28 - 9:31
    that are supported by Wikidata right now?"
  • 9:31 - 9:34
    So if such questions, we do not know,
  • 9:34 - 9:38
    sometimes because there are new data types
    that keep on coming.
  • 9:38 - 9:42
    So this data,
    this is generated at real time,
  • 9:42 - 9:45
    this creates the data structure
    and it will give you the answer.
  • 9:45 - 9:46
    How many languages are there?
  • 9:46 - 9:50
    Yes, of course,
    see that there are 313 languages.
  • 9:50 - 9:55
    And then, for example,
    how many labels were translated.
  • 9:55 - 9:59
    So you could see
    that the data is being fetched.
  • 9:59 - 10:00
    I hope it comes.
  • 10:02 - 10:03
    Okay, let's hope. (chuckles)
  • 10:08 - 10:12
    Okay, I will take
    some other stuff as well.
  • 10:12 - 10:14
    Browsing all properties by their time.
  • 10:14 - 10:17
    Yes. So you see,
    this is count of translated labels,
  • 10:17 - 10:20
    and you see all this data
    that is coming real time,
  • 10:20 - 10:22
    and you can see that the labels
  • 10:22 - 10:27
    are currently available
    in 6,804 languages in English,
  • 10:27 - 10:31
    followed by Dutch, followed by Arabic,
    followed by Ukrainian, and then French.
  • 10:31 - 10:33
    So this is real-time statistics.
  • 10:33 - 10:35
    So you could also do the same
    for description,
  • 10:35 - 10:38
    also do for aliases, et cetera.
  • 10:38 - 10:41
    And you could get the overall
    translation statuses if you want.
  • 10:41 - 10:44
    So there are some other things
    that we will discuss later,
  • 10:44 - 10:46
    if time permits.
  • 10:46 - 10:50
    But you could navigate
    all the different items
  • 10:50 - 10:52
    on the left-hand side,
  • 10:52 - 10:54
    and you could see
    there are a lot of things
  • 10:54 - 10:59
    that could really help to see
    what things are happening in WDProp.
  • 10:59 - 11:04
    So this is, for example,
    Wikidata properties,
  • 11:04 - 11:06
    these are the properties
    that are currently available.
  • 11:06 - 11:10
    But as I said some time back,
    properties could be deleted.
  • 11:10 - 11:13
    And this, you see that these are
    the properties that were deleted,
  • 11:13 - 11:17
    starting from P1, P2, P3, P4, P5,
    these have all been deleted,
  • 11:17 - 11:23
    and you could get this thing
    just from the statistics board.
  • 11:23 - 11:25
    And here, so same thing.
  • 11:25 - 11:30
    Then, the next thing that interested me
    was to understand the translation pattern.
  • 11:30 - 11:33
    So, for example, sometimes we feel
    that some languages--
  • 11:33 - 11:37
    so English is created first,
    and followed by maybe Dutch,
  • 11:37 - 11:38
    or maybe French,
  • 11:38 - 11:41
    and maybe after French,
    it could be Arabic.
  • 11:41 - 11:44
    So these things
    could be interesting to know.
  • 11:44 - 11:49
    So for that, we started to look
    at the idea of translation path--
  • 11:49 - 11:52
    exactly how things are translated.
  • 11:52 - 11:57
    So again, if you go to the property page,
    you could click on any property.
  • 11:57 - 11:58
    Sorry.
  • 11:59 - 12:01
    Maybe I can show.
  • 12:04 - 12:06
    So you could click on any property
    and you could just say,
  • 12:06 - 12:08
    "Give me the translation path."
  • 12:08 - 12:11
    It takes some time,
    but it will start bringing the data,
  • 12:11 - 12:15
    because it's real time,
    so you get the data coming from all this.
  • 12:15 - 12:17
    So you get the date,
  • 12:17 - 12:22
    you get what things have been changed,
    when was something deleted, et cetera.
  • 12:22 - 12:24
    Why it is important?
  • 12:25 - 12:29
    For example, you see
    this is something that happened in 2017,
  • 12:29 - 12:32
    and the label has been removed.
  • 12:32 - 12:34
    This is the official website.
  • 12:34 - 12:39
    So imagine you have removed the label
    from the official website--
  • 12:39 - 12:40
    sorry, this country--
  • 12:40 - 12:43
    so anybody who doesn't know P17,
    what it is, cannot even understand,
  • 12:43 - 12:46
    because the label has been deleted
    by the person.
  • 12:46 - 12:48
    So this type of vandalism exists.
  • 12:48 - 12:51
    Another example where, completely,
  • 12:51 - 12:53
    all the language labels
    have been deleted--
  • 12:53 - 12:56
    English, French, Spanish, German,
    everything has been deleted.
  • 12:56 - 12:58
    There are no labels,
    there are no descriptions.
  • 12:58 - 13:01
    So you could find these types of things
    from the translation path
  • 13:01 - 13:05
    and just because of the color code,
    you could see what happened on what day,
  • 13:05 - 13:10
    and you could check exactly,
    because it is also linked.
  • 13:10 - 13:14
    If you click on any of this,
    you could also get a link to the revision,
  • 13:14 - 13:19
    identify what exactly happened
    during that particular revision.
  • 13:19 - 13:21
    So this is coming from revision history.
  • 13:21 - 13:25
    So if you click on any of this,
    you get what exactly is happening
  • 13:25 - 13:29
    in any particular revision.
  • 13:29 - 13:31
    So how did we build it?
  • 13:31 - 13:32
    Just if you come back,
  • 13:32 - 13:38
    here, you see there is something
    called a comment on the right-hand side.
  • 13:38 - 13:43
    You see there is something
    called added aliases,
  • 13:43 - 13:47
    "added British English aliases,"
    "changed Esperanto label,"
  • 13:47 - 13:48
    "added [io] label," et cetera.
  • 13:48 - 13:51
    So we made use of this information,
  • 13:51 - 13:53
    for example,
    for label description and aliases,
  • 13:53 - 13:56
    if you add something,
    you have some sort of comment
  • 13:56 - 13:58
    which starts with wbsetlabel-add.
  • 13:58 - 14:02
    Or if it is updated,
    you have wbsetlabel-set.
  • 14:02 - 14:04
    And if you remove something,
    you see it is removed.
  • 14:04 - 14:07
    And based on this type of information,
  • 14:07 - 14:11
    we were able to build
    such a translation path.
  • 14:11 - 14:17
    Okay, this is good, but what happened
    is that this type of information,
  • 14:17 - 14:19
    this type of things,
    just using the comment,
  • 14:19 - 14:24
    it is useful for building real-time tools,
    just like what I showed before, WDProp,
  • 14:24 - 14:31
    but it is very difficult to detect
    when there are multiple changes.
  • 14:31 - 14:35
    For example, if you have seen
    bots activity on Wikidata,
  • 14:35 - 14:40
    some bots make multiple labels
    in one single edit.
  • 14:40 - 14:42
    In that case,
    you cannot find what happened
  • 14:42 - 14:46
    because you do not have wbsetlabel,
    that particular language.
  • 14:46 - 14:49
    So you do not have a set of languages
    along with your comment.
  • 14:49 - 14:54
    So these are some problems
    if you want to use this type of approach.
  • 14:55 - 14:58
    So what we did,
    we decided to collect the data,
  • 14:58 - 15:01
    and we decided to publicly
    make this data available.
  • 15:03 - 15:06
    And what we did,
    we wanted to make use of content.
  • 15:06 - 15:09
    So what we did,
    we started with every revision,
  • 15:09 - 15:12
    and we took the content of each revision.
  • 15:12 - 15:17
    And we took the next revision,
    and we decided to find the difference
  • 15:17 - 15:20
    between these two revisions,
    to find what exactly changes,
  • 15:20 - 15:22
    which of the labels got changed.
  • 15:22 - 15:25
    Because of that, we got
    much more interesting information,
  • 15:25 - 15:29
    much more accurate information
    than the previous approach
  • 15:29 - 15:31
    because it is very important
    for doing analysis.
  • 15:31 - 15:34
    It is important
    that you make use of correct data.
  • 15:34 - 15:37
    So you have four columns
    that were used here--
  • 15:37 - 15:39
    timestamp, property,
    language, type, et cetera.
  • 15:39 - 15:44
    And you get this data in this format.
    It is publicly available.
  • 15:44 - 15:47
    So what does this data give me?
  • 15:47 - 15:49
    This data gives me information
  • 15:49 - 15:55
    that currently almost 4,000 plus,
  • 15:55 - 15:57
    4,500 properties
  • 15:57 - 16:00
    have labels between 0 and 20.
  • 16:00 - 16:02
    So there are a lot of properties
  • 16:02 - 16:07
    who do not have
    more than 20 multilingual labels.
  • 16:07 - 16:11
    And there are only
    1,500 language properties
  • 16:11 - 16:13
    that have been translated up to 40.
  • 16:13 - 16:19
    And yesterday, if you were present
    during the talk of Lydia Pintscher,
  • 16:19 - 16:22
    she talked about P18,
    so P18 is something here.
  • 16:22 - 16:25
    So you can see there are only
    a couple of six or seven properties
  • 16:25 - 16:30
    that are currently having all the--
  • 16:30 - 16:35
    P18 has 154 translations,
    just to give that idea.
  • 16:35 - 16:40
    So there is one property
    which is having 154 multilingual labels.
  • 16:40 - 16:44
    There are properties
    which have only one particular label.
  • 16:44 - 16:50
    And the average number
    of labels is only 21,
  • 16:50 - 16:53
    and the standard deviation is 20.
  • 16:53 - 16:56
    Okay, what next we would like to say?
  • 16:56 - 17:00
    So you have seen something similar
    in the real-time data.
  • 17:00 - 17:02
    This is from the collected data.
  • 17:02 - 17:08
    So this is what are the top languages
    that are coming up in the results.
  • 17:08 - 17:09
    So these we have seen.
  • 17:09 - 17:13
    But my next point is,
    are there combinations possible.
  • 17:13 - 17:17
    For example, if there is French,
    there is Arabic.
  • 17:17 - 17:20
    If there is Arabic,
    there is some other language.
  • 17:20 - 17:22
    If there's French,
    there's Ukrainian, et cetera.
  • 17:22 - 17:26
    Can we find such type of combinations
    in the translation data set?
  • 17:26 - 17:27
    So, yes, it is possible.
  • 17:27 - 17:30
    So if you see this count,
    this frequent itemsets--
  • 17:30 - 17:32
    so I've just shown seven of them--
  • 17:32 - 17:35
    you find that there are combinations
    that are possible.
  • 17:37 - 17:41
    Okay, let us say, is there a possibility
    of having four labels,
  • 17:41 - 17:44
    like if there is English,
    there's also possibility to find Dutch,
  • 17:44 - 17:46
    Arabic, Ukrainian.
  • 17:46 - 17:48
    If there is English,
    there's possibility to find Dutch,
  • 17:48 - 17:50
    French, and Arabic, et cetera.
  • 17:50 - 17:53
    You can also find a lot of combinations.
  • 17:53 - 17:54
    Why it is important?
  • 17:54 - 17:57
    Because it is important to know if,
  • 17:57 - 18:00
    for example,
    if you have multilingual speakers
  • 18:00 - 18:04
    who are contributors,
    who can speak multiple languages,
  • 18:04 - 18:07
    if you're able to find
    any particular pattern
  • 18:07 - 18:13
    that helps us to find
    that if you tell this person to translate,
  • 18:13 - 18:15
    a new property is created
    to translate this label,
  • 18:15 - 18:19
    because he already
    speaks multiple languages,
  • 18:19 - 18:22
    we can suggest these things to the user.
  • 18:22 - 18:25
    So let's just show you one example.
  • 18:25 - 18:27
    This is a complete translation path
  • 18:27 - 18:30
    that has obtained
    from different languages.
  • 18:30 - 18:35
    So here, what we have done is
    we selected two small minority languages,
  • 18:35 - 18:39
    like Tagalog and Kapampangan,
  • 18:39 - 18:43
    which are minority languages
    from Philippines,
  • 18:43 - 18:46
    and you see that there is
    a strong transfer
  • 18:46 - 18:50
    between Tagalog and Kapampangan.
  • 18:50 - 18:52
    So these types of things can be detected
  • 18:52 - 18:55
    when you have such type
    of translation results.
  • 18:55 - 18:57
    So that is another advantage.
  • 18:57 - 19:00
    To conclude my work,
    I would like to say,
  • 19:00 - 19:05
    this is important that we understand
    how properties are translated
  • 19:05 - 19:11
    because if you want to extract data
    from Wikipedia,
  • 19:11 - 19:15
    you need to know what are the words
  • 19:15 - 19:16
    in the local languages
    that are being used.
  • 19:16 - 19:20
    What is "image" in French,
    what is "image" in Punjabi,
  • 19:20 - 19:23
    what is "image" in Hindi,
    or any other language.
  • 19:23 - 19:26
    So that is important for importing data.
  • 19:26 - 19:30
    And tomorrow, of course,
    if you are able to fetch this data,
  • 19:30 - 19:35
    to Wikidata, we could also
    use new projects like Wikidata Bridge,
  • 19:35 - 19:39
    which we could use
    to fill other info boxes,
  • 19:39 - 19:45
    like multilingual Wikipedia articles,
  • 19:45 - 19:47
    and this could be really helpful.
  • 19:47 - 19:51
    So withe that, I would like to thank you,
    and if you have questions,
  • 19:51 - 19:54
    I would be happy to answer them.
  • 19:55 - 19:57
    (moderator) Anybody with questions?
  • 19:59 - 20:02
    (audience applause)
  • 20:08 - 20:09
    Yes?
  • 20:12 - 20:16
    (man) So what you're doing
    is mainly analyzing how this--
  • 20:16 - 20:17
    - (John) Yes.
    - (man) ...is all happening?
  • 20:17 - 20:21
    Do you know if there are initiatives
    or if there are tools
  • 20:21 - 20:25
    which can help make this easier,
    like translation of properties?
  • 20:25 - 20:28
    Yes. Tools, like, for example,
    what to translate
  • 20:28 - 20:33
    from Wikimedia Foundation, is helpful,
    but I have not seen--
  • 20:33 - 20:36
    This is not currently
    integrated with Wikidata.
  • 20:36 - 20:42
    What to translate is only integrated
    with certain languages on Wikipedia,
  • 20:42 - 20:44
    but not on Wikidata.
  • 20:44 - 20:46
    But that could be really interesting.
  • 20:46 - 20:50
    Yes, thank you for bringing this up,
    because just imagine,
  • 20:50 - 20:54
    if we know that a person
    has been labeling in multiple languages,
  • 20:54 - 20:57
    and we also have
    this what to translate tool,
  • 20:57 - 21:00
    and we have these statistics,
    we have this data
  • 21:00 - 21:05
    coming from this type
    of property translation,
  • 21:05 - 21:09
    it is easier to suggest to a person
    that new properties have been created,
  • 21:09 - 21:11
    and then you could--
  • 21:11 - 21:14
    Right now it's not integrated to Wikidata.
  • 21:16 - 21:17
    (moderator) Anybody else?
  • 21:20 - 21:23
    (man 2) I have one question myself,
    that comes back to it,
  • 21:23 - 21:28
    does anybody know of working lists
    on translating properties?
  • 21:28 - 21:29
    Sorry?
  • 21:29 - 21:30
    (man 2) Does anybody
    know of working lists
  • 21:30 - 21:32
    about translating properties,
  • 21:32 - 21:38
    like, I can imagine from your statistics,
    you could say, this is the top 100
  • 21:38 - 21:40
    most widely used properties
  • 21:40 - 21:43
    who lack translations
    in this and this language?
  • 21:43 - 21:47
    No, there is, I think,
    there are ways by,
  • 21:47 - 21:51
    for example,
    you could browse by data types,
  • 21:51 - 21:54
    browse by property classes.
  • 21:54 - 21:57
    For example, here is something
    called property classes
  • 21:57 - 22:01
    where people have created projects--
  • 22:01 - 22:03
    it's taking time--
    so you have projects,
  • 22:03 - 22:09
    and you could say, how would I describe,
    what are the, for example,
  • 22:09 - 22:12
    what are the properties
    that I could describe for this,
  • 22:12 - 22:14
    for describing IEEE standard version?
  • 22:14 - 22:17
    You need edition number,
    you need edition translation, et cetera.
  • 22:17 - 22:23
    So if you have a targeted thing,
    you could search for what type of classes.
  • 22:23 - 22:26
    For example, if you're working
    in GLAM or histories,
  • 22:26 - 22:30
    you could say, what is history-related
    any document are there?
  • 22:30 - 22:33
    So you could say, historical,
    and you could find historical.
  • 22:33 - 22:36
    Okay, this is a property class,
    go to this property class.
  • 22:36 - 22:38
    And, sorry, where is it?
  • 22:38 - 22:40
    So it is having something
    called "Merimee ID."
  • 22:40 - 22:44
    So people have been
    trying to use property classes
  • 22:44 - 22:46
    to link objects.
  • 22:46 - 22:50
    That helps if you're working
    on a particular project,
  • 22:50 - 22:52
    and you could find
    that property's related to that.
  • 22:52 - 22:58
    (man 2) But your tool could quite easily
    make a list of, let's say,
  • 22:58 - 23:03
    the top 100 most widely used properties
  • 23:03 - 23:07
    who haven't got, I don't know,
    Punjabi label, let's say?
  • 23:07 - 23:10
    - (John) For that, I will just--
    - (man 2) Which could be interesting.
  • 23:10 - 23:14
    (John) Okay, tell me any language,
    for example, let us say, Netherlands,
  • 23:14 - 23:17
    because it's performing very well.
  • 23:17 - 23:22
    So I would say-- translated labels.
  • 23:22 - 23:24
    So this is translate-- sorry.
  • 23:30 - 23:33
    (mouse clicking)
  • 23:37 - 23:39
    For example, Hindi.
  • 23:39 - 23:40
    So here, what happens,
  • 23:40 - 23:44
    here you just see any properties
    that need translation.
  • 23:44 - 23:47
    So there are like 6,647 properties
  • 23:47 - 23:50
    that need translation
    in a particular language.
  • 23:50 - 23:55
    So you could click on any language
    that you want and get the data.
  • 23:55 - 23:59
    And you could get the list
    of where people need support.
  • 23:59 - 24:03
    So, this could be interesting
    to link with property usage,
  • 24:03 - 24:06
    how many people, is it really top,
    is it under the top ten.
  • 24:06 - 24:09
    So suggest those ten top hundred,
    in that language.
  • 24:09 - 24:11
    That would be an interesting list.
    That's good.
  • 24:12 - 24:13
    (man 3) Just what you asked,
  • 24:13 - 24:17
    there is a list of top 100
    most used properties on Wikidata.
  • 24:17 - 24:19
    It's on Wikidata.
  • 24:19 - 24:21
    So, yeah, it's there,
  • 24:21 - 24:26
    under Wikidata Database Reports/
    Top 100 Properties.
  • 24:26 - 24:31
    So one thing could be that
    we could just link this and suggest it.
  • 24:31 - 24:33
    (moderator) Could you maybe
    add the link to the etherpad,
  • 24:33 - 24:37
    and then maybe,
    this information can come together.
  • 24:37 - 24:39
    (John) Okay.
  • 24:40 - 24:42
    (moderator) If there is
    no other questions,
  • 24:42 - 24:44
    then we will conclude here.
  • 24:44 - 24:49
    And we have two, three minutes break
    until we start with the next speaker.
  • 24:49 - 24:51
    - Thanks.
    - (John) Thank you very much.
  • 24:51 - 24:53
    (audience applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1119-eng-Analysing_Translation_of_Wikidata_Properties_hd.mp4
Video Language:
English
Duration:
25:00

English subtitles

Revisions