Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4

  • 0:06 - 0:10
    Yes, Wikidata Statistics:
    What, Where, and How?
  • 0:10 - 0:13
    This has been an attempt of an overview
    for analytical systems
  • 0:13 - 0:16
    focusing on what was developed
    with the Wikimedia Deutschland
  • 0:16 - 0:18
    in the previous almost three years
  • 0:18 - 0:22
    since I started doing data science
    for Wikidata and thе dictionary.
  • 0:22 - 0:28
    So, during this presentation,
    I will try to switch from the presentation
  • 0:28 - 0:32
    to the dashboards
    and show you the end data products.
  • 0:33 - 0:35
    However, if that causes any trouble,
  • 0:35 - 0:39
    so this is actually the URL
    of the analytics portal.
  • 0:39 - 0:41
    So everything that
    I will be presenting here,
  • 0:41 - 0:44
    whatever you can see on the slides,
    you can also check out later
  • 0:44 - 0:47
    from the presentation,
    go and play with the real thing.
  • 0:47 - 0:51
    Otherwise, you will see only
    the screenshots here from the slides.
  • 0:51 - 0:58
    So the goal-- well, the talk
    will be a failed attempt to communicate
  • 0:58 - 1:02
    an almost endlessly
    technically complicated field
  • 1:03 - 1:07
    in terms that can actually motivate
    people to start making use
  • 1:07 - 1:08
    of this analytical product
  • 1:08 - 1:11
    in which development
    we are really putting a lot of effort.
  • 1:11 - 1:14
    So, as I said, I will try
    to provide an overview
  • 1:14 - 1:16
    of the Wikidata Statistics
    and Analytics systems.
  • 1:16 - 1:21
    So I will try to exemplify the usage
    of some of them, not all.
  • 1:21 - 1:23
    And also I will try to go just
    a little bit under the hood
  • 1:23 - 1:27
    to try to illustrate how it is done,
    what is done here,
  • 1:27 - 1:31
    because I thought it might be
    interesting to the audience.
  • 1:32 - 1:34
    Okay, so say...
  • 1:35 - 1:39
    In analytics and data science,
    you always start with formulating
  • 1:39 - 1:42
    as clearly as possible
    your goals and motivations.
  • 1:42 - 1:47
    Otherwise, you enter into endless cycles
    of developing analytical tools
  • 1:47 - 1:50
    and data science products
    that actually do something,
  • 1:50 - 1:53
    but nobody really understands
    what they're being built for.
  • 1:53 - 1:58
    In 2017, in Wikimedia Deutschland,
    a request, a demand was formulated--
  • 1:58 - 2:00
    we said that we needed
    an analytical system
  • 2:00 - 2:02
    that will give an insight into the ways
  • 2:02 - 2:06
    that Wikidata items are reused
    across the Wikimedia projects,
  • 2:06 - 2:09
    meaning across the Wikipedia universe--
    all the encyclopedias,
  • 2:09 - 2:12
    and then Wikivoyage,
    Wikibooks, WikiCite, etc.--
  • 2:12 - 2:16
    all the websites, approximately 800
    that we are actually managing.
  • 2:16 - 2:20
    So just to explain the differences
    between the data.
  • 2:20 - 2:24
    On the left, for example, you see a small
    or very small substitute Wikidata.
  • 2:24 - 2:28
    These are the languages,
    some of the Slavic, I think, languages,
  • 2:28 - 2:30
    and in Wikidata they are connected,
  • 2:30 - 2:34
    but they are properties and belong
    to different classes, etc.
  • 2:34 - 2:37
    But we were looking
    for a different kind of mapping.
  • 2:37 - 2:41
    So what you see here,
    on the right side, is a set of items
  • 2:41 - 2:45
    all belonging to the class
    of architectural structures, I would say.
  • 2:45 - 2:48
    And this here is the result
    of their empirical embeddings.
  • 2:48 - 2:51
    So the items related here--
  • 2:51 - 2:56
    they are linked by their similarity
    of usage across Wikipedias, for example.
  • 2:56 - 2:58
    So what does it mean-- the similarity?
  • 2:59 - 3:03
    To be similar in terms of how an item
    is used across the Wikipedias.
  • 3:03 - 3:07
    So imagine you take an area of numbers,
  • 3:07 - 3:11
    and each element of an area
    is one project-- it's English Wikipedia,
  • 3:12 - 3:17
    it is French Wikivoyage,
    it is Italian Wikipedia, etc.
  • 3:18 - 3:20
    And then, you count how many times
  • 3:20 - 3:23
    a particular item has been used
    in that project.
  • 3:24 - 3:28
    So you use an area of numbers
    to describe the item that way.
  • 3:28 - 3:30
    It's a little bit more complicated
    in practice.
  • 3:31 - 3:36
    And then, you can describe all items
    in Wikidata that were ever used
  • 3:36 - 3:39
    across the websites at all
    by such areas of numbers,
  • 3:39 - 3:41
    called embeddings, technically, right?
  • 3:42 - 3:46
    From those data,
    using different distance metrics,
  • 3:46 - 3:49
    applying machine learning methods,
    doing dimensionality reduction,
  • 3:49 - 3:50
    and similar things,
  • 3:50 - 3:53
    you can actually figure out
    what is the similarity pattern.
  • 3:53 - 3:56
    And here items are connected,
  • 3:56 - 4:01
    but how similar are their patterns
    of usage across different Wikipedias.
  • 4:02 - 4:05
    Once again, every visualization,
    every result that I show--
  • 4:05 - 4:08
    there is a link on the presentation,
    so you can go and check for yourself.
  • 4:08 - 4:11
    You can play
    with this thing interactively.
  • 4:11 - 4:16
    Similarly, we will be able to derive
    a graph like this one.
  • 4:16 - 4:20
    This one does not connect
    the Wikidata items, it connects projects.
  • 4:20 - 4:23
    But looking at how similar they are
  • 4:23 - 4:27
    in terms of how they use
    different Wikidata items.
  • 4:30 - 4:32
    To be precise as possible,
  • 4:32 - 4:35
    the data that we use to do this--
    they do not live in Wikidata,
  • 4:35 - 4:37
    they are not a part of the Wikidata,
  • 4:37 - 4:39
    data does not at all [locate] here.
  • 4:39 - 4:42
    We have the Wikidata,
    we have formulated our motivational goals,
  • 4:42 - 4:46
    and immediately we started talking
    about the data model and the structures.
  • 4:46 - 4:50
    What structures and data models
    you need to answer the questions
  • 4:50 - 4:52
    that you have initially proposed.
  • 4:53 - 4:59
    So there is Wikibase
    and the client site tracking mechanism,
  • 4:59 - 5:02
    which is installed in all those wikis,
  • 5:02 - 5:07
    that actually tracks the Wikidata usage
    on a project, on Wikipedia, for example.
  • 5:07 - 5:11
    So every time an item is used
    in [meaningful works]
  • 5:11 - 5:15
    or in a different way--
    there is a role in a huge sequel table
  • 5:15 - 5:18
    that enters and checks
    the usage of that number.
  • 5:18 - 5:22
    Now, immediately, we had to face
    a data-engineering problem, of course,
  • 5:22 - 5:26
    because we are talking
    about hundreds of huge sequel tables,
  • 5:26 - 5:29
    and we had to do
    machine learning and statistics
  • 5:29 - 5:33
    across all the data together,
    not separately,
  • 5:33 - 5:37
    in order to be able to produce structures,
    like this one or like this one.
  • 5:38 - 5:41
    So in cooperation with the Analytics
    Engineering Team of the Foundation,
  • 5:41 - 5:44
    we started transferring
    those data from Wikibase
  • 5:44 - 5:49
    to the Wikimedia Foundation Data Lake
    which is actually a big data storage.
  • 5:49 - 5:53
    The data do not live there
    in a relational database.
  • 5:53 - 5:54
    They live in something similar--
  • 5:54 - 5:57
    its Hadoop, and Hive tables
    are there, etc.,
  • 5:57 - 5:59
    but it's a huge,
    huge engineering procedure.
  • 5:59 - 6:03
    So not all data in analytics,
    especially in big games like this
  • 6:03 - 6:06
    that we have to play
    with Wikidata and Wikipedia,
  • 6:06 - 6:08
    are immediately available to you.
  • 6:08 - 6:09
    One source of complication
  • 6:09 - 6:13
    is before you actually start solving
    the problem in a scientific way,
  • 6:13 - 6:17
    to put it that way, is to engineer
    the data stats to prepare the structures
  • 6:17 - 6:21
    that you actually need for doing
    machine-learning statistics
  • 6:21 - 6:23
    and similar things.
  • 6:23 - 6:27
    This is a full design of the system
    called the Wikidata Concepts Monitor
  • 6:27 - 6:28
    that tracks their reuse statistics.
  • 6:28 - 6:31
    I will not go
    into details here, of course.
  • 6:32 - 6:36
    The obvious complication
    is that-- as I wrote it up--
  • 6:36 - 6:38
    many systems need to work together.
  • 6:38 - 6:41
    You have to synchronize
    many different sources of data,
  • 6:41 - 6:43
    many different infrastructures
  • 6:43 - 6:48
    just in order to make it happen,
    even before starting thinking
  • 6:48 - 6:52
    in terms of methodologies, science,
    statistics, and similar.
  • 6:54 - 6:58
    As I said, we started
    with our goals and motivation,
  • 6:58 - 7:02
    then, typically, the data model
    and the structures that you need--
  • 7:02 - 7:05
    they correspond to those goals
    and motivations that should always be--
  • 7:05 - 7:08
    your first step in developing
    an analytics project.
  • 7:08 - 7:11
    Then you figure out
    it's really too complicated,
  • 7:11 - 7:13
    it cannot be done when one person--
  • 7:13 - 7:15
    It cannot be done on one computer,
    to put it that way.
  • 7:15 - 7:18
    So we needed to work
    with the analytics infrastructure
  • 7:18 - 7:20
    and then add an additional layer
    of complication--
  • 7:20 - 7:24
    that's communication
    with external teams and cooperators
  • 7:24 - 7:28
    because, obviously, such a system
    cannot be managed easily by one person.
  • 7:28 - 7:31
    Actually, I think
    it would be pretty impossible.
  • 7:32 - 7:34
    So, as I mentioned,
    there is this Data Lake,
  • 7:34 - 7:38
    our big data storage in Hadoop,
  • 7:38 - 7:42
    and the team of awesome data engineers
    in the Foundation
  • 7:42 - 7:44
    called the Analytics Engineering Team.
  • 7:44 - 7:48
    To data science, data engineers are people
    who actually watch your back
  • 7:48 - 7:49
    while you're trying to do your things.
  • 7:49 - 7:52
    If you cannot rely on
    a good engineering team,
  • 7:52 - 7:54
    there's not much you will be able
    to do by yourself.
  • 7:56 - 8:00
    This infrastructure is actually
    maintained by the Foundation,
  • 8:00 - 8:04
    so you enter through
    several statistical servers--
  • 8:04 - 8:06
    these blue boxes down there.
  • 8:06 - 8:09
    You can communicate
    with the relational database systems.
  • 8:09 - 8:11
    We used the MariaDB.
  • 8:11 - 8:12
    You can communicate with the Data Lake.
  • 8:12 - 8:18
    And, of course, for your computations,
    you go to the so-called Analytics Cluster
  • 8:18 - 8:21
    where you do things
    like Apache Spark that actually--
  • 8:21 - 8:25
    it's the only really efficient way
    to process the data
  • 8:25 - 8:27
    that we need to process.
  • 8:27 - 8:32
    When I started doing this back in 2017,
    I remember when I saw
  • 8:32 - 8:35
    only the schema of the infrastructure
    for the first time.
  • 8:35 - 8:39
    If I could not rely on my colleague
    Adam Shorland--
  • 8:39 - 8:40
    who is still with us
    in Wikimedia Deutschland--
  • 8:40 - 8:44
    I would never make it, I wouldn't even
    know how to navigate the structure.
  • 8:46 - 8:49
    As you start building a project
    to do analytics for Wikidata,
  • 8:49 - 8:52
    you see how increasingly it gets
    more and more complicated
  • 8:52 - 8:55
    because you have to deal
    with synchronizing different systems,
  • 8:55 - 8:58
    different teams, infrastructures,
    different data stats.
  • 8:58 - 9:00
    However, it pays off,
  • 9:00 - 9:03
    that synchronization and all the pain.
  • 9:03 - 9:08
    It can get really nasty sometimes,
    and the most recent example
  • 9:08 - 9:11
    is the production
    of the Data Quality Report for Wikidata.
  • 9:12 - 9:17
    That's an initial assessment
    of the quality of work we had in Wikidata.
  • 9:17 - 9:18
    In order to produce it,
  • 9:18 - 9:22
    we had to rely on the Quality Predictions
    from the ORES system,
  • 9:22 - 9:25
    the machine learning system,
    developed by Aaron Halfaker,
  • 9:25 - 9:28
    and the scoring platform
  • 9:28 - 9:32
    to combine that with the Wikidata
    Concepts Monitor reuse statistics.
  • 9:33 - 9:37
    We the revision history, the full revision
    history of all Wikipedias
  • 9:37 - 9:40
    is available in one single
    huge big data table
  • 9:40 - 9:41
    called the MediaWiki History.
  • 9:41 - 9:43
    That lives in the Data Lake.
  • 9:43 - 9:47
    And also we had to process
    the JSON Dump in HDFS.
  • 9:47 - 9:49
    So we're talking about form
    as in structures,
  • 9:49 - 9:52
    like two machine learning systems
    with their complexities,
  • 9:52 - 9:54
    and two huge data sets.
  • 9:54 - 9:58
    Everything needs to work in sync in order
    to be able to produce the Quality Report
  • 9:58 - 10:01
    that we're presenting
    this year at WikidataCon.
  • 10:01 - 10:04
    But if we didn't do, if we [listed]
    or something like that,
  • 10:05 - 10:08
    we couldn't say, we couldn't show
    beautiful things like this.
  • 10:08 - 10:12
    So on the horizontal axis, you have
    the ORES Quality Prediction score.
  • 10:12 - 10:13
    We use five categories.
  • 10:13 - 10:17
    And you can inform yourself-- just google
    "Wikidata data quality categories."
  • 10:17 - 10:19
    You will find the description.
  • 10:19 - 10:22
    The A-class to the left--
    the best items that we have,
  • 10:22 - 10:25
    and at the same time--
    that's the green box--
  • 10:25 - 10:28
    they are the most
    reused items in Wikipedia.
  • 10:28 - 10:30
    So it's not like,
    as Lydia explained yesterday,
  • 10:30 - 10:33
    it's not like all our items
    are of the highest quality.
  • 10:33 - 10:38
    To the contrary, we have many items
    that are not of that high quality,
  • 10:38 - 10:41
    but at least we know
    what we're doing with them.
  • 10:41 - 10:42
    And you can see the regularity.
  • 10:42 - 10:46
    As the quality of an item
    decreases from left to right,
  • 10:46 - 10:49
    the items tend to be less and less reused.
  • 10:50 - 10:54
    So also this synchronization
    helped us learn things like this.
  • 10:54 - 10:58
    To the right, for example,
    this five-time series here.
  • 10:58 - 11:05
    Each time series corresponds
    to one of the quality categories--
  • 11:05 - 11:07
    A, B, C, or D.
  • 11:07 - 11:11
    And the time is on the horizontal axis
    running from left to right.
  • 11:11 - 11:16
    And you can see here how many items
    from each quality-class
  • 11:16 - 11:19
    received their latest revision when.
  • 11:20 - 11:24
    So the top quality class A
    that is this [inaudible] line
  • 11:24 - 11:30
    which is found, say,
    at the most right position here,
  • 11:30 - 11:31
    and the shortest line.
  • 11:31 - 11:35
    So those are the best items that we have.
  • 11:35 - 11:38
    And what you can see
    is actually that there is no item
  • 11:38 - 11:45
    that did not receive at least
    one revision after December 2018,
  • 11:45 - 11:48
    meaning one thing-- if you want quality
    in Wikidata, you have to work on it.
  • 11:48 - 11:51
    So the best items that we have
    are actually the items
  • 11:51 - 11:53
    that we're really paying attention to.
  • 11:53 - 11:56
    If you look at the classes
    of lower quality, the other time-series,
  • 11:56 - 11:59
    you will see that we have items
    that were revised in 2012
  • 11:59 - 12:01
    for the last time.
  • 12:01 - 12:03
    So it tells a story of responsibilities--
  • 12:03 - 12:08
    how much work we put
    into the items [that actually work].
  • 12:08 - 12:09
    What brings quality.
  • 12:13 - 12:17
    While we do these things,
    we also try to make use
  • 12:17 - 12:20
    of the byproducts
    of these procedures as possible.
  • 12:21 - 12:23
    So, for example, in order
    to develop the project
  • 12:23 - 12:25
    called Wikidata Languages Landscape--
  • 12:25 - 12:28
    I think I mentioned it yesterday
    during the Birthday Presentation--
  • 12:31 - 12:34
    I had to perform a quite thorough study
  • 12:34 - 12:38
    of the sub-ontology
    in Wikidata of languages.
  • 12:38 - 12:42
    And you know what?
    There are problems in that ontology.
  • 12:46 - 12:48
    I tried not to miss
    to give you an opportunity.
  • 12:49 - 12:52
    So this is the dashboard actually
    about the languages
  • 12:52 - 12:55
    called the Wikidata Languages Landscape.
  • 12:55 - 12:59
    Once again, you have all the URLs
    in the presentation.
  • 13:00 - 13:04
    So for example, you want to take a look
    at a particular language.
  • 13:04 - 13:09
    Say, English, okay.
  • 13:09 - 13:15
    So the dashboard will generate
    its local ontological context
  • 13:15 - 13:19
    and mark all the relations
    of the form instance
  • 13:19 - 13:21
    of subclass often part of.
  • 13:22 - 13:24
    Why did I choose to do this?
  • 13:24 - 13:26
    To help you fix the language ontology.
  • 13:26 - 13:32
    Why? Because you will find many languages,
    for example, my native language
  • 13:32 - 13:34
    which used to be Serbo-Croatian,
  • 13:34 - 13:39
    and for silly reasons now we have Serbian
    and Croatian-- it's a political thing.
  • 13:39 - 13:41
    I don't want to go into it,
    but you realize
  • 13:41 - 13:43
    that Serbian is now, for example,
    at the same time
  • 13:43 - 13:47
    a subclass of Serbo-Croatian
    and a part of Serbo-Croatian.
  • 13:47 - 13:48
    Which still holds for the Croatian--
  • 13:48 - 13:51
    Croatian is also a part
    and a subclass of Serbo-Croatian.
  • 13:51 - 13:52
    So Serbo-Croatian used to be a language.
  • 13:52 - 13:55
    Now we don't have
    normative support for it.
  • 13:55 - 13:57
    But still, it's not a language class,
    it's a language.
  • 13:57 - 14:01
    Can it be a part of it
    or can it be a subclass of it?
  • 14:01 - 14:03
    So it's a confusion of [methodological]
    and set-theoretic relations,
  • 14:03 - 14:06
    and I think it should be fixed somehow.
  • 14:07 - 14:09
    In other words, don't say
  • 14:10 - 14:15
    that you don't have the tool
    to fix the ontology.
  • 14:15 - 14:18
    Just find some time and go play with it.
  • 14:19 - 14:22
    Speaking of languages, I mentioned,
    just to show you this project.
  • 14:23 - 14:27
    Many people liked this thing
    what I published online on Twitter.
  • 14:27 - 14:29
    That's one of the things, you know.
  • 14:29 - 14:33
    Data science is usually
    sold via visualizations.
  • 14:33 - 14:34
    People like to visualize things,
  • 14:34 - 14:37
    and, of course,
    we do pay attention to that.
  • 14:38 - 14:40
    Aesthetics is a part of communication.
  • 14:42 - 14:44
    It's not the most important thing
    for a scientific finding
  • 14:44 - 14:45
    to show you something beautiful,
  • 14:45 - 14:49
    but if you can show something beautiful,
    you shouldn't miss the opportunity.
  • 14:49 - 14:52
    So here we did
    with the languages in Wikidata
  • 14:52 - 14:54
    the same thing that we do
    with items and projects
  • 14:54 - 14:56
    in the Wikidata Concepts Monitor.
  • 14:56 - 15:03
    We actually group languages by similarity,
    and the similarity was defined
  • 15:03 - 15:06
    as how much do they overlap
    across the items.
  • 15:06 - 15:11
    So if I can talk about
    the same things in English
  • 15:11 - 15:14
    and in some West-African
    language, for example,
  • 15:14 - 15:16
    then those two things, those two languages
  • 15:16 - 15:19
    are similar in terms
    of their reference sets.
  • 15:19 - 15:21
    What they can refer to.
  • 15:22 - 15:25
    Each language here
  • 15:25 - 15:27
    points to its closest neighbor,
    nearest neighbor--
  • 15:27 - 15:30
    to the most which is most similar to it.
  • 15:30 - 15:36
    And, of course, you can see
    these groupings actually occur naturally.
  • 15:36 - 15:38
    So it's not a fully-connected graph.
  • 15:38 - 15:41
    Clustering this thing
    was nothing like [there is].
  • 15:41 - 15:44
    Also, what you can learn
    from the Languages Landscape project
  • 15:44 - 15:49
    is when you combine our data
    with external resources.
  • 15:49 - 15:51
    So this is also very informative for us,
  • 15:51 - 15:54
    for the whole, I would say,
    Wikimedia community.
  • 15:55 - 15:57
    We have the UNESCO language status
  • 15:57 - 16:00
    which Wikidata actually gets from UNESCO,
  • 16:00 - 16:02
    its websites and databases,
  • 16:02 - 16:05
    and the Ethnologue language status
    on the vertical axes.
  • 16:05 - 16:09
    We have the Concepts Monitor
    reuse statistic.
  • 16:09 - 16:13
    So we look at all the items that have
    a label in a particular language,
  • 16:13 - 16:16
    and then we look at
    how popular those items are,
  • 16:16 - 16:18
    how many times people used them.
  • 16:19 - 16:25
    Of course, those safe national languages,
    languages that are not endangered,
  • 16:26 - 16:28
    they have a slight advantage.
  • 16:28 - 16:31
    But the situation is not really that bad.
  • 16:31 - 16:34
    Say, for example, take a look
    at the Ethnologue category
  • 16:34 - 16:37
    of "Second language only"--
    that's the rightmost one.
  • 16:37 - 16:42
    You will see three languages
    there being reused
  • 16:42 - 16:44
    in a way comparable to the most favorable,
  • 16:44 - 16:47
    not endangered category
    of national languages.
  • 16:48 - 16:49
    It's not like a gender bias.
  • 16:49 - 16:54
    Wikipedia seems to be really reflecting
    the gender bias that exists in the world.
  • 16:54 - 16:58
    Then we have nice initiatives like women
    who are trying to fix this thing.
  • 16:58 - 17:02
    With languages, well, of course,
    some languages are a little bit favored,
  • 17:02 - 17:04
    but it's not that bad,
  • 17:04 - 17:08
    and that finding really brought
    a lot of joy to ourselves.
  • 17:09 - 17:13
    Now, speaking of external resources,
    every time that I look at this graph,
  • 17:13 - 17:16
    I say to myself, "We know
    who is the queen of the databases."
  • 17:18 - 17:22
    You know the external identifiers
    property in Wikidata.
  • 17:23 - 17:30
    So here we take all external identifiers
    that were present in August,
  • 17:32 - 17:35
    JSON Dump of Wikidata, which we processed.
  • 17:35 - 17:38
    Then, once again,
    did some statistics on it
  • 17:38 - 17:45
    and grouped all the external identifiers
    by how much they overlap across the items.
  • 17:51 - 17:53
    Aha, here we are.
  • 17:55 - 17:58
    That visualization, except for maybe
    being aesthetically pleasing,
  • 17:58 - 18:00
    is not that useful,
  • 18:00 - 18:03
    but you have an interactive version
    developed in the dashboard.
  • 18:04 - 18:08
    If you go and inspect
    the interactive version,
  • 18:08 - 18:11
    you can learn, for example,
    one obvious fact
  • 18:11 - 18:14
    that they really follow
    some natural semantics.
  • 18:14 - 18:16
    They are grouped in intuitive ways.
  • 18:16 - 18:22
    We should be perfectly expecting them
    to give some feedback on the quality
  • 18:22 - 18:24
    of the organizational data in Wikidata,
  • 18:24 - 18:27
    telling that situation
    is really not that bad.
  • 18:27 - 18:30
    What I am saying is
    that all the external identifiers
  • 18:30 - 18:32
    from the databases
    on sports, for example,
  • 18:32 - 18:35
    you will find to be in one cluster.
  • 18:35 - 18:39
    And then, for example, you will even
    be able to figure out what sport.
  • 18:39 - 18:44
    Databases on tennis are here,
    databases on football are here, etc.
  • 18:48 - 18:51
    Yes, these external resources
  • 18:51 - 18:54
    are things that we really try
    to pay a lot of attention to.
  • 18:55 - 19:00
    All right, as I said, the final thing
    is communication and aesthetics.
  • 19:00 - 19:01
    We do pay attention to it.
  • 19:01 - 19:04
    So, for example, this thing--
    many people liked it.
  • 19:04 - 19:07
    It's a little bit rescaled for aesthetics,
  • 19:07 - 19:12
    the same network of external identifiers
    that you were able to see.
  • 19:12 - 19:16
    But you don't get
    to these results for free, of course.
  • 19:17 - 19:20
    For example, this one was obtained
    by running a clustering algorithm
  • 19:20 - 19:24
    on Jaccard distances--
    technical terms, I'm not going into it.
  • 19:24 - 19:29
    And first, we had to start from a matrix
    actually derived from 408 languages
  • 19:29 - 19:32
    that are reused across the Wikimedia.
  • 19:32 - 19:35
    Wikidata knows about
    many languages, not only 400.
  • 19:35 - 19:40
    But only 400 of them are actually
    labels of the items that get reused
  • 19:40 - 19:44
    across 60 million items contingency
    matrix-- that's a lot of computations.
  • 19:45 - 19:47
    To add an additional layer of complication
  • 19:47 - 19:51
    and, of course, the most beautiful part
    of your work as a data scientist,
  • 19:51 - 19:55
    but it doesn't get to occupy
  • 19:55 - 19:58
    more than, say, 10% or 15% of your time,
  • 19:58 - 20:01
    because everything else
    goes to data engineering
  • 20:01 - 20:03
    and synchronization of different systems.
  • 20:03 - 20:05
    With the machine learning
    and statistic things,
  • 20:05 - 20:07
    we use plenty of different algorithms.
  • 20:07 - 20:13
    I don't think this is now time to go
    and talk about details of these things.
  • 20:13 - 20:15
    I have plenty of opportunities
    to discuss them,
  • 20:15 - 20:18
    but it's typically
    a highly technical topic,
  • 20:18 - 20:21
    better suited for a scientific conference.
  • 20:23 - 20:27
    Here are old layers of complexity.
  • 20:27 - 20:30
    In the end, we have to add
    deployment and dashboards,
  • 20:30 - 20:33
    because they won't build
    themselves to this thing.
  • 20:34 - 20:37
    And all these things, all these phases
  • 20:37 - 20:41
    of development of analytics
    of data science project
  • 20:41 - 20:47
    need to fit together in order
    to be able to derive empirical results
  • 20:47 - 20:49
    on the system of Wikidata's complexity.
  • 20:50 - 20:54
    The true picture is that you cannot
    really just run through these cycles.
  • 20:54 - 20:57
    All the phases of the process
    are interdependent
  • 20:57 - 21:00
    because you really
    have to plan very early on
  • 21:00 - 21:04
    what visualizations are you going to use,
    what technology you will use
  • 21:04 - 21:07
    to render those visualizations in the end.
  • 21:07 - 21:09
    What machine learning algorithms
    you will be using,
  • 21:09 - 21:14
    because all of them have their own taste
    about what data structures they like.
  • 21:14 - 21:17
    And then you hit the constraints
    of infrastructure-- similar things.
  • 21:17 - 21:19
    I am not complaining,
    I'm really enjoying this.
  • 21:19 - 21:22
    This is the most beautiful playground
    I've ever seen in my life.
  • 21:22 - 21:25
    Thanks to you and people
    who built Wikidata.
  • 21:25 - 21:26
    Thank you very much!
  • 21:26 - 21:28
    That would be it.
  • 21:28 - 21:30
    (moderator) Thank you, Goran.
  • 21:30 - 21:32
    (applause)
  • 21:33 - 21:35
    (moderator) You have time
    for a couple of questions.
  • 21:44 - 21:48
    (man) Well, you did a lot of research,
    I can see that.
  • 21:48 - 21:49
    (Goran) Sorry?
  • 21:49 - 21:52
    (man) You did a lot of research,
    I can see that.
  • 21:52 - 21:57
    I'm wondering if there anything
    that you discovered during the research
  • 21:57 - 21:59
    that surprised you.
  • 21:59 - 22:01
    Thank you for that question.
  • 22:01 - 22:08
    Actually, I wanted to focus
    on that in this talk
  • 22:08 - 22:11
    until I realized that we simply
    won't have enough time
  • 22:11 - 22:14
    to explain everything.
  • 22:15 - 22:19
    Most of the time
    when you're analyzing big datasets
  • 22:19 - 22:22
    structured in a way like Wikidata.
  • 22:22 - 22:26
    Even when you're going to the wild,
    meaning study the reuse of data
  • 22:26 - 22:27
    across Wikipedia,
  • 22:27 - 22:31
    where actually people can do
    whatever they like with those items,
  • 22:32 - 22:34
    you have a lot of data,
    a lot of information.
  • 22:34 - 22:36
    Of course, you see structure.
  • 22:36 - 22:40
    Most of the time, 90% of the time,
    you see things that are expected.
  • 22:41 - 22:47
    Things like what projects
    we make the most use of Wikidata.
  • 22:47 - 22:50
    And you can almost--
    you didn't have to do too much statistics,
  • 22:51 - 22:55
    you can rely on the expectations
    of all the world and see what's happening.
  • 22:57 - 22:59
    Many things were surprising,
  • 22:59 - 23:03
    and those things that were surprising
    are really the most informative things.
  • 23:05 - 23:09
    When one communicates the findings
    from analytics and such systems,
  • 23:09 - 23:14
    it's important, people typically expect
    either "wow" visualizations
  • 23:14 - 23:18
    and have tons of data so we can always
    deliver "wow" visualizations,
  • 23:19 - 23:22
    or they expect to learn things like,
  • 23:22 - 23:24
    "Our project is doing better
    than this project"
  • 23:24 - 23:26
    or "Yes, we are rocking!" etc.,
  • 23:26 - 23:30
    while the goal of the whole game
    should actually be to learn
  • 23:30 - 23:34
    what is wrong, what is not working,
    what could be done better.
  • 23:35 - 23:36
    Many things were surprising.
  • 23:38 - 23:42
    For example, the distribution
    of item usage across languages--
  • 23:42 - 23:44
    that was surprising to me.
  • 23:44 - 23:45
    This thing.
  • 23:47 - 23:51
    So I did not really expect
    that the situation with languages
  • 23:51 - 23:54
    will be this good, I would say.
  • 23:55 - 24:01
    My expectation would be that languages
    that have less economic support,
  • 24:01 - 24:04
    normative support,
    even political support--
  • 24:04 - 24:07
    that's a fact when you talk
    about languages--
  • 24:07 - 24:12
    will not be so widely reused
    across the Wikimedia universe.
  • 24:12 - 24:16
    In fact, it turns out
    that the differences-- we can see them,
  • 24:16 - 24:19
    but it's far away from gender bias
    which is really bad, I think,
  • 24:19 - 24:21
    we need to work there.
  • 24:21 - 24:22
    That was surprising, for example.
  • 24:22 - 24:26
    It was a positive surprise,
    to put it that way.
  • 24:26 - 24:28
    Then from time to time,
    we discover projects
  • 24:29 - 24:35
    that actually do a great job by reusing
    the Wikidata content and Wikimedia.
  • 24:35 - 24:38
    We're totally surprised to learn that
    such a project can do it.
  • 24:39 - 24:43
    Then you start thinking, you figure out
    there is a community of people
  • 24:43 - 24:44
    actually doing it.
  • 24:44 - 24:49
    And it's a strange feeling because I get
    to see all these things through machines,
  • 24:49 - 24:52
    through databases,
    through visualizations and tables,
  • 24:52 - 24:58
    and it's always that strange feeling
    when I realize this result was produced
  • 24:58 - 25:03
    by a group of people, they don't even know
    the time looking at their result now.
  • 25:06 - 25:08
    (moderator) Another question?
  • 25:14 - 25:15
    Thank you.
  • 25:15 - 25:16
    Is that it? Thank you very much!
  • 25:16 - 25:18
    (moderator) Thank you.
  • 25:18 - 25:20
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4
Video Language:
English
Duration:
25:26

English subtitles

Revisions