WEBVTT

00:00:06.343 --> 00:00:09.678
Yes, Wikidata Statistics:
What, Where, and How?

00:00:09.678 --> 00:00:13.005
This has been an attempt of an overview
for analytical systems

00:00:13.005 --> 00:00:16.173
focusing on what was developed
with the Wikimedia Deutschland

00:00:16.173 --> 00:00:18.155
in the previous almost three years

00:00:18.155 --> 00:00:22.346
since I started doing data science
for Wikidata and thе dictionary.

00:00:22.346 --> 00:00:28.345
So, during this presentation,
I will try to switch from the presentation

00:00:28.346 --> 00:00:32.092
to the dashboards
and show you the end data products.

00:00:32.995 --> 00:00:35.029
However, if that causes any trouble,

00:00:35.029 --> 00:00:39.070
so this is actually the URL
of the analytics portal.

00:00:39.070 --> 00:00:41.272
So everything that
I will be presenting here,

00:00:41.272 --> 00:00:44.105
whatever you can see on the slides,
you can also check out later

00:00:44.105 --> 00:00:47.285
from the presentation,
go and play with the real thing.

00:00:47.285 --> 00:00:51.101
Otherwise, you will see only
the screenshots here from the slides.

00:00:51.101 --> 00:00:58.275
So the goal-- well, the talk
will be a failed attempt to communicate

00:00:58.275 --> 00:01:01.502
an almost endlessly
technically complicated field

00:01:02.567 --> 00:01:06.843
in terms that can actually motivate
people to start making use

00:01:06.843 --> 00:01:08.338
of this analytical product

00:01:08.338 --> 00:01:11.010
in which development
we are really putting a lot of effort.

00:01:11.010 --> 00:01:13.631
So, as I said, I will try
to provide an overview

00:01:13.631 --> 00:01:15.679
of the Wikidata Statistics
and Analytics systems.

00:01:15.679 --> 00:01:20.636
So I will try to exemplify the usage
of some of them, not all.

00:01:20.636 --> 00:01:23.362
And also I will try to go just
a little bit under the hood

00:01:23.362 --> 00:01:27.453
to try to illustrate how it is done,
what is done here,

00:01:27.453 --> 00:01:31.144
because I thought it might be
interesting to the audience.

00:01:31.818 --> 00:01:33.534
Okay, so say...

00:01:34.804 --> 00:01:38.538
In analytics and data science,
you always start with formulating

00:01:38.538 --> 00:01:41.709
as clearly as possible
your goals and motivations.

00:01:41.709 --> 00:01:47.080
Otherwise, you enter into endless cycles
of developing analytical tools

00:01:47.080 --> 00:01:49.733
and data science products
that actually do something,

00:01:49.733 --> 00:01:52.835
but nobody really understands
what they're being built for.

00:01:52.835 --> 00:01:57.669
In 2017, in Wikimedia Deutschland,
a request, a demand was formulated--

00:01:57.925 --> 00:01:59.740
we said that we needed
an analytical system

00:01:59.740 --> 00:02:01.936
that will give an insight into the ways

00:02:01.936 --> 00:02:05.865
that Wikidata items are reused
across the Wikimedia projects,

00:02:05.865 --> 00:02:09.016
meaning across the Wikipedia universe--
all the encyclopedias,

00:02:09.016 --> 00:02:11.826
and then Wikivoyage,
Wikibooks, WikiCite, etc.--

00:02:11.826 --> 00:02:15.610
all the websites, approximately 800
that we are actually managing.

00:02:15.610 --> 00:02:19.553
So just to explain the differences
between the data.

00:02:19.553 --> 00:02:23.794
On the left, for example, you see a small
or very small substitute Wikidata.

00:02:23.794 --> 00:02:28.114
These are the languages,
some of the Slavic, I think, languages,

00:02:28.114 --> 00:02:30.383
and in Wikidata they are connected,

00:02:30.383 --> 00:02:34.194
but they are properties and belong
to different classes, etc.

00:02:34.194 --> 00:02:36.785
But we were looking
for a different kind of mapping.

00:02:36.785 --> 00:02:41.085
So what you see here,
on the right side, is a set of items

00:02:41.085 --> 00:02:44.823
all belonging to the class
of architectural structures, I would say.

00:02:44.823 --> 00:02:48.496
And this here is the result
of their empirical embeddings.

00:02:48.496 --> 00:02:50.511
So the items related here--

00:02:50.518 --> 00:02:55.952
they are linked by their similarity
of usage across Wikipedias, for example.

00:02:55.952 --> 00:02:57.842
So what does it mean-- the similarity?

00:02:58.632 --> 00:03:03.068
To be similar in terms of how an item
is used across the Wikipedias.

00:03:03.068 --> 00:03:06.943
So imagine you take an area of numbers,

00:03:07.353 --> 00:03:11.107
and each element of an area
is one project-- it's English Wikipedia,

00:03:11.558 --> 00:03:17.417
it is French Wikivoyage,
it is Italian Wikipedia, etc.

00:03:17.901 --> 00:03:20.495
And then, you count how many times

00:03:20.495 --> 00:03:23.085
a particular item has been used
in that project.

00:03:24.112 --> 00:03:27.631
So you use an area of numbers
to describe the item that way.

00:03:27.631 --> 00:03:29.768
It's a little bit more complicated
in practice.

00:03:31.299 --> 00:03:36.074
And then, you can describe all items
in Wikidata that were ever used

00:03:36.074 --> 00:03:39.358
across the websites at all
by such areas of numbers,

00:03:39.358 --> 00:03:41.320
called embeddings, technically, right?

00:03:41.791 --> 00:03:45.513
From those data,
using different distance metrics,

00:03:45.513 --> 00:03:48.893
applying machine learning methods,
doing dimensionality reduction,

00:03:48.893 --> 00:03:50.382
and similar things,

00:03:50.382 --> 00:03:53.093
you can actually figure out
what is the similarity pattern.

00:03:53.093 --> 00:03:55.622
And here items are connected,

00:03:55.622 --> 00:04:00.501
but how similar are their patterns
of usage across different Wikipedias.

00:04:01.726 --> 00:04:04.551
Once again, every visualization,
every result that I show--

00:04:04.551 --> 00:04:08.278
there is a link on the presentation,
so you can go and check for yourself.

00:04:08.278 --> 00:04:10.578
You can play
with this thing interactively.

00:04:10.578 --> 00:04:15.826
Similarly, we will be able to derive
a graph like this one.

00:04:15.826 --> 00:04:20.011
This one does not connect
the Wikidata items, it connects projects.

00:04:20.011 --> 00:04:23.069
But looking at how similar they are

00:04:23.069 --> 00:04:26.779
in terms of how they use
different Wikidata items.

00:04:30.235 --> 00:04:31.733
To be precise as possible,

00:04:32.468 --> 00:04:35.369
the data that we use to do this--
they do not live in Wikidata,

00:04:35.369 --> 00:04:36.818
they are not a part of the Wikidata,

00:04:36.818 --> 00:04:38.823
data does not at all [locate] here.

00:04:38.823 --> 00:04:41.917
We have the Wikidata,
we have formulated our motivational goals,

00:04:41.917 --> 00:04:45.773
and immediately we started talking
about the data model and the structures.

00:04:45.773 --> 00:04:49.760
What structures and data models
you need to answer the questions

00:04:49.760 --> 00:04:52.412
that you have initially proposed.

00:04:52.941 --> 00:04:59.064
So there is Wikibase
and the client site tracking mechanism,

00:04:59.064 --> 00:05:01.884
which is installed in all those wikis,

00:05:01.884 --> 00:05:07.001
that actually tracks the Wikidata usage
on a project, on Wikipedia, for example.

00:05:07.001 --> 00:05:10.700
So every time an item is used
in [meaningful works]

00:05:10.700 --> 00:05:14.743
or in a different way--
there is a role in a huge sequel table

00:05:14.743 --> 00:05:18.124
that enters and checks
the usage of that number.

00:05:18.124 --> 00:05:22.326
Now, immediately, we had to face
a data-engineering problem, of course,

00:05:22.326 --> 00:05:26.434
because we are talking
about hundreds of huge sequel tables,

00:05:26.434 --> 00:05:29.301
and we had to do
machine learning and statistics

00:05:29.301 --> 00:05:32.746
across all the data together,
not separately,

00:05:32.746 --> 00:05:37.283
in order to be able to produce structures,
like this one or like this one.

00:05:37.578 --> 00:05:41.332
So in cooperation with the Analytics
Engineering Team of the Foundation,

00:05:41.332 --> 00:05:44.459
we started transferring
those data from Wikibase

00:05:44.459 --> 00:05:49.181
to the Wikimedia Foundation Data Lake
which is actually a big data storage.

00:05:49.181 --> 00:05:52.753
The data do not live there
in a relational database.

00:05:52.753 --> 00:05:54.060
They live in something similar--

00:05:54.060 --> 00:05:56.546
its Hadoop, and Hive tables
are there, etc.,

00:05:56.546 --> 00:05:58.552
but it's a huge,
huge engineering procedure.

00:05:58.552 --> 00:06:03.405
So not all data in analytics,
especially in big games like this

00:06:03.405 --> 00:06:06.001
that we have to play
with Wikidata and Wikipedia,

00:06:06.001 --> 00:06:07.667
are immediately available to you.

00:06:07.667 --> 00:06:09.171
One source of complication

00:06:09.171 --> 00:06:13.459
is before you actually start solving
the problem in a scientific way,

00:06:13.459 --> 00:06:16.847
to put it that way, is to engineer
the data stats to prepare the structures

00:06:16.847 --> 00:06:20.805
that you actually need for doing
machine-learning statistics

00:06:20.805 --> 00:06:22.588
and similar things.

00:06:23.464 --> 00:06:26.921
This is a full design of the system
called the Wikidata Concepts Monitor

00:06:26.921 --> 00:06:28.380
that tracks their reuse statistics.

00:06:28.380 --> 00:06:30.844
I will not go
into details here, of course.

00:06:32.394 --> 00:06:35.764
The obvious complication
is that-- as I wrote it up--

00:06:35.764 --> 00:06:38.432
many systems need to work together.

00:06:38.432 --> 00:06:41.248
You have to synchronize
many different sources of data,

00:06:41.248 --> 00:06:42.846
many different infrastructures

00:06:42.846 --> 00:06:47.994
just in order to make it happen,
even before starting thinking

00:06:47.994 --> 00:06:52.247
in terms of methodologies, science,
statistics, and similar.

00:06:53.955 --> 00:06:57.930
As I said, we started
with our goals and motivation,

00:06:57.930 --> 00:07:01.629
then, typically, the data model
and the structures that you need--

00:07:01.629 --> 00:07:04.881
they correspond to those goals
and motivations that should always be--

00:07:04.881 --> 00:07:08.250
your first step in developing
an analytics project.

00:07:08.250 --> 00:07:10.857
Then you figure out
it's really too complicated,

00:07:10.857 --> 00:07:12.846
it cannot be done when one person--

00:07:12.846 --> 00:07:15.077
It cannot be done on one computer,
to put it that way.

00:07:15.077 --> 00:07:17.771
So we needed to work
with the analytics infrastructure

00:07:17.771 --> 00:07:20.403
and then add an additional layer
of complication--

00:07:20.403 --> 00:07:23.750
that's communication
with external teams and cooperators

00:07:23.750 --> 00:07:28.366
because, obviously, such a system
cannot be managed easily by one person.

00:07:28.366 --> 00:07:31.358
Actually, I think
it would be pretty impossible.

00:07:31.720 --> 00:07:33.587
So, as I mentioned,
there is this Data Lake,

00:07:33.587 --> 00:07:38.091
our big data storage in Hadoop,

00:07:38.091 --> 00:07:41.880
and the team of awesome data engineers
in the Foundation

00:07:41.880 --> 00:07:43.987
called the Analytics Engineering Team.

00:07:43.987 --> 00:07:47.660
To data science, data engineers are people
who actually watch your back

00:07:47.660 --> 00:07:49.426
while you're trying to do your things.

00:07:49.426 --> 00:07:51.766
If you cannot rely on
a good engineering team,

00:07:51.766 --> 00:07:54.164
there's not much you will be able
to do by yourself.

00:07:55.636 --> 00:08:00.357
This infrastructure is actually
maintained by the Foundation,

00:08:00.357 --> 00:08:04.127
so you enter through
several statistical servers--

00:08:04.441 --> 00:08:06.152
these blue boxes down there.

00:08:06.152 --> 00:08:09.274
You can communicate
with the relational database systems.

00:08:09.274 --> 00:08:10.531
We used the MariaDB.

00:08:10.531 --> 00:08:12.274
You can communicate with the Data Lake.

00:08:12.274 --> 00:08:17.536
And, of course, for your computations,
you go to the so-called Analytics Cluster

00:08:17.536 --> 00:08:20.712
where you do things
like Apache Spark that actually--

00:08:20.712 --> 00:08:25.138
it's the only really efficient way
to process the data

00:08:25.138 --> 00:08:27.313
that we need to process.

00:08:27.313 --> 00:08:32.219
When I started doing this back in 2017,
I remember when I saw

00:08:32.219 --> 00:08:35.421
only the schema of the infrastructure
for the first time.

00:08:35.421 --> 00:08:38.504
If I could not rely on my colleague
Adam Shorland--

00:08:38.504 --> 00:08:40.471
who is still with us
in Wikimedia Deutschland--

00:08:40.471 --> 00:08:44.008
I would never make it, I wouldn't even
know how to navigate the structure.

00:08:46.070 --> 00:08:49.085
As you start building a project
to do analytics for Wikidata,

00:08:49.085 --> 00:08:52.391
you see how increasingly it gets
more and more complicated

00:08:52.391 --> 00:08:55.046
because you have to deal
with synchronizing different systems,

00:08:55.046 --> 00:08:57.908
different teams, infrastructures,
different data stats.

00:08:58.419 --> 00:08:59.968
However, it pays off,

00:09:00.346 --> 00:09:02.948
that synchronization and all the pain.

00:09:03.282 --> 00:09:07.632
It can get really nasty sometimes,
and the most recent example

00:09:07.632 --> 00:09:10.777
is the production
of the Data Quality Report for Wikidata.

00:09:12.128 --> 00:09:16.926
That's an initial assessment
of the quality of work we had in Wikidata.

00:09:16.926 --> 00:09:18.278
In order to produce it,

00:09:18.278 --> 00:09:22.211
we had to rely on the Quality Predictions
from the ORES system,

00:09:22.211 --> 00:09:25.283
the machine learning system,
developed by Aaron Halfaker,

00:09:25.283 --> 00:09:27.502
and the scoring platform

00:09:28.383 --> 00:09:32.317
to combine that with the Wikidata
Concepts Monitor reuse statistics.

00:09:32.806 --> 00:09:36.691
We the revision history, the full revision
history of all Wikipedias

00:09:36.691 --> 00:09:40.009
is available in one single
huge big data table

00:09:40.009 --> 00:09:41.358
called the MediaWiki History.

00:09:41.358 --> 00:09:42.982
That lives in the Data Lake.

00:09:42.982 --> 00:09:46.672
And also we had to process
the JSON Dump in HDFS.

00:09:46.672 --> 00:09:48.804
So we're talking about form
as in structures,

00:09:48.804 --> 00:09:51.946
like two machine learning systems
with their complexities,

00:09:51.946 --> 00:09:53.893
and two huge data sets.

00:09:53.893 --> 00:09:58.231
Everything needs to work in sync in order
to be able to produce the Quality Report

00:09:58.231 --> 00:10:00.750
that we're presenting
this year at WikidataCon.

00:10:00.750 --> 00:10:04.376
But if we didn't do, if we [listed]
or something like that,

00:10:04.759 --> 00:10:07.765
we couldn't say, we couldn't show
beautiful things like this.

00:10:07.765 --> 00:10:12.130
So on the horizontal axis, you have
the ORES Quality Prediction score.

00:10:12.130 --> 00:10:13.490
We use five categories.

00:10:13.490 --> 00:10:17.436
And you can inform yourself-- just google
"Wikidata data quality categories."

00:10:17.436 --> 00:10:18.798
You will find the description.

00:10:18.798 --> 00:10:22.271
The A-class to the left--
the best items that we have,

00:10:22.271 --> 00:10:24.969
and at the same time--
that's the green box--

00:10:24.969 --> 00:10:27.812
they are the most
reused items in Wikipedia.

00:10:27.812 --> 00:10:30.450
So it's not like,
as Lydia explained yesterday,

00:10:30.450 --> 00:10:32.742
it's not like all our items
are of the highest quality.

00:10:32.742 --> 00:10:38.030
To the contrary, we have many items
that are not of that high quality,

00:10:38.030 --> 00:10:40.541
but at least we know
what we're doing with them.

00:10:40.541 --> 00:10:42.124
And you can see the regularity.

00:10:42.124 --> 00:10:46.228
As the quality of an item
decreases from left to right,

00:10:46.228 --> 00:10:49.179
the items tend to be less and less reused.

00:10:49.724 --> 00:10:53.817
So also this synchronization
helped us learn things like this.

00:10:54.225 --> 00:10:57.850
To the right, for example,
this five-time series here.

00:10:58.274 --> 00:11:05.252
Each time series corresponds
to one of the quality categories--

00:11:05.252 --> 00:11:06.642
A, B, C, or D.

00:11:06.642 --> 00:11:11.222
And the time is on the horizontal axis
running from left to right.

00:11:11.222 --> 00:11:15.883
And you can see here how many items
from each quality-class

00:11:15.883 --> 00:11:19.305
received their latest revision when.

00:11:19.792 --> 00:11:23.956
So the top quality class A
that is this [inaudible] line

00:11:23.956 --> 00:11:29.693
which is found, say,
at the most right position here,

00:11:29.693 --> 00:11:31.113
and the shortest line.

00:11:31.113 --> 00:11:34.584
So those are the best items that we have.

00:11:35.341 --> 00:11:38.247
And what you can see
is actually that there is no item

00:11:38.247 --> 00:11:44.580
that did not receive at least
one revision after December 2018,

00:11:44.580 --> 00:11:48.118
meaning one thing-- if you want quality 
in Wikidata, you have to work on it.

00:11:48.118 --> 00:11:50.893
So the best items that we have
are actually the items

00:11:50.893 --> 00:11:52.801
that we're really paying attention to.

00:11:52.801 --> 00:11:55.743
If you look at the classes
of lower quality, the other time-series,

00:11:55.743 --> 00:11:59.173
you will see that we have items
that were revised in 2012

00:11:59.173 --> 00:12:00.683
for the last time.

00:12:01.156 --> 00:12:03.348
So it tells a story of responsibilities--

00:12:03.348 --> 00:12:07.694
how much work we put
into the items [that actually work].

00:12:07.694 --> 00:12:09.421
What brings quality.

00:12:13.043 --> 00:12:17.205
While we do these things,
we also try to make use

00:12:17.205 --> 00:12:20.163
of the byproducts
of these procedures as possible.

00:12:20.569 --> 00:12:23.308
So, for example, in order
to develop the project

00:12:23.308 --> 00:12:25.425
called Wikidata Languages Landscape--

00:12:25.425 --> 00:12:28.375
I think I mentioned it yesterday
during the Birthday Presentation--

00:12:30.545 --> 00:12:34.444
I had to perform a quite thorough study

00:12:34.444 --> 00:12:37.725
of the sub-ontology
in Wikidata of languages.

00:12:37.725 --> 00:12:41.712
And you know what?
There are problems in that ontology.

00:12:45.502 --> 00:12:48.467
I tried not to miss
to give you an opportunity.

00:12:49.301 --> 00:12:52.247
So this is the dashboard actually
about the languages

00:12:52.247 --> 00:12:54.791
called the Wikidata Languages Landscape.

00:12:54.791 --> 00:12:58.594
Once again, you have all the URLs
in the presentation.

00:12:59.694 --> 00:13:03.720
So for example, you want to take a look
at a particular language.

00:13:03.720 --> 00:13:08.688
Say, English, okay.

00:13:09.448 --> 00:13:14.636
So the dashboard will generate
its local ontological context

00:13:14.636 --> 00:13:19.006
and mark all the relations
of the form instance

00:13:19.006 --> 00:13:21.276
of subclass often part of.

00:13:21.716 --> 00:13:23.845
Why did I choose to do this?

00:13:23.845 --> 00:13:25.991
To help you fix the language ontology.

00:13:25.991 --> 00:13:31.586
Why? Because you will find many languages,
for example, my native language

00:13:31.586 --> 00:13:33.618
which used to be Serbo-Croatian,

00:13:33.618 --> 00:13:38.553
and for silly reasons now we have Serbian
and Croatian-- it's a political thing.

00:13:38.553 --> 00:13:40.554
I don't want to go into it,
but you realize

00:13:40.554 --> 00:13:43.255
that Serbian is now, for example,
at the same time

00:13:43.255 --> 00:13:46.637
a subclass of Serbo-Croatian
and a part of Serbo-Croatian.

00:13:46.955 --> 00:13:48.395
Which still holds for the Croatian--

00:13:48.395 --> 00:13:50.860
Croatian is also a part
and a subclass of Serbo-Croatian.

00:13:50.860 --> 00:13:52.496
So Serbo-Croatian used to be a language.

00:13:52.496 --> 00:13:54.957
Now we don't have
normative support for it.

00:13:54.957 --> 00:13:57.086
But still, it's not a language class,
it's a language.

00:13:57.086 --> 00:14:00.528
Can it be a part of it
or can it be a subclass of it?

00:14:00.528 --> 00:14:03.297
So it's a confusion of [methodological]
and set-theoretic relations,

00:14:03.297 --> 00:14:05.803
and I think it should be fixed somehow.

00:14:06.656 --> 00:14:09.245
In other words, don't say

00:14:10.129 --> 00:14:14.993
that you don't have the tool
to fix the ontology.

00:14:14.993 --> 00:14:17.859
Just find some time and go play with it.

00:14:19.257 --> 00:14:22.431
Speaking of languages, I mentioned,
just to show you this project.

00:14:22.990 --> 00:14:27.162
Many people liked this thing
what I published online on Twitter.

00:14:27.162 --> 00:14:28.567
That's one of the things, you know.

00:14:28.567 --> 00:14:32.565
Data science is usually
sold via visualizations.

00:14:32.565 --> 00:14:34.202
People like to visualize things,

00:14:34.202 --> 00:14:36.843
and, of course,
we do pay attention to that.

00:14:37.763 --> 00:14:40.385
Aesthetics is a part of communication.

00:14:41.772 --> 00:14:44.051
It's not the most important thing
for a scientific finding

00:14:44.051 --> 00:14:45.348
to show you something beautiful,

00:14:45.348 --> 00:14:48.621
but if you can show something beautiful,
you shouldn't miss the opportunity.

00:14:48.621 --> 00:14:51.876
So here we did
with the languages in Wikidata

00:14:51.876 --> 00:14:53.987
the same thing that we do
with items and projects

00:14:53.987 --> 00:14:56.161
in the Wikidata Concepts Monitor.

00:14:56.161 --> 00:15:02.898
We actually group languages by similarity,
and the similarity was defined

00:15:02.898 --> 00:15:05.800
as how much do they overlap
across the items.

00:15:06.452 --> 00:15:10.531
So if I can talk about
the same things in English

00:15:10.531 --> 00:15:13.973
and in some West-African
language, for example,

00:15:13.973 --> 00:15:15.807
then those two things, those two languages

00:15:15.807 --> 00:15:19.209
are similar in terms
of their reference sets.

00:15:19.209 --> 00:15:21.302
What they can refer to.

00:15:22.330 --> 00:15:24.849
Each language here

00:15:24.849 --> 00:15:27.368
points to its closest neighbor,
nearest neighbor--

00:15:27.368 --> 00:15:29.840
to the most which is most similar to it.

00:15:29.840 --> 00:15:35.595
And, of course, you can see
these groupings actually occur naturally.

00:15:35.595 --> 00:15:37.549
So it's not a fully-connected graph.

00:15:37.549 --> 00:15:40.838
Clustering this thing
was nothing like [there is].

00:15:41.471 --> 00:15:44.418
Also, what you can learn
from the Languages Landscape project

00:15:44.418 --> 00:15:49.294
is when you combine our data
with external resources.

00:15:49.294 --> 00:15:51.369
So this is also very informative for us,

00:15:51.369 --> 00:15:54.240
for the whole, I would say,
Wikimedia community.

00:15:54.563 --> 00:15:56.636
We have the UNESCO language status

00:15:56.636 --> 00:15:59.755
which Wikidata actually gets from UNESCO,

00:15:59.755 --> 00:16:01.907
its websites and databases,

00:16:01.907 --> 00:16:05.198
and the Ethnologue language status
on the vertical axes.

00:16:05.198 --> 00:16:08.751
We have the Concepts Monitor
reuse statistic.

00:16:08.945 --> 00:16:12.973
So we look at all the items that have
a label in a particular language,

00:16:12.973 --> 00:16:15.949
and then we look at
how popular those items are,

00:16:15.949 --> 00:16:18.010
how many times people used them.

00:16:19.310 --> 00:16:25.059
Of course, those safe national languages,
languages that are not endangered,

00:16:25.886 --> 00:16:28.165
they have a slight advantage.

00:16:28.165 --> 00:16:30.624
But the situation is not really that bad.

00:16:30.624 --> 00:16:33.660
Say, for example, take a look
at the Ethnologue category

00:16:33.660 --> 00:16:37.206
of "Second language only"--
that's the rightmost one.

00:16:37.206 --> 00:16:41.798
You will see three languages
there being reused

00:16:41.798 --> 00:16:44.445
in a way comparable to the most favorable,

00:16:44.445 --> 00:16:47.456
not endangered category
of national languages.

00:16:47.756 --> 00:16:49.414
It's not like a gender bias.

00:16:49.414 --> 00:16:53.784
Wikipedia seems to be really reflecting
the gender bias that exists in the world.

00:16:53.784 --> 00:16:58.130
Then we have nice initiatives like women 
who are trying to fix this thing.

00:16:58.130 --> 00:17:02.210
With languages, well, of course,
some languages are a little bit favored,

00:17:02.210 --> 00:17:04.276
but it's not that bad,

00:17:04.276 --> 00:17:07.872
and that finding really brought
a lot of joy to ourselves.

00:17:08.739 --> 00:17:12.732
Now, speaking of external resources,
every time that I look at this graph,

00:17:12.732 --> 00:17:16.482
I say to myself, "We know
who is the queen of the databases."

00:17:18.122 --> 00:17:22.294
You know the external identifiers
property in Wikidata.

00:17:23.020 --> 00:17:30.171
So here we take all external identifiers
that were present in August,

00:17:31.504 --> 00:17:34.823
JSON Dump of Wikidata, which we processed.

00:17:34.823 --> 00:17:38.079
Then, once again,
did some statistics on it

00:17:38.079 --> 00:17:45.125
and grouped all the external identifiers
by how much they overlap across the items.

00:17:51.228 --> 00:17:52.944
Aha, here we are.

00:17:55.021 --> 00:17:58.363
That visualization, except for maybe
being aesthetically pleasing,

00:17:58.363 --> 00:17:59.691
is not that useful,

00:17:59.691 --> 00:18:03.007
but you have an interactive version
developed in the dashboard.

00:18:04.231 --> 00:18:07.857
If you go and inspect
the interactive version,

00:18:07.857 --> 00:18:10.984
you can learn, for example,
one obvious fact

00:18:10.984 --> 00:18:13.615
that they really follow
some natural semantics.

00:18:13.615 --> 00:18:15.706
They are grouped in intuitive ways.

00:18:16.050 --> 00:18:21.745
We should be perfectly expecting them
to give some feedback on the quality

00:18:21.745 --> 00:18:24.453
of the organizational data in Wikidata,

00:18:24.453 --> 00:18:26.797
telling that situation
is really not that bad.

00:18:27.307 --> 00:18:30.129
What I am saying is
that all the external identifiers

00:18:30.129 --> 00:18:32.230
from the databases
on sports, for example,

00:18:32.230 --> 00:18:34.685
you will find to be in one cluster.

00:18:34.685 --> 00:18:38.681
And then, for example, you will even
be able to figure out what sport.

00:18:39.198 --> 00:18:44.277
Databases on tennis are here,
databases on football are here, etc.

00:18:48.175 --> 00:18:50.670
Yes, these external resources

00:18:50.670 --> 00:18:53.684
are things that we really try
to pay a lot of attention to.

00:18:54.653 --> 00:18:59.781
All right, as I said, the final thing
is communication and aesthetics.

00:18:59.781 --> 00:19:01.265
We do pay attention to it.

00:19:01.265 --> 00:19:04.183
So, for example, this thing--
many people liked it.

00:19:04.183 --> 00:19:07.184
It's a little bit rescaled for aesthetics,

00:19:07.184 --> 00:19:11.808
the same network of external identifiers
that you were able to see.

00:19:11.808 --> 00:19:16.318
But you don't get
to these results for free, of course.

00:19:16.707 --> 00:19:20.163
For example, this one was obtained
by running a clustering algorithm

00:19:20.163 --> 00:19:23.946
on Jaccard distances--
technical terms, I'm not going into it.

00:19:23.946 --> 00:19:29.093
And first, we had to start from a matrix
actually derived from 408 languages

00:19:29.093 --> 00:19:31.852
that are reused across the Wikimedia.

00:19:31.852 --> 00:19:35.268
Wikidata knows about
many languages, not only 400.

00:19:35.268 --> 00:19:39.704
But only 400 of them are actually
labels of the items that get reused

00:19:39.704 --> 00:19:43.880
across 60 million items contingency
matrix-- that's a lot of computations.

00:19:44.591 --> 00:19:47.112
To add an additional layer of complication

00:19:47.112 --> 00:19:51.382
and, of course, the most beautiful part 
of your work as a data scientist,

00:19:51.382 --> 00:19:55.216
but it doesn't get to occupy

00:19:55.216 --> 00:19:58.266
more than, say, 10% or 15% of your time,

00:19:58.266 --> 00:20:00.932
because everything else
goes to data engineering

00:20:00.932 --> 00:20:03.083
and synchronization of different systems.

00:20:03.083 --> 00:20:04.936
With the machine learning
and statistic things,

00:20:04.936 --> 00:20:07.249
we use plenty of different algorithms.

00:20:07.249 --> 00:20:12.845
I don't think this is now time to go
and talk about details of these things.

00:20:12.845 --> 00:20:14.916
I have plenty of opportunities
to discuss them,

00:20:14.916 --> 00:20:18.466
but it's typically
a highly technical topic,

00:20:18.466 --> 00:20:21.369
better suited for a scientific conference.

00:20:22.999 --> 00:20:26.509
Here are old layers of complexity.

00:20:26.509 --> 00:20:30.206
In the end, we have to add
deployment and dashboards,

00:20:30.206 --> 00:20:33.445
because they won't build
themselves to this thing.

00:20:33.831 --> 00:20:36.854
And all these things, all these phases

00:20:36.854 --> 00:20:40.581
of development of analytics
of data science project

00:20:41.188 --> 00:20:46.560
need to fit together in order
to be able to derive empirical results

00:20:46.565 --> 00:20:49.392
on the system of Wikidata's complexity.

00:20:49.848 --> 00:20:53.720
The true picture is that you cannot
really just run through these cycles.

00:20:54.417 --> 00:20:56.884
All the phases of the process
are interdependent

00:20:56.884 --> 00:21:00.012
because you really
have to plan very early on


00:21:00.012 --> 00:21:04.115
what visualizations are you going to use,
what technology you will use

00:21:04.115 --> 00:21:06.654
to render those visualizations in the end.

00:21:06.654 --> 00:21:08.888
What machine learning algorithms
you will be using,

00:21:08.888 --> 00:21:13.534
because all of them have their own taste
about what data structures they like.

00:21:13.534 --> 00:21:16.695
And then you hit the constraints
of infrastructure-- similar things.

00:21:16.695 --> 00:21:18.827
I am not complaining,
I'm really enjoying this.

00:21:18.827 --> 00:21:22.400
This is the most beautiful playground
I've ever seen in my life.

00:21:22.400 --> 00:21:25.381
Thanks to you and people
who built Wikidata.

00:21:25.381 --> 00:21:26.388
Thank you very much!

00:21:26.388 --> 00:21:27.729
That would be it.

00:21:28.119 --> 00:21:29.991
(moderator) Thank you, Goran.

00:21:29.991 --> 00:21:32.290
(applause)

00:21:32.825 --> 00:21:35.261
(moderator) You have time
for a couple of questions.

00:21:44.322 --> 00:21:47.663
(man) Well, you did a lot of research,
I can see that.

00:21:47.663 --> 00:21:48.676
(Goran) Sorry?

00:21:48.676 --> 00:21:51.642
(man) You did a lot of research,
I can see that.

00:21:51.642 --> 00:21:57.244
I'm wondering if there anything
that you discovered during the research

00:21:57.244 --> 00:21:58.853
that surprised you.

00:21:59.327 --> 00:22:01.356
Thank you for that question.

00:22:01.356 --> 00:22:07.663
Actually, I wanted to focus
on that in this talk

00:22:07.663 --> 00:22:11.244
until I realized that we simply
won't have enough time

00:22:11.244 --> 00:22:13.816
to explain everything.

00:22:15.407 --> 00:22:19.247
Most of the time
when you're analyzing big datasets

00:22:19.247 --> 00:22:22.179
structured in a way like Wikidata.

00:22:22.179 --> 00:22:26.345
Even when you're going to the wild,
meaning study the reuse of data

00:22:26.345 --> 00:22:27.442
across Wikipedia,

00:22:27.442 --> 00:22:30.622
where actually people can do
whatever they like with those items,

00:22:31.662 --> 00:22:33.917
you have a lot of data,
a lot of information.

00:22:33.917 --> 00:22:35.603
Of course, you see structure.

00:22:35.603 --> 00:22:40.209
Most of the time, 90% of the time,
you see things that are expected.

00:22:41.195 --> 00:22:46.678
Things like what projects
we make the most use of Wikidata.

00:22:46.678 --> 00:22:49.891
And you can almost--
you didn't have to do too much statistics,

00:22:50.721 --> 00:22:54.897
you can rely on the expectations
of all the world and see what's happening.

00:22:56.694 --> 00:22:58.643
Many things were surprising,

00:22:58.643 --> 00:23:03.308
and those things that were surprising
are really the most informative things.

00:23:05.372 --> 00:23:09.069
When one communicates the findings
from analytics and such systems,

00:23:09.486 --> 00:23:14.200
it's important, people typically expect
either "wow" visualizations

00:23:14.200 --> 00:23:18.316
and have tons of data so we can always
deliver "wow" visualizations,

00:23:18.912 --> 00:23:21.563
or they expect to learn things like,

00:23:21.563 --> 00:23:24.204
"Our project is doing better
than this project"

00:23:24.204 --> 00:23:26.239
or "Yes, we are rocking!" etc.,

00:23:26.239 --> 00:23:30.148
while the goal of the whole game
should actually be to learn

00:23:30.148 --> 00:23:34.128
what is wrong, what is not working,
what could be done better.

00:23:34.938 --> 00:23:36.451
Many things were surprising.

00:23:38.341 --> 00:23:42.061
For example, the distribution
of item usage across languages--

00:23:42.061 --> 00:23:43.850
that was surprising to me.

00:23:43.850 --> 00:23:45.014
This thing.

00:23:47.098 --> 00:23:51.348
So I did not really expect
that the situation with languages

00:23:51.348 --> 00:23:54.352
will be this good, I would say.

00:23:54.830 --> 00:24:01.332
My expectation would be that languages
that have less economic support,

00:24:01.332 --> 00:24:03.651
normative support,
even political support--

00:24:03.651 --> 00:24:06.601
that's a fact when you talk
about languages--

00:24:06.601 --> 00:24:11.521
will not be so widely reused
across the Wikimedia universe.

00:24:11.521 --> 00:24:15.540
In fact, it turns out
that the differences-- we can see them,

00:24:15.540 --> 00:24:18.977
but it's far away from gender bias
which is really bad, I think,

00:24:18.977 --> 00:24:20.707
we need to work there.

00:24:20.707 --> 00:24:22.456
That was surprising, for example.

00:24:22.456 --> 00:24:25.725
It was a positive surprise,
to put it that way.

00:24:25.725 --> 00:24:28.271
Then from time to time,
we discover projects

00:24:28.821 --> 00:24:34.775
that actually do a great job by reusing
the Wikidata content and Wikimedia.

00:24:34.775 --> 00:24:37.895
We're totally surprised to learn that
such a project can do it.

00:24:38.612 --> 00:24:42.554
Then you start thinking, you figure out
there is a community of people

00:24:42.554 --> 00:24:44.000
actually doing it.

00:24:44.468 --> 00:24:48.735
And it's a strange feeling because I get
to see all these things through machines,

00:24:48.735 --> 00:24:51.971
through databases,
through visualizations and tables,

00:24:51.971 --> 00:24:58.165
and it's always that strange feeling
when I realize this result was produced

00:24:58.165 --> 00:25:03.094
by a group of people, they don't even know
the time looking at their result now.

00:25:06.101 --> 00:25:07.832
(moderator) Another question?

00:25:13.657 --> 00:25:14.703
Thank you.

00:25:14.703 --> 00:25:16.237
Is that it? Thank you very much!

00:25:16.237 --> 00:25:17.734
(moderator) Thank you.

00:25:17.734 --> 00:25:19.890
(applause)