cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4

Edit subtitles

0:06 - 0:10

Yes, Wikidata Statistics:
What, Where, and How?
0:10 - 0:13

This has been an attempt of an overview
for analytical systems
0:13 - 0:16

focusing on what was developed
with the Wikimedia Deutschland
0:16 - 0:18

in the previous almost three years
0:18 - 0:22

since I started doing data science
for Wikidata and thе dictionary.
0:22 - 0:28

So, during this presentation,
I will try to switch from the presentation
0:28 - 0:32

to the dashboards
and show you the end data products.
0:33 - 0:35

However, if that causes any trouble,
0:35 - 0:39

so this is actually the URL
of the analytics portal.
0:39 - 0:41

So everything that
I will be presenting here,
0:41 - 0:44

whatever you can see on the slides,
you can also check out later
0:44 - 0:47

from the presentation,
go and play with the real thing.
0:47 - 0:51

Otherwise, you will see only
the screenshots here from the slides.
0:51 - 0:58

So the goal-- well, the talk
will be a failed attempt to communicate
0:58 - 1:02

an almost endlessly
technically complicated field
1:03 - 1:07

in terms that can actually motivate
people to start making use
1:07 - 1:08

of this analytical product
1:08 - 1:11

in which development
we are really putting a lot of effort.
1:11 - 1:14

So, as I said, I will try
to provide an overview
1:14 - 1:16

of the Wikidata Statistics
and Analytics systems.
1:16 - 1:21

So I will try to exemplify the usage
of some of them, not all.
1:21 - 1:23

And also I will try to go just
a little bit under the hood
1:23 - 1:27

to try to illustrate how it is done,
what is done here,
1:27 - 1:31

because I thought it might be
interesting to the audience.
1:32 - 1:34

Okay, so say...
1:35 - 1:39

In analytics and data science,
you always start with formulating
1:39 - 1:42

as clearly as possible
your goals and motivations.
1:42 - 1:47

Otherwise, you enter into endless cycles
of developing analytical tools
1:47 - 1:50

and data science products
that actually do something,
1:50 - 1:53

but nobody really understands
what they're being built for.
1:53 - 1:58

In 2017, in Wikimedia Deutschland,
a request, a demand was formulated--
1:58 - 2:00

we said that we needed
an analytical system
2:00 - 2:02

that will give an insight into the ways
2:02 - 2:06

that Wikidata items are reused
across the Wikimedia projects,
2:06 - 2:09

meaning across the Wikipedia universe--
all the encyclopedias,
2:09 - 2:12

and then Wikivoyage,
Wikibooks, WikiCite, etc.--
2:12 - 2:16

all the websites, approximately 800
that we are actually managing.
2:16 - 2:20

So just to explain the differences
between the data.
2:20 - 2:24

On the left, for example, you see a small
or very small substitute Wikidata.
2:24 - 2:28

These are the languages,
some of the Slavic, I think, languages,
2:28 - 2:30

and in Wikidata they are connected,
2:30 - 2:34

but they are properties and belong
to different classes, etc.
2:34 - 2:37

But we were looking
for a different kind of mapping.
2:37 - 2:41

So what you see here,
on the right side, is a set of items
2:41 - 2:45

all belonging to the class
of architectural structures, I would say.
2:45 - 2:48

And this here is the result
of their empirical embeddings.
2:48 - 2:51

So the items related here--
2:51 - 2:56

they are linked by their similarity
of usage across Wikipedias, for example.
2:56 - 2:58

So what does it mean-- the similarity?
2:59 - 3:03

To be similar in terms of how an item
is used across the Wikipedias.
3:03 - 3:07

So imagine you take an area of numbers,
3:07 - 3:11

and each element of an area
is one project-- it's English Wikipedia,
3:12 - 3:17

it is French Wikivoyage,
it is Italian Wikipedia, etc.
3:18 - 3:20

And then, you count how many times
3:20 - 3:23

a particular item has been used
in that project.
3:24 - 3:28

So you use an area of numbers
to describe the item that way.
3:28 - 3:30

It's a little bit more complicated
in practice.
3:31 - 3:36

And then, you can describe all items
in Wikidata that were ever used
3:36 - 3:39

across the websites at all
by such areas of numbers,
3:39 - 3:41

called embeddings, technically, right?
3:42 - 3:46

From those data,
using different distance metrics,
3:46 - 3:49

applying machine learning methods,
doing dimensionality reduction,
3:49 - 3:50

and similar things,
3:50 - 3:53

you can actually figure out
what is the similarity pattern.
3:53 - 3:56

And here items are connected,
3:56 - 4:01

but how similar are their patterns
of usage across different Wikipedias.
4:02 - 4:05

Once again, every visualization,
every result that I show--
4:05 - 4:08

there is a link on the presentation,
so you can go and check for yourself.
4:08 - 4:11

You can play
with this thing interactively.
4:11 - 4:16

Similarly, we will be able to derive
a graph like this one.
4:16 - 4:20

This one does not connect
the Wikidata items, it connects projects.
4:20 - 4:23

But looking at how similar they are
4:23 - 4:27

in terms of how they use
different Wikidata items.
4:30 - 4:32

To be precise as possible,
4:32 - 4:35

the data that we use to do this--
they do not live in Wikidata,
4:35 - 4:37

they are not a part of the Wikidata,
4:37 - 4:39

data does not at all [locate] here.
4:39 - 4:42

We have the Wikidata,
we have formulated our motivational goals,
4:42 - 4:46

and immediately we started talking
about the data model and the structures.
4:46 - 4:50

What structures and data models
you need to answer the questions
4:50 - 4:52

that you have initially proposed.
4:53 - 4:59

So there is Wikibase
and the client site tracking mechanism,
4:59 - 5:02

which is installed in all those wikis,
5:02 - 5:07

that actually tracks the Wikidata usage
on a project, on Wikipedia, for example.
5:07 - 5:11

So every time an item is used
in [meaningful works]
5:11 - 5:15

or in a different way--
there is a role in a huge sequel table
5:15 - 5:18

that enters and checks
the usage of that number.
5:18 - 5:22

Now, immediately, we had to face
a data-engineering problem, of course,
5:22 - 5:26

because we are talking
about hundreds of huge sequel tables,
5:26 - 5:29

and we had to do
machine learning and statistics
5:29 - 5:33

across all the data together,
not separately,
5:33 - 5:37

in order to be able to produce structures,
like this one or like this one.
5:38 - 5:41

So in cooperation with the Analytics
Engineering Team of the Foundation,
5:41 - 5:44

we started transferring
those data from Wikibase
5:44 - 5:49

to the Wikimedia Foundation Data Lake
which is actually a big data storage.
5:49 - 5:53

The data do not live there
in a relational database.
5:53 - 5:54

They live in something similar--
5:54 - 5:57

its Hadoop, and Hive tables
are there, etc.,
5:57 - 5:59

but it's a huge,
huge engineering procedure.
5:59 - 6:03

So not all data in analytics,
especially in big games like this
6:03 - 6:06

that we have to play
with Wikidata and Wikipedia,
6:06 - 6:08

are immediately available to you.
6:08 - 6:09

One source of complication
6:09 - 6:13

is before you actually start solving
the problem in a scientific way,
6:13 - 6:17

to put it that way, is to engineer
the data stats to prepare the structures
6:17 - 6:21

that you actually need for doing
machine-learning statistics
6:21 - 6:23

and similar things.
6:23 - 6:27

This is a full design of the system
called the Wikidata Concepts Monitor
6:27 - 6:28

that tracks their reuse statistics.
6:28 - 6:31

I will not go
into details here, of course.
6:32 - 6:36

The obvious complication
is that-- as I wrote it up--
6:36 - 6:38

many systems need to work together.
6:38 - 6:41

You have to synchronize
many different sources of data,
6:41 - 6:43

many different infrastructures
6:43 - 6:48

just in order to make it happen,
even before starting thinking
6:48 - 6:52

in terms of methodologies, science,
statistics, and similar.
6:54 - 6:58

As I said, we started
with our goals and motivation,
6:58 - 7:02

then, typically, the data model
and the structures that you need--
7:02 - 7:05

they correspond to those goals
and motivations that should always be--
7:05 - 7:08

your first step in developing
an analytics project.
7:08 - 7:11

Then you figure out
it's really too complicated,
7:11 - 7:13

it cannot be done when one person--
7:13 - 7:15

It cannot be done on one computer,
to put it that way.
7:15 - 7:18

So we needed to work
with the analytics infrastructure
7:18 - 7:20

and then add an additional layer
of complication--
7:20 - 7:24

that's communication
with external teams and cooperators
7:24 - 7:28

because, obviously, such a system
cannot be managed easily by one person.
7:28 - 7:31

Actually, I think
it would be pretty impossible.
7:32 - 7:34

So, as I mentioned,
there is this Data Lake,
7:34 - 7:38

our big data storage in Hadoop,
7:38 - 7:42

and the team of awesome data engineers
in the Foundation
7:42 - 7:44

called the Analytics Engineering Team.
7:44 - 7:48

To data science, data engineers are people
who actually watch your back
7:48 - 7:49

while you're trying to do your things.
7:49 - 7:52

If you cannot rely on
a good engineering team,
7:52 - 7:54

there's not much you will be able
to do by yourself.
7:56 - 8:00

This infrastructure is actually
maintained by the Foundation,
8:00 - 8:04

so you enter through
several statistical servers--
8:04 - 8:06

these blue boxes down there.
8:06 - 8:09

You can communicate
with the relational database systems.
8:09 - 8:11

We used the MariaDB.
8:11 - 8:12

You can communicate with the Data Lake.
8:12 - 8:18

And, of course, for your computations,
you go to the so-called Analytics Cluster
8:18 - 8:21

where you do things
like Apache Spark that actually--
8:21 - 8:25

it's the only really efficient way
to process the data
8:25 - 8:27

that we need to process.
8:27 - 8:32

When I started doing this back in 2017,
I remember when I saw
8:32 - 8:35

only the schema of the infrastructure
for the first time.
8:35 - 8:39

If I could not rely on my colleague
Adam Shorland--
8:39 - 8:40

who is still with us
in Wikimedia Deutschland--
8:40 - 8:44

I would never make it, I wouldn't even
know how to navigate the structure.
8:46 - 8:49

As you start building a project
to do analytics for Wikidata,
8:49 - 8:52

you see how increasingly it gets
more and more complicated
8:52 - 8:55

because you have to deal
with synchronizing different systems,
8:55 - 8:58

different teams, infrastructures,
different data stats.
8:58 - 9:00

However, it pays off,
9:00 - 9:03

that synchronization and all the pain.
9:03 - 9:08

It can get really nasty sometimes,
and the most recent example
9:08 - 9:11

is the production
of the Data Quality Report for Wikidata.
9:12 - 9:17

That's an initial assessment
of the quality of work we had in Wikidata.
9:17 - 9:18

In order to produce it,
9:18 - 9:22

we had to rely on the Quality Predictions
from the ORES system,
9:22 - 9:25

the machine learning system,
developed by Aaron Halfaker,
9:25 - 9:28

and the scoring platform
9:28 - 9:32

to combine that with the Wikidata
Concepts Monitor reuse statistics.
9:33 - 9:37

We the revision history, the full revision
history of all Wikipedias
9:37 - 9:40

is available in one single
huge big data table
9:40 - 9:41

called the MediaWiki History.
9:41 - 9:43

That lives in the Data Lake.
9:43 - 9:47

And also we had to process
the JSON Dump in HDFS.
9:47 - 9:49

So we're talking about form
as in structures,
9:49 - 9:52

like two machine learning systems
with their complexities,
9:52 - 9:54

and two huge data sets.
9:54 - 9:58

Everything needs to work in sync in order
to be able to produce the Quality Report
9:58 - 10:01

that we're presenting
this year at WikidataCon.
10:01 - 10:04

But if we didn't do, if we [listed]
or something like that,
10:05 - 10:08

we couldn't say, we couldn't show
beautiful things like this.
10:08 - 10:12

So on the horizontal axis, you have
the ORES Quality Prediction score.
10:12 - 10:13

We use five categories.
10:13 - 10:17

And you can inform yourself-- just google
"Wikidata data quality categories."
10:17 - 10:19

You will find the description.
10:19 - 10:22

The A-class to the left--
the best items that we have,
10:22 - 10:25

and at the same time--
that's the green box--
10:25 - 10:28

they are the most
reused items in Wikipedia.
10:28 - 10:30

So it's not like,
as Lydia explained yesterday,
10:30 - 10:33

it's not like all our items
are of the highest quality.
10:33 - 10:38

To the contrary, we have many items
that are not of that high quality,
10:38 - 10:41

but at least we know
what we're doing with them.
10:41 - 10:42

And you can see the regularity.
10:42 - 10:46

As the quality of an item
decreases from left to right,
10:46 - 10:49

the items tend to be less and less reused.
10:50 - 10:54

So also this synchronization
helped us learn things like this.
10:54 - 10:58

To the right, for example,
this five-time series here.
10:58 - 11:05

Each time series corresponds
to one of the quality categories--
11:05 - 11:07

A, B, C, or D.
11:07 - 11:11

And the time is on the horizontal axis
running from left to right.
11:11 - 11:16

And you can see here how many items
from each quality-class
11:16 - 11:19

received their latest revision when.
11:20 - 11:24

So the top quality class A
that is this [inaudible] line
11:24 - 11:30

which is found, say,
at the most right position here,
11:30 - 11:31

and the shortest line.
11:31 - 11:35

So those are the best items that we have.
11:35 - 11:38

And what you can see
is actually that there is no item
11:38 - 11:45

that did not receive at least
one revision after December 2018,
11:45 - 11:48

meaning one thing-- if you want quality
in Wikidata, you have to work on it.
11:48 - 11:51

So the best items that we have
are actually the items
11:51 - 11:53

that we're really paying attention to.
11:53 - 11:56

If you look at the classes
of lower quality, the other time-series,
11:56 - 11:59

you will see that we have items
that were revised in 2012
11:59 - 12:01

for the last time.
12:01 - 12:03

So it tells a story of responsibilities--
12:03 - 12:08

how much work we put
into the items [that actually work].
12:08 - 12:09

What brings quality.
12:13 - 12:17

While we do these things,
we also try to make use
12:17 - 12:20

of the byproducts
of these procedures as possible.
12:21 - 12:23

So, for example, in order
to develop the project
12:23 - 12:25

called Wikidata Languages Landscape--
12:25 - 12:28

I think I mentioned it yesterday
during the Birthday Presentation--
12:31 - 12:34

I had to perform a quite thorough study
12:34 - 12:38

of the sub-ontology
in Wikidata of languages.
12:38 - 12:42

And you know what?
There are problems in that ontology.
12:46 - 12:48

I tried not to miss
to give you an opportunity.
12:49 - 12:52

So this is the dashboard actually
about the languages
12:52 - 12:55

called the Wikidata Languages Landscape.
12:55 - 12:59

Once again, you have all the URLs
in the presentation.
13:00 - 13:04

So for example, you want to take a look
at a particular language.
13:04 - 13:09

Say, English, okay.
13:09 - 13:15

So the dashboard will generate
its local ontological context
13:15 - 13:19

and mark all the relations
of the form instance
13:19 - 13:21

of subclass often part of.
13:22 - 13:24

Why did I choose to do this?
13:24 - 13:26

To help you fix the language ontology.
13:26 - 13:32

Why? Because you will find many languages,
for example, my native language
13:32 - 13:34

which used to be Serbo-Croatian,
13:34 - 13:39

and for silly reasons now we have Serbian
and Croatian-- it's a political thing.
13:39 - 13:41

I don't want to go into it,
but you realize
13:41 - 13:43

that Serbian is now, for example,
at the same time
13:43 - 13:47

a subclass of Serbo-Croatian
and a part of Serbo-Croatian.
13:47 - 13:48

Which still holds for the Croatian--
13:48 - 13:51

Croatian is also a part
and a subclass of Serbo-Croatian.
13:51 - 13:52

So Serbo-Croatian used to be a language.
13:52 - 13:55

Now we don't have
normative support for it.
13:55 - 13:57

But still, it's not a language class,
it's a language.
13:57 - 14:01

Can it be a part of it
or can it be a subclass of it?
14:01 - 14:03

So it's a confusion of [methodological]
and set-theoretic relations,
14:03 - 14:06

and I think it should be fixed somehow.
14:07 - 14:09

In other words, don't say
14:10 - 14:15

that you don't have the tool
to fix the ontology.
14:15 - 14:18

Just find some time and go play with it.
14:19 - 14:22

Speaking of languages, I mentioned,
just to show you this project.
14:23 - 14:27

Many people liked this thing
what I published online on Twitter.
14:27 - 14:29

That's one of the things, you know.
14:29 - 14:33

Data science is usually
sold via visualizations.
14:33 - 14:34

People like to visualize things,
14:34 - 14:37

and, of course,
we do pay attention to that.
14:38 - 14:40

Aesthetics is a part of communication.
14:42 - 14:44

It's not the most important thing
for a scientific finding
14:44 - 14:45

to show you something beautiful,
14:45 - 14:49

but if you can show something beautiful,
you shouldn't miss the opportunity.
14:49 - 14:52

So here we did
with the languages in Wikidata
14:52 - 14:54

the same thing that we do
with items and projects
14:54 - 14:56

in the Wikidata Concepts Monitor.
14:56 - 15:03

We actually group languages by similarity,
and the similarity was defined
15:03 - 15:06

as how much do they overlap
across the items.
15:06 - 15:11

So if I can talk about
the same things in English
15:11 - 15:14

and in some West-African
language, for example,
15:14 - 15:16

then those two things, those two languages
15:16 - 15:19

are similar in terms
of their reference sets.
15:19 - 15:21

What they can refer to.
15:22 - 15:25

Each language here
15:25 - 15:27

points to its closest neighbor,
nearest neighbor--
15:27 - 15:30

to the most which is most similar to it.
15:30 - 15:36

And, of course, you can see
these groupings actually occur naturally.
15:36 - 15:38

So it's not a fully-connected graph.
15:38 - 15:41

Clustering this thing
was nothing like [there is].
15:41 - 15:44

Also, what you can learn
from the Languages Landscape project
15:44 - 15:49

is when you combine our data
with external resources.
15:49 - 15:51

So this is also very informative for us,
15:51 - 15:54

for the whole, I would say,
Wikimedia community.
15:55 - 15:57

We have the UNESCO language status
15:57 - 16:00

which Wikidata actually gets from UNESCO,
16:00 - 16:02

its websites and databases,
16:02 - 16:05

and the Ethnologue language status
on the vertical axes.
16:05 - 16:09

We have the Concepts Monitor
reuse statistic.
16:09 - 16:13

So we look at all the items that have
a label in a particular language,
16:13 - 16:16

and then we look at
how popular those items are,
16:16 - 16:18

how many times people used them.
16:19 - 16:25

Of course, those safe national languages,
languages that are not endangered,
16:26 - 16:28

they have a slight advantage.
16:28 - 16:31

But the situation is not really that bad.
16:31 - 16:34

Say, for example, take a look
at the Ethnologue category
16:34 - 16:37

of "Second language only"--
that's the rightmost one.
16:37 - 16:42

You will see three languages
there being reused
16:42 - 16:44

in a way comparable to the most favorable,
16:44 - 16:47

not endangered category
of national languages.
16:48 - 16:49

It's not like a gender bias.
16:49 - 16:54

Wikipedia seems to be really reflecting
the gender bias that exists in the world.
16:54 - 16:58

Then we have nice initiatives like women
who are trying to fix this thing.
16:58 - 17:02

With languages, well, of course,
some languages are a little bit favored,
17:02 - 17:04

but it's not that bad,
17:04 - 17:08

and that finding really brought
a lot of joy to ourselves.
17:09 - 17:13

Now, speaking of external resources,
every time that I look at this graph,
17:13 - 17:16

I say to myself, "We know
who is the queen of the databases."
17:18 - 17:22

You know the external identifiers
property in Wikidata.
17:23 - 17:30

So here we take all external identifiers
that were present in August,
17:32 - 17:35

JSON Dump of Wikidata, which we processed.
17:35 - 17:38

Then, once again,
did some statistics on it
17:38 - 17:45

and grouped all the external identifiers
by how much they overlap across the items.
17:51 - 17:53

Aha, here we are.
17:55 - 17:58

That visualization, except for maybe
being aesthetically pleasing,
17:58 - 18:00

is not that useful,
18:00 - 18:03

but you have an interactive version
developed in the dashboard.
18:04 - 18:08

If you go and inspect
the interactive version,
18:08 - 18:11

you can learn, for example,
one obvious fact
18:11 - 18:14

that they really follow
some natural semantics.
18:14 - 18:16

They are grouped in intuitive ways.
18:16 - 18:22

We should be perfectly expecting them
to give some feedback on the quality
18:22 - 18:24

of the organizational data in Wikidata,
18:24 - 18:27

telling that situation
is really not that bad.
18:27 - 18:30

What I am saying is
that all the external identifiers
18:30 - 18:32

from the databases
on sports, for example,
18:32 - 18:35

you will find to be in one cluster.
18:35 - 18:39

And then, for example, you will even
be able to figure out what sport.
18:39 - 18:44

Databases on tennis are here,
databases on football are here, etc.
18:48 - 18:51

Yes, these external resources
18:51 - 18:54

are things that we really try
to pay a lot of attention to.
18:55 - 19:00

All right, as I said, the final thing
is communication and aesthetics.
19:00 - 19:01

We do pay attention to it.
19:01 - 19:04

So, for example, this thing--
many people liked it.
19:04 - 19:07

It's a little bit rescaled for aesthetics,
19:07 - 19:12

the same network of external identifiers
that you were able to see.
19:12 - 19:16

But you don't get
to these results for free, of course.
19:17 - 19:20

For example, this one was obtained
by running a clustering algorithm
19:20 - 19:24

on Jaccard distances--
technical terms, I'm not going into it.
19:24 - 19:29

And first, we had to start from a matrix
actually derived from 408 languages
19:29 - 19:32

that are reused across the Wikimedia.
19:32 - 19:35

Wikidata knows about
many languages, not only 400.
19:35 - 19:40

But only 400 of them are actually
labels of the items that get reused
19:40 - 19:44

across 60 million items contingency
matrix-- that's a lot of computations.
19:45 - 19:47

To add an additional layer of complication
19:47 - 19:51

and, of course, the most beautiful part
of your work as a data scientist,
19:51 - 19:55

but it doesn't get to occupy
19:55 - 19:58

more than, say, 10% or 15% of your time,
19:58 - 20:01

because everything else
goes to data engineering
20:01 - 20:03

and synchronization of different systems.
20:03 - 20:05

With the machine learning
and statistic things,
20:05 - 20:07

we use plenty of different algorithms.
20:07 - 20:13

I don't think this is now time to go
and talk about details of these things.
20:13 - 20:15

I have plenty of opportunities
to discuss them,
20:15 - 20:18

but it's typically
a highly technical topic,
20:18 - 20:21

better suited for a scientific conference.
20:23 - 20:27

Here are old layers of complexity.
20:27 - 20:30

In the end, we have to add
deployment and dashboards,
20:30 - 20:33

because they won't build
themselves to this thing.
20:34 - 20:37

And all these things, all these phases
20:37 - 20:41

of development of analytics
of data science project
20:41 - 20:47

need to fit together in order
to be able to derive empirical results
20:47 - 20:49

on the system of Wikidata's complexity.
20:50 - 20:54

The true picture is that you cannot
really just run through these cycles.
20:54 - 20:57

All the phases of the process
are interdependent
20:57 - 21:00

because you really
have to plan very early on
21:00 - 21:04

what visualizations are you going to use,
what technology you will use
21:04 - 21:07

to render those visualizations in the end.
21:07 - 21:09

What machine learning algorithms
you will be using,
21:09 - 21:14

because all of them have their own taste
about what data structures they like.
21:14 - 21:17

And then you hit the constraints
of infrastructure-- similar things.
21:17 - 21:19

I am not complaining,
I'm really enjoying this.
21:19 - 21:22

This is the most beautiful playground
I've ever seen in my life.
21:22 - 21:25

Thanks to you and people
who built Wikidata.
21:25 - 21:26

Thank you very much!
21:26 - 21:28

That would be it.
21:28 - 21:30

(moderator) Thank you, Goran.
21:30 - 21:32

(applause)
21:33 - 21:35

(moderator) You have time
for a couple of questions.
21:44 - 21:48

(man) Well, you did a lot of research,
I can see that.
21:48 - 21:49

(Goran) Sorry?
21:49 - 21:52

(man) You did a lot of research,
I can see that.
21:52 - 21:57

I'm wondering if there anything
that you discovered during the research
21:57 - 21:59

that surprised you.
21:59 - 22:01

Thank you for that question.
22:01 - 22:08

Actually, I wanted to focus
on that in this talk
22:08 - 22:11

until I realized that we simply
won't have enough time
22:11 - 22:14

to explain everything.
22:15 - 22:19

Most of the time
when you're analyzing big datasets
22:19 - 22:22

structured in a way like Wikidata.
22:22 - 22:26

Even when you're going to the wild,
meaning study the reuse of data
22:26 - 22:27

across Wikipedia,
22:27 - 22:31

where actually people can do
whatever they like with those items,
22:32 - 22:34

you have a lot of data,
a lot of information.
22:34 - 22:36

Of course, you see structure.
22:36 - 22:40

Most of the time, 90% of the time,
you see things that are expected.
22:41 - 22:47

Things like what projects
we make the most use of Wikidata.
22:47 - 22:50

And you can almost--
you didn't have to do too much statistics,
22:51 - 22:55

you can rely on the expectations
of all the world and see what's happening.
22:57 - 22:59

Many things were surprising,
22:59 - 23:03

and those things that were surprising
are really the most informative things.
23:05 - 23:09

When one communicates the findings
from analytics and such systems,
23:09 - 23:14

it's important, people typically expect
either "wow" visualizations
23:14 - 23:18

and have tons of data so we can always
deliver "wow" visualizations,
23:19 - 23:22

or they expect to learn things like,
23:22 - 23:24

"Our project is doing better
than this project"
23:24 - 23:26

or "Yes, we are rocking!" etc.,
23:26 - 23:30

while the goal of the whole game
should actually be to learn
23:30 - 23:34

what is wrong, what is not working,
what could be done better.
23:35 - 23:36

Many things were surprising.
23:38 - 23:42

For example, the distribution
of item usage across languages--
23:42 - 23:44

that was surprising to me.
23:44 - 23:45

This thing.
23:47 - 23:51

So I did not really expect
that the situation with languages
23:51 - 23:54

will be this good, I would say.
23:55 - 24:01

My expectation would be that languages
that have less economic support,
24:01 - 24:04

normative support,
even political support--
24:04 - 24:07

that's a fact when you talk
about languages--
24:07 - 24:12

will not be so widely reused
across the Wikimedia universe.
24:12 - 24:16

In fact, it turns out
that the differences-- we can see them,
24:16 - 24:19

but it's far away from gender bias
which is really bad, I think,
24:19 - 24:21

we need to work there.
24:21 - 24:22

That was surprising, for example.
24:22 - 24:26

It was a positive surprise,
to put it that way.
24:26 - 24:28

Then from time to time,
we discover projects
24:29 - 24:35

that actually do a great job by reusing
the Wikidata content and Wikimedia.
24:35 - 24:38

We're totally surprised to learn that
such a project can do it.
24:39 - 24:43

Then you start thinking, you figure out
there is a community of people
24:43 - 24:44

actually doing it.
24:44 - 24:49

And it's a strange feeling because I get
to see all these things through machines,
24:49 - 24:52

through databases,
through visualizations and tables,
24:52 - 24:58

and it's always that strange feeling
when I realize this result was produced
24:58 - 25:03

by a group of people, they don't even know
the time looking at their result now.
25:06 - 25:08

(moderator) Another question?
25:14 - 25:15

Thank you.
25:15 - 25:16

Is that it? Thank you very much!
25:16 - 25:18

(moderator) Thank you.
25:18 - 25:20

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4
Video Language:: English
Duration:: 25:26

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-1091-eng-Wikidata_Statistics_What_Where_and_How_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)