Yes, Wikidata Statistics:
What, Where, and How?
This has been an attempt of an overview
for analytical systems
focusing on what was developed
with the Wikimedia Deutschland
in the previous almost three years
since I started doing data science
for Wikidata and thе dictionary.
So, during this presentation,
I will try to switch from the presentation
to the dashboards
and show you the end data products.
However, if that causes any trouble,
so this is actually the URL
of the analytics portal.
So everything that
I will be presenting here,
whatever you can see on the slides,
you can also check out later
from the presentation,
go and play with the real thing.
Otherwise, you will see only
the screenshots here from the slides.
So the goal-- well, the talk
will be a failed attempt to communicate
an almost endlessly
technically complicated field
in terms that can actually motivate
people to start making use
of this analytical product
in which development
we are really putting a lot of effort.
So, as I said, I will try
to provide an overview
of the Wikidata Statistics
and Analytics systems.
So I will try to exemplify the usage
of some of them, not all.
And also I will try to go just
a little bit under the hood
to try to illustrate how it is done,
what is done here,
because I thought it might be
interesting to the audience.
Okay, so say...
In analytics and data science,
you always start with formulating
as clearly as possible
your goals and motivations.
Otherwise, you enter into endless cycles
of developing analytical tools
and data science products
that actually do something,
but nobody really understands
what they're being built for.
In 2017, in Wikimedia Deutschland,
a request, a demand was formulated--
we said that we needed
an analytical system
that will give an insight into the ways
that Wikidata items are reused
across the Wikimedia projects,
meaning across the Wikipedia universe--
all the encyclopedias,
and then Wikivoyage,
Wikibooks, WikiCite, etc.--
all the websites, approximately 800
that we are actually managing.
So just to explain the differences
between the data.
On the left, for example, you see a small
or very small substitute Wikidata.
These are the languages,
some of the Slavic, I think, languages,
and in Wikidata they are connected,
but they are properties and belong
to different classes, etc.
But we were looking
for a different kind of mapping.
So what you see here,
on the right side, is a set of items
all belonging to the class
of architectural structures, I would say.
And this here is the result
of their empirical embeddings.
So the items related here--
they are linked by their similarity
of usage across Wikipedias, for example.
So what does it mean-- the similarity?
To be similar in terms of how an item
is used across the Wikipedias.
So imagine you take an area of numbers,
and each element of an area
is one project-- it's English Wikipedia,
it is French Wikivoyage,
it is Italian Wikipedia, etc.
And then, you count how many times
a particular item has been used
in that project.
So you use an area of numbers
to describe the item that way.
It's a little bit more complicated
in practice.
And then, you can describe all items
in Wikidata that were ever used
across the websites at all
by such areas of numbers,
called embeddings, technically, right?
From those data,
using different distance metrics,
applying machine learning methods,
doing dimensionality reduction,
and similar things,
you can actually figure out
what is the similarity pattern.
And here items are connected,
but how similar are their patterns
of usage across different Wikipedias.
Once again, every visualization,
every result that I show--
there is a link on the presentation,
so you can go and check for yourself.
You can play
with this thing interactively.
Similarly, we will be able to derive
a graph like this one.
This one does not connect
the Wikidata items, it connects projects.
But looking at how similar they are
in terms of how they use
different Wikidata items.
To be precise as possible,
the data that we use to do this--
they do not live in Wikidata,
they are not a part of the Wikidata,
data does not at all [locate] here.
We have the Wikidata,
we have formulated our motivational goals,
and immediately we started talking
about the data model and the structures.
What structures and data models
you need to answer the questions
that you have initially proposed.
So there is Wikibase
and the client site tracking mechanism,
which is installed in all those wikis,
that actually tracks the Wikidata usage
on a project, on Wikipedia, for example.
So every time an item is used
in [meaningful works]
or in a different way--
there is a role in a huge sequel table
that enters and checks
the usage of that number.
Now, immediately, we had to face
a data-engineering problem, of course,
because we are talking
about hundreds of huge sequel tables,
and we had to do
machine learning and statistics
across all the data together,
not separately,
in order to be able to produce structures,
like this one or like this one.
So in cooperation with the Analytics
Engineering Team of the Foundation,
we started transferring
those data from Wikibase
to the Wikimedia Foundation Data Lake
which is actually a big data storage.
The data do not live there
in a relational database.
They live in something similar--
its Hadoop, and Hive tables
are there, etc.,
but it's a huge,
huge engineering procedure.
So not all data in analytics,
especially in big games like this
that we have to play
with Wikidata and Wikipedia,
are immediately available to you.
One source of complication
is before you actually start solving
the problem in a scientific way,
to put it that way, is to engineer
the data stats to prepare the structures
that you actually need for doing
machine-learning statistics
and similar things.
This is a full design of the system
called the Wikidata Concepts Monitor
that tracks their reuse statistics.
I will not go
into details here, of course.
The obvious complication
is that-- as I wrote it up--
many systems need to work together.
You have to synchronize
many different sources of data,
many different infrastructures
just in order to make it happen,
even before starting thinking
in terms of methodologies, science,
statistics, and similar.
As I said, we started
with our goals and motivation,
then, typically, the data model
and the structures that you need--
they correspond to those goals
and motivations that should always be--
your first step in developing
an analytics project.
Then you figure out
it's really too complicated,
it cannot be done when one person--
It cannot be done on one computer,
to put it that way.
So we needed to work
with the analytics infrastructure
and then add an additional layer
of complication--
that's communication
with external teams and cooperators
because, obviously, such a system
cannot be managed easily by one person.
Actually, I think
it would be pretty impossible.
So, as I mentioned,
there is this Data Lake,
our big data storage in Hadoop,
and the team of awesome data engineers
in the Foundation
called the Analytics Engineering Team.
To data science, data engineers are people
who actually watch your back
while you're trying to do your things.
If you cannot rely on
a good engineering team,
there's not much you will be able
to do by yourself.
This infrastructure is actually
maintained by the Foundation,
so you enter through
several statistical servers--
these blue boxes down there.
You can communicate
with the relational database systems.
We used the MariaDB.
You can communicate with the Data Lake.
And, of course, for your computations,
you go to the so-called Analytics Cluster
where you do things
like Apache Spark that actually--
it's the only really efficient way
to process the data
that we need to process.
When I started doing this back in 2017,
I remember when I saw
only the schema of the infrastructure
for the first time.
If I could not rely on my colleague
Adam Shorland--
who is still with us
in Wikimedia Deutschland--
I would never make it, I wouldn't even
know how to navigate the structure.
As you start building a project
to do analytics for Wikidata,
you see how increasingly it gets
more and more complicated
because you have to deal
with synchronizing different systems,
different teams, infrastructures,
different data stats.
However, it pays off,
that synchronization and all the pain.
It can get really nasty sometimes,
and the most recent example
is the production
of the Data Quality Report for Wikidata.
That's an initial assessment
of the quality of work we had in Wikidata.
In order to produce it,
we had to rely on the Quality Predictions
from the ORES system,
the machine learning system,
developed by Aaron Halfaker,
and the scoring platform
to combine that with the Wikidata
Concepts Monitor reuse statistics.
We the revision history, the full revision
history of all Wikipedias
is available in one single
huge big data table
called the MediaWiki History.
That lives in the Data Lake.
And also we had to process
the JSON Dump in HDFS.
So we're talking about form
as in structures,
like two machine learning systems
with their complexities,
and two huge data sets.
Everything needs to work in sync in order
to be able to produce the Quality Report
that we're presenting
this year at WikidataCon.
But if we didn't do, if we [listed]
or something like that,
we couldn't say, we couldn't show
beautiful things like this.
So on the horizontal axis, you have
the ORES Quality Prediction score.
We use five categories.
And you can inform yourself-- just google
"Wikidata data quality categories."
You will find the description.
The A-class to the left--
the best items that we have,
and at the same time--
that's the green box--
they are the most
reused items in Wikipedia.
So it's not like,
as Lydia explained yesterday,
it's not like all our items
are of the highest quality.
To the contrary, we have many items
that are not of that high quality,
but at least we know
what we're doing with them.
And you can see the regularity.
As the quality of an item
decreases from left to right,
the items tend to be less and less reused.
So also this synchronization
helped us learn things like this.
To the right, for example,
this five-time series here.
Each time series corresponds
to one of the quality categories--
A, B, C, or D.
And the time is on the horizontal axis
running from left to right.
And you can see here how many items
from each quality-class
received their latest revision when.
So the top quality class A
that is this [inaudible] line
which is found, say,
at the most right position here,
and the shortest line.
So those are the best items that we have.
And what you can see
is actually that there is no item
that did not receive at least
one revision after December 2018,
meaning one thing-- if you want quality
in Wikidata, you have to work on it.
So the best items that we have
are actually the items
that we're really paying attention to.
If you look at the classes
of lower quality, the other time-series,
you will see that we have items
that were revised in 2012
for the last time.
So it tells a story of responsibilities--
how much work we put
into the items [that actually work].
What brings quality.
While we do these things,
we also try to make use
of the byproducts
of these procedures as possible.
So, for example, in order
to develop the project
called Wikidata Languages Landscape--
I think I mentioned it yesterday
during the Birthday Presentation--
I had to perform a quite thorough study
of the sub-ontology
in Wikidata of languages.
And you know what?
There are problems in that ontology.
I tried not to miss
to give you an opportunity.
So this is the dashboard actually
about the languages
called the Wikidata Languages Landscape.
Once again, you have all the URLs
in the presentation.
So for example, you want to take a look
at a particular language.
Say, English, okay.
So the dashboard will generate
its local ontological context
and mark all the relations
of the form instance
of subclass often part of.
Why did I choose to do this?
To help you fix the language ontology.
Why? Because you will find many languages,
for example, my native language
which used to be Serbo-Croatian,
and for silly reasons now we have Serbian
and Croatian-- it's a political thing.
I don't want to go into it,
but you realize
that Serbian is now, for example,
at the same time
a subclass of Serbo-Croatian
and a part of Serbo-Croatian.
Which still holds for the Croatian--
Croatian is also a part
and a subclass of Serbo-Croatian.
So Serbo-Croatian used to be a language.
Now we don't have
normative support for it.
But still, it's not a language class,
it's a language.
Can it be a part of it
or can it be a subclass of it?
So it's a confusion of [methodological]
and set-theoretic relations,
and I think it should be fixed somehow.
In other words, don't say
that you don't have the tool
to fix the ontology.
Just find some time and go play with it.
Speaking of languages, I mentioned,
just to show you this project.
Many people liked this thing
what I published online on Twitter.
That's one of the things, you know.
Data science is usually
sold via visualizations.
People like to visualize things,
and, of course,
we do pay attention to that.
Aesthetics is a part of communication.
It's not the most important thing
for a scientific finding
to show you something beautiful,
but if you can show something beautiful,
you shouldn't miss the opportunity.
So here we did
with the languages in Wikidata
the same thing that we do
with items and projects
in the Wikidata Concepts Monitor.
We actually group languages by similarity,
and the similarity was defined
as how much do they overlap
across the items.
So if I can talk about
the same things in English
and in some West-African
language, for example,
then those two things, those two languages
are similar in terms
of their reference sets.
What they can refer to.
Each language here
points to its closest neighbor,
nearest neighbor--
to the most which is most similar to it.
And, of course, you can see
these groupings actually occur naturally.
So it's not a fully-connected graph.
Clustering this thing
was nothing like [there is].
Also, what you can learn
from the Languages Landscape project
is when you combine our data
with external resources.
So this is also very informative for us,
for the whole, I would say,
Wikimedia community.
We have the UNESCO language status
which Wikidata actually gets from UNESCO,
its websites and databases,
and the Ethnologue language status
on the vertical axes.
We have the Concepts Monitor
reuse statistic.
So we look at all the items that have
a label in a particular language,
and then we look at
how popular those items are,
how many times people used them.
Of course, those safe national languages,
languages that are not endangered,
they have a slight advantage.
But the situation is not really that bad.
Say, for example, take a look
at the Ethnologue category
of "Second language only"--
that's the rightmost one.
You will see three languages
there being reused
in a way comparable to the most favorable,
not endangered category
of national languages.
It's not like a gender bias.
Wikipedia seems to be really reflecting
the gender bias that exists in the world.
Then we have nice initiatives like women
who are trying to fix this thing.
With languages, well, of course,
some languages are a little bit favored,
but it's not that bad,
and that finding really brought
a lot of joy to ourselves.
Now, speaking of external resources,
every time that I look at this graph,
I say to myself, "We know
who is the queen of the databases."
You know the external identifiers
property in Wikidata.
So here we take all external identifiers
that were present in August,
JSON Dump of Wikidata, which we processed.
Then, once again,
did some statistics on it
and grouped all the external identifiers
by how much they overlap across the items.
Aha, here we are.
That visualization, except for maybe
being aesthetically pleasing,
is not that useful,
but you have an interactive version
developed in the dashboard.
If you go and inspect
the interactive version,
you can learn, for example,
one obvious fact
that they really follow
some natural semantics.
They are grouped in intuitive ways.
We should be perfectly expecting them
to give some feedback on the quality
of the organizational data in Wikidata,
telling that situation
is really not that bad.
What I am saying is
that all the external identifiers
from the databases
on sports, for example,
you will find to be in one cluster.
And then, for example, you will even
be able to figure out what sport.
Databases on tennis are here,
databases on football are here, etc.
Yes, these external resources
are things that we really try
to pay a lot of attention to.
All right, as I said, the final thing
is communication and aesthetics.
We do pay attention to it.
So, for example, this thing--
many people liked it.
It's a little bit rescaled for aesthetics,
the same network of external identifiers
that you were able to see.
But you don't get
to these results for free, of course.
For example, this one was obtained
by running a clustering algorithm
on Jaccard distances--
technical terms, I'm not going into it.
And first, we had to start from a matrix
actually derived from 408 languages
that are reused across the Wikimedia.
Wikidata knows about
many languages, not only 400.
But only 400 of them are actually
labels of the items that get reused
across 60 million items contingency
matrix-- that's a lot of computations.
To add an additional layer of complication
and, of course, the most beautiful part
of your work as a data scientist,
but it doesn't get to occupy
more than, say, 10% or 15% of your time,
because everything else
goes to data engineering
and synchronization of different systems.
With the machine learning
and statistic things,
we use plenty of different algorithms.
I don't think this is now time to go
and talk about details of these things.
I have plenty of opportunities
to discuss them,
but it's typically
a highly technical topic,
better suited for a scientific conference.
Here are old layers of complexity.
In the end, we have to add
deployment and dashboards,
because they won't build
themselves to this thing.
And all these things, all these phases
of development of analytics
of data science project
need to fit together in order
to be able to derive empirical results
on the system of Wikidata's complexity.
The true picture is that you cannot
really just run through these cycles.
All the phases of the process
are interdependent
because you really
have to plan very early on
what visualizations are you going to use,
what technology you will use
to render those visualizations in the end.
What machine learning algorithms
you will be using,
because all of them have their own taste
about what data structures they like.
And then you hit the constraints
of infrastructure-- similar things.
I am not complaining,
I'm really enjoying this.
This is the most beautiful playground
I've ever seen in my life.
Thanks to you and people
who built Wikidata.
Thank you very much!
That would be it.
(moderator) Thank you, Goran.
(applause)
(moderator) You have time
for a couple of questions.
(man) Well, you did a lot of research,
I can see that.
(Goran) Sorry?
(man) You did a lot of research,
I can see that.
I'm wondering if there anything
that you discovered during the research
that surprised you.
Thank you for that question.
Actually, I wanted to focus
on that in this talk
until I realized that we simply
won't have enough time
to explain everything.
Most of the time
when you're analyzing big datasets
structured in a way like Wikidata.
Even when you're going to the wild,
meaning study the reuse of data
across Wikipedia,
where actually people can do
whatever they like with those items,
you have a lot of data,
a lot of information.
Of course, you see structure.
Most of the time, 90% of the time,
you see things that are expected.
Things like what projects
we make the most use of Wikidata.
And you can almost--
you didn't have to do too much statistics,
you can rely on the expectations
of all the world and see what's happening.
Many things were surprising,
and those things that were surprising
are really the most informative things.
When one communicates the findings
from analytics and such systems,
it's important, people typically expect
either "wow" visualizations
and have tons of data so we can always
deliver "wow" visualizations,
or they expect to learn things like,
"Our project is doing better
than this project"
or "Yes, we are rocking!" etc.,
while the goal of the whole game
should actually be to learn
what is wrong, what is not working,
what could be done better.
Many things were surprising.
For example, the distribution
of item usage across languages--
that was surprising to me.
This thing.
So I did not really expect
that the situation with languages
will be this good, I would say.
My expectation would be that languages
that have less economic support,
normative support,
even political support--
that's a fact when you talk
about languages--
will not be so widely reused
across the Wikimedia universe.
In fact, it turns out
that the differences-- we can see them,
but it's far away from gender bias
which is really bad, I think,
we need to work there.
That was surprising, for example.
It was a positive surprise,
to put it that way.
Then from time to time,
we discover projects
that actually do a great job by reusing
the Wikidata content and Wikimedia.
We're totally surprised to learn that
such a project can do it.
Then you start thinking, you figure out
there is a community of people
actually doing it.
And it's a strange feeling because I get
to see all these things through machines,
through databases,
through visualizations and tables,
and it's always that strange feeling
when I realize this result was produced
by a group of people, they don't even know
the time looking at their result now.
(moderator) Another question?
Thank you.
Is that it? Thank you very much!
(moderator) Thank you.
(applause)