-
Yes, Wikidata Statistics:
What, Where, and How?
-
This has been an attempt of an overview
for analytical systems
-
focusing on what was developed
with the Wikimedia Deutschland
-
in the previous almost three years
-
since I started doing data science
for Wikidata and thе dictionary.
-
So, during this presentation,
I will try to switch from the presentation
-
to the dashboards
and show you the end data products.
-
However, if that causes any trouble,
-
so this is actually the URL
of the analytics portal.
-
So everything that
I will be presenting here,
-
whatever you can see on the slides,
you can also check out later
-
from the presentation,
go and play with the real thing.
-
Otherwise, you will see only
the screenshots here from the slides.
-
So the goal-- well, the talk
will be a failed attempt to communicate
-
an almost endlessly
technically complicated field
-
in terms that can actually motivate
people to start making use
-
of this analytical product
-
in which development
we are really putting a lot of effort.
-
So, as I said, I will try
to provide an overview
-
of the Wikidata Statistics
and Analytics systems.
-
So I will try to exemplify the usage
of some of them, not all.
-
And also I will try to go just
a little bit under the hood
-
to try to illustrate how it is done,
what is done here,
-
because I thought it might be
interesting to the audience.
-
Okay, so say...
-
In analytics and data science,
you always start with formulating
-
as clearly as possible
your goals and motivations.
-
Otherwise, you enter into endless cycles
of developing analytical tools
-
and data science products
that actually do something,
-
but nobody really understands
what they're being built for.
-
In 2017, in Wikimedia Deutschland,
a request, a demand was formulated--
-
we said that we needed
an analytical system
-
that will give an insight into the ways
-
that Wikidata items are reused
across the Wikimedia projects,
-
meaning across the Wikipedia universe--
all the encyclopedias,
-
and then Wikivoyage,
Wikibooks, WikiCite, etc.--
-
all the websites, approximately 800
that we are actually managing.
-
So just to explain the differences
between the data.
-
On the left, for example, you see a small
or very small substitute Wikidata.
-
These are the languages,
some of the Slavic, I think, languages,
-
and in Wikidata they are connected,
-
but they are properties and belong
to different classes, etc.
-
But we were looking
for a different kind of mapping.
-
So what you see here,
on the right side, is a set of items
-
all belonging to the class
of architectural structures, I would say.
-
And this here is the result
of their empirical embeddings.
-
So the items related here--
-
they are linked by their similarity
of usage across Wikipedias, for example.
-
So what does it mean-- the similarity?
-
To be similar in terms of how an item
is used across the Wikipedias.
-
So imagine you take an area of numbers,
-
and each element of an area
is one project-- it's English Wikipedia,
-
it is French Wikivoyage,
it is Italian Wikipedia, etc.
-
And then, you count how many times
-
a particular item has been used
in that project.
-
So you use an area of numbers
to describe the item that way.
-
It's a little bit more complicated
in practice.
-
And then, you can describe all items
in Wikidata that were ever used
-
across the websites at all
by such areas of numbers,
-
called embeddings, technically, right?
-
From those data,
using different distance metrics,
-
applying machine learning methods,
doing dimensionality reduction,
-
and similar things,
-
you can actually figure out
what is the similarity pattern.
-
And here items are connected,
-
but how similar are their patterns
of usage across different Wikipedias.
-
Once again, every visualization,
every result that I show--
-
there is a link on the presentation,
so you can go and check for yourself.
-
You can play
with this thing interactively.
-
Similarly, we will be able to derive
a graph like this one.
-
This one does not connect
the Wikidata items, it connects projects.
-
But looking at how similar they are
-
in terms of how they use
different Wikidata items.
-
To be precise as possible,
-
the data that we use to do this--
they do not live in Wikidata,
-
they are not a part of the Wikidata,
-
data does not at all [locate] here.
-
We have the Wikidata,
we have formulated our motivational goals,
-
and immediately we started talking
about the data model and the structures.
-
What structures and data models
you need to answer the questions
-
that you have initially proposed.
-
So there is Wikibase
and the client site tracking mechanism,
-
which is installed in all those wikis,
-
that actually tracks the Wikidata usage
on a project, on Wikipedia, for example.
-
So every time an item is used
in [meaningful works]
-
or in a different way--
there is a role in a huge sequel table
-
that enters and checks
the usage of that number.
-
Now, immediately, we had to face
a data-engineering problem, of course,
-
because we are talking
about hundreds of huge sequel tables,
-
and we had to do
machine learning and statistics
-
across all the data together,
not separately,
-
in order to be able to produce structures,
like this one or like this one.
-
So in cooperation with the Analytics
Engineering Team of the Foundation,
-
we started transferring
those data from Wikibase
-
to the Wikimedia Foundation Data Lake
which is actually a big data storage.
-
The data do not live there
in a relational database.
-
They live in something similar--
-
its Hadoop, and Hive tables
are there, etc.,
-
but it's a huge,
huge engineering procedure.
-
So not all data in analytics,
especially in big games like this
-
that we have to play
with Wikidata and Wikipedia,
-
are immediately available to you.
-
One source of complication
-
is before you actually start solving
the problem in a scientific way,
-
to put it that way, is to engineer
the data stats to prepare the structures
-
that you actually need for doing
machine-learning statistics
-
and similar things.
-
This is a full design of the system
called the Wikidata Concepts Monitor
-
that tracks their reuse statistics.
-
I will not go
into details here, of course.
-
The obvious complication
is that-- as I wrote it up--
-
many systems need to work together.
-
You have to synchronize
many different sources of data,
-
many different infrastructures
-
just in order to make it happen,
even before starting thinking
-
in terms of methodologies, science,
statistics, and similar.
-
As I said, we started
with our goals and motivation,
-
then, typically, the data model
and the structures that you need--
-
they correspond to those goals
and motivations that should always be--
-
your first step in developing
an analytics project.
-
Then you figure out
it's really too complicated,
-
it cannot be done when one person--
-
It cannot be done on one computer,
to put it that way.
-
So we needed to work
with the analytics infrastructure
-
and then add an additional layer
of complication--
-
that's communication
with external teams and cooperators
-
because, obviously, such a system
cannot be managed easily by one person.
-
Actually, I think
it would be pretty impossible.
-
So, as I mentioned,
there is this Data Lake,
-
our big data storage in Hadoop,
-
and the team of awesome data engineers
in the Foundation
-
called the Analytics Engineering Team.
-
To data science, data engineers are people
who actually watch your back
-
while you're trying to do your things.
-
If you cannot rely on
a good engineering team,
-
there's not much you will be able
to do by yourself.
-
This infrastructure is actually
maintained by the Foundation,
-
so you enter through
several statistical servers--
-
these blue boxes down there.
-
You can communicate
with the relational database systems.
-
We used the MariaDB.
-
You can communicate with the Data Lake.
-
And, of course, for your computations,
you go to the so-called Analytics Cluster
-
where you do things
like Apache Spark that actually--
-
it's the only really efficient way
to process the data
-
that we need to process.
-
When I started doing this back in 2017,
I remember when I saw
-
only the schema of the infrastructure
for the first time.
-
If I could not rely on my colleague
Adam Shorland--
-
who is still with us
in Wikimedia Deutschland--
-
I would never make it, I wouldn't even
know how to navigate the structure.
-
As you start building a project
to do analytics for Wikidata,
-
you see how increasingly it gets
more and more complicated
-
because you have to deal
with synchronizing different systems,
-
different teams, infrastructures,
different data stats.
-
However, it pays off,
-
that synchronization and all the pain.
-
It can get really nasty sometimes,
and the most recent example
-
is the production
of the Data Quality Report for Wikidata.
-
That's an initial assessment
of the quality of work we had in Wikidata.
-
In order to produce it,
-
we had to rely on the Quality Predictions
from the ORES system,
-
the machine learning system,
developed by Aaron Halfaker,
-
and the scoring platform
-
to combine that with the Wikidata
Concepts Monitor reuse statistics.
-
We the revision history, the full revision
history of all Wikipedias
-
is available in one single
huge big data table
-
called the MediaWiki History.
-
That lives in the Data Lake.
-
And also we had to process
the JSON Dump in HDFS.
-
So we're talking about form
as in structures,
-
like two machine learning systems
with their complexities,
-
and two huge data sets.
-
Everything needs to work in sync in order
to be able to produce the Quality Report
-
that we're presenting
this year at WikidataCon.
-
But if we didn't do, if we [listed]
or something like that,
-
we couldn't say, we couldn't show
beautiful things like this.
-
So on the horizontal axis, you have
the ORES Quality Prediction score.
-
We use five categories.
-
And you can inform yourself-- just google
"Wikidata data quality categories."
-
You will find the description.
-
The A-class to the left--
the best items that we have,
-
and at the same time--
that's the green box--
-
they are the most
reused items in Wikipedia.
-
So it's not like,
as Lydia explained yesterday,
-
it's not like all our items
are of the highest quality.
-
To the contrary, we have many items
that are not of that high quality,
-
but at least we know
what we're doing with them.
-
And you can see the regularity.
-
As the quality of an item
decreases from left to right,
-
the items tend to be less and less reused.
-
So also this synchronization
helped us learn things like this.
-
To the right, for example,
this five-time series here.
-
Each time series corresponds
to one of the quality categories--
-
A, B, C, or D.
-
And the time is on the horizontal axis
running from left to right.
-
And you can see here how many items
from each quality-class
-
received their latest revision when.
-
So the top quality class A
that is this [inaudible] line
-
which is found, say,
at the most right position here,
-
and the shortest line.
-
So those are the best items that we have.
-
And what you can see
is actually that there is no item
-
that did not receive at least
one revision after December 2018,
-
meaning one thing-- if you want quality
in Wikidata, you have to work on it.
-
So the best items that we have
are actually the items
-
that we're really paying attention to.
-
If you look at the classes
of lower quality, the other time-series,
-
you will see that we have items
that were revised in 2012
-
for the last time.
-
So it tells a story of responsibilities--
-
how much work we put
into the items [that actually work].
-
What brings quality.
-
While we do these things,
we also try to make use
-
of the byproducts
of these procedures as possible.
-
So, for example, in order
to develop the project
-
called Wikidata Languages Landscape--
-
I think I mentioned it yesterday
during the Birthday Presentation--
-
I had to perform a quite thorough study
-
of the sub-ontology
in Wikidata of languages.
-
And you know what?
There are problems in that ontology.
-
I tried not to miss
to give you an opportunity.
-
So this is the dashboard actually
about the languages
-
called the Wikidata Languages Landscape.
-
Once again, you have all the URLs
in the presentation.
-
So for example, you want to take a look
at a particular language.
-
Say, English, okay.
-
So the dashboard will generate
its local ontological context
-
and mark all the relations
of the form instance
-
of subclass often part of.
-
Why did I choose to do this?
-
To help you fix the language ontology.
-
Why? Because you will find many languages,
for example, my native language
-
which used to be Serbo-Croatian,
-
and for silly reasons now we have Serbian
and Croatian-- it's a political thing.
-
I don't want to go into it,
but you realize
-
that Serbian is now, for example,
at the same time
-
a subclass of Serbo-Croatian
and a part of Serbo-Croatian.
-
Which still holds for the Croatian--
-
Croatian is also a part
and a subclass of Serbo-Croatian.
-
So Serbo-Croatian used to be a language.
-
Now we don't have
normative support for it.
-
But still, it's not a language class,
it's a language.
-
Can it be a part of it
or can it be a subclass of it?
-
So it's a confusion of [methodological]
and set-theoretic relations,
-
and I think it should be fixed somehow.
-
In other words, don't say
-
that you don't have the tool
to fix the ontology.
-
Just find some time and go play with it.
-
Speaking of languages, I mentioned,
just to show you this project.
-
Many people liked this thing
what I published online on Twitter.
-
That's one of the things, you know.
-
Data science is usually
sold via visualizations.
-
People like to visualize things,
-
and, of course,
we do pay attention to that.
-
Aesthetics is a part of communication.
-
It's not the most important thing
for a scientific finding
-
to show you something beautiful,
-
but if you can show something beautiful,
you shouldn't miss the opportunity.
-
So here we did
with the languages in Wikidata
-
the same thing that we do
with items and projects
-
in the Wikidata Concepts Monitor.
-
We actually group languages by similarity,
and the similarity was defined
-
as how much do they overlap
across the items.
-
So if I can talk about
the same things in English
-
and in some West-African
language, for example,
-
then those two things, those two languages
-
are similar in terms
of their reference sets.
-
What they can refer to.
-
Each language here
-
points to its closest neighbor,
nearest neighbor--
-
to the most which is most similar to it.
-
And, of course, you can see
these groupings actually occur naturally.
-
So it's not a fully-connected graph.
-
Clustering this thing
was nothing like [there is].
-
Also, what you can learn
from the Languages Landscape project
-
is when you combine our data
with external resources.
-
So this is also very informative for us,
-
for the whole, I would say,
Wikimedia community.
-
We have the UNESCO language status
-
which Wikidata actually gets from UNESCO,
-
its websites and databases,
-
and the Ethnologue language status
on the vertical axes.
-
We have the Concepts Monitor
reuse statistic.
-
So we look at all the items that have
a label in a particular language,
-
and then we look at
how popular those items are,
-
how many times people used them.
-
Of course, those safe national languages,
languages that are not endangered,
-
they have a slight advantage.
-
But the situation is not really that bad.
-
Say, for example, take a look
at the Ethnologue category
-
of "Second language only"--
that's the rightmost one.
-
You will see three languages
there being reused
-
in a way comparable to the most favorable,
-
not endangered category
of national languages.
-
It's not like a gender bias.
-
Wikipedia seems to be really reflecting
the gender bias that exists in the world.
-
Then we have nice initiatives like women
who are trying to fix this thing.
-
With languages, well, of course,
some languages are a little bit favored,
-
but it's not that bad,
-
and that finding really brought
a lot of joy to ourselves.
-
Now, speaking of external resources,
every time that I look at this graph,
-
I say to myself, "We know
who is the queen of the databases."
-
You know the external identifiers
property in Wikidata.
-
So here we take all external identifiers
that were present in August,
-
JSON Dump of Wikidata, which we processed.
-
Then, once again,
did some statistics on it
-
and grouped all the external identifiers
by how much they overlap across the items.
-
Aha, here we are.
-
That visualization, except for maybe
being aesthetically pleasing,
-
is not that useful,
-
but you have an interactive version
developed in the dashboard.
-
If you go and inspect
the interactive version,
-
you can learn, for example,
one obvious fact
-
that they really follow
some natural semantics.
-
They are grouped in intuitive ways.
-
We should be perfectly expecting them
to give some feedback on the quality
-
of the organizational data in Wikidata,
-
telling that situation
is really not that bad.
-
What I am saying is
that all the external identifiers
-
from the databases
on sports, for example,
-
you will find to be in one cluster.
-
And then, for example, you will even
be able to figure out what sport.
-
Databases on tennis are here,
databases on football are here, etc.
-
Yes, these external resources
-
are things that we really try
to pay a lot of attention to.
-
All right, as I said, the final thing
is communication and aesthetics.
-
We do pay attention to it.
-
So, for example, this thing--
many people liked it.
-
It's a little bit rescaled for aesthetics,
-
the same network of external identifiers
that you were able to see.
-
But you don't get
to these results for free, of course.
-
For example, this one was obtained
by running a clustering algorithm
-
on Jaccard distances--
technical terms, I'm not going into it.
-
And first, we had to start from a matrix
actually derived from 408 languages
-
that are reused across the Wikimedia.
-
Wikidata knows about
many languages, not only 400.
-
But only 400 of them are actually
labels of the items that get reused
-
across 60 million items contingency
matrix-- that's a lot of computations.
-
To add an additional layer of complication
-
and, of course, the most beautiful part
of your work as a data scientist,
-
but it doesn't get to occupy
-
more than, say, 10% or 15% of your time,
-
because everything else
goes to data engineering
-
and synchronization of different systems.
-
With the machine learning
and statistic things,
-
we use plenty of different algorithms.
-
I don't think this is now time to go
and talk about details of these things.
-
I have plenty of opportunities
to discuss them,
-
but it's typically
a highly technical topic,
-
better suited for a scientific conference.
-
Here are old layers of complexity.
-
In the end, we have to add
deployment and dashboards,
-
because they won't build
themselves to this thing.
-
And all these things, all these phases
-
of development of analytics
of data science project
-
need to fit together in order
to be able to derive empirical results
-
on the system of Wikidata's complexity.
-
The true picture is that you cannot
really just run through these cycles.
-
All the phases of the process
are interdependent
-
because you really
have to plan very early on
-
what visualizations are you going to use,
what technology you will use
-
to render those visualizations in the end.
-
What machine learning algorithms
you will be using,
-
because all of them have their own taste
about what data structures they like.
-
And then you hit the constraints
of infrastructure-- similar things.
-
I am not complaining,
I'm really enjoying this.
-
This is the most beautiful playground
I've ever seen in my life.
-
Thanks to you and people
who built Wikidata.
-
Thank you very much!
-
That would be it.
-
(moderator) Thank you, Goran.
-
(applause)
-
(moderator) You have time
for a couple of questions.
-
(man) Well, you did a lot of research,
I can see that.
-
(Goran) Sorry?
-
(man) You did a lot of research,
I can see that.
-
I'm wondering if there anything
that you discovered during the research
-
that surprised you.
-
Thank you for that question.
-
Actually, I wanted to focus
on that in this talk
-
until I realized that we simply
won't have enough time
-
to explain everything.
-
Most of the time
when you're analyzing big datasets
-
structured in a way like Wikidata.
-
Even when you're going to the wild,
meaning study the reuse of data
-
across Wikipedia,
-
where actually people can do
whatever they like with those items,
-
you have a lot of data,
a lot of information.
-
Of course, you see structure.
-
Most of the time, 90% of the time,
you see things that are expected.
-
Things like what projects
we make the most use of Wikidata.
-
And you can almost--
you didn't have to do too much statistics,
-
you can rely on the expectations
of all the world and see what's happening.
-
Many things were surprising,
-
and those things that were surprising
are really the most informative things.
-
When one communicates the findings
from analytics and such systems,
-
it's important, people typically expect
either "wow" visualizations
-
and have tons of data so we can always
deliver "wow" visualizations,
-
or they expect to learn things like,
-
"Our project is doing better
than this project"
-
or "Yes, we are rocking!" etc.,
-
while the goal of the whole game
should actually be to learn
-
what is wrong, what is not working,
what could be done better.
-
Many things were surprising.
-
For example, the distribution
of item usage across languages--
-
that was surprising to me.
-
This thing.
-
So I did not really expect
that the situation with languages
-
will be this good, I would say.
-
My expectation would be that languages
that have less economic support,
-
normative support,
even political support--
-
that's a fact when you talk
about languages--
-
will not be so widely reused
across the Wikimedia universe.
-
In fact, it turns out
that the differences-- we can see them,
-
but it's far away from gender bias
which is really bad, I think,
-
we need to work there.
-
That was surprising, for example.
-
It was a positive surprise,
to put it that way.
-
Then from time to time,
we discover projects
-
that actually do a great job by reusing
the Wikidata content and Wikimedia.
-
We're totally surprised to learn that
such a project can do it.
-
Then you start thinking, you figure out
there is a community of people
-
actually doing it.
-
And it's a strange feeling because I get
to see all these things through machines,
-
through databases,
through visualizations and tables,
-
and it's always that strange feeling
when I realize this result was produced
-
by a group of people, they don't even know
the time looking at their result now.
-
(moderator) Another question?
-
Thank you.
-
Is that it? Thank you very much!
-
(moderator) Thank you.
-
(applause)