1
00:00:06,343 --> 00:00:09,678
Yes, Wikidata Statistics:
What, Where, and How?

2
00:00:09,678 --> 00:00:13,005
This has been an attempt of an overview
for analytical systems

3
00:00:13,005 --> 00:00:16,173
focusing on what was developed
with the Wikimedia Deutschland

4
00:00:16,173 --> 00:00:18,155
in the previous almost three years

5
00:00:18,155 --> 00:00:22,346
since I started doing data science
for Wikidata and thе dictionary.

6
00:00:22,346 --> 00:00:28,345
So, during this presentation,
I will try to switch from the presentation

7
00:00:28,346 --> 00:00:32,092
to the dashboards
and show you the end data products.

8
00:00:32,995 --> 00:00:35,029
However, if that causes any trouble,

9
00:00:35,029 --> 00:00:39,070
so this is actually the URL
of the analytics portal.

10
00:00:39,070 --> 00:00:41,272
So everything that
I will be presenting here,

11
00:00:41,272 --> 00:00:44,105
whatever you can see on the slides,
you can also check out later

12
00:00:44,105 --> 00:00:47,285
from the presentation,
go and play with the real thing.

13
00:00:47,285 --> 00:00:51,101
Otherwise, you will see only
the screenshots here from the slides.

14
00:00:51,101 --> 00:00:58,275
So the goal-- well, the talk
will be a failed attempt to communicate

15
00:00:58,275 --> 00:01:01,502
an almost endlessly
technically complicated field

16
00:01:02,567 --> 00:01:06,843
in terms that can actually motivate
people to start making use

17
00:01:06,843 --> 00:01:08,338
of this analytical product

18
00:01:08,338 --> 00:01:11,010
in which development
we are really putting a lot of effort.

19
00:01:11,010 --> 00:01:13,631
So, as I said, I will try
to provide an overview

20
00:01:13,631 --> 00:01:15,679
of the Wikidata Statistics
and Analytics systems.

21
00:01:15,679 --> 00:01:20,636
So I will try to exemplify the usage
of some of them, not all.

22
00:01:20,636 --> 00:01:23,362
And also I will try to go just
a little bit under the hood

23
00:01:23,362 --> 00:01:27,453
to try to illustrate how it is done,
what is done here,

24
00:01:27,453 --> 00:01:31,144
because I thought it might be
interesting to the audience.

25
00:01:31,818 --> 00:01:33,534
Okay, so say...

26
00:01:34,804 --> 00:01:38,538
In analytics and data science,
you always start with formulating

27
00:01:38,538 --> 00:01:41,709
as clearly as possible
your goals and motivations.

28
00:01:41,709 --> 00:01:47,080
Otherwise, you enter into endless cycles
of developing analytical tools

29
00:01:47,080 --> 00:01:49,733
and data science products
that actually do something,

30
00:01:49,733 --> 00:01:52,835
but nobody really understands
what they're being built for.

31
00:01:52,835 --> 00:01:57,669
In 2017, in Wikimedia Deutschland,
a request, a demand was formulated--

32
00:01:57,925 --> 00:01:59,740
we said that we needed
an analytical system

33
00:01:59,740 --> 00:02:01,936
that will give an insight into the ways

34
00:02:01,936 --> 00:02:05,865
that Wikidata items are reused
across the Wikimedia projects,

35
00:02:05,865 --> 00:02:09,016
meaning across the Wikipedia universe--
all the encyclopedias,

36
00:02:09,016 --> 00:02:11,826
and then Wikivoyage,
Wikibooks, WikiCite, etc.--

37
00:02:11,826 --> 00:02:15,610
all the websites, approximately 800
that we are actually managing.

38
00:02:15,610 --> 00:02:19,553
So just to explain the differences
between the data.

39
00:02:19,553 --> 00:02:23,794
On the left, for example, you see a small
or very small substitute Wikidata.

40
00:02:23,794 --> 00:02:28,114
These are the languages,
some of the Slavic, I think, languages,

41
00:02:28,114 --> 00:02:30,383
and in Wikidata they are connected,

42
00:02:30,383 --> 00:02:34,194
but they are properties and belong
to different classes, etc.

43
00:02:34,194 --> 00:02:36,785
But we were looking
for a different kind of mapping.

44
00:02:36,785 --> 00:02:41,085
So what you see here,
on the right side, is a set of items

45
00:02:41,085 --> 00:02:44,823
all belonging to the class
of architectural structures, I would say.

46
00:02:44,823 --> 00:02:48,496
And this here is the result
of their empirical embeddings.

47
00:02:48,496 --> 00:02:50,511
So the items related here--

48
00:02:50,518 --> 00:02:55,952
they are linked by their similarity
of usage across Wikipedias, for example.

49
00:02:55,952 --> 00:02:57,842
So what does it mean-- the similarity?

50
00:02:58,632 --> 00:03:03,068
To be similar in terms of how an item
is used across the Wikipedias.

51
00:03:03,068 --> 00:03:06,943
So imagine you take an area of numbers,

52
00:03:07,353 --> 00:03:11,107
and each element of an area
is one project-- it's English Wikipedia,

53
00:03:11,558 --> 00:03:17,417
it is French Wikivoyage,
it is Italian Wikipedia, etc.

54
00:03:17,901 --> 00:03:20,495
And then, you count how many times

55
00:03:20,495 --> 00:03:23,085
a particular item has been used
in that project.

56
00:03:24,112 --> 00:03:27,631
So you use an area of numbers
to describe the item that way.

57
00:03:27,631 --> 00:03:29,768
It's a little bit more complicated
in practice.

58
00:03:31,299 --> 00:03:36,074
And then, you can describe all items
in Wikidata that were ever used

59
00:03:36,074 --> 00:03:39,358
across the websites at all
by such areas of numbers,

60
00:03:39,358 --> 00:03:41,320
called embeddings, technically, right?

61
00:03:41,791 --> 00:03:45,513
From those data,
using different distance metrics,

62
00:03:45,513 --> 00:03:48,893
applying machine learning methods,
doing dimensionality reduction,

63
00:03:48,893 --> 00:03:50,382
and similar things,

64
00:03:50,382 --> 00:03:53,093
you can actually figure out
what is the similarity pattern.

65
00:03:53,093 --> 00:03:55,622
And here items are connected,

66
00:03:55,622 --> 00:04:00,501
but how similar are their patterns
of usage across different Wikipedias.

67
00:04:01,726 --> 00:04:04,551
Once again, every visualization,
every result that I show--

68
00:04:04,551 --> 00:04:08,278
there is a link on the presentation,
so you can go and check for yourself.

69
00:04:08,278 --> 00:04:10,578
You can play
with this thing interactively.

70
00:04:10,578 --> 00:04:15,826
Similarly, we will be able to derive
a graph like this one.

71
00:04:15,826 --> 00:04:20,011
This one does not connect
the Wikidata items, it connects projects.

72
00:04:20,011 --> 00:04:23,069
But looking at how similar they are

73
00:04:23,069 --> 00:04:26,779
in terms of how they use
different Wikidata items.

74
00:04:30,235 --> 00:04:31,733
To be precise as possible,

75
00:04:32,468 --> 00:04:35,369
the data that we use to do this--
they do not live in Wikidata,

76
00:04:35,369 --> 00:04:36,818
they are not a part of the Wikidata,

77
00:04:36,818 --> 00:04:38,823
data does not at all [locate] here.

78
00:04:38,823 --> 00:04:41,917
We have the Wikidata,
we have formulated our motivational goals,

79
00:04:41,917 --> 00:04:45,773
and immediately we started talking
about the data model and the structures.

80
00:04:45,773 --> 00:04:49,760
What structures and data models
you need to answer the questions

81
00:04:49,760 --> 00:04:52,412
that you have initially proposed.

82
00:04:52,941 --> 00:04:59,064
So there is Wikibase
and the client site tracking mechanism,

83
00:04:59,064 --> 00:05:01,884
which is installed in all those wikis,

84
00:05:01,884 --> 00:05:07,001
that actually tracks the Wikidata usage
on a project, on Wikipedia, for example.

85
00:05:07,001 --> 00:05:10,700
So every time an item is used
in [meaningful works]

86
00:05:10,700 --> 00:05:14,743
or in a different way--
there is a role in a huge sequel table

87
00:05:14,743 --> 00:05:18,124
that enters and checks
the usage of that number.

88
00:05:18,124 --> 00:05:22,326
Now, immediately, we had to face
a data-engineering problem, of course,

89
00:05:22,326 --> 00:05:26,434
because we are talking
about hundreds of huge sequel tables,

90
00:05:26,434 --> 00:05:29,301
and we had to do
machine learning and statistics

91
00:05:29,301 --> 00:05:32,746
across all the data together,
not separately,

92
00:05:32,746 --> 00:05:37,283
in order to be able to produce structures,
like this one or like this one.

93
00:05:37,578 --> 00:05:41,332
So in cooperation with the Analytics
Engineering Team of the Foundation,

94
00:05:41,332 --> 00:05:44,459
we started transferring
those data from Wikibase

95
00:05:44,459 --> 00:05:49,181
to the Wikimedia Foundation Data Lake
which is actually a big data storage.

96
00:05:49,181 --> 00:05:52,753
The data do not live there
in a relational database.

97
00:05:52,753 --> 00:05:54,060
They live in something similar--

98
00:05:54,060 --> 00:05:56,546
its Hadoop, and Hive tables
are there, etc.,

99
00:05:56,546 --> 00:05:58,552
but it's a huge,
huge engineering procedure.

100
00:05:58,552 --> 00:06:03,405
So not all data in analytics,
especially in big games like this

101
00:06:03,405 --> 00:06:06,001
that we have to play
with Wikidata and Wikipedia,

102
00:06:06,001 --> 00:06:07,667
are immediately available to you.

103
00:06:07,667 --> 00:06:09,171
One source of complication

104
00:06:09,171 --> 00:06:13,459
is before you actually start solving
the problem in a scientific way,

105
00:06:13,459 --> 00:06:16,847
to put it that way, is to engineer
the data stats to prepare the structures

106
00:06:16,847 --> 00:06:20,805
that you actually need for doing
machine-learning statistics

107
00:06:20,805 --> 00:06:22,588
and similar things.

108
00:06:23,464 --> 00:06:26,921
This is a full design of the system
called the Wikidata Concepts Monitor

109
00:06:26,921 --> 00:06:28,380
that tracks their reuse statistics.

110
00:06:28,380 --> 00:06:30,844
I will not go
into details here, of course.

111
00:06:32,394 --> 00:06:35,764
The obvious complication
is that-- as I wrote it up--

112
00:06:35,764 --> 00:06:38,432
many systems need to work together.

113
00:06:38,432 --> 00:06:41,248
You have to synchronize
many different sources of data,

114
00:06:41,248 --> 00:06:42,846
many different infrastructures

115
00:06:42,846 --> 00:06:47,994
just in order to make it happen,
even before starting thinking

116
00:06:47,994 --> 00:06:52,247
in terms of methodologies, science,
statistics, and similar.

117
00:06:53,955 --> 00:06:57,930
As I said, we started
with our goals and motivation,

118
00:06:57,930 --> 00:07:01,629
then, typically, the data model
and the structures that you need--

119
00:07:01,629 --> 00:07:04,881
they correspond to those goals
and motivations that should always be--

120
00:07:04,881 --> 00:07:08,250
your first step in developing
an analytics project.

121
00:07:08,250 --> 00:07:10,857
Then you figure out
it's really too complicated,

122
00:07:10,857 --> 00:07:12,846
it cannot be done when one person--

123
00:07:12,846 --> 00:07:15,077
It cannot be done on one computer,
to put it that way.

124
00:07:15,077 --> 00:07:17,771
So we needed to work
with the analytics infrastructure

125
00:07:17,771 --> 00:07:20,403
and then add an additional layer
of complication--

126
00:07:20,403 --> 00:07:23,750
that's communication
with external teams and cooperators

127
00:07:23,750 --> 00:07:28,366
because, obviously, such a system
cannot be managed easily by one person.

128
00:07:28,366 --> 00:07:31,358
Actually, I think
it would be pretty impossible.

129
00:07:31,720 --> 00:07:33,587
So, as I mentioned,
there is this Data Lake,

130
00:07:33,587 --> 00:07:38,091
our big data storage in Hadoop,

131
00:07:38,091 --> 00:07:41,880
and the team of awesome data engineers
in the Foundation

132
00:07:41,880 --> 00:07:43,987
called the Analytics Engineering Team.

133
00:07:43,987 --> 00:07:47,660
To data science, data engineers are people
who actually watch your back

134
00:07:47,660 --> 00:07:49,426
while you're trying to do your things.

135
00:07:49,426 --> 00:07:51,766
If you cannot rely on
a good engineering team,

136
00:07:51,766 --> 00:07:54,164
there's not much you will be able
to do by yourself.

137
00:07:55,636 --> 00:08:00,357
This infrastructure is actually
maintained by the Foundation,

138
00:08:00,357 --> 00:08:04,127
so you enter through
several statistical servers--

139
00:08:04,441 --> 00:08:06,152
these blue boxes down there.

140
00:08:06,152 --> 00:08:09,274
You can communicate
with the relational database systems.

141
00:08:09,274 --> 00:08:10,531
We used the MariaDB.

142
00:08:10,531 --> 00:08:12,274
You can communicate with the Data Lake.

143
00:08:12,274 --> 00:08:17,536
And, of course, for your computations,
you go to the so-called Analytics Cluster

144
00:08:17,536 --> 00:08:20,712
where you do things
like Apache Spark that actually--

145
00:08:20,712 --> 00:08:25,138
it's the only really efficient way
to process the data

146
00:08:25,138 --> 00:08:27,313
that we need to process.

147
00:08:27,313 --> 00:08:32,219
When I started doing this back in 2017,
I remember when I saw

148
00:08:32,219 --> 00:08:35,421
only the schema of the infrastructure
for the first time.

149
00:08:35,421 --> 00:08:38,504
If I could not rely on my colleague
Adam Shorland--

150
00:08:38,504 --> 00:08:40,471
who is still with us
in Wikimedia Deutschland--

151
00:08:40,471 --> 00:08:44,008
I would never make it, I wouldn't even
know how to navigate the structure.

152
00:08:46,070 --> 00:08:49,085
As you start building a project
to do analytics for Wikidata,

153
00:08:49,085 --> 00:08:52,391
you see how increasingly it gets
more and more complicated

154
00:08:52,391 --> 00:08:55,046
because you have to deal
with synchronizing different systems,

155
00:08:55,046 --> 00:08:57,908
different teams, infrastructures,
different data stats.

156
00:08:58,419 --> 00:08:59,968
However, it pays off,

157
00:09:00,346 --> 00:09:02,948
that synchronization and all the pain.

158
00:09:03,282 --> 00:09:07,632
It can get really nasty sometimes,
and the most recent example

159
00:09:07,632 --> 00:09:10,777
is the production
of the Data Quality Report for Wikidata.

160
00:09:12,128 --> 00:09:16,926
That's an initial assessment
of the quality of work we had in Wikidata.

161
00:09:16,926 --> 00:09:18,278
In order to produce it,

162
00:09:18,278 --> 00:09:22,211
we had to rely on the Quality Predictions
from the ORES system,

163
00:09:22,211 --> 00:09:25,283
the machine learning system,
developed by Aaron Halfaker,

164
00:09:25,283 --> 00:09:27,502
and the scoring platform

165
00:09:28,383 --> 00:09:32,317
to combine that with the Wikidata
Concepts Monitor reuse statistics.

166
00:09:32,806 --> 00:09:36,691
We the revision history, the full revision
history of all Wikipedias

167
00:09:36,691 --> 00:09:40,009
is available in one single
huge big data table

168
00:09:40,009 --> 00:09:41,358
called the MediaWiki History.

169
00:09:41,358 --> 00:09:42,982
That lives in the Data Lake.

170
00:09:42,982 --> 00:09:46,672
And also we had to process
the JSON Dump in HDFS.

171
00:09:46,672 --> 00:09:48,804
So we're talking about form
as in structures,

172
00:09:48,804 --> 00:09:51,946
like two machine learning systems
with their complexities,

173
00:09:51,946 --> 00:09:53,893
and two huge data sets.

174
00:09:53,893 --> 00:09:58,231
Everything needs to work in sync in order
to be able to produce the Quality Report

175
00:09:58,231 --> 00:10:00,750
that we're presenting
this year at WikidataCon.

176
00:10:00,750 --> 00:10:04,376
But if we didn't do, if we [listed]
or something like that,

177
00:10:04,759 --> 00:10:07,765
we couldn't say, we couldn't show
beautiful things like this.

178
00:10:07,765 --> 00:10:12,130
So on the horizontal axis, you have
the ORES Quality Prediction score.

179
00:10:12,130 --> 00:10:13,490
We use five categories.

180
00:10:13,490 --> 00:10:17,436
And you can inform yourself-- just google
"Wikidata data quality categories."

181
00:10:17,436 --> 00:10:18,798
You will find the description.

182
00:10:18,798 --> 00:10:22,271
The A-class to the left--
the best items that we have,

183
00:10:22,271 --> 00:10:24,969
and at the same time--
that's the green box--

184
00:10:24,969 --> 00:10:27,812
they are the most
reused items in Wikipedia.

185
00:10:27,812 --> 00:10:30,450
So it's not like,
as Lydia explained yesterday,

186
00:10:30,450 --> 00:10:32,742
it's not like all our items
are of the highest quality.

187
00:10:32,742 --> 00:10:38,030
To the contrary, we have many items
that are not of that high quality,

188
00:10:38,030 --> 00:10:40,541
but at least we know
what we're doing with them.

189
00:10:40,541 --> 00:10:42,124
And you can see the regularity.

190
00:10:42,124 --> 00:10:46,228
As the quality of an item
decreases from left to right,

191
00:10:46,228 --> 00:10:49,179
the items tend to be less and less reused.

192
00:10:49,724 --> 00:10:53,817
So also this synchronization
helped us learn things like this.

193
00:10:54,225 --> 00:10:57,850
To the right, for example,
this five-time series here.

194
00:10:58,274 --> 00:11:05,252
Each time series corresponds
to one of the quality categories--

195
00:11:05,252 --> 00:11:06,642
A, B, C, or D.

196
00:11:06,642 --> 00:11:11,222
And the time is on the horizontal axis
running from left to right.

197
00:11:11,222 --> 00:11:15,883
And you can see here how many items
from each quality-class

198
00:11:15,883 --> 00:11:19,305
received their latest revision when.

199
00:11:19,792 --> 00:11:23,956
So the top quality class A
that is this [inaudible] line

200
00:11:23,956 --> 00:11:29,693
which is found, say,
at the most right position here,

201
00:11:29,693 --> 00:11:31,113
and the shortest line.

202
00:11:31,113 --> 00:11:34,584
So those are the best items that we have.

203
00:11:35,341 --> 00:11:38,247
And what you can see
is actually that there is no item

204
00:11:38,247 --> 00:11:44,580
that did not receive at least
one revision after December 2018,

205
00:11:44,580 --> 00:11:48,118
meaning one thing-- if you want quality 
in Wikidata, you have to work on it.

206
00:11:48,118 --> 00:11:50,893
So the best items that we have
are actually the items

207
00:11:50,893 --> 00:11:52,801
that we're really paying attention to.

208
00:11:52,801 --> 00:11:55,743
If you look at the classes
of lower quality, the other time-series,

209
00:11:55,743 --> 00:11:59,173
you will see that we have items
that were revised in 2012

210
00:11:59,173 --> 00:12:00,683
for the last time.

211
00:12:01,156 --> 00:12:03,348
So it tells a story of responsibilities--

212
00:12:03,348 --> 00:12:07,694
how much work we put
into the items [that actually work].

213
00:12:07,694 --> 00:12:09,421
What brings quality.

214
00:12:13,043 --> 00:12:17,205
While we do these things,
we also try to make use

215
00:12:17,205 --> 00:12:20,163
of the byproducts
of these procedures as possible.

216
00:12:20,569 --> 00:12:23,308
So, for example, in order
to develop the project

217
00:12:23,308 --> 00:12:25,425
called Wikidata Languages Landscape--

218
00:12:25,425 --> 00:12:28,375
I think I mentioned it yesterday
during the Birthday Presentation--

219
00:12:30,545 --> 00:12:34,444
I had to perform a quite thorough study

220
00:12:34,444 --> 00:12:37,725
of the sub-ontology
in Wikidata of languages.

221
00:12:37,725 --> 00:12:41,712
And you know what?
There are problems in that ontology.

222
00:12:45,502 --> 00:12:48,467
I tried not to miss
to give you an opportunity.

223
00:12:49,301 --> 00:12:52,247
So this is the dashboard actually
about the languages

224
00:12:52,247 --> 00:12:54,791
called the Wikidata Languages Landscape.

225
00:12:54,791 --> 00:12:58,594
Once again, you have all the URLs
in the presentation.

226
00:12:59,694 --> 00:13:03,720
So for example, you want to take a look
at a particular language.

227
00:13:03,720 --> 00:13:08,688
Say, English, okay.

228
00:13:09,448 --> 00:13:14,636
So the dashboard will generate
its local ontological context

229
00:13:14,636 --> 00:13:19,006
and mark all the relations
of the form instance

230
00:13:19,006 --> 00:13:21,276
of subclass often part of.

231
00:13:21,716 --> 00:13:23,845
Why did I choose to do this?

232
00:13:23,845 --> 00:13:25,991
To help you fix the language ontology.

233
00:13:25,991 --> 00:13:31,586
Why? Because you will find many languages,
for example, my native language

234
00:13:31,586 --> 00:13:33,618
which used to be Serbo-Croatian,

235
00:13:33,618 --> 00:13:38,553
and for silly reasons now we have Serbian
and Croatian-- it's a political thing.

236
00:13:38,553 --> 00:13:40,554
I don't want to go into it,
but you realize

237
00:13:40,554 --> 00:13:43,255
that Serbian is now, for example,
at the same time

238
00:13:43,255 --> 00:13:46,637
a subclass of Serbo-Croatian
and a part of Serbo-Croatian.

239
00:13:46,955 --> 00:13:48,395
Which still holds for the Croatian--

240
00:13:48,395 --> 00:13:50,860
Croatian is also a part
and a subclass of Serbo-Croatian.

241
00:13:50,860 --> 00:13:52,496
So Serbo-Croatian used to be a language.

242
00:13:52,496 --> 00:13:54,957
Now we don't have
normative support for it.

243
00:13:54,957 --> 00:13:57,086
But still, it's not a language class,
it's a language.

244
00:13:57,086 --> 00:14:00,528
Can it be a part of it
or can it be a subclass of it?

245
00:14:00,528 --> 00:14:03,297
So it's a confusion of [methodological]
and set-theoretic relations,

246
00:14:03,297 --> 00:14:05,803
and I think it should be fixed somehow.

247
00:14:06,656 --> 00:14:09,245
In other words, don't say

248
00:14:10,129 --> 00:14:14,993
that you don't have the tool
to fix the ontology.

249
00:14:14,993 --> 00:14:17,859
Just find some time and go play with it.

250
00:14:19,257 --> 00:14:22,431
Speaking of languages, I mentioned,
just to show you this project.

251
00:14:22,990 --> 00:14:27,162
Many people liked this thing
what I published online on Twitter.

252
00:14:27,162 --> 00:14:28,567
That's one of the things, you know.

253
00:14:28,567 --> 00:14:32,565
Data science is usually
sold via visualizations.

254
00:14:32,565 --> 00:14:34,202
People like to visualize things,

255
00:14:34,202 --> 00:14:36,843
and, of course,
we do pay attention to that.

256
00:14:37,763 --> 00:14:40,385
Aesthetics is a part of communication.

257
00:14:41,772 --> 00:14:44,051
It's not the most important thing
for a scientific finding

258
00:14:44,051 --> 00:14:45,348
to show you something beautiful,

259
00:14:45,348 --> 00:14:48,621
but if you can show something beautiful,
you shouldn't miss the opportunity.

260
00:14:48,621 --> 00:14:51,876
So here we did
with the languages in Wikidata

261
00:14:51,876 --> 00:14:53,987
the same thing that we do
with items and projects

262
00:14:53,987 --> 00:14:56,161
in the Wikidata Concepts Monitor.

263
00:14:56,161 --> 00:15:02,898
We actually group languages by similarity,
and the similarity was defined

264
00:15:02,898 --> 00:15:05,800
as how much do they overlap
across the items.

265
00:15:06,452 --> 00:15:10,531
So if I can talk about
the same things in English

266
00:15:10,531 --> 00:15:13,973
and in some West-African
language, for example,

267
00:15:13,973 --> 00:15:15,807
then those two things, those two languages

268
00:15:15,807 --> 00:15:19,209
are similar in terms
of their reference sets.

269
00:15:19,209 --> 00:15:21,302
What they can refer to.

270
00:15:22,330 --> 00:15:24,849
Each language here

271
00:15:24,849 --> 00:15:27,368
points to its closest neighbor,
nearest neighbor--

272
00:15:27,368 --> 00:15:29,840
to the most which is most similar to it.

273
00:15:29,840 --> 00:15:35,595
And, of course, you can see
these groupings actually occur naturally.

274
00:15:35,595 --> 00:15:37,549
So it's not a fully-connected graph.

275
00:15:37,549 --> 00:15:40,838
Clustering this thing
was nothing like [there is].

276
00:15:41,471 --> 00:15:44,418
Also, what you can learn
from the Languages Landscape project

277
00:15:44,418 --> 00:15:49,294
is when you combine our data
with external resources.

278
00:15:49,294 --> 00:15:51,369
So this is also very informative for us,

279
00:15:51,369 --> 00:15:54,240
for the whole, I would say,
Wikimedia community.

280
00:15:54,563 --> 00:15:56,636
We have the UNESCO language status

281
00:15:56,636 --> 00:15:59,755
which Wikidata actually gets from UNESCO,

282
00:15:59,755 --> 00:16:01,907
its websites and databases,

283
00:16:01,907 --> 00:16:05,198
and the Ethnologue language status
on the vertical axes.

284
00:16:05,198 --> 00:16:08,751
We have the Concepts Monitor
reuse statistic.

285
00:16:08,945 --> 00:16:12,973
So we look at all the items that have
a label in a particular language,

286
00:16:12,973 --> 00:16:15,949
and then we look at
how popular those items are,

287
00:16:15,949 --> 00:16:18,010
how many times people used them.

288
00:16:19,310 --> 00:16:25,059
Of course, those safe national languages,
languages that are not endangered,

289
00:16:25,886 --> 00:16:28,165
they have a slight advantage.

290
00:16:28,165 --> 00:16:30,624
But the situation is not really that bad.

291
00:16:30,624 --> 00:16:33,660
Say, for example, take a look
at the Ethnologue category

292
00:16:33,660 --> 00:16:37,206
of "Second language only"--
that's the rightmost one.

293
00:16:37,206 --> 00:16:41,798
You will see three languages
there being reused

294
00:16:41,798 --> 00:16:44,445
in a way comparable to the most favorable,

295
00:16:44,445 --> 00:16:47,456
not endangered category
of national languages.

296
00:16:47,756 --> 00:16:49,414
It's not like a gender bias.

297
00:16:49,414 --> 00:16:53,784
Wikipedia seems to be really reflecting
the gender bias that exists in the world.

298
00:16:53,784 --> 00:16:58,130
Then we have nice initiatives like women 
who are trying to fix this thing.

299
00:16:58,130 --> 00:17:02,210
With languages, well, of course,
some languages are a little bit favored,

300
00:17:02,210 --> 00:17:04,276
but it's not that bad,

301
00:17:04,276 --> 00:17:07,872
and that finding really brought
a lot of joy to ourselves.

302
00:17:08,739 --> 00:17:12,732
Now, speaking of external resources,
every time that I look at this graph,

303
00:17:12,732 --> 00:17:16,482
I say to myself, "We know
who is the queen of the databases."

304
00:17:18,122 --> 00:17:22,294
You know the external identifiers
property in Wikidata.

305
00:17:23,020 --> 00:17:30,171
So here we take all external identifiers
that were present in August,

306
00:17:31,504 --> 00:17:34,823
JSON Dump of Wikidata, which we processed.

307
00:17:34,823 --> 00:17:38,079
Then, once again,
did some statistics on it

308
00:17:38,079 --> 00:17:45,125
and grouped all the external identifiers
by how much they overlap across the items.

309
00:17:51,228 --> 00:17:52,944
Aha, here we are.

310
00:17:55,021 --> 00:17:58,363
That visualization, except for maybe
being aesthetically pleasing,

311
00:17:58,363 --> 00:17:59,691
is not that useful,

312
00:17:59,691 --> 00:18:03,007
but you have an interactive version
developed in the dashboard.

313
00:18:04,231 --> 00:18:07,857
If you go and inspect
the interactive version,

314
00:18:07,857 --> 00:18:10,984
you can learn, for example,
one obvious fact

315
00:18:10,984 --> 00:18:13,615
that they really follow
some natural semantics.

316
00:18:13,615 --> 00:18:15,706
They are grouped in intuitive ways.

317
00:18:16,050 --> 00:18:21,745
We should be perfectly expecting them
to give some feedback on the quality

318
00:18:21,745 --> 00:18:24,453
of the organizational data in Wikidata,

319
00:18:24,453 --> 00:18:26,797
telling that situation
is really not that bad.

320
00:18:27,307 --> 00:18:30,129
What I am saying is
that all the external identifiers

321
00:18:30,129 --> 00:18:32,230
from the databases
on sports, for example,

322
00:18:32,230 --> 00:18:34,685
you will find to be in one cluster.

323
00:18:34,685 --> 00:18:38,681
And then, for example, you will even
be able to figure out what sport.

324
00:18:39,198 --> 00:18:44,277
Databases on tennis are here,
databases on football are here, etc.

325
00:18:48,175 --> 00:18:50,670
Yes, these external resources

326
00:18:50,670 --> 00:18:53,684
are things that we really try
to pay a lot of attention to.

327
00:18:54,653 --> 00:18:59,781
All right, as I said, the final thing
is communication and aesthetics.

328
00:18:59,781 --> 00:19:01,265
We do pay attention to it.

329
00:19:01,265 --> 00:19:04,183
So, for example, this thing--
many people liked it.

330
00:19:04,183 --> 00:19:07,184
It's a little bit rescaled for aesthetics,

331
00:19:07,184 --> 00:19:11,808
the same network of external identifiers
that you were able to see.

332
00:19:11,808 --> 00:19:16,318
But you don't get
to these results for free, of course.

333
00:19:16,707 --> 00:19:20,163
For example, this one was obtained
by running a clustering algorithm

334
00:19:20,163 --> 00:19:23,946
on Jaccard distances--
technical terms, I'm not going into it.

335
00:19:23,946 --> 00:19:29,093
And first, we had to start from a matrix
actually derived from 408 languages

336
00:19:29,093 --> 00:19:31,852
that are reused across the Wikimedia.

337
00:19:31,852 --> 00:19:35,268
Wikidata knows about
many languages, not only 400.

338
00:19:35,268 --> 00:19:39,704
But only 400 of them are actually
labels of the items that get reused

339
00:19:39,704 --> 00:19:43,880
across 60 million items contingency
matrix-- that's a lot of computations.

340
00:19:44,591 --> 00:19:47,112
To add an additional layer of complication

341
00:19:47,112 --> 00:19:51,382
and, of course, the most beautiful part 
of your work as a data scientist,

342
00:19:51,382 --> 00:19:55,216
but it doesn't get to occupy

343
00:19:55,216 --> 00:19:58,266
more than, say, 10% or 15% of your time,

344
00:19:58,266 --> 00:20:00,932
because everything else
goes to data engineering

345
00:20:00,932 --> 00:20:03,083
and synchronization of different systems.

346
00:20:03,083 --> 00:20:04,936
With the machine learning
and statistic things,

347
00:20:04,936 --> 00:20:07,249
we use plenty of different algorithms.

348
00:20:07,249 --> 00:20:12,845
I don't think this is now time to go
and talk about details of these things.

349
00:20:12,845 --> 00:20:14,916
I have plenty of opportunities
to discuss them,

350
00:20:14,916 --> 00:20:18,466
but it's typically
a highly technical topic,

351
00:20:18,466 --> 00:20:21,369
better suited for a scientific conference.

352
00:20:22,999 --> 00:20:26,509
Here are old layers of complexity.

353
00:20:26,509 --> 00:20:30,206
In the end, we have to add
deployment and dashboards,

354
00:20:30,206 --> 00:20:33,445
because they won't build
themselves to this thing.

355
00:20:33,831 --> 00:20:36,854
And all these things, all these phases

356
00:20:36,854 --> 00:20:40,581
of development of analytics
of data science project

357
00:20:41,188 --> 00:20:46,560
need to fit together in order
to be able to derive empirical results

358
00:20:46,565 --> 00:20:49,392
on the system of Wikidata's complexity.

359
00:20:49,848 --> 00:20:53,720
The true picture is that you cannot
really just run through these cycles.

360
00:20:54,417 --> 00:20:56,884
All the phases of the process
are interdependent

361
00:20:56,884 --> 00:21:00,012
because you really
have to plan very early on


362
00:21:00,012 --> 00:21:04,115
what visualizations are you going to use,
what technology you will use

363
00:21:04,115 --> 00:21:06,654
to render those visualizations in the end.

364
00:21:06,654 --> 00:21:08,888
What machine learning algorithms
you will be using,

365
00:21:08,888 --> 00:21:13,534
because all of them have their own taste
about what data structures they like.

366
00:21:13,534 --> 00:21:16,695
And then you hit the constraints
of infrastructure-- similar things.

367
00:21:16,695 --> 00:21:18,827
I am not complaining,
I'm really enjoying this.

368
00:21:18,827 --> 00:21:22,400
This is the most beautiful playground
I've ever seen in my life.

369
00:21:22,400 --> 00:21:25,381
Thanks to you and people
who built Wikidata.

370
00:21:25,381 --> 00:21:26,388
Thank you very much!

371
00:21:26,388 --> 00:21:27,729
That would be it.

372
00:21:28,119 --> 00:21:29,991
(moderator) Thank you, Goran.

373
00:21:29,991 --> 00:21:32,290
(applause)

374
00:21:32,825 --> 00:21:35,261
(moderator) You have time
for a couple of questions.

375
00:21:44,322 --> 00:21:47,663
(man) Well, you did a lot of research,
I can see that.

376
00:21:47,663 --> 00:21:48,676
(Goran) Sorry?

377
00:21:48,676 --> 00:21:51,642
(man) You did a lot of research,
I can see that.

378
00:21:51,642 --> 00:21:57,244
I'm wondering if there anything
that you discovered during the research

379
00:21:57,244 --> 00:21:58,853
that surprised you.

380
00:21:59,327 --> 00:22:01,356
Thank you for that question.

381
00:22:01,356 --> 00:22:07,663
Actually, I wanted to focus
on that in this talk

382
00:22:07,663 --> 00:22:11,244
until I realized that we simply
won't have enough time

383
00:22:11,244 --> 00:22:13,816
to explain everything.

384
00:22:15,407 --> 00:22:19,247
Most of the time
when you're analyzing big datasets

385
00:22:19,247 --> 00:22:22,179
structured in a way like Wikidata.

386
00:22:22,179 --> 00:22:26,345
Even when you're going to the wild,
meaning study the reuse of data

387
00:22:26,345 --> 00:22:27,442
across Wikipedia,

388
00:22:27,442 --> 00:22:30,622
where actually people can do
whatever they like with those items,

389
00:22:31,662 --> 00:22:33,917
you have a lot of data,
a lot of information.

390
00:22:33,917 --> 00:22:35,603
Of course, you see structure.

391
00:22:35,603 --> 00:22:40,209
Most of the time, 90% of the time,
you see things that are expected.

392
00:22:41,195 --> 00:22:46,678
Things like what projects
we make the most use of Wikidata.

393
00:22:46,678 --> 00:22:49,891
And you can almost--
you didn't have to do too much statistics,

394
00:22:50,721 --> 00:22:54,897
you can rely on the expectations
of all the world and see what's happening.

395
00:22:56,694 --> 00:22:58,643
Many things were surprising,

396
00:22:58,643 --> 00:23:03,308
and those things that were surprising
are really the most informative things.

397
00:23:05,372 --> 00:23:09,069
When one communicates the findings
from analytics and such systems,

398
00:23:09,486 --> 00:23:14,200
it's important, people typically expect
either "wow" visualizations

399
00:23:14,200 --> 00:23:18,316
and have tons of data so we can always
deliver "wow" visualizations,

400
00:23:18,912 --> 00:23:21,563
or they expect to learn things like,

401
00:23:21,563 --> 00:23:24,204
"Our project is doing better
than this project"

402
00:23:24,204 --> 00:23:26,239
or "Yes, we are rocking!" etc.,

403
00:23:26,239 --> 00:23:30,148
while the goal of the whole game
should actually be to learn

404
00:23:30,148 --> 00:23:34,128
what is wrong, what is not working,
what could be done better.

405
00:23:34,938 --> 00:23:36,451
Many things were surprising.

406
00:23:38,341 --> 00:23:42,061
For example, the distribution
of item usage across languages--

407
00:23:42,061 --> 00:23:43,850
that was surprising to me.

408
00:23:43,850 --> 00:23:45,014
This thing.

409
00:23:47,098 --> 00:23:51,348
So I did not really expect
that the situation with languages

410
00:23:51,348 --> 00:23:54,352
will be this good, I would say.

411
00:23:54,830 --> 00:24:01,332
My expectation would be that languages
that have less economic support,

412
00:24:01,332 --> 00:24:03,651
normative support,
even political support--

413
00:24:03,651 --> 00:24:06,601
that's a fact when you talk
about languages--

414
00:24:06,601 --> 00:24:11,521
will not be so widely reused
across the Wikimedia universe.

415
00:24:11,521 --> 00:24:15,540
In fact, it turns out
that the differences-- we can see them,

416
00:24:15,540 --> 00:24:18,977
but it's far away from gender bias
which is really bad, I think,

417
00:24:18,977 --> 00:24:20,707
we need to work there.

418
00:24:20,707 --> 00:24:22,456
That was surprising, for example.

419
00:24:22,456 --> 00:24:25,725
It was a positive surprise,
to put it that way.

420
00:24:25,725 --> 00:24:28,271
Then from time to time,
we discover projects

421
00:24:28,821 --> 00:24:34,775
that actually do a great job by reusing
the Wikidata content and Wikimedia.

422
00:24:34,775 --> 00:24:37,895
We're totally surprised to learn that
such a project can do it.

423
00:24:38,612 --> 00:24:42,554
Then you start thinking, you figure out
there is a community of people

424
00:24:42,554 --> 00:24:44,000
actually doing it.

425
00:24:44,468 --> 00:24:48,735
And it's a strange feeling because I get
to see all these things through machines,

426
00:24:48,735 --> 00:24:51,971
through databases,
through visualizations and tables,

427
00:24:51,971 --> 00:24:58,165
and it's always that strange feeling
when I realize this result was produced

428
00:24:58,165 --> 00:25:03,094
by a group of people, they don't even know
the time looking at their result now.

429
00:25:06,101 --> 00:25:07,832
(moderator) Another question?

430
00:25:13,657 --> 00:25:14,703
Thank you.

431
00:25:14,703 --> 00:25:16,237
Is that it? Thank you very much!

432
00:25:16,237 --> 00:25:17,734
(moderator) Thank you.

433
00:25:17,734 --> 00:25:19,890
(applause)