1 00:00:06,343 --> 00:00:09,678 Yes, Wikidata Statistics: What, Where, and How? 2 00:00:09,678 --> 00:00:13,005 This has been an attempt of an overview for analytical systems 3 00:00:13,005 --> 00:00:16,173 focusing on what was developed with the Wikimedia Deutschland 4 00:00:16,173 --> 00:00:18,155 in the previous almost three years 5 00:00:18,155 --> 00:00:22,346 since I started doing data science for Wikidata and thе dictionary. 6 00:00:22,346 --> 00:00:28,345 So, during this presentation, I will try to switch from the presentation 7 00:00:28,346 --> 00:00:32,092 to the dashboards and show you the end data products. 8 00:00:32,995 --> 00:00:35,029 However, if that causes any trouble, 9 00:00:35,029 --> 00:00:39,070 so this is actually the URL of the analytics portal. 10 00:00:39,070 --> 00:00:41,272 So everything that I will be presenting here, 11 00:00:41,272 --> 00:00:44,105 whatever you can see on the slides, you can also check out later 12 00:00:44,105 --> 00:00:47,285 from the presentation, go and play with the real thing. 13 00:00:47,285 --> 00:00:51,101 Otherwise, you will see only the screenshots here from the slides. 14 00:00:51,101 --> 00:00:58,275 So the goal-- well, the talk will be a failed attempt to communicate 15 00:00:58,275 --> 00:01:01,502 an almost endlessly technically complicated field 16 00:01:02,567 --> 00:01:06,843 in terms that can actually motivate people to start making use 17 00:01:06,843 --> 00:01:08,338 of this analytical product 18 00:01:08,338 --> 00:01:11,010 in which development we are really putting a lot of effort. 19 00:01:11,010 --> 00:01:13,631 So, as I said, I will try to provide an overview 20 00:01:13,631 --> 00:01:15,679 of the Wikidata Statistics and Analytics systems. 21 00:01:15,679 --> 00:01:20,636 So I will try to exemplify the usage of some of them, not all. 22 00:01:20,636 --> 00:01:23,362 And also I will try to go just a little bit under the hood 23 00:01:23,362 --> 00:01:27,453 to try to illustrate how it is done, what is done here, 24 00:01:27,453 --> 00:01:31,144 because I thought it might be interesting to the audience. 25 00:01:31,818 --> 00:01:33,534 Okay, so say... 26 00:01:34,804 --> 00:01:38,538 In analytics and data science, you always start with formulating 27 00:01:38,538 --> 00:01:41,709 as clearly as possible your goals and motivations. 28 00:01:41,709 --> 00:01:47,080 Otherwise, you enter into endless cycles of developing analytical tools 29 00:01:47,080 --> 00:01:49,733 and data science products that actually do something, 30 00:01:49,733 --> 00:01:52,835 but nobody really understands what they're being built for. 31 00:01:52,835 --> 00:01:57,669 In 2017, in Wikimedia Deutschland, a request, a demand was formulated-- 32 00:01:57,925 --> 00:01:59,740 we said that we needed an analytical system 33 00:01:59,740 --> 00:02:01,936 that will give an insight into the ways 34 00:02:01,936 --> 00:02:05,865 that Wikidata items are reused across the Wikimedia projects, 35 00:02:05,865 --> 00:02:09,016 meaning across the Wikipedia universe-- all the encyclopedias, 36 00:02:09,016 --> 00:02:11,826 and then Wikivoyage, Wikibooks, WikiCite, etc.-- 37 00:02:11,826 --> 00:02:15,610 all the websites, approximately 800 that we are actually managing. 38 00:02:15,610 --> 00:02:19,553 So just to explain the differences between the data. 39 00:02:19,553 --> 00:02:23,794 On the left, for example, you see a small or very small substitute Wikidata. 40 00:02:23,794 --> 00:02:28,114 These are the languages, some of the Slavic, I think, languages, 41 00:02:28,114 --> 00:02:30,383 and in Wikidata they are connected, 42 00:02:30,383 --> 00:02:34,194 but they are properties and belong to different classes, etc. 43 00:02:34,194 --> 00:02:36,785 But we were looking for a different kind of mapping. 44 00:02:36,785 --> 00:02:41,085 So what you see here, on the right side, is a set of items 45 00:02:41,085 --> 00:02:44,823 all belonging to the class of architectural structures, I would say. 46 00:02:44,823 --> 00:02:48,496 And this here is the result of their empirical embeddings. 47 00:02:48,496 --> 00:02:50,511 So the items related here-- 48 00:02:50,518 --> 00:02:55,952 they are linked by their similarity of usage across Wikipedias, for example. 49 00:02:55,952 --> 00:02:57,842 So what does it mean-- the similarity? 50 00:02:58,632 --> 00:03:03,068 To be similar in terms of how an item is used across the Wikipedias. 51 00:03:03,068 --> 00:03:06,943 So imagine you take an area of numbers, 52 00:03:07,353 --> 00:03:11,107 and each element of an area is one project-- it's English Wikipedia, 53 00:03:11,558 --> 00:03:17,417 it is French Wikivoyage, it is Italian Wikipedia, etc. 54 00:03:17,901 --> 00:03:20,495 And then, you count how many times 55 00:03:20,495 --> 00:03:23,085 a particular item has been used in that project. 56 00:03:24,112 --> 00:03:27,631 So you use an area of numbers to describe the item that way. 57 00:03:27,631 --> 00:03:29,768 It's a little bit more complicated in practice. 58 00:03:31,299 --> 00:03:36,074 And then, you can describe all items in Wikidata that were ever used 59 00:03:36,074 --> 00:03:39,358 across the websites at all by such areas of numbers, 60 00:03:39,358 --> 00:03:41,320 called embeddings, technically, right? 61 00:03:41,791 --> 00:03:45,513 From those data, using different distance metrics, 62 00:03:45,513 --> 00:03:48,893 applying machine learning methods, doing dimensionality reduction, 63 00:03:48,893 --> 00:03:50,382 and similar things, 64 00:03:50,382 --> 00:03:53,093 you can actually figure out what is the similarity pattern. 65 00:03:53,093 --> 00:03:55,622 And here items are connected, 66 00:03:55,622 --> 00:04:00,501 but how similar are their patterns of usage across different Wikipedias. 67 00:04:01,726 --> 00:04:04,551 Once again, every visualization, every result that I show-- 68 00:04:04,551 --> 00:04:08,278 there is a link on the presentation, so you can go and check for yourself. 69 00:04:08,278 --> 00:04:10,578 You can play with this thing interactively. 70 00:04:10,578 --> 00:04:15,826 Similarly, we will be able to derive a graph like this one. 71 00:04:15,826 --> 00:04:20,011 This one does not connect the Wikidata items, it connects projects. 72 00:04:20,011 --> 00:04:23,069 But looking at how similar they are 73 00:04:23,069 --> 00:04:26,779 in terms of how they use different Wikidata items. 74 00:04:30,235 --> 00:04:31,733 To be precise as possible, 75 00:04:32,468 --> 00:04:35,369 the data that we use to do this-- they do not live in Wikidata, 76 00:04:35,369 --> 00:04:36,818 they are not a part of the Wikidata, 77 00:04:36,818 --> 00:04:38,823 data does not at all [locate] here. 78 00:04:38,823 --> 00:04:41,917 We have the Wikidata, we have formulated our motivational goals, 79 00:04:41,917 --> 00:04:45,773 and immediately we started talking about the data model and the structures. 80 00:04:45,773 --> 00:04:49,760 What structures and data models you need to answer the questions 81 00:04:49,760 --> 00:04:52,412 that you have initially proposed. 82 00:04:52,941 --> 00:04:59,064 So there is Wikibase and the client site tracking mechanism, 83 00:04:59,064 --> 00:05:01,884 which is installed in all those wikis, 84 00:05:01,884 --> 00:05:07,001 that actually tracks the Wikidata usage on a project, on Wikipedia, for example. 85 00:05:07,001 --> 00:05:10,700 So every time an item is used in [meaningful works] 86 00:05:10,700 --> 00:05:14,743 or in a different way-- there is a role in a huge sequel table 87 00:05:14,743 --> 00:05:18,124 that enters and checks the usage of that number. 88 00:05:18,124 --> 00:05:22,326 Now, immediately, we had to face a data-engineering problem, of course, 89 00:05:22,326 --> 00:05:26,434 because we are talking about hundreds of huge sequel tables, 90 00:05:26,434 --> 00:05:29,301 and we had to do machine learning and statistics 91 00:05:29,301 --> 00:05:32,746 across all the data together, not separately, 92 00:05:32,746 --> 00:05:37,283 in order to be able to produce structures, like this one or like this one. 93 00:05:37,578 --> 00:05:41,332 So in cooperation with the Analytics Engineering Team of the Foundation, 94 00:05:41,332 --> 00:05:44,459 we started transferring those data from Wikibase 95 00:05:44,459 --> 00:05:49,181 to the Wikimedia Foundation Data Lake which is actually a big data storage. 96 00:05:49,181 --> 00:05:52,753 The data do not live there in a relational database. 97 00:05:52,753 --> 00:05:54,060 They live in something similar-- 98 00:05:54,060 --> 00:05:56,546 its Hadoop, and Hive tables are there, etc., 99 00:05:56,546 --> 00:05:58,552 but it's a huge, huge engineering procedure. 100 00:05:58,552 --> 00:06:03,405 So not all data in analytics, especially in big games like this 101 00:06:03,405 --> 00:06:06,001 that we have to play with Wikidata and Wikipedia, 102 00:06:06,001 --> 00:06:07,667 are immediately available to you. 103 00:06:07,667 --> 00:06:09,171 One source of complication 104 00:06:09,171 --> 00:06:13,459 is before you actually start solving the problem in a scientific way, 105 00:06:13,459 --> 00:06:16,847 to put it that way, is to engineer the data stats to prepare the structures 106 00:06:16,847 --> 00:06:20,805 that you actually need for doing machine-learning statistics 107 00:06:20,805 --> 00:06:22,588 and similar things. 108 00:06:23,464 --> 00:06:26,921 This is a full design of the system called the Wikidata Concepts Monitor 109 00:06:26,921 --> 00:06:28,380 that tracks their reuse statistics. 110 00:06:28,380 --> 00:06:30,844 I will not go into details here, of course. 111 00:06:32,394 --> 00:06:35,764 The obvious complication is that-- as I wrote it up-- 112 00:06:35,764 --> 00:06:38,432 many systems need to work together. 113 00:06:38,432 --> 00:06:41,248 You have to synchronize many different sources of data, 114 00:06:41,248 --> 00:06:42,846 many different infrastructures 115 00:06:42,846 --> 00:06:47,994 just in order to make it happen, even before starting thinking 116 00:06:47,994 --> 00:06:52,247 in terms of methodologies, science, statistics, and similar. 117 00:06:53,955 --> 00:06:57,930 As I said, we started with our goals and motivation, 118 00:06:57,930 --> 00:07:01,629 then, typically, the data model and the structures that you need-- 119 00:07:01,629 --> 00:07:04,881 they correspond to those goals and motivations that should always be-- 120 00:07:04,881 --> 00:07:08,250 your first step in developing an analytics project. 121 00:07:08,250 --> 00:07:10,857 Then you figure out it's really too complicated, 122 00:07:10,857 --> 00:07:12,846 it cannot be done when one person-- 123 00:07:12,846 --> 00:07:15,077 It cannot be done on one computer, to put it that way. 124 00:07:15,077 --> 00:07:17,771 So we needed to work with the analytics infrastructure 125 00:07:17,771 --> 00:07:20,403 and then add an additional layer of complication-- 126 00:07:20,403 --> 00:07:23,750 that's communication with external teams and cooperators 127 00:07:23,750 --> 00:07:28,366 because, obviously, such a system cannot be managed easily by one person. 128 00:07:28,366 --> 00:07:31,358 Actually, I think it would be pretty impossible. 129 00:07:31,720 --> 00:07:33,587 So, as I mentioned, there is this Data Lake, 130 00:07:33,587 --> 00:07:38,091 our big data storage in Hadoop, 131 00:07:38,091 --> 00:07:41,880 and the team of awesome data engineers in the Foundation 132 00:07:41,880 --> 00:07:43,987 called the Analytics Engineering Team. 133 00:07:43,987 --> 00:07:47,660 To data science, data engineers are people who actually watch your back 134 00:07:47,660 --> 00:07:49,426 while you're trying to do your things. 135 00:07:49,426 --> 00:07:51,766 If you cannot rely on a good engineering team, 136 00:07:51,766 --> 00:07:54,164 there's not much you will be able to do by yourself. 137 00:07:55,636 --> 00:08:00,357 This infrastructure is actually maintained by the Foundation, 138 00:08:00,357 --> 00:08:04,127 so you enter through several statistical servers-- 139 00:08:04,441 --> 00:08:06,152 these blue boxes down there. 140 00:08:06,152 --> 00:08:09,274 You can communicate with the relational database systems. 141 00:08:09,274 --> 00:08:10,531 We used the MariaDB. 142 00:08:10,531 --> 00:08:12,274 You can communicate with the Data Lake. 143 00:08:12,274 --> 00:08:17,536 And, of course, for your computations, you go to the so-called Analytics Cluster 144 00:08:17,536 --> 00:08:20,712 where you do things like Apache Spark that actually-- 145 00:08:20,712 --> 00:08:25,138 it's the only really efficient way to process the data 146 00:08:25,138 --> 00:08:27,313 that we need to process. 147 00:08:27,313 --> 00:08:32,219 When I started doing this back in 2017, I remember when I saw 148 00:08:32,219 --> 00:08:35,421 only the schema of the infrastructure for the first time. 149 00:08:35,421 --> 00:08:38,504 If I could not rely on my colleague Adam Shorland-- 150 00:08:38,504 --> 00:08:40,471 who is still with us in Wikimedia Deutschland-- 151 00:08:40,471 --> 00:08:44,008 I would never make it, I wouldn't even know how to navigate the structure. 152 00:08:46,070 --> 00:08:49,085 As you start building a project to do analytics for Wikidata, 153 00:08:49,085 --> 00:08:52,391 you see how increasingly it gets more and more complicated 154 00:08:52,391 --> 00:08:55,046 because you have to deal with synchronizing different systems, 155 00:08:55,046 --> 00:08:57,908 different teams, infrastructures, different data stats. 156 00:08:58,419 --> 00:08:59,968 However, it pays off, 157 00:09:00,346 --> 00:09:02,948 that synchronization and all the pain. 158 00:09:03,282 --> 00:09:07,632 It can get really nasty sometimes, and the most recent example 159 00:09:07,632 --> 00:09:10,777 is the production of the Data Quality Report for Wikidata. 160 00:09:12,128 --> 00:09:16,926 That's an initial assessment of the quality of work we had in Wikidata. 161 00:09:16,926 --> 00:09:18,278 In order to produce it, 162 00:09:18,278 --> 00:09:22,211 we had to rely on the Quality Predictions from the ORES system, 163 00:09:22,211 --> 00:09:25,283 the machine learning system, developed by Aaron Halfaker, 164 00:09:25,283 --> 00:09:27,502 and the scoring platform 165 00:09:28,383 --> 00:09:32,317 to combine that with the Wikidata Concepts Monitor reuse statistics. 166 00:09:32,806 --> 00:09:36,691 We the revision history, the full revision history of all Wikipedias 167 00:09:36,691 --> 00:09:40,009 is available in one single huge big data table 168 00:09:40,009 --> 00:09:41,358 called the MediaWiki History. 169 00:09:41,358 --> 00:09:42,982 That lives in the Data Lake. 170 00:09:42,982 --> 00:09:46,672 And also we had to process the JSON Dump in HDFS. 171 00:09:46,672 --> 00:09:48,804 So we're talking about form as in structures, 172 00:09:48,804 --> 00:09:51,946 like two machine learning systems with their complexities, 173 00:09:51,946 --> 00:09:53,893 and two huge data sets. 174 00:09:53,893 --> 00:09:58,231 Everything needs to work in sync in order to be able to produce the Quality Report 175 00:09:58,231 --> 00:10:00,750 that we're presenting this year at WikidataCon. 176 00:10:00,750 --> 00:10:04,376 But if we didn't do, if we [listed] or something like that, 177 00:10:04,759 --> 00:10:07,765 we couldn't say, we couldn't show beautiful things like this. 178 00:10:07,765 --> 00:10:12,130 So on the horizontal axis, you have the ORES Quality Prediction score. 179 00:10:12,130 --> 00:10:13,490 We use five categories. 180 00:10:13,490 --> 00:10:17,436 And you can inform yourself-- just google "Wikidata data quality categories." 181 00:10:17,436 --> 00:10:18,798 You will find the description. 182 00:10:18,798 --> 00:10:22,271 The A-class to the left-- the best items that we have, 183 00:10:22,271 --> 00:10:24,969 and at the same time-- that's the green box-- 184 00:10:24,969 --> 00:10:27,812 they are the most reused items in Wikipedia. 185 00:10:27,812 --> 00:10:30,450 So it's not like, as Lydia explained yesterday, 186 00:10:30,450 --> 00:10:32,742 it's not like all our items are of the highest quality. 187 00:10:32,742 --> 00:10:38,030 To the contrary, we have many items that are not of that high quality, 188 00:10:38,030 --> 00:10:40,541 but at least we know what we're doing with them. 189 00:10:40,541 --> 00:10:42,124 And you can see the regularity. 190 00:10:42,124 --> 00:10:46,228 As the quality of an item decreases from left to right, 191 00:10:46,228 --> 00:10:49,179 the items tend to be less and less reused. 192 00:10:49,724 --> 00:10:53,817 So also this synchronization helped us learn things like this. 193 00:10:54,225 --> 00:10:57,850 To the right, for example, this five-time series here. 194 00:10:58,274 --> 00:11:05,252 Each time series corresponds to one of the quality categories-- 195 00:11:05,252 --> 00:11:06,642 A, B, C, or D. 196 00:11:06,642 --> 00:11:11,222 And the time is on the horizontal axis running from left to right. 197 00:11:11,222 --> 00:11:15,883 And you can see here how many items from each quality-class 198 00:11:15,883 --> 00:11:19,305 received their latest revision when. 199 00:11:19,792 --> 00:11:23,956 So the top quality class A that is this [inaudible] line 200 00:11:23,956 --> 00:11:29,693 which is found, say, at the most right position here, 201 00:11:29,693 --> 00:11:31,113 and the shortest line. 202 00:11:31,113 --> 00:11:34,584 So those are the best items that we have. 203 00:11:35,341 --> 00:11:38,247 And what you can see is actually that there is no item 204 00:11:38,247 --> 00:11:44,580 that did not receive at least one revision after December 2018, 205 00:11:44,580 --> 00:11:48,118 meaning one thing-- if you want quality in Wikidata, you have to work on it. 206 00:11:48,118 --> 00:11:50,893 So the best items that we have are actually the items 207 00:11:50,893 --> 00:11:52,801 that we're really paying attention to. 208 00:11:52,801 --> 00:11:55,743 If you look at the classes of lower quality, the other time-series, 209 00:11:55,743 --> 00:11:59,173 you will see that we have items that were revised in 2012 210 00:11:59,173 --> 00:12:00,683 for the last time. 211 00:12:01,156 --> 00:12:03,348 So it tells a story of responsibilities-- 212 00:12:03,348 --> 00:12:07,694 how much work we put into the items [that actually work]. 213 00:12:07,694 --> 00:12:09,421 What brings quality. 214 00:12:13,043 --> 00:12:17,205 While we do these things, we also try to make use 215 00:12:17,205 --> 00:12:20,163 of the byproducts of these procedures as possible. 216 00:12:20,569 --> 00:12:23,308 So, for example, in order to develop the project 217 00:12:23,308 --> 00:12:25,425 called Wikidata Languages Landscape-- 218 00:12:25,425 --> 00:12:28,375 I think I mentioned it yesterday during the Birthday Presentation-- 219 00:12:30,545 --> 00:12:34,444 I had to perform a quite thorough study 220 00:12:34,444 --> 00:12:37,725 of the sub-ontology in Wikidata of languages. 221 00:12:37,725 --> 00:12:41,712 And you know what? There are problems in that ontology. 222 00:12:45,502 --> 00:12:48,467 I tried not to miss to give you an opportunity. 223 00:12:49,301 --> 00:12:52,247 So this is the dashboard actually about the languages 224 00:12:52,247 --> 00:12:54,791 called the Wikidata Languages Landscape. 225 00:12:54,791 --> 00:12:58,594 Once again, you have all the URLs in the presentation. 226 00:12:59,694 --> 00:13:03,720 So for example, you want to take a look at a particular language. 227 00:13:03,720 --> 00:13:08,688 Say, English, okay. 228 00:13:09,448 --> 00:13:14,636 So the dashboard will generate its local ontological context 229 00:13:14,636 --> 00:13:19,006 and mark all the relations of the form instance 230 00:13:19,006 --> 00:13:21,276 of subclass often part of. 231 00:13:21,716 --> 00:13:23,845 Why did I choose to do this? 232 00:13:23,845 --> 00:13:25,991 To help you fix the language ontology. 233 00:13:25,991 --> 00:13:31,586 Why? Because you will find many languages, for example, my native language 234 00:13:31,586 --> 00:13:33,618 which used to be Serbo-Croatian, 235 00:13:33,618 --> 00:13:38,553 and for silly reasons now we have Serbian and Croatian-- it's a political thing. 236 00:13:38,553 --> 00:13:40,554 I don't want to go into it, but you realize 237 00:13:40,554 --> 00:13:43,255 that Serbian is now, for example, at the same time 238 00:13:43,255 --> 00:13:46,637 a subclass of Serbo-Croatian and a part of Serbo-Croatian. 239 00:13:46,955 --> 00:13:48,395 Which still holds for the Croatian-- 240 00:13:48,395 --> 00:13:50,860 Croatian is also a part and a subclass of Serbo-Croatian. 241 00:13:50,860 --> 00:13:52,496 So Serbo-Croatian used to be a language. 242 00:13:52,496 --> 00:13:54,957 Now we don't have normative support for it. 243 00:13:54,957 --> 00:13:57,086 But still, it's not a language class, it's a language. 244 00:13:57,086 --> 00:14:00,528 Can it be a part of it or can it be a subclass of it? 245 00:14:00,528 --> 00:14:03,297 So it's a confusion of [methodological] and set-theoretic relations, 246 00:14:03,297 --> 00:14:05,803 and I think it should be fixed somehow. 247 00:14:06,656 --> 00:14:09,245 In other words, don't say 248 00:14:10,129 --> 00:14:14,993 that you don't have the tool to fix the ontology. 249 00:14:14,993 --> 00:14:17,859 Just find some time and go play with it. 250 00:14:19,257 --> 00:14:22,431 Speaking of languages, I mentioned, just to show you this project. 251 00:14:22,990 --> 00:14:27,162 Many people liked this thing what I published online on Twitter. 252 00:14:27,162 --> 00:14:28,567 That's one of the things, you know. 253 00:14:28,567 --> 00:14:32,565 Data science is usually sold via visualizations. 254 00:14:32,565 --> 00:14:34,202 People like to visualize things, 255 00:14:34,202 --> 00:14:36,843 and, of course, we do pay attention to that. 256 00:14:37,763 --> 00:14:40,385 Aesthetics is a part of communication. 257 00:14:41,772 --> 00:14:44,051 It's not the most important thing for a scientific finding 258 00:14:44,051 --> 00:14:45,348 to show you something beautiful, 259 00:14:45,348 --> 00:14:48,621 but if you can show something beautiful, you shouldn't miss the opportunity. 260 00:14:48,621 --> 00:14:51,876 So here we did with the languages in Wikidata 261 00:14:51,876 --> 00:14:53,987 the same thing that we do with items and projects 262 00:14:53,987 --> 00:14:56,161 in the Wikidata Concepts Monitor. 263 00:14:56,161 --> 00:15:02,898 We actually group languages by similarity, and the similarity was defined 264 00:15:02,898 --> 00:15:05,800 as how much do they overlap across the items. 265 00:15:06,452 --> 00:15:10,531 So if I can talk about the same things in English 266 00:15:10,531 --> 00:15:13,973 and in some West-African language, for example, 267 00:15:13,973 --> 00:15:15,807 then those two things, those two languages 268 00:15:15,807 --> 00:15:19,209 are similar in terms of their reference sets. 269 00:15:19,209 --> 00:15:21,302 What they can refer to. 270 00:15:22,330 --> 00:15:24,849 Each language here 271 00:15:24,849 --> 00:15:27,368 points to its closest neighbor, nearest neighbor-- 272 00:15:27,368 --> 00:15:29,840 to the most which is most similar to it. 273 00:15:29,840 --> 00:15:35,595 And, of course, you can see these groupings actually occur naturally. 274 00:15:35,595 --> 00:15:37,549 So it's not a fully-connected graph. 275 00:15:37,549 --> 00:15:40,838 Clustering this thing was nothing like [there is]. 276 00:15:41,471 --> 00:15:44,418 Also, what you can learn from the Languages Landscape project 277 00:15:44,418 --> 00:15:49,294 is when you combine our data with external resources. 278 00:15:49,294 --> 00:15:51,369 So this is also very informative for us, 279 00:15:51,369 --> 00:15:54,240 for the whole, I would say, Wikimedia community. 280 00:15:54,563 --> 00:15:56,636 We have the UNESCO language status 281 00:15:56,636 --> 00:15:59,755 which Wikidata actually gets from UNESCO, 282 00:15:59,755 --> 00:16:01,907 its websites and databases, 283 00:16:01,907 --> 00:16:05,198 and the Ethnologue language status on the vertical axes. 284 00:16:05,198 --> 00:16:08,751 We have the Concepts Monitor reuse statistic. 285 00:16:08,945 --> 00:16:12,973 So we look at all the items that have a label in a particular language, 286 00:16:12,973 --> 00:16:15,949 and then we look at how popular those items are, 287 00:16:15,949 --> 00:16:18,010 how many times people used them. 288 00:16:19,310 --> 00:16:25,059 Of course, those safe national languages, languages that are not endangered, 289 00:16:25,886 --> 00:16:28,165 they have a slight advantage. 290 00:16:28,165 --> 00:16:30,624 But the situation is not really that bad. 291 00:16:30,624 --> 00:16:33,660 Say, for example, take a look at the Ethnologue category 292 00:16:33,660 --> 00:16:37,206 of "Second language only"-- that's the rightmost one. 293 00:16:37,206 --> 00:16:41,798 You will see three languages there being reused 294 00:16:41,798 --> 00:16:44,445 in a way comparable to the most favorable, 295 00:16:44,445 --> 00:16:47,456 not endangered category of national languages. 296 00:16:47,756 --> 00:16:49,414 It's not like a gender bias. 297 00:16:49,414 --> 00:16:53,784 Wikipedia seems to be really reflecting the gender bias that exists in the world. 298 00:16:53,784 --> 00:16:58,130 Then we have nice initiatives like women who are trying to fix this thing. 299 00:16:58,130 --> 00:17:02,210 With languages, well, of course, some languages are a little bit favored, 300 00:17:02,210 --> 00:17:04,276 but it's not that bad, 301 00:17:04,276 --> 00:17:07,872 and that finding really brought a lot of joy to ourselves. 302 00:17:08,739 --> 00:17:12,732 Now, speaking of external resources, every time that I look at this graph, 303 00:17:12,732 --> 00:17:16,482 I say to myself, "We know who is the queen of the databases." 304 00:17:18,122 --> 00:17:22,294 You know the external identifiers property in Wikidata. 305 00:17:23,020 --> 00:17:30,171 So here we take all external identifiers that were present in August, 306 00:17:31,504 --> 00:17:34,823 JSON Dump of Wikidata, which we processed. 307 00:17:34,823 --> 00:17:38,079 Then, once again, did some statistics on it 308 00:17:38,079 --> 00:17:45,125 and grouped all the external identifiers by how much they overlap across the items. 309 00:17:51,228 --> 00:17:52,944 Aha, here we are. 310 00:17:55,021 --> 00:17:58,363 That visualization, except for maybe being aesthetically pleasing, 311 00:17:58,363 --> 00:17:59,691 is not that useful, 312 00:17:59,691 --> 00:18:03,007 but you have an interactive version developed in the dashboard. 313 00:18:04,231 --> 00:18:07,857 If you go and inspect the interactive version, 314 00:18:07,857 --> 00:18:10,984 you can learn, for example, one obvious fact 315 00:18:10,984 --> 00:18:13,615 that they really follow some natural semantics. 316 00:18:13,615 --> 00:18:15,706 They are grouped in intuitive ways. 317 00:18:16,050 --> 00:18:21,745 We should be perfectly expecting them to give some feedback on the quality 318 00:18:21,745 --> 00:18:24,453 of the organizational data in Wikidata, 319 00:18:24,453 --> 00:18:26,797 telling that situation is really not that bad. 320 00:18:27,307 --> 00:18:30,129 What I am saying is that all the external identifiers 321 00:18:30,129 --> 00:18:32,230 from the databases on sports, for example, 322 00:18:32,230 --> 00:18:34,685 you will find to be in one cluster. 323 00:18:34,685 --> 00:18:38,681 And then, for example, you will even be able to figure out what sport. 324 00:18:39,198 --> 00:18:44,277 Databases on tennis are here, databases on football are here, etc. 325 00:18:48,175 --> 00:18:50,670 Yes, these external resources 326 00:18:50,670 --> 00:18:53,684 are things that we really try to pay a lot of attention to. 327 00:18:54,653 --> 00:18:59,781 All right, as I said, the final thing is communication and aesthetics. 328 00:18:59,781 --> 00:19:01,265 We do pay attention to it. 329 00:19:01,265 --> 00:19:04,183 So, for example, this thing-- many people liked it. 330 00:19:04,183 --> 00:19:07,184 It's a little bit rescaled for aesthetics, 331 00:19:07,184 --> 00:19:11,808 the same network of external identifiers that you were able to see. 332 00:19:11,808 --> 00:19:16,318 But you don't get to these results for free, of course. 333 00:19:16,707 --> 00:19:20,163 For example, this one was obtained by running a clustering algorithm 334 00:19:20,163 --> 00:19:23,946 on Jaccard distances-- technical terms, I'm not going into it. 335 00:19:23,946 --> 00:19:29,093 And first, we had to start from a matrix actually derived from 408 languages 336 00:19:29,093 --> 00:19:31,852 that are reused across the Wikimedia. 337 00:19:31,852 --> 00:19:35,268 Wikidata knows about many languages, not only 400. 338 00:19:35,268 --> 00:19:39,704 But only 400 of them are actually labels of the items that get reused 339 00:19:39,704 --> 00:19:43,880 across 60 million items contingency matrix-- that's a lot of computations. 340 00:19:44,591 --> 00:19:47,112 To add an additional layer of complication 341 00:19:47,112 --> 00:19:51,382 and, of course, the most beautiful part of your work as a data scientist, 342 00:19:51,382 --> 00:19:55,216 but it doesn't get to occupy 343 00:19:55,216 --> 00:19:58,266 more than, say, 10% or 15% of your time, 344 00:19:58,266 --> 00:20:00,932 because everything else goes to data engineering 345 00:20:00,932 --> 00:20:03,083 and synchronization of different systems. 346 00:20:03,083 --> 00:20:04,936 With the machine learning and statistic things, 347 00:20:04,936 --> 00:20:07,249 we use plenty of different algorithms. 348 00:20:07,249 --> 00:20:12,845 I don't think this is now time to go and talk about details of these things. 349 00:20:12,845 --> 00:20:14,916 I have plenty of opportunities to discuss them, 350 00:20:14,916 --> 00:20:18,466 but it's typically a highly technical topic, 351 00:20:18,466 --> 00:20:21,369 better suited for a scientific conference. 352 00:20:22,999 --> 00:20:26,509 Here are old layers of complexity. 353 00:20:26,509 --> 00:20:30,206 In the end, we have to add deployment and dashboards, 354 00:20:30,206 --> 00:20:33,445 because they won't build themselves to this thing. 355 00:20:33,831 --> 00:20:36,854 And all these things, all these phases 356 00:20:36,854 --> 00:20:40,581 of development of analytics of data science project 357 00:20:41,188 --> 00:20:46,560 need to fit together in order to be able to derive empirical results 358 00:20:46,565 --> 00:20:49,392 on the system of Wikidata's complexity. 359 00:20:49,848 --> 00:20:53,720 The true picture is that you cannot really just run through these cycles. 360 00:20:54,417 --> 00:20:56,884 All the phases of the process are interdependent 361 00:20:56,884 --> 00:21:00,012 because you really have to plan very early on 362 00:21:00,012 --> 00:21:04,115 what visualizations are you going to use, what technology you will use 363 00:21:04,115 --> 00:21:06,654 to render those visualizations in the end. 364 00:21:06,654 --> 00:21:08,888 What machine learning algorithms you will be using, 365 00:21:08,888 --> 00:21:13,534 because all of them have their own taste about what data structures they like. 366 00:21:13,534 --> 00:21:16,695 And then you hit the constraints of infrastructure-- similar things. 367 00:21:16,695 --> 00:21:18,827 I am not complaining, I'm really enjoying this. 368 00:21:18,827 --> 00:21:22,400 This is the most beautiful playground I've ever seen in my life. 369 00:21:22,400 --> 00:21:25,381 Thanks to you and people who built Wikidata. 370 00:21:25,381 --> 00:21:26,388 Thank you very much! 371 00:21:26,388 --> 00:21:27,729 That would be it. 372 00:21:28,119 --> 00:21:29,991 (moderator) Thank you, Goran. 373 00:21:29,991 --> 00:21:32,290 (applause) 374 00:21:32,825 --> 00:21:35,261 (moderator) You have time for a couple of questions. 375 00:21:44,322 --> 00:21:47,663 (man) Well, you did a lot of research, I can see that. 376 00:21:47,663 --> 00:21:48,676 (Goran) Sorry? 377 00:21:48,676 --> 00:21:51,642 (man) You did a lot of research, I can see that. 378 00:21:51,642 --> 00:21:57,244 I'm wondering if there anything that you discovered during the research 379 00:21:57,244 --> 00:21:58,853 that surprised you. 380 00:21:59,327 --> 00:22:01,356 Thank you for that question. 381 00:22:01,356 --> 00:22:07,663 Actually, I wanted to focus on that in this talk 382 00:22:07,663 --> 00:22:11,244 until I realized that we simply won't have enough time 383 00:22:11,244 --> 00:22:13,816 to explain everything. 384 00:22:15,407 --> 00:22:19,247 Most of the time when you're analyzing big datasets 385 00:22:19,247 --> 00:22:22,179 structured in a way like Wikidata. 386 00:22:22,179 --> 00:22:26,345 Even when you're going to the wild, meaning study the reuse of data 387 00:22:26,345 --> 00:22:27,442 across Wikipedia, 388 00:22:27,442 --> 00:22:30,622 where actually people can do whatever they like with those items, 389 00:22:31,662 --> 00:22:33,917 you have a lot of data, a lot of information. 390 00:22:33,917 --> 00:22:35,603 Of course, you see structure. 391 00:22:35,603 --> 00:22:40,209 Most of the time, 90% of the time, you see things that are expected. 392 00:22:41,195 --> 00:22:46,678 Things like what projects we make the most use of Wikidata. 393 00:22:46,678 --> 00:22:49,891 And you can almost-- you didn't have to do too much statistics, 394 00:22:50,721 --> 00:22:54,897 you can rely on the expectations of all the world and see what's happening. 395 00:22:56,694 --> 00:22:58,643 Many things were surprising, 396 00:22:58,643 --> 00:23:03,308 and those things that were surprising are really the most informative things. 397 00:23:05,372 --> 00:23:09,069 When one communicates the findings from analytics and such systems, 398 00:23:09,486 --> 00:23:14,200 it's important, people typically expect either "wow" visualizations 399 00:23:14,200 --> 00:23:18,316 and have tons of data so we can always deliver "wow" visualizations, 400 00:23:18,912 --> 00:23:21,563 or they expect to learn things like, 401 00:23:21,563 --> 00:23:24,204 "Our project is doing better than this project" 402 00:23:24,204 --> 00:23:26,239 or "Yes, we are rocking!" etc., 403 00:23:26,239 --> 00:23:30,148 while the goal of the whole game should actually be to learn 404 00:23:30,148 --> 00:23:34,128 what is wrong, what is not working, what could be done better. 405 00:23:34,938 --> 00:23:36,451 Many things were surprising. 406 00:23:38,341 --> 00:23:42,061 For example, the distribution of item usage across languages-- 407 00:23:42,061 --> 00:23:43,850 that was surprising to me. 408 00:23:43,850 --> 00:23:45,014 This thing. 409 00:23:47,098 --> 00:23:51,348 So I did not really expect that the situation with languages 410 00:23:51,348 --> 00:23:54,352 will be this good, I would say. 411 00:23:54,830 --> 00:24:01,332 My expectation would be that languages that have less economic support, 412 00:24:01,332 --> 00:24:03,651 normative support, even political support-- 413 00:24:03,651 --> 00:24:06,601 that's a fact when you talk about languages-- 414 00:24:06,601 --> 00:24:11,521 will not be so widely reused across the Wikimedia universe. 415 00:24:11,521 --> 00:24:15,540 In fact, it turns out that the differences-- we can see them, 416 00:24:15,540 --> 00:24:18,977 but it's far away from gender bias which is really bad, I think, 417 00:24:18,977 --> 00:24:20,707 we need to work there. 418 00:24:20,707 --> 00:24:22,456 That was surprising, for example. 419 00:24:22,456 --> 00:24:25,725 It was a positive surprise, to put it that way. 420 00:24:25,725 --> 00:24:28,271 Then from time to time, we discover projects 421 00:24:28,821 --> 00:24:34,775 that actually do a great job by reusing the Wikidata content and Wikimedia. 422 00:24:34,775 --> 00:24:37,895 We're totally surprised to learn that such a project can do it. 423 00:24:38,612 --> 00:24:42,554 Then you start thinking, you figure out there is a community of people 424 00:24:42,554 --> 00:24:44,000 actually doing it. 425 00:24:44,468 --> 00:24:48,735 And it's a strange feeling because I get to see all these things through machines, 426 00:24:48,735 --> 00:24:51,971 through databases, through visualizations and tables, 427 00:24:51,971 --> 00:24:58,165 and it's always that strange feeling when I realize this result was produced 428 00:24:58,165 --> 00:25:03,094 by a group of people, they don't even know the time looking at their result now. 429 00:25:06,101 --> 00:25:07,832 (moderator) Another question? 430 00:25:13,657 --> 00:25:14,703 Thank you. 431 00:25:14,703 --> 00:25:16,237 Is that it? Thank you very much! 432 00:25:16,237 --> 00:25:17,734 (moderator) Thank you. 433 00:25:17,734 --> 00:25:19,890 (applause)