WEBVTT 00:00:06.343 --> 00:00:09.678 Yes, Wikidata Statistics: What, Where, and How? 00:00:09.678 --> 00:00:13.005 This has been an attempt of an overview for analytical systems 00:00:13.005 --> 00:00:16.173 focusing on what was developed with the Wikimedia Deutschland 00:00:16.173 --> 00:00:18.155 in the previous almost three years 00:00:18.155 --> 00:00:22.346 since I started doing data science for Wikidata and thе dictionary. 00:00:22.346 --> 00:00:28.345 So, during this presentation, I will try to switch from the presentation 00:00:28.346 --> 00:00:32.092 to the dashboards and show you the end data products. 00:00:32.995 --> 00:00:35.029 However, if that causes any trouble, 00:00:35.029 --> 00:00:39.070 so this is actually the URL of the analytics portal. 00:00:39.070 --> 00:00:41.272 So everything that I will be presenting here, 00:00:41.272 --> 00:00:44.105 whatever you can see on the slides, you can also check out later 00:00:44.105 --> 00:00:47.285 from the presentation, go and play with the real thing. 00:00:47.285 --> 00:00:51.101 Otherwise, you will see only the screenshots here from the slides. 00:00:51.101 --> 00:00:58.275 So the goal-- well, the talk will be a failed attempt to communicate 00:00:58.275 --> 00:01:01.502 an almost endlessly technically complicated field 00:01:02.567 --> 00:01:06.843 in terms that can actually motivate people to start making use 00:01:06.843 --> 00:01:08.338 of this analytical product 00:01:08.338 --> 00:01:11.010 in which development we are really putting a lot of effort. 00:01:11.010 --> 00:01:13.631 So, as I said, I will try to provide an overview 00:01:13.631 --> 00:01:15.679 of the Wikidata Statistics and Analytics systems. 00:01:15.679 --> 00:01:20.636 So I will try to exemplify the usage of some of them, not all. 00:01:20.636 --> 00:01:23.362 And also I will try to go just a little bit under the hood 00:01:23.362 --> 00:01:27.453 to try to illustrate how it is done, what is done here, 00:01:27.453 --> 00:01:31.144 because I thought it might be interesting to the audience. 00:01:31.818 --> 00:01:33.534 Okay, so say... 00:01:34.804 --> 00:01:38.538 In analytics and data science, you always start with formulating 00:01:38.538 --> 00:01:41.709 as clearly as possible your goals and motivations. 00:01:41.709 --> 00:01:47.080 Otherwise, you enter into endless cycles of developing analytical tools 00:01:47.080 --> 00:01:49.733 and data science products that actually do something, 00:01:49.733 --> 00:01:52.835 but nobody really understands what they're being built for. 00:01:52.835 --> 00:01:57.669 In 2017, in Wikimedia Deutschland, a request, a demand was formulated-- 00:01:57.925 --> 00:01:59.740 we said that we needed an analytical system 00:01:59.740 --> 00:02:01.936 that will give an insight into the ways 00:02:01.936 --> 00:02:05.865 that Wikidata items are reused across the Wikimedia projects, 00:02:05.865 --> 00:02:09.016 meaning across the Wikipedia universe-- all the encyclopedias, 00:02:09.016 --> 00:02:11.826 and then Wikivoyage, Wikibooks, WikiCite, etc.-- 00:02:11.826 --> 00:02:15.610 all the websites, approximately 800 that we are actually managing. 00:02:15.610 --> 00:02:19.553 So just to explain the differences between the data. 00:02:19.553 --> 00:02:23.794 On the left, for example, you see a small or very small substitute Wikidata. 00:02:23.794 --> 00:02:28.114 These are the languages, some of the Slavic, I think, languages, 00:02:28.114 --> 00:02:30.383 and in Wikidata they are connected, 00:02:30.383 --> 00:02:34.194 but they are properties and belong to different classes, etc. 00:02:34.194 --> 00:02:36.785 But we were looking for a different kind of mapping. 00:02:36.785 --> 00:02:41.085 So what you see here, on the right side, is a set of items 00:02:41.085 --> 00:02:44.823 all belonging to the class of architectural structures, I would say. 00:02:44.823 --> 00:02:48.496 And this here is the result of their empirical embeddings. 00:02:48.496 --> 00:02:50.511 So the items related here-- 00:02:50.518 --> 00:02:55.952 they are linked by their similarity of usage across Wikipedias, for example. 00:02:55.952 --> 00:02:57.842 So what does it mean-- the similarity? 00:02:58.632 --> 00:03:03.068 To be similar in terms of how an item is used across the Wikipedias. 00:03:03.068 --> 00:03:06.943 So imagine you take an area of numbers, 00:03:07.353 --> 00:03:11.107 and each element of an area is one project-- it's English Wikipedia, 00:03:11.558 --> 00:03:17.417 it is French Wikivoyage, it is Italian Wikipedia, etc. 00:03:17.901 --> 00:03:20.495 And then, you count how many times 00:03:20.495 --> 00:03:23.085 a particular item has been used in that project. 00:03:24.112 --> 00:03:27.631 So you use an area of numbers to describe the item that way. 00:03:27.631 --> 00:03:29.768 It's a little bit more complicated in practice. 00:03:31.299 --> 00:03:36.074 And then, you can describe all items in Wikidata that were ever used 00:03:36.074 --> 00:03:39.358 across the websites at all by such areas of numbers, 00:03:39.358 --> 00:03:41.320 called embeddings, technically, right? 00:03:41.791 --> 00:03:45.513 From those data, using different distance metrics, 00:03:45.513 --> 00:03:48.893 applying machine learning methods, doing dimensionality reduction, 00:03:48.893 --> 00:03:50.382 and similar things, 00:03:50.382 --> 00:03:53.093 you can actually figure out what is the similarity pattern. 00:03:53.093 --> 00:03:55.622 And here items are connected, 00:03:55.622 --> 00:04:00.501 but how similar are their patterns of usage across different Wikipedias. 00:04:01.726 --> 00:04:04.551 Once again, every visualization, every result that I show-- 00:04:04.551 --> 00:04:08.278 there is a link on the presentation, so you can go and check for yourself. 00:04:08.278 --> 00:04:10.578 You can play with this thing interactively. 00:04:10.578 --> 00:04:15.826 Similarly, we will be able to derive a graph like this one. 00:04:15.826 --> 00:04:20.011 This one does not connect the Wikidata items, it connects projects. 00:04:20.011 --> 00:04:23.069 But looking at how similar they are 00:04:23.069 --> 00:04:26.779 in terms of how they use different Wikidata items. 00:04:30.235 --> 00:04:31.733 To be precise as possible, 00:04:32.468 --> 00:04:35.369 the data that we use to do this-- they do not live in Wikidata, 00:04:35.369 --> 00:04:36.818 they are not a part of the Wikidata, 00:04:36.818 --> 00:04:38.823 data does not at all [locate] here. 00:04:38.823 --> 00:04:41.917 We have the Wikidata, we have formulated our motivational goals, 00:04:41.917 --> 00:04:45.773 and immediately we started talking about the data model and the structures. 00:04:45.773 --> 00:04:49.760 What structures and data models you need to answer the questions 00:04:49.760 --> 00:04:52.412 that you have initially proposed. 00:04:52.941 --> 00:04:59.064 So there is Wikibase and the client site tracking mechanism, 00:04:59.064 --> 00:05:01.884 which is installed in all those wikis, 00:05:01.884 --> 00:05:07.001 that actually tracks the Wikidata usage on a project, on Wikipedia, for example. 00:05:07.001 --> 00:05:10.700 So every time an item is used in [meaningful works] 00:05:10.700 --> 00:05:14.743 or in a different way-- there is a role in a huge sequel table 00:05:14.743 --> 00:05:18.124 that enters and checks the usage of that number. 00:05:18.124 --> 00:05:22.326 Now, immediately, we had to face a data-engineering problem, of course, 00:05:22.326 --> 00:05:26.434 because we are talking about hundreds of huge sequel tables, 00:05:26.434 --> 00:05:29.301 and we had to do machine learning and statistics 00:05:29.301 --> 00:05:32.746 across all the data together, not separately, 00:05:32.746 --> 00:05:37.283 in order to be able to produce structures, like this one or like this one. 00:05:37.578 --> 00:05:41.332 So in cooperation with the Analytics Engineering Team of the Foundation, 00:05:41.332 --> 00:05:44.459 we started transferring those data from Wikibase 00:05:44.459 --> 00:05:49.181 to the Wikimedia Foundation Data Lake which is actually a big data storage. 00:05:49.181 --> 00:05:52.753 The data do not live there in a relational database. 00:05:52.753 --> 00:05:54.060 They live in something similar-- 00:05:54.060 --> 00:05:56.546 its Hadoop, and Hive tables are there, etc., 00:05:56.546 --> 00:05:58.552 but it's a huge, huge engineering procedure. 00:05:58.552 --> 00:06:03.405 So not all data in analytics, especially in big games like this 00:06:03.405 --> 00:06:06.001 that we have to play with Wikidata and Wikipedia, 00:06:06.001 --> 00:06:07.667 are immediately available to you. 00:06:07.667 --> 00:06:09.171 One source of complication 00:06:09.171 --> 00:06:13.459 is before you actually start solving the problem in a scientific way, 00:06:13.459 --> 00:06:16.847 to put it that way, is to engineer the data stats to prepare the structures 00:06:16.847 --> 00:06:20.805 that you actually need for doing machine-learning statistics 00:06:20.805 --> 00:06:22.588 and similar things. 00:06:23.464 --> 00:06:26.921 This is a full design of the system called the Wikidata Concepts Monitor 00:06:26.921 --> 00:06:28.380 that tracks their reuse statistics. 00:06:28.380 --> 00:06:30.844 I will not go into details here, of course. 00:06:32.394 --> 00:06:35.764 The obvious complication is that-- as I wrote it up-- 00:06:35.764 --> 00:06:38.432 many systems need to work together. 00:06:38.432 --> 00:06:41.248 You have to synchronize many different sources of data, 00:06:41.248 --> 00:06:42.846 many different infrastructures 00:06:42.846 --> 00:06:47.994 just in order to make it happen, even before starting thinking 00:06:47.994 --> 00:06:52.247 in terms of methodologies, science, statistics, and similar. 00:06:53.955 --> 00:06:57.930 As I said, we started with our goals and motivation, 00:06:57.930 --> 00:07:01.629 then, typically, the data model and the structures that you need-- 00:07:01.629 --> 00:07:04.881 they correspond to those goals and motivations that should always be-- 00:07:04.881 --> 00:07:08.250 your first step in developing an analytics project. 00:07:08.250 --> 00:07:10.857 Then you figure out it's really too complicated, 00:07:10.857 --> 00:07:12.846 it cannot be done when one person-- 00:07:12.846 --> 00:07:15.077 It cannot be done on one computer, to put it that way. 00:07:15.077 --> 00:07:17.771 So we needed to work with the analytics infrastructure 00:07:17.771 --> 00:07:20.403 and then add an additional layer of complication-- 00:07:20.403 --> 00:07:23.750 that's communication with external teams and cooperators 00:07:23.750 --> 00:07:28.366 because, obviously, such a system cannot be managed easily by one person. 00:07:28.366 --> 00:07:31.358 Actually, I think it would be pretty impossible. 00:07:31.720 --> 00:07:33.587 So, as I mentioned, there is this Data Lake, 00:07:33.587 --> 00:07:38.091 our big data storage in Hadoop, 00:07:38.091 --> 00:07:41.880 and the team of awesome data engineers in the Foundation 00:07:41.880 --> 00:07:43.987 called the Analytics Engineering Team. 00:07:43.987 --> 00:07:47.660 To data science, data engineers are people who actually watch your back 00:07:47.660 --> 00:07:49.426 while you're trying to do your things. 00:07:49.426 --> 00:07:51.766 If you cannot rely on a good engineering team, 00:07:51.766 --> 00:07:54.164 there's not much you will be able to do by yourself. 00:07:55.636 --> 00:08:00.357 This infrastructure is actually maintained by the Foundation, 00:08:00.357 --> 00:08:04.127 so you enter through several statistical servers-- 00:08:04.441 --> 00:08:06.152 these blue boxes down there. 00:08:06.152 --> 00:08:09.274 You can communicate with the relational database systems. 00:08:09.274 --> 00:08:10.531 We used the MariaDB. 00:08:10.531 --> 00:08:12.274 You can communicate with the Data Lake. 00:08:12.274 --> 00:08:17.536 And, of course, for your computations, you go to the so-called Analytics Cluster 00:08:17.536 --> 00:08:20.712 where you do things like Apache Spark that actually-- 00:08:20.712 --> 00:08:25.138 it's the only really efficient way to process the data 00:08:25.138 --> 00:08:27.313 that we need to process. 00:08:27.313 --> 00:08:32.219 When I started doing this back in 2017, I remember when I saw 00:08:32.219 --> 00:08:35.421 only the schema of the infrastructure for the first time. 00:08:35.421 --> 00:08:38.504 If I could not rely on my colleague Adam Shorland-- 00:08:38.504 --> 00:08:40.471 who is still with us in Wikimedia Deutschland-- 00:08:40.471 --> 00:08:44.008 I would never make it, I wouldn't even know how to navigate the structure. 00:08:46.070 --> 00:08:49.085 As you start building a project to do analytics for Wikidata, 00:08:49.085 --> 00:08:52.391 you see how increasingly it gets more and more complicated 00:08:52.391 --> 00:08:55.046 because you have to deal with synchronizing different systems, 00:08:55.046 --> 00:08:57.908 different teams, infrastructures, different data stats. 00:08:58.419 --> 00:08:59.968 However, it pays off, 00:09:00.346 --> 00:09:02.948 that synchronization and all the pain. 00:09:03.282 --> 00:09:07.632 It can get really nasty sometimes, and the most recent example 00:09:07.632 --> 00:09:10.777 is the production of the Data Quality Report for Wikidata. 00:09:12.128 --> 00:09:16.926 That's an initial assessment of the quality of work we had in Wikidata. 00:09:16.926 --> 00:09:18.278 In order to produce it, 00:09:18.278 --> 00:09:22.211 we had to rely on the Quality Predictions from the ORES system, 00:09:22.211 --> 00:09:25.283 the machine learning system, developed by Aaron Halfaker, 00:09:25.283 --> 00:09:27.502 and the scoring platform 00:09:28.383 --> 00:09:32.317 to combine that with the Wikidata Concepts Monitor reuse statistics. 00:09:32.806 --> 00:09:36.691 We the revision history, the full revision history of all Wikipedias 00:09:36.691 --> 00:09:40.009 is available in one single huge big data table 00:09:40.009 --> 00:09:41.358 called the MediaWiki History. 00:09:41.358 --> 00:09:42.982 That lives in the Data Lake. 00:09:42.982 --> 00:09:46.672 And also we had to process the JSON Dump in HDFS. 00:09:46.672 --> 00:09:48.804 So we're talking about form as in structures, 00:09:48.804 --> 00:09:51.946 like two machine learning systems with their complexities, 00:09:51.946 --> 00:09:53.893 and two huge data sets. 00:09:53.893 --> 00:09:58.231 Everything needs to work in sync in order to be able to produce the Quality Report 00:09:58.231 --> 00:10:00.750 that we're presenting this year at WikidataCon. 00:10:00.750 --> 00:10:04.376 But if we didn't do, if we [listed] or something like that, 00:10:04.759 --> 00:10:07.765 we couldn't say, we couldn't show beautiful things like this. 00:10:07.765 --> 00:10:12.130 So on the horizontal axis, you have the ORES Quality Prediction score. 00:10:12.130 --> 00:10:13.490 We use five categories. 00:10:13.490 --> 00:10:17.436 And you can inform yourself-- just google "Wikidata data quality categories." 00:10:17.436 --> 00:10:18.798 You will find the description. 00:10:18.798 --> 00:10:22.271 The A-class to the left-- the best items that we have, 00:10:22.271 --> 00:10:24.969 and at the same time-- that's the green box-- 00:10:24.969 --> 00:10:27.812 they are the most reused items in Wikipedia. 00:10:27.812 --> 00:10:30.450 So it's not like, as Lydia explained yesterday, 00:10:30.450 --> 00:10:32.742 it's not like all our items are of the highest quality. 00:10:32.742 --> 00:10:38.030 To the contrary, we have many items that are not of that high quality, 00:10:38.030 --> 00:10:40.541 but at least we know what we're doing with them. 00:10:40.541 --> 00:10:42.124 And you can see the regularity. 00:10:42.124 --> 00:10:46.228 As the quality of an item decreases from left to right, 00:10:46.228 --> 00:10:49.179 the items tend to be less and less reused. 00:10:49.724 --> 00:10:53.817 So also this synchronization helped us learn things like this. 00:10:54.225 --> 00:10:57.850 To the right, for example, this five-time series here. 00:10:58.274 --> 00:11:05.252 Each time series corresponds to one of the quality categories-- 00:11:05.252 --> 00:11:06.642 A, B, C, or D. 00:11:06.642 --> 00:11:11.222 And the time is on the horizontal axis running from left to right. 00:11:11.222 --> 00:11:15.883 And you can see here how many items from each quality-class 00:11:15.883 --> 00:11:19.305 received their latest revision when. 00:11:19.792 --> 00:11:23.956 So the top quality class A that is this [inaudible] line 00:11:23.956 --> 00:11:29.693 which is found, say, at the most right position here, 00:11:29.693 --> 00:11:31.113 and the shortest line. 00:11:31.113 --> 00:11:34.584 So those are the best items that we have. 00:11:35.341 --> 00:11:38.247 And what you can see is actually that there is no item 00:11:38.247 --> 00:11:44.580 that did not receive at least one revision after December 2018, 00:11:44.580 --> 00:11:48.118 meaning one thing-- if you want quality in Wikidata, you have to work on it. 00:11:48.118 --> 00:11:50.893 So the best items that we have are actually the items 00:11:50.893 --> 00:11:52.801 that we're really paying attention to. 00:11:52.801 --> 00:11:55.743 If you look at the classes of lower quality, the other time-series, 00:11:55.743 --> 00:11:59.173 you will see that we have items that were revised in 2012 00:11:59.173 --> 00:12:00.683 for the last time. 00:12:01.156 --> 00:12:03.348 So it tells a story of responsibilities-- 00:12:03.348 --> 00:12:07.694 how much work we put into the items [that actually work]. 00:12:07.694 --> 00:12:09.421 What brings quality. 00:12:13.043 --> 00:12:17.205 While we do these things, we also try to make use 00:12:17.205 --> 00:12:20.163 of the byproducts of these procedures as possible. 00:12:20.569 --> 00:12:23.308 So, for example, in order to develop the project 00:12:23.308 --> 00:12:25.425 called Wikidata Languages Landscape-- 00:12:25.425 --> 00:12:28.375 I think I mentioned it yesterday during the Birthday Presentation-- 00:12:30.545 --> 00:12:34.444 I had to perform a quite thorough study 00:12:34.444 --> 00:12:37.725 of the sub-ontology in Wikidata of languages. 00:12:37.725 --> 00:12:41.712 And you know what? There are problems in that ontology. 00:12:45.502 --> 00:12:48.467 I tried not to miss to give you an opportunity. 00:12:49.301 --> 00:12:52.247 So this is the dashboard actually about the languages 00:12:52.247 --> 00:12:54.791 called the Wikidata Languages Landscape. 00:12:54.791 --> 00:12:58.594 Once again, you have all the URLs in the presentation. 00:12:59.694 --> 00:13:03.720 So for example, you want to take a look at a particular language. 00:13:03.720 --> 00:13:08.688 Say, English, okay. 00:13:09.448 --> 00:13:14.636 So the dashboard will generate its local ontological context 00:13:14.636 --> 00:13:19.006 and mark all the relations of the form instance 00:13:19.006 --> 00:13:21.276 of subclass often part of. 00:13:21.716 --> 00:13:23.845 Why did I choose to do this? 00:13:23.845 --> 00:13:25.991 To help you fix the language ontology. 00:13:25.991 --> 00:13:31.586 Why? Because you will find many languages, for example, my native language 00:13:31.586 --> 00:13:33.618 which used to be Serbo-Croatian, 00:13:33.618 --> 00:13:38.553 and for silly reasons now we have Serbian and Croatian-- it's a political thing. 00:13:38.553 --> 00:13:40.554 I don't want to go into it, but you realize 00:13:40.554 --> 00:13:43.255 that Serbian is now, for example, at the same time 00:13:43.255 --> 00:13:46.637 a subclass of Serbo-Croatian and a part of Serbo-Croatian. 00:13:46.955 --> 00:13:48.395 Which still holds for the Croatian-- 00:13:48.395 --> 00:13:50.860 Croatian is also a part and a subclass of Serbo-Croatian. 00:13:50.860 --> 00:13:52.496 So Serbo-Croatian used to be a language. 00:13:52.496 --> 00:13:54.957 Now we don't have normative support for it. 00:13:54.957 --> 00:13:57.086 But still, it's not a language class, it's a language. 00:13:57.086 --> 00:14:00.528 Can it be a part of it or can it be a subclass of it? 00:14:00.528 --> 00:14:03.297 So it's a confusion of [methodological] and set-theoretic relations, 00:14:03.297 --> 00:14:05.803 and I think it should be fixed somehow. 00:14:06.656 --> 00:14:09.245 In other words, don't say 00:14:10.129 --> 00:14:14.993 that you don't have the tool to fix the ontology. 00:14:14.993 --> 00:14:17.859 Just find some time and go play with it. 00:14:19.257 --> 00:14:22.431 Speaking of languages, I mentioned, just to show you this project. 00:14:22.990 --> 00:14:27.162 Many people liked this thing what I published online on Twitter. 00:14:27.162 --> 00:14:28.567 That's one of the things, you know. 00:14:28.567 --> 00:14:32.565 Data science is usually sold via visualizations. 00:14:32.565 --> 00:14:34.202 People like to visualize things, 00:14:34.202 --> 00:14:36.843 and, of course, we do pay attention to that. 00:14:37.763 --> 00:14:40.385 Aesthetics is a part of communication. 00:14:41.772 --> 00:14:44.051 It's not the most important thing for a scientific finding 00:14:44.051 --> 00:14:45.348 to show you something beautiful, 00:14:45.348 --> 00:14:48.621 but if you can show something beautiful, you shouldn't miss the opportunity. 00:14:48.621 --> 00:14:51.876 So here we did with the languages in Wikidata 00:14:51.876 --> 00:14:53.987 the same thing that we do with items and projects 00:14:53.987 --> 00:14:56.161 in the Wikidata Concepts Monitor. 00:14:56.161 --> 00:15:02.898 We actually group languages by similarity, and the similarity was defined 00:15:02.898 --> 00:15:05.800 as how much do they overlap across the items. 00:15:06.452 --> 00:15:10.531 So if I can talk about the same things in English 00:15:10.531 --> 00:15:13.973 and in some West-African language, for example, 00:15:13.973 --> 00:15:15.807 then those two things, those two languages 00:15:15.807 --> 00:15:19.209 are similar in terms of their reference sets. 00:15:19.209 --> 00:15:21.302 What they can refer to. 00:15:22.330 --> 00:15:24.849 Each language here 00:15:24.849 --> 00:15:27.368 points to its closest neighbor, nearest neighbor-- 00:15:27.368 --> 00:15:29.840 to the most which is most similar to it. 00:15:29.840 --> 00:15:35.595 And, of course, you can see these groupings actually occur naturally. 00:15:35.595 --> 00:15:37.549 So it's not a fully-connected graph. 00:15:37.549 --> 00:15:40.838 Clustering this thing was nothing like [there is]. 00:15:41.471 --> 00:15:44.418 Also, what you can learn from the Languages Landscape project 00:15:44.418 --> 00:15:49.294 is when you combine our data with external resources. 00:15:49.294 --> 00:15:51.369 So this is also very informative for us, 00:15:51.369 --> 00:15:54.240 for the whole, I would say, Wikimedia community. 00:15:54.563 --> 00:15:56.636 We have the UNESCO language status 00:15:56.636 --> 00:15:59.755 which Wikidata actually gets from UNESCO, 00:15:59.755 --> 00:16:01.907 its websites and databases, 00:16:01.907 --> 00:16:05.198 and the Ethnologue language status on the vertical axes. 00:16:05.198 --> 00:16:08.751 We have the Concepts Monitor reuse statistic. 00:16:08.945 --> 00:16:12.973 So we look at all the items that have a label in a particular language, 00:16:12.973 --> 00:16:15.949 and then we look at how popular those items are, 00:16:15.949 --> 00:16:18.010 how many times people used them. 00:16:19.310 --> 00:16:25.059 Of course, those safe national languages, languages that are not endangered, 00:16:25.886 --> 00:16:28.165 they have a slight advantage. 00:16:28.165 --> 00:16:30.624 But the situation is not really that bad. 00:16:30.624 --> 00:16:33.660 Say, for example, take a look at the Ethnologue category 00:16:33.660 --> 00:16:37.206 of "Second language only"-- that's the rightmost one. 00:16:37.206 --> 00:16:41.798 You will see three languages there being reused 00:16:41.798 --> 00:16:44.445 in a way comparable to the most favorable, 00:16:44.445 --> 00:16:47.456 not endangered category of national languages. 00:16:47.756 --> 00:16:49.414 It's not like a gender bias. 00:16:49.414 --> 00:16:53.784 Wikipedia seems to be really reflecting the gender bias that exists in the world. 00:16:53.784 --> 00:16:58.130 Then we have nice initiatives like women who are trying to fix this thing. 00:16:58.130 --> 00:17:02.210 With languages, well, of course, some languages are a little bit favored, 00:17:02.210 --> 00:17:04.276 but it's not that bad, 00:17:04.276 --> 00:17:07.872 and that finding really brought a lot of joy to ourselves. 00:17:08.739 --> 00:17:12.732 Now, speaking of external resources, every time that I look at this graph, 00:17:12.732 --> 00:17:16.482 I say to myself, "We know who is the queen of the databases." 00:17:18.122 --> 00:17:22.294 You know the external identifiers property in Wikidata. 00:17:23.020 --> 00:17:30.171 So here we take all external identifiers that were present in August, 00:17:31.504 --> 00:17:34.823 JSON Dump of Wikidata, which we processed. 00:17:34.823 --> 00:17:38.079 Then, once again, did some statistics on it 00:17:38.079 --> 00:17:45.125 and grouped all the external identifiers by how much they overlap across the items. 00:17:51.228 --> 00:17:52.944 Aha, here we are. 00:17:55.021 --> 00:17:58.363 That visualization, except for maybe being aesthetically pleasing, 00:17:58.363 --> 00:17:59.691 is not that useful, 00:17:59.691 --> 00:18:03.007 but you have an interactive version developed in the dashboard. 00:18:04.231 --> 00:18:07.857 If you go and inspect the interactive version, 00:18:07.857 --> 00:18:10.984 you can learn, for example, one obvious fact 00:18:10.984 --> 00:18:13.615 that they really follow some natural semantics. 00:18:13.615 --> 00:18:15.706 They are grouped in intuitive ways. 00:18:16.050 --> 00:18:21.745 We should be perfectly expecting them to give some feedback on the quality 00:18:21.745 --> 00:18:24.453 of the organizational data in Wikidata, 00:18:24.453 --> 00:18:26.797 telling that situation is really not that bad. 00:18:27.307 --> 00:18:30.129 What I am saying is that all the external identifiers 00:18:30.129 --> 00:18:32.230 from the databases on sports, for example, 00:18:32.230 --> 00:18:34.685 you will find to be in one cluster. 00:18:34.685 --> 00:18:38.681 And then, for example, you will even be able to figure out what sport. 00:18:39.198 --> 00:18:44.277 Databases on tennis are here, databases on football are here, etc. 00:18:48.175 --> 00:18:50.670 Yes, these external resources 00:18:50.670 --> 00:18:53.684 are things that we really try to pay a lot of attention to. 00:18:54.653 --> 00:18:59.781 All right, as I said, the final thing is communication and aesthetics. 00:18:59.781 --> 00:19:01.265 We do pay attention to it. 00:19:01.265 --> 00:19:04.183 So, for example, this thing-- many people liked it. 00:19:04.183 --> 00:19:07.184 It's a little bit rescaled for aesthetics, 00:19:07.184 --> 00:19:11.808 the same network of external identifiers that you were able to see. 00:19:11.808 --> 00:19:16.318 But you don't get to these results for free, of course. 00:19:16.707 --> 00:19:20.163 For example, this one was obtained by running a clustering algorithm 00:19:20.163 --> 00:19:23.946 on Jaccard distances-- technical terms, I'm not going into it. 00:19:23.946 --> 00:19:29.093 And first, we had to start from a matrix actually derived from 408 languages 00:19:29.093 --> 00:19:31.852 that are reused across the Wikimedia. 00:19:31.852 --> 00:19:35.268 Wikidata knows about many languages, not only 400. 00:19:35.268 --> 00:19:39.704 But only 400 of them are actually labels of the items that get reused 00:19:39.704 --> 00:19:43.880 across 60 million items contingency matrix-- that's a lot of computations. 00:19:44.591 --> 00:19:47.112 To add an additional layer of complication 00:19:47.112 --> 00:19:51.382 and, of course, the most beautiful part of your work as a data scientist, 00:19:51.382 --> 00:19:55.216 but it doesn't get to occupy 00:19:55.216 --> 00:19:58.266 more than, say, 10% or 15% of your time, 00:19:58.266 --> 00:20:00.932 because everything else goes to data engineering 00:20:00.932 --> 00:20:03.083 and synchronization of different systems. 00:20:03.083 --> 00:20:04.936 With the machine learning and statistic things, 00:20:04.936 --> 00:20:07.249 we use plenty of different algorithms. 00:20:07.249 --> 00:20:12.845 I don't think this is now time to go and talk about details of these things. 00:20:12.845 --> 00:20:14.916 I have plenty of opportunities to discuss them, 00:20:14.916 --> 00:20:18.466 but it's typically a highly technical topic, 00:20:18.466 --> 00:20:21.369 better suited for a scientific conference. 00:20:22.999 --> 00:20:26.509 Here are old layers of complexity. 00:20:26.509 --> 00:20:30.206 In the end, we have to add deployment and dashboards, 00:20:30.206 --> 00:20:33.445 because they won't build themselves to this thing. 00:20:33.831 --> 00:20:36.854 And all these things, all these phases 00:20:36.854 --> 00:20:40.581 of development of analytics of data science project 00:20:41.188 --> 00:20:46.560 need to fit together in order to be able to derive empirical results 00:20:46.565 --> 00:20:49.392 on the system of Wikidata's complexity. 00:20:49.848 --> 00:20:53.720 The true picture is that you cannot really just run through these cycles. 00:20:54.417 --> 00:20:56.884 All the phases of the process are interdependent 00:20:56.884 --> 00:21:00.012 because you really have to plan very early on 00:21:00.012 --> 00:21:04.115 what visualizations are you going to use, what technology you will use 00:21:04.115 --> 00:21:06.654 to render those visualizations in the end. 00:21:06.654 --> 00:21:08.888 What machine learning algorithms you will be using, 00:21:08.888 --> 00:21:13.534 because all of them have their own taste about what data structures they like. 00:21:13.534 --> 00:21:16.695 And then you hit the constraints of infrastructure-- similar things. 00:21:16.695 --> 00:21:18.827 I am not complaining, I'm really enjoying this. 00:21:18.827 --> 00:21:22.400 This is the most beautiful playground I've ever seen in my life. 00:21:22.400 --> 00:21:25.381 Thanks to you and people who built Wikidata. 00:21:25.381 --> 00:21:26.388 Thank you very much! 00:21:26.388 --> 00:21:27.729 That would be it. 00:21:28.119 --> 00:21:29.991 (moderator) Thank you, Goran. 00:21:29.991 --> 00:21:32.290 (applause) 00:21:32.825 --> 00:21:35.261 (moderator) You have time for a couple of questions. 00:21:44.322 --> 00:21:47.663 (man) Well, you did a lot of research, I can see that. 00:21:47.663 --> 00:21:48.676 (Goran) Sorry? 00:21:48.676 --> 00:21:51.642 (man) You did a lot of research, I can see that. 00:21:51.642 --> 00:21:57.244 I'm wondering if there anything that you discovered during the research 00:21:57.244 --> 00:21:58.853 that surprised you. 00:21:59.327 --> 00:22:01.356 Thank you for that question. 00:22:01.356 --> 00:22:07.663 Actually, I wanted to focus on that in this talk 00:22:07.663 --> 00:22:11.244 until I realized that we simply won't have enough time 00:22:11.244 --> 00:22:13.816 to explain everything. 00:22:15.407 --> 00:22:19.247 Most of the time when you're analyzing big datasets 00:22:19.247 --> 00:22:22.179 structured in a way like Wikidata. 00:22:22.179 --> 00:22:26.345 Even when you're going to the wild, meaning study the reuse of data 00:22:26.345 --> 00:22:27.442 across Wikipedia, 00:22:27.442 --> 00:22:30.622 where actually people can do whatever they like with those items, 00:22:31.662 --> 00:22:33.917 you have a lot of data, a lot of information. 00:22:33.917 --> 00:22:35.603 Of course, you see structure. 00:22:35.603 --> 00:22:40.209 Most of the time, 90% of the time, you see things that are expected. 00:22:41.195 --> 00:22:46.678 Things like what projects we make the most use of Wikidata. 00:22:46.678 --> 00:22:49.891 And you can almost-- you didn't have to do too much statistics, 00:22:50.721 --> 00:22:54.897 you can rely on the expectations of all the world and see what's happening. 00:22:56.694 --> 00:22:58.643 Many things were surprising, 00:22:58.643 --> 00:23:03.308 and those things that were surprising are really the most informative things. 00:23:05.372 --> 00:23:09.069 When one communicates the findings from analytics and such systems, 00:23:09.486 --> 00:23:14.200 it's important, people typically expect either "wow" visualizations 00:23:14.200 --> 00:23:18.316 and have tons of data so we can always deliver "wow" visualizations, 00:23:18.912 --> 00:23:21.563 or they expect to learn things like, 00:23:21.563 --> 00:23:24.204 "Our project is doing better than this project" 00:23:24.204 --> 00:23:26.239 or "Yes, we are rocking!" etc., 00:23:26.239 --> 00:23:30.148 while the goal of the whole game should actually be to learn 00:23:30.148 --> 00:23:34.128 what is wrong, what is not working, what could be done better. 00:23:34.938 --> 00:23:36.451 Many things were surprising. 00:23:38.341 --> 00:23:42.061 For example, the distribution of item usage across languages-- 00:23:42.061 --> 00:23:43.850 that was surprising to me. 00:23:43.850 --> 00:23:45.014 This thing. 00:23:47.098 --> 00:23:51.348 So I did not really expect that the situation with languages 00:23:51.348 --> 00:23:54.352 will be this good, I would say. 00:23:54.830 --> 00:24:01.332 My expectation would be that languages that have less economic support, 00:24:01.332 --> 00:24:03.651 normative support, even political support-- 00:24:03.651 --> 00:24:06.601 that's a fact when you talk about languages-- 00:24:06.601 --> 00:24:11.521 will not be so widely reused across the Wikimedia universe. 00:24:11.521 --> 00:24:15.540 In fact, it turns out that the differences-- we can see them, 00:24:15.540 --> 00:24:18.977 but it's far away from gender bias which is really bad, I think, 00:24:18.977 --> 00:24:20.707 we need to work there. 00:24:20.707 --> 00:24:22.456 That was surprising, for example. 00:24:22.456 --> 00:24:25.725 It was a positive surprise, to put it that way. 00:24:25.725 --> 00:24:28.271 Then from time to time, we discover projects 00:24:28.821 --> 00:24:34.775 that actually do a great job by reusing the Wikidata content and Wikimedia. 00:24:34.775 --> 00:24:37.895 We're totally surprised to learn that such a project can do it. 00:24:38.612 --> 00:24:42.554 Then you start thinking, you figure out there is a community of people 00:24:42.554 --> 00:24:44.000 actually doing it. 00:24:44.468 --> 00:24:48.735 And it's a strange feeling because I get to see all these things through machines, 00:24:48.735 --> 00:24:51.971 through databases, through visualizations and tables, 00:24:51.971 --> 00:24:58.165 and it's always that strange feeling when I realize this result was produced 00:24:58.165 --> 00:25:03.094 by a group of people, they don't even know the time looking at their result now. 00:25:06.101 --> 00:25:07.832 (moderator) Another question? 00:25:13.657 --> 00:25:14.703 Thank you. 00:25:14.703 --> 00:25:16.237 Is that it? Thank you very much! 00:25:16.237 --> 00:25:17.734 (moderator) Thank you. 00:25:17.734 --> 00:25:19.890 (applause)