Yes, Wikidata Statistics: What, Where, and How? This has been an attempt of an overview for analytical systems focusing on what was developed with the Wikimedia Deutschland in the previous almost three years since I started doing data science for Wikidata and thе dictionary. So, during this presentation, I will try to switch from the presentation to the dashboards and show you the end data products. However, if that causes any trouble, so this is actually the URL of the analytics portal. So everything that I will be presenting here, whatever you can see on the slides, you can also check out later from the presentation, go and play with the real thing. Otherwise, you will see only the screenshots here from the slides. So the goal-- well, the talk will be a failed attempt to communicate an almost endlessly technically complicated field in terms that can actually motivate people to start making use of this analytical product in which development we are really putting a lot of effort. So, as I said, I will try to provide an overview of the Wikidata Statistics and Analytics systems. So I will try to exemplify the usage of some of them, not all. And also I will try to go just a little bit under the hood to try to illustrate how it is done, what is done here, because I thought it might be interesting to the audience. Okay, so say... In analytics and data science, you always start with formulating as clearly as possible your goals and motivations. Otherwise, you enter into endless cycles of developing analytical tools and data science products that actually do something, but nobody really understands what they're being built for. In 2017, in Wikimedia Deutschland, a request, a demand was formulated-- we said that we needed an analytical system that will give an insight into the ways that Wikidata items are reused across the Wikimedia projects, meaning across the Wikipedia universe-- all the encyclopedias, and then Wikivoyage, Wikibooks, WikiCite, etc.-- all the websites, approximately 800 that we are actually managing. So just to explain the differences between the data. On the left, for example, you see a small or very small substitute Wikidata. These are the languages, some of the Slavic, I think, languages, and in Wikidata they are connected, but they are properties and belong to different classes, etc. But we were looking for a different kind of mapping. So what you see here, on the right side, is a set of items all belonging to the class of architectural structures, I would say. And this here is the result of their empirical embeddings. So the items related here-- they are linked by their similarity of usage across Wikipedias, for example. So what does it mean-- the similarity? To be similar in terms of how an item is used across the Wikipedias. So imagine you take an area of numbers, and each element of an area is one project-- it's English Wikipedia, it is French Wikivoyage, it is Italian Wikipedia, etc. And then, you count how many times a particular item has been used in that project. So you use an area of numbers to describe the item that way. It's a little bit more complicated in practice. And then, you can describe all items in Wikidata that were ever used across the websites at all by such areas of numbers, called embeddings, technically, right? From those data, using different distance metrics, applying machine learning methods, doing dimensionality reduction, and similar things, you can actually figure out what is the similarity pattern. And here items are connected, but how similar are their patterns of usage across different Wikipedias. Once again, every visualization, every result that I show-- there is a link on the presentation, so you can go and check for yourself. You can play with this thing interactively. Similarly, we will be able to derive a graph like this one. This one does not connect the Wikidata items, it connects projects. But looking at how similar they are in terms of how they use different Wikidata items. To be precise as possible, the data that we use to do this-- they do not live in Wikidata, they are not a part of the Wikidata, data does not at all [locate] here. We have the Wikidata, we have formulated our motivational goals, and immediately we started talking about the data model and the structures. What structures and data models you need to answer the questions that you have initially proposed. So there is Wikibase and the client site tracking mechanism, which is installed in all those wikis, that actually tracks the Wikidata usage on a project, on Wikipedia, for example. So every time an item is used in [meaningful works] or in a different way-- there is a role in a huge sequel table that enters and checks the usage of that number. Now, immediately, we had to face a data-engineering problem, of course, because we are talking about hundreds of huge sequel tables, and we had to do machine learning and statistics across all the data together, not separately, in order to be able to produce structures, like this one or like this one. So in cooperation with the Analytics Engineering Team of the Foundation, we started transferring those data from Wikibase to the Wikimedia Foundation Data Lake which is actually a big data storage. The data do not live there in a relational database. They live in something similar-- its Hadoop, and Hive tables are there, etc., but it's a huge, huge engineering procedure. So not all data in analytics, especially in big games like this that we have to play with Wikidata and Wikipedia, are immediately available to you. One source of complication is before you actually start solving the problem in a scientific way, to put it that way, is to engineer the data stats to prepare the structures that you actually need for doing machine-learning statistics and similar things. This is a full design of the system called the Wikidata Concepts Monitor that tracks their reuse statistics. I will not go into details here, of course. The obvious complication is that-- as I wrote it up-- many systems need to work together. You have to synchronize many different sources of data, many different infrastructures just in order to make it happen, even before starting thinking in terms of methodologies, science, statistics, and similar. As I said, we started with our goals and motivation, then, typically, the data model and the structures that you need-- they correspond to those goals and motivations that should always be-- your first step in developing an analytics project. Then you figure out it's really too complicated, it cannot be done when one person-- It cannot be done on one computer, to put it that way. So we needed to work with the analytics infrastructure and then add an additional layer of complication-- that's communication with external teams and cooperators because, obviously, such a system cannot be managed easily by one person. Actually, I think it would be pretty impossible. So, as I mentioned, there is this Data Lake, our big data storage in Hadoop, and the team of awesome data engineers in the Foundation called the Analytics Engineering Team. To data science, data engineers are people who actually watch your back while you're trying to do your things. If you cannot rely on a good engineering team, there's not much you will be able to do by yourself. This infrastructure is actually maintained by the Foundation, so you enter through several statistical servers-- these blue boxes down there. You can communicate with the relational database systems. We used the MariaDB. You can communicate with the Data Lake. And, of course, for your computations, you go to the so-called Analytics Cluster where you do things like Apache Spark that actually-- it's the only really efficient way to process the data that we need to process. When I started doing this back in 2017, I remember when I saw only the schema of the infrastructure for the first time. If I could not rely on my colleague Adam Shorland-- who is still with us in Wikimedia Deutschland-- I would never make it, I wouldn't even know how to navigate the structure. As you start building a project to do analytics for Wikidata, you see how increasingly it gets more and more complicated because you have to deal with synchronizing different systems, different teams, infrastructures, different data stats. However, it pays off, that synchronization and all the pain. It can get really nasty sometimes, and the most recent example is the production of the Data Quality Report for Wikidata. That's an initial assessment of the quality of work we had in Wikidata. In order to produce it, we had to rely on the Quality Predictions from the ORES system, the machine learning system, developed by Aaron Halfaker, and the scoring platform to combine that with the Wikidata Concepts Monitor reuse statistics. We the revision history, the full revision history of all Wikipedias is available in one single huge big data table called the MediaWiki History. That lives in the Data Lake. And also we had to process the JSON Dump in HDFS. So we're talking about form as in structures, like two machine learning systems with their complexities, and two huge data sets. Everything needs to work in sync in order to be able to produce the Quality Report that we're presenting this year at WikidataCon. But if we didn't do, if we [listed] or something like that, we couldn't say, we couldn't show beautiful things like this. So on the horizontal axis, you have the ORES Quality Prediction score. We use five categories. And you can inform yourself-- just google "Wikidata data quality categories." You will find the description. The A-class to the left-- the best items that we have, and at the same time-- that's the green box-- they are the most reused items in Wikipedia. So it's not like, as Lydia explained yesterday, it's not like all our items are of the highest quality. To the contrary, we have many items that are not of that high quality, but at least we know what we're doing with them. And you can see the regularity. As the quality of an item decreases from left to right, the items tend to be less and less reused. So also this synchronization helped us learn things like this. To the right, for example, this five-time series here. Each time series corresponds to one of the quality categories-- A, B, C, or D. And the time is on the horizontal axis running from left to right. And you can see here how many items from each quality-class received their latest revision when. So the top quality class A that is this [inaudible] line which is found, say, at the most right position here, and the shortest line. So those are the best items that we have. And what you can see is actually that there is no item that did not receive at least one revision after December 2018, meaning one thing-- if you want quality in Wikidata, you have to work on it. So the best items that we have are actually the items that we're really paying attention to. If you look at the classes of lower quality, the other time-series, you will see that we have items that were revised in 2012 for the last time. So it tells a story of responsibilities-- how much work we put into the items [that actually work]. What brings quality. While we do these things, we also try to make use of the byproducts of these procedures as possible. So, for example, in order to develop the project called Wikidata Languages Landscape-- I think I mentioned it yesterday during the Birthday Presentation-- I had to perform a quite thorough study of the sub-ontology in Wikidata of languages. And you know what? There are problems in that ontology. I tried not to miss to give you an opportunity. So this is the dashboard actually about the languages called the Wikidata Languages Landscape. Once again, you have all the URLs in the presentation. So for example, you want to take a look at a particular language. Say, English, okay. So the dashboard will generate its local ontological context and mark all the relations of the form instance of subclass often part of. Why did I choose to do this? To help you fix the language ontology. Why? Because you will find many languages, for example, my native language which used to be Serbo-Croatian, and for silly reasons now we have Serbian and Croatian-- it's a political thing. I don't want to go into it, but you realize that Serbian is now, for example, at the same time a subclass of Serbo-Croatian and a part of Serbo-Croatian. Which still holds for the Croatian-- Croatian is also a part and a subclass of Serbo-Croatian. So Serbo-Croatian used to be a language. Now we don't have normative support for it. But still, it's not a language class, it's a language. Can it be a part of it or can it be a subclass of it? So it's a confusion of [methodological] and set-theoretic relations, and I think it should be fixed somehow. In other words, don't say that you don't have the tool to fix the ontology. Just find some time and go play with it. Speaking of languages, I mentioned, just to show you this project. Many people liked this thing what I published online on Twitter. That's one of the things, you know. Data science is usually sold via visualizations. People like to visualize things, and, of course, we do pay attention to that. Aesthetics is a part of communication. It's not the most important thing for a scientific finding to show you something beautiful, but if you can show something beautiful, you shouldn't miss the opportunity. So here we did with the languages in Wikidata the same thing that we do with items and projects in the Wikidata Concepts Monitor. We actually group languages by similarity, and the similarity was defined as how much do they overlap across the items. So if I can talk about the same things in English and in some West-African language, for example, then those two things, those two languages are similar in terms of their reference sets. What they can refer to. Each language here points to its closest neighbor, nearest neighbor-- to the most which is most similar to it. And, of course, you can see these groupings actually occur naturally. So it's not a fully-connected graph. Clustering this thing was nothing like [there is]. Also, what you can learn from the Languages Landscape project is when you combine our data with external resources. So this is also very informative for us, for the whole, I would say, Wikimedia community. We have the UNESCO language status which Wikidata actually gets from UNESCO, its websites and databases, and the Ethnologue language status on the vertical axes. We have the Concepts Monitor reuse statistic. So we look at all the items that have a label in a particular language, and then we look at how popular those items are, how many times people used them. Of course, those safe national languages, languages that are not endangered, they have a slight advantage. But the situation is not really that bad. Say, for example, take a look at the Ethnologue category of "Second language only"-- that's the rightmost one. You will see three languages there being reused in a way comparable to the most favorable, not endangered category of national languages. It's not like a gender bias. Wikipedia seems to be really reflecting the gender bias that exists in the world. Then we have nice initiatives like women who are trying to fix this thing. With languages, well, of course, some languages are a little bit favored, but it's not that bad, and that finding really brought a lot of joy to ourselves. Now, speaking of external resources, every time that I look at this graph, I say to myself, "We know who is the queen of the databases." You know the external identifiers property in Wikidata. So here we take all external identifiers that were present in August, JSON Dump of Wikidata, which we processed. Then, once again, did some statistics on it and grouped all the external identifiers by how much they overlap across the items. Aha, here we are. That visualization, except for maybe being aesthetically pleasing, is not that useful, but you have an interactive version developed in the dashboard. If you go and inspect the interactive version, you can learn, for example, one obvious fact that they really follow some natural semantics. They are grouped in intuitive ways. We should be perfectly expecting them to give some feedback on the quality of the organizational data in Wikidata, telling that situation is really not that bad. What I am saying is that all the external identifiers from the databases on sports, for example, you will find to be in one cluster. And then, for example, you will even be able to figure out what sport. Databases on tennis are here, databases on football are here, etc. Yes, these external resources are things that we really try to pay a lot of attention to. All right, as I said, the final thing is communication and aesthetics. We do pay attention to it. So, for example, this thing-- many people liked it. It's a little bit rescaled for aesthetics, the same network of external identifiers that you were able to see. But you don't get to these results for free, of course. For example, this one was obtained by running a clustering algorithm on Jaccard distances-- technical terms, I'm not going into it. And first, we had to start from a matrix actually derived from 408 languages that are reused across the Wikimedia. Wikidata knows about many languages, not only 400. But only 400 of them are actually labels of the items that get reused across 60 million items contingency matrix-- that's a lot of computations. To add an additional layer of complication and, of course, the most beautiful part of your work as a data scientist, but it doesn't get to occupy more than, say, 10% or 15% of your time, because everything else goes to data engineering and synchronization of different systems. With the machine learning and statistic things, we use plenty of different algorithms. I don't think this is now time to go and talk about details of these things. I have plenty of opportunities to discuss them, but it's typically a highly technical topic, better suited for a scientific conference. Here are old layers of complexity. In the end, we have to add deployment and dashboards, because they won't build themselves to this thing. And all these things, all these phases of development of analytics of data science project need to fit together in order to be able to derive empirical results on the system of Wikidata's complexity. The true picture is that you cannot really just run through these cycles. All the phases of the process are interdependent because you really have to plan very early on what visualizations are you going to use, what technology you will use to render those visualizations in the end. What machine learning algorithms you will be using, because all of them have their own taste about what data structures they like. And then you hit the constraints of infrastructure-- similar things. I am not complaining, I'm really enjoying this. This is the most beautiful playground I've ever seen in my life. Thanks to you and people who built Wikidata. Thank you very much! That would be it. (moderator) Thank you, Goran. (applause) (moderator) You have time for a couple of questions. (man) Well, you did a lot of research, I can see that. (Goran) Sorry? (man) You did a lot of research, I can see that. I'm wondering if there anything that you discovered during the research that surprised you. Thank you for that question. Actually, I wanted to focus on that in this talk until I realized that we simply won't have enough time to explain everything. Most of the time when you're analyzing big datasets structured in a way like Wikidata. Even when you're going to the wild, meaning study the reuse of data across Wikipedia, where actually people can do whatever they like with those items, you have a lot of data, a lot of information. Of course, you see structure. Most of the time, 90% of the time, you see things that are expected. Things like what projects we make the most use of Wikidata. And you can almost-- you didn't have to do too much statistics, you can rely on the expectations of all the world and see what's happening. Many things were surprising, and those things that were surprising are really the most informative things. When one communicates the findings from analytics and such systems, it's important, people typically expect either "wow" visualizations and have tons of data so we can always deliver "wow" visualizations, or they expect to learn things like, "Our project is doing better than this project" or "Yes, we are rocking!" etc., while the goal of the whole game should actually be to learn what is wrong, what is not working, what could be done better. Many things were surprising. For example, the distribution of item usage across languages-- that was surprising to me. This thing. So I did not really expect that the situation with languages will be this good, I would say. My expectation would be that languages that have less economic support, normative support, even political support-- that's a fact when you talk about languages-- will not be so widely reused across the Wikimedia universe. In fact, it turns out that the differences-- we can see them, but it's far away from gender bias which is really bad, I think, we need to work there. That was surprising, for example. It was a positive surprise, to put it that way. Then from time to time, we discover projects that actually do a great job by reusing the Wikidata content and Wikimedia. We're totally surprised to learn that such a project can do it. Then you start thinking, you figure out there is a community of people actually doing it. And it's a strange feeling because I get to see all these things through machines, through databases, through visualizations and tables, and it's always that strange feeling when I realize this result was produced by a group of people, they don't even know the time looking at their result now. (moderator) Another question? Thank you. Is that it? Thank you very much! (moderator) Thank you. (applause)