[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:06.34,0:00:09.68,Default,,0000,0000,0000,,Yes, Wikidata Statistics:\NWhat, Where, and How? Dialogue: 0,0:00:09.68,0:00:13.00,Default,,0000,0000,0000,,This has been an attempt of an overview\Nfor analytical systems Dialogue: 0,0:00:13.00,0:00:16.17,Default,,0000,0000,0000,,focusing on what was developed\Nwith the Wikimedia Deutschland Dialogue: 0,0:00:16.17,0:00:18.16,Default,,0000,0000,0000,,in the previous almost three years Dialogue: 0,0:00:18.16,0:00:22.35,Default,,0000,0000,0000,,since I started doing data science\Nfor Wikidata and thе dictionary. Dialogue: 0,0:00:22.35,0:00:28.34,Default,,0000,0000,0000,,So, during this presentation,\NI will try to switch from the presentation Dialogue: 0,0:00:28.35,0:00:32.09,Default,,0000,0000,0000,,to the dashboards\Nand show you the end data products. Dialogue: 0,0:00:32.100,0:00:35.03,Default,,0000,0000,0000,,However, if that causes any trouble, Dialogue: 0,0:00:35.03,0:00:39.07,Default,,0000,0000,0000,,so this is actually the URL\Nof the analytics portal. Dialogue: 0,0:00:39.07,0:00:41.27,Default,,0000,0000,0000,,So everything that\NI will be presenting here, Dialogue: 0,0:00:41.27,0:00:44.10,Default,,0000,0000,0000,,whatever you can see on the slides,\Nyou can also check out later Dialogue: 0,0:00:44.10,0:00:47.28,Default,,0000,0000,0000,,from the presentation,\Ngo and play with the real thing. Dialogue: 0,0:00:47.28,0:00:51.10,Default,,0000,0000,0000,,Otherwise, you will see only\Nthe screenshots here from the slides. Dialogue: 0,0:00:51.10,0:00:58.28,Default,,0000,0000,0000,,So the goal-- well, the talk\Nwill be a failed attempt to communicate Dialogue: 0,0:00:58.28,0:01:01.50,Default,,0000,0000,0000,,an almost endlessly\Ntechnically complicated field Dialogue: 0,0:01:02.57,0:01:06.84,Default,,0000,0000,0000,,in terms that can actually motivate\Npeople to start making use Dialogue: 0,0:01:06.84,0:01:08.34,Default,,0000,0000,0000,,of this analytical product Dialogue: 0,0:01:08.34,0:01:11.01,Default,,0000,0000,0000,,in which development\Nwe are really putting a lot of effort. Dialogue: 0,0:01:11.01,0:01:13.63,Default,,0000,0000,0000,,So, as I said, I will try\Nto provide an overview Dialogue: 0,0:01:13.63,0:01:15.68,Default,,0000,0000,0000,,of the Wikidata Statistics\Nand Analytics systems. Dialogue: 0,0:01:15.68,0:01:20.64,Default,,0000,0000,0000,,So I will try to exemplify the usage\Nof some of them, not all. Dialogue: 0,0:01:20.64,0:01:23.36,Default,,0000,0000,0000,,And also I will try to go just\Na little bit under the hood Dialogue: 0,0:01:23.36,0:01:27.45,Default,,0000,0000,0000,,to try to illustrate how it is done,\Nwhat is done here, Dialogue: 0,0:01:27.45,0:01:31.14,Default,,0000,0000,0000,,because I thought it might be\Ninteresting to the audience. Dialogue: 0,0:01:31.82,0:01:33.53,Default,,0000,0000,0000,,Okay, so say... Dialogue: 0,0:01:34.80,0:01:38.54,Default,,0000,0000,0000,,In analytics and data science,\Nyou always start with formulating Dialogue: 0,0:01:38.54,0:01:41.71,Default,,0000,0000,0000,,as clearly as possible\Nyour goals and motivations. Dialogue: 0,0:01:41.71,0:01:47.08,Default,,0000,0000,0000,,Otherwise, you enter into endless cycles\Nof developing analytical tools Dialogue: 0,0:01:47.08,0:01:49.73,Default,,0000,0000,0000,,and data science products\Nthat actually do something, Dialogue: 0,0:01:49.73,0:01:52.84,Default,,0000,0000,0000,,but nobody really understands\Nwhat they're being built for. Dialogue: 0,0:01:52.84,0:01:57.67,Default,,0000,0000,0000,,In 2017, in Wikimedia Deutschland,\Na request, a demand was formulated-- Dialogue: 0,0:01:57.92,0:01:59.74,Default,,0000,0000,0000,,we said that we needed\Nan analytical system Dialogue: 0,0:01:59.74,0:02:01.94,Default,,0000,0000,0000,,that will give an insight into the ways Dialogue: 0,0:02:01.94,0:02:05.86,Default,,0000,0000,0000,,that Wikidata items are reused\Nacross the Wikimedia projects, Dialogue: 0,0:02:05.86,0:02:09.02,Default,,0000,0000,0000,,meaning across the Wikipedia universe--\Nall the encyclopedias, Dialogue: 0,0:02:09.02,0:02:11.83,Default,,0000,0000,0000,,and then Wikivoyage,\NWikibooks, WikiCite, etc.-- Dialogue: 0,0:02:11.83,0:02:15.61,Default,,0000,0000,0000,,all the websites, approximately 800\Nthat we are actually managing. Dialogue: 0,0:02:15.61,0:02:19.55,Default,,0000,0000,0000,,So just to explain the differences\Nbetween the data. Dialogue: 0,0:02:19.55,0:02:23.79,Default,,0000,0000,0000,,On the left, for example, you see a small\Nor very small substitute Wikidata. Dialogue: 0,0:02:23.79,0:02:28.11,Default,,0000,0000,0000,,These are the languages,\Nsome of the Slavic, I think, languages, Dialogue: 0,0:02:28.11,0:02:30.38,Default,,0000,0000,0000,,and in Wikidata they are connected, Dialogue: 0,0:02:30.38,0:02:34.19,Default,,0000,0000,0000,,but they are properties and belong\Nto different classes, etc. Dialogue: 0,0:02:34.19,0:02:36.78,Default,,0000,0000,0000,,But we were looking\Nfor a different kind of mapping. Dialogue: 0,0:02:36.78,0:02:41.08,Default,,0000,0000,0000,,So what you see here,\Non the right side, is a set of items Dialogue: 0,0:02:41.08,0:02:44.82,Default,,0000,0000,0000,,all belonging to the class\Nof architectural structures, I would say. Dialogue: 0,0:02:44.82,0:02:48.50,Default,,0000,0000,0000,,And this here is the result\Nof their empirical embeddings. Dialogue: 0,0:02:48.50,0:02:50.51,Default,,0000,0000,0000,,So the items related here-- Dialogue: 0,0:02:50.52,0:02:55.95,Default,,0000,0000,0000,,they are linked by their similarity\Nof usage across Wikipedias, for example. Dialogue: 0,0:02:55.95,0:02:57.84,Default,,0000,0000,0000,,So what does it mean-- the similarity? Dialogue: 0,0:02:58.63,0:03:03.07,Default,,0000,0000,0000,,To be similar in terms of how an item\Nis used across the Wikipedias. Dialogue: 0,0:03:03.07,0:03:06.94,Default,,0000,0000,0000,,So imagine you take an area of numbers, Dialogue: 0,0:03:07.35,0:03:11.11,Default,,0000,0000,0000,,and each element of an area\Nis one project-- it's English Wikipedia, Dialogue: 0,0:03:11.56,0:03:17.42,Default,,0000,0000,0000,,it is French Wikivoyage,\Nit is Italian Wikipedia, etc. Dialogue: 0,0:03:17.90,0:03:20.50,Default,,0000,0000,0000,,And then, you count how many times Dialogue: 0,0:03:20.50,0:03:23.08,Default,,0000,0000,0000,,a particular item has been used\Nin that project. Dialogue: 0,0:03:24.11,0:03:27.63,Default,,0000,0000,0000,,So you use an area of numbers\Nto describe the item that way. Dialogue: 0,0:03:27.63,0:03:29.77,Default,,0000,0000,0000,,It's a little bit more complicated\Nin practice. Dialogue: 0,0:03:31.30,0:03:36.07,Default,,0000,0000,0000,,And then, you can describe all items\Nin Wikidata that were ever used Dialogue: 0,0:03:36.07,0:03:39.36,Default,,0000,0000,0000,,across the websites at all\Nby such areas of numbers, Dialogue: 0,0:03:39.36,0:03:41.32,Default,,0000,0000,0000,,called embeddings, technically, right? Dialogue: 0,0:03:41.79,0:03:45.51,Default,,0000,0000,0000,,From those data,\Nusing different distance metrics, Dialogue: 0,0:03:45.51,0:03:48.89,Default,,0000,0000,0000,,applying machine learning methods,\Ndoing dimensionality reduction, Dialogue: 0,0:03:48.89,0:03:50.38,Default,,0000,0000,0000,,and similar things, Dialogue: 0,0:03:50.38,0:03:53.09,Default,,0000,0000,0000,,you can actually figure out\Nwhat is the similarity pattern. Dialogue: 0,0:03:53.09,0:03:55.62,Default,,0000,0000,0000,,And here items are connected, Dialogue: 0,0:03:55.62,0:04:00.50,Default,,0000,0000,0000,,but how similar are their patterns\Nof usage across different Wikipedias. Dialogue: 0,0:04:01.73,0:04:04.55,Default,,0000,0000,0000,,Once again, every visualization,\Nevery result that I show-- Dialogue: 0,0:04:04.55,0:04:08.28,Default,,0000,0000,0000,,there is a link on the presentation,\Nso you can go and check for yourself. Dialogue: 0,0:04:08.28,0:04:10.58,Default,,0000,0000,0000,,You can play\Nwith this thing interactively. Dialogue: 0,0:04:10.58,0:04:15.83,Default,,0000,0000,0000,,Similarly, we will be able to derive\Na graph like this one. Dialogue: 0,0:04:15.83,0:04:20.01,Default,,0000,0000,0000,,This one does not connect\Nthe Wikidata items, it connects projects. Dialogue: 0,0:04:20.01,0:04:23.07,Default,,0000,0000,0000,,But looking at how similar they are Dialogue: 0,0:04:23.07,0:04:26.78,Default,,0000,0000,0000,,in terms of how they use\Ndifferent Wikidata items. Dialogue: 0,0:04:30.24,0:04:31.73,Default,,0000,0000,0000,,To be precise as possible, Dialogue: 0,0:04:32.47,0:04:35.37,Default,,0000,0000,0000,,the data that we use to do this--\Nthey do not live in Wikidata, Dialogue: 0,0:04:35.37,0:04:36.82,Default,,0000,0000,0000,,they are not a part of the Wikidata, Dialogue: 0,0:04:36.82,0:04:38.82,Default,,0000,0000,0000,,data does not at all [locate] here. Dialogue: 0,0:04:38.82,0:04:41.92,Default,,0000,0000,0000,,We have the Wikidata,\Nwe have formulated our motivational goals, Dialogue: 0,0:04:41.92,0:04:45.77,Default,,0000,0000,0000,,and immediately we started talking\Nabout the data model and the structures. Dialogue: 0,0:04:45.77,0:04:49.76,Default,,0000,0000,0000,,What structures and data models\Nyou need to answer the questions Dialogue: 0,0:04:49.76,0:04:52.41,Default,,0000,0000,0000,,that you have initially proposed. Dialogue: 0,0:04:52.94,0:04:59.06,Default,,0000,0000,0000,,So there is Wikibase\Nand the client site tracking mechanism, Dialogue: 0,0:04:59.06,0:05:01.88,Default,,0000,0000,0000,,which is installed in all those wikis, Dialogue: 0,0:05:01.88,0:05:07.00,Default,,0000,0000,0000,,that actually tracks the Wikidata usage\Non a project, on Wikipedia, for example. Dialogue: 0,0:05:07.00,0:05:10.70,Default,,0000,0000,0000,,So every time an item is used\Nin [meaningful works] Dialogue: 0,0:05:10.70,0:05:14.74,Default,,0000,0000,0000,,or in a different way--\Nthere is a role in a huge sequel table Dialogue: 0,0:05:14.74,0:05:18.12,Default,,0000,0000,0000,,that enters and checks\Nthe usage of that number. Dialogue: 0,0:05:18.12,0:05:22.33,Default,,0000,0000,0000,,Now, immediately, we had to face\Na data-engineering problem, of course, Dialogue: 0,0:05:22.33,0:05:26.43,Default,,0000,0000,0000,,because we are talking\Nabout hundreds of huge sequel tables, Dialogue: 0,0:05:26.43,0:05:29.30,Default,,0000,0000,0000,,and we had to do\Nmachine learning and statistics Dialogue: 0,0:05:29.30,0:05:32.75,Default,,0000,0000,0000,,across all the data together,\Nnot separately, Dialogue: 0,0:05:32.75,0:05:37.28,Default,,0000,0000,0000,,in order to be able to produce structures,\Nlike this one or like this one. Dialogue: 0,0:05:37.58,0:05:41.33,Default,,0000,0000,0000,,So in cooperation with the Analytics\NEngineering Team of the Foundation, Dialogue: 0,0:05:41.33,0:05:44.46,Default,,0000,0000,0000,,we started transferring\Nthose data from Wikibase Dialogue: 0,0:05:44.46,0:05:49.18,Default,,0000,0000,0000,,to the Wikimedia Foundation Data Lake\Nwhich is actually a big data storage. Dialogue: 0,0:05:49.18,0:05:52.75,Default,,0000,0000,0000,,The data do not live there\Nin a relational database. Dialogue: 0,0:05:52.75,0:05:54.06,Default,,0000,0000,0000,,They live in something similar-- Dialogue: 0,0:05:54.06,0:05:56.55,Default,,0000,0000,0000,,its Hadoop, and Hive tables\Nare there, etc., Dialogue: 0,0:05:56.55,0:05:58.55,Default,,0000,0000,0000,,but it's a huge,\Nhuge engineering procedure. Dialogue: 0,0:05:58.55,0:06:03.40,Default,,0000,0000,0000,,So not all data in analytics,\Nespecially in big games like this Dialogue: 0,0:06:03.40,0:06:06.00,Default,,0000,0000,0000,,that we have to play\Nwith Wikidata and Wikipedia, Dialogue: 0,0:06:06.00,0:06:07.67,Default,,0000,0000,0000,,are immediately available to you. Dialogue: 0,0:06:07.67,0:06:09.17,Default,,0000,0000,0000,,One source of complication Dialogue: 0,0:06:09.17,0:06:13.46,Default,,0000,0000,0000,,is before you actually start solving\Nthe problem in a scientific way, Dialogue: 0,0:06:13.46,0:06:16.85,Default,,0000,0000,0000,,to put it that way, is to engineer\Nthe data stats to prepare the structures Dialogue: 0,0:06:16.85,0:06:20.80,Default,,0000,0000,0000,,that you actually need for doing\Nmachine-learning statistics Dialogue: 0,0:06:20.80,0:06:22.59,Default,,0000,0000,0000,,and similar things. Dialogue: 0,0:06:23.46,0:06:26.92,Default,,0000,0000,0000,,This is a full design of the system\Ncalled the Wikidata Concepts Monitor Dialogue: 0,0:06:26.92,0:06:28.38,Default,,0000,0000,0000,,that tracks their reuse statistics. Dialogue: 0,0:06:28.38,0:06:30.84,Default,,0000,0000,0000,,I will not go\Ninto details here, of course. Dialogue: 0,0:06:32.39,0:06:35.76,Default,,0000,0000,0000,,The obvious complication\Nis that-- as I wrote it up-- Dialogue: 0,0:06:35.76,0:06:38.43,Default,,0000,0000,0000,,many systems need to work together. Dialogue: 0,0:06:38.43,0:06:41.25,Default,,0000,0000,0000,,You have to synchronize\Nmany different sources of data, Dialogue: 0,0:06:41.25,0:06:42.85,Default,,0000,0000,0000,,many different infrastructures Dialogue: 0,0:06:42.85,0:06:47.99,Default,,0000,0000,0000,,just in order to make it happen,\Neven before starting thinking Dialogue: 0,0:06:47.99,0:06:52.25,Default,,0000,0000,0000,,in terms of methodologies, science,\Nstatistics, and similar. Dialogue: 0,0:06:53.96,0:06:57.93,Default,,0000,0000,0000,,As I said, we started\Nwith our goals and motivation, Dialogue: 0,0:06:57.93,0:07:01.63,Default,,0000,0000,0000,,then, typically, the data model\Nand the structures that you need-- Dialogue: 0,0:07:01.63,0:07:04.88,Default,,0000,0000,0000,,they correspond to those goals\Nand motivations that should always be-- Dialogue: 0,0:07:04.88,0:07:08.25,Default,,0000,0000,0000,,your first step in developing\Nan analytics project. Dialogue: 0,0:07:08.25,0:07:10.86,Default,,0000,0000,0000,,Then you figure out\Nit's really too complicated, Dialogue: 0,0:07:10.86,0:07:12.85,Default,,0000,0000,0000,,it cannot be done when one person-- Dialogue: 0,0:07:12.85,0:07:15.08,Default,,0000,0000,0000,,It cannot be done on one computer,\Nto put it that way. Dialogue: 0,0:07:15.08,0:07:17.77,Default,,0000,0000,0000,,So we needed to work\Nwith the analytics infrastructure Dialogue: 0,0:07:17.77,0:07:20.40,Default,,0000,0000,0000,,and then add an additional layer\Nof complication-- Dialogue: 0,0:07:20.40,0:07:23.75,Default,,0000,0000,0000,,that's communication\Nwith external teams and cooperators Dialogue: 0,0:07:23.75,0:07:28.37,Default,,0000,0000,0000,,because, obviously, such a system\Ncannot be managed easily by one person. Dialogue: 0,0:07:28.37,0:07:31.36,Default,,0000,0000,0000,,Actually, I think\Nit would be pretty impossible. Dialogue: 0,0:07:31.72,0:07:33.59,Default,,0000,0000,0000,,So, as I mentioned,\Nthere is this Data Lake, Dialogue: 0,0:07:33.59,0:07:38.09,Default,,0000,0000,0000,,our big data storage in Hadoop, Dialogue: 0,0:07:38.09,0:07:41.88,Default,,0000,0000,0000,,and the team of awesome data engineers\Nin the Foundation Dialogue: 0,0:07:41.88,0:07:43.99,Default,,0000,0000,0000,,called the Analytics Engineering Team. Dialogue: 0,0:07:43.99,0:07:47.66,Default,,0000,0000,0000,,To data science, data engineers are people\Nwho actually watch your back Dialogue: 0,0:07:47.66,0:07:49.43,Default,,0000,0000,0000,,while you're trying to do your things. Dialogue: 0,0:07:49.43,0:07:51.77,Default,,0000,0000,0000,,If you cannot rely on\Na good engineering team, Dialogue: 0,0:07:51.77,0:07:54.16,Default,,0000,0000,0000,,there's not much you will be able\Nto do by yourself. Dialogue: 0,0:07:55.64,0:08:00.36,Default,,0000,0000,0000,,This infrastructure is actually\Nmaintained by the Foundation, Dialogue: 0,0:08:00.36,0:08:04.13,Default,,0000,0000,0000,,so you enter through\Nseveral statistical servers-- Dialogue: 0,0:08:04.44,0:08:06.15,Default,,0000,0000,0000,,these blue boxes down there. Dialogue: 0,0:08:06.15,0:08:09.27,Default,,0000,0000,0000,,You can communicate\Nwith the relational database systems. Dialogue: 0,0:08:09.27,0:08:10.53,Default,,0000,0000,0000,,We used the MariaDB. Dialogue: 0,0:08:10.53,0:08:12.27,Default,,0000,0000,0000,,You can communicate with the Data Lake. Dialogue: 0,0:08:12.27,0:08:17.54,Default,,0000,0000,0000,,And, of course, for your computations,\Nyou go to the so-called Analytics Cluster Dialogue: 0,0:08:17.54,0:08:20.71,Default,,0000,0000,0000,,where you do things\Nlike Apache Spark that actually-- Dialogue: 0,0:08:20.71,0:08:25.14,Default,,0000,0000,0000,,it's the only really efficient way\Nto process the data Dialogue: 0,0:08:25.14,0:08:27.31,Default,,0000,0000,0000,,that we need to process. Dialogue: 0,0:08:27.31,0:08:32.22,Default,,0000,0000,0000,,When I started doing this back in 2017,\NI remember when I saw Dialogue: 0,0:08:32.22,0:08:35.42,Default,,0000,0000,0000,,only the schema of the infrastructure\Nfor the first time. Dialogue: 0,0:08:35.42,0:08:38.50,Default,,0000,0000,0000,,If I could not rely on my colleague\NAdam Shorland-- Dialogue: 0,0:08:38.50,0:08:40.47,Default,,0000,0000,0000,,who is still with us\Nin Wikimedia Deutschland-- Dialogue: 0,0:08:40.47,0:08:44.01,Default,,0000,0000,0000,,I would never make it, I wouldn't even\Nknow how to navigate the structure. Dialogue: 0,0:08:46.07,0:08:49.08,Default,,0000,0000,0000,,As you start building a project\Nto do analytics for Wikidata, Dialogue: 0,0:08:49.08,0:08:52.39,Default,,0000,0000,0000,,you see how increasingly it gets\Nmore and more complicated Dialogue: 0,0:08:52.39,0:08:55.05,Default,,0000,0000,0000,,because you have to deal\Nwith synchronizing different systems, Dialogue: 0,0:08:55.05,0:08:57.91,Default,,0000,0000,0000,,different teams, infrastructures,\Ndifferent data stats. Dialogue: 0,0:08:58.42,0:08:59.97,Default,,0000,0000,0000,,However, it pays off, Dialogue: 0,0:09:00.35,0:09:02.95,Default,,0000,0000,0000,,that synchronization and all the pain. Dialogue: 0,0:09:03.28,0:09:07.63,Default,,0000,0000,0000,,It can get really nasty sometimes,\Nand the most recent example Dialogue: 0,0:09:07.63,0:09:10.78,Default,,0000,0000,0000,,is the production\Nof the Data Quality Report for Wikidata. Dialogue: 0,0:09:12.13,0:09:16.93,Default,,0000,0000,0000,,That's an initial assessment\Nof the quality of work we had in Wikidata. Dialogue: 0,0:09:16.93,0:09:18.28,Default,,0000,0000,0000,,In order to produce it, Dialogue: 0,0:09:18.28,0:09:22.21,Default,,0000,0000,0000,,we had to rely on the Quality Predictions\Nfrom the ORES system, Dialogue: 0,0:09:22.21,0:09:25.28,Default,,0000,0000,0000,,the machine learning system,\Ndeveloped by Aaron Halfaker, Dialogue: 0,0:09:25.28,0:09:27.50,Default,,0000,0000,0000,,and the scoring platform Dialogue: 0,0:09:28.38,0:09:32.32,Default,,0000,0000,0000,,to combine that with the Wikidata\NConcepts Monitor reuse statistics. Dialogue: 0,0:09:32.81,0:09:36.69,Default,,0000,0000,0000,,We the revision history, the full revision\Nhistory of all Wikipedias Dialogue: 0,0:09:36.69,0:09:40.01,Default,,0000,0000,0000,,is available in one single\Nhuge big data table Dialogue: 0,0:09:40.01,0:09:41.36,Default,,0000,0000,0000,,called the MediaWiki History. Dialogue: 0,0:09:41.36,0:09:42.98,Default,,0000,0000,0000,,That lives in the Data Lake. Dialogue: 0,0:09:42.98,0:09:46.67,Default,,0000,0000,0000,,And also we had to process\Nthe JSON Dump in HDFS. Dialogue: 0,0:09:46.67,0:09:48.80,Default,,0000,0000,0000,,So we're talking about form\Nas in structures, Dialogue: 0,0:09:48.80,0:09:51.95,Default,,0000,0000,0000,,like two machine learning systems\Nwith their complexities, Dialogue: 0,0:09:51.95,0:09:53.89,Default,,0000,0000,0000,,and two huge data sets. Dialogue: 0,0:09:53.89,0:09:58.23,Default,,0000,0000,0000,,Everything needs to work in sync in order\Nto be able to produce the Quality Report Dialogue: 0,0:09:58.23,0:10:00.75,Default,,0000,0000,0000,,that we're presenting\Nthis year at WikidataCon. Dialogue: 0,0:10:00.75,0:10:04.38,Default,,0000,0000,0000,,But if we didn't do, if we [listed]\Nor something like that, Dialogue: 0,0:10:04.76,0:10:07.76,Default,,0000,0000,0000,,we couldn't say, we couldn't show\Nbeautiful things like this. Dialogue: 0,0:10:07.76,0:10:12.13,Default,,0000,0000,0000,,So on the horizontal axis, you have\Nthe ORES Quality Prediction score. Dialogue: 0,0:10:12.13,0:10:13.49,Default,,0000,0000,0000,,We use five categories. Dialogue: 0,0:10:13.49,0:10:17.44,Default,,0000,0000,0000,,And you can inform yourself-- just google\N"Wikidata data quality categories." Dialogue: 0,0:10:17.44,0:10:18.80,Default,,0000,0000,0000,,You will find the description. Dialogue: 0,0:10:18.80,0:10:22.27,Default,,0000,0000,0000,,The A-class to the left--\Nthe best items that we have, Dialogue: 0,0:10:22.27,0:10:24.97,Default,,0000,0000,0000,,and at the same time--\Nthat's the green box-- Dialogue: 0,0:10:24.97,0:10:27.81,Default,,0000,0000,0000,,they are the most\Nreused items in Wikipedia. Dialogue: 0,0:10:27.81,0:10:30.45,Default,,0000,0000,0000,,So it's not like,\Nas Lydia explained yesterday, Dialogue: 0,0:10:30.45,0:10:32.74,Default,,0000,0000,0000,,it's not like all our items\Nare of the highest quality. Dialogue: 0,0:10:32.74,0:10:38.03,Default,,0000,0000,0000,,To the contrary, we have many items\Nthat are not of that high quality, Dialogue: 0,0:10:38.03,0:10:40.54,Default,,0000,0000,0000,,but at least we know\Nwhat we're doing with them. Dialogue: 0,0:10:40.54,0:10:42.12,Default,,0000,0000,0000,,And you can see the regularity. Dialogue: 0,0:10:42.12,0:10:46.23,Default,,0000,0000,0000,,As the quality of an item\Ndecreases from left to right, Dialogue: 0,0:10:46.23,0:10:49.18,Default,,0000,0000,0000,,the items tend to be less and less reused. Dialogue: 0,0:10:49.72,0:10:53.82,Default,,0000,0000,0000,,So also this synchronization\Nhelped us learn things like this. Dialogue: 0,0:10:54.22,0:10:57.85,Default,,0000,0000,0000,,To the right, for example,\Nthis five-time series here. Dialogue: 0,0:10:58.27,0:11:05.25,Default,,0000,0000,0000,,Each time series corresponds\Nto one of the quality categories-- Dialogue: 0,0:11:05.25,0:11:06.64,Default,,0000,0000,0000,,A, B, C, or D. Dialogue: 0,0:11:06.64,0:11:11.22,Default,,0000,0000,0000,,And the time is on the horizontal axis\Nrunning from left to right. Dialogue: 0,0:11:11.22,0:11:15.88,Default,,0000,0000,0000,,And you can see here how many items\Nfrom each quality-class Dialogue: 0,0:11:15.88,0:11:19.30,Default,,0000,0000,0000,,received their latest revision when. Dialogue: 0,0:11:19.79,0:11:23.96,Default,,0000,0000,0000,,So the top quality class A\Nthat is this [inaudible] line Dialogue: 0,0:11:23.96,0:11:29.69,Default,,0000,0000,0000,,which is found, say,\Nat the most right position here, Dialogue: 0,0:11:29.69,0:11:31.11,Default,,0000,0000,0000,,and the shortest line. Dialogue: 0,0:11:31.11,0:11:34.58,Default,,0000,0000,0000,,So those are the best items that we have. Dialogue: 0,0:11:35.34,0:11:38.25,Default,,0000,0000,0000,,And what you can see\Nis actually that there is no item Dialogue: 0,0:11:38.25,0:11:44.58,Default,,0000,0000,0000,,that did not receive at least\None revision after December 2018, Dialogue: 0,0:11:44.58,0:11:48.12,Default,,0000,0000,0000,,meaning one thing-- if you want quality \Nin Wikidata, you have to work on it. Dialogue: 0,0:11:48.12,0:11:50.89,Default,,0000,0000,0000,,So the best items that we have\Nare actually the items Dialogue: 0,0:11:50.89,0:11:52.80,Default,,0000,0000,0000,,that we're really paying attention to. Dialogue: 0,0:11:52.80,0:11:55.74,Default,,0000,0000,0000,,If you look at the classes\Nof lower quality, the other time-series, Dialogue: 0,0:11:55.74,0:11:59.17,Default,,0000,0000,0000,,you will see that we have items\Nthat were revised in 2012 Dialogue: 0,0:11:59.17,0:12:00.68,Default,,0000,0000,0000,,for the last time. Dialogue: 0,0:12:01.16,0:12:03.35,Default,,0000,0000,0000,,So it tells a story of responsibilities-- Dialogue: 0,0:12:03.35,0:12:07.69,Default,,0000,0000,0000,,how much work we put\Ninto the items [that actually work]. Dialogue: 0,0:12:07.69,0:12:09.42,Default,,0000,0000,0000,,What brings quality. Dialogue: 0,0:12:13.04,0:12:17.20,Default,,0000,0000,0000,,While we do these things,\Nwe also try to make use Dialogue: 0,0:12:17.20,0:12:20.16,Default,,0000,0000,0000,,of the byproducts\Nof these procedures as possible. Dialogue: 0,0:12:20.57,0:12:23.31,Default,,0000,0000,0000,,So, for example, in order\Nto develop the project Dialogue: 0,0:12:23.31,0:12:25.42,Default,,0000,0000,0000,,called Wikidata Languages Landscape-- Dialogue: 0,0:12:25.42,0:12:28.38,Default,,0000,0000,0000,,I think I mentioned it yesterday\Nduring the Birthday Presentation-- Dialogue: 0,0:12:30.54,0:12:34.44,Default,,0000,0000,0000,,I had to perform a quite thorough study Dialogue: 0,0:12:34.44,0:12:37.72,Default,,0000,0000,0000,,of the sub-ontology\Nin Wikidata of languages. Dialogue: 0,0:12:37.72,0:12:41.71,Default,,0000,0000,0000,,And you know what?\NThere are problems in that ontology. Dialogue: 0,0:12:45.50,0:12:48.47,Default,,0000,0000,0000,,I tried not to miss\Nto give you an opportunity. Dialogue: 0,0:12:49.30,0:12:52.25,Default,,0000,0000,0000,,So this is the dashboard actually\Nabout the languages Dialogue: 0,0:12:52.25,0:12:54.79,Default,,0000,0000,0000,,called the Wikidata Languages Landscape. Dialogue: 0,0:12:54.79,0:12:58.59,Default,,0000,0000,0000,,Once again, you have all the URLs\Nin the presentation. Dialogue: 0,0:12:59.69,0:13:03.72,Default,,0000,0000,0000,,So for example, you want to take a look\Nat a particular language. Dialogue: 0,0:13:03.72,0:13:08.69,Default,,0000,0000,0000,,Say, English, okay. Dialogue: 0,0:13:09.45,0:13:14.64,Default,,0000,0000,0000,,So the dashboard will generate\Nits local ontological context Dialogue: 0,0:13:14.64,0:13:19.01,Default,,0000,0000,0000,,and mark all the relations\Nof the form instance Dialogue: 0,0:13:19.01,0:13:21.28,Default,,0000,0000,0000,,of subclass often part of. Dialogue: 0,0:13:21.72,0:13:23.84,Default,,0000,0000,0000,,Why did I choose to do this? Dialogue: 0,0:13:23.84,0:13:25.99,Default,,0000,0000,0000,,To help you fix the language ontology. Dialogue: 0,0:13:25.99,0:13:31.59,Default,,0000,0000,0000,,Why? Because you will find many languages,\Nfor example, my native language Dialogue: 0,0:13:31.59,0:13:33.62,Default,,0000,0000,0000,,which used to be Serbo-Croatian, Dialogue: 0,0:13:33.62,0:13:38.55,Default,,0000,0000,0000,,and for silly reasons now we have Serbian\Nand Croatian-- it's a political thing. Dialogue: 0,0:13:38.55,0:13:40.55,Default,,0000,0000,0000,,I don't want to go into it,\Nbut you realize Dialogue: 0,0:13:40.55,0:13:43.26,Default,,0000,0000,0000,,that Serbian is now, for example,\Nat the same time Dialogue: 0,0:13:43.26,0:13:46.64,Default,,0000,0000,0000,,a subclass of Serbo-Croatian\Nand a part of Serbo-Croatian. Dialogue: 0,0:13:46.96,0:13:48.40,Default,,0000,0000,0000,,Which still holds for the Croatian-- Dialogue: 0,0:13:48.40,0:13:50.86,Default,,0000,0000,0000,,Croatian is also a part\Nand a subclass of Serbo-Croatian. Dialogue: 0,0:13:50.86,0:13:52.50,Default,,0000,0000,0000,,So Serbo-Croatian used to be a language. Dialogue: 0,0:13:52.50,0:13:54.96,Default,,0000,0000,0000,,Now we don't have\Nnormative support for it. Dialogue: 0,0:13:54.96,0:13:57.09,Default,,0000,0000,0000,,But still, it's not a language class,\Nit's a language. Dialogue: 0,0:13:57.09,0:14:00.53,Default,,0000,0000,0000,,Can it be a part of it\Nor can it be a subclass of it? Dialogue: 0,0:14:00.53,0:14:03.30,Default,,0000,0000,0000,,So it's a confusion of [methodological]\Nand set-theoretic relations, Dialogue: 0,0:14:03.30,0:14:05.80,Default,,0000,0000,0000,,and I think it should be fixed somehow. Dialogue: 0,0:14:06.66,0:14:09.24,Default,,0000,0000,0000,,In other words, don't say Dialogue: 0,0:14:10.13,0:14:14.99,Default,,0000,0000,0000,,that you don't have the tool\Nto fix the ontology. Dialogue: 0,0:14:14.99,0:14:17.86,Default,,0000,0000,0000,,Just find some time and go play with it. Dialogue: 0,0:14:19.26,0:14:22.43,Default,,0000,0000,0000,,Speaking of languages, I mentioned,\Njust to show you this project. Dialogue: 0,0:14:22.99,0:14:27.16,Default,,0000,0000,0000,,Many people liked this thing\Nwhat I published online on Twitter. Dialogue: 0,0:14:27.16,0:14:28.57,Default,,0000,0000,0000,,That's one of the things, you know. Dialogue: 0,0:14:28.57,0:14:32.56,Default,,0000,0000,0000,,Data science is usually\Nsold via visualizations. Dialogue: 0,0:14:32.56,0:14:34.20,Default,,0000,0000,0000,,People like to visualize things, Dialogue: 0,0:14:34.20,0:14:36.84,Default,,0000,0000,0000,,and, of course,\Nwe do pay attention to that. Dialogue: 0,0:14:37.76,0:14:40.38,Default,,0000,0000,0000,,Aesthetics is a part of communication. Dialogue: 0,0:14:41.77,0:14:44.05,Default,,0000,0000,0000,,It's not the most important thing\Nfor a scientific finding Dialogue: 0,0:14:44.05,0:14:45.35,Default,,0000,0000,0000,,to show you something beautiful, Dialogue: 0,0:14:45.35,0:14:48.62,Default,,0000,0000,0000,,but if you can show something beautiful,\Nyou shouldn't miss the opportunity. Dialogue: 0,0:14:48.62,0:14:51.88,Default,,0000,0000,0000,,So here we did\Nwith the languages in Wikidata Dialogue: 0,0:14:51.88,0:14:53.99,Default,,0000,0000,0000,,the same thing that we do\Nwith items and projects Dialogue: 0,0:14:53.99,0:14:56.16,Default,,0000,0000,0000,,in the Wikidata Concepts Monitor. Dialogue: 0,0:14:56.16,0:15:02.90,Default,,0000,0000,0000,,We actually group languages by similarity,\Nand the similarity was defined Dialogue: 0,0:15:02.90,0:15:05.80,Default,,0000,0000,0000,,as how much do they overlap\Nacross the items. Dialogue: 0,0:15:06.45,0:15:10.53,Default,,0000,0000,0000,,So if I can talk about\Nthe same things in English Dialogue: 0,0:15:10.53,0:15:13.97,Default,,0000,0000,0000,,and in some West-African\Nlanguage, for example, Dialogue: 0,0:15:13.97,0:15:15.81,Default,,0000,0000,0000,,then those two things, those two languages Dialogue: 0,0:15:15.81,0:15:19.21,Default,,0000,0000,0000,,are similar in terms\Nof their reference sets. Dialogue: 0,0:15:19.21,0:15:21.30,Default,,0000,0000,0000,,What they can refer to. Dialogue: 0,0:15:22.33,0:15:24.85,Default,,0000,0000,0000,,Each language here Dialogue: 0,0:15:24.85,0:15:27.37,Default,,0000,0000,0000,,points to its closest neighbor,\Nnearest neighbor-- Dialogue: 0,0:15:27.37,0:15:29.84,Default,,0000,0000,0000,,to the most which is most similar to it. Dialogue: 0,0:15:29.84,0:15:35.60,Default,,0000,0000,0000,,And, of course, you can see\Nthese groupings actually occur naturally. Dialogue: 0,0:15:35.60,0:15:37.55,Default,,0000,0000,0000,,So it's not a fully-connected graph. Dialogue: 0,0:15:37.55,0:15:40.84,Default,,0000,0000,0000,,Clustering this thing\Nwas nothing like [there is]. Dialogue: 0,0:15:41.47,0:15:44.42,Default,,0000,0000,0000,,Also, what you can learn\Nfrom the Languages Landscape project Dialogue: 0,0:15:44.42,0:15:49.29,Default,,0000,0000,0000,,is when you combine our data\Nwith external resources. Dialogue: 0,0:15:49.29,0:15:51.37,Default,,0000,0000,0000,,So this is also very informative for us, Dialogue: 0,0:15:51.37,0:15:54.24,Default,,0000,0000,0000,,for the whole, I would say,\NWikimedia community. Dialogue: 0,0:15:54.56,0:15:56.64,Default,,0000,0000,0000,,We have the UNESCO language status Dialogue: 0,0:15:56.64,0:15:59.76,Default,,0000,0000,0000,,which Wikidata actually gets from UNESCO, Dialogue: 0,0:15:59.76,0:16:01.91,Default,,0000,0000,0000,,its websites and databases, Dialogue: 0,0:16:01.91,0:16:05.20,Default,,0000,0000,0000,,and the Ethnologue language status\Non the vertical axes. Dialogue: 0,0:16:05.20,0:16:08.75,Default,,0000,0000,0000,,We have the Concepts Monitor\Nreuse statistic. Dialogue: 0,0:16:08.94,0:16:12.97,Default,,0000,0000,0000,,So we look at all the items that have\Na label in a particular language, Dialogue: 0,0:16:12.97,0:16:15.95,Default,,0000,0000,0000,,and then we look at\Nhow popular those items are, Dialogue: 0,0:16:15.95,0:16:18.01,Default,,0000,0000,0000,,how many times people used them. Dialogue: 0,0:16:19.31,0:16:25.06,Default,,0000,0000,0000,,Of course, those safe national languages,\Nlanguages that are not endangered, Dialogue: 0,0:16:25.89,0:16:28.16,Default,,0000,0000,0000,,they have a slight advantage. Dialogue: 0,0:16:28.16,0:16:30.62,Default,,0000,0000,0000,,But the situation is not really that bad. Dialogue: 0,0:16:30.62,0:16:33.66,Default,,0000,0000,0000,,Say, for example, take a look\Nat the Ethnologue category Dialogue: 0,0:16:33.66,0:16:37.21,Default,,0000,0000,0000,,of "Second language only"--\Nthat's the rightmost one. Dialogue: 0,0:16:37.21,0:16:41.80,Default,,0000,0000,0000,,You will see three languages\Nthere being reused Dialogue: 0,0:16:41.80,0:16:44.44,Default,,0000,0000,0000,,in a way comparable to the most favorable, Dialogue: 0,0:16:44.44,0:16:47.46,Default,,0000,0000,0000,,not endangered category\Nof national languages. Dialogue: 0,0:16:47.76,0:16:49.41,Default,,0000,0000,0000,,It's not like a gender bias. Dialogue: 0,0:16:49.41,0:16:53.78,Default,,0000,0000,0000,,Wikipedia seems to be really reflecting\Nthe gender bias that exists in the world. Dialogue: 0,0:16:53.78,0:16:58.13,Default,,0000,0000,0000,,Then we have nice initiatives like women \Nwho are trying to fix this thing. Dialogue: 0,0:16:58.13,0:17:02.21,Default,,0000,0000,0000,,With languages, well, of course,\Nsome languages are a little bit favored, Dialogue: 0,0:17:02.21,0:17:04.28,Default,,0000,0000,0000,,but it's not that bad, Dialogue: 0,0:17:04.28,0:17:07.87,Default,,0000,0000,0000,,and that finding really brought\Na lot of joy to ourselves. Dialogue: 0,0:17:08.74,0:17:12.73,Default,,0000,0000,0000,,Now, speaking of external resources,\Nevery time that I look at this graph, Dialogue: 0,0:17:12.73,0:17:16.48,Default,,0000,0000,0000,,I say to myself, "We know\Nwho is the queen of the databases." Dialogue: 0,0:17:18.12,0:17:22.29,Default,,0000,0000,0000,,You know the external identifiers\Nproperty in Wikidata. Dialogue: 0,0:17:23.02,0:17:30.17,Default,,0000,0000,0000,,So here we take all external identifiers\Nthat were present in August, Dialogue: 0,0:17:31.50,0:17:34.82,Default,,0000,0000,0000,,JSON Dump of Wikidata, which we processed. Dialogue: 0,0:17:34.82,0:17:38.08,Default,,0000,0000,0000,,Then, once again,\Ndid some statistics on it Dialogue: 0,0:17:38.08,0:17:45.12,Default,,0000,0000,0000,,and grouped all the external identifiers\Nby how much they overlap across the items. Dialogue: 0,0:17:51.23,0:17:52.94,Default,,0000,0000,0000,,Aha, here we are. Dialogue: 0,0:17:55.02,0:17:58.36,Default,,0000,0000,0000,,That visualization, except for maybe\Nbeing aesthetically pleasing, Dialogue: 0,0:17:58.36,0:17:59.69,Default,,0000,0000,0000,,is not that useful, Dialogue: 0,0:17:59.69,0:18:03.01,Default,,0000,0000,0000,,but you have an interactive version\Ndeveloped in the dashboard. Dialogue: 0,0:18:04.23,0:18:07.86,Default,,0000,0000,0000,,If you go and inspect\Nthe interactive version, Dialogue: 0,0:18:07.86,0:18:10.98,Default,,0000,0000,0000,,you can learn, for example,\None obvious fact Dialogue: 0,0:18:10.98,0:18:13.62,Default,,0000,0000,0000,,that they really follow\Nsome natural semantics. Dialogue: 0,0:18:13.62,0:18:15.71,Default,,0000,0000,0000,,They are grouped in intuitive ways. Dialogue: 0,0:18:16.05,0:18:21.74,Default,,0000,0000,0000,,We should be perfectly expecting them\Nto give some feedback on the quality Dialogue: 0,0:18:21.74,0:18:24.45,Default,,0000,0000,0000,,of the organizational data in Wikidata, Dialogue: 0,0:18:24.45,0:18:26.80,Default,,0000,0000,0000,,telling that situation\Nis really not that bad. Dialogue: 0,0:18:27.31,0:18:30.13,Default,,0000,0000,0000,,What I am saying is\Nthat all the external identifiers Dialogue: 0,0:18:30.13,0:18:32.23,Default,,0000,0000,0000,,from the databases\Non sports, for example, Dialogue: 0,0:18:32.23,0:18:34.68,Default,,0000,0000,0000,,you will find to be in one cluster. Dialogue: 0,0:18:34.68,0:18:38.68,Default,,0000,0000,0000,,And then, for example, you will even\Nbe able to figure out what sport. Dialogue: 0,0:18:39.20,0:18:44.28,Default,,0000,0000,0000,,Databases on tennis are here,\Ndatabases on football are here, etc. Dialogue: 0,0:18:48.18,0:18:50.67,Default,,0000,0000,0000,,Yes, these external resources Dialogue: 0,0:18:50.67,0:18:53.68,Default,,0000,0000,0000,,are things that we really try\Nto pay a lot of attention to. Dialogue: 0,0:18:54.65,0:18:59.78,Default,,0000,0000,0000,,All right, as I said, the final thing\Nis communication and aesthetics. Dialogue: 0,0:18:59.78,0:19:01.26,Default,,0000,0000,0000,,We do pay attention to it. Dialogue: 0,0:19:01.26,0:19:04.18,Default,,0000,0000,0000,,So, for example, this thing--\Nmany people liked it. Dialogue: 0,0:19:04.18,0:19:07.18,Default,,0000,0000,0000,,It's a little bit rescaled for aesthetics, Dialogue: 0,0:19:07.18,0:19:11.81,Default,,0000,0000,0000,,the same network of external identifiers\Nthat you were able to see. Dialogue: 0,0:19:11.81,0:19:16.32,Default,,0000,0000,0000,,But you don't get\Nto these results for free, of course. Dialogue: 0,0:19:16.71,0:19:20.16,Default,,0000,0000,0000,,For example, this one was obtained\Nby running a clustering algorithm Dialogue: 0,0:19:20.16,0:19:23.95,Default,,0000,0000,0000,,on Jaccard distances--\Ntechnical terms, I'm not going into it. Dialogue: 0,0:19:23.95,0:19:29.09,Default,,0000,0000,0000,,And first, we had to start from a matrix\Nactually derived from 408 languages Dialogue: 0,0:19:29.09,0:19:31.85,Default,,0000,0000,0000,,that are reused across the Wikimedia. Dialogue: 0,0:19:31.85,0:19:35.27,Default,,0000,0000,0000,,Wikidata knows about\Nmany languages, not only 400. Dialogue: 0,0:19:35.27,0:19:39.70,Default,,0000,0000,0000,,But only 400 of them are actually\Nlabels of the items that get reused Dialogue: 0,0:19:39.70,0:19:43.88,Default,,0000,0000,0000,,across 60 million items contingency\Nmatrix-- that's a lot of computations. Dialogue: 0,0:19:44.59,0:19:47.11,Default,,0000,0000,0000,,To add an additional layer of complication Dialogue: 0,0:19:47.11,0:19:51.38,Default,,0000,0000,0000,,and, of course, the most beautiful part \Nof your work as a data scientist, Dialogue: 0,0:19:51.38,0:19:55.22,Default,,0000,0000,0000,,but it doesn't get to occupy Dialogue: 0,0:19:55.22,0:19:58.27,Default,,0000,0000,0000,,more than, say, 10% or 15% of your time, Dialogue: 0,0:19:58.27,0:20:00.93,Default,,0000,0000,0000,,because everything else\Ngoes to data engineering Dialogue: 0,0:20:00.93,0:20:03.08,Default,,0000,0000,0000,,and synchronization of different systems. Dialogue: 0,0:20:03.08,0:20:04.94,Default,,0000,0000,0000,,With the machine learning\Nand statistic things, Dialogue: 0,0:20:04.94,0:20:07.25,Default,,0000,0000,0000,,we use plenty of different algorithms. Dialogue: 0,0:20:07.25,0:20:12.84,Default,,0000,0000,0000,,I don't think this is now time to go\Nand talk about details of these things. Dialogue: 0,0:20:12.84,0:20:14.92,Default,,0000,0000,0000,,I have plenty of opportunities\Nto discuss them, Dialogue: 0,0:20:14.92,0:20:18.47,Default,,0000,0000,0000,,but it's typically\Na highly technical topic, Dialogue: 0,0:20:18.47,0:20:21.37,Default,,0000,0000,0000,,better suited for a scientific conference. Dialogue: 0,0:20:22.100,0:20:26.51,Default,,0000,0000,0000,,Here are old layers of complexity. Dialogue: 0,0:20:26.51,0:20:30.21,Default,,0000,0000,0000,,In the end, we have to add\Ndeployment and dashboards, Dialogue: 0,0:20:30.21,0:20:33.44,Default,,0000,0000,0000,,because they won't build\Nthemselves to this thing. Dialogue: 0,0:20:33.83,0:20:36.85,Default,,0000,0000,0000,,And all these things, all these phases Dialogue: 0,0:20:36.85,0:20:40.58,Default,,0000,0000,0000,,of development of analytics\Nof data science project Dialogue: 0,0:20:41.19,0:20:46.56,Default,,0000,0000,0000,,need to fit together in order\Nto be able to derive empirical results Dialogue: 0,0:20:46.56,0:20:49.39,Default,,0000,0000,0000,,on the system of Wikidata's complexity. Dialogue: 0,0:20:49.85,0:20:53.72,Default,,0000,0000,0000,,The true picture is that you cannot\Nreally just run through these cycles. Dialogue: 0,0:20:54.42,0:20:56.88,Default,,0000,0000,0000,,All the phases of the process\Nare interdependent Dialogue: 0,0:20:56.88,0:21:00.01,Default,,0000,0000,0000,,because you really\Nhave to plan very early on\N Dialogue: 0,0:21:00.01,0:21:04.12,Default,,0000,0000,0000,,what visualizations are you going to use,\Nwhat technology you will use Dialogue: 0,0:21:04.12,0:21:06.65,Default,,0000,0000,0000,,to render those visualizations in the end. Dialogue: 0,0:21:06.65,0:21:08.89,Default,,0000,0000,0000,,What machine learning algorithms\Nyou will be using, Dialogue: 0,0:21:08.89,0:21:13.53,Default,,0000,0000,0000,,because all of them have their own taste\Nabout what data structures they like. Dialogue: 0,0:21:13.53,0:21:16.70,Default,,0000,0000,0000,,And then you hit the constraints\Nof infrastructure-- similar things. Dialogue: 0,0:21:16.70,0:21:18.83,Default,,0000,0000,0000,,I am not complaining,\NI'm really enjoying this. Dialogue: 0,0:21:18.83,0:21:22.40,Default,,0000,0000,0000,,This is the most beautiful playground\NI've ever seen in my life. Dialogue: 0,0:21:22.40,0:21:25.38,Default,,0000,0000,0000,,Thanks to you and people\Nwho built Wikidata. Dialogue: 0,0:21:25.38,0:21:26.39,Default,,0000,0000,0000,,Thank you very much! Dialogue: 0,0:21:26.39,0:21:27.73,Default,,0000,0000,0000,,That would be it. Dialogue: 0,0:21:28.12,0:21:29.99,Default,,0000,0000,0000,,(moderator) Thank you, Goran. Dialogue: 0,0:21:29.99,0:21:32.29,Default,,0000,0000,0000,,(applause) Dialogue: 0,0:21:32.82,0:21:35.26,Default,,0000,0000,0000,,(moderator) You have time\Nfor a couple of questions. Dialogue: 0,0:21:44.32,0:21:47.66,Default,,0000,0000,0000,,(man) Well, you did a lot of research,\NI can see that. Dialogue: 0,0:21:47.66,0:21:48.68,Default,,0000,0000,0000,,(Goran) Sorry? Dialogue: 0,0:21:48.68,0:21:51.64,Default,,0000,0000,0000,,(man) You did a lot of research,\NI can see that. Dialogue: 0,0:21:51.64,0:21:57.24,Default,,0000,0000,0000,,I'm wondering if there anything\Nthat you discovered during the research Dialogue: 0,0:21:57.24,0:21:58.85,Default,,0000,0000,0000,,that surprised you. Dialogue: 0,0:21:59.33,0:22:01.36,Default,,0000,0000,0000,,Thank you for that question. Dialogue: 0,0:22:01.36,0:22:07.66,Default,,0000,0000,0000,,Actually, I wanted to focus\Non that in this talk Dialogue: 0,0:22:07.66,0:22:11.24,Default,,0000,0000,0000,,until I realized that we simply\Nwon't have enough time Dialogue: 0,0:22:11.24,0:22:13.82,Default,,0000,0000,0000,,to explain everything. Dialogue: 0,0:22:15.41,0:22:19.25,Default,,0000,0000,0000,,Most of the time\Nwhen you're analyzing big datasets Dialogue: 0,0:22:19.25,0:22:22.18,Default,,0000,0000,0000,,structured in a way like Wikidata. Dialogue: 0,0:22:22.18,0:22:26.34,Default,,0000,0000,0000,,Even when you're going to the wild,\Nmeaning study the reuse of data Dialogue: 0,0:22:26.34,0:22:27.44,Default,,0000,0000,0000,,across Wikipedia, Dialogue: 0,0:22:27.44,0:22:30.62,Default,,0000,0000,0000,,where actually people can do\Nwhatever they like with those items, Dialogue: 0,0:22:31.66,0:22:33.92,Default,,0000,0000,0000,,you have a lot of data,\Na lot of information. Dialogue: 0,0:22:33.92,0:22:35.60,Default,,0000,0000,0000,,Of course, you see structure. Dialogue: 0,0:22:35.60,0:22:40.21,Default,,0000,0000,0000,,Most of the time, 90% of the time,\Nyou see things that are expected. Dialogue: 0,0:22:41.20,0:22:46.68,Default,,0000,0000,0000,,Things like what projects\Nwe make the most use of Wikidata. Dialogue: 0,0:22:46.68,0:22:49.89,Default,,0000,0000,0000,,And you can almost--\Nyou didn't have to do too much statistics, Dialogue: 0,0:22:50.72,0:22:54.90,Default,,0000,0000,0000,,you can rely on the expectations\Nof all the world and see what's happening. Dialogue: 0,0:22:56.69,0:22:58.64,Default,,0000,0000,0000,,Many things were surprising, Dialogue: 0,0:22:58.64,0:23:03.31,Default,,0000,0000,0000,,and those things that were surprising\Nare really the most informative things. Dialogue: 0,0:23:05.37,0:23:09.07,Default,,0000,0000,0000,,When one communicates the findings\Nfrom analytics and such systems, Dialogue: 0,0:23:09.49,0:23:14.20,Default,,0000,0000,0000,,it's important, people typically expect\Neither "wow" visualizations Dialogue: 0,0:23:14.20,0:23:18.32,Default,,0000,0000,0000,,and have tons of data so we can always\Ndeliver "wow" visualizations, Dialogue: 0,0:23:18.91,0:23:21.56,Default,,0000,0000,0000,,or they expect to learn things like, Dialogue: 0,0:23:21.56,0:23:24.20,Default,,0000,0000,0000,,"Our project is doing better\Nthan this project" Dialogue: 0,0:23:24.20,0:23:26.24,Default,,0000,0000,0000,,or "Yes, we are rocking!" etc., Dialogue: 0,0:23:26.24,0:23:30.15,Default,,0000,0000,0000,,while the goal of the whole game\Nshould actually be to learn Dialogue: 0,0:23:30.15,0:23:34.13,Default,,0000,0000,0000,,what is wrong, what is not working,\Nwhat could be done better. Dialogue: 0,0:23:34.94,0:23:36.45,Default,,0000,0000,0000,,Many things were surprising. Dialogue: 0,0:23:38.34,0:23:42.06,Default,,0000,0000,0000,,For example, the distribution\Nof item usage across languages-- Dialogue: 0,0:23:42.06,0:23:43.85,Default,,0000,0000,0000,,that was surprising to me. Dialogue: 0,0:23:43.85,0:23:45.01,Default,,0000,0000,0000,,This thing. Dialogue: 0,0:23:47.10,0:23:51.35,Default,,0000,0000,0000,,So I did not really expect\Nthat the situation with languages Dialogue: 0,0:23:51.35,0:23:54.35,Default,,0000,0000,0000,,will be this good, I would say. Dialogue: 0,0:23:54.83,0:24:01.33,Default,,0000,0000,0000,,My expectation would be that languages\Nthat have less economic support, Dialogue: 0,0:24:01.33,0:24:03.65,Default,,0000,0000,0000,,normative support,\Neven political support-- Dialogue: 0,0:24:03.65,0:24:06.60,Default,,0000,0000,0000,,that's a fact when you talk\Nabout languages-- Dialogue: 0,0:24:06.60,0:24:11.52,Default,,0000,0000,0000,,will not be so widely reused\Nacross the Wikimedia universe. Dialogue: 0,0:24:11.52,0:24:15.54,Default,,0000,0000,0000,,In fact, it turns out\Nthat the differences-- we can see them, Dialogue: 0,0:24:15.54,0:24:18.98,Default,,0000,0000,0000,,but it's far away from gender bias\Nwhich is really bad, I think, Dialogue: 0,0:24:18.98,0:24:20.71,Default,,0000,0000,0000,,we need to work there. Dialogue: 0,0:24:20.71,0:24:22.46,Default,,0000,0000,0000,,That was surprising, for example. Dialogue: 0,0:24:22.46,0:24:25.72,Default,,0000,0000,0000,,It was a positive surprise,\Nto put it that way. Dialogue: 0,0:24:25.72,0:24:28.27,Default,,0000,0000,0000,,Then from time to time,\Nwe discover projects Dialogue: 0,0:24:28.82,0:24:34.78,Default,,0000,0000,0000,,that actually do a great job by reusing\Nthe Wikidata content and Wikimedia. Dialogue: 0,0:24:34.78,0:24:37.90,Default,,0000,0000,0000,,We're totally surprised to learn that\Nsuch a project can do it. Dialogue: 0,0:24:38.61,0:24:42.55,Default,,0000,0000,0000,,Then you start thinking, you figure out\Nthere is a community of people Dialogue: 0,0:24:42.55,0:24:44.00,Default,,0000,0000,0000,,actually doing it. Dialogue: 0,0:24:44.47,0:24:48.74,Default,,0000,0000,0000,,And it's a strange feeling because I get\Nto see all these things through machines, Dialogue: 0,0:24:48.74,0:24:51.97,Default,,0000,0000,0000,,through databases,\Nthrough visualizations and tables, Dialogue: 0,0:24:51.97,0:24:58.16,Default,,0000,0000,0000,,and it's always that strange feeling\Nwhen I realize this result was produced Dialogue: 0,0:24:58.16,0:25:03.09,Default,,0000,0000,0000,,by a group of people, they don't even know\Nthe time looking at their result now. Dialogue: 0,0:25:06.10,0:25:07.83,Default,,0000,0000,0000,,(moderator) Another question? Dialogue: 0,0:25:13.66,0:25:14.70,Default,,0000,0000,0000,,Thank you. Dialogue: 0,0:25:14.70,0:25:16.24,Default,,0000,0000,0000,,Is that it? Thank you very much! Dialogue: 0,0:25:16.24,0:25:17.73,Default,,0000,0000,0000,,(moderator) Thank you. Dialogue: 0,0:25:17.73,0:25:19.89,Default,,0000,0000,0000,,(applause)