WEBVTT 00:00:02.651 --> 00:00:03.900 Asaf Bartov: Testing, testing. 00:00:10.036 --> 00:00:12.640 Is this heard in the room? 00:00:15.190 --> 00:00:15.690 Testing. 00:00:22.620 --> 00:00:24.930 Hello, everyone. 00:00:24.930 --> 00:00:29.460 This is a gentle introduction to Wikidata 00:00:29.460 --> 00:00:31.922 for absolute beginners. 00:00:31.922 --> 00:00:34.130 If you're an absolute beginner, if you've never heard 00:00:34.130 --> 00:00:38.210 of Wikidata, or if you've heard of Wikidata but don't quite get 00:00:38.210 --> 00:00:41.360 it, don't know what it's good for, have only used it 00:00:41.360 --> 00:00:43.880 for inter-wiki links-- 00:00:43.880 --> 00:00:46.247 if you're anywhere on this range, 00:00:46.247 --> 00:00:47.330 you're in the right place. 00:00:50.990 --> 00:00:52.040 My name is Asaf Bartov. 00:00:52.040 --> 00:00:54.590 I work for the Wikimedia Foundation, 00:00:54.590 --> 00:00:59.790 and I am a Wikidata enthusiast. 00:00:59.790 --> 00:01:05.620 So the first thing I want to say is that you are lucky. 00:01:05.620 --> 00:01:10.540 You are lucky because Wikidata is already 00:01:10.540 --> 00:01:15.415 and is quickly becoming even more of an important research 00:01:15.415 --> 00:01:21.730 tool for anyone who's trying to ask questions 00:01:21.730 --> 00:01:25.030 about large amounts of information. 00:01:25.030 --> 00:01:29.770 It will become more and more used across the humanities, 00:01:29.770 --> 00:01:33.460 in particular, because of the things that it's able to do, 00:01:33.460 --> 00:01:37.090 some of which we will demonstrate shortly. 00:01:37.090 --> 00:01:40.750 And you are lucky because you get to find out about it now 00:01:40.750 --> 00:01:43.400 before most of the world. 00:01:43.400 --> 00:01:49.120 So by the end of this talk, you will be a Wikidata hipster 00:01:49.120 --> 00:01:51.250 because you'll be able to say, oh yeah. 00:01:51.250 --> 00:01:53.470 I knew about Wikidata before it was cool. 00:01:56.090 --> 00:02:00.370 So before we actually visit Wikidata, 00:02:00.370 --> 00:02:08.620 I want to share two key problems that Wikidata seeks to solve 00:02:08.620 --> 00:02:12.940 and which would help us understand why it exists. 00:02:12.940 --> 00:02:17.640 The first problem is that have of dated data, that 00:02:17.640 --> 00:02:20.880 is data that is out of date. 00:02:20.880 --> 00:02:23.960 And this is apparent on Wikipedia 00:02:23.960 --> 00:02:27.870 across our free knowledge encyclopedias. 00:02:27.870 --> 00:02:32.160 Data on Wikipedia is not always up to date. 00:02:32.160 --> 00:02:37.470 And the more obscure it is, the more likely 00:02:37.470 --> 00:02:40.280 it is not to be up to date. 00:02:40.280 --> 00:02:49.360 So the Polish Wikipedia may have an article about a small town 00:02:49.360 --> 00:02:55.480 in Argentina, and that article will include information 00:02:55.480 --> 00:03:00.910 about that town like population size, name of the mayor. 00:03:00.910 --> 00:03:04.580 And that information, ideally, was 00:03:04.580 --> 00:03:08.540 correct at the time the article was created on the Polish 00:03:08.540 --> 00:03:10.370 Wikipedia-- 00:03:10.370 --> 00:03:13.760 maybe translated from another wiki. 00:03:13.760 --> 00:03:17.900 But then how likely is it to be kept up to date? 00:03:17.900 --> 00:03:20.960 How likely is it that the Polish Wikipedia would give us 00:03:20.960 --> 00:03:25.880 the correct and latest numbers or data about the population 00:03:25.880 --> 00:03:28.370 size of that town or the mayor, right? 00:03:28.370 --> 00:03:31.720 So this is the kind of data that does go out of date, right? 00:03:31.720 --> 00:03:34.250 Every few years-- five, 10 years-- 00:03:34.250 --> 00:03:37.850 there is a census, and now there are new population figures. 00:03:37.850 --> 00:03:42.440 Now the census in Argentina will be made available in Argentina 00:03:42.440 --> 00:03:45.500 in Spanish, probably, which brings us 00:03:45.500 --> 00:03:48.710 to another component of the problem of dated data, which 00:03:48.710 --> 00:03:53.810 is there are no obvious triggers for updating the data. 00:03:53.810 --> 00:03:58.520 So the Polish Wikipedian is not sent an email 00:03:58.520 --> 00:04:00.680 by the Argentinean government saying, hey, 00:04:00.680 --> 00:04:01.820 we have a new census. 00:04:01.820 --> 00:04:05.420 There are new population numbers for you to update on Wikipedia. 00:04:05.420 --> 00:04:07.550 No such email is sent. 00:04:07.550 --> 00:04:10.146 So it's kind of hard to notice when. 00:04:10.146 --> 00:04:12.770 And of course, multiply that by all the different jurisdictions 00:04:12.770 --> 00:04:14.670 around the world. 00:04:14.670 --> 00:04:16.610 There's no easy way and notice when 00:04:16.610 --> 00:04:17.790 your data goes out of date. 00:04:20.620 --> 00:04:24.070 So that's difficult to keep up to date. 00:04:24.070 --> 00:04:27.940 And even if we were to receive some kind of indication-- 00:04:27.940 --> 00:04:31.310 oh, there's a new census in Argentina, 00:04:31.310 --> 00:04:33.100 so a whole bunch of population figures 00:04:33.100 --> 00:04:34.960 have now gone out of date. 00:04:34.960 --> 00:04:37.240 Updating it on the Polish Wikipedia 00:04:37.240 --> 00:04:40.090 and the French Wikipedia and the Indonesian Wikipedia 00:04:40.090 --> 00:04:44.920 and the Arabic Wikipedia is a whole bunch of repetitive work 00:04:44.920 --> 00:04:46.540 that a lot of different volunteers 00:04:46.540 --> 00:04:49.900 will need to do just for that one updated piece 00:04:49.900 --> 00:04:54.810 of information about Argentina. 00:04:54.810 --> 00:04:57.720 So I hope this is clear and resonates 00:04:57.720 --> 00:05:01.920 with some of your experience editing Wikipedia-- 00:05:01.920 --> 00:05:04.170 data that is out of date or that needs 00:05:04.170 --> 00:05:08.640 to be updated manually, menially, 00:05:08.640 --> 00:05:16.190 on a fairly frequent schedule across the different countries 00:05:16.190 --> 00:05:18.410 and data sources. 00:05:18.410 --> 00:05:22.340 The other-- and I think maybe more interesting-- 00:05:22.340 --> 00:05:26.210 shortcoming or problem that I want to discuss 00:05:26.210 --> 00:05:30.260 is what I call the inflexible ways 00:05:30.260 --> 00:05:36.020 of lateral queries, crosscutting queries of knowledge. 00:05:36.020 --> 00:05:43.980 So if I want an answer to the question, what countries 00:05:43.980 --> 00:05:48.740 in the world export rubber-- 00:05:52.300 --> 00:05:54.790 that's a reasonable question, right? 00:05:54.790 --> 00:05:57.460 That information is on Wikipedia. 00:05:57.460 --> 00:05:58.630 Do you agree? 00:05:58.630 --> 00:06:00.640 If you go to Wikipedia and read up 00:06:00.640 --> 00:06:05.560 about Brazil, about Peru, about Germany, somewhere in there-- 00:06:05.560 --> 00:06:09.010 maybe a sub-article called Economics of Brazil-- 00:06:09.010 --> 00:06:13.600 you will find the main exports of that country. 00:06:13.600 --> 00:06:15.400 And you can find out whether or not 00:06:15.400 --> 00:06:16.930 that country exports rubber. 00:06:16.930 --> 00:06:19.994 But what if I don't want to go country by country 00:06:19.994 --> 00:06:21.160 looking for the word rubber? 00:06:21.160 --> 00:06:22.090 I just want an answer. 00:06:22.090 --> 00:06:25.540 What are the countries that export rubber? 00:06:25.540 --> 00:06:28.360 Even though that information is in Wikipedia, 00:06:28.360 --> 00:06:29.680 it's hard to get at. 00:06:29.680 --> 00:06:31.680 It's hard to query. 00:06:31.680 --> 00:06:35.770 Now, you may say, well, that's what we have categories for, 00:06:35.770 --> 00:06:36.270 right? 00:06:36.270 --> 00:06:39.820 Categories are a way to cut across Wikipedia. 00:06:39.820 --> 00:06:45.110 So if someone made a category called rubber 00:06:45.110 --> 00:06:48.380 exporting countries, then you can go to that category 00:06:48.380 --> 00:06:51.560 and see a list of countries that export rubber. 00:06:51.560 --> 00:06:53.390 And if nobody has made it yet, well, you 00:06:53.390 --> 00:06:56.990 can create that category and, with a kind of one-time effort, 00:06:56.990 --> 00:06:59.730 populate that category, and you're done. 00:06:59.730 --> 00:07:01.970 Well, yes. 00:07:01.970 --> 00:07:04.250 That's still not very convenient. 00:07:04.250 --> 00:07:06.980 But also, it's still very, very limited, 00:07:06.980 --> 00:07:12.380 because what if I only want countries that export rubber 00:07:12.380 --> 00:07:15.950 and have a democratic system of government, 00:07:15.950 --> 00:07:18.770 or any other kind of additional condition 00:07:18.770 --> 00:07:20.510 that I would like to add to this? 00:07:20.510 --> 00:07:22.230 Or take a completely different example. 00:07:22.230 --> 00:07:26.750 What if I want to know which Flemish town had 00:07:26.750 --> 00:07:31.510 the most painters born in it? 00:07:31.510 --> 00:07:34.480 There's a ton of Flemish painters. 00:07:34.480 --> 00:07:37.870 Most of them were born somewhere. 00:07:37.870 --> 00:07:39.685 We could theoretically, just you know, 00:07:39.685 --> 00:07:43.900 look up all the birthplaces of all the Flemish painters 00:07:43.900 --> 00:07:46.900 and tally up the numbers and figure out 00:07:46.900 --> 00:07:51.610 what is the place where the most Flemish painters come from? 00:07:51.610 --> 00:07:53.050 I don't know the answer to that. 00:07:53.050 --> 00:07:55.420 It would be nice to be able to get that answer. 00:07:55.420 --> 00:07:57.610 Again, the data is in Wikipedia. 00:07:57.610 --> 00:08:00.400 Those birthplaces are listed in the articles 00:08:00.400 --> 00:08:01.636 about those painters. 00:08:01.636 --> 00:08:05.710 But there's no easy way to get that information. 00:08:05.710 --> 00:08:13.420 What if I want to ask, who are some painters whose father was 00:08:13.420 --> 00:08:14.245 also a painter? 00:08:16.840 --> 00:08:18.500 That's a thing that exists, right? 00:08:18.500 --> 00:08:22.630 Some painters are sons of painters. 00:08:22.630 --> 00:08:26.560 You know, Bruegel comes to mind as an obvious example. 00:08:26.560 --> 00:08:28.240 But there's a bunch of others, right? 00:08:28.240 --> 00:08:29.380 So who are those people? 00:08:29.380 --> 00:08:30.930 What if I want to ask that question? 00:08:30.930 --> 00:08:33.400 That's the kind of question that not only Wikipedia 00:08:33.400 --> 00:08:34.600 doesn't answer today. 00:08:34.600 --> 00:08:41.500 If you walk to your friendly university library reference 00:08:41.500 --> 00:08:45.010 desk and say, hello, I would like 00:08:45.010 --> 00:08:49.290 a list of painters whose father was also a painter, 00:08:49.290 --> 00:08:52.820 how would that librarian help you? 00:08:52.820 --> 00:08:57.960 There's no easy way to get an answer to a question like that. 00:08:57.960 --> 00:09:01.100 What if you only want a list of painters 00:09:01.100 --> 00:09:05.870 who were immigrants, painters who lived somewhere else 00:09:05.870 --> 00:09:08.240 than where they were born? 00:09:08.240 --> 00:09:09.770 There's no book. 00:09:09.770 --> 00:09:11.720 I guess maybe there is, but you know, 00:09:11.720 --> 00:09:15.590 it's not obvious that there's a ready resource that says, list 00:09:15.590 --> 00:09:17.840 of painters who are immigrants. 00:09:17.840 --> 00:09:19.910 And the librarian would probably refer you 00:09:19.910 --> 00:09:22.760 to a book on the shelf called, I don't know, 00:09:22.760 --> 00:09:24.200 The Complete Dictionary of Flemish 00:09:24.200 --> 00:09:26.300 Painters and go, look up the index, 00:09:26.300 --> 00:09:28.520 you know, and if you see a similar surname, 00:09:28.520 --> 00:09:29.910 maybe they're father and son. 00:09:29.910 --> 00:09:35.000 And kind of cobble together the answer on your own. 00:09:35.000 --> 00:09:37.100 The reason I'm comparing this to a library 00:09:37.100 --> 00:09:42.170 is to show you that this is a kind of question that is not 00:09:42.170 --> 00:09:46.760 readily satisfiable today. 00:09:46.760 --> 00:09:50.240 Now, these questions may sound contrived to you. 00:09:50.240 --> 00:09:52.460 You may say to yourself, well, you 00:09:52.460 --> 00:09:54.860 know, painters who are also sons of painters, yeah. 00:09:54.860 --> 00:09:57.680 You know, that never occurred to me 00:09:57.680 --> 00:09:59.610 as a question I might care about. 00:09:59.610 --> 00:10:01.850 But I want to invite you to consider 00:10:01.850 --> 00:10:06.380 that this kind of question, questions like that question, 00:10:06.380 --> 00:10:09.260 may well be questions you do care about. 00:10:09.260 --> 00:10:12.740 And I also want to suggest that the fact it is so nearly 00:10:12.740 --> 00:10:16.250 impossible, the fact that there's no obvious way 00:10:16.250 --> 00:10:19.250 to ask that kind of question today, 00:10:19.250 --> 00:10:21.200 is partly responsible to your not 00:10:21.200 --> 00:10:22.970 coming up with those questions, right? 00:10:22.970 --> 00:10:25.850 We tend to be limited by the possible. 00:10:25.850 --> 00:10:30.080 You know, until human flight was made possible, 00:10:30.080 --> 00:10:32.840 it did not occur to anyone to say, oh yeah, by this time 00:10:32.840 --> 00:10:34.430 next week I will be in Australia, 00:10:34.430 --> 00:10:36.630 because that was just impossible. 00:10:36.630 --> 00:10:38.587 But when flight is possible, there's 00:10:38.587 --> 00:10:40.670 all kinds of things that suddenly become possible, 00:10:40.670 --> 00:10:42.740 and there's all kinds of needs that 00:10:42.740 --> 00:10:46.430 arise based on the availability of resources 00:10:46.430 --> 00:10:48.600 to fulfill those needs. 00:10:48.600 --> 00:10:54.120 So many of these research questions, compound lateral 00:10:54.120 --> 00:10:58.520 cross-cutting queries, are not being asked because people have 00:10:58.520 --> 00:11:00.410 internalized the fact that there is no way 00:11:00.410 --> 00:11:05.750 to get an answer to questions like, 00:11:05.750 --> 00:11:13.270 what is the most popular first name among British politicians? 00:11:13.270 --> 00:11:14.520 I just made that up, you know? 00:11:14.520 --> 00:11:15.340 Is it John? 00:11:15.340 --> 00:11:16.510 Maybe. 00:11:16.510 --> 00:11:19.030 Maybe it's William, for whatever reason. 00:11:19.030 --> 00:11:22.030 You know, these are the kinds of questions we don't routinely 00:11:22.030 --> 00:11:25.855 ask because we know that it's like, who are you going to ask? 00:11:25.855 --> 00:11:28.330 How are you going to get an answer to that? 00:11:28.330 --> 00:11:36.040 So this problem of not having very flexible ways of querying 00:11:36.040 --> 00:11:38.220 the data that we already have-- 00:11:38.220 --> 00:11:41.230 in Wikipedia, in Wikisource, elsewhere-- 00:11:41.230 --> 00:11:45.060 is a significant limitation. 00:11:45.060 --> 00:11:50.880 So these two key problems have one solution. 00:11:50.880 --> 00:11:55.500 And that is an editable, central storage 00:11:55.500 --> 00:12:00.510 for structured and linked data on a wiki, 00:12:00.510 --> 00:12:05.160 under a free license, which is a very long way of saying 00:12:05.160 --> 00:12:07.290 Wikidata. 00:12:07.290 --> 00:12:08.470 That is Wikidata. 00:12:08.470 --> 00:12:11.190 Wikidata is an editable, central storage 00:12:11.190 --> 00:12:15.840 for structured and linked data on a wiki, 00:12:15.840 --> 00:12:17.700 under a free license. 00:12:17.700 --> 00:12:22.590 So let's take this apart and unpack it. 00:12:22.590 --> 00:12:24.820 First of all, it's a central storage. 00:12:24.820 --> 00:12:27.660 This relates to the first problem, right? 00:12:27.660 --> 00:12:34.370 If we had one place containing data like population size, 00:12:34.370 --> 00:12:38.270 we would be able to update that one place and then have 00:12:38.270 --> 00:12:42.260 all of the different Wikipedias draw the data from that one 00:12:42.260 --> 00:12:45.320 place so that we wouldn't have to manually, 00:12:45.320 --> 00:12:49.980 repetitively update it across our hundreds of projects. 00:12:49.980 --> 00:12:53.690 So having central storage makes, I hope, kind 00:12:53.690 --> 00:12:57.230 of immediate, intuitive sense. 00:12:57.230 --> 00:13:02.840 But what do I mean by structured and linked data? 00:13:02.840 --> 00:13:10.120 So structured data means that each datum, each piece-- 00:13:10.120 --> 00:13:15.880 individual piece-- of data is managed on its own, 00:13:15.880 --> 00:13:19.660 is identified and defined on its own, 00:13:19.660 --> 00:13:21.040 as distinct from Wikipedia. 00:13:21.040 --> 00:13:22.990 Wikipedia has articles. 00:13:22.990 --> 00:13:27.190 The article about Brazil includes a ton of data, 00:13:27.190 --> 00:13:31.570 all kinds of information, and it's presented as text, 00:13:31.570 --> 00:13:34.270 as several paragraphs-- several pages-- 00:13:34.270 --> 00:13:36.540 of text, right? 00:13:36.540 --> 00:13:41.460 Now, we do have an approximation of structured data 00:13:41.460 --> 00:13:43.580 on Wikipedia. 00:13:43.580 --> 00:13:45.300 If you've browsed Wikipedia a little, 00:13:45.300 --> 00:13:49.100 you've noticed that we often have an info box, what we 00:13:49.100 --> 00:13:50.750 call an info box on Wikipedia. 00:13:50.750 --> 00:13:55.220 That's the table on the right side if it's a left to right 00:13:55.220 --> 00:13:57.200 language, the table on the right side 00:13:57.200 --> 00:14:02.270 that has information that is easy to tabulate, right? 00:14:02.270 --> 00:14:08.210 So you know, birth date, birth place, death date, death place, 00:14:08.210 --> 00:14:09.710 nationality-- 00:14:09.710 --> 00:14:16.670 or if it's about a country, area, population, anthem, 00:14:16.670 --> 00:14:20.090 type of government, whatever you are likely to find. 00:14:20.090 --> 00:14:23.150 If it's a movie, then you know, starring, 00:14:23.150 --> 00:14:27.350 genre, box office receipts, whatever pieces of data 00:14:27.350 --> 00:14:29.900 are relevant to an article about a movie. 00:14:29.900 --> 00:14:34.940 So we do already kind of group pieces of information 00:14:34.940 --> 00:14:40.160 on Wikipedia into this kind of structured format. 00:14:40.160 --> 00:14:43.630 Those of you who have ever looked at the source, 00:14:43.630 --> 00:14:45.970 at what the wiki code under that looks like, 00:14:45.970 --> 00:14:49.640 know that it's only semi-structured. 00:14:49.640 --> 00:14:52.370 It looks neat and organized in a table, 00:14:52.370 --> 00:14:55.660 but really, it's just a bunch of text that is put there. 00:14:55.660 --> 00:14:57.140 It is not centralized. 00:14:57.140 --> 00:15:00.100 Every Wikipedia has its own copy of that data. 00:15:00.100 --> 00:15:02.930 And if I go and update the population size 00:15:02.930 --> 00:15:07.070 on Spanish Wikipedia of that Argentinean town, 00:15:07.070 --> 00:15:10.190 it does not get updated automagically 00:15:10.190 --> 00:15:13.520 on the English Wikipedia or the Arabic Wikipedia, right? 00:15:13.520 --> 00:15:17.150 So the structured data that we already have on Wikipedia 00:15:17.150 --> 00:15:20.939 is not managed centrally. 00:15:20.939 --> 00:15:22.480 The other thing about structured data 00:15:22.480 --> 00:15:29.250 is, when you have a notion of an individual piece of data, that 00:15:29.250 --> 00:15:33.390 is the cornerstone of allowing the kinds of queries 00:15:33.390 --> 00:15:34.770 that I was talking about. 00:15:34.770 --> 00:15:40.440 That is what will allow me to ask questions like, 00:15:40.440 --> 00:15:43.470 what is the Flemish town where the most painters were born, 00:15:43.470 --> 00:15:46.650 or what are the world's largest cities that 00:15:46.650 --> 00:15:49.730 have a female mayor? 00:15:49.730 --> 00:15:52.430 I could come up with other examples all day long, right? 00:15:52.430 --> 00:15:55.280 These are all questions that you can ask, 00:15:55.280 --> 00:15:59.390 once you break down your data into individual pieces, each 00:15:59.390 --> 00:16:02.300 of which is-- 00:16:02.300 --> 00:16:06.950 you're able to refer to each of those programmatically. 00:16:06.950 --> 00:16:10.430 The computer can identify, isolate, 00:16:10.430 --> 00:16:14.700 and calculate based on each of those pieces of data. 00:16:14.700 --> 00:16:17.060 So that's why the structure is important. 00:16:17.060 --> 00:16:22.520 Now, Wikidata is also a linked data repository. 00:16:22.520 --> 00:16:24.890 What does it mean that the data is linked? 00:16:24.890 --> 00:16:29.700 Well, it means that a single piece of data can point at, 00:16:29.700 --> 00:16:34.770 can link to another whole bag of data. 00:16:34.770 --> 00:16:43.360 So if we are describing, for example, a person, 00:16:43.360 --> 00:16:46.960 and we record the single piece of data 00:16:46.960 --> 00:16:54.820 that this person was born in Salem, Massachusetts, 00:16:54.820 --> 00:17:02.300 that single piece of data links to the item about Salem, 00:17:02.300 --> 00:17:04.060 Massachusetts because, of course, 00:17:04.060 --> 00:17:07.010 we know a lot of things about that place, Salem, 00:17:07.010 --> 00:17:07.869 Massachusetts. 00:17:07.869 --> 00:17:09.245 So it's not just the text-- 00:17:09.245 --> 00:17:13.450 S-A-L-E-M. It's not just, that's where they were born. 00:17:13.450 --> 00:17:17.170 But it's a link to all the data that we have 00:17:17.170 --> 00:17:19.270 about Salem, Massachusetts. 00:17:19.270 --> 00:17:24.940 If we say someone's nationality is French, 00:17:24.940 --> 00:17:26.589 that is a link to France. 00:17:26.589 --> 00:17:30.700 That is a link to everything we know about the country France. 00:17:30.700 --> 00:17:34.150 The fact that the data is linked and structured 00:17:34.150 --> 00:17:37.630 allows not only humans, but also computers 00:17:37.630 --> 00:17:41.620 to traverse information and to bring 00:17:41.620 --> 00:17:44.950 us different pieces of relevant information 00:17:44.950 --> 00:17:49.000 programmatically, automatically, based on those links. 00:17:49.000 --> 00:17:52.000 Because it's not just text, it's an actual link 00:17:52.000 --> 00:17:56.700 to another chunk of data. 00:17:56.700 --> 00:17:58.880 If this sounds a little abstract, 00:17:58.880 --> 00:18:01.190 it will become much clearer in just a second 00:18:01.190 --> 00:18:03.230 when we see it in action. 00:18:03.230 --> 00:18:06.200 But the other components of this little definition are, 00:18:06.200 --> 00:18:09.650 of course, this central storage of structured and linked data 00:18:09.650 --> 00:18:12.620 needs to be editable, of course, because we 00:18:12.620 --> 00:18:14.370 need to keep it up to date. 00:18:14.370 --> 00:18:16.460 We need to correct mistakes. 00:18:16.460 --> 00:18:21.300 And we want it on a wiki under a free license. 00:18:21.300 --> 00:18:23.940 The free license is, of course, essential to enable 00:18:23.940 --> 00:18:30.910 reuse of that data, to enable all kinds of reuse of the data. 00:18:30.910 --> 00:18:34.060 And Wikidata, unlike Wikipedia, is released 00:18:34.060 --> 00:18:36.160 under a different free license. 00:18:36.160 --> 00:18:41.590 Wikidata is released under CC0 waiver. 00:18:41.590 --> 00:18:44.920 That means unlike Wikipedia, where 00:18:44.920 --> 00:18:51.160 you have to attribute Wikipedia when you reuse information 00:18:51.160 --> 00:18:55.150 from Wikipedia, you do not need to attribute Wikidata, 00:18:55.150 --> 00:18:57.040 and you do not need to share alike your work. 00:18:57.040 --> 00:19:02.020 It's an unencumbered license to reuse the data in any way you 00:19:02.020 --> 00:19:03.267 want, including commercially. 00:19:03.267 --> 00:19:05.350 You don't have to say that it comes from Wikidata. 00:19:05.350 --> 00:19:07.390 I mean, it could be nice, but you don't have to. 00:19:07.390 --> 00:19:09.280 You're under no obligation to do it. 00:19:09.280 --> 00:19:14.080 And that is important to allow certain kinds of reuse 00:19:14.080 --> 00:19:17.140 where, for example, if you're building some kind of device, 00:19:17.140 --> 00:19:20.680 you may not have a practical way to give attribution. 00:19:20.680 --> 00:19:23.920 And had we required that to use Wikidata, 00:19:23.920 --> 00:19:27.250 we would have made Wikidata less reusable. 00:19:27.250 --> 00:19:32.940 So Wikidata is unencumbered by the requirement of attribution. 00:19:32.940 --> 00:19:35.730 And of course, because it's on a wiki, 00:19:35.730 --> 00:19:40.421 we get all the benefits that we are used to expect from a wiki, 00:19:40.421 --> 00:19:40.920 right? 00:19:40.920 --> 00:19:42.810 So it's a wiki, which means, yes. 00:19:42.810 --> 00:19:44.910 It has discussion pages. 00:19:44.910 --> 00:19:46.500 It has revision histories. 00:19:46.500 --> 00:19:47.620 It remembers everything. 00:19:47.620 --> 00:19:50.610 So if you screw it up, you can always go a version back. 00:19:50.610 --> 00:19:52.380 Or if someone else vandalized the content, 00:19:52.380 --> 00:19:54.610 we can always go back, just like Wikipedia. 00:19:54.610 --> 00:19:56.880 So we get all the benefits we're used to-- 00:19:56.880 --> 00:20:01.260 user talk pages, group discussion pages, watch lists, 00:20:01.260 --> 00:20:03.755 all the features that we expect in a wiki. 00:20:06.740 --> 00:20:11.170 In short, Wikidata is love. 00:20:11.170 --> 00:20:14.100 I hope you agree with me by the end of this talk. 00:20:14.100 --> 00:20:18.580 So let's zoom in and see what this structured data 00:20:18.580 --> 00:20:21.420 looks like. 00:20:21.420 --> 00:20:29.460 So structured data on Wikidata is collected in statements. 00:20:29.460 --> 00:20:31.930 And statements have the general form 00:20:31.930 --> 00:20:39.490 of this triple, this tripartite ascription-- 00:20:39.490 --> 00:20:43.550 items, properties, and values. 00:20:43.550 --> 00:20:46.930 Now an item is the subject, is the topic 00:20:46.930 --> 00:20:48.820 that we are trying to describe. 00:20:48.820 --> 00:20:52.164 It can be any topic that Wikipedia can cover, 00:20:52.164 --> 00:20:53.830 and many others that Wikipedia wouldn't. 00:20:53.830 --> 00:20:57.490 So the topic, the item can be Germany, 00:20:57.490 --> 00:21:00.520 or it can be Salem, Massachusetts, 00:21:00.520 --> 00:21:03.340 or it can be the concept of redemption. 00:21:03.340 --> 00:21:04.610 It can be anything at all. 00:21:04.610 --> 00:21:10.000 Anything you can imagine describing in any way with data 00:21:10.000 --> 00:21:11.990 can be the item. 00:21:11.990 --> 00:21:15.430 So the item, consider it like the title 00:21:15.430 --> 00:21:17.480 of the rest of the data. 00:21:17.480 --> 00:21:20.860 And then what do we say about Salem, Massachusetts 00:21:20.860 --> 00:21:22.330 or about Germany? 00:21:22.330 --> 00:21:26.770 Well, that's a series of properties and values, 00:21:26.770 --> 00:21:28.450 properties and values. 00:21:28.450 --> 00:21:32.680 The property is the kind of datum, 00:21:32.680 --> 00:21:39.770 like birth date or language spoken or manner of death. 00:21:39.770 --> 00:21:42.640 These are all real properties. 00:21:42.640 --> 00:21:46.030 Or national anthem, if I'm trying to describe a country-- 00:21:46.030 --> 00:21:47.830 these are properties. 00:21:47.830 --> 00:21:49.880 And then they have values, right? 00:21:49.880 --> 00:21:55.740 So this person, this imaginary person's place 00:21:55.740 --> 00:21:59.640 of birth, the value of the property place of birth 00:21:59.640 --> 00:22:02.430 is Salem, Massachusetts. 00:22:02.430 --> 00:22:06.690 So you can think about it as like a government form-- 00:22:06.690 --> 00:22:09.540 or not government, just any form that you're filling out-- 00:22:09.540 --> 00:22:12.420 where there are field names, and then empty spaces for you 00:22:12.420 --> 00:22:13.110 to fill out. 00:22:13.110 --> 00:22:14.460 That's the value, OK? 00:22:14.460 --> 00:22:18.150 So the field names or the categories 00:22:18.150 --> 00:22:19.350 are the properties, right? 00:22:19.350 --> 00:22:22.960 So name, language, occupation, date of birth-- 00:22:22.960 --> 00:22:24.420 these are all properties. 00:22:24.420 --> 00:22:26.640 And the values are the actual piece 00:22:26.640 --> 00:22:31.391 of data, the actual information that we have. 00:22:31.391 --> 00:22:33.870 And of course, different kinds of data 00:22:33.870 --> 00:22:40.170 are relevant for describing different kinds of items. 00:22:40.170 --> 00:22:45.030 And the key in the value is it can be either a literal value-- 00:22:45.030 --> 00:22:50.370 like if we're describing the height of a mountain, 00:22:50.370 --> 00:22:55.826 we might say just the number 8,848. 00:22:55.826 --> 00:22:57.325 That's the height of which mountain? 00:23:01.990 --> 00:23:04.070 Not everyone at once. 00:23:04.070 --> 00:23:07.430 Oh, because it's meters, the metric system. 00:23:07.430 --> 00:23:08.270 Yeah, Mt. 00:23:08.270 --> 00:23:12.390 Everest is 8,848 meters. 00:23:12.390 --> 00:23:14.160 Yes. 00:23:14.160 --> 00:23:15.780 Get with it, America. 00:23:15.780 --> 00:23:17.630 The metric system. 00:23:17.630 --> 00:23:20.930 All right, so that can be a literal value 00:23:20.930 --> 00:23:22.580 like an actual number. 00:23:22.580 --> 00:23:28.280 Or it can be a link to an item, pointing at another item. 00:23:28.280 --> 00:23:30.890 But in this statement, it is the value. 00:23:30.890 --> 00:23:35.150 So if I'm talking about Germany, the item is Germany. 00:23:35.150 --> 00:23:39.680 And the property capital city has the value Berlin. 00:23:39.680 --> 00:23:43.130 But the value is not B-E-R-L-I-N. 00:23:43.130 --> 00:23:48.740 The value is a pointer to the item Berlin, right? 00:23:48.740 --> 00:23:51.410 That's the link. 00:23:51.410 --> 00:23:56.671 So a single item is described by a series of such statements, 00:23:56.671 --> 00:23:57.170 right? 00:23:57.170 --> 00:24:01.400 There's hundreds and hundreds of things I can say about Germany. 00:24:01.400 --> 00:24:04.280 There's hundreds of things I can say about a person. 00:24:04.280 --> 00:24:06.350 And these will generally take the form 00:24:06.350 --> 00:24:08.330 of a property and a value. 00:24:08.330 --> 00:24:11.720 By the way, some properties may have more than one value. 00:24:11.720 --> 00:24:15.920 Consider the property languages spoken. 00:24:15.920 --> 00:24:18.050 People can speak more than one language, right? 00:24:18.050 --> 00:24:20.330 So if I'm from describing myself, 00:24:20.330 --> 00:24:22.400 we can say languages spoken-- 00:24:22.400 --> 00:24:26.000 English, Hebrew, Latin, whatever. 00:24:26.000 --> 00:24:27.860 So a property can have more than one value. 00:24:30.970 --> 00:24:34.010 So if the item is about a country, 00:24:34.010 --> 00:24:38.890 it would have statements about properties like population, 00:24:38.890 --> 00:24:43.180 land area, official languages, borders with, anthem, 00:24:43.180 --> 00:24:45.070 capital city. 00:24:45.070 --> 00:24:48.580 If I'm describing a person, I have a whole mostly different 00:24:48.580 --> 00:24:51.220 set of properties that are relevant, right? 00:24:51.220 --> 00:24:54.160 Date of birth, place of birth, citizenship, occupation, 00:24:54.160 --> 00:24:56.950 father, mother, religion, notable works-- 00:24:56.950 --> 00:24:59.780 now, are all of these relevant for all people? 00:24:59.780 --> 00:25:00.970 No, of course not. 00:25:00.970 --> 00:25:02.140 It depends. 00:25:02.140 --> 00:25:05.220 And different items about different people 00:25:05.220 --> 00:25:08.920 will either have or not have these fields, right? 00:25:08.920 --> 00:25:12.640 So we wouldn't record religion for absolutely every person. 00:25:12.640 --> 00:25:14.200 Some people manage to do without. 00:25:14.200 --> 00:25:17.710 And also, it's not relevant for a lot of people, like, 00:25:17.710 --> 00:25:20.320 what their religion happens to be. 00:25:20.320 --> 00:25:22.840 Date of birth is generally relevant for most people 00:25:22.840 --> 00:25:24.060 that we're documenting. 00:25:24.060 --> 00:25:29.390 So some properties kind of crop up more commonly than others. 00:25:29.390 --> 00:25:33.220 A person's height, for example, is not generally 00:25:33.220 --> 00:25:35.596 considered of encyclopedic value, right? 00:25:35.596 --> 00:25:36.970 We don't, for example, if we have 00:25:36.970 --> 00:25:40.840 an article about even a really well-documented person 00:25:40.840 --> 00:25:45.610 like Winston Churchill, does Wikipedia mention his height? 00:25:45.610 --> 00:25:47.620 I don't think it does. 00:25:47.620 --> 00:25:50.320 Even though I'm sure we could probably 00:25:50.320 --> 00:25:52.810 find a source somewhere that lists his height, 00:25:52.810 --> 00:25:55.570 it's just not a very relevant piece 00:25:55.570 --> 00:25:57.506 of information about Churchill. 00:25:57.506 --> 00:25:59.380 With everything else that's written about him 00:25:59.380 --> 00:26:00.796 and that we know about him that we 00:26:00.796 --> 00:26:03.460 want to include in the article, a person's height 00:26:03.460 --> 00:26:08.180 is not really something of great value most of the time. 00:26:08.180 --> 00:26:14.420 But if we are describing Michael Jordan, it is relevant. 00:26:14.420 --> 00:26:15.430 I'm dating myself. 00:26:15.430 --> 00:26:19.230 People still know Michael Jordan, right? 00:26:19.230 --> 00:26:21.600 You know, a basketball player, that's 00:26:21.600 --> 00:26:24.204 when height is very relevant, right? 00:26:24.204 --> 00:26:25.620 That's one of the first things you 00:26:25.620 --> 00:26:28.020 say when you're describing a basketball player, 00:26:28.020 --> 00:26:31.380 is list their height. 00:26:31.380 --> 00:26:33.690 So even within the class of person, 00:26:33.690 --> 00:26:36.480 some properties may be more or less relevant, 00:26:36.480 --> 00:26:38.320 depending on the context. 00:26:38.320 --> 00:26:40.090 So let's look at some examples. 00:26:40.090 --> 00:26:42.870 These are examples of statements. 00:26:42.870 --> 00:26:45.400 Each line is a statement. 00:26:45.400 --> 00:26:47.130 So here's the first one. 00:26:47.130 --> 00:26:53.270 I want to state, about the item Earth, our planet. 00:26:53.270 --> 00:26:55.760 And what I want to say about Earth 00:26:55.760 --> 00:27:00.980 is that the property highest point on Earth 00:27:00.980 --> 00:27:03.310 has the value Mt. 00:27:03.310 --> 00:27:04.817 Everest. 00:27:04.817 --> 00:27:05.900 Would you agree with that? 00:27:05.900 --> 00:27:09.580 That is the highest point on Earth. 00:27:09.580 --> 00:27:11.100 That's a statement. 00:27:11.100 --> 00:27:14.020 It says something specific, one piece 00:27:14.020 --> 00:27:15.517 of information about Earth. 00:27:15.517 --> 00:27:17.350 Now of course, there's a lot of other things 00:27:17.350 --> 00:27:18.820 we want to say about Earth-- 00:27:18.820 --> 00:27:21.165 circumference, average temperature, 00:27:21.165 --> 00:27:22.540 I don't know, all kinds of things 00:27:22.540 --> 00:27:26.750 we can describe the planet with, density, it's a galaxy, 00:27:26.750 --> 00:27:28.250 it belongs to, all that. 00:27:28.250 --> 00:27:30.400 But here's one piece of information, 00:27:30.400 --> 00:27:37.370 one very specific field in the detailed form about Earth. 00:27:37.370 --> 00:27:38.990 The highest point is Mt. 00:27:38.990 --> 00:27:39.590 Everest. 00:27:39.590 --> 00:27:41.570 Now here's a second statement. 00:27:41.570 --> 00:27:42.920 This time Mt. 00:27:42.920 --> 00:27:46.690 Everest itself is the item that I'm describing, right? 00:27:46.690 --> 00:27:48.590 The topic has changed. 00:27:48.590 --> 00:27:50.120 Now I'm saying something about Mt. 00:27:50.120 --> 00:27:52.340 Everest, and what I'm saying about Mt. 00:27:52.340 --> 00:27:56.860 Everest is elevation above sea level. 00:27:56.860 --> 00:28:01.190 Sounds the same but it isn't, because the highest 00:28:01.190 --> 00:28:04.670 point on Earth answers the question where, 00:28:04.670 --> 00:28:08.090 like on the planet, what is the highest point? 00:28:08.090 --> 00:28:08.720 It's Mt. 00:28:08.720 --> 00:28:09.630 Everest. 00:28:09.630 --> 00:28:12.911 But how high is that highest point is a different piece 00:28:12.911 --> 00:28:13.535 of information. 00:28:13.535 --> 00:28:14.710 Do you agree? 00:28:14.710 --> 00:28:16.790 It's the actual altitude. 00:28:16.790 --> 00:28:19.600 It's not where on the planet it is. 00:28:19.600 --> 00:28:21.680 So it may sound similar, but these are actually 00:28:21.680 --> 00:28:24.030 very different pieces of information. 00:28:24.030 --> 00:28:27.800 So that highest point, how high is it? 00:28:27.800 --> 00:28:31.790 Well, it's 8,848 meters high. 00:28:31.790 --> 00:28:36.550 Now the third statement gives another piece of information 00:28:36.550 --> 00:28:37.960 about the first item. 00:28:37.960 --> 00:28:40.870 Same item-- I could have grouped them together. 00:28:40.870 --> 00:28:42.400 Another thing I know about the Earth 00:28:42.400 --> 00:28:46.480 is that the deepest point on the planet 00:28:46.480 --> 00:28:53.050 is the Challenger Deep, part of the so-called Mariana 00:28:53.050 --> 00:28:54.760 Trench in the ocean. 00:28:54.760 --> 00:28:56.530 So that is the deepest point. 00:28:56.530 --> 00:28:58.180 And how deep is it? 00:28:58.180 --> 00:29:01.384 I again use the elevation above sea level. 00:29:01.384 --> 00:29:03.550 That's the name of the property even though it's not 00:29:03.550 --> 00:29:04.750 above sea level. 00:29:04.750 --> 00:29:08.260 I have a negative value because the elevation of the Challenger 00:29:08.260 --> 00:29:13.700 Deep is minus 11 kilometers, more or less. 00:29:13.700 --> 00:29:14.200 All right? 00:29:14.200 --> 00:29:15.620 So these are statements. 00:29:15.620 --> 00:29:18.820 These are four individual pieces of data. 00:29:18.820 --> 00:29:21.160 And I could also look at it this way. 00:29:21.160 --> 00:29:25.210 Maybe that's closer to the government form example 00:29:25.210 --> 00:29:26.620 that I was giving, right? 00:29:26.620 --> 00:29:29.190 So I want to say something about Earth. 00:29:29.190 --> 00:29:30.760 What do I want to say? 00:29:30.760 --> 00:29:33.580 Two things-- highest point. 00:29:33.580 --> 00:29:36.760 That's the field, that's the property, 00:29:36.760 --> 00:29:37.780 and this is the value. 00:29:37.780 --> 00:29:39.190 The highest point is Mt. 00:29:39.190 --> 00:29:40.240 Everest. 00:29:40.240 --> 00:29:42.880 The deepest point is Challenger Deep. 00:29:42.880 --> 00:29:46.450 And then I have things to say about Challenger Deep-- 00:29:46.450 --> 00:29:49.630 the property of elevation above sea level, the value 00:29:49.630 --> 00:29:52.280 is minus 11 kilometers. 00:29:55.900 --> 00:30:00.600 Now here's yet another view of the same data 00:30:00.600 --> 00:30:04.530 once more, with numeric IDs. 00:30:04.530 --> 00:30:08.150 So this is the same information, the same four statements. 00:30:08.150 --> 00:30:13.020 But this time, in addition to using words, 00:30:13.020 --> 00:30:21.270 I'm also including weird numbers following either Q or P. 00:30:21.270 --> 00:30:25.890 So P stands for property. 00:30:25.890 --> 00:30:30.330 So the highest point property is P610. 00:30:30.330 --> 00:30:34.216 And the deepest point property is P1589. 00:30:34.216 --> 00:30:35.340 What do these numbers mean? 00:30:35.340 --> 00:30:36.985 They don't mean anything at all. 00:30:36.985 --> 00:30:37.860 They're just numbers. 00:30:37.860 --> 00:30:39.760 They're just sequential numbers. 00:30:39.760 --> 00:30:42.600 And if I create a new Wikidata item right now, 00:30:42.600 --> 00:30:46.020 it'll get just the next available number. 00:30:46.020 --> 00:30:47.790 So they're just numbers. 00:30:47.790 --> 00:30:49.080 So P stands for property. 00:30:49.080 --> 00:30:51.480 What does Q stand for? 00:30:51.480 --> 00:30:53.460 Does anyone know? 00:30:53.460 --> 00:30:58.500 It's a trick question because it's hard to guess. 00:30:58.500 --> 00:31:01.896 But the principal architect of Wikidata, 00:31:01.896 --> 00:31:07.860 a Wikipedian named Danny [INAUDIBLE] and data scientist, 00:31:07.860 --> 00:31:10.950 is married to a lovely lady named [INAUDIBLE] 00:31:10.950 --> 00:31:16.320 spelled with a Q. And this is a loving tribute. 00:31:16.320 --> 00:31:21.780 And she's also a Wikipedian and an admin of Uzbek Wikipedia. 00:31:21.780 --> 00:31:31.650 So Q2 is just the numeric identifier of the item Earth. 00:31:31.650 --> 00:31:36.190 And Q513 is the identifier of Mt. 00:31:36.190 --> 00:31:37.310 Everest. 00:31:37.310 --> 00:31:42.950 You notice that we use that ID across the statement, right? 00:31:42.950 --> 00:31:48.520 So from Wikidata's perspective, this 00:31:48.520 --> 00:31:53.290 is actually what the database actually contains. 00:31:53.290 --> 00:31:55.030 What we were saying with words-- 00:31:55.030 --> 00:31:57.650 the Earth, highest point, whatever-- 00:31:57.650 --> 00:31:58.540 never mind that. 00:31:58.540 --> 00:32:03.250 Q2 has P610 with a value Q513. 00:32:03.250 --> 00:32:06.190 That's what Wikidata cares about, OK? 00:32:06.190 --> 00:32:09.770 Now that, you'll agree, is a little inaccessible. 00:32:09.770 --> 00:32:13.120 Just these lists of numbers, that's a little hard. 00:32:13.120 --> 00:32:16.240 So Wikidata understands and allows 00:32:16.240 --> 00:32:19.690 us to continue using our words. 00:32:19.690 --> 00:32:23.650 But actually, it gets translated into numeric IDs. 00:32:23.650 --> 00:32:25.050 Now why is this a good idea? 00:32:30.070 --> 00:32:33.070 Why can't we just say Earth or Mt. 00:32:33.070 --> 00:32:35.120 Everest? 00:32:35.120 --> 00:32:36.170 Any thoughts? 00:32:36.170 --> 00:32:39.530 This is an open question. 00:32:39.530 --> 00:32:41.540 Why is this a good idea to use numbers 00:32:41.540 --> 00:32:43.260 instead of the names of things? 00:32:47.000 --> 00:32:51.750 Yes, because more than one thing can have the same name. 00:32:51.750 --> 00:32:52.590 What do you mean? 00:32:52.590 --> 00:32:53.460 There's only one Mt. 00:32:53.460 --> 00:32:54.480 Everest. 00:32:54.480 --> 00:32:55.510 Well, yeah. 00:32:55.510 --> 00:32:58.710 But there there's also a movie called-- and probably 00:32:58.710 --> 00:33:00.000 more than one-- called Mt. 00:33:00.000 --> 00:33:04.080 Everest, or a TV documentary literally called Mt. 00:33:04.080 --> 00:33:06.590 Everest. 00:33:06.590 --> 00:33:09.960 And of course, if I'm describing a person named 00:33:09.960 --> 00:33:14.930 Frank Johnson, not the only Frank Johnson on the planet, 00:33:14.930 --> 00:33:16.180 right? 00:33:16.180 --> 00:33:17.760 But wait, you say. 00:33:17.760 --> 00:33:20.640 On Wikipedia we deal with that problem, right? 00:33:20.640 --> 00:33:23.490 How do we deal with that problem on Wikipedia? 00:33:23.490 --> 00:33:26.270 Does anyone in the audience know? 00:33:26.270 --> 00:33:27.969 The standard way to deal with the fact 00:33:27.969 --> 00:33:30.260 that there is more than one Frank Johnson in the world, 00:33:30.260 --> 00:33:35.600 on Wikipedia, is to use parentheses after the name. 00:33:35.600 --> 00:33:39.200 So there is Frank Johnson (actor) 00:33:39.200 --> 00:33:42.620 and Frank Johnson (politician), for example, 00:33:42.620 --> 00:33:44.700 if that's the distinction we need to make. 00:33:44.700 --> 00:33:48.140 So you put in parentheses kind of the minimal amount 00:33:48.140 --> 00:33:51.840 of information you need to tell apart these Frank Johnsons. 00:33:51.840 --> 00:33:54.530 What if there's two politician Frank Johnsons? 00:33:54.530 --> 00:33:58.880 Well, then you would say Frank Johnson, (Delaware politician) 00:33:58.880 --> 00:34:01.960 versus Frank Johnson (California politician), right? 00:34:01.960 --> 00:34:05.210 You just put in that bit of context to tell them apart. 00:34:05.210 --> 00:34:07.640 So that's the solution that Wikipedians came up 00:34:07.640 --> 00:34:12.469 with years and years ago because they did need 00:34:12.469 --> 00:34:15.560 a unique name for the article. 00:34:15.560 --> 00:34:18.170 You can't have two articles literally called 00:34:18.170 --> 00:34:20.790 Frank Johnson on Wikipedia. 00:34:20.790 --> 00:34:23.570 So that's the solution on Wikipedia. 00:34:23.570 --> 00:34:28.429 But Wikidata was designed much later, more than a decade 00:34:28.429 --> 00:34:31.340 after Wikipedia, and was able to kind of learn 00:34:31.340 --> 00:34:34.520 from the experience of Wikipedia, which 00:34:34.520 --> 00:34:39.380 has tremendous experience with multilingualism, much 00:34:39.380 --> 00:34:42.870 more than most sites and projects, as we know. 00:34:42.870 --> 00:34:44.659 And so the Wikidata team understood 00:34:44.659 --> 00:34:47.840 from the get go that this will be an issue, 00:34:47.840 --> 00:34:50.989 and it's better to use numbers that are unequivocally 00:34:50.989 --> 00:34:54.800 different from each other instead of labels, 00:34:54.800 --> 00:34:57.290 instead of the actual name, the actual text, 00:34:57.290 --> 00:34:59.630 because names are not unique. 00:34:59.630 --> 00:35:03.260 Names can change, right? 00:35:03.260 --> 00:35:08.960 Just last year, there was a big naming reform in Ukraine 00:35:08.960 --> 00:35:13.610 and a whole bunch of towns and districts were renamed. 00:35:13.610 --> 00:35:17.330 Does that mean we should change all the data that we have, like 00:35:17.330 --> 00:35:19.550 lose all the data that we have about the old name? 00:35:19.550 --> 00:35:22.130 No, we ideally just want to change the name 00:35:22.130 --> 00:35:24.020 without breaking links. 00:35:24.020 --> 00:35:28.550 So having the links actually refer to the numbers 00:35:28.550 --> 00:35:32.090 is one way to ensure the integrity of the data, 00:35:32.090 --> 00:35:35.360 of the links, when renaming happens. 00:35:35.360 --> 00:35:39.230 Another reason is well, even if the name doesn't change, 00:35:39.230 --> 00:35:42.230 not all humans call everything the same, right? 00:35:42.230 --> 00:35:46.180 So Earth is Earth in English, but it's 00:35:46.180 --> 00:35:48.210 [SPEAKING ARABIC] in Arabic. 00:35:48.210 --> 00:35:49.585 It's [SPEAKING HEBREW] in Hebrew. 00:35:53.480 --> 00:35:56.570 So obviously, Earth-- even that is not 00:35:56.570 --> 00:36:01.920 as unambiguous or unequivocal as you might think. 00:36:01.920 --> 00:36:03.500 And so that is the reason Wikidata, 00:36:03.500 --> 00:36:07.640 which is built to be multilingual from the start, 00:36:07.640 --> 00:36:11.230 talks about numbers rather than labels. 00:36:11.230 --> 00:36:12.150 OK. 00:36:12.150 --> 00:36:15.370 Ha, I had a whole slide about that and I forgot. 00:36:15.370 --> 00:36:17.830 Yes, so even London, again, is not 00:36:17.830 --> 00:36:20.710 just London, England, which is what you were thinking about. 00:36:20.710 --> 00:36:22.030 It's also a city in Canada. 00:36:22.030 --> 00:36:26.260 And it's also a family name, like Jack London. 00:36:26.260 --> 00:36:27.430 It's also a movie company. 00:36:27.430 --> 00:36:32.230 There must be some hotel named London somewhere. 00:36:32.230 --> 00:36:36.070 This is a good opportunity to remind everyone 00:36:36.070 --> 00:36:41.110 that the vast majority of humankind 00:36:41.110 --> 00:36:45.700 does not speak a word of English. 00:36:45.700 --> 00:36:48.790 That's a statistic worth remembering. 00:36:48.790 --> 00:36:55.240 The vast majority of the planet does not speak English at all. 00:36:55.240 --> 00:36:57.070 That does not contradict the datum 00:36:57.070 --> 00:37:00.070 that English is the most widely spoken language. 00:37:00.070 --> 00:37:02.860 And yet, in aggregate, a majority of people 00:37:02.860 --> 00:37:07.180 speak other languages, and not English at all. 00:37:07.180 --> 00:37:13.150 So moving swiftly on, this is a pause for questions 00:37:13.150 --> 00:37:15.610 about what I've covered so far. 00:37:15.610 --> 00:37:17.390 Any questions in the audience? 00:37:17.390 --> 00:37:19.450 If not, we moved to IRC. 00:37:19.450 --> 00:37:21.042 If there are any questions-- 00:37:23.880 --> 00:37:26.891 Any questions? 00:37:26.891 --> 00:37:27.390 No? 00:37:27.390 --> 00:37:28.305 IRC? 00:37:28.305 --> 00:37:29.490 Any questions? 00:37:33.580 --> 00:37:34.180 OK. 00:37:34.180 --> 00:37:38.170 We will have additional pauses for questions later. 00:37:38.170 --> 00:37:41.470 But enough of my hand-waving. 00:37:41.470 --> 00:37:44.590 Let's go explore Wikidata. 00:37:44.590 --> 00:37:49.730 So Wikidata lives at wikidata.org. 00:37:49.730 --> 00:37:59.570 And Wikidata already has more than 25 million items. 00:37:59.570 --> 00:38:05.570 That is, it collects statements about more than 25 00:38:05.570 --> 00:38:08.270 million topics. 00:38:08.270 --> 00:38:12.170 It has many, many more than 25 million statements 00:38:12.170 --> 00:38:14.660 because many of these items have dozens or hundreds 00:38:14.660 --> 00:38:16.370 of statements. 00:38:16.370 --> 00:38:20.720 So it documents 25 million things-- 00:38:20.720 --> 00:38:23.153 people, books, rivers, whatever. 00:38:26.010 --> 00:38:28.800 Just to give us a sense of how big that number is, 00:38:28.800 --> 00:38:32.430 how many articles do we have on English Wikipedia? 00:38:32.430 --> 00:38:35.610 More than-- yes, more than 5 million articles. 00:38:35.610 --> 00:38:37.990 And that's the largest Wikipedia. 00:38:37.990 --> 00:38:41.100 So Wikidata is already describing 00:38:41.100 --> 00:38:45.450 more than five times, or about five times as many items 00:38:45.450 --> 00:38:48.460 as even our largest Wikipedia. 00:38:48.460 --> 00:38:50.840 So obviously, Wikidata contains data 00:38:50.840 --> 00:38:56.900 about things that have no article on any Wikipedia. 00:38:56.900 --> 00:39:01.980 It is a much, much larger, more comprehensive project. 00:39:01.980 --> 00:39:04.250 All right, the second thing we might notice 00:39:04.250 --> 00:39:07.610 is, well, this looks kind of like Wikipedia, right? 00:39:07.610 --> 00:39:11.210 If we've never visited, it looks kind of like Wikipedia. 00:39:11.210 --> 00:39:13.490 It has this sidebar. 00:39:13.490 --> 00:39:15.290 It has these buttons at the top. 00:39:15.290 --> 00:39:17.810 It looks like it's from the '90s. 00:39:17.810 --> 00:39:18.770 Yeah. 00:39:18.770 --> 00:39:20.900 So the reason it looks like Wikipedia 00:39:20.900 --> 00:39:24.410 is that it is a wiki running on Mediawiki software. 00:39:24.410 --> 00:39:28.430 It is running on software very much like Wikipedia. 00:39:28.430 --> 00:39:32.180 But it is running on a kind of modification 00:39:32.180 --> 00:39:34.010 of the standard wiki software. 00:39:34.010 --> 00:39:36.170 It has an additional, very important component 00:39:36.170 --> 00:39:38.630 named Wikibase, which gives it all 00:39:38.630 --> 00:39:42.700 of its structured and linked data power. 00:39:42.700 --> 00:39:46.763 So let's start exploring Wikidata. 00:39:52.830 --> 00:39:55.770 Let's take something local-- 00:39:55.770 --> 00:39:57.530 Harvey Milk. 00:39:57.530 --> 00:40:00.190 Harvey Milk. 00:40:00.190 --> 00:40:03.460 What does Wikidata know about Harvey Milk? 00:40:03.460 --> 00:40:06.730 For those on YouTube who may not be local, 00:40:06.730 --> 00:40:15.580 he's a San Francisco politician and gay rights activist 00:40:15.580 --> 00:40:18.380 who was murdered in the '70s. 00:40:18.380 --> 00:40:21.280 It was very significant in the history of those struggles 00:40:21.280 --> 00:40:22.710 in this country. 00:40:22.710 --> 00:40:27.220 So what does Wikidata tell us about Harvey Milk? 00:40:27.220 --> 00:40:29.770 Well, the first thing is it knows 00:40:29.770 --> 00:40:34.562 that Harvey Milk is Q17141. 00:40:34.562 --> 00:40:36.520 That's the most important piece of information, 00:40:36.520 --> 00:40:38.770 is first of all, that is the identifier. 00:40:38.770 --> 00:40:42.490 That is the item number of all the data 00:40:42.490 --> 00:40:46.150 that we will collect about Harvey Milk. 00:40:46.150 --> 00:40:50.020 The second thing you see right under the title 00:40:50.020 --> 00:40:54.730 is this line, this very, very brief summary, right? 00:40:54.730 --> 00:40:59.620 "American politician who became a martyr in the gay community." 00:40:59.620 --> 00:41:02.080 This line is the description line. 00:41:02.080 --> 00:41:04.640 So the name of the item-- 00:41:04.640 --> 00:41:05.980 this is the label. 00:41:05.980 --> 00:41:07.450 We call it label on Wikidata. 00:41:07.450 --> 00:41:08.740 That's the label. 00:41:08.740 --> 00:41:10.990 And this line is the description. 00:41:10.990 --> 00:41:13.480 Now why is this description important? 00:41:13.480 --> 00:41:16.990 This is the description that helps us tell this Harvey 00:41:16.990 --> 00:41:23.230 Milk from any other Harvey Milk that may exist, all right? 00:41:23.230 --> 00:41:26.530 So again, this would be useful if I'm 00:41:26.530 --> 00:41:30.190 looking up someone with a slightly more generic name. 00:41:30.190 --> 00:41:33.910 That line will help me tell apart the item about Harvey 00:41:33.910 --> 00:41:38.860 Milk the gay activist rather than Harvey Milk the film 00:41:38.860 --> 00:41:41.750 actor, OK? 00:41:41.750 --> 00:41:43.100 And where is it coming from? 00:41:43.100 --> 00:41:48.690 Well, Wikidata has this whole table, 00:41:48.690 --> 00:41:52.790 as you can see, with descriptions and labels 00:41:52.790 --> 00:41:54.750 in other languages. 00:41:54.750 --> 00:41:59.600 So Wikidata is able to refer to Harvey Milk in Arabic which, 00:41:59.600 --> 00:42:04.010 don't panic, is written from right to left. 00:42:04.010 --> 00:42:07.730 It also knows what to call him in Bulgarian. 00:42:07.730 --> 00:42:11.030 I mean, it's the same name, but it's in a different script. 00:42:11.030 --> 00:42:13.640 In French, in Hebrew, and that's it? 00:42:13.640 --> 00:42:17.960 Does it not know a name for Harvey Milk in Italian? 00:42:17.960 --> 00:42:19.760 Of course it does. 00:42:19.760 --> 00:42:22.250 It actually has labels for this person 00:42:22.250 --> 00:42:24.435 in many, many, many languages. 00:42:24.435 --> 00:42:30.080 It doesn't have descriptions in every language, as you can see. 00:42:30.080 --> 00:42:30.800 OK? 00:42:30.800 --> 00:42:36.240 So why was Wikidata showing me these languages and not others? 00:42:36.240 --> 00:42:39.260 I mean, why this somewhat arbitrary collection-- 00:42:39.260 --> 00:42:42.860 English, Arabic, Bulgarian, German, French, and Hebrew? 00:42:42.860 --> 00:42:45.300 Because I told it to. 00:42:45.300 --> 00:42:50.390 So if we briefly click over to my user page-- 00:42:50.390 --> 00:42:52.730 again, like every wiki, you have user accounts. 00:42:52.730 --> 00:42:53.960 You have user pages. 00:42:53.960 --> 00:42:55.380 This is my user page. 00:42:55.380 --> 00:42:59.750 And as you can see, there's this little user 00:42:59.750 --> 00:43:03.230 information box here called a Babel box by Wikipedians, 00:43:03.230 --> 00:43:06.610 where I list the languages that I speak. 00:43:06.610 --> 00:43:11.000 And Wikidata uses this box just to kind of helpfully 00:43:11.000 --> 00:43:12.944 show me these languages. 00:43:12.944 --> 00:43:14.360 Of course, all the other languages 00:43:14.360 --> 00:43:19.580 are still available, as you saw, by clicking the more languages. 00:43:19.580 --> 00:43:22.940 But this is just a useful little way 00:43:22.940 --> 00:43:27.590 of getting the languages I care about up there first. 00:43:27.590 --> 00:43:29.060 By the way, this is a lie. 00:43:29.060 --> 00:43:31.170 I don't actually speak Bulgarian. 00:43:31.170 --> 00:43:33.740 That stayed on my user page because I was demonstrating 00:43:33.740 --> 00:43:37.010 this in Bulgaria and I wanted that label to show up there 00:43:37.010 --> 00:43:38.420 during the talk-- 00:43:38.420 --> 00:43:40.250 just in case you were going to tell me 00:43:40.250 --> 00:43:43.840 a really good Bulgarian joke. 00:43:43.840 --> 00:43:48.470 OK so for example, Hebrew is my mother tongue. 00:43:48.470 --> 00:43:51.730 And we have a Hebrew label for Harvey Milk. 00:43:51.730 --> 00:43:53.810 But we don't have a description. 00:43:53.810 --> 00:44:00.950 So let's fix that right now by clicking the edit button right 00:44:00.950 --> 00:44:01.960 here. 00:44:01.960 --> 00:44:05.930 I click edit, and this table became editable. 00:44:05.930 --> 00:44:09.661 And now I can very briefly type a description. 00:44:22.899 --> 00:44:24.440 AUDIENCE: Online in about 20 seconds. 00:44:24.440 --> 00:44:25.400 But can we hold it? 00:44:25.400 --> 00:44:26.066 ASAF BARTOV: OK. 00:44:28.454 --> 00:44:30.430 That was good timing for the screen to crash. 00:44:53.642 --> 00:44:54.142 OK? 00:44:59.082 --> 00:45:01.800 Are we back? 00:45:01.800 --> 00:45:02.850 OK. 00:45:02.850 --> 00:45:03.690 Sorry about that. 00:45:03.690 --> 00:45:07.500 So this was all about what to call him in different languages 00:45:07.500 --> 00:45:09.930 and scripts and how to tell this person apart 00:45:09.930 --> 00:45:13.590 from other people with potentially the same name. 00:45:13.590 --> 00:45:17.930 Let's scroll down and see what else does Wikidata 00:45:17.930 --> 00:45:19.680 know about this person? 00:45:19.680 --> 00:45:24.060 So as you can see, this is a list of statements, right? 00:45:24.060 --> 00:45:25.500 This is a list of statements. 00:45:25.500 --> 00:45:27.900 And the properties are on the left, 00:45:27.900 --> 00:45:30.340 the values are on the right. 00:45:30.340 --> 00:45:33.870 So the first thing Wikidata knows about Harvey Milk 00:45:33.870 --> 00:45:38.520 is a very important property called instance of. 00:45:38.520 --> 00:45:39.910 Instance of. 00:45:39.910 --> 00:45:44.690 And the property instance of answers the very basic question 00:45:44.690 --> 00:45:49.460 what kind of thing is this that I'm describing? 00:45:49.460 --> 00:45:50.870 Is it a book? 00:45:50.870 --> 00:45:51.980 Is it a poem? 00:45:51.980 --> 00:45:53.570 Is it a mountain? 00:45:53.570 --> 00:45:55.520 Is it a theological concept? 00:45:55.520 --> 00:45:57.800 No, it's a human. 00:45:57.800 --> 00:46:00.020 It's a person, OK? 00:46:00.020 --> 00:46:01.880 The item about Mt. 00:46:01.880 --> 00:46:07.070 Everest will say instance of mountain, OK? 00:46:07.070 --> 00:46:10.790 This is a very important property. 00:46:10.790 --> 00:46:12.500 Why is it important? 00:46:12.500 --> 00:46:14.630 Wouldn't anyone looking at this know that this is 00:46:14.630 --> 00:46:15.550 a human being? 00:46:15.550 --> 00:46:16.310 Yes. 00:46:16.310 --> 00:46:18.720 Anyone looking at this will know. 00:46:18.720 --> 00:46:23.780 But if I want a computer to be able to pull information 00:46:23.780 --> 00:46:28.160 about people, I want to be able to easily exclude 00:46:28.160 --> 00:46:30.680 all the mountains and poems and other things that 00:46:30.680 --> 00:46:33.440 are not people from my query. 00:46:33.440 --> 00:46:37.400 So this single datum, this single piece of data, 00:46:37.400 --> 00:46:41.720 is what tells computers and algorithms very clearly, 00:46:41.720 --> 00:46:42.890 this is a human. 00:46:42.890 --> 00:46:47.340 Things that aren't instance of human are other things. 00:46:47.340 --> 00:46:48.230 OK? 00:46:48.230 --> 00:46:50.145 So it may sound very trivial, but it's not. 00:46:50.145 --> 00:46:51.770 It's very important to have an instance 00:46:51.770 --> 00:46:54.077 of field for Wikidata items. 00:46:54.077 --> 00:46:55.410 All right, what else do we know? 00:46:55.410 --> 00:46:59.360 Well, Wikidata knows about an image for Harvey Milk. 00:46:59.360 --> 00:47:02.982 Again, we can find a ton of images-- or maybe not a ton, 00:47:02.982 --> 00:47:04.940 but we can find dozens of images of Harvey Milk 00:47:04.940 --> 00:47:10.430 on Commons, on our Wikimedia multimedia repository. 00:47:10.430 --> 00:47:13.430 So why should we have a single image here on Wikidata? 00:47:13.430 --> 00:47:16.280 Again, this is mostly for reusers. 00:47:16.280 --> 00:47:18.920 If I'm building some kind of tool that pulls information 00:47:18.920 --> 00:47:21.680 from Wikidata, it's nice if there's 00:47:21.680 --> 00:47:24.680 at least one representative image to kind of use 00:47:24.680 --> 00:47:30.300 as the default or immediate image for Harvey Milk 00:47:30.300 --> 00:47:33.120 in some other reused context. 00:47:33.120 --> 00:47:34.770 All right, sex or gender-- 00:47:34.770 --> 00:47:35.670 male. 00:47:35.670 --> 00:47:38.790 Country of citizenship-- United States of America. 00:47:38.790 --> 00:47:39.910 Given name is Harvey. 00:47:39.910 --> 00:47:41.580 The date of birth is so and so. 00:47:41.580 --> 00:47:44.340 The place of birth is Woodmere. 00:47:44.340 --> 00:47:45.870 The place of death is San Francisco. 00:47:45.870 --> 00:47:48.640 The manner of death is homicide. 00:47:48.640 --> 00:47:50.930 Wikidata knows that. 00:47:50.930 --> 00:47:55.700 Now again, every little datum like that 00:47:55.700 --> 00:48:02.210 is the basis for later querying and answering questions. 00:48:02.210 --> 00:48:07.390 So the fact that we record the manner of death of people-- 00:48:07.390 --> 00:48:09.230 or at least of some people-- 00:48:09.230 --> 00:48:11.900 will allow us later to go, you know, 00:48:11.900 --> 00:48:17.120 who are some people from Belgium who died by homicide? 00:48:17.120 --> 00:48:24.650 That's a question Wikidata can answer, thanks to this field. 00:48:24.650 --> 00:48:27.680 The other thing I mentioned is that things are links. 00:48:27.680 --> 00:48:29.680 So the place of birth is Woodmere. 00:48:29.680 --> 00:48:31.900 I don't know where Woodmere is, but I 00:48:31.900 --> 00:48:34.390 can click that and find out. 00:48:34.390 --> 00:48:38.270 Here is the Wikidata item about Woodmere, right? 00:48:38.270 --> 00:48:41.230 It was the value in the statement about Harvey Milk, 00:48:41.230 --> 00:48:43.900 but now I'm looking at the item about Woodmere. 00:48:43.900 --> 00:48:48.047 And it turns out it's in Nassau County, New York, right? 00:48:48.047 --> 00:48:50.380 And of course, Wikidata has a whole bunch of information 00:48:50.380 --> 00:48:55.450 for me about Woodmere-- 00:48:55.450 --> 00:48:59.720 what country it's in and the coordinates and the population 00:48:59.720 --> 00:49:06.230 and the area, all the things you would expect about a place, OK? 00:49:06.230 --> 00:49:07.512 Let's get back to Harvey Milk. 00:49:10.370 --> 00:49:13.260 So the manner of death, the cause of death-- 00:49:13.260 --> 00:49:16.880 now here, Wikidata gives us excellent information. 00:49:16.880 --> 00:49:20.390 The actual cause of death is ballistic trauma. 00:49:20.390 --> 00:49:22.160 That's a professional term. 00:49:22.160 --> 00:49:27.560 And this statement has qualifiers. 00:49:27.560 --> 00:49:30.650 So until now, I was talking about triples, right? 00:49:30.650 --> 00:49:33.260 The item has a property with a certain value. 00:49:33.260 --> 00:49:35.270 Actually, each statement can also 00:49:35.270 --> 00:49:38.030 have a number of qualifiers which 00:49:38.030 --> 00:49:45.424 add aspects of information, still about that one question 00:49:45.424 --> 00:49:46.590 that we're answering, right? 00:49:46.590 --> 00:49:49.904 So if this property answers cause of death, 00:49:49.904 --> 00:49:51.320 it's not discussing anything else. 00:49:51.320 --> 00:49:52.880 It's not discussing languages. 00:49:52.880 --> 00:49:54.920 It's not discussing date of birth, right? 00:49:54.920 --> 00:49:56.930 It's talking about the cause of death. 00:49:56.930 --> 00:49:59.300 But we're not just saying ballistic trauma. 00:49:59.300 --> 00:50:04.550 We're saying ballistic trauma with the quantity attribute 00:50:04.550 --> 00:50:05.660 being five. 00:50:05.660 --> 00:50:07.550 What does that mean? 00:50:07.550 --> 00:50:08.870 Five bullets, right? 00:50:08.870 --> 00:50:12.780 There are five ballistic traumas. 00:50:12.780 --> 00:50:15.300 He was he was shot five times. 00:50:15.300 --> 00:50:18.210 And he was shot by this person named Dan White. 00:50:18.210 --> 00:50:25.020 And this ballistic trauma, like this actual shooting, 00:50:25.020 --> 00:50:28.420 is itself the subject of this other thing. 00:50:28.420 --> 00:50:31.440 This is a link to a whole other Wikidata 00:50:31.440 --> 00:50:35.510 item about the Moscone-Milk assassinations. 00:50:35.510 --> 00:50:38.610 Moscone was the San Francisco mayor at the time. 00:50:43.540 --> 00:50:47.510 We'll see slightly better or easier to understand examples 00:50:47.510 --> 00:50:49.460 of qualifiers in a bit. 00:50:49.460 --> 00:50:54.440 So if this was confusing, hang on. 00:50:54.440 --> 00:50:55.970 So he was killed by Dan White. 00:50:55.970 --> 00:50:57.800 He spoke English. 00:50:57.800 --> 00:50:59.960 His occupation-- here's an example 00:50:59.960 --> 00:51:03.140 of a property with more than one value, right? 00:51:03.140 --> 00:51:06.260 So Milk was a politician. 00:51:06.260 --> 00:51:09.710 But he was also a Navy officer, at least for a while. 00:51:09.710 --> 00:51:12.980 That was another thing that he did during his life. 00:51:12.980 --> 00:51:15.350 And he was a human rights activist, right? 00:51:15.350 --> 00:51:20.600 So some people are writers and translators. 00:51:20.600 --> 00:51:22.610 So people can have more than one occupation. 00:51:22.610 --> 00:51:26.310 People can speak more than one language. 00:51:26.310 --> 00:51:29.130 Here's a better example of a qualifier. 00:51:29.130 --> 00:51:35.090 So the property award received has the value Presidential 00:51:35.090 --> 00:51:37.560 Medal of Freedom. 00:51:37.560 --> 00:51:42.570 And that award has an attribute called point in time, 00:51:42.570 --> 00:51:44.070 like when was this? 00:51:44.070 --> 00:51:46.580 This was in 2009. 00:51:46.580 --> 00:51:50.510 Do you see that this piece of data-- 00:51:50.510 --> 00:52:04.780 2009-- is a sub-statement or is subjugated 00:52:04.780 --> 00:52:09.621 to the context of this award, was the Presidential Medal 00:52:09.621 --> 00:52:10.120 of Freedom? 00:52:10.120 --> 00:52:13.430 It can't just kind of free float in the article. 00:52:13.430 --> 00:52:17.650 It's not that 2009 is itself a meaningful thing, right? 00:52:17.650 --> 00:52:21.550 This medal was awarded in 2009. 00:52:21.550 --> 00:52:22.170 If 00:52:22.170 --> 00:52:24.070 Wikidata doesn't tell us, for example, 00:52:24.070 --> 00:52:27.130 when he was a Navy officer, OK? 00:52:27.130 --> 00:52:30.100 But if we were, for example, to look that up right now 00:52:30.100 --> 00:52:33.820 and find out that Milk was a Navy officer between 1962 00:52:33.820 --> 00:52:39.542 and 1964, we could go back here to the Navy officer bit 00:52:39.542 --> 00:52:41.010 and click edit. 00:52:41.010 --> 00:52:44.190 This is how I edit this particular little piece 00:52:44.190 --> 00:52:45.360 of information. 00:52:45.360 --> 00:52:49.350 And add a qualifier like this. 00:52:49.350 --> 00:52:51.300 I click Add Qualifier. 00:52:51.300 --> 00:52:57.660 And I could pick start time and end time, right? 00:52:57.660 --> 00:53:04.990 And then I could type 1962 to 1964, 00:53:04.990 --> 00:53:08.000 and that would be teaching Wikidata. 00:53:08.000 --> 00:53:10.660 Oh, I'm sorry, I meant to do that for Navy officer. 00:53:10.660 --> 00:53:11.230 OK. 00:53:11.230 --> 00:53:14.800 But, you know, that is the exact-- 00:53:14.800 --> 00:53:18.400 the accurate time span of that statement. 00:53:18.400 --> 00:53:22.850 So it's true to say about a person, he was a Navy officer, 00:53:22.850 --> 00:53:25.990 even if of course he wasn't a Navy officer his entire life. 00:53:25.990 --> 00:53:28.120 But it's better and it's more accurate, 00:53:28.120 --> 00:53:32.260 to say he was a Navy officer between 1962 and 1964. 00:53:32.260 --> 00:53:35.380 Don't worry, I'm not saving this. 00:53:35.380 --> 00:53:39.150 No vandalizing of Wikidata in this session. 00:53:39.150 --> 00:53:40.450 OK. 00:53:40.450 --> 00:53:41.140 Moving on. 00:53:41.140 --> 00:53:42.430 What else does Wikidata know? 00:53:42.430 --> 00:53:43.960 He was educated at this university. 00:53:43.960 --> 00:53:46.970 He was a member of this political party. 00:53:46.970 --> 00:53:47.470 Right? 00:53:47.470 --> 00:53:49.428 That's of course if they're a relevant property 00:53:49.428 --> 00:53:52.270 for a politician. 00:53:52.270 --> 00:53:56.500 Religion, military branch, what is the category on commons 00:53:56.500 --> 00:53:58.720 that discusses this item, is something 00:53:58.720 --> 00:54:00.790 that Wikidata can tell us. 00:54:00.790 --> 00:54:02.200 And that's it. 00:54:02.200 --> 00:54:04.570 Now, is that everything that we could possibly 00:54:04.570 --> 00:54:07.780 say in a structured way about Harvey Milk? 00:54:07.780 --> 00:54:08.680 No. 00:54:08.680 --> 00:54:13.570 We could probably find at least a few more things to say. 00:54:13.570 --> 00:54:17.170 We will see how to contribute new information to Wikidata 00:54:17.170 --> 00:54:19.990 in just a minute with a different example. 00:54:19.990 --> 00:54:23.360 But this-- all this was a set of statements. 00:54:23.360 --> 00:54:23.860 Right? 00:54:23.860 --> 00:54:25.927 This was the title statements here. 00:54:28.840 --> 00:54:31.160 But at the bottom of the list of statements is 00:54:31.160 --> 00:54:34.300 another section called identifiers. 00:54:34.300 --> 00:54:36.960 And I want to spend a minute talking about what that is. 00:54:36.960 --> 00:54:43.630 So identifiers is a collection of keys. 00:54:43.630 --> 00:54:47.980 A collection of IDs, or codes, that 00:54:47.980 --> 00:54:52.890 are keys to other information sources. 00:54:52.890 --> 00:54:58.560 And a lot of Wikidata items have a whole series of keys 00:54:58.560 --> 00:55:03.030 to other databases, other sites, other repositories, 00:55:03.030 --> 00:55:08.340 that help you or a computer be able to access not just 00:55:08.340 --> 00:55:12.240 some database and look for information about Harvey Milk, 00:55:12.240 --> 00:55:16.950 but access the exact record relevant to Harvey Milk. 00:55:16.950 --> 00:55:20.280 And again, if you imagine someone named John Smith, 00:55:20.280 --> 00:55:21.690 that is really valuable, right? 00:55:21.690 --> 00:55:23.250 If you're not just told, oh yeah, 00:55:23.250 --> 00:55:24.875 you can look at the Library of Congress 00:55:24.875 --> 00:55:27.840 for John Smith, good luck with that. 00:55:27.840 --> 00:55:30.240 Or if I tell you, go to the Library of Congress 00:55:30.240 --> 00:55:35.810 to this record for this John Smith, you see the difference. 00:55:35.810 --> 00:55:42.080 So Wikidata tells us that on VIAF, which is the Virtual 00:55:42.080 --> 00:55:44.570 International Authority File. 00:55:44.570 --> 00:55:50.140 It's an aggregated master index built by bibliographers, 00:55:50.140 --> 00:55:52.831 by librarians, of people. 00:55:52.831 --> 00:55:53.330 Right? 00:55:53.330 --> 00:55:56.720 It tries to kind of aggregate information about people 00:55:56.720 --> 00:55:59.270 across library catalogs everywhere. 00:55:59.270 --> 00:56:05.120 So the VIAF ID for Harvey Milk is this number. 00:56:05.120 --> 00:56:07.340 And conveniently, if I click that, 00:56:07.340 --> 00:56:10.160 I'm not taking to some Wikidata item. 00:56:10.160 --> 00:56:13.010 I'm actually taken to the relevant site. 00:56:13.010 --> 00:56:16.760 So this took me right to viaf.org, the Virtual 00:56:16.760 --> 00:56:21.770 International Authority File, directly to their record 00:56:21.770 --> 00:56:23.310 about Harvey Milk. 00:56:23.310 --> 00:56:23.810 All right? 00:56:23.810 --> 00:56:27.290 And that itself leads me to national catalogs 00:56:27.290 --> 00:56:29.630 of national libraries all over the world. 00:56:29.630 --> 00:56:32.360 We won't get into the things you can do with VIAF. 00:56:32.360 --> 00:56:37.220 The point is Wikidata contained the piece of thread 00:56:37.220 --> 00:56:40.820 that I could tug on to arrive directly 00:56:40.820 --> 00:56:44.840 to that information in other databases. 00:56:44.840 --> 00:56:45.680 Yes. 00:56:45.680 --> 00:56:49.670 And it has that for many, many kinds of databases. 00:56:49.670 --> 00:56:53.150 The BNF, for example, that's the National Library of France. 00:56:53.150 --> 00:56:56.270 And that will take me to that index card. 00:56:56.270 --> 00:56:57.320 IMDB. 00:56:57.320 --> 00:56:58.620 We all know IMDB, right? 00:56:58.620 --> 00:57:03.320 So here I have the key to Harvey Milk in IMDB. 00:57:03.320 --> 00:57:05.810 And this is what IMDB says about Harvey Milk, right? 00:57:05.810 --> 00:57:08.480 They have their own piece of information about him, 00:57:08.480 --> 00:57:11.590 of course, with filmography and everything else. 00:57:11.590 --> 00:57:15.140 And see, I did not have to search IMDB for it. 00:57:15.140 --> 00:57:19.070 I just had the key right there waiting for me. 00:57:19.070 --> 00:57:21.080 Now, again, this is very convenient for me 00:57:21.080 --> 00:57:24.590 as I just showed you the human use case for this. 00:57:24.590 --> 00:57:27.530 But it's even more powerful in aggregate 00:57:27.530 --> 00:57:35.450 when we allow computers to traverse this network of links 00:57:35.450 --> 00:57:36.110 between-- 00:57:36.110 --> 00:57:41.690 not just within wiki data, but between data storage facilities 00:57:41.690 --> 00:57:43.850 and repositories. 00:57:43.850 --> 00:57:49.790 This is sometimes referred to as the linked data open cloud. 00:57:49.790 --> 00:57:52.670 Cloud, because it's multiple different repositories 00:57:52.670 --> 00:57:54.740 that are interlinked. 00:57:54.740 --> 00:58:02.210 And Wikidata is already, and to a growing extent, the Nexus, 00:58:02.210 --> 00:58:04.460 the connection point between a lot 00:58:04.460 --> 00:58:06.780 of these different databases. 00:58:06.780 --> 00:58:09.230 So IMDB, for example, it's a good example 00:58:09.230 --> 00:58:11.300 because it's site almost everyone knows, 00:58:11.300 --> 00:58:14.000 IMDB has information about Harvey Milk. 00:58:14.000 --> 00:58:16.670 But that information does not include a link 00:58:16.670 --> 00:58:19.140 to the French National Library. 00:58:19.140 --> 00:58:19.645 Right? 00:58:19.645 --> 00:58:20.770 Do you see what I'm saying? 00:58:20.770 --> 00:58:25.550 So IMDB is a data repository with IDs and allows linking. 00:58:25.550 --> 00:58:28.100 But it does not give you what Wikidata gives you which 00:58:28.100 --> 00:58:32.850 is this kind of collection of-- 00:58:32.850 --> 00:58:36.330 it's like a junction of all these different data sources. 00:58:36.330 --> 00:58:37.910 So Wikidata is the place where you 00:58:37.910 --> 00:58:40.730 can document these interrelationships 00:58:40.730 --> 00:58:41.640 or equivalencies. 00:58:41.640 --> 00:58:42.140 Right? 00:58:42.140 --> 00:58:48.770 So ID, you know, 587548 on IMDB is discussing the same topic 00:58:48.770 --> 00:58:52.260 as French National Library ID whatever. 00:58:52.260 --> 00:58:55.210 Wikidata contains that piece of information. 00:58:55.210 --> 00:58:59.090 that this ID in this database is about the same person 00:58:59.090 --> 00:59:04.050 as that ID in that database. 00:59:04.050 --> 00:59:05.290 OK. 00:59:05.290 --> 00:59:07.420 So that's what identifiers are about. 00:59:07.420 --> 00:59:11.320 Still scrolling down the Wikidata item about Harvey 00:59:11.320 --> 00:59:15.500 Milk, we have the site links. 00:59:15.500 --> 00:59:20.840 The site links are links to Wikimedia projects 00:59:20.840 --> 00:59:22.770 that are related to this item. 00:59:22.770 --> 00:59:25.250 So of course there are Wikipedia articles 00:59:25.250 --> 00:59:28.880 about Harvey Milk in many, many different wikipedias. 00:59:28.880 --> 00:59:31.700 Quite a few language versions. 00:59:31.700 --> 00:59:34.960 And there are pages on Wikiquote, 00:59:34.960 --> 00:59:36.680 one of the sister projects. 00:59:36.680 --> 00:59:38.630 There are pages on Wikiquote with some quotes 00:59:38.630 --> 00:59:40.130 from Harvey Milk. 00:59:40.130 --> 00:59:45.060 And there is even a page for Harvey Milk on Wikisource. 00:59:45.060 --> 00:59:45.560 Right? 00:59:45.560 --> 00:59:47.840 So this is a collection of those links. 00:59:47.840 --> 00:59:52.760 And those of you who have maybe only dealt with Wikidata data 00:59:52.760 --> 00:59:57.290 for inter-wiki links, which we used to do in the old days 00:59:57.290 --> 00:59:59.600 manually within the article text, 00:59:59.600 --> 01:00:01.716 now we do it through Wikidata, so maybe that's 01:00:01.716 --> 01:00:03.590 the only thing you didn't know about Wikidata 01:00:03.590 --> 01:00:10.130 is how to update these inter-wiki tables on Wikidata. 01:00:10.130 --> 01:00:11.430 All right. 01:00:11.430 --> 01:00:14.090 So that concludes our little tour 01:00:14.090 --> 01:00:18.560 of the anatomy of a Wikidata page. 01:00:18.560 --> 01:00:22.370 I will just remind you that it's a wiki page, which 01:00:22.370 --> 01:00:26.120 means it has a discussion page, a talk page. 01:00:26.120 --> 01:00:27.960 This one happens to be empty. 01:00:27.960 --> 01:00:30.092 But, you know, if we have concerns or arguments 01:00:30.092 --> 01:00:31.550 about some of the data here that is 01:00:31.550 --> 01:00:33.290 what we would use to discuss this 01:00:33.290 --> 01:00:36.830 and to arrive at consensus. 01:00:36.830 --> 01:00:41.760 It also has a history view just like every Wikipedia article. 01:00:41.760 --> 01:00:47.402 So you can see here a list of edits. 01:00:47.402 --> 01:00:48.860 Maybe some of you have never looked 01:00:48.860 --> 01:00:51.710 at a history page on Wikipedia, so this looks overwhelming. 01:00:51.710 --> 01:00:55.040 But every line here, every entry here, 01:00:55.040 --> 01:00:58.240 is a single edit, a single revision, a single change 01:00:58.240 --> 01:01:00.440 to this Wikidata item. 01:01:00.440 --> 01:01:01.670 Just Harvey Milk. 01:01:01.670 --> 01:01:04.250 And you can see at the very top this edit that I just 01:01:04.250 --> 01:01:06.680 made-- this is my volunteer account 01:01:06.680 --> 01:01:09.650 and I just made this edit, and in parentheses you 01:01:09.650 --> 01:01:10.790 can see what I did. 01:01:10.790 --> 01:01:14.640 I added an HE, Hebrew, description. 01:01:14.640 --> 01:01:16.930 And this is the text that I added in Hebrew. 01:01:16.930 --> 01:01:17.430 Right? 01:01:17.430 --> 01:01:21.470 So we can see who added what to the Wikidata item, 01:01:21.470 --> 01:01:24.960 just like we can do the same on Wikipedia. 01:01:24.960 --> 01:01:26.390 So we have the revision history. 01:01:26.390 --> 01:01:27.560 We can undo edits. 01:01:27.560 --> 01:01:30.320 We can revert, just like on Wikipedia. 01:01:34.420 --> 01:01:36.940 And what else did I want to show here? 01:01:36.940 --> 01:01:40.930 We can add an item to my watch list using the star, 01:01:40.930 --> 01:01:42.020 just like on Wikipedia. 01:01:42.020 --> 01:01:46.670 So we have all these standard wiki features 01:01:46.670 --> 01:01:47.878 that we would come to expect. 01:01:50.440 --> 01:01:54.270 Let's pause for questions. 01:01:54.270 --> 01:01:58.412 Any questions about what we've covered so far? 01:02:02.573 --> 01:02:03.073 Yes. 01:02:06.950 --> 01:02:11.345 Are attributes of statements precept for the specific value? 01:02:16.640 --> 01:02:19.830 No they're not reset. 01:02:19.830 --> 01:02:29.760 And generally Wikidata data does not enforce by default logic. 01:02:29.760 --> 01:02:32.130 So, I mean, there's nothing to prevent you 01:02:32.130 --> 01:02:38.700 from editing the item about Brazil, 01:02:38.700 --> 01:02:42.990 and adding the property height. 01:02:46.690 --> 01:02:50.430 Now height is not a relevant property for a country. 01:02:50.430 --> 01:02:50.970 Right? 01:02:50.970 --> 01:02:53.880 I mean, maybe average elevation, maybe. 01:02:53.880 --> 01:02:56.400 But not just height, which is used for humans 01:02:56.400 --> 01:02:59.040 or for physical things. 01:02:59.040 --> 01:03:02.400 So you could add that property to Brazil and save it 01:03:02.400 --> 01:03:04.650 and the wiki would not complain. 01:03:04.650 --> 01:03:07.590 Now in the background there are kind 01:03:07.590 --> 01:03:13.020 of extra wiki outside the wiki prostheses for constraint 01:03:13.020 --> 01:03:13.710 validation. 01:03:13.710 --> 01:03:16.050 So there are bots and other processes that 01:03:16.050 --> 01:03:17.940 run, and occasionally, for example, 01:03:17.940 --> 01:03:26.570 identify non-living things with a date of birth field. 01:03:26.570 --> 01:03:27.720 That's nonsensical. 01:03:27.720 --> 01:03:29.010 That should not exist. 01:03:29.010 --> 01:03:31.710 If someone mistakenly added that there are processes 01:03:31.710 --> 01:03:34.350 that would flag that to be fixed. 01:03:34.350 --> 01:03:36.690 But the wiki itself, Wikidata, will not 01:03:36.690 --> 01:03:38.550 prevent you from adding that. 01:03:38.550 --> 01:03:41.940 And that is by design to keep things flexible. 01:03:41.940 --> 01:03:43.930 So that people don't run into, oh wait, 01:03:43.930 --> 01:03:46.560 but I can't add this because nobody thought 01:03:46.560 --> 01:03:49.830 that I would need this, maybe. 01:03:49.830 --> 01:03:54.530 I hope that answers your question. 01:03:54.530 --> 01:03:57.290 You say helpful answer, question mark. 01:03:57.290 --> 01:03:59.510 So was it a helpful answer, or? 01:04:03.940 --> 01:04:04.440 OK. 01:04:04.440 --> 01:04:05.426 Yes, Eleanor. 01:04:05.426 --> 01:04:10.707 AUDIENCE: [INAUDIBLE] 01:04:10.707 --> 01:04:12.040 ASAF BARTOV: Excellent question. 01:04:12.040 --> 01:04:13.030 I'll repeat it. 01:04:13.030 --> 01:04:16.180 You ask how do I find the wiki data item 01:04:16.180 --> 01:04:18.370 number from Wikipedia. 01:04:18.370 --> 01:04:21.580 If I'm reading about Harvey Milk and I want to look at the data 01:04:21.580 --> 01:04:23.600 how do I do that? 01:04:23.600 --> 01:04:27.400 That is an excellent question and let's skip to Wikipedia. 01:04:27.400 --> 01:04:32.030 Conveniently I have the link right here on English. 01:04:32.030 --> 01:04:35.600 So this is the Wikipedia article about Harvey Milk 01:04:35.600 --> 01:04:42.740 and every item on Wikipedia should have a wiki data 01:04:42.740 --> 01:04:47.660 item associated with it, but it doesn't happen automatically. 01:04:47.660 --> 01:04:51.470 So if I just created a page on Wikipedia 01:04:51.470 --> 01:04:55.010 I also need to create a Wikidata entity for it 01:04:55.010 --> 01:04:57.170 if it doesn't already exist. 01:04:57.170 --> 01:04:59.420 It could already exist because it was already 01:04:59.420 --> 01:05:01.970 covered in a different language, for example. 01:05:01.970 --> 01:05:05.390 So that was parenthetical. 01:05:05.390 --> 01:05:09.020 But every article on Wikipedia should have, here on the side, 01:05:09.020 --> 01:05:14.270 on the side are under Tools, a link called Wikidata item. 01:05:14.270 --> 01:05:15.450 Right here. 01:05:15.450 --> 01:05:16.160 OK. 01:05:16.160 --> 01:05:18.110 That Wikidata data item is a link 01:05:18.110 --> 01:05:21.710 that takes you to Wikidata, to the entity, 01:05:21.710 --> 01:05:23.510 and there you find the number. 01:05:23.510 --> 01:05:25.370 You can-- you don't even have to click it. 01:05:25.370 --> 01:05:27.830 I mean, the URL itself tells you the number. 01:05:27.830 --> 01:05:34.620 The number, you see, it's wikidata.org/wiki/q17141. 01:05:34.620 --> 01:05:35.444 OK. 01:05:35.444 --> 01:05:36.860 So that was an excellent question. 01:05:36.860 --> 01:05:37.686 Other questions? 01:05:37.686 --> 01:05:38.185 Yes. 01:05:41.470 --> 01:05:44.430 Yeah, about the additional attributes, the qualifiers. 01:05:44.430 --> 01:05:46.920 So, yes, I answered more generically. 01:05:46.920 --> 01:05:49.370 But just like the properties themselves 01:05:49.370 --> 01:05:53.390 are not limited per item, the qualifiers per statement 01:05:53.390 --> 01:05:57.750 are also not entirely preordained. 01:05:57.750 --> 01:05:59.570 But there is some structure to it. 01:05:59.570 --> 01:06:03.140 I don't want to go into it at great length right now. 01:06:03.140 --> 01:06:06.320 If we have time in the end we can get back to that. 01:06:06.320 --> 01:06:09.590 But some qualifiers are again relevant for some things, 01:06:09.590 --> 01:06:13.180 start time, end time, and others won't be. 01:06:13.180 --> 01:06:16.280 Wikidata does try to offer you-- 01:06:16.280 --> 01:06:18.710 you may remember when I clicked add qualifier, 01:06:18.710 --> 01:06:22.170 it gave me kind of drop down of some relevant qualifiers. 01:06:22.170 --> 01:06:24.475 So it does try to help you in that way. 01:06:27.280 --> 01:06:28.160 Other question? 01:06:28.160 --> 01:06:31.180 Are the values for instance of already 01:06:31.180 --> 01:06:33.310 mappable to external ontologies? 01:06:36.500 --> 01:06:41.310 That is a complicated question. 01:06:41.310 --> 01:06:43.490 I'll help people understand the question first. 01:06:43.490 --> 01:06:48.570 So an ontology is a structure, some kind 01:06:48.570 --> 01:06:52.350 of hierarchy or cloud, of entities 01:06:52.350 --> 01:06:54.510 and their interrelationships. 01:06:54.510 --> 01:06:56.920 An ontology would say, for example, 01:06:56.920 --> 01:06:58.710 a person is a living thing. 01:06:58.710 --> 01:06:59.670 So is a dog. 01:06:59.670 --> 01:07:02.340 They're both living things, but they're different things. 01:07:02.340 --> 01:07:09.910 And then, you know, say things about those entities 01:07:09.910 --> 01:07:11.350 and their interrelationships. 01:07:11.350 --> 01:07:13.300 Now there are many, many competing, 01:07:13.300 --> 01:07:17.230 or coexisting models of ontology's. 01:07:17.230 --> 01:07:19.840 Many of them were created for specific needs. 01:07:19.840 --> 01:07:25.170 Many of them want to be a universal ontology. 01:07:25.170 --> 01:07:27.790 But of course it's impossible to quite 01:07:27.790 --> 01:07:32.150 agree on one complete and simple ontology. 01:07:32.150 --> 01:07:34.240 And so there are many ontology's. 01:07:34.240 --> 01:07:38.520 Which brings up your question, can we map across ontology's? 01:07:38.520 --> 01:07:43.840 Can we say that when wiki data says instance of book that 01:07:43.840 --> 01:07:47.260 is equivalent to some other ontology saying instance 01:07:47.260 --> 01:07:49.940 of bibliographic record? 01:07:49.940 --> 01:07:50.860 And the answer is yes. 01:07:50.860 --> 01:07:52.360 There are some such mappings. 01:07:52.360 --> 01:07:54.420 They are incomplete. 01:07:54.420 --> 01:07:58.240 And there's no kind of auto magic thing happening 01:07:58.240 --> 01:08:01.180 in the wiki vis-a-vis those other ontology's. 01:08:01.180 --> 01:08:03.250 That's kind of left as an exercise 01:08:03.250 --> 01:08:06.280 for those dealing with those other ontology's, and for tool 01:08:06.280 --> 01:08:09.880 builders and other platform improvements 01:08:09.880 --> 01:08:13.050 beyond Wikidata itself. 01:08:13.050 --> 01:08:13.750 OK. 01:08:13.750 --> 01:08:15.190 Other questions? 01:08:15.190 --> 01:08:17.430 Yeah, we have one from the YouTube stream. 01:08:17.430 --> 01:08:21.160 Someone asked, why can't I link Howard Carter's occupation 01:08:21.160 --> 01:08:26.439 to archeologists when I use an info box that fetches info 01:08:26.439 --> 01:08:28.960 from Wikidata? 01:08:28.960 --> 01:08:33.160 Why can't I link it from the info box? 01:08:33.160 --> 01:08:35.500 So, someone on the stream answered 01:08:35.500 --> 01:08:37.659 saying, because it's an improper connection, 01:08:37.659 --> 01:08:39.700 because the target is not about the subject only. 01:08:43.020 --> 01:08:46.710 The target is not about the subject? 01:08:46.710 --> 01:08:48.479 If I understand the question correctly, 01:08:48.479 --> 01:08:53.130 what you would want to be able to do is from within Wikipedia 01:08:53.130 --> 01:08:59.130 be able to say occupation and link to a Wikidata entry 01:08:59.130 --> 01:09:01.050 about archeology. 01:09:01.050 --> 01:09:03.569 That doesn't quite work that way. 01:09:03.569 --> 01:09:05.430 We will get to a little discussion 01:09:05.430 --> 01:09:08.460 of that in an upcoming section of this talk. 01:09:08.460 --> 01:09:13.260 So I will defer the rest of my answer to then. 01:09:13.260 --> 01:09:15.319 OK. 01:09:15.319 --> 01:09:19.160 So we're done with questions for this phase, 01:09:19.160 --> 01:09:22.850 and my browser got tired of waiting for me. 01:09:22.850 --> 01:09:26.551 So, yes. 01:09:26.551 --> 01:09:27.050 All right. 01:09:27.050 --> 01:09:36.850 So we took a look at Wikidata, and we took questions. 01:09:36.850 --> 01:09:41.020 So now, let's teach Wikidata some new things. 01:09:41.020 --> 01:09:44.020 Some things it doesn't already know. 01:09:44.020 --> 01:09:47.109 Let's look at this item here. 01:09:47.109 --> 01:09:50.950 So this item is about one of my favorite writers, 01:09:50.950 --> 01:09:53.840 an American writer named Helen Dewitt. 01:09:53.840 --> 01:10:01.570 Wikidata, of course, fondly refers to her as q54674, 01:10:01.570 --> 01:10:03.070 but we can call her Helen Dewitt. 01:10:03.070 --> 01:10:05.740 And what can we contribute here? 01:10:05.740 --> 01:10:10.600 So Wikidata has far less information about Helen Dewitt. 01:10:10.600 --> 01:10:13.144 Most of you probably haven't heard of her, that's OK. 01:10:13.144 --> 01:10:14.560 What does Wikidata know about her? 01:10:14.560 --> 01:10:16.450 Well instance of human. 01:10:16.450 --> 01:10:17.800 We have a photo of her. 01:10:17.800 --> 01:10:18.780 She's female. 01:10:18.780 --> 01:10:20.530 She's an American. 01:10:20.530 --> 01:10:21.790 Her name is Helen. 01:10:21.790 --> 01:10:22.630 Date of birth. 01:10:22.630 --> 01:10:23.650 Place of birth. 01:10:23.650 --> 01:10:25.970 She's an author, a novelist, a writer. 01:10:25.970 --> 01:10:28.840 She was educated at the University of Oxford. 01:10:28.840 --> 01:10:33.160 And Wikidata knows what her official website is. 01:10:33.160 --> 01:10:35.780 That's useful, but that's it. 01:10:35.780 --> 01:10:37.780 Now we can contribute information here. 01:10:37.780 --> 01:10:43.120 For example, she's an American author writing in English. 01:10:43.120 --> 01:10:45.550 So we could add that information. 01:10:45.550 --> 01:10:48.430 We could click the Add button here. 01:10:48.430 --> 01:10:50.200 And this is a good moment to acknowledge 01:10:50.200 --> 01:10:54.830 that the user interface of Wikidata is a work in progress. 01:10:54.830 --> 01:10:56.740 It's not as intuitive as it might be. 01:10:56.740 --> 01:10:58.570 So you need to understand that click-- 01:10:58.570 --> 01:11:01.630 to add a completely new property, 01:11:01.630 --> 01:11:04.060 You need to click this Add button. 01:11:04.060 --> 01:11:08.020 If you want to add an additional value to the property official 01:11:08.020 --> 01:11:11.530 website, you need to click this Add button. 01:11:11.530 --> 01:11:13.780 It makes a kind of sense with a shaded box. 01:11:13.780 --> 01:11:15.880 But, you know, you need to kind of pay attention, 01:11:15.880 --> 01:11:18.901 and it's not as friendly as it might be. 01:11:18.901 --> 01:11:20.650 [COUGHING] Excuse me. 01:11:20.650 --> 01:11:23.380 So, let's add a property here. 01:11:23.380 --> 01:11:25.690 Click the Add button. 01:11:25.690 --> 01:11:29.740 Again, Wikidata tries to be useful by suggesting 01:11:29.740 --> 01:11:32.760 some relevant properties for humans. 01:11:32.760 --> 01:11:36.640 A bit more morbidly it suggests, how about date of death? 01:11:36.640 --> 01:11:38.700 That's not cool, Wikidata. 01:11:38.700 --> 01:11:40.480 Helen Dewitt is still alive. 01:11:40.480 --> 01:11:42.700 So I will not add date of death, but I 01:11:42.700 --> 01:11:46.140 can add languages spoken, written, or signed. 01:11:46.140 --> 01:11:48.370 OK, so I click that. 01:11:48.370 --> 01:11:51.670 And she writes in English. 01:11:51.670 --> 01:11:54.450 I just type English-- whoops. 01:11:54.450 --> 01:11:56.750 Not in Hebrew. 01:11:56.750 --> 01:11:58.380 Don't panic. 01:11:58.380 --> 01:12:01.010 I type English here. 01:12:01.010 --> 01:12:04.250 And, oh, and of course Wikidata has auto-complete, right? 01:12:04.250 --> 01:12:06.080 So it tries to help me along. 01:12:06.080 --> 01:12:10.100 But you will notice that it has all kinds of things 01:12:10.100 --> 01:12:10.940 called English. 01:12:10.940 --> 01:12:14.030 I mean, it turns out that there is a place in Indiana 01:12:14.030 --> 01:12:16.370 called English, Indiana. 01:12:16.370 --> 01:12:17.150 Did I mean that? 01:12:17.150 --> 01:12:20.210 No, of course I didn't mean that she writes her books 01:12:20.210 --> 01:12:21.961 in English, Indiana. 01:12:21.961 --> 01:12:22.460 Right? 01:12:22.460 --> 01:12:26.180 But, you know, Wikidata gives me the option of linking to that. 01:12:26.180 --> 01:12:30.530 I also don't mean the botanist Carl Schwartz English. 01:12:30.530 --> 01:12:32.870 No, no I mean the west Germanic language 01:12:32.870 --> 01:12:34.029 originating in England. 01:12:34.029 --> 01:12:34.820 That's what I mean. 01:12:34.820 --> 01:12:36.110 So I click that. 01:12:36.110 --> 01:12:37.760 And I click Save. 01:12:37.760 --> 01:12:38.450 And that's it. 01:12:38.450 --> 01:12:41.780 Again I have just made an edit to Wikidata. 01:12:41.780 --> 01:12:47.750 I have just taught Wikidata that this author speaks English. 01:12:47.750 --> 01:12:50.370 Now, again, this may be very obvious. 01:12:50.370 --> 01:12:52.280 She's American. 01:12:52.280 --> 01:12:54.560 Of course not all Americans write in English. 01:12:54.560 --> 01:12:56.930 It may be obvious if you look at her books. 01:12:56.930 --> 01:12:59.060 The important thing is that now Wikidata 01:12:59.060 --> 01:13:02.090 knows this as a piece of data. 01:13:02.090 --> 01:13:04.610 And, again, think ahead to queries, which we will 01:13:04.610 --> 01:13:06.980 demonstrate in a little bit. 01:13:06.980 --> 01:13:09.000 Without this piece of information 01:13:09.000 --> 01:13:14.060 that I just added, if I were to ask Wikidata five minutes ago, 01:13:14.060 --> 01:13:19.760 give me a list of novelists writing in English, OK, 01:13:19.760 --> 01:13:22.730 Wikidata would have returned thousands of results. 01:13:22.730 --> 01:13:27.600 But Helen Dewitt would not have been among them. 01:13:27.600 --> 01:13:32.000 Because up until two minutes ago Wikidata 01:13:32.000 --> 01:13:35.640 didn't know that Helen Dewitt writes in English and not 01:13:35.640 --> 01:13:37.520 in Spanish. 01:13:37.520 --> 01:13:38.730 Do you see? 01:13:38.730 --> 01:13:42.570 It is this explicit statement that will now 01:13:42.570 --> 01:13:46.560 make her be included in any future queries that asks, 01:13:46.560 --> 01:13:48.700 who are novelists writing in English? 01:13:53.250 --> 01:13:54.500 OK. 01:13:54.500 --> 01:13:58.560 By the way, she's a PhD in Classics. 01:13:58.560 --> 01:14:05.590 She speaks-- or at least reads and writes Latin and Greek, 01:14:05.590 --> 01:14:07.270 ancient Greek, and I could-- 01:14:07.270 --> 01:14:09.610 I can-- I mean, I happen to know that. 01:14:09.610 --> 01:14:12.420 But wait, wait, wait, wait, wait, you say. 01:14:12.420 --> 01:14:14.130 What about original research? 01:14:14.130 --> 01:14:18.890 I mean, you can't just add stuff like that to Wikidata. 01:14:18.890 --> 01:14:19.920 Don't you need sources? 01:14:19.920 --> 01:14:22.860 Citations? 01:14:22.860 --> 01:14:23.890 Of course I do. 01:14:23.890 --> 01:14:25.020 Yes. 01:14:25.020 --> 01:14:27.720 Let's add some sources to this. 01:14:27.720 --> 01:14:31.410 So on Wikidata, just like Wikipedia, 01:14:31.410 --> 01:14:34.980 things should generally be supported by citations, 01:14:34.980 --> 01:14:36.990 by references. 01:14:36.990 --> 01:14:43.290 And just like Wikipedia, they aren't always supported 01:14:43.290 --> 01:14:44.650 in that way. 01:14:44.650 --> 01:14:48.870 OK so, I mean, I can just add it to Wikidata. 01:14:48.870 --> 01:14:49.442 Watch me. 01:14:49.442 --> 01:14:50.400 I just did that, right? 01:14:50.400 --> 01:14:54.450 I just added English and Latin without any citation, 01:14:54.450 --> 01:14:56.850 and I will not be arrested for it. 01:14:56.850 --> 01:14:59.520 Just like I could edit a Wikipedia article 01:14:59.520 --> 01:15:02.610 and add some information without a citation. 01:15:02.610 --> 01:15:03.600 It may stick. 01:15:03.600 --> 01:15:06.810 It may stay in the article, or it may be reverted. 01:15:06.810 --> 01:15:11.010 It depends on the kind of information I'm adding. 01:15:11.010 --> 01:15:13.740 It depends how many people are paying attention 01:15:13.740 --> 01:15:15.060 to the article on Wikipedia. 01:15:15.060 --> 01:15:18.420 And it works the same way on Wikidata. 01:15:18.420 --> 01:15:21.780 OK, so, you can add some things without references. 01:15:21.780 --> 01:15:23.970 Ideally, when you add, information you 01:15:23.970 --> 01:15:25.570 should include references. 01:15:25.570 --> 01:15:30.990 So let's be good Wikidata citizens and add a source. 01:15:30.990 --> 01:15:34.395 Here is an article that I prepared in advance. 01:15:38.100 --> 01:15:39.370 This is Helen Dewitt. 01:15:39.370 --> 01:15:44.450 And in this article, somewhere, it actually 01:15:44.450 --> 01:15:51.770 says right at the bottom here, see, 01:15:51.770 --> 01:15:54.990 Dewitt knows, in descending order of proficiency, Latin, 01:15:54.990 --> 01:15:57.010 ancient Greek, French, German, Spanish, 01:15:57.010 --> 01:15:59.460 and Portuguese, Dutch, Danish, Norwegian, Swedish, Arabic, 01:15:59.460 --> 01:16:01.680 Hebrew and Japanese. 01:16:01.680 --> 01:16:04.770 This may sound excessive, but it's true. 01:16:04.770 --> 01:16:06.330 I met this woman. 01:16:06.330 --> 01:16:09.670 So anyway, we don't have to include all of that. 01:16:09.670 --> 01:16:13.050 The point is this article from a reasonably reliable source, 01:16:13.050 --> 01:16:15.840 this magazine, this interview, can 01:16:15.840 --> 01:16:19.270 count as a source for the languages she speaks. 01:16:19.270 --> 01:16:20.700 So I copy the URL. 01:16:20.700 --> 01:16:23.130 I just copied off my browser. 01:16:23.130 --> 01:16:27.530 And, whoops-- that's not-- 01:16:27.530 --> 01:16:28.580 here we go. 01:16:28.580 --> 01:16:31.610 And I can just add a reference here 01:16:31.610 --> 01:16:34.670 to the information that I just added to Wikidata, right? 01:16:34.670 --> 01:16:38.300 I can click Add Reference. 01:16:38.300 --> 01:16:45.800 And then just say the reference URL is, and I just paste. 01:16:45.800 --> 01:16:48.840 I paste this URL. 01:16:48.840 --> 01:16:50.160 Hit Enter. 01:16:50.160 --> 01:16:51.060 And that's it. 01:16:51.060 --> 01:16:55.380 And now the fact that she speaks Latin has a reference. 01:16:55.380 --> 01:16:58.320 If you look at the other things here on Wikidata, 01:16:58.320 --> 01:17:02.660 you can see that these IDs, for example, have references, too. 01:17:02.660 --> 01:17:03.420 Right? 01:17:03.420 --> 01:17:06.570 In this case, the reference just says, excuse me-- 01:17:14.760 --> 01:17:18.600 In this case it just as imported from English Wikipedia. 01:17:18.600 --> 01:17:24.970 But wait, you say, can Wikipedia be a source? 01:17:24.970 --> 01:17:26.620 Not properly, no. 01:17:26.620 --> 01:17:30.100 I mean, just like Wikipedia itself doesn't cite itself. 01:17:30.100 --> 01:17:33.790 We don't say, this person was born in this city 01:17:33.790 --> 01:17:34.870 how do we know? 01:17:34.870 --> 01:17:37.210 We read it on Wikipedia in another language. 01:17:37.210 --> 01:17:39.610 That's not a good citation. 01:17:39.610 --> 01:17:41.400 It's not a good citation for Wikidata 01:17:41.400 --> 01:17:45.040 either so why do we put it here? 01:17:45.040 --> 01:17:49.240 Well you can see the qualifier here is different, right? 01:17:49.240 --> 01:17:53.535 It's not reference URL, which is what I put in for Latin here. 01:18:17.020 --> 01:18:20.320 It's not reference URL here, it's a different qualifier. 01:18:20.320 --> 01:18:23.020 It says-- saying, imported from. 01:18:23.020 --> 01:18:25.960 So this is not an actual reference that 01:18:25.960 --> 01:18:27.610 supports this piece of data. 01:18:27.610 --> 01:18:30.730 It just shows where did this data come from. 01:18:30.730 --> 01:18:33.670 It's a slightly different thing, because this data was 01:18:33.670 --> 01:18:37.210 mass imported into Wikidata. 01:18:37.210 --> 01:18:40.960 So it wasn't input by hand by some volunteer. 01:18:40.960 --> 01:18:44.770 It was imported into Wikidata en masse by a script, 01:18:44.770 --> 01:18:46.180 by a program. 01:18:46.180 --> 01:18:49.820 And we want to know, where did this number come from? 01:18:49.820 --> 01:18:51.440 Well it came from English Wikipedia. 01:18:51.440 --> 01:18:54.130 So again, that's not a proper reference 01:18:54.130 --> 01:18:56.200 for the validity of the information, 01:18:56.200 --> 01:18:59.200 but it does at least tell us it came from English Wikipedia. 01:18:59.200 --> 01:19:03.460 We can click and look on English Wikipedia and find out. 01:19:03.460 --> 01:19:05.230 Maybe there's a footnote there that 01:19:05.230 --> 01:19:08.970 says where it did come from. 01:19:08.970 --> 01:19:11.000 OK. 01:19:11.000 --> 01:19:15.320 So this was an example of teaching Wikidata something 01:19:15.320 --> 01:19:16.910 that it didn't know. 01:19:16.910 --> 01:19:18.512 Something about the languages. 01:19:18.512 --> 01:19:20.720 And of course I could add this reference for English. 01:19:20.720 --> 01:19:23.210 I could add all the other languages that she speaks. 01:19:23.210 --> 01:19:26.060 And I won't bore you with that, but that is basically 01:19:26.060 --> 01:19:27.050 how it's done. 01:19:27.050 --> 01:19:29.720 So you click this Add to add a completely new-- 01:19:32.650 --> 01:19:34.030 completely new statement. 01:19:34.030 --> 01:19:36.250 Now, by the way, the fact that these are the only two 01:19:36.250 --> 01:19:39.220 suggestions that Wikidata can think of, 01:19:39.220 --> 01:19:42.100 doesn't mean these are the only options. 01:19:42.100 --> 01:19:46.750 OK, you can just type anything that may be relevant. 01:19:46.750 --> 01:19:50.950 We could add, for example, award. 01:19:50.950 --> 01:19:52.570 Just start typing award. 01:19:52.570 --> 01:19:54.910 And here I have I have a bunch of properties 01:19:54.910 --> 01:19:56.510 that are relevant for awards. 01:19:56.510 --> 01:20:00.100 Awards received, together with, conferred by, right? 01:20:00.100 --> 01:20:05.790 There's all kinds of properties that I could rely on. 01:20:05.790 --> 01:20:09.600 And of course there is a list of all the properties of Wikidata. 01:20:09.600 --> 01:20:11.580 And that list is also sorted by type. 01:20:11.580 --> 01:20:15.480 So yes, there is a list of properties relevant to people 01:20:15.480 --> 01:20:17.130 so that you don't have to guess. 01:20:17.130 --> 01:20:18.660 But a surprising amount of the time 01:20:18.660 --> 01:20:22.760 you can just start typing and get the right properties 01:20:22.760 --> 01:20:25.340 suggested to you. 01:20:25.340 --> 01:20:27.230 OK. 01:20:27.230 --> 01:20:33.050 So we taught Wikidata something new, 01:20:33.050 --> 01:20:38.980 and now let's teach Wikidata something completely new. 01:20:38.980 --> 01:20:39.480 Right? 01:20:39.480 --> 01:20:42.480 So how do we create a new Wikidata item? 01:20:42.480 --> 01:20:46.880 So, like I said, if I created a Wikipedia article 01:20:46.880 --> 01:20:49.520 about something that was not previously covered 01:20:49.520 --> 01:20:53.540 on any other Wikipedia, chances are 01:20:53.540 --> 01:20:57.170 there would not be an already existing Wikidata item. 01:20:57.170 --> 01:21:03.190 Sometimes there might be, because Wikidata 01:21:03.190 --> 01:21:06.857 does have 25 million entities. 01:21:06.857 --> 01:21:08.190 But sometimes there wouldn't be. 01:21:08.190 --> 01:21:10.148 So, first of all, I could search for it, right? 01:21:10.148 --> 01:21:14.210 So I could go to Wikidata to the search box 01:21:14.210 --> 01:21:17.390 here and just start typing, and search for what I want, right? 01:21:17.390 --> 01:21:20.690 So if I'm searching for Helen Dewitt I just say Helen, 01:21:20.690 --> 01:21:25.590 and I can see whether or not it exists. 01:21:25.590 --> 01:21:29.240 And there's a detailed search results page, et cetera, 01:21:29.240 --> 01:21:33.074 where I can where I can find out if the item does exist or not. 01:21:33.074 --> 01:21:35.240 Excuse me, this reminds me of a very important thing 01:21:35.240 --> 01:21:36.620 I wanted to demonstrate, and that 01:21:36.620 --> 01:21:42.710 is the multilingualism of Wikidata. 01:21:42.710 --> 01:21:49.340 So remember all these labels in other languages. 01:21:49.340 --> 01:21:54.390 Wikidata knows what to call Helen Dewitt in Hebrew. 01:21:54.390 --> 01:22:00.800 And it will show it to Wikidata users whose language is Hebrew. 01:22:00.800 --> 01:22:04.220 Mine is set to English, for your sake. 01:22:04.220 --> 01:22:08.830 But if I change this I go to Preferences here and change 01:22:08.830 --> 01:22:09.740 my language. 01:22:09.740 --> 01:22:15.475 [INAUDIBLE] All right, and I hit Save. 01:22:15.475 --> 01:22:20.350 Wikidata will start talking to me in Hebrew. 01:22:20.350 --> 01:22:23.090 Now brace yourselves. 01:22:23.090 --> 01:22:24.620 Are you ready? 01:22:24.620 --> 01:22:28.430 Don't panic, it's right to left. 01:22:28.430 --> 01:22:32.630 Oh my god everything is topsy-turvy. 01:22:32.630 --> 01:22:36.590 So this is the same article in Hebrew. 01:22:36.590 --> 01:22:39.290 So the sidebar has switched direction, 01:22:39.290 --> 01:22:41.300 and I know most of you cannot read it. 01:22:41.300 --> 01:22:42.480 Bear with me. 01:22:42.480 --> 01:22:44.750 This is the label that we previously 01:22:44.750 --> 01:22:46.840 saw in the label box. 01:22:46.840 --> 01:22:49.580 This is how you spell Helen Dewitt in Hebrew. 01:22:49.580 --> 01:22:52.550 And here is the description in Hebrew. 01:22:52.550 --> 01:22:54.980 It's not the description in English, this description, 01:22:54.980 --> 01:22:57.380 American writer, which I was shown previously. 01:22:57.380 --> 01:23:00.740 Now I'm shown the Hebrew description, appropriately. 01:23:00.740 --> 01:23:03.500 But more interestingly, oh my god! 01:23:03.500 --> 01:23:07.640 All these statements are suddenly in Hebrew. 01:23:07.640 --> 01:23:08.940 How did that happen? 01:23:11.570 --> 01:23:15.560 Well this tiny word here is the very concise way 01:23:15.560 --> 01:23:22.450 to say in Hebrew, instance of, and this word here means human. 01:23:22.450 --> 01:23:25.960 So these are links to the same things, right? 01:23:25.960 --> 01:23:28.100 It still links to Q5. 01:23:28.100 --> 01:23:31.780 Q5 is the Wikidata entity for human. 01:23:31.780 --> 01:23:33.370 These are still the same things. 01:23:33.370 --> 01:23:37.600 But because Wikidata has multiple labels for everything, 01:23:37.600 --> 01:23:39.580 it has multiple labels for items. 01:23:39.580 --> 01:23:42.760 And it also has multiple labels for property names. 01:23:42.760 --> 01:23:46.450 So Wikidata knows how to say, instance of, 01:23:46.450 --> 01:23:50.140 and award received, in other languages. 01:23:50.140 --> 01:23:54.490 That is why it is able to show me all this data in Hebrew 01:23:54.490 --> 01:23:59.890 even if none of that data was actually input into Wikidata 01:23:59.890 --> 01:24:01.870 by a Hebrew speaker. 01:24:01.870 --> 01:24:04.900 That data could have been input by English speakers, 01:24:04.900 --> 01:24:08.230 but thanks to the fact that someone once 01:24:08.230 --> 01:24:12.760 translated the word photo into Hebrew, 01:24:12.760 --> 01:24:14.830 I can see this field in Hebrew. 01:24:17.750 --> 01:24:21.230 So one of the things you can do to help Wikidata, 01:24:21.230 --> 01:24:23.600 right now, without any special knowledge 01:24:23.600 --> 01:24:26.210 is to help translate those labels. 01:24:26.210 --> 01:24:29.030 Every label only needs to be translated just once. 01:24:29.030 --> 01:24:31.310 So you can see that all of these properties, date 01:24:31.310 --> 01:24:34.720 of birth, name et cetera, they all have Hebrew labels. 01:24:34.720 --> 01:24:36.760 Maybe one of these would not. 01:24:36.760 --> 01:24:38.361 No, they all have Hebrew labels. 01:24:38.361 --> 01:24:39.110 Doing pretty good. 01:24:42.960 --> 01:24:45.810 And I'm able to search in my own language. 01:24:45.810 --> 01:24:48.210 I'm able to click Add. 01:24:48.210 --> 01:24:49.890 This word is Add, so I click this, 01:24:49.890 --> 01:24:51.780 and now I have the Add screen. 01:24:51.780 --> 01:24:55.860 It all speaks my language, and it's awesome. 01:24:55.860 --> 01:25:00.330 And now for your sake I will switch back to English, 01:25:00.330 --> 01:25:03.090 but it is important to know you can 01:25:03.090 --> 01:25:05.740 edit Wikidata in any language. 01:25:05.740 --> 01:25:09.050 And it is far more multi-lingual and multi-lingual friendly 01:25:09.050 --> 01:25:13.260 than, for example commons, which is also a project we all share. 01:25:13.260 --> 01:25:17.730 But commons has some limitations on how multi-lingual it is. 01:25:17.730 --> 01:25:21.410 For example, the category names, et cetera. 01:25:21.410 --> 01:25:23.270 OK. 01:25:23.270 --> 01:25:25.670 So we were beginning to discuss creating 01:25:25.670 --> 01:25:27.140 something completely new. 01:25:27.140 --> 01:25:29.360 AUDIENCE: Quick questions, if that's OK? 01:25:29.360 --> 01:25:30.980 So there's two questions on IRC. 01:25:30.980 --> 01:25:33.890 The first one is, can you show search for something 01:25:33.890 --> 01:25:35.420 like getting the list of things? 01:25:35.420 --> 01:25:38.360 I want to learn how to search for something properly like, 01:25:38.360 --> 01:25:43.705 show me all the items with this value of this property. 01:25:43.705 --> 01:25:45.080 ASAF BARTOV: Yes. 01:25:45.080 --> 01:25:47.540 That is part of this talk, but I'll 01:25:47.540 --> 01:25:49.250 get to that in a little bit later. 01:25:49.250 --> 01:25:52.010 There's a whole section where I will demonstrate the very, very 01:25:52.010 --> 01:25:55.190 powerful query system of Wikidata 01:25:55.190 --> 01:25:57.170 where I will cash that check that I gave 01:25:57.170 --> 01:25:59.090 at the beginning of all these painters 01:25:59.090 --> 01:26:01.029 who are sons of painters queries et cetera 01:26:01.029 --> 01:26:02.570 So I will demonstrate how to do that. 01:26:02.570 --> 01:26:04.190 AUDIENCE: Other question. 01:26:04.190 --> 01:26:07.250 How does Wikidata data deal with link rot, and other issues 01:26:07.250 --> 01:26:09.680 streaming from their URL refs. 01:26:13.528 --> 01:26:16.290 ASAF BARTOV: URLs break. 01:26:16.290 --> 01:26:18.730 We call that link rot. 01:26:18.730 --> 01:26:22.470 Wikidata doesn't have any particular magic 01:26:22.470 --> 01:26:24.730 around link rot, just like Wikipedia. 01:26:24.730 --> 01:26:29.100 So if you do use a bare URL it may well rot. 01:26:29.100 --> 01:26:34.230 But you can add qualifiers with back up URLs else 01:26:34.230 --> 01:26:37.680 on the Internet Archive, or another mirroring service. 01:26:37.680 --> 01:26:42.780 And potentially that could be a software feature for Wikidata 01:26:42.780 --> 01:26:46.590 to automatically save or ensure that something 01:26:46.590 --> 01:26:48.660 is saved on Internet Archive, but I don't 01:26:48.660 --> 01:26:50.670 know that it is doing so now. 01:26:50.670 --> 01:26:56.040 So, just like Wikipedia, if it is a bear URL it may rot. 01:26:56.040 --> 01:27:00.240 And may need to be replaced, possibly by bot. 01:27:00.240 --> 01:27:01.390 Other questions? 01:27:09.840 --> 01:27:12.650 All right, so let's talk about how you 01:27:12.650 --> 01:27:15.090 create a completely new item. 01:27:15.090 --> 01:27:16.300 It's very simple. 01:27:16.300 --> 01:27:21.810 You go to Wikidata and you click here on the side. 01:27:21.810 --> 01:27:30.180 There's a link, create new item, which gives you this screen. 01:27:30.180 --> 01:27:35.030 And let's create an item about a book 01:27:35.030 --> 01:27:39.500 that I'm reading right now by this Bulgarian writer. 01:27:39.500 --> 01:27:43.950 So we have an article about this writer guy named Deyan Enev. 01:27:43.950 --> 01:27:48.530 But we don't have an article or a Wikidata item 01:27:48.530 --> 01:28:07.980 about one of his famous books called Circus Bulgaria. 01:28:07.980 --> 01:28:10.050 That's the book I'm reading, his first collection 01:28:10.050 --> 01:28:11.216 of short stories in English. 01:28:11.216 --> 01:28:14.280 Circus Bulgaria came out in 2010, Portobello Books, 01:28:14.280 --> 01:28:17.099 translated by Kapka Kassabova. 01:28:17.099 --> 01:28:18.390 So that's the book I'm reading. 01:28:18.390 --> 01:28:20.520 As you can see it's not a link on Wikipedia. 01:28:20.520 --> 01:28:23.370 There's no article about it, and there's not even 01:28:23.370 --> 01:28:26.310 a Wikidata entity item about it. 01:28:26.310 --> 01:28:32.220 But we can totally create it, even without a Wikipedia 01:28:32.220 --> 01:28:33.090 article. 01:28:33.090 --> 01:28:34.980 So let's create this new item. 01:28:34.980 --> 01:28:37.260 Let's create it in English for the purposes 01:28:37.260 --> 01:28:38.880 of our demonstration. 01:28:38.880 --> 01:28:44.910 The name of the item is Circus Bulgaria. 01:28:44.910 --> 01:28:47.520 Circus Bulgaria, that's the name. 01:28:47.520 --> 01:28:50.670 Not Circus Bulgaria parentheses book, 01:28:50.670 --> 01:28:53.520 or anything you may be used to from Wikipedia. 01:28:53.520 --> 01:28:56.520 It's the actual name of the book, 01:28:56.520 --> 01:29:00.450 and the description, again, remember, 01:29:00.450 --> 01:29:03.270 the description field is just to kind of help 01:29:03.270 --> 01:29:08.681 tell apart this Circus Bulgaria from any other potential Circus 01:29:08.681 --> 01:29:09.180 Bulgaria. 01:29:09.180 --> 01:29:11.280 Maybe there's a film or something. 01:29:11.280 --> 01:29:20.480 So it's enough to just say something like short story 01:29:20.480 --> 01:29:23.270 collection. 01:29:23.270 --> 01:29:27.830 I might add by Deyan Enev and if just in case, again, 01:29:27.830 --> 01:29:31.910 some future other short story collection by some other author 01:29:31.910 --> 01:29:33.560 happens to have that same name. 01:29:33.560 --> 01:29:36.391 That should be disambiguating enough. 01:29:36.391 --> 01:29:36.890 OK. 01:29:36.890 --> 01:29:39.770 Short story collection by Deyan Enev. 01:29:39.770 --> 01:29:42.050 I could have aliases for this. 01:29:42.050 --> 01:29:47.240 The aliases assist find-ability. 01:29:47.240 --> 01:29:51.020 This particular book has just this one name, so that's fine. 01:29:51.020 --> 01:29:52.260 And I click Create. 01:29:52.260 --> 01:29:52.760 That's it. 01:29:52.760 --> 01:29:55.990 I just start with a label, and a description. 01:29:55.990 --> 01:29:58.740 I click Create. 01:29:58.740 --> 01:30:03.890 I have a brand new queue number for my new Wikidata item. 01:30:03.890 --> 01:30:05.960 And Wikidata knows what to call it. 01:30:05.960 --> 01:30:09.320 And a description in one language at least. 01:30:09.320 --> 01:30:11.930 And that's it, and I can start populating it. 01:30:11.930 --> 01:30:15.050 As it can see, it it has no site links, 01:30:15.050 --> 01:30:17.450 but it's ready to be taught. 01:30:17.450 --> 01:30:20.450 So, for example, I can start by teaching 01:30:20.450 --> 01:30:24.610 it the name of the book in another language 01:30:24.610 --> 01:30:25.870 that I happened to speak. 01:30:29.050 --> 01:30:31.720 Now it has two labels in English and Hebrew. 01:30:31.720 --> 01:30:36.880 I could also look up the book Areon, 01:30:36.880 --> 01:30:39.510 the original Bulgarian label for this book. 01:30:39.510 --> 01:30:41.550 Seems relevant. 01:30:41.550 --> 01:30:43.320 Again, I do not speak Bulgarian. 01:30:43.320 --> 01:30:49.860 But I can go to the Bulgarian Wikipedia through into Wiki. 01:30:49.860 --> 01:30:51.510 This is this gentleman. 01:30:51.510 --> 01:30:54.510 And I could find-- 01:30:54.510 --> 01:30:59.190 I can read Cyrillic so I could easily find-- 01:30:59.190 --> 01:31:00.030 when I say easily-- 01:31:02.940 --> 01:31:05.710 when I say easily-- 01:31:05.710 --> 01:31:12.731 maybe not so easy, but I can search for it. 01:31:21.070 --> 01:31:22.180 Here we go. 01:31:22.180 --> 01:31:25.190 Tsirk Bulgaria. 01:31:25.190 --> 01:31:27.510 That is the name of the book. 01:31:27.510 --> 01:31:28.910 Tsirk, as in circus. 01:31:28.910 --> 01:31:30.440 No problem. 01:31:30.440 --> 01:31:32.725 So I just copy this right here. 01:31:35.240 --> 01:31:38.090 And I go back to my new item. 01:31:38.090 --> 01:31:45.725 My new item, which is here, and I edit the Bulgarian field. 01:31:48.260 --> 01:31:49.950 And here it is. 01:31:49.950 --> 01:31:50.720 Awesome. 01:31:50.720 --> 01:31:51.220 All right. 01:31:51.220 --> 01:31:55.420 But I still haven't told Wikidata anything about this. 01:31:55.420 --> 01:31:56.920 I know I'm talking about a book. 01:31:56.920 --> 01:31:59.110 Wikidata that doesn't know that yet. 01:31:59.110 --> 01:32:02.630 So let's start by adding some statements. 01:32:02.630 --> 01:32:05.390 First of all, I click Add. 01:32:05.390 --> 01:32:07.190 Wikidata sensibly says, how about we 01:32:07.190 --> 01:32:08.630 start with instance of. 01:32:08.630 --> 01:32:11.090 Tell me what kind of animal-- no, not kind of animal. 01:32:11.090 --> 01:32:13.940 What kind of thing are you trying to describe here? 01:32:13.940 --> 01:32:18.130 Well it's an instance of a book. 01:32:18.130 --> 01:32:20.930 Not in Hebrew, please. 01:32:20.930 --> 01:32:22.180 So it's an instance of a book. 01:32:22.180 --> 01:32:23.763 I could even be a little more specific 01:32:23.763 --> 01:32:31.920 and say it's an instance of a short story collection. 01:32:31.920 --> 01:32:34.620 There we go, short story collection. 01:32:34.620 --> 01:32:36.800 I hit Save. 01:32:36.800 --> 01:32:37.430 Awesome. 01:32:37.430 --> 01:32:39.680 So now we know what kind of thing it is. 01:32:39.680 --> 01:32:42.860 It's not a human, it's not a mountain, it's not a concept. 01:32:42.860 --> 01:32:44.760 It's a short story collection. 01:32:44.760 --> 01:32:46.400 Now I can add some other things. 01:32:46.400 --> 01:32:48.770 See, Wikidata is already working for me. 01:32:48.770 --> 01:32:51.020 Because it's a short story collection 01:32:51.020 --> 01:32:53.960 it's offering me to populate these properties, and not 01:32:53.960 --> 01:32:54.890 other ones. 01:32:54.890 --> 01:32:56.990 Publication date, original language, 01:32:56.990 --> 01:33:00.350 genre, country of origin, these are all relevant, right? 01:33:00.350 --> 01:33:04.220 So let's start with original language of the work 01:33:04.220 --> 01:33:07.410 is Bulgarian. 01:33:07.410 --> 01:33:09.810 Not Bulgaria, Bulgarian. 01:33:09.810 --> 01:33:12.040 This is the item I want to link. 01:33:12.040 --> 01:33:21.570 Hit Save, and whatever. 01:33:21.570 --> 01:33:22.890 Author. 01:33:22.890 --> 01:33:26.540 Let's identify the author. 01:33:26.540 --> 01:33:29.350 So the author, the main creator of the work, 01:33:29.350 --> 01:33:32.470 is that gentleman Deyan Enev. 01:33:32.470 --> 01:33:34.750 And remember, he has a Wikipedia article. 01:33:34.750 --> 01:33:37.210 He also has a Wikidata entity. 01:33:37.210 --> 01:33:39.640 So Wikidata does know about him. 01:33:39.640 --> 01:33:48.930 So I hit Save, and I can add something about the translator. 01:33:52.530 --> 01:33:54.390 And what was that lady's name? 01:33:57.990 --> 01:34:00.120 Kapka Kassabova. 01:34:00.120 --> 01:34:05.430 Now it so happens that Wikidata already knows about this lady. 01:34:08.330 --> 01:34:08.840 See? 01:34:08.840 --> 01:34:12.290 So I can just start typing and then just link to it. 01:34:12.290 --> 01:34:12.840 Awesome. 01:34:12.840 --> 01:34:13.824 But what if it didn't? 01:34:13.824 --> 01:34:15.740 What if it was translated by someone who isn't 01:34:15.740 --> 01:34:17.690 already covered on Wikidata? 01:34:17.690 --> 01:34:22.190 Well I could just type the name as a string, 01:34:22.190 --> 01:34:25.760 but ideally I could create a Wikidata entity 01:34:25.760 --> 01:34:28.940 about this translator so that there is a possibility 01:34:28.940 --> 01:34:30.350 to link to her. 01:34:33.560 --> 01:34:36.920 Now I might actually add a qualifier here 01:34:36.920 --> 01:34:40.310 because, she's not the translator of the book, right? 01:34:40.310 --> 01:34:43.620 She's the translator of the book into English. 01:34:43.620 --> 01:34:44.440 Right. 01:34:44.440 --> 01:34:50.151 So the language that she translated into is English. 01:34:50.151 --> 01:34:50.650 Right? 01:34:50.650 --> 01:34:53.620 This book-- remember I'm describing the book. 01:34:53.620 --> 01:34:55.376 The item is about the book. 01:34:55.376 --> 01:34:57.250 So the book would have a different translator 01:34:57.250 --> 01:34:58.510 into Polish. 01:34:58.510 --> 01:35:02.320 So this is an example of a property or a statement 01:35:02.320 --> 01:35:06.430 that doesn't make sense without one of those qualifiers. 01:35:06.430 --> 01:35:08.140 It's just not correct. 01:35:08.140 --> 01:35:11.320 It doesn't make sense to say that translator is. 01:35:11.320 --> 01:35:14.950 The English translator, or even this English translator. 01:35:14.950 --> 01:35:17.770 In 50 years maybe there would be an additional English 01:35:17.770 --> 01:35:18.940 translation. 01:35:18.940 --> 01:35:24.774 So that's an example of needing that qualifier. 01:35:24.774 --> 01:35:27.190 And of course I could go on and populate the other fields. 01:35:27.190 --> 01:35:29.710 We don't have to do that right now. 01:35:29.710 --> 01:35:32.960 Publication date, country of origin, et cetera. 01:35:32.960 --> 01:35:35.440 So this is already beginning to look like all those items 01:35:35.440 --> 01:35:38.440 that we already saw, but just a moment ago it didn't exist. 01:35:38.440 --> 01:35:43.920 Just a moment ago Wikidata had no concept of this work. 01:35:43.920 --> 01:35:46.500 This happens to be one of his notable works. 01:35:46.500 --> 01:35:52.080 So I could actually go to the item about Deyan Enev which 01:35:52.080 --> 01:35:56.190 has all this information already, occupation, languages, 01:35:56.190 --> 01:35:59.170 and add a property. 01:35:59.170 --> 01:36:01.050 Remember, I'm not limited to these. 01:36:01.050 --> 01:36:06.180 I can add a property called notable works, 01:36:06.180 --> 01:36:08.670 and mention my new item. 01:36:08.670 --> 01:36:12.120 Circus Bulgaria. 01:36:12.120 --> 01:36:12.750 See? 01:36:12.750 --> 01:36:15.180 My new item is showing up, and thanks 01:36:15.180 --> 01:36:18.660 to this description that I wrote, short story collection, 01:36:18.660 --> 01:36:22.650 it's already appearing here in the dropdown very conveniently. 01:36:22.650 --> 01:36:24.270 So I linked to this. 01:36:24.270 --> 01:36:25.154 I hit Save. 01:36:28.680 --> 01:36:32.310 Ideally again I should find some references showing 01:36:32.310 --> 01:36:34.620 that this is a notable work by him, 01:36:34.620 --> 01:36:37.000 but we won't spend time on that right now. 01:36:37.000 --> 01:36:39.010 But the point is we created a new item. 01:36:39.010 --> 01:36:40.410 We populated it a little bit. 01:36:40.410 --> 01:36:44.400 We linked to it so that it's more discoverable by mentioning 01:36:44.400 --> 01:36:47.760 it in the author name, and of course the book item 01:36:47.760 --> 01:36:50.710 itself mentions the author and links to the author. 01:36:50.710 --> 01:36:52.770 So that's all good. 01:36:52.770 --> 01:36:57.780 One last thing we shall do is give it some useful identifier 01:36:57.780 --> 01:37:02.880 so let's add, say, the Library of Congress record 01:37:02.880 --> 01:37:03.940 for this book. 01:37:03.940 --> 01:37:04.440 OK. 01:37:04.440 --> 01:37:07.710 So I have prepared this in advance. 01:37:07.710 --> 01:37:08.760 Ooh. 01:37:08.760 --> 01:37:12.720 Just in time, with 80 seconds to go before it's giving up on me. 01:37:12.720 --> 01:37:14.310 Oh it has already given up on me. 01:37:14.310 --> 01:37:15.490 That is very unfortunate. 01:37:23.300 --> 01:37:29.110 So I go to the Library of Congress and I find this book. 01:37:29.110 --> 01:37:33.050 I find this entry, right? 01:37:33.050 --> 01:37:37.320 In the Library of Congress database about this book. 01:37:37.320 --> 01:37:39.120 And it has a permalink. 01:37:39.120 --> 01:37:42.570 It has a kind of guaranteed to be permanent link. 01:37:42.570 --> 01:37:47.950 I can just copy that link, go back to my little book, 01:37:47.950 --> 01:37:55.770 and say the Library of Congress. 01:37:55.770 --> 01:38:01.070 Yeah, LCCN, that's what they call their IDs, the call 01:38:01.070 --> 01:38:02.120 number. 01:38:02.120 --> 01:38:06.502 And I paste it here. 01:38:06.502 --> 01:38:08.210 I actually don't need the URL. 01:38:08.210 --> 01:38:09.136 I need just a number. 01:38:12.440 --> 01:38:13.520 And there we go. 01:38:13.520 --> 01:38:16.550 I have added it, and now Wikidata 01:38:16.550 --> 01:38:20.630 knows how to find bibliographic information about this book. 01:38:20.630 --> 01:38:24.710 And any re-user of Wikidata, some program, 01:38:24.710 --> 01:38:28.950 some tool that connects books to authors 01:38:28.950 --> 01:38:32.870 or does statistical analysis or whatever, some future yet to be 01:38:32.870 --> 01:38:35.090 imagined tool could automatically 01:38:35.090 --> 01:38:39.170 find additional metadata on the Library of Congress site thanks 01:38:39.170 --> 01:38:41.840 to this connection that I just made. 01:38:41.840 --> 01:38:44.150 And of course I could add many other IDs 01:38:44.150 --> 01:38:46.460 to other catalogs around the world, 01:38:46.460 --> 01:38:48.150 and we won't do that right now. 01:38:48.150 --> 01:38:51.840 You can see that it's now showing up under identifiers. 01:38:51.840 --> 01:38:56.330 So this is how we created a brand new piece of data. 01:38:56.330 --> 01:38:59.632 Questions about this, about creating new items? 01:39:18.100 --> 01:39:19.180 Yeah, all right. 01:39:19.180 --> 01:39:25.510 So we've seen how to contribute to Wikidata on our own, 01:39:25.510 --> 01:39:26.350 kind of through-- 01:39:26.350 --> 01:39:27.840 directly through Wikidata. 01:39:30.680 --> 01:39:35.220 Now you may you may be thinking, but Asaf, this 01:39:35.220 --> 01:39:39.880 sounds like a ton of work recording 01:39:39.880 --> 01:39:44.500 all of these little tiny bits of information about every person 01:39:44.500 --> 01:39:47.410 and every book and every town. 01:39:47.410 --> 01:39:50.520 And if you think that you would be correct. 01:39:50.520 --> 01:39:52.730 That is a ton of work. 01:39:52.730 --> 01:39:54.600 It's a lot of work. 01:39:54.600 --> 01:39:59.930 However, it is centralized, so it is reusable on other wikis 01:39:59.930 --> 01:40:03.860 and we will show in just a moment how we pull information 01:40:03.860 --> 01:40:07.296 from Wikidata into Wikipedia or other projects. 01:40:10.860 --> 01:40:13.780 We will show that in just a moment. 01:40:13.780 --> 01:40:18.660 But here's an awesome little game 01:40:18.660 --> 01:40:23.205 that we Wikidata volunteer, Magnis Monska, 01:40:23.205 --> 01:40:30.900 has authored called the Wikidata game, in which he 01:40:30.900 --> 01:40:31.920 tricks people-- 01:40:31.920 --> 01:40:35.730 sorry, helps people make contributions 01:40:35.730 --> 01:40:41.500 to Wikidata in a very, very easy and pleasant way. 01:40:41.500 --> 01:40:44.410 Let's look at the Wikidata game. 01:40:44.410 --> 01:40:47.840 So the first thing you need to do in that Wikidata game 01:40:47.840 --> 01:40:50.660 is to log in, because the Wikidata 01:40:50.660 --> 01:40:53.150 game makes edits in your name. 01:40:53.150 --> 01:40:54.980 So we need to authorize it. 01:40:54.980 --> 01:40:57.250 It's perfectly safe. 01:40:57.250 --> 01:41:01.090 And after you do that you can go to the Wikidata game. 01:41:01.090 --> 01:41:02.020 So this is the game. 01:41:02.020 --> 01:41:03.520 Now I'm logged in. 01:41:03.520 --> 01:41:05.230 And the Wikidata game actually includes 01:41:05.230 --> 01:41:06.970 a number of different games. 01:41:06.970 --> 01:41:09.310 Let's start with a person game. 01:41:09.310 --> 01:41:14.170 So Wikidata shows you-- 01:41:14.170 --> 01:41:20.800 shows you an item, and asks you a very simple question. 01:41:20.800 --> 01:41:23.200 Person, or not a person? 01:41:26.410 --> 01:41:30.550 So Wikidata goes through Wikidata entities 01:41:30.550 --> 01:41:35.540 that don't even have the instance of property. 01:41:35.540 --> 01:41:37.520 Which is why Wikidata doesn't know, 01:41:37.520 --> 01:41:41.120 literally doesn't know, if this is a person, or a mountain, 01:41:41.120 --> 01:41:44.390 or a city, or a country, or anything else. 01:41:44.390 --> 01:41:47.150 So it asks you, because this is the kind of question that 01:41:47.150 --> 01:41:50.300 Wikidata cannot decide on its own, 01:41:50.300 --> 01:41:54.800 but for us humans it's generally trivial to be able to say 01:41:54.800 --> 01:41:58.220 whether something that we're looking at is a person or not. 01:41:58.220 --> 01:42:03.590 It gets slightly trickier when the information is in Javanese, 01:42:03.590 --> 01:42:06.470 as it is here, rather than English. 01:42:06.470 --> 01:42:10.010 So this item happens to be described in Javanese. 01:42:10.010 --> 01:42:14.360 My Javanese, spoken in Indonesia, is very weak. 01:42:14.360 --> 01:42:19.620 However, I can tell that this is not a person. 01:42:19.620 --> 01:42:20.730 How can I tell? 01:42:20.730 --> 01:42:23.220 Without understanding a word of Japanese 01:42:23.220 --> 01:42:25.950 I see that it mentions 1000 kilometers 01:42:25.950 --> 01:42:28.860 and square kilometers, see? 01:42:28.860 --> 01:42:32.520 So this is about a place, or an area, 01:42:32.520 --> 01:42:36.090 or a region, or whatever, but not a person. 01:42:36.090 --> 01:42:39.060 So this is an example of how even 01:42:39.060 --> 01:42:41.100 without understanding language you can sometimes 01:42:41.100 --> 01:42:42.400 make a determination. 01:42:42.400 --> 01:42:45.030 However, of course, you should be sure. 01:42:45.030 --> 01:42:47.700 This is definitely not what the Wikipedia article 01:42:47.700 --> 01:42:49.150 about a person looks like. 01:42:49.150 --> 01:42:50.430 So this is not a person. 01:42:50.430 --> 01:42:52.780 I just click it and I'm shown the next item. 01:42:56.600 --> 01:42:59.660 This item is in another language I do not speak, 01:42:59.660 --> 01:43:00.950 and I just don't know. 01:43:00.950 --> 01:43:03.740 I do not know if this is about a person or not. 01:43:03.740 --> 01:43:07.350 So I click Not Sure. 01:43:07.350 --> 01:43:11.190 This is in Swedish, and it's about Sulawesi, still 01:43:11.190 --> 01:43:13.770 Indonesia. 01:43:13.770 --> 01:43:16.530 And it is not about a person. 01:43:16.530 --> 01:43:18.150 I have enough Swedish for that. 01:43:18.150 --> 01:43:21.750 So I click not a person. 01:43:21.750 --> 01:43:24.420 Now, you may say, well, do I really 01:43:24.420 --> 01:43:28.350 have to deal with all these languages that I don't speak? 01:43:28.350 --> 01:43:29.190 The answer is no. 01:43:29.190 --> 01:43:30.630 You don't have to. 01:43:30.630 --> 01:43:32.580 Here at the bottom of the Wikidata game 01:43:32.580 --> 01:43:33.840 there are settings. 01:43:33.840 --> 01:43:38.270 You can click that and tell Wikidata, 01:43:38.270 --> 01:43:41.840 I cannot even read Chinese or Japanese, 01:43:41.840 --> 01:43:44.600 so please don't show me items in those languages. 01:43:44.600 --> 01:43:47.060 Because I wouldn't even be able to guess. 01:43:47.060 --> 01:43:50.000 I prefer these languages in which I can relatively easily 01:43:50.000 --> 01:43:51.380 make determinations. 01:43:51.380 --> 01:43:54.601 And I can even tell Wikidata to only show me these languages. 01:43:54.601 --> 01:43:55.100 You see? 01:43:55.100 --> 01:43:57.350 This was not selected, which is why I 01:43:57.350 --> 01:44:00.600 was shown some other languages. 01:44:00.600 --> 01:44:04.240 I could say, only use these languages, and save. 01:44:04.240 --> 01:44:06.100 And now I can try this game again. 01:44:06.100 --> 01:44:07.980 However, that can slow it down a little. 01:44:07.980 --> 01:44:09.000 So here we go. 01:44:09.000 --> 01:44:11.640 Here's a Spanish-- which is one of the languages I 01:44:11.640 --> 01:44:14.640 told Wikidata game it can use. 01:44:14.640 --> 01:44:16.480 This is a Spanish item. 01:44:16.480 --> 01:44:19.265 Now is it about a person or not? 01:44:22.120 --> 01:44:23.230 It is not about a person. 01:44:25.906 --> 01:44:26.780 Is it about a person? 01:44:29.155 --> 01:44:29.655 No. 01:44:32.900 --> 01:44:35.180 Yes, it is right? 01:44:35.180 --> 01:44:38.550 Monk Cistercian, Pedro de Ovideo Falconi. 01:44:38.550 --> 01:44:40.890 That sounds like a person. 01:44:40.890 --> 01:44:42.680 Frau Pedro Nasser. 01:44:42.680 --> 01:44:44.960 Yeah, he was born in Madrid 1577. 01:44:44.960 --> 01:44:46.280 This is a person. 01:44:46.280 --> 01:44:47.060 OK. 01:44:47.060 --> 01:44:49.730 So I click person. 01:44:49.730 --> 01:44:52.100 Again, if you're not sure, click not sure. 01:44:52.100 --> 01:44:55.100 The point is, just by clicking person and as you can see 01:44:55.100 --> 01:44:57.780 this would work very well on mobile, 01:44:57.780 --> 01:45:01.430 which is why I said you can contribute on your commute. 01:45:01.430 --> 01:45:04.100 You can just hold your phone or tablet or whatever, 01:45:04.100 --> 01:45:05.840 and just tap. 01:45:05.840 --> 01:45:07.040 Person, not a person. 01:45:07.040 --> 01:45:08.900 Person, not a person. 01:45:08.900 --> 01:45:12.500 The amazing thing is that just tapping person has actually 01:45:12.500 --> 01:45:15.830 made an edit to Wikidata on my behalf, which 01:45:15.830 --> 01:45:21.560 I can find out, like every wiki, by clicking contributions. 01:45:21.560 --> 01:45:24.200 And as you can see in addition to the stuff about circus 01:45:24.200 --> 01:45:28.340 Bulgaria, my latest edit is in fact about this Pedro de Ovideo 01:45:28.340 --> 01:45:30.130 Falconi person. 01:45:30.130 --> 01:45:32.000 And the edit was, you can-- 01:45:32.000 --> 01:45:38.030 I hope you can see this, created the claim instance of human. 01:45:38.030 --> 01:45:39.110 So I added-- 01:45:39.110 --> 01:45:43.100 I mean Wikidata game added for me the statement 01:45:43.100 --> 01:45:44.180 instance of human. 01:45:44.180 --> 01:45:47.780 Now, the awesome thing is that it was super easy to do. 01:45:47.780 --> 01:45:51.890 I didn't have to go into that entity, click the Add button, 01:45:51.890 --> 01:45:57.080 choose the instance of property, choose human, hit Save. 01:45:57.080 --> 01:45:59.210 Instead of all these operations I just 01:45:59.210 --> 01:46:04.250 tapped on my screen, person, not a person. 01:46:04.250 --> 01:46:10.280 And I can do hundreds of edits during my daily commute. 01:46:10.280 --> 01:46:12.410 There are other games, like the gender game. 01:46:12.410 --> 01:46:14.810 So this is about-- 01:46:14.810 --> 01:46:17.240 this is when Wikidata already knows 01:46:17.240 --> 01:46:19.760 that this item is a person, but it doesn't 01:46:19.760 --> 01:46:21.710 know the gender of this person. 01:46:21.710 --> 01:46:25.340 Which is another one of the more basic items. 01:46:25.340 --> 01:46:27.770 And this is taking a long time because of the language 01:46:27.770 --> 01:46:29.870 limitations that I set on it. 01:46:29.870 --> 01:46:32.660 I guess the less exotic languages have already 01:46:32.660 --> 01:46:35.130 been exhausted in the game. 01:46:35.130 --> 01:46:36.880 We don't have to wait all this time. 01:46:40.280 --> 01:46:44.970 We can try something else. 01:46:44.970 --> 01:46:45.950 How about occupation? 01:46:45.950 --> 01:46:46.850 The occupation game. 01:46:46.850 --> 01:46:49.400 Here we go, this is in Russian. 01:46:49.400 --> 01:46:55.540 And what is the occupation of this gentleman? 01:46:55.540 --> 01:46:58.630 Well he is an [INAUDIBLE]. 01:46:58.630 --> 01:47:00.700 He's a church person. 01:47:00.700 --> 01:47:04.300 However, so the occupation game is 01:47:04.300 --> 01:47:06.490 where Wikidata game will automatically 01:47:06.490 --> 01:47:10.990 pull likely occupations from the article text 01:47:10.990 --> 01:47:13.810 and ask for confirmation. 01:47:13.810 --> 01:47:16.840 So if he-- if this person really is a deacon, 01:47:16.840 --> 01:47:17.770 I should click that. 01:47:17.770 --> 01:47:19.990 But I'm not sure. 01:47:19.990 --> 01:47:24.950 I'm not clear on the Russian church's distinctions between-- 01:47:24.950 --> 01:47:26.620 I mean [INAUDIBLE] is pretty senior, 01:47:26.620 --> 01:47:28.690 but I don't know if that automatically also means 01:47:28.690 --> 01:47:30.100 he's a deacon or not. 01:47:30.100 --> 01:47:32.720 And [INAUDIBLE] is not listed here. 01:47:32.720 --> 01:47:36.380 So I will click not listed. 01:47:36.380 --> 01:47:39.540 Also, these guesses are not always correct. 01:47:39.540 --> 01:47:42.680 So, this guy for example, is in Russian. 01:47:42.680 --> 01:47:43.430 I can read this. 01:47:43.430 --> 01:47:44.470 He's a philologist. 01:47:44.470 --> 01:47:45.380 He's a linguist. 01:47:45.380 --> 01:47:48.510 So I can confirm it and click linguist. 01:47:48.510 --> 01:47:49.010 All right? 01:47:49.010 --> 01:47:51.950 And again, if we look at my contributions 01:47:51.950 --> 01:47:55.700 we can see the Wikidata game on my behalf 01:47:55.700 --> 01:47:59.930 created occupation linguist. 01:47:59.930 --> 01:48:02.450 OK. 01:48:02.450 --> 01:48:04.370 Just by typing linguist there. 01:48:04.370 --> 01:48:07.040 Now if it's taken from the article, 01:48:07.040 --> 01:48:09.860 why would it ever be wrong? 01:48:09.860 --> 01:48:15.970 Well Jesus was the son of a carpenter. 01:48:15.970 --> 01:48:18.870 The word carpenter appears in the text. 01:48:18.870 --> 01:48:22.840 That doesn't mean it's correct to say Jesus was a carpenter. 01:48:22.840 --> 01:48:23.340 OK? 01:48:23.340 --> 01:48:24.660 Just a trivial example, right? 01:48:24.660 --> 01:48:30.250 So many, many articles will say, you know, born to a physician. 01:48:30.250 --> 01:48:32.850 And so the word physician could be guessed, 01:48:32.850 --> 01:48:36.030 but it wouldn't be correct unless the son is also 01:48:36.030 --> 01:48:38.090 a physician. 01:48:38.090 --> 01:48:43.540 So I hope it gives you the gist of it. 01:48:43.540 --> 01:48:47.500 There is also a distributed Wikidata game, 01:48:47.500 --> 01:48:48.774 which is pretty awesome. 01:48:51.450 --> 01:48:54.320 Here we go, which has additional games. 01:48:54.320 --> 01:49:02.610 So, for example, the key on game gives you, 01:49:02.610 --> 01:49:06.940 maybe it gives you, some items to play with. 01:49:16.610 --> 01:49:17.110 Yes? 01:49:17.110 --> 01:49:17.610 No? 01:49:17.610 --> 01:49:18.430 OK. 01:49:18.430 --> 01:49:20.830 So it gives you this little card, 01:49:20.830 --> 01:49:27.940 and asks you to confirm is this instance of human settlement? 01:49:27.940 --> 01:49:30.480 That is, is it a village, town, city, whatever. 01:49:30.480 --> 01:49:33.310 Is it a kind of human settlement or not? 01:49:33.310 --> 01:49:34.340 Or maybe it's a book. 01:49:34.340 --> 01:49:35.540 Maybe it's a poem. 01:49:35.540 --> 01:49:38.980 Again, so, is it an English settlement? 01:49:38.980 --> 01:49:41.500 And you can click the languages here to see the information. 01:49:41.500 --> 01:49:43.270 So I can click English. 01:49:43.270 --> 01:49:44.572 And indeed the article-- 01:49:44.572 --> 01:49:46.030 I mean the actual Wikipedia article 01:49:46.030 --> 01:49:49.360 says Camigji is a town and territory 01:49:49.360 --> 01:49:51.370 in this district in the Congo. 01:49:51.370 --> 01:49:54.640 So yes, this is an instance of human settlement. 01:49:54.640 --> 01:49:57.580 So I clicked yes. 01:49:57.580 --> 01:50:00.460 And just clicking yes again went to that item, 01:50:00.460 --> 01:50:02.740 and added property of human settlement. 01:50:02.740 --> 01:50:05.560 Now the point of all these games is 01:50:05.560 --> 01:50:08.140 these are tools, written by programmers, 01:50:08.140 --> 01:50:12.490 making kind of semi educated guesses about these fairly 01:50:12.490 --> 01:50:14.120 basic properties. 01:50:14.120 --> 01:50:17.770 And they are meant to semi automate, to assist, 01:50:17.770 --> 01:50:23.730 in the accumulation of all these important pieces of data. 01:50:23.730 --> 01:50:26.640 Now every single click here helps 01:50:26.640 --> 01:50:31.000 Wikidata give better results, richer results 01:50:31.000 --> 01:50:32.380 in future queries. 01:50:32.380 --> 01:50:38.130 Again, as of right now Wikidata can include Camigji 01:50:38.130 --> 01:50:42.690 if I ask it, you know, what are some towns in Congo? 01:50:42.690 --> 01:50:44.220 Until now it could not. 01:50:44.220 --> 01:50:46.830 Because it literally didn't know. 01:50:46.830 --> 01:50:51.950 So every time we click male, female, person, not a person, 01:50:51.950 --> 01:50:56.640 make these decisions, we help improve Wikidata 01:50:56.640 --> 01:51:01.560 and enrich the results that we could receive. 01:51:01.560 --> 01:51:04.590 Any questions about this, about kind of micro contributions 01:51:04.590 --> 01:51:07.010 through the Wikidata game? 01:51:07.010 --> 01:51:09.890 If that looks appealing I encourage 01:51:09.890 --> 01:51:12.860 you to go and visit the Wikidata game 01:51:12.860 --> 01:51:15.205 and start contributing in that way. 01:51:19.580 --> 01:51:21.650 There is a question here. 01:51:21.650 --> 01:51:24.650 If I make an article about Circus Bulgaria how should 01:51:24.650 --> 01:51:26.630 I correctly connect them? 01:51:26.630 --> 01:51:28.740 That is an excellent question. 01:51:28.740 --> 01:51:33.090 So once-- so now there is a Wikidata item about that book, 01:51:33.090 --> 01:51:37.650 but there is no Wikipedia article anywhere. 01:51:37.650 --> 01:51:41.460 Now suppose I write one in, Bulgarian maybe, 01:51:41.460 --> 01:51:42.870 you go to Wikidata. 01:51:42.870 --> 01:51:45.180 You find the item by searching. 01:51:45.180 --> 01:51:49.170 You find the item, and then the empty site links section 01:51:49.170 --> 01:51:50.850 right at the bottom there-- 01:51:50.850 --> 01:51:52.020 where are we? 01:51:52.020 --> 01:51:53.100 We have this? 01:51:53.100 --> 01:51:55.050 Circus Bulgaria. 01:51:55.050 --> 01:51:56.010 Let's demonstrate this. 01:51:56.010 --> 01:51:58.000 So here is the item about the book. 01:51:58.000 --> 01:52:01.030 Let's say that now there is an article 01:52:01.030 --> 01:52:03.670 because I just created it. 01:52:03.670 --> 01:52:07.450 I can go here to the empty Wikipedia link section, 01:52:07.450 --> 01:52:11.760 click Edit, type the name of the wiki, 01:52:11.760 --> 01:52:16.430 let's say English, and then type the name of the page 01:52:16.430 --> 01:52:18.230 that I just created. 01:52:18.230 --> 01:52:20.790 Circus-- right? 01:52:20.790 --> 01:52:23.400 And again, it offers me auto-complete 01:52:23.400 --> 01:52:25.080 for my convenience. 01:52:25.080 --> 01:52:28.260 Now we don't actually have the article created, 01:52:28.260 --> 01:52:30.480 but I could let's just say this was the article. 01:52:30.480 --> 01:52:33.330 I can just click this, hit Save, and that 01:52:33.330 --> 01:52:36.450 would associate the new Wikipedia article 01:52:36.450 --> 01:52:38.130 with this Wikidata item. 01:52:38.130 --> 01:52:41.940 That is the beginning of the inter-wiki list for this item. 01:52:41.940 --> 01:52:43.620 I will not click Save Now, because we 01:52:43.620 --> 01:52:45.289 didn't have the article yet. 01:52:45.289 --> 01:52:46.830 So I hope that answers that question. 01:52:46.830 --> 01:52:50.340 Was there another question that I missed here? 01:52:50.340 --> 01:52:51.450 No. 01:52:51.450 --> 01:52:53.170 OK. 01:52:53.170 --> 01:52:55.300 Any questions about the Wikidata game? 01:52:55.300 --> 01:53:00.740 About this idea of micro contributions? 01:53:00.740 --> 01:53:05.330 If not then we can move on to embedding data, 01:53:05.330 --> 01:53:07.490 and after that we can discuss queries, 01:53:07.490 --> 01:53:12.000 how to get at all this data from Wikidata. 01:53:12.000 --> 01:53:16.500 So the short version of how to embed data from Wikidata 01:53:16.500 --> 01:53:19.920 is that there is this little magic incantation. 01:53:19.920 --> 01:53:25.410 Curly brace, curly brace, hash mark, property. 01:53:25.410 --> 01:53:29.820 It looks like a template, but it isn't because of that hash. 01:53:29.820 --> 01:53:31.320 And that is magic. 01:53:31.320 --> 01:53:34.170 Take a look at this little demo that I prepared. 01:53:34.170 --> 01:53:37.950 This page, which is off my user page on meta, 01:53:37.950 --> 01:53:40.110 but it could be on any wiki. 01:53:40.110 --> 01:53:42.490 OK. 01:53:42.490 --> 01:53:49.420 Says, since San Francisco is item Q62 in Wikidata, 01:53:49.420 --> 01:53:55.240 and since population is property P1082, I can tell you 01:53:55.240 --> 01:53:58.840 that according to Wikidata the population of San Francisco 01:53:58.840 --> 01:54:02.180 is this. 01:54:02.180 --> 01:54:08.420 And this bolded number here was produced with this incantation. 01:54:08.420 --> 01:54:14.420 Curly brace, curly brace, hash mark, property P1082, 01:54:14.420 --> 01:54:18.751 that's population, type from what item? 01:54:18.751 --> 01:54:19.250 Right? 01:54:19.250 --> 01:54:21.650 Cause I'm pulling an arbitrary number. 01:54:21.650 --> 01:54:23.570 I could put any property in any item 01:54:23.570 --> 01:54:27.020 here, and kind of include it, embedded, into my text. 01:54:27.020 --> 01:54:29.630 This isn't even about-- you notice this is my user page. 01:54:29.630 --> 01:54:32.480 This isn't even the article about San Francisco. 01:54:32.480 --> 01:54:35.210 I just want to pull that number into this thing 01:54:35.210 --> 01:54:36.410 that I'm writing. 01:54:36.410 --> 01:54:38.820 So it's fairly simple. 01:54:38.820 --> 01:54:40.970 I identify the property. 01:54:40.970 --> 01:54:43.440 I identify the item to take it from. 01:54:43.440 --> 01:54:47.120 And Wikidata will, I mean Wikipedia, 01:54:47.120 --> 01:54:50.480 or the wiki I'm on, in this case meta, will go to Wikipedia 01:54:50.480 --> 01:54:52.820 and fetch it for me. 01:54:52.820 --> 01:54:56.480 Likewise, since Denny Vrandecic, the designer of Wikidata 01:54:56.480 --> 01:55:01.370 is item 18618629, right? 01:55:01.370 --> 01:55:04.790 I mean, he's a notable person, so he has a Wikidata entity. 01:55:04.790 --> 01:55:09.160 And since occupation is property 106, and date of birth is 569, 01:55:09.160 --> 01:55:12.290 and place of birth is 19, because 01:55:12.290 --> 01:55:14.720 of all that I can tell you that Vrandecic was born 01:55:14.720 --> 01:55:19.130 in Stuttgart, on this date, and is researcher, programmer, 01:55:19.130 --> 01:55:20.850 and computer scientist. 01:55:20.850 --> 01:55:25.010 If you look at the source for this page, click Edit Source, 01:55:25.010 --> 01:55:28.700 you can see that the word Stuttgart does not appear here, 01:55:28.700 --> 01:55:30.530 because it came from Wikidata. 01:55:30.530 --> 01:55:34.171 I did not write this into my little demo page here. 01:55:34.171 --> 01:55:34.670 See? 01:55:34.670 --> 01:55:37.380 Place of birth is-- 01:55:37.380 --> 01:55:37.880 where is it? 01:55:37.880 --> 01:55:38.380 Here. 01:55:38.380 --> 01:55:43.790 Born in property 19 from queue number so-and-so. 01:55:43.790 --> 01:55:46.970 That is how easy it is to pull stuff 01:55:46.970 --> 01:55:51.890 into a wiki from Wikidata. 01:55:51.890 --> 01:55:55.280 OK now there's some nuance to it. 01:55:55.280 --> 01:55:57.470 And there's there are some additional parameters 01:55:57.470 --> 01:55:58.130 you can give. 01:55:58.130 --> 01:56:00.230 And you can ask Wikidata to give you 01:56:00.230 --> 01:56:03.635 not just the text of the values, but actually make it links. 01:56:06.750 --> 01:56:14.825 So, for example, if I change this from property to values-- 01:56:25.950 --> 01:56:29.142 No, that did not work at all. 01:56:29.142 --> 01:56:29.850 Wasn't it values? 01:56:29.850 --> 01:56:30.350 What was it? 01:56:33.370 --> 01:56:34.614 Values and then-- 01:57:19.265 --> 01:57:19.890 Oh, statements. 01:57:19.890 --> 01:57:20.710 My bad, sorry. 01:57:20.710 --> 01:57:22.980 The Magic word is statements. 01:57:22.980 --> 01:57:24.010 Statements. 01:57:24.010 --> 01:57:28.680 So going back here. 01:57:28.680 --> 01:57:35.385 If I change the word property to the word statements 01:57:35.385 --> 01:57:40.890 here then this same value-- 01:57:40.890 --> 01:57:43.300 that did not work at all. 01:57:43.300 --> 01:57:46.690 Oh, because I'm on meta. 01:57:46.690 --> 01:57:48.670 So because I'm on meta, meta doesn't 01:57:48.670 --> 01:57:52.230 have an article named researcher, programmer, 01:57:52.230 --> 01:57:53.500 or computer scientist. 01:57:53.500 --> 01:57:55.120 But Wikipedia does. 01:57:55.120 --> 01:58:00.210 If I included this same syntax in Wikipedia, 01:58:00.210 --> 01:58:02.950 like English Wikipedia, for example-- 01:58:02.950 --> 01:58:04.855 So let's go there right now. 01:58:11.240 --> 01:58:13.480 And go-- go to my-- 01:58:18.550 --> 01:58:19.345 Go to my sandbox. 01:58:23.090 --> 01:58:27.982 If I just brutally paste this on my sandbox here-- 01:58:32.690 --> 01:58:35.810 So, see, these became links. 01:58:35.810 --> 01:58:39.740 Because Wikipedia has an article called programmer and computer 01:58:39.740 --> 01:58:40.910 scientist. 01:58:40.910 --> 01:58:43.460 So, like I said, there's some additional nuance 01:58:43.460 --> 01:58:44.840 to the embedding. 01:58:44.840 --> 01:58:47.030 The important thing is that this is 01:58:47.030 --> 01:58:51.470 the key to delivering on that first problem that I mentioned. 01:58:51.470 --> 01:58:55.970 How to get data from a central location 01:58:55.970 --> 01:58:58.850 onto your wiki in your language. 01:58:58.850 --> 01:59:04.460 Basically using property and statements magic incantations. 01:59:04.460 --> 01:59:07.100 And of course, usually, this would be 01:59:07.100 --> 01:59:10.010 in the context of an info box. 01:59:10.010 --> 01:59:14.180 Some wikis-- English Wikipedia is not leading the way there. 01:59:14.180 --> 01:59:16.490 Some smaller wikis are more advanced 01:59:16.490 --> 01:59:22.070 actually in integrating Wikidata embeddings like this 01:59:22.070 --> 01:59:24.620 into their info boxes. 01:59:24.620 --> 01:59:26.300 So that instead of the info box just 01:59:26.300 --> 01:59:30.620 being a template on the wiki with field equals value, 01:59:30.620 --> 01:59:31.685 field equals value. 01:59:31.685 --> 01:59:35.700 That template of the info box on the wiki 01:59:35.700 --> 01:59:40.160 pulls the values, the birthdate, the languages, et cetera, 01:59:40.160 --> 01:59:44.210 pulls them from Wikidata. 01:59:44.210 --> 01:59:49.820 So basically just-- I just demonstrated single calls 01:59:49.820 --> 01:59:52.550 to this, but of course an info box template 01:59:52.550 --> 01:59:56.270 would include maybe 20 or 40 such embeds, 01:59:56.270 --> 01:59:57.710 and that is not a problem. 01:59:57.710 --> 02:00:01.460 Of course, before you go and edit the English Wikipedia's 02:00:01.460 --> 02:00:06.050 info box person and replace it all with Wikidata embeds, 02:00:06.050 --> 02:00:09.050 you should discuss it with the English Wikipedia community. 02:00:09.050 --> 02:00:12.000 These discussions have already been taking place. 02:00:12.000 --> 02:00:13.640 There are some concerns about how 02:00:13.640 --> 02:00:17.150 to patrol this, how to keep it newbie friendly, et cetera. 02:00:17.150 --> 02:00:20.690 So there are legitimate concerns with just moving everything 02:00:20.690 --> 02:00:22.910 to be embedded from Wikidata. 02:00:22.910 --> 02:00:26.450 But the communities are gradually handling this. 02:00:26.450 --> 02:00:29.390 I mean this ability to embed from Wikidata is not very old. 02:00:29.390 --> 02:00:31.550 It's been around for about a year. 02:00:31.550 --> 02:00:35.150 So communities are still working on kind 02:00:35.150 --> 02:00:37.560 of integrating that technology. 02:00:37.560 --> 02:00:40.190 But that is that is kind of just the basics of how 02:00:40.190 --> 02:00:44.210 to pull data, individual bits of data, that's not querying, 02:00:44.210 --> 02:00:47.330 that's not asking those sweeping questions that I was talking 02:00:47.330 --> 02:00:48.850 about yet. 02:00:48.850 --> 02:00:50.720 We'll get to that right now this is 02:00:50.720 --> 02:00:55.310 how to pull a specific datum, a specific piece of data, 02:00:55.310 --> 02:00:57.395 from Wikidata. 02:01:01.530 --> 02:01:02.530 OK. 02:01:02.530 --> 02:01:07.080 So here's another quick thing to demonstrate 02:01:07.080 --> 02:01:09.880 before we go to queries, and that 02:01:09.880 --> 02:01:12.010 is the article placeholder. 02:01:12.010 --> 02:01:15.010 The article placeholder is a feature 02:01:15.010 --> 02:01:19.660 that is being tested on the Esperanto Wikipedia, and maybe 02:01:19.660 --> 02:01:22.180 another wiki, I don't remember. 02:01:22.180 --> 02:01:28.490 And it is using the potential of Wikidata 02:01:28.490 --> 02:01:32.690 to offer a placeholder for an article. 02:01:32.690 --> 02:01:37.940 An automatically generated Wikidata powered replacement 02:01:37.940 --> 02:01:41.720 placeholder for an article for articles that don't yet 02:01:41.720 --> 02:01:45.950 exist on Esperanto. 02:01:45.950 --> 02:01:50.440 So let's go to the Esperanto Wikipedia. 02:01:50.440 --> 02:01:52.440 I don't speak Esperanto. 02:01:52.440 --> 02:01:56.760 But let's look for Helen Dewitt, our friend, 02:01:56.760 --> 02:01:58.170 in Esperanto Wikipedia. 02:01:58.170 --> 02:02:00.270 Now Esperanto is not one of the Wikipedias 02:02:00.270 --> 02:02:03.060 that have an article about Helen Dewitt. 02:02:03.060 --> 02:02:04.890 And so it tells me that, right? 02:02:04.890 --> 02:02:06.570 There is no Helen Dewitt. 02:02:06.570 --> 02:02:08.670 Maybe you were looking for Helena Dewitt. 02:02:08.670 --> 02:02:10.200 No, I was not. 02:02:10.200 --> 02:02:13.650 You can start an article about Helen Dewitt. 02:02:13.650 --> 02:02:15.390 You can search. 02:02:15.390 --> 02:02:17.820 You know, there's all this stuff. 02:02:17.820 --> 02:02:24.180 But there is also this little option here, hiding, 02:02:24.180 --> 02:02:30.640 which tells me that the Esperanto Wikipedia is-- 02:02:30.640 --> 02:02:31.580 what's happening here? 02:02:35.140 --> 02:02:35.890 Yes. 02:02:35.890 --> 02:02:40.520 The Esperanto Wikipedia is ready to give me this page. 02:02:40.520 --> 02:02:44.020 This page, as you can see, it's on the Esperanto Wikipedia, 02:02:44.020 --> 02:02:46.090 but it's not an article. 02:02:46.090 --> 02:02:47.480 See, it's a special page. 02:02:47.480 --> 02:02:49.700 It's machine generated. 02:02:49.700 --> 02:02:52.150 You can see the URL as well. 02:02:52.150 --> 02:02:54.410 It's not, you know, slash Helen Dewitt. 02:02:54.410 --> 02:02:58.450 It's slash specialio, about topic, 02:02:58.450 --> 02:03:01.570 and then the Wikidata ID of Helen Dewitt. 02:03:01.570 --> 02:03:03.760 And what I get here-- 02:03:03.760 --> 02:03:05.860 I get an English description, by the way, 02:03:05.860 --> 02:03:08.300 because there is no Esperanto description. 02:03:08.300 --> 02:03:10.420 Wikidata can't make it up. 02:03:10.420 --> 02:03:13.600 But what it can do is offer me these pieces 02:03:13.600 --> 02:03:16.960 of data in my language, in this case Esperanto. 02:03:16.960 --> 02:03:18.921 I'm on the Esperanto Wikipedia. 02:03:18.921 --> 02:03:19.420 OK. 02:03:19.420 --> 02:03:23.380 So it tells me that she's American, for example, 02:03:23.380 --> 02:03:26.090 and it tells me that in Esperanto. 02:03:26.090 --> 02:03:29.350 OK and it tells me that she speaks Latin. 02:03:29.350 --> 02:03:32.410 Remember we taught Wikidata that? 02:03:32.410 --> 02:03:35.800 It tells me that she was educated in Oxford, 02:03:35.800 --> 02:03:38.050 you know, and gives me the references to the extent 02:03:38.050 --> 02:03:39.130 that they exist. 02:03:39.130 --> 02:03:41.560 I mean this is not an article. 02:03:41.560 --> 02:03:46.650 It's not, you know, paragraphs of fluent Esperanto text. 02:03:46.650 --> 02:03:50.190 But it is information that I can understand 02:03:50.190 --> 02:03:51.960 if I speak this language. 02:03:51.960 --> 02:03:55.380 And it's better than nothing. 02:03:55.380 --> 02:04:00.120 And remember Helen Dewitt was not a very detailed article. 02:04:00.120 --> 02:04:03.690 If I were to ask about, I don't know, some politician, 02:04:03.690 --> 02:04:08.340 or popular singer that has more data in Wikidata, 02:04:08.340 --> 02:04:12.690 than this machine generated thing would have been richer. 02:04:12.690 --> 02:04:16.320 So this feature is available and is under beta testing 02:04:16.320 --> 02:04:19.530 right now, but generally if this sounds interesting for you 02:04:19.530 --> 02:04:21.600 especially if you come from a smaller wiki that 02:04:21.600 --> 02:04:25.230 is missing a lot of articles that people may want to learn 02:04:25.230 --> 02:04:28.320 about, you can contact the Wikimedia foundation 02:04:28.320 --> 02:04:33.486 and ask for article placeholder to be enabled on your wiki. 02:04:33.486 --> 02:04:34.860 And again, this is a placeholder. 02:04:34.860 --> 02:04:37.890 Of course, it exists only until someone actually 02:04:37.890 --> 02:04:43.290 writes a proper Esperanto article about Helen Dewitt. 02:04:43.290 --> 02:04:45.060 So I hope this is clear. 02:04:45.060 --> 02:04:50.810 This is all coming from Wikidata on the fly. 02:04:50.810 --> 02:04:51.470 In real time. 02:04:51.470 --> 02:04:57.500 As you can see it includes my latest edits to Helen Dewitt. 02:04:57.500 --> 02:04:58.940 OK. 02:04:58.940 --> 02:05:05.250 Questions about the-- questions about the article placeholder? 02:05:05.250 --> 02:05:09.580 If there are try and put them on the channel. 02:05:09.580 --> 02:05:13.300 And this brings us to one of the main courses of this talk, 02:05:13.300 --> 02:05:15.270 which is querying Wikidata. 02:05:15.270 --> 02:05:18.660 So I've explained how Wikidata works. 02:05:18.660 --> 02:05:19.680 We've walked through it. 02:05:19.680 --> 02:05:20.850 We've added to it. 02:05:20.850 --> 02:05:22.800 We've created a new item. 02:05:22.800 --> 02:05:26.360 We learned how to contribute during our commutes. 02:05:26.360 --> 02:05:30.150 And all this was you kept promising us, 02:05:30.150 --> 02:05:32.050 Asaf, that this would be-- 02:05:32.050 --> 02:05:34.690 this would enable these amazing queries. 02:05:34.690 --> 02:05:37.960 So time to make good on that. 02:05:37.960 --> 02:05:42.880 The URL you need to remember is query.wikidata.org. 02:05:42.880 --> 02:05:49.390 And that will take you to a query system that 02:05:49.390 --> 02:05:52.510 uses a language called SPARQL. 02:05:52.510 --> 02:05:58.150 SPARQL, spelt with a Q. This language 02:05:58.150 --> 02:06:01.690 is not a Wikimedia creation. 02:06:01.690 --> 02:06:06.010 It's a standardized language used for querying linked data 02:06:06.010 --> 02:06:07.540 sources. 02:06:07.540 --> 02:06:10.720 And because of that there are there 02:06:10.720 --> 02:06:14.590 are certain usability prices that we pay for using SPARQL, 02:06:14.590 --> 02:06:16.010 for using a standard language. 02:06:16.010 --> 02:06:19.570 It's not completely custom made for querying Wikidata, 02:06:19.570 --> 02:06:21.740 and we'll see that in just a moment. 02:06:21.740 --> 02:06:23.530 The principle to remember about Wikidata 02:06:23.530 --> 02:06:27.880 query is that Wikidata will tell you everything it knows, 02:06:27.880 --> 02:06:29.470 but no more. 02:06:29.470 --> 02:06:32.440 I have anticipated this several times already, right? 02:06:32.440 --> 02:06:35.980 Until this moment when we taught Wikidata data 02:06:35.980 --> 02:06:38.590 that Helen Dewitt speaks Latin, she 02:06:38.590 --> 02:06:41.500 would not have appeared in query results 02:06:41.500 --> 02:06:45.974 asking who are American writers who speak Latin? 02:06:45.974 --> 02:06:47.140 She would not have appeared. 02:06:47.140 --> 02:06:49.090 But as of this afternoon, she will 02:06:49.090 --> 02:06:52.950 appear because I've added that piece of information. 02:06:52.950 --> 02:07:01.380 So a result of that principle is that you can never say, 02:07:01.380 --> 02:07:05.950 well I ran a Wikidata query and this 02:07:05.950 --> 02:07:11.510 is the list of Flemish painters who are sons of painters. 02:07:11.510 --> 02:07:12.310 The list. 02:07:12.310 --> 02:07:14.110 That these are all the Flemish painters 02:07:14.110 --> 02:07:15.220 who are sons of painters. 02:07:15.220 --> 02:07:19.390 That is never something you can say based on a Wikidata query, 02:07:19.390 --> 02:07:22.390 because of course, maybe not all the Flemish painters 02:07:22.390 --> 02:07:26.020 who are sons of painters have been expressed in Wikidata data 02:07:26.020 --> 02:07:26.760 yet. 02:07:26.760 --> 02:07:28.840 Wikidata doesn't know about some of them, 02:07:28.840 --> 02:07:30.340 or maybe it knows about all of them 02:07:30.340 --> 02:07:32.500 but doesn't know the important fact 02:07:32.500 --> 02:07:35.200 that this person is the son of that person, 02:07:35.200 --> 02:07:38.740 because those properties have not been added. 02:07:38.740 --> 02:07:40.940 And so they cannot be included in the results. 02:07:40.940 --> 02:07:42.550 So the results of a Wikidata query 02:07:42.550 --> 02:07:46.870 are never the definitive sets. 02:07:46.870 --> 02:07:49.600 What you can say about a Wikidata query is here 02:07:49.600 --> 02:07:52.840 are some Flemish painters who are sons of painters. 02:07:52.840 --> 02:07:56.260 Here are some cities with female mayors. 02:07:56.260 --> 02:07:58.270 Whatever it is you're querying about 02:07:58.270 --> 02:08:01.030 is never guaranteed to be complete 02:08:01.030 --> 02:08:03.580 because Wikidata, like Wikipedia, is 02:08:03.580 --> 02:08:05.530 a work in progress. 02:08:05.530 --> 02:08:13.240 And of course, the more we teach Wikidata the 02:08:13.240 --> 02:08:16.240 more useful it becomes. 02:08:16.240 --> 02:08:22.520 OK so lets go and see those queries. 02:08:22.520 --> 02:08:25.990 So this is query.wikidata.org. 02:08:25.990 --> 02:08:29.000 It's not the wiki. 02:08:29.000 --> 02:08:29.500 All right? 02:08:29.500 --> 02:08:32.530 So this isn't like some page on the wiki itself. 02:08:32.530 --> 02:08:35.099 This is kind of an external system. 02:08:35.099 --> 02:08:35.890 So it's not a wiki. 02:08:35.890 --> 02:08:37.960 You can see I don't have a user page here. 02:08:37.960 --> 02:08:39.520 I don't have a history tab. 02:08:39.520 --> 02:08:40.960 This isn't a wiki page. 02:08:40.960 --> 02:08:44.560 This is a special kind of tool or system. 02:08:44.560 --> 02:08:51.330 And it invites me to input a SPARQL query. 02:08:51.330 --> 02:08:55.060 Now most of us do not speak SPARQL. 02:08:55.060 --> 02:08:59.800 It's a a technical language. 02:08:59.800 --> 02:09:01.720 It's a query language. 02:09:01.720 --> 02:09:06.760 Some of you may be thinking about SQL, the database query 02:09:06.760 --> 02:09:08.500 language. 02:09:08.500 --> 02:09:13.330 SPARQL is named with kind of a wink, or a nod, to SQL. 02:09:13.330 --> 02:09:17.440 But, I warn you, if you are comfortable in 02:09:17.440 --> 02:09:22.750 SQL don't expect to carry over your knowledge of SQL 02:09:22.750 --> 02:09:23.550 into SPARQL. 02:09:23.550 --> 02:09:26.140 They're not the same. 02:09:26.140 --> 02:09:27.940 They are superficially similar. 02:09:27.940 --> 02:09:28.440 Right? 02:09:28.440 --> 02:09:31.530 So they both use the keyword select, 02:09:31.530 --> 02:09:35.010 and they use the word where, and they use things like limit, 02:09:35.010 --> 02:09:35.770 and order. 02:09:35.770 --> 02:09:38.190 So again, if you know this already from SQL 02:09:38.190 --> 02:09:40.500 those mean roughly the same things, 02:09:40.500 --> 02:09:44.550 but don't expect it to behave just like SQL. 02:09:44.550 --> 02:09:49.800 You do need to spend some time understanding how SPARQL works. 02:09:49.800 --> 02:09:52.560 So, by all means, I invite you to go and read 02:09:52.560 --> 02:09:55.680 one of the many fine SPARQL tutorials that 02:09:55.680 --> 02:09:59.590 are out there on the web, or to click the Help button here, 02:09:59.590 --> 02:10:03.930 which also includes help about SPARQL. 02:10:03.930 --> 02:10:08.440 But I also know that most of us when 02:10:08.440 --> 02:10:12.580 we want to do some advanced formatting on wiki, 02:10:12.580 --> 02:10:16.090 for example, we don't go and read the help page 02:10:16.090 --> 02:10:18.220 on templates, right? 02:10:18.220 --> 02:10:21.460 We go to a page that already does what we want to do, 02:10:21.460 --> 02:10:27.430 and adopt and adapt the code from that other page, right? 02:10:27.430 --> 02:10:30.610 So we just take something that does roughly what we want, 02:10:30.610 --> 02:10:33.280 and just copy it over and change what we need to change. 02:10:33.280 --> 02:10:35.620 That is a very pragmatic and reasonable way 02:10:35.620 --> 02:10:37.420 to do things which is why-- 02:10:37.420 --> 02:10:39.850 and the wiki data engineers know this, 02:10:39.850 --> 02:10:43.300 which is why they prepared this very handy button for us 02:10:43.300 --> 02:10:45.580 called examples. 02:10:45.580 --> 02:10:47.710 We click the examples button. 02:10:47.710 --> 02:10:52.390 And, oh my god, there is a ton of-- well there's 312 example 02:10:52.390 --> 02:10:55.582 queries for us to choose from. 02:10:55.582 --> 02:10:57.040 And we can just pick something that 02:10:57.040 --> 02:11:00.310 is roughly like what we're trying to find out, 02:11:00.310 --> 02:11:02.740 and then just change what needs changing. 02:11:02.740 --> 02:11:05.410 So let's take a very simple one. 02:11:05.410 --> 02:11:07.020 The cats query. 02:11:07.020 --> 02:11:10.270 Maybe one of the simplest you could possibly have. 02:11:10.270 --> 02:11:13.510 And let's run it first and then I'll kind of 02:11:13.510 --> 02:11:16.420 walk you through it. 02:11:16.420 --> 02:11:18.460 The goal here is not to teach you SPARQL, 02:11:18.460 --> 02:11:20.860 but to get you to be kind of literate in SPARQL. 02:11:20.860 --> 02:11:23.980 To kind of understand why this does what it does. 02:11:23.980 --> 02:11:25.730 So let's run this query first. 02:11:25.730 --> 02:11:31.390 We click Run and here I have results at the bottom. 02:11:31.390 --> 02:11:34.060 The item, which is just a Wikidata item, 02:11:34.060 --> 02:11:35.290 which of course is a number. 02:11:35.290 --> 02:11:38.860 Remember, wiki data thinks of items as queue numbers. 02:11:38.860 --> 02:11:40.900 And the label, because we're humans 02:11:40.900 --> 02:11:43.190 and we prefer words to numbers. 02:11:43.190 --> 02:11:49.870 So these 114 results are all the cats 02:11:49.870 --> 02:11:53.310 that wiki data knows about. 02:11:53.310 --> 02:11:55.380 Is this all the cats in the world? 02:11:55.380 --> 02:11:57.320 No of course not, remember? 02:11:57.320 --> 02:11:59.730 It's all the cats Wikidata knows about, which 02:11:59.730 --> 02:12:01.410 means they're somehow notable. 02:12:01.410 --> 02:12:05.130 I mean someone bothered to describe them on Wikidata. 02:12:05.130 --> 02:12:12.570 And Wikidata was told this item is an instance of cat. 02:12:12.570 --> 02:12:13.620 Right? 02:12:13.620 --> 02:12:17.040 So these are those cats. 02:12:17.040 --> 02:12:18.540 And we can click any of them. 02:12:18.540 --> 02:12:20.190 I don't know, Pixel, for example. 02:12:20.190 --> 02:12:21.780 Click the Wikipedia item. 02:12:21.780 --> 02:12:24.090 And here is the Wikidata item about Pixel 02:12:24.090 --> 02:12:25.860 with the queue number. 02:12:25.860 --> 02:12:28.980 And he is a tortoiseshell cat. 02:12:28.980 --> 02:12:32.640 And as you can see instance of cat. 02:12:32.640 --> 02:12:33.610 OK. 02:12:33.610 --> 02:12:37.220 And he is five inches high. 02:12:37.220 --> 02:12:41.780 And he is apparently documented in Indonesian, In Bahasa. 02:12:41.780 --> 02:12:45.080 Right here this is Pixel. 02:12:45.080 --> 02:12:50.060 And he is apparently somehow related to the Guinness World 02:12:50.060 --> 02:12:52.160 Records book. 02:12:52.160 --> 02:12:54.650 I don't speak Bahasa, so I don't know exactly why 02:12:54.650 --> 02:12:56.120 this cat is so notable. 02:12:56.120 --> 02:12:58.889 But, of course, cats can become notable 02:12:58.889 --> 02:12:59.930 for all kinds of reasons. 02:12:59.930 --> 02:13:02.204 Maybe they're a YouTube sensation, 02:13:02.204 --> 02:13:03.620 you know, maybe they were involved 02:13:03.620 --> 02:13:05.330 in some historical event. 02:13:05.330 --> 02:13:09.410 I like this cat named Gladstone. 02:13:09.410 --> 02:13:16.590 This cat named Gladstone is-- 02:13:16.590 --> 02:13:19.950 he has position held Chief Mouser 02:13:19.950 --> 02:13:22.320 to Her Majesty's Treasury. 02:13:22.320 --> 02:13:25.230 This is an official cat with a job. 02:13:25.230 --> 02:13:29.190 And he has been holding this job, mind you, since the 28th 02:13:29.190 --> 02:13:31.570 of June this past year. 02:13:31.570 --> 02:13:32.970 That's the start time. 02:13:32.970 --> 02:13:35.760 And there is no end time which means he currently 02:13:35.760 --> 02:13:38.850 holds the position of Chief Mouser 02:13:38.850 --> 02:13:40.470 to her Majesty's Treasury. 02:13:40.470 --> 02:13:42.750 His employer is Her Majesty's Treasury. 02:13:42.750 --> 02:13:44.290 He's a male creature. 02:13:44.290 --> 02:13:46.650 And Wikidata knows that this cat is 02:13:46.650 --> 02:13:53.127 named after William Gladstone, the Victorian prime minister. 02:13:53.127 --> 02:13:54.960 Of course if I don't know who this person is 02:13:54.960 --> 02:13:57.540 I can click through and learn that he 02:13:57.540 --> 02:14:01.860 was a liberal politician and prime minister, right? 02:14:01.860 --> 02:14:03.390 He even has a Twitter account. 02:14:03.390 --> 02:14:05.910 And Wikidata sends me right to it. 02:14:05.910 --> 02:14:08.040 The treasury cat Twitter account. 02:14:08.040 --> 02:14:11.010 And he has articles in German, and English, 02:14:11.010 --> 02:14:15.520 and of course Japanese, because he's a cat. 02:14:15.520 --> 02:14:16.020 All right. 02:14:16.020 --> 02:14:19.500 So this was a very simple query. 02:14:19.500 --> 02:14:21.400 Let's find out why it works. 02:14:21.400 --> 02:14:21.900 OK. 02:14:21.900 --> 02:14:25.800 So what did we actually tell Wikidata to do for us? 02:14:25.800 --> 02:14:31.650 We said, please select some items for us 02:14:31.650 --> 02:14:33.580 along with their labels. 02:14:33.580 --> 02:14:34.080 OK? 02:14:34.080 --> 02:14:36.180 Along with their human readable labels 02:14:36.180 --> 02:14:42.010 because if I remove this label what I get is, see, 02:14:42.010 --> 02:14:44.200 just a list of item numbers. 02:14:44.200 --> 02:14:45.280 That's not as fun. 02:14:45.280 --> 02:14:46.930 So that's what this little bit did. 02:14:46.930 --> 02:14:49.630 I just said, give me the items, but also they're 02:14:49.630 --> 02:14:52.330 human readable label. 02:14:52.330 --> 02:14:54.620 And I want you to select a bunch of items, 02:14:54.620 --> 02:14:56.770 but not just any random bunch of items, 02:14:56.770 --> 02:15:01.210 I want to select items where a certain condition holds. 02:15:01.210 --> 02:15:02.790 What is the condition? 02:15:02.790 --> 02:15:06.430 The condition is that the item that I want you to select 02:15:06.430 --> 02:15:14.360 needs to have property 31 with a value of Q146. 02:15:14.360 --> 02:15:15.670 Well, that's helpful. 02:15:15.670 --> 02:15:18.070 If I hover over these numbers-- 02:15:18.070 --> 02:15:19.750 Again, I get the human readable version. 02:15:19.750 --> 02:15:23.530 So I'm looking for items that have property 02:15:23.530 --> 02:15:28.841 instance of with the value cat. 02:15:28.841 --> 02:15:29.340 Right? 02:15:29.340 --> 02:15:31.173 Because that's literally what I want, right? 02:15:31.173 --> 02:15:33.960 I want all the items that have a property, a statement, that 02:15:33.960 --> 02:15:36.840 says instance of cat. 02:15:36.840 --> 02:15:37.950 That's the condition. 02:15:37.950 --> 02:15:41.640 I'm not interested in items that are instance of book, 02:15:41.640 --> 02:15:43.200 or instance of human. 02:15:43.200 --> 02:15:46.290 I'm interested in instance of cat. 02:15:46.290 --> 02:15:51.090 That is the only condition here in this query. 02:15:51.090 --> 02:15:55.800 This complicated line I ask you to basically ignore. 02:15:55.800 --> 02:15:57.510 This is one of those sacrifices that we 02:15:57.510 --> 02:16:00.720 make for using a standard language like SPARQL. 02:16:00.720 --> 02:16:02.820 But the role of this complicated line 02:16:02.820 --> 02:16:04.920 is to basically ensure that we get 02:16:04.920 --> 02:16:07.860 the English label for that cat. 02:16:07.860 --> 02:16:08.817 OK? 02:16:08.817 --> 02:16:09.900 So don't worry about that. 02:16:09.900 --> 02:16:11.550 Just leave it there. 02:16:11.550 --> 02:16:13.320 And we run the query and we get the list 02:16:13.320 --> 02:16:17.330 of cats with their English labels, and that is awesome. 02:16:17.330 --> 02:16:21.510 By the way, if I change EN, without really understanding 02:16:21.510 --> 02:16:27.260 this line, if I change EN to HE, for Hebrew, 02:16:27.260 --> 02:16:30.160 I get the same results with a Hebrew label. 02:16:30.160 --> 02:16:33.670 Of course, these cats, nobody bothered to give them 02:16:33.670 --> 02:16:35.709 Hebrew labels unfortunately. 02:16:35.709 --> 02:16:37.570 So I get the queue number. 02:16:37.570 --> 02:16:42.874 But if I changed it to Japanese, JA, 02:16:42.874 --> 02:16:45.290 I would get still a bunch of queue numbers for where there 02:16:45.290 --> 02:16:47.389 isn't a Japanese label, but I would get the labels 02:16:47.389 --> 02:16:48.781 in Japanese. 02:16:48.781 --> 02:16:49.280 OK? 02:16:49.280 --> 02:16:51.260 So this is an example of how you don't even 02:16:51.260 --> 02:16:54.620 need to understand all the syntax of this query 02:16:54.620 --> 02:16:56.100 to adapt it to your needs. 02:16:56.100 --> 02:16:58.070 If you want this query as is, but you 02:16:58.070 --> 02:17:00.320 want the labels in Japanese, you can just 02:17:00.320 --> 02:17:03.190 change the language code here. 02:17:03.190 --> 02:17:06.559 OK so that is all this query does. 02:17:06.559 --> 02:17:08.870 Again, just give me the items that 02:17:08.870 --> 02:17:17.590 have property 31, instance of, with a value 146, which is cat. 02:17:17.590 --> 02:17:20.379 Let's take a question just about this very simple query 02:17:20.379 --> 02:17:25.809 before we advance to more complicated queries. 02:17:25.809 --> 02:17:29.200 Any questions just about this? 02:17:29.200 --> 02:17:32.850 Like, did anyone kind of really lose me talking 02:17:32.850 --> 02:17:35.010 about this simple query? 02:17:35.010 --> 02:17:39.389 Again, this query just tells Wikidata, get me all the items 02:17:39.389 --> 02:17:41.280 that somewhere among their statements 02:17:41.280 --> 02:17:44.219 have instance of cat. 02:17:44.219 --> 02:17:46.670 That's the only condition. 02:17:46.670 --> 02:17:47.740 No questions. 02:17:47.740 --> 02:17:49.959 OK, feel free to ask if you'd come up with one. 02:17:49.959 --> 02:17:54.709 So let's complicate things a little. 02:17:54.709 --> 02:17:59.365 Let's ask only for male cats. 02:18:02.080 --> 02:18:03.070 OK. 02:18:03.070 --> 02:18:07.330 Remember this cat Gladstone is male, 02:18:07.330 --> 02:18:09.850 and we know this because he has a property called 02:18:09.850 --> 02:18:14.320 sex or gender, and the value is male creature, right? 02:18:14.320 --> 02:18:17.950 So let's add another condition right here 02:18:17.950 --> 02:18:19.860 under the first condition. 02:18:19.860 --> 02:18:20.870 OK? 02:18:20.870 --> 02:18:22.750 This is a new line. 02:18:22.750 --> 02:18:24.940 And I'm adding a new condition to the query. 02:18:24.940 --> 02:18:30.520 I'm saying, not only do I want this item that you return 02:18:30.520 --> 02:18:35.469 to be instance of cat, I also want this same item 02:18:35.469 --> 02:18:39.280 to have another property, the property sex or gender. 02:18:39.280 --> 02:18:40.299 Right? 02:18:40.299 --> 02:18:43.480 And I need to refer to the property by number. 02:18:43.480 --> 02:18:45.760 But don't worry, Wikidata will help you. 02:18:45.760 --> 02:18:49.500 So you start with this prefix, Wikidata WDDT. 02:18:52.520 --> 02:18:54.980 Again, just ignore that prefix it's 02:18:54.980 --> 02:18:58.940 one of the features of SPARQL that we need to respect. 02:18:58.940 --> 02:19:02.715 WDT colon, and then I can just type control space 02:19:02.715 --> 02:19:04.340 to do a search, to do an auto complete. 02:19:04.340 --> 02:19:08.090 So I can just type sex and Wikidata helpfully 02:19:08.090 --> 02:19:11.760 offers me a drop down with relevant properties. 02:19:11.760 --> 02:19:15.200 So I click property 21, which is the sex or gender property. 02:19:15.200 --> 02:19:17.629 And then I say, so I want the sex or gender property 02:19:17.629 --> 02:19:19.670 to have the Wikidata value. 02:19:19.670 --> 02:19:21.799 Again, control space. 02:19:21.799 --> 02:19:25.340 And I can just say male creature. 02:19:25.340 --> 02:19:25.850 See? 02:19:25.850 --> 02:19:30.950 There's a different item for male, as inhuman, 02:19:30.950 --> 02:19:33.799 and a different one for male creature, for reasons 02:19:33.799 --> 02:19:34.910 that we won't go into. 02:19:34.910 --> 02:19:36.535 Let's pick male creature, because we're 02:19:36.535 --> 02:19:38.040 talking about cats here. 02:19:38.040 --> 02:19:38.540 All right. 02:19:38.540 --> 02:19:42.080 And add a period here at the end and click Run. 02:19:42.080 --> 02:19:48.330 And instead of 114 cats, we get, this time, we got 43 results. 02:19:48.330 --> 02:19:53.360 Including our friend Gladstone who is a male creature cat. 02:19:53.360 --> 02:19:58.530 So that means all the rest are female, right? 02:19:58.530 --> 02:20:00.410 Wrong. 02:20:00.410 --> 02:20:00.980 Wrong. 02:20:00.980 --> 02:20:02.840 That does not mean that at all. 02:20:02.840 --> 02:20:06.530 What it means is of the 114 items that 02:20:06.530 --> 02:20:11.960 have instance of cat, only 43 have explicitly 02:20:11.960 --> 02:20:14.690 sex male creature. 02:20:14.690 --> 02:20:17.570 The rest of them do not. 02:20:17.570 --> 02:20:21.800 Maybe because they have sex female creature, 02:20:21.800 --> 02:20:25.930 but maybe because they don't have that property at all. 02:20:25.930 --> 02:20:28.290 I'm emphasizing this to kind of help 02:20:28.290 --> 02:20:31.770 you train yourself to correctly interpret 02:20:31.770 --> 02:20:34.140 the results of queries from Wikidata. 02:20:34.140 --> 02:20:36.870 Don't jump into this kind of simplistic conclusion, 02:20:36.870 --> 02:20:41.820 OK there's 114 total, 43 male, therefore the rest are female. 02:20:41.820 --> 02:20:43.520 That is not correct. 02:20:43.520 --> 02:20:45.030 OK? 02:20:45.030 --> 02:20:49.740 But 43 of those explicitly had another statement, sex 02:20:49.740 --> 02:20:52.530 or gender, male creature. 02:20:52.530 --> 02:20:55.020 So I just added another condition, 02:20:55.020 --> 02:20:58.290 and now my query is asking two separate things 02:20:58.290 --> 02:21:00.150 about the results. 02:21:00.150 --> 02:21:04.472 They need to be a cat and a male creature. 02:21:04.472 --> 02:21:06.270 AUDIENCE: Maybe we should see how many 02:21:06.270 --> 02:21:08.100 cats have Twitter accounts. 02:21:08.100 --> 02:21:11.440 But there is a question from YouTube, 02:21:11.440 --> 02:21:14.220 which is will you talk about the export possibilities 02:21:14.220 --> 02:21:17.280 of the result of the query? 02:21:17.280 --> 02:21:18.420 ASAF BARTOV: Absolutely. 02:21:18.420 --> 02:21:21.000 Absolutely I will in just a little bit. 02:21:21.000 --> 02:21:23.010 I mean there is, in addition to just getting 02:21:23.010 --> 02:21:28.350 this kind of table, I can get these results in other formats. 02:21:28.350 --> 02:21:30.360 And I can also download these results. 02:21:30.360 --> 02:21:32.820 I can click the Download button and get them 02:21:32.820 --> 02:21:35.070 as a comma separated file, tab separated 02:21:35.070 --> 02:21:38.910 file, a JSON file, which is useful for programmatic uses. 02:21:38.910 --> 02:21:40.590 I can also get a link. 02:21:40.590 --> 02:21:42.330 So I can get a link to this query. 02:21:42.330 --> 02:21:45.990 I mean, I spent all this time designing this beautiful query. 02:21:45.990 --> 02:21:50.280 I can get a short URL that was generated especially for me 02:21:50.280 --> 02:21:52.170 right now with a tiny URL. 02:21:52.170 --> 02:21:54.690 I can just paste this into Twitter and go, 02:21:54.690 --> 02:21:59.280 hey people look at all the male cats that Wikidata knows about. 02:21:59.280 --> 02:22:01.170 OK, this is not a very exciting query. 02:22:01.170 --> 02:22:03.900 But once I get to a really complicated exciting query 02:22:03.900 --> 02:22:07.650 I can totally share that very easily through this. 02:22:07.650 --> 02:22:09.750 And we will get to more interesting queries 02:22:09.750 --> 02:22:11.740 in just a second. 02:22:11.740 --> 02:22:16.400 Any questions on this kind of basic querying so far? 02:22:16.400 --> 02:22:17.940 OK. 02:22:17.940 --> 02:22:25.340 So that was a very simple example. 02:22:25.340 --> 02:22:30.250 Let's spend a moment exploring. 02:22:30.250 --> 02:22:38.920 So this cat Gladstone was named after this dude, William 02:22:38.920 --> 02:22:43.550 Gladstone, who was an important British politician. 02:22:43.550 --> 02:22:45.760 I'm sure he's not the only thing out there 02:22:45.760 --> 02:22:48.970 in the universe that's named after Gladstone, right? 02:22:48.970 --> 02:22:52.120 I mean there has got to be, I don't know, 02:22:52.120 --> 02:22:54.790 park benches, planets, asteroids, 02:22:54.790 --> 02:22:59.590 something other than the cat, named after this guy. 02:22:59.590 --> 02:23:04.030 So we can ask Wikidata to tell us all the things 02:23:04.030 --> 02:23:06.850 that, you know, without saying instance of something. 02:23:06.850 --> 02:23:10.960 Like, I don't know, anything named after William Gladstone. 02:23:10.960 --> 02:23:12.760 So how do I do that? 02:23:12.760 --> 02:23:15.310 Same principle. 02:23:15.310 --> 02:23:19.850 Instead of asking about the property instance of, property 02:23:19.850 --> 02:23:25.360 31, instead of that, I will ask about the property 02:23:25.360 --> 02:23:26.860 named after-- 02:23:26.860 --> 02:23:29.120 sorry, named after-- 02:23:29.120 --> 02:23:30.830 I don't need to remember the number. 02:23:30.830 --> 02:23:32.240 I have auto-complete. 02:23:32.240 --> 02:23:35.360 Named after is property 138. 02:23:35.360 --> 02:23:37.430 And I want anything at all that is 02:23:37.430 --> 02:23:42.080 named after this person, William Gladstone. 02:23:42.080 --> 02:23:43.850 Here we go. 02:23:43.850 --> 02:23:45.860 Which is 160852. 02:23:45.860 --> 02:23:46.820 Whatever. 02:23:46.820 --> 02:23:48.230 OK. 02:23:48.230 --> 02:23:50.510 You notice I removed instance of cat. 02:23:50.510 --> 02:23:52.040 I remove the male creature. 02:23:52.040 --> 02:23:55.130 I'm only asking, get me all the items 02:23:55.130 --> 02:23:58.940 that are somehow named after that particular politician. 02:23:58.940 --> 02:24:00.920 And I run the query, and it turns out 02:24:00.920 --> 02:24:05.007 the Wikidata knows about three such things. 02:24:05.007 --> 02:24:06.590 Does that mean that's the only-- these 02:24:06.590 --> 02:24:08.881 are the only three things named after him in the world? 02:24:08.881 --> 02:24:09.939 Of course not. 02:24:09.939 --> 02:24:12.230 But these are the only three items that are in Wikidata 02:24:12.230 --> 02:24:17.720 and explicitly have the property named after Gladstone. 02:24:17.720 --> 02:24:20.150 For all I know, there may be a village 02:24:20.150 --> 02:24:23.600 in England called Gladstone named after this person. 02:24:23.600 --> 02:24:27.410 But if nobody added the property, named after, linking 02:24:27.410 --> 02:24:30.950 to the person, he wouldn't show up in the results to my query. 02:24:30.950 --> 02:24:33.750 So Wikidata knows about three such things. 02:24:33.750 --> 02:24:36.110 One of them is something called the Gladstone Professor 02:24:36.110 --> 02:24:37.360 of Government. 02:24:37.360 --> 02:24:40.370 I can click through and see that it's a chair at Oxford 02:24:40.370 --> 02:24:41.180 University, right? 02:24:41.180 --> 02:24:43.470 So it's a position. 02:24:43.470 --> 02:24:49.520 And another is the William Gladstone school number 18. 02:24:49.520 --> 02:24:51.470 William Gladstone school number 18. 02:24:51.470 --> 02:24:52.900 Where is that? 02:24:52.900 --> 02:24:55.380 That is in Sofia, Bulgaria. 02:24:55.380 --> 02:24:56.470 Again. 02:24:56.470 --> 02:24:59.000 All right, so that's a particular school in Bulgaria 02:24:59.000 --> 02:25:02.720 named after William Gladstone. 02:25:02.720 --> 02:25:07.220 And finally, the third result is, of course, our pal 02:25:07.220 --> 02:25:09.800 Gladstone the Cheif Mouser. 02:25:09.800 --> 02:25:12.674 If I click through, that's the cat. 02:25:12.674 --> 02:25:14.090 All right, so that was an example. 02:25:14.090 --> 02:25:15.700 I mean, you saw how easy it was. 02:25:15.700 --> 02:25:18.980 I just named the property and the value that I care about, 02:25:18.980 --> 02:25:21.420 and I get the results. 02:25:21.420 --> 02:25:23.289 Again, I mean, it's kind of a silly example, 02:25:23.289 --> 02:25:24.080 but think about it. 02:25:24.080 --> 02:25:27.570 This is-- how else can you answer that question? 02:25:27.570 --> 02:25:30.470 There's no reference desk, even at a great University 02:25:30.470 --> 02:25:34.250 of Oxford, where you can walk in and say, give me 02:25:34.250 --> 02:25:37.470 a list of things named after Gladstone. 02:25:37.470 --> 02:25:40.590 There's no easy way to answer that unless you happen 02:25:40.590 --> 02:25:44.520 to have a very large structured and linked 02:25:44.520 --> 02:25:48.130 data store, like Wikidata. 02:25:48.130 --> 02:25:50.560 All right, so that was a silly example. 02:25:50.560 --> 02:25:51.280 Let's take some-- 02:25:51.280 --> 02:25:53.113 AUDIENCE: There's a bunch of stuff on there. 02:25:53.113 --> 02:25:54.446 ASAF: Oh, OK. 02:25:54.446 --> 02:25:57.430 AUDIENCE: Can you show easy query on the video? 02:25:57.430 --> 02:26:02.260 And somebody needs to know how to just do property 02:26:02.260 --> 02:26:05.750 exists without giving a specific value. 02:26:05.750 --> 02:26:11.030 And then once you show easy query you reload the page and-- 02:26:11.030 --> 02:26:13.240 ASAF: I don't know easy query. 02:26:13.240 --> 02:26:15.670 So is that a gadget? 02:26:15.670 --> 02:26:17.110 I don't know what easy query is. 02:26:17.110 --> 02:26:19.870 I don't use it. 02:26:19.870 --> 02:26:24.760 So someone can maybe send a link or something? 02:26:24.760 --> 02:26:26.100 Oh it is a gadget. 02:26:26.100 --> 02:26:27.100 I don't have it enabled. 02:26:31.610 --> 02:26:32.480 That is nice. 02:26:32.480 --> 02:26:42.080 So now, what I just did by hand, by formulating the query named 02:26:42.080 --> 02:26:45.200 after Gladstone-- 02:26:45.200 --> 02:26:48.390 I guess this is the-- 02:26:48.390 --> 02:26:48.960 Is it? 02:26:53.000 --> 02:26:53.720 Yeah. 02:26:53.720 --> 02:26:56.050 So this-- I just clicked the three-- 02:26:56.050 --> 02:26:57.470 the ellipsis here. 02:26:57.470 --> 02:26:58.460 Right after the name. 02:26:58.460 --> 02:26:59.630 You see this? 02:26:59.630 --> 02:27:03.050 This was just added by enabling easy query, 02:27:03.050 --> 02:27:04.640 which I just learned about. 02:27:04.640 --> 02:27:07.640 So you just click this and it auto-magically 02:27:07.640 --> 02:27:09.620 made this kind of trivial query. 02:27:09.620 --> 02:27:12.380 Of course, if I want a more complicated query like, 02:27:12.380 --> 02:27:14.510 I don't know, give me all the things that 02:27:14.510 --> 02:27:18.110 are named after Lincoln but are a school, 02:27:18.110 --> 02:27:21.650 I will still need to kind of edit a custom query. 02:27:21.650 --> 02:27:23.450 But this is a super easy and very nice 02:27:23.450 --> 02:27:28.620 way of just doing a very super quick query for exactly this. 02:27:28.620 --> 02:27:29.120 Right? 02:27:29.120 --> 02:27:33.410 Like. what other items have exactly this property and value 02:27:33.410 --> 02:27:35.720 named after William Gladstone? 02:27:35.720 --> 02:27:38.750 So, thank you to whoever made this suggestion 02:27:38.750 --> 02:27:42.140 to demonstrate that, and I'm glad I learned something 02:27:42.140 --> 02:27:45.230 too today. 02:27:45.230 --> 02:27:48.590 Let's move to another sample query. 02:27:48.590 --> 02:27:50.360 Here's a fun example. 02:27:50.360 --> 02:27:56.910 Popular surnames among fictional characters. 02:27:56.910 --> 02:27:58.650 Think about that for a second. 02:27:58.650 --> 02:28:03.030 Popular surnames among fictional characters. 02:28:03.030 --> 02:28:06.510 So we're asking Wikidata to go through all 02:28:06.510 --> 02:28:10.120 the fictional characters you know, 02:28:10.120 --> 02:28:13.510 and of those look through their surnames, group 02:28:13.510 --> 02:28:15.910 them so that you can count them, the repetitions 02:28:15.910 --> 02:28:18.460 of the surnames, and give me the most 02:28:18.460 --> 02:28:21.550 popular surnames among them. 02:28:21.550 --> 02:28:26.280 Additionally, I want you to awesomely present the results 02:28:26.280 --> 02:28:28.020 as a bubble chart. 02:28:28.020 --> 02:28:29.220 Oh, yeah. 02:28:29.220 --> 02:28:31.050 Wikidata can do that. 02:28:31.050 --> 02:28:34.420 And I run the query. 02:28:34.420 --> 02:28:36.750 And check it out. 02:28:36.750 --> 02:28:41.130 The most popular names among fictional characters 02:28:41.130 --> 02:28:45.780 we can say that knows about are Joan, Smith, Taylor, et cetera. 02:28:45.780 --> 02:28:48.450 I mean for all we know, the most popular name 02:28:48.450 --> 02:28:50.770 among fictional characters actually in the world 02:28:50.770 --> 02:28:52.350 may be Wu. 02:28:52.350 --> 02:28:54.790 Or something in Chinese for all we know. 02:28:54.790 --> 02:28:57.930 But if that has not been modeled in Wikidata, 02:28:57.930 --> 02:29:01.020 we're not going to get that. 02:29:01.020 --> 02:29:03.540 So Taylor, Smith, Jones, Williams, 02:29:03.540 --> 02:29:06.870 seem to be the most popular names. 02:29:06.870 --> 02:29:08.400 And again, I could limit this. 02:29:08.400 --> 02:29:11.520 I could make the same query but add, 02:29:11.520 --> 02:29:14.250 only among works whose original language 02:29:14.250 --> 02:29:19.020 was Italian, for example, to get more interesting results if I 02:29:19.020 --> 02:29:21.480 only care about Italian literature. 02:29:21.480 --> 02:29:24.720 But this is an example of how I got awesome bubble 02:29:24.720 --> 02:29:28.170 charts for free, and I can just plug this 02:29:28.170 --> 02:29:30.900 into an awesome presentation that I make. 02:29:30.900 --> 02:29:34.500 Of course I can still look at the raw table. 02:29:34.500 --> 02:29:37.940 So the query still resulted in a bunch of data, right? 02:29:37.940 --> 02:29:42.480 So Smith repeats 41 times, Jones 38 times, Taylor 34 times, 02:29:42.480 --> 02:29:43.750 et cetera, et cetera. 02:29:43.750 --> 02:29:48.960 And down that list. 02:29:48.960 --> 02:29:52.320 And I could, again, I could export this into a file 02:29:52.320 --> 02:29:56.100 and load it up in a spreadsheet, and do additional processing 02:29:56.100 --> 02:29:56.670 on it. 02:29:56.670 --> 02:29:58.560 I can link to it. 02:29:58.560 --> 02:30:02.530 I can do all kinds of awesome things with it. 02:30:02.530 --> 02:30:05.250 So that's another awesome query. 02:30:05.250 --> 02:30:08.460 We don't have to go into every line by line analysis 02:30:08.460 --> 02:30:11.670 here of why this works the way it does. 02:30:11.670 --> 02:30:15.840 I want to show you some other queries first. 02:30:15.840 --> 02:30:22.470 Let's look at-- this is just fun, overall causes of death. 02:30:22.470 --> 02:30:24.870 Again a bubble chart just looking 02:30:24.870 --> 02:30:28.260 at people who died of things, and have 02:30:28.260 --> 02:30:30.760 a cause of death listed. 02:30:30.760 --> 02:30:34.380 And we learn that the most commonly listed cause of death 02:30:34.380 --> 02:30:40.350 is myocardial infarction, pneumonitis, cerebral vascular, 02:30:40.350 --> 02:30:42.620 lung cancer, et cetera, et cetera. 02:30:42.620 --> 02:30:44.850 And again, in a bubble chart. 02:30:44.850 --> 02:30:49.670 And so how does that work? 02:30:49.670 --> 02:30:53.050 So just very briefly, the important parts of this query 02:30:53.050 --> 02:30:59.150 are I'm looking for something, for some person, who 02:30:59.150 --> 02:31:04.240 is instance of 31, instance of Q5, which is human. 02:31:04.240 --> 02:31:05.390 So a human. 02:31:05.390 --> 02:31:07.130 Again, just to kind of limit the query. 02:31:07.130 --> 02:31:11.330 I'm not interested in books or mountains. 02:31:11.330 --> 02:31:14.420 I'm looking for humans who have that same person, 02:31:14.420 --> 02:31:21.150 that same variable PID, should have a 509, meaning-- 02:31:21.150 --> 02:31:22.412 Hello. 02:31:22.412 --> 02:31:24.620 Why don't I have the-- 02:31:24.620 --> 02:31:25.120 Yeah. 02:31:25.120 --> 02:31:28.480 A 509, which is cause of death. 02:31:28.480 --> 02:31:31.540 And that cause of death is another variable, 02:31:31.540 --> 02:31:32.930 that I'm calling CID. 02:31:32.930 --> 02:31:35.410 Now, previously we were saying you 02:31:35.410 --> 02:31:36.850 know I want things that are named 02:31:36.850 --> 02:31:39.550 after Gladstone specifically. 02:31:39.550 --> 02:31:42.000 Only things that have that particular value. 02:31:42.000 --> 02:31:44.320 Here I'm saying I'm looking for things 02:31:44.320 --> 02:31:47.110 that have some cause of death. 02:31:47.110 --> 02:31:48.760 Not a specific one. 02:31:48.760 --> 02:31:50.260 I just wanted to get everything that 02:31:50.260 --> 02:31:54.880 has a statement with some value about property 509 02:31:54.880 --> 02:31:56.530 cause of death. 02:31:56.530 --> 02:31:57.940 OK? 02:31:57.940 --> 02:32:04.410 And then this other bit of magic here, the group by, 02:32:04.410 --> 02:32:07.870 tells Wikidata I'm not actually interested 02:32:07.870 --> 02:32:09.100 in every individual thing. 02:32:09.100 --> 02:32:12.310 I want you to group those causes, and then count them 02:32:12.310 --> 02:32:14.230 and give me the top ones. 02:32:14.230 --> 02:32:15.523 So that's how this query works. 02:32:20.550 --> 02:32:22.320 Here's that query I promised. 02:32:22.320 --> 02:32:26.460 Painters whose fathers were also painters. 02:32:26.460 --> 02:32:28.630 I can only think of a couple. 02:32:28.630 --> 02:32:31.890 I mean, Monet and Vogel. 02:32:31.890 --> 02:32:34.800 But I'm sure Wikidata knows many more. 02:32:34.800 --> 02:32:38.620 So let's run this query. 02:32:38.620 --> 02:32:40.270 And I have 100 results. 02:32:40.270 --> 02:32:43.120 By the way, I have limited it to 100 results just 02:32:43.120 --> 02:32:44.650 to keep it kind of snappy. 02:32:44.650 --> 02:32:47.530 But actually, we could maybe try removing the limit 02:32:47.530 --> 02:32:50.170 and see if Wikidata could tell us 02:32:50.170 --> 02:32:53.890 the total number in Wikidata. 02:32:53.890 --> 02:32:55.120 Yeah, that wasn't too bad. 02:32:55.120 --> 02:32:58.400 So 1,270 results. 02:32:58.400 --> 02:32:59.140 OK. 02:32:59.140 --> 02:33:04.150 Wikidata, already at this early date and it's progress, 02:33:04.150 --> 02:33:07.540 already knows about more than 1,200 painters 02:33:07.540 --> 02:33:10.980 who are sons of painters. 02:33:10.980 --> 02:33:16.140 Sons of male painters, like their father is a painter. 02:33:16.140 --> 02:33:18.120 There may be additional painters who 02:33:18.120 --> 02:33:21.390 are sons of female painters not included in this query. 02:33:21.390 --> 02:33:24.990 Again, always remember what exactly you are asking. 02:33:24.990 --> 02:33:27.840 In this query I was asking about the father. 02:33:27.840 --> 02:33:30.330 I'm leaving out any possible painters who 02:33:30.330 --> 02:33:32.720 are sons of mother painters. 02:33:32.720 --> 02:33:33.390 OK? 02:33:33.390 --> 02:33:35.250 So how does this work? 02:33:35.250 --> 02:33:39.630 I'm asking for the painter along with the human label, 02:33:39.630 --> 02:33:42.630 and the father along with the human label. 02:33:42.630 --> 02:33:47.610 So Michel Monet is the son of Claude Monet. 02:33:47.610 --> 02:33:54.180 And Domenico Tintoretto is the son of the famous Tintoretto 02:33:54.180 --> 02:33:57.210 whose label, you know, is just Tintoretto like Michelangelo. 02:33:57.210 --> 02:33:59.960 You know, you don't always have to have the full name 02:33:59.960 --> 02:34:02.420 in the common label. 02:34:02.420 --> 02:34:07.010 Paloma Picasso is the daughter of Pablo Picasso. 02:34:07.010 --> 02:34:07.510 OK. 02:34:07.510 --> 02:34:11.040 So Wikidata knows about all these results. 02:34:11.040 --> 02:34:14.610 Of course Holbein the Younger son of Holbein the Elder. 02:34:14.610 --> 02:34:15.760 And how did we get there? 02:34:15.760 --> 02:34:20.860 Well we asked Wikidata to look for something, 02:34:20.860 --> 02:34:26.820 let's call it painter, which has 106, which is occupation, 02:34:26.820 --> 02:34:31.100 with a value painter. 02:34:31.100 --> 02:34:31.600 Right? 02:34:31.600 --> 02:34:35.310 This unwieldy number 1028181, that's painter. 02:34:35.310 --> 02:34:40.250 So I'm asking for any item that has occupation painter. 02:34:40.250 --> 02:34:43.300 And let's call that item painter. 02:34:43.300 --> 02:34:49.770 I also want that painter to have a property 22, which is father. 02:34:49.770 --> 02:34:50.850 OK. 02:34:50.850 --> 02:34:52.350 Father. 02:34:52.350 --> 02:34:55.140 And I want it to have some value. 02:34:55.140 --> 02:34:58.770 OK, I'm putting it into another variable called father. 02:34:58.770 --> 02:35:01.320 I could have called it, you know, frog. 02:35:01.320 --> 02:35:04.230 That doesn't change anything, just to be clear. 02:35:04.230 --> 02:35:06.630 What matters is that this is the property father. 02:35:06.630 --> 02:35:10.320 I could have called it anything I want. 02:35:10.320 --> 02:35:13.590 So, and then, I have a third condition. 02:35:13.590 --> 02:35:18.010 That the father, like whatever it says here in property 22, 02:35:18.010 --> 02:35:22.590 I want that father to have himself a property 106 02:35:22.590 --> 02:35:27.750 occupation with a value painter. 02:35:27.750 --> 02:35:28.730 OK? 02:35:28.730 --> 02:35:30.800 These conditions combined to give me 02:35:30.800 --> 02:35:36.080 a list of people who have a father and that father 02:35:36.080 --> 02:35:37.850 has occupation painter as well. 02:35:37.850 --> 02:35:40.550 Of course, if I suddenly, or if you suddenly, 02:35:40.550 --> 02:35:44.480 are consumed by curiosity to know 02:35:44.480 --> 02:35:51.344 who are some politicians who are sons of carpenters? 02:35:51.344 --> 02:35:52.760 You could just change that, right? 02:35:52.760 --> 02:35:56.700 Change the first value from painter to politician. 02:35:56.700 --> 02:36:02.624 Change the third line's value from painter to carpenter. 02:36:02.624 --> 02:36:04.040 Maybe that list will be very short 02:36:04.040 --> 02:36:06.680 because carpenters don't tend to be notable, 02:36:06.680 --> 02:36:08.910 so they wouldn't be represented on Wikidata. 02:36:08.910 --> 02:36:11.990 That's why this works relatively well with painters, right? 02:36:11.990 --> 02:36:14.420 Because most of them are notable. 02:36:14.420 --> 02:36:16.370 But generally you could do that, right? 02:36:16.370 --> 02:36:18.500 That's an example of how you can take a query 02:36:18.500 --> 02:36:22.340 and just replace one of those values, or even the language. 02:36:22.340 --> 02:36:26.840 So again, I could ask for these same painters. 02:36:26.840 --> 02:36:27.650 It's limited again. 02:36:27.650 --> 02:36:31.190 These same painters, but with Arabic labels. 02:36:31.190 --> 02:36:34.880 Same query, but I have Arabic labels for these painters. 02:36:34.880 --> 02:36:37.250 And of course where there is no Arabic label 02:36:37.250 --> 02:36:40.360 I get the queue number. 02:36:40.360 --> 02:36:40.860 OK? 02:36:40.860 --> 02:36:43.650 So that's that query that I promised you, 02:36:43.650 --> 02:36:47.670 painters who sons of painters can be done by Wikidata 02:36:47.670 --> 02:36:49.830 in under one second. 02:36:49.830 --> 02:36:51.480 How awesome is that? 02:36:51.480 --> 02:36:52.950 We can also get some statistics. 02:36:52.950 --> 02:36:55.920 So how about counting total articles 02:36:55.920 --> 02:36:59.740 in a given wiki by gender. 02:36:59.740 --> 02:37:02.070 This is what we call the content gender 02:37:02.070 --> 02:37:06.900 gap, as distinct from the participation gender gap. 02:37:06.900 --> 02:37:10.276 This is the gender gap in what we cover on Wikipedia. 02:37:10.276 --> 02:37:11.400 So let's take one of these. 02:37:16.380 --> 02:37:17.630 So this is a query. 02:37:17.630 --> 02:37:23.130 Articles about women in some given Wikipedia. 02:37:23.130 --> 02:37:23.660 All right. 02:37:23.660 --> 02:37:25.799 So let's take-- 02:37:25.799 --> 02:37:26.340 I don't know. 02:37:26.340 --> 02:37:30.240 Let's take the Tamil Wikipedia. 02:37:30.240 --> 02:37:32.460 That's language code TA. 02:37:32.460 --> 02:37:34.950 So I just put TA here. 02:37:34.950 --> 02:37:38.850 And I click Run, and I get this count. 02:37:38.850 --> 02:37:39.960 That's all I wanted. 02:37:39.960 --> 02:37:41.720 I'm not actually interested in the items, 02:37:41.720 --> 02:37:44.962 like in the list of women on the Tamil Wikipedia. 02:37:44.962 --> 02:37:45.920 I just want the number. 02:37:45.920 --> 02:37:48.510 So I selected the count here. 02:37:48.510 --> 02:37:52.610 And this number turns out to be 2159. 02:37:52.610 --> 02:37:57.300 So there are 2000 articles about women 02:37:57.300 --> 02:38:02.350 the Tamil Wikipedia that Wikidata knows to be female. 02:38:02.350 --> 02:38:02.850 Right? 02:38:02.850 --> 02:38:05.730 I'm asking about the gender field, property 21 again. 02:38:05.730 --> 02:38:08.900 Remember, if there's some article about a woman in Tamil 02:38:08.900 --> 02:38:12.090 Wikipedia, but wiki data doesn't have 02:38:12.090 --> 02:38:14.460 a statement about the gender, that person 02:38:14.460 --> 02:38:15.640 will not be counted here. 02:38:15.640 --> 02:38:18.240 So again, be careful about kind of stating 02:38:18.240 --> 02:38:22.800 that is exactly the number of women articles on Tamil 02:38:22.800 --> 02:38:23.340 Wikipedia. 02:38:23.340 --> 02:38:24.600 That's probably not true. 02:38:24.600 --> 02:38:27.560 I'm sure some of those articles are missing 02:38:27.560 --> 02:38:30.740 a sex or gender or property. 02:38:30.740 --> 02:38:33.150 But for raw statistics, that's probably good, 02:38:33.150 --> 02:38:35.700 because some men are also missing the sex or gender 02:38:35.700 --> 02:38:37.620 statistic property. 02:38:37.620 --> 02:38:41.820 So we could take the same query for men. 02:38:41.820 --> 02:38:43.170 It's essentially the exact same. 02:38:43.170 --> 02:38:48.840 It just has this unwieldy number for males, 6581097. 02:38:48.840 --> 02:38:52.710 I can change this language code again to TA for Tamil. 02:38:52.710 --> 02:38:58.880 And how many men are covered on Tamil Wikipedia 14,649. 02:38:58.880 --> 02:38:59.610 OK. 02:38:59.610 --> 02:39:06.880 So women, 2,100, men, about seven times as many. 02:39:06.880 --> 02:39:07.380 Right? 02:39:07.380 --> 02:39:12.300 So that's the approximate size of the content gender 02:39:12.300 --> 02:39:14.610 gap on Tamil Wikipedia. 02:39:14.610 --> 02:39:18.850 And again, I can complicate this query as much as I want. 02:39:18.850 --> 02:39:21.390 For example, I can try and find out 02:39:21.390 --> 02:39:30.390 if this gender gap is wider or narrower among musicians, 02:39:30.390 --> 02:39:31.350 just as an example. 02:39:31.350 --> 02:39:35.850 I could just add a line here that says occupation musician, 02:39:35.850 --> 02:39:37.890 and then I'm only counting articles 02:39:37.890 --> 02:39:41.190 on Tamil Wikipedia about musicians who are female 02:39:41.190 --> 02:39:43.190 versus articles on Tamil Wikipedia 02:39:43.190 --> 02:39:45.030 about musicians who are male. 02:39:45.030 --> 02:39:47.890 And I can kind of compare the gender-- 02:39:47.890 --> 02:39:53.820 the content gender gap across occupations on Tamil Wikipedia. 02:39:53.820 --> 02:39:56.030 Do you see the important point here? 02:39:56.030 --> 02:39:58.490 Is that this is not just kind of a one purpose query. 02:39:58.490 --> 02:40:01.250 I can just with a single additional conditional suddenly 02:40:01.250 --> 02:40:04.370 make it a much more interesting query, because I break it down 02:40:04.370 --> 02:40:05.540 by occupation. 02:40:05.540 --> 02:40:07.810 Or I break it down by century. 02:40:07.810 --> 02:40:12.530 Do we have more of the coverage gap in 19th century people 02:40:12.530 --> 02:40:13.940 than in 21st century people? 02:40:13.940 --> 02:40:15.560 I mean, I sure hope so, right? 02:40:15.560 --> 02:40:18.480 The patriarchy is weakening somewhat. 02:40:18.480 --> 02:40:21.830 So I wouldn't be surprised if there are many more notable men 02:40:21.830 --> 02:40:23.430 covered about the 19th century. 02:40:23.430 --> 02:40:25.784 But if we are also covering-- 02:40:25.784 --> 02:40:27.200 I mean it's the gender gap is just 02:40:27.200 --> 02:40:29.540 as wide for 21st century people, that would 02:40:29.540 --> 02:40:30.800 be a little disappointing. 02:40:30.800 --> 02:40:35.870 Again that's something I can fairly easily find out 02:40:35.870 --> 02:40:38.980 on Wikidata query. 02:40:38.980 --> 02:40:41.500 Any questions so far, or are you just sharing links? 02:40:41.500 --> 02:40:43.160 AUDIENCE: Yep there is one. 02:40:43.160 --> 02:40:47.480 So somebody is wondering if you can demonstrate, or at least 02:40:47.480 --> 02:40:50.420 give a short answer of the latter of this question. 02:40:50.420 --> 02:40:52.530 Is it possible using in Wikidata SPARQL 02:40:52.530 --> 02:40:55.520 to find specific Wikidata articles, e.g. 02:40:55.520 --> 02:40:59.060 featured articles, of a certain language which do not 02:40:59.060 --> 02:41:01.160 exist in another language. 02:41:01.160 --> 02:41:03.770 I know it is possible to find category based 02:41:03.770 --> 02:41:05.820 results using a PET scan tool. 02:41:05.820 --> 02:41:09.110 But can we specify that by selecting e.g. 02:41:09.110 --> 02:41:10.055 featured articles? 02:41:10.055 --> 02:41:11.390 ASAF BARTOV: Yes. 02:41:11.390 --> 02:41:12.600 Excellent question. 02:41:12.600 --> 02:41:14.120 It is possible, indeed. 02:41:14.120 --> 02:41:17.570 And I will demonstrate one such query. 02:41:17.570 --> 02:41:19.190 Another query that I already mentioned 02:41:19.190 --> 02:41:24.840 largest cities in the world with a female mayor. 02:41:24.840 --> 02:41:29.190 This query-- let's close some of these tabs 02:41:29.190 --> 02:41:30.315 before my browser chokes. 02:41:33.600 --> 02:41:36.840 So this query lists the major world cities 02:41:36.840 --> 02:41:39.120 run by women currently. 02:41:39.120 --> 02:41:45.650 And the answer is Mumbai, Mexico City, Tokyo, bunch of others. 02:41:49.470 --> 02:41:52.371 And wait-- that's not it at all. 02:41:52.371 --> 02:41:53.370 I clicked the wrong one. 02:41:53.370 --> 02:41:55.050 That's the map of paintings. 02:41:55.050 --> 02:41:55.800 OK. 02:41:55.800 --> 02:41:57.370 Let's demonstrate that for a second. 02:41:57.370 --> 02:41:59.520 So this is the map of all paintings 02:41:59.520 --> 02:42:03.870 for which we know a location with the count per location. 02:42:03.870 --> 02:42:07.770 And the results are awesomely presented on a map. 02:42:07.770 --> 02:42:08.830 OK. 02:42:08.830 --> 02:42:12.420 Again, under the hood this is a table, of course, of results. 02:42:12.420 --> 02:42:15.660 But, awesomely, I can browse it as a map. 02:42:15.660 --> 02:42:20.320 So here is a map of the world with all the paintings 02:42:20.320 --> 02:42:22.060 that Wikidata knows about. 02:42:22.060 --> 02:42:23.920 Not just knows about the paintings, 02:42:23.920 --> 02:42:28.180 but knows about their location in a museum. 02:42:28.180 --> 02:42:30.670 Not surprisingly Europe is much better 02:42:30.670 --> 02:42:35.540 covered than Russia or Africa. 02:42:35.540 --> 02:42:40.150 There is a huge gap in contribution to Wikidata 02:42:40.150 --> 02:42:41.740 from these countries. 02:42:41.740 --> 02:42:43.780 And some of it can be fixed. 02:42:43.780 --> 02:42:47.740 And of course there is much more documentation, and much more 02:42:47.740 --> 02:42:50.260 art in Europe. 02:42:50.260 --> 02:42:54.280 But if we zoom in, I don't know, Rome probably 02:42:54.280 --> 02:42:55.900 has a few paintings. 02:42:55.900 --> 02:42:56.400 Right? 02:43:00.080 --> 02:43:02.288 Hello. 02:43:02.288 --> 02:43:04.200 Sorry. 02:43:04.200 --> 02:43:09.780 It's-- Yes. 02:43:09.780 --> 02:43:13.290 Vatican City sounds like a good bet, right? 02:43:13.290 --> 02:43:14.290 I can zoom in here. 02:43:14.290 --> 02:43:16.290 And I can just click one of these dots 02:43:16.290 --> 02:43:21.400 and see in this point there are two paintings. 02:43:21.400 --> 02:43:25.270 And in this one there is one and it's the Archbasilica 02:43:25.270 --> 02:43:27.460 of St. John Lateran. 02:43:27.460 --> 02:43:31.060 Let's see, this is the actual St. Peter, right? 02:43:31.060 --> 02:43:33.650 Sistine Chapel has 23 paintings. 02:43:33.650 --> 02:43:34.330 What? 02:43:34.330 --> 02:43:36.670 The Sistine Chapel has way more than 23 paintings. 02:43:36.670 --> 02:43:40.330 Correct, but 23 of them are documented on Wikidata. 02:43:40.330 --> 02:43:43.330 Have their own item for the painting, not 02:43:43.330 --> 02:43:45.280 the Sistine Chapel, the painting has 02:43:45.280 --> 02:43:49.540 an item that lists its being in the Sistine Chapel. 02:43:49.540 --> 02:43:50.950 There are 23 of those. 02:43:50.950 --> 02:43:52.270 OK. 02:43:52.270 --> 02:43:54.310 There is definitely room to document 02:43:54.310 --> 02:43:57.040 the rest of the artworks in the Sistine Chapel. 02:43:57.040 --> 02:43:59.740 So, again, this is just not the kind of query 02:43:59.740 --> 02:44:03.330 you were able to make before Wikidata, 02:44:03.330 --> 02:44:07.750 and it's a fairly simple query, as you can see. 02:44:07.750 --> 02:44:13.020 There are examples using maps like airports within 100 02:44:13.020 --> 02:44:15.040 kilometers of Berlin. 02:44:15.040 --> 02:44:18.310 Again using the coordinates as a useful data point. 02:44:18.310 --> 02:44:21.880 And here is a map showing me only airports within a 100 02:44:21.880 --> 02:44:25.990 kilometer radius from Berlin. 02:44:25.990 --> 02:44:29.140 But I wanted to show you the mayors query. 02:44:29.140 --> 02:44:34.510 Let's click the-- oh I just have the wrong link here. 02:44:34.510 --> 02:44:41.040 But I can still find it here by typing mayor. 02:44:41.040 --> 02:44:44.590 Here we go, largest cities with female mayor. 02:44:44.590 --> 02:44:47.230 So this is a slightly more complicated query. 02:44:47.230 --> 02:44:53.010 But if I run it, I get the top 10, because I set limit to 10. 02:44:53.010 --> 02:44:54.820 I get the top 10 cities in the world, 02:44:54.820 --> 02:44:59.710 by population, size that are currently run by women. 02:44:59.710 --> 02:45:03.490 Tokyo, Mumbai, Yokohama, Caracas, et cetera. 02:45:03.490 --> 02:45:08.080 And one interesting thing that you may want to notice here 02:45:08.080 --> 02:45:10.690 is that I'm asking for cities. 02:45:10.690 --> 02:45:13.660 I mean items, that are instance of city. 02:45:13.660 --> 02:45:16.420 And that have a head of government, 02:45:16.420 --> 02:45:18.640 that have some statement about who 02:45:18.640 --> 02:45:28.440 is in charge, and that statement has sex that's listed up here 02:45:28.440 --> 02:45:29.886 as female. 02:45:29.886 --> 02:45:31.510 Don't worry about the syntax right now. 02:45:31.510 --> 02:45:34.590 I just want to show you some specific angle here. 02:45:34.590 --> 02:45:37.920 And I'm further filtering these results. 02:45:37.920 --> 02:45:45.400 I only want those items where there is not the property 02:45:45.400 --> 02:45:48.630 and the qualifier, end time. 02:45:48.630 --> 02:45:50.390 Why is that important? 02:45:50.390 --> 02:45:56.530 Because if a city once had a female mayor, 02:45:56.530 --> 02:45:59.890 but that mayor is not the mayor anymore, because mayors change, 02:45:59.890 --> 02:46:01.600 I don't want them in this query. 02:46:01.600 --> 02:46:04.990 I want to query of cities currently having 02:46:04.990 --> 02:46:05.680 a female mayor. 02:46:05.680 --> 02:46:07.990 And of course Wikidata may have historical data 02:46:07.990 --> 02:46:09.880 with start and end time, as we've 02:46:09.880 --> 02:46:14.530 seen, that documents this person was the mayor of Tokyo 02:46:14.530 --> 02:46:17.170 or San Francisco between these years. 02:46:17.170 --> 02:46:18.820 But if there is no end times that means 02:46:18.820 --> 02:46:21.520 they are currently the mayor. 02:46:21.520 --> 02:46:24.490 So that's an example of asking about a qualifier 02:46:24.490 --> 02:46:28.180 of a statement, to again, to get the results we actually want. 02:46:28.180 --> 02:46:31.630 If we want current mayors it's important to put this filter. 02:46:31.630 --> 02:46:35.365 If we don't, we will get historical female mayors 02:46:35.365 --> 02:46:35.865 as well. 02:46:39.920 --> 02:46:40.490 All right. 02:46:40.490 --> 02:46:45.380 So these are some example queries. 02:46:45.380 --> 02:46:49.085 Questions about that? 02:46:51.620 --> 02:46:53.030 Oh, the featured article example. 02:46:58.280 --> 02:47:01.700 So let's look at that. 02:47:07.050 --> 02:47:12.660 So I have prepared such a query recently. 02:47:12.660 --> 02:47:15.300 Here we go. 02:47:15.300 --> 02:47:18.570 So this is a query. 02:47:18.570 --> 02:47:20.472 I just saved it here on my user page. 02:47:20.472 --> 02:47:21.930 I mean, this is not Wikidata query. 02:47:21.930 --> 02:47:25.390 This is just a meta page containing the query usefully. 02:47:28.260 --> 02:47:33.800 And let's run this. 02:47:33.800 --> 02:47:38.030 So this query, it's actually not very complicated. 02:47:38.030 --> 02:47:40.030 It's just has a long list of countries, 02:47:40.030 --> 02:47:42.170 because I'm asking about African countries. 02:47:42.170 --> 02:47:42.670 OK. 02:47:42.670 --> 02:47:45.010 I'm looking for human females from one 02:47:45.010 --> 02:47:51.060 of these countries that have an article in English. 02:47:51.060 --> 02:47:53.010 That's what this line means. 02:47:53.010 --> 02:47:55.620 But not in French. 02:47:55.620 --> 02:47:57.570 That's what this part means. 02:47:57.570 --> 02:47:59.170 OK. 02:47:59.170 --> 02:48:01.720 This part, these two lines together. 02:48:01.720 --> 02:48:03.190 But not in French. 02:48:03.190 --> 02:48:05.920 And this is what's called a badge. 02:48:05.920 --> 02:48:09.430 That's Wikidata's concept of good and featured articles. 02:48:09.430 --> 02:48:10.600 It's called a badge. 02:48:10.600 --> 02:48:16.500 So I want them to have some badge on English Wikipedia. 02:48:16.500 --> 02:48:17.000 OK? 02:48:17.000 --> 02:48:22.250 So again, this query is asking for the top 100 women 02:48:22.250 --> 02:48:26.150 from Africa who are documented on English Wikipedia, 02:48:26.150 --> 02:48:28.730 in a featured or good article status. 02:48:28.730 --> 02:48:30.660 But not on French Wikipedia. 02:48:30.660 --> 02:48:33.270 So this is a query that's a to-do query, right? 02:48:33.270 --> 02:48:35.630 That's a query for French editors 02:48:35.630 --> 02:48:40.100 to consider what they might usefully translate or create 02:48:40.100 --> 02:48:41.180 in French. 02:48:41.180 --> 02:48:48.860 And if we run this see we have three results. 02:48:48.860 --> 02:48:50.720 I mean, we have many women from Africa 02:48:50.720 --> 02:48:52.460 covered on English Wikipedia. 02:48:52.460 --> 02:48:57.500 But only three articles have featured or good status 02:48:57.500 --> 02:49:03.460 among those that do not have French Wikipedia coverage. 02:49:03.460 --> 02:49:04.900 Let me rephrase that. 02:49:04.900 --> 02:49:07.990 Among the English Wikipedia articles about African women 02:49:07.990 --> 02:49:11.170 that don't have a French counterpart, 02:49:11.170 --> 02:49:14.520 only three are featured or good. 02:49:14.520 --> 02:49:16.960 OK? 02:49:16.960 --> 02:49:17.640 Do you see this? 02:49:17.640 --> 02:49:19.720 The badge is good article. 02:49:19.720 --> 02:49:23.550 This little incantation here is what allows 02:49:23.550 --> 02:49:25.950 you to ask about the badge. 02:49:25.950 --> 02:49:28.730 This here. 02:49:28.730 --> 02:49:33.420 And, by the way, the slides will be uploaded to commons. 02:49:33.420 --> 02:49:38.708 And we will-- how shall we make it available on the YouTube 02:49:38.708 --> 02:49:39.710 thing as well? 02:49:42.730 --> 02:49:43.230 No, no. 02:49:43.230 --> 02:49:45.870 But, I mean, for people who will later watch this video. 02:49:52.119 --> 02:49:54.160 Oh yeah, we can add it to the YouTube description 02:49:54.160 --> 02:49:55.368 and the comments description. 02:49:55.368 --> 02:49:58.090 So in the-- if you're watching this video later, 02:49:58.090 --> 02:50:00.820 in the description, we will add a link to this query 02:50:00.820 --> 02:50:01.480 specifically. 02:50:01.480 --> 02:50:03.340 Because it's not in the slides right now. 02:50:03.340 --> 02:50:03.910 It will be. 02:50:06.622 --> 02:50:07.980 OK. 02:50:07.980 --> 02:50:10.260 So. 02:50:10.260 --> 02:50:13.590 Questions so far? 02:50:13.590 --> 02:50:14.700 We're almost done. 02:50:14.700 --> 02:50:16.260 We have a few minutes left. 02:50:16.260 --> 02:50:18.090 So questions about queries? 02:50:18.090 --> 02:50:20.130 I mean, I'm sure there's tons of things 02:50:20.130 --> 02:50:21.510 you don't know how to do yet. 02:50:21.510 --> 02:50:24.720 And you maybe you didn't really get the sense for SPARQL. 02:50:24.720 --> 02:50:27.120 It's something you need to really do on your own 02:50:27.120 --> 02:50:28.290 on your computer. 02:50:28.290 --> 02:50:29.465 See how it works. 02:50:29.465 --> 02:50:30.090 Fiddle with it. 02:50:30.090 --> 02:50:30.900 Change something. 02:50:30.900 --> 02:50:33.270 See that it breaks and complains. 02:50:33.270 --> 02:50:37.470 But, very importantly-- oh I had this in the other questions 02:50:37.470 --> 02:50:38.340 slide. 02:50:38.340 --> 02:50:42.480 Remember Wikidata project chat. 02:50:42.480 --> 02:50:45.810 That's kind of the Wikidata equivalent of the village pump. 02:50:45.810 --> 02:50:47.790 It's the page on Wikidata where you can just 02:50:47.790 --> 02:50:49.830 show up and ask a question. 02:50:49.830 --> 02:50:52.290 In my experience, the Wikidata community 02:50:52.290 --> 02:50:55.410 is very nice, very welcoming, and very eager 02:50:55.410 --> 02:51:00.100 to help newer people integrate and learn how to do things. 02:51:00.100 --> 02:51:01.800 There's also an IRC channel. 02:51:01.800 --> 02:51:04.260 If you know what IRC is and how to use it, by all means, 02:51:04.260 --> 02:51:07.890 go to IRC channel Wikidata. 02:51:07.890 --> 02:51:09.330 There's people there all the time, 02:51:09.330 --> 02:51:11.040 and you can just ask a question. 02:51:11.040 --> 02:51:13.245 If you're trying to do a query, and you don't quite 02:51:13.245 --> 02:51:15.870 understand the syntax, or you're not sure how to get the result 02:51:15.870 --> 02:51:16.680 you want. 02:51:16.680 --> 02:51:20.050 There are people there who will gladly help you do that. 02:51:20.050 --> 02:51:22.560 There is also a Wikidata newsletter 02:51:22.560 --> 02:51:25.680 published by the Wikidata team, which is centered in Germany 02:51:25.680 --> 02:51:27.330 and Wikipedia Germany. 02:51:27.330 --> 02:51:31.890 And they send out a newsletter in English with Wikidata news. 02:51:31.890 --> 02:51:33.570 You know, new properties, new items, 02:51:33.570 --> 02:51:34.920 new things in the project. 02:51:34.920 --> 02:51:36.840 But also sample queries. 02:51:36.840 --> 02:51:39.300 So once a week there is kind of an awesome query 02:51:39.300 --> 02:51:43.440 to learn from, if you want to learn that way instead 02:51:43.440 --> 02:51:46.230 of reading like a whole manual on SPARQL. 02:51:46.230 --> 02:51:48.300 So I'm just encouraging you to get help 02:51:48.300 --> 02:51:49.470 in one of those channels. 02:51:49.470 --> 02:51:51.000 Of course you can write to me. 02:51:51.000 --> 02:51:55.920 Just reach out to me and ask me questions as well. 02:51:55.920 --> 02:51:58.860 I hope by now you agree that Wikidata is love, 02:51:58.860 --> 02:52:03.150 and Wikidata data is awesome. 02:52:03.150 --> 02:52:06.480 If there are no questions, we do have a tiny bit of time 02:52:06.480 --> 02:52:11.510 to demonstrate one more tool but that's-- 02:52:11.510 --> 02:52:12.010 no? 02:52:12.010 --> 02:52:13.170 No questions. 02:52:13.170 --> 02:52:17.600 OK so let's talk about-- 02:52:17.600 --> 02:52:19.100 well, the resonator is kind of nice, 02:52:19.100 --> 02:52:22.890 but it's a little like the article placeholder. 02:52:22.890 --> 02:52:25.530 So this is not Wikidata this is a tool again 02:52:25.530 --> 02:52:26.805 built by Magnus Manske-- 02:52:26.805 --> 02:52:29.310 AUDIENCE: There's also one final question to you in case-- 02:52:29.310 --> 02:52:29.820 ASAF BARTOV: Oh, there is a question. 02:52:29.820 --> 02:52:30.390 AUDIENCE: Yeah. 02:52:30.390 --> 02:52:32.348 ASAF BARTOV: Which advantages and disadvantages 02:52:32.348 --> 02:52:35.370 to create an item before an article is 02:52:35.370 --> 02:52:37.920 done on English Wikipedia? 02:52:37.920 --> 02:52:42.340 Well, I mean, this example that I just made right. 02:52:42.340 --> 02:52:46.960 I'm reading this book by a notable author. 02:52:46.960 --> 02:52:47.810 OK. 02:52:47.810 --> 02:52:51.400 I want this to exist on Wikidata, 02:52:51.400 --> 02:52:53.320 and to be mentioned on Wikidata, so 02:52:53.320 --> 02:52:56.950 that when people look up that author in Wikidata 02:52:56.950 --> 02:52:59.170 they will know about one of his notable works. 02:52:59.170 --> 02:53:02.470 But I'm not prepared to put in the time investment 02:53:02.470 --> 02:53:05.670 to build a whole article on English Wikipedia. 02:53:05.670 --> 02:53:07.420 Either because I don't have the time, or I 02:53:07.420 --> 02:53:09.040 don't have good sources. 02:53:09.040 --> 02:53:11.560 Or maybe my English is not good enough, 02:53:11.560 --> 02:53:14.980 but it is good enough to just record these very basic facts 02:53:14.980 --> 02:53:17.850 and point to the Library of Congress records et cetera. 02:53:17.850 --> 02:53:20.170 So that it's better than nothing. 02:53:20.170 --> 02:53:23.170 So that's one reason to maybe do it. 02:53:23.170 --> 02:53:26.690 Another reason is to be able to link to it. 02:53:26.690 --> 02:53:30.190 So remember that translator lady already 02:53:30.190 --> 02:53:33.280 had an item on Wikidata, but if she hadn't we could have just 02:53:33.280 --> 02:53:38.560 created a very, very basic rudimentary item about her just 02:53:38.560 --> 02:53:41.740 saying, you know, this name is human. 02:53:41.740 --> 02:53:43.060 Country, Bulgaria. 02:53:43.060 --> 02:53:45.220 Occupation, translator. 02:53:45.220 --> 02:53:48.580 Even just that would have would have been something, 02:53:48.580 --> 02:53:51.610 and would have enabled me to link to this person. 02:53:51.610 --> 02:53:56.860 So these are legitimate reasons to create Wikidata entities 02:53:56.860 --> 02:54:01.510 without, or at least before, creating a Wikipedia article. 02:54:01.510 --> 02:54:02.709 If you are going to create-- 02:54:02.709 --> 02:54:04.750 I mean if you're at and edit-a-thon or something, 02:54:04.750 --> 02:54:07.690 and you have come to create Wikipedia articles, 02:54:07.690 --> 02:54:10.660 by all means, first create the Wikipedia article, 02:54:10.660 --> 02:54:13.982 then create the Wikipedia item and link to it. 02:54:17.580 --> 02:54:20.500 I hope that answers the question. 02:54:20.500 --> 02:54:24.940 So the reasonator is simply a kind 02:54:24.940 --> 02:54:31.330 of prettier view of items in Wikidata. 02:54:31.330 --> 02:54:35.980 So you can just type the name of an item or the number. 02:54:35.980 --> 02:54:39.010 Let's pick just a random number, 42. 02:54:39.010 --> 02:54:39.595 Say 42. 02:54:42.770 --> 02:54:45.950 Which happens to be, maybe you've 02:54:45.950 --> 02:54:51.310 heard of this guy, Douglas Adams. 02:54:51.310 --> 02:54:55.490 He happened to have received the queue number 42. 02:54:55.490 --> 02:54:58.760 I'm sure it's a cosmic coincidence 02:54:58.760 --> 02:55:01.460 of infinite improbability. 02:55:01.460 --> 02:55:03.470 And this is a view-- 02:55:03.470 --> 02:55:05.690 this is a tool that is not Wikidata. 02:55:05.690 --> 02:55:09.690 It's a tool built on top of Wikidata called resonator. 02:55:09.690 --> 02:55:14.750 And it gives us the information from Q42, that is from the-- 02:55:14.750 --> 02:55:18.800 this item in Wikidata, which looks like an item in Wikidata. 02:55:18.800 --> 02:55:21.320 But it gives it to us in a slightly more rational kind 02:55:21.320 --> 02:55:22.430 of lay out. 02:55:22.430 --> 02:55:24.200 It even kind of generates a little bit 02:55:24.200 --> 02:55:27.620 of pseudo article text for us. 02:55:27.620 --> 02:55:30.429 You know, Douglas Adams was a British writer, playwright, 02:55:30.429 --> 02:55:31.970 screenwriter, bla-bla-bla, an author. 02:55:31.970 --> 02:55:35.630 He was born on this date, in this place, to these people. 02:55:35.630 --> 02:55:39.080 He studied at this place between these years. 02:55:39.080 --> 02:55:40.670 That's all machine generated. 02:55:40.670 --> 02:55:42.230 Nobody wrote this text. 02:55:42.230 --> 02:55:46.330 That's all taken from those statements in Wikidata, 02:55:46.330 --> 02:55:51.080 and generates this reasonable reading summary paragraph. 02:55:51.080 --> 02:55:54.140 And then it gives us this little table of relatives. 02:55:54.140 --> 02:55:55.610 It's all taken from Wikidata. 02:55:55.610 --> 02:55:57.740 But as you can see, this is already 02:55:57.740 --> 02:56:02.120 a little more accessible than the essentially arbitrary 02:56:02.120 --> 02:56:05.120 ordering of statements on Wikidata. 02:56:05.120 --> 02:56:06.200 And that's OK. 02:56:06.200 --> 02:56:08.060 I mean, that's kind of by design. 02:56:08.060 --> 02:56:10.100 Wikidata is the platform. 02:56:10.100 --> 02:56:11.960 There is going to be-- there are going 02:56:11.960 --> 02:56:15.680 to be many new applications, and platforms, and tools, 02:56:15.680 --> 02:56:19.010 and visual interfaces on top of Wikidata 02:56:19.010 --> 02:56:23.000 to browse Wikidata in a more friendly or more customized 02:56:23.000 --> 02:56:24.480 ways. 02:56:24.480 --> 02:56:27.080 For example, one of the things that resonator 02:56:27.080 --> 02:56:31.610 does for us is give us pictures and maps and a timeline. 02:56:31.610 --> 02:56:32.960 Check it out this. 02:56:32.960 --> 02:56:38.990 Time line machine generated, just from dates and points 02:56:38.990 --> 02:56:44.090 in time, mentioned in the relatively rich Wikidata 02:56:44.090 --> 02:56:47.200 item about Douglas Adams. 02:56:47.200 --> 02:56:47.700 Right? 02:56:47.700 --> 02:56:50.030 So this timeline, for example again, completely machine 02:56:50.030 --> 02:56:51.140 generated. 02:56:51.140 --> 02:56:53.270 But he was educated between these years, 02:56:53.270 --> 02:56:54.920 so I can put it on the timeline. 02:56:54.920 --> 02:56:57.260 And this is the year he was nominated for a Hugo awards, 02:56:57.260 --> 02:56:59.570 so I can put that in a timeline. 02:56:59.570 --> 02:57:00.600 Et cetera. 02:57:00.600 --> 02:57:03.050 So that's just a super quick demonstration 02:57:03.050 --> 02:57:06.620 of that tool, the resonator. 02:57:06.620 --> 02:57:10.310 Links are all here in the slides. 02:57:10.310 --> 02:57:13.390 And the final tool I wanted to mention very quickly 02:57:13.390 --> 02:57:16.220 is the mix and match tool. 02:57:16.220 --> 02:57:21.980 You remember my explanation about Wikidata as Nexus, 02:57:21.980 --> 02:57:27.320 as connection point between many databases, many data sources. 02:57:27.320 --> 02:57:31.080 Those depend on these equivalencies. 02:57:31.080 --> 02:57:35.300 On Wikidata being taught that this item is like that 02:57:35.300 --> 02:57:37.940 ID in this other database. 02:57:37.940 --> 02:57:41.810 And mix and match is a tool again by, Magnus Manske. 02:57:41.810 --> 02:57:44.690 Maybe you're detecting a pattern here. 02:57:44.690 --> 02:57:47.390 It's a tool by Magnus that is designed 02:57:47.390 --> 02:57:50.270 to enable us to kind of take a foreign, 02:57:50.270 --> 02:57:54.950 an external data set, put it alongside Wikidata, 02:57:54.950 --> 02:57:56.690 and kind of try and align them. 02:57:56.690 --> 02:57:59.410 So this item in this external dataset, 02:57:59.410 --> 02:58:01.230 is that already covered in Wikidata? 02:58:01.230 --> 02:58:02.890 If so, by what queue number? 02:58:02.890 --> 02:58:03.890 By what item? 02:58:03.890 --> 02:58:06.170 If not, maybe we need to create a Wikidata 02:58:06.170 --> 02:58:07.610 item to represent it. 02:58:07.610 --> 02:58:10.010 Or maybe it's a duplicate, or something. 02:58:10.010 --> 02:58:15.980 So the mix and match tool has a list of external data sets, 02:58:15.980 --> 02:58:18.140 as you can see. 02:58:18.140 --> 02:58:21.260 The Art and Architecture Thesaurus by the Getty Research 02:58:21.260 --> 02:58:22.220 Institute. 02:58:22.220 --> 02:58:26.690 Or the Australian Dictionary of Biography. 02:58:26.690 --> 02:58:28.880 All kinds of external data sets here. 02:58:32.470 --> 02:58:40.060 Somewhere here I had a specific link to the Royal Society. 02:58:40.060 --> 02:58:41.710 It can also give me some statistics. 02:58:41.710 --> 02:58:47.410 So there is an external data set of all the Fellows of the Royal 02:58:47.410 --> 02:58:48.001 Society. 02:58:48.001 --> 02:58:48.500 Right? 02:58:48.500 --> 02:58:54.970 The oldest academic learned society in England. 02:58:54.970 --> 02:58:57.415 And the internet is tired. 02:59:03.240 --> 02:59:04.640 Here we go. 02:59:04.640 --> 02:59:07.115 Nope. 02:59:07.115 --> 02:59:08.105 Did that work? 02:59:12.560 --> 02:59:15.390 Fellows of the Royal Society, here we go. 02:59:15.390 --> 02:59:17.970 So this one is complete. 02:59:17.970 --> 02:59:21.330 I mean, people have manually gone over every single item 02:59:21.330 --> 02:59:24.330 there and either matched it to Wikidata 02:59:24.330 --> 02:59:27.390 or declared that it was not in scope, or a duplicate 02:59:27.390 --> 02:59:28.520 or whatever. 02:59:28.520 --> 02:59:31.230 But let's look at site stats. 02:59:31.230 --> 02:59:35.210 This is a fun kind of aspect of this tool. 02:59:35.210 --> 02:59:38.530 But that is not working. 02:59:38.530 --> 02:59:40.820 Or it's taking too long. 02:59:40.820 --> 02:59:43.940 So let's just demonstrate how this works. 02:59:43.940 --> 02:59:45.590 Maybe Britannica? 02:59:45.590 --> 02:59:46.780 Is that done already? 02:59:52.570 --> 02:59:53.990 Here we go. 02:59:53.990 --> 02:59:55.330 Encyclopedia Britannica. 02:59:55.330 --> 02:59:55.960 Yeah. 02:59:55.960 --> 03:00:02.040 So the Encyclopedia Britannica has 03:00:02.040 --> 03:00:05.940 40% of the items there are not yet processed. 03:00:05.940 --> 03:00:07.830 So let's process one of them. 03:00:07.830 --> 03:00:16.180 For example there is an item in the Encyclopedia Britannica 03:00:16.180 --> 03:00:19.960 called Boston, England. 03:00:19.960 --> 03:00:23.050 As you know All-American place names 03:00:23.050 --> 03:00:26.050 are totally stolen from elsewhere. 03:00:26.050 --> 03:00:29.440 So there is a Boston in England, though it's 03:00:29.440 --> 03:00:30.700 no longer the famous one. 03:00:30.700 --> 03:00:36.340 And the mix and match tool has automatically 03:00:36.340 --> 03:00:39.610 matched it based on the label to queue 03:00:39.610 --> 03:00:43.900 100, which is Boston big city in the United States. 03:00:43.900 --> 03:00:45.500 And that is incorrect, right? 03:00:45.500 --> 03:00:48.910 That's kind of naive computer going, well this is Boston, 03:00:48.910 --> 03:00:50.820 and this other thing is also Boston. 03:00:50.820 --> 03:00:56.260 And it is asking me to confirm this match or not. 03:00:56.260 --> 03:00:57.400 You see? 03:00:57.400 --> 03:01:01.120 So this is the Boston, England from Britannica. 03:01:01.120 --> 03:01:04.720 And the tool is asking me, is this the same as 03:01:04.720 --> 03:01:06.910 Boston queue 100 in America? 03:01:06.910 --> 03:01:07.990 The answer is no. 03:01:07.990 --> 03:01:10.110 I removed this. 03:01:10.110 --> 03:01:11.860 I remove this match. 03:01:11.860 --> 03:01:15.430 And now this Boston, England is unmatched. 03:01:15.430 --> 03:01:23.230 And I can match it to the correct one in England. 03:01:23.230 --> 03:01:27.370 I can do this by searching English Wikipedia, 03:01:27.370 --> 03:01:28.780 or searching Wikidata. 03:01:28.780 --> 03:01:32.000 I mean, it has these handy links. 03:01:32.000 --> 03:01:36.910 So the English town is in Lincolnshire. 03:01:36.910 --> 03:01:38.230 Boston, Lincolnshire. 03:01:38.230 --> 03:01:46.030 So I can go there and then get the Wikidata item number. 03:01:46.030 --> 03:01:49.810 See this is not queue 100, Boston in the states, 03:01:49.810 --> 03:01:53.440 this is queue 311975 town in Lincolnshire. 03:01:53.440 --> 03:01:57.310 I can get this queue number, go back to the mix 03:01:57.310 --> 03:01:58.160 and match tool-- 03:01:58.160 --> 03:01:59.110 Where was that? 03:01:59.110 --> 03:02:00.180 Here we are. 03:02:00.180 --> 03:02:01.510 And set queue. 03:02:01.510 --> 03:02:08.650 I can tell the tool that this is the right Boston, and click OK. 03:02:08.650 --> 03:02:14.550 And now this town in Lincolnshire, 03:02:14.550 --> 03:02:17.100 you can see this here, this item, queue 311975, 03:02:17.100 --> 03:02:21.190 is linked to Britannica. 03:02:21.190 --> 03:02:22.660 What does this mean? 03:02:22.660 --> 03:02:23.820 Well, if we go there. 03:02:23.820 --> 03:02:25.380 If we actually go to the Wikidata 03:02:25.380 --> 03:02:28.890 entity you will see that in addition 03:02:28.890 --> 03:02:34.140 to the few statements that it already had, it now has, 03:02:34.140 --> 03:02:38.610 thanks to my clicking, it now has another identifier here. 03:02:38.610 --> 03:02:39.270 See? 03:02:39.270 --> 03:02:43.950 Encyclopedia Britannica Online ID, with this link. 03:02:43.950 --> 03:02:49.440 And if we click it, we will indeed reach this page 03:02:49.440 --> 03:02:51.510 in the Britannica online, which is indeed 03:02:51.510 --> 03:02:53.700 about this town in Lincolnshire. 03:02:53.700 --> 03:02:54.510 You see? 03:02:54.510 --> 03:02:58.650 So I've contributed one of those mappings, one 03:02:58.650 --> 03:03:01.950 of those identifiers, into Wikidata. 03:03:01.950 --> 03:03:04.860 And I didn't have to do it manually. 03:03:04.860 --> 03:03:07.980 This tool kind of prompted me to either confirm 03:03:07.980 --> 03:03:09.480 if it was correct, I could have just 03:03:09.480 --> 03:03:12.150 clicked confirm since it wasn't correct. 03:03:12.150 --> 03:03:16.920 I corrected it manually, but it made this edit on my behalf. 03:03:16.920 --> 03:03:21.180 So that's another tool that encourages us to systematically 03:03:21.180 --> 03:03:24.360 teach Wikidata more things. 03:03:24.360 --> 03:03:25.860 And we're out of time. 03:03:25.860 --> 03:03:29.430 Go edit Wikidata, Now that you have the power, 03:03:29.430 --> 03:03:30.510 you know the deal. 03:03:30.510 --> 03:03:32.430 Use it for good, and not for evil. 03:03:32.430 --> 03:03:35.640 If you have questions, this is my email address. 03:03:35.640 --> 03:03:38.640 If you're watching this video not live the description 03:03:38.640 --> 03:03:41.610 will have links to the slides, and to a bunch 03:03:41.610 --> 03:03:44.610 of other useful pieces of information. 03:03:44.610 --> 03:03:49.510 Any last questions on IRC? 03:03:49.510 --> 03:03:53.210 If not, thank you for your attention. 03:03:53.210 --> 03:03:56.470 And if you like this, and if you feel that you now get Wikidata, 03:03:56.470 --> 03:03:58.330 and you get what it's good for, and you're 03:03:58.330 --> 03:04:01.660 inspired to contribute, I have only one request from you. 03:04:01.660 --> 03:04:04.960 I mean, in addition to using it for good not for evil, 03:04:04.960 --> 03:04:07.630 I ask that you spread the word. 03:04:07.630 --> 03:04:09.550 Show this video-- share this video 03:04:09.550 --> 03:04:13.180 with other people in your community, or around you. 03:04:13.180 --> 03:04:16.000 Teach this yourself once you're comfortable 03:04:16.000 --> 03:04:17.650 with these concepts. 03:04:17.650 --> 03:04:21.330 Feel free to use my slides. 03:04:21.330 --> 03:04:23.580 Yeah, and edit Wikidata. 03:04:23.580 --> 03:04:27.010 Thank you very much, and goodbye.