WEBVTT 00:00:00.000 --> 00:00:19.480 36C3 preroll music 00:00:19.480 --> 00:00:24.140 Herald Angel: We have Tom and Max here. They have a talk here with a very 00:00:24.140 --> 00:00:28.140 complicated title that I don't quite understand yet. It's called "Interactively 00:00:28.140 --> 00:00:35.810 Discovering Implicational Knowledge in Wikidata. And they told me the point of 00:00:35.810 --> 00:00:39.190 the talk is that I would like to understand what it means and I hope I 00:00:39.190 --> 00:00:42.190 will. So good luck. Tom: Thank you very much. 00:00:42.190 --> 00:00:44.310 Herald: And have some applause, please. 00:00:44.310 --> 00:00:47.880 applause 00:00:47.880 --> 00:00:54.980 T: Thank you very much. Do you hear me? Does it work? Hello? Oh, very good. Thank 00:00:54.980 --> 00:00:58.789 you very much and welcome to our talk about interactively discovering 00:00:58.789 --> 00:01:05.110 implicational knowledge in Wikidata. It is more or less a fun project we started 00:01:05.110 --> 00:01:10.890 for finding rules that are implicit in Wikidata – entailed just by the data it 00:01:10.890 --> 00:01:18.850 has, that people inserted into the Wikidata database so far. And we will 00:01:18.850 --> 00:01:23.570 start with the explicit knowledge. So the explicit data in Wikidata, with Max. 00:01:23.570 --> 00:01:28.340 Max: So. Right. What what is Wikidata? Maybe you have heard about Wikidata, then 00:01:28.340 --> 00:01:33.210 that's all fine. Maybe you haven't, then surely you've heard of Wikipedia. And 00:01:33.210 --> 00:01:36.790 Wikipedia is run by the Wikimedia Foundation and the Wikimedia Foundation 00:01:36.790 --> 00:01:41.330 has several other projects. And one of those is Wikidata. And Wikidata is 00:01:41.330 --> 00:01:45.490 basically a large graph that encodes machine readable knowledge in the form of 00:01:45.490 --> 00:01:51.730 statements. And a statement basically consists of some entity that is connected 00:01:51.730 --> 00:01:58.200 – or some some entities that are connected by some property. And these properties 00:01:58.200 --> 00:02:02.909 can then even have annotations on them. So, for example, we have Donna Strickland 00:02:02.909 --> 00:02:09.149 here and we encode that she has received a Nobel prize in physics last year by this 00:02:09.149 --> 00:02:16.290 property "awarded" and this has then a qualifier "time: 2018" and also "for: 00:02:16.290 --> 00:02:23.100 Chirped Pulse Amplification". And all in all, we have some 890 million statements 00:02:23.100 --> 00:02:31.960 on Wikidata that connect 71 million items using 7000 properties. But there's also a 00:02:31.960 --> 00:02:36.830 bit more. So we also know that Donna Strickland has "field of work: optics" and 00:02:36.830 --> 00:02:41.420 also "field of work: lasers" so we can use the same property to connect some entity 00:02:41.420 --> 00:02:46.480 with different other entities. And we don't even have to have knowledge that 00:02:46.480 --> 00:02:56.530 connects the entities. We can have a date of birth, which is 1959. Nineteen ninety. 00:02:56.530 --> 00:03:05.530 No. Nineteen fifty nine. Yes. And this is then just a plain date, not an entity. And 00:03:05.530 --> 00:03:11.510 now coming from the explicit knowledge then, well, we have some more we have 00:03:11.510 --> 00:03:16.209 Donna Strickland has received a Nobel prize in physics and also Marie Curie has 00:03:16.209 --> 00:03:21.170 received the Nobel prize in physics. And we also know that Marie Curie has a Nobel 00:03:21.170 --> 00:03:27.780 prize ID that starts with "phys" and then "1903" and some random numbers that 00:03:27.780 --> 00:03:32.970 basically are this ID. Then Marie Curie also has received a Nobel prize in 00:03:32.970 --> 00:03:38.580 chemistry in 1911. So she has another Nobel ID that starts with "chem" and has 00:03:38.580 --> 00:03:43.590 "1911" there. And then there's also Frances Arnold, who received the Nobel 00:03:43.590 --> 00:03:48.549 prize in chemistry last year. So she has a Nobel ID that starts with "chem" and has 00:03:48.549 --> 00:03:54.740 "2018" there. And now one one could assume that, well, everybody who was awarded the 00:03:54.740 --> 00:04:00.156 Nobel prize should also have a Nobel ID. So everybody who was awarded the Nobel 00:04:00.156 --> 00:04:05.670 prize should also have a Nobel prize ID, and we could write that as some 00:04:05.670 --> 00:04:11.791 implication here. So "awarded(nobelPrize)" implies "nobelID". And well, if you 00:04:11.791 --> 00:04:16.349 look sharply at this picture, then there's this arrow here conspicuously missing that 00:04:16.349 --> 00:04:22.550 Donald Strickland doesn't have a Nobel prize ID. And indeed, there's 25 people 00:04:22.550 --> 00:04:26.669 currently on Wikidata that are missing Nobel prize IDs, and Donna Strickland is 00:04:26.669 --> 00:04:34.060 one of them. So we call these people that don't satisfy this implication – we call 00:04:34.060 --> 00:04:40.419 those counterexamples and well, if you look at Wikidata on the scale of really 00:04:40.419 --> 00:04:45.350 these 890 million statements, then you won't find any counterexamples because 00:04:45.350 --> 00:04:52.550 it's just too big. So we need some way to automatically do that. And the idea is 00:04:52.550 --> 00:04:58.930 that, well, if we had this knowledge that while some implications are not satisfied, 00:04:58.930 --> 00:05:03.840 then this encodes maybe missing information or wrong information, and we 00:05:03.840 --> 00:05:10.870 want to represent that in a way that is easy to understand and also succinct. So 00:05:10.870 --> 00:05:16.090 it doesn't take long to write it down, it should have a short representation. So 00:05:16.090 --> 00:05:23.060 that rules out anything, including complex syntax or logical quantifies. So no SPARQL 00:05:23.060 --> 00:05:27.480 queries as a description of that implicit knowledge. No description logics, if 00:05:27.480 --> 00:05:33.199 you've heard of that. And we also want something that we can actually compute on 00:05:33.199 --> 00:05:41.539 actual hardware in a reasonable timeframe. So our approach is we use Formal Concept 00:05:41.539 --> 00:05:46.889 Analysis, which is a technique that has been developed over the past several years 00:05:46.889 --> 00:05:52.070 to extract what is called propositional implications. So just logical formulas of 00:05:52.070 --> 00:05:56.240 propositional logic that are an implication in the form of this 00:05:56.240 --> 00:06:03.020 "awarded(nobelPrize)" implies "nobleID". So what exactly is Formal Concept 00:06:03.020 --> 00:06:08.500 Analysis? Off to Tom. T: Thank you. So what is Formal Concept 00:06:08.500 --> 00:06:14.420 Analysis? It was developed in 1980s by a guy called Rudolf Wille and Bernard Ganter 00:06:14.420 --> 00:06:18.539 and they were restructuring lattice theory. Lattice theory is an ambiguous 00:06:18.539 --> 00:06:23.370 name in math, it has two meanings: One meaning is you have a grid and have a 00:06:23.370 --> 00:06:29.050 lattice there. The other thing is to speak about orders – order relations. So I like 00:06:29.050 --> 00:06:34.150 steaks, I like pudding and I like steaks more than pudding. And I like rice more 00:06:34.150 --> 00:06:40.960 than steaks. That's an order, right? And lattices are particular orders which can 00:06:40.960 --> 00:06:46.770 be used to represent propositional logic. So easy rules like "when it rains, the 00:06:46.770 --> 00:06:52.990 street gets wet", right? So and the data representation those guys used back then, 00:06:52.990 --> 00:06:57.080 they called it a formal context, which is basically just a set of objects – they 00:06:57.080 --> 00:07:02.000 call them objects, it's just a name –, a set of attributes and some incidence, 00:07:02.000 --> 00:07:07.890 which basically means which object does have which attributes. So, for example, my 00:07:07.890 --> 00:07:13.150 laptop has the colour black. So this object has some property, right? So that's 00:07:13.150 --> 00:07:17.870 a small example on the right for such a formal context. So the objects there are 00:07:17.870 --> 00:07:24.379 some animals: a platypus – that's the fun animal from Australia, the mammal which is 00:07:24.379 --> 00:07:30.279 also laying eggs and which is also venomous –, a black widow – the spider –, 00:07:30.279 --> 00:07:35.449 the duck and the cat. So we see, the platypus has all the properties; it has 00:07:35.449 --> 00:07:39.729 being venomous, laying eggs and being a mammal; we have the duck, which is not a 00:07:39.729 --> 00:07:44.169 mammal, but it lays eggs, and so on and so on. And it's very easy to grasp some 00:07:44.169 --> 00:07:49.430 implicational knowledge here. An easy rule you can find is whenever you endeavour a 00:07:49.430 --> 00:07:54.300 mammal that is venomous, it has to lay eggs. So this is a rule that falls out of 00:07:54.300 --> 00:07:59.639 this binary data table. Our main problem then or at this point is we do not have 00:07:59.639 --> 00:08:03.470 such a data table for Wikidata, right? We have the implicit graph, which is way more 00:08:03.470 --> 00:08:09.030 expressive than binary data, and we cannot even store Wikidata as a binary table. 00:08:09.030 --> 00:08:13.859 Even if you tried to, we have no chance to compute such rules from that. And for 00:08:13.859 --> 00:08:21.460 this, the people from Formal Context Analysis proposed an algorithm to extract 00:08:21.460 --> 00:08:27.160 implicit knowledge from an expert. So our expert here could be Wikidata. It's an 00:08:27.160 --> 00:08:31.240 expert, you can ask Wikidata questions, right? Using this SPARQL interface, you 00:08:31.240 --> 00:08:34.739 can ask. You can ask "Is there an example for that? Is there a counterexample for 00:08:34.739 --> 00:08:39.880 something else?" So the algorithm is quite easy. The algorithm is the algorithm and 00:08:39.880 --> 00:08:45.380 some expert – in our case, Wikidata –, and the algorithm keeps notes for 00:08:45.380 --> 00:08:49.449 counterexamples and keeps notes for valid implications. So in the beginning, we do 00:08:49.449 --> 00:08:53.569 not have any valid implications, so this list on the right is empty, and in the 00:08:53.569 --> 00:08:56.780 beginning we do not have any counterexamples. So the list on the left, 00:08:56.780 --> 00:09:01.900 the formal context to build up is also empty. And all the algorithm does now is, 00:09:01.900 --> 00:09:09.170 it asks "is this implication, X follows Y, Y follows X or X implies Y, is it true?" 00:09:09.170 --> 00:09:14.000 So "is it true," for example, "that an animal that is a mammal and is venomous 00:09:14.000 --> 00:09:18.880 lays eggs?" So now the expert, which in our case is Wikidata, can answer it. We 00:09:18.880 --> 00:09:24.860 can query that. We showed in our paper we can query that. So we query it, and if the 00:09:24.860 --> 00:09:28.491 Wikidata expert does not find any counterexamples, it will say, ok, that's 00:09:28.491 --> 00:09:36.200 maybe a true, true thing; it's yes. Or if it's not a true implication in Wikidata, 00:09:36.200 --> 00:09:41.779 it can say, no, no, no, it's not true, and here's a counterexample. So this is 00:09:41.779 --> 00:09:48.510 something you contradict by example. You say this rule cannot be true. For example, 00:09:48.510 --> 00:09:52.900 when the street is wet, that does not mean it has rained, right? It could be the 00:09:52.900 --> 00:10:01.380 cleaning service car or something else. So our idea now was to use Wikidata as an 00:10:01.380 --> 00:10:05.819 expert, but also include a human into this loop. So we do not just want to ask 00:10:05.819 --> 00:10:11.709 Wikidata, we also want to ask a human expert as well. So we first ask in our 00:10:11.709 --> 00:10:18.520 tool the Wikidata expert for some rule. After that, we also inquire the human 00:10:18.520 --> 00:10:22.080 expert. And he can also say "yeah, that's true, I know that," or "No, no. Wikidata 00:10:22.080 --> 00:10:27.200 is not aware of this counterexample, I know one." Or, in the other case "oh, 00:10:27.200 --> 00:10:32.770 Wikidata says this is true. I am aware of a counterexample." Yeah, and so on and so 00:10:32.770 --> 00:10:37.600 on. And you can represent this more or less – this is just some mathematical 00:10:37.600 --> 00:10:41.689 picture, it's not very important. But you can see on the left there's an exploration 00:10:41.689 --> 00:10:46.720 going on, just Wikidata with the algorithm, on the right an exploration, a 00:10:46.720 --> 00:10:51.419 human expert versus Wikidata which can answer all the queries. And we combined 00:10:51.419 --> 00:10:57.720 those two into one small tool, still under development. So, back to Max. 00:10:57.720 --> 00:11:02.980 M: Okay. So far for that to work, we basically need to have a way of viewing 00:11:02.980 --> 00:11:08.070 Wikidata, or at least parts of Wikidata, as a formal context. And this formal 00:11:08.070 --> 00:11:13.610 context, well, this was a binary table, so what do we do? We just take all the items 00:11:13.610 --> 00:11:18.880 in Wikidata as objects and all the properties as attributes of our context 00:11:18.880 --> 00:11:24.159 and then have an incidence relation that says "well, this entity has this 00:11:24.159 --> 00:11:30.549 property," so it is incident there, and then we end up with a context that has 71 00:11:30.549 --> 00:11:36.430 million rows and seven thousand columns. So, well, that might actually be a slight 00:11:36.430 --> 00:11:40.180 problem there, because we want to have something that we can run on actual 00:11:40.180 --> 00:11:45.811 hardware and not on a supercomputer. So let's maybe not do that and focus on 00:11:45.811 --> 00:11:50.900 a smaller set of properties that are actually related to one another through 00:11:50.900 --> 00:11:55.689 some kind of common domain, yeah? So it doesn't make any sense to have a property 00:11:55.689 --> 00:11:59.640 that relates to spacecraft and then a property that relates to books – that's 00:11:59.640 --> 00:12:05.050 probably not a good idea to try to find implicit knowledge between those two. But 00:12:05.050 --> 00:12:10.259 two different properties about spacecraft, that sounds good, right? And then the 00:12:10.259 --> 00:12:15.000 interesting question is just how do we define the incidence for our set of 00:12:15.000 --> 00:12:20.150 properties? And that actually depends very much on which properties we choose, 00:12:20.150 --> 00:12:25.550 because it does – for some properties, it makes sense to account for the direction 00:12:25.550 --> 00:12:32.679 of the statement: So there is a property called parent? Actually, no, it's child, 00:12:32.679 --> 00:12:38.309 and then there's father and mother, and you don't want to turn those around, as do 00:12:38.309 --> 00:12:43.760 you want to have "A is a child of B," that should be something different than "B 00:12:43.760 --> 00:12:48.930 is a child of A." Then there's the qualifiers that might be important for 00:12:48.930 --> 00:12:54.740 some properties. So receiving an award for something might be something different 00:12:54.740 --> 00:13:00.740 than receiving an award for something else. But while receiving an award in 2018 00:13:00.740 --> 00:13:06.549 and receiving one in 2017, that's probably more or less the same thing, so we don't 00:13:06.549 --> 00:13:11.930 necessarily need to differentiate that. And there's also a thing called subclasses 00:13:11.930 --> 00:13:15.470 and they form a hierarchy on Wikidata. And you might also want to take that into 00:13:15.470 --> 00:13:20.150 account because while winning something that is a Nobel prize, that means also 00:13:20.150 --> 00:13:25.190 winning an award itself, and winning the Nobel Peace prize means winning a peace 00:13:25.190 --> 00:13:32.586 prize. So there's also implications going on there that you want to respect. So, 00:13:32.586 --> 00:13:38.400 to see how we actually do that, let's look at an example. So we have here, well, this 00:13:38.400 --> 00:13:47.030 is Donald Strickland. And – I forgot his first name – Ashkin, this is one of the 00:13:47.030 --> 00:13:51.720 people that won the Nobel prize in physics with her last year. And also Gérard 00:13:51.720 --> 00:13:57.990 Mourou. That is the third one. They all got the Nobel prize in physics last year. 00:13:57.990 --> 00:14:04.190 So we have all these statements here, and these two have a qualifier that says 00:14:04.190 --> 00:14:10.260 "with: Gérard Mourou" here. And I don't think the qualifier is on this statement 00:14:10.260 --> 00:14:15.160 here, actually, but it doesn't actually matter. So what we've done here is, 00:14:15.160 --> 00:14:21.190 put all the entities in the small graph as rows in the table. So we have Strickland 00:14:21.190 --> 00:14:27.850 and Mourou and Ashkin, and also Arnold and Curie that are not in the picture. But you 00:14:27.850 --> 00:14:33.290 can maybe remember that. And then here we have awarded, and we scaled that by the 00:14:33.290 --> 00:14:37.250 instance of the different Nobel prizes that people have won. So that's the 00:14:37.250 --> 00:14:42.209 physics Nobel in the first column, the chemistry Nobel Prize in the second column 00:14:42.209 --> 00:14:48.380 and just general Nobel prizes in the third column. There's awarded and that is scaled 00:14:48.380 --> 00:14:55.240 by the "with" qualifier, so awarded with Gérard Mourou. And then there's field of 00:14:55.240 --> 00:15:00.450 work, and we have lasers here and radioactivity, so we scale by the actual 00:15:00.450 --> 00:15:06.580 field of work that people have. And well then, if we look at what kind of incidence 00:15:06.580 --> 00:15:11.370 we get for Donna Strickland, she has a Nobel prize in physics and that is also a 00:15:11.370 --> 00:15:17.190 Nobel prize, and she has that together with Mourou. And she has "field of work: 00:15:17.190 --> 00:15:23.220 lasers," but not radioactivity. Then, Mourou himself: he has a Nobel prize in 00:15:23.220 --> 00:15:29.450 physics, and that is a Nobel prize, but none of the others. Ashkin gets the Nobel 00:15:29.450 --> 00:15:33.890 prize in physics, and that is still a Nobel prize, and he gets that with Gérard 00:15:33.890 --> 00:15:40.970 Mourou. And also he works on lasers, but not in radioactivity. So Frances Arnold 00:15:40.970 --> 00:15:47.230 has a Nobel prize in chemistry, and that is a Nobel prize. And Marie Curie, she has 00:15:47.230 --> 00:15:50.510 a Nobel prize in physics and one in chemistry, and they are both a Nobel 00:15:50.510 --> 00:15:55.319 prize. And she also works on radioactivity. But lasers didn't exist 00:15:55.319 --> 00:16:02.490 back then, so she doesn't get "field of work: lasers." And then basically this 00:16:02.490 --> 00:16:10.289 table here is a representation of our formal context. So and then we've actually 00:16:10.289 --> 00:16:14.840 gone ahead and started building a tool where you can interactively do all these 00:16:14.840 --> 00:16:20.320 things, and it will take care of building the context for you. You just put in the 00:16:20.320 --> 00:16:24.540 properties, and Tom will show you how that works. 00:16:24.540 --> 00:16:29.030 T: So here you see some first screenshots of this tool. So please do not comment on 00:16:29.030 --> 00:16:32.520 the graphic design. We have no idea about that, we have to ask someone about that. 00:16:32.520 --> 00:16:36.120 We're just into logics, more or less. On the left, you see the initial state of the 00:16:36.120 --> 00:16:41.120 game. On the left you have five boxes: they're called countries and borders, 00:16:41.120 --> 00:16:47.370 credit cards, use of energy, memory and computation – I think –, and space 00:16:47.370 --> 00:16:53.180 launches, which are just presets we defined. You can explore, for example, in 00:16:53.180 --> 00:16:57.050 the case of the credit card, you can explore the properties from Wikidata which 00:16:57.050 --> 00:17:02.170 are called "card network," "operator," and "fee," so you can just choose one of them, 00:17:02.170 --> 00:17:05.530 or on the right, "custom properties," you can just input the properties you're 00:17:05.530 --> 00:17:10.640 interested in Wikidata, whatever one of the seven thousand you like, or some 00:17:10.640 --> 00:17:15.140 number of them. On the right, I chose then the credit card thingy and I now want to 00:17:15.140 --> 00:17:21.860 show you what happens if you now explore these properties, right? The first step in 00:17:21.860 --> 00:17:25.750 the game is that the game will ask – I mean, the game, the exploration process – 00:17:25.750 --> 00:17:31.020 will ask, is it true that every entity in Wikidata will have these three properties? 00:17:31.020 --> 00:17:36.360 So are they common among all entities in your data, which is most probably not 00:17:36.360 --> 00:17:41.540 true, right? I mean, not everything in Wikidata has a fee, at least I hope. So, 00:17:41.540 --> 00:17:46.520 what I will do now, I would click the "reject this implication" button, since 00:17:46.520 --> 00:17:51.480 the implication "Nothing implies everything" is not true. In the second 00:17:51.480 --> 00:17:56.360 step now, the algorithm tries to find the minimal number of questions to obtain the 00:17:56.360 --> 00:18:01.820 domain knowledge, so to obtain all valid rules in this domain. So next question is 00:18:01.820 --> 00:18:06.120 "is it true that everything in Wikidata that has a 'card network' property also 00:18:06.120 --> 00:18:12.560 has a 'fee' and an 'operator' property?" And down here you can see Wikidata says 00:18:12.560 --> 00:18:18.110 "ok, there are 26 items which are counterexamples," so there's 26 items in 00:18:18.110 --> 00:18:22.670 Wikidata which have the "card network" property but do not have the other two 00:18:22.670 --> 00:18:28.200 ones. So, 26 is not a big number, this could mean "ok, that's an error, so 26 00:18:28.200 --> 00:18:32.860 statements are missing." Or maybe that that's, really, that's the true case. 00:18:32.860 --> 00:18:36.890 That's also ok. But you can now choose what you think is right. You can say, "oh, 00:18:36.890 --> 00:18:40.470 I would say it should be true" or you can say "no, I think that's ok, one of these 00:18:40.470 --> 00:18:46.380 counterexamples seems valid. Let's reject it." I in this case, rejected it. The next 00:18:46.380 --> 00:18:51.020 question it asks: "is it true that everything that has an operator has also a 00:18:51.020 --> 00:18:56.290 fee and a card network?" Yeah, this is possibly not true. There's also more than 00:18:56.290 --> 00:19:03.110 1000 counterexamples, one being, I think a telecommunication operator in Hungary or 00:19:03.110 --> 00:19:10.340 something. And so we can reject this as well. Next question, everything that has 00:19:10.340 --> 00:19:15.360 an operator and a card network – so card network means Visa, MasterCard, whatever, 00:19:15.360 --> 00:19:21.690 all this stuff – is it true that they have to have a fee?" Wikidata says "no," it has 00:19:21.690 --> 00:19:27.570 23 items that contradict it. But one of the items, for example, is the American 00:19:27.570 --> 00:19:32.090 Express Gold Card. I suppose the American Express Gold Card has some fee. So this 00:19:32.090 --> 00:19:36.140 indicates, "oh, there is some missing data in Wikidata," there is something that 00:19:36.140 --> 00:19:40.680 Wikidata does not know but should know to reason correctly in Wikidata with your 00:19:40.680 --> 00:19:46.520 SPARQL queries. So we can now say, "yeah, that's, uh, that's not a reject, that's an 00:19:46.520 --> 00:19:51.470 accept," because we think it should be true. But Wikidata thinks otherwise. And 00:19:51.470 --> 00:19:55.800 you go on, we go on. This is then the last question: "Is it true that everything that 00:19:55.800 --> 00:20:00.950 has a fee and a card work should have an operator," and you see, "oh, no counter 00:20:00.950 --> 00:20:05.930 examples." This means Wikidata says "this is true," because it says there is no 00:20:05.930 --> 00:20:09.580 counterexample. If you're asking Wikidata it says this is a valid implication in the 00:20:09.580 --> 00:20:15.400 data set so far, which could also be indicating that something is missing, I'm 00:20:15.400 --> 00:20:20.310 not aware if this is possible or not, but ok, for me it sounds reasonable. Everyone 00:20:20.310 --> 00:20:23.800 has a fee and a card network should also have an operator, which meens a bank or 00:20:23.800 --> 00:20:29.220 something like that. So I accept this implication. And then, yeah, you have won 00:20:29.220 --> 00:20:34.410 the exploration game, which essentially means you've won some knowledge. Thank 00:20:34.410 --> 00:20:40.300 you. And the knowledge is that you know which implications in Wikidata are true or 00:20:40.300 --> 00:20:44.340 should be true from your point of view. And yeah, this is more or less the state 00:20:44.340 --> 00:20:50.700 of the game so far as we programmed it in October. And the next state will be to 00:20:50.700 --> 00:20:54.970 show you some – "How much does your opinion of the world differ from the 00:20:54.970 --> 00:20:59.950 opinion that is now reflected in the data?" So is what you think about the data 00:20:59.950 --> 00:21:05.430 true, close to true to what is true in Wikidata. Or maybe Wikidata has wrong 00:21:05.430 --> 00:21:10.680 information. You can find it with that. But Max will tell me more about that. 00:21:10.680 --> 00:21:18.220 M: Ok. So let me just quickly come back to what we have actually done. So we 00:21:18.220 --> 00:21:23.670 offer a procedure that allows you to explore properties in Wikidata and the 00:21:23.670 --> 00:21:30.720 implicational knowledge that holds between these properties. And the key idea's here 00:21:30.720 --> 00:21:34.661 that when you look at these implications that you get, while there might be some 00:21:34.661 --> 00:21:39.280 that you don't actually want because they shouldn't be true, and there might also be 00:21:39.280 --> 00:21:46.220 ones that you don't get, but you expect to get because they should hold. And these 00:21:46.220 --> 00:21:51.840 unwanted and/or missing implications, they point to missing statements and items in 00:21:51.840 --> 00:21:56.130 Wikidata. So they show you where the opportunities to improve the knowledge in 00:21:56.130 --> 00:22:00.100 Wikidata are, and, well, sometimes you also get to learn something about the 00:22:00.100 --> 00:22:04.080 world, and in most cases, it's that the world is more complicated than you thought 00:22:04.080 --> 00:22:10.260 it was – and that's just how life is. But in general, implications can guide you in 00:22:10.260 --> 00:22:17.220 your way of improving Wikidata and the state of knowledge therein. So what's 00:22:17.220 --> 00:22:22.380 next? Well, so what we currently don't offer in the exploration game and what we 00:22:22.380 --> 00:22:27.710 definitely will focus next on is having configurable counterexamples and also 00:22:27.710 --> 00:22:32.030 filterable counterexamples – right now you just get a list of a random number of 00:22:32.030 --> 00:22:36.880 counterexamples. And you might want to search through this list for something you 00:22:36.880 --> 00:22:42.520 recognise and you might also want to explicitly say, well, this one should be a 00:22:42.520 --> 00:22:48.600 counterexample, and that's definitely coming next. Then, well, domain specific 00:22:48.600 --> 00:22:53.750 scaling of properties, there's still much work to be done. Currently, we only have 00:22:53.750 --> 00:23:00.500 some very basic support for that. So you can have properties, but you can't do the 00:23:00.500 --> 00:23:03.780 fancy things where you say, "well, everything that is an award should be 00:23:03.780 --> 00:23:10.840 considered as one instance of this property." That's also coming and then 00:23:10.840 --> 00:23:15.550 what Tom mentioned alread: compare your knowledge that you have explored through 00:23:15.550 --> 00:23:21.610 this process against the knowledge that is currently on Wikidata as a form of seeing 00:23:21.610 --> 00:23:26.540 "where do you stand? What is missing in Wikidata? How can you improve Wikidata?" 00:23:26.540 --> 00:23:32.600 And well, if you have any more suggestions for features, then just tell us. There's a 00:23:32.600 --> 00:23:39.530 Github link on the implication game page. And here's the link to the tool again. So, 00:23:39.530 --> 00:23:46.140 yeah, just let us know. Open an issue and have fun. And if you have any questions, 00:23:46.140 --> 00:23:50.230 then I guess now would be the time to ask. T: Thank you. 00:23:50.230 --> 00:23:52.730 Herald: Thank you very much, Tom and Max. 00:23:52.730 --> 00:23:55.020 applause 00:23:55.020 --> 00:24:01.510 Herald: So we will switch microphones now because then I can hand this microphone to 00:24:01.510 --> 00:24:07.250 you if any of you have a question for our two speakers. Are there any questions or 00:24:07.250 --> 00:24:14.370 suggestions? Yes. Question: Hi. Thanks for the nice talk. I 00:24:14.370 --> 00:24:18.720 wanted to ask what's the first question, what's the most interesting implication 00:24:18.720 --> 00:24:25.020 that you've found? M: Yeah. That would have made for a 00:24:25.020 --> 00:24:31.850 good back up slide. The most interesting implication so far – 00:24:31.850 --> 00:24:36.010 T: The most basic thing you would expect everything that is launched in space by 00:24:36.010 --> 00:24:41.920 humans – no, everything that landed from space, that has a landing date, also has a 00:24:41.920 --> 00:24:46.450 start date. So nothing landed on earth, which was not started here. 00:24:46.450 --> 00:24:55.200 M: Yes. Q: Right now, the game only helps you find 00:24:55.200 --> 00:25:00.710 out implications. Are you also planning to have that I can also add data like for 00:25:00.710 --> 00:25:04.309 example, let's say I have twenty five Nobel laureates who don't have a Nobel 00:25:04.309 --> 00:25:08.220 laureate ID. Is there plans where you could give me a simple interface for me to 00:25:08.220 --> 00:25:12.760 Google and add that ID because it would make the process of adding new entities to 00:25:12.760 --> 00:25:17.400 Wikidata itself more simple. M: Yes. And that's partly hidden 00:25:17.400 --> 00:25:23.050 behind this "configurable and filterable counterexamples" thing. We will probably 00:25:23.050 --> 00:25:28.380 not have an explicit interface for adding stuff, but most likely interface with some 00:25:28.380 --> 00:25:32.270 other tool built around Wikidata, so probably something that will give you 00:25:32.270 --> 00:25:37.100 QuickStatements or something like that. But yes, adding data is definitely on the 00:25:37.100 --> 00:25:41.710 roadmap. Herald: Any more questions? Yes. 00:25:41.710 --> 00:25:48.860 Q: Wouldn't it be nice to do this in other languages, too? 00:25:48.860 --> 00:25:52.600 T: Actually it's language independent, so we use Wikidata and then as far as we 00:25:52.600 --> 00:25:58.110 know, Wikidata has no language itself. You know, it has just items and properties, so 00:25:58.110 --> 00:26:02.640 Qs and Ps, and whatever language you use, it should be translated in the language of 00:26:02.640 --> 00:26:06.180 the properties, if there is a label for that property or for that item that you 00:26:06.180 --> 00:26:12.420 have. So if Wikidata is aware of your language, we are. 00:26:12.420 --> 00:26:15.020 Herald: Oh, yes. More! M: Of course, the tool still needs to be 00:26:15.020 --> 00:26:18.360 translated, but – T: The tool itself, it should be. 00:26:18.360 --> 00:26:21.850 Q: Hi, thanks for the talk. I have a question. Right now you only can find 00:26:21.850 --> 00:26:25.990 missing data with this, right? Or surplus data. Would you think you'd be able to 00:26:25.990 --> 00:26:31.560 find wrong information with a similar approach. 00:26:31.560 --> 00:26:37.001 T: Actually, we do. I mean, if we Wikidata has a counterexample to something we would 00:26:37.001 --> 00:26:42.830 expect to be true, this could point to wrong data, right? If the counterexample 00:26:42.830 --> 00:26:47.450 is a wrong counterexample. If there is a missing property or missing property to an 00:26:47.450 --> 00:26:58.160 item. Q: Ok, I get to ask a second question. So 00:26:58.160 --> 00:27:06.000 the horizontal axis in the incidence matrix. You said it has 7000, it spans 00:27:06.000 --> 00:27:10.300 7000 columns, right? M: Yes, because there's 7000 properties in 00:27:10.300 --> 00:27:13.850 Wikidata. Q: But it's actually way more columns, 00:27:13.850 --> 00:27:17.849 right? Because you multiply the properties times the arguments, right? 00:27:17.849 --> 00:27:21.360 M: Yes. So if you do any scaling then of course that might give you multiple 00:27:21.360 --> 00:27:23.380 entries. Q: So that's what you mean with scaling, 00:27:23.380 --> 00:27:27.770 basically? M: Yes. But already seven thousand is way 00:27:27.770 --> 00:27:35.580 too big to actually compute that. Q: How many would it be if you multiply 00:27:35.580 --> 00:27:48.060 all the arguments? M: I have no idea, probably a few million. 00:27:48.060 --> 00:27:55.309 Q: Have you thought about a recursive method, as counterexamples may be wrong by 00:27:55.309 --> 00:28:00.350 other counterexamples, like in an argumentative graph or something like 00:28:00.350 --> 00:28:06.708 this? T: Actually, I don't get it. How can a 00:28:06.708 --> 00:28:14.040 counterexample be wrong through another counterxample? 00:28:14.040 --> 00:28:24.450 Q: Maybe some example says that cats can have golden hair and then another example 00:28:24.450 --> 00:28:31.260 might say that this is not a cat. T: Ah, so the property to be a cat or 00:28:31.260 --> 00:28:38.000 something cat-ish is missing then. Okay. No, we have not considered so far deeper 00:28:38.000 --> 00:28:44.570 reasoning. This horn-propositional logic, you know, it has no contradictions, 00:28:44.570 --> 00:28:47.740 because all you can do is you can contradict by counterexamples, but there 00:28:47.740 --> 00:28:52.740 can never be a rule that is not true, so far. Just in your or my opinion, maybe, 00:28:52.740 --> 00:28:56.370 but not in the logic. So what we have to think about is that we have bigger 00:28:56.370 --> 00:29:01.780 reasoning, right? So. Q: Sorry, quick question. Because you're 00:29:01.780 --> 00:29:04.929 not considering all the 7000 odd properties for each of the entities, 00:29:04.929 --> 00:29:07.570 right? What's your current process of filtering? What are the relevant 00:29:07.570 --> 00:29:14.820 properties? I'm sorry, I didn't get that. M: Well, we basically handpick those. So 00:29:14.820 --> 00:29:19.940 you have this input field? Yeah, we can go ahead and select our properties. We also 00:29:19.940 --> 00:29:26.870 have some predefined sets. Okay. And there's also some classes for groups of 00:29:26.870 --> 00:29:30.780 properties that are related that you could use if you want bigger sets, 00:29:30.780 --> 00:29:35.960 T: for example, space or family or what was the other? 00:29:35.960 --> 00:29:43.410 M: Awards is one. T: It depends on the size of the class. 00:29:43.410 --> 00:29:47.390 For example, for space, it's not that much, I think it's 10 or 15 properties. It 00:29:47.390 --> 00:29:51.520 will take you some hours, but you can do because they are 15 or something like 00:29:51.520 --> 00:29:58.150 that. I think for family, it's way too much, it's like 40 of 50 properties. So a 00:29:58.150 --> 00:30:04.540 lot of questions. Herald: I don't see any more hands. Maybe 00:30:04.540 --> 00:30:09.760 someone who has not asked the question yet has another one we could take that, 00:30:09.760 --> 00:30:14.270 otherwise we would be perfectly on time. And maybe you can tell us where you will 00:30:14.270 --> 00:30:18.860 be for deeper discussions where people can find you. 00:30:18.860 --> 00:30:22.400 T: Probably at the couches. Herald: The couches, behind our stage. 00:30:22.400 --> 00:30:26.720 M: Or just running around somewhere. So there's also our DECT numbers on the 00:30:26.720 --> 00:30:35.960 slides; it's 6284 for Tom and 6279 for me. So just call and ask where we're hanging 00:30:35.960 --> 00:30:38.470 around. H: Well then, thank you again. Have a 00:30:38.470 --> 00:30:40.210 round of applause. applause 00:30:40.210 --> 00:30:42.650 T: Thank you. M: Well, thanks for having us. 00:30:42.650 --> 00:30:45.310 Applause 00:30:45.310 --> 00:30:49.740 postroll music 00:30:49.740 --> 00:31:12.000 subtitles created by c3subtitles.de in the year 2020. Join, and help us!