WEBVTT 00:00:05.945 --> 00:00:09.476 Hello everyone to the Data Quality panel. 00:00:10.288 --> 00:00:13.671 Data quality matters because more and more people out there 00:00:13.672 --> 00:00:19.289 rely on our data being in good shape, so we're going to talk about data quality, 00:00:20.029 --> 00:00:26.000 and there will be four speakers who will give short introductions 00:00:26.000 --> 00:00:29.539 on topics related to data quality and then we will have a Q and A. 00:00:30.130 --> 00:00:32.234 And the first one is Lucas. 00:00:34.385 --> 00:00:35.385 Thank you. 00:00:35.901 --> 00:00:39.899 Hi, I'm Lucas, and I'm going to start with an overview 00:00:39.899 --> 00:00:43.806 of data quality tools that we already have on Wikidata 00:00:43.807 --> 00:00:46.109 and also some things that are coming up soon. 00:00:46.932 --> 00:00:50.623 And I've grouped them into some general themes 00:00:50.623 --> 00:00:53.761 of making errors more visible, making problems actionable, 00:00:53.762 --> 00:00:56.322 getting more eyes on the data so that people notice the problems, 00:00:56.945 --> 00:01:02.616 fix some common sources of errors, maintain the quality of the existing data 00:01:02.616 --> 00:01:03.966 and also human curation. 00:01:05.063 --> 00:01:09.874 And the ones that are currently available start with property constraints. 00:01:10.388 --> 00:01:12.421 So you've probably seen this if you're on Wikidata. 00:01:12.422 --> 00:01:14.029 You can sometimes get these icons 00:01:14.530 --> 00:01:17.241 which check the internal consistency of the data. 00:01:17.242 --> 00:01:20.800 For example, if one event follows the other, 00:01:20.801 --> 00:01:23.760 then the other event should also be followed by this one, 00:01:23.761 --> 00:01:27.161 which on the WikidataCon item was apparently missing. 00:01:27.162 --> 00:01:29.360 I'm not sure, this feature is a few days old. 00:01:30.040 --> 00:01:34.681 And there's also, if this is too limited or simple for you, 00:01:34.682 --> 00:01:38.080 you can write any checks you want using the Query Service 00:01:38.081 --> 00:01:39.842 which is useful for lots of things of course, 00:01:39.843 --> 00:01:44.543 but you can also use it for finding errors. 00:01:44.544 --> 00:01:46.974 Like if you've noticed one occurrence of a mistake, 00:01:46.975 --> 00:01:49.709 then you can check if there are other places 00:01:49.710 --> 00:01:51.958 where people have made a very similar error 00:01:51.958 --> 00:01:53.438 and find that with the Query Service. 00:01:53.439 --> 00:01:54.559 You can also combine the two 00:01:54.560 --> 00:01:57.874 and search for constraint violations in the Query Service, 00:01:57.875 --> 00:02:01.240 for example, only the violations in some area 00:02:01.241 --> 00:02:03.762 or WikiProject that's relevant to you, 00:02:03.762 --> 00:02:06.828 although the results are currently not complete, sadly. 00:02:08.422 --> 00:02:09.877 There is revision scoring. 00:02:10.690 --> 00:02:12.666 That's... I think this is from the recent changes 00:02:12.667 --> 00:02:16.217 you can also get it on your watch list an automatic assessment 00:02:16.217 --> 00:02:20.249 of is this edit likely to be in good faith or in bad faith 00:02:20.250 --> 00:02:22.312 and is it likely to be damaging or not damaging, 00:02:22.313 --> 00:02:24.205 I think those are the two dimensions. 00:02:24.206 --> 00:02:25.686 So you can, if you want, 00:02:25.687 --> 00:02:29.898 focus on just looking through the damaging but good faith edits. 00:02:29.899 --> 00:02:32.523 If you're feeling particularly friendly and welcoming 00:02:32.524 --> 00:02:37.121 you can tell these editors, "Thank you for your contribution, 00:02:37.122 --> 00:02:40.560 here's how you should have done it but thank you, still." 00:02:40.561 --> 00:02:42.186 And if you're not feeling that way, 00:02:42.187 --> 00:02:44.452 you can go through the bad faith, damaging edits, 00:02:44.453 --> 00:02:45.573 and revert the vandals. 00:02:47.544 --> 00:02:49.761 There's also, similar to that, entity scoring. 00:02:49.762 --> 00:02:52.590 So instead of scoring an edit, the change that it made, 00:02:52.591 --> 00:02:53.904 you score the whole revision, 00:02:53.904 --> 00:02:56.483 and I think that is the same quality measure 00:02:56.483 --> 00:02:59.863 that Lydia mentions at the beginning of the conference. 00:03:00.372 --> 00:03:04.569 That gives a user script up here and gives you a score of like one to five, 00:03:04.570 --> 00:03:08.176 I think it was, of what the quality of the current item is. 00:03:10.043 --> 00:03:15.528 The primary sources tool is for any database that you want to import, 00:03:15.528 --> 00:03:18.364 but that's not high enough quality to directly add to Wikidata, 00:03:18.374 --> 00:03:20.335 so you add it to the primary sources tool instead, 00:03:20.336 --> 00:03:22.956 and then humans can decide 00:03:22.956 --> 00:03:26.024 should they add these individual statements or not. 00:03:28.595 --> 00:03:31.901 Showing coordinates as maps is mainly a convenience feature 00:03:31.901 --> 00:03:33.588 but it's also useful for quality control. 00:03:33.588 --> 00:03:36.937 Like if you see this is supposed to be the office of Wikimedia Germany 00:03:36.938 --> 00:03:39.400 and if the coordinates are somewhere in the Indian Ocean, 00:03:39.401 --> 00:03:41.529 then you know that something is not right there 00:03:41.530 --> 00:03:44.790 and you can see it much more easily than if you just had the numbers. 00:03:46.382 --> 00:03:49.576 This is a gadget called the relative completeness indicator 00:03:49.577 --> 00:03:52.480 which shows you this little icon here 00:03:53.007 --> 00:03:55.652 telling you how complete it thinks this item is 00:03:55.652 --> 00:03:57.613 and also which properties are most likely missing, 00:03:57.614 --> 00:03:59.769 which is really useful if you're editing an item 00:03:59.769 --> 00:04:03.172 and you're in an area that you're not very familiar with 00:04:03.172 --> 00:04:05.661 and you don't know what the right properties to use are, 00:04:05.662 --> 00:04:08.230 then this is a very useful gadget to have. 00:04:09.604 --> 00:04:11.401 And we have Shape Expressions. 00:04:11.402 --> 00:04:15.624 I think Andra or Jose are going to talk more about those 00:04:15.624 --> 00:04:19.757 but basically, a very powerful way of comparing the data you have 00:04:19.758 --> 00:04:20.758 against the schema, 00:04:20.759 --> 00:04:22.680 like what statement should certain entities have, 00:04:22.681 --> 00:04:25.677 what other entities should they link to and what should those look like, 00:04:26.229 --> 00:04:29.374 and then you can find problems that way. 00:04:30.366 --> 00:04:32.361 I think... No there is still more. 00:04:32.362 --> 00:04:34.321 Integraality or property dashboard. 00:04:34.322 --> 00:04:36.773 It gives you a quick overview of the data you already have. 00:04:36.774 --> 00:04:39.147 For example, this is from the WikiProject Red Pandas, 00:04:39.657 --> 00:04:41.681 and you can see that we have a sex or gender 00:04:41.682 --> 00:04:43.561 for almost all of the red pandas, 00:04:43.561 --> 00:04:46.854 the date of birth varies a lot by which zoo they come from 00:04:46.854 --> 00:04:50.255 and we have almost no dead pandas which is wonderful, 00:04:51.437 --> 00:04:52.600 because they're so cute. 00:04:53.699 --> 00:04:55.654 So this is also useful. 00:04:56.377 --> 00:04:59.185 There we go, OK, now for the things that are coming up. 00:04:59.889 --> 00:05:03.784 Wikidata Bridge, or also known, formerly known as client editing, 00:05:03.785 --> 00:05:07.076 so editing Wikidata from Wikipedia infoboxes 00:05:07.675 --> 00:05:11.725 which will on the one hand get more eyes on the data 00:05:11.725 --> 00:05:13.441 because more people can see the data there 00:05:13.441 --> 00:05:18.841 and it will hopefully encourage more use of Wikidata in the Wikipedias 00:05:18.841 --> 00:05:20.920 and that means that more people can notice 00:05:20.921 --> 00:05:23.389 if, for example some data is outdated and needs to be updated 00:05:23.857 --> 00:05:27.000 instead of if they would only see it on Wikidata itself. 00:05:28.630 --> 00:05:30.656 There is also tainted references. 00:05:30.657 --> 00:05:33.959 The idea here is that if you edit a statement value, 00:05:34.683 --> 00:05:37.279 you might want to update the references as well, 00:05:37.280 --> 00:05:39.373 unless it was just a typo or something. 00:05:39.897 --> 00:05:43.662 And this tainted references tells editors that 00:05:43.663 --> 00:05:49.756 and also that other editors see which other edits were made 00:05:49.756 --> 00:05:52.471 that edited a statement value and didn't update a reference 00:05:52.472 --> 00:05:56.766 then you can clean up after that and decide should that be... 00:05:57.737 --> 00:05:59.566 Do you need to do any thing more of that 00:05:59.566 --> 00:06:02.796 or is that actually fine and you don't need to update the reference. 00:06:03.543 --> 00:06:09.336 That's related to signed statements which is coming from a concern, I think, 00:06:09.336 --> 00:06:12.355 that some data providers have that like... 00:06:14.131 --> 00:06:17.231 There's a statement that's referenced through the UNESCO or something 00:06:17.232 --> 00:06:19.872 and then suddenly, someone vandalizes the statement 00:06:19.873 --> 00:06:21.836 and they are worried that it will look like 00:06:22.827 --> 00:06:26.992 this organization, like UNESCO, still set this vandalism value 00:06:26.993 --> 00:06:28.706 and so, with signed statements, 00:06:28.706 --> 00:06:31.488 they can cryptographically sign this reference 00:06:31.488 --> 00:06:33.562 and that doesn't prevent any edits to it, 00:06:34.169 --> 00:06:37.744 but at least, if someone vandalizes the statement 00:06:37.744 --> 00:06:40.255 or edits it in any way, then the signature is no longer valid, 00:06:40.255 --> 00:06:43.401 and you can tell this is not exactly what the organization said, 00:06:43.402 --> 00:06:47.064 and perhaps it's a good edit and they should re-sign the new statement, 00:06:47.065 --> 00:06:49.851 but also perhaps it should be reverted. 00:06:51.203 --> 00:06:54.166 And also, this is going to be very exciting, I think, 00:06:54.166 --> 00:06:56.846 Citoid is this amazing system they have on Wikipedia 00:06:57.379 --> 00:07:01.340 where you can paste a URL, or an identifier, or an ISBN 00:07:01.340 --> 00:07:04.759 or Wikidata ID or basically anything into the Visual Editor, 00:07:05.260 --> 00:07:08.241 and it spits out a reference that is nicely formatted 00:07:08.242 --> 00:07:11.049 and has all the data you want and it's wonderful to use. 00:07:11.049 --> 00:07:14.337 And by comparison, on Wikidata, if I want to add a reference 00:07:14.338 --> 00:07:18.801 I typically have to add a reference URL, title, author name string, 00:07:18.802 --> 00:07:20.449 published in, publication date, 00:07:20.450 --> 00:07:25.141 retrieve dates, at least those, and that's annoying, 00:07:25.141 --> 00:07:29.261 and integrating Citoid into Wikibase will hopefully help with that. 00:07:30.245 --> 00:07:33.604 And I think that's all the ones I had, yeah. 00:07:33.604 --> 00:07:36.400 So now, I'm going to pass to Cristina. 00:07:37.788 --> 00:07:42.339 (applause) 00:07:43.780 --> 00:07:45.471 Hi, I'm Cristina. 00:07:45.472 --> 00:07:47.672 I'm a research scientist from the University of Zürich, 00:07:47.673 --> 00:07:51.417 and I'm also an active member of the Swiss Community. 00:07:52.698 --> 00:07:57.901 When Claudia Müller-Birn and I submitted this to the WikidataCon, 00:07:57.902 --> 00:08:00.410 what we wanted to do is continue our discussion 00:08:00.411 --> 00:08:02.424 that we started in the beginning of the year 00:08:02.424 --> 00:08:07.442 with a workshop on data quality and also some sessions in Wikimania. 00:08:07.442 --> 00:08:10.535 So the goal of this talk is basically to bring some thoughts 00:08:10.536 --> 00:08:14.432 that we have been collecting from the community and ourselves 00:08:14.432 --> 00:08:16.560 and continue discussion. 00:08:16.561 --> 00:08:20.065 So what we would like is to continue interacting a lot with you. 00:08:21.557 --> 00:08:23.371 So what we think is very important 00:08:23.372 --> 00:08:27.580 is that we continuously ask all types of users in the community 00:08:27.581 --> 00:08:32.240 about what they really need, what problems they have with data quality, 00:08:32.240 --> 00:08:35.000 not only editors but also the people who are coding, 00:08:35.000 --> 00:08:36.241 or consuming the data, 00:08:36.242 --> 00:08:39.494 and also researchers who are actually using all the edit history 00:08:39.494 --> 00:08:40.800 to analyze what is happening. 00:08:42.367 --> 00:08:48.431 So we did a review of around 80 tools that are existing in Wikidata 00:08:48.431 --> 00:08:52.380 and we aligned them to the different data quality dimensions. 00:08:52.380 --> 00:08:54.360 And what we saw was that actually, 00:08:54.361 --> 00:08:57.681 many of them were looking at, monitoring completeness, 00:08:57.682 --> 00:09:02.820 but actually... and also some of them are also enabling interlinking. 00:09:02.820 --> 00:09:08.442 But there is a big need for tools that are looking into diversity, 00:09:08.443 --> 00:09:12.824 which is one of the things that we actually can have in Wikidata, 00:09:12.824 --> 00:09:15.958 especially this design principle of Wikidata 00:09:15.959 --> 00:09:17.901 where we can have plurality 00:09:17.902 --> 00:09:20.308 and different statements with different values 00:09:21.034 --> 00:09:22.236 coming from different sources. 00:09:22.236 --> 00:09:24.921 Because it's a secondary source, we don't have really tools 00:09:24.922 --> 00:09:27.750 that actually tell us how many plural statements there are, 00:09:27.751 --> 00:09:30.889 and how many we can improve and how, 00:09:30.890 --> 00:09:32.833 and we also don't know really 00:09:32.833 --> 00:09:35.538 what are all the reasons for plurality that we can have. 00:09:36.491 --> 00:09:39.201 So from these community meetings, 00:09:39.201 --> 00:09:43.084 what we discussed was the challenges that still need attention. 00:09:43.084 --> 00:09:47.249 For example, that having all these crowdsourcing communities 00:09:47.249 --> 00:09:49.613 is very good because different people attack different parts 00:09:49.613 --> 00:09:51.833 of the data or the graph, 00:09:51.834 --> 00:09:54.615 and we also have different background knowledge 00:09:54.616 --> 00:09:59.161 but actually, it's very difficult to align everything in something homogeneous 00:09:59.162 --> 00:10:04.920 because different people are using different properties in different ways 00:10:04.920 --> 00:10:08.401 and they are also expecting different things from entity descriptions. 00:10:09.003 --> 00:10:12.721 People also said that they also need more tools 00:10:12.722 --> 00:10:16.000 that give a better overview of the global status of things. 00:10:16.000 --> 00:10:20.733 So what entities are missing in terms of completeness, 00:10:20.733 --> 00:10:26.121 but also like what are people working on right now most of the time, 00:10:26.121 --> 00:10:30.516 and they also mention many times a tighter collaboration 00:10:30.517 --> 00:10:33.311 across not only languages but the WikiProjects 00:10:33.311 --> 00:10:35.571 and the different Wikimedia platforms. 00:10:35.571 --> 00:10:38.859 And we published all the transcribed comments 00:10:38.860 --> 00:10:42.959 from all these discussions in those links here in the Etherpads 00:10:42.959 --> 00:10:46.162 and also in the wiki page of Wikimania. 00:10:46.162 --> 00:10:48.481 Some solutions that appeared actually 00:10:48.481 --> 00:10:53.001 were going into the direction of sharing more the best practices 00:10:53.001 --> 00:10:55.762 that are being developed in different WikiProjects, 00:10:55.762 --> 00:11:01.238 but also people want tools that help organize work in teams 00:11:01.239 --> 00:11:03.845 or at least understanding who is working on that, 00:11:03.845 --> 00:11:07.815 and they were also mentioning that they want more showcases 00:11:07.816 --> 00:11:12.019 and more templates that help them create things in a better way. 00:11:12.946 --> 00:11:15.161 And from the contact that we have 00:11:15.162 --> 00:11:18.721 with Open Governmental Data Organizations, 00:11:18.722 --> 00:11:20.068 and in particularly, 00:11:20.068 --> 00:11:23.102 I am in contact with the canton and the city of Zürich, 00:11:23.102 --> 00:11:26.207 they are very interested in working with Wikidata 00:11:26.207 --> 00:11:29.896 because they want their data to be accessible for everyone 00:11:29.897 --> 00:11:33.681 in the place where people go and consult or access data. 00:11:33.682 --> 00:11:36.550 So for them, something that would be really interesting 00:11:36.551 --> 00:11:38.600 is to have some kind of quality indicators 00:11:38.600 --> 00:11:41.082 both in the wiki, which is already happening, 00:11:41.082 --> 00:11:42.801 but also in SPARQL results, 00:11:42.802 --> 00:11:46.066 to know whether they can trust or not that data from the community. 00:11:46.067 --> 00:11:48.230 And then, they also want to know 00:11:48.230 --> 00:11:51.417 what parts of their own data sets are useful for Wikidata 00:11:51.418 --> 00:11:56.040 and they would love to have a tool that can help them assess that automatically. 00:11:56.041 --> 00:11:59.066 They also need some kind of methodology or tool 00:11:59.067 --> 00:12:03.894 that helps them decide whether they should import or link their data 00:12:03.894 --> 00:12:04.894 because in some cases, 00:12:04.895 --> 00:12:07.137 they also have their own linked open data sets, 00:12:07.138 --> 00:12:09.746 so they don't know whether to just ingest the data 00:12:09.747 --> 00:12:13.424 or to keep on creating links from the data sets to Wikidata 00:12:13.425 --> 00:12:14.425 and the other way around. 00:12:14.950 --> 00:12:20.043 And they also want to know where their websites are referred in Wikidata. 00:12:20.044 --> 00:12:23.361 And when they run such a query in the query service, 00:12:23.362 --> 00:12:24.848 they often get timeouts, 00:12:24.849 --> 00:12:28.181 so maybe we should really create more tools 00:12:28.181 --> 00:12:32.240 that help them get these answers for their questions. 00:12:33.148 --> 00:12:36.208 And, besides that, 00:12:36.208 --> 00:12:39.361 we wiki researchers also sometimes 00:12:39.362 --> 00:12:42.023 lack some information in the edit summaries. 00:12:42.024 --> 00:12:44.953 So I remember that when we were doing some work 00:12:44.954 --> 00:12:48.919 to understand the different behavior of editors 00:12:48.919 --> 00:12:53.403 with tools or bots or anonymous users and so on, 00:12:53.403 --> 00:12:56.154 we were really lacking, for example, 00:12:56.154 --> 00:13:01.112 a standard way of tracing that tools were being used. 00:13:01.113 --> 00:13:03.154 And there are some tools that are already doing that 00:13:03.155 --> 00:13:05.230 like PetScan and many others, 00:13:05.230 --> 00:13:07.720 but maybe we should in the community 00:13:07.721 --> 00:13:13.531 discuss more about how to record these for fine-grained provenance. 00:13:14.169 --> 00:13:15.321 And further on, 00:13:15.322 --> 00:13:20.801 we think that we need to think of more concrete data quality dimensions 00:13:20.802 --> 00:13:24.961 that are related to link data but not all the types of data, 00:13:24.962 --> 00:13:30.721 so we worked on some measures to access actually the information gain 00:13:30.722 --> 00:13:33.881 enabled by the links, and what we mean by that 00:13:33.882 --> 00:13:36.681 is that when we link Wikidata to other data sets, 00:13:36.682 --> 00:13:38.201 we should also be thinking 00:13:38.202 --> 00:13:41.921 how much the entities are actually gaining in the classification, 00:13:41.922 --> 00:13:45.601 also in the description but also in the vocabularies they use. 00:13:45.602 --> 00:13:51.041 So just to give a very simple example of what I mean with this 00:13:51.042 --> 00:13:54.269 is we can think of-- in this case, would be Wikidata 00:13:54.270 --> 00:13:57.771 or the external data center that is linking to Wikidata, 00:13:57.772 --> 00:14:00.487 we have the entity for a person that is called Natasha Noy, 00:14:00.487 --> 00:14:02.601 we have the affiliation and other things, 00:14:02.602 --> 00:14:05.239 and then we say OK, we link to an external place, 00:14:05.240 --> 00:14:08.919 and that entity also has that name, but we actually have the same value. 00:14:08.920 --> 00:14:12.889 So what it would be better is that we link to something that has a different name, 00:14:12.889 --> 00:14:16.881 that is still valid because this person has two ways of writing the name, 00:14:16.882 --> 00:14:19.714 and also other information that we don't have in Wikidata 00:14:19.715 --> 00:14:21.760 or that we don't have in the other data set. 00:14:22.390 --> 00:14:24.652 But also, what is even better 00:14:24.653 --> 00:14:27.770 is that we are actually looking in the target data set 00:14:27.770 --> 00:14:31.392 that they also have new ways of classifying the information. 00:14:31.393 --> 00:14:35.354 So not only is this a person, but in the other data set, 00:14:35.355 --> 00:14:39.525 they also say it's a female or anything else that they classify with. 00:14:39.526 --> 00:14:43.401 And if in the other data set, they are using many other vocabularies 00:14:43.402 --> 00:14:46.588 that is also helping in their whole information retrieval thing. 00:14:47.371 --> 00:14:51.233 So with that, I also would like to say 00:14:51.234 --> 00:14:55.809 that we think that we can showcase federated queries better 00:14:55.810 --> 00:15:00.448 because when we look at the query log provided by Malyshev et al., 00:15:01.285 --> 00:15:04.301 we see actually that from the organic queries, 00:15:04.302 --> 00:15:06.921 we have only very few federated queries. 00:15:06.922 --> 00:15:12.801 And actually, federation is one of the key advantages of having link data, 00:15:12.802 --> 00:15:16.903 so maybe the community or the people using Wikidata 00:15:16.903 --> 00:15:18.898 also need more examples on this. 00:15:18.898 --> 00:15:22.666 And if we look at the list of endpoints that are being used, 00:15:22.667 --> 00:15:25.401 this is not a complete list and we have many more. 00:15:25.402 --> 00:15:30.479 Of course, this data was analyzed from queries until March 2018, 00:15:30.480 --> 00:15:34.807 but we should look into the list of federated endpoints that we have 00:15:34.808 --> 00:15:37.048 and see whether we are really using them or not. 00:15:37.813 --> 00:15:40.441 So two questions that I have for the audience 00:15:40.442 --> 00:15:43.001 that maybe we can use afterwards for the discussion are: 00:15:43.001 --> 00:15:46.001 what data quality problems should be addressed in your opinion, 00:15:46.002 --> 00:15:47.412 because of the needs that you have, 00:15:47.412 --> 00:15:50.401 but also, where do you need more automation 00:15:50.402 --> 00:15:52.943 to help you with editing or patrolling. 00:15:53.866 --> 00:15:55.146 That's all, thank you very much. 00:15:55.779 --> 00:15:57.527 (applause) 00:16:06.030 --> 00:16:08.595 (Jose Emilio Labra) OK, so what I'm going to talk about 00:16:08.595 --> 00:16:14.715 is some tools that we were developing related with Shape Expressions. 00:16:15.536 --> 00:16:19.371 So this is what I want to talk... I am Jose Emilio Labra, 00:16:19.371 --> 00:16:23.215 but this has... all these tools have been done by different people, 00:16:23.920 --> 00:16:28.480 mainly related with W3C ShEx, Shape Expressions Community Group. 00:16:28.481 --> 00:16:29.481 ShEx Community Group. 00:16:30.144 --> 00:16:36.081 So the first tool that I want to mention is RDFShape, this is a general tool, 00:16:36.082 --> 00:16:40.681 because Shape Expressions is not only for Wikidata, 00:16:40.682 --> 00:16:44.168 Shape Expressions is a language to validate RDF in general. 00:16:44.168 --> 00:16:47.568 So this tool was developed mainly by me 00:16:47.568 --> 00:16:50.880 and it's a tool to validate RDF in general. 00:16:50.881 --> 00:16:55.139 So if you want to learn about RDF or you want to validate RDF 00:16:55.140 --> 00:16:58.621 or SPARQL endpoints not only in Wikidata, 00:16:58.622 --> 00:17:00.891 my advice is that you can use this tool. 00:17:00.891 --> 00:17:03.255 Also for teaching. 00:17:03.255 --> 00:17:05.640 I am a teacher in the university 00:17:05.641 --> 00:17:09.151 and I use it in my semantic web course to teach RDF. 00:17:09.161 --> 00:17:12.121 So if you want to learn RDF, I think it's a good tool. 00:17:13.033 --> 00:17:17.598 For example, this is just a visualization of an RDF graph with the tool. 00:17:18.587 --> 00:17:22.643 But before coming here, in the last month, 00:17:22.643 --> 00:17:28.441 I started a fork of rdfshape specifically for Wikidata, because I thought... 00:17:28.443 --> 00:17:33.082 It's called WikiShape, and yesterday, I presented it as a present for Wikidata. 00:17:33.082 --> 00:17:34.441 So what I took is... 00:17:34.442 --> 00:17:39.898 What I did is to remove all the stuff that was not related with Wikidata 00:17:39.898 --> 00:17:44.801 and to put several things, hard-coded, for example, the Wikidata SPARQL endpoint, 00:17:44.802 --> 00:17:49.041 but now, someone asked me if I could do it also for Wikibase. 00:17:49.042 --> 00:17:52.000 And it is very easy to do it for Wikibase also. 00:17:52.760 --> 00:17:56.280 So this tool, WikiShape, is quite new. 00:17:57.015 --> 00:17:59.843 I think it works, most of the features, 00:17:59.844 --> 00:18:02.468 but there are some features that maybe don't work, 00:18:02.469 --> 00:18:06.281 and if you try it and you want to improve it, please tell me. 00:18:06.281 --> 00:18:12.680 So this is [inaudible] captures, but I think I can even try so let's try. 00:18:15.385 --> 00:18:16.945 So let's see if it works. 00:18:16.953 --> 00:18:20.070 First, I have to go out of the... 00:18:22.453 --> 00:18:23.453 Here. 00:18:24.226 --> 00:18:28.324 Alright, yeah. So this is the tool here. 00:18:28.324 --> 00:18:29.844 Things that you can do with the tool, 00:18:29.845 --> 00:18:35.275 for example, is that you can check schemas, entity schemas. 00:18:35.276 --> 00:18:38.611 You know that there is a new namespace which is "E whatever," 00:18:38.612 --> 00:18:44.805 so here, if you start for example, write for example "human"... 00:18:44.806 --> 00:18:48.812 As you are writing, its autocomplete allows you to check, 00:18:48.812 --> 00:18:52.001 for example, this is the Shape Expressions of a human, 00:18:52.790 --> 00:18:55.937 and this is the Shape Expressions here. 00:18:55.938 --> 00:18:59.841 And as you can see, this editor has syntax highlighting, 00:18:59.842 --> 00:19:04.559 this is... well, maybe it's very small, the screen. 00:19:05.676 --> 00:19:07.590 I can try to do it bigger. 00:19:09.194 --> 00:19:10.973 Maybe you see it better now. 00:19:10.973 --> 00:19:14.241 So... and this is the editor with syntax highlighting and also has... 00:19:14.241 --> 00:19:17.851 I mean, this editor comes from the same source code 00:19:17.851 --> 00:19:19.641 as the Wikidata query service. 00:19:19.642 --> 00:19:23.960 So for example, if you hover with the mouse here, 00:19:23.961 --> 00:19:27.961 it shows you the labels of the different properties. 00:19:27.962 --> 00:19:31.298 So I think it's very helpful because now, 00:19:32.588 --> 00:19:38.601 the entity schemas that is in the Wikidata is just a plain text idea, 00:19:38.602 --> 00:19:42.493 and I think this editor is much better because it has autocomplete 00:19:42.494 --> 00:19:43.743 and it also has... 00:19:43.744 --> 00:19:48.241 I mean, if you, for example, wanted to add a constraint, 00:19:48.241 --> 00:19:51.570 you say "wdt:" 00:19:51.570 --> 00:19:56.884 You start writing "author" and then you click Ctrl+Space 00:19:56.884 --> 00:19:58.922 and it suggests the different things. 00:19:58.922 --> 00:20:02.388 So this is similar to the Wikidata query service 00:20:02.389 --> 00:20:06.445 but specifically for Shape Expressions 00:20:06.445 --> 00:20:11.975 because my feeling is that creating Shape Expressions 00:20:11.976 --> 00:20:15.841 is not more difficult than writing SPARQL queries. 00:20:15.842 --> 00:20:21.255 So some people think that it's at the same level, 00:20:22.278 --> 00:20:26.296 It's probably easier, I think, because Shape Expressions was, 00:20:26.296 --> 00:20:31.241 when we designed it, we were doing it to be easier to work. 00:20:31.242 --> 00:20:35.001 OK, so this is one of the first things, that you have this editor 00:20:35.001 --> 00:20:36.620 for Shape Expressions. 00:20:37.371 --> 00:20:41.467 And then you also have the possibility, for example, to visualize. 00:20:41.468 --> 00:20:44.801 If you have a Shape Expression, use for example... 00:20:44.802 --> 00:20:49.386 I think, "written work" is a nice Shape Expression 00:20:49.386 --> 00:20:53.300 because it has some relationships between different things. 00:20:54.823 --> 00:20:58.160 And this is the UML visualization of written work. 00:20:58.161 --> 00:21:02.090 In a UML, this is easy to see the different properties. 00:21:02.790 --> 00:21:06.794 When you do this, I realized when I tried with several people, 00:21:06.795 --> 00:21:09.216 they find some mistakes in their Shape Expressions 00:21:09.217 --> 00:21:12.988 because it's easy to detect which are the missing properties or whatever. 00:21:13.588 --> 00:21:15.771 Then there is another possibility here 00:21:15.772 --> 00:21:19.520 is that you can also validate, I think I have it here, the validation. 00:21:20.496 --> 00:21:25.285 I think I had it in some label, maybe I closed it. 00:21:26.267 --> 00:21:30.988 OK, but you can, for example, you can click here, Validate entities. 00:21:32.308 --> 00:21:34.232 You, for example, 00:21:35.404 --> 00:21:41.921 "q42" with "e42" which is author. 00:21:42.818 --> 00:21:46.180 With "human," I think we can do it with "human." 00:21:49.050 --> 00:21:50.050 And then it's... 00:21:50.688 --> 00:21:56.365 And it's taking a little while to do it because this is doing the SPARQL queries 00:21:56.365 --> 00:21:59.134 and now, for example, it's failing by the network but... 00:21:59.657 --> 00:22:01.580 So you can try it. 00:22:02.759 --> 00:22:07.026 OK, so let's go continue with the presentation, with other tools. 00:22:07.026 --> 00:22:12.353 So my advice is that if you want to try it and you want any feedback let me know. 00:22:13.133 --> 00:22:15.540 So to continue with the presentation... 00:22:18.923 --> 00:22:20.233 So this is WikiShape. 00:22:23.800 --> 00:22:26.509 Then, I already said this, 00:22:27.681 --> 00:22:34.157 the Shape Expressions Editor is an independent project in GitHub. 00:22:35.605 --> 00:22:37.472 You can use it in your own project. 00:22:37.472 --> 00:22:41.036 If you want to do a Shape Expressions tool, 00:22:41.036 --> 00:22:45.635 you can just embed it in any other project, 00:22:45.636 --> 00:22:48.235 so this is in GitHub and you can use it. 00:22:48.868 --> 00:22:51.970 Then the same author, it's one of my students, 00:22:52.684 --> 00:22:55.704 he also created an editor for Shape Expressions, 00:22:55.704 --> 00:22:57.799 also inspired by the Wikidata query service 00:22:57.800 --> 00:23:00.681 where, in a column, 00:23:00.682 --> 00:23:05.103 you have this more visual editor of SPARQL queries 00:23:05.104 --> 00:23:07.135 where you can put this kind of things. 00:23:07.136 --> 00:23:09.123 So this is a screen capture. 00:23:09.123 --> 00:23:12.662 You can see that that's the Shape Expressions in text 00:23:12.662 --> 00:23:17.822 but this is a form-based Shape Expressions where it would probably take a bit longer 00:23:18.595 --> 00:23:23.400 where you can put the different rows on the different fields. 00:23:23.401 --> 00:23:25.800 OK, then there is ShExEr. 00:23:26.879 --> 00:23:31.882 We have... it's done by one PhD student at the University of Oviedo 00:23:31.883 --> 00:23:34.080 and he's here, so you can present ShExEr. 00:23:38.147 --> 00:23:40.024 (Danny) Hello, I am Danny Fernández, 00:23:40.025 --> 00:23:43.800 I am a PhD student in University of Oviedo working with Labra. 00:23:44.710 --> 00:23:47.725 Since we are running out of time, let's make these quickly, 00:23:47.726 --> 00:23:52.641 so let's not go for any actual demo, but just print some screenshots. 00:23:52.642 --> 00:23:57.897 OK, so the usual way to work with Shape Expressions or any shape language 00:23:57.897 --> 00:23:59.521 is that you have a domain expert 00:23:59.522 --> 00:24:02.313 that defines a priori how the graph should look like 00:24:02.314 --> 00:24:03.555 define some structures, 00:24:03.556 --> 00:24:06.983 and then you use these structures to validate the actual data against it. 00:24:08.124 --> 00:24:11.641 This tool, which is as well as the ones that Labra has been presenting, 00:24:11.642 --> 00:24:14.441 this is a general purpose tool for any RDF source, 00:24:14.442 --> 00:24:17.375 is designed to do the other way around. 00:24:17.376 --> 00:24:18.758 You already have some data, 00:24:18.759 --> 00:24:23.165 you select what nodes you want to get the shape about 00:24:23.165 --> 00:24:26.718 and then you automatically extract or infer the shape. 00:24:26.719 --> 00:24:29.791 So even if this is a general purpose tool, 00:24:29.791 --> 00:24:34.063 what we did for this WikidataCon is these fancy button 00:24:34.884 --> 00:24:37.081 that if you click it, essentially what happens 00:24:37.081 --> 00:24:42.079 is that there are so many configurations params 00:24:42.080 --> 00:24:46.251 and it configures it to work against the Wikidata endpoint 00:24:46.251 --> 00:24:47.971 and it will end soon, sorry. 00:24:48.733 --> 00:24:52.883 So, once you press this button what you get is essentially this. 00:24:52.884 --> 00:24:55.126 After having selected what kind of notes, 00:24:55.127 --> 00:24:59.360 what kind of instances of our class, whatever you are looking for, 00:24:59.361 --> 00:25:01.321 you get an automatic schema. 00:25:02.319 --> 00:25:07.111 All the constraints are sorted by how many modes actually conform to it, 00:25:07.112 --> 00:25:09.772 you can filter the less common ones, etc. 00:25:09.772 --> 00:25:12.126 So there is a poster downstairs about this stuff 00:25:12.127 --> 00:25:14.595 and well, I will be downstairs and upstairs 00:25:14.596 --> 00:25:16.454 and all over the place all day, 00:25:16.455 --> 00:25:19.081 so if you have any further interest in this tool, 00:25:19.082 --> 00:25:21.476 just speak to me during this journey. 00:25:21.477 --> 00:25:24.624 And now, I'll give back the micro to Labra, thank you. 00:25:24.625 --> 00:25:29.265 (applause) 00:25:29.812 --> 00:25:32.578 (Jose) So let's continue with the other tools. 00:25:32.579 --> 00:25:34.984 The other tool is the ShapeDesigner. 00:25:34.984 --> 00:25:37.241 Andra, do you want to do the ShapeDesigner now 00:25:37.242 --> 00:25:39.287 or maybe later or in the workshop? 00:25:39.287 --> 00:25:40.603 There is a workshop... 00:25:40.603 --> 00:25:44.437 This afternoon, there is a workshop specifically for Shape Expressions, and... 00:25:45.265 --> 00:25:47.939 The idea is that was going to be more hands on, 00:25:47.940 --> 00:25:52.324 and if you want to practice some ShEx, you can do it there. 00:25:52.875 --> 00:25:55.720 This tool is ShEx... and there is Eric here, 00:25:55.721 --> 00:25:56.890 so you can present it. 00:25:57.969 --> 00:26:00.687 (Eric) So just super quick, the thing that I want to say 00:26:00.687 --> 00:26:05.711 is that you've probably already seen the ShEx interface 00:26:05.711 --> 00:26:07.601 that's tailored for Wikidata. 00:26:07.602 --> 00:26:12.930 That's effectively stripped down and tailored specifically for Wikidata 00:26:12.930 --> 00:26:17.937 because the generic one has more features but it turns out I thought I'd mention it 00:26:17.937 --> 00:26:19.977 because one of those features is particularly useful 00:26:19.978 --> 00:26:23.201 for debugging Wikidata schemas, 00:26:23.201 --> 00:26:29.224 which is if you go and you select the slurp mode, 00:26:29.225 --> 00:26:31.444 what it does is it says while I'm validating, 00:26:31.445 --> 00:26:34.694 I want to pull all the the triples down and that means 00:26:34.695 --> 00:26:36.274 if I get a bunch of failures, 00:26:36.275 --> 00:26:39.586 I can go through and start looking at those failures and saying, 00:26:39.587 --> 00:26:41.800 OK, what are the triples that are in here, 00:26:41.801 --> 00:26:44.120 sorry, I apologize, the triples are down there, 00:26:44.121 --> 00:26:45.647 this is just a log of what went by. 00:26:46.327 --> 00:26:49.180 And then you can just sit there and fiddle with it in real time 00:26:49.181 --> 00:26:51.033 like you play with something and it changes. 00:26:51.033 --> 00:26:54.160 So it's a quicker version for doing all that stuff. 00:26:55.361 --> 00:26:56.481 This is a ShExC form, 00:26:56.482 --> 00:26:59.455 this is something [Joachim] had suggested 00:27:00.035 --> 00:27:04.631 could be useful for populating Wikidata documents 00:27:04.631 --> 00:27:07.338 based on a Shape Expression for that that document. 00:27:08.095 --> 00:27:11.681 This is not tailored for Wikidata, 00:27:11.682 --> 00:27:14.081 but this is just to say that you can have a schema 00:27:14.082 --> 00:27:15.402 and you can have some annotations 00:27:15.403 --> 00:27:17.518 to say specifically how I want that schema rendered 00:27:17.519 --> 00:27:19.031 and then it just builds a form, 00:27:19.031 --> 00:27:21.191 and if you've got data, it can even populate the form. 00:27:24.517 --> 00:27:26.164 PyShEx [inaudible]. 00:27:28.025 --> 00:27:31.080 (Jose) I think this is the last one. 00:27:31.821 --> 00:27:34.080 Yes, so the last one is PyShEx. 00:27:34.675 --> 00:27:38.151 PyShEx is a Python implementation of Shape Expressions, 00:27:39.193 --> 00:27:42.680 you can play also with Jupyter Notebooks if you want those kind of things. 00:27:42.680 --> 00:27:44.432 OK, so that's all for this. 00:27:44.433 --> 00:27:47.170 (applause) 00:27:52.916 --> 00:27:57.073 (Andra) So I'm going to talk about a specific project that I'm involved in 00:27:57.074 --> 00:27:58.074 called Gene Wiki, 00:27:58.075 --> 00:28:04.596 and where we are also dealing with quality issues. 00:28:04.597 --> 00:28:06.684 But before going into the quality, 00:28:06.685 --> 00:28:09.229 maybe a quick introduction about what Gene Wiki is, 00:28:09.855 --> 00:28:15.175 and we just released a pre-print of a paper that we recently have written 00:28:15.175 --> 00:28:18.160 that explains the details of the project. 00:28:19.821 --> 00:28:23.839 I see people taking pictures, but basically, what Gene Wiki does, 00:28:23.846 --> 00:28:28.027 it's trying to get biomedical data, public data into Wikidata, 00:28:28.028 --> 00:28:32.200 and we follow a specific pattern to get that data into Wikidata. 00:28:33.130 --> 00:28:36.809 So when we have a new repository or a new data set 00:28:36.810 --> 00:28:39.600 that is eligible to be included into Wikidata, 00:28:39.601 --> 00:28:41.293 the first step is community engagement. 00:28:41.294 --> 00:28:43.784 It is not necessary directly to a Wikidata community 00:28:43.785 --> 00:28:46.120 but a local research community, 00:28:46.121 --> 00:28:50.286 and we meet in person or online or on any platform 00:28:50.286 --> 00:28:52.881 and try to come up with a data model 00:28:52.882 --> 00:28:56.197 that bridges their data with the Wikidata model. 00:28:56.197 --> 00:28:59.944 So here I have a picture of a workshop that happened here last year 00:28:59.945 --> 00:29:02.663 which was trying to look at a specific data set 00:29:02.663 --> 00:29:05.280 and, well, you see a lot of discussions, 00:29:05.281 --> 00:29:09.780 then aligning it with schema.org and other ontologies that are out there. 00:29:10.320 --> 00:29:15.508 And then, at the end of the first step, we have a whiteboard drawing of the schema 00:29:15.509 --> 00:29:17.336 that we want to implement in Wikidata. 00:29:17.337 --> 00:29:20.440 What you see over there, this is just plain, 00:29:20.441 --> 00:29:21.766 we have it in the back there 00:29:21.767 --> 00:29:25.240 so we can make some schemas within this panel today even. 00:29:26.560 --> 00:29:28.399 So once we have the schema in place, 00:29:28.400 --> 00:29:31.320 the next thing is try to make that schema machine readable 00:29:32.358 --> 00:29:36.841 because you want to have actionable models to bridge the data that you're bringing in 00:29:36.842 --> 00:29:39.690 from any biomedical database into Wikidata. 00:29:40.393 --> 00:29:45.182 And here we are applying Shape Expressions. 00:29:46.471 --> 00:29:52.518 And we use that because Shape Expressions allow you to test 00:29:52.518 --> 00:29:57.040 whether the data set is actually-- no, to first see 00:29:57.041 --> 00:30:01.782 of already existing data in Wikidata follows the same data model 00:30:01.783 --> 00:30:04.718 that was achieved in the previous process. 00:30:04.719 --> 00:30:06.641 So then with the Shape Expression we can check: 00:30:06.642 --> 00:30:10.926 OK the data that are on this topic in Wikidata, does it need some cleaning up 00:30:10.926 --> 00:30:15.013 or do we need to adapt our model to the Wikidata model or vice versa. 00:30:15.937 --> 00:30:19.867 Once that is in place and we start writing bots, 00:30:20.670 --> 00:30:23.801 and bots are seeding the information 00:30:23.802 --> 00:30:27.308 that is in the primary sources into Wikidata. 00:30:27.846 --> 00:30:29.303 And when the bots are ready, 00:30:29.304 --> 00:30:33.001 we write these bots with a platform called-- 00:30:33.002 --> 00:30:36.201 with a Python library called Wikidata Integrator 00:30:36.202 --> 00:30:38.167 that came out of our project. 00:30:38.698 --> 00:30:42.921 And once we have our bots, we use a platform called Jenkins 00:30:42.921 --> 00:30:44.540 for continuous integration. 00:30:44.540 --> 00:30:45.762 And with Jenkins, 00:30:45.762 --> 00:30:51.160 we continuously update the primary sources with Wikidata. 00:30:52.178 --> 00:30:55.889 And this is a diagram for the paper I previously mentioned. 00:30:55.890 --> 00:30:57.241 This is our current landscape. 00:30:57.242 --> 00:31:02.059 So every orange box out there is a primary resource on drugs, 00:31:02.060 --> 00:31:07.827 proteins, genes, diseases, chemical compounds with interaction, 00:31:07.827 --> 00:31:10.870 and this model is too small to read now 00:31:10.870 --> 00:31:17.472 but this is the database, the sources that we manage in Wikidata 00:31:17.473 --> 00:31:20.560 and bridge with the primary sources. 00:31:20.561 --> 00:31:22.355 Here is such a workflow. 00:31:22.870 --> 00:31:25.312 So one of our partners is the Disease Ontology 00:31:25.312 --> 00:31:27.672 the Disease Ontology is a CC0 ontology, 00:31:28.179 --> 00:31:31.990 and the CC0 Ontology has a curation cycle on its own, 00:31:32.756 --> 00:31:35.736 and they just continuously update the Disease Ontology 00:31:35.737 --> 00:31:39.687 to reflect the disease space or the interpretation of diseases. 00:31:40.336 --> 00:31:44.361 And there is the Wikidata curation cycle also on diseases 00:31:44.362 --> 00:31:49.844 where the Wikidata community constantly monitors what's going on on Wikidata. 00:31:50.406 --> 00:31:51.601 And then we have two roles, 00:31:51.602 --> 00:31:55.477 we call them colloquially the gatekeeper curator, 00:31:56.009 --> 00:31:59.561 and this was me and a colleague five years ago 00:31:59.562 --> 00:32:03.414 where we just sit on our computers and we monitor Wikipedia and Wikidata, 00:32:03.415 --> 00:32:08.601 and if there is an issue that was reported back to the primary community, 00:32:08.602 --> 00:32:11.765 the primary resources, they looked at the implementation and decided: 00:32:11.765 --> 00:32:14.240 OK, do we do we trust the Wikidata input? 00:32:14.850 --> 00:32:18.555 Yes--then it's considered, it goes into the cycle, 00:32:18.555 --> 00:32:22.686 and the next iteration is part of the Disease Ontology 00:32:22.687 --> 00:32:25.411 and fed back into Wikidata. 00:32:27.419 --> 00:32:31.480 We're doing the same for WikiPathways. 00:32:31.481 --> 00:32:36.601 WikiPathways is a MediaWiki-inspired pathway and pathway repository. 00:32:36.602 --> 00:32:40.901 Same story, there are different pathway resources on Wikidata already. 00:32:41.463 --> 00:32:44.713 There might be conflicts between those pathway resources 00:32:44.722 --> 00:32:46.701 and these conflicts are reported back 00:32:46.702 --> 00:32:49.521 by the gatekeeper curators to that community, 00:32:49.522 --> 00:32:53.715 and you maintain the individual curation cycles. 00:32:53.715 --> 00:32:57.068 But if you remember the previous cycle, 00:32:57.069 --> 00:33:03.041 here I mentioned only two cycles, two resources, 00:33:03.566 --> 00:33:06.300 we have to do that for every single resource that we have 00:33:06.300 --> 00:33:08.061 and we have to manage what's going on 00:33:08.062 --> 00:33:09.185 because when I say curation, 00:33:09.185 --> 00:33:11.377 I really mean going to the Wikipedia top pages, 00:33:11.377 --> 00:33:14.544 going into the Wikidata top pages and trying to do that. 00:33:14.545 --> 00:33:19.316 That doesn't scale for the two gatekeeper curators we had. 00:33:19.860 --> 00:33:22.777 So when I was in a conference in 2016 00:33:22.778 --> 00:33:26.933 where Eric gave a presentation on Shape Expressions, 00:33:26.934 --> 00:33:29.277 I jumped on the bandwagon and said OK, 00:33:29.278 --> 00:33:34.240 Shape Expressions can help us detect what differences in Wikidata 00:33:34.240 --> 00:33:41.159 and so that allows the gatekeepers to have some more efficient reporting to report. 00:33:42.275 --> 00:33:46.019 So this year, I was delighted by the schema entity 00:33:46.020 --> 00:33:50.765 because now, we can store those entity schemas on Wikidata, 00:33:50.765 --> 00:33:53.183 on Wikidata itself, whereas before, it was on GitHub, 00:33:53.860 --> 00:33:56.815 and this aligns with the Wikidata interface, 00:33:56.816 --> 00:33:59.350 so you have things like document discussions 00:33:59.350 --> 00:34:00.762 but you also have revisions. 00:34:00.763 --> 00:34:05.261 So you can leverage the top pages and the revisions in Wikidata 00:34:05.262 --> 00:34:12.255 to use that to discuss about what is in Wikidata 00:34:12.255 --> 00:34:14.060 and what are in the primary resources. 00:34:14.966 --> 00:34:19.686 So this what Eric just presented, this is already quite a benefit. 00:34:19.686 --> 00:34:24.335 So here, we made up a Shape Expression for the human gene, 00:34:24.336 --> 00:34:30.225 and then we ran it through simple ShEx, and as you can see, 00:34:30.225 --> 00:34:32.428 we just got already ni-- 00:34:32.429 --> 00:34:34.641 There is one issue that needs to be monitored 00:34:34.642 --> 00:34:37.316 which there is an item that doesn't fit that schema, 00:34:37.316 --> 00:34:43.139 and then you can sort of already create schema entities curation reports 00:34:43.140 --> 00:34:46.240 based on... and send that to the different curation reports. 00:34:48.058 --> 00:34:52.788 But the ShEx.js a built interface, 00:34:52.788 --> 00:34:55.860 and if I can show back here, I only do ten, 00:34:55.860 --> 00:35:00.362 but we have tens of thousands, and so that again doesn't scale. 00:35:00.362 --> 00:35:04.654 So the Wikidata Integrator now supports ShEx support as well, 00:35:05.168 --> 00:35:07.431 and then we can just loop item loops 00:35:07.431 --> 00:35:11.494 where we say yes-no, yes-no, true-false, true-false. 00:35:11.495 --> 00:35:12.495 So again, 00:35:13.065 --> 00:35:16.514 increasing a bit of the efficiency of dealing with the reports. 00:35:17.256 --> 00:35:22.662 But now, recently, that builds on the Wikidata Query Service, 00:35:23.181 --> 00:35:24.998 and well, we recently have been throttling 00:35:24.999 --> 00:35:26.560 so again, that doesn't scale. 00:35:26.561 --> 00:35:31.391 So it's still an ongoing process, how to deal with models on Wikidata. 00:35:32.202 --> 00:35:36.682 And so again, ShEx is not only intimidating 00:35:36.683 --> 00:35:40.356 but also the scale is just too big to deal with. 00:35:41.068 --> 00:35:46.081 So I started working, this is my first proof of concept or exercise 00:35:46.082 --> 00:35:47.680 where I used a tool called yED, 00:35:48.184 --> 00:35:52.590 and I started to draw those Shape Expressions and because... 00:35:52.591 --> 00:35:58.098 and then regenerate this schema 00:35:58.099 --> 00:36:01.279 into this adjacent format of the Shape Expressions, 00:36:01.280 --> 00:36:04.520 so that would open up already to the audience 00:36:04.521 --> 00:36:07.432 that are intimidated by the Shape Expressions languages. 00:36:07.961 --> 00:36:12.308 But actually, there is a problem with those visual descriptions 00:36:12.309 --> 00:36:18.229 because this is also a schema that was actually drawn in yEd by someone. 00:36:18.230 --> 00:36:23.838 And here is another one which is beautiful. 00:36:23.838 --> 00:36:29.414 I would love to have this on my wall, but it is still not interoperable. 00:36:30.281 --> 00:36:32.131 So I want to end my talk with, 00:36:32.131 --> 00:36:35.732 and the first time, I've been stealing this slide, using this slide. 00:36:35.732 --> 00:36:37.594 It's an honor to have him in the audience 00:36:37.595 --> 00:36:39.423 and I really like this: 00:36:39.424 --> 00:36:42.362 "People think RDF is a pain because it's complicated. 00:36:42.362 --> 00:36:43.985 The truth is even worse, it's so simple, 00:36:45.581 --> 00:36:48.133 because you have to work with real-world data problems 00:36:48.134 --> 00:36:50.031 that are horribly complicated. 00:36:50.031 --> 00:36:51.451 While you can avoid RDF, 00:36:51.451 --> 00:36:55.760 it is harder to avoid complicated data and complicated computer problems." 00:36:55.761 --> 00:36:59.535 This is about RDF, but I think this so applies to modeling as well. 00:37:00.112 --> 00:37:02.769 So my point of discussion is should we really... 00:37:03.387 --> 00:37:05.882 How do we get modeling going? 00:37:05.882 --> 00:37:10.826 Should we discuss ShEx or visual models or... 00:37:11.426 --> 00:37:13.271 How do we continue? 00:37:13.474 --> 00:37:14.840 Thank you very much for your time. 00:37:15.102 --> 00:37:17.787 (applause) 00:37:20.001 --> 00:37:21.188 (Lydia) Thank you so much. 00:37:21.692 --> 00:37:24.001 Would you come to the front 00:37:24.002 --> 00:37:27.741 so that we can open the questions from the audience. 00:37:28.610 --> 00:37:30.203 Are there questions? 00:37:31.507 --> 00:37:32.507 Yes. 00:37:34.253 --> 00:37:36.890 And I think, for the camera, we need to... 00:37:38.835 --> 00:37:40.968 (Lydia laughing) Yeah. 00:37:43.094 --> 00:37:46.273 (man3) So a question for Cristina, I think. 00:37:47.366 --> 00:37:51.641 So you mentioned exactly the term "information gain" 00:37:51.642 --> 00:37:53.689 from linking with other systems. 00:37:53.690 --> 00:37:55.619 There is an information theoretic measure 00:37:55.620 --> 00:37:58.001 using statistic and probability called information gain. 00:37:58.002 --> 00:37:59.541 Do you have the same... 00:37:59.542 --> 00:38:01.736 I mean did you mean exactly that measure, 00:38:01.736 --> 00:38:04.173 the information gain from the probability theory 00:38:04.174 --> 00:38:05.240 from information theory 00:38:05.241 --> 00:38:09.024 or just use this conceptual thing to measure information gain some way? 00:38:09.025 --> 00:38:13.016 No, so we actually defined and implemented measures 00:38:13.695 --> 00:38:20.161 that are using the Shannon entropy, so it's meant as that. 00:38:20.162 --> 00:38:22.696 I didn't want to go into details of the concrete formulas... 00:38:22.697 --> 00:38:24.977 (man3) No, no, of course, that's why I asked the question. 00:38:24.978 --> 00:38:26.698 - (Cristina) But yeah... - (man3) Thank you. 00:38:33.091 --> 00:38:35.047 (man4) Make more of a comment than a question. 00:38:35.048 --> 00:38:36.241 (Lydia) Go for it. 00:38:36.242 --> 00:38:39.840 (man4) So there's been a lot of focus at the item level 00:38:39.840 --> 00:38:42.547 about quality and completeness, 00:38:42.547 --> 00:38:47.374 one of the things that concerns me is that we're not applying the same to hierarchies 00:38:47.374 --> 00:38:51.480 and I think we have an issue is that our hierarchy often isn't good. 00:38:51.481 --> 00:38:53.463 We're seeing this is going to be a real problem 00:38:53.464 --> 00:38:55.774 with Commons searching and other things. 00:38:56.771 --> 00:39:00.601 One of the abilities that we can do is to import external-- 00:39:00.602 --> 00:39:04.842 The way that external thesauruses structure their hierarchies, 00:39:04.842 --> 00:39:10.291 using the P4900 broader concept qualifier. 00:39:11.037 --> 00:39:16.167 But what I think would be really helpful would be much better tools for doing that 00:39:16.168 --> 00:39:21.212 so that you can import an external... thesaurus's hierarchy 00:39:21.212 --> 00:39:24.111 map that onto our Wikidata items. 00:39:24.111 --> 00:39:28.199 Once it's in place with those P4900 qualifiers, 00:39:28.200 --> 00:39:31.494 you can actually do some quite good querying through SPARQL 00:39:32.490 --> 00:39:37.534 to see where our hierarchy diverges from that external hierarchy. 00:39:37.534 --> 00:39:41.346 For instance, [Paula Morma], user PKM, you may know, 00:39:41.346 --> 00:39:43.533 does a lot of work on fashion. 00:39:43.533 --> 00:39:50.524 So we use that to pull in the Europeana Fashion Thesaurus's hierarchy 00:39:50.524 --> 00:39:53.812 and the Getty AAT fashion thesaurus hierarchy, 00:39:53.812 --> 00:39:57.957 and then see where the gaps were in our higher level items, 00:39:57.957 --> 00:40:00.511 which is a real problem for us because often, 00:40:00.511 --> 00:40:04.355 these are things that only exist as disambiguation pages on Wikipedia, 00:40:04.356 --> 00:40:09.270 so we have a lot of higher level items in our hierarchies missing 00:40:09.271 --> 00:40:14.480 and this is something that we must address in terms of quality and completeness, 00:40:14.480 --> 00:40:15.971 but what would really help 00:40:16.643 --> 00:40:20.871 would be better tools than the jungle of pull scripts that I wrote... 00:40:20.872 --> 00:40:26.010 If somebody could put that into a PAWS notebook in Python 00:40:26.561 --> 00:40:31.972 to be able to take an external thesaurus, take its hierarchy, 00:40:31.973 --> 00:40:34.595 which may well be available as linked data or may not, 00:40:35.379 --> 00:40:40.580 to then put those into quick statements to put in P4900 values. 00:40:41.165 --> 00:40:42.165 And then later, 00:40:42.166 --> 00:40:44.527 when our representation gets more complete, 00:40:44.528 --> 00:40:49.691 to update those P4900s because as our representation gets dated, 00:40:49.691 --> 00:40:51.590 becomes more dense, 00:40:51.590 --> 00:40:55.377 the values of those qualifiers need to change 00:40:56.230 --> 00:40:59.526 to represent that we've got more of their hierarchy in our system. 00:40:59.526 --> 00:41:03.728 If somebody could do that, I think that would be very helpful, 00:41:03.728 --> 00:41:07.121 and we do need to also look at other approaches 00:41:07.122 --> 00:41:10.762 to improve quality and completeness at the hierarchy level 00:41:10.763 --> 00:41:12.378 not just at the item level. 00:41:13.308 --> 00:41:14.840 (Andra) Can I add to that? 00:41:16.362 --> 00:41:19.901 Yes, and we actually do that, 00:41:19.911 --> 00:41:23.551 and I can recommend looking at the Shape Expression that Finn made 00:41:23.552 --> 00:41:27.330 with the lexical data where he creates Shape Expressions 00:41:27.330 --> 00:41:29.640 and then build on authorship expressions 00:41:29.641 --> 00:41:32.528 so you have this concept of linked Shape Expressions in Wikidata, 00:41:32.529 --> 00:41:35.005 and specifically, the use case, if I understand correctly, 00:41:35.006 --> 00:41:37.183 is exactly what we are doing in Gene Wiki. 00:41:37.184 --> 00:41:40.841 So you have the Disease Ontology which is put into Wikidata 00:41:40.842 --> 00:41:44.681 and then disease data comes in and we apply the Shape Expressions 00:41:44.682 --> 00:41:47.247 to see if that fits with this thesaurus. 00:41:47.248 --> 00:41:50.919 And there are other thesauruses or other ontologies for controlled vocabularies 00:41:50.920 --> 00:41:52.559 that still need to go into Wikidata, 00:41:52.559 --> 00:41:55.401 and that's exactly why Shape Expression is so interesting 00:41:55.402 --> 00:41:57.963 because you can have a Shape Expression for the Disease Ontology, 00:41:57.964 --> 00:41:59.644 you can have a Shape Expression for MeSH, 00:41:59.645 --> 00:42:01.761 you can say: OK, now I want to check the quality. 00:42:01.762 --> 00:42:04.059 Because you also have in Wikidata the context 00:42:04.060 --> 00:42:09.567 of when you have a controlled vocabulary, you say the quality is according to this, 00:42:09.568 --> 00:42:11.636 but you might have a disagreeing community. 00:42:11.636 --> 00:42:16.081 So the tooling is indeed in place but now is indeed to create those models 00:42:16.082 --> 00:42:18.144 and apply them on the different use cases. 00:42:18.811 --> 00:42:20.921 (man4) The ShapeExpression's very useful 00:42:20.922 --> 00:42:25.928 once you have the external ontology mapped into Wikidata, 00:42:25.929 --> 00:42:29.474 but my problem is that it's getting to that stage, 00:42:29.475 --> 00:42:34.881 it's working out how much of the external ontology isn't yet in Wikidata 00:42:34.882 --> 00:42:36.256 and where the gaps are, 00:42:36.257 --> 00:42:40.660 and that's where I think that having much more robust tools 00:42:40.660 --> 00:42:44.286 to see what's missing from external ontologies 00:42:44.286 --> 00:42:45.537 would be very helpful. 00:42:47.678 --> 00:42:49.062 The biggest problem there 00:42:49.062 --> 00:42:51.201 is not so much tooling but more licensing. 00:42:51.803 --> 00:42:55.249 So getting the ontologies into Wikidata is actually a piece of cake 00:42:55.250 --> 00:42:59.295 but most of the ontologies have, how can I say that politely, 00:42:59.965 --> 00:43:03.256 restrictive licensing, so they are not compatible with Wikidata. 00:43:04.068 --> 00:43:06.678 (man4) There's a huge number of public sector thesauruses 00:43:06.678 --> 00:43:08.209 in cultural fields. 00:43:08.210 --> 00:43:10.851 - (Andra) Then we need to talk. - (man4) Not a problem. 00:43:10.852 --> 00:43:12.384 (Andra) Then we need to talk. 00:43:13.624 --> 00:43:19.192 (man5) Just... the comment I want to make is actually answer to James, 00:43:19.192 --> 00:43:22.401 so the thing is that hierarchies make graphs, 00:43:22.374 --> 00:43:24.041 and when you want to... 00:43:24.579 --> 00:43:28.888 I want to basically talk about... a common problem in hierarchies 00:43:28.889 --> 00:43:30.820 is circle hierarchies, 00:43:30.821 --> 00:43:33.796 so they come back to each other when there's a problem, 00:43:33.796 --> 00:43:35.920 which you should not have that in hierarchies. 00:43:37.022 --> 00:43:41.295 This, funnily enough, happens in categories in Wikipedia a lot 00:43:41.295 --> 00:43:42.990 we have a lot of circles in categories, 00:43:43.898 --> 00:43:46.612 but the good news is that this is... 00:43:47.713 --> 00:43:51.582 Technically, it's a PMP complete problem, so you cannot find this, 00:43:51.583 --> 00:43:53.414 and easily if you built a graph of that, 00:43:54.473 --> 00:43:57.046 but there are lots of ways that have been developed 00:43:57.047 --> 00:44:00.624 to find problems in these hierarchy graphs. 00:44:00.625 --> 00:44:04.860 Like there is a paper called Finding Cycles... 00:44:04.861 --> 00:44:07.955 Breaking Cycles in Noisy Hierarchies, 00:44:07.956 --> 00:44:12.671 and it's been used to help categorization of English Wikipedia. 00:44:12.672 --> 00:44:17.141 You can just take this and apply these hierarchies in Wikidata, 00:44:17.142 --> 00:44:19.540 and then you can find things that are problematic 00:44:19.541 --> 00:44:22.481 and just remove the ones that are causing issues 00:44:22.482 --> 00:44:24.593 and find the issues, actually. 00:44:24.594 --> 00:44:26.960 So this is just an idea, just so you... 00:44:28.780 --> 00:44:29.930 (man4) That's all very well 00:44:29.931 --> 00:44:34.402 but I think you're underestimating the number of bad subclass relations 00:44:34.402 --> 00:44:35.402 that we have. 00:44:35.403 --> 00:44:39.680 It's like having a city in completely the wrong country, 00:44:40.250 --> 00:44:44.874 and there are tools for geography to identify that, 00:44:44.875 --> 00:44:49.201 and we need to have much better tools in hierarchies 00:44:49.202 --> 00:44:53.477 to identify where the equivalent of the item for the country 00:44:53.478 --> 00:44:57.673 is missing entirely, or where it's actually been subclassed 00:44:57.674 --> 00:45:01.804 to something that isn't meaning something completely different. 00:45:02.804 --> 00:45:07.165 (Lydia) Yeah, I think you're getting to something 00:45:07.166 --> 00:45:12.024 that me and my team keeps hearing from people who reuse our data 00:45:12.025 --> 00:45:13.991 quite a bit as well, right, 00:45:15.002 --> 00:45:16.638 Individual data point might be great 00:45:16.639 --> 00:45:20.163 but if you have to look at the ontology and so on, 00:45:20.164 --> 00:45:21.857 then it gets very... 00:45:22.388 --> 00:45:26.437 And I think one of the big problems why this is happening 00:45:26.437 --> 00:45:30.736 is that a lot of editing on Wikidata 00:45:30.736 --> 00:45:34.544 happens on the basis of an individual item, right, 00:45:34.545 --> 00:45:36.201 you make an edit on that item, 00:45:37.653 --> 00:45:42.075 without realizing that this might have very global consequences 00:45:42.075 --> 00:45:44.245 on the rest of the graph, for example. 00:45:44.245 --> 00:45:50.040 And if people have ideas around how to make this more visible, 00:45:50.041 --> 00:45:53.185 the consequences of an individual local edit, 00:45:54.005 --> 00:45:56.537 I think that would be worth exploring, 00:45:57.550 --> 00:46:01.583 to show people better what the consequence of their edit 00:46:01.584 --> 00:46:03.434 that they might do in very good faith, 00:46:04.481 --> 00:46:05.481 what that is. 00:46:06.939 --> 00:46:12.237 Whoa! OK, let's start with, yeah, you, then you, then you, then you. 00:46:12.237 --> 00:46:13.921 (man5) Well, after the discussion, 00:46:13.922 --> 00:46:18.262 just to express my agreement with what James was saying. 00:46:18.263 --> 00:46:22.467 So essentially, it seems the most dangerous thing is the hierarchy, 00:46:22.468 --> 00:46:23.910 not the hierarchy, but generally 00:46:23.911 --> 00:46:28.022 the semantics of the subclass relations seen in Wikidata, right. 00:46:28.022 --> 00:46:32.561 So I've been studying languages recently, just for the purposes of this conference, 00:46:32.562 --> 00:46:35.257 and for example, you find plenty of cases 00:46:35.257 --> 00:46:39.463 where a language is a part of and subclass of the same thing, OK. 00:46:39.463 --> 00:46:43.577 So you know, you can say we have a flexible ontology. 00:46:43.577 --> 00:46:46.256 Wikidata gives you freedom to express that, sometimes. 00:46:46.256 --> 00:46:47.257 Because, for example, 00:46:47.258 --> 00:46:50.721 that ontology of languages is also politically complicated, right? 00:46:50.722 --> 00:46:55.038 It is even good to be in a position to express a level of uncertainty. 00:46:55.038 --> 00:46:57.983 But imagine anyone who wants to do machine reading from that. 00:46:57.984 --> 00:46:59.468 So that's really problematic. 00:46:59.468 --> 00:47:00.468 And then again, 00:47:00.469 --> 00:47:03.686 I don't think that ontology was ever imported from somewhere, 00:47:03.687 --> 00:47:05.490 that's something which is originally ours. 00:47:05.491 --> 00:47:08.321 It's harvested from Wikipedia in the very beginning I will say. 00:47:08.322 --> 00:47:11.324 So I wonder... this Shape Expressions thing is great, 00:47:11.325 --> 00:47:15.575 and also validating and fixing, if you like, the Wikidata ontology 00:47:15.576 --> 00:47:18.191 by external resources, beautiful idea. 00:47:19.026 --> 00:47:20.026 In the end, 00:47:20.027 --> 00:47:25.440 will we end by reflecting the external ontologies in Wikidata? 00:47:25.441 --> 00:47:28.651 And also, what we do with the core part of our ontology 00:47:28.652 --> 00:47:30.642 which is never harvested from external resources, 00:47:30.643 --> 00:47:31.978 how do we go and fix that? 00:47:31.979 --> 00:47:35.276 And I really think that that will be a problem on its own. 00:47:35.277 --> 00:47:39.010 We will have to focus on that independently of the idea 00:47:39.010 --> 00:47:41.046 of validating ontology with something external. 00:47:49.353 --> 00:47:53.379 (man6) OK, and constrains and shapes are very impressive 00:47:53.380 --> 00:47:54.495 what we can do with it, 00:47:55.205 --> 00:47:58.481 but the main point is not being really made clear-- 00:47:58.482 --> 00:48:03.229 it's because now we can make more explicit what we expect from the data. 00:48:03.229 --> 00:48:06.893 Before, each one has to write its own tools and scripts 00:48:06.894 --> 00:48:10.601 and so it's more visible and we can discuss about it. 00:48:10.602 --> 00:48:13.641 But because it's not about what's wrong or right, 00:48:13.642 --> 00:48:15.870 it's about an expectation, 00:48:15.870 --> 00:48:18.105 and you will have different expectations and discussions 00:48:18.106 --> 00:48:20.737 about how we want to model things in Wikidata, 00:48:21.246 --> 00:48:23.095 and this... 00:48:23.096 --> 00:48:26.280 The current state is just one step in the direction 00:48:26.281 --> 00:48:28.041 because now you need 00:48:28.042 --> 00:48:31.041 very much technical expertise to get into this, 00:48:31.042 --> 00:48:35.721 and we need better ways to visualize this constraint, 00:48:35.722 --> 00:48:39.995 to transform it maybe in natural language so people can better understand, 00:48:40.939 --> 00:48:43.768 but it's less about what's wrong or right. 00:48:44.925 --> 00:48:45.925 (Lydia) Yeah. 00:48:50.986 --> 00:48:53.893 (man7) So for quality issues, I just want to echo it like... 00:48:53.894 --> 00:48:57.010 I've definitely found a lot of the issues I've encountered have been 00:48:58.838 --> 00:49:02.330 differences in opinion between instance of versus subclass. 00:49:02.331 --> 00:49:05.963 I would say errors in those situations 00:49:05.963 --> 00:49:11.521 and trying to find those has been a very time-consuming process. 00:49:11.522 --> 00:49:14.840 What I've found is like: "Oh, if I find very high-impression items 00:49:14.840 --> 00:49:16.051 that are something... 00:49:16.052 --> 00:49:21.628 and then use all the subclass instances to find all derived statements of this," 00:49:21.628 --> 00:49:26.215 this is a very useful way of looking for these errors. 00:49:26.215 --> 00:49:28.067 But I was curious if Shape Expressions, 00:49:29.841 --> 00:49:31.582 if there is... 00:49:31.583 --> 00:49:36.934 If this can be used as a tool to help resolve those issues but, yeah... 00:49:40.514 --> 00:49:42.555 (man8) If it has a structural footprint... 00:49:45.910 --> 00:49:49.310 If it has a structural footprint that you can...that's sort of falsifiable, 00:49:49.310 --> 00:49:51.191 you can look at that and say well, that's wrong, 00:49:51.192 --> 00:49:52.670 then yeah, you can do that. 00:49:52.671 --> 00:49:56.921 But if it's just sort of trying to map it to real-world objects, 00:49:56.922 --> 00:49:59.082 then you're just going to need lots and lots of brains. 00:50:05.768 --> 00:50:08.631 (man9) Hi, Pablo Mendes from Apple Siri Knowledge. 00:50:09.154 --> 00:50:12.770 We're here to find out how to help the project and the community 00:50:12.770 --> 00:50:15.645 but Cristina made the mistake of asking what we want. 00:50:16.471 --> 00:50:20.052 (laughing) So I think one thing I'd like to see 00:50:20.958 --> 00:50:23.521 is a lot around verifiability 00:50:23.522 --> 00:50:26.372 which is one of the core tenets of the project in the community, 00:50:27.062 --> 00:50:28.590 and trustworthiness. 00:50:28.590 --> 00:50:32.412 Not every statement is the same, some of them are heavily disputed, 00:50:32.413 --> 00:50:33.653 some of them are easy to guess, 00:50:33.654 --> 00:50:35.541 like somebody's date of birth can be verified, 00:50:36.071 --> 00:50:39.082 as you saw today in the Keynote, gender issues are a lot more complicated. 00:50:40.205 --> 00:50:42.130 Can you discuss a little bit what you know 00:50:42.131 --> 00:50:47.271 in this area of data quality around trustworthiness and verifiability? 00:50:55.442 --> 00:50:58.138 If there isn't a lot, I'd love to see a lot more. (laughs) 00:51:00.646 --> 00:51:01.646 (Lydia) Yeah. 00:51:03.314 --> 00:51:06.548 Apparently, we don't have a lot to say on that. (laughs) 00:51:08.024 --> 00:51:12.299 (Andra) I think we can do a lot, but I had a discussion with you yesterday. 00:51:12.300 --> 00:51:15.774 My favorite example I learned yesterday that's already deprecated 00:51:15.774 --> 00:51:20.281 is if you go to the Q2, which is earth, 00:51:20.282 --> 00:51:23.343 there is statement that claims that the earth is flat. 00:51:24.183 --> 00:51:26.055 And I love that example 00:51:26.056 --> 00:51:28.391 because there is a community out there that claims that 00:51:28.392 --> 00:51:30.417 and they have verifiable resources. 00:51:30.418 --> 00:51:32.254 So I think it's a genuine case, 00:51:32.255 --> 00:51:34.641 it shouldn't be deprecated, it should be in Wikidata. 00:51:34.642 --> 00:51:40.385 And I think Shape Expressions can be really instrumental there, 00:51:40.386 --> 00:51:41.832 because what you can say, 00:51:41.833 --> 00:51:44.856 OK, I'm really interested in this use case, 00:51:44.857 --> 00:51:47.129 or this is a use case where you disagree, 00:51:47.130 --> 00:51:51.059 but there can also be a use case where you say OK, I'm interested. 00:51:51.059 --> 00:51:53.449 So there is this example you say, I have glucose. 00:51:53.449 --> 00:51:55.841 And glucose when you're a biologist, 00:51:55.842 --> 00:52:00.176 you don't care for the chemical constraints of the glucose molecule, 00:52:00.177 --> 00:52:03.201 you just... everything glucose is the same. 00:52:03.202 --> 00:52:05.973 But if you're a chemist, you cringe when you hear that, 00:52:05.973 --> 00:52:08.191 you have 200 something... 00:52:08.191 --> 00:52:10.443 So then you can have multiple Shape Expressions, 00:52:10.443 --> 00:52:12.721 OK, I'm coming in with... I'm at a chemist view, 00:52:12.722 --> 00:52:13.887 I'm applying that. 00:52:13.887 --> 00:52:16.691 And then you say I'm from a biological use case, 00:52:16.691 --> 00:52:18.524 I'm applying that Shape Expression. 00:52:18.524 --> 00:52:20.358 And then when you want to collaborate, 00:52:20.358 --> 00:52:22.784 yes, well you should talk to Eric about ShEx maps. 00:52:23.910 --> 00:52:28.873 And so... but this journey is just starting. 00:52:28.873 --> 00:52:32.238 But I personally I believe that it's quite instrumental in that area. 00:52:34.292 --> 00:52:35.535 (Lydia) OK. Over there. 00:52:37.949 --> 00:52:39.168 (laughs) 00:52:40.597 --> 00:52:46.035 (woman2) I had several ideas from some points in the discussions, 00:52:46.035 --> 00:52:50.902 so I will try not to lose... I had three ideas so... 00:52:52.394 --> 00:52:55.201 Based on what James said a while ago, 00:52:55.202 --> 00:52:59.001 we have a very, very big problem on Wikidata since the beginning 00:52:59.002 --> 00:53:01.574 for the upper ontology. 00:53:02.363 --> 00:53:05.339 We talked about that two years ago at WikidataCon, 00:53:05.340 --> 00:53:07.432 and we talked about that at Wikimania. 00:53:07.432 --> 00:53:09.818 Well, always we have a Wikidata meeting 00:53:09.818 --> 00:53:11.656 we are talking about that, 00:53:11.656 --> 00:53:15.782 because it's a very big problem at a very very eye level 00:53:15.783 --> 00:53:23.118 what entity is, with what work is, what genre is, art, 00:53:23.118 --> 00:53:25.461 are really the biggest concept. 00:53:26.195 --> 00:53:33.117 And that's actually a very weak point on global ontology 00:53:33.118 --> 00:53:37.453 because people try to clean up regularly 00:53:38.017 --> 00:53:41.047 and broke everything down the line, 00:53:42.516 --> 00:53:48.649 because yes, I think some of you may remember the guy who in good faith 00:53:48.649 --> 00:53:51.785 broke absolutely all cities in the world. 00:53:51.785 --> 00:53:57.537 We were not geographical items anymore, so violation constraints everywhere. 00:53:58.720 --> 00:54:00.278 And it was in good faith 00:54:00.278 --> 00:54:03.623 because he was really correcting a mistake in an item, 00:54:04.170 --> 00:54:05.732 but everything broke down. 00:54:06.349 --> 00:54:09.373 And I'm not sure how we can solve that 00:54:10.216 --> 00:54:15.709 because there is actually no external institution we could just copy 00:54:15.710 --> 00:54:18.490 because everyone is working on... 00:54:19.154 --> 00:54:22.041 Well, if I am performing art database, 00:54:22.042 --> 00:54:24.601 I will just go at the performing art label, 00:54:24.601 --> 00:54:29.361 or I won't go to the philosophical concept of what an entity is, 00:54:29.362 --> 00:54:31.201 and that's actually... 00:54:31.202 --> 00:54:34.561 I don't know any database which is working at this level, 00:54:34.562 --> 00:54:36.827 but that's the weakest point of Wikidata. 00:54:37.936 --> 00:54:40.812 And probably, when we are talking about data quality, 00:54:40.812 --> 00:54:44.034 that's actually a big part of it, so... 00:54:44.034 --> 00:54:48.569 And I think it's the same we have stated in... 00:54:48.569 --> 00:54:50.452 Oh, I am sorry, I am changing the subject, 00:54:51.401 --> 00:54:55.774 but we have stated in different sessions about qualities, 00:54:55.774 --> 00:54:59.398 which is actually some of us are doing good modeling job, 00:54:59.399 --> 00:55:01.240 are doing ShEx, are doing things like that. 00:55:01.967 --> 00:55:07.655 People don't see it on Wikidata, they don't see the ShEx, 00:55:07.655 --> 00:55:10.392 they don't see the WikiProject on the discussion page, 00:55:10.393 --> 00:55:11.393 and sometimes, 00:55:11.394 --> 00:55:14.958 they don't even see the talk pages of properties, 00:55:14.958 --> 00:55:19.628 which is explicitly stating, a), this property is used for that. 00:55:19.628 --> 00:55:23.887 Like last week, I added constraints to a property. 00:55:23.888 --> 00:55:26.324 The constraint was explicitly written 00:55:26.325 --> 00:55:28.690 in the discussion of the creation of the property. 00:55:28.690 --> 00:55:34.548 I just created the technical part of adding the constraint, and someone: 00:55:34.548 --> 00:55:37.182 "What! You broke down all my edits!" 00:55:37.183 --> 00:55:41.542 And he was using the property wrongly for the last two years. 00:55:41.542 --> 00:55:46.868 And the property was actually very clear, but there were no warnings and everything, 00:55:46.869 --> 00:55:49.922 and so, it's the same at the Pink Pony we said at Wikimania 00:55:49.922 --> 00:55:54.719 to make WikiProject more visible or to make ShEx more visible, but... 00:55:54.719 --> 00:55:56.917 And that's what Cristina said. 00:55:56.917 --> 00:56:02.368 We have a visibility problem of what the existing solutions are. 00:56:02.368 --> 00:56:04.242 And at this session, 00:56:04.242 --> 00:56:06.862 we are all talking about how to create more ShEx, 00:56:06.863 --> 00:56:10.727 or to facilitate the jobs of the people who are doing the cleanup. 00:56:11.605 --> 00:56:15.835 But we are cleaning up since the first day of Wikidata, 00:56:15.836 --> 00:56:20.921 and globally, we are losing, and we are losing because, well, 00:56:20.922 --> 00:56:22.960 if I know names are complicated 00:56:22.961 --> 00:56:26.162 but I am the only one doing the cleaning up job, 00:56:26.662 --> 00:56:29.671 the guy who added Latin script name 00:56:29.672 --> 00:56:31.584 to all Chinese researcher, 00:56:32.088 --> 00:56:35.616 I will take months to clean that and I can't do it alone, 00:56:35.616 --> 00:56:38.777 and he did one massive batch. 00:56:38.777 --> 00:56:40.241 So we really need... 00:56:40.242 --> 00:56:44.158 we have a visibility problem more than a tool problem, I think, 00:56:44.158 --> 00:56:45.733 because we have many tools. 00:56:45.733 --> 00:56:50.255 (Lydia) Right, so unfortunately, I've got shown a sign, (laughs), 00:56:50.256 --> 00:56:52.121 so we need to wrap this up. 00:56:52.122 --> 00:56:53.563 Thank you so much for your comments, 00:56:53.563 --> 00:56:56.611 I hope you will continue discussing during the rest of the day, 00:56:56.611 --> 00:56:57.840 and thanks for your input. 00:56:58.359 --> 00:56:59.944 (applause)