WEBVTT 00:00:07.133 --> 00:00:11.738 I work as a teacher at the University of Alicante, 00:00:11.738 --> 00:00:17.040 where I recently obtained my PhD on data libraries and linked open data. 00:00:17.040 --> 00:00:19.038 And I'm also a software developer 00:00:19.038 --> 00:00:21.718 at the Biblioteca Virtual Miguel de Cervantes. 00:00:21.718 --> 00:00:24.467 And today, I'm going to talk about data quality. 00:00:28.252 --> 00:00:31.527 Well, those are my colleagues at the university. 00:00:32.457 --> 00:00:36.727 And as you may know, many organizations are publishing their data 00:00:36.727 --> 00:00:38.447 or linked open data-- 00:00:38.447 --> 00:00:41.437 for example, the National Library of France, 00:00:41.437 --> 00:00:45.947 the National Library of Spain, us, which is Cervantes Virtual, 00:00:45.947 --> 00:00:49.007 the British National Bibliography, 00:00:49.007 --> 00:00:51.667 the Library of Congress and Europeana. 00:00:51.667 --> 00:00:56.000 All of them provide a SPARQL endpoint, 00:00:56.000 --> 00:00:58.875 which is useful in order to retrieve the data. 00:00:59.104 --> 00:01:00.984 And if I'm not wrong, 00:01:00.984 --> 00:01:05.890 the Library of Congress only provide the data as a dump that you can't use. 00:01:07.956 --> 00:01:13.787 When we publish our repository as linked open data, 00:01:13.787 --> 00:01:17.475 my idea was to be reused by other institutions. 00:01:17.981 --> 00:01:24.000 But what about if I'm an institution who wants to enrich their data 00:01:24.000 --> 00:01:27.435 with any data from other data libraries. 00:01:27.574 --> 00:01:30.674 Which data set should I use? 00:01:30.674 --> 00:01:34.314 Which data set is better in terms of quality? 00:01:36.874 --> 00:01:41.314 The benefits of the evaluation of data quality in libraries are many. 00:01:41.314 --> 00:01:47.143 For example, methodologies can be improved in order to include new criteria, 00:01:47.182 --> 00:01:49.162 in order to assess the quality. 00:01:49.162 --> 00:01:54.592 And also, organizations can benefit from best practices and guidelines 00:01:54.602 --> 00:01:58.270 in order to publish their data as linked open data. 00:02:00.012 --> 00:02:03.462 What do we need in order to assess the quality? 00:02:03.462 --> 00:02:06.862 Well, obviously, a set of candidates and a set of features. 00:02:06.862 --> 00:02:10.077 For example, do they have a SPARQL endpoint, 00:02:10.077 --> 00:02:13.132 do they have a web interface, how many publications do they have, 00:02:13.132 --> 00:02:18.092 how many vocabularies do they use, how many Wikidata properties do they have, 00:02:18.092 --> 00:02:20.892 and where can I get those candidates? 00:02:20.892 --> 00:02:22.472 I use LOD Cloud-- 00:02:22.472 --> 00:02:27.422 but when I was doing this slide, I thought about using Wikidata 00:02:27.562 --> 00:02:29.746 in order to retrieve those candidates. 00:02:29.746 --> 00:02:34.295 For example, getting entities of type data library, 00:02:34.295 --> 00:02:36.473 which has a SPARQL endpoint. 00:02:36.473 --> 00:02:38.693 You have here the link. 00:02:41.453 --> 00:02:45.083 And I come up with those data libraries. 00:02:45.104 --> 00:02:50.233 The first one uses bibliographic ontology as main vocabulary, 00:02:50.233 --> 00:02:54.122 and the others are based, more or less, on FRBR, 00:02:54.122 --> 00:02:57.180 which is a vocabulary published by IFLA. 00:02:57.180 --> 00:03:00.013 And this is just an example of how we could compare 00:03:00.013 --> 00:03:04.393 data libraries using bubble charts on Wikidata. 00:03:04.393 --> 00:03:08.613 And this is just an example comparing how many Wikidata properties 00:03:08.613 --> 00:03:10.633 are per data library. 00:03:13.483 --> 00:03:15.980 Well, how can we measure quality? 00:03:15.928 --> 00:03:17.972 There are different methodologies, 00:03:17.972 --> 00:03:19.726 for example, FRBR 1, 00:03:19.726 --> 00:03:24.337 which provides a set of criteria grouped by dimensions, 00:03:24.337 --> 00:03:27.556 and those in green are the ones that I found-- 00:03:27.556 --> 00:03:30.917 that I could assess by means of Wikidata. 00:03:33.870 --> 00:03:39.397 And we also find that we could define new criteria, 00:03:39.397 --> 00:03:44.567 for example, a new one to evaluate the number of duplications in Wikidata. 00:03:45.047 --> 00:03:47.206 We use those properties. 00:03:47.206 --> 00:03:50.098 And this is an example of SPARQL, 00:03:50.098 --> 00:03:54.486 in order to count the number of duplicates property. 00:03:57.136 --> 00:04:00.366 And about the results, while at the moment of doing this study, 00:04:00.366 --> 00:04:05.216 not the slides, there was no property for the British National Bibliography. 00:04:05.860 --> 00:04:08.260 They don't provide provenance information, 00:04:08.260 --> 00:04:11.536 which could be useful for metadata enrichment. 00:04:11.536 --> 00:04:14.660 And they don't allow to edit the information. 00:04:14.660 --> 00:04:17.166 So, we've been talking about Wikibase the whole weekend, 00:04:17.166 --> 00:04:21.396 and maybe we should try to adopt Wikibase as an interface. 00:04:23.186 --> 00:04:25.436 And they are focused on their own content, 00:04:25.436 --> 00:04:28.856 and this is just the SPARQL query based on Wikidata 00:04:28.856 --> 00:04:31.411 in order to assess the population. 00:04:32.066 --> 00:04:36.006 And the BnF provides labels in multiple languages, 00:04:36.006 --> 00:04:38.956 and they all use self-describing URIs, 00:04:38.956 --> 00:04:43.058 which is that in the URI, they have the type of entity, 00:04:43.058 --> 00:04:48.406 which allows the human reader to understand what they are using. 00:04:51.499 --> 00:04:55.256 And more results, they provide different output format, 00:04:55.256 --> 00:04:58.646 they use external vocabularies. 00:04:58.854 --> 00:05:01.116 Only the British National Bibliography 00:05:01.116 --> 00:05:03.734 provides machine-readable licensing information. 00:05:03.734 --> 00:05:09.124 And up to one-third of the instances are connected to external repositories, 00:05:09.124 --> 00:05:11.225 which is really nice. 00:05:12.604 --> 00:05:18.290 And while this study, this work has been done in our Labs team, 00:05:18.364 --> 00:05:22.391 a lab in a GLAM is a group of people 00:05:22.391 --> 00:05:27.520 who want to explore new ways 00:05:27.587 --> 00:05:30.306 of reusing data collections. 00:05:31.039 --> 00:05:35.054 And there's a community led by the British Library, 00:05:35.054 --> 00:05:37.366 and in particular, Mahendra Mahey, 00:05:37.366 --> 00:05:40.610 and we had a first event in London, 00:05:40.610 --> 00:05:42.601 and another one in Copenhagen, 00:05:42.601 --> 00:05:45.279 and we're going to have a new one in May 00:05:45.279 --> 00:05:48.240 at the Library of Congress in Washington. 00:05:48.528 --> 00:05:52.481 And we are now 250 people. 00:05:52.481 --> 00:05:56.421 And I'm so glad that I found somebody here at the WikidataCon 00:05:56.421 --> 00:05:58.860 who has just joined us-- 00:05:58.860 --> 00:06:01.160 Sylvia from [inaudible], Mexico. 00:06:01.160 --> 00:06:04.509 And I'd like to invite you to our community, 00:06:04.509 --> 00:06:09.719 since you may be part of a GLAM institution. 00:06:10.659 --> 00:06:13.164 So, we can talk later if you want to know about this. 00:06:14.589 --> 00:06:16.719 And this--it's all about people. 00:06:16.719 --> 00:06:19.669 This is me, people from the British Library, 00:06:19.669 --> 00:06:24.629 Library of Congress, Universities, and National Libraries in Europe 00:06:24.871 --> 00:06:28.050 And there's a link here in case you want to know more. 00:06:28.433 --> 00:06:32.655 And, well, last month, we decided to meet in Doha 00:06:32.655 --> 00:06:37.448 in order to write a book about how to create a lab in our GLAM. 00:06:38.585 --> 00:06:43.279 And they choose 15 people, and I was so lucky to be there. 00:06:45.314 --> 00:06:48.594 And the book follows the Booksprint methodology, 00:06:48.594 --> 00:06:51.674 which means that nothing is prepared beforehand. 00:06:51.674 --> 00:06:53.495 All is done there in a week. 00:06:53.495 --> 00:06:55.725 And believe me, it was really hard work 00:06:55.725 --> 00:06:58.905 to have their whole book done in this week. 00:06:59.890 --> 00:07:04.490 And I'd like to introduce you to the book, which will be published-- 00:07:04.490 --> 00:07:06.455 it was supposed to be published this week, 00:07:06.455 --> 00:07:08.274 but it will be next week. 00:07:08.974 --> 00:07:13.014 And it will be published open, so you can have it, 00:07:13.065 --> 00:07:15.668 and I can show you a little bit later if you want. 00:07:15.734 --> 00:07:17.601 And those are the authors. 00:07:17.601 --> 00:07:19.678 I'm here-- I'm so happy, too. 00:07:19.678 --> 00:07:22.110 And those are the institutions-- 00:07:22.110 --> 00:07:26.722 Library of Congress, British Library-- and this is the title. 00:07:27.330 --> 00:07:29.604 And now, I'd like to show you-- 00:07:31.441 --> 00:07:33.971 a map that I'm doing. 00:07:34.278 --> 00:07:37.234 We are launching a website for our community, 00:07:37.234 --> 00:07:42.893 and I'm in charge of creating a map with our institutions there. 00:07:43.097 --> 00:07:44.860 This is not finished. 00:07:44.860 --> 00:07:50.276 But this is just SPARQL, and below, 00:07:51.546 --> 00:07:53.027 we see the map. 00:07:53.027 --> 00:07:58.086 And we see here the new people that I found, here, 00:07:58.086 --> 00:08:00.486 at the WikidataCon-- I'm so happy for this. 00:08:00.621 --> 00:08:05.631 And we have here my data library of my university, 00:08:05.681 --> 00:08:08.490 and many other institutions. 00:08:09.051 --> 00:08:10.940 Also, from Australia-- 00:08:11.850 --> 00:08:13.061 if I can do it. 00:08:13.930 --> 00:08:15.711 Well, here, we have some links. 00:08:19.586 --> 00:08:21.088 There you go. 00:08:21.189 --> 00:08:23.059 Okay, this is not finished. 00:08:23.539 --> 00:08:26.049 We are still working on this, and that's all. 00:08:26.057 --> 00:08:28.170 Thank you very much for your attention. 00:08:28.858 --> 00:08:33.683 (applause) 00:08:41.962 --> 00:08:48.079 [inaudible] 00:08:59.490 --> 00:09:00.870 Good morning, everybody. 00:09:00.870 --> 00:09:01.930 I'm Olaf Janssen. 00:09:01.930 --> 00:09:03.570 I'm the Wikimedia coordinator 00:09:03.570 --> 00:09:06.150 at the National Library of the Netherlands. 00:09:06.310 --> 00:09:08.390 And I would like to share my work, 00:09:08.390 --> 00:09:11.610 which I'm doing about creating Linked Open Data 00:09:11.640 --> 00:09:15.351 for Dutch Public Libraries using Wikidata. 00:09:17.600 --> 00:09:20.850 And my story starts roughly a year ago 00:09:20.850 --> 00:09:24.581 when I was at the GLAM Wiki conference in Tel Aviv, in Israel. 00:09:25.301 --> 00:09:27.938 And there are two men with very similar shirts, 00:09:27.938 --> 00:09:31.120 and equally similar hairdos, [Matt]... 00:09:31.120 --> 00:09:33.440 (laughter) 00:09:33.440 --> 00:09:35.325 And on the left, that's me. 00:09:35.325 --> 00:09:39.065 And a year ago, I didn't have any practical knowledge and skills 00:09:39.065 --> 00:09:40.265 about Wikidata. 00:09:40.265 --> 00:09:43.285 I looked at Wikidata, and I looked at the items, 00:09:43.285 --> 00:09:44.524 and I played with it. 00:09:44.524 --> 00:09:47.070 But I wasn't able to make a SPARQL query 00:09:47.070 --> 00:09:50.285 or to do data modeling with the right shape expression. 00:09:51.305 --> 00:09:52.865 That's a year ago. 00:09:53.465 --> 00:09:57.065 And on the lefthand side, that's Simon Cobb, user: Sic19. 00:09:57.304 --> 00:10:00.265 And I was talking to him, because, just before, 00:10:00.525 --> 00:10:01.974 he had given a presentation 00:10:01.974 --> 00:10:06.374 about improving the coverage of public libraries in Wikidata. 00:10:06.757 --> 00:10:08.934 And I was very inspired by his talk. 00:10:09.564 --> 00:10:13.355 And basically, he was talking about adding basic data 00:10:13.355 --> 00:10:14.867 about public libraries. 00:10:14.867 --> 00:10:19.046 So, the name of the library, if available, the photo of the building, 00:10:19.046 --> 00:10:21.497 the address data of the library, 00:10:21.497 --> 00:10:25.120 the geo-coordinates latitude and longitude, 00:10:25.120 --> 00:10:26.367 and some other things, 00:10:26.367 --> 00:10:29.187 including with all source references. 00:10:31.317 --> 00:10:34.557 And what I was very impressed about a year ago was this map. 00:10:34.557 --> 00:10:37.337 This is a map about public libraries in the U.K. 00:10:37.337 --> 00:10:38.577 with all the colors. 00:10:38.577 --> 00:10:43.017 And you can see that all the libraries are layered by library organizations. 00:10:43.017 --> 00:10:46.210 And when he showed this, I was really, "Wow, that's cool." 00:10:46.637 --> 00:10:49.138 So, then, one minute later, I thought, 00:10:49.138 --> 00:10:52.918 "Well, let's do it for the country for that one." 00:10:52.918 --> 00:10:54.850 (laughter) 00:10:57.149 --> 00:10:59.496 And something about public libraries in the Netherlands-- 00:10:59.496 --> 00:11:03.020 there are about 1,300 library branches in our country, 00:11:03.020 --> 00:11:06.710 grouped into 160 library organizations. 00:11:07.723 --> 00:11:10.937 And you might wonder why do I want to do this project? 00:11:10.997 --> 00:11:14.137 Well, first of all, because for the common good, for society, 00:11:14.137 --> 00:11:16.707 because I think using Wikidata, 00:11:16.707 --> 00:11:20.657 and from there, creating Wikipedia articles, 00:11:20.657 --> 00:11:23.417 and opening it up via the linked open data cloud-- 00:11:23.417 --> 00:11:29.006 it's improving visibility and reusability of public libraries in the Netherlands. 00:11:30.110 --> 00:11:32.197 And my second goal was actually a more personal one, 00:11:32.197 --> 00:11:36.517 because a year ago, I had this yearly evaluation with my manager, 00:11:37.243 --> 00:11:41.737 and we decided it was a good idea that I got more practical skills 00:11:41.737 --> 00:11:45.853 on linked open data, data modeling, and also on Wikidata. 00:11:46.464 --> 00:11:50.286 And of course, I wanted to be able to make these kinds of maps myself. 00:11:50.286 --> 00:11:51.396 (laughter) 00:11:54.345 --> 00:11:57.100 Then you might wonder why do I want to do this? 00:11:57.100 --> 00:12:01.723 Isn't there already enough basic library data out there in the Netherlands 00:12:02.450 --> 00:12:04.233 to have a good coverage? 00:12:06.019 --> 00:12:08.367 So, let me show you some of the websites 00:12:08.367 --> 00:12:12.882 that are available to discover address and location information 00:12:12.882 --> 00:12:14.505 about Dutch public libraries. 00:12:14.505 --> 00:12:17.722 And the first one is this one-- Gidsvoornederland.nl-- 00:12:17.722 --> 00:12:20.641 and that's the official public library inventory 00:12:20.641 --> 00:12:23.037 maintained by my library, the National Library. 00:12:23.727 --> 00:12:29.160 And you can look up addresses and geo-coordinates on that website. 00:12:30.493 --> 00:12:32.797 Then there is this site, Bibliotheekinzicht-- 00:12:32.797 --> 00:12:36.502 this is also an official website maintained by my National Library. 00:12:36.502 --> 00:12:38.982 And this is about public library statistics. 00:12:41.010 --> 00:12:43.933 Then there is another one, debibliotheken.nl-- 00:12:43.933 --> 00:12:46.005 as you can see there is also address information 00:12:46.005 --> 00:12:49.659 about library organizations, not about individual branches. 00:12:51.724 --> 00:12:55.010 And there's even this one, which also has address information. 00:12:56.546 --> 00:12:59.028 And of course, there's something like Google Maps, 00:12:59.028 --> 00:13:02.157 which also has all the names and the locations and the addresses. 00:13:03.455 --> 00:13:06.218 And this one, the International Library of Technology, 00:13:06.218 --> 00:13:09.580 which has a worldwide inventory of libraries, 00:13:09.646 --> 00:13:11.393 including the Netherlands. 00:13:13.058 --> 00:13:15.049 And I even discovered there is a data set 00:13:15.049 --> 00:13:18.423 you can buy for 50 euros or so to download it. 00:13:18.423 --> 00:13:21.023 And there is also--seems to be I didn't download it, 00:13:21.023 --> 00:13:23.633 but there seems to be address information available. 00:13:24.273 --> 00:13:30.180 You might wonder is this kind of data good enough for the purposes I had? 00:13:32.282 --> 00:13:37.372 So, this is my birthday list for my ideal public library data list. 00:13:37.439 --> 00:13:39.105 And what's on my list? 00:13:39.173 --> 00:13:43.830 First of all, the data I want to have must be up-to-date-ish-- 00:13:43.830 --> 00:13:45.604 it must be fairly up-to-date. 00:13:45.604 --> 00:13:48.513 So, doesn't have to be real time, 00:13:48.513 --> 00:13:51.323 but let's say, a couple of months, or half a year, 00:13:53.284 --> 00:13:57.354 delayed with official publication, that's okay for my purposes. 00:13:58.116 --> 00:14:00.956 And I want to have it both library branches 00:14:00.956 --> 00:14:02.697 and the library organizations. 00:14:04.206 --> 00:14:08.400 Then I want my data to be structured, because it has to be machine-readable. 00:14:08.301 --> 00:14:11.986 It has to be in open file format, such as CSV or JSON or RDF. 00:14:12.717 --> 00:14:15.197 It has to be linked to other resources preferably. 00:14:16.011 --> 00:14:22.182 And the uses--the license on the data needs to be manifest public domain or CC0. 00:14:23.520 --> 00:14:26.192 Then, I would like my data to have an API, 00:14:26.599 --> 00:14:30.548 which must be public, free, and preferably also anonymous 00:14:30.548 --> 00:14:34.900 so you don't have to use an API key, or you have to register an account. 00:14:36.103 --> 00:14:38.863 And I also want to have a SPARQL interface. 00:14:41.131 --> 00:14:43.651 So, now, these are all the sites I just showed you. 00:14:43.717 --> 00:14:46.450 And I'm going to make a big grid. 00:14:47.337 --> 00:14:50.017 And then, this is about the evaluation I did. 00:14:51.187 --> 00:14:54.166 I'm not going into it, but there is no single column 00:14:54.166 --> 00:14:56.007 which has all green check marks. 00:14:56.007 --> 00:14:57.997 That's the important thing to take away. 00:14:58.967 --> 00:15:03.947 And so, in summary, there was no linked public free linked open data 00:15:03.947 --> 00:15:08.937 for Dutch public libraries available before I started my project. 00:15:09.237 --> 00:15:13.027 So, this was the ideal motivation to actually work on it. 00:15:14.730 --> 00:15:17.427 So, that's what I've been doing for a year now. 00:15:17.717 --> 00:15:22.977 And I've been adding libraries bit by bit, organization by organization to Wikidata. 00:15:23.417 --> 00:15:26.387 I created also a project website on it. 00:15:26.727 --> 00:15:29.567 It's still rather messy, but it has all the information, 00:15:29.567 --> 00:15:33.240 and I try to keep it as up-to-date as possible. 00:15:33.240 --> 00:15:36.277 And also all the SPARQL queries you can see are linked from here. 00:15:38.002 --> 00:15:40.235 And I'm just adding really basic information. 00:15:40.235 --> 00:15:44.097 You see the instances, images if available, 00:15:44.097 --> 00:15:47.229 addresses, locations, et cetera, municipalities. 00:15:48.534 --> 00:15:53.276 And where possible, I also try to link the libraries to external identifiers. 00:15:56.024 --> 00:15:58.415 And then, you can really easily-- we all know, 00:15:58.415 --> 00:16:03.050 generating some Listeria lists with public libraries grouped 00:16:03.050 --> 00:16:05.060 by organizations, for instance. 00:16:05.060 --> 00:16:08.380 Or using SPARQL queries, you can also do aggregation on data-- 00:16:08.380 --> 00:16:11.060 let's say, give me all the municipalities in the Netherlands 00:16:11.060 --> 00:16:15.115 and the number of library branches in all the municipalities. 00:16:17.025 --> 00:16:20.228 With one click, you can make these kinds of photo galleries. 00:16:22.092 --> 00:16:23.655 And what I set out to do first, 00:16:23.655 --> 00:16:26.036 you can really create these kinds of maps. 00:16:27.176 --> 00:16:30.425 And you might wonder, "Are there any libraries here or there?" 00:16:30.555 --> 00:16:33.355 There are--they are not yet in Wikidata. 00:16:33.355 --> 00:16:35.055 We're still working on that. 00:16:35.135 --> 00:16:37.644 And actually, last week, I spoke with a volunteer, 00:16:37.644 --> 00:16:40.864 who's helping now with entering the libraries. 00:16:41.644 --> 00:16:45.394 You can really make cool--in Wikidata, 00:16:45.394 --> 00:16:47.914 and also with using the Cartographer extension, 00:16:47.914 --> 00:16:50.244 you can use these kinds of maps. 00:16:51.724 --> 00:16:53.736 And I even took it one step further. 00:16:53.911 --> 00:16:57.399 I also have some Python skills, and some Leaflet things skills-- 00:16:57.399 --> 00:16:59.971 so, I created, and I'm quite proud of it, actually. 00:16:59.971 --> 00:17:03.482 I created this library heat map, which is fully interactive. 00:17:03.482 --> 00:17:05.956 You can zoom in to it, and you can see all the libraries, 00:17:06.712 --> 00:17:08.726 and you can also run it off Wiki. 00:17:08.726 --> 00:17:10.552 So, you can just embed it in your own website, 00:17:10.552 --> 00:17:13.412 and it fully runs interactively. 00:17:15.131 --> 00:17:17.592 So, now going back to my big scary table. 00:17:19.512 --> 00:17:22.970 There is one column on the right, which is blank. 00:17:22.970 --> 00:17:24.940 And no surprise, it will be Wikidata. 00:17:24.940 --> 00:17:26.448 Let's see how it scores there. 00:17:26.448 --> 00:17:29.500 (cheering) 00:17:32.892 --> 00:17:35.191 So, I actually think of printing this on a T-shirt. 00:17:35.301 --> 00:17:37.288 (laughter) 00:17:37.788 --> 00:17:39.700 So, just to summarize this in words, 00:17:39.700 --> 00:17:41.129 thanks to my project, now, 00:17:41.129 --> 00:17:45.879 there is public free linked open data available for Dutch public libraries. 00:17:47.124 --> 00:17:49.686 And who can benefit from my effort? 00:17:50.333 --> 00:17:52.002 Well, all kinds of parties-- 00:17:52.002 --> 00:17:54.274 you see Wikipedia, because you can generate lists 00:17:54.274 --> 00:17:56.051 and overviews and articles, 00:17:56.051 --> 00:17:59.908 for instance, using this and be able to from Wikidata 00:17:59.908 --> 00:18:01.976 for our National Library for-- 00:18:02.850 --> 00:18:05.391 IFLA also has an inventory of worldwide libraries, 00:18:05.391 --> 00:18:07.216 they can also reuse the data. 00:18:07.650 --> 00:18:09.497 And especially for Sandra, 00:18:09.549 --> 00:18:13.237 it's also important for the Ministry-- Dutch Ministry of Culture-- 00:18:13.277 --> 00:18:15.667 because Sandra is going to have a talk about Wikidata 00:18:15.667 --> 00:18:18.287 with the Ministry this Monday, next Monday. 00:18:19.922 --> 00:18:22.277 And also, on the righthand side, for instance, 00:18:23.891 --> 00:18:27.098 Amazon with Alexa, the assistant, 00:18:27.098 --> 00:18:28.961 they're also using Wikidata, 00:18:28.961 --> 00:18:30.995 so you can imagine that they also use, 00:18:30.995 --> 00:18:33.357 if you're looking for public library information, 00:18:33.357 --> 00:18:36.580 they can also use Wikidata for that. 00:18:38.955 --> 00:18:41.680 Because one year ago, Simon Cobb inspired me 00:18:41.680 --> 00:18:44.244 to do this project, I would like to call upon you, 00:18:44.244 --> 00:18:45.664 if you have time available, 00:18:45.664 --> 00:18:49.532 and if you have data from your own country about public libraries, 00:18:51.572 --> 00:18:54.422 make the coverage better, add more red dots, 00:18:54.982 --> 00:18:56.982 and of course, I'm willing to help you with that. 00:18:56.982 --> 00:18:59.227 And Simon is also willing to help with this. 00:18:59.870 --> 00:19:01.471 And so, I hope next year, somebody else 00:19:01.471 --> 00:19:03.901 will be at this conference or another conference 00:19:03.901 --> 00:19:06.291 and there will be more red dots on the map. 00:19:07.551 --> 00:19:08.911 Thank you very much. 00:19:09.004 --> 00:19:12.740 (applause) 00:19:18.336 --> 00:19:20.086 Thank you, Olaf. 00:19:20.086 --> 00:19:23.554 Next we have Ursula Oberst and Heleen Smits 00:19:23.613 --> 00:19:27.734 presenting how can a small research library benefit from Wikidata: 00:19:27.734 --> 00:19:31.423 enhancing library products using Wikidata. 00:19:53.717 --> 00:19:57.637 Okay. Good morning. My name is Heleen Smits. 00:19:58.680 --> 00:20:01.753 And my colleague, Ursula Oberst--where are you? 00:20:01.753 --> 00:20:03.873 (laughter) 00:20:04.371 --> 00:20:09.220 And I work at the Library of the African Studies Center 00:20:09.220 --> 00:20:11.086 in Leiden, in the Netherlands. 00:20:11.086 --> 00:20:15.038 And the African Studies Center is a center devoted-- 00:20:15.038 --> 00:20:21.464 is an academic institution devoted entirely to the study of Africa, 00:20:21.464 --> 00:20:23.986 focusing on Humanities and Social Studies. 00:20:24.672 --> 00:20:28.123 We used to be an independent research organization, 00:20:28.123 --> 00:20:33.064 but in 2016, we became part of Leiden University, 00:20:33.064 --> 00:20:38.433 and our catalog was integrated into the larger university catalog. 00:20:39.283 --> 00:20:43.593 Though it remained possible to do a search in the part of the Leiden-- 00:20:43.593 --> 00:20:45.894 of the African Studies Catalog, alone, 00:20:47.960 --> 00:20:50.505 we remained independent in some respects. 00:20:50.586 --> 00:20:53.262 For example, with respect to our thesaurus. 00:20:54.921 --> 00:20:59.883 And also with respect to the products we make for our users, 00:21:01.180 --> 00:21:04.378 such as acquisition lists and work dossiers. 00:21:05.158 --> 00:21:11.975 And it is in the field of the web dossiers 00:21:11.975 --> 00:21:14.582 that we have been looking 00:21:14.582 --> 00:21:19.582 for possible ways to apply Wikidata, 00:21:19.582 --> 00:21:23.372 and that's the part where Ursula will in the second part of this talk 00:21:24.212 --> 00:21:27.184 show you a bit what we've been doing there. 00:21:31.250 --> 00:21:35.160 The web dossiers are our collections 00:21:35.160 --> 00:21:39.000 of titles from our catalog that we compile 00:21:39.000 --> 00:21:45.591 around a theme usually connected to, for example, a conference, 00:21:45.591 --> 00:21:51.227 or to a special event, and actually, the most recent web dossier we made 00:21:51.227 --> 00:21:56.017 was connected to the year of indigenous languages, 00:21:56.017 --> 00:21:59.547 and that was around proverbs in African languages. 00:22:00.780 --> 00:22:02.327 Our first steps-- 00:22:04.307 --> 00:22:09.287 next slide--our first steps on the Wiki path as a library, 00:22:10.267 --> 00:22:15.046 were in 2013, when we were one of 12 GLAM institutions 00:22:15.046 --> 00:22:16.472 in the Netherlands, 00:22:16.472 --> 00:22:20.952 part of the project of Wikipedians in Residence, 00:22:20.952 --> 00:22:26.443 and we had for two months, a Wikipedian in the house, 00:22:27.035 --> 00:22:32.527 and he gave us trainings for adding articles to Wikipedia, 00:22:33.000 --> 00:22:37.720 and also, we made a start with uploading photo collections to Commons, 00:22:38.530 --> 00:22:42.650 which always remained a little bit dependent on funding, as well, 00:22:43.229 --> 00:22:45.702 whether we would be able to digitize them, 00:22:45.702 --> 00:22:50.350 and to mostly have a student assistant to do this. 00:22:51.220 --> 00:22:55.440 But it was actually a great adding to what we could offer 00:22:55.440 --> 00:22:57.560 as an academic library. 00:22:59.370 --> 00:23:04.742 In May 2018, so is that my Ursula, my colleague Ursula-- 00:23:04.742 --> 00:23:09.465 she started to really explore-- dive into Wikidata 00:23:09.465 --> 00:23:14.515 and see what we as a small and not very much experienced library 00:23:14.515 --> 00:23:18.175 in these fields could do with that. 00:23:25.050 --> 00:23:26.995 So, I mentioned, we have our own thesaurus. 00:23:28.210 --> 00:23:30.689 And this is where we started. 00:23:30.689 --> 00:23:34.502 This is a thesaurus of 13,000 terms, 00:23:34.502 --> 00:23:37.670 all in the field of African studies. 00:23:37.670 --> 00:23:41.457 It contains a lot of African languages, 00:23:43.417 --> 00:23:46.360 names of ethnic groups in Africa, 00:23:47.586 --> 00:23:49.431 and other proper names, 00:23:49.431 --> 00:23:55.509 which are perhaps especially interesting for Wikidata. 00:23:58.604 --> 00:24:04.824 So, it is a real authority control 00:24:04.824 --> 00:24:08.370 to vocabulary with 5,000 preferred terms. 00:24:08.554 --> 00:24:11.204 So, we submitted the request to Wikidata, 00:24:11.204 --> 00:24:17.135 and that was actually very quickly met with a positive response, 00:24:17.214 --> 00:24:19.354 which was very encouraging for us. 00:24:22.884 --> 00:24:25.574 Our thesaurus was loaded into Mix-n-Match, 00:24:25.574 --> 00:24:31.691 and by now, 75% of the terms 00:24:31.691 --> 00:24:36.145 have been manually matched with Wikidata. 00:24:38.061 --> 00:24:42.081 So, it means, well, that we are now-- 00:24:42.971 --> 00:24:47.687 we are added as an identifier-- 00:24:48.387 --> 00:24:51.553 for example, if you click on Swahili language, 00:24:52.463 --> 00:24:57.152 what happens then in Wikidata on the number that-- 00:24:59.004 --> 00:25:02.354 that connects our term-- is the Wikidata term-- 00:25:02.560 --> 00:25:05.620 we enter into our thesaurus, 00:25:05.620 --> 00:25:10.000 and from there, you can do a search directly in the catalog 00:25:10.000 --> 00:25:12.560 by clicking the button again. 00:25:12.560 --> 00:25:18.160 It means, also, that Wikidata has not really integrated 00:25:18.160 --> 00:25:19.572 into our catalog. 00:25:19.572 --> 00:25:22.090 But that's also more difficult. 00:25:22.314 --> 00:25:26.053 Okay, we have to give the floor 00:25:26.053 --> 00:25:30.838 to Ursula for the next part. 00:25:30.838 --> 00:25:32.554 (Ursula) Thank you very much, Heleen. 00:25:32.554 --> 00:25:37.258 So, I will talk about our experiences 00:25:37.258 --> 00:25:39.677 with incorporating Wikidata elements 00:25:39.677 --> 00:25:41.356 to our web dossier. 00:25:41.356 --> 00:25:44.607 A web dossier is--oh, sorry, yeah, sorry. 00:25:45.447 --> 00:25:49.646 A web dossier, or a classical web dossier, consists of three parts: 00:25:50.248 --> 00:25:53.320 an introduction to the subject, 00:25:53.320 --> 00:25:56.060 mostly written by one of our researchers; 00:25:56.060 --> 00:26:01.328 a selection of titles, both books and articles from our collection; 00:26:01.328 --> 00:26:06.146 and the third part, an annotated list 00:26:06.146 --> 00:26:08.876 with links to electronic resources. 00:26:09.161 --> 00:26:15.815 And this year, we added a fourth part to our web dossiers, 00:26:15.815 --> 00:26:18.276 which is the Wikidata elements. 00:26:19.008 --> 00:26:22.007 And it all started last year, 00:26:22.007 --> 00:26:25.206 and my story is similar to the story of Olaf, actually. 00:26:25.352 --> 00:26:29.570 Last year, when I had no clue about Wikidata, 00:26:29.570 --> 00:26:33.402 and I discovered this wonderful article by Alex Stinson 00:26:33.402 --> 00:26:36.932 on how to write a query in Wikidata. 00:26:37.382 --> 00:26:41.592 And he chose a subject-- a very appealing subject to me. 00:26:41.592 --> 00:26:45.902 Namely, "Discovering Women Writers from North Africa." 00:26:46.402 --> 00:26:51.162 I can really recommend this article, 00:26:51.162 --> 00:26:52.981 because it's very instructive. 00:26:52.981 --> 00:26:57.422 And I thought I will be-- I'm going to work on this query, 00:26:57.422 --> 00:27:02.662 and try to change it to: "Southern African Women Writers," 00:27:02.662 --> 00:27:07.034 and try to add a link to their work in our catalog. 00:27:07.311 --> 00:27:10.861 And on the right-hand side, you see the SPARQL query 00:27:11.592 --> 00:27:15.181 which searches for "Southern African Women Writers." 00:27:15.181 --> 00:27:20.686 If you click on the button, on the blue button on the lefthand side, 00:27:21.526 --> 00:27:23.971 the search result will appear beneath. 00:27:23.971 --> 00:27:26.448 The search result can have different formats. 00:27:26.448 --> 00:27:29.871 In my case, the search result is a map. 00:27:29.871 --> 00:27:32.850 And the nice thing about Wikidata 00:27:32.850 --> 00:27:36.652 is that you can embed to this search result 00:27:36.652 --> 00:27:38.682 into your own webpage, 00:27:38.682 --> 00:27:42.339 and that's what we are now doing with our work dossiers. 00:27:42.339 --> 00:27:47.039 So, this was the very first one on Southern African women writers, 00:27:47.039 --> 00:27:49.649 listed classical three elements, 00:27:49.649 --> 00:27:53.209 plus this map on the lefthand side, 00:27:53.209 --> 00:27:55.650 which gives extra information-- 00:27:55.650 --> 00:27:58.219 a link to the Southern African women writer-- 00:27:58.219 --> 00:28:00.749 a link to her works in our catalog, 00:28:00.749 --> 00:28:07.252 and a link to the Wikidata record of her birth place, and her name, 00:28:08.219 --> 00:28:13.099 her personal record, plus a photo, if it's available on Wikidata. 00:28:16.231 --> 00:28:20.329 And you have to retrieve a nice map 00:28:20.329 --> 00:28:24.032 with a lot of red dots on the African continent. 00:28:24.032 --> 00:28:28.662 You need nice data in Wikidata, complete, sufficient data. 00:28:29.042 --> 00:28:33.442 So, with our second web dossier on public art in Africa, 00:28:33.442 --> 00:28:38.420 we also started to enhance the data in Wikidata. 00:28:38.420 --> 00:28:43.242 In this case, for a public art-- we edited geo-locations-- 00:28:43.242 --> 00:28:46.919 geo-locations to Wikidata. 00:28:46.919 --> 00:28:51.139 And we also searched for works of public art in commons, 00:28:51.139 --> 00:28:55.165 and if they don't have a record on Wikidata yet, 00:28:55.165 --> 00:29:00.670 we edited the record to Wikidata. 00:29:00.855 --> 00:29:05.327 And the third thing we do, 00:29:05.327 --> 00:29:09.958 because when we prepare a web dossier, 00:29:09.958 --> 00:29:15.514 we download the titles from our catalog, 00:29:15.514 --> 00:29:17.584 and the tiles are in MARC 21, 00:29:17.584 --> 00:29:23.226 so we have to convert them to a format that is presentable on the website, 00:29:23.226 --> 00:29:28.229 and it takes not much time and effort to convert the same set of titles 00:29:28.229 --> 00:29:30.457 to Wikidata QuickStatements, 00:29:30.457 --> 00:29:36.999 and then, we also upload a title set to Wikidata, 00:29:36.999 --> 00:29:41.254 and you can see the titles we uploaded 00:29:41.254 --> 00:29:44.124 from our latest web dossier 00:29:44.124 --> 00:29:47.514 on African proverbs in Scholia. 00:29:48.546 --> 00:29:52.294 A really nice tool that visualizes Scholia publications 00:29:52.294 --> 00:29:54.674 being present in Wikidata. 00:29:54.674 --> 00:29:59.674 And, one second--when it is possible, we add a Scholia template 00:29:59.674 --> 00:30:01.863 to our web dossier's topic. 00:30:01.863 --> 00:30:03.272 Thank you very much. 00:30:03.272 --> 00:30:08.079 (applause) 00:30:09.255 --> 00:30:11.724 Thank you, Heleen and Ursula. 00:30:12.010 --> 00:30:16.866 Next we have Adrian Pohl presenting using Wikidata 00:30:16.866 --> 00:30:22.265 to improve spatial subject indexing and regional bibliography. 00:30:45.181 --> 00:30:46.621 Okay, hello everybody. 00:30:46.621 --> 00:30:49.630 I'm going right into the topic. 00:30:49.630 --> 00:30:54.146 I only have ten minutes to present a three-year project. 00:30:54.535 --> 00:30:57.044 It wasn't full time. (laughs) 00:30:57.044 --> 00:31:00.100 Okay, what's the NWBib? 00:31:00.100 --> 00:31:04.404 It's an acronym for North-Rhine Westphalian Bibliography. 00:31:04.404 --> 00:31:07.944 It's a regional bibliography that records literature 00:31:07.944 --> 00:31:11.441 about people and places in North Rhine-Westphalia. 00:31:12.534 --> 00:31:14.103 And the monograph's in it-- 00:31:15.162 --> 00:31:19.451 there are a lot of articles in it, and most of them are quite unique, 00:31:19.451 --> 00:31:22.052 so, that's the interesting thing about this bibliography-- 00:31:22.052 --> 00:31:25.472 because it's often less quite obscure stuff-- 00:31:25.472 --> 00:31:28.188 local people writing about that tradition, 00:31:28.188 --> 00:31:29.488 and something like this. 00:31:29.612 --> 00:31:33.428 And there's over 400,000 entries in there. 00:31:33.428 --> 00:31:37.689 And the bibliography started in 1983, 00:31:37.689 --> 00:31:42.718 and so we only have titles from this publication year onwards. 00:31:44.744 --> 00:31:49.166 If you want to take a look at it, it's at nwbib.de, 00:31:49.166 --> 00:31:50.859 that's the web application. 00:31:50.859 --> 00:31:55.389 It's based on our service, lobid.org, the API. 00:31:57.148 --> 00:32:01.220 Because it's cataloged as part of the hbz union catalog, 00:32:01.220 --> 00:32:04.988 which comprises around 20 million records, 00:32:04.988 --> 00:32:08.869 it's an [inaudible] Aleph system we get the data out of there, 00:32:08.869 --> 00:32:11.308 and make RDF out of it, 00:32:11.308 --> 00:32:16.408 and provide it as via JSON or the HTTP API. 00:32:17.129 --> 00:32:20.507 So, the initial status in 2017 00:32:20.507 --> 00:32:25.307 was we had nearly 9,000 distinct strings 00:32:25.307 --> 00:32:28.727 about places--referring to places, in North Rhine-Westphalia. 00:32:28.727 --> 00:32:34.187 Mostly, those were administrative areas, like towns and districts, 00:32:34.187 --> 00:32:38.458 but also monasteries, principalities, or natural regions. 00:32:38.907 --> 00:32:43.517 And we already used Wikidata in 2017, 00:32:43.517 --> 00:32:48.496 and matched those strings with Wikidata API to Wikidata entries 00:32:48.496 --> 00:32:51.907 quite naively to get the geo-coordinates from there, 00:32:51.907 --> 00:32:57.210 and do some geo-based discovery stuff with it. 00:32:57.326 --> 00:32:59.910 But this had some drawbacks. 00:32:59.910 --> 00:33:02.577 And so, the matching was really poor, 00:33:02.577 --> 00:33:05.197 and there were a lot of false positives, 00:33:05.197 --> 00:33:09.184 and we still had no hierarchy in those places, 00:33:09.184 --> 00:33:13.201 and we still had a lot of non-unique names. 00:33:13.505 --> 00:33:15.356 So, this is an example here. 00:33:16.616 --> 00:33:18.378 Does this work? 00:33:18.494 --> 00:33:22.314 Yeah, as you can see, for one place, Brauweiler, 00:33:22.314 --> 00:33:24.615 there are four different strings in there. 00:33:24.820 --> 00:33:27.893 So, we all know how this happens. 00:33:27.893 --> 00:33:31.994 If there's no authority file, you end up with this data. 00:33:31.994 --> 00:33:33.894 But we want to improve on that. 00:33:34.614 --> 00:33:38.211 And as you can also see, that while the matching didn't work-- 00:33:38.211 --> 00:33:40.382 so you have this name of the place 00:33:40.382 --> 00:33:45.170 and there's often the name of the superior administrative area, 00:33:45.170 --> 00:33:50.532 and even on the second level, a superior administrative area 00:33:50.532 --> 00:33:52.040 often in the name 00:33:52.040 --> 00:33:58.909 to identify the place successfully. 00:33:58.909 --> 00:34:04.679 So, the goal was to build a full-fledged spatial classification based on this data, 00:34:04.679 --> 00:34:07.109 with a hierarchical view of places, 00:34:09.079 --> 00:34:11.389 with one entry or ID for each place. 00:34:11.518 --> 00:34:17.488 And we got this mock-up by NWBib editors in 2016, made in Excel, 00:34:18.048 --> 00:34:23.116 to get a feeling of what they would like to have. 00:34:25.006 --> 00:34:28.198 There you have the-- Regierungsbezirk-- 00:34:28.198 --> 00:34:31.016 that's the most superior administrative area-- 00:34:31.016 --> 00:34:34.918 we have in there some towns or districts--rural districts-- 00:34:34.918 --> 00:34:39.861 and then, it's going down to the parts of towns, 00:34:39.861 --> 00:34:42.011 even to this level. 00:34:43.225 --> 00:34:46.232 And we chose Wikidata for this task. 00:34:46.232 --> 00:34:50.087 We also looked at the GND, the Integrated Authority File, 00:34:50.087 --> 00:34:54.918 and GeoNames--but Wikidata had the best coverage, 00:34:54.918 --> 00:34:56.902 and the best infrastructure. 00:34:58.112 --> 00:35:02.072 The coverage for the places and the geo-coordinates we need, 00:35:02.072 --> 00:35:04.512 and the hierarchical information, for example. 00:35:04.512 --> 00:35:06.732 There were a lot of places, also, in the GND, 00:35:06.732 --> 00:35:09.694 but there was no hierarchical information in there. 00:35:11.170 --> 00:35:13.682 And also, Wikidata provides the infrastructure 00:35:13.682 --> 00:35:15.343 for editing and versioning. 00:35:15.343 --> 00:35:20.022 And there's also a community that helps maintaining the data, 00:35:20.022 --> 00:35:22.052 which was quite good. 00:35:22.950 --> 00:35:26.882 Okay, but there was a requirement by the NWBib editors. 00:35:27.682 --> 00:35:31.447 They did not want to directly rely on Wikidata, 00:35:31.447 --> 00:35:32.972 which was understandable. 00:35:32.972 --> 00:35:34.982 We don't have those servers under our control, 00:35:34.982 --> 00:35:38.002 and we won't know what's going on there. 00:35:38.084 --> 00:35:41.944 There might be some unwelcome edits that destroy the classification, 00:35:41.944 --> 00:35:44.159 or parts of it, or vandalism. 00:35:44.159 --> 00:35:50.794 So, we decide to put an intermediate SKOS file in between, 00:35:50.794 --> 00:35:55.534 on which the application would-- which should be generated from Wikidata. 00:35:57.113 --> 00:35:59.462 And SKOS is the Simple Knowledge Organization System-- 00:35:59.462 --> 00:36:03.919 it's the standard way to model 00:36:03.919 --> 00:36:07.519 a classification in the linked data world. 00:36:07.603 --> 00:36:09.278 So, how we did it? Five steps. 00:36:09.278 --> 00:36:14.037 I will come to each of the steps in more detail. 00:36:14.037 --> 00:36:18.460 We match the strings to Wikidata with a better approach than before. 00:36:18.727 --> 00:36:23.131 Created classification based on Wikidata, edit, 00:36:23.131 --> 00:36:26.255 then back the links from Wikidata to NWBib 00:36:26.255 --> 00:36:27.590 with a custom property. 00:36:27.590 --> 00:36:32.659 And now, we are in the process of establishing a good process 00:36:32.659 --> 00:36:36.559 for updating the classification in Wikidata. 00:36:36.619 --> 00:36:38.888 Seeing--having a DIF of the changes, 00:36:38.888 --> 00:36:41.158 and then publishing it to the SKOS file. 00:36:42.813 --> 00:36:44.646 I will come to the details. 00:36:44.646 --> 00:36:46.261 So, the matching approach-- 00:36:46.261 --> 00:36:48.356 as the API wasn't very sufficient, 00:36:48.356 --> 00:36:53.585 and because we have those different levels in the strings, 00:36:54.441 --> 00:36:59.036 we build a custom Elasticsearch index for our task. 00:36:59.596 --> 00:37:04.378 I think by now, you could probably, as well, use OpenRefine for doing this, 00:37:04.378 --> 00:37:09.306 but at that point in time, it wasn't available for Wikidata. 00:37:10.186 --> 00:37:14.336 And we build this index base on SPARQL query, 00:37:14.336 --> 00:37:20.484 and for entities in NRW, and with a specific type. 00:37:20.484 --> 00:37:25.069 And the query evolved over time a lot. 00:37:25.148 --> 00:37:29.157 And we have a few entries that you can see the history on GitHub. 00:37:29.727 --> 00:37:32.088 So, where we put in the matching index, 00:37:32.088 --> 00:37:36.337 in the spatial object, is what we need in our data. 00:37:36.337 --> 00:37:39.662 It's the label and the ID or the link to Wikidata, 00:37:40.222 --> 00:37:43.874 the geo-coordinates, and the type from Wikidata [inaudible], as well. 00:37:44.194 --> 00:37:50.488 But also for the matching, very important that aliases and the broader thing-- 00:37:50.488 --> 00:37:54.138 and this is also an example where the name of the broader entity 00:37:54.138 --> 00:37:57.875 and the district itself are very similar. 00:37:57.937 --> 00:38:03.096 So, it's important to have some type information, as well, 00:38:03.096 --> 00:38:04.606 for the matching. 00:38:04.900 --> 00:38:07.900 So, the nationwide results were very good. 00:38:07.900 --> 00:38:11.110 We could automatically match more than 99% of records 00:38:11.110 --> 00:38:12.265 with this approach. 00:38:13.885 --> 00:38:16.356 These were only 92% of the strings. 00:38:16.540 --> 00:38:18.140 So, obviously, the results-- 00:38:18.140 --> 00:38:20.610 those strings that only occurred one or two times 00:38:20.610 --> 00:38:22.419 often didn't appear in Wikidata. 00:38:22.419 --> 00:38:26.309 And so, we had to do a lot of work with those with the [long tail]. 00:38:27.905 --> 00:38:32.039 And for around 1,000 strings, the matching was incorrect. 00:38:32.114 --> 00:38:34.950 But the catalogers did a lot of work in the Aleph catalog, 00:38:34.950 --> 00:38:39.869 but also in Wikidata, they made more than 6,000 manual edits to Wikidata 00:38:39.869 --> 00:38:45.019 to reach 100% coverage by adding aliases-type information, 00:38:45.085 --> 00:38:46.615 creating new entries. 00:38:46.615 --> 00:38:49.100 Okay, so, I have to speed up. 00:38:49.546 --> 00:38:54.295 We created classification based on this, on the hierarchical statements. 00:38:54.295 --> 00:38:58.580 P131 is the main property there. 00:38:59.827 --> 00:39:02.495 We added the information to our data. 00:39:03.035 --> 00:39:06.525 So, we now have this in our data spatial object-- 00:39:06.525 --> 00:39:11.535 and we focus this--the link to Wikidata, and the types are there, 00:39:12.625 --> 00:39:17.554 and here's the ID from the SKOS classification 00:39:17.554 --> 00:39:19.234 we built based on Wikidata. 00:39:20.034 --> 00:39:23.555 And you can see there are Q identifiers in there. 00:39:26.940 --> 00:39:29.286 Now, you can basically query our API 00:39:29.286 --> 00:39:34.051 with such a query using Wikidata URIs, 00:39:34.316 --> 00:39:38.627 and get literature, in this example, about Cologne back. 00:39:39.724 --> 00:39:45.675 Then we created a Wikidata property for NWBib and edit those links 00:39:45.675 --> 00:39:50.995 from Wikidata to the classification-- batch load them with QuickStatements. 00:39:52.105 --> 00:39:53.634 And there's also a nice-- 00:39:53.634 --> 00:39:59.344 also a move to using a qualifier on this property 00:39:59.344 --> 00:40:02.994 to add the broader information there. 00:40:02.994 --> 00:40:06.333 So, I think people won't mess around that work with this, 00:40:06.333 --> 00:40:09.223 and as with the P131 statement. 00:40:10.094 --> 00:40:11.743 So, this is what it looks like. 00:40:12.563 --> 00:40:16.142 This will go to the classification where you can then start a query. 00:40:18.670 --> 00:40:23.293 Now, we have to build this update and review process, 00:40:23.293 --> 00:40:28.692 and we will add those data like this, 00:40:28.692 --> 00:40:32.452 with a zero sub-field to Aleph, 00:40:32.452 --> 00:40:36.962 and the catalogers will start using those Wikidata based IDs, 00:40:36.962 --> 00:40:41.012 URIs, for cataloging for spatial indexing. 00:40:44.702 --> 00:40:50.082 So, by now, there are more than 400,000 NWBib entries with links to Wikidata, 00:40:50.082 --> 00:40:55.905 and more than 4,400 Wikidata entries with links to NWBib. 00:40:56.617 --> 00:40:58.042 Thank you. 00:40:58.042 --> 00:41:03.182 (applause) 00:41:07.574 --> 00:41:09.682 Thank you, Adrian. 00:41:13.312 --> 00:41:15.472 I got it. Thank you. 00:41:31.122 --> 00:41:34.402 So, as you've seen me before, I'm Hilary Thorsen. 00:41:34.402 --> 00:41:36.152 I'm Wikimedian in residence 00:41:36.152 --> 00:41:38.382 with the Linked Data for Production Project. 00:41:38.382 --> 00:41:39.942 I am based at Stanford, 00:41:39.942 --> 00:41:42.590 and I'm here today with my colleague, Lena Denis, 00:41:42.590 --> 00:41:45.581 who is Cartographic Assistant at Harvard Library. 00:41:45.581 --> 00:41:50.041 And Christine Fernsebner Eslao is here in spirit. 00:41:50.041 --> 00:41:53.530 She is currently back in Boston, but supporting us from afar. 00:41:53.530 --> 00:41:56.240 So, we'll be talking about Wikidata and Libraries 00:41:56.240 --> 00:42:00.350 as partners in data production, organization, and project inspiration. 00:42:00.850 --> 00:42:04.300 And our work is part of the Linked Data for Production Project. 00:42:05.450 --> 00:42:08.190 So, Linked Data for Production is in its second phase, 00:42:08.190 --> 00:42:10.450 called Pathway for Implementation. 00:42:10.450 --> 00:42:13.291 And it's an Andrew W. Mellon Foundation grant, 00:42:13.291 --> 00:42:16.120 involving the partnership of several universities, 00:42:16.120 --> 00:42:20.280 with the goal of constructing a pathway for shifting the catalog community 00:42:20.280 --> 00:42:24.860 to begin describing library resources with linked data. 00:42:24.860 --> 00:42:26.919 And it builds upon a previous grant, 00:42:26.919 --> 00:42:30.369 but this iteration is focused on the practical aspects 00:42:30.369 --> 00:42:32.009 of the transition. 00:42:33.559 --> 00:42:35.650 One of these pathways of investigation 00:42:35.650 --> 00:42:39.000 has been integrating library metadata with Wikidata. 00:42:39.429 --> 00:42:41.054 We have a lot of questions, 00:42:41.054 --> 00:42:42.999 but some of the ones we're most interested in 00:42:42.999 --> 00:42:46.180 are how we can integrate library metadata with Wikidata, 00:42:46.180 --> 00:42:49.580 and make contribution a part of our cataloging workflows, 00:42:49.580 --> 00:42:53.589 how Wikidata can help us improve our library discovery environment, 00:42:53.589 --> 00:42:55.929 how it can help us reveal more relationships 00:42:55.929 --> 00:42:59.629 and connections within our data and with external data sets, 00:42:59.629 --> 00:43:04.370 and if we have connections in our own data that can be added to Wikidata, 00:43:04.370 --> 00:43:07.480 how libraries can help fill in gaps in Wikidata, 00:43:07.480 --> 00:43:09.969 and how libraries can work with local communities 00:43:09.969 --> 00:43:13.070 to describe library and archival resources. 00:43:14.010 --> 00:43:17.129 Finding answers to these questions has focused on the mutual benefit 00:43:17.129 --> 00:43:19.649 for the library and Wikidata communities. 00:43:19.649 --> 00:43:22.949 We've learned through starting to work on our different Wikidata projects, 00:43:22.949 --> 00:43:25.279 that many of the issues libraries grapple with, 00:43:25.279 --> 00:43:29.451 like data modeling, identity management, data maintenance, documentation, 00:43:29.451 --> 00:43:31.289 and instruction on linked data, 00:43:31.289 --> 00:43:33.970 are ones the Wikidata community works on too. 00:43:34.370 --> 00:43:36.099 I'm going to turn things over to Lena 00:43:36.099 --> 00:43:39.640 to talk about what she's been working on now. 00:43:46.550 --> 00:43:51.040 Hi, so, as Hilary briefly mentioned, I work as a map librarian at Harvard, 00:43:51.040 --> 00:43:54.180 where I process maps, atlases, and archives for our online catalog. 00:43:54.180 --> 00:43:56.580 And while processing two-dimensional cartographic works 00:43:56.580 --> 00:43:59.572 is relatively straighforward, cataloging archival collections 00:43:59.572 --> 00:44:02.429 so that their cartographic resources can be made discoverable, 00:44:02.429 --> 00:44:04.119 has always been more difficult. 00:44:04.119 --> 00:44:06.989 So, my use case for Wikidata is visually modeling relationships 00:44:06.989 --> 00:44:10.389 between archival collections and the individual items within them, 00:44:10.389 --> 00:44:13.210 as well as between archival drafts in published works. 00:44:13.359 --> 00:44:17.329 So, I used Wikidata to highlight the work of our cartographer named Erwin Raisz, 00:44:17.329 --> 00:44:19.890 who worked at Harvard in the early 20th-century. 00:44:19.890 --> 00:44:22.539 He was known for his vividly detailed and artistic land forms, 00:44:22.539 --> 00:44:23.939 like this one on the screen-- 00:44:23.939 --> 00:44:26.294 but also for inventing the armadillo projection, 00:44:26.294 --> 00:44:29.020 writing the first cartography textbook in English 00:44:29.020 --> 00:44:31.318 and other various important contributions 00:44:31.318 --> 00:44:32.919 to the field of geography. 00:44:32.919 --> 00:44:34.609 And at the Harvard Map Collection, 00:44:34.609 --> 00:44:38.509 we have a 66-item collection of Raisz's field notebooks, 00:44:38.509 --> 00:44:41.359 which begin when he was a student and end just before his death. 00:44:43.679 --> 00:44:46.229 So, this is the collection-level record that I made for them, 00:44:46.229 --> 00:44:47.994 which merely gives an overview, 00:44:47.994 --> 00:44:50.513 but his notebooks are full of information 00:44:50.513 --> 00:44:53.351 that he used in later atlases, maps, and textbooks. 00:44:53.351 --> 00:44:56.313 But researchers don't know how to find that trajectory information, 00:44:56.313 --> 00:44:58.665 and the system is not designed to show them. 00:45:01.030 --> 00:45:03.734 So, I felt that with Wikidata, and other Wikimedia platforms, 00:45:03.734 --> 00:45:05.154 I'd be able to take advantage 00:45:05.154 --> 00:45:08.075 of information that already exists about him on the open web, 00:45:08.075 --> 00:45:10.629 along with library records and a notebook inventory 00:45:10.629 --> 00:45:12.574 that I had made in an Excel spreadsheet 00:45:12.574 --> 00:45:15.416 to show relationships and influences between his works. 00:45:15.574 --> 00:45:18.594 So here, you can see how I edited and reconciled library data 00:45:18.594 --> 00:45:20.165 in OpenRefine. 00:45:20.165 --> 00:45:23.164 And then, I used QuickStatements to batch import my results. 00:45:23.304 --> 00:45:25.244 So, now, I was ready to create knowledge graphs 00:45:25.244 --> 00:45:27.864 with SPARQL queries to show patterns of influence. 00:45:30.084 --> 00:45:33.304 The examples here show how I leveraged Wikimedia Commons images 00:45:33.304 --> 00:45:34.664 that I connected to him. 00:45:34.664 --> 00:45:36.459 And the hierarchy of some of his works 00:45:36.459 --> 00:45:38.604 that were contributing factors to other works. 00:45:38.604 --> 00:45:42.354 So, modeling Raisz's works on Wikidata allowed me to encompass in a single image, 00:45:42.354 --> 00:45:45.890 or in this case, in two images, the connections that require many pages 00:45:45.890 --> 00:45:47.864 of bibliographic data to reveal. 00:45:51.684 --> 00:45:55.544 So, this video is going to load. 00:45:55.563 --> 00:45:57.233 Yes! Alright. 00:45:57.233 --> 00:46:00.113 This video is a minute and a half long screencast I made, 00:46:00.113 --> 00:46:02.033 that I'm going to narrate as you watch. 00:46:02.033 --> 00:46:05.423 It shows the process of inputting and then running a SPARQL query, 00:46:05.423 --> 00:46:09.283 showing hierarchical relationships between notebooks, an atlas, and a map 00:46:09.283 --> 00:46:11.033 that Raisz created about Cuba. 00:46:11.033 --> 00:46:12.603 He worked there before the revolution, 00:46:12.603 --> 00:46:14.633 so he had the unique position of having support 00:46:14.633 --> 00:46:17.013 from both the American and the Cuban governments. 00:46:17.334 --> 00:46:20.583 So, I made this query as an example to show people who work on Raisz, 00:46:20.583 --> 00:46:24.134 and who are interested in narrowing down what materials they'd like to request 00:46:24.134 --> 00:46:26.154 when they come to us for research. 00:46:26.154 --> 00:46:29.684 To make the approach replicable for other archival collections, 00:46:29.684 --> 00:46:33.105 I hope that Harvard and other institutions will prioritize Wikidata look-ups 00:46:33.105 --> 00:46:35.414 as they move to linked data cataloging production, 00:46:35.414 --> 00:46:37.520 which my co-presenters can speak to the progress on 00:46:37.520 --> 00:46:38.854 better than I can. 00:46:38.854 --> 00:46:41.543 But my work has brought me-- has brought to mind a particular issue 00:46:41.543 --> 00:46:46.580 that I see as a future opportunity, which is that of archival modeling. 00:46:47.369 --> 00:46:52.302 So, to an archivist, an item is a discrete archival material 00:46:52.302 --> 00:46:55.000 within a larger collection of archival materials 00:46:55.000 --> 00:46:56.884 that is not a physical location. 00:46:56.884 --> 00:47:00.663 So an archivist from the American National Archives and Records Administration, 00:47:00.663 --> 00:47:02.943 who is also a Wikidata enthusiast, 00:47:02.943 --> 00:47:05.742 advised me when I was trying to determine how to express this 00:47:05.742 --> 00:47:07.734 using an example item, 00:47:07.734 --> 00:47:10.456 that I'm going to show as soon as this video is finally over. 00:47:11.433 --> 00:47:14.391 Alright. Great. 00:47:20.437 --> 00:47:22.100 Nope, that's not what I wanted. 00:47:22.135 --> 00:47:23.536 Here we go. 00:47:31.190 --> 00:47:32.280 It's doing that. 00:47:32.280 --> 00:47:34.154 (humming) 00:47:34.208 --> 00:47:37.418 Nope. Sorry. Sorry. 00:47:40.444 --> 00:47:43.045 Alright, I don't know why it's not going full screen again. 00:47:43.045 --> 00:47:44.329 I can't get it to do anything. 00:47:44.329 --> 00:47:46.880 But this is the-- oh, my gosh. 00:47:46.880 --> 00:47:48.235 Stop that. Alright. 00:47:48.235 --> 00:47:51.195 So, this is the item that I mentioned. 00:47:51.575 --> 00:47:53.655 So, this was what the archivist 00:47:53.655 --> 00:47:55.964 from the National Archives and Records Administration 00:47:55.964 --> 00:47:57.414 showed me as an example. 00:47:57.414 --> 00:48:02.414 And he recommended this compromise, which is to use the part of property 00:48:02.414 --> 00:48:05.614 to connect a lower level description to a higher level of description, 00:48:05.614 --> 00:48:08.534 which allows the relationships between different hierarchical levels 00:48:08.534 --> 00:48:10.840 to be asserted as statements and qualifiers. 00:48:10.840 --> 00:48:12.884 So, in this example that's on screen, 00:48:12.884 --> 00:48:16.294 the relationship between an item, a series, a collection, and a record group 00:48:16.294 --> 00:48:19.655 are thus contained and described within a Wikidata item entity. 00:48:19.655 --> 00:48:22.024 So, I followed this model in my work on Raisz. 00:48:22.704 --> 00:48:26.024 And one of my images is missing. 00:48:26.024 --> 00:48:27.971 No, it's not. It's right there. I'm sorry. 00:48:28.210 --> 00:48:30.613 And so, I followed this model on my work on Raisz, 00:48:30.613 --> 00:48:33.103 but I look forward to further standardization. 00:48:38.983 --> 00:48:41.352 So, another archival project Harvard is working on 00:48:41.352 --> 00:48:44.632 is the Arthur Freedman collection of more than 2,000 hours 00:48:44.632 --> 00:48:48.702 of punk rock performances from the 1970s to early 2000s 00:48:48.702 --> 00:48:51.970 in the Boston and Cambridge, Massachussets areas. 00:48:51.970 --> 00:48:55.145 It includes many bands and venues that no longer exist. 00:48:55.604 --> 00:48:59.505 So far, work has been done in OpenRefine on reconciliation of the bands and venues 00:48:59.505 --> 00:49:02.324 to see which need an item created in Wikidata. 00:49:02.886 --> 00:49:05.964 A basic item will be created via batch process next spring, 00:49:05.964 --> 00:49:08.697 and then, an edit-a-thon will be held in conjunction 00:49:08.697 --> 00:49:12.254 with the New England Music Library Association's meeting in Boston 00:49:12.254 --> 00:49:15.866 to focus on adding more statements to the batch-created items, 00:49:15.866 --> 00:49:18.937 by drawing on local music community knowledge. 00:49:18.937 --> 00:49:22.086 We're interested in learning more about models for pairing librarians 00:49:22.086 --> 00:49:26.310 and Wiki enthusiasts with new contributors who have domain knowledge. 00:49:26.297 --> 00:49:29.293 Items will eventually be linked to digitized video 00:49:29.293 --> 00:49:31.387 in Harvard's digital collection platform 00:49:31.387 --> 00:49:33.167 once rights have been cleared with artists, 00:49:33.167 --> 00:49:35.147 which will likely be a slow process. 00:49:36.327 --> 00:49:38.030 There's also a great amount of interest 00:49:38.030 --> 00:49:41.680 in moving away from manual cataloging and creation of authority data 00:49:41.680 --> 00:49:43.247 towards identity management, 00:49:43.247 --> 00:49:45.667 where descriptions can be created in batches. 00:49:45.667 --> 00:49:48.057 An additional project that focused on 00:49:48.057 --> 00:49:51.297 creating international standard name identifiers, or ISNIs, 00:49:51.297 --> 00:49:53.477 for avant-garde and women filmmakers 00:49:53.477 --> 00:49:57.657 can be adapted for creating Wikidata items for these filmmakers, as well. 00:49:57.657 --> 00:50:01.076 Spreadsheets with the ISNIs, filmmaker names, and other details 00:50:01.076 --> 00:50:04.697 can be reconciled in OpenRefine, and uploaded with QuickStatements. 00:50:04.910 --> 00:50:06.940 Once people in organizations have been described, 00:50:06.940 --> 00:50:09.316 we'll move toward describing the films in Wikidata, 00:50:09.316 --> 00:50:12.526 which will likely present some additional modeling challenges. 00:50:13.446 --> 00:50:15.486 A library presentation wouldn't be complete 00:50:15.486 --> 00:50:16.882 without a MARC record. 00:50:16.882 --> 00:50:19.916 Here, you can see the record for Karen Aqua's taxonomy film, 00:50:19.916 --> 00:50:22.096 where her ISNI and Wikidata Q number 00:50:22.096 --> 00:50:24.176 have been added to the 100 field. 00:50:24.176 --> 00:50:26.636 The ISNIs and Wikidata Q numbers that have been created 00:50:26.636 --> 00:50:30.066 can then be batch added back into MARC records via MarcEdit. 00:50:30.066 --> 00:50:33.236 You might be asking why I'm showing you this ugly MARC record, 00:50:33.236 --> 00:50:35.596 instead of some beautiful linked data statements. 00:50:35.596 --> 00:50:38.576 And that's because our libraries will be working in a hybrid environment 00:50:38.576 --> 00:50:39.896 for some time. 00:50:39.896 --> 00:50:42.326 Our library catalogs still relies on MARC records, 00:50:42.326 --> 00:50:44.076 so by adding in these URIs, 00:50:44.076 --> 00:50:46.366 we can try to take advantage of linked data, 00:50:46.366 --> 00:50:48.346 while our systems still use MARC. 00:50:49.496 --> 00:50:52.950 Adding URIs into MARC records makes an additional aspect 00:50:52.950 --> 00:50:54.335 of our project possible. 00:50:54.335 --> 00:50:56.894 Work has been done at Stanford and Cornell to bring data 00:50:56.894 --> 00:51:01.873 from Wikidata into our library catalog using URIs already in our MARC records. 00:51:02.334 --> 00:51:05.090 You can see an example of a knowledge panel, 00:51:05.090 --> 00:51:06.984 where all the data is sourced from Wikidata, 00:51:06.984 --> 00:51:11.004 and links back to the item itself, along with an invitation to contribute. 00:51:11.403 --> 00:51:15.130 This is currently in a test environment, not in production in our catalog. 00:51:15.130 --> 00:51:17.444 Ideally, eventually, these will be generated 00:51:17.444 --> 00:51:19.916 from linked data descriptions of library resources 00:51:19.916 --> 00:51:22.954 created using Sinopia, our linked data editor 00:51:22.954 --> 00:51:24.563 developed for cataloging. 00:51:24.563 --> 00:51:27.994 We found that adding a look-up to Wikidata in Sinopia is difficult. 00:51:27.994 --> 00:51:31.514 The scale and modeling of Wikidata makes it hard to partition the data 00:51:31.514 --> 00:51:33.544 to be able to look up typed entities, 00:51:33.544 --> 00:51:34.900 and we've run into the problem 00:51:34.900 --> 00:51:37.493 of SPARQL not being good for keyword search, 00:51:37.493 --> 00:51:41.883 but wanting our keyword APIs to return SPARQL-like RDF descriptions. 00:51:41.883 --> 00:51:45.043 So, as you can see, we still have quite a bit of work to do. 00:51:45.043 --> 00:51:47.937 This round of the grant runs until June 2020, 00:51:47.937 --> 00:51:50.163 so, we'll be continuing our exploration. 00:51:50.163 --> 00:51:53.113 And I just wanted to invite anyone 00:51:53.113 --> 00:51:57.573 who's continued an interest in talking about Wikidata and libraries, 00:51:57.573 --> 00:52:01.454 I lead a Wikidata Affinity Group that's open to anyone to join. 00:52:01.454 --> 00:52:03.013 We meet every two weeks, 00:52:03.013 --> 00:52:05.513 and our next call is Tuesday, November the 5th, 00:52:05.513 --> 00:52:08.073 so if you're interested in continuing discussions, 00:52:08.073 --> 00:52:10.393 I would love to talk with you further. 00:52:10.393 --> 00:52:11.890 Thank you, everyone. 00:52:11.890 --> 00:52:13.623 And thank you to the other presenters 00:52:13.623 --> 00:52:16.893 for talking about all of their wonderful projects. 00:52:16.893 --> 00:52:21.283 (applause)