1 00:00:07,133 --> 00:00:11,738 I work as a teacher at the University of Alicante, 2 00:00:11,738 --> 00:00:17,040 where I recently obtained my PhD on data libraries and linked open data. 3 00:00:17,040 --> 00:00:19,038 And I'm also a software developer 4 00:00:19,038 --> 00:00:21,718 at the Biblioteca Virtual Miguel de Cervantes. 5 00:00:21,718 --> 00:00:24,467 And today, I'm going to talk about data quality. 6 00:00:28,252 --> 00:00:31,527 Well, those are my colleagues at the university. 7 00:00:32,457 --> 00:00:36,727 And as you may know, many organizations are publishing their data 8 00:00:36,727 --> 00:00:38,447 or linked open data-- 9 00:00:38,447 --> 00:00:41,437 for example, the National Library of France, 10 00:00:41,437 --> 00:00:45,947 the National Library of Spain, us, which is Cervantes Virtual, 11 00:00:45,947 --> 00:00:49,007 the British National Bibliography, 12 00:00:49,007 --> 00:00:51,667 the Library of Congress and Europeana. 13 00:00:51,667 --> 00:00:56,000 All of them provide a SPARQL endpoint, 14 00:00:56,000 --> 00:00:58,875 which is useful in order to retrieve the data. 15 00:00:59,104 --> 00:01:00,984 And if I'm not wrong, 16 00:01:00,984 --> 00:01:05,890 the Library of Congress only provide the data as a dump that you can't use. 17 00:01:07,956 --> 00:01:13,787 When we publish our repository as linked open data, 18 00:01:13,787 --> 00:01:17,475 my idea was to be reused by other institutions. 19 00:01:17,981 --> 00:01:24,000 But what about if I'm an institution who wants to enrich their data 20 00:01:24,000 --> 00:01:27,435 with any data from other data libraries. 21 00:01:27,574 --> 00:01:30,674 Which data set should I use? 22 00:01:30,674 --> 00:01:34,314 Which data set is better in terms of quality? 23 00:01:36,874 --> 00:01:41,314 The benefits of the evaluation of data quality in libraries are many. 24 00:01:41,314 --> 00:01:47,143 For example, methodologies can be improved in order to include new criteria, 25 00:01:47,182 --> 00:01:49,162 in order to assess the quality. 26 00:01:49,162 --> 00:01:54,592 And also, organizations can benefit from best practices and guidelines 27 00:01:54,602 --> 00:01:58,270 in order to publish their data as linked open data. 28 00:02:00,012 --> 00:02:03,462 What do we need in order to assess the quality? 29 00:02:03,462 --> 00:02:06,862 Well, obviously, a set of candidates and a set of features. 30 00:02:06,862 --> 00:02:10,077 For example, do they have a SPARQL endpoint, 31 00:02:10,077 --> 00:02:13,132 do they have a web interface, how many publications do they have, 32 00:02:13,132 --> 00:02:18,092 how many vocabularies do they use, how many Wikidata properties do they have, 33 00:02:18,092 --> 00:02:20,892 and where can I get those candidates? 34 00:02:20,892 --> 00:02:22,472 I use LOD Cloud-- 35 00:02:22,472 --> 00:02:27,422 but when I was doing this slide, I thought about using Wikidata 36 00:02:27,562 --> 00:02:29,746 in order to retrieve those candidates. 37 00:02:29,746 --> 00:02:34,295 For example, getting entities of type data library, 38 00:02:34,295 --> 00:02:36,473 which has a SPARQL endpoint. 39 00:02:36,473 --> 00:02:38,693 You have here the link. 40 00:02:41,453 --> 00:02:45,083 And I come up with those data libraries. 41 00:02:45,104 --> 00:02:50,233 The first one uses bibliographic ontology as main vocabulary, 42 00:02:50,233 --> 00:02:54,122 and the others are based, more or less, on FRBR, 43 00:02:54,122 --> 00:02:57,180 which is a vocabulary published by IFLA. 44 00:02:57,180 --> 00:03:00,013 And this is just an example of how we could compare 45 00:03:00,013 --> 00:03:04,393 data libraries using bubble charts on Wikidata. 46 00:03:04,393 --> 00:03:08,613 And this is just an example comparing how many Wikidata properties 47 00:03:08,613 --> 00:03:10,633 are per data library. 48 00:03:13,483 --> 00:03:15,980 Well, how can we measure quality? 49 00:03:15,928 --> 00:03:17,972 There are different methodologies, 50 00:03:17,972 --> 00:03:19,726 for example, FRBR 1, 51 00:03:19,726 --> 00:03:24,337 which provides a set of criteria grouped by dimensions, 52 00:03:24,337 --> 00:03:27,556 and those in green are the ones that I found-- 53 00:03:27,556 --> 00:03:30,917 that I could assess by means of Wikidata. 54 00:03:33,870 --> 00:03:39,397 And we also find that we could define new criteria, 55 00:03:39,397 --> 00:03:44,567 for example, a new one to evaluate the number of duplications in Wikidata. 56 00:03:45,047 --> 00:03:47,206 We use those properties. 57 00:03:47,206 --> 00:03:50,098 And this is an example of SPARQL, 58 00:03:50,098 --> 00:03:54,486 in order to count the number of duplicates property. 59 00:03:57,136 --> 00:04:00,366 And about the results, while at the moment of doing this study, 60 00:04:00,366 --> 00:04:05,216 not the slides, there was no property for the British National Bibliography. 61 00:04:05,860 --> 00:04:08,260 They don't provide provenance information, 62 00:04:08,260 --> 00:04:11,536 which could be useful for metadata enrichment. 63 00:04:11,536 --> 00:04:14,660 And they don't allow to edit the information. 64 00:04:14,660 --> 00:04:17,166 So, we've been talking about Wikibase the whole weekend, 65 00:04:17,166 --> 00:04:21,396 and maybe we should try to adopt Wikibase as an interface. 66 00:04:23,186 --> 00:04:25,436 And they are focused on their own content, 67 00:04:25,436 --> 00:04:28,856 and this is just the SPARQL query based on Wikidata 68 00:04:28,856 --> 00:04:31,411 in order to assess the population. 69 00:04:32,066 --> 00:04:36,006 And the BnF provides labels in multiple languages, 70 00:04:36,006 --> 00:04:38,956 and they all use self-describing URIs, 71 00:04:38,956 --> 00:04:43,058 which is that in the URI, they have the type of entity, 72 00:04:43,058 --> 00:04:48,406 which allows the human reader to understand what they are using. 73 00:04:51,499 --> 00:04:55,256 And more results, they provide different output format, 74 00:04:55,256 --> 00:04:58,646 they use external vocabularies. 75 00:04:58,854 --> 00:05:01,116 Only the British National Bibliography 76 00:05:01,116 --> 00:05:03,734 provides machine-readable licensing information. 77 00:05:03,734 --> 00:05:09,124 And up to one-third of the instances are connected to external repositories, 78 00:05:09,124 --> 00:05:11,225 which is really nice. 79 00:05:12,604 --> 00:05:18,290 And while this study, this work has been done in our Labs team, 80 00:05:18,364 --> 00:05:22,391 a lab in a GLAM is a group of people 81 00:05:22,391 --> 00:05:27,520 who want to explore new ways 82 00:05:27,587 --> 00:05:30,306 of reusing data collections. 83 00:05:31,039 --> 00:05:35,054 And there's a community led by the British Library, 84 00:05:35,054 --> 00:05:37,366 and in particular, Mahendra Mahey, 85 00:05:37,366 --> 00:05:40,610 and we had a first event in London, 86 00:05:40,610 --> 00:05:42,601 and another one in Copenhagen, 87 00:05:42,601 --> 00:05:45,279 and we're going to have a new one in May 88 00:05:45,279 --> 00:05:48,240 at the Library of Congress in Washington. 89 00:05:48,528 --> 00:05:52,481 And we are now 250 people. 90 00:05:52,481 --> 00:05:56,421 And I'm so glad that I found somebody here at the WikidataCon 91 00:05:56,421 --> 00:05:58,860 who has just joined us-- 92 00:05:58,860 --> 00:06:01,160 Sylvia from [inaudible], Mexico. 93 00:06:01,160 --> 00:06:04,509 And I'd like to invite you to our community, 94 00:06:04,509 --> 00:06:09,719 since you may be part of a GLAM institution. 95 00:06:10,659 --> 00:06:13,164 So, we can talk later if you want to know about this. 96 00:06:14,589 --> 00:06:16,719 And this--it's all about people. 97 00:06:16,719 --> 00:06:19,669 This is me, people from the British Library, 98 00:06:19,669 --> 00:06:24,629 Library of Congress, Universities, and National Libraries in Europe 99 00:06:24,871 --> 00:06:28,050 And there's a link here in case you want to know more. 100 00:06:28,433 --> 00:06:32,655 And, well, last month, we decided to meet in Doha 101 00:06:32,655 --> 00:06:37,448 in order to write a book about how to create a lab in our GLAM. 102 00:06:38,585 --> 00:06:43,279 And they choose 15 people, and I was so lucky to be there. 103 00:06:45,314 --> 00:06:48,594 And the book follows the Booksprint methodology, 104 00:06:48,594 --> 00:06:51,674 which means that nothing is prepared beforehand. 105 00:06:51,674 --> 00:06:53,495 All is done there in a week. 106 00:06:53,495 --> 00:06:55,725 And believe me, it was really hard work 107 00:06:55,725 --> 00:06:58,905 to have their whole book done in this week. 108 00:06:59,890 --> 00:07:04,490 And I'd like to introduce you to the book, which will be published-- 109 00:07:04,490 --> 00:07:06,455 it was supposed to be published this week, 110 00:07:06,455 --> 00:07:08,274 but it will be next week. 111 00:07:08,974 --> 00:07:13,014 And it will be published open, so you can have it, 112 00:07:13,065 --> 00:07:15,668 and I can show you a little bit later if you want. 113 00:07:15,734 --> 00:07:17,601 And those are the authors. 114 00:07:17,601 --> 00:07:19,678 I'm here-- I'm so happy, too. 115 00:07:19,678 --> 00:07:22,110 And those are the institutions-- 116 00:07:22,110 --> 00:07:26,722 Library of Congress, British Library-- and this is the title. 117 00:07:27,330 --> 00:07:29,604 And now, I'd like to show you-- 118 00:07:31,441 --> 00:07:33,971 a map that I'm doing. 119 00:07:34,278 --> 00:07:37,234 We are launching a website for our community, 120 00:07:37,234 --> 00:07:42,893 and I'm in charge of creating a map with our institutions there. 121 00:07:43,097 --> 00:07:44,860 This is not finished. 122 00:07:44,860 --> 00:07:50,276 But this is just SPARQL, and below, 123 00:07:51,546 --> 00:07:53,027 we see the map. 124 00:07:53,027 --> 00:07:58,086 And we see here the new people that I found, here, 125 00:07:58,086 --> 00:08:00,486 at the WikidataCon-- I'm so happy for this. 126 00:08:00,621 --> 00:08:05,631 And we have here my data library of my university, 127 00:08:05,681 --> 00:08:08,490 and many other institutions. 128 00:08:09,051 --> 00:08:10,940 Also, from Australia-- 129 00:08:11,850 --> 00:08:13,061 if I can do it. 130 00:08:13,930 --> 00:08:15,711 Well, here, we have some links. 131 00:08:19,586 --> 00:08:21,088 There you go. 132 00:08:21,189 --> 00:08:23,059 Okay, this is not finished. 133 00:08:23,539 --> 00:08:26,049 We are still working on this, and that's all. 134 00:08:26,057 --> 00:08:28,170 Thank you very much for your attention. 135 00:08:28,858 --> 00:08:33,683 (applause) 136 00:08:41,962 --> 00:08:48,079 [inaudible] 137 00:08:59,490 --> 00:09:00,870 Good morning, everybody. 138 00:09:00,870 --> 00:09:01,930 I'm Olaf Janssen. 139 00:09:01,930 --> 00:09:03,570 I'm the Wikimedia coordinator 140 00:09:03,570 --> 00:09:06,150 at the National Library of the Netherlands. 141 00:09:06,310 --> 00:09:08,390 And I would like to share my work, 142 00:09:08,390 --> 00:09:11,610 which I'm doing about creating Linked Open Data 143 00:09:11,640 --> 00:09:15,351 for Dutch Public Libraries using Wikidata. 144 00:09:17,600 --> 00:09:20,850 And my story starts roughly a year ago 145 00:09:20,850 --> 00:09:24,581 when I was at the GLAM Wiki conference in Tel Aviv, in Israel. 146 00:09:25,301 --> 00:09:27,938 And there are two men with very similar shirts, 147 00:09:27,938 --> 00:09:31,120 and equally similar hairdos, [Matt]... 148 00:09:31,120 --> 00:09:33,440 (laughter) 149 00:09:33,440 --> 00:09:35,325 And on the left, that's me. 150 00:09:35,325 --> 00:09:39,065 And a year ago, I didn't have any practical knowledge and skills 151 00:09:39,065 --> 00:09:40,265 about Wikidata. 152 00:09:40,265 --> 00:09:43,285 I looked at Wikidata, and I looked at the items, 153 00:09:43,285 --> 00:09:44,524 and I played with it. 154 00:09:44,524 --> 00:09:47,070 But I wasn't able to make a SPARQL query 155 00:09:47,070 --> 00:09:50,285 or to do data modeling with the right shape expression. 156 00:09:51,305 --> 00:09:52,865 That's a year ago. 157 00:09:53,465 --> 00:09:57,065 And on the lefthand side, that's Simon Cobb, user: Sic19. 158 00:09:57,304 --> 00:10:00,265 And I was talking to him, because, just before, 159 00:10:00,525 --> 00:10:01,974 he had given a presentation 160 00:10:01,974 --> 00:10:06,374 about improving the coverage of public libraries in Wikidata. 161 00:10:06,757 --> 00:10:08,934 And I was very inspired by his talk. 162 00:10:09,564 --> 00:10:13,355 And basically, he was talking about adding basic data 163 00:10:13,355 --> 00:10:14,867 about public libraries. 164 00:10:14,867 --> 00:10:19,046 So, the name of the library, if available, the photo of the building, 165 00:10:19,046 --> 00:10:21,497 the address data of the library, 166 00:10:21,497 --> 00:10:25,120 the geo-coordinates latitude and longitude, 167 00:10:25,120 --> 00:10:26,367 and some other things, 168 00:10:26,367 --> 00:10:29,187 including with all source references. 169 00:10:31,317 --> 00:10:34,557 And what I was very impressed about a year ago was this map. 170 00:10:34,557 --> 00:10:37,337 This is a map about public libraries in the U.K. 171 00:10:37,337 --> 00:10:38,577 with all the colors. 172 00:10:38,577 --> 00:10:43,017 And you can see that all the libraries are layered by library organizations. 173 00:10:43,017 --> 00:10:46,210 And when he showed this, I was really, "Wow, that's cool." 174 00:10:46,637 --> 00:10:49,138 So, then, one minute later, I thought, 175 00:10:49,138 --> 00:10:52,918 "Well, let's do it for the country for that one." 176 00:10:52,918 --> 00:10:54,850 (laughter) 177 00:10:57,149 --> 00:10:59,496 And something about public libraries in the Netherlands-- 178 00:10:59,496 --> 00:11:03,020 there are about 1,300 library branches in our country, 179 00:11:03,020 --> 00:11:06,710 grouped into 160 library organizations. 180 00:11:07,723 --> 00:11:10,937 And you might wonder why do I want to do this project? 181 00:11:10,997 --> 00:11:14,137 Well, first of all, because for the common good, for society, 182 00:11:14,137 --> 00:11:16,707 because I think using Wikidata, 183 00:11:16,707 --> 00:11:20,657 and from there, creating Wikipedia articles, 184 00:11:20,657 --> 00:11:23,417 and opening it up via the linked open data cloud-- 185 00:11:23,417 --> 00:11:29,006 it's improving visibility and reusability of public libraries in the Netherlands. 186 00:11:30,110 --> 00:11:32,197 And my second goal was actually a more personal one, 187 00:11:32,197 --> 00:11:36,517 because a year ago, I had this yearly evaluation with my manager, 188 00:11:37,243 --> 00:11:41,737 and we decided it was a good idea that I got more practical skills 189 00:11:41,737 --> 00:11:45,853 on linked open data, data modeling, and also on Wikidata. 190 00:11:46,464 --> 00:11:50,286 And of course, I wanted to be able to make these kinds of maps myself. 191 00:11:50,286 --> 00:11:51,396 (laughter) 192 00:11:54,345 --> 00:11:57,100 Then you might wonder why do I want to do this? 193 00:11:57,100 --> 00:12:01,723 Isn't there already enough basic library data out there in the Netherlands 194 00:12:02,450 --> 00:12:04,233 to have a good coverage? 195 00:12:06,019 --> 00:12:08,367 So, let me show you some of the websites 196 00:12:08,367 --> 00:12:12,882 that are available to discover address and location information 197 00:12:12,882 --> 00:12:14,505 about Dutch public libraries. 198 00:12:14,505 --> 00:12:17,722 And the first one is this one-- Gidsvoornederland.nl-- 199 00:12:17,722 --> 00:12:20,641 and that's the official public library inventory 200 00:12:20,641 --> 00:12:23,037 maintained by my library, the National Library. 201 00:12:23,727 --> 00:12:29,160 And you can look up addresses and geo-coordinates on that website. 202 00:12:30,493 --> 00:12:32,797 Then there is this site, Bibliotheekinzicht-- 203 00:12:32,797 --> 00:12:36,502 this is also an official website maintained by my National Library. 204 00:12:36,502 --> 00:12:38,982 And this is about public library statistics. 205 00:12:41,010 --> 00:12:43,933 Then there is another one, debibliotheken.nl-- 206 00:12:43,933 --> 00:12:46,005 as you can see there is also address information 207 00:12:46,005 --> 00:12:49,659 about library organizations, not about individual branches. 208 00:12:51,724 --> 00:12:55,010 And there's even this one, which also has address information. 209 00:12:56,546 --> 00:12:59,028 And of course, there's something like Google Maps, 210 00:12:59,028 --> 00:13:02,157 which also has all the names and the locations and the addresses. 211 00:13:03,455 --> 00:13:06,218 And this one, the International Library of Technology, 212 00:13:06,218 --> 00:13:09,580 which has a worldwide inventory of libraries, 213 00:13:09,646 --> 00:13:11,393 including the Netherlands. 214 00:13:13,058 --> 00:13:15,049 And I even discovered there is a data set 215 00:13:15,049 --> 00:13:18,423 you can buy for 50 euros or so to download it. 216 00:13:18,423 --> 00:13:21,023 And there is also--seems to be I didn't download it, 217 00:13:21,023 --> 00:13:23,633 but there seems to be address information available. 218 00:13:24,273 --> 00:13:30,180 You might wonder is this kind of data good enough for the purposes I had? 219 00:13:32,282 --> 00:13:37,372 So, this is my birthday list for my ideal public library data list. 220 00:13:37,439 --> 00:13:39,105 And what's on my list? 221 00:13:39,173 --> 00:13:43,830 First of all, the data I want to have must be up-to-date-ish-- 222 00:13:43,830 --> 00:13:45,604 it must be fairly up-to-date. 223 00:13:45,604 --> 00:13:48,513 So, doesn't have to be real time, 224 00:13:48,513 --> 00:13:51,323 but let's say, a couple of months, or half a year, 225 00:13:53,284 --> 00:13:57,354 delayed with official publication, that's okay for my purposes. 226 00:13:58,116 --> 00:14:00,956 And I want to have it both library branches 227 00:14:00,956 --> 00:14:02,697 and the library organizations. 228 00:14:04,206 --> 00:14:08,400 Then I want my data to be structured, because it has to be machine-readable. 229 00:14:08,301 --> 00:14:11,986 It has to be in open file format, such as CSV or JSON or RDF. 230 00:14:12,717 --> 00:14:15,197 It has to be linked to other resources preferably. 231 00:14:16,011 --> 00:14:22,182 And the uses--the license on the data needs to be manifest public domain or CC0. 232 00:14:23,520 --> 00:14:26,192 Then, I would like my data to have an API, 233 00:14:26,599 --> 00:14:30,548 which must be public, free, and preferably also anonymous 234 00:14:30,548 --> 00:14:34,900 so you don't have to use an API key, or you have to register an account. 235 00:14:36,103 --> 00:14:38,863 And I also want to have a SPARQL interface. 236 00:14:41,131 --> 00:14:43,651 So, now, these are all the sites I just showed you. 237 00:14:43,717 --> 00:14:46,450 And I'm going to make a big grid. 238 00:14:47,337 --> 00:14:50,017 And then, this is about the evaluation I did. 239 00:14:51,187 --> 00:14:54,166 I'm not going into it, but there is no single column 240 00:14:54,166 --> 00:14:56,007 which has all green check marks. 241 00:14:56,007 --> 00:14:57,997 That's the important thing to take away. 242 00:14:58,967 --> 00:15:03,947 And so, in summary, there was no linked public free linked open data 243 00:15:03,947 --> 00:15:08,937 for Dutch public libraries available before I started my project. 244 00:15:09,237 --> 00:15:13,027 So, this was the ideal motivation to actually work on it. 245 00:15:14,730 --> 00:15:17,427 So, that's what I've been doing for a year now. 246 00:15:17,717 --> 00:15:22,977 And I've been adding libraries bit by bit, organization by organization to Wikidata. 247 00:15:23,417 --> 00:15:26,387 I created also a project website on it. 248 00:15:26,727 --> 00:15:29,567 It's still rather messy, but it has all the information, 249 00:15:29,567 --> 00:15:33,240 and I try to keep it as up-to-date as possible. 250 00:15:33,240 --> 00:15:36,277 And also all the SPARQL queries you can see are linked from here. 251 00:15:38,002 --> 00:15:40,235 And I'm just adding really basic information. 252 00:15:40,235 --> 00:15:44,097 You see the instances, images if available, 253 00:15:44,097 --> 00:15:47,229 addresses, locations, et cetera, municipalities. 254 00:15:48,534 --> 00:15:53,276 And where possible, I also try to link the libraries to external identifiers. 255 00:15:56,024 --> 00:15:58,415 And then, you can really easily-- we all know, 256 00:15:58,415 --> 00:16:03,050 generating some Listeria lists with public libraries grouped 257 00:16:03,050 --> 00:16:05,060 by organizations, for instance. 258 00:16:05,060 --> 00:16:08,380 Or using SPARQL queries, you can also do aggregation on data-- 259 00:16:08,380 --> 00:16:11,060 let's say, give me all the municipalities in the Netherlands 260 00:16:11,060 --> 00:16:15,115 and the number of library branches in all the municipalities. 261 00:16:17,025 --> 00:16:20,228 With one click, you can make these kinds of photo galleries. 262 00:16:22,092 --> 00:16:23,655 And what I set out to do first, 263 00:16:23,655 --> 00:16:26,036 you can really create these kinds of maps. 264 00:16:27,176 --> 00:16:30,425 And you might wonder, "Are there any libraries here or there?" 265 00:16:30,555 --> 00:16:33,355 There are--they are not yet in Wikidata. 266 00:16:33,355 --> 00:16:35,055 We're still working on that. 267 00:16:35,135 --> 00:16:37,644 And actually, last week, I spoke with a volunteer, 268 00:16:37,644 --> 00:16:40,864 who's helping now with entering the libraries. 269 00:16:41,644 --> 00:16:45,394 You can really make cool--in Wikidata, 270 00:16:45,394 --> 00:16:47,914 and also with using the Cartographer extension, 271 00:16:47,914 --> 00:16:50,244 you can use these kinds of maps. 272 00:16:51,724 --> 00:16:53,736 And I even took it one step further. 273 00:16:53,911 --> 00:16:57,399 I also have some Python skills, and some Leaflet things skills-- 274 00:16:57,399 --> 00:16:59,971 so, I created, and I'm quite proud of it, actually. 275 00:16:59,971 --> 00:17:03,482 I created this library heat map, which is fully interactive. 276 00:17:03,482 --> 00:17:05,956 You can zoom in to it, and you can see all the libraries, 277 00:17:06,712 --> 00:17:08,726 and you can also run it off Wiki. 278 00:17:08,726 --> 00:17:10,552 So, you can just embed it in your own website, 279 00:17:10,552 --> 00:17:13,412 and it fully runs interactively. 280 00:17:15,131 --> 00:17:17,592 So, now going back to my big scary table. 281 00:17:19,512 --> 00:17:22,970 There is one column on the right, which is blank. 282 00:17:22,970 --> 00:17:24,940 And no surprise, it will be Wikidata. 283 00:17:24,940 --> 00:17:26,448 Let's see how it scores there. 284 00:17:26,448 --> 00:17:29,500 (cheering) 285 00:17:32,892 --> 00:17:35,191 So, I actually think of printing this on a T-shirt. 286 00:17:35,301 --> 00:17:37,288 (laughter) 287 00:17:37,788 --> 00:17:39,700 So, just to summarize this in words, 288 00:17:39,700 --> 00:17:41,129 thanks to my project, now, 289 00:17:41,129 --> 00:17:45,879 there is public free linked open data available for Dutch public libraries. 290 00:17:47,124 --> 00:17:49,686 And who can benefit from my effort? 291 00:17:50,333 --> 00:17:52,002 Well, all kinds of parties-- 292 00:17:52,002 --> 00:17:54,274 you see Wikipedia, because you can generate lists 293 00:17:54,274 --> 00:17:56,051 and overviews and articles, 294 00:17:56,051 --> 00:17:59,908 for instance, using this and be able to from Wikidata 295 00:17:59,908 --> 00:18:01,976 for our National Library for-- 296 00:18:02,850 --> 00:18:05,391 IFLA also has an inventory of worldwide libraries, 297 00:18:05,391 --> 00:18:07,216 they can also reuse the data. 298 00:18:07,650 --> 00:18:09,497 And especially for Sandra, 299 00:18:09,549 --> 00:18:13,237 it's also important for the Ministry-- Dutch Ministry of Culture-- 300 00:18:13,277 --> 00:18:15,667 because Sandra is going to have a talk about Wikidata 301 00:18:15,667 --> 00:18:18,287 with the Ministry this Monday, next Monday. 302 00:18:19,922 --> 00:18:22,277 And also, on the righthand side, for instance, 303 00:18:23,891 --> 00:18:27,098 Amazon with Alexa, the assistant, 304 00:18:27,098 --> 00:18:28,961 they're also using Wikidata, 305 00:18:28,961 --> 00:18:30,995 so you can imagine that they also use, 306 00:18:30,995 --> 00:18:33,357 if you're looking for public library information, 307 00:18:33,357 --> 00:18:36,580 they can also use Wikidata for that. 308 00:18:38,955 --> 00:18:41,680 Because one year ago, Simon Cobb inspired me 309 00:18:41,680 --> 00:18:44,244 to do this project, I would like to call upon you, 310 00:18:44,244 --> 00:18:45,664 if you have time available, 311 00:18:45,664 --> 00:18:49,532 and if you have data from your own country about public libraries, 312 00:18:51,572 --> 00:18:54,422 make the coverage better, add more red dots, 313 00:18:54,982 --> 00:18:56,982 and of course, I'm willing to help you with that. 314 00:18:56,982 --> 00:18:59,227 And Simon is also willing to help with this. 315 00:18:59,870 --> 00:19:01,471 And so, I hope next year, somebody else 316 00:19:01,471 --> 00:19:03,901 will be at this conference or another conference 317 00:19:03,901 --> 00:19:06,291 and there will be more red dots on the map. 318 00:19:07,551 --> 00:19:08,911 Thank you very much. 319 00:19:09,004 --> 00:19:12,740 (applause) 320 00:19:18,336 --> 00:19:20,086 Thank you, Olaf. 321 00:19:20,086 --> 00:19:23,554 Next we have Ursula Oberst and Heleen Smits 322 00:19:23,613 --> 00:19:27,734 presenting how can a small research library benefit from Wikidata: 323 00:19:27,734 --> 00:19:31,423 enhancing library products using Wikidata. 324 00:19:53,717 --> 00:19:57,637 Okay. Good morning. My name is Heleen Smits. 325 00:19:58,680 --> 00:20:01,753 And my colleague, Ursula Oberst--where are you? 326 00:20:01,753 --> 00:20:03,873 (laughter) 327 00:20:04,371 --> 00:20:09,220 And I work at the Library of the African Studies Center 328 00:20:09,220 --> 00:20:11,086 in Leiden, in the Netherlands. 329 00:20:11,086 --> 00:20:15,038 And the African Studies Center is a center devoted-- 330 00:20:15,038 --> 00:20:21,464 is an academic institution devoted entirely to the study of Africa, 331 00:20:21,464 --> 00:20:23,986 focusing on Humanities and Social Studies. 332 00:20:24,672 --> 00:20:28,123 We used to be an independent research organization, 333 00:20:28,123 --> 00:20:33,064 but in 2016, we became part of Leiden University, 334 00:20:33,064 --> 00:20:38,433 and our catalog was integrated into the larger university catalog. 335 00:20:39,283 --> 00:20:43,593 Though it remained possible to do a search in the part of the Leiden-- 336 00:20:43,593 --> 00:20:45,894 of the African Studies Catalog, alone, 337 00:20:47,960 --> 00:20:50,505 we remained independent in some respects. 338 00:20:50,586 --> 00:20:53,262 For example, with respect to our thesaurus. 339 00:20:54,921 --> 00:20:59,883 And also with respect to the products we make for our users, 340 00:21:01,180 --> 00:21:04,378 such as acquisition lists and work dossiers. 341 00:21:05,158 --> 00:21:11,975 And it is in the field of the web dossiers 342 00:21:11,975 --> 00:21:14,582 that we have been looking 343 00:21:14,582 --> 00:21:19,582 for possible ways to apply Wikidata, 344 00:21:19,582 --> 00:21:23,372 and that's the part where Ursula will in the second part of this talk 345 00:21:24,212 --> 00:21:27,184 show you a bit what we've been doing there. 346 00:21:31,250 --> 00:21:35,160 The web dossiers are our collections 347 00:21:35,160 --> 00:21:39,000 of titles from our catalog that we compile 348 00:21:39,000 --> 00:21:45,591 around a theme usually connected to, for example, a conference, 349 00:21:45,591 --> 00:21:51,227 or to a special event, and actually, the most recent web dossier we made 350 00:21:51,227 --> 00:21:56,017 was connected to the year of indigenous languages, 351 00:21:56,017 --> 00:21:59,547 and that was around proverbs in African languages. 352 00:22:00,780 --> 00:22:02,327 Our first steps-- 353 00:22:04,307 --> 00:22:09,287 next slide--our first steps on the Wiki path as a library, 354 00:22:10,267 --> 00:22:15,046 were in 2013, when we were one of 12 GLAM institutions 355 00:22:15,046 --> 00:22:16,472 in the Netherlands, 356 00:22:16,472 --> 00:22:20,952 part of the project of Wikipedians in Residence, 357 00:22:20,952 --> 00:22:26,443 and we had for two months, a Wikipedian in the house, 358 00:22:27,035 --> 00:22:32,527 and he gave us trainings for adding articles to Wikipedia, 359 00:22:33,000 --> 00:22:37,720 and also, we made a start with uploading photo collections to Commons, 360 00:22:38,530 --> 00:22:42,650 which always remained a little bit dependent on funding, as well, 361 00:22:43,229 --> 00:22:45,702 whether we would be able to digitize them, 362 00:22:45,702 --> 00:22:50,350 and to mostly have a student assistant to do this. 363 00:22:51,220 --> 00:22:55,440 But it was actually a great adding to what we could offer 364 00:22:55,440 --> 00:22:57,560 as an academic library. 365 00:22:59,370 --> 00:23:04,742 In May 2018, so is that my Ursula, my colleague Ursula-- 366 00:23:04,742 --> 00:23:09,465 she started to really explore-- dive into Wikidata 367 00:23:09,465 --> 00:23:14,515 and see what we as a small and not very much experienced library 368 00:23:14,515 --> 00:23:18,175 in these fields could do with that. 369 00:23:25,050 --> 00:23:26,995 So, I mentioned, we have our own thesaurus. 370 00:23:28,210 --> 00:23:30,689 And this is where we started. 371 00:23:30,689 --> 00:23:34,502 This is a thesaurus of 13,000 terms, 372 00:23:34,502 --> 00:23:37,670 all in the field of African studies. 373 00:23:37,670 --> 00:23:41,457 It contains a lot of African languages, 374 00:23:43,417 --> 00:23:46,360 names of ethnic groups in Africa, 375 00:23:47,586 --> 00:23:49,431 and other proper names, 376 00:23:49,431 --> 00:23:55,509 which are perhaps especially interesting for Wikidata. 377 00:23:58,604 --> 00:24:04,824 So, it is a real authority control 378 00:24:04,824 --> 00:24:08,370 to vocabulary with 5,000 preferred terms. 379 00:24:08,554 --> 00:24:11,204 So, we submitted the request to Wikidata, 380 00:24:11,204 --> 00:24:17,135 and that was actually very quickly met with a positive response, 381 00:24:17,214 --> 00:24:19,354 which was very encouraging for us. 382 00:24:22,884 --> 00:24:25,574 Our thesaurus was loaded into Mix-n-Match, 383 00:24:25,574 --> 00:24:31,691 and by now, 75% of the terms 384 00:24:31,691 --> 00:24:36,145 have been manually matched with Wikidata. 385 00:24:38,061 --> 00:24:42,081 So, it means, well, that we are now-- 386 00:24:42,971 --> 00:24:47,687 we are added as an identifier-- 387 00:24:48,387 --> 00:24:51,553 for example, if you click on Swahili language, 388 00:24:52,463 --> 00:24:57,152 what happens then in Wikidata on the number that-- 389 00:24:59,004 --> 00:25:02,354 that connects our term-- is the Wikidata term-- 390 00:25:02,560 --> 00:25:05,620 we enter into our thesaurus, 391 00:25:05,620 --> 00:25:10,000 and from there, you can do a search directly in the catalog 392 00:25:10,000 --> 00:25:12,560 by clicking the button again. 393 00:25:12,560 --> 00:25:18,160 It means, also, that Wikidata has not really integrated 394 00:25:18,160 --> 00:25:19,572 into our catalog. 395 00:25:19,572 --> 00:25:22,090 But that's also more difficult. 396 00:25:22,314 --> 00:25:26,053 Okay, we have to give the floor 397 00:25:26,053 --> 00:25:30,838 to Ursula for the next part. 398 00:25:30,838 --> 00:25:32,554 (Ursula) Thank you very much, Heleen. 399 00:25:32,554 --> 00:25:37,258 So, I will talk about our experiences 400 00:25:37,258 --> 00:25:39,677 with incorporating Wikidata elements 401 00:25:39,677 --> 00:25:41,356 to our web dossier. 402 00:25:41,356 --> 00:25:44,607 A web dossier is--oh, sorry, yeah, sorry. 403 00:25:45,447 --> 00:25:49,646 A web dossier, or a classical web dossier, consists of three parts: 404 00:25:50,248 --> 00:25:53,320 an introduction to the subject, 405 00:25:53,320 --> 00:25:56,060 mostly written by one of our researchers; 406 00:25:56,060 --> 00:26:01,328 a selection of titles, both books and articles from our collection; 407 00:26:01,328 --> 00:26:06,146 and the third part, an annotated list 408 00:26:06,146 --> 00:26:08,876 with links to electronic resources. 409 00:26:09,161 --> 00:26:15,815 And this year, we added a fourth part to our web dossiers, 410 00:26:15,815 --> 00:26:18,276 which is the Wikidata elements. 411 00:26:19,008 --> 00:26:22,007 And it all started last year, 412 00:26:22,007 --> 00:26:25,206 and my story is similar to the story of Olaf, actually. 413 00:26:25,352 --> 00:26:29,570 Last year, when I had no clue about Wikidata, 414 00:26:29,570 --> 00:26:33,402 and I discovered this wonderful article by Alex Stinson 415 00:26:33,402 --> 00:26:36,932 on how to write a query in Wikidata. 416 00:26:37,382 --> 00:26:41,592 And he chose a subject-- a very appealing subject to me. 417 00:26:41,592 --> 00:26:45,902 Namely, "Discovering Women Writers from North Africa." 418 00:26:46,402 --> 00:26:51,162 I can really recommend this article, 419 00:26:51,162 --> 00:26:52,981 because it's very instructive. 420 00:26:52,981 --> 00:26:57,422 And I thought I will be-- I'm going to work on this query, 421 00:26:57,422 --> 00:27:02,662 and try to change it to: "Southern African Women Writers," 422 00:27:02,662 --> 00:27:07,034 and try to add a link to their work in our catalog. 423 00:27:07,311 --> 00:27:10,861 And on the right-hand side, you see the SPARQL query 424 00:27:11,592 --> 00:27:15,181 which searches for "Southern African Women Writers." 425 00:27:15,181 --> 00:27:20,686 If you click on the button, on the blue button on the lefthand side, 426 00:27:21,526 --> 00:27:23,971 the search result will appear beneath. 427 00:27:23,971 --> 00:27:26,448 The search result can have different formats. 428 00:27:26,448 --> 00:27:29,871 In my case, the search result is a map. 429 00:27:29,871 --> 00:27:32,850 And the nice thing about Wikidata 430 00:27:32,850 --> 00:27:36,652 is that you can embed to this search result 431 00:27:36,652 --> 00:27:38,682 into your own webpage, 432 00:27:38,682 --> 00:27:42,339 and that's what we are now doing with our work dossiers. 433 00:27:42,339 --> 00:27:47,039 So, this was the very first one on Southern African women writers, 434 00:27:47,039 --> 00:27:49,649 listed classical three elements, 435 00:27:49,649 --> 00:27:53,209 plus this map on the lefthand side, 436 00:27:53,209 --> 00:27:55,650 which gives extra information-- 437 00:27:55,650 --> 00:27:58,219 a link to the Southern African women writer-- 438 00:27:58,219 --> 00:28:00,749 a link to her works in our catalog, 439 00:28:00,749 --> 00:28:07,252 and a link to the Wikidata record of her birth place, and her name, 440 00:28:08,219 --> 00:28:13,099 her personal record, plus a photo, if it's available on Wikidata. 441 00:28:16,231 --> 00:28:20,329 And you have to retrieve a nice map 442 00:28:20,329 --> 00:28:24,032 with a lot of red dots on the African continent. 443 00:28:24,032 --> 00:28:28,662 You need nice data in Wikidata, complete, sufficient data. 444 00:28:29,042 --> 00:28:33,442 So, with our second web dossier on public art in Africa, 445 00:28:33,442 --> 00:28:38,420 we also started to enhance the data in Wikidata. 446 00:28:38,420 --> 00:28:43,242 In this case, for a public art-- we edited geo-locations-- 447 00:28:43,242 --> 00:28:46,919 geo-locations to Wikidata. 448 00:28:46,919 --> 00:28:51,139 And we also searched for works of public art in commons, 449 00:28:51,139 --> 00:28:55,165 and if they don't have a record on Wikidata yet, 450 00:28:55,165 --> 00:29:00,670 we edited the record to Wikidata. 451 00:29:00,855 --> 00:29:05,327 And the third thing we do, 452 00:29:05,327 --> 00:29:09,958 because when we prepare a web dossier, 453 00:29:09,958 --> 00:29:15,514 we download the titles from our catalog, 454 00:29:15,514 --> 00:29:17,584 and the tiles are in MARC 21, 455 00:29:17,584 --> 00:29:23,226 so we have to convert them to a format that is presentable on the website, 456 00:29:23,226 --> 00:29:28,229 and it takes not much time and effort to convert the same set of titles 457 00:29:28,229 --> 00:29:30,457 to Wikidata QuickStatements, 458 00:29:30,457 --> 00:29:36,999 and then, we also upload a title set to Wikidata, 459 00:29:36,999 --> 00:29:41,254 and you can see the titles we uploaded 460 00:29:41,254 --> 00:29:44,124 from our latest web dossier 461 00:29:44,124 --> 00:29:47,514 on African proverbs in Scholia. 462 00:29:48,546 --> 00:29:52,294 A really nice tool that visualizes Scholia publications 463 00:29:52,294 --> 00:29:54,674 being present in Wikidata. 464 00:29:54,674 --> 00:29:59,674 And, one second--when it is possible, we add a Scholia template 465 00:29:59,674 --> 00:30:01,863 to our web dossier's topic. 466 00:30:01,863 --> 00:30:03,272 Thank you very much. 467 00:30:03,272 --> 00:30:08,079 (applause) 468 00:30:09,255 --> 00:30:11,724 Thank you, Heleen and Ursula. 469 00:30:12,010 --> 00:30:16,866 Next we have Adrian Pohl presenting using Wikidata 470 00:30:16,866 --> 00:30:22,265 to improve spatial subject indexing and regional bibliography. 471 00:30:45,181 --> 00:30:46,621 Okay, hello everybody. 472 00:30:46,621 --> 00:30:49,630 I'm going right into the topic. 473 00:30:49,630 --> 00:30:54,146 I only have ten minutes to present a three-year project. 474 00:30:54,535 --> 00:30:57,044 It wasn't full time. (laughs) 475 00:30:57,044 --> 00:31:00,100 Okay, what's the NWBib? 476 00:31:00,100 --> 00:31:04,404 It's an acronym for North-Rhine Westphalian Bibliography. 477 00:31:04,404 --> 00:31:07,944 It's a regional bibliography that records literature 478 00:31:07,944 --> 00:31:11,441 about people and places in North Rhine-Westphalia. 479 00:31:12,534 --> 00:31:14,103 And the monograph's in it-- 480 00:31:15,162 --> 00:31:19,451 there are a lot of articles in it, and most of them are quite unique, 481 00:31:19,451 --> 00:31:22,052 so, that's the interesting thing about this bibliography-- 482 00:31:22,052 --> 00:31:25,472 because it's often less quite obscure stuff-- 483 00:31:25,472 --> 00:31:28,188 local people writing about that tradition, 484 00:31:28,188 --> 00:31:29,488 and something like this. 485 00:31:29,612 --> 00:31:33,428 And there's over 400,000 entries in there. 486 00:31:33,428 --> 00:31:37,689 And the bibliography started in 1983, 487 00:31:37,689 --> 00:31:42,718 and so we only have titles from this publication year onwards. 488 00:31:44,744 --> 00:31:49,166 If you want to take a look at it, it's at nwbib.de, 489 00:31:49,166 --> 00:31:50,859 that's the web application. 490 00:31:50,859 --> 00:31:55,389 It's based on our service, lobid.org, the API. 491 00:31:57,148 --> 00:32:01,220 Because it's cataloged as part of the hbz union catalog, 492 00:32:01,220 --> 00:32:04,988 which comprises around 20 million records, 493 00:32:04,988 --> 00:32:08,869 it's an [inaudible] Aleph system we get the data out of there, 494 00:32:08,869 --> 00:32:11,308 and make RDF out of it, 495 00:32:11,308 --> 00:32:16,408 and provide it as via JSON or the HTTP API. 496 00:32:17,129 --> 00:32:20,507 So, the initial status in 2017 497 00:32:20,507 --> 00:32:25,307 was we had nearly 9,000 distinct strings 498 00:32:25,307 --> 00:32:28,727 about places--referring to places, in North Rhine-Westphalia. 499 00:32:28,727 --> 00:32:34,187 Mostly, those were administrative areas, like towns and districts, 500 00:32:34,187 --> 00:32:38,458 but also monasteries, principalities, or natural regions. 501 00:32:38,907 --> 00:32:43,517 And we already used Wikidata in 2017, 502 00:32:43,517 --> 00:32:48,496 and matched those strings with Wikidata API to Wikidata entries 503 00:32:48,496 --> 00:32:51,907 quite naively to get the geo-coordinates from there, 504 00:32:51,907 --> 00:32:57,210 and do some geo-based discovery stuff with it. 505 00:32:57,326 --> 00:32:59,910 But this had some drawbacks. 506 00:32:59,910 --> 00:33:02,577 And so, the matching was really poor, 507 00:33:02,577 --> 00:33:05,197 and there were a lot of false positives, 508 00:33:05,197 --> 00:33:09,184 and we still had no hierarchy in those places, 509 00:33:09,184 --> 00:33:13,201 and we still had a lot of non-unique names. 510 00:33:13,505 --> 00:33:15,356 So, this is an example here. 511 00:33:16,616 --> 00:33:18,378 Does this work? 512 00:33:18,494 --> 00:33:22,314 Yeah, as you can see, for one place, Brauweiler, 513 00:33:22,314 --> 00:33:24,615 there are four different strings in there. 514 00:33:24,820 --> 00:33:27,893 So, we all know how this happens. 515 00:33:27,893 --> 00:33:31,994 If there's no authority file, you end up with this data. 516 00:33:31,994 --> 00:33:33,894 But we want to improve on that. 517 00:33:34,614 --> 00:33:38,211 And as you can also see, that while the matching didn't work-- 518 00:33:38,211 --> 00:33:40,382 so you have this name of the place 519 00:33:40,382 --> 00:33:45,170 and there's often the name of the superior administrative area, 520 00:33:45,170 --> 00:33:50,532 and even on the second level, a superior administrative area 521 00:33:50,532 --> 00:33:52,040 often in the name 522 00:33:52,040 --> 00:33:58,909 to identify the place successfully. 523 00:33:58,909 --> 00:34:04,679 So, the goal was to build a full-fledged spatial classification based on this data, 524 00:34:04,679 --> 00:34:07,109 with a hierarchical view of places, 525 00:34:09,079 --> 00:34:11,389 with one entry or ID for each place. 526 00:34:11,518 --> 00:34:17,488 And we got this mock-up by NWBib editors in 2016, made in Excel, 527 00:34:18,048 --> 00:34:23,116 to get a feeling of what they would like to have. 528 00:34:25,006 --> 00:34:28,198 There you have the-- Regierungsbezirk-- 529 00:34:28,198 --> 00:34:31,016 that's the most superior administrative area-- 530 00:34:31,016 --> 00:34:34,918 we have in there some towns or districts--rural districts-- 531 00:34:34,918 --> 00:34:39,861 and then, it's going down to the parts of towns, 532 00:34:39,861 --> 00:34:42,011 even to this level. 533 00:34:43,225 --> 00:34:46,232 And we chose Wikidata for this task. 534 00:34:46,232 --> 00:34:50,087 We also looked at the GND, the Integrated Authority File, 535 00:34:50,087 --> 00:34:54,918 and GeoNames--but Wikidata had the best coverage, 536 00:34:54,918 --> 00:34:56,902 and the best infrastructure. 537 00:34:58,112 --> 00:35:02,072 The coverage for the places and the geo-coordinates we need, 538 00:35:02,072 --> 00:35:04,512 and the hierarchical information, for example. 539 00:35:04,512 --> 00:35:06,732 There were a lot of places, also, in the GND, 540 00:35:06,732 --> 00:35:09,694 but there was no hierarchical information in there. 541 00:35:11,170 --> 00:35:13,682 And also, Wikidata provides the infrastructure 542 00:35:13,682 --> 00:35:15,343 for editing and versioning. 543 00:35:15,343 --> 00:35:20,022 And there's also a community that helps maintaining the data, 544 00:35:20,022 --> 00:35:22,052 which was quite good. 545 00:35:22,950 --> 00:35:26,882 Okay, but there was a requirement by the NWBib editors. 546 00:35:27,682 --> 00:35:31,447 They did not want to directly rely on Wikidata, 547 00:35:31,447 --> 00:35:32,972 which was understandable. 548 00:35:32,972 --> 00:35:34,982 We don't have those servers under our control, 549 00:35:34,982 --> 00:35:38,002 and we won't know what's going on there. 550 00:35:38,084 --> 00:35:41,944 There might be some unwelcome edits that destroy the classification, 551 00:35:41,944 --> 00:35:44,159 or parts of it, or vandalism. 552 00:35:44,159 --> 00:35:50,794 So, we decide to put an intermediate SKOS file in between, 553 00:35:50,794 --> 00:35:55,534 on which the application would-- which should be generated from Wikidata. 554 00:35:57,113 --> 00:35:59,462 And SKOS is the Simple Knowledge Organization System-- 555 00:35:59,462 --> 00:36:03,919 it's the standard way to model 556 00:36:03,919 --> 00:36:07,519 a classification in the linked data world. 557 00:36:07,603 --> 00:36:09,278 So, how we did it? Five steps. 558 00:36:09,278 --> 00:36:14,037 I will come to each of the steps in more detail. 559 00:36:14,037 --> 00:36:18,460 We match the strings to Wikidata with a better approach than before. 560 00:36:18,727 --> 00:36:23,131 Created classification based on Wikidata, edit, 561 00:36:23,131 --> 00:36:26,255 then back the links from Wikidata to NWBib 562 00:36:26,255 --> 00:36:27,590 with a custom property. 563 00:36:27,590 --> 00:36:32,659 And now, we are in the process of establishing a good process 564 00:36:32,659 --> 00:36:36,559 for updating the classification in Wikidata. 565 00:36:36,619 --> 00:36:38,888 Seeing--having a DIF of the changes, 566 00:36:38,888 --> 00:36:41,158 and then publishing it to the SKOS file. 567 00:36:42,813 --> 00:36:44,646 I will come to the details. 568 00:36:44,646 --> 00:36:46,261 So, the matching approach-- 569 00:36:46,261 --> 00:36:48,356 as the API wasn't very sufficient, 570 00:36:48,356 --> 00:36:53,585 and because we have those different levels in the strings, 571 00:36:54,441 --> 00:36:59,036 we build a custom Elasticsearch index for our task. 572 00:36:59,596 --> 00:37:04,378 I think by now, you could probably, as well, use OpenRefine for doing this, 573 00:37:04,378 --> 00:37:09,306 but at that point in time, it wasn't available for Wikidata. 574 00:37:10,186 --> 00:37:14,336 And we build this index base on SPARQL query, 575 00:37:14,336 --> 00:37:20,484 and for entities in NRW, and with a specific type. 576 00:37:20,484 --> 00:37:25,069 And the query evolved over time a lot. 577 00:37:25,148 --> 00:37:29,157 And we have a few entries that you can see the history on GitHub. 578 00:37:29,727 --> 00:37:32,088 So, where we put in the matching index, 579 00:37:32,088 --> 00:37:36,337 in the spatial object, is what we need in our data. 580 00:37:36,337 --> 00:37:39,662 It's the label and the ID or the link to Wikidata, 581 00:37:40,222 --> 00:37:43,874 the geo-coordinates, and the type from Wikidata [inaudible], as well. 582 00:37:44,194 --> 00:37:50,488 But also for the matching, very important that aliases and the broader thing-- 583 00:37:50,488 --> 00:37:54,138 and this is also an example where the name of the broader entity 584 00:37:54,138 --> 00:37:57,875 and the district itself are very similar. 585 00:37:57,937 --> 00:38:03,096 So, it's important to have some type information, as well, 586 00:38:03,096 --> 00:38:04,606 for the matching. 587 00:38:04,900 --> 00:38:07,900 So, the nationwide results were very good. 588 00:38:07,900 --> 00:38:11,110 We could automatically match more than 99% of records 589 00:38:11,110 --> 00:38:12,265 with this approach. 590 00:38:13,885 --> 00:38:16,356 These were only 92% of the strings. 591 00:38:16,540 --> 00:38:18,140 So, obviously, the results-- 592 00:38:18,140 --> 00:38:20,610 those strings that only occurred one or two times 593 00:38:20,610 --> 00:38:22,419 often didn't appear in Wikidata. 594 00:38:22,419 --> 00:38:26,309 And so, we had to do a lot of work with those with the [long tail]. 595 00:38:27,905 --> 00:38:32,039 And for around 1,000 strings, the matching was incorrect. 596 00:38:32,114 --> 00:38:34,950 But the catalogers did a lot of work in the Aleph catalog, 597 00:38:34,950 --> 00:38:39,869 but also in Wikidata, they made more than 6,000 manual edits to Wikidata 598 00:38:39,869 --> 00:38:45,019 to reach 100% coverage by adding aliases-type information, 599 00:38:45,085 --> 00:38:46,615 creating new entries. 600 00:38:46,615 --> 00:38:49,100 Okay, so, I have to speed up. 601 00:38:49,546 --> 00:38:54,295 We created classification based on this, on the hierarchical statements. 602 00:38:54,295 --> 00:38:58,580 P131 is the main property there. 603 00:38:59,827 --> 00:39:02,495 We added the information to our data. 604 00:39:03,035 --> 00:39:06,525 So, we now have this in our data spatial object-- 605 00:39:06,525 --> 00:39:11,535 and we focus this--the link to Wikidata, and the types are there, 606 00:39:12,625 --> 00:39:17,554 and here's the ID from the SKOS classification 607 00:39:17,554 --> 00:39:19,234 we built based on Wikidata. 608 00:39:20,034 --> 00:39:23,555 And you can see there are Q identifiers in there. 609 00:39:26,940 --> 00:39:29,286 Now, you can basically query our API 610 00:39:29,286 --> 00:39:34,051 with such a query using Wikidata URIs, 611 00:39:34,316 --> 00:39:38,627 and get literature, in this example, about Cologne back. 612 00:39:39,724 --> 00:39:45,675 Then we created a Wikidata property for NWBib and edit those links 613 00:39:45,675 --> 00:39:50,995 from Wikidata to the classification-- batch load them with QuickStatements. 614 00:39:52,105 --> 00:39:53,634 And there's also a nice-- 615 00:39:53,634 --> 00:39:59,344 also a move to using a qualifier on this property 616 00:39:59,344 --> 00:40:02,994 to add the broader information there. 617 00:40:02,994 --> 00:40:06,333 So, I think people won't mess around that work with this, 618 00:40:06,333 --> 00:40:09,223 and as with the P131 statement. 619 00:40:10,094 --> 00:40:11,743 So, this is what it looks like. 620 00:40:12,563 --> 00:40:16,142 This will go to the classification where you can then start a query. 621 00:40:18,670 --> 00:40:23,293 Now, we have to build this update and review process, 622 00:40:23,293 --> 00:40:28,692 and we will add those data like this, 623 00:40:28,692 --> 00:40:32,452 with a zero sub-field to Aleph, 624 00:40:32,452 --> 00:40:36,962 and the catalogers will start using those Wikidata based IDs, 625 00:40:36,962 --> 00:40:41,012 URIs, for cataloging for spatial indexing. 626 00:40:44,702 --> 00:40:50,082 So, by now, there are more than 400,000 NWBib entries with links to Wikidata, 627 00:40:50,082 --> 00:40:55,905 and more than 4,400 Wikidata entries with links to NWBib. 628 00:40:56,617 --> 00:40:58,042 Thank you. 629 00:40:58,042 --> 00:41:03,182 (applause) 630 00:41:07,574 --> 00:41:09,682 Thank you, Adrian. 631 00:41:13,312 --> 00:41:15,472 I got it. Thank you. 632 00:41:31,122 --> 00:41:34,402 So, as you've seen me before, I'm Hilary Thorsen. 633 00:41:34,402 --> 00:41:36,152 I'm Wikimedian in residence 634 00:41:36,152 --> 00:41:38,382 with the Linked Data for Production Project. 635 00:41:38,382 --> 00:41:39,942 I am based at Stanford, 636 00:41:39,942 --> 00:41:42,590 and I'm here today with my colleague, Lena Denis, 637 00:41:42,590 --> 00:41:45,581 who is Cartographic Assistant at Harvard Library. 638 00:41:45,581 --> 00:41:50,041 And Christine Fernsebner Eslao is here in spirit. 639 00:41:50,041 --> 00:41:53,530 She is currently back in Boston, but supporting us from afar. 640 00:41:53,530 --> 00:41:56,240 So, we'll be talking about Wikidata and Libraries 641 00:41:56,240 --> 00:42:00,350 as partners in data production, organization, and project inspiration. 642 00:42:00,850 --> 00:42:04,300 And our work is part of the Linked Data for Production Project. 643 00:42:05,450 --> 00:42:08,190 So, Linked Data for Production is in its second phase, 644 00:42:08,190 --> 00:42:10,450 called Pathway for Implementation. 645 00:42:10,450 --> 00:42:13,291 And it's an Andrew W. Mellon Foundation grant, 646 00:42:13,291 --> 00:42:16,120 involving the partnership of several universities, 647 00:42:16,120 --> 00:42:20,280 with the goal of constructing a pathway for shifting the catalog community 648 00:42:20,280 --> 00:42:24,860 to begin describing library resources with linked data. 649 00:42:24,860 --> 00:42:26,919 And it builds upon a previous grant, 650 00:42:26,919 --> 00:42:30,369 but this iteration is focused on the practical aspects 651 00:42:30,369 --> 00:42:32,009 of the transition. 652 00:42:33,559 --> 00:42:35,650 One of these pathways of investigation 653 00:42:35,650 --> 00:42:39,000 has been integrating library metadata with Wikidata. 654 00:42:39,429 --> 00:42:41,054 We have a lot of questions, 655 00:42:41,054 --> 00:42:42,999 but some of the ones we're most interested in 656 00:42:42,999 --> 00:42:46,180 are how we can integrate library metadata with Wikidata, 657 00:42:46,180 --> 00:42:49,580 and make contribution a part of our cataloging workflows, 658 00:42:49,580 --> 00:42:53,589 how Wikidata can help us improve our library discovery environment, 659 00:42:53,589 --> 00:42:55,929 how it can help us reveal more relationships 660 00:42:55,929 --> 00:42:59,629 and connections within our data and with external data sets, 661 00:42:59,629 --> 00:43:04,370 and if we have connections in our own data that can be added to Wikidata, 662 00:43:04,370 --> 00:43:07,480 how libraries can help fill in gaps in Wikidata, 663 00:43:07,480 --> 00:43:09,969 and how libraries can work with local communities 664 00:43:09,969 --> 00:43:13,070 to describe library and archival resources. 665 00:43:14,010 --> 00:43:17,129 Finding answers to these questions has focused on the mutual benefit 666 00:43:17,129 --> 00:43:19,649 for the library and Wikidata communities. 667 00:43:19,649 --> 00:43:22,949 We've learned through starting to work on our different Wikidata projects, 668 00:43:22,949 --> 00:43:25,279 that many of the issues libraries grapple with, 669 00:43:25,279 --> 00:43:29,451 like data modeling, identity management, data maintenance, documentation, 670 00:43:29,451 --> 00:43:31,289 and instruction on linked data, 671 00:43:31,289 --> 00:43:33,970 are ones the Wikidata community works on too. 672 00:43:34,370 --> 00:43:36,099 I'm going to turn things over to Lena 673 00:43:36,099 --> 00:43:39,640 to talk about what she's been working on now. 674 00:43:46,550 --> 00:43:51,040 Hi, so, as Hilary briefly mentioned, I work as a map librarian at Harvard, 675 00:43:51,040 --> 00:43:54,180 where I process maps, atlases, and archives for our online catalog. 676 00:43:54,180 --> 00:43:56,580 And while processing two-dimensional cartographic works 677 00:43:56,580 --> 00:43:59,572 is relatively straighforward, cataloging archival collections 678 00:43:59,572 --> 00:44:02,429 so that their cartographic resources can be made discoverable, 679 00:44:02,429 --> 00:44:04,119 has always been more difficult. 680 00:44:04,119 --> 00:44:06,989 So, my use case for Wikidata is visually modeling relationships 681 00:44:06,989 --> 00:44:10,389 between archival collections and the individual items within them, 682 00:44:10,389 --> 00:44:13,210 as well as between archival drafts in published works. 683 00:44:13,359 --> 00:44:17,329 So, I used Wikidata to highlight the work of our cartographer named Erwin Raisz, 684 00:44:17,329 --> 00:44:19,890 who worked at Harvard in the early 20th-century. 685 00:44:19,890 --> 00:44:22,539 He was known for his vividly detailed and artistic land forms, 686 00:44:22,539 --> 00:44:23,939 like this one on the screen-- 687 00:44:23,939 --> 00:44:26,294 but also for inventing the armadillo projection, 688 00:44:26,294 --> 00:44:29,020 writing the first cartography textbook in English 689 00:44:29,020 --> 00:44:31,318 and other various important contributions 690 00:44:31,318 --> 00:44:32,919 to the field of geography. 691 00:44:32,919 --> 00:44:34,609 And at the Harvard Map Collection, 692 00:44:34,609 --> 00:44:38,509 we have a 66-item collection of Raisz's field notebooks, 693 00:44:38,509 --> 00:44:41,359 which begin when he was a student and end just before his death. 694 00:44:43,679 --> 00:44:46,229 So, this is the collection-level record that I made for them, 695 00:44:46,229 --> 00:44:47,994 which merely gives an overview, 696 00:44:47,994 --> 00:44:50,513 but his notebooks are full of information 697 00:44:50,513 --> 00:44:53,351 that he used in later atlases, maps, and textbooks. 698 00:44:53,351 --> 00:44:56,313 But researchers don't know how to find that trajectory information, 699 00:44:56,313 --> 00:44:58,665 and the system is not designed to show them. 700 00:45:01,030 --> 00:45:03,734 So, I felt that with Wikidata, and other Wikimedia platforms, 701 00:45:03,734 --> 00:45:05,154 I'd be able to take advantage 702 00:45:05,154 --> 00:45:08,075 of information that already exists about him on the open web, 703 00:45:08,075 --> 00:45:10,629 along with library records and a notebook inventory 704 00:45:10,629 --> 00:45:12,574 that I had made in an Excel spreadsheet 705 00:45:12,574 --> 00:45:15,416 to show relationships and influences between his works. 706 00:45:15,574 --> 00:45:18,594 So here, you can see how I edited and reconciled library data 707 00:45:18,594 --> 00:45:20,165 in OpenRefine. 708 00:45:20,165 --> 00:45:23,164 And then, I used QuickStatements to batch import my results. 709 00:45:23,304 --> 00:45:25,244 So, now, I was ready to create knowledge graphs 710 00:45:25,244 --> 00:45:27,864 with SPARQL queries to show patterns of influence. 711 00:45:30,084 --> 00:45:33,304 The examples here show how I leveraged Wikimedia Commons images 712 00:45:33,304 --> 00:45:34,664 that I connected to him. 713 00:45:34,664 --> 00:45:36,459 And the hierarchy of some of his works 714 00:45:36,459 --> 00:45:38,604 that were contributing factors to other works. 715 00:45:38,604 --> 00:45:42,354 So, modeling Raisz's works on Wikidata allowed me to encompass in a single image, 716 00:45:42,354 --> 00:45:45,890 or in this case, in two images, the connections that require many pages 717 00:45:45,890 --> 00:45:47,864 of bibliographic data to reveal. 718 00:45:51,684 --> 00:45:55,544 So, this video is going to load. 719 00:45:55,563 --> 00:45:57,233 Yes! Alright. 720 00:45:57,233 --> 00:46:00,113 This video is a minute and a half long screencast I made, 721 00:46:00,113 --> 00:46:02,033 that I'm going to narrate as you watch. 722 00:46:02,033 --> 00:46:05,423 It shows the process of inputting and then running a SPARQL query, 723 00:46:05,423 --> 00:46:09,283 showing hierarchical relationships between notebooks, an atlas, and a map 724 00:46:09,283 --> 00:46:11,033 that Raisz created about Cuba. 725 00:46:11,033 --> 00:46:12,603 He worked there before the revolution, 726 00:46:12,603 --> 00:46:14,633 so he had the unique position of having support 727 00:46:14,633 --> 00:46:17,013 from both the American and the Cuban governments. 728 00:46:17,334 --> 00:46:20,583 So, I made this query as an example to show people who work on Raisz, 729 00:46:20,583 --> 00:46:24,134 and who are interested in narrowing down what materials they'd like to request 730 00:46:24,134 --> 00:46:26,154 when they come to us for research. 731 00:46:26,154 --> 00:46:29,684 To make the approach replicable for other archival collections, 732 00:46:29,684 --> 00:46:33,105 I hope that Harvard and other institutions will prioritize Wikidata look-ups 733 00:46:33,105 --> 00:46:35,414 as they move to linked data cataloging production, 734 00:46:35,414 --> 00:46:37,520 which my co-presenters can speak to the progress on 735 00:46:37,520 --> 00:46:38,854 better than I can. 736 00:46:38,854 --> 00:46:41,543 But my work has brought me-- has brought to mind a particular issue 737 00:46:41,543 --> 00:46:46,580 that I see as a future opportunity, which is that of archival modeling. 738 00:46:47,369 --> 00:46:52,302 So, to an archivist, an item is a discrete archival material 739 00:46:52,302 --> 00:46:55,000 within a larger collection of archival materials 740 00:46:55,000 --> 00:46:56,884 that is not a physical location. 741 00:46:56,884 --> 00:47:00,663 So an archivist from the American National Archives and Records Administration, 742 00:47:00,663 --> 00:47:02,943 who is also a Wikidata enthusiast, 743 00:47:02,943 --> 00:47:05,742 advised me when I was trying to determine how to express this 744 00:47:05,742 --> 00:47:07,734 using an example item, 745 00:47:07,734 --> 00:47:10,456 that I'm going to show as soon as this video is finally over. 746 00:47:11,433 --> 00:47:14,391 Alright. Great. 747 00:47:20,437 --> 00:47:22,100 Nope, that's not what I wanted. 748 00:47:22,135 --> 00:47:23,536 Here we go. 749 00:47:31,190 --> 00:47:32,280 It's doing that. 750 00:47:32,280 --> 00:47:34,154 (humming) 751 00:47:34,208 --> 00:47:37,418 Nope. Sorry. Sorry. 752 00:47:40,444 --> 00:47:43,045 Alright, I don't know why it's not going full screen again. 753 00:47:43,045 --> 00:47:44,329 I can't get it to do anything. 754 00:47:44,329 --> 00:47:46,880 But this is the-- oh, my gosh. 755 00:47:46,880 --> 00:47:48,235 Stop that. Alright. 756 00:47:48,235 --> 00:47:51,195 So, this is the item that I mentioned. 757 00:47:51,575 --> 00:47:53,655 So, this was what the archivist 758 00:47:53,655 --> 00:47:55,964 from the National Archives and Records Administration 759 00:47:55,964 --> 00:47:57,414 showed me as an example. 760 00:47:57,414 --> 00:48:02,414 And he recommended this compromise, which is to use the part of property 761 00:48:02,414 --> 00:48:05,614 to connect a lower level description to a higher level of description, 762 00:48:05,614 --> 00:48:08,534 which allows the relationships between different hierarchical levels 763 00:48:08,534 --> 00:48:10,840 to be asserted as statements and qualifiers. 764 00:48:10,840 --> 00:48:12,884 So, in this example that's on screen, 765 00:48:12,884 --> 00:48:16,294 the relationship between an item, a series, a collection, and a record group 766 00:48:16,294 --> 00:48:19,655 are thus contained and described within a Wikidata item entity. 767 00:48:19,655 --> 00:48:22,024 So, I followed this model in my work on Raisz. 768 00:48:22,704 --> 00:48:26,024 And one of my images is missing. 769 00:48:26,024 --> 00:48:27,971 No, it's not. It's right there. I'm sorry. 770 00:48:28,210 --> 00:48:30,613 And so, I followed this model on my work on Raisz, 771 00:48:30,613 --> 00:48:33,103 but I look forward to further standardization. 772 00:48:38,983 --> 00:48:41,352 So, another archival project Harvard is working on 773 00:48:41,352 --> 00:48:44,632 is the Arthur Freedman collection of more than 2,000 hours 774 00:48:44,632 --> 00:48:48,702 of punk rock performances from the 1970s to early 2000s 775 00:48:48,702 --> 00:48:51,970 in the Boston and Cambridge, Massachussets areas. 776 00:48:51,970 --> 00:48:55,145 It includes many bands and venues that no longer exist. 777 00:48:55,604 --> 00:48:59,505 So far, work has been done in OpenRefine on reconciliation of the bands and venues 778 00:48:59,505 --> 00:49:02,324 to see which need an item created in Wikidata. 779 00:49:02,886 --> 00:49:05,964 A basic item will be created via batch process next spring, 780 00:49:05,964 --> 00:49:08,697 and then, an edit-a-thon will be held in conjunction 781 00:49:08,697 --> 00:49:12,254 with the New England Music Library Association's meeting in Boston 782 00:49:12,254 --> 00:49:15,866 to focus on adding more statements to the batch-created items, 783 00:49:15,866 --> 00:49:18,937 by drawing on local music community knowledge. 784 00:49:18,937 --> 00:49:22,086 We're interested in learning more about models for pairing librarians 785 00:49:22,086 --> 00:49:26,310 and Wiki enthusiasts with new contributors who have domain knowledge. 786 00:49:26,297 --> 00:49:29,293 Items will eventually be linked to digitized video 787 00:49:29,293 --> 00:49:31,387 in Harvard's digital collection platform 788 00:49:31,387 --> 00:49:33,167 once rights have been cleared with artists, 789 00:49:33,167 --> 00:49:35,147 which will likely be a slow process. 790 00:49:36,327 --> 00:49:38,030 There's also a great amount of interest 791 00:49:38,030 --> 00:49:41,680 in moving away from manual cataloging and creation of authority data 792 00:49:41,680 --> 00:49:43,247 towards identity management, 793 00:49:43,247 --> 00:49:45,667 where descriptions can be created in batches. 794 00:49:45,667 --> 00:49:48,057 An additional project that focused on 795 00:49:48,057 --> 00:49:51,297 creating international standard name identifiers, or ISNIs, 796 00:49:51,297 --> 00:49:53,477 for avant-garde and women filmmakers 797 00:49:53,477 --> 00:49:57,657 can be adapted for creating Wikidata items for these filmmakers, as well. 798 00:49:57,657 --> 00:50:01,076 Spreadsheets with the ISNIs, filmmaker names, and other details 799 00:50:01,076 --> 00:50:04,697 can be reconciled in OpenRefine, and uploaded with QuickStatements. 800 00:50:04,910 --> 00:50:06,940 Once people in organizations have been described, 801 00:50:06,940 --> 00:50:09,316 we'll move toward describing the films in Wikidata, 802 00:50:09,316 --> 00:50:12,526 which will likely present some additional modeling challenges. 803 00:50:13,446 --> 00:50:15,486 A library presentation wouldn't be complete 804 00:50:15,486 --> 00:50:16,882 without a MARC record. 805 00:50:16,882 --> 00:50:19,916 Here, you can see the record for Karen Aqua's taxonomy film, 806 00:50:19,916 --> 00:50:22,096 where her ISNI and Wikidata Q number 807 00:50:22,096 --> 00:50:24,176 have been added to the 100 field. 808 00:50:24,176 --> 00:50:26,636 The ISNIs and Wikidata Q numbers that have been created 809 00:50:26,636 --> 00:50:30,066 can then be batch added back into MARC records via MarcEdit. 810 00:50:30,066 --> 00:50:33,236 You might be asking why I'm showing you this ugly MARC record, 811 00:50:33,236 --> 00:50:35,596 instead of some beautiful linked data statements. 812 00:50:35,596 --> 00:50:38,576 And that's because our libraries will be working in a hybrid environment 813 00:50:38,576 --> 00:50:39,896 for some time. 814 00:50:39,896 --> 00:50:42,326 Our library catalogs still relies on MARC records, 815 00:50:42,326 --> 00:50:44,076 so by adding in these URIs, 816 00:50:44,076 --> 00:50:46,366 we can try to take advantage of linked data, 817 00:50:46,366 --> 00:50:48,346 while our systems still use MARC. 818 00:50:49,496 --> 00:50:52,950 Adding URIs into MARC records makes an additional aspect 819 00:50:52,950 --> 00:50:54,335 of our project possible. 820 00:50:54,335 --> 00:50:56,894 Work has been done at Stanford and Cornell to bring data 821 00:50:56,894 --> 00:51:01,873 from Wikidata into our library catalog using URIs already in our MARC records. 822 00:51:02,334 --> 00:51:05,090 You can see an example of a knowledge panel, 823 00:51:05,090 --> 00:51:06,984 where all the data is sourced from Wikidata, 824 00:51:06,984 --> 00:51:11,004 and links back to the item itself, along with an invitation to contribute. 825 00:51:11,403 --> 00:51:15,130 This is currently in a test environment, not in production in our catalog. 826 00:51:15,130 --> 00:51:17,444 Ideally, eventually, these will be generated 827 00:51:17,444 --> 00:51:19,916 from linked data descriptions of library resources 828 00:51:19,916 --> 00:51:22,954 created using Sinopia, our linked data editor 829 00:51:22,954 --> 00:51:24,563 developed for cataloging. 830 00:51:24,563 --> 00:51:27,994 We found that adding a look-up to Wikidata in Sinopia is difficult. 831 00:51:27,994 --> 00:51:31,514 The scale and modeling of Wikidata makes it hard to partition the data 832 00:51:31,514 --> 00:51:33,544 to be able to look up typed entities, 833 00:51:33,544 --> 00:51:34,900 and we've run into the problem 834 00:51:34,900 --> 00:51:37,493 of SPARQL not being good for keyword search, 835 00:51:37,493 --> 00:51:41,883 but wanting our keyword APIs to return SPARQL-like RDF descriptions. 836 00:51:41,883 --> 00:51:45,043 So, as you can see, we still have quite a bit of work to do. 837 00:51:45,043 --> 00:51:47,937 This round of the grant runs until June 2020, 838 00:51:47,937 --> 00:51:50,163 so, we'll be continuing our exploration. 839 00:51:50,163 --> 00:51:53,113 And I just wanted to invite anyone 840 00:51:53,113 --> 00:51:57,573 who's continued an interest in talking about Wikidata and libraries, 841 00:51:57,573 --> 00:52:01,454 I lead a Wikidata Affinity Group that's open to anyone to join. 842 00:52:01,454 --> 00:52:03,013 We meet every two weeks, 843 00:52:03,013 --> 00:52:05,513 and our next call is Tuesday, November the 5th, 844 00:52:05,513 --> 00:52:08,073 so if you're interested in continuing discussions, 845 00:52:08,073 --> 00:52:10,393 I would love to talk with you further. 846 00:52:10,393 --> 00:52:11,890 Thank you, everyone. 847 00:52:11,890 --> 00:52:13,623 And thank you to the other presenters 848 00:52:13,623 --> 00:52:16,893 for talking about all of their wonderful projects. 849 00:52:16,893 --> 00:52:21,283 (applause)