WEBVTT
00:00:07.133 --> 00:00:11.738
I work as a teacher
at the University of Alicante,
00:00:11.738 --> 00:00:17.040
where I recently obtained my PhD
on data libraries and linked open data.
00:00:17.040 --> 00:00:19.038
And I'm also a software developer
00:00:19.038 --> 00:00:21.718
at the Biblioteca Virtual
Miguel de Cervantes.
00:00:21.718 --> 00:00:24.467
And today, I'm going to talk
about data quality.
00:00:28.252 --> 00:00:31.527
Well, those are my colleagues
at the university.
00:00:32.457 --> 00:00:36.727
And as you may know, many organizations
are publishing their data
00:00:36.727 --> 00:00:38.447
or linked open data--
00:00:38.447 --> 00:00:41.437
for example,
the National Library of France,
00:00:41.437 --> 00:00:45.947
the National Library of Spain,
us, which is Cervantes Virtual,
00:00:45.947 --> 00:00:49.007
the British National Bibliography,
00:00:49.007 --> 00:00:51.667
the Library of Congress and Europeana.
00:00:51.667 --> 00:00:56.000
All of them provide a SPARQL endpoint,
00:00:56.000 --> 00:00:58.875
which is useful in order
to retrieve the data.
00:00:59.104 --> 00:01:00.984
And if I'm not wrong,
00:01:00.984 --> 00:01:05.890
the Library of Congress only provide
the data as a dump that you can't use.
00:01:07.956 --> 00:01:13.787
When we publish our repository
as linked open data,
00:01:13.787 --> 00:01:17.475
my idea was to be reused
by other institutions.
00:01:17.981 --> 00:01:24.000
But what about if I'm an institution
who wants to enrich their data
00:01:24.000 --> 00:01:27.435
with any data from other data libraries.
00:01:27.574 --> 00:01:30.674
Which data set should I use?
00:01:30.674 --> 00:01:34.314
Which data set is better
in terms of quality?
00:01:36.874 --> 00:01:41.314
The benefits of the evaluation
of data quality in libraries are many.
00:01:41.314 --> 00:01:47.143
For example, methodologies can be improved
in order to include new criteria,
00:01:47.182 --> 00:01:49.162
in order to assess the quality.
00:01:49.162 --> 00:01:54.592
And also, organizations can benefit
from best practices and guidelines
00:01:54.602 --> 00:01:58.270
in order to publish their data
as linked open data.
00:02:00.012 --> 00:02:03.462
What do we need
in order to assess the quality?
00:02:03.462 --> 00:02:06.862
Well, obviously, a set of candidates
and a set of features.
00:02:06.862 --> 00:02:10.077
For example, do they have
a SPARQL endpoint,
00:02:10.077 --> 00:02:13.132
do they have a web interface,
how many publications do they have,
00:02:13.132 --> 00:02:18.092
how many vocabularies do they use,
how many Wikidata properties do they have,
00:02:18.092 --> 00:02:20.892
and where can I get those candidates?
00:02:20.892 --> 00:02:22.472
I use LOD Cloud--
00:02:22.472 --> 00:02:27.422
but when I was doing this slide,
I thought about using Wikidata
00:02:27.562 --> 00:02:29.746
in order to retrieve those candidates.
00:02:29.746 --> 00:02:34.295
For example, getting entities
of type data library,
00:02:34.295 --> 00:02:36.473
which has a SPARQL endpoint.
00:02:36.473 --> 00:02:38.693
You have here the link.
00:02:41.453 --> 00:02:45.083
And I come up with those data libraries.
00:02:45.104 --> 00:02:50.233
The first one uses bibliographic ontology
as main vocabulary,
00:02:50.233 --> 00:02:54.122
and the others are based,
more or less, on FRBR,
00:02:54.122 --> 00:02:57.180
which is a vocabulary published by IFLA.
00:02:57.180 --> 00:03:00.013
And this is just an example
of how we could compare
00:03:00.013 --> 00:03:04.393
data libraries using
bubble charts on Wikidata.
00:03:04.393 --> 00:03:08.613
And this is just an example comparing
how many Wikidata properties
00:03:08.613 --> 00:03:10.633
are per data library.
00:03:13.483 --> 00:03:15.980
Well, how can we measure quality?
00:03:15.928 --> 00:03:17.972
There are different methodologies,
00:03:17.972 --> 00:03:19.726
for example, FRBR 1,
00:03:19.726 --> 00:03:24.337
which provides a set of criteria
grouped by dimensions,
00:03:24.337 --> 00:03:27.556
and those in green
are the ones that I found--
00:03:27.556 --> 00:03:30.917
that I could assess by means of Wikidata.
00:03:33.870 --> 00:03:39.397
And we also find that we
could define new criteria,
00:03:39.397 --> 00:03:44.567
for example, a new one to evaluate
the number of duplications in Wikidata.
00:03:45.047 --> 00:03:47.206
We use those properties.
00:03:47.206 --> 00:03:50.098
And this is an example of SPARQL,
00:03:50.098 --> 00:03:54.486
in order to count the number
of duplicates property.
00:03:57.136 --> 00:04:00.366
And about the results,
while at the moment of doing this study,
00:04:00.366 --> 00:04:05.216
not the slides, there was no property
for the British National Bibliography.
00:04:05.860 --> 00:04:08.260
They don't provide provenance information,
00:04:08.260 --> 00:04:11.536
which could be useful
for metadata enrichment.
00:04:11.536 --> 00:04:14.660
And they don't allow
to edit the information.
00:04:14.660 --> 00:04:17.166
So, we've been talking
about Wikibase the whole weekend,
00:04:17.166 --> 00:04:21.396
and maybe we should try to adopt
Wikibase as an interface.
00:04:23.186 --> 00:04:25.436
And they are focused on their own content,
00:04:25.436 --> 00:04:28.856
and this is just the SPARQL query
based on Wikidata
00:04:28.856 --> 00:04:31.411
in order to assess the population.
00:04:32.066 --> 00:04:36.006
And the BnF provides labels
in multiple languages,
00:04:36.006 --> 00:04:38.956
and they all use self-describing URIs,
00:04:38.956 --> 00:04:43.058
which is that in the URI,
they have the type of entity,
00:04:43.058 --> 00:04:48.406
which allows the human reader
to understand what they are using.
00:04:51.499 --> 00:04:55.256
And more results, they provide
different output format,
00:04:55.256 --> 00:04:58.646
they use external vocabularies.
00:04:58.854 --> 00:05:01.116
Only the British National Bibliography
00:05:01.116 --> 00:05:03.734
provides machine-readable
licensing information.
00:05:03.734 --> 00:05:09.124
And up to one-third of the instances
are connected to external repositories,
00:05:09.124 --> 00:05:11.225
which is really nice.
00:05:12.604 --> 00:05:18.290
And while this study, this work
has been done in our Labs team,
00:05:18.364 --> 00:05:22.391
a lab in a GLAM is a group of people
00:05:22.391 --> 00:05:27.520
who want to explore new ways
00:05:27.587 --> 00:05:30.306
of reusing data collections.
00:05:31.039 --> 00:05:35.054
And there's a community
led by the British Library,
00:05:35.054 --> 00:05:37.366
and in particular, Mahendra Mahey,
00:05:37.366 --> 00:05:40.610
and we had a first event in London,
00:05:40.610 --> 00:05:42.601
and another one in Copenhagen,
00:05:42.601 --> 00:05:45.279
and we're going to have a new one in May
00:05:45.279 --> 00:05:48.240
at the Library of Congress in Washington.
00:05:48.528 --> 00:05:52.481
And we are now 250 people.
00:05:52.481 --> 00:05:56.421
And I'm so glad that I found
somebody here at the WikidataCon
00:05:56.421 --> 00:05:58.860
who has just joined us--
00:05:58.860 --> 00:06:01.160
Sylvia from [inaudible], Mexico.
00:06:01.160 --> 00:06:04.509
And I'd like to invite you
to our community,
00:06:04.509 --> 00:06:09.719
since you may be part
of a GLAM institution.
00:06:10.659 --> 00:06:13.164
So, we can talk later
if you want to know about this.
00:06:14.589 --> 00:06:16.719
And this--it's all about people.
00:06:16.719 --> 00:06:19.669
This is me, people
from the British Library,
00:06:19.669 --> 00:06:24.629
Library of Congress, Universities,
and National Libraries in Europe
00:06:24.871 --> 00:06:28.050
And there's a link here
in case you want to know more.
00:06:28.433 --> 00:06:32.655
And, well, last month,
we decided to meet in Doha
00:06:32.655 --> 00:06:37.448
in order to write a book
about how to create a lab in our GLAM.
00:06:38.585 --> 00:06:43.279
And they choose 15 people,
and I was so lucky to be there.
00:06:45.314 --> 00:06:48.594
And the book follows
the Booksprint methodology,
00:06:48.594 --> 00:06:51.674
which means that nothing
is prepared beforehand.
00:06:51.674 --> 00:06:53.495
All is done there in a week.
00:06:53.495 --> 00:06:55.725
And believe me, it was really hard work
00:06:55.725 --> 00:06:58.905
to have their whole book
done in this week.
00:06:59.890 --> 00:07:04.490
And I'd like to introduce you to the book,
which will be published--
00:07:04.490 --> 00:07:06.455
it was supposed to be published this week,
00:07:06.455 --> 00:07:08.274
but it will be next week.
00:07:08.974 --> 00:07:13.014
And it will be published open,
so you can have it,
00:07:13.065 --> 00:07:15.668
and I can show you
a little bit later if you want.
00:07:15.734 --> 00:07:17.601
And those are the authors.
00:07:17.601 --> 00:07:19.678
I'm here-- I'm so happy, too.
00:07:19.678 --> 00:07:22.110
And those are the institutions--
00:07:22.110 --> 00:07:26.722
Library of Congress, British Library--
and this is the title.
00:07:27.330 --> 00:07:29.604
And now, I'd like to show you--
00:07:31.441 --> 00:07:33.971
a map that I'm doing.
00:07:34.278 --> 00:07:37.234
We are launching a website
for our community,
00:07:37.234 --> 00:07:42.893
and I'm in charge of creating a map
with our institutions there.
00:07:43.097 --> 00:07:44.860
This is not finished.
00:07:44.860 --> 00:07:50.276
But this is just SPARQL, and below,
00:07:51.546 --> 00:07:53.027
we see the map.
00:07:53.027 --> 00:07:58.086
And we see here
the new people that I found, here,
00:07:58.086 --> 00:08:00.486
at the WikidataCon--
I'm so happy for this.
00:08:00.621 --> 00:08:05.631
And we have here my data library
of my university,
00:08:05.681 --> 00:08:08.490
and many other institutions.
00:08:09.051 --> 00:08:10.940
Also, from Australia--
00:08:11.850 --> 00:08:13.061
if I can do it.
00:08:13.930 --> 00:08:15.711
Well, here, we have some links.
00:08:19.586 --> 00:08:21.088
There you go.
00:08:21.189 --> 00:08:23.059
Okay, this is not finished.
00:08:23.539 --> 00:08:26.049
We are still working on this,
and that's all.
00:08:26.057 --> 00:08:28.170
Thank you very much for your attention.
00:08:28.858 --> 00:08:33.683
(applause)
00:08:41.962 --> 00:08:48.079
[inaudible]
00:08:59.490 --> 00:09:00.870
Good morning, everybody.
00:09:00.870 --> 00:09:01.930
I'm Olaf Janssen.
00:09:01.930 --> 00:09:03.570
I'm the Wikimedia coordinator
00:09:03.570 --> 00:09:06.150
at the National Library
of the Netherlands.
00:09:06.310 --> 00:09:08.390
And I would like to share my work,
00:09:08.390 --> 00:09:11.610
which I'm doing about creating
Linked Open Data
00:09:11.640 --> 00:09:15.351
for Dutch Public Libraries using Wikidata.
00:09:17.600 --> 00:09:20.850
And my story starts roughly a year ago
00:09:20.850 --> 00:09:24.581
when I was at the GLAM Wiki conference
in Tel Aviv, in Israel.
00:09:25.301 --> 00:09:27.938
And there are two men
with very similar shirts,
00:09:27.938 --> 00:09:31.120
and equally similar hairdos, [Matt]...
00:09:31.120 --> 00:09:33.440
(laughter)
00:09:33.440 --> 00:09:35.325
And on the left, that's me.
00:09:35.325 --> 00:09:39.065
And a year ago, I didn't have
any practical knowledge and skills
00:09:39.065 --> 00:09:40.265
about Wikidata.
00:09:40.265 --> 00:09:43.285
I looked at Wikidata,
and I looked at the items,
00:09:43.285 --> 00:09:44.524
and I played with it.
00:09:44.524 --> 00:09:47.070
But I wasn't able to make a SPARQL query
00:09:47.070 --> 00:09:50.285
or to do data modeling
with the right shape expression.
00:09:51.305 --> 00:09:52.865
That's a year ago.
00:09:53.465 --> 00:09:57.065
And on the lefthand side,
that's Simon Cobb, user: Sic19.
00:09:57.304 --> 00:10:00.265
And I was talking to him,
because, just before,
00:10:00.525 --> 00:10:01.974
he had given a presentation
00:10:01.974 --> 00:10:06.374
about improving the coverage
of public libraries in Wikidata.
00:10:06.757 --> 00:10:08.934
And I was very inspired by his talk.
00:10:09.564 --> 00:10:13.355
And basically, he was talking
about adding basic data
00:10:13.355 --> 00:10:14.867
about public libraries.
00:10:14.867 --> 00:10:19.046
So, the name of the library, if available,
the photo of the building,
00:10:19.046 --> 00:10:21.497
the address data of the library,
00:10:21.497 --> 00:10:25.120
the geo-coordinates
latitude and longitude,
00:10:25.120 --> 00:10:26.367
and some other things,
00:10:26.367 --> 00:10:29.187
including with all source references.
00:10:31.317 --> 00:10:34.557
And what I was very impressed
about a year ago was this map.
00:10:34.557 --> 00:10:37.337
This is a map about
public libraries in the U.K.
00:10:37.337 --> 00:10:38.577
with all the colors.
00:10:38.577 --> 00:10:43.017
And you can see that all the libraries
are layered by library organizations.
00:10:43.017 --> 00:10:46.210
And when he showed this,
I was really, "Wow, that's cool."
00:10:46.637 --> 00:10:49.138
So, then, one minute later, I thought,
00:10:49.138 --> 00:10:52.918
"Well, let's do it
for the country for that one."
00:10:52.918 --> 00:10:54.850
(laughter)
00:10:57.149 --> 00:10:59.496
And something about public libraries
in the Netherlands--
00:10:59.496 --> 00:11:03.020
there are about 1,300 library
branches in our country,
00:11:03.020 --> 00:11:06.710
grouped into 160 library organizations.
00:11:07.723 --> 00:11:10.937
And you might wonder why
do I want to do this project?
00:11:10.997 --> 00:11:14.137
Well, first of all, because
for the common good, for society,
00:11:14.137 --> 00:11:16.707
because I think using Wikidata,
00:11:16.707 --> 00:11:20.657
and from there,
creating Wikipedia articles,
00:11:20.657 --> 00:11:23.417
and opening it up
via the linked open data cloud--
00:11:23.417 --> 00:11:29.006
it's improving visibility and reusability
of public libraries in the Netherlands.
00:11:30.110 --> 00:11:32.197
And my second goal was actually
a more personal one,
00:11:32.197 --> 00:11:36.517
because a year ago, I had this
yearly evaluation with my manager,
00:11:37.243 --> 00:11:41.737
and we decided it was a good idea
that I got more practical skills
00:11:41.737 --> 00:11:45.853
on linked open data, data modeling,
and also on Wikidata.
00:11:46.464 --> 00:11:50.286
And of course, I wanted to be able to make
these kinds of maps myself.
00:11:50.286 --> 00:11:51.396
(laughter)
00:11:54.345 --> 00:11:57.100
Then you might wonder
why do I want to do this?
00:11:57.100 --> 00:12:01.723
Isn't there already enough basic
library data out there in the Netherlands
00:12:02.450 --> 00:12:04.233
to have a good coverage?
00:12:06.019 --> 00:12:08.367
So, let me show you some of the websites
00:12:08.367 --> 00:12:12.882
that are available to discover
address and location information
00:12:12.882 --> 00:12:14.505
about Dutch public libraries.
00:12:14.505 --> 00:12:17.722
And the first one is this one--
Gidsvoornederland.nl--
00:12:17.722 --> 00:12:20.641
and that's the official
public library inventory
00:12:20.641 --> 00:12:23.037
maintained by my library,
the National Library.
00:12:23.727 --> 00:12:29.160
And you can look up addresses
and geo-coordinates on that website.
00:12:30.493 --> 00:12:32.797
Then there is this site,
Bibliotheekinzicht--
00:12:32.797 --> 00:12:36.502
this is also an official website
maintained by my National Library.
00:12:36.502 --> 00:12:38.982
And this is about
public library statistics.
00:12:41.010 --> 00:12:43.933
Then there is another one,
debibliotheken.nl--
00:12:43.933 --> 00:12:46.005
as you can see there is also
address information
00:12:46.005 --> 00:12:49.659
about library organizations,
not about individual branches.
00:12:51.724 --> 00:12:55.010
And there's even this one,
which also has address information.
00:12:56.546 --> 00:12:59.028
And of course, there's something
like Google Maps,
00:12:59.028 --> 00:13:02.157
which also has all the names
and the locations and the addresses.
00:13:03.455 --> 00:13:06.218
And this one, the International
Library of Technology,
00:13:06.218 --> 00:13:09.580
which has a worldwide
inventory of libraries,
00:13:09.646 --> 00:13:11.393
including the Netherlands.
00:13:13.058 --> 00:13:15.049
And I even discovered there is a data set
00:13:15.049 --> 00:13:18.423
you can buy for 50 euros or so
to download it.
00:13:18.423 --> 00:13:21.023
And there is also--seems to be
I didn't download it,
00:13:21.023 --> 00:13:23.633
but there seems to be address
information available.
00:13:24.273 --> 00:13:30.180
You might wonder is this kind of data
good enough for the purposes I had?
00:13:32.282 --> 00:13:37.372
So, this is my birthday list
for my ideal public library data list.
00:13:37.439 --> 00:13:39.105
And what's on my list?
00:13:39.173 --> 00:13:43.830
First of all, the data I want to have
must be up-to-date-ish--
00:13:43.830 --> 00:13:45.604
it must be fairly up-to-date.
00:13:45.604 --> 00:13:48.513
So, doesn't have to be real time,
00:13:48.513 --> 00:13:51.323
but let's say, a couple
of months, or half a year,
00:13:53.284 --> 00:13:57.354
delayed with official publication,
that's okay for my purposes.
00:13:58.116 --> 00:14:00.956
And I want to have it both
library branches
00:14:00.956 --> 00:14:02.697
and the library organizations.
00:14:04.206 --> 00:14:08.400
Then I want my data to be structured,
because it has to be machine-readable.
00:14:08.301 --> 00:14:11.986
It has to be in open file format,
such as CSV or JSON or RDF.
00:14:12.717 --> 00:14:15.197
It has to be linked
to other resources preferably.
00:14:16.011 --> 00:14:22.182
And the uses--the license on the data
needs to be manifest public domain or CC0.
00:14:23.520 --> 00:14:26.192
Then, I would like my data to have an API,
00:14:26.599 --> 00:14:30.548
which must be public, free,
and preferably also anonymous
00:14:30.548 --> 00:14:34.900
so you don't have to use an API key,
or you have to register an account.
00:14:36.103 --> 00:14:38.863
And I also want to have
a SPARQL interface.
00:14:41.131 --> 00:14:43.651
So, now, these are all the sites
I just showed you.
00:14:43.717 --> 00:14:46.450
And I'm going to make a big grid.
00:14:47.337 --> 00:14:50.017
And then, this is about
the evaluation I did.
00:14:51.187 --> 00:14:54.166
I'm not going into it,
but there is no single column
00:14:54.166 --> 00:14:56.007
which has all green check marks.
00:14:56.007 --> 00:14:57.997
That's the important thing to take away.
00:14:58.967 --> 00:15:03.947
And so, in summary, there was no
linked public free linked open data
00:15:03.947 --> 00:15:08.937
for Dutch public libraries available
before I started my project.
00:15:09.237 --> 00:15:13.027
So, this was the ideal motivation
to actually work on it.
00:15:14.730 --> 00:15:17.427
So, that's what I've been doing
for a year now.
00:15:17.717 --> 00:15:22.977
And I've been adding libraries bit by bit,
organization by organization to Wikidata.
00:15:23.417 --> 00:15:26.387
I created also a project website on it.
00:15:26.727 --> 00:15:29.567
It's still rather messy,
but it has all the information,
00:15:29.567 --> 00:15:33.240
and I try to keep it
as up-to-date as possible.
00:15:33.240 --> 00:15:36.277
And also all the SPARQL queries
you can see are linked from here.
00:15:38.002 --> 00:15:40.235
And I'm just adding
really basic information.
00:15:40.235 --> 00:15:44.097
You see the instances,
images if available,
00:15:44.097 --> 00:15:47.229
addresses, locations, et cetera,
municipalities.
00:15:48.534 --> 00:15:53.276
And where possible, I also try to link
the libraries to external identifiers.
00:15:56.024 --> 00:15:58.415
And then, you can really easily--
we all know,
00:15:58.415 --> 00:16:03.050
generating some Listeria lists
with public libraries grouped
00:16:03.050 --> 00:16:05.060
by organizations, for instance.
00:16:05.060 --> 00:16:08.380
Or using SPARQL queries,
you can also do aggregation on data--
00:16:08.380 --> 00:16:11.060
let's say, give me all
the municipalities in the Netherlands
00:16:11.060 --> 00:16:15.115
and the number of library branches
in all the municipalities.
00:16:17.025 --> 00:16:20.228
With one click, you can make
these kinds of photo galleries.
00:16:22.092 --> 00:16:23.655
And what I set out to do first,
00:16:23.655 --> 00:16:26.036
you can really create these kinds of maps.
00:16:27.176 --> 00:16:30.425
And you might wonder,
"Are there any libraries here or there?"
00:16:30.555 --> 00:16:33.355
There are--they are not yet in Wikidata.
00:16:33.355 --> 00:16:35.055
We're still working on that.
00:16:35.135 --> 00:16:37.644
And actually, last week,
I spoke with a volunteer,
00:16:37.644 --> 00:16:40.864
who's helping now
with entering the libraries.
00:16:41.644 --> 00:16:45.394
You can really make cool--in Wikidata,
00:16:45.394 --> 00:16:47.914
and also with using
the Cartographer extension,
00:16:47.914 --> 00:16:50.244
you can use these kinds of maps.
00:16:51.724 --> 00:16:53.736
And I even took it one step further.
00:16:53.911 --> 00:16:57.399
I also have some Python skills,
and some Leaflet things skills--
00:16:57.399 --> 00:16:59.971
so, I created, and I'm quite
proud of it, actually.
00:16:59.971 --> 00:17:03.482
I created this library heat map,
which is fully interactive.
00:17:03.482 --> 00:17:05.956
You can zoom in to it,
and you can see all the libraries,
00:17:06.712 --> 00:17:08.726
and you can also run it off Wiki.
00:17:08.726 --> 00:17:10.552
So, you can just embed it
in your own website,
00:17:10.552 --> 00:17:13.412
and it fully runs interactively.
00:17:15.131 --> 00:17:17.592
So, now going back to my big scary table.
00:17:19.512 --> 00:17:22.970
There is one column
on the right, which is blank.
00:17:22.970 --> 00:17:24.940
And no surprise, it will be Wikidata.
00:17:24.940 --> 00:17:26.448
Let's see how it scores there.
00:17:26.448 --> 00:17:29.500
(cheering)
00:17:32.892 --> 00:17:35.191
So, I actually think
of printing this on a T-shirt.
00:17:35.301 --> 00:17:37.288
(laughter)
00:17:37.788 --> 00:17:39.700
So, just to summarize this in words,
00:17:39.700 --> 00:17:41.129
thanks to my project, now,
00:17:41.129 --> 00:17:45.879
there is public free linked open data
available for Dutch public libraries.
00:17:47.124 --> 00:17:49.686
And who can benefit from my effort?
00:17:50.333 --> 00:17:52.002
Well, all kinds of parties--
00:17:52.002 --> 00:17:54.274
you see Wikipedia,
because you can generate lists
00:17:54.274 --> 00:17:56.051
and overviews and articles,
00:17:56.051 --> 00:17:59.908
for instance, using this
and be able to from Wikidata
00:17:59.908 --> 00:18:01.976
for our National Library for--
00:18:02.850 --> 00:18:05.391
IFLA also has an inventory
of worldwide libraries,
00:18:05.391 --> 00:18:07.216
they can also reuse the data.
00:18:07.650 --> 00:18:09.497
And especially for Sandra,
00:18:09.549 --> 00:18:13.237
it's also important for the Ministry--
Dutch Ministry of Culture--
00:18:13.277 --> 00:18:15.667
because Sandra is going
to have a talk about Wikidata
00:18:15.667 --> 00:18:18.287
with the Ministry this Monday,
next Monday.
00:18:19.922 --> 00:18:22.277
And also, on the righthand side,
for instance,
00:18:23.891 --> 00:18:27.098
Amazon with Alexa, the assistant,
00:18:27.098 --> 00:18:28.961
they're also using Wikidata,
00:18:28.961 --> 00:18:30.995
so you can imagine that they also use,
00:18:30.995 --> 00:18:33.357
if you're looking for public
library information,
00:18:33.357 --> 00:18:36.580
they can also use Wikidata for that.
00:18:38.955 --> 00:18:41.680
Because one year ago,
Simon Cobb inspired me
00:18:41.680 --> 00:18:44.244
to do this project,
I would like to call upon you,
00:18:44.244 --> 00:18:45.664
if you have time available,
00:18:45.664 --> 00:18:49.532
and if you have data from your own country
about public libraries,
00:18:51.572 --> 00:18:54.422
make the coverage better,
add more red dots,
00:18:54.982 --> 00:18:56.982
and of course, I'm willing
to help you with that.
00:18:56.982 --> 00:18:59.227
And Simon is also willing
to help with this.
00:18:59.870 --> 00:19:01.471
And so, I hope next year, somebody else
00:19:01.471 --> 00:19:03.901
will be at this conference
or another conference
00:19:03.901 --> 00:19:06.291
and there will be more
red dots on the map.
00:19:07.551 --> 00:19:08.911
Thank you very much.
00:19:09.004 --> 00:19:12.740
(applause)
00:19:18.336 --> 00:19:20.086
Thank you, Olaf.
00:19:20.086 --> 00:19:23.554
Next we have Ursula Oberst
and Heleen Smits
00:19:23.613 --> 00:19:27.734
presenting how can a small
research library benefit from Wikidata:
00:19:27.734 --> 00:19:31.423
enhancing library products using Wikidata.
00:19:53.717 --> 00:19:57.637
Okay. Good morning.
My name is Heleen Smits.
00:19:58.680 --> 00:20:01.753
And my colleague,
Ursula Oberst--where are you?
00:20:01.753 --> 00:20:03.873
(laughter)
00:20:04.371 --> 00:20:09.220
And I work at the Library
of the African Studies Center
00:20:09.220 --> 00:20:11.086
in Leiden, in the Netherlands.
00:20:11.086 --> 00:20:15.038
And the African Studies Center
is a center devoted--
00:20:15.038 --> 00:20:21.464
is an academic institution
devoted entirely to the study of Africa,
00:20:21.464 --> 00:20:23.986
focusing on Humanities and Social Studies.
00:20:24.672 --> 00:20:28.123
We used to be an independent
research organization,
00:20:28.123 --> 00:20:33.064
but in 2016, we became part
of Leiden University,
00:20:33.064 --> 00:20:38.433
and our catalog was integrated
into the larger university catalog.
00:20:39.283 --> 00:20:43.593
Though it remained possible
to do a search in the part of the Leiden--
00:20:43.593 --> 00:20:45.894
of the African Studies Catalog, alone,
00:20:47.960 --> 00:20:50.505
we remained independent in some respects.
00:20:50.586 --> 00:20:53.262
For example, with respect
to our thesaurus.
00:20:54.921 --> 00:20:59.883
And also with respect
to the products we make for our users,
00:21:01.180 --> 00:21:04.378
such as acquisition lists
and work dossiers.
00:21:05.158 --> 00:21:11.975
And it is in the field of the web dossiers
00:21:11.975 --> 00:21:14.582
that we have been looking
00:21:14.582 --> 00:21:19.582
for possible ways to apply Wikidata,
00:21:19.582 --> 00:21:23.372
and that's the part where Ursula
will in the second part of this talk
00:21:24.212 --> 00:21:27.184
show you a bit
what we've been doing there.
00:21:31.250 --> 00:21:35.160
The web dossiers are our collections
00:21:35.160 --> 00:21:39.000
of titles from our catalog
that we compile
00:21:39.000 --> 00:21:45.591
around a theme usually connected
to, for example, a conference,
00:21:45.591 --> 00:21:51.227
or to a special event, and actually,
the most recent web dossier we made
00:21:51.227 --> 00:21:56.017
was connected to the year
of indigenous languages,
00:21:56.017 --> 00:21:59.547
and that was around proverbs
in African languages.
00:22:00.780 --> 00:22:02.327
Our first steps--
00:22:04.307 --> 00:22:09.287
next slide--our first steps
on the Wiki path as a library,
00:22:10.267 --> 00:22:15.046
were in 2013, when we were one
of 12 GLAM institutions
00:22:15.046 --> 00:22:16.472
in the Netherlands,
00:22:16.472 --> 00:22:20.952
part of the project
of Wikipedians in Residence,
00:22:20.952 --> 00:22:26.443
and we had for two months,
a Wikipedian in the house,
00:22:27.035 --> 00:22:32.527
and he gave us trainings
for adding articles to Wikipedia,
00:22:33.000 --> 00:22:37.720
and also, we made a start with uploading
photo collections to Commons,
00:22:38.530 --> 00:22:42.650
which always remained a little bit
dependent on funding, as well,
00:22:43.229 --> 00:22:45.702
whether we would be able to digitize them,
00:22:45.702 --> 00:22:50.350
and to mostly have
a student assistant to do this.
00:22:51.220 --> 00:22:55.440
But it was actually a great adding
to what we could offer
00:22:55.440 --> 00:22:57.560
as an academic library.
00:22:59.370 --> 00:23:04.742
In May 2018, so is that my Ursula,
my colleague Ursula--
00:23:04.742 --> 00:23:09.465
she started to really explore--
dive into Wikidata
00:23:09.465 --> 00:23:14.515
and see what we as a small
and not very much experienced library
00:23:14.515 --> 00:23:18.175
in these fields could do with that.
00:23:25.050 --> 00:23:26.995
So, I mentioned, we have
our own thesaurus.
00:23:28.210 --> 00:23:30.689
And this is where we started.
00:23:30.689 --> 00:23:34.502
This is a thesaurus of 13,000 terms,
00:23:34.502 --> 00:23:37.670
all in the field of African studies.
00:23:37.670 --> 00:23:41.457
It contains a lot of African languages,
00:23:43.417 --> 00:23:46.360
names of ethnic groups in Africa,
00:23:47.586 --> 00:23:49.431
and other proper names,
00:23:49.431 --> 00:23:55.509
which are perhaps especially
interesting for Wikidata.
00:23:58.604 --> 00:24:04.824
So, it is a real authority control
00:24:04.824 --> 00:24:08.370
to vocabulary
with 5,000 preferred terms.
00:24:08.554 --> 00:24:11.204
So, we submitted the request to Wikidata,
00:24:11.204 --> 00:24:17.135
and that was actually very quickly
met with a positive response,
00:24:17.214 --> 00:24:19.354
which was very encouraging for us.
00:24:22.884 --> 00:24:25.574
Our thesaurus was loaded into Mix-n-Match,
00:24:25.574 --> 00:24:31.691
and by now, 75% of the terms
00:24:31.691 --> 00:24:36.145
have been manually matched with Wikidata.
00:24:38.061 --> 00:24:42.081
So, it means, well, that we are now--
00:24:42.971 --> 00:24:47.687
we are added as an identifier--
00:24:48.387 --> 00:24:51.553
for example, if you click
on Swahili language,
00:24:52.463 --> 00:24:57.152
what happens then in Wikidata
on the number that--
00:24:59.004 --> 00:25:02.354
that connects our term--
is the Wikidata term--
00:25:02.560 --> 00:25:05.620
we enter into our thesaurus,
00:25:05.620 --> 00:25:10.000
and from there, you can do a search
directly in the catalog
00:25:10.000 --> 00:25:12.560
by clicking the button again.
00:25:12.560 --> 00:25:18.160
It means, also, that Wikidata
has not really integrated
00:25:18.160 --> 00:25:19.572
into our catalog.
00:25:19.572 --> 00:25:22.090
But that's also more difficult.
00:25:22.314 --> 00:25:26.053
Okay, we have to give the floor
00:25:26.053 --> 00:25:30.838
to Ursula for the next part.
00:25:30.838 --> 00:25:32.554
(Ursula) Thank you very much, Heleen.
00:25:32.554 --> 00:25:37.258
So, I will talk about our experiences
00:25:37.258 --> 00:25:39.677
with incorporating Wikidata elements
00:25:39.677 --> 00:25:41.356
to our web dossier.
00:25:41.356 --> 00:25:44.607
A web dossier is--oh, sorry, yeah, sorry.
00:25:45.447 --> 00:25:49.646
A web dossier, or a classical web dossier,
consists of three parts:
00:25:50.248 --> 00:25:53.320
an introduction to the subject,
00:25:53.320 --> 00:25:56.060
mostly written by one of our researchers;
00:25:56.060 --> 00:26:01.328
a selection of titles, both books
and articles from our collection;
00:26:01.328 --> 00:26:06.146
and the third part, an annotated list
00:26:06.146 --> 00:26:08.876
with links to electronic resources.
00:26:09.161 --> 00:26:15.815
And this year, we added a fourth part
to our web dossiers,
00:26:15.815 --> 00:26:18.276
which is the Wikidata elements.
00:26:19.008 --> 00:26:22.007
And it all started last year,
00:26:22.007 --> 00:26:25.206
and my story is similar
to the story of Olaf, actually.
00:26:25.352 --> 00:26:29.570
Last year, when I had no clue
about Wikidata,
00:26:29.570 --> 00:26:33.402
and I discovered this wonderful
article by Alex Stinson
00:26:33.402 --> 00:26:36.932
on how to write a query in Wikidata.
00:26:37.382 --> 00:26:41.592
And he chose a subject--
a very appealing subject to me.
00:26:41.592 --> 00:26:45.902
Namely, "Discovering Women Writers
from North Africa."
00:26:46.402 --> 00:26:51.162
I can really recommend this article,
00:26:51.162 --> 00:26:52.981
because it's very instructive.
00:26:52.981 --> 00:26:57.422
And I thought I will be--
I'm going to work on this query,
00:26:57.422 --> 00:27:02.662
and try to change it to:
"Southern African Women Writers,"
00:27:02.662 --> 00:27:07.034
and try to add a link
to their work in our catalog.
00:27:07.311 --> 00:27:10.861
And on the right-hand side,
you see the SPARQL query
00:27:11.592 --> 00:27:15.181
which searches for
"Southern African Women Writers."
00:27:15.181 --> 00:27:20.686
If you click on the button,
on the blue button on the lefthand side,
00:27:21.526 --> 00:27:23.971
the search result will appear beneath.
00:27:23.971 --> 00:27:26.448
The search result can have
different formats.
00:27:26.448 --> 00:27:29.871
In my case, the search result is a map.
00:27:29.871 --> 00:27:32.850
And the nice thing about Wikidata
00:27:32.850 --> 00:27:36.652
is that you can embed
to this search result
00:27:36.652 --> 00:27:38.682
into your own webpage,
00:27:38.682 --> 00:27:42.339
and that's what we are now doing
with our work dossiers.
00:27:42.339 --> 00:27:47.039
So, this was the very first one
on Southern African women writers,
00:27:47.039 --> 00:27:49.649
listed classical three elements,
00:27:49.649 --> 00:27:53.209
plus this map on the lefthand side,
00:27:53.209 --> 00:27:55.650
which gives extra information--
00:27:55.650 --> 00:27:58.219
a link to the Southern African
women writer--
00:27:58.219 --> 00:28:00.749
a link to her works in our catalog,
00:28:00.749 --> 00:28:07.252
and a link to the Wikidata record
of her birth place, and her name,
00:28:08.219 --> 00:28:13.099
her personal record, plus a photo,
if it's available on Wikidata.
00:28:16.231 --> 00:28:20.329
And you have to retrieve a nice map
00:28:20.329 --> 00:28:24.032
with a lot of red dots
on the African continent.
00:28:24.032 --> 00:28:28.662
You need nice data in Wikidata,
complete, sufficient data.
00:28:29.042 --> 00:28:33.442
So, with our second web dossier
on public art in Africa,
00:28:33.442 --> 00:28:38.420
we also started to enhance
the data in Wikidata.
00:28:38.420 --> 00:28:43.242
In this case, for a public art--
we edited geo-locations--
00:28:43.242 --> 00:28:46.919
geo-locations to Wikidata.
00:28:46.919 --> 00:28:51.139
And we also searched for works
of public art in commons,
00:28:51.139 --> 00:28:55.165
and if they don't have
a record on Wikidata yet,
00:28:55.165 --> 00:29:00.670
we edited the record to Wikidata.
00:29:00.855 --> 00:29:05.327
And the third thing we do,
00:29:05.327 --> 00:29:09.958
because when we prepare a web dossier,
00:29:09.958 --> 00:29:15.514
we download the titles from our catalog,
00:29:15.514 --> 00:29:17.584
and the tiles are in MARC 21,
00:29:17.584 --> 00:29:23.226
so we have to convert them to a format
that is presentable on the website,
00:29:23.226 --> 00:29:28.229
and it takes not much time and effort
to convert the same set of titles
00:29:28.229 --> 00:29:30.457
to Wikidata QuickStatements,
00:29:30.457 --> 00:29:36.999
and then, we also upload
a title set to Wikidata,
00:29:36.999 --> 00:29:41.254
and you can see the titles we uploaded
00:29:41.254 --> 00:29:44.124
from our latest web dossier
00:29:44.124 --> 00:29:47.514
on African proverbs in Scholia.
00:29:48.546 --> 00:29:52.294
A really nice tool
that visualizes Scholia publications
00:29:52.294 --> 00:29:54.674
being present in Wikidata.
00:29:54.674 --> 00:29:59.674
And, one second--when it is possible,
we add a Scholia template
00:29:59.674 --> 00:30:01.863
to our web dossier's topic.
00:30:01.863 --> 00:30:03.272
Thank you very much.
00:30:03.272 --> 00:30:08.079
(applause)
00:30:09.255 --> 00:30:11.724
Thank you, Heleen and Ursula.
00:30:12.010 --> 00:30:16.866
Next we have Adrian Pohl
presenting using Wikidata
00:30:16.866 --> 00:30:22.265
to improve spatial subject indexing
and regional bibliography.
00:30:45.181 --> 00:30:46.621
Okay, hello everybody.
00:30:46.621 --> 00:30:49.630
I'm going right into the topic.
00:30:49.630 --> 00:30:54.146
I only have ten minutes to present
a three-year project.
00:30:54.535 --> 00:30:57.044
It wasn't full time. (laughs)
00:30:57.044 --> 00:31:00.100
Okay, what's the NWBib?
00:31:00.100 --> 00:31:04.404
It's an acronym for North-Rhine
Westphalian Bibliography.
00:31:04.404 --> 00:31:07.944
It's a regional bibliography
that records literature
00:31:07.944 --> 00:31:11.441
about people and places
in North Rhine-Westphalia.
00:31:12.534 --> 00:31:14.103
And the monograph's in it--
00:31:15.162 --> 00:31:19.451
there are a lot of articles in it,
and most of them are quite unique,
00:31:19.451 --> 00:31:22.052
so, that's the interesting thing
about this bibliography--
00:31:22.052 --> 00:31:25.472
because it's often
less quite obscure stuff--
00:31:25.472 --> 00:31:28.188
local people writing
about that tradition,
00:31:28.188 --> 00:31:29.488
and something like this.
00:31:29.612 --> 00:31:33.428
And there's over 400,000 entries in there.
00:31:33.428 --> 00:31:37.689
And the bibliography started in 1983,
00:31:37.689 --> 00:31:42.718
and so we only have titles
from this publication year onwards.
00:31:44.744 --> 00:31:49.166
If you want to take a look at it,
it's at nwbib.de,
00:31:49.166 --> 00:31:50.859
that's the web application.
00:31:50.859 --> 00:31:55.389
It's based on our service,
lobid.org, the API.
00:31:57.148 --> 00:32:01.220
Because it's cataloged as part
of the hbz union catalog,
00:32:01.220 --> 00:32:04.988
which comprises around 20 million records,
00:32:04.988 --> 00:32:08.869
it's an [inaudible] Aleph system
we get the data out of there,
00:32:08.869 --> 00:32:11.308
and make RDF out of it,
00:32:11.308 --> 00:32:16.408
and provide it as via JSON
or the HTTP API.
00:32:17.129 --> 00:32:20.507
So, the initial status in 2017
00:32:20.507 --> 00:32:25.307
was we had nearly 9,000 distinct strings
00:32:25.307 --> 00:32:28.727
about places--referring to places,
in North Rhine-Westphalia.
00:32:28.727 --> 00:32:34.187
Mostly, those were administrative areas,
like towns and districts,
00:32:34.187 --> 00:32:38.458
but also monasteries, principalities,
or natural regions.
00:32:38.907 --> 00:32:43.517
And we already used Wikidata in 2017,
00:32:43.517 --> 00:32:48.496
and matched those strings
with Wikidata API to Wikidata entries
00:32:48.496 --> 00:32:51.907
quite naively to get
the geo-coordinates from there,
00:32:51.907 --> 00:32:57.210
and do some geo-based
discovery stuff with it.
00:32:57.326 --> 00:32:59.910
But this had some drawbacks.
00:32:59.910 --> 00:33:02.577
And so, the matching was really poor,
00:33:02.577 --> 00:33:05.197
and there were a lot of false positives,
00:33:05.197 --> 00:33:09.184
and we still had no hierarchy
in those places,
00:33:09.184 --> 00:33:13.201
and we still had a lot
of non-unique names.
00:33:13.505 --> 00:33:15.356
So, this is an example here.
00:33:16.616 --> 00:33:18.378
Does this work?
00:33:18.494 --> 00:33:22.314
Yeah, as you can see,
for one place, Brauweiler,
00:33:22.314 --> 00:33:24.615
there are four different strings in there.
00:33:24.820 --> 00:33:27.893
So, we all know how this happens.
00:33:27.893 --> 00:33:31.994
If there's no authority file,
you end up with this data.
00:33:31.994 --> 00:33:33.894
But we want to improve on that.
00:33:34.614 --> 00:33:38.211
And as you can also see,
that while the matching didn't work--
00:33:38.211 --> 00:33:40.382
so you have this name of the place
00:33:40.382 --> 00:33:45.170
and there's often the name
of the superior administrative area,
00:33:45.170 --> 00:33:50.532
and even on the second level,
a superior administrative area
00:33:50.532 --> 00:33:52.040
often in the name
00:33:52.040 --> 00:33:58.909
to identify the place successfully.
00:33:58.909 --> 00:34:04.679
So, the goal was to build a full-fledged
spatial classification based on this data,
00:34:04.679 --> 00:34:07.109
with a hierarchical view of places,
00:34:09.079 --> 00:34:11.389
with one entry or ID for each place.
00:34:11.518 --> 00:34:17.488
And we got this mock-up
by NWBib editors in 2016, made in Excel,
00:34:18.048 --> 00:34:23.116
to get a feeling of what
they would like to have.
00:34:25.006 --> 00:34:28.198
There you have the--
Regierungsbezirk--
00:34:28.198 --> 00:34:31.016
that's the most superior
administrative area--
00:34:31.016 --> 00:34:34.918
we have in there some towns
or districts--rural districts--
00:34:34.918 --> 00:34:39.861
and then, it's going down
to the parts of towns,
00:34:39.861 --> 00:34:42.011
even to this level.
00:34:43.225 --> 00:34:46.232
And we chose Wikidata for this task.
00:34:46.232 --> 00:34:50.087
We also looked at the GND,
the Integrated Authority File,
00:34:50.087 --> 00:34:54.918
and GeoNames--but Wikidata
had the best coverage,
00:34:54.918 --> 00:34:56.902
and the best infrastructure.
00:34:58.112 --> 00:35:02.072
The coverage for the places
and the geo-coordinates we need,
00:35:02.072 --> 00:35:04.512
and the hierarchical
information, for example.
00:35:04.512 --> 00:35:06.732
There were a lot of places,
also, in the GND,
00:35:06.732 --> 00:35:09.694
but there was no hierarchical
information in there.
00:35:11.170 --> 00:35:13.682
And also, Wikidata provides
the infrastructure
00:35:13.682 --> 00:35:15.343
for editing and versioning.
00:35:15.343 --> 00:35:20.022
And there's also a community
that helps maintaining the data,
00:35:20.022 --> 00:35:22.052
which was quite good.
00:35:22.950 --> 00:35:26.882
Okay, but there was a requirement
by the NWBib editors.
00:35:27.682 --> 00:35:31.447
They did not want to directly
rely on Wikidata,
00:35:31.447 --> 00:35:32.972
which was understandable.
00:35:32.972 --> 00:35:34.982
We don't have those servers
under our control,
00:35:34.982 --> 00:35:38.002
and we won't know what's going on there.
00:35:38.084 --> 00:35:41.944
There might be some unwelcome edits
that destroy the classification,
00:35:41.944 --> 00:35:44.159
or parts of it, or vandalism.
00:35:44.159 --> 00:35:50.794
So, we decide to put
an intermediate SKOS file in between,
00:35:50.794 --> 00:35:55.534
on which the application would--
which should be generated from Wikidata.
00:35:57.113 --> 00:35:59.462
And SKOS is the Simple Knowledge
Organization System--
00:35:59.462 --> 00:36:03.919
it's the standard way to model
00:36:03.919 --> 00:36:07.519
a classification in the linked data world.
00:36:07.603 --> 00:36:09.278
So, how we did it? Five steps.
00:36:09.278 --> 00:36:14.037
I will come to each
of the steps in more detail.
00:36:14.037 --> 00:36:18.460
We match the strings to Wikidata
with a better approach than before.
00:36:18.727 --> 00:36:23.131
Created classification based
on Wikidata, edit,
00:36:23.131 --> 00:36:26.255
then back the links
from Wikidata to NWBib
00:36:26.255 --> 00:36:27.590
with a custom property.
00:36:27.590 --> 00:36:32.659
And now, we are in the process
of establishing a good process
00:36:32.659 --> 00:36:36.559
for updating the classification
in Wikidata.
00:36:36.619 --> 00:36:38.888
Seeing--having a DIF
of the changes,
00:36:38.888 --> 00:36:41.158
and then publishing it to the SKOS file.
00:36:42.813 --> 00:36:44.646
I will come to the details.
00:36:44.646 --> 00:36:46.261
So, the matching approach--
00:36:46.261 --> 00:36:48.356
as the API wasn't very sufficient,
00:36:48.356 --> 00:36:53.585
and because we have those
different levels in the strings,
00:36:54.441 --> 00:36:59.036
we build a custom Elasticsearch
index for our task.
00:36:59.596 --> 00:37:04.378
I think by now, you could probably,
as well, use OpenRefine for doing this,
00:37:04.378 --> 00:37:09.306
but at that point in time,
it wasn't available for Wikidata.
00:37:10.186 --> 00:37:14.336
And we build this index base
on SPARQL query,
00:37:14.336 --> 00:37:20.484
and for entities in NRW,
and with a specific type.
00:37:20.484 --> 00:37:25.069
And the query evolved over time a lot.
00:37:25.148 --> 00:37:29.157
And we have a few entries
that you can see the history on GitHub.
00:37:29.727 --> 00:37:32.088
So, where we put in the matching index,
00:37:32.088 --> 00:37:36.337
in the spatial object,
is what we need in our data.
00:37:36.337 --> 00:37:39.662
It's the label and the ID
or the link to Wikidata,
00:37:40.222 --> 00:37:43.874
the geo-coordinates, and the type
from Wikidata [inaudible], as well.
00:37:44.194 --> 00:37:50.488
But also for the matching, very important
that aliases and the broader thing--
00:37:50.488 --> 00:37:54.138
and this is also an example where the name
of the broader entity
00:37:54.138 --> 00:37:57.875
and the district itself are very similar.
00:37:57.937 --> 00:38:03.096
So, it's important to have
some type information, as well,
00:38:03.096 --> 00:38:04.606
for the matching.
00:38:04.900 --> 00:38:07.900
So, the nationwide results
were very good.
00:38:07.900 --> 00:38:11.110
We could automatically match
more than 99% of records
00:38:11.110 --> 00:38:12.265
with this approach.
00:38:13.885 --> 00:38:16.356
These were only 92% of the strings.
00:38:16.540 --> 00:38:18.140
So, obviously, the results--
00:38:18.140 --> 00:38:20.610
those strings that only occurred
one or two times
00:38:20.610 --> 00:38:22.419
often didn't appear in Wikidata.
00:38:22.419 --> 00:38:26.309
And so, we had to do a lot of work
with those with the [long tail].
00:38:27.905 --> 00:38:32.039
And for around 1,000 strings,
the matching was incorrect.
00:38:32.114 --> 00:38:34.950
But the catalogers did a lot of work
in the Aleph catalog,
00:38:34.950 --> 00:38:39.869
but also in Wikidata, they made
more than 6,000 manual edits to Wikidata
00:38:39.869 --> 00:38:45.019
to reach 100% coverage by adding
aliases-type information,
00:38:45.085 --> 00:38:46.615
creating new entries.
00:38:46.615 --> 00:38:49.100
Okay, so, I have to speed up.
00:38:49.546 --> 00:38:54.295
We created classification based on this,
on the hierarchical statements.
00:38:54.295 --> 00:38:58.580
P131 is the main property there.
00:38:59.827 --> 00:39:02.495
We added the information to our data.
00:39:03.035 --> 00:39:06.525
So, we now have this
in our data spatial object--
00:39:06.525 --> 00:39:11.535
and we focus this--the link to Wikidata,
and the types are there,
00:39:12.625 --> 00:39:17.554
and here's the ID
from the SKOS classification
00:39:17.554 --> 00:39:19.234
we built based on Wikidata.
00:39:20.034 --> 00:39:23.555
And you can see there
are Q identifiers in there.
00:39:26.940 --> 00:39:29.286
Now, you can basically query our API
00:39:29.286 --> 00:39:34.051
with such a query using Wikidata URIs,
00:39:34.316 --> 00:39:38.627
and get literature, in this example,
about Cologne back.
00:39:39.724 --> 00:39:45.675
Then we created a Wikidata property
for NWBib and edit those links
00:39:45.675 --> 00:39:50.995
from Wikidata to the classification--
batch load them with QuickStatements.
00:39:52.105 --> 00:39:53.634
And there's also a nice--
00:39:53.634 --> 00:39:59.344
also a move to using a qualifier
on this property
00:39:59.344 --> 00:40:02.994
to add the broader information there.
00:40:02.994 --> 00:40:06.333
So, I think people won't mess around
that work with this,
00:40:06.333 --> 00:40:09.223
and as with the P131 statement.
00:40:10.094 --> 00:40:11.743
So, this is what it looks like.
00:40:12.563 --> 00:40:16.142
This will go to the classification
where you can then start a query.
00:40:18.670 --> 00:40:23.293
Now, we have to build this
update and review process,
00:40:23.293 --> 00:40:28.692
and we will add those data like this,
00:40:28.692 --> 00:40:32.452
with a zero sub-field to Aleph,
00:40:32.452 --> 00:40:36.962
and the catalogers will start
using those Wikidata based IDs,
00:40:36.962 --> 00:40:41.012
URIs, for cataloging for spatial indexing.
00:40:44.702 --> 00:40:50.082
So, by now, there are more than 400,000
NWBib entries with links to Wikidata,
00:40:50.082 --> 00:40:55.905
and more than 4,400 Wikidata entries
with links to NWBib.
00:40:56.617 --> 00:40:58.042
Thank you.
00:40:58.042 --> 00:41:03.182
(applause)
00:41:07.574 --> 00:41:09.682
Thank you, Adrian.
00:41:13.312 --> 00:41:15.472
I got it. Thank you.
00:41:31.122 --> 00:41:34.402
So, as you've seen me before,
I'm Hilary Thorsen.
00:41:34.402 --> 00:41:36.152
I'm Wikimedian in residence
00:41:36.152 --> 00:41:38.382
with the Linked Data
for Production Project.
00:41:38.382 --> 00:41:39.942
I am based at Stanford,
00:41:39.942 --> 00:41:42.590
and I'm here today
with my colleague, Lena Denis,
00:41:42.590 --> 00:41:45.581
who is Cartographic Assistant
at Harvard Library.
00:41:45.581 --> 00:41:50.041
And Christine Fernsebner Eslao
is here in spirit.
00:41:50.041 --> 00:41:53.530
She is currently back in Boston,
but supporting us from afar.
00:41:53.530 --> 00:41:56.240
So, we'll be talking
about Wikidata and Libraries
00:41:56.240 --> 00:42:00.350
as partners in data production,
organization, and project inspiration.
00:42:00.850 --> 00:42:04.300
And our work is part of the Linked Data
for Production Project.
00:42:05.450 --> 00:42:08.190
So, Linked Data for Production
is in its second phase,
00:42:08.190 --> 00:42:10.450
called Pathway for Implementation.
00:42:10.450 --> 00:42:13.291
And it's an Andrew W. Mellon
Foundation grant,
00:42:13.291 --> 00:42:16.120
involving the partnership
of several universities,
00:42:16.120 --> 00:42:20.280
with the goal of constructing a pathway
for shifting the catalog community
00:42:20.280 --> 00:42:24.860
to begin describing library
resources with linked data.
00:42:24.860 --> 00:42:26.919
And it builds upon a previous grant,
00:42:26.919 --> 00:42:30.369
but this iteration is focused
on the practical aspects
00:42:30.369 --> 00:42:32.009
of the transition.
00:42:33.559 --> 00:42:35.650
One of these pathways of investigation
00:42:35.650 --> 00:42:39.000
has been integrating
library metadata with Wikidata.
00:42:39.429 --> 00:42:41.054
We have a lot of questions,
00:42:41.054 --> 00:42:42.999
but some of the ones
we're most interested in
00:42:42.999 --> 00:42:46.180
are how we can integrate
library metadata with Wikidata,
00:42:46.180 --> 00:42:49.580
and make contribution
a part of our cataloging workflows,
00:42:49.580 --> 00:42:53.589
how Wikidata can help us improve
our library discovery environment,
00:42:53.589 --> 00:42:55.929
how it can help us reveal
more relationships
00:42:55.929 --> 00:42:59.629
and connections within our data
and with external data sets,
00:42:59.629 --> 00:43:04.370
and if we have connections in our own data
that can be added to Wikidata,
00:43:04.370 --> 00:43:07.480
how libraries can help
fill in gaps in Wikidata,
00:43:07.480 --> 00:43:09.969
and how libraries can work
with local communities
00:43:09.969 --> 00:43:13.070
to describe library
and archival resources.
00:43:14.010 --> 00:43:17.129
Finding answers to these questions
has focused on the mutual benefit
00:43:17.129 --> 00:43:19.649
for the library and Wikidata communities.
00:43:19.649 --> 00:43:22.949
We've learned through starting to work
on our different Wikidata projects,
00:43:22.949 --> 00:43:25.279
that many of the issues
libraries grapple with,
00:43:25.279 --> 00:43:29.451
like data modeling, identity management,
data maintenance, documentation,
00:43:29.451 --> 00:43:31.289
and instruction on linked data,
00:43:31.289 --> 00:43:33.970
are ones the Wikidata
community works on too.
00:43:34.370 --> 00:43:36.099
I'm going to turn things over to Lena
00:43:36.099 --> 00:43:39.640
to talk about what
she's been working on now.
00:43:46.550 --> 00:43:51.040
Hi, so, as Hilary briefly mentioned,
I work as a map librarian at Harvard,
00:43:51.040 --> 00:43:54.180
where I process maps, atlases,
and archives for our online catalog.
00:43:54.180 --> 00:43:56.580
And while processing two-dimensional
cartographic works
00:43:56.580 --> 00:43:59.572
is relatively straighforward,
cataloging archival collections
00:43:59.572 --> 00:44:02.429
so that their cartographic resources
can be made discoverable,
00:44:02.429 --> 00:44:04.119
has always been more difficult.
00:44:04.119 --> 00:44:06.989
So, my use case for Wikidata
is visually modeling relationships
00:44:06.989 --> 00:44:10.389
between archival collections
and the individual items within them,
00:44:10.389 --> 00:44:13.210
as well as between archival drafts
in published works.
00:44:13.359 --> 00:44:17.329
So, I used Wikidata to highlight the work
of our cartographer named Erwin Raisz,
00:44:17.329 --> 00:44:19.890
who worked at Harvard
in the early 20th-century.
00:44:19.890 --> 00:44:22.539
He was known for his vividly detailed
and artistic land forms,
00:44:22.539 --> 00:44:23.939
like this one on the screen--
00:44:23.939 --> 00:44:26.294
but also for inventing
the armadillo projection,
00:44:26.294 --> 00:44:29.020
writing the first cartography
textbook in English
00:44:29.020 --> 00:44:31.318
and other various
important contributions
00:44:31.318 --> 00:44:32.919
to the field of geography.
00:44:32.919 --> 00:44:34.609
And at the Harvard Map Collection,
00:44:34.609 --> 00:44:38.509
we have a 66-item collection
of Raisz's field notebooks,
00:44:38.509 --> 00:44:41.359
which begin when he was a student
and end just before his death.
00:44:43.679 --> 00:44:46.229
So, this is the collection-level record
that I made for them,
00:44:46.229 --> 00:44:47.994
which merely gives an overview,
00:44:47.994 --> 00:44:50.513
but his notebooks are full of information
00:44:50.513 --> 00:44:53.351
that he used in later atlases,
maps, and textbooks.
00:44:53.351 --> 00:44:56.313
But researchers don't know how to find
that trajectory information,
00:44:56.313 --> 00:44:58.665
and the system
is not designed to show them.
00:45:01.030 --> 00:45:03.734
So, I felt that with Wikidata,
and other Wikimedia platforms,
00:45:03.734 --> 00:45:05.154
I'd be able to take advantage
00:45:05.154 --> 00:45:08.075
of information that already exists
about him on the open web,
00:45:08.075 --> 00:45:10.629
along with library records
and a notebook inventory
00:45:10.629 --> 00:45:12.574
that I had made in an Excel spreadsheet
00:45:12.574 --> 00:45:15.416
to show relationships and influences
between his works.
00:45:15.574 --> 00:45:18.594
So here, you can see how I edited
and reconciled library data
00:45:18.594 --> 00:45:20.165
in OpenRefine.
00:45:20.165 --> 00:45:23.164
And then, I used QuickStatements
to batch import my results.
00:45:23.304 --> 00:45:25.244
So, now, I was ready
to create knowledge graphs
00:45:25.244 --> 00:45:27.864
with SPARQL queries
to show patterns of influence.
00:45:30.084 --> 00:45:33.304
The examples here show
how I leveraged Wikimedia Commons images
00:45:33.304 --> 00:45:34.664
that I connected to him.
00:45:34.664 --> 00:45:36.459
And the hierarchy of some of his works
00:45:36.459 --> 00:45:38.604
that were contributing
factors to other works.
00:45:38.604 --> 00:45:42.354
So, modeling Raisz's works on Wikidata
allowed me to encompass in a single image,
00:45:42.354 --> 00:45:45.890
or in this case, in two images,
the connections that require many pages
00:45:45.890 --> 00:45:47.864
of bibliographic data to reveal.
00:45:51.684 --> 00:45:55.544
So, this video is going to load.
00:45:55.563 --> 00:45:57.233
Yes! Alright.
00:45:57.233 --> 00:46:00.113
This video is a minute and a half long
screencast I made,
00:46:00.113 --> 00:46:02.033
that I'm going to narrate as you watch.
00:46:02.033 --> 00:46:05.423
It shows the process of inputting
and then running a SPARQL query,
00:46:05.423 --> 00:46:09.283
showing hierarchical relationships
between notebooks, an atlas, and a map
00:46:09.283 --> 00:46:11.033
that Raisz created about Cuba.
00:46:11.033 --> 00:46:12.603
He worked there before the revolution,
00:46:12.603 --> 00:46:14.633
so he had the unique position
of having support
00:46:14.633 --> 00:46:17.013
from both the American
and the Cuban governments.
00:46:17.334 --> 00:46:20.583
So, I made this query as an example
to show people who work on Raisz,
00:46:20.583 --> 00:46:24.134
and who are interested in narrowing down
what materials they'd like to request
00:46:24.134 --> 00:46:26.154
when they come to us for research.
00:46:26.154 --> 00:46:29.684
To make the approach replicable
for other archival collections,
00:46:29.684 --> 00:46:33.105
I hope that Harvard and other institutions
will prioritize Wikidata look-ups
00:46:33.105 --> 00:46:35.414
as they move to linked data
cataloging production,
00:46:35.414 --> 00:46:37.520
which my co-presenters
can speak to the progress on
00:46:37.520 --> 00:46:38.854
better than I can.
00:46:38.854 --> 00:46:41.543
But my work has brought me--
has brought to mind a particular issue
00:46:41.543 --> 00:46:46.580
that I see as a future opportunity,
which is that of archival modeling.
00:46:47.369 --> 00:46:52.302
So, to an archivist, an item
is a discrete archival material
00:46:52.302 --> 00:46:55.000
within a larger collection
of archival materials
00:46:55.000 --> 00:46:56.884
that is not a physical location.
00:46:56.884 --> 00:47:00.663
So an archivist from the American National
Archives and Records Administration,
00:47:00.663 --> 00:47:02.943
who is also a Wikidata enthusiast,
00:47:02.943 --> 00:47:05.742
advised me when I was trying
to determine how to express this
00:47:05.742 --> 00:47:07.734
using an example item,
00:47:07.734 --> 00:47:10.456
that I'm going to show
as soon as this video is finally over.
00:47:11.433 --> 00:47:14.391
Alright. Great.
00:47:20.437 --> 00:47:22.100
Nope, that's not what I wanted.
00:47:22.135 --> 00:47:23.536
Here we go.
00:47:31.190 --> 00:47:32.280
It's doing that.
00:47:32.280 --> 00:47:34.154
(humming)
00:47:34.208 --> 00:47:37.418
Nope. Sorry. Sorry.
00:47:40.444 --> 00:47:43.045
Alright, I don't know why
it's not going full screen again.
00:47:43.045 --> 00:47:44.329
I can't get it to do anything.
00:47:44.329 --> 00:47:46.880
But this is the-- oh, my gosh.
00:47:46.880 --> 00:47:48.235
Stop that. Alright.
00:47:48.235 --> 00:47:51.195
So, this is the item that I mentioned.
00:47:51.575 --> 00:47:53.655
So, this was what the archivist
00:47:53.655 --> 00:47:55.964
from the National Archives
and Records Administration
00:47:55.964 --> 00:47:57.414
showed me as an example.
00:47:57.414 --> 00:48:02.414
And he recommended this compromise,
which is to use the part of property
00:48:02.414 --> 00:48:05.614
to connect a lower level description
to a higher level of description,
00:48:05.614 --> 00:48:08.534
which allows the relationships
between different hierarchical levels
00:48:08.534 --> 00:48:10.840
to be asserted as statements
and qualifiers.
00:48:10.840 --> 00:48:12.884
So, in this example that's on screen,
00:48:12.884 --> 00:48:16.294
the relationship between an item,
a series, a collection, and a record group
00:48:16.294 --> 00:48:19.655
are thus contained and described
within a Wikidata item entity.
00:48:19.655 --> 00:48:22.024
So, I followed this model
in my work on Raisz.
00:48:22.704 --> 00:48:26.024
And one of my images is missing.
00:48:26.024 --> 00:48:27.971
No, it's not. It's right there. I'm sorry.
00:48:28.210 --> 00:48:30.613
And so, I followed this model
on my work on Raisz,
00:48:30.613 --> 00:48:33.103
but I look forward
to further standardization.
00:48:38.983 --> 00:48:41.352
So, another archival project
Harvard is working on
00:48:41.352 --> 00:48:44.632
is the Arthur Freedman collection
of more than 2,000 hours
00:48:44.632 --> 00:48:48.702
of punk rock performances
from the 1970s to early 2000s
00:48:48.702 --> 00:48:51.970
in the Boston and Cambridge,
Massachussets areas.
00:48:51.970 --> 00:48:55.145
It includes many bands and venues
that no longer exist.
00:48:55.604 --> 00:48:59.505
So far, work has been done in OpenRefine
on reconciliation of the bands and venues
00:48:59.505 --> 00:49:02.324
to see which need an item
created in Wikidata.
00:49:02.886 --> 00:49:05.964
A basic item will be created
via batch process next spring,
00:49:05.964 --> 00:49:08.697
and then, an edit-a-thon will be
held in conjunction
00:49:08.697 --> 00:49:12.254
with the New England Music Library
Association's meeting in Boston
00:49:12.254 --> 00:49:15.866
to focus on adding more statements
to the batch-created items,
00:49:15.866 --> 00:49:18.937
by drawing on local music
community knowledge.
00:49:18.937 --> 00:49:22.086
We're interested in learning more
about models for pairing librarians
00:49:22.086 --> 00:49:26.310
and Wiki enthusiasts with new contributors
who have domain knowledge.
00:49:26.297 --> 00:49:29.293
Items will eventually be linked
to digitized video
00:49:29.293 --> 00:49:31.387
in Harvard's digital collection platform
00:49:31.387 --> 00:49:33.167
once rights have
been cleared with artists,
00:49:33.167 --> 00:49:35.147
which will likely be a slow process.
00:49:36.327 --> 00:49:38.030
There's also a great amount of interest
00:49:38.030 --> 00:49:41.680
in moving away from manual cataloging
and creation of authority data
00:49:41.680 --> 00:49:43.247
towards identity management,
00:49:43.247 --> 00:49:45.667
where descriptions
can be created in batches.
00:49:45.667 --> 00:49:48.057
An additional project that focused on
00:49:48.057 --> 00:49:51.297
creating international standard
name identifiers, or ISNIs,
00:49:51.297 --> 00:49:53.477
for avant-garde and women filmmakers
00:49:53.477 --> 00:49:57.657
can be adapted for creating Wikidata items
for these filmmakers, as well.
00:49:57.657 --> 00:50:01.076
Spreadsheets with the ISNIs,
filmmaker names, and other details
00:50:01.076 --> 00:50:04.697
can be reconciled in OpenRefine,
and uploaded with QuickStatements.
00:50:04.910 --> 00:50:06.940
Once people in organizations
have been described,
00:50:06.940 --> 00:50:09.316
we'll move toward describing
the films in Wikidata,
00:50:09.316 --> 00:50:12.526
which will likely present
some additional modeling challenges.
00:50:13.446 --> 00:50:15.486
A library presentation
wouldn't be complete
00:50:15.486 --> 00:50:16.882
without a MARC record.
00:50:16.882 --> 00:50:19.916
Here, you can see the record
for Karen Aqua's taxonomy film,
00:50:19.916 --> 00:50:22.096
where her ISNI and Wikidata Q number
00:50:22.096 --> 00:50:24.176
have been added to the 100 field.
00:50:24.176 --> 00:50:26.636
The ISNIs and Wikidata Q numbers
that have been created
00:50:26.636 --> 00:50:30.066
can then be batch added
back into MARC records via MarcEdit.
00:50:30.066 --> 00:50:33.236
You might be asking why I'm showing you
this ugly MARC record,
00:50:33.236 --> 00:50:35.596
instead of some beautiful
linked data statements.
00:50:35.596 --> 00:50:38.576
And that's because our libraries
will be working in a hybrid environment
00:50:38.576 --> 00:50:39.896
for some time.
00:50:39.896 --> 00:50:42.326
Our library catalogs still relies
on MARC records,
00:50:42.326 --> 00:50:44.076
so by adding in these URIs,
00:50:44.076 --> 00:50:46.366
we can try to take advantage
of linked data,
00:50:46.366 --> 00:50:48.346
while our systems still use MARC.
00:50:49.496 --> 00:50:52.950
Adding URIs into MARC records
makes an additional aspect
00:50:52.950 --> 00:50:54.335
of our project possible.
00:50:54.335 --> 00:50:56.894
Work has been done at Stanford
and Cornell to bring data
00:50:56.894 --> 00:51:01.873
from Wikidata into our library catalog
using URIs already in our MARC records.
00:51:02.334 --> 00:51:05.090
You can see an example
of a knowledge panel,
00:51:05.090 --> 00:51:06.984
where all the data is sourced
from Wikidata,
00:51:06.984 --> 00:51:11.004
and links back to the item itself,
along with an invitation to contribute.
00:51:11.403 --> 00:51:15.130
This is currently in a test environment,
not in production in our catalog.
00:51:15.130 --> 00:51:17.444
Ideally, eventually,
these will be generated
00:51:17.444 --> 00:51:19.916
from linked data descriptions
of library resources
00:51:19.916 --> 00:51:22.954
created using Sinopia,
our linked data editor
00:51:22.954 --> 00:51:24.563
developed for cataloging.
00:51:24.563 --> 00:51:27.994
We found that adding a look-up
to Wikidata in Sinopia is difficult.
00:51:27.994 --> 00:51:31.514
The scale and modeling of Wikidata
makes it hard to partition the data
00:51:31.514 --> 00:51:33.544
to be able to look up typed entities,
00:51:33.544 --> 00:51:34.900
and we've run into the problem
00:51:34.900 --> 00:51:37.493
of SPARQL not being good
for keyword search,
00:51:37.493 --> 00:51:41.883
but wanting our keyword APIs
to return SPARQL-like RDF descriptions.
00:51:41.883 --> 00:51:45.043
So, as you can see, we still have
quite a bit of work to do.
00:51:45.043 --> 00:51:47.937
This round of the grant
runs until June 2020,
00:51:47.937 --> 00:51:50.163
so, we'll be continuing our exploration.
00:51:50.163 --> 00:51:53.113
And I just wanted to invite anyone
00:51:53.113 --> 00:51:57.573
who's continued an interest in talking
about Wikidata and libraries,
00:51:57.573 --> 00:52:01.454
I lead a Wikidata Affinity Group
that's open to anyone to join.
00:52:01.454 --> 00:52:03.013
We meet every two weeks,
00:52:03.013 --> 00:52:05.513
and our next call is Tuesday,
November the 5th,
00:52:05.513 --> 00:52:08.073
so if you're interested
in continuing discussions,
00:52:08.073 --> 00:52:10.393
I would love to talk with you further.
00:52:10.393 --> 00:52:11.890
Thank you, everyone.
00:52:11.890 --> 00:52:13.623
And thank you to the other presenters
00:52:13.623 --> 00:52:16.893
for talking about all
of their wonderful projects.
00:52:16.893 --> 00:52:21.283
(applause)