36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service

Edit subtitles

0:00 - 0:22

36C3 preroll music
0:22 - 0:30

Okay so now to our speaker, he’s Lucas.
He's a SPARQL magician I'm told, so and he
0:30 - 0:35

will introduce you to his favorite
querying language, SPARQL, and give you a
0:35 - 0:40

little introduction and in the second part
he will do some live coding which is
0:40 - 0:46

always really interesting and funny and
you can give him some things that he's
0:46 - 0:50

querying for you and I'm sure we'll have
lots of fun and interesting learning stuff
0:50 - 0:54

here so give a warm round of applause to
Lucas.
0:54 - 0:56

[Applause]
1:01 - 1:09

[inaudible]
1:09 - 1:13

Is this better? Aha! It's a bit too loud
so I'll just talk a bit until they have
1:13 - 1:19

figured it out. Yeah so this is going to
be kind of two parts but not really that
1:19 - 1:22

separate but in the second part I'm
basically going to write the queries that
1:22 - 1:27

you suggest so if you – if you see what
I'm going to do here and then think oh I
1:27 - 1:31

have a great idea for something we could
perhaps query then just remember that and
1:31 - 1:35

we'll get back to that hopefully because
otherwise the second half is going to be
1:35 - 1:40

really short if I don't get any ideas from
you. But yeah, so this is about querying
1:40 - 1:46

linked data which allows you to do all
kinds of crazy things and answer all kinds
1:46 - 1:51

of crazy questions such as I think I had
on the slides something like "what are the
1:51 - 1:54

largest cities with a female mayor?" and
if you wanted to find that out
1:54 - 1:59

traditionally you could like go through
Wikipedia and try to find all the largest
1:59 - 2:03

cities and see which ones have a female
mayor and which ones don't or perhaps
2:03 - 2:07

there's a category with all the cities
with a female mayor but then you have to
2:07 - 2:12

sort them by population and it's a whole
mess and with linked data you can find
2:12 - 2:17

that out much more easily and also all
kinds of other things but let's start with
2:17 - 2:25

some simple fantasy linked data so this is
a tiny snippet of linked data, some data
2:25 - 2:30

graph. It's just composed of a load of
nodes which are these ovals and rectangles
2:30 - 2:35

here and they're connected with arrows and
each of these forms kind of a triple
2:35 - 2:40

consisting of the start node and then the
arrow and then the end node and that's how
2:40 - 2:45

we represent all the information you have
in there, in this linked database. So for
2:45 - 2:48

example we can read this as this talk
right now happens in the Esszimmer or the
2:48 - 2:52

dining room which is the name of this
stage here and it's going to be followed
2:52 - 2:56

by the live querying session which also
happens in Esszimmer and the live querying
2:56 - 3:01

session in turn follows this talk again
and the Esszimmer, the dining room, is
3:01 - 3:06

next to the kitchen, the Küche, and the
kitchen is next to the dining room again
3:06 - 3:10

and both of them are part of the
WikipakaWG which is part of 36C3 and the
3:10 - 3:17

talk happens right now and at the same
time there's also some talk about how
3:17 - 3:22

state elections are climate elections or
something in the Chaos West stage, starts
3:22 - 3:26

at the same time, Chaos West stage is part
of the Chaos West Assembly which is part
3:26 - 3:32

of 36C3 as well and so this graph has a
few important properties, for example
3:32 - 3:36

there's some redundant connections here,
you could see, you could say, if this talk
3:36 - 3:39

is followed by the live querying then you
don't really need to know that live
3:39 - 3:44

querying follows this talk, it's kind of
redundant information. You already know
3:44 - 3:49

it, but it doesn't hurt to have it, and it
often makes your life easier if you have a
3:49 - 3:54

little bit of redundancy in your graph and
then if you find that one half of this
3:54 - 3:57

connection is missing for example you can
still investigate what's going on and also
3:57 - 4:02

in here we have kind of bi-directional
connection so Esszimmer is next to Küche
4:02 - 4:08

which is next to Esszimmer but this is two
separate arrows and could also be that
4:08 - 4:12

only one of them is there so you don't
have arrows which go into-, in both
4:12 - 4:16

directions at once in this data model, it
has to be, if you want something like this
4:16 - 4:19

you have to have two separate arrows
because that keeps the data model very
4:19 - 4:26

simple. You just have subject predicate
object and that's everything you have, and
4:26 - 4:33

then to query this graph, you kind of
select a tiny part of it and then you
4:33 - 4:39

remove some part that you don't know about
for example we know that this talk is
4:39 - 4:44

followed by live querying and if we remove
the live querying part, then we can ask
4:44 - 4:51

something like... Okay, I did it the other
way around. Never mind, this way. This
4:51 - 4:54

talk is followed by which talk? and then
you have a question but because you've
4:54 - 5:00

left out this part and then if you ask
this question to a query service it can,
5:00 - 5:07

kind of, you can think of this like a,
err, damn, I only know the German word for
5:07 - 5:12

this one, a, Schablone, template, so you
put this over the graph and this has to
5:12 - 5:16

match the existing node this has to match
the existing arrow and then you see which
5:16 - 5:20

nodes can you put in here and in this case
that's only the live querying or the other
5:20 - 5:27

way around which talk follows this one so
you can have the beginning of the triple
5:27 - 5:31

can be a variable like this one or the end
of the triple can be a variable like in
5:31 - 5:39

this case and you can also have more
complicated patterns like, no there's not
5:39 - 5:42

a more complicated pattern, this is the
same pattern. You have the question which
5:42 - 5:46

talk happens in Esszimmer and you have two
answers: this talk happens in Esszimmer
5:46 - 5:52

and live querying happens in Esszimmer.
But you can also combine more graph nodes
5:52 - 5:58

like this, for example, which talk happens
in some room, which is part of the
5:58 - 6:02

Wikipaka-WG. So we have one free part here
and one free part here. But we know that
6:02 - 6:06

these two have to be connected with,
"happens in", and then this has to be
6:06 - 6:11

connected with "is part of" to the
Wikipaka-WG. And you can kind of
6:11 - 6:17

construct– if you can phrase your question
as a kind of graph like this, where some
6:17 - 6:19

parts are predetermined that you already
know about and the other parts that you
6:19 - 6:26

want to find. Those are these kind of
variables which are here indicated with
6:26 - 6:31

just dashed lines. Then you can ask that
question to the graph and find the
6:31 - 6:36

matching results. In this case, you have
these two matches, this talk happens in
6:36 - 6:40

Esszimmer as part of Wikipaka-WG and live
querying happens in Esszimmer, is part of
6:40 - 6:47

Wikidata– Wikipaka-WG. And then, if you–
if we had more information in this graph
6:47 - 6:52

here, we might also have other rooms. For
example, there's this library over there
6:52 - 6:56

which also is going to have some talks. If
we had the whole schedule in here, we
6:56 - 7:01

would find those as well. And we could
also adapt the query so that we don't even
7:01 - 7:07

make the Wikipaka-WG part fixed. We could
ask for anything that happens in 33C3. So
7:07 - 7:11

that would be some variable, happens in
some room, is part of some assembly, is
7:11 - 7:16

part of 36C3. And then we would find this
thing as well because it fits the same
7:16 - 7:22

kind of pattern: happens in, is part of,
is part of 36C3. Does that make sense?
7:22 - 7:32

Hopefully. I'm seeing a lot of nodding
heads. OK, that's great. So then we can
7:32 - 7:38

try to move ahead to actually ask some of
these questions to a real query system.
7:38 - 7:43

Because in reality, you're not going to
actually draw these graphs, but you have
7:43 - 7:48

some kind of language where you phrase
them instead, which looks a bit like this.
7:48 - 7:53

So you have the part: SELECT anything
WHERE, that is kind of like SQL, and then
7:53 - 7:58

everything else is not like SQL. Forget
SQL! I hear this is easier to understand
7:58 - 8:03

if you don't know SQL. I didn't know SQL
that much when I learned SPARQL, and I
8:03 - 8:09

think it helped me, apparently. But what
you write down here is these, is this kind
8:09 - 8:14

of description of the graph, and these
dashed parts, which are the variables
8:14 - 8:18

which you don't yet know. Those are marked
with a question mark because that's kind
8:18 - 8:21

of what you use to ask a question. In this
case, I've just called it "?talk", but it
8:21 - 8:27

could be any name, basically. And then
instead of "happens in" as two words, I've
8:27 - 8:33

just written "happensIn" as one and then
with the prefix "36C3" and it happens in
8:33 - 8:38

the 36C3 Esszimmer because I don't really
have a separate dining room at home, but a
8:38 - 8:43

lot of people do. So if we just wrote it
happens in Esszimmer, that would be pretty
8:43 - 8:48

ambiguous and no one would know which
which dining room you're talking about.
8:48 - 8:53

And by adding this prefix we know we're
talking about just the dining room in
8:53 - 8:59

this, at thirty– 36C3. I think, I assume
there's no other assembly that has
8:59 - 9:01

something called the dining room. If it
does, then we would have to add something
9:01 - 9:06

else here to make it clear. And I've used
the same prefix for "happensIn" to make
9:06 - 9:10

clear which kind of "happens in" relation
we're talking about, that it's one
9:10 - 9:16

specific to Congress events. And then you
could ask this to a query service which
9:16 - 9:22

has this example graph in it, and you
might get the response that it's these two
9:22 - 9:28

talks. And at the end, you have this
period here because if you read the whole
9:28 - 9:33

thing, it's kind of like a sentence again.
Because the talk happens in Esszimmer. And
9:33 - 9:37

if you have two sentences, then you have
two periods. So the talk happens in some
9:37 - 9:41

room. And this room is part of the
Wikipaka-WG. And because we've used the
9:41 - 9:48

same variable name here and down here,
this has to be the same room. And it
9:48 - 9:51

couldn't just be two different things. So
if we use two different variable names
9:51 - 9:56

here, room and something else, then we
would just get all the combinations of
9:56 - 9:59

talks happening somewhere and rooms being
part of Wikipaka-WG without them being
9:59 - 10:03

connected anyway, but because they use the
same variable name they have to be
10:03 - 10:09

connected like this. And then you would
get these results we've seen earlier. What
10:09 - 10:14

you can also do is leave out the room. So
when I translate this into English, I
10:14 - 10:18

could say, the talk happens in the room
and the room is part of Wikipaka-WG. But I
10:18 - 10:23

could also say the talk happens in some
room, which is part of the Wikipaka-WG,
10:23 - 10:26

as kind of a– I don't know what that's
called in English kind of a relative
10:26 - 10:33

sentence sub-something-clause where we
don't really talk about the room in itself
10:33 - 10:37

just as a part of this larger sentence.
And you can write that in SPARQL as well.
10:37 - 10:44

And then it looks like this. And these
square brackets kind of describe what the
10:44 - 10:48

room looks like without giving it names.
So in this case, you can only select the
10:48 - 10:52

talk up here and we don't have a room
variable. But if you don't care about what
10:52 - 10:56

the room is, then that can be very useful.
I've also changed something else here.
10:56 - 11:04

I've replaced the 36C3 in "isPartOf" with
schema, which is another prefix and schema
11:04 - 11:09

is kind of this collection of useful
prefixes and other nodes that you can
11:09 - 11:14

reuse, for example, if you're describing
things you have on your website, you might
11:14 - 11:19

say you have an article with a
schema:title and a schema:publicationDate.
11:19 - 11:23

So this was mainly introduced by Google
and some other search engines. But we can
11:23 - 11:28

use the same vocabulary to talk about our
talks because "isPartOf" is one of these
11:28 - 11:36

standard terms we can use for that. And
what else do I have. OK, the next thing I
11:36 - 11:41

have is actual queries. So I think I'm
just going to– I'm almost going to switch
11:41 - 11:45

to Wikidata, so I should talk a bit about
Wikidata. So all these examples here were
11:45 - 11:53

just on some example graph, which I made
up here and threw on a slide with a lot of
11:53 - 11:58

probably overengineered tikz LaTeX magic,
which I shouldn't have wasted that much
11:58 - 12:04

time about. But it looks nice. And… but if
we want to write real queries, we could
12:04 - 12:07

load this thing into a query service, but
it wouldn't be that interesting because
12:07 - 12:12

it's kind of small. But there are a lot of
real data graphs out there that you can
12:12 - 12:17

query with this query language, SPARQL.
And one of the coolest ones, at least in
12:17 - 12:21

my opinion, is called Wikidata or
Wikidata. There's some kind of discussion
12:21 - 12:28

about how it's pronounced. And it's kind
of a free database of anything that's
12:28 - 12:34

relevant. And it's part of the same family
of projects as Wikipedia and Wikimedia
12:34 - 12:38

Commons and other things. And it's also
maintained by the same community of
12:38 - 12:42

volunteers. And you can find all kinds of
really interesting and cool and funny data
12:42 - 12:46

there. So all of these example queries,
which I have here, we're just going to ask
12:46 - 12:57

to Wikidata. But first, I will just give
you one or two minutes to try to imagine
12:57 - 13:04

what this question would look like, either
in the graph format or in the SPARQL
13:04 - 13:09

format. Just try to figure out how you
would formulate: "which software is
13:09 - 13:15

written in bash" as a kind of, this kind
of graph query. And then we can see what
13:15 - 13:23

we can come up with. So. I didn't think
this through. I need some waiting loop
13:23 - 13:36

music now. Does anyone have a kind of idea
of what the graph looks like, because I'm
13:36 - 13:41

going to uncover it now and then you can
compare, if it looks the same way. So it
13:41 - 13:46

would look like, this at least using the
Wikidata terminology. So instead of "is
13:46 - 13:52

written in", the property is called
probing– programming language. And this
13:52 - 13:56

could also, this could be called "bash" or
"Bourne Again Shell" or "GNU bash" or
13:56 - 14:02

something. Doesn't really matter. And in
SPARQL, it looks like this, which is a lot
14:02 - 14:07

less readable, unfortunately, because one
of the things about Wikidata is that it's
14:07 - 14:14

multilingual. So instead of saying
"programming language", we say "P277". And
14:14 - 14:18

I think that's beautiful, haha. No, but
this is a property ID and you can look up
14:18 - 14:23

what this property is called in English or
in German or in any other language. So if
14:23 - 14:31

we look at Wikidata.org and look for – I
think I forgot to zoom in. Yeah. There we
14:31 - 14:40

go. I hope that's readable. Property P,
what was it? 277. That is the property
14:40 - 14:45

"programming language", at least in… okay,
you can't read that. There you go. At
14:45 - 14:48

least in English. In German it's
"Programmiersprache", and it has tons of
14:48 - 14:52

other languages too. So you can use
Wikidata in any language you want, which
14:52 - 14:57

is very nice. I could also show this page
in a different language and then all of
14:57 - 15:01

this would look different. The downside is
that the SPARQL query is not quite as
15:01 - 15:07

readable because you have to use all these
numeric identifiers, but you don't have to
15:07 - 15:15

memorize them at least. So let's… oops,
try to write this query. SELECT * WHERE
15:15 - 15:25

and we have the software, which is… which
has the programming language "bash", and
15:25 - 15:31

then we have to add these prefixes first,
so bash is going to be a Wikidata item. So
15:31 - 15:36

we abbreviate that with "wd" and that's a
prefix. And then if I press control space,
15:36 - 15:42

or I think on Macs command space works as
well, then it searches for bash and shows
15:42 - 15:47

me these suggestions and then I can just
select the right one. In this case, "GNU
15:47 - 15:51

bash", and then I have the ID, and if I
move the mouse over it again, then I can
15:51 - 15:56

see what this ID refers to. So it's not
quite as bad as– so on the PDF slides, you
15:56 - 16:01

just see the ID. But if you're actually on
the query.wikidata.org website… let me
16:01 - 16:05

make that a bit larger so you can all see
it. And if you want to try that out on
16:05 - 16:09

your laptop, I don't know, here it's a bit
audio outage And for the programming
16:09 - 16:17

language, we use a slightly different
prefix, which is "wdt", which stands for
16:17 - 16:21

"truthy". So we're only interested in
"truthy" information and not all the
16:21 - 16:29

information. And then we find this
property P277. And if we run this query
16:29 - 16:35

with control-enter or with this button
here, then we get a collection of other
16:35 - 16:40

IDs. Yeah. Does anyone want to get
software which is written in bash? This
16:40 - 16:51

one has a very low ID that is going to be…
Loading. There we go. Autopackage. Some
16:51 - 16:55

package management system that I haven't
even heard of, but it's written in bash.
16:55 - 17:01

OK, so… wait. Er, so here you can see all
these statements and "programming
17:01 - 17:08

language: GNU Bash" is the one we looked
for. And unfortunately… so this is not a
17:08 - 17:12

very useful list. So one thing we can do
in the Wikidata Query Service, which is
17:12 - 17:17

pretty specific to Wikidata, is to add the
so-called label service, which is
17:17 - 17:21

basically magic that you don't need to
understand. But you write something like
17:21 - 17:26

"serv" or "service" and then with
control+space again for autocompletion.
17:26 - 17:31

And it suggests you this thing. And you
just keep that in your query at all times,
17:31 - 17:35

basically. And then you say, I would like
to have not just a software, but also the
17:35 - 17:41

software label. And then we get down here,
the label of the software. And I can also
17:41 - 17:46

add the software description. And then we
also see what, what is described. At least
17:46 - 17:53

if it has a description and then the query
results are already a lot more usable. And
17:53 - 17:59

I'm just going to rename this to "item"
and then we can edit this query however we
17:59 - 18:04

want and the variable name will always
kind of match. Because the next query
18:04 - 18:08

won't be about software anymore. So it'll
be confusing if you just still call it
18:08 - 18:13

"software". But, yeah, there is some
software here like Apache Yetus, Ruby
18:13 - 18:19

Version Manager, Wikidata missing
pictures, Pi-hole, all written in Bash.
18:19 - 18:28

OK, I have several more examples queries
here, which are kind of simple, should I
18:28 - 18:34

skip ahead or is it good if I do a few
more simple examples. Skip ahead? Is that
18:34 - 18:41

OK? OK, then let's. So who was born at sea
is not all that interesting. Just Place of
18:41 - 18:45

birth at sea. We have a special value for
that and it's not a very interesting list.
18:45 - 18:49

I think a few results, just five or so,
because most people are going to have
18:49 - 18:52

"place of birth: Atlantic Ocean" or
something. Which places are located on the
18:52 - 18:57

White Elster, just something for the
Leipzig people. And where does the
18:57 - 19:01

Neverending Story take place? This
actually kind of cute. Let's do that.
19:01 - 19:06

Also, this is a bit interesting because in
this case, the variable is in the last
19:06 - 19:13

place and not the first one. So that… and
then we have the Neverending Story in the
19:13 - 19:20

beginning and narrative location. And then
the item is at the end instead of at the
19:20 - 19:25

beginning of a triple. And it works just
as well, except that a lot of these don't
19:25 - 19:32

have a label in English. So let's add
German as a fallback language. And then we
19:32 - 19:38

get all of these places which someone
added to Wikidata at some point. Let's see
19:38 - 19:42

if there's any useful information about
them. So they all have IDs in the same
19:42 - 19:48

range. So it looks like they were all
created at the same time because the are
19:48 - 19:52

are just increasing all the time. So the
Gelichterland is a place from the
19:52 - 19:55

Neverending Story, it's a finctional…
fictional country. It has a capital, which
19:55 - 20:01

is this fictional place. It's located on
the… this terrain feature, it's present in
20:01 - 20:06

the Neverending Story. And it depicts
horror fiction. I'm not sure about that,
20:06 - 20:12

but let's leave it alone for now. OK,
yeah. And skip to a slightly more
20:12 - 20:20

interesting query, which is this one,
which popes had children. So what is the
20:20 - 20:25

graph going to look like for this? How
many, how many triples are we going to
20:25 - 20:29

have? So triple is node, arrow, and
another node, how many triples would you
20:29 - 20:36

need for "Pope has a child"? Let's do a
raising hands. Who thinks you need zero
20:36 - 20:43

triples, OK? Who thinks you need one
triple? Who thinks you need two triples?
20:43 - 20:48

That's more people. Does anyone think you
need three triples? No. OK, so mostly two,
20:48 - 20:54

but some people think one. So the one… the
people who think it might need one triple,
20:54 - 21:03

perhaps are thinking of something like the
Pope, which is the leader of the worldwide
21:03 - 21:11

Catholic Church, has a child, this child
or it's called item, but that's not going
21:11 - 21:15

to have any results. Or it could be the
other way around. And you could say that…
21:15 - 21:26

oh let's just comment this out. The item
has "father: the pope". And that doesn't
21:26 - 21:31

work. Because the items are not… the
children are not directly connected to the
21:31 - 21:35

item for the office of the pope, instead
it's going to be two levels. It's going to
21:35 - 21:40

say the child has a father, some person,
and then the person has the office pope or
21:40 - 21:45

has the position pope or is a pope or
something. So you need this level of
21:45 - 21:49

indirection. So in the graph that looks
either like this or it could be the other
21:49 - 21:55

way around. So either the child has a
father pope, which has "position held:
21:55 - 22:01

pope" or the pope has a child and also a
"position held", so that's kind of an
22:01 - 22:04

example of the redundancy I mentioned
earlier, we have the two directions
22:04 - 22:11

"child" and also "father"/"mother", and-
so you can ask your query in two ways, and
22:11 - 22:14

it doesn't really make that much of a
difference, assuming that the data is
22:14 - 22:20

complete. And I think someone occasionally
runs queries to check if any of these
22:20 - 22:25

circles are missing. So let's try one of
them, let's just stay with this one, so
22:25 - 22:32

the item does not have "pope" as father,
it has some pope, and then this pope has
22:32 - 22:43

"position held: pope". And then let's add
the "pope" label and… yeah, pope label is
22:43 - 22:50

enough, and then we get 24 results! So we
have a Duke of Parma which, who was the
22:50 - 22:55

son of Paul III. Paul III had three
children. Let's sort by this. Wow,
22:55 - 23:04

Alexander VI was very busy. And some of
them just have, oh oh oh, we have
23:04 - 23:09

duplicates, Giovanni Borgia and Giovanni
Borgia. Should I demonstrate Wikidata
23:09 - 23:14

editing now or do we just ignore this? So,
yeah, someone imported a lot of
23:14 - 23:19

information from this peerage database and
apparently we have some duplicate items
23:19 - 23:24

here, let's just leave those alone for
now. In fact, I think this and this also
23:24 - 23:30

looks suspiciously similar. Giovanni
Borgia, unless he had two children of that
23:30 - 23:38

name. I mean, he could have. So this… we
have a date of birth 1470s… 1498. No, that
23:38 - 23:45

might actually be different children. OK,
not a very creative father in the names.
23:45 - 23:53

Yeah. And wait, that's a pope who's a
child of another pope. Very interesting!
23:53 - 23:56

And another one. And another one. We have
three popes who are children of other
23:56 - 24:02

popes. Let's search for those! So we would
also need for that, that the item has
24:02 - 24:11

"position held: Pope", and I could copy
paste this, but just do this. So the item
24:11 - 24:14

should be… child should have a "father:
pope" and the item should have "position
24:14 - 24:18

held: Pope", and the pope should also have
"position held: pope". And in this case,
24:18 - 24:23

it would probably be less confusing to
call these "child" and "father", because
24:23 - 24:26

this is also a pope now, but… variable
names. One of the three hardest problems
24:26 - 24:30

in computer science, right? Yeah, we have
three children who are… three popes who
24:30 - 24:37

are children of other popes. Wow. I'm
actually going to save this query, popes
24:37 - 24:42

who were children of other popes. But
actually, we can future-proof this a
24:42 - 24:48

little bit, because right now we've only
said that the father should be a pope. But
24:48 - 24:51

in case there's ever a female pope, let's
just switch this around and say that the
24:51 - 24:59

pope should have the child… item and then
it's going to work, even if the pope
24:59 - 25:03

happens to be female and is a mother
instead of a father. There we go, same
25:03 - 25:13

three results. OK, and let's keep that,
and open a new tab for next queries. Yeah.
25:13 - 25:18

Which Microsoft software runs on Linux.
OK. That's not that funny. So perhaps we
25:18 - 25:23

can just skip it… I don't know. That joke
kind of ran out of steam a while ago.
25:23 - 25:27

Basically looks like this and it's like
Visual Studio Code and three other
25:27 - 25:31

programs, meh. What are some compositions
for organ and orchestra. This isn't funny
25:31 - 25:36

at all, but I just find it very nice
because it's just an awesome sound. And so
25:36 - 25:41

that would be… the composition has the
instrumentation "organ" and also
25:41 - 25:53

"orchestra", which we can write as… item,
item label… composition… instrumentation,
25:53 - 26:12

this one, orchestra. And also,
"composition… organ". And then, oops,
26:12 - 26:18

yeah, this should be "item"… and also I
forgot to add the label service. There we
26:18 - 26:28

go. And we have 12 results, which is nice
if you want to listen to any of those. We
26:28 - 26:39

could also check if any of them have an
audio file on Commons. Let's see. One, OK,
26:39 - 26:46

and I think we've heard this one already.
So, but… one thing that's kind of annoying
26:46 - 26:50

here, I should have mentioned this in the
last query, I think. So I had to repeat
26:50 - 26:53

the item and the property ID, which is a
bit annoying and makes the query difficult
26:53 - 26:58

to read. And what you can do is leave that
out and you can also do this in the
26:58 - 27:05

previous case. So let's actually go one
slide back. So here I didn't write twice
27:05 - 27:07

that it's the software which should have
the developer, and also the operating
27:07 - 27:11

system. I just wrote the software has
"developer: Microsoft" and also with a
27:11 - 27:17

semicolon at the end instead of a period,
it has "operating system: Linux". So if
27:17 - 27:19

you read this as English it's just one
sentence where you don't repeat the
27:19 - 27:22

subject twice. The software has
"developer: Microsoft" and "operating
27:22 - 27:26

system: Linux", instead of "software has
developer: Microsoft" and "software has
27:26 - 27:31

operating system: Linux". And if you… if
the property here is also the same thing,
27:31 - 27:36

then you can even leave that out and add a
comma at the end and just list the two
27:36 - 27:41

values and you don't even have to repeat
the instrumentation. So let's do that here
27:41 - 27:47

and abbreviate this query. And it has the
exact same 12 results, just slightly more
27:47 - 27:55

convenient to read and… to write at least,
hopefully also to read. I don't know. But
27:55 - 27:57

you don't use the comma that much. The
semicolon is pretty useful, like we could
27:57 - 28:07

have written this as, the pope has, er,
the child and also position held like
28:07 - 28:11

this. It means exactly the same, but you
can immediately see that both of these
28:11 - 28:18

refer to the pope because there's just a
bunch of blank space here. Yeah, so then
28:18 - 28:28

we have this one. This isn't funny at all,
but there are a lot of people who used to
28:28 - 28:33

be in the Nazi Party during World War 2
and then who later just went back into a
28:33 - 28:37

civil life and even received the
Bundesverdienstkreuz, the order of merit
28:37 - 28:42

of the Federal Republic of Germany. And
you can find those… in this case I've done
28:42 - 28:47

it with three triples, which is, the
person was a member of this political
28:47 - 28:52

party and received this award. And also
I've added that they're "instance of:
28:52 - 28:55

human", because we also have a lot of
fictional data on Wikidata. You already
28:55 - 28:58

saw that with the Neverending Story stuff
earlier. So there might also be a
28:58 - 29:02

fictional character who was a member of
this political party and who received the
29:02 - 29:07

award, and we're not really interested in
those. So we add "instance of: human", and
29:07 - 29:11

then we are certain that we only get real
results and not fictional results. And it
29:11 - 29:14

doesn't really cost us anything because
the Query Service can optimize that pretty
29:14 - 29:22

well. So let's write that… actually, let's
do that here. So the item should be
29:22 - 29:32

"instance of: human", which is Q5, because
it's a very common item, and "member of
29:32 - 29:40

political party". And you can see I can
search by the German abbreviation and find
29:40 - 29:44

this, even though it's not a label,
because there are search aliases. And also
29:44 - 29:49

"award received", the
Bundesverdienstkreuz, because I can't be
29:49 - 29:54

bothered to type in the whole English
name. There we go. And we find, I think…
29:54 - 30:04

how many results? Eleven results. Yeah.
And this actually isn't quite correct,
30:04 - 30:10

because in theory, you don't get this
order, this order has like 11 parts or
30:10 - 30:15

something. You can get the Grand Cross
with Distinction or you can get the Star
30:15 - 30:19

or whatever. I think it's listed somewhere
here. Yeah, you can get the Grand Cross
30:19 - 30:23

Special Class, you can get the Grand Cross
Special Issue, you can get the Grand Cross
30:23 - 30:27

First Class, blah blah blah. And so, in
theory, any of these people should have
30:27 - 30:34

one of these awards and not just "order of
merit". But I think when I checked, all of
30:34 - 30:42

them just had… all the results, just had
directly "order of merit". But actually,
30:42 - 30:48

no we can try to search for the correct
ones instead. So it would not be part of
30:48 - 30:54

this directly, it would be… "award
received" would be some award, such as
30:54 - 31:03

this one, and then this award is part of
the order of merit, so "award"… "part of"…
31:03 - 31:15

Let's see if that finds any results. Oh.
Oh. Oh, dear. Yeah, that, that… that's a
31:15 - 31:21

lot of results. "Herbert von Karajan".
That's that's depressing. OK, yeah. OK, so
31:21 - 31:24

I think I… when I tried this out and
didn't find any results, I just did
31:24 - 31:30

something wrong because, this way we find
a lot more results. And if we… so we don't
31:30 - 31:36

actually select the award here, because we
don't care what kind of award they got. So
31:36 - 31:42

we could also use this abbreviation again,
like this. So we just say they got some
31:42 - 31:47

award, which is part of the order of
merit. And in this case, we could even
31:47 - 31:54

abbreviate that further and say, we put a
slash here. And then, that kind of
31:54 - 31:58

describes a path that you have to take
from this item to this item and you have
31:58 - 32:04

to first get to some award received. And
then that has to be part of something
32:04 - 32:08

else. And you can add as many elements
here as you want. And then we get the
32:08 - 32:18

exact same 802 results… and… lots of well-
known names here. And if we want to find
32:18 - 32:22

the original 11 ones that directly had the
order of merit as the award received, we
32:22 - 32:26

can add a question mark here, which is
just like in a regular expression, it says
32:26 - 32:32

this part is optional. They can have
directly received this award or they can
32:32 - 32:36

have received some award, which is part of
the order of merit. And then we should get
32:36 - 32:48

813. Yeah, 813 results, so 802, plus the
11 from earlier. And… I'm starting this
32:48 - 32:53

with "instance of: human", which… and the
Query Service is going to re-order this
32:53 - 32:57

because searching for all the humans and
then filtering for the ones who are in
32:57 - 33:01

this political party and so on wouldn't be
efficient. So I don't have to worry about
33:01 - 33:06

that. I could write it in this order, or I
could shuffle it around. Doesn't make any
33:06 - 33:10

difference. The Query Service already
knows in which order to do these things.
33:10 - 33:14

So you don't have to worry about that. You
can just start with "is a human" and then
33:14 - 33:23

add everything else. I think I have one
more complicated query here. Yeah, so
33:23 - 33:28

that's one of the examples I mentioned
earlier, the largest cities by population
33:28 - 33:33

with a female mayor. So the graph for that
is, I think the largest one I prepared for
33:33 - 33:38

the slides, except the one in the
beginning. And it looks like this. We
33:38 - 33:41

should have a city which is a city,
"instance of: city", and it has a certain
33:41 - 33:46

population, and it has… so for the mayor,
we use the same property as for head of
33:46 - 33:52

government. And if you don't know that,
you could look at some city like Berlin
33:52 - 33:59

and maybe you know what the mayor of
Berlin is called… what was it?. Something
33:59 - 34:05

"Müller", I think. Yeah. And then you can
see, aha, the property for the mayor is
34:05 - 34:14

"head of government". Or you could also
search for, the city should have a mayor,
34:14 - 34:19

and then you'll still find "head of
government", the right property. And that
34:19 - 34:25

mayor should be a human and she should
have the gender "female". Oops. There's a
34:25 - 34:28

question mark there for no reason at all.
That's not a variable. That should be the
34:28 - 34:37

fixed value. Sorry. So let's put that
there. We have a city which is "instance
34:37 - 34:50

of: city", and it also has a population
which we're going to use later and it also
34:50 - 34:55

has a head of government. No, that's
wrong. Not the "office held by head of
34:55 - 34:59

government", the "head of government"
itself, which we call the mayor and then
34:59 - 35:18

the mayor is "instance of: human" and
gender should be female… come on… female.
35:18 - 35:28

And let's select the city, cityLabel,
mayorLabel and also the population. And
35:28 - 35:31

then we find some 83 results. That's not
yet the largest cities with a female
35:31 - 35:37

mayor. That's just all of them. And in
Wikidata we know about 83, apparently. And
35:37 - 35:42

if your local hometown has a female mayor,
just go ahead and add it to Wikidata and
35:42 - 35:47

it's probably relevant. It's not– So the
relevance criteria are not as strict as on
35:47 - 35:53

Wikipedia fortunately. But if we want just
the most populous ones, we can go a bit
35:53 - 36:00

back into SQL land and say we want to
ORDER BY the population and in SQL you
36:00 - 36:03

would write DESC afterwards and in SPARQL
it's different. You write
36:03 - 36:10

DESC(?population). Erm, I think it's nicer
that way. But perhaps it would have been
36:10 - 36:14

nicer to just stick with the SQL syntax. I
don't know. And we want to limit this to
36:14 - 36:19

just the ten most populous cities, for
example. And here we go. Tokyo is
36:19 - 36:26

currently the biggest one, then Hong Kong,
Baghdad, Surabaya, Rome. Yeah. And, oh.
36:26 - 36:37

This doesn't make that much sense, Caracas
has two mayors. Anyone… yeah, exactly. So
36:37 - 36:44

we're only supposed to get the current
mayor. Head of government… yeah. Does
36:44 - 36:52

anyone know which one is the current one?
Or we could just check Wikipedia… Caracas,
36:52 - 36:56

which hopefully doesn't get it's
information from Wikidata yet. So it's not
36:56 - 37:08

circular. And the mayor is… Carolina,
Carolina Cestari… Cestari, I don't know.
37:12 - 37:15

laughter
37:15 - 37:25

OK, so let's add a new one. Ah…? Doesn't
have an item yet, is that… is that the
37:25 - 37:31

mayor, or is chief of government something
else? Doesn't occur anywhere else on the
37:31 - 37:45

page, of course. Local government… mayor…
no. OK, so let's just… I don't know,
37:45 - 37:55

doesn't she have a Wikipedia article? No.
Just appears in some lists and then she
37:55 - 38:01

doesn't have a Wikidata item yet? No.
Then… I don't know. We'll do some live
38:01 - 38:05

Wikidata editing. It wasn't part of this
talk, but let's just do it. Carolina
38:05 - 38:17

Cestari… what country is that? Venezuela.
Venezuelan politician, and that sounds
38:17 - 38:23

like a female name, so I'm just going to
guess and check that after the talk. So
38:23 - 38:29

she's definitely a human. And gender is
female and that is going to be enough for
38:29 - 38:38

our query. Do this search again. There we
go. And set this to preferred rank. So
38:38 - 38:41

that's how the Query Service knows that
this is the current value and it should
38:41 - 38:44

only return this one. And ideally, one of
the head of government values should have
38:44 - 38:50

this preferred rank to mark it as the
correct current value. And then all the
38:50 - 38:54

other ones are additional data that you
can use if you want. But it's not the main
38:54 - 39:01

value and we are not going to get it in a
simple query. And then there's some error
39:01 - 39:06

because Caracas isn't some kind of
political territorial entity and it should
39:06 - 39:13

have a start time. I don't care right now.
OK, so we run this query again and
39:13 - 39:21

hopefully get just one result for Caracas
this time. No. Uhm, we have to wait a bit
39:21 - 39:26

until the Query Service is updated.
Because it's kind of asynchronous. It just
39:26 - 39:34

keeps watching for changes and eventually
it will get the new data, but… okay. It
39:34 - 39:42

might take a bit longer. Anyways. That's
how that query works. Does that make kind
39:42 - 39:52

of sense? OK, great. Yeah, I think this is
almost exactly what I wrote here. Yeah.
39:52 - 39:56

Except with some labels and the label
service. Yeah. There is one problem here,
39:56 - 40:02

which is, for example, I happen to know
that Mexico City is a very large city with
40:02 - 40:11

a population of… population: almost 9
million. So it should be right after Tokyo
40:11 - 40:19

in front of Hong Kong. And the head of
government is a Claudia Sheinbaum or
40:19 - 40:24

something, which sounds like a woman. So
we should get this result in the query.
40:24 - 40:29

The reason we don't is that Mexico City is
an instance of "big city" and we have
40:29 - 40:35

searched for "instance of: city". And
there's some debate about does this class
40:35 - 40:40

even make sense at all? I think this is
actually the German classification of, a
40:40 - 40:44

big city is one with 100 000 Inhabitants,
and in other languages or countries, a big
40:44 - 40:49

city might be something else, but for now
that… the data is what it is. Fortunately,
40:49 - 40:54

what we have here is the information, a
"big city" is a subclass of a city/town,
40:54 - 41:05

which is a subclass of "locality", which
is a subclass of. Wait. We should arrive
41:05 - 41:08

at city at some point, but I think we've
already gone past that. It's also an
41:08 - 41:12

instance of capital. Let's go down that
instead. A capital is a subclass of city,
41:12 - 41:17

there we go. So if we can tell the Query
Service to follow these subclass
41:17 - 41:23

connections, then we should find these
cities. And one way to do that… to make it
41:23 - 41:30

work for Mexico City would be to say, it
has to be "instance of", some, with the
41:30 - 41:37

path again, "subclass of: city" and then
we would find Mexico City, but we would
41:37 - 41:43

not find all the… oh, we would still find
Tokyo because it's still a capital, I
41:43 - 41:47

guess. But we've missed a lot of other
cities, I think which we used to have…
41:47 - 41:54

yeah. Rome, for example, is gone. Because
it's… that's just an instance of city
41:54 - 41:57

directly. And we've now made the subclass
mandatory. What we should do is make it
41:57 - 42:02

optional, or even better, we would– we
should say there can be any number of this
42:02 - 42:07

element. So there… it can be an instance
of city or it can be an instance of a
42:07 - 42:11

subclass of city, it can be an instance of
a subclass of a subclass of city. You can
42:11 - 42:14

follow any number of elements, that what
this… that's what this star means, just
42:14 - 42:19

like in a regular expression. And then we
probably have to say we only want the
42:19 - 42:24

distinct ones because they are like five
different ways to go through the subclass
42:24 - 42:30

tree until you've found "city". And we're
not interested in the different ways. But
42:30 - 42:35

now we should get Tokyo and Mexico City.
And Rome is also here and Caracas is
42:35 - 42:39

completely gone because we found enough
other cities which we were missing
42:39 - 42:46

earlier. So you kind of have to watch out
and sometimes use elements like this…
42:46 - 42:52

"subclass of"-tree is pretty common, or
with a, something… order of merit, we had
42:52 - 42:57

to use this "part of". You have to watch
out if the results are plausible, or
42:57 - 43:01

ideally, you know some item that should be
in the results, and then you check, is it
43:01 - 43:06

there? Why is it not there? And
investigate like that. But that's a fixed
43:06 - 43:11

version of the query. And… yeah, if we
were not interested in the mayor, we could
43:11 - 43:15

do the same trick again. But, yeah. It
doesn't make that much of a difference.
43:15 - 43:19

And I think… yeah, that was almost the
only difference. Yeah, except that I
43:19 - 43:23

removed the population so we can order by
a variable that you don't select in the
43:23 - 43:34

end if you want. And I think I am out of
slides. So, yeah, if you want to see more
43:34 - 43:38

queries, you can look at these Twitter or
social media accounts. There's a huge list
43:38 - 43:43

of example queries on Wikidata, which is
so big that it's getting too big for a
43:43 - 43:46

wiki page, and people had to move some
queries out there and it's kind of just
43:46 - 43:51

grown since 2015 or something. And there's
a lot of garbage there, but also a lot of
43:51 - 43:56

useful queries if you want to look at
that. And I had two more queries in the
43:56 - 44:01

talk description which we haven't talked
about yet, and I think we have the time. I
44:01 - 44:04

can just try to open these. "Which films
starred more than one future head of
44:04 - 44:15

government?" Does that work? It doesn't.
Can I copy the URL here? Yeah, copy link
44:15 - 44:21

address. So that's a kind of longer query,
which is why it didn't really fit on one
44:21 - 44:26

slide. But the important film is you have…
er, the important part is you have some
44:26 - 44:32

film… instance of, or subclass of film, it
has a publication date and a cast member,
44:32 - 44:41

which is the head of government. And the
head of government held some position,
44:41 - 44:47

some head of government, er, some subclass
of head of government. And that should be
44:47 - 44:53

after the film was published. And then you
get a bunch of results. I think this takes
44:53 - 45:00

like 11 seconds or something. And you get
like films with Schwarzenegger and one
45:00 - 45:06

other actor who became US governor. I
don't remember the name. And you also get
45:06 - 45:10

a lot of… or several films from World War
II with future French heads of government,
45:10 - 45:16

which is really cool. So, like a film that
was shot about the liberation of Paris,
45:16 - 45:20

where it's… it's kind of a stretch to call
them cast members, but they're definitely
45:20 - 45:26

in the film. And if we get the result,
then I can tell you what the film is
45:26 - 45:35

called. Yeah, it might be busy right now,
so you get up to 60 seconds in the Query
45:35 - 45:40

Service and then in the end your query is
killed if it takes longer than that. So
45:40 - 45:43

sometimes it can be a bit of a struggle to
make the query work within 60 seconds.
45:43 - 45:48

There we go, 50 seconds. That was close.
So there's yeah, there's a "La Libération
45:48 - 45:52

de Paris" with Charles de Gaulle, who was
president of the Council and president of
45:52 - 45:58

the provisional government, and also
Georges Bidault, I think, who was prime
45:58 - 46:03

minister and president of the Council, and
other stuff. We have several Indian films
46:03 - 46:10

with people who went on to become chief
ministers. And then down here there's some
46:10 - 46:14

Canadian politicians, apparently. And then
here's Arnold Schwarzenegger and Jesse
46:14 - 46:21

Ventura, who both became governors and
also starred in several films. And the
46:21 - 46:26

other thing was, we have a lot of data
about the British government because a lot
46:26 - 46:32

of volunteers have just been slaving away
at that data and adding and adding more
46:32 - 46:39

information. I think they've… they have
all their parliaments, complete with party
46:39 - 46:43

affiliations and everything for at least
the last 100 years and some partial data
46:43 - 46:47

for a lot more than that, because they
have a very long parliamentary history.
46:47 - 46:51

And then you can do queries like "how many
people named John are there in
46:51 - 46:56

parliament", and "how many women with any
name". And you can see when the women were
46:56 - 47:02

finally more than just the men who are
named "John". And it's kind of an amusing
47:02 - 47:08

graph. Or not so amusing. Takes a while as
well. I hope it doesn't take 50 seconds,
47:08 - 47:14

but it looks like the Query Service might
be busy at the moment. But I think it was
47:14 - 47:20

something like in 1991 or so is the
crossover point. Oh yeah. And I should
47:20 - 47:24

mention anyway, so everything we saw right
now was just a lot of tables. But you can
47:24 - 47:31

also show results in different ways, such
as a line chart. There we go. So in 1992,
47:31 - 47:35

this was the first parliament which had
more women than Johns. And then the Johns
47:35 - 47:41

have slightly declined and the women have
gone up to 220. How many people are in the
47:41 - 47:48

House of Commons in total? Does anyone
know? No. So I don't know what percentage
47:48 - 47:52

this is. Uh, but, this was… yeah, this
latest election from 12 December already
47:52 - 48:03

in there. Yeah. indistinguishable. What?
So the query looks like this. So this one
48:03 - 48:06

is broken into several parts. We first
find all the members of parliament, so
48:06 - 48:11

they should be human, again, no fictional
people, and then they should have some
48:11 - 48:16

"position held", which is a subclass of
"member of parliament" in the House of
48:16 - 48:22

Commons. And then there should also be,
um, a parliamentary term on that, so that
48:22 - 48:28

we know which parliament it is and when it
starts. And then down here, we import all
48:28 - 48:35

those MPs and filter for just the ones
with the "given name: John". And then we
48:35 - 48:40

filter for just the ones with "gender:
female". And there's an optional "subclass
48:40 - 48:44

of" in here, because currently the data
model is that there is a separate item for
48:44 - 48:49

transgender female and someone can have
"gender: transfemale– transgender female",
48:49 - 48:53

which is a subclass of "female". And there
is a discussion right now to get rid of
48:53 - 48:57

that and have a separate property for that
instead. And then all the trans people
48:57 - 48:59

just have "gender:", their right gender,
and you don't have to mess with subclass.
48:59 - 49:04

But right now we still… well, we need it
in theory, I don't think there are any MPs
49:04 - 49:09

in practice. But, you know, you know, you
can just keep it in there. And then we
49:09 - 49:15

import the results and get them here
either as a line chart or as a table, if
49:15 - 49:21

you want to sort it by the time… yeah, the
data starts in 1919, apparently. So we
49:21 - 49:25

have exactly a hundred years of history
there. We can also show it as a bar chart,
49:25 - 49:31

if that makes more sense. No it doesn't.
That makes no sense. Line chart is the
49:31 - 49:35

right one. Oh, right, but if you show the
line chart again, then it breaks for some
49:35 - 49:39

reason, there's some bug there. So let's
just show it again. There we go. That's
49:39 - 49:47

the right… chart. Yeah, and I guess… oh
wow, it's already… 50 minutes, so I guess
49:47 - 49:55

this is the point where we start moving to
the live querying part, and I was told I
49:55 - 49:59

should make at least a short break for the
stream, so the Angels know where to cut
49:59 - 50:03

between. But we could also take a 10
minute's break and then start the next
50:03 - 50:09

talk on time. Does that sound OK? Or is 10
minutes too long? Uhm, if you're going to
50:09 - 50:14

stay here, which would be very nice, then
please think of some example queries that
50:14 - 50:17

you think we could write, and then I can
try to write them, because otherwise I'm
50:17 - 50:22

not going to have much to do. But yeah,
let's do a 10 minute break and see you
50:22 - 50:25

then. Thank you so far.
50:25 - 50:27

Applause
50:27 - 50:32

Postroll Music
50:32 - 50:55

Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!

Title:: 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service
Description:: more » « less
Video Language:: English
Duration:: 50:55

	fsjan edited English subtitles for 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service
	fsjan edited English subtitles for 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service
	C3Subtitles edited English subtitles for 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service
	C3Subtitles edited English subtitles for 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service
	C3Subtitles edited English subtitles for 36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service

English subtitles

Revisions

Revision 5 Edited

fsjan

36C3 Wikipaka WG: Querying Linked Data with SPARQL and the Wikidata Query Service

Revisions

Our website uses cookies

Operating cookies (Required)