1
00:00:00,000 --> 00:00:22,090
36C3 preroll music
2
00:00:22,090 --> 00:00:30,160
Okay so now to our speaker, he’s Lucas.
He's a SPARQL magician I'm told, so and he
3
00:00:30,160 --> 00:00:35,230
will introduce you to his favorite
querying language, SPARQL, and give you a
4
00:00:35,230 --> 00:00:40,020
little introduction and in the second part
he will do some live coding which is
5
00:00:40,020 --> 00:00:45,840
always really interesting and funny and
you can give him some things that he's
6
00:00:45,840 --> 00:00:50,070
querying for you and I'm sure we'll have
lots of fun and interesting learning stuff
7
00:00:50,070 --> 00:00:53,790
here so give a warm round of applause to
Lucas.
8
00:00:53,790 --> 00:00:55,730
[Applause]
9
00:01:01,030 --> 00:01:09,040
[inaudible]
10
00:01:09,040 --> 00:01:13,440
Is this better? Aha! It's a bit too loud
so I'll just talk a bit until they have
11
00:01:13,440 --> 00:01:18,729
figured it out. Yeah so this is going to
be kind of two parts but not really that
12
00:01:18,729 --> 00:01:22,420
separate but in the second part I'm
basically going to write the queries that
13
00:01:22,420 --> 00:01:27,010
you suggest so if you – if you see what
I'm going to do here and then think oh I
14
00:01:27,010 --> 00:01:31,270
have a great idea for something we could
perhaps query then just remember that and
15
00:01:31,270 --> 00:01:34,869
we'll get back to that hopefully because
otherwise the second half is going to be
16
00:01:34,869 --> 00:01:40,040
really short if I don't get any ideas from
you. But yeah, so this is about querying
17
00:01:40,040 --> 00:01:46,190
linked data which allows you to do all
kinds of crazy things and answer all kinds
18
00:01:46,190 --> 00:01:50,770
of crazy questions such as I think I had
on the slides something like "what are the
19
00:01:50,770 --> 00:01:54,390
largest cities with a female mayor?" and
if you wanted to find that out
20
00:01:54,390 --> 00:01:59,200
traditionally you could like go through
Wikipedia and try to find all the largest
21
00:01:59,200 --> 00:02:03,320
cities and see which ones have a female
mayor and which ones don't or perhaps
22
00:02:03,320 --> 00:02:06,550
there's a category with all the cities
with a female mayor but then you have to
23
00:02:06,550 --> 00:02:12,200
sort them by population and it's a whole
mess and with linked data you can find
24
00:02:12,200 --> 00:02:17,489
that out much more easily and also all
kinds of other things but let's start with
25
00:02:17,489 --> 00:02:24,580
some simple fantasy linked data so this is
a tiny snippet of linked data, some data
26
00:02:24,580 --> 00:02:30,049
graph. It's just composed of a load of
nodes which are these ovals and rectangles
27
00:02:30,049 --> 00:02:35,049
here and they're connected with arrows and
each of these forms kind of a triple
28
00:02:35,049 --> 00:02:39,820
consisting of the start node and then the
arrow and then the end node and that's how
29
00:02:39,820 --> 00:02:45,250
we represent all the information you have
in there, in this linked database. So for
30
00:02:45,250 --> 00:02:48,410
example we can read this as this talk
right now happens in the Esszimmer or the
31
00:02:48,410 --> 00:02:51,959
dining room which is the name of this
stage here and it's going to be followed
32
00:02:51,959 --> 00:02:55,930
by the live querying session which also
happens in Esszimmer and the live querying
33
00:02:55,930 --> 00:03:00,900
session in turn follows this talk again
and the Esszimmer, the dining room, is
34
00:03:00,900 --> 00:03:06,049
next to the kitchen, the Küche, and the
kitchen is next to the dining room again
35
00:03:06,049 --> 00:03:09,890
and both of them are part of the
WikipakaWG which is part of 36C3 and the
36
00:03:09,890 --> 00:03:17,340
talk happens right now and at the same
time there's also some talk about how
37
00:03:17,340 --> 00:03:22,110
state elections are climate elections or
something in the Chaos West stage, starts
38
00:03:22,110 --> 00:03:25,530
at the same time, Chaos West stage is part
of the Chaos West Assembly which is part
39
00:03:25,530 --> 00:03:31,670
of 36C3 as well and so this graph has a
few important properties, for example
40
00:03:31,670 --> 00:03:36,060
there's some redundant connections here,
you could see, you could say, if this talk
41
00:03:36,060 --> 00:03:39,180
is followed by the live querying then you
don't really need to know that live
42
00:03:39,180 --> 00:03:43,810
querying follows this talk, it's kind of
redundant information. You already know
43
00:03:43,810 --> 00:03:48,900
it, but it doesn't hurt to have it, and it
often makes your life easier if you have a
44
00:03:48,900 --> 00:03:53,650
little bit of redundancy in your graph and
then if you find that one half of this
45
00:03:53,650 --> 00:03:57,269
connection is missing for example you can
still investigate what's going on and also
46
00:03:57,269 --> 00:04:02,480
in here we have kind of bi-directional
connection so Esszimmer is next to Küche
47
00:04:02,480 --> 00:04:07,510
which is next to Esszimmer but this is two
separate arrows and could also be that
48
00:04:07,510 --> 00:04:11,790
only one of them is there so you don't
have arrows which go into-, in both
49
00:04:11,790 --> 00:04:16,010
directions at once in this data model, it
has to be, if you want something like this
50
00:04:16,010 --> 00:04:18,680
you have to have two separate arrows
because that keeps the data model very
51
00:04:18,680 --> 00:04:25,720
simple. You just have subject predicate
object and that's everything you have, and
52
00:04:25,720 --> 00:04:33,210
then to query this graph, you kind of
select a tiny part of it and then you
53
00:04:33,210 --> 00:04:39,090
remove some part that you don't know about
for example we know that this talk is
54
00:04:39,090 --> 00:04:43,650
followed by live querying and if we remove
the live querying part, then we can ask
55
00:04:43,650 --> 00:04:50,650
something like... Okay, I did it the other
way around. Never mind, this way. This
56
00:04:50,650 --> 00:04:53,530
talk is followed by which talk? and then
you have a question but because you've
57
00:04:53,530 --> 00:05:00,449
left out this part and then if you ask
this question to a query service it can,
58
00:05:00,449 --> 00:05:06,820
kind of, you can think of this like a,
err, damn, I only know the German word for
59
00:05:06,820 --> 00:05:11,620
this one, a, Schablone, template, so you
put this over the graph and this has to
60
00:05:11,620 --> 00:05:15,660
match the existing node this has to match
the existing arrow and then you see which
61
00:05:15,660 --> 00:05:20,170
nodes can you put in here and in this case
that's only the live querying or the other
62
00:05:20,170 --> 00:05:26,510
way around which talk follows this one so
you can have the beginning of the triple
63
00:05:26,510 --> 00:05:31,490
can be a variable like this one or the end
of the triple can be a variable like in
64
00:05:31,490 --> 00:05:38,540
this case and you can also have more
complicated patterns like, no there's not
65
00:05:38,540 --> 00:05:42,280
a more complicated pattern, this is the
same pattern. You have the question which
66
00:05:42,280 --> 00:05:46,139
talk happens in Esszimmer and you have two
answers: this talk happens in Esszimmer
67
00:05:46,139 --> 00:05:51,819
and live querying happens in Esszimmer.
But you can also combine more graph nodes
68
00:05:51,819 --> 00:05:57,659
like this, for example, which talk happens
in some room, which is part of the
69
00:05:57,659 --> 00:06:02,389
Wikipaka-WG. So we have one free part here
and one free part here. But we know that
70
00:06:02,389 --> 00:06:06,220
these two have to be connected with,
"happens in", and then this has to be
71
00:06:06,220 --> 00:06:10,819
connected with "is part of" to the
Wikipaka-WG. And you can kind of
72
00:06:10,819 --> 00:06:16,610
construct– if you can phrase your question
as a kind of graph like this, where some
73
00:06:16,610 --> 00:06:19,439
parts are predetermined that you already
know about and the other parts that you
74
00:06:19,439 --> 00:06:26,249
want to find. Those are these kind of
variables which are here indicated with
75
00:06:26,249 --> 00:06:30,990
just dashed lines. Then you can ask that
question to the graph and find the
76
00:06:30,990 --> 00:06:35,930
matching results. In this case, you have
these two matches, this talk happens in
77
00:06:35,930 --> 00:06:40,139
Esszimmer as part of Wikipaka-WG and live
querying happens in Esszimmer, is part of
78
00:06:40,139 --> 00:06:46,759
Wikidata– Wikipaka-WG. And then, if you–
if we had more information in this graph
79
00:06:46,759 --> 00:06:51,770
here, we might also have other rooms. For
example, there's this library over there
80
00:06:51,770 --> 00:06:56,080
which also is going to have some talks. If
we had the whole schedule in here, we
81
00:06:56,080 --> 00:07:01,069
would find those as well. And we could
also adapt the query so that we don't even
82
00:07:01,069 --> 00:07:06,860
make the Wikipaka-WG part fixed. We could
ask for anything that happens in 33C3. So
83
00:07:06,860 --> 00:07:11,360
that would be some variable, happens in
some room, is part of some assembly, is
84
00:07:11,360 --> 00:07:15,610
part of 36C3. And then we would find this
thing as well because it fits the same
85
00:07:15,610 --> 00:07:21,530
kind of pattern: happens in, is part of,
is part of 36C3. Does that make sense?
86
00:07:21,530 --> 00:07:31,539
Hopefully. I'm seeing a lot of nodding
heads. OK, that's great. So then we can
87
00:07:31,539 --> 00:07:38,060
try to move ahead to actually ask some of
these questions to a real query system.
88
00:07:38,060 --> 00:07:43,029
Because in reality, you're not going to
actually draw these graphs, but you have
89
00:07:43,029 --> 00:07:47,529
some kind of language where you phrase
them instead, which looks a bit like this.
90
00:07:47,529 --> 00:07:52,719
So you have the part: SELECT anything
WHERE, that is kind of like SQL, and then
91
00:07:52,719 --> 00:07:57,900
everything else is not like SQL. Forget
SQL! I hear this is easier to understand
92
00:07:57,900 --> 00:08:03,499
if you don't know SQL. I didn't know SQL
that much when I learned SPARQL, and I
93
00:08:03,499 --> 00:08:08,850
think it helped me, apparently. But what
you write down here is these, is this kind
94
00:08:08,850 --> 00:08:14,080
of description of the graph, and these
dashed parts, which are the variables
95
00:08:14,080 --> 00:08:17,919
which you don't yet know. Those are marked
with a question mark because that's kind
96
00:08:17,919 --> 00:08:21,150
of what you use to ask a question. In this
case, I've just called it "?talk", but it
97
00:08:21,150 --> 00:08:27,069
could be any name, basically. And then
instead of "happens in" as two words, I've
98
00:08:27,069 --> 00:08:32,510
just written "happensIn" as one and then
with the prefix "36C3" and it happens in
99
00:08:32,510 --> 00:08:38,289
the 36C3 Esszimmer because I don't really
have a separate dining room at home, but a
100
00:08:38,289 --> 00:08:43,320
lot of people do. So if we just wrote it
happens in Esszimmer, that would be pretty
101
00:08:43,320 --> 00:08:48,110
ambiguous and no one would know which
which dining room you're talking about.
102
00:08:48,110 --> 00:08:52,780
And by adding this prefix we know we're
talking about just the dining room in
103
00:08:52,780 --> 00:08:58,510
this, at thirty– 36C3. I think, I assume
there's no other assembly that has
104
00:08:58,510 --> 00:09:01,370
something called the dining room. If it
does, then we would have to add something
105
00:09:01,370 --> 00:09:06,199
else here to make it clear. And I've used
the same prefix for "happensIn" to make
106
00:09:06,199 --> 00:09:09,970
clear which kind of "happens in" relation
we're talking about, that it's one
107
00:09:09,970 --> 00:09:15,650
specific to Congress events. And then you
could ask this to a query service which
108
00:09:15,650 --> 00:09:21,750
has this example graph in it, and you
might get the response that it's these two
109
00:09:21,750 --> 00:09:27,760
talks. And at the end, you have this
period here because if you read the whole
110
00:09:27,760 --> 00:09:33,089
thing, it's kind of like a sentence again.
Because the talk happens in Esszimmer. And
111
00:09:33,089 --> 00:09:36,810
if you have two sentences, then you have
two periods. So the talk happens in some
112
00:09:36,810 --> 00:09:40,990
room. And this room is part of the
Wikipaka-WG. And because we've used the
113
00:09:40,990 --> 00:09:47,510
same variable name here and down here,
this has to be the same room. And it
114
00:09:47,510 --> 00:09:50,790
couldn't just be two different things. So
if we use two different variable names
115
00:09:50,790 --> 00:09:55,500
here, room and something else, then we
would just get all the combinations of
116
00:09:55,500 --> 00:09:59,260
talks happening somewhere and rooms being
part of Wikipaka-WG without them being
117
00:09:59,260 --> 00:10:02,970
connected anyway, but because they use the
same variable name they have to be
118
00:10:02,970 --> 00:10:08,840
connected like this. And then you would
get these results we've seen earlier. What
119
00:10:08,840 --> 00:10:13,830
you can also do is leave out the room. So
when I translate this into English, I
120
00:10:13,830 --> 00:10:18,410
could say, the talk happens in the room
and the room is part of Wikipaka-WG. But I
121
00:10:18,410 --> 00:10:23,160
could also say the talk happens in some
room, which is part of the Wikipaka-WG,
122
00:10:23,160 --> 00:10:26,300
as kind of a– I don't know what that's
called in English kind of a relative
123
00:10:26,300 --> 00:10:32,509
sentence sub-something-clause where we
don't really talk about the room in itself
124
00:10:32,509 --> 00:10:36,600
just as a part of this larger sentence.
And you can write that in SPARQL as well.
125
00:10:36,600 --> 00:10:44,220
And then it looks like this. And these
square brackets kind of describe what the
126
00:10:44,220 --> 00:10:48,480
room looks like without giving it names.
So in this case, you can only select the
127
00:10:48,480 --> 00:10:51,959
talk up here and we don't have a room
variable. But if you don't care about what
128
00:10:51,959 --> 00:10:55,740
the room is, then that can be very useful.
I've also changed something else here.
129
00:10:55,740 --> 00:11:04,149
I've replaced the 36C3 in "isPartOf" with
schema, which is another prefix and schema
130
00:11:04,149 --> 00:11:09,380
is kind of this collection of useful
prefixes and other nodes that you can
131
00:11:09,380 --> 00:11:14,189
reuse, for example, if you're describing
things you have on your website, you might
132
00:11:14,189 --> 00:11:18,890
say you have an article with a
schema:title and a schema:publicationDate.
133
00:11:18,890 --> 00:11:22,870
So this was mainly introduced by Google
and some other search engines. But we can
134
00:11:22,870 --> 00:11:27,880
use the same vocabulary to talk about our
talks because "isPartOf" is one of these
135
00:11:27,880 --> 00:11:35,829
standard terms we can use for that. And
what else do I have. OK, the next thing I
136
00:11:35,829 --> 00:11:41,190
have is actual queries. So I think I'm
just going to– I'm almost going to switch
137
00:11:41,190 --> 00:11:45,350
to Wikidata, so I should talk a bit about
Wikidata. So all these examples here were
138
00:11:45,350 --> 00:11:52,639
just on some example graph, which I made
up here and threw on a slide with a lot of
139
00:11:52,639 --> 00:11:58,180
probably overengineered tikz LaTeX magic,
which I shouldn't have wasted that much
140
00:11:58,180 --> 00:12:03,790
time about. But it looks nice. And… but if
we want to write real queries, we could
141
00:12:03,790 --> 00:12:07,470
load this thing into a query service, but
it wouldn't be that interesting because
142
00:12:07,470 --> 00:12:12,220
it's kind of small. But there are a lot of
real data graphs out there that you can
143
00:12:12,220 --> 00:12:17,120
query with this query language, SPARQL.
And one of the coolest ones, at least in
144
00:12:17,120 --> 00:12:21,170
my opinion, is called Wikidata or
Wikidata. There's some kind of discussion
145
00:12:21,170 --> 00:12:27,980
about how it's pronounced. And it's kind
of a free database of anything that's
146
00:12:27,980 --> 00:12:33,910
relevant. And it's part of the same family
of projects as Wikipedia and Wikimedia
147
00:12:33,910 --> 00:12:37,769
Commons and other things. And it's also
maintained by the same community of
148
00:12:37,769 --> 00:12:42,269
volunteers. And you can find all kinds of
really interesting and cool and funny data
149
00:12:42,269 --> 00:12:46,009
there. So all of these example queries,
which I have here, we're just going to ask
150
00:12:46,009 --> 00:12:57,380
to Wikidata. But first, I will just give
you one or two minutes to try to imagine
151
00:12:57,380 --> 00:13:04,079
what this question would look like, either
in the graph format or in the SPARQL
152
00:13:04,079 --> 00:13:09,339
format. Just try to figure out how you
would formulate: "which software is
153
00:13:09,339 --> 00:13:15,100
written in bash" as a kind of, this kind
of graph query. And then we can see what
154
00:13:15,100 --> 00:13:22,970
we can come up with. So. I didn't think
this through. I need some waiting loop
155
00:13:22,970 --> 00:13:36,380
music now. Does anyone have a kind of idea
of what the graph looks like, because I'm
156
00:13:36,380 --> 00:13:41,160
going to uncover it now and then you can
compare, if it looks the same way. So it
157
00:13:41,160 --> 00:13:45,760
would look like, this at least using the
Wikidata terminology. So instead of "is
158
00:13:45,760 --> 00:13:51,790
written in", the property is called
probing– programming language. And this
159
00:13:51,790 --> 00:13:56,050
could also, this could be called "bash" or
"Bourne Again Shell" or "GNU bash" or
160
00:13:56,050 --> 00:14:02,009
something. Doesn't really matter. And in
SPARQL, it looks like this, which is a lot
161
00:14:02,009 --> 00:14:06,630
less readable, unfortunately, because one
of the things about Wikidata is that it's
162
00:14:06,630 --> 00:14:14,290
multilingual. So instead of saying
"programming language", we say "P277". And
163
00:14:14,290 --> 00:14:17,509
I think that's beautiful, haha. No, but
this is a property ID and you can look up
164
00:14:17,509 --> 00:14:22,589
what this property is called in English or
in German or in any other language. So if
165
00:14:22,589 --> 00:14:31,420
we look at Wikidata.org and look for – I
think I forgot to zoom in. Yeah. There we
166
00:14:31,420 --> 00:14:40,180
go. I hope that's readable. Property P,
what was it? 277. That is the property
167
00:14:40,180 --> 00:14:45,019
"programming language", at least in… okay,
you can't read that. There you go. At
168
00:14:45,019 --> 00:14:48,220
least in English. In German it's
"Programmiersprache", and it has tons of
169
00:14:48,220 --> 00:14:51,530
other languages too. So you can use
Wikidata in any language you want, which
170
00:14:51,530 --> 00:14:56,640
is very nice. I could also show this page
in a different language and then all of
171
00:14:56,640 --> 00:15:01,330
this would look different. The downside is
that the SPARQL query is not quite as
172
00:15:01,330 --> 00:15:06,649
readable because you have to use all these
numeric identifiers, but you don't have to
173
00:15:06,649 --> 00:15:14,920
memorize them at least. So let's… oops,
try to write this query. SELECT * WHERE
174
00:15:14,920 --> 00:15:25,290
and we have the software, which is… which
has the programming language "bash", and
175
00:15:25,290 --> 00:15:30,589
then we have to add these prefixes first,
so bash is going to be a Wikidata item. So
176
00:15:30,589 --> 00:15:35,639
we abbreviate that with "wd" and that's a
prefix. And then if I press control space,
177
00:15:35,639 --> 00:15:41,660
or I think on Macs command space works as
well, then it searches for bash and shows
178
00:15:41,660 --> 00:15:46,850
me these suggestions and then I can just
select the right one. In this case, "GNU
179
00:15:46,850 --> 00:15:50,959
bash", and then I have the ID, and if I
move the mouse over it again, then I can
180
00:15:50,959 --> 00:15:55,760
see what this ID refers to. So it's not
quite as bad as– so on the PDF slides, you
181
00:15:55,760 --> 00:16:00,879
just see the ID. But if you're actually on
the query.wikidata.org website… let me
182
00:16:00,879 --> 00:16:05,370
make that a bit larger so you can all see
it. And if you want to try that out on
183
00:16:05,370 --> 00:16:09,180
your laptop, I don't know, here it's a bit
audio outage And for the programming
184
00:16:09,180 --> 00:16:17,290
language, we use a slightly different
prefix, which is "wdt", which stands for
185
00:16:17,290 --> 00:16:21,270
"truthy". So we're only interested in
"truthy" information and not all the
186
00:16:21,270 --> 00:16:28,529
information. And then we find this
property P277. And if we run this query
187
00:16:28,529 --> 00:16:34,620
with control-enter or with this button
here, then we get a collection of other
188
00:16:34,620 --> 00:16:40,240
IDs. Yeah. Does anyone want to get
software which is written in bash? This
189
00:16:40,240 --> 00:16:51,209
one has a very low ID that is going to be…
Loading. There we go. Autopackage. Some
190
00:16:51,209 --> 00:16:55,060
package management system that I haven't
even heard of, but it's written in bash.
191
00:16:55,060 --> 00:17:01,130
OK, so… wait. Er, so here you can see all
these statements and "programming
192
00:17:01,130 --> 00:17:08,010
language: GNU Bash" is the one we looked
for. And unfortunately… so this is not a
193
00:17:08,010 --> 00:17:11,720
very useful list. So one thing we can do
in the Wikidata Query Service, which is
194
00:17:11,720 --> 00:17:17,140
pretty specific to Wikidata, is to add the
so-called label service, which is
195
00:17:17,140 --> 00:17:21,300
basically magic that you don't need to
understand. But you write something like
196
00:17:21,300 --> 00:17:25,650
"serv" or "service" and then with
control+space again for autocompletion.
197
00:17:25,650 --> 00:17:30,600
And it suggests you this thing. And you
just keep that in your query at all times,
198
00:17:30,600 --> 00:17:34,800
basically. And then you say, I would like
to have not just a software, but also the
199
00:17:34,800 --> 00:17:41,200
software label. And then we get down here,
the label of the software. And I can also
200
00:17:41,200 --> 00:17:46,270
add the software description. And then we
also see what, what is described. At least
201
00:17:46,270 --> 00:17:53,150
if it has a description and then the query
results are already a lot more usable. And
202
00:17:53,150 --> 00:17:59,170
I'm just going to rename this to "item"
and then we can edit this query however we
203
00:17:59,170 --> 00:18:04,340
want and the variable name will always
kind of match. Because the next query
204
00:18:04,340 --> 00:18:07,610
won't be about software anymore. So it'll
be confusing if you just still call it
205
00:18:07,610 --> 00:18:13,210
"software". But, yeah, there is some
software here like Apache Yetus, Ruby
206
00:18:13,210 --> 00:18:18,780
Version Manager, Wikidata missing
pictures, Pi-hole, all written in Bash.
207
00:18:18,780 --> 00:18:27,790
OK, I have several more examples queries
here, which are kind of simple, should I
208
00:18:27,790 --> 00:18:34,100
skip ahead or is it good if I do a few
more simple examples. Skip ahead? Is that
209
00:18:34,100 --> 00:18:41,180
OK? OK, then let's. So who was born at sea
is not all that interesting. Just Place of
210
00:18:41,180 --> 00:18:45,020
birth at sea. We have a special value for
that and it's not a very interesting list.
211
00:18:45,020 --> 00:18:48,780
I think a few results, just five or so,
because most people are going to have
212
00:18:48,780 --> 00:18:51,890
"place of birth: Atlantic Ocean" or
something. Which places are located on the
213
00:18:51,890 --> 00:18:57,180
White Elster, just something for the
Leipzig people. And where does the
214
00:18:57,180 --> 00:19:00,750
Neverending Story take place? This
actually kind of cute. Let's do that.
215
00:19:00,750 --> 00:19:06,220
Also, this is a bit interesting because in
this case, the variable is in the last
216
00:19:06,220 --> 00:19:13,330
place and not the first one. So that… and
then we have the Neverending Story in the
217
00:19:13,330 --> 00:19:19,620
beginning and narrative location. And then
the item is at the end instead of at the
218
00:19:19,620 --> 00:19:24,660
beginning of a triple. And it works just
as well, except that a lot of these don't
219
00:19:24,660 --> 00:19:31,800
have a label in English. So let's add
German as a fallback language. And then we
220
00:19:31,800 --> 00:19:37,630
get all of these places which someone
added to Wikidata at some point. Let's see
221
00:19:37,630 --> 00:19:42,410
if there's any useful information about
them. So they all have IDs in the same
222
00:19:42,410 --> 00:19:47,890
range. So it looks like they were all
created at the same time because the are
223
00:19:47,890 --> 00:19:51,880
are just increasing all the time. So the
Gelichterland is a place from the
224
00:19:51,880 --> 00:19:55,261
Neverending Story, it's a finctional…
fictional country. It has a capital, which
225
00:19:55,261 --> 00:20:00,740
is this fictional place. It's located on
the… this terrain feature, it's present in
226
00:20:00,740 --> 00:20:05,600
the Neverending Story. And it depicts
horror fiction. I'm not sure about that,
227
00:20:05,600 --> 00:20:12,350
but let's leave it alone for now. OK,
yeah. And skip to a slightly more
228
00:20:12,350 --> 00:20:20,120
interesting query, which is this one,
which popes had children. So what is the
229
00:20:20,120 --> 00:20:24,580
graph going to look like for this? How
many, how many triples are we going to
230
00:20:24,580 --> 00:20:29,090
have? So triple is node, arrow, and
another node, how many triples would you
231
00:20:29,090 --> 00:20:36,500
need for "Pope has a child"? Let's do a
raising hands. Who thinks you need zero
232
00:20:36,500 --> 00:20:43,380
triples, OK? Who thinks you need one
triple? Who thinks you need two triples?
233
00:20:43,380 --> 00:20:48,180
That's more people. Does anyone think you
need three triples? No. OK, so mostly two,
234
00:20:48,180 --> 00:20:54,250
but some people think one. So the one… the
people who think it might need one triple,
235
00:20:54,250 --> 00:21:02,650
perhaps are thinking of something like the
Pope, which is the leader of the worldwide
236
00:21:02,650 --> 00:21:11,200
Catholic Church, has a child, this child
or it's called item, but that's not going
237
00:21:11,200 --> 00:21:15,360
to have any results. Or it could be the
other way around. And you could say that…
238
00:21:15,360 --> 00:21:26,370
oh let's just comment this out. The item
has "father: the pope". And that doesn't
239
00:21:26,370 --> 00:21:30,650
work. Because the items are not… the
children are not directly connected to the
240
00:21:30,650 --> 00:21:35,170
item for the office of the pope, instead
it's going to be two levels. It's going to
241
00:21:35,170 --> 00:21:40,400
say the child has a father, some person,
and then the person has the office pope or
242
00:21:40,400 --> 00:21:44,690
has the position pope or is a pope or
something. So you need this level of
243
00:21:44,690 --> 00:21:49,150
indirection. So in the graph that looks
either like this or it could be the other
244
00:21:49,150 --> 00:21:54,880
way around. So either the child has a
father pope, which has "position held:
245
00:21:54,880 --> 00:22:00,930
pope" or the pope has a child and also a
"position held", so that's kind of an
246
00:22:00,930 --> 00:22:04,090
example of the redundancy I mentioned
earlier, we have the two directions
247
00:22:04,090 --> 00:22:11,300
"child" and also "father"/"mother", and-
so you can ask your query in two ways, and
248
00:22:11,300 --> 00:22:14,170
it doesn't really make that much of a
difference, assuming that the data is
249
00:22:14,170 --> 00:22:19,700
complete. And I think someone occasionally
runs queries to check if any of these
250
00:22:19,700 --> 00:22:25,460
circles are missing. So let's try one of
them, let's just stay with this one, so
251
00:22:25,460 --> 00:22:32,030
the item does not have "pope" as father,
it has some pope, and then this pope has
252
00:22:32,030 --> 00:22:42,720
"position held: pope". And then let's add
the "pope" label and… yeah, pope label is
253
00:22:42,720 --> 00:22:49,800
enough, and then we get 24 results! So we
have a Duke of Parma which, who was the
254
00:22:49,800 --> 00:22:55,150
son of Paul III. Paul III had three
children. Let's sort by this. Wow,
255
00:22:55,150 --> 00:23:04,390
Alexander VI was very busy. And some of
them just have, oh oh oh, we have
256
00:23:04,390 --> 00:23:08,780
duplicates, Giovanni Borgia and Giovanni
Borgia. Should I demonstrate Wikidata
257
00:23:08,780 --> 00:23:13,550
editing now or do we just ignore this? So,
yeah, someone imported a lot of
258
00:23:13,550 --> 00:23:19,050
information from this peerage database and
apparently we have some duplicate items
259
00:23:19,050 --> 00:23:24,140
here, let's just leave those alone for
now. In fact, I think this and this also
260
00:23:24,140 --> 00:23:29,770
looks suspiciously similar. Giovanni
Borgia, unless he had two children of that
261
00:23:29,770 --> 00:23:38,190
name. I mean, he could have. So this… we
have a date of birth 1470s… 1498. No, that
262
00:23:38,190 --> 00:23:44,970
might actually be different children. OK,
not a very creative father in the names.
263
00:23:44,970 --> 00:23:52,980
Yeah. And wait, that's a pope who's a
child of another pope. Very interesting!
264
00:23:52,980 --> 00:23:56,460
And another one. And another one. We have
three popes who are children of other
265
00:23:56,460 --> 00:24:02,000
popes. Let's search for those! So we would
also need for that, that the item has
266
00:24:02,000 --> 00:24:11,380
"position held: Pope", and I could copy
paste this, but just do this. So the item
267
00:24:11,380 --> 00:24:14,300
should be… child should have a "father:
pope" and the item should have "position
268
00:24:14,300 --> 00:24:18,380
held: Pope", and the pope should also have
"position held: pope". And in this case,
269
00:24:18,380 --> 00:24:22,690
it would probably be less confusing to
call these "child" and "father", because
270
00:24:22,690 --> 00:24:26,480
this is also a pope now, but… variable
names. One of the three hardest problems
271
00:24:26,480 --> 00:24:30,490
in computer science, right? Yeah, we have
three children who are… three popes who
272
00:24:30,490 --> 00:24:36,540
are children of other popes. Wow. I'm
actually going to save this query, popes
273
00:24:36,540 --> 00:24:42,470
who were children of other popes. But
actually, we can future-proof this a
274
00:24:42,470 --> 00:24:47,910
little bit, because right now we've only
said that the father should be a pope. But
275
00:24:47,910 --> 00:24:50,730
in case there's ever a female pope, let's
just switch this around and say that the
276
00:24:50,730 --> 00:24:58,640
pope should have the child… item and then
it's going to work, even if the pope
277
00:24:58,640 --> 00:25:03,140
happens to be female and is a mother
instead of a father. There we go, same
278
00:25:03,140 --> 00:25:13,070
three results. OK, and let's keep that,
and open a new tab for next queries. Yeah.
279
00:25:13,070 --> 00:25:18,340
Which Microsoft software runs on Linux.
OK. That's not that funny. So perhaps we
280
00:25:18,340 --> 00:25:23,010
can just skip it… I don't know. That joke
kind of ran out of steam a while ago.
281
00:25:23,010 --> 00:25:26,630
Basically looks like this and it's like
Visual Studio Code and three other
282
00:25:26,630 --> 00:25:31,230
programs, meh. What are some compositions
for organ and orchestra. This isn't funny
283
00:25:31,230 --> 00:25:35,710
at all, but I just find it very nice
because it's just an awesome sound. And so
284
00:25:35,710 --> 00:25:40,860
that would be… the composition has the
instrumentation "organ" and also
285
00:25:40,860 --> 00:25:52,670
"orchestra", which we can write as… item,
item label… composition… instrumentation,
286
00:25:52,670 --> 00:26:11,650
this one, orchestra. And also,
"composition… organ". And then, oops,
287
00:26:11,650 --> 00:26:18,120
yeah, this should be "item"… and also I
forgot to add the label service. There we
288
00:26:18,120 --> 00:26:28,300
go. And we have 12 results, which is nice
if you want to listen to any of those. We
289
00:26:28,300 --> 00:26:38,570
could also check if any of them have an
audio file on Commons. Let's see. One, OK,
290
00:26:38,570 --> 00:26:46,460
and I think we've heard this one already.
So, but… one thing that's kind of annoying
291
00:26:46,460 --> 00:26:50,420
here, I should have mentioned this in the
last query, I think. So I had to repeat
292
00:26:50,420 --> 00:26:53,490
the item and the property ID, which is a
bit annoying and makes the query difficult
293
00:26:53,490 --> 00:26:57,740
to read. And what you can do is leave that
out and you can also do this in the
294
00:26:57,740 --> 00:27:04,660
previous case. So let's actually go one
slide back. So here I didn't write twice
295
00:27:04,660 --> 00:27:07,350
that it's the software which should have
the developer, and also the operating
296
00:27:07,350 --> 00:27:10,860
system. I just wrote the software has
"developer: Microsoft" and also with a
297
00:27:10,860 --> 00:27:16,690
semicolon at the end instead of a period,
it has "operating system: Linux". So if
298
00:27:16,690 --> 00:27:18,920
you read this as English it's just one
sentence where you don't repeat the
299
00:27:18,920 --> 00:27:22,230
subject twice. The software has
"developer: Microsoft" and "operating
300
00:27:22,230 --> 00:27:26,220
system: Linux", instead of "software has
developer: Microsoft" and "software has
301
00:27:26,220 --> 00:27:31,350
operating system: Linux". And if you… if
the property here is also the same thing,
302
00:27:31,350 --> 00:27:36,330
then you can even leave that out and add a
comma at the end and just list the two
303
00:27:36,330 --> 00:27:41,170
values and you don't even have to repeat
the instrumentation. So let's do that here
304
00:27:41,170 --> 00:27:47,450
and abbreviate this query. And it has the
exact same 12 results, just slightly more
305
00:27:47,450 --> 00:27:54,720
convenient to read and… to write at least,
hopefully also to read. I don't know. But
306
00:27:54,720 --> 00:27:56,840
you don't use the comma that much. The
semicolon is pretty useful, like we could
307
00:27:56,840 --> 00:28:06,600
have written this as, the pope has, er,
the child and also position held like
308
00:28:06,600 --> 00:28:10,530
this. It means exactly the same, but you
can immediately see that both of these
309
00:28:10,530 --> 00:28:18,400
refer to the pope because there's just a
bunch of blank space here. Yeah, so then
310
00:28:18,400 --> 00:28:27,690
we have this one. This isn't funny at all,
but there are a lot of people who used to
311
00:28:27,690 --> 00:28:33,000
be in the Nazi Party during World War 2
and then who later just went back into a
312
00:28:33,000 --> 00:28:37,340
civil life and even received the
Bundesverdienstkreuz, the order of merit
313
00:28:37,340 --> 00:28:42,320
of the Federal Republic of Germany. And
you can find those… in this case I've done
314
00:28:42,320 --> 00:28:46,600
it with three triples, which is, the
person was a member of this political
315
00:28:46,600 --> 00:28:51,510
party and received this award. And also
I've added that they're "instance of:
316
00:28:51,510 --> 00:28:55,040
human", because we also have a lot of
fictional data on Wikidata. You already
317
00:28:55,040 --> 00:28:57,701
saw that with the Neverending Story stuff
earlier. So there might also be a
318
00:28:57,701 --> 00:29:02,040
fictional character who was a member of
this political party and who received the
319
00:29:02,040 --> 00:29:07,300
award, and we're not really interested in
those. So we add "instance of: human", and
320
00:29:07,300 --> 00:29:11,420
then we are certain that we only get real
results and not fictional results. And it
321
00:29:11,420 --> 00:29:14,410
doesn't really cost us anything because
the Query Service can optimize that pretty
322
00:29:14,410 --> 00:29:22,160
well. So let's write that… actually, let's
do that here. So the item should be
323
00:29:22,160 --> 00:29:31,670
"instance of: human", which is Q5, because
it's a very common item, and "member of
324
00:29:31,670 --> 00:29:39,920
political party". And you can see I can
search by the German abbreviation and find
325
00:29:39,920 --> 00:29:44,250
this, even though it's not a label,
because there are search aliases. And also
326
00:29:44,250 --> 00:29:48,690
"award received", the
Bundesverdienstkreuz, because I can't be
327
00:29:48,690 --> 00:29:54,300
bothered to type in the whole English
name. There we go. And we find, I think…
328
00:29:54,300 --> 00:30:03,720
how many results? Eleven results. Yeah.
And this actually isn't quite correct,
329
00:30:03,720 --> 00:30:10,250
because in theory, you don't get this
order, this order has like 11 parts or
330
00:30:10,250 --> 00:30:15,310
something. You can get the Grand Cross
with Distinction or you can get the Star
331
00:30:15,310 --> 00:30:19,280
or whatever. I think it's listed somewhere
here. Yeah, you can get the Grand Cross
332
00:30:19,280 --> 00:30:22,580
Special Class, you can get the Grand Cross
Special Issue, you can get the Grand Cross
333
00:30:22,580 --> 00:30:27,020
First Class, blah blah blah. And so, in
theory, any of these people should have
334
00:30:27,020 --> 00:30:34,190
one of these awards and not just "order of
merit". But I think when I checked, all of
335
00:30:34,190 --> 00:30:42,190
them just had… all the results, just had
directly "order of merit". But actually,
336
00:30:42,190 --> 00:30:48,230
no we can try to search for the correct
ones instead. So it would not be part of
337
00:30:48,230 --> 00:30:53,650
this directly, it would be… "award
received" would be some award, such as
338
00:30:53,650 --> 00:31:03,310
this one, and then this award is part of
the order of merit, so "award"… "part of"…
339
00:31:03,310 --> 00:31:14,670
Let's see if that finds any results. Oh.
Oh. Oh, dear. Yeah, that, that… that's a
340
00:31:14,670 --> 00:31:21,210
lot of results. "Herbert von Karajan".
That's that's depressing. OK, yeah. OK, so
341
00:31:21,210 --> 00:31:24,000
I think I… when I tried this out and
didn't find any results, I just did
342
00:31:24,000 --> 00:31:30,430
something wrong because, this way we find
a lot more results. And if we… so we don't
343
00:31:30,430 --> 00:31:35,660
actually select the award here, because we
don't care what kind of award they got. So
344
00:31:35,660 --> 00:31:41,710
we could also use this abbreviation again,
like this. So we just say they got some
345
00:31:41,710 --> 00:31:47,280
award, which is part of the order of
merit. And in this case, we could even
346
00:31:47,280 --> 00:31:54,000
abbreviate that further and say, we put a
slash here. And then, that kind of
347
00:31:54,000 --> 00:31:58,420
describes a path that you have to take
from this item to this item and you have
348
00:31:58,420 --> 00:32:03,900
to first get to some award received. And
then that has to be part of something
349
00:32:03,900 --> 00:32:08,020
else. And you can add as many elements
here as you want. And then we get the
350
00:32:08,020 --> 00:32:17,540
exact same 802 results… and… lots of well-
known names here. And if we want to find
351
00:32:17,540 --> 00:32:21,500
the original 11 ones that directly had the
order of merit as the award received, we
352
00:32:21,500 --> 00:32:25,970
can add a question mark here, which is
just like in a regular expression, it says
353
00:32:25,970 --> 00:32:32,360
this part is optional. They can have
directly received this award or they can
354
00:32:32,360 --> 00:32:36,090
have received some award, which is part of
the order of merit. And then we should get
355
00:32:36,090 --> 00:32:47,540
813. Yeah, 813 results, so 802, plus the
11 from earlier. And… I'm starting this
356
00:32:47,540 --> 00:32:53,020
with "instance of: human", which… and the
Query Service is going to re-order this
357
00:32:53,020 --> 00:32:57,210
because searching for all the humans and
then filtering for the ones who are in
358
00:32:57,210 --> 00:33:01,270
this political party and so on wouldn't be
efficient. So I don't have to worry about
359
00:33:01,270 --> 00:33:05,970
that. I could write it in this order, or I
could shuffle it around. Doesn't make any
360
00:33:05,970 --> 00:33:10,020
difference. The Query Service already
knows in which order to do these things.
361
00:33:10,020 --> 00:33:14,110
So you don't have to worry about that. You
can just start with "is a human" and then
362
00:33:14,110 --> 00:33:23,310
add everything else. I think I have one
more complicated query here. Yeah, so
363
00:33:23,310 --> 00:33:27,620
that's one of the examples I mentioned
earlier, the largest cities by population
364
00:33:27,620 --> 00:33:33,200
with a female mayor. So the graph for that
is, I think the largest one I prepared for
365
00:33:33,200 --> 00:33:37,570
the slides, except the one in the
beginning. And it looks like this. We
366
00:33:37,570 --> 00:33:41,340
should have a city which is a city,
"instance of: city", and it has a certain
367
00:33:41,340 --> 00:33:45,990
population, and it has… so for the mayor,
we use the same property as for head of
368
00:33:45,990 --> 00:33:52,270
government. And if you don't know that,
you could look at some city like Berlin
369
00:33:52,270 --> 00:33:59,280
and maybe you know what the mayor of
Berlin is called… what was it?. Something
370
00:33:59,280 --> 00:34:04,540
"Müller", I think. Yeah. And then you can
see, aha, the property for the mayor is
371
00:34:04,540 --> 00:34:13,909
"head of government". Or you could also
search for, the city should have a mayor,
372
00:34:13,909 --> 00:34:19,490
and then you'll still find "head of
government", the right property. And that
373
00:34:19,490 --> 00:34:24,879
mayor should be a human and she should
have the gender "female". Oops. There's a
374
00:34:24,879 --> 00:34:28,369
question mark there for no reason at all.
That's not a variable. That should be the
375
00:34:28,369 --> 00:34:36,940
fixed value. Sorry. So let's put that
there. We have a city which is "instance
376
00:34:36,940 --> 00:34:49,759
of: city", and it also has a population
which we're going to use later and it also
377
00:34:49,759 --> 00:34:55,139
has a head of government. No, that's
wrong. Not the "office held by head of
378
00:34:55,139 --> 00:34:59,380
government", the "head of government"
itself, which we call the mayor and then
379
00:34:59,380 --> 00:35:17,609
the mayor is "instance of: human" and
gender should be female… come on… female.
380
00:35:17,609 --> 00:35:27,649
And let's select the city, cityLabel,
mayorLabel and also the population. And
381
00:35:27,649 --> 00:35:31,220
then we find some 83 results. That's not
yet the largest cities with a female
382
00:35:31,220 --> 00:35:37,269
mayor. That's just all of them. And in
Wikidata we know about 83, apparently. And
383
00:35:37,269 --> 00:35:41,740
if your local hometown has a female mayor,
just go ahead and add it to Wikidata and
384
00:35:41,740 --> 00:35:47,009
it's probably relevant. It's not– So the
relevance criteria are not as strict as on
385
00:35:47,009 --> 00:35:52,529
Wikipedia fortunately. But if we want just
the most populous ones, we can go a bit
386
00:35:52,529 --> 00:35:59,760
back into SQL land and say we want to
ORDER BY the population and in SQL you
387
00:35:59,760 --> 00:36:03,420
would write DESC afterwards and in SPARQL
it's different. You write
388
00:36:03,420 --> 00:36:09,700
DESC(?population). Erm, I think it's nicer
that way. But perhaps it would have been
389
00:36:09,700 --> 00:36:13,740
nicer to just stick with the SQL syntax. I
don't know. And we want to limit this to
390
00:36:13,740 --> 00:36:19,160
just the ten most populous cities, for
example. And here we go. Tokyo is
391
00:36:19,160 --> 00:36:25,819
currently the biggest one, then Hong Kong,
Baghdad, Surabaya, Rome. Yeah. And, oh.
392
00:36:25,819 --> 00:36:37,190
This doesn't make that much sense, Caracas
has two mayors. Anyone… yeah, exactly. So
393
00:36:37,190 --> 00:36:43,819
we're only supposed to get the current
mayor. Head of government… yeah. Does
394
00:36:43,819 --> 00:36:51,890
anyone know which one is the current one?
Or we could just check Wikipedia… Caracas,
395
00:36:51,890 --> 00:36:55,730
which hopefully doesn't get it's
information from Wikidata yet. So it's not
396
00:36:55,730 --> 00:37:07,940
circular. And the mayor is… Carolina,
Carolina Cestari… Cestari, I don't know.
397
00:37:12,090 --> 00:37:14,660
laughter
398
00:37:14,660 --> 00:37:25,420
OK, so let's add a new one. Ah…? Doesn't
have an item yet, is that… is that the
399
00:37:25,420 --> 00:37:31,369
mayor, or is chief of government something
else? Doesn't occur anywhere else on the
400
00:37:31,369 --> 00:37:45,420
page, of course. Local government… mayor…
no. OK, so let's just… I don't know,
401
00:37:45,420 --> 00:37:55,059
doesn't she have a Wikipedia article? No.
Just appears in some lists and then she
402
00:37:55,059 --> 00:38:01,210
doesn't have a Wikidata item yet? No.
Then… I don't know. We'll do some live
403
00:38:01,210 --> 00:38:04,660
Wikidata editing. It wasn't part of this
talk, but let's just do it. Carolina
404
00:38:04,660 --> 00:38:17,270
Cestari… what country is that? Venezuela.
Venezuelan politician, and that sounds
405
00:38:17,270 --> 00:38:22,609
like a female name, so I'm just going to
guess and check that after the talk. So
406
00:38:22,609 --> 00:38:29,330
she's definitely a human. And gender is
female and that is going to be enough for
407
00:38:29,330 --> 00:38:37,930
our query. Do this search again. There we
go. And set this to preferred rank. So
408
00:38:37,930 --> 00:38:40,559
that's how the Query Service knows that
this is the current value and it should
409
00:38:40,559 --> 00:38:44,500
only return this one. And ideally, one of
the head of government values should have
410
00:38:44,500 --> 00:38:50,240
this preferred rank to mark it as the
correct current value. And then all the
411
00:38:50,240 --> 00:38:53,640
other ones are additional data that you
can use if you want. But it's not the main
412
00:38:53,640 --> 00:39:00,859
value and we are not going to get it in a
simple query. And then there's some error
413
00:39:00,859 --> 00:39:06,259
because Caracas isn't some kind of
political territorial entity and it should
414
00:39:06,259 --> 00:39:12,579
have a start time. I don't care right now.
OK, so we run this query again and
415
00:39:12,579 --> 00:39:21,400
hopefully get just one result for Caracas
this time. No. Uhm, we have to wait a bit
416
00:39:21,400 --> 00:39:26,450
until the Query Service is updated.
Because it's kind of asynchronous. It just
417
00:39:26,450 --> 00:39:33,639
keeps watching for changes and eventually
it will get the new data, but… okay. It
418
00:39:33,639 --> 00:39:42,079
might take a bit longer. Anyways. That's
how that query works. Does that make kind
419
00:39:42,079 --> 00:39:51,710
of sense? OK, great. Yeah, I think this is
almost exactly what I wrote here. Yeah.
420
00:39:51,710 --> 00:39:56,039
Except with some labels and the label
service. Yeah. There is one problem here,
421
00:39:56,039 --> 00:40:02,019
which is, for example, I happen to know
that Mexico City is a very large city with
422
00:40:02,019 --> 00:40:11,430
a population of… population: almost 9
million. So it should be right after Tokyo
423
00:40:11,430 --> 00:40:19,259
in front of Hong Kong. And the head of
government is a Claudia Sheinbaum or
424
00:40:19,259 --> 00:40:23,980
something, which sounds like a woman. So
we should get this result in the query.
425
00:40:23,980 --> 00:40:29,089
The reason we don't is that Mexico City is
an instance of "big city" and we have
426
00:40:29,089 --> 00:40:35,470
searched for "instance of: city". And
there's some debate about does this class
427
00:40:35,470 --> 00:40:39,860
even make sense at all? I think this is
actually the German classification of, a
428
00:40:39,860 --> 00:40:43,859
big city is one with 100 000 Inhabitants,
and in other languages or countries, a big
429
00:40:43,859 --> 00:40:49,000
city might be something else, but for now
that… the data is what it is. Fortunately,
430
00:40:49,000 --> 00:40:54,049
what we have here is the information, a
"big city" is a subclass of a city/town,
431
00:40:54,049 --> 00:41:04,599
which is a subclass of "locality", which
is a subclass of. Wait. We should arrive
432
00:41:04,599 --> 00:41:07,789
at city at some point, but I think we've
already gone past that. It's also an
433
00:41:07,789 --> 00:41:12,080
instance of capital. Let's go down that
instead. A capital is a subclass of city,
434
00:41:12,080 --> 00:41:16,670
there we go. So if we can tell the Query
Service to follow these subclass
435
00:41:16,670 --> 00:41:22,609
connections, then we should find these
cities. And one way to do that… to make it
436
00:41:22,609 --> 00:41:29,500
work for Mexico City would be to say, it
has to be "instance of", some, with the
437
00:41:29,500 --> 00:41:37,160
path again, "subclass of: city" and then
we would find Mexico City, but we would
438
00:41:37,160 --> 00:41:42,690
not find all the… oh, we would still find
Tokyo because it's still a capital, I
439
00:41:42,690 --> 00:41:47,319
guess. But we've missed a lot of other
cities, I think which we used to have…
440
00:41:47,319 --> 00:41:53,609
yeah. Rome, for example, is gone. Because
it's… that's just an instance of city
441
00:41:53,609 --> 00:41:57,420
directly. And we've now made the subclass
mandatory. What we should do is make it
442
00:41:57,420 --> 00:42:02,490
optional, or even better, we would– we
should say there can be any number of this
443
00:42:02,490 --> 00:42:06,960
element. So there… it can be an instance
of city or it can be an instance of a
444
00:42:06,960 --> 00:42:10,839
subclass of city, it can be an instance of
a subclass of a subclass of city. You can
445
00:42:10,839 --> 00:42:14,359
follow any number of elements, that what
this… that's what this star means, just
446
00:42:14,359 --> 00:42:19,390
like in a regular expression. And then we
probably have to say we only want the
447
00:42:19,390 --> 00:42:24,359
distinct ones because they are like five
different ways to go through the subclass
448
00:42:24,359 --> 00:42:30,050
tree until you've found "city". And we're
not interested in the different ways. But
449
00:42:30,050 --> 00:42:35,330
now we should get Tokyo and Mexico City.
And Rome is also here and Caracas is
450
00:42:35,330 --> 00:42:39,249
completely gone because we found enough
other cities which we were missing
451
00:42:39,249 --> 00:42:45,810
earlier. So you kind of have to watch out
and sometimes use elements like this…
452
00:42:45,810 --> 00:42:51,940
"subclass of"-tree is pretty common, or
with a, something… order of merit, we had
453
00:42:51,940 --> 00:42:56,839
to use this "part of". You have to watch
out if the results are plausible, or
454
00:42:56,839 --> 00:43:00,570
ideally, you know some item that should be
in the results, and then you check, is it
455
00:43:00,570 --> 00:43:05,779
there? Why is it not there? And
investigate like that. But that's a fixed
456
00:43:05,779 --> 00:43:10,910
version of the query. And… yeah, if we
were not interested in the mayor, we could
457
00:43:10,910 --> 00:43:15,019
do the same trick again. But, yeah. It
doesn't make that much of a difference.
458
00:43:15,019 --> 00:43:19,119
And I think… yeah, that was almost the
only difference. Yeah, except that I
459
00:43:19,119 --> 00:43:22,829
removed the population so we can order by
a variable that you don't select in the
460
00:43:22,829 --> 00:43:33,759
end if you want. And I think I am out of
slides. So, yeah, if you want to see more
461
00:43:33,759 --> 00:43:38,029
queries, you can look at these Twitter or
social media accounts. There's a huge list
462
00:43:38,029 --> 00:43:43,299
of example queries on Wikidata, which is
so big that it's getting too big for a
463
00:43:43,299 --> 00:43:46,499
wiki page, and people had to move some
queries out there and it's kind of just
464
00:43:46,499 --> 00:43:50,890
grown since 2015 or something. And there's
a lot of garbage there, but also a lot of
465
00:43:50,890 --> 00:43:55,970
useful queries if you want to look at
that. And I had two more queries in the
466
00:43:55,970 --> 00:44:00,900
talk description which we haven't talked
about yet, and I think we have the time. I
467
00:44:00,900 --> 00:44:04,400
can just try to open these. "Which films
starred more than one future head of
468
00:44:04,400 --> 00:44:15,210
government?" Does that work? It doesn't.
Can I copy the URL here? Yeah, copy link
469
00:44:15,210 --> 00:44:20,700
address. So that's a kind of longer query,
which is why it didn't really fit on one
470
00:44:20,700 --> 00:44:26,480
slide. But the important film is you have…
er, the important part is you have some
471
00:44:26,480 --> 00:44:32,480
film… instance of, or subclass of film, it
has a publication date and a cast member,
472
00:44:32,480 --> 00:44:41,070
which is the head of government. And the
head of government held some position,
473
00:44:41,070 --> 00:44:47,009
some head of government, er, some subclass
of head of government. And that should be
474
00:44:47,009 --> 00:44:53,330
after the film was published. And then you
get a bunch of results. I think this takes
475
00:44:53,330 --> 00:45:00,069
like 11 seconds or something. And you get
like films with Schwarzenegger and one
476
00:45:00,069 --> 00:45:05,750
other actor who became US governor. I
don't remember the name. And you also get
477
00:45:05,750 --> 00:45:09,890
a lot of… or several films from World War
II with future French heads of government,
478
00:45:09,890 --> 00:45:15,910
which is really cool. So, like a film that
was shot about the liberation of Paris,
479
00:45:15,910 --> 00:45:20,289
where it's… it's kind of a stretch to call
them cast members, but they're definitely
480
00:45:20,289 --> 00:45:26,190
in the film. And if we get the result,
then I can tell you what the film is
481
00:45:26,190 --> 00:45:35,381
called. Yeah, it might be busy right now,
so you get up to 60 seconds in the Query
482
00:45:35,381 --> 00:45:40,210
Service and then in the end your query is
killed if it takes longer than that. So
483
00:45:40,210 --> 00:45:43,039
sometimes it can be a bit of a struggle to
make the query work within 60 seconds.
484
00:45:43,039 --> 00:45:48,359
There we go, 50 seconds. That was close.
So there's yeah, there's a "La Libération
485
00:45:48,359 --> 00:45:52,450
de Paris" with Charles de Gaulle, who was
president of the Council and president of
486
00:45:52,450 --> 00:45:58,240
the provisional government, and also
Georges Bidault, I think, who was prime
487
00:45:58,240 --> 00:46:02,700
minister and president of the Council, and
other stuff. We have several Indian films
488
00:46:02,700 --> 00:46:09,589
with people who went on to become chief
ministers. And then down here there's some
489
00:46:09,589 --> 00:46:14,490
Canadian politicians, apparently. And then
here's Arnold Schwarzenegger and Jesse
490
00:46:14,490 --> 00:46:21,450
Ventura, who both became governors and
also starred in several films. And the
491
00:46:21,450 --> 00:46:26,320
other thing was, we have a lot of data
about the British government because a lot
492
00:46:26,320 --> 00:46:31,670
of volunteers have just been slaving away
at that data and adding and adding more
493
00:46:31,670 --> 00:46:38,789
information. I think they've… they have
all their parliaments, complete with party
494
00:46:38,789 --> 00:46:42,990
affiliations and everything for at least
the last 100 years and some partial data
495
00:46:42,990 --> 00:46:47,020
for a lot more than that, because they
have a very long parliamentary history.
496
00:46:47,020 --> 00:46:51,180
And then you can do queries like "how many
people named John are there in
497
00:46:51,180 --> 00:46:56,420
parliament", and "how many women with any
name". And you can see when the women were
498
00:46:56,420 --> 00:47:01,710
finally more than just the men who are
named "John". And it's kind of an amusing
499
00:47:01,710 --> 00:47:08,160
graph. Or not so amusing. Takes a while as
well. I hope it doesn't take 50 seconds,
500
00:47:08,160 --> 00:47:13,549
but it looks like the Query Service might
be busy at the moment. But I think it was
501
00:47:13,549 --> 00:47:19,910
something like in 1991 or so is the
crossover point. Oh yeah. And I should
502
00:47:19,910 --> 00:47:23,840
mention anyway, so everything we saw right
now was just a lot of tables. But you can
503
00:47:23,840 --> 00:47:31,170
also show results in different ways, such
as a line chart. There we go. So in 1992,
504
00:47:31,170 --> 00:47:35,390
this was the first parliament which had
more women than Johns. And then the Johns
505
00:47:35,390 --> 00:47:41,480
have slightly declined and the women have
gone up to 220. How many people are in the
506
00:47:41,480 --> 00:47:47,690
House of Commons in total? Does anyone
know? No. So I don't know what percentage
507
00:47:47,690 --> 00:47:52,500
this is. Uh, but, this was… yeah, this
latest election from 12 December already
508
00:47:52,500 --> 00:48:02,739
in there. Yeah. indistinguishable. What?
So the query looks like this. So this one
509
00:48:02,739 --> 00:48:06,400
is broken into several parts. We first
find all the members of parliament, so
510
00:48:06,400 --> 00:48:10,509
they should be human, again, no fictional
people, and then they should have some
511
00:48:10,509 --> 00:48:15,540
"position held", which is a subclass of
"member of parliament" in the House of
512
00:48:15,540 --> 00:48:22,440
Commons. And then there should also be,
um, a parliamentary term on that, so that
513
00:48:22,440 --> 00:48:27,660
we know which parliament it is and when it
starts. And then down here, we import all
514
00:48:27,660 --> 00:48:35,230
those MPs and filter for just the ones
with the "given name: John". And then we
515
00:48:35,230 --> 00:48:39,989
filter for just the ones with "gender:
female". And there's an optional "subclass
516
00:48:39,989 --> 00:48:44,259
of" in here, because currently the data
model is that there is a separate item for
517
00:48:44,259 --> 00:48:49,410
transgender female and someone can have
"gender: transfemale– transgender female",
518
00:48:49,410 --> 00:48:52,940
which is a subclass of "female". And there
is a discussion right now to get rid of
519
00:48:52,940 --> 00:48:56,519
that and have a separate property for that
instead. And then all the trans people
520
00:48:56,519 --> 00:48:59,390
just have "gender:", their right gender,
and you don't have to mess with subclass.
521
00:48:59,390 --> 00:49:03,660
But right now we still… well, we need it
in theory, I don't think there are any MPs
522
00:49:03,660 --> 00:49:08,540
in practice. But, you know, you know, you
can just keep it in there. And then we
523
00:49:08,540 --> 00:49:15,359
import the results and get them here
either as a line chart or as a table, if
524
00:49:15,359 --> 00:49:20,769
you want to sort it by the time… yeah, the
data starts in 1919, apparently. So we
525
00:49:20,769 --> 00:49:25,450
have exactly a hundred years of history
there. We can also show it as a bar chart,
526
00:49:25,450 --> 00:49:30,529
if that makes more sense. No it doesn't.
That makes no sense. Line chart is the
527
00:49:30,529 --> 00:49:35,059
right one. Oh, right, but if you show the
line chart again, then it breaks for some
528
00:49:35,059 --> 00:49:39,059
reason, there's some bug there. So let's
just show it again. There we go. That's
529
00:49:39,059 --> 00:49:47,160
the right… chart. Yeah, and I guess… oh
wow, it's already… 50 minutes, so I guess
530
00:49:47,160 --> 00:49:55,359
this is the point where we start moving to
the live querying part, and I was told I
531
00:49:55,359 --> 00:49:58,690
should make at least a short break for the
stream, so the Angels know where to cut
532
00:49:58,690 --> 00:50:02,770
between. But we could also take a 10
minute's break and then start the next
533
00:50:02,770 --> 00:50:09,170
talk on time. Does that sound OK? Or is 10
minutes too long? Uhm, if you're going to
534
00:50:09,170 --> 00:50:13,670
stay here, which would be very nice, then
please think of some example queries that
535
00:50:13,670 --> 00:50:16,820
you think we could write, and then I can
try to write them, because otherwise I'm
536
00:50:16,820 --> 00:50:21,569
not going to have much to do. But yeah,
let's do a 10 minute break and see you
537
00:50:21,569 --> 00:50:24,569
then. Thank you so far.
538
00:50:24,569 --> 00:50:27,219
Applause
539
00:50:27,219 --> 00:50:32,429
Postroll Music
540
00:50:32,429 --> 00:50:55,000
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!