-
36C3 preroll music
-
Okay so now to our speaker, he’s Lucas.
He's a SPARQL magician I'm told, so and he
-
will introduce you to his favorite
querying language, SPARQL, and give you a
-
little introduction and in the second part
he will do some live coding which is
-
always really interesting and funny and
you can give him some things that he's
-
querying for you and I'm sure we'll have
lots of fun and interesting learning stuff
-
here so give a warm round of applause to
Lucas.
-
[Applause]
-
[inaudible]
-
Is this better? Aha! It's a bit too loud
so I'll just talk a bit until they have
-
figured it out. Yeah so this is going to
be kind of two parts but not really that
-
separate but in the second part I'm
basically going to write the queries that
-
you suggest so if you – if you see what
I'm going to do here and then think oh I
-
have a great idea for something we could
perhaps query then just remember that and
-
we'll get back to that hopefully because
otherwise the second half is going to be
-
really short if I don't get any ideas from
you. But yeah, so this is about querying
-
linked data which allows you to do all
kinds of crazy things and answer all kinds
-
of crazy questions such as I think I had
on the slides something like "what are the
-
largest cities with a female mayor?" and
if you wanted to find that out
-
traditionally you could like go through
Wikipedia and try to find all the largest
-
cities and see which ones have a female
mayor and which ones don't or perhaps
-
there's a category with all the cities
with a female mayor but then you have to
-
sort them by population and it's a whole
mess and with linked data you can find
-
that out much more easily and also all
kinds of other things but let's start with
-
some simple fantasy linked data so this is
a tiny snippet of linked data, some data
-
graph. It's just composed of a load of
nodes which are these ovals and rectangles
-
here and they're connected with arrows and
each of these forms kind of a triple
-
consisting of the start node and then the
arrow and then the end node and that's how
-
we represent all the information you have
in there, in this linked database. So for
-
example we can read this as this talk
right now happens in the Esszimmer or the
-
dining room which is the name of this
stage here and it's going to be followed
-
by the live querying session which also
happens in Esszimmer and the live querying
-
session in turn follows this talk again
and the Esszimmer, the dining room, is
-
next to the kitchen, the Küche, and the
kitchen is next to the dining room again
-
and both of them are part of the
WikipakaWG which is part of 36C3 and the
-
talk happens right now and at the same
time there's also some talk about how
-
state elections are climate elections or
something in the Chaos West stage, starts
-
at the same time, Chaos West stage is part
of the Chaos West Assembly which is part
-
of 36C3 as well and so this graph has a
few important properties, for example
-
there's some redundant connections here,
you could see, you could say, if this talk
-
is followed by the live querying then you
don't really need to know that live
-
querying follows this talk, it's kind of
redundant information. You already know
-
it, but it doesn't hurt to have it, and it
often makes your life easier if you have a
-
little bit of redundancy in your graph and
then if you find that one half of this
-
connection is missing for example you can
still investigate what's going on and also
-
in here we have kind of bi-directional
connection so Esszimmer is next to Küche
-
which is next to Esszimmer but this is two
separate arrows and could also be that
-
only one of them is there so you don't
have arrows which go into-, in both
-
directions at once in this data model, it
has to be, if you want something like this
-
you have to have two separate arrows
because that keeps the data model very
-
simple. You just have subject predicate
object and that's everything you have, and
-
then to query this graph, you kind of
select a tiny part of it and then you
-
remove some part that you don't know about
for example we know that this talk is
-
followed by live querying and if we remove
the live querying part, then we can ask
-
something like... Okay, I did it the other
way around. Never mind, this way. This
-
talk is followed by which talk? and then
you have a question but because you've
-
left out this part and then if you ask
this question to a query service it can,
-
kind of, you can think of this like a,
err, damn, I only know the German word for
-
this one, a, Schablone, template, so you
put this over the graph and this has to
-
match the existing node this has to match
the existing arrow and then you see which
-
nodes can you put in here and in this case
that's only the live querying or the other
-
way around which talk follows this one so
you can have the beginning of the triple
-
can be a variable like this one or the end
of the triple can be a variable like in
-
this case and you can also have more
complicated patterns like, no there's not
-
a more complicated pattern, this is the
same pattern. You have the question which
-
talk happens in Esszimmer and you have two
answers: this talk happens in Esszimmer
-
and live querying happens in Esszimmer.
But you can also combine more graph nodes
-
like this, for example, which talk happens
in some room, which is part of the
-
Wikipaka-WG. So we have one free part here
and one free part here. But we know that
-
these two have to be connected with,
"happens in", and then this has to be
-
connected with "is part of" to the
Wikipaka-WG. And you can kind of
-
construct– if you can phrase your question
as a kind of graph like this, where some
-
parts are predetermined that you already
know about and the other parts that you
-
want to find. Those are these kind of
variables which are here indicated with
-
just dashed lines. Then you can ask that
question to the graph and find the
-
matching results. In this case, you have
these two matches, this talk happens in
-
Esszimmer as part of Wikipaka-WG and live
querying happens in Esszimmer, is part of
-
Wikidata– Wikipaka-WG. And then, if you–
if we had more information in this graph
-
here, we might also have other rooms. For
example, there's this library over there
-
which also is going to have some talks. If
we had the whole schedule in here, we
-
would find those as well. And we could
also adapt the query so that we don't even
-
make the Wikipaka-WG part fixed. We could
ask for anything that happens in 33C3. So
-
that would be some variable, happens in
some room, is part of some assembly, is
-
part of 36C3. And then we would find this
thing as well because it fits the same
-
kind of pattern: happens in, is part of,
is part of 36C3. Does that make sense?
-
Hopefully. I'm seeing a lot of nodding
heads. OK, that's great. So then we can
-
try to move ahead to actually ask some of
these questions to a real query system.
-
Because in reality, you're not going to
actually draw these graphs, but you have
-
some kind of language where you phrase
them instead, which looks a bit like this.
-
So you have the part: SELECT anything
WHERE, that is kind of like SQL, and then
-
everything else is not like SQL. Forget
SQL! I hear this is easier to understand
-
if you don't know SQL. I didn't know SQL
that much when I learned SPARQL, and I
-
think it helped me, apparently. But what
you write down here is these, is this kind
-
of description of the graph, and these
dashed parts, which are the variables
-
which you don't yet know. Those are marked
with a question mark because that's kind
-
of what you use to ask a question. In this
case, I've just called it "?talk", but it
-
could be any name, basically. And then
instead of "happens in" as two words, I've
-
just written "happensIn" as one and then
with the prefix "36C3" and it happens in
-
the 36C3 Esszimmer because I don't really
have a separate dining room at home, but a
-
lot of people do. So if we just wrote it
happens in Esszimmer, that would be pretty
-
ambiguous and no one would know which
which dining room you're talking about.
-
And by adding this prefix we know we're
talking about just the dining room in
-
this, at thirty– 36C3. I think, I assume
there's no other assembly that has
-
something called the dining room. If it
does, then we would have to add something
-
else here to make it clear. And I've used
the same prefix for "happensIn" to make
-
clear which kind of "happens in" relation
we're talking about, that it's one
-
specific to Congress events. And then you
could ask this to a query service which
-
has this example graph in it, and you
might get the response that it's these two
-
talks. And at the end, you have this
period here because if you read the whole
-
thing, it's kind of like a sentence again.
Because the talk happens in Esszimmer. And
-
if you have two sentences, then you have
two periods. So the talk happens in some
-
room. And this room is part of the
Wikipaka-WG. And because we've used the
-
same variable name here and down here,
this has to be the same room. And it
-
couldn't just be two different things. So
if we use two different variable names
-
here, room and something else, then we
would just get all the combinations of
-
talks happening somewhere and rooms being
part of Wikipaka-WG without them being
-
connected anyway, but because they use the
same variable name they have to be
-
connected like this. And then you would
get these results we've seen earlier. What
-
you can also do is leave out the room. So
when I translate this into English, I
-
could say, the talk happens in the room
and the room is part of Wikipaka-WG. But I
-
could also say the talk happens in some
room, which is part of the Wikipaka-WG,
-
as kind of a– I don't know what that's
called in English kind of a relative
-
sentence sub-something-clause where we
don't really talk about the room in itself
-
just as a part of this larger sentence.
And you can write that in SPARQL as well.
-
And then it looks like this. And these
square brackets kind of describe what the
-
room looks like without giving it names.
So in this case, you can only select the
-
talk up here and we don't have a room
variable. But if you don't care about what
-
the room is, then that can be very useful.
I've also changed something else here.
-
I've replaced the 36C3 in "isPartOf" with
schema, which is another prefix and schema
-
is kind of this collection of useful
prefixes and other nodes that you can
-
reuse, for example, if you're describing
things you have on your website, you might
-
say you have an article with a
schema:title and a schema:publicationDate.
-
So this was mainly introduced by Google
and some other search engines. But we can
-
use the same vocabulary to talk about our
talks because "isPartOf" is one of these
-
standard terms we can use for that. And
what else do I have. OK, the next thing I
-
have is actual queries. So I think I'm
just going to– I'm almost going to switch
-
to Wikidata, so I should talk a bit about
Wikidata. So all these examples here were
-
just on some example graph, which I made
up here and threw on a slide with a lot of
-
probably overengineered tikz LaTeX magic,
which I shouldn't have wasted that much
-
time about. But it looks nice. And… but if
we want to write real queries, we could
-
load this thing into a query service, but
it wouldn't be that interesting because
-
it's kind of small. But there are a lot of
real data graphs out there that you can
-
query with this query language, SPARQL.
And one of the coolest ones, at least in
-
my opinion, is called Wikidata or
Wikidata. There's some kind of discussion
-
about how it's pronounced. And it's kind
of a free database of anything that's
-
relevant. And it's part of the same family
of projects as Wikipedia and Wikimedia
-
Commons and other things. And it's also
maintained by the same community of
-
volunteers. And you can find all kinds of
really interesting and cool and funny data
-
there. So all of these example queries,
which I have here, we're just going to ask
-
to Wikidata. But first, I will just give
you one or two minutes to try to imagine
-
what this question would look like, either
in the graph format or in the SPARQL
-
format. Just try to figure out how you
would formulate: "which software is
-
written in bash" as a kind of, this kind
of graph query. And then we can see what
-
we can come up with. So. I didn't think
this through. I need some waiting loop
-
music now. Does anyone have a kind of idea
of what the graph looks like, because I'm
-
going to uncover it now and then you can
compare, if it looks the same way. So it
-
would look like, this at least using the
Wikidata terminology. So instead of "is
-
written in", the property is called
probing– programming language. And this
-
could also, this could be called "bash" or
"Bourne Again Shell" or "GNU bash" or
-
something. Doesn't really matter. And in
SPARQL, it looks like this, which is a lot
-
less readable, unfortunately, because one
of the things about Wikidata is that it's
-
multilingual. So instead of saying
"programming language", we say "P277". And
-
I think that's beautiful, haha. No, but
this is a property ID and you can look up
-
what this property is called in English or
in German or in any other language. So if
-
we look at Wikidata.org and look for – I
think I forgot to zoom in. Yeah. There we
-
go. I hope that's readable. Property P,
what was it? 277. That is the property
-
"programming language", at least in… okay,
you can't read that. There you go. At
-
least in English. In German it's
"Programmiersprache", and it has tons of
-
other languages too. So you can use
Wikidata in any language you want, which
-
is very nice. I could also show this page
in a different language and then all of
-
this would look different. The downside is
that the SPARQL query is not quite as
-
readable because you have to use all these
numeric identifiers, but you don't have to
-
memorize them at least. So let's… oops,
try to write this query. SELECT * WHERE
-
and we have the software, which is… which
has the programming language "bash", and
-
then we have to add these prefixes first,
so bash is going to be a Wikidata item. So
-
we abbreviate that with "wd" and that's a
prefix. And then if I press control space,
-
or I think on Macs command space works as
well, then it searches for bash and shows
-
me these suggestions and then I can just
select the right one. In this case, "GNU
-
bash", and then I have the ID, and if I
move the mouse over it again, then I can
-
see what this ID refers to. So it's not
quite as bad as– so on the PDF slides, you
-
just see the ID. But if you're actually on
the query.wikidata.org website… let me
-
make that a bit larger so you can all see
it. And if you want to try that out on
-
your laptop, I don't know, here it's a bit
audio outage And for the programming
-
language, we use a slightly different
prefix, which is "wdt", which stands for
-
"truthy". So we're only interested in
"truthy" information and not all the
-
information. And then we find this
property P277. And if we run this query
-
with control-enter or with this button
here, then we get a collection of other
-
IDs. Yeah. Does anyone want to get
software which is written in bash? This
-
one has a very low ID that is going to be…
Loading. There we go. Autopackage. Some
-
package management system that I haven't
even heard of, but it's written in bash.
-
OK, so… wait. Er, so here you can see all
these statements and "programming
-
language: GNU Bash" is the one we looked
for. And unfortunately… so this is not a
-
very useful list. So one thing we can do
in the Wikidata Query Service, which is
-
pretty specific to Wikidata, is to add the
so-called label service, which is
-
basically magic that you don't need to
understand. But you write something like
-
"serv" or "service" and then with
control+space again for autocompletion.
-
And it suggests you this thing. And you
just keep that in your query at all times,
-
basically. And then you say, I would like
to have not just a software, but also the
-
software label. And then we get down here,
the label of the software. And I can also
-
add the software description. And then we
also see what, what is described. At least
-
if it has a description and then the query
results are already a lot more usable. And
-
I'm just going to rename this to "item"
and then we can edit this query however we
-
want and the variable name will always
kind of match. Because the next query
-
won't be about software anymore. So it'll
be confusing if you just still call it
-
"software". But, yeah, there is some
software here like Apache Yetus, Ruby
-
Version Manager, Wikidata missing
pictures, Pi-hole, all written in Bash.
-
OK, I have several more examples queries
here, which are kind of simple, should I
-
skip ahead or is it good if I do a few
more simple examples. Skip ahead? Is that
-
OK? OK, then let's. So who was born at sea
is not all that interesting. Just Place of
-
birth at sea. We have a special value for
that and it's not a very interesting list.
-
I think a few results, just five or so,
because most people are going to have
-
"place of birth: Atlantic Ocean" or
something. Which places are located on the
-
White Elster, just something for the
Leipzig people. And where does the
-
Neverending Story take place? This
actually kind of cute. Let's do that.
-
Also, this is a bit interesting because in
this case, the variable is in the last
-
place and not the first one. So that… and
then we have the Neverending Story in the
-
beginning and narrative location. And then
the item is at the end instead of at the
-
beginning of a triple. And it works just
as well, except that a lot of these don't
-
have a label in English. So let's add
German as a fallback language. And then we
-
get all of these places which someone
added to Wikidata at some point. Let's see
-
if there's any useful information about
them. So they all have IDs in the same
-
range. So it looks like they were all
created at the same time because the are
-
are just increasing all the time. So the
Gelichterland is a place from the
-
Neverending Story, it's a finctional…
fictional country. It has a capital, which
-
is this fictional place. It's located on
the… this terrain feature, it's present in
-
the Neverending Story. And it depicts
horror fiction. I'm not sure about that,
-
but let's leave it alone for now. OK,
yeah. And skip to a slightly more
-
interesting query, which is this one,
which popes had children. So what is the
-
graph going to look like for this? How
many, how many triples are we going to
-
have? So triple is node, arrow, and
another node, how many triples would you
-
need for "Pope has a child"? Let's do a
raising hands. Who thinks you need zero
-
triples, OK? Who thinks you need one
triple? Who thinks you need two triples?
-
That's more people. Does anyone think you
need three triples? No. OK, so mostly two,
-
but some people think one. So the one… the
people who think it might need one triple,
-
perhaps are thinking of something like the
Pope, which is the leader of the worldwide
-
Catholic Church, has a child, this child
or it's called item, but that's not going
-
to have any results. Or it could be the
other way around. And you could say that…
-
oh let's just comment this out. The item
has "father: the pope". And that doesn't
-
work. Because the items are not… the
children are not directly connected to the
-
item for the office of the pope, instead
it's going to be two levels. It's going to
-
say the child has a father, some person,
and then the person has the office pope or
-
has the position pope or is a pope or
something. So you need this level of
-
indirection. So in the graph that looks
either like this or it could be the other
-
way around. So either the child has a
father pope, which has "position held:
-
pope" or the pope has a child and also a
"position held", so that's kind of an
-
example of the redundancy I mentioned
earlier, we have the two directions
-
"child" and also "father"/"mother", and-
so you can ask your query in two ways, and
-
it doesn't really make that much of a
difference, assuming that the data is
-
complete. And I think someone occasionally
runs queries to check if any of these
-
circles are missing. So let's try one of
them, let's just stay with this one, so
-
the item does not have "pope" as father,
it has some pope, and then this pope has
-
"position held: pope". And then let's add
the "pope" label and… yeah, pope label is
-
enough, and then we get 24 results! So we
have a Duke of Parma which, who was the
-
son of Paul III. Paul III had three
children. Let's sort by this. Wow,
-
Alexander VI was very busy. And some of
them just have, oh oh oh, we have
-
duplicates, Giovanni Borgia and Giovanni
Borgia. Should I demonstrate Wikidata
-
editing now or do we just ignore this? So,
yeah, someone imported a lot of
-
information from this peerage database and
apparently we have some duplicate items
-
here, let's just leave those alone for
now. In fact, I think this and this also
-
looks suspiciously similar. Giovanni
Borgia, unless he had two children of that
-
name. I mean, he could have. So this… we
have a date of birth 1470s… 1498. No, that
-
might actually be different children. OK,
not a very creative father in the names.
-
Yeah. And wait, that's a pope who's a
child of another pope. Very interesting!
-
And another one. And another one. We have
three popes who are children of other
-
popes. Let's search for those! So we would
also need for that, that the item has
-
"position held: Pope", and I could copy
paste this, but just do this. So the item
-
should be… child should have a "father:
pope" and the item should have "position
-
held: Pope", and the pope should also have
"position held: pope". And in this case,
-
it would probably be less confusing to
call these "child" and "father", because
-
this is also a pope now, but… variable
names. One of the three hardest problems
-
in computer science, right? Yeah, we have
three children who are… three popes who
-
are children of other popes. Wow. I'm
actually going to save this query, popes
-
who were children of other popes. But
actually, we can future-proof this a
-
little bit, because right now we've only
said that the father should be a pope. But
-
in case there's ever a female pope, let's
just switch this around and say that the
-
pope should have the child… item and then
it's going to work, even if the pope
-
happens to be female and is a mother
instead of a father. There we go, same
-
three results. OK, and let's keep that,
and open a new tab for next queries. Yeah.
-
Which Microsoft software runs on Linux.
OK. That's not that funny. So perhaps we
-
can just skip it… I don't know. That joke
kind of ran out of steam a while ago.
-
Basically looks like this and it's like
Visual Studio Code and three other
-
programs, meh. What are some compositions
for organ and orchestra. This isn't funny
-
at all, but I just find it very nice
because it's just an awesome sound. And so
-
that would be… the composition has the
instrumentation "organ" and also
-
"orchestra", which we can write as… item,
item label… composition… instrumentation,
-
this one, orchestra. And also,
"composition… organ". And then, oops,
-
yeah, this should be "item"… and also I
forgot to add the label service. There we
-
go. And we have 12 results, which is nice
if you want to listen to any of those. We
-
could also check if any of them have an
audio file on Commons. Let's see. One, OK,
-
and I think we've heard this one already.
So, but… one thing that's kind of annoying
-
here, I should have mentioned this in the
last query, I think. So I had to repeat
-
the item and the property ID, which is a
bit annoying and makes the query difficult
-
to read. And what you can do is leave that
out and you can also do this in the
-
previous case. So let's actually go one
slide back. So here I didn't write twice
-
that it's the software which should have
the developer, and also the operating
-
system. I just wrote the software has
"developer: Microsoft" and also with a
-
semicolon at the end instead of a period,
it has "operating system: Linux". So if
-
you read this as English it's just one
sentence where you don't repeat the
-
subject twice. The software has
"developer: Microsoft" and "operating
-
system: Linux", instead of "software has
developer: Microsoft" and "software has
-
operating system: Linux". And if you… if
the property here is also the same thing,
-
then you can even leave that out and add a
comma at the end and just list the two
-
values and you don't even have to repeat
the instrumentation. So let's do that here
-
and abbreviate this query. And it has the
exact same 12 results, just slightly more
-
convenient to read and… to write at least,
hopefully also to read. I don't know. But
-
you don't use the comma that much. The
semicolon is pretty useful, like we could
-
have written this as, the pope has, er,
the child and also position held like
-
this. It means exactly the same, but you
can immediately see that both of these
-
refer to the pope because there's just a
bunch of blank space here. Yeah, so then
-
we have this one. This isn't funny at all,
but there are a lot of people who used to
-
be in the Nazi Party during World War 2
and then who later just went back into a
-
civil life and even received the
Bundesverdienstkreuz, the order of merit
-
of the Federal Republic of Germany. And
you can find those… in this case I've done
-
it with three triples, which is, the
person was a member of this political
-
party and received this award. And also
I've added that they're "instance of:
-
human", because we also have a lot of
fictional data on Wikidata. You already
-
saw that with the Neverending Story stuff
earlier. So there might also be a
-
fictional character who was a member of
this political party and who received the
-
award, and we're not really interested in
those. So we add "instance of: human", and
-
then we are certain that we only get real
results and not fictional results. And it
-
doesn't really cost us anything because
the Query Service can optimize that pretty
-
well. So let's write that… actually, let's
do that here. So the item should be
-
"instance of: human", which is Q5, because
it's a very common item, and "member of
-
political party". And you can see I can
search by the German abbreviation and find
-
this, even though it's not a label,
because there are search aliases. And also
-
"award received", the
Bundesverdienstkreuz, because I can't be
-
bothered to type in the whole English
name. There we go. And we find, I think…
-
how many results? Eleven results. Yeah.
And this actually isn't quite correct,
-
because in theory, you don't get this
order, this order has like 11 parts or
-
something. You can get the Grand Cross
with Distinction or you can get the Star
-
or whatever. I think it's listed somewhere
here. Yeah, you can get the Grand Cross
-
Special Class, you can get the Grand Cross
Special Issue, you can get the Grand Cross
-
First Class, blah blah blah. And so, in
theory, any of these people should have
-
one of these awards and not just "order of
merit". But I think when I checked, all of
-
them just had… all the results, just had
directly "order of merit". But actually,
-
no we can try to search for the correct
ones instead. So it would not be part of
-
this directly, it would be… "award
received" would be some award, such as
-
this one, and then this award is part of
the order of merit, so "award"… "part of"…
-
Let's see if that finds any results. Oh.
Oh. Oh, dear. Yeah, that, that… that's a
-
lot of results. "Herbert von Karajan".
That's that's depressing. OK, yeah. OK, so
-
I think I… when I tried this out and
didn't find any results, I just did
-
something wrong because, this way we find
a lot more results. And if we… so we don't
-
actually select the award here, because we
don't care what kind of award they got. So
-
we could also use this abbreviation again,
like this. So we just say they got some
-
award, which is part of the order of
merit. And in this case, we could even
-
abbreviate that further and say, we put a
slash here. And then, that kind of
-
describes a path that you have to take
from this item to this item and you have
-
to first get to some award received. And
then that has to be part of something
-
else. And you can add as many elements
here as you want. And then we get the
-
exact same 802 results… and… lots of well-
known names here. And if we want to find
-
the original 11 ones that directly had the
order of merit as the award received, we
-
can add a question mark here, which is
just like in a regular expression, it says
-
this part is optional. They can have
directly received this award or they can
-
have received some award, which is part of
the order of merit. And then we should get
-
813. Yeah, 813 results, so 802, plus the
11 from earlier. And… I'm starting this
-
with "instance of: human", which… and the
Query Service is going to re-order this
-
because searching for all the humans and
then filtering for the ones who are in
-
this political party and so on wouldn't be
efficient. So I don't have to worry about
-
that. I could write it in this order, or I
could shuffle it around. Doesn't make any
-
difference. The Query Service already
knows in which order to do these things.
-
So you don't have to worry about that. You
can just start with "is a human" and then
-
add everything else. I think I have one
more complicated query here. Yeah, so
-
that's one of the examples I mentioned
earlier, the largest cities by population
-
with a female mayor. So the graph for that
is, I think the largest one I prepared for
-
the slides, except the one in the
beginning. And it looks like this. We
-
should have a city which is a city,
"instance of: city", and it has a certain
-
population, and it has… so for the mayor,
we use the same property as for head of
-
government. And if you don't know that,
you could look at some city like Berlin
-
and maybe you know what the mayor of
Berlin is called… what was it?. Something
-
"Müller", I think. Yeah. And then you can
see, aha, the property for the mayor is
-
"head of government". Or you could also
search for, the city should have a mayor,
-
and then you'll still find "head of
government", the right property. And that
-
mayor should be a human and she should
have the gender "female". Oops. There's a
-
question mark there for no reason at all.
That's not a variable. That should be the
-
fixed value. Sorry. So let's put that
there. We have a city which is "instance
-
of: city", and it also has a population
which we're going to use later and it also
-
has a head of government. No, that's
wrong. Not the "office held by head of
-
government", the "head of government"
itself, which we call the mayor and then
-
the mayor is "instance of: human" and
gender should be female… come on… female.
-
And let's select the city, cityLabel,
mayorLabel and also the population. And
-
then we find some 83 results. That's not
yet the largest cities with a female
-
mayor. That's just all of them. And in
Wikidata we know about 83, apparently. And
-
if your local hometown has a female mayor,
just go ahead and add it to Wikidata and
-
it's probably relevant. It's not– So the
relevance criteria are not as strict as on
-
Wikipedia fortunately. But if we want just
the most populous ones, we can go a bit
-
back into SQL land and say we want to
ORDER BY the population and in SQL you
-
would write DESC afterwards and in SPARQL
it's different. You write
-
DESC(?population). Erm, I think it's nicer
that way. But perhaps it would have been
-
nicer to just stick with the SQL syntax. I
don't know. And we want to limit this to
-
just the ten most populous cities, for
example. And here we go. Tokyo is
-
currently the biggest one, then Hong Kong,
Baghdad, Surabaya, Rome. Yeah. And, oh.
-
This doesn't make that much sense, Caracas
has two mayors. Anyone… yeah, exactly. So
-
we're only supposed to get the current
mayor. Head of government… yeah. Does
-
anyone know which one is the current one?
Or we could just check Wikipedia… Caracas,
-
which hopefully doesn't get it's
information from Wikidata yet. So it's not
-
circular. And the mayor is… Carolina,
Carolina Cestari… Cestari, I don't know.
-
laughter
-
OK, so let's add a new one. Ah…? Doesn't
have an item yet, is that… is that the
-
mayor, or is chief of government something
else? Doesn't occur anywhere else on the
-
page, of course. Local government… mayor…
no. OK, so let's just… I don't know,
-
doesn't she have a Wikipedia article? No.
Just appears in some lists and then she
-
doesn't have a Wikidata item yet? No.
Then… I don't know. We'll do some live
-
Wikidata editing. It wasn't part of this
talk, but let's just do it. Carolina
-
Cestari… what country is that? Venezuela.
Venezuelan politician, and that sounds
-
like a female name, so I'm just going to
guess and check that after the talk. So
-
she's definitely a human. And gender is
female and that is going to be enough for
-
our query. Do this search again. There we
go. And set this to preferred rank. So
-
that's how the Query Service knows that
this is the current value and it should
-
only return this one. And ideally, one of
the head of government values should have
-
this preferred rank to mark it as the
correct current value. And then all the
-
other ones are additional data that you
can use if you want. But it's not the main
-
value and we are not going to get it in a
simple query. And then there's some error
-
because Caracas isn't some kind of
political territorial entity and it should
-
have a start time. I don't care right now.
OK, so we run this query again and
-
hopefully get just one result for Caracas
this time. No. Uhm, we have to wait a bit
-
until the Query Service is updated.
Because it's kind of asynchronous. It just
-
keeps watching for changes and eventually
it will get the new data, but… okay. It
-
might take a bit longer. Anyways. That's
how that query works. Does that make kind
-
of sense? OK, great. Yeah, I think this is
almost exactly what I wrote here. Yeah.
-
Except with some labels and the label
service. Yeah. There is one problem here,
-
which is, for example, I happen to know
that Mexico City is a very large city with
-
a population of… population: almost 9
million. So it should be right after Tokyo
-
in front of Hong Kong. And the head of
government is a Claudia Sheinbaum or
-
something, which sounds like a woman. So
we should get this result in the query.
-
The reason we don't is that Mexico City is
an instance of "big city" and we have
-
searched for "instance of: city". And
there's some debate about does this class
-
even make sense at all? I think this is
actually the German classification of, a
-
big city is one with 100 000 Inhabitants,
and in other languages or countries, a big
-
city might be something else, but for now
that… the data is what it is. Fortunately,
-
what we have here is the information, a
"big city" is a subclass of a city/town,
-
which is a subclass of "locality", which
is a subclass of. Wait. We should arrive
-
at city at some point, but I think we've
already gone past that. It's also an
-
instance of capital. Let's go down that
instead. A capital is a subclass of city,
-
there we go. So if we can tell the Query
Service to follow these subclass
-
connections, then we should find these
cities. And one way to do that… to make it
-
work for Mexico City would be to say, it
has to be "instance of", some, with the
-
path again, "subclass of: city" and then
we would find Mexico City, but we would
-
not find all the… oh, we would still find
Tokyo because it's still a capital, I
-
guess. But we've missed a lot of other
cities, I think which we used to have…
-
yeah. Rome, for example, is gone. Because
it's… that's just an instance of city
-
directly. And we've now made the subclass
mandatory. What we should do is make it
-
optional, or even better, we would– we
should say there can be any number of this
-
element. So there… it can be an instance
of city or it can be an instance of a
-
subclass of city, it can be an instance of
a subclass of a subclass of city. You can
-
follow any number of elements, that what
this… that's what this star means, just
-
like in a regular expression. And then we
probably have to say we only want the
-
distinct ones because they are like five
different ways to go through the subclass
-
tree until you've found "city". And we're
not interested in the different ways. But
-
now we should get Tokyo and Mexico City.
And Rome is also here and Caracas is
-
completely gone because we found enough
other cities which we were missing
-
earlier. So you kind of have to watch out
and sometimes use elements like this…
-
"subclass of"-tree is pretty common, or
with a, something… order of merit, we had
-
to use this "part of". You have to watch
out if the results are plausible, or
-
ideally, you know some item that should be
in the results, and then you check, is it
-
there? Why is it not there? And
investigate like that. But that's a fixed
-
version of the query. And… yeah, if we
were not interested in the mayor, we could
-
do the same trick again. But, yeah. It
doesn't make that much of a difference.
-
And I think… yeah, that was almost the
only difference. Yeah, except that I
-
removed the population so we can order by
a variable that you don't select in the
-
end if you want. And I think I am out of
slides. So, yeah, if you want to see more
-
queries, you can look at these Twitter or
social media accounts. There's a huge list
-
of example queries on Wikidata, which is
so big that it's getting too big for a
-
wiki page, and people had to move some
queries out there and it's kind of just
-
grown since 2015 or something. And there's
a lot of garbage there, but also a lot of
-
useful queries if you want to look at
that. And I had two more queries in the
-
talk description which we haven't talked
about yet, and I think we have the time. I
-
can just try to open these. "Which films
starred more than one future head of
-
government?" Does that work? It doesn't.
Can I copy the URL here? Yeah, copy link
-
address. So that's a kind of longer query,
which is why it didn't really fit on one
-
slide. But the important film is you have…
er, the important part is you have some
-
film… instance of, or subclass of film, it
has a publication date and a cast member,
-
which is the head of government. And the
head of government held some position,
-
some head of government, er, some subclass
of head of government. And that should be
-
after the film was published. And then you
get a bunch of results. I think this takes
-
like 11 seconds or something. And you get
like films with Schwarzenegger and one
-
other actor who became US governor. I
don't remember the name. And you also get
-
a lot of… or several films from World War
II with future French heads of government,
-
which is really cool. So, like a film that
was shot about the liberation of Paris,
-
where it's… it's kind of a stretch to call
them cast members, but they're definitely
-
in the film. And if we get the result,
then I can tell you what the film is
-
called. Yeah, it might be busy right now,
so you get up to 60 seconds in the Query
-
Service and then in the end your query is
killed if it takes longer than that. So
-
sometimes it can be a bit of a struggle to
make the query work within 60 seconds.
-
There we go, 50 seconds. That was close.
So there's yeah, there's a "La Libération
-
de Paris" with Charles de Gaulle, who was
president of the Council and president of
-
the provisional government, and also
Georges Bidault, I think, who was prime
-
minister and president of the Council, and
other stuff. We have several Indian films
-
with people who went on to become chief
ministers. And then down here there's some
-
Canadian politicians, apparently. And then
here's Arnold Schwarzenegger and Jesse
-
Ventura, who both became governors and
also starred in several films. And the
-
other thing was, we have a lot of data
about the British government because a lot
-
of volunteers have just been slaving away
at that data and adding and adding more
-
information. I think they've… they have
all their parliaments, complete with party
-
affiliations and everything for at least
the last 100 years and some partial data
-
for a lot more than that, because they
have a very long parliamentary history.
-
And then you can do queries like "how many
people named John are there in
-
parliament", and "how many women with any
name". And you can see when the women were
-
finally more than just the men who are
named "John". And it's kind of an amusing
-
graph. Or not so amusing. Takes a while as
well. I hope it doesn't take 50 seconds,
-
but it looks like the Query Service might
be busy at the moment. But I think it was
-
something like in 1991 or so is the
crossover point. Oh yeah. And I should
-
mention anyway, so everything we saw right
now was just a lot of tables. But you can
-
also show results in different ways, such
as a line chart. There we go. So in 1992,
-
this was the first parliament which had
more women than Johns. And then the Johns
-
have slightly declined and the women have
gone up to 220. How many people are in the
-
House of Commons in total? Does anyone
know? No. So I don't know what percentage
-
this is. Uh, but, this was… yeah, this
latest election from 12 December already
-
in there. Yeah. indistinguishable. What?
So the query looks like this. So this one
-
is broken into several parts. We first
find all the members of parliament, so
-
they should be human, again, no fictional
people, and then they should have some
-
"position held", which is a subclass of
"member of parliament" in the House of
-
Commons. And then there should also be,
um, a parliamentary term on that, so that
-
we know which parliament it is and when it
starts. And then down here, we import all
-
those MPs and filter for just the ones
with the "given name: John". And then we
-
filter for just the ones with "gender:
female". And there's an optional "subclass
-
of" in here, because currently the data
model is that there is a separate item for
-
transgender female and someone can have
"gender: transfemale– transgender female",
-
which is a subclass of "female". And there
is a discussion right now to get rid of
-
that and have a separate property for that
instead. And then all the trans people
-
just have "gender:", their right gender,
and you don't have to mess with subclass.
-
But right now we still… well, we need it
in theory, I don't think there are any MPs
-
in practice. But, you know, you know, you
can just keep it in there. And then we
-
import the results and get them here
either as a line chart or as a table, if
-
you want to sort it by the time… yeah, the
data starts in 1919, apparently. So we
-
have exactly a hundred years of history
there. We can also show it as a bar chart,
-
if that makes more sense. No it doesn't.
That makes no sense. Line chart is the
-
right one. Oh, right, but if you show the
line chart again, then it breaks for some
-
reason, there's some bug there. So let's
just show it again. There we go. That's
-
the right… chart. Yeah, and I guess… oh
wow, it's already… 50 minutes, so I guess
-
this is the point where we start moving to
the live querying part, and I was told I
-
should make at least a short break for the
stream, so the Angels know where to cut
-
between. But we could also take a 10
minute's break and then start the next
-
talk on time. Does that sound OK? Or is 10
minutes too long? Uhm, if you're going to
-
stay here, which would be very nice, then
please think of some example queries that
-
you think we could write, and then I can
try to write them, because otherwise I'm
-
not going to have much to do. But yeah,
let's do a 10 minute break and see you
-
then. Thank you so far.
-
Applause
-
Postroll Music
-
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!