36C3 preroll music
Okay so now to our speaker, he’s Lucas.
He's a SPARQL magician I'm told, so and he
will introduce you to his favorite
querying language, SPARQL, and give you a
little introduction and in the second part
he will do some live coding which is
always really interesting and funny and
you can give him some things that he's
querying for you and I'm sure we'll have
lots of fun and interesting learning stuff
here so give a warm round of applause to
Lucas.
[Applause]
[inaudible]
Is this better? Aha! It's a bit too loud
so I'll just talk a bit until they have
figured it out. Yeah so this is going to
be kind of two parts but not really that
separate but in the second part I'm
basically going to write the queries that
you suggest so if you – if you see what
I'm going to do here and then think oh I
have a great idea for something we could
perhaps query then just remember that and
we'll get back to that hopefully because
otherwise the second half is going to be
really short if I don't get any ideas from
you. But yeah, so this is about querying
linked data which allows you to do all
kinds of crazy things and answer all kinds
of crazy questions such as I think I had
on the slides something like "what are the
largest cities with a female mayor?" and
if you wanted to find that out
traditionally you could like go through
Wikipedia and try to find all the largest
cities and see which ones have a female
mayor and which ones don't or perhaps
there's a category with all the cities
with a female mayor but then you have to
sort them by population and it's a whole
mess and with linked data you can find
that out much more easily and also all
kinds of other things but let's start with
some simple fantasy linked data so this is
a tiny snippet of linked data, some data
graph. It's just composed of a load of
nodes which are these ovals and rectangles
here and they're connected with arrows and
each of these forms kind of a triple
consisting of the start node and then the
arrow and then the end node and that's how
we represent all the information you have
in there, in this linked database. So for
example we can read this as this talk
right now happens in the Esszimmer or the
dining room which is the name of this
stage here and it's going to be followed
by the live querying session which also
happens in Esszimmer and the live querying
session in turn follows this talk again
and the Esszimmer, the dining room, is
next to the kitchen, the Küche, and the
kitchen is next to the dining room again
and both of them are part of the
WikipakaWG which is part of 36C3 and the
talk happens right now and at the same
time there's also some talk about how
state elections are climate elections or
something in the Chaos West stage, starts
at the same time, Chaos West stage is part
of the Chaos West Assembly which is part
of 36C3 as well and so this graph has a
few important properties, for example
there's some redundant connections here,
you could see, you could say, if this talk
is followed by the live querying then you
don't really need to know that live
querying follows this talk, it's kind of
redundant information. You already know
it, but it doesn't hurt to have it, and it
often makes your life easier if you have a
little bit of redundancy in your graph and
then if you find that one half of this
connection is missing for example you can
still investigate what's going on and also
in here we have kind of bi-directional
connection so Esszimmer is next to Küche
which is next to Esszimmer but this is two
separate arrows and could also be that
only one of them is there so you don't
have arrows which go into-, in both
directions at once in this data model, it
has to be, if you want something like this
you have to have two separate arrows
because that keeps the data model very
simple. You just have subject predicate
object and that's everything you have, and
then to query this graph, you kind of
select a tiny part of it and then you
remove some part that you don't know about
for example we know that this talk is
followed by live querying and if we remove
the live querying part, then we can ask
something like... Okay, I did it the other
way around. Never mind, this way. This
talk is followed by which talk? and then
you have a question but because you've
left out this part and then if you ask
this question to a query service it can,
kind of, you can think of this like a,
err, damn, I only know the German word for
this one, a, Schablone, template, so you
put this over the graph and this has to
match the existing node this has to match
the existing arrow and then you see which
nodes can you put in here and in this case
that's only the live querying or the other
way around which talk follows this one so
you can have the beginning of the triple
can be a variable like this one or the end
of the triple can be a variable like in
this case and you can also have more
complicated patterns like, no there's not
a more complicated pattern, this is the
same pattern. You have the question which
talk happens in Esszimmer and you have two
answers: this talk happens in Esszimmer
and live querying happens in Esszimmer.
But you can also combine more graph nodes
like this, for example, which talk happens
in some room, which is part of the
Wikipaka-WG. So we have one free part here
and one free part here. But we know that
these two have to be connected with,
"happens in", and then this has to be
connected with "is part of" to the
Wikipaka-WG. And you can kind of
construct– if you can phrase your question
as a kind of graph like this, where some
parts are predetermined that you already
know about and the other parts that you
want to find. Those are these kind of
variables which are here indicated with
just dashed lines. Then you can ask that
question to the graph and find the
matching results. In this case, you have
these two matches, this talk happens in
Esszimmer as part of Wikipaka-WG and live
querying happens in Esszimmer, is part of
Wikidata– Wikipaka-WG. And then, if you–
if we had more information in this graph
here, we might also have other rooms. For
example, there's this library over there
which also is going to have some talks. If
we had the whole schedule in here, we
would find those as well. And we could
also adapt the query so that we don't even
make the Wikipaka-WG part fixed. We could
ask for anything that happens in 33C3. So
that would be some variable, happens in
some room, is part of some assembly, is
part of 36C3. And then we would find this
thing as well because it fits the same
kind of pattern: happens in, is part of,
is part of 36C3. Does that make sense?
Hopefully. I'm seeing a lot of nodding
heads. OK, that's great. So then we can
try to move ahead to actually ask some of
these questions to a real query system.
Because in reality, you're not going to
actually draw these graphs, but you have
some kind of language where you phrase
them instead, which looks a bit like this.
So you have the part: SELECT anything
WHERE, that is kind of like SQL, and then
everything else is not like SQL. Forget
SQL! I hear this is easier to understand
if you don't know SQL. I didn't know SQL
that much when I learned SPARQL, and I
think it helped me, apparently. But what
you write down here is these, is this kind
of description of the graph, and these
dashed parts, which are the variables
which you don't yet know. Those are marked
with a question mark because that's kind
of what you use to ask a question. In this
case, I've just called it "?talk", but it
could be any name, basically. And then
instead of "happens in" as two words, I've
just written "happensIn" as one and then
with the prefix "36C3" and it happens in
the 36C3 Esszimmer because I don't really
have a separate dining room at home, but a
lot of people do. So if we just wrote it
happens in Esszimmer, that would be pretty
ambiguous and no one would know which
which dining room you're talking about.
And by adding this prefix we know we're
talking about just the dining room in
this, at thirty– 36C3. I think, I assume
there's no other assembly that has
something called the dining room. If it
does, then we would have to add something
else here to make it clear. And I've used
the same prefix for "happensIn" to make
clear which kind of "happens in" relation
we're talking about, that it's one
specific to Congress events. And then you
could ask this to a query service which
has this example graph in it, and you
might get the response that it's these two
talks. And at the end, you have this
period here because if you read the whole
thing, it's kind of like a sentence again.
Because the talk happens in Esszimmer. And
if you have two sentences, then you have
two periods. So the talk happens in some
room. And this room is part of the
Wikipaka-WG. And because we've used the
same variable name here and down here,
this has to be the same room. And it
couldn't just be two different things. So
if we use two different variable names
here, room and something else, then we
would just get all the combinations of
talks happening somewhere and rooms being
part of Wikipaka-WG without them being
connected anyway, but because they use the
same variable name they have to be
connected like this. And then you would
get these results we've seen earlier. What
you can also do is leave out the room. So
when I translate this into English, I
could say, the talk happens in the room
and the room is part of Wikipaka-WG. But I
could also say the talk happens in some
room, which is part of the Wikipaka-WG,
as kind of a– I don't know what that's
called in English kind of a relative
sentence sub-something-clause where we
don't really talk about the room in itself
just as a part of this larger sentence.
And you can write that in SPARQL as well.
And then it looks like this. And these
square brackets kind of describe what the
room looks like without giving it names.
So in this case, you can only select the
talk up here and we don't have a room
variable. But if you don't care about what
the room is, then that can be very useful.
I've also changed something else here.
I've replaced the 36C3 in "isPartOf" with
schema, which is another prefix and schema
is kind of this collection of useful
prefixes and other nodes that you can
reuse, for example, if you're describing
things you have on your website, you might
say you have an article with a
schema:title and a schema:publicationDate.
So this was mainly introduced by Google
and some other search engines. But we can
use the same vocabulary to talk about our
talks because "isPartOf" is one of these
standard terms we can use for that. And
what else do I have. OK, the next thing I
have is actual queries. So I think I'm
just going to– I'm almost going to switch
to Wikidata, so I should talk a bit about
Wikidata. So all these examples here were
just on some example graph, which I made
up here and threw on a slide with a lot of
probably overengineered tikz LaTeX magic,
which I shouldn't have wasted that much
time about. But it looks nice. And… but if
we want to write real queries, we could
load this thing into a query service, but
it wouldn't be that interesting because
it's kind of small. But there are a lot of
real data graphs out there that you can
query with this query language, SPARQL.
And one of the coolest ones, at least in
my opinion, is called Wikidata or
Wikidata. There's some kind of discussion
about how it's pronounced. And it's kind
of a free database of anything that's
relevant. And it's part of the same family
of projects as Wikipedia and Wikimedia
Commons and other things. And it's also
maintained by the same community of
volunteers. And you can find all kinds of
really interesting and cool and funny data
there. So all of these example queries,
which I have here, we're just going to ask
to Wikidata. But first, I will just give
you one or two minutes to try to imagine
what this question would look like, either
in the graph format or in the SPARQL
format. Just try to figure out how you
would formulate: "which software is
written in bash" as a kind of, this kind
of graph query. And then we can see what
we can come up with. So. I didn't think
this through. I need some waiting loop
music now. Does anyone have a kind of idea
of what the graph looks like, because I'm
going to uncover it now and then you can
compare, if it looks the same way. So it
would look like, this at least using the
Wikidata terminology. So instead of "is
written in", the property is called
probing– programming language. And this
could also, this could be called "bash" or
"Bourne Again Shell" or "GNU bash" or
something. Doesn't really matter. And in
SPARQL, it looks like this, which is a lot
less readable, unfortunately, because one
of the things about Wikidata is that it's
multilingual. So instead of saying
"programming language", we say "P277". And
I think that's beautiful, haha. No, but
this is a property ID and you can look up
what this property is called in English or
in German or in any other language. So if
we look at Wikidata.org and look for – I
think I forgot to zoom in. Yeah. There we
go. I hope that's readable. Property P,
what was it? 277. That is the property
"programming language", at least in… okay,
you can't read that. There you go. At
least in English. In German it's
"Programmiersprache", and it has tons of
other languages too. So you can use
Wikidata in any language you want, which
is very nice. I could also show this page
in a different language and then all of
this would look different. The downside is
that the SPARQL query is not quite as
readable because you have to use all these
numeric identifiers, but you don't have to
memorize them at least. So let's… oops,
try to write this query. SELECT * WHERE
and we have the software, which is… which
has the programming language "bash", and
then we have to add these prefixes first,
so bash is going to be a Wikidata item. So
we abbreviate that with "wd" and that's a
prefix. And then if I press control space,
or I think on Macs command space works as
well, then it searches for bash and shows
me these suggestions and then I can just
select the right one. In this case, "GNU
bash", and then I have the ID, and if I
move the mouse over it again, then I can
see what this ID refers to. So it's not
quite as bad as– so on the PDF slides, you
just see the ID. But if you're actually on
the query.wikidata.org website… let me
make that a bit larger so you can all see
it. And if you want to try that out on
your laptop, I don't know, here it's a bit
audio outage And for the programming
language, we use a slightly different
prefix, which is "wdt", which stands for
"truthy". So we're only interested in
"truthy" information and not all the
information. And then we find this
property P277. And if we run this query
with control-enter or with this button
here, then we get a collection of other
IDs. Yeah. Does anyone want to get
software which is written in bash? This
one has a very low ID that is going to be…
Loading. There we go. Autopackage. Some
package management system that I haven't
even heard of, but it's written in bash.
OK, so… wait. Er, so here you can see all
these statements and "programming
language: GNU Bash" is the one we looked
for. And unfortunately… so this is not a
very useful list. So one thing we can do
in the Wikidata Query Service, which is
pretty specific to Wikidata, is to add the
so-called label service, which is
basically magic that you don't need to
understand. But you write something like
"serv" or "service" and then with
control+space again for autocompletion.
And it suggests you this thing. And you
just keep that in your query at all times,
basically. And then you say, I would like
to have not just a software, but also the
software label. And then we get down here,
the label of the software. And I can also
add the software description. And then we
also see what, what is described. At least
if it has a description and then the query
results are already a lot more usable. And
I'm just going to rename this to "item"
and then we can edit this query however we
want and the variable name will always
kind of match. Because the next query
won't be about software anymore. So it'll
be confusing if you just still call it
"software". But, yeah, there is some
software here like Apache Yetus, Ruby
Version Manager, Wikidata missing
pictures, Pi-hole, all written in Bash.
OK, I have several more examples queries
here, which are kind of simple, should I
skip ahead or is it good if I do a few
more simple examples. Skip ahead? Is that
OK? OK, then let's. So who was born at sea
is not all that interesting. Just Place of
birth at sea. We have a special value for
that and it's not a very interesting list.
I think a few results, just five or so,
because most people are going to have
"place of birth: Atlantic Ocean" or
something. Which places are located on the
White Elster, just something for the
Leipzig people. And where does the
Neverending Story take place? This
actually kind of cute. Let's do that.
Also, this is a bit interesting because in
this case, the variable is in the last
place and not the first one. So that… and
then we have the Neverending Story in the
beginning and narrative location. And then
the item is at the end instead of at the
beginning of a triple. And it works just
as well, except that a lot of these don't
have a label in English. So let's add
German as a fallback language. And then we
get all of these places which someone
added to Wikidata at some point. Let's see
if there's any useful information about
them. So they all have IDs in the same
range. So it looks like they were all
created at the same time because the are
are just increasing all the time. So the
Gelichterland is a place from the
Neverending Story, it's a finctional…
fictional country. It has a capital, which
is this fictional place. It's located on
the… this terrain feature, it's present in
the Neverending Story. And it depicts
horror fiction. I'm not sure about that,
but let's leave it alone for now. OK,
yeah. And skip to a slightly more
interesting query, which is this one,
which popes had children. So what is the
graph going to look like for this? How
many, how many triples are we going to
have? So triple is node, arrow, and
another node, how many triples would you
need for "Pope has a child"? Let's do a
raising hands. Who thinks you need zero
triples, OK? Who thinks you need one
triple? Who thinks you need two triples?
That's more people. Does anyone think you
need three triples? No. OK, so mostly two,
but some people think one. So the one… the
people who think it might need one triple,
perhaps are thinking of something like the
Pope, which is the leader of the worldwide
Catholic Church, has a child, this child
or it's called item, but that's not going
to have any results. Or it could be the
other way around. And you could say that…
oh let's just comment this out. The item
has "father: the pope". And that doesn't
work. Because the items are not… the
children are not directly connected to the
item for the office of the pope, instead
it's going to be two levels. It's going to
say the child has a father, some person,
and then the person has the office pope or
has the position pope or is a pope or
something. So you need this level of
indirection. So in the graph that looks
either like this or it could be the other
way around. So either the child has a
father pope, which has "position held:
pope" or the pope has a child and also a
"position held", so that's kind of an
example of the redundancy I mentioned
earlier, we have the two directions
"child" and also "father"/"mother", and-
so you can ask your query in two ways, and
it doesn't really make that much of a
difference, assuming that the data is
complete. And I think someone occasionally
runs queries to check if any of these
circles are missing. So let's try one of
them, let's just stay with this one, so
the item does not have "pope" as father,
it has some pope, and then this pope has
"position held: pope". And then let's add
the "pope" label and… yeah, pope label is
enough, and then we get 24 results! So we
have a Duke of Parma which, who was the
son of Paul III. Paul III had three
children. Let's sort by this. Wow,
Alexander VI was very busy. And some of
them just have, oh oh oh, we have
duplicates, Giovanni Borgia and Giovanni
Borgia. Should I demonstrate Wikidata
editing now or do we just ignore this? So,
yeah, someone imported a lot of
information from this peerage database and
apparently we have some duplicate items
here, let's just leave those alone for
now. In fact, I think this and this also
looks suspiciously similar. Giovanni
Borgia, unless he had two children of that
name. I mean, he could have. So this… we
have a date of birth 1470s… 1498. No, that
might actually be different children. OK,
not a very creative father in the names.
Yeah. And wait, that's a pope who's a
child of another pope. Very interesting!
And another one. And another one. We have
three popes who are children of other
popes. Let's search for those! So we would
also need for that, that the item has
"position held: Pope", and I could copy
paste this, but just do this. So the item
should be… child should have a "father:
pope" and the item should have "position
held: Pope", and the pope should also have
"position held: pope". And in this case,
it would probably be less confusing to
call these "child" and "father", because
this is also a pope now, but… variable
names. One of the three hardest problems
in computer science, right? Yeah, we have
three children who are… three popes who
are children of other popes. Wow. I'm
actually going to save this query, popes
who were children of other popes. But
actually, we can future-proof this a
little bit, because right now we've only
said that the father should be a pope. But
in case there's ever a female pope, let's
just switch this around and say that the
pope should have the child… item and then
it's going to work, even if the pope
happens to be female and is a mother
instead of a father. There we go, same
three results. OK, and let's keep that,
and open a new tab for next queries. Yeah.
Which Microsoft software runs on Linux.
OK. That's not that funny. So perhaps we
can just skip it… I don't know. That joke
kind of ran out of steam a while ago.
Basically looks like this and it's like
Visual Studio Code and three other
programs, meh. What are some compositions
for organ and orchestra. This isn't funny
at all, but I just find it very nice
because it's just an awesome sound. And so
that would be… the composition has the
instrumentation "organ" and also
"orchestra", which we can write as… item,
item label… composition… instrumentation,
this one, orchestra. And also,
"composition… organ". And then, oops,
yeah, this should be "item"… and also I
forgot to add the label service. There we
go. And we have 12 results, which is nice
if you want to listen to any of those. We
could also check if any of them have an
audio file on Commons. Let's see. One, OK,
and I think we've heard this one already.
So, but… one thing that's kind of annoying
here, I should have mentioned this in the
last query, I think. So I had to repeat
the item and the property ID, which is a
bit annoying and makes the query difficult
to read. And what you can do is leave that
out and you can also do this in the
previous case. So let's actually go one
slide back. So here I didn't write twice
that it's the software which should have
the developer, and also the operating
system. I just wrote the software has
"developer: Microsoft" and also with a
semicolon at the end instead of a period,
it has "operating system: Linux". So if
you read this as English it's just one
sentence where you don't repeat the
subject twice. The software has
"developer: Microsoft" and "operating
system: Linux", instead of "software has
developer: Microsoft" and "software has
operating system: Linux". And if you… if
the property here is also the same thing,
then you can even leave that out and add a
comma at the end and just list the two
values and you don't even have to repeat
the instrumentation. So let's do that here
and abbreviate this query. And it has the
exact same 12 results, just slightly more
convenient to read and… to write at least,
hopefully also to read. I don't know. But
you don't use the comma that much. The
semicolon is pretty useful, like we could
have written this as, the pope has, er,
the child and also position held like
this. It means exactly the same, but you
can immediately see that both of these
refer to the pope because there's just a
bunch of blank space here. Yeah, so then
we have this one. This isn't funny at all,
but there are a lot of people who used to
be in the Nazi Party during World War 2
and then who later just went back into a
civil life and even received the
Bundesverdienstkreuz, the order of merit
of the Federal Republic of Germany. And
you can find those… in this case I've done
it with three triples, which is, the
person was a member of this political
party and received this award. And also
I've added that they're "instance of:
human", because we also have a lot of
fictional data on Wikidata. You already
saw that with the Neverending Story stuff
earlier. So there might also be a
fictional character who was a member of
this political party and who received the
award, and we're not really interested in
those. So we add "instance of: human", and
then we are certain that we only get real
results and not fictional results. And it
doesn't really cost us anything because
the Query Service can optimize that pretty
well. So let's write that… actually, let's
do that here. So the item should be
"instance of: human", which is Q5, because
it's a very common item, and "member of
political party". And you can see I can
search by the German abbreviation and find
this, even though it's not a label,
because there are search aliases. And also
"award received", the
Bundesverdienstkreuz, because I can't be
bothered to type in the whole English
name. There we go. And we find, I think…
how many results? Eleven results. Yeah.
And this actually isn't quite correct,
because in theory, you don't get this
order, this order has like 11 parts or
something. You can get the Grand Cross
with Distinction or you can get the Star
or whatever. I think it's listed somewhere
here. Yeah, you can get the Grand Cross
Special Class, you can get the Grand Cross
Special Issue, you can get the Grand Cross
First Class, blah blah blah. And so, in
theory, any of these people should have
one of these awards and not just "order of
merit". But I think when I checked, all of
them just had… all the results, just had
directly "order of merit". But actually,
no we can try to search for the correct
ones instead. So it would not be part of
this directly, it would be… "award
received" would be some award, such as
this one, and then this award is part of
the order of merit, so "award"… "part of"…
Let's see if that finds any results. Oh.
Oh. Oh, dear. Yeah, that, that… that's a
lot of results. "Herbert von Karajan".
That's that's depressing. OK, yeah. OK, so
I think I… when I tried this out and
didn't find any results, I just did
something wrong because, this way we find
a lot more results. And if we… so we don't
actually select the award here, because we
don't care what kind of award they got. So
we could also use this abbreviation again,
like this. So we just say they got some
award, which is part of the order of
merit. And in this case, we could even
abbreviate that further and say, we put a
slash here. And then, that kind of
describes a path that you have to take
from this item to this item and you have
to first get to some award received. And
then that has to be part of something
else. And you can add as many elements
here as you want. And then we get the
exact same 802 results… and… lots of well-
known names here. And if we want to find
the original 11 ones that directly had the
order of merit as the award received, we
can add a question mark here, which is
just like in a regular expression, it says
this part is optional. They can have
directly received this award or they can
have received some award, which is part of
the order of merit. And then we should get
813. Yeah, 813 results, so 802, plus the
11 from earlier. And… I'm starting this
with "instance of: human", which… and the
Query Service is going to re-order this
because searching for all the humans and
then filtering for the ones who are in
this political party and so on wouldn't be
efficient. So I don't have to worry about
that. I could write it in this order, or I
could shuffle it around. Doesn't make any
difference. The Query Service already
knows in which order to do these things.
So you don't have to worry about that. You
can just start with "is a human" and then
add everything else. I think I have one
more complicated query here. Yeah, so
that's one of the examples I mentioned
earlier, the largest cities by population
with a female mayor. So the graph for that
is, I think the largest one I prepared for
the slides, except the one in the
beginning. And it looks like this. We
should have a city which is a city,
"instance of: city", and it has a certain
population, and it has… so for the mayor,
we use the same property as for head of
government. And if you don't know that,
you could look at some city like Berlin
and maybe you know what the mayor of
Berlin is called… what was it?. Something
"Müller", I think. Yeah. And then you can
see, aha, the property for the mayor is
"head of government". Or you could also
search for, the city should have a mayor,
and then you'll still find "head of
government", the right property. And that
mayor should be a human and she should
have the gender "female". Oops. There's a
question mark there for no reason at all.
That's not a variable. That should be the
fixed value. Sorry. So let's put that
there. We have a city which is "instance
of: city", and it also has a population
which we're going to use later and it also
has a head of government. No, that's
wrong. Not the "office held by head of
government", the "head of government"
itself, which we call the mayor and then
the mayor is "instance of: human" and
gender should be female… come on… female.
And let's select the city, cityLabel,
mayorLabel and also the population. And
then we find some 83 results. That's not
yet the largest cities with a female
mayor. That's just all of them. And in
Wikidata we know about 83, apparently. And
if your local hometown has a female mayor,
just go ahead and add it to Wikidata and
it's probably relevant. It's not– So the
relevance criteria are not as strict as on
Wikipedia fortunately. But if we want just
the most populous ones, we can go a bit
back into SQL land and say we want to
ORDER BY the population and in SQL you
would write DESC afterwards and in SPARQL
it's different. You write
DESC(?population). Erm, I think it's nicer
that way. But perhaps it would have been
nicer to just stick with the SQL syntax. I
don't know. And we want to limit this to
just the ten most populous cities, for
example. And here we go. Tokyo is
currently the biggest one, then Hong Kong,
Baghdad, Surabaya, Rome. Yeah. And, oh.
This doesn't make that much sense, Caracas
has two mayors. Anyone… yeah, exactly. So
we're only supposed to get the current
mayor. Head of government… yeah. Does
anyone know which one is the current one?
Or we could just check Wikipedia… Caracas,
which hopefully doesn't get it's
information from Wikidata yet. So it's not
circular. And the mayor is… Carolina,
Carolina Cestari… Cestari, I don't know.
laughter
OK, so let's add a new one. Ah…? Doesn't
have an item yet, is that… is that the
mayor, or is chief of government something
else? Doesn't occur anywhere else on the
page, of course. Local government… mayor…
no. OK, so let's just… I don't know,
doesn't she have a Wikipedia article? No.
Just appears in some lists and then she
doesn't have a Wikidata item yet? No.
Then… I don't know. We'll do some live
Wikidata editing. It wasn't part of this
talk, but let's just do it. Carolina
Cestari… what country is that? Venezuela.
Venezuelan politician, and that sounds
like a female name, so I'm just going to
guess and check that after the talk. So
she's definitely a human. And gender is
female and that is going to be enough for
our query. Do this search again. There we
go. And set this to preferred rank. So
that's how the Query Service knows that
this is the current value and it should
only return this one. And ideally, one of
the head of government values should have
this preferred rank to mark it as the
correct current value. And then all the
other ones are additional data that you
can use if you want. But it's not the main
value and we are not going to get it in a
simple query. And then there's some error
because Caracas isn't some kind of
political territorial entity and it should
have a start time. I don't care right now.
OK, so we run this query again and
hopefully get just one result for Caracas
this time. No. Uhm, we have to wait a bit
until the Query Service is updated.
Because it's kind of asynchronous. It just
keeps watching for changes and eventually
it will get the new data, but… okay. It
might take a bit longer. Anyways. That's
how that query works. Does that make kind
of sense? OK, great. Yeah, I think this is
almost exactly what I wrote here. Yeah.
Except with some labels and the label
service. Yeah. There is one problem here,
which is, for example, I happen to know
that Mexico City is a very large city with
a population of… population: almost 9
million. So it should be right after Tokyo
in front of Hong Kong. And the head of
government is a Claudia Sheinbaum or
something, which sounds like a woman. So
we should get this result in the query.
The reason we don't is that Mexico City is
an instance of "big city" and we have
searched for "instance of: city". And
there's some debate about does this class
even make sense at all? I think this is
actually the German classification of, a
big city is one with 100 000 Inhabitants,
and in other languages or countries, a big
city might be something else, but for now
that… the data is what it is. Fortunately,
what we have here is the information, a
"big city" is a subclass of a city/town,
which is a subclass of "locality", which
is a subclass of. Wait. We should arrive
at city at some point, but I think we've
already gone past that. It's also an
instance of capital. Let's go down that
instead. A capital is a subclass of city,
there we go. So if we can tell the Query
Service to follow these subclass
connections, then we should find these
cities. And one way to do that… to make it
work for Mexico City would be to say, it
has to be "instance of", some, with the
path again, "subclass of: city" and then
we would find Mexico City, but we would
not find all the… oh, we would still find
Tokyo because it's still a capital, I
guess. But we've missed a lot of other
cities, I think which we used to have…
yeah. Rome, for example, is gone. Because
it's… that's just an instance of city
directly. And we've now made the subclass
mandatory. What we should do is make it
optional, or even better, we would– we
should say there can be any number of this
element. So there… it can be an instance
of city or it can be an instance of a
subclass of city, it can be an instance of
a subclass of a subclass of city. You can
follow any number of elements, that what
this… that's what this star means, just
like in a regular expression. And then we
probably have to say we only want the
distinct ones because they are like five
different ways to go through the subclass
tree until you've found "city". And we're
not interested in the different ways. But
now we should get Tokyo and Mexico City.
And Rome is also here and Caracas is
completely gone because we found enough
other cities which we were missing
earlier. So you kind of have to watch out
and sometimes use elements like this…
"subclass of"-tree is pretty common, or
with a, something… order of merit, we had
to use this "part of". You have to watch
out if the results are plausible, or
ideally, you know some item that should be
in the results, and then you check, is it
there? Why is it not there? And
investigate like that. But that's a fixed
version of the query. And… yeah, if we
were not interested in the mayor, we could
do the same trick again. But, yeah. It
doesn't make that much of a difference.
And I think… yeah, that was almost the
only difference. Yeah, except that I
removed the population so we can order by
a variable that you don't select in the
end if you want. And I think I am out of
slides. So, yeah, if you want to see more
queries, you can look at these Twitter or
social media accounts. There's a huge list
of example queries on Wikidata, which is
so big that it's getting too big for a
wiki page, and people had to move some
queries out there and it's kind of just
grown since 2015 or something. And there's
a lot of garbage there, but also a lot of
useful queries if you want to look at
that. And I had two more queries in the
talk description which we haven't talked
about yet, and I think we have the time. I
can just try to open these. "Which films
starred more than one future head of
government?" Does that work? It doesn't.
Can I copy the URL here? Yeah, copy link
address. So that's a kind of longer query,
which is why it didn't really fit on one
slide. But the important film is you have…
er, the important part is you have some
film… instance of, or subclass of film, it
has a publication date and a cast member,
which is the head of government. And the
head of government held some position,
some head of government, er, some subclass
of head of government. And that should be
after the film was published. And then you
get a bunch of results. I think this takes
like 11 seconds or something. And you get
like films with Schwarzenegger and one
other actor who became US governor. I
don't remember the name. And you also get
a lot of… or several films from World War
II with future French heads of government,
which is really cool. So, like a film that
was shot about the liberation of Paris,
where it's… it's kind of a stretch to call
them cast members, but they're definitely
in the film. And if we get the result,
then I can tell you what the film is
called. Yeah, it might be busy right now,
so you get up to 60 seconds in the Query
Service and then in the end your query is
killed if it takes longer than that. So
sometimes it can be a bit of a struggle to
make the query work within 60 seconds.
There we go, 50 seconds. That was close.
So there's yeah, there's a "La Libération
de Paris" with Charles de Gaulle, who was
president of the Council and president of
the provisional government, and also
Georges Bidault, I think, who was prime
minister and president of the Council, and
other stuff. We have several Indian films
with people who went on to become chief
ministers. And then down here there's some
Canadian politicians, apparently. And then
here's Arnold Schwarzenegger and Jesse
Ventura, who both became governors and
also starred in several films. And the
other thing was, we have a lot of data
about the British government because a lot
of volunteers have just been slaving away
at that data and adding and adding more
information. I think they've… they have
all their parliaments, complete with party
affiliations and everything for at least
the last 100 years and some partial data
for a lot more than that, because they
have a very long parliamentary history.
And then you can do queries like "how many
people named John are there in
parliament", and "how many women with any
name". And you can see when the women were
finally more than just the men who are
named "John". And it's kind of an amusing
graph. Or not so amusing. Takes a while as
well. I hope it doesn't take 50 seconds,
but it looks like the Query Service might
be busy at the moment. But I think it was
something like in 1991 or so is the
crossover point. Oh yeah. And I should
mention anyway, so everything we saw right
now was just a lot of tables. But you can
also show results in different ways, such
as a line chart. There we go. So in 1992,
this was the first parliament which had
more women than Johns. And then the Johns
have slightly declined and the women have
gone up to 220. How many people are in the
House of Commons in total? Does anyone
know? No. So I don't know what percentage
this is. Uh, but, this was… yeah, this
latest election from 12 December already
in there. Yeah. indistinguishable. What?
So the query looks like this. So this one
is broken into several parts. We first
find all the members of parliament, so
they should be human, again, no fictional
people, and then they should have some
"position held", which is a subclass of
"member of parliament" in the House of
Commons. And then there should also be,
um, a parliamentary term on that, so that
we know which parliament it is and when it
starts. And then down here, we import all
those MPs and filter for just the ones
with the "given name: John". And then we
filter for just the ones with "gender:
female". And there's an optional "subclass
of" in here, because currently the data
model is that there is a separate item for
transgender female and someone can have
"gender: transfemale– transgender female",
which is a subclass of "female". And there
is a discussion right now to get rid of
that and have a separate property for that
instead. And then all the trans people
just have "gender:", their right gender,
and you don't have to mess with subclass.
But right now we still… well, we need it
in theory, I don't think there are any MPs
in practice. But, you know, you know, you
can just keep it in there. And then we
import the results and get them here
either as a line chart or as a table, if
you want to sort it by the time… yeah, the
data starts in 1919, apparently. So we
have exactly a hundred years of history
there. We can also show it as a bar chart,
if that makes more sense. No it doesn't.
That makes no sense. Line chart is the
right one. Oh, right, but if you show the
line chart again, then it breaks for some
reason, there's some bug there. So let's
just show it again. There we go. That's
the right… chart. Yeah, and I guess… oh
wow, it's already… 50 minutes, so I guess
this is the point where we start moving to
the live querying part, and I was told I
should make at least a short break for the
stream, so the Angels know where to cut
between. But we could also take a 10
minute's break and then start the next
talk on time. Does that sound OK? Or is 10
minutes too long? Uhm, if you're going to
stay here, which would be very nice, then
please think of some example queries that
you think we could write, and then I can
try to write them, because otherwise I'm
not going to have much to do. But yeah,
let's do a 10 minute break and see you
then. Thank you so far.
Applause
Postroll Music
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!