Asaf Bartov: Testing, testing.
Is this heard in the room?
Testing.
Hello, everyone.
This is a gentle
introduction to Wikidata
for absolute beginners.
If you're an absolute
beginner, if you've never heard
of Wikidata, or if you've heard
of Wikidata but don't quite get
it, don't know what it's
good for, have only used it
for inter-wiki links--
if you're anywhere
on this range,
you're in the right place.
My name is Asaf Bartov.
I work for the
Wikimedia Foundation,
and I am a Wikidata enthusiast.
So the first thing I want to
say is that you are lucky.
You are lucky because
Wikidata is already
and is quickly becoming even
more of an important research
tool for anyone who's
trying to ask questions
about large amounts
of information.
It will become more and more
used across the humanities,
in particular, because of the
things that it's able to do,
some of which we will
demonstrate shortly.
And you are lucky because you
get to find out about it now
before most of the world.
So by the end of this talk,
you will be a Wikidata hipster
because you'll be
able to say, oh yeah.
I knew about Wikidata
before it was cool.
So before we actually
visit Wikidata,
I want to share two key problems
that Wikidata seeks to solve
and which would help us
understand why it exists.
The first problem is that
have of dated data, that
is data that is out of date.
And this is apparent
on Wikipedia
across our free
knowledge encyclopedias.
Data on Wikipedia is
not always up to date.
And the more obscure
it is, the more likely
it is not to be up to date.
So the Polish Wikipedia may have
an article about a small town
in Argentina, and that article
will include information
about that town like population
size, name of the mayor.
And that information,
ideally, was
correct at the time the article
was created on the Polish
Wikipedia--
maybe translated
from another wiki.
But then how likely is
it to be kept up to date?
How likely is it that the
Polish Wikipedia would give us
the correct and latest numbers
or data about the population
size of that town
or the mayor, right?
So this is the kind of data
that does go out of date, right?
Every few years--
five, 10 years--
there is a census, and now there
are new population figures.
Now the census in Argentina will
be made available in Argentina
in Spanish, probably,
which brings us
to another component of the
problem of dated data, which
is there are no obvious
triggers for updating the data.
So the Polish Wikipedian
is not sent an email
by the Argentinean
government saying, hey,
we have a new census.
There are new population numbers
for you to update on Wikipedia.
No such email is sent.
So it's kind of
hard to notice when.
And of course, multiply that by
all the different jurisdictions
around the world.
There's no easy
way and notice when
your data goes out of date.
So that's difficult
to keep up to date.
And even if we were to receive
some kind of indication--
oh, there's a new
census in Argentina,
so a whole bunch of
population figures
have now gone out of date.
Updating it on the
Polish Wikipedia
and the French Wikipedia
and the Indonesian Wikipedia
and the Arabic Wikipedia is a
whole bunch of repetitive work
that a lot of
different volunteers
will need to do just for
that one updated piece
of information about Argentina.
So I hope this is
clear and resonates
with some of your experience
editing Wikipedia--
data that is out of
date or that needs
to be updated
manually, menially,
on a fairly frequent schedule
across the different countries
and data sources.
The other-- and I think
maybe more interesting--
shortcoming or problem
that I want to discuss
is what I call the
inflexible ways
of lateral queries, crosscutting
queries of knowledge.
So if I want an answer to
the question, what countries
in the world export rubber--
that's a reasonable
question, right?
That information
is on Wikipedia.
Do you agree?
If you go to
Wikipedia and read up
about Brazil, about Peru, about
Germany, somewhere in there--
maybe a sub-article called
Economics of Brazil--
you will find the main
exports of that country.
And you can find
out whether or not
that country exports rubber.
But what if I don't want
to go country by country
looking for the word rubber?
I just want an answer.
What are the countries
that export rubber?
Even though that
information is in Wikipedia,
it's hard to get at.
It's hard to query.
Now, you may say, well, that's
what we have categories for,
right?
Categories are a way to
cut across Wikipedia.
So if someone made a
category called rubber
exporting countries, then
you can go to that category
and see a list of countries
that export rubber.
And if nobody has
made it yet, well, you
can create that category and,
with a kind of one-time effort,
populate that category,
and you're done.
Well, yes.
That's still not
very convenient.
But also, it's still
very, very limited,
because what if I only want
countries that export rubber
and have a democratic
system of government,
or any other kind of
additional condition
that I would like
to add to this?
Or take a completely
different example.
What if I want to know
which Flemish town had
the most painters born in it?
There's a ton of
Flemish painters.
Most of them were
born somewhere.
We could theoretically,
just you know,
look up all the birthplaces
of all the Flemish painters
and tally up the
numbers and figure out
what is the place where the
most Flemish painters come from?
I don't know the answer to that.
It would be nice to be
able to get that answer.
Again, the data is in Wikipedia.
Those birthplaces are
listed in the articles
about those painters.
But there's no easy way
to get that information.
What if I want to ask, who are
some painters whose father was
also a painter?
That's a thing
that exists, right?
Some painters are
sons of painters.
You know, Bruegel comes to
mind as an obvious example.
But there's a bunch
of others, right?
So who are those people?
What if I want to
ask that question?
That's the kind of question
that not only Wikipedia
doesn't answer today.
If you walk to your friendly
university library reference
desk and say,
hello, I would like
a list of painters whose
father was also a painter,
how would that
librarian help you?
There's no easy way to get an
answer to a question like that.
What if you only want
a list of painters
who were immigrants, painters
who lived somewhere else
than where they were born?
There's no book.
I guess maybe there
is, but you know,
it's not obvious that there's a
ready resource that says, list
of painters who are immigrants.
And the librarian would
probably refer you
to a book on the shelf
called, I don't know,
The Complete
Dictionary of Flemish
Painters and go,
look up the index,
you know, and if you
see a similar surname,
maybe they're father and son.
And kind of cobble together
the answer on your own.
The reason I'm comparing
this to a library
is to show you that this is a
kind of question that is not
readily satisfiable today.
Now, these questions may
sound contrived to you.
You may say to
yourself, well, you
know, painters who are also
sons of painters, yeah.
You know, that
never occurred to me
as a question I
might care about.
But I want to invite
you to consider
that this kind of question,
questions like that question,
may well be questions
you do care about.
And I also want to suggest
that the fact it is so nearly
impossible, the fact that
there's no obvious way
to ask that kind
of question today,
is partly responsible
to your not
coming up with those
questions, right?
We tend to be limited
by the possible.
You know, until human
flight was made possible,
it did not occur to anyone
to say, oh yeah, by this time
next week I will
be in Australia,
because that was
just impossible.
But when flight is
possible, there's
all kinds of things that
suddenly become possible,
and there's all
kinds of needs that
arise based on the
availability of resources
to fulfill those needs.
So many of these research
questions, compound lateral
cross-cutting queries, are not
being asked because people have
internalized the fact
that there is no way
to get an answer
to questions like,
what is the most popular first
name among British politicians?
I just made that up, you know?
Is it John?
Maybe.
Maybe it's William,
for whatever reason.
You know, these are the kinds
of questions we don't routinely
ask because we know that it's
like, who are you going to ask?
How are you going to
get an answer to that?
So this problem of not having
very flexible ways of querying
the data that we already have--
in Wikipedia, in
Wikisource, elsewhere--
is a significant limitation.
So these two key problems
have one solution.
And that is an editable,
central storage
for structured and
linked data on a wiki,
under a free license, which
is a very long way of saying
Wikidata.
That is Wikidata.
Wikidata is an editable,
central storage
for structured and
linked data on a wiki,
under a free license.
So let's take this
apart and unpack it.
First of all, it's
a central storage.
This relates to the
first problem, right?
If we had one place containing
data like population size,
we would be able to update
that one place and then have
all of the different Wikipedias
draw the data from that one
place so that we wouldn't
have to manually,
repetitively update it across
our hundreds of projects.
So having central storage
makes, I hope, kind
of immediate, intuitive sense.
But what do I mean by
structured and linked data?
So structured data means
that each datum, each piece--
individual piece-- of data
is managed on its own,
is identified and
defined on its own,
as distinct from Wikipedia.
Wikipedia has articles.
The article about Brazil
includes a ton of data,
all kinds of information,
and it's presented as text,
as several paragraphs--
several pages--
of text, right?
Now, we do have an
approximation of structured data
on Wikipedia.
If you've browsed
Wikipedia a little,
you've noticed that we often
have an info box, what we
call an info box on Wikipedia.
That's the table on the right
side if it's a left to right
language, the table
on the right side
that has information that
is easy to tabulate, right?
So you know, birth date, birth
place, death date, death place,
nationality--
or if it's about a country,
area, population, anthem,
type of government, whatever
you are likely to find.
If it's a movie, then
you know, starring,
genre, box office receipts,
whatever pieces of data
are relevant to an
article about a movie.
So we do already kind of
group pieces of information
on Wikipedia into this
kind of structured format.
Those of you who have
ever looked at the source,
at what the wiki code
under that looks like,
know that it's only
semi-structured.
It looks neat and
organized in a table,
but really, it's just a bunch
of text that is put there.
It is not centralized.
Every Wikipedia has its
own copy of that data.
And if I go and update
the population size
on Spanish Wikipedia of
that Argentinean town,
it does not get
updated automagically
on the English Wikipedia or
the Arabic Wikipedia, right?
So the structured data that
we already have on Wikipedia
is not managed centrally.
The other thing
about structured data
is, when you have a notion of an
individual piece of data, that
is the cornerstone of
allowing the kinds of queries
that I was talking about.
That is what will allow
me to ask questions like,
what is the Flemish town where
the most painters were born,
or what are the world's
largest cities that
have a female mayor?
I could come up with other
examples all day long, right?
These are all questions
that you can ask,
once you break down your data
into individual pieces, each
of which is--
you're able to refer to each
of those programmatically.
The computer can
identify, isolate,
and calculate based on each
of those pieces of data.
So that's why the
structure is important.
Now, Wikidata is also a
linked data repository.
What does it mean that
the data is linked?
Well, it means that a single
piece of data can point at,
can link to another
whole bag of data.
So if we are describing,
for example, a person,
and we record the
single piece of data
that this person was born
in Salem, Massachusetts,
that single piece of data
links to the item about Salem,
Massachusetts
because, of course,
we know a lot of things
about that place, Salem,
Massachusetts.
So it's not just the text--
S-A-L-E-M. It's not just,
that's where they were born.
But it's a link to all
the data that we have
about Salem, Massachusetts.
If we say someone's
nationality is French,
that is a link to France.
That is a link to everything we
know about the country France.
The fact that the data
is linked and structured
allows not only humans,
but also computers
to traverse information
and to bring
us different pieces of
relevant information
programmatically, automatically,
based on those links.
Because it's not just
text, it's an actual link
to another chunk of data.
If this sounds a
little abstract,
it will become much
clearer in just a second
when we see it in action.
But the other components of
this little definition are,
of course, this central storage
of structured and linked data
needs to be editable,
of course, because we
need to keep it up to date.
We need to correct mistakes.
And we want it on a wiki
under a free license.
The free license is, of
course, essential to enable
reuse of that data, to enable
all kinds of reuse of the data.
And Wikidata, unlike
Wikipedia, is released
under a different free license.
Wikidata is released
under CC0 waiver.
That means unlike
Wikipedia, where
you have to attribute Wikipedia
when you reuse information
from Wikipedia, you do not
need to attribute Wikidata,
and you do not need to
share alike your work.
It's an unencumbered license to
reuse the data in any way you
want, including commercially.
You don't have to say that
it comes from Wikidata.
I mean, it could be nice,
but you don't have to.
You're under no
obligation to do it.
And that is important to
allow certain kinds of reuse
where, for example, if you're
building some kind of device,
you may not have a practical
way to give attribution.
And had we required
that to use Wikidata,
we would have made
Wikidata less reusable.
So Wikidata is unencumbered by
the requirement of attribution.
And of course, because
it's on a wiki,
we get all the benefits that we
are used to expect from a wiki,
right?
So it's a wiki,
which means, yes.
It has discussion pages.
It has revision histories.
It remembers everything.
So if you screw it up, you
can always go a version back.
Or if someone else
vandalized the content,
we can always go back,
just like Wikipedia.
So we get all the
benefits we're used to--
user talk pages, group
discussion pages, watch lists,
all the features that
we expect in a wiki.
In short, Wikidata is love.
I hope you agree with me
by the end of this talk.
So let's zoom in and see
what this structured data
looks like.
So structured data on Wikidata
is collected in statements.
And statements have
the general form
of this triple, this
tripartite ascription--
items, properties, and values.
Now an item is the
subject, is the topic
that we are trying to describe.
It can be any topic that
Wikipedia can cover,
and many others that
Wikipedia wouldn't.
So the topic, the
item can be Germany,
or it can be Salem,
Massachusetts,
or it can be the
concept of redemption.
It can be anything at all.
Anything you can imagine
describing in any way with data
can be the item.
So the item, consider
it like the title
of the rest of the data.
And then what do we say
about Salem, Massachusetts
or about Germany?
Well, that's a series of
properties and values,
properties and values.
The property is
the kind of datum,
like birth date or language
spoken or manner of death.
These are all real properties.
Or national anthem, if I'm
trying to describe a country--
these are properties.
And then they have
values, right?
So this person, this
imaginary person's place
of birth, the value of the
property place of birth
is Salem, Massachusetts.
So you can think about it
as like a government form--
or not government, just any
form that you're filling out--
where there are field names,
and then empty spaces for you
to fill out.
That's the value, OK?
So the field names
or the categories
are the properties, right?
So name, language,
occupation, date of birth--
these are all properties.
And the values are
the actual piece
of data, the actual
information that we have.
And of course,
different kinds of data
are relevant for describing
different kinds of items.
And the key in the value is it
can be either a literal value--
like if we're describing
the height of a mountain,
we might say just
the number 8,848.
That's the height
of which mountain?
Not everyone at once.
Oh, because it's meters,
the metric system.
Yeah, Mt.
Everest is 8,848 meters.
Yes.
Get with it, America.
The metric system.
All right, so that
can be a literal value
like an actual number.
Or it can be a link to an
item, pointing at another item.
But in this statement,
it is the value.
So if I'm talking about
Germany, the item is Germany.
And the property capital
city has the value Berlin.
But the value is
not B-E-R-L-I-N.
The value is a pointer to
the item Berlin, right?
That's the link.
So a single item is described
by a series of such statements,
right?
There's hundreds and hundreds of
things I can say about Germany.
There's hundreds of things
I can say about a person.
And these will
generally take the form
of a property and a value.
By the way, some properties
may have more than one value.
Consider the property
languages spoken.
People can speak more
than one language, right?
So if I'm from
describing myself,
we can say languages spoken--
English, Hebrew,
Latin, whatever.
So a property can have
more than one value.
So if the item is
about a country,
it would have statements about
properties like population,
land area, official languages,
borders with, anthem,
capital city.
If I'm describing a person, I
have a whole mostly different
set of properties that
are relevant, right?
Date of birth, place of birth,
citizenship, occupation,
father, mother,
religion, notable works--
now, are all of these
relevant for all people?
No, of course not.
It depends.
And different items
about different people
will either have or not
have these fields, right?
So we wouldn't record religion
for absolutely every person.
Some people manage
to do without.
And also, it's not relevant
for a lot of people, like,
what their religion
happens to be.
Date of birth is generally
relevant for most people
that we're documenting.
So some properties kind of crop
up more commonly than others.
A person's height, for
example, is not generally
considered of
encyclopedic value, right?
We don't, for
example, if we have
an article about even a
really well-documented person
like Winston Churchill, does
Wikipedia mention his height?
I don't think it does.
Even though I'm sure
we could probably
find a source somewhere
that lists his height,
it's just not a
very relevant piece
of information about Churchill.
With everything else
that's written about him
and that we know
about him that we
want to include in the
article, a person's height
is not really something of
great value most of the time.
But if we are describing
Michael Jordan, it is relevant.
I'm dating myself.
People still know
Michael Jordan, right?
You know, a basketball
player, that's
when height is very
relevant, right?
That's one of the
first things you
say when you're describing
a basketball player,
is list their height.
So even within the
class of person,
some properties may be
more or less relevant,
depending on the context.
So let's look at some examples.
These are examples
of statements.
Each line is a statement.
So here's the first one.
I want to state, about the
item Earth, our planet.
And what I want
to say about Earth
is that the property
highest point on Earth
has the value Mt.
Everest.
Would you agree with that?
That is the highest
point on Earth.
That's a statement.
It says something
specific, one piece
of information about Earth.
Now of course, there's
a lot of other things
we want to say about Earth--
circumference,
average temperature,
I don't know, all
kinds of things
we can describe the planet
with, density, it's a galaxy,
it belongs to, all that.
But here's one piece
of information,
one very specific field in
the detailed form about Earth.
The highest point is Mt.
Everest.
Now here's a second statement.
This time Mt.
Everest itself is the item
that I'm describing, right?
The topic has changed.
Now I'm saying
something about Mt.
Everest, and what
I'm saying about Mt.
Everest is elevation
above sea level.
Sounds the same but it
isn't, because the highest
point on Earth answers
the question where,
like on the planet, what
is the highest point?
It's Mt.
Everest.
But how high is that highest
point is a different piece
of information.
Do you agree?
It's the actual altitude.
It's not where on
the planet it is.
So it may sound similar,
but these are actually
very different pieces
of information.
So that highest
point, how high is it?
Well, it's 8,848 meters high.
Now the third statement gives
another piece of information
about the first item.
Same item-- I could have
grouped them together.
Another thing I
know about the Earth
is that the deepest
point on the planet
is the Challenger Deep, part
of the so-called Mariana
Trench in the ocean.
So that is the deepest point.
And how deep is it?
I again use the elevation
above sea level.
That's the name of the
property even though it's not
above sea level.
I have a negative value because
the elevation of the Challenger
Deep is minus 11
kilometers, more or less.
All right?
So these are statements.
These are four individual
pieces of data.
And I could also
look at it this way.
Maybe that's closer to the
government form example
that I was giving, right?
So I want to say
something about Earth.
What do I want to say?
Two things-- highest point.
That's the field,
that's the property,
and this is the value.
The highest point is Mt.
Everest.
The deepest point
is Challenger Deep.
And then I have things to
say about Challenger Deep--
the property of elevation
above sea level, the value
is minus 11 kilometers.
Now here's yet another
view of the same data
once more, with numeric IDs.
So this is the same information,
the same four statements.
But this time, in
addition to using words,
I'm also including weird
numbers following either Q or P.
So P stands for property.
So the highest point
property is P610.
And the deepest point
property is P1589.
What do these numbers mean?
They don't mean anything at all.
They're just numbers.
They're just sequential numbers.
And if I create a new
Wikidata item right now,
it'll get just the
next available number.
So they're just numbers.
So P stands for property.
What does Q stand for?
Does anyone know?
It's a trick question
because it's hard to guess.
But the principal
architect of Wikidata,
a Wikipedian named Danny
[INAUDIBLE] and data scientist,
is married to a lovely
lady named [INAUDIBLE]
spelled with a Q. And
this is a loving tribute.
And she's also a Wikipedian and
an admin of Uzbek Wikipedia.
So Q2 is just the numeric
identifier of the item Earth.
And Q513 is the
identifier of Mt.
Everest.
You notice that we use that ID
across the statement, right?
So from Wikidata's
perspective, this
is actually what the
database actually contains.
What we were saying with words--
the Earth, highest
point, whatever--
never mind that.
Q2 has P610 with a value Q513.
That's what Wikidata
cares about, OK?
Now that, you'll agree,
is a little inaccessible.
Just these lists of numbers,
that's a little hard.
So Wikidata
understands and allows
us to continue using our words.
But actually, it gets
translated into numeric IDs.
Now why is this a good idea?
Why can't we just
say Earth or Mt.
Everest?
Any thoughts?
This is an open question.
Why is this a good
idea to use numbers
instead of the names of things?
Yes, because more than one
thing can have the same name.
What do you mean?
There's only one Mt.
Everest.
Well, yeah.
But there there's also a
movie called-- and probably
more than one-- called Mt.
Everest, or a TV documentary
literally called Mt.
Everest.
And of course, if I'm
describing a person named
Frank Johnson, not the only
Frank Johnson on the planet,
right?
But wait, you say.
On Wikipedia we deal
with that problem, right?
How do we deal with that
problem on Wikipedia?
Does anyone in
the audience know?
The standard way to
deal with the fact
that there is more than one
Frank Johnson in the world,
on Wikipedia, is to use
parentheses after the name.
So there is Frank
Johnson (actor)
and Frank Johnson
(politician), for example,
if that's the distinction
we need to make.
So you put in parentheses
kind of the minimal amount
of information you need to tell
apart these Frank Johnsons.
What if there's two
politician Frank Johnsons?
Well, then you would say Frank
Johnson, (Delaware politician)
versus Frank Johnson
(California politician), right?
You just put in that bit of
context to tell them apart.
So that's the solution
that Wikipedians came up
with years and years ago
because they did need
a unique name for the article.
You can't have two
articles literally called
Frank Johnson on Wikipedia.
So that's the
solution on Wikipedia.
But Wikidata was designed
much later, more than a decade
after Wikipedia, and was
able to kind of learn
from the experience
of Wikipedia, which
has tremendous experience
with multilingualism, much
more than most sites and
projects, as we know.
And so the Wikidata
team understood
from the get go that
this will be an issue,
and it's better to use
numbers that are unequivocally
different from each
other instead of labels,
instead of the actual
name, the actual text,
because names are not unique.
Names can change, right?
Just last year, there was a
big naming reform in Ukraine
and a whole bunch of towns
and districts were renamed.
Does that mean we should change
all the data that we have, like
lose all the data that we
have about the old name?
No, we ideally just
want to change the name
without breaking links.
So having the links actually
refer to the numbers
is one way to ensure the
integrity of the data,
of the links, when
renaming happens.
Another reason is well, even
if the name doesn't change,
not all humans call
everything the same, right?
So Earth is Earth
in English, but it's
[SPEAKING ARABIC] in Arabic.
It's [SPEAKING HEBREW]
in Hebrew.
So obviously, Earth--
even that is not
as unambiguous or unequivocal
as you might think.
And so that is the
reason Wikidata,
which is built to be
multilingual from the start,
talks about numbers
rather than labels.
OK.
Ha, I had a whole slide
about that and I forgot.
Yes, so even London,
again, is not
just London, England, which is
what you were thinking about.
It's also a city in Canada.
And it's also a family
name, like Jack London.
It's also a movie company.
There must be some hotel
named London somewhere.
This is a good opportunity
to remind everyone
that the vast
majority of humankind
does not speak a
word of English.
That's a statistic
worth remembering.
The vast majority of the planet
does not speak English at all.
That does not
contradict the datum
that English is the most
widely spoken language.
And yet, in aggregate,
a majority of people
speak other languages,
and not English at all.
So moving swiftly on, this
is a pause for questions
about what I've covered so far.
Any questions in the audience?
If not, we moved to IRC.
If there are any questions--
Any questions?
No?
IRC?
Any questions?
OK.
We will have additional
pauses for questions later.
But enough of my hand-waving.
Let's go explore Wikidata.
So Wikidata lives
at wikidata.org.
And Wikidata already has
more than 25 million items.
That is, it collects
statements about more than 25
million topics.
It has many, many more
than 25 million statements
because many of these items
have dozens or hundreds
of statements.
So it documents 25
million things--
people, books, rivers, whatever.
Just to give us a sense
of how big that number is,
how many articles do we
have on English Wikipedia?
More than-- yes, more
than 5 million articles.
And that's the
largest Wikipedia.
So Wikidata is
already describing
more than five times, or
about five times as many items
as even our largest Wikipedia.
So obviously,
Wikidata contains data
about things that have no
article on any Wikipedia.
It is a much, much larger,
more comprehensive project.
All right, the second
thing we might notice
is, well, this looks kind
of like Wikipedia, right?
If we've never visited, it
looks kind of like Wikipedia.
It has this sidebar.
It has these buttons at the top.
It looks like it's
from the '90s.
Yeah.
So the reason it
looks like Wikipedia
is that it is a wiki running
on Mediawiki software.
It is running on software
very much like Wikipedia.
But it is running on
a kind of modification
of the standard wiki software.
It has an additional,
very important component
named Wikibase,
which gives it all
of its structured and
linked data power.
So let's start
exploring Wikidata.
Let's take something local--
Harvey Milk.
Harvey Milk.
What does Wikidata
know about Harvey Milk?
For those on YouTube
who may not be local,
he's a San Francisco politician
and gay rights activist
who was murdered in the '70s.
It was very significant in
the history of those struggles
in this country.
So what does Wikidata
tell us about Harvey Milk?
Well, the first
thing is it knows
that Harvey Milk is Q17141.
That's the most important
piece of information,
is first of all, that
is the identifier.
That is the item
number of all the data
that we will collect
about Harvey Milk.
The second thing you see
right under the title
is this line, this very,
very brief summary, right?
"American politician who became
a martyr in the gay community."
This line is the
description line.
So the name of the item--
this is the label.
We call it label on Wikidata.
That's the label.
And this line is
the description.
Now why is this
description important?
This is the description that
helps us tell this Harvey
Milk from any other Harvey
Milk that may exist, all right?
So again, this would
be useful if I'm
looking up someone with a
slightly more generic name.
That line will help me tell
apart the item about Harvey
Milk the gay activist rather
than Harvey Milk the film
actor, OK?
And where is it coming from?
Well, Wikidata has
this whole table,
as you can see, with
descriptions and labels
in other languages.
So Wikidata is able to refer
to Harvey Milk in Arabic which,
don't panic, is written
from right to left.
It also knows what to
call him in Bulgarian.
I mean, it's the same name,
but it's in a different script.
In French, in Hebrew,
and that's it?
Does it not know a name
for Harvey Milk in Italian?
Of course it does.
It actually has
labels for this person
in many, many, many languages.
It doesn't have descriptions in
every language, as you can see.
OK?
So why was Wikidata showing me
these languages and not others?
I mean, why this somewhat
arbitrary collection--
English, Arabic, Bulgarian,
German, French, and Hebrew?
Because I told it to.
So if we briefly click
over to my user page--
again, like every wiki,
you have user accounts.
You have user pages.
This is my user page.
And as you can see,
there's this little user
information box here called
a Babel box by Wikipedians,
where I list the
languages that I speak.
And Wikidata uses this box
just to kind of helpfully
show me these languages.
Of course, all the
other languages
are still available, as you saw,
by clicking the more languages.
But this is just a
useful little way
of getting the languages I
care about up there first.
By the way, this is a lie.
I don't actually
speak Bulgarian.
That stayed on my user page
because I was demonstrating
this in Bulgaria and I wanted
that label to show up there
during the talk--
just in case you
were going to tell me
a really good Bulgarian joke.
OK so for example, Hebrew
is my mother tongue.
And we have a Hebrew
label for Harvey Milk.
But we don't have a description.
So let's fix that right now by
clicking the edit button right
here.
I click edit, and this
table became editable.
And now I can very briefly
type a description.
AUDIENCE: Online in
about 20 seconds.
But can we hold it?
ASAF BARTOV: OK.
That was good timing
for the screen to crash.
OK?
Are we back?
OK.
Sorry about that.
So this was all about what to
call him in different languages
and scripts and how to
tell this person apart
from other people with
potentially the same name.
Let's scroll down and see
what else does Wikidata
know about this person?
So as you can see, this is
a list of statements, right?
This is a list of statements.
And the properties
are on the left,
the values are on the right.
So the first thing Wikidata
knows about Harvey Milk
is a very important
property called instance of.
Instance of.
And the property instance of
answers the very basic question
what kind of thing is
this that I'm describing?
Is it a book?
Is it a poem?
Is it a mountain?
Is it a theological concept?
No, it's a human.
It's a person, OK?
The item about Mt.
Everest will say
instance of mountain, OK?
This is a very
important property.
Why is it important?
Wouldn't anyone looking
at this know that this is
a human being?
Yes.
Anyone looking at
this will know.
But if I want a computer to
be able to pull information
about people, I want to
be able to easily exclude
all the mountains and
poems and other things that
are not people from my query.
So this single datum,
this single piece of data,
is what tells computers and
algorithms very clearly,
this is a human.
Things that aren't instance
of human are other things.
OK?
So it may sound very
trivial, but it's not.
It's very important
to have an instance
of field for Wikidata items.
All right, what else do we know?
Well, Wikidata knows about
an image for Harvey Milk.
Again, we can find a ton of
images-- or maybe not a ton,
but we can find dozens
of images of Harvey Milk
on Commons, on our Wikimedia
multimedia repository.
So why should we have a
single image here on Wikidata?
Again, this is
mostly for reusers.
If I'm building some kind of
tool that pulls information
from Wikidata, it's
nice if there's
at least one representative
image to kind of use
as the default or immediate
image for Harvey Milk
in some other reused context.
All right, sex or gender--
male.
Country of citizenship--
United States of America.
Given name is Harvey.
The date of birth is so and so.
The place of birth is Woodmere.
The place of death
is San Francisco.
The manner of death is homicide.
Wikidata knows that.
Now again, every
little datum like that
is the basis for later querying
and answering questions.
So the fact that we record the
manner of death of people--
or at least of some people--
will allow us later
to go, you know,
who are some people from
Belgium who died by homicide?
That's a question Wikidata can
answer, thanks to this field.
The other thing I mentioned
is that things are links.
So the place of
birth is Woodmere.
I don't know where
Woodmere is, but I
can click that and find out.
Here is the Wikidata item
about Woodmere, right?
It was the value in the
statement about Harvey Milk,
but now I'm looking at
the item about Woodmere.
And it turns out it's in
Nassau County, New York, right?
And of course, Wikidata has
a whole bunch of information
for me about Woodmere--
what country it's in and the
coordinates and the population
and the area, all the things you
would expect about a place, OK?
Let's get back to Harvey Milk.
So the manner of death,
the cause of death--
now here, Wikidata gives
us excellent information.
The actual cause of death
is ballistic trauma.
That's a professional term.
And this statement
has qualifiers.
So until now, I was talking
about triples, right?
The item has a property
with a certain value.
Actually, each
statement can also
have a number of
qualifiers which
add aspects of information,
still about that one question
that we're answering, right?
So if this property
answers cause of death,
it's not discussing
anything else.
It's not discussing languages.
It's not discussing
date of birth, right?
It's talking about
the cause of death.
But we're not just
saying ballistic trauma.
We're saying ballistic trauma
with the quantity attribute
being five.
What does that mean?
Five bullets, right?
There are five
ballistic traumas.
He was he was shot five times.
And he was shot by this
person named Dan White.
And this ballistic trauma,
like this actual shooting,
is itself the subject
of this other thing.
This is a link to a
whole other Wikidata
item about the Moscone-Milk
assassinations.
Moscone was the San
Francisco mayor at the time.
We'll see slightly better or
easier to understand examples
of qualifiers in a bit.
So if this was
confusing, hang on.
So he was killed by Dan White.
He spoke English.
His occupation--
here's an example
of a property with more
than one value, right?
So Milk was a politician.
But he was also a Navy
officer, at least for a while.
That was another thing that
he did during his life.
And he was a human
rights activist, right?
So some people are
writers and translators.
So people can have more
than one occupation.
People can speak more
than one language.
Here's a better
example of a qualifier.
So the property award received
has the value Presidential
Medal of Freedom.
And that award has an
attribute called point in time,
like when was this?
This was in 2009.
Do you see that
this piece of data--
2009-- is a sub-statement
or is subjugated
to the context of this award,
was the Presidential Medal
of Freedom?
It can't just kind of
free float in the article.
It's not that 2009 is itself
a meaningful thing, right?
This medal was awarded in 2009.
If
Wikidata doesn't
tell us, for example,
when he was a Navy officer, OK?
But if we were, for example,
to look that up right now
and find out that Milk was
a Navy officer between 1962
and 1964, we could go back
here to the Navy officer bit
and click edit.
This is how I edit this
particular little piece
of information.
And add a qualifier like this.
I click Add Qualifier.
And I could pick start
time and end time, right?
And then I could
type 1962 to 1964,
and that would be
teaching Wikidata.
Oh, I'm sorry, I meant to
do that for Navy officer.
OK.
But, you know,
that is the exact--
the accurate time span
of that statement.
So it's true to say about a
person, he was a Navy officer,
even if of course he wasn't a
Navy officer his entire life.
But it's better and
it's more accurate,
to say he was a Navy officer
between 1962 and 1964.
Don't worry, I'm
not saving this.
No vandalizing of
Wikidata in this session.
OK.
Moving on.
What else does Wikidata know?
He was educated at
this university.
He was a member of
this political party.
Right?
That's of course if
they're a relevant property
for a politician.
Religion, military branch,
what is the category on commons
that discusses this
item, is something
that Wikidata can tell us.
And that's it.
Now, is that everything
that we could possibly
say in a structured
way about Harvey Milk?
No.
We could probably find at
least a few more things to say.
We will see how to contribute
new information to Wikidata
in just a minute with
a different example.
But this-- all this was
a set of statements.
Right?
This was the title
statements here.
But at the bottom of the
list of statements is
another section
called identifiers.
And I want to spend a minute
talking about what that is.
So identifiers is a
collection of keys.
A collection of
IDs, or codes, that
are keys to other
information sources.
And a lot of Wikidata items
have a whole series of keys
to other databases, other
sites, other repositories,
that help you or a computer
be able to access not just
some database and look for
information about Harvey Milk,
but access the exact record
relevant to Harvey Milk.
And again, if you imagine
someone named John Smith,
that is really valuable, right?
If you're not just
told, oh yeah,
you can look at the
Library of Congress
for John Smith,
good luck with that.
Or if I tell you, go to
the Library of Congress
to this record for this John
Smith, you see the difference.
So Wikidata tells us that on
VIAF, which is the Virtual
International Authority File.
It's an aggregated master
index built by bibliographers,
by librarians, of people.
Right?
It tries to kind of aggregate
information about people
across library
catalogs everywhere.
So the VIAF ID for Harvey
Milk is this number.
And conveniently,
if I click that,
I'm not taking to
some Wikidata item.
I'm actually taken
to the relevant site.
So this took me right
to viaf.org, the Virtual
International Authority File,
directly to their record
about Harvey Milk.
All right?
And that itself leads
me to national catalogs
of national libraries
all over the world.
We won't get into the
things you can do with VIAF.
The point is Wikidata
contained the piece of thread
that I could tug on
to arrive directly
to that information
in other databases.
Yes.
And it has that for many,
many kinds of databases.
The BNF, for example, that's
the National Library of France.
And that will take me
to that index card.
IMDB.
We all know IMDB, right?
So here I have the key
to Harvey Milk in IMDB.
And this is what IMDB says
about Harvey Milk, right?
They have their own piece
of information about him,
of course, with filmography
and everything else.
And see, I did not have
to search IMDB for it.
I just had the key right
there waiting for me.
Now, again, this is
very convenient for me
as I just showed you the
human use case for this.
But it's even more
powerful in aggregate
when we allow computers to
traverse this network of links
between--
not just within wiki data, but
between data storage facilities
and repositories.
This is sometimes referred to
as the linked data open cloud.
Cloud, because it's multiple
different repositories
that are interlinked.
And Wikidata is already, and
to a growing extent, the Nexus,
the connection
point between a lot
of these different databases.
So IMDB, for example,
it's a good example
because it's site
almost everyone knows,
IMDB has information
about Harvey Milk.
But that information
does not include a link
to the French National Library.
Right?
Do you see what I'm saying?
So IMDB is a data repository
with IDs and allows linking.
But it does not give you
what Wikidata gives you which
is this kind of collection of--
it's like a junction of all
these different data sources.
So Wikidata is the
place where you
can document these
interrelationships
or equivalencies.
Right?
So ID, you know, 587548 on IMDB
is discussing the same topic
as French National
Library ID whatever.
Wikidata contains that
piece of information.
that this ID in this database
is about the same person
as that ID in that database.
OK.
So that's what
identifiers are about.
Still scrolling down the
Wikidata item about Harvey
Milk, we have the site links.
The site links are links
to Wikimedia projects
that are related to this item.
So of course there
are Wikipedia articles
about Harvey Milk in many,
many different wikipedias.
Quite a few language versions.
And there are
pages on Wikiquote,
one of the sister projects.
There are pages on
Wikiquote with some quotes
from Harvey Milk.
And there is even a page for
Harvey Milk on Wikisource.
Right?
So this is a collection
of those links.
And those of you who have maybe
only dealt with Wikidata data
for inter-wiki links, which
we used to do in the old days
manually within
the article text,
now we do it through
Wikidata, so maybe that's
the only thing you didn't
know about Wikidata
is how to update these
inter-wiki tables on Wikidata.
All right.
So that concludes
our little tour
of the anatomy of
a Wikidata page.
I will just remind you that
it's a wiki page, which
means it has a discussion
page, a talk page.
This one happens to be empty.
But, you know, if we have
concerns or arguments
about some of the
data here that is
what we would use
to discuss this
and to arrive at consensus.
It also has a history view just
like every Wikipedia article.
So you can see here
a list of edits.
Maybe some of you
have never looked
at a history page on Wikipedia,
so this looks overwhelming.
But every line here,
every entry here,
is a single edit, a single
revision, a single change
to this Wikidata item.
Just Harvey Milk.
And you can see at the very
top this edit that I just
made-- this is my
volunteer account
and I just made this edit,
and in parentheses you
can see what I did.
I added an HE,
Hebrew, description.
And this is the text
that I added in Hebrew.
Right?
So we can see who added
what to the Wikidata item,
just like we can do
the same on Wikipedia.
So we have the revision history.
We can undo edits.
We can revert, just
like on Wikipedia.
And what else did I
want to show here?
We can add an item to my
watch list using the star,
just like on Wikipedia.
So we have all these
standard wiki features
that we would come to expect.
Let's pause for questions.
Any questions about what
we've covered so far?
Yes.
Are attributes of statements
precept for the specific value?
No they're not reset.
And generally Wikidata data does
not enforce by default logic.
So, I mean, there's
nothing to prevent you
from editing the
item about Brazil,
and adding the property height.
Now height is not a relevant
property for a country.
Right?
I mean, maybe average
elevation, maybe.
But not just height,
which is used for humans
or for physical things.
So you could add that
property to Brazil and save it
and the wiki would not complain.
Now in the background
there are kind
of extra wiki outside the
wiki prostheses for constraint
validation.
So there are bots and
other processes that
run, and occasionally,
for example,
identify non-living things
with a date of birth field.
That's nonsensical.
That should not exist.
If someone mistakenly added
that there are processes
that would flag
that to be fixed.
But the wiki itself,
Wikidata, will not
prevent you from adding that.
And that is by design
to keep things flexible.
So that people don't
run into, oh wait,
but I can't add this
because nobody thought
that I would need this, maybe.
I hope that answers
your question.
You say helpful
answer, question mark.
So was it a helpful answer, or?
OK.
Yes, Eleanor.
AUDIENCE: [INAUDIBLE]
ASAF BARTOV: Excellent question.
I'll repeat it.
You ask how do I find
the wiki data item
number from Wikipedia.
If I'm reading about Harvey Milk
and I want to look at the data
how do I do that?
That is an excellent question
and let's skip to Wikipedia.
Conveniently I have the
link right here on English.
So this is the Wikipedia
article about Harvey Milk
and every item on Wikipedia
should have a wiki data
item associated with it, but it
doesn't happen automatically.
So if I just created
a page on Wikipedia
I also need to create a
Wikidata entity for it
if it doesn't already exist.
It could already exist
because it was already
covered in a different
language, for example.
So that was parenthetical.
But every article on Wikipedia
should have, here on the side,
on the side are under Tools,
a link called Wikidata item.
Right here.
OK.
That Wikidata data
item is a link
that takes you to
Wikidata, to the entity,
and there you find the number.
You can-- you don't
even have to click it.
I mean, the URL itself
tells you the number.
The number, you see, it's
wikidata.org/wiki/q17141.
OK.
So that was an
excellent question.
Other questions?
Yes.
Yeah, about the additional
attributes, the qualifiers.
So, yes, I answered
more generically.
But just like the
properties themselves
are not limited per item,
the qualifiers per statement
are also not
entirely preordained.
But there is some
structure to it.
I don't want to go into it
at great length right now.
If we have time in the end
we can get back to that.
But some qualifiers are again
relevant for some things,
start time, end time,
and others won't be.
Wikidata does try to offer you--
you may remember when I
clicked add qualifier,
it gave me kind of drop down
of some relevant qualifiers.
So it does try to
help you in that way.
Other question?
Are the values for
instance of already
mappable to external ontologies?
That is a complicated question.
I'll help people understand
the question first.
So an ontology is a
structure, some kind
of hierarchy or
cloud, of entities
and their interrelationships.
An ontology would
say, for example,
a person is a living thing.
So is a dog.
They're both living things,
but they're different things.
And then, you know, say
things about those entities
and their interrelationships.
Now there are many,
many competing,
or coexisting models
of ontology's.
Many of them were created
for specific needs.
Many of them want to be
a universal ontology.
But of course it's
impossible to quite
agree on one complete
and simple ontology.
And so there are
many ontology's.
Which brings up your question,
can we map across ontology's?
Can we say that when wiki data
says instance of book that
is equivalent to some other
ontology saying instance
of bibliographic record?
And the answer is yes.
There are some such mappings.
They are incomplete.
And there's no kind of
auto magic thing happening
in the wiki vis-a-vis
those other ontology's.
That's kind of
left as an exercise
for those dealing with those
other ontology's, and for tool
builders and other
platform improvements
beyond Wikidata itself.
OK.
Other questions?
Yeah, we have one from
the YouTube stream.
Someone asked, why can't I
link Howard Carter's occupation
to archeologists when I use
an info box that fetches info
from Wikidata?
Why can't I link it
from the info box?
So, someone on the
stream answered
saying, because it's
an improper connection,
because the target is not
about the subject only.
The target is not
about the subject?
If I understand the
question correctly,
what you would want to be able
to do is from within Wikipedia
be able to say occupation
and link to a Wikidata entry
about archeology.
That doesn't quite
work that way.
We will get to a
little discussion
of that in an upcoming
section of this talk.
So I will defer the rest
of my answer to then.
OK.
So we're done with
questions for this phase,
and my browser got
tired of waiting for me.
So, yes.
All right.
So we took a look at Wikidata,
and we took questions.
So now, let's teach
Wikidata some new things.
Some things it
doesn't already know.
Let's look at this item here.
So this item is about one
of my favorite writers,
an American writer
named Helen Dewitt.
Wikidata, of course, fondly
refers to her as q54674,
but we can call
her Helen Dewitt.
And what can we contribute here?
So Wikidata has far less
information about Helen Dewitt.
Most of you probably haven't
heard of her, that's OK.
What does Wikidata
know about her?
Well instance of human.
We have a photo of her.
She's female.
She's an American.
Her name is Helen.
Date of birth.
Place of birth.
She's an author, a
novelist, a writer.
She was educated at the
University of Oxford.
And Wikidata knows what
her official website is.
That's useful, but that's it.
Now we can contribute
information here.
For example, she's an American
author writing in English.
So we could add
that information.
We could click the
Add button here.
And this is a good
moment to acknowledge
that the user interface of
Wikidata is a work in progress.
It's not as intuitive
as it might be.
So you need to
understand that click--
to add a completely
new property,
You need to click
this Add button.
If you want to add an additional
value to the property official
website, you need to
click this Add button.
It makes a kind of
sense with a shaded box.
But, you know, you need
to kind of pay attention,
and it's not as
friendly as it might be.
[COUGHING] Excuse me.
So, let's add a property here.
Click the Add button.
Again, Wikidata tries to
be useful by suggesting
some relevant
properties for humans.
A bit more morbidly it suggests,
how about date of death?
That's not cool, Wikidata.
Helen Dewitt is still alive.
So I will not add
date of death, but I
can add languages spoken,
written, or signed.
OK, so I click that.
And she writes in English.
I just type English-- whoops.
Not in Hebrew.
Don't panic.
I type English here.
And, oh, and of course Wikidata
has auto-complete, right?
So it tries to help me along.
But you will notice that
it has all kinds of things
called English.
I mean, it turns out that
there is a place in Indiana
called English, Indiana.
Did I mean that?
No, of course I didn't mean
that she writes her books
in English, Indiana.
Right?
But, you know, Wikidata gives me
the option of linking to that.
I also don't mean the botanist
Carl Schwartz English.
No, no I mean the
west Germanic language
originating in England.
That's what I mean.
So I click that.
And I click Save.
And that's it.
Again I have just made
an edit to Wikidata.
I have just taught Wikidata
that this author speaks English.
Now, again, this
may be very obvious.
She's American.
Of course not all
Americans write in English.
It may be obvious if
you look at her books.
The important thing
is that now Wikidata
knows this as a piece of data.
And, again, think ahead
to queries, which we will
demonstrate in a little bit.
Without this piece
of information
that I just added, if I were to
ask Wikidata five minutes ago,
give me a list of novelists
writing in English, OK,
Wikidata would have returned
thousands of results.
But Helen Dewitt would
not have been among them.
Because up until two
minutes ago Wikidata
didn't know that Helen Dewitt
writes in English and not
in Spanish.
Do you see?
It is this explicit
statement that will now
make her be included in any
future queries that asks,
who are novelists
writing in English?
OK.
By the way, she's
a PhD in Classics.
She speaks-- or at least reads
and writes Latin and Greek,
ancient Greek, and I could--
I can-- I mean, I
happen to know that.
But wait, wait, wait,
wait, wait, you say.
What about original research?
I mean, you can't just add
stuff like that to Wikidata.
Don't you need sources?
Citations?
Of course I do.
Yes.
Let's add some sources to this.
So on Wikidata,
just like Wikipedia,
things should generally
be supported by citations,
by references.
And just like Wikipedia,
they aren't always supported
in that way.
OK so, I mean, I can
just add it to Wikidata.
Watch me.
I just did that, right?
I just added English and
Latin without any citation,
and I will not be
arrested for it.
Just like I could edit
a Wikipedia article
and add some information
without a citation.
It may stick.
It may stay in the article,
or it may be reverted.
It depends on the kind of
information I'm adding.
It depends how many people
are paying attention
to the article on Wikipedia.
And it works the
same way on Wikidata.
OK, so, you can add some
things without references.
Ideally, when you
add, information you
should include references.
So let's be good Wikidata
citizens and add a source.
Here is an article that
I prepared in advance.
This is Helen Dewitt.
And in this article,
somewhere, it actually
says right at the
bottom here, see,
Dewitt knows, in descending
order of proficiency, Latin,
ancient Greek, French,
German, Spanish,
and Portuguese, Dutch, Danish,
Norwegian, Swedish, Arabic,
Hebrew and Japanese.
This may sound
excessive, but it's true.
I met this woman.
So anyway, we don't have
to include all of that.
The point is this article from
a reasonably reliable source,
this magazine,
this interview, can
count as a source for
the languages she speaks.
So I copy the URL.
I just copied off my browser.
And, whoops-- that's not--
here we go.
And I can just add
a reference here
to the information that I
just added to Wikidata, right?
I can click Add Reference.
And then just say the reference
URL is, and I just paste.
I paste this URL.
Hit Enter.
And that's it.
And now the fact that she
speaks Latin has a reference.
If you look at the other
things here on Wikidata,
you can see that these IDs, for
example, have references, too.
Right?
In this case, the reference
just says, excuse me--
In this case it just as
imported from English Wikipedia.
But wait, you say, can
Wikipedia be a source?
Not properly, no.
I mean, just like Wikipedia
itself doesn't cite itself.
We don't say, this person
was born in this city
how do we know?
We read it on Wikipedia
in another language.
That's not a good citation.
It's not a good
citation for Wikidata
either so why do we put it here?
Well you can see the qualifier
here is different, right?
It's not reference URL, which
is what I put in for Latin here.
It's not reference URL here,
it's a different qualifier.
It says-- saying, imported from.
So this is not an
actual reference that
supports this piece of data.
It just shows where did
this data come from.
It's a slightly different
thing, because this data was
mass imported into Wikidata.
So it wasn't input by
hand by some volunteer.
It was imported into Wikidata
en masse by a script,
by a program.
And we want to know, where
did this number come from?
Well it came from
English Wikipedia.
So again, that's not
a proper reference
for the validity
of the information,
but it does at least tell us
it came from English Wikipedia.
We can click and look on
English Wikipedia and find out.
Maybe there's a
footnote there that
says where it did come from.
OK.
So this was an example of
teaching Wikidata something
that it didn't know.
Something about the languages.
And of course I could add
this reference for English.
I could add all the other
languages that she speaks.
And I won't bore you with
that, but that is basically
how it's done.
So you click this Add to
add a completely new--
completely new statement.
Now, by the way, the fact
that these are the only two
suggestions that
Wikidata can think of,
doesn't mean these
are the only options.
OK, you can just type
anything that may be relevant.
We could add, for
example, award.
Just start typing award.
And here I have I have
a bunch of properties
that are relevant for awards.
Awards received, together
with, conferred by, right?
There's all kinds of properties
that I could rely on.
And of course there is a list of
all the properties of Wikidata.
And that list is
also sorted by type.
So yes, there is a list of
properties relevant to people
so that you don't have to guess.
But a surprising
amount of the time
you can just start typing
and get the right properties
suggested to you.
OK.
So we taught Wikidata
something new,
and now let's teach Wikidata
something completely new.
Right?
So how do we create
a new Wikidata item?
So, like I said, if I
created a Wikipedia article
about something that was
not previously covered
on any other
Wikipedia, chances are
there would not be an already
existing Wikidata item.
Sometimes there might
be, because Wikidata
does have 25 million entities.
But sometimes there wouldn't be.
So, first of all, I could
search for it, right?
So I could go to Wikidata
to the search box
here and just start typing, and
search for what I want, right?
So if I'm searching for Helen
Dewitt I just say Helen,
and I can see whether
or not it exists.
And there's a detailed search
results page, et cetera,
where I can where I can find out
if the item does exist or not.
Excuse me, this reminds me
of a very important thing
I wanted to
demonstrate, and that
is the multilingualism
of Wikidata.
So remember all these
labels in other languages.
Wikidata knows what to call
Helen Dewitt in Hebrew.
And it will show it to Wikidata
users whose language is Hebrew.
Mine is set to
English, for your sake.
But if I change this I go to
Preferences here and change
my language.
[INAUDIBLE] All
right, and I hit Save.
Wikidata will start
talking to me in Hebrew.
Now brace yourselves.
Are you ready?
Don't panic, it's right to left.
Oh my god everything
is topsy-turvy.
So this is the same
article in Hebrew.
So the sidebar has
switched direction,
and I know most of
you cannot read it.
Bear with me.
This is the label
that we previously
saw in the label box.
This is how you spell
Helen Dewitt in Hebrew.
And here is the
description in Hebrew.
It's not the description in
English, this description,
American writer, which
I was shown previously.
Now I'm shown the Hebrew
description, appropriately.
But more interestingly,
oh my god!
All these statements
are suddenly in Hebrew.
How did that happen?
Well this tiny word here
is the very concise way
to say in Hebrew, instance of,
and this word here means human.
So these are links to
the same things, right?
It still links to Q5.
Q5 is the Wikidata
entity for human.
These are still the same things.
But because Wikidata has
multiple labels for everything,
it has multiple
labels for items.
And it also has multiple
labels for property names.
So Wikidata knows how
to say, instance of,
and award received,
in other languages.
That is why it is able to show
me all this data in Hebrew
even if none of that data was
actually input into Wikidata
by a Hebrew speaker.
That data could have been
input by English speakers,
but thanks to the
fact that someone once
translated the word
photo into Hebrew,
I can see this field in Hebrew.
So one of the things you
can do to help Wikidata,
right now, without
any special knowledge
is to help translate
those labels.
Every label only needs to
be translated just once.
So you can see that all
of these properties, date
of birth, name et cetera,
they all have Hebrew labels.
Maybe one of these would not.
No, they all have Hebrew labels.
Doing pretty good.
And I'm able to search
in my own language.
I'm able to click Add.
This word is Add,
so I click this,
and now I have the Add screen.
It all speaks my language,
and it's awesome.
And now for your sake I
will switch back to English,
but it is important
to know you can
edit Wikidata in any language.
And it is far more multi-lingual
and multi-lingual friendly
than, for example commons, which
is also a project we all share.
But commons has some limitations
on how multi-lingual it is.
For example, the category
names, et cetera.
OK.
So we were beginning
to discuss creating
something completely new.
AUDIENCE: Quick
questions, if that's OK?
So there's two questions on IRC.
The first one is, can you
show search for something
like getting the list of things?
I want to learn how to search
for something properly like,
show me all the items with
this value of this property.
ASAF BARTOV: Yes.
That is part of
this talk, but I'll
get to that in a
little bit later.
There's a whole section where I
will demonstrate the very, very
powerful query
system of Wikidata
where I will cash
that check that I gave
at the beginning of
all these painters
who are sons of painters
queries et cetera
So I will demonstrate
how to do that.
AUDIENCE: Other question.
How does Wikidata data deal
with link rot, and other issues
streaming from their URL refs.
ASAF BARTOV: URLs break.
We call that link rot.
Wikidata doesn't have
any particular magic
around link rot,
just like Wikipedia.
So if you do use a bare
URL it may well rot.
But you can add qualifiers
with back up URLs else
on the Internet Archive, or
another mirroring service.
And potentially that could be
a software feature for Wikidata
to automatically save
or ensure that something
is saved on Internet
Archive, but I don't
know that it is doing so now.
So, just like Wikipedia, if
it is a bear URL it may rot.
And may need to be
replaced, possibly by bot.
Other questions?
All right, so let's
talk about how you
create a completely new item.
It's very simple.
You go to Wikidata and you
click here on the side.
There's a link, create new item,
which gives you this screen.
And let's create an
item about a book
that I'm reading right now
by this Bulgarian writer.
So we have an article about this
writer guy named Deyan Enev.
But we don't have an
article or a Wikidata item
about one of his famous
books called Circus Bulgaria.
That's the book I'm reading,
his first collection
of short stories in English.
Circus Bulgaria came out
in 2010, Portobello Books,
translated by Kapka Kassabova.
So that's the book I'm reading.
As you can see it's not
a link on Wikipedia.
There's no article about
it, and there's not even
a Wikidata entity item about it.
But we can totally create
it, even without a Wikipedia
article.
So let's create this new item.
Let's create it in
English for the purposes
of our demonstration.
The name of the item
is Circus Bulgaria.
Circus Bulgaria,
that's the name.
Not Circus Bulgaria
parentheses book,
or anything you may be
used to from Wikipedia.
It's the actual
name of the book,
and the description,
again, remember,
the description field
is just to kind of help
tell apart this Circus Bulgaria
from any other potential Circus
Bulgaria.
Maybe there's a
film or something.
So it's enough to just say
something like short story
collection.
I might add by Deyan Enev
and if just in case, again,
some future other short story
collection by some other author
happens to have that same name.
That should be
disambiguating enough.
OK.
Short story collection
by Deyan Enev.
I could have aliases for this.
The aliases assist find-ability.
This particular book has just
this one name, so that's fine.
And I click Create.
That's it.
I just start with a
label, and a description.
I click Create.
I have a brand new queue number
for my new Wikidata item.
And Wikidata knows
what to call it.
And a description in
one language at least.
And that's it, and I
can start populating it.
As it can see, it it
has no site links,
but it's ready to be taught.
So, for example, I
can start by teaching
it the name of the book
in another language
that I happened to speak.
Now it has two labels
in English and Hebrew.
I could also look
up the book Areon,
the original Bulgarian
label for this book.
Seems relevant.
Again, I do not speak Bulgarian.
But I can go to the Bulgarian
Wikipedia through into Wiki.
This is this gentleman.
And I could find--
I can read Cyrillic so
I could easily find--
when I say easily--
when I say easily--
maybe not so easy, but
I can search for it.
Here we go.
Tsirk Bulgaria.
That is the name of the book.
Tsirk, as in circus.
No problem.
So I just copy this right here.
And I go back to my new item.
My new item, which is here,
and I edit the Bulgarian field.
And here it is.
Awesome.
All right.
But I still haven't told
Wikidata anything about this.
I know I'm talking about a book.
Wikidata that doesn't
know that yet.
So let's start by
adding some statements.
First of all, I click Add.
Wikidata sensibly
says, how about we
start with instance of.
Tell me what kind of animal--
no, not kind of animal.
What kind of thing are you
trying to describe here?
Well it's an instance of a book.
Not in Hebrew, please.
So it's an instance of a book.
I could even be a
little more specific
and say it's an instance of
a short story collection.
There we go, short
story collection.
I hit Save.
Awesome.
So now we know what
kind of thing it is.
It's not a human, it's not a
mountain, it's not a concept.
It's a short story collection.
Now I can add some other things.
See, Wikidata is
already working for me.
Because it's a short
story collection
it's offering me to populate
these properties, and not
other ones.
Publication date,
original language,
genre, country of origin,
these are all relevant, right?
So let's start with original
language of the work
is Bulgarian.
Not Bulgaria, Bulgarian.
This is the item I want to link.
Hit Save, and whatever.
Author.
Let's identify the author.
So the author, the main
creator of the work,
is that gentleman Deyan Enev.
And remember, he has
a Wikipedia article.
He also has a Wikidata entity.
So Wikidata does know about him.
So I hit Save, and I can add
something about the translator.
And what was that lady's name?
Kapka Kassabova.
Now it so happens that Wikidata
already knows about this lady.
See?
So I can just start typing
and then just link to it.
Awesome.
But what if it didn't?
What if it was translated
by someone who isn't
already covered on Wikidata?
Well I could just type
the name as a string,
but ideally I could
create a Wikidata entity
about this translator so
that there is a possibility
to link to her.
Now I might actually
add a qualifier here
because, she's not the
translator of the book, right?
She's the translator of
the book into English.
Right.
So the language that she
translated into is English.
Right?
This book-- remember
I'm describing the book.
The item is about the book.
So the book would have
a different translator
into Polish.
So this is an example of
a property or a statement
that doesn't make sense without
one of those qualifiers.
It's just not correct.
It doesn't make sense to
say that translator is.
The English translator, or
even this English translator.
In 50 years maybe there would
be an additional English
translation.
So that's an example of
needing that qualifier.
And of course I could go on
and populate the other fields.
We don't have to
do that right now.
Publication date, country
of origin, et cetera.
So this is already beginning
to look like all those items
that we already saw, but just
a moment ago it didn't exist.
Just a moment ago Wikidata
had no concept of this work.
This happens to be one
of his notable works.
So I could actually go to the
item about Deyan Enev which
has all this information
already, occupation, languages,
and add a property.
Remember, I'm not
limited to these.
I can add a property
called notable works,
and mention my new item.
Circus Bulgaria.
See?
My new item is
showing up, and thanks
to this description that I
wrote, short story collection,
it's already appearing here in
the dropdown very conveniently.
So I linked to this.
I hit Save.
Ideally again I should find
some references showing
that this is a
notable work by him,
but we won't spend
time on that right now.
But the point is we
created a new item.
We populated it a little bit.
We linked to it so that it's
more discoverable by mentioning
it in the author name, and
of course the book item
itself mentions the author
and links to the author.
So that's all good.
One last thing we shall do is
give it some useful identifier
so let's add, say, the
Library of Congress record
for this book.
OK.
So I have prepared
this in advance.
Ooh.
Just in time, with 80 seconds to
go before it's giving up on me.
Oh it has already
given up on me.
That is very unfortunate.
So I go to the Library of
Congress and I find this book.
I find this entry, right?
In the Library of Congress
database about this book.
And it has a permalink.
It has a kind of guaranteed
to be permanent link.
I can just copy that link,
go back to my little book,
and say the Library of Congress.
Yeah, LCCN, that's what they
call their IDs, the call
number.
And I paste it here.
I actually don't need the URL.
I need just a number.
And there we go.
I have added it,
and now Wikidata
knows how to find bibliographic
information about this book.
And any re-user of
Wikidata, some program,
some tool that connects
books to authors
or does statistical analysis or
whatever, some future yet to be
imagined tool
could automatically
find additional metadata on the
Library of Congress site thanks
to this connection
that I just made.
And of course I could
add many other IDs
to other catalogs
around the world,
and we won't do that right now.
You can see that it's now
showing up under identifiers.
So this is how we created
a brand new piece of data.
Questions about this,
about creating new items?
Yeah, all right.
So we've seen how to contribute
to Wikidata on our own,
kind of through--
directly through Wikidata.
Now you may you may be
thinking, but Asaf, this
sounds like a ton
of work recording
all of these little tiny bits of
information about every person
and every book and every town.
And if you think that
you would be correct.
That is a ton of work.
It's a lot of work.
However, it is centralized, so
it is reusable on other wikis
and we will show in just a
moment how we pull information
from Wikidata into
Wikipedia or other projects.
We will show that
in just a moment.
But here's an
awesome little game
that we Wikidata
volunteer, Magnis Monska,
has authored called the
Wikidata game, in which he
tricks people--
sorry, helps people
make contributions
to Wikidata in a very,
very easy and pleasant way.
Let's look at the Wikidata game.
So the first thing you need
to do in that Wikidata game
is to log in,
because the Wikidata
game makes edits in your name.
So we need to authorize it.
It's perfectly safe.
And after you do that you
can go to the Wikidata game.
So this is the game.
Now I'm logged in.
And the Wikidata game
actually includes
a number of different games.
Let's start with a person game.
So Wikidata shows you--
shows you an item, and asks
you a very simple question.
Person, or not a person?
So Wikidata goes through
Wikidata entities
that don't even have the
instance of property.
Which is why Wikidata
doesn't know,
literally doesn't know, if this
is a person, or a mountain,
or a city, or a country,
or anything else.
So it asks you, because this
is the kind of question that
Wikidata cannot
decide on its own,
but for us humans it's generally
trivial to be able to say
whether something that we're
looking at is a person or not.
It gets slightly trickier when
the information is in Javanese,
as it is here,
rather than English.
So this item happens to
be described in Javanese.
My Javanese, spoken in
Indonesia, is very weak.
However, I can tell that
this is not a person.
How can I tell?
Without understanding
a word of Japanese
I see that it mentions
1000 kilometers
and square kilometers, see?
So this is about a
place, or an area,
or a region, or whatever,
but not a person.
So this is an
example of how even
without understanding
language you can sometimes
make a determination.
However, of course,
you should be sure.
This is definitely not
what the Wikipedia article
about a person looks like.
So this is not a person.
I just click it and I'm
shown the next item.
This item is in another
language I do not speak,
and I just don't know.
I do not know if this is
about a person or not.
So I click Not Sure.
This is in Swedish, and
it's about Sulawesi, still
Indonesia.
And it is not about a person.
I have enough Swedish for that.
So I click not a person.
Now, you may say,
well, do I really
have to deal with all these
languages that I don't speak?
The answer is no.
You don't have to.
Here at the bottom
of the Wikidata game
there are settings.
You can click that
and tell Wikidata,
I cannot even read
Chinese or Japanese,
so please don't show me
items in those languages.
Because I wouldn't
even be able to guess.
I prefer these languages in
which I can relatively easily
make determinations.
And I can even tell Wikidata to
only show me these languages.
You see?
This was not selected,
which is why I
was shown some other languages.
I could say, only use
these languages, and save.
And now I can try
this game again.
However, that can
slow it down a little.
So here we go.
Here's a Spanish-- which
is one of the languages I
told Wikidata game it can use.
This is a Spanish item.
Now is it about a person or not?
It is not about a person.
Is it about a person?
No.
Yes, it is right?
Monk Cistercian, Pedro
de Ovideo Falconi.
That sounds like a person.
Frau Pedro Nasser.
Yeah, he was born
in Madrid 1577.
This is a person.
OK.
So I click person.
Again, if you're not
sure, click not sure.
The point is, just by clicking
person and as you can see
this would work
very well on mobile,
which is why I said you can
contribute on your commute.
You can just hold your
phone or tablet or whatever,
and just tap.
Person, not a person.
Person, not a person.
The amazing thing is that just
tapping person has actually
made an edit to Wikidata
on my behalf, which
I can find out, like every
wiki, by clicking contributions.
And as you can see in addition
to the stuff about circus
Bulgaria, my latest edit is in
fact about this Pedro de Ovideo
Falconi person.
And the edit was, you can--
I hope you can see this, created
the claim instance of human.
So I added--
I mean Wikidata game
added for me the statement
instance of human.
Now, the awesome thing is
that it was super easy to do.
I didn't have to go into that
entity, click the Add button,
choose the instance of property,
choose human, hit Save.
Instead of all these
operations I just
tapped on my screen,
person, not a person.
And I can do hundreds of
edits during my daily commute.
There are other games,
like the gender game.
So this is about--
this is when Wikidata
already knows
that this item is a
person, but it doesn't
know the gender of this person.
Which is another one of
the more basic items.
And this is taking a long
time because of the language
limitations that I set on it.
I guess the less exotic
languages have already
been exhausted in the game.
We don't have to
wait all this time.
We can try something else.
How about occupation?
The occupation game.
Here we go, this is in Russian.
And what is the occupation
of this gentleman?
Well he is an [INAUDIBLE].
He's a church person.
However, so the
occupation game is
where Wikidata game
will automatically
pull likely occupations
from the article text
and ask for confirmation.
So if he-- if this person
really is a deacon,
I should click that.
But I'm not sure.
I'm not clear on the Russian
church's distinctions between--
I mean [INAUDIBLE]
is pretty senior,
but I don't know if that
automatically also means
he's a deacon or not.
And [INAUDIBLE] is
not listed here.
So I will click not listed.
Also, these guesses
are not always correct.
So, this guy for
example, is in Russian.
I can read this.
He's a philologist.
He's a linguist.
So I can confirm it
and click linguist.
All right?
And again, if we look
at my contributions
we can see the Wikidata
game on my behalf
created occupation linguist.
OK.
Just by typing linguist there.
Now if it's taken
from the article,
why would it ever be wrong?
Well Jesus was the
son of a carpenter.
The word carpenter
appears in the text.
That doesn't mean it's correct
to say Jesus was a carpenter.
OK?
Just a trivial example, right?
So many, many articles will say,
you know, born to a physician.
And so the word physician
could be guessed,
but it wouldn't be correct
unless the son is also
a physician.
So I hope it gives
you the gist of it.
There is also a
distributed Wikidata game,
which is pretty awesome.
Here we go, which
has additional games.
So, for example, the
key on game gives you,
maybe it gives you,
some items to play with.
Yes?
No?
OK.
So it gives you
this little card,
and asks you to confirm is this
instance of human settlement?
That is, is it a village,
town, city, whatever.
Is it a kind of human
settlement or not?
Or maybe it's a book.
Maybe it's a poem.
Again, so, is it an
English settlement?
And you can click the languages
here to see the information.
So I can click English.
And indeed the article--
I mean the actual
Wikipedia article
says Camigji is a
town and territory
in this district in the Congo.
So yes, this is an instance
of human settlement.
So I clicked yes.
And just clicking yes
again went to that item,
and added property
of human settlement.
Now the point of
all these games is
these are tools,
written by programmers,
making kind of semi educated
guesses about these fairly
basic properties.
And they are meant to
semi automate, to assist,
in the accumulation of all
these important pieces of data.
Now every single
click here helps
Wikidata give better
results, richer results
in future queries.
Again, as of right now
Wikidata can include Camigji
if I ask it, you know, what
are some towns in Congo?
Until now it could not.
Because it literally
didn't know.
So every time we click male,
female, person, not a person,
make these decisions,
we help improve Wikidata
and enrich the results
that we could receive.
Any questions about this, about
kind of micro contributions
through the Wikidata game?
If that looks
appealing I encourage
you to go and visit
the Wikidata game
and start contributing
in that way.
There is a question here.
If I make an article about
Circus Bulgaria how should
I correctly connect them?
That is an excellent question.
So once-- so now there is a
Wikidata item about that book,
but there is no Wikipedia
article anywhere.
Now suppose I write one
in, Bulgarian maybe,
you go to Wikidata.
You find the item by searching.
You find the item, and then
the empty site links section
right at the bottom there--
where are we?
We have this?
Circus Bulgaria.
Let's demonstrate this.
So here is the item
about the book.
Let's say that now
there is an article
because I just created it.
I can go here to the empty
Wikipedia link section,
click Edit, type the
name of the wiki,
let's say English, and then
type the name of the page
that I just created.
Circus-- right?
And again, it offers
me auto-complete
for my convenience.
Now we don't actually
have the article created,
but I could let's just
say this was the article.
I can just click this,
hit Save, and that
would associate the
new Wikipedia article
with this Wikidata item.
That is the beginning of the
inter-wiki list for this item.
I will not click
Save Now, because we
didn't have the article yet.
So I hope that
answers that question.
Was there another question
that I missed here?
No.
OK.
Any questions about
the Wikidata game?
About this idea of
micro contributions?
If not then we can move
on to embedding data,
and after that we
can discuss queries,
how to get at all this
data from Wikidata.
So the short version of how
to embed data from Wikidata
is that there is this
little magic incantation.
Curly brace, curly brace,
hash mark, property.
It looks like a template, but
it isn't because of that hash.
And that is magic.
Take a look at this little
demo that I prepared.
This page, which is off
my user page on meta,
but it could be on any wiki.
OK.
Says, since San Francisco
is item Q62 in Wikidata,
and since population is
property P1082, I can tell you
that according to Wikidata the
population of San Francisco
is this.
And this bolded number here was
produced with this incantation.
Curly brace, curly brace,
hash mark, property P1082,
that's population,
type from what item?
Right?
Cause I'm pulling
an arbitrary number.
I could put any
property in any item
here, and kind of include
it, embedded, into my text.
This isn't even about-- you
notice this is my user page.
This isn't even the article
about San Francisco.
I just want to pull that
number into this thing
that I'm writing.
So it's fairly simple.
I identify the property.
I identify the item
to take it from.
And Wikidata will,
I mean Wikipedia,
or the wiki I'm on, in this
case meta, will go to Wikipedia
and fetch it for me.
Likewise, since Denny Vrandecic,
the designer of Wikidata
is item 18618629, right?
I mean, he's a notable person,
so he has a Wikidata entity.
And since occupation is property
106, and date of birth is 569,
and place of birth
is 19, because
of all that I can tell you
that Vrandecic was born
in Stuttgart, on this date,
and is researcher, programmer,
and computer scientist.
If you look at the source for
this page, click Edit Source,
you can see that the word
Stuttgart does not appear here,
because it came from Wikidata.
I did not write this into
my little demo page here.
See?
Place of birth is--
where is it?
Here.
Born in property 19 from
queue number so-and-so.
That is how easy
it is to pull stuff
into a wiki from Wikidata.
OK now there's
some nuance to it.
And there's there are
some additional parameters
you can give.
And you can ask
Wikidata to give you
not just the text of the values,
but actually make it links.
So, for example, if I change
this from property to values--
No, that did not work at all.
Wasn't it values?
What was it?
Values and then--
Oh, statements.
My bad, sorry.
The Magic word is statements.
Statements.
So going back here.
If I change the word property
to the word statements
here then this same value--
that did not work at all.
Oh, because I'm on meta.
So because I'm on
meta, meta doesn't
have an article named
researcher, programmer,
or computer scientist.
But Wikipedia does.
If I included this same
syntax in Wikipedia,
like English Wikipedia,
for example--
So let's go there right now.
And go-- go to my--
Go to my sandbox.
If I just brutally paste
this on my sandbox here--
So, see, these became links.
Because Wikipedia has an article
called programmer and computer
scientist.
So, like I said, there's
some additional nuance
to the embedding.
The important thing
is that this is
the key to delivering on that
first problem that I mentioned.
How to get data from
a central location
onto your wiki in your language.
Basically using property and
statements magic incantations.
And of course,
usually, this would be
in the context of an info box.
Some wikis-- English Wikipedia
is not leading the way there.
Some smaller wikis
are more advanced
actually in integrating
Wikidata embeddings like this
into their info boxes.
So that instead of
the info box just
being a template on the wiki
with field equals value,
field equals value.
That template of the
info box on the wiki
pulls the values, the birthdate,
the languages, et cetera,
pulls them from Wikidata.
So basically just-- I just
demonstrated single calls
to this, but of course
an info box template
would include maybe
20 or 40 such embeds,
and that is not a problem.
Of course, before you go and
edit the English Wikipedia's
info box person and replace
it all with Wikidata embeds,
you should discuss it with the
English Wikipedia community.
These discussions have
already been taking place.
There are some
concerns about how
to patrol this, how to keep
it newbie friendly, et cetera.
So there are legitimate concerns
with just moving everything
to be embedded from Wikidata.
But the communities are
gradually handling this.
I mean this ability to embed
from Wikidata is not very old.
It's been around
for about a year.
So communities are
still working on kind
of integrating that technology.
But that is that is kind
of just the basics of how
to pull data, individual bits
of data, that's not querying,
that's not asking those sweeping
questions that I was talking
about yet.
We'll get to that
right now this is
how to pull a specific datum,
a specific piece of data,
from Wikidata.
OK.
So here's another quick
thing to demonstrate
before we go to
queries, and that
is the article placeholder.
The article placeholder
is a feature
that is being tested on the
Esperanto Wikipedia, and maybe
another wiki, I don't remember.
And it is using the
potential of Wikidata
to offer a placeholder
for an article.
An automatically generated
Wikidata powered replacement
placeholder for an article
for articles that don't yet
exist on Esperanto.
So let's go to the
Esperanto Wikipedia.
I don't speak Esperanto.
But let's look for Helen
Dewitt, our friend,
in Esperanto Wikipedia.
Now Esperanto is not
one of the Wikipedias
that have an article
about Helen Dewitt.
And so it tells me that, right?
There is no Helen Dewitt.
Maybe you were looking
for Helena Dewitt.
No, I was not.
You can start an article
about Helen Dewitt.
You can search.
You know, there's
all this stuff.
But there is also this
little option here, hiding,
which tells me that the
Esperanto Wikipedia is--
what's happening here?
Yes.
The Esperanto Wikipedia is
ready to give me this page.
This page, as you can see, it's
on the Esperanto Wikipedia,
but it's not an article.
See, it's a special page.
It's machine generated.
You can see the URL as well.
It's not, you know,
slash Helen Dewitt.
It's slash specialio,
about topic,
and then the Wikidata
ID of Helen Dewitt.
And what I get here--
I get an English
description, by the way,
because there is no
Esperanto description.
Wikidata can't make it up.
But what it can do is
offer me these pieces
of data in my language,
in this case Esperanto.
I'm on the Esperanto Wikipedia.
OK.
So it tells me that she's
American, for example,
and it tells me
that in Esperanto.
OK and it tells me
that she speaks Latin.
Remember we taught
Wikidata that?
It tells me that she
was educated in Oxford,
you know, and gives me the
references to the extent
that they exist.
I mean this is not an article.
It's not, you know, paragraphs
of fluent Esperanto text.
But it is information
that I can understand
if I speak this language.
And it's better than nothing.
And remember Helen Dewitt was
not a very detailed article.
If I were to ask about, I
don't know, some politician,
or popular singer that
has more data in Wikidata,
than this machine generated
thing would have been richer.
So this feature is available
and is under beta testing
right now, but generally if
this sounds interesting for you
especially if you come
from a smaller wiki that
is missing a lot of articles
that people may want to learn
about, you can contact
the Wikimedia foundation
and ask for article placeholder
to be enabled on your wiki.
And again, this
is a placeholder.
Of course, it exists only
until someone actually
writes a proper Esperanto
article about Helen Dewitt.
So I hope this is clear.
This is all coming from
Wikidata on the fly.
In real time.
As you can see it includes my
latest edits to Helen Dewitt.
OK.
Questions about the-- questions
about the article placeholder?
If there are try and
put them on the channel.
And this brings us to one of
the main courses of this talk,
which is querying Wikidata.
So I've explained
how Wikidata works.
We've walked through it.
We've added to it.
We've created a new item.
We learned how to contribute
during our commutes.
And all this was you
kept promising us,
Asaf, that this would be--
this would enable
these amazing queries.
So time to make good on that.
The URL you need to remember
is query.wikidata.org.
And that will take you
to a query system that
uses a language called SPARQL.
SPARQL, spelt with
a Q. This language
is not a Wikimedia creation.
It's a standardized language
used for querying linked data
sources.
And because of that
there are there
are certain usability prices
that we pay for using SPARQL,
for using a standard language.
It's not completely custom
made for querying Wikidata,
and we'll see that
in just a moment.
The principle to
remember about Wikidata
query is that Wikidata will
tell you everything it knows,
but no more.
I have anticipated this
several times already, right?
Until this moment when
we taught Wikidata data
that Helen Dewitt
speaks Latin, she
would not have appeared
in query results
asking who are American
writers who speak Latin?
She would not have appeared.
But as of this
afternoon, she will
appear because I've added
that piece of information.
So a result of that principle
is that you can never say,
well I ran a Wikidata
query and this
is the list of Flemish painters
who are sons of painters.
The list.
That these are all
the Flemish painters
who are sons of painters.
That is never something you can
say based on a Wikidata query,
because of course, maybe
not all the Flemish painters
who are sons of painters have
been expressed in Wikidata data
yet.
Wikidata doesn't know
about some of them,
or maybe it knows
about all of them
but doesn't know
the important fact
that this person is
the son of that person,
because those properties
have not been added.
And so they cannot be
included in the results.
So the results of
a Wikidata query
are never the definitive sets.
What you can say about
a Wikidata query is here
are some Flemish painters
who are sons of painters.
Here are some cities
with female mayors.
Whatever it is
you're querying about
is never guaranteed
to be complete
because Wikidata,
like Wikipedia, is
a work in progress.
And of course, the more
we teach Wikidata the
more useful it becomes.
OK so lets go and
see those queries.
So this is query.wikidata.org.
It's not the wiki.
All right?
So this isn't like some
page on the wiki itself.
This is kind of an
external system.
So it's not a wiki.
You can see I don't
have a user page here.
I don't have a history tab.
This isn't a wiki page.
This is a special kind
of tool or system.
And it invites me to
input a SPARQL query.
Now most of us do
not speak SPARQL.
It's a a technical language.
It's a query language.
Some of you may be thinking
about SQL, the database query
language.
SPARQL is named with kind
of a wink, or a nod, to SQL.
But, I warn you, if
you are comfortable in
SQL don't expect to carry
over your knowledge of SQL
into SPARQL.
They're not the same.
They are superficially similar.
Right?
So they both use
the keyword select,
and they use the word where,
and they use things like limit,
and order.
So again, if you know
this already from SQL
those mean roughly
the same things,
but don't expect it to
behave just like SQL.
You do need to spend some time
understanding how SPARQL works.
So, by all means, I
invite you to go and read
one of the many fine
SPARQL tutorials that
are out there on the web, or
to click the Help button here,
which also includes
help about SPARQL.
But I also know
that most of us when
we want to do some advanced
formatting on wiki,
for example, we don't go
and read the help page
on templates, right?
We go to a page that already
does what we want to do,
and adopt and adapt the code
from that other page, right?
So we just take something that
does roughly what we want,
and just copy it over and
change what we need to change.
That is a very pragmatic
and reasonable way
to do things which is why--
and the wiki data
engineers know this,
which is why they prepared
this very handy button for us
called examples.
We click the examples button.
And, oh my god, there is a ton
of-- well there's 312 example
queries for us to choose from.
And we can just
pick something that
is roughly like what
we're trying to find out,
and then just change
what needs changing.
So let's take a very simple one.
The cats query.
Maybe one of the simplest
you could possibly have.
And let's run it first
and then I'll kind of
walk you through it.
The goal here is not
to teach you SPARQL,
but to get you to be kind
of literate in SPARQL.
To kind of understand why
this does what it does.
So let's run this query first.
We click Run and here I
have results at the bottom.
The item, which is
just a Wikidata item,
which of course is a number.
Remember, wiki data thinks
of items as queue numbers.
And the label,
because we're humans
and we prefer words to numbers.
So these 114 results
are all the cats
that wiki data knows about.
Is this all the
cats in the world?
No of course not, remember?
It's all the cats Wikidata
knows about, which
means they're somehow notable.
I mean someone bothered to
describe them on Wikidata.
And Wikidata was told this
item is an instance of cat.
Right?
So these are those cats.
And we can click any of them.
I don't know,
Pixel, for example.
Click the Wikipedia item.
And here is the Wikidata
item about Pixel
with the queue number.
And he is a tortoiseshell cat.
And as you can see
instance of cat.
OK.
And he is five inches high.
And he is apparently documented
in Indonesian, In Bahasa.
Right here this is Pixel.
And he is apparently somehow
related to the Guinness World
Records book.
I don't speak Bahasa, so
I don't know exactly why
this cat is so notable.
But, of course, cats
can become notable
for all kinds of reasons.
Maybe they're a
YouTube sensation,
you know, maybe
they were involved
in some historical event.
I like this cat named Gladstone.
This cat named Gladstone is--
he has position
held Chief Mouser
to Her Majesty's Treasury.
This is an official
cat with a job.
And he has been holding this
job, mind you, since the 28th
of June this past year.
That's the start time.
And there is no end time
which means he currently
holds the position
of Chief Mouser
to her Majesty's Treasury.
His employer is Her
Majesty's Treasury.
He's a male creature.
And Wikidata knows
that this cat is
named after William Gladstone,
the Victorian prime minister.
Of course if I don't
know who this person is
I can click through
and learn that he
was a liberal politician
and prime minister, right?
He even has a Twitter account.
And Wikidata sends
me right to it.
The treasury cat
Twitter account.
And he has articles in
German, and English,
and of course Japanese,
because he's a cat.
All right.
So this was a very simple query.
Let's find out why it works.
OK.
So what did we actually
tell Wikidata to do for us?
We said, please select
some items for us
along with their labels.
OK?
Along with their
human readable labels
because if I remove this
label what I get is, see,
just a list of item numbers.
That's not as fun.
So that's what this
little bit did.
I just said, give me the
items, but also they're
human readable label.
And I want you to
select a bunch of items,
but not just any
random bunch of items,
I want to select items where
a certain condition holds.
What is the condition?
The condition is that the
item that I want you to select
needs to have property
31 with a value of Q146.
Well, that's helpful.
If I hover over these numbers--
Again, I get the human
readable version.
So I'm looking for
items that have property
instance of with the value cat.
Right?
Because that's literally
what I want, right?
I want all the items that have
a property, a statement, that
says instance of cat.
That's the condition.
I'm not interested in items
that are instance of book,
or instance of human.
I'm interested in
instance of cat.
That is the only condition
here in this query.
This complicated line I ask
you to basically ignore.
This is one of those
sacrifices that we
make for using a standard
language like SPARQL.
But the role of this
complicated line
is to basically
ensure that we get
the English label for that cat.
OK?
So don't worry about that.
Just leave it there.
And we run the query
and we get the list
of cats with their English
labels, and that is awesome.
By the way, if I change EN,
without really understanding
this line, if I change
EN to HE, for Hebrew,
I get the same results
with a Hebrew label.
Of course, these cats,
nobody bothered to give them
Hebrew labels unfortunately.
So I get the queue number.
But if I changed
it to Japanese, JA,
I would get still a bunch of
queue numbers for where there
isn't a Japanese label,
but I would get the labels
in Japanese.
OK?
So this is an example
of how you don't even
need to understand all
the syntax of this query
to adapt it to your needs.
If you want this
query as is, but you
want the labels in
Japanese, you can just
change the language code here.
OK so that is all
this query does.
Again, just give
me the items that
have property 31, instance of,
with a value 146, which is cat.
Let's take a question just
about this very simple query
before we advance to
more complicated queries.
Any questions just about this?
Like, did anyone kind of
really lose me talking
about this simple query?
Again, this query just tells
Wikidata, get me all the items
that somewhere among
their statements
have instance of cat.
That's the only condition.
No questions.
OK, feel free to ask if
you'd come up with one.
So let's complicate
things a little.
Let's ask only for male cats.
OK.
Remember this cat
Gladstone is male,
and we know this because
he has a property called
sex or gender, and the value
is male creature, right?
So let's add another
condition right here
under the first condition.
OK?
This is a new line.
And I'm adding a new
condition to the query.
I'm saying, not only do I
want this item that you return
to be instance of cat, I
also want this same item
to have another property,
the property sex or gender.
Right?
And I need to refer to
the property by number.
But don't worry,
Wikidata will help you.
So you start with this
prefix, Wikidata WDDT.
Again, just ignore
that prefix it's
one of the features of SPARQL
that we need to respect.
WDT colon, and then I can
just type control space
to do a search, to
do an auto complete.
So I can just type sex
and Wikidata helpfully
offers me a drop down
with relevant properties.
So I click property 21, which
is the sex or gender property.
And then I say, so I want
the sex or gender property
to have the Wikidata value.
Again, control space.
And I can just
say male creature.
See?
There's a different item
for male, as inhuman,
and a different one for
male creature, for reasons
that we won't go into.
Let's pick male
creature, because we're
talking about cats here.
All right.
And add a period here at
the end and click Run.
And instead of 114 cats, we get,
this time, we got 43 results.
Including our friend Gladstone
who is a male creature cat.
So that means all the
rest are female, right?
Wrong.
Wrong.
That does not mean that at all.
What it means is of
the 114 items that
have instance of cat,
only 43 have explicitly
sex male creature.
The rest of them do not.
Maybe because they have
sex female creature,
but maybe because they don't
have that property at all.
I'm emphasizing
this to kind of help
you train yourself to
correctly interpret
the results of
queries from Wikidata.
Don't jump into this kind
of simplistic conclusion,
OK there's 114 total, 43 male,
therefore the rest are female.
That is not correct.
OK?
But 43 of those explicitly
had another statement, sex
or gender, male creature.
So I just added
another condition,
and now my query is
asking two separate things
about the results.
They need to be a cat
and a male creature.
AUDIENCE: Maybe we
should see how many
cats have Twitter accounts.
But there is a
question from YouTube,
which is will you talk about
the export possibilities
of the result of the query?
ASAF BARTOV: Absolutely.
Absolutely I will in
just a little bit.
I mean there is, in
addition to just getting
this kind of table, I can get
these results in other formats.
And I can also
download these results.
I can click the Download
button and get them
as a comma separated
file, tab separated
file, a JSON file, which is
useful for programmatic uses.
I can also get a link.
So I can get a
link to this query.
I mean, I spent all this time
designing this beautiful query.
I can get a short URL that was
generated especially for me
right now with a tiny URL.
I can just paste this
into Twitter and go,
hey people look at all the male
cats that Wikidata knows about.
OK, this is not a
very exciting query.
But once I get to a really
complicated exciting query
I can totally share that
very easily through this.
And we will get to more
interesting queries
in just a second.
Any questions on this kind
of basic querying so far?
OK.
So that was a very
simple example.
Let's spend a moment exploring.
So this cat Gladstone was
named after this dude, William
Gladstone, who was an
important British politician.
I'm sure he's not the
only thing out there
in the universe that's named
after Gladstone, right?
I mean there has got
to be, I don't know,
park benches,
planets, asteroids,
something other than the
cat, named after this guy.
So we can ask Wikidata
to tell us all the things
that, you know, without
saying instance of something.
Like, I don't know, anything
named after William Gladstone.
So how do I do that?
Same principle.
Instead of asking about the
property instance of, property
31, instead of that, I
will ask about the property
named after--
sorry, named after--
I don't need to
remember the number.
I have auto-complete.
Named after is property 138.
And I want anything
at all that is
named after this person,
William Gladstone.
Here we go.
Which is 160852.
Whatever.
OK.
You notice I removed
instance of cat.
I remove the male creature.
I'm only asking,
get me all the items
that are somehow named after
that particular politician.
And I run the query,
and it turns out
the Wikidata knows
about three such things.
Does that mean that's
the only-- these
are the only three things
named after him in the world?
Of course not.
But these are the only three
items that are in Wikidata
and explicitly have the
property named after Gladstone.
For all I know, there
may be a village
in England called Gladstone
named after this person.
But if nobody added the
property, named after, linking
to the person, he wouldn't show
up in the results to my query.
So Wikidata knows about
three such things.
One of them is something
called the Gladstone Professor
of Government.
I can click through and see
that it's a chair at Oxford
University, right?
So it's a position.
And another is the William
Gladstone school number 18.
William Gladstone
school number 18.
Where is that?
That is in Sofia, Bulgaria.
Again.
All right, so that's a
particular school in Bulgaria
named after William Gladstone.
And finally, the third
result is, of course, our pal
Gladstone the Cheif Mouser.
If I click through,
that's the cat.
All right, so that
was an example.
I mean, you saw how easy it was.
I just named the property and
the value that I care about,
and I get the results.
Again, I mean, it's
kind of a silly example,
but think about it.
This is-- how else can
you answer that question?
There's no reference desk,
even at a great University
of Oxford, where you can
walk in and say, give me
a list of things
named after Gladstone.
There's no easy way to
answer that unless you happen
to have a very large
structured and linked
data store, like Wikidata.
All right, so that
was a silly example.
Let's take some--
AUDIENCE: There's a
bunch of stuff on there.
ASAF: Oh, OK.
AUDIENCE: Can you show
easy query on the video?
And somebody needs to know
how to just do property
exists without giving
a specific value.
And then once you show easy
query you reload the page and--
ASAF: I don't know easy query.
So is that a gadget?
I don't know what easy query is.
I don't use it.
So someone can maybe
send a link or something?
Oh it is a gadget.
I don't have it enabled.
That is nice.
So now, what I just did by hand,
by formulating the query named
after Gladstone--
I guess this is the--
Is it?
Yeah.
So this-- I just
clicked the three--
the ellipsis here.
Right after the name.
You see this?
This was just added by
enabling easy query,
which I just learned about.
So you just click this
and it auto-magically
made this kind of trivial query.
Of course, if I want a more
complicated query like,
I don't know, give me
all the things that
are named after Lincoln
but are a school,
I will still need to kind
of edit a custom query.
But this is a super
easy and very nice
way of just doing a very super
quick query for exactly this.
Right?
Like. what other items have
exactly this property and value
named after William Gladstone?
So, thank you to whoever
made this suggestion
to demonstrate that, and
I'm glad I learned something
too today.
Let's move to
another sample query.
Here's a fun example.
Popular surnames among
fictional characters.
Think about that for a second.
Popular surnames among
fictional characters.
So we're asking Wikidata
to go through all
the fictional
characters you know,
and of those look through
their surnames, group
them so that you can count
them, the repetitions
of the surnames,
and give me the most
popular surnames among them.
Additionally, I want you to
awesomely present the results
as a bubble chart.
Oh, yeah.
Wikidata can do that.
And I run the query.
And check it out.
The most popular names
among fictional characters
we can say that knows about are
Joan, Smith, Taylor, et cetera.
I mean for all we know,
the most popular name
among fictional characters
actually in the world
may be Wu.
Or something in Chinese
for all we know.
But if that has not been
modeled in Wikidata,
we're not going to get that.
So Taylor, Smith,
Jones, Williams,
seem to be the
most popular names.
And again, I could limit this.
I could make the
same query but add,
only among works whose
original language
was Italian, for example, to get
more interesting results if I
only care about
Italian literature.
But this is an example of
how I got awesome bubble
charts for free, and
I can just plug this
into an awesome
presentation that I make.
Of course I can still
look at the raw table.
So the query still resulted
in a bunch of data, right?
So Smith repeats 41 times,
Jones 38 times, Taylor 34 times,
et cetera, et cetera.
And down that list.
And I could, again, I could
export this into a file
and load it up in a spreadsheet,
and do additional processing
on it.
I can link to it.
I can do all kinds of
awesome things with it.
So that's another awesome query.
We don't have to go into
every line by line analysis
here of why this
works the way it does.
I want to show you some
other queries first.
Let's look at-- this is just
fun, overall causes of death.
Again a bubble
chart just looking
at people who died
of things, and have
a cause of death listed.
And we learn that the most
commonly listed cause of death
is myocardial infarction,
pneumonitis, cerebral vascular,
lung cancer, et
cetera, et cetera.
And again, in a bubble chart.
And so how does that work?
So just very briefly, the
important parts of this query
are I'm looking for something,
for some person, who
is instance of 31, instance
of Q5, which is human.
So a human.
Again, just to kind
of limit the query.
I'm not interested in
books or mountains.
I'm looking for humans
who have that same person,
that same variable PID,
should have a 509, meaning--
Hello.
Why don't I have the--
Yeah.
A 509, which is cause of death.
And that cause of death
is another variable,
that I'm calling CID.
Now, previously
we were saying you
know I want things
that are named
after Gladstone specifically.
Only things that have
that particular value.
Here I'm saying I'm
looking for things
that have some cause of death.
Not a specific one.
I just wanted to
get everything that
has a statement with some
value about property 509
cause of death.
OK?
And then this other bit of
magic here, the group by,
tells Wikidata I'm not
actually interested
in every individual thing.
I want you to group those
causes, and then count them
and give me the top ones.
So that's how this query works.
Here's that query I promised.
Painters whose fathers
were also painters.
I can only think of a couple.
I mean, Monet and Vogel.
But I'm sure Wikidata
knows many more.
So let's run this query.
And I have 100 results.
By the way, I have limited
it to 100 results just
to keep it kind of snappy.
But actually, we could
maybe try removing the limit
and see if Wikidata
could tell us
the total number in Wikidata.
Yeah, that wasn't too bad.
So 1,270 results.
OK.
Wikidata, already at this
early date and it's progress,
already knows about
more than 1,200 painters
who are sons of painters.
Sons of male painters, like
their father is a painter.
There may be
additional painters who
are sons of female painters
not included in this query.
Again, always remember what
exactly you are asking.
In this query I was
asking about the father.
I'm leaving out any
possible painters who
are sons of mother painters.
OK?
So how does this work?
I'm asking for the painter
along with the human label,
and the father along
with the human label.
So Michel Monet is the
son of Claude Monet.
And Domenico Tintoretto is the
son of the famous Tintoretto
whose label, you know, is just
Tintoretto like Michelangelo.
You know, you don't always
have to have the full name
in the common label.
Paloma Picasso is the
daughter of Pablo Picasso.
OK.
So Wikidata knows about
all these results.
Of course Holbein the Younger
son of Holbein the Elder.
And how did we get there?
Well we asked Wikidata
to look for something,
let's call it painter, which
has 106, which is occupation,
with a value painter.
Right?
This unwieldy number
1028181, that's painter.
So I'm asking for any item
that has occupation painter.
And let's call
that item painter.
I also want that painter to have
a property 22, which is father.
OK.
Father.
And I want it to
have some value.
OK, I'm putting it into
another variable called father.
I could have called
it, you know, frog.
That doesn't change
anything, just to be clear.
What matters is that this
is the property father.
I could have called
it anything I want.
So, and then, I have
a third condition.
That the father, like whatever
it says here in property 22,
I want that father to have
himself a property 106
occupation with a value painter.
OK?
These conditions
combined to give me
a list of people who have
a father and that father
has occupation painter as well.
Of course, if I suddenly,
or if you suddenly,
are consumed by
curiosity to know
who are some politicians
who are sons of carpenters?
You could just
change that, right?
Change the first value
from painter to politician.
Change the third line's value
from painter to carpenter.
Maybe that list
will be very short
because carpenters don't
tend to be notable,
so they wouldn't be
represented on Wikidata.
That's why this works relatively
well with painters, right?
Because most of
them are notable.
But generally you
could do that, right?
That's an example of
how you can take a query
and just replace one of those
values, or even the language.
So again, I could ask
for these same painters.
It's limited again.
These same painters,
but with Arabic labels.
Same query, but I have Arabic
labels for these painters.
And of course where
there is no Arabic label
I get the queue number.
OK?
So that's that query
that I promised you,
painters who sons of painters
can be done by Wikidata
in under one second.
How awesome is that?
We can also get some statistics.
So how about counting
total articles
in a given wiki by gender.
This is what we call
the content gender
gap, as distinct from the
participation gender gap.
This is the gender gap in
what we cover on Wikipedia.
So let's take one of these.
So this is a query.
Articles about women in
some given Wikipedia.
All right.
So let's take--
I don't know.
Let's take the Tamil Wikipedia.
That's language code TA.
So I just put TA here.
And I click Run, and
I get this count.
That's all I wanted.
I'm not actually
interested in the items,
like in the list of women
on the Tamil Wikipedia.
I just want the number.
So I selected the count here.
And this number
turns out to be 2159.
So there are 2000
articles about women
the Tamil Wikipedia that
Wikidata knows to be female.
Right?
I'm asking about the gender
field, property 21 again.
Remember, if there's some
article about a woman in Tamil
Wikipedia, but wiki
data doesn't have
a statement about the
gender, that person
will not be counted here.
So again, be careful
about kind of stating
that is exactly the number
of women articles on Tamil
Wikipedia.
That's probably not true.
I'm sure some of those
articles are missing
a sex or gender or property.
But for raw statistics,
that's probably good,
because some men are also
missing the sex or gender
statistic property.
So we could take the
same query for men.
It's essentially the exact same.
It just has this unwieldy
number for males, 6581097.
I can change this language
code again to TA for Tamil.
And how many men are covered
on Tamil Wikipedia 14,649.
OK.
So women, 2,100, men,
about seven times as many.
Right?
So that's the approximate
size of the content gender
gap on Tamil Wikipedia.
And again, I can complicate
this query as much as I want.
For example, I can
try and find out
if this gender gap is wider
or narrower among musicians,
just as an example.
I could just add a line here
that says occupation musician,
and then I'm only
counting articles
on Tamil Wikipedia about
musicians who are female
versus articles
on Tamil Wikipedia
about musicians who are male.
And I can kind of
compare the gender--
the content gender gap across
occupations on Tamil Wikipedia.
Do you see the
important point here?
Is that this is not just
kind of a one purpose query.
I can just with a single
additional conditional suddenly
make it a much more interesting
query, because I break it down
by occupation.
Or I break it down by century.
Do we have more of the coverage
gap in 19th century people
than in 21st century people?
I mean, I sure hope so, right?
The patriarchy is
weakening somewhat.
So I wouldn't be surprised if
there are many more notable men
covered about the 19th century.
But if we are also covering--
I mean it's the
gender gap is just
as wide for 21st century
people, that would
be a little disappointing.
Again that's something I
can fairly easily find out
on Wikidata query.
Any questions so far, or
are you just sharing links?
AUDIENCE: Yep there is one.
So somebody is wondering if you
can demonstrate, or at least
give a short answer of the
latter of this question.
Is it possible using
in Wikidata SPARQL
to find specific
Wikidata articles, e.g.
featured articles, of a
certain language which do not
exist in another language.
I know it is possible
to find category based
results using a PET scan tool.
But can we specify
that by selecting e.g.
featured articles?
ASAF BARTOV: Yes.
Excellent question.
It is possible, indeed.
And I will demonstrate
one such query.
Another query that
I already mentioned
largest cities in the
world with a female mayor.
This query-- let's
close some of these tabs
before my browser chokes.
So this query lists
the major world cities
run by women currently.
And the answer is Mumbai, Mexico
City, Tokyo, bunch of others.
And wait-- that's not it at all.
I clicked the wrong one.
That's the map of paintings.
OK.
Let's demonstrate
that for a second.
So this is the map
of all paintings
for which we know a location
with the count per location.
And the results are
awesomely presented on a map.
OK.
Again, under the hood this is
a table, of course, of results.
But, awesomely, I can
browse it as a map.
So here is a map of the
world with all the paintings
that Wikidata knows about.
Not just knows
about the paintings,
but knows about their
location in a museum.
Not surprisingly
Europe is much better
covered than Russia or Africa.
There is a huge gap in
contribution to Wikidata
from these countries.
And some of it can be fixed.
And of course there is much more
documentation, and much more
art in Europe.
But if we zoom in, I
don't know, Rome probably
has a few paintings.
Right?
Hello.
Sorry.
It's-- Yes.
Vatican City sounds
like a good bet, right?
I can zoom in here.
And I can just click
one of these dots
and see in this point
there are two paintings.
And in this one there is one
and it's the Archbasilica
of St. John Lateran.
Let's see, this is the
actual St. Peter, right?
Sistine Chapel has 23 paintings.
What?
The Sistine Chapel has way
more than 23 paintings.
Correct, but 23 of them
are documented on Wikidata.
Have their own item
for the painting, not
the Sistine Chapel,
the painting has
an item that lists its
being in the Sistine Chapel.
There are 23 of those.
OK.
There is definitely
room to document
the rest of the artworks
in the Sistine Chapel.
So, again, this is just
not the kind of query
you were able to
make before Wikidata,
and it's a fairly simple
query, as you can see.
There are examples using
maps like airports within 100
kilometers of Berlin.
Again using the coordinates
as a useful data point.
And here is a map showing me
only airports within a 100
kilometer radius from Berlin.
But I wanted to show
you the mayors query.
Let's click the-- oh I just
have the wrong link here.
But I can still find it
here by typing mayor.
Here we go, largest
cities with female mayor.
So this is a slightly
more complicated query.
But if I run it, I get the top
10, because I set limit to 10.
I get the top 10
cities in the world,
by population, size that
are currently run by women.
Tokyo, Mumbai, Yokohama,
Caracas, et cetera.
And one interesting thing that
you may want to notice here
is that I'm asking for cities.
I mean items, that
are instance of city.
And that have a
head of government,
that have some
statement about who
is in charge, and that statement
has sex that's listed up here
as female.
Don't worry about
the syntax right now.
I just want to show you
some specific angle here.
And I'm further
filtering these results.
I only want those items where
there is not the property
and the qualifier, end time.
Why is that important?
Because if a city once
had a female mayor,
but that mayor is not the mayor
anymore, because mayors change,
I don't want them in this query.
I want to query of
cities currently having
a female mayor.
And of course Wikidata
may have historical data
with start and
end time, as we've
seen, that documents this
person was the mayor of Tokyo
or San Francisco
between these years.
But if there is no
end times that means
they are currently the mayor.
So that's an example of
asking about a qualifier
of a statement, to again, to get
the results we actually want.
If we want current mayors it's
important to put this filter.
If we don't, we will get
historical female mayors
as well.
All right.
So these are some
example queries.
Questions about that?
Oh, the featured
article example.
So let's look at that.
So I have prepared
such a query recently.
Here we go.
So this is a query.
I just saved it here
on my user page.
I mean, this is
not Wikidata query.
This is just a meta page
containing the query usefully.
And let's run this.
So this query, it's actually
not very complicated.
It's just has a long
list of countries,
because I'm asking
about African countries.
OK.
I'm looking for human
females from one
of these countries that
have an article in English.
That's what this line means.
But not in French.
That's what this part means.
OK.
This part, these
two lines together.
But not in French.
And this is what's
called a badge.
That's Wikidata's concept of
good and featured articles.
It's called a badge.
So I want them to have some
badge on English Wikipedia.
OK?
So again, this query is
asking for the top 100 women
from Africa who are documented
on English Wikipedia,
in a featured or
good article status.
But not on French Wikipedia.
So this is a query that's
a to-do query, right?
That's a query
for French editors
to consider what they might
usefully translate or create
in French.
And if we run this see
we have three results.
I mean, we have many
women from Africa
covered on English Wikipedia.
But only three articles
have featured or good status
among those that do not have
French Wikipedia coverage.
Let me rephrase that.
Among the English Wikipedia
articles about African women
that don't have a
French counterpart,
only three are featured or good.
OK?
Do you see this?
The badge is good article.
This little incantation
here is what allows
you to ask about the badge.
This here.
And, by the way, the slides
will be uploaded to commons.
And we will-- how shall we make
it available on the YouTube
thing as well?
No, no.
But, I mean, for people who
will later watch this video.
Oh yeah, we can add it to
the YouTube description
and the comments description.
So in the-- if you're
watching this video later,
in the description, we will
add a link to this query
specifically.
Because it's not in
the slides right now.
It will be.
OK.
So.
Questions so far?
We're almost done.
We have a few minutes left.
So questions about queries?
I mean, I'm sure
there's tons of things
you don't know how to do yet.
And you maybe you didn't really
get the sense for SPARQL.
It's something you need
to really do on your own
on your computer.
See how it works.
Fiddle with it.
Change something.
See that it breaks
and complains.
But, very importantly-- oh I
had this in the other questions
slide.
Remember Wikidata project chat.
That's kind of the Wikidata
equivalent of the village pump.
It's the page on Wikidata
where you can just
show up and ask a question.
In my experience, the
Wikidata community
is very nice, very
welcoming, and very eager
to help newer people integrate
and learn how to do things.
There's also an IRC channel.
If you know what IRC is and
how to use it, by all means,
go to IRC channel Wikidata.
There's people
there all the time,
and you can just ask a question.
If you're trying to do a
query, and you don't quite
understand the syntax, or you're
not sure how to get the result
you want.
There are people there who
will gladly help you do that.
There is also a
Wikidata newsletter
published by the Wikidata team,
which is centered in Germany
and Wikipedia Germany.
And they send out a newsletter
in English with Wikidata news.
You know, new
properties, new items,
new things in the project.
But also sample queries.
So once a week there is
kind of an awesome query
to learn from, if you want
to learn that way instead
of reading like a
whole manual on SPARQL.
So I'm just encouraging
you to get help
in one of those channels.
Of course you can write to me.
Just reach out to me and
ask me questions as well.
I hope by now you agree
that Wikidata is love,
and Wikidata data is awesome.
If there are no questions,
we do have a tiny bit of time
to demonstrate one
more tool but that's--
no?
No questions.
OK so let's talk about--
well, the resonator
is kind of nice,
but it's a little like
the article placeholder.
So this is not Wikidata
this is a tool again
built by Magnus Manske--
AUDIENCE: There's also one
final question to you in case--
ASAF BARTOV: Oh,
there is a question.
AUDIENCE: Yeah.
ASAF BARTOV: Which
advantages and disadvantages
to create an item
before an article is
done on English Wikipedia?
Well, I mean, this example
that I just made right.
I'm reading this book
by a notable author.
OK.
I want this to
exist on Wikidata,
and to be mentioned
on Wikidata, so
that when people look up
that author in Wikidata
they will know about one
of his notable works.
But I'm not prepared to
put in the time investment
to build a whole article
on English Wikipedia.
Either because I don't
have the time, or I
don't have good sources.
Or maybe my English
is not good enough,
but it is good enough to just
record these very basic facts
and point to the Library of
Congress records et cetera.
So that it's better
than nothing.
So that's one reason
to maybe do it.
Another reason is to
be able to link to it.
So remember that
translator lady already
had an item on Wikidata, but if
she hadn't we could have just
created a very, very basic
rudimentary item about her just
saying, you know,
this name is human.
Country, Bulgaria.
Occupation, translator.
Even just that would have
would have been something,
and would have enabled me
to link to this person.
So these are legitimate reasons
to create Wikidata entities
without, or at least before,
creating a Wikipedia article.
If you are going to create--
I mean if you're at and
edit-a-thon or something,
and you have come to
create Wikipedia articles,
by all means, first create
the Wikipedia article,
then create the Wikipedia
item and link to it.
I hope that answers
the question.
So the reasonator
is simply a kind
of prettier view of
items in Wikidata.
So you can just type the name
of an item or the number.
Let's pick just a
random number, 42.
Say 42.
Which happens to
be, maybe you've
heard of this guy,
Douglas Adams.
He happened to have received
the queue number 42.
I'm sure it's a
cosmic coincidence
of infinite improbability.
And this is a view--
this is a tool that
is not Wikidata.
It's a tool built on top of
Wikidata called resonator.
And it gives us the information
from Q42, that is from the--
this item in Wikidata, which
looks like an item in Wikidata.
But it gives it to us in a
slightly more rational kind
of lay out.
It even kind of
generates a little bit
of pseudo article text for us.
You know, Douglas Adams was
a British writer, playwright,
screenwriter,
bla-bla-bla, an author.
He was born on this date, in
this place, to these people.
He studied at this place
between these years.
That's all machine generated.
Nobody wrote this text.
That's all taken from those
statements in Wikidata,
and generates this reasonable
reading summary paragraph.
And then it gives us this
little table of relatives.
It's all taken from Wikidata.
But as you can see,
this is already
a little more accessible than
the essentially arbitrary
ordering of statements
on Wikidata.
And that's OK.
I mean, that's
kind of by design.
Wikidata is the platform.
There is going to
be-- there are going
to be many new applications,
and platforms, and tools,
and visual interfaces
on top of Wikidata
to browse Wikidata in a more
friendly or more customized
ways.
For example, one of the
things that resonator
does for us is give us pictures
and maps and a timeline.
Check it out this.
Time line machine generated,
just from dates and points
in time, mentioned in the
relatively rich Wikidata
item about Douglas Adams.
Right?
So this timeline, for example
again, completely machine
generated.
But he was educated
between these years,
so I can put it on the timeline.
And this is the year he was
nominated for a Hugo awards,
so I can put that in a timeline.
Et cetera.
So that's just a super
quick demonstration
of that tool, the resonator.
Links are all here
in the slides.
And the final tool I wanted
to mention very quickly
is the mix and match tool.
You remember my explanation
about Wikidata as Nexus,
as connection point between many
databases, many data sources.
Those depend on
these equivalencies.
On Wikidata being taught
that this item is like that
ID in this other database.
And mix and match is a tool
again by, Magnus Manske.
Maybe you're detecting
a pattern here.
It's a tool by Magnus
that is designed
to enable us to kind
of take a foreign,
an external data set, put
it alongside Wikidata,
and kind of try and align them.
So this item in this
external dataset,
is that already
covered in Wikidata?
If so, by what queue number?
By what item?
If not, maybe we need
to create a Wikidata
item to represent it.
Or maybe it's a
duplicate, or something.
So the mix and match tool has
a list of external data sets,
as you can see.
The Art and Architecture
Thesaurus by the Getty Research
Institute.
Or the Australian
Dictionary of Biography.
All kinds of external
data sets here.
Somewhere here I had a specific
link to the Royal Society.
It can also give
me some statistics.
So there is an external data set
of all the Fellows of the Royal
Society.
Right?
The oldest academic
learned society in England.
And the internet is tired.
Here we go.
Nope.
Did that work?
Fellows of the Royal
Society, here we go.
So this one is complete.
I mean, people have manually
gone over every single item
there and either
matched it to Wikidata
or declared that it was not
in scope, or a duplicate
or whatever.
But let's look at site stats.
This is a fun kind of
aspect of this tool.
But that is not working.
Or it's taking too long.
So let's just demonstrate
how this works.
Maybe Britannica?
Is that done already?
Here we go.
Encyclopedia Britannica.
Yeah.
So the Encyclopedia
Britannica has
40% of the items there
are not yet processed.
So let's process one of them.
For example there is an item
in the Encyclopedia Britannica
called Boston, England.
As you know
All-American place names
are totally stolen
from elsewhere.
So there is a Boston
in England, though it's
no longer the famous one.
And the mix and match
tool has automatically
matched it based on
the label to queue
100, which is Boston big
city in the United States.
And that is incorrect, right?
That's kind of naive computer
going, well this is Boston,
and this other thing
is also Boston.
And it is asking me to
confirm this match or not.
You see?
So this is the Boston,
England from Britannica.
And the tool is asking
me, is this the same as
Boston queue 100 in America?
The answer is no.
I removed this.
I remove this match.
And now this Boston,
England is unmatched.
And I can match it to the
correct one in England.
I can do this by searching
English Wikipedia,
or searching Wikidata.
I mean, it has
these handy links.
So the English town
is in Lincolnshire.
Boston, Lincolnshire.
So I can go there and then
get the Wikidata item number.
See this is not queue
100, Boston in the states,
this is queue 311975
town in Lincolnshire.
I can get this queue
number, go back to the mix
and match tool--
Where was that?
Here we are.
And set queue.
I can tell the tool that this is
the right Boston, and click OK.
And now this town
in Lincolnshire,
you can see this here,
this item, queue 311975,
is linked to Britannica.
What does this mean?
Well, if we go there.
If we actually go
to the Wikidata
entity you will see
that in addition
to the few statements that
it already had, it now has,
thanks to my clicking, it now
has another identifier here.
See?
Encyclopedia Britannica
Online ID, with this link.
And if we click it, we
will indeed reach this page
in the Britannica
online, which is indeed
about this town in Lincolnshire.
You see?
So I've contributed one
of those mappings, one
of those identifiers,
into Wikidata.
And I didn't have
to do it manually.
This tool kind of prompted
me to either confirm
if it was correct,
I could have just
clicked confirm since
it wasn't correct.
I corrected it manually, but
it made this edit on my behalf.
So that's another tool that
encourages us to systematically
teach Wikidata more things.
And we're out of time.
Go edit Wikidata, Now
that you have the power,
you know the deal.
Use it for good,
and not for evil.
If you have questions,
this is my email address.
If you're watching this video
not live the description
will have links to the
slides, and to a bunch
of other useful
pieces of information.
Any last questions on IRC?
If not, thank you
for your attention.
And if you like this, and if you
feel that you now get Wikidata,
and you get what it's
good for, and you're
inspired to contribute, I have
only one request from you.
I mean, in addition to using
it for good not for evil,
I ask that you spread the word.
Show this video--
share this video
with other people in your
community, or around you.
Teach this yourself
once you're comfortable
with these concepts.
Feel free to use my slides.
Yeah, and edit Wikidata.
Thank you very
much, and goodbye.