-
(moderator) Good afternoon, everybody.
We're about to start.
-
I'm presenting you John Samuel
-
who works at the French
engineering school CPE,
-
based in Lyon in France.
-
And he will tell us something more
-
about the translation
of properties in Wikidata.
-
As you know,
as is the case in all sessions,
-
there is an etherpad
for collaborative note-taking.
-
Please don't forget that.
-
We'll have the presentation
-
and then we'll have
some time for a short Q&A.
-
- The floor is yours.
- (John) Thanks, [inaudible].
-
Thank you all for coming here.
-
So my talk is about analyzing
translation of Wikidata properties.
-
So just give you a quick outline.
-
I would like to introduce this topic.
-
I will present a tool
that I developed some years before,
-
called WDProp,
which I'm continuously working,
-
and based on the feedback
from the community,
-
I add new features.
-
And then I will talk about
something called coarser analysis,
-
where I would like to look
at the property translation,
-
from a much larger picture.
-
So I will talk about
how we collected this data,
-
because this work is also done
with one of my students, Thibaut Chamard.
-
And then I will present some results,
and finally, I will conclude the talk.
-
So Wikidata, as you all know,
it started in 2012,
-
and it's a free, open, linked,
structured, collaborative,
-
and multilingual knowledge base.
-
My focus today
is on the multilingual part,
-
because there is a big change
from the traditional way
-
of how we used to edit on Wikipedia site.
-
There were multiple subdomains,
-
and now you'll have a single domain
on a Wikidata
-
where multilingual contributors come
and write or create articles.
-
So this is a collaborative.
-
There has been work to say
what exactly is collaborative,
-
why it is collaborative.
-
I have given references for these works.
-
So this is, if you see Wikidata,
-
everything that starts
is starting from the property.
-
The property is proposed
and then discussed and voted.
-
And then it is created
and finally translated,
-
and then you are finally
able to use these properties.
-
But these properties may also be deleted--
-
there's also something called deletion.
-
But, as I highlighted on this slide,
-
my focus is on the multilingual aspect,
-
and the property creation
and translation point of view.
-
So you have been here
for the past two days,
-
and by this time
you have seen many articles,
-
and I just want to point
what am I looking for on a Wikidata item.
-
This is a Wikidata item,
-
so you have this Q2841, which is Bogotá,
-
which is the capital city of Colombia,
-
and you have four parts here:
-
the languages, the labels,
the description, and aliases.
-
So you can see,
for different languages
-
you'll have the label,
you have the description
-
as well as if there any aliases
also known as, you could see them.
-
And this, under the city,
where you see the labels
-
and the properties together.
-
This is Avignon, a city in France.
-
So what I'm interested in
is only the properties part.
-
For example, official name, native label,
country, capital of, et cetera.
-
So when I say property,
for example, if a country,
-
in this country,
I'm looking at different aspects:
-
the language, the label,
and the description,
-
and see how things change.
-
For example, if you take instance of--
-
okay, everybody knows instance of,
you have been using it quite a lot--
-
this is P31, you see
the number of aliases in English
-
for the property P31 in instance of,
-
and then you would find
that these types of properties
-
are created after discussion
with the community.
-
So if I take the complete prop--
the procedure,
-
what happens to creation of properties--
-
you start proposing properties
with some possible translation.
-
It is important it's not just in English.
-
You have the templates
to suggest your properties
-
in your local language.
-
So that's why it's a proposition
with possible translation.
-
And then you put it to discussion,
then you are put to voting,
-
and it's created, and then finally,
the community members start translating it
-
and people put it into use.
-
But then you cannot be guaranteed
the properties that are created
-
are always there forever.
-
Properties can be deleted,
just like items can be deleted.
-
But then, again,
it goes through a similar procedure.
-
You put the property
-
as you propose that it should be deleted,
-
and if the community decides it,
it votes it, and then if it is decided--
-
the majority votes
has decided to delete it--
-
we deprecate the property,
and finally we delete this property.
-
So for today's talk, I'm mostly interested
for the translation part.
-
So where are the translations happening?
-
First, the translation would happen
at the proposition part,
-
and then you could find that,
at the time of creation,
-
the person who creates the property
can use the exact names
-
that were suggested
by the property proposer
-
and he or she will create the properties,
-
and later, you start translating
these properties.
-
So let us look at why this matters,
why it is important.
-
So I put some examples.
-
This is, again, on P31,
-
instance of the very, very famous
property P31,
-
and you see there is
no description for this item.
-
There are almost
six descriptions on this image,
-
where we do not have any description.
-
Again, some more description
for Odia and Punjabi,
-
there is no description.
-
This is a property
which is used quite a lot,
-
and you see that there is
no description for it.
-
And there is a surprising part
that you could also have cases
-
where there are descriptions,
but there are no labels.
-
For example, Ruffian,
that has been shown here,
-
again on property P31,
there is a label that is missing.
-
So this was the initial
inspiration for this work
-
when I started working
on property analysis.
-
I wanted to look at
what aspects of properties,
-
or what aspects of property
-
that the whole flow chart
that we have seen,
-
is multilingual.
-
So I wanted to look at,
-
okay, we know that Wikidata
is multilingual,
-
and it's collaborative,
that has been done.
-
But are we really able to achieve
a truly multilingual experience?
-
That was the question
behind the creation of WDProp.
-
So you may ask
why there are so many people
-
who have worked on items,
there are people who have worked on--
-
users, multilingual users
and bots, et cetera,
-
why you want to focus on properties?
-
The answer is,
I want to focus on properties
-
because it's very, very
less influenced by bots.
-
You may have heard today or yesterday,
-
many people said,
"Okay, if you have translation
-
in your local languages,
and it has reached a very good number,
-
you should ensure
what type of translation it is.
-
Is it just bots, which copies
the name of a person to another language.
-
Then is it really translation?"
-
Okay, that's debatable.
-
But, of course,
there is an influence by bot,
-
but in case of properties,
there is not so much influence by bots,
-
and that is a good part.
-
That's why I focus on the bots part.
-
So, as I said, when WDProp was created,
-
it was to understand every aspect--
the proposal, the creation, translation.
-
What are the templates that are available.
-
Are these templates,
for example, you said support,
-
if a French person opens Wikidata,
a Wikidata France translation page,
-
can he see the word, [soutien],
for that particular property proposal?
-
Is it possible?
-
So this type of things was needed.
-
In the end, it was also
about giving real-time statistics
-
to the multilingual contributors.
-
It's not about one time,
-
it's like you just made it
and published for one time-- no.
-
You want people
to get this data in real time.
-
So what are we doing?
-
So the goal of WDProp
was to understand everything
-
about Wikidata properties.
-
So, label, aliases, description.
-
So you have got all these three translated
so the middle part where you say,
-
this property is completely usable
because all the three aspects
-
have been translated.
-
So let me just show you quickly,
what is this WDProp,
-
what I'm talking about.
-
So this is the WDProp,
-
it's available on
tools.wmflabs.org/wdprop/.
-
So you have a lot statistics
and if I ask you some questions today,
-
like, for example,
"How many data types are there
-
that are supported by Wikidata right now?"
-
So if such questions, we do not know,
-
sometimes because there are new data types
that keep on coming.
-
So this data,
this is generated at real time,
-
this creates the data structure
and it will give you the answer.
-
How many languages are there?
-
Yes, of course,
see that there are 313 languages.
-
And then, for example,
how many labels were translated.
-
So you could see
that the data is being fetched.
-
I hope it comes.
-
Okay, let's hope. (chuckles)
-
Okay, I will take
some other stuff as well.
-
Browsing all properties by their time.
-
Yes. So you see,
this is count of translated labels,
-
and you see all this data
that is coming real time,
-
and you can see that the labels
-
are currently available
in 6,804 languages in English,
-
followed by Dutch, followed by Arabic,
followed by Ukrainian, and then French.
-
So this is real-time statistics.
-
So you could also do the same
for description,
-
also do for aliases, et cetera.
-
And you could get the overall
translation statuses if you want.
-
So there are some other things
that we will discuss later,
-
if time permits.
-
But you could navigate
all the different items
-
on the left-hand side,
-
and you could see
there are a lot of things
-
that could really help to see
what things are happening in WDProp.
-
So this is, for example,
Wikidata properties,
-
these are the properties
that are currently available.
-
But as I said some time back,
properties could be deleted.
-
And this, you see that these are
the properties that were deleted,
-
starting from P1, P2, P3, P4, P5,
these have all been deleted,
-
and you could get this thing
just from the statistics board.
-
And here, so same thing.
-
Then, the next thing that interested me
was to understand the translation pattern.
-
So, for example, sometimes we feel
that some languages--
-
so English is created first,
and followed by maybe Dutch,
-
or maybe French,
-
and maybe after French,
it could be Arabic.
-
So these things
could be interesting to know.
-
So for that, we started to look
at the idea of translation path--
-
exactly how things are translated.
-
So again, if you go to the property page,
you could click on any property.
-
Sorry.
-
Maybe I can show.
-
So you could click on any property
and you could just say,
-
"Give me the translation path."
-
It takes some time,
but it will start bringing the data,
-
because it's real time,
so you get the data coming from all this.
-
So you get the date,
-
you get what things have been changed,
when was something deleted, et cetera.
-
Why it is important?
-
For example, you see
this is something that happened in 2017,
-
and the label has been removed.
-
This is the official website.
-
So imagine you have removed the label
from the official website--
-
sorry, this country--
-
so anybody who doesn't know P17,
what it is, cannot even understand,
-
because the label has been deleted
by the person.
-
So this type of vandalism exists.
-
Another example where, completely,
-
all the language labels
have been deleted--
-
English, French, Spanish, German,
everything has been deleted.
-
There are no labels,
there are no descriptions.
-
So you could find these types of things
from the translation path
-
and just because of the color code,
you could see what happened on what day,
-
and you could check exactly,
because it is also linked.
-
If you click on any of this,
you could also get a link to the revision,
-
identify what exactly happened
during that particular revision.
-
So this is coming from revision history.
-
So if you click on any of this,
you get what exactly is happening
-
in any particular revision.
-
So how did we build it?
-
Just if you come back,
-
here, you see there is something
called a comment on the right-hand side.
-
You see there is something
called added aliases,
-
"added British English aliases,"
"changed Esperanto label,"
-
"added [io] label," et cetera.
-
So we made use of this information,
-
for example,
for label description and aliases,
-
if you add something,
you have some sort of comment
-
which starts with wbsetlabel-add.
-
Or if it is updated,
you have wbsetlabel-set.
-
And if you remove something,
you see it is removed.
-
And based on this type of information,
-
we were able to build
such a translation path.
-
Okay, this is good, but what happened
is that this type of information,
-
this type of things,
just using the comment,
-
it is useful for building real-time tools,
just like what I showed before, WDProp,
-
but it is very difficult to detect
when there are multiple changes.
-
For example, if you have seen
bots activity on Wikidata,
-
some bots make multiple labels
in one single edit.
-
In that case,
you cannot find what happened
-
because you do not have wbsetlabel,
that particular language.
-
So you do not have a set of languages
along with your comment.
-
So these are some problems
if you want to use this type of approach.
-
So what we did,
we decided to collect the data,
-
and we decided to publicly
make this data available.
-
And what we did,
we wanted to make use of content.
-
So what we did,
we started with every revision,
-
and we took the content of each revision.
-
And we took the next revision,
and we decided to find the difference
-
between these two revisions,
to find what exactly changes,
-
which of the labels got changed.
-
Because of that, we got
much more interesting information,
-
much more accurate information
than the previous approach
-
because it is very important
for doing analysis.
-
It is important
that you make use of correct data.
-
So you have four columns
that were used here--
-
timestamp, property,
language, type, et cetera.
-
And you get this data in this format.
It is publicly available.
-
So what does this data give me?
-
This data gives me information
-
that currently almost 4,000 plus,
-
4,500 properties
-
have labels between 0 and 20.
-
So there are a lot of properties
-
who do not have
more than 20 multilingual labels.
-
And there are only
1,500 language properties
-
that have been translated up to 40.
-
And yesterday, if you were present
during the talk of Lydia Pintscher,
-
she talked about P18,
so P18 is something here.
-
So you can see there are only
a couple of six or seven properties
-
that are currently having all the--
-
P18 has 154 translations,
just to give that idea.
-
So there is one property
which is having 154 multilingual labels.
-
There are properties
which have only one particular label.
-
And the average number
of labels is only 21,
-
and the standard deviation is 20.
-
Okay, what next we would like to say?
-
So you have seen something similar
in the real-time data.
-
This is from the collected data.
-
So this is what are the top languages
that are coming up in the results.
-
So these we have seen.
-
But my next point is,
are there combinations possible.
-
For example, if there is French,
there is Arabic.
-
If there is Arabic,
there is some other language.
-
If there's French,
there's Ukrainian, et cetera.
-
Can we find such type of combinations
in the translation data set?
-
So, yes, it is possible.
-
So if you see this count,
this frequent itemsets--
-
so I've just shown seven of them--
-
you find that there are combinations
that are possible.
-
Okay, let us say, is there a possibility
of having four labels,
-
like if there is English,
there's also possibility to find Dutch,
-
Arabic, Ukrainian.
-
If there is English,
there's possibility to find Dutch,
-
French, and Arabic, et cetera.
-
You can also find a lot of combinations.
-
Why it is important?
-
Because it is important to know if,
-
for example,
if you have multilingual speakers
-
who are contributors,
who can speak multiple languages,
-
if you're able to find
any particular pattern
-
that helps us to find
that if you tell this person to translate,
-
a new property is created
to translate this label,
-
because he already
speaks multiple languages,
-
we can suggest these things to the user.
-
So let's just show you one example.
-
This is a complete translation path
-
that has obtained
from different languages.
-
So here, what we have done is
we selected two small minority languages,
-
like Tagalog and Kapampangan,
-
which are minority languages
from Philippines,
-
and you see that there is
a strong transfer
-
between Tagalog and Kapampangan.
-
So these types of things can be detected
-
when you have such type
of translation results.
-
So that is another advantage.
-
To conclude my work,
I would like to say,
-
this is important that we understand
how properties are translated
-
because if you want to extract data
from Wikipedia,
-
you need to know what are the words
-
in the local languages
that are being used.
-
What is "image" in French,
what is "image" in Punjabi,
-
what is "image" in Hindi,
or any other language.
-
So that is important for importing data.
-
And tomorrow, of course,
if you are able to fetch this data,
-
to Wikidata, we could also
use new projects like Wikidata Bridge,
-
which we could use
to fill other info boxes,
-
like multilingual Wikipedia articles,
-
and this could be really helpful.
-
So withe that, I would like to thank you,
and if you have questions,
-
I would be happy to answer them.
-
(moderator) Anybody with questions?
-
(audience applause)
-
Yes?
-
(man) So what you're doing
is mainly analyzing how this--
-
- (John) Yes.
- (man) ...is all happening?
-
Do you know if there are initiatives
or if there are tools
-
which can help make this easier,
like translation of properties?
-
Yes. Tools, like, for example,
what to translate
-
from Wikimedia Foundation, is helpful,
but I have not seen--
-
This is not currently
integrated with Wikidata.
-
What to translate is only integrated
with certain languages on Wikipedia,
-
but not on Wikidata.
-
But that could be really interesting.
-
Yes, thank you for bringing this up,
because just imagine,
-
if we know that a person
has been labeling in multiple languages,
-
and we also have
this what to translate tool,
-
and we have these statistics,
we have this data
-
coming from this type
of property translation,
-
it is easier to suggest to a person
that new properties have been created,
-
and then you could--
-
Right now it's not integrated to Wikidata.
-
(moderator) Anybody else?
-
(man 2) I have one question myself,
that comes back to it,
-
does anybody know of working lists
on translating properties?
-
Sorry?
-
(man 2) Does anybody
know of working lists
-
about translating properties,
-
like, I can imagine from your statistics,
you could say, this is the top 100
-
most widely used properties
-
who lack translations
in this and this language?
-
No, there is, I think,
there are ways by,
-
for example,
you could browse by data types,
-
browse by property classes.
-
For example, here is something
called property classes
-
where people have created projects--
-
it's taking time--
so you have projects,
-
and you could say, how would I describe,
what are the, for example,
-
what are the properties
that I could describe for this,
-
for describing IEEE standard version?
-
You need edition number,
you need edition translation, et cetera.
-
So if you have a targeted thing,
you could search for what type of classes.
-
For example, if you're working
in GLAM or histories,
-
you could say, what is history-related
any document are there?
-
So you could say, historical,
and you could find historical.
-
Okay, this is a property class,
go to this property class.
-
And, sorry, where is it?
-
So it is having something
called "Merimee ID."
-
So people have been
trying to use property classes
-
to link objects.
-
That helps if you're working
on a particular project,
-
and you could find
that property's related to that.
-
(man 2) But your tool could quite easily
make a list of, let's say,
-
the top 100 most widely used properties
-
who haven't got, I don't know,
Punjabi label, let's say?
-
- (John) For that, I will just--
- (man 2) Which could be interesting.
-
(John) Okay, tell me any language,
for example, let us say, Netherlands,
-
because it's performing very well.
-
So I would say-- translated labels.
-
So this is translate-- sorry.
-
(mouse clicking)
-
For example, Hindi.
-
So here, what happens,
-
here you just see any properties
that need translation.
-
So there are like 6,647 properties
-
that need translation
in a particular language.
-
So you could click on any language
that you want and get the data.
-
And you could get the list
of where people need support.
-
So, this could be interesting
to link with property usage,
-
how many people, is it really top,
is it under the top ten.
-
So suggest those ten top hundred,
in that language.
-
That would be an interesting list.
That's good.
-
(man 3) Just what you asked,
-
there is a list of top 100
most used properties on Wikidata.
-
It's on Wikidata.
-
So, yeah, it's there,
-
under Wikidata Database Reports/
Top 100 Properties.
-
So one thing could be that
we could just link this and suggest it.
-
(moderator) Could you maybe
add the link to the etherpad,
-
and then maybe,
this information can come together.
-
(John) Okay.
-
(moderator) If there is
no other questions,
-
then we will conclude here.
-
And we have two, three minutes break
until we start with the next speaker.
-
- Thanks.
- (John) Thank you very much.
-
(audience applause)