WEBVTT
00:00:06.055 --> 00:00:09.281
(moderator) Good afternoon, everybody.
We're about to start.
00:00:09.281 --> 00:00:11.416
I'm presenting you John Samuel
00:00:11.416 --> 00:00:17.207
who works at the French
engineering school CPE,
00:00:17.207 --> 00:00:19.658
based in Lyon in France.
00:00:19.658 --> 00:00:21.101
And he will tell us something more
00:00:21.101 --> 00:00:27.271
about the translation
of properties in Wikidata.
00:00:27.271 --> 00:00:29.604
As you know,
as is the case in all sessions,
00:00:29.604 --> 00:00:32.172
there is an etherpad
for collaborative note-taking.
00:00:32.172 --> 00:00:34.904
Please don't forget that.
00:00:34.904 --> 00:00:36.302
We'll have the presentation
00:00:36.302 --> 00:00:39.988
and then we'll have
some time for a short Q&A.
00:00:39.988 --> 00:00:42.051
- The floor is yours.
- (John) Thanks, [inaudible].
00:00:42.917 --> 00:00:45.114
Thank you all for coming here.
00:00:45.114 --> 00:00:50.257
So my talk is about analyzing
translation of Wikidata properties.
00:00:50.257 --> 00:00:52.743
So just give you a quick outline.
00:00:52.743 --> 00:00:54.859
I would like to introduce this topic.
00:00:54.859 --> 00:00:58.756
I will present a tool
that I developed some years before,
00:00:58.756 --> 00:01:01.446
called WDProp,
which I'm continuously working,
00:01:01.446 --> 00:01:03.795
and based on the feedback
from the community,
00:01:03.795 --> 00:01:05.319
I add new features.
00:01:05.319 --> 00:01:09.368
And then I will talk about
something called coarser analysis,
00:01:09.368 --> 00:01:12.476
where I would like to look
at the property translation,
00:01:12.476 --> 00:01:15.257
from a much larger picture.
00:01:15.257 --> 00:01:18.667
So I will talk about
how we collected this data,
00:01:18.667 --> 00:01:23.002
because this work is also done
with one of my students, Thibaut Chamard.
00:01:23.002 --> 00:01:26.682
And then I will present some results,
and finally, I will conclude the talk.
00:01:27.469 --> 00:01:30.982
So Wikidata, as you all know,
it started in 2012,
00:01:30.982 --> 00:01:33.877
and it's a free, open, linked,
structured, collaborative,
00:01:33.877 --> 00:01:36.010
and multilingual knowledge base.
00:01:36.910 --> 00:01:40.063
My focus today
is on the multilingual part,
00:01:40.063 --> 00:01:42.979
because there is a big change
from the traditional way
00:01:42.979 --> 00:01:45.412
of how we used to edit on Wikipedia site.
00:01:45.412 --> 00:01:47.917
There were multiple subdomains,
00:01:47.917 --> 00:01:50.753
and now you'll have a single domain
on a Wikidata
00:01:50.753 --> 00:01:56.191
where multilingual contributors come
and write or create articles.
00:01:56.191 --> 00:01:57.499
So this is a collaborative.
00:01:57.499 --> 00:02:00.585
There has been work to say
what exactly is collaborative,
00:02:00.585 --> 00:02:02.441
why it is collaborative.
00:02:02.441 --> 00:02:04.597
I have given references for these works.
00:02:04.597 --> 00:02:07.254
So this is, if you see Wikidata,
00:02:07.254 --> 00:02:11.057
everything that starts
is starting from the property.
00:02:11.057 --> 00:02:14.144
The property is proposed
and then discussed and voted.
00:02:14.144 --> 00:02:17.471
And then it is created
and finally translated,
00:02:17.471 --> 00:02:20.005
and then you are finally
able to use these properties.
00:02:20.005 --> 00:02:22.010
But these properties may also be deleted--
00:02:22.010 --> 00:02:24.019
there's also something called deletion.
00:02:24.019 --> 00:02:26.700
But, as I highlighted on this slide,
00:02:26.700 --> 00:02:28.856
my focus is on the multilingual aspect,
00:02:28.856 --> 00:02:32.671
and the property creation
and translation point of view.
00:02:32.671 --> 00:02:36.408
So you have been here
for the past two days,
00:02:36.408 --> 00:02:40.095
and by this time
you have seen many articles,
00:02:40.095 --> 00:02:46.029
and I just want to point
what am I looking for on a Wikidata item.
00:02:46.029 --> 00:02:48.005
This is a Wikidata item,
00:02:48.005 --> 00:02:51.697
so you have this Q2841, which is Bogotá,
00:02:51.697 --> 00:02:55.597
which is the capital city of Colombia,
00:02:55.597 --> 00:02:57.389
and you have four parts here:
00:02:57.389 --> 00:03:00.678
the languages, the labels,
the description, and aliases.
00:03:00.678 --> 00:03:02.255
So you can see,
for different languages
00:03:02.255 --> 00:03:05.089
you'll have the label,
you have the description
00:03:05.089 --> 00:03:10.970
as well as if there any aliases
also known as, you could see them.
00:03:10.970 --> 00:03:14.180
And this, under the city,
where you see the labels
00:03:14.180 --> 00:03:16.155
and the properties together.
00:03:16.155 --> 00:03:20.845
This is Avignon, a city in France.
00:03:20.845 --> 00:03:24.966
So what I'm interested in
is only the properties part.
00:03:24.966 --> 00:03:30.638
For example, official name, native label,
country, capital of, et cetera.
00:03:30.638 --> 00:03:34.310
So when I say property,
for example, if a country,
00:03:34.310 --> 00:03:37.736
in this country,
I'm looking at different aspects:
00:03:37.736 --> 00:03:39.986
the language, the label,
and the description,
00:03:39.986 --> 00:03:42.670
and see how things change.
00:03:42.670 --> 00:03:44.446
For example, if you take instance of--
00:03:44.446 --> 00:03:48.932
okay, everybody knows instance of,
you have been using it quite a lot--
00:03:48.932 --> 00:03:54.089
this is P31, you see
the number of aliases in English
00:03:54.089 --> 00:03:58.667
for the property P31 in instance of,
00:03:58.667 --> 00:04:03.686
and then you would find
that these types of properties
00:04:03.686 --> 00:04:07.536
are created after discussion
with the community.
00:04:07.536 --> 00:04:10.513
So if I take the complete prop--
the procedure,
00:04:10.513 --> 00:04:13.343
what happens to creation of properties--
00:04:13.343 --> 00:04:17.347
you start proposing properties
with some possible translation.
00:04:17.347 --> 00:04:19.388
It is important it's not just in English.
00:04:19.388 --> 00:04:23.734
You have the templates
to suggest your properties
00:04:23.734 --> 00:04:25.129
in your local language.
00:04:25.129 --> 00:04:28.552
So that's why it's a proposition
with possible translation.
00:04:28.552 --> 00:04:32.367
And then you put it to discussion,
then you are put to voting,
00:04:32.367 --> 00:04:37.273
and it's created, and then finally,
the community members start translating it
00:04:37.273 --> 00:04:38.976
and people put it into use.
00:04:38.976 --> 00:04:42.336
But then you cannot be guaranteed
the properties that are created
00:04:42.336 --> 00:04:44.435
are always there forever.
00:04:44.435 --> 00:04:47.417
Properties can be deleted,
just like items can be deleted.
00:04:47.417 --> 00:04:51.004
But then, again,
it goes through a similar procedure.
00:04:51.004 --> 00:04:54.727
You put the property
00:04:54.727 --> 00:04:58.427
as you propose that it should be deleted,
00:04:58.427 --> 00:05:02.424
and if the community decides it,
it votes it, and then if it is decided--
00:05:02.424 --> 00:05:05.238
the majority votes
has decided to delete it--
00:05:05.238 --> 00:05:09.191
we deprecate the property,
and finally we delete this property.
00:05:09.191 --> 00:05:14.826
So for today's talk, I'm mostly interested
for the translation part.
00:05:14.826 --> 00:05:17.004
So where are the translations happening?
00:05:17.004 --> 00:05:20.037
First, the translation would happen
at the proposition part,
00:05:20.037 --> 00:05:22.778
and then you could find that,
at the time of creation,
00:05:22.778 --> 00:05:27.917
the person who creates the property
can use the exact names
00:05:27.917 --> 00:05:31.062
that were suggested
by the property proposer
00:05:31.062 --> 00:05:34.753
and he or she will create the properties,
00:05:34.753 --> 00:05:38.705
and later, you start translating
these properties.
00:05:38.705 --> 00:05:43.176
So let us look at why this matters,
why it is important.
00:05:43.176 --> 00:05:44.909
So I put some examples.
00:05:44.909 --> 00:05:47.162
This is, again, on P31,
00:05:47.162 --> 00:05:51.762
instance of the very, very famous
property P31,
00:05:51.762 --> 00:05:56.094
and you see there is
no description for this item.
00:05:56.094 --> 00:06:00.876
There are almost
six descriptions on this image,
00:06:00.876 --> 00:06:03.310
where we do not have any description.
00:06:03.310 --> 00:06:06.961
Again, some more description
for Odia and Punjabi,
00:06:06.961 --> 00:06:07.970
there is no description.
00:06:07.970 --> 00:06:10.806
This is a property
which is used quite a lot,
00:06:10.806 --> 00:06:13.820
and you see that there is
no description for it.
00:06:13.820 --> 00:06:17.876
And there is a surprising part
that you could also have cases
00:06:17.876 --> 00:06:22.000
where there are descriptions,
but there are no labels.
00:06:22.000 --> 00:06:25.293
For example, Ruffian,
that has been shown here,
00:06:25.293 --> 00:06:30.116
again on property P31,
there is a label that is missing.
00:06:30.116 --> 00:06:34.100
So this was the initial
inspiration for this work
00:06:34.100 --> 00:06:37.486
when I started working
on property analysis.
00:06:37.486 --> 00:06:44.272
I wanted to look at
what aspects of properties,
00:06:44.272 --> 00:06:46.459
or what aspects of property
00:06:46.459 --> 00:06:49.569
that the whole flow chart
that we have seen,
00:06:49.569 --> 00:06:51.316
is multilingual.
00:06:51.316 --> 00:06:53.048
So I wanted to look at,
00:06:53.048 --> 00:06:56.304
okay, we know that Wikidata
is multilingual,
00:06:56.304 --> 00:06:58.984
and it's collaborative,
that has been done.
00:06:58.984 --> 00:07:05.285
But are we really able to achieve
a truly multilingual experience?
00:07:05.285 --> 00:07:09.054
That was the question
behind the creation of WDProp.
00:07:09.054 --> 00:07:11.166
So you may ask
why there are so many people
00:07:11.166 --> 00:07:14.600
who have worked on items,
there are people who have worked on--
00:07:14.600 --> 00:07:17.047
users, multilingual users
and bots, et cetera,
00:07:17.047 --> 00:07:19.444
why you want to focus on properties?
00:07:19.444 --> 00:07:22.770
The answer is,
I want to focus on properties
00:07:22.770 --> 00:07:25.738
because it's very, very
less influenced by bots.
00:07:25.738 --> 00:07:28.581
You may have heard today or yesterday,
00:07:28.581 --> 00:07:31.895
many people said,
"Okay, if you have translation
00:07:31.895 --> 00:07:36.761
in your local languages,
and it has reached a very good number,
00:07:36.761 --> 00:07:39.227
you should ensure
what type of translation it is.
00:07:39.227 --> 00:07:44.339
Is it just bots, which copies
the name of a person to another language.
00:07:44.339 --> 00:07:47.242
Then is it really translation?"
00:07:47.242 --> 00:07:48.413
Okay, that's debatable.
00:07:48.413 --> 00:07:51.365
But, of course,
there is an influence by bot,
00:07:51.365 --> 00:07:54.811
but in case of properties,
there is not so much influence by bots,
00:07:54.811 --> 00:07:55.913
and that is a good part.
00:07:55.913 --> 00:08:00.706
That's why I focus on the bots part.
00:08:00.706 --> 00:08:05.552
So, as I said, when WDProp was created,
00:08:05.552 --> 00:08:09.451
it was to understand every aspect--
the proposal, the creation, translation.
00:08:09.451 --> 00:08:12.326
What are the templates that are available.
00:08:12.326 --> 00:08:16.232
Are these templates,
for example, you said support,
00:08:16.232 --> 00:08:21.875
if a French person opens Wikidata,
a Wikidata France translation page,
00:08:21.875 --> 00:08:28.039
can he see the word, [soutien],
for that particular property proposal?
00:08:28.039 --> 00:08:29.373
Is it possible?
00:08:29.373 --> 00:08:33.125
So this type of things was needed.
00:08:33.125 --> 00:08:35.987
In the end, it was also
about giving real-time statistics
00:08:35.987 --> 00:08:37.741
to the multilingual contributors.
00:08:37.741 --> 00:08:38.783
It's not about one time,
00:08:38.783 --> 00:08:42.178
it's like you just made it
and published for one time-- no.
00:08:42.178 --> 00:08:45.434
You want people
to get this data in real time.
00:08:45.434 --> 00:08:46.716
So what are we doing?
00:08:46.716 --> 00:08:52.065
So the goal of WDProp
was to understand everything
00:08:52.065 --> 00:08:54.418
about Wikidata properties.
00:08:54.418 --> 00:08:56.955
So, label, aliases, description.
00:08:56.955 --> 00:09:01.348
So you have got all these three translated
so the middle part where you say,
00:09:01.348 --> 00:09:05.618
this property is completely usable
because all the three aspects
00:09:05.618 --> 00:09:08.984
have been translated.
00:09:08.984 --> 00:09:12.055
So let me just show you quickly,
what is this WDProp,
00:09:12.055 --> 00:09:14.224
what I'm talking about.
00:09:14.224 --> 00:09:15.496
So this is the WDProp,
00:09:15.496 --> 00:09:19.726
it's available on
tools.wmflabs.org/wdprop/.
00:09:19.726 --> 00:09:23.813
So you have a lot statistics
and if I ask you some questions today,
00:09:23.813 --> 00:09:27.960
like, for example,
"How many data types are there
00:09:27.960 --> 00:09:30.846
that are supported by Wikidata right now?"
00:09:30.846 --> 00:09:34.369
So if such questions, we do not know,
00:09:34.369 --> 00:09:37.549
sometimes because there are new data types
that keep on coming.
00:09:37.549 --> 00:09:41.668
So this data,
this is generated at real time,
00:09:41.668 --> 00:09:44.993
this creates the data structure
and it will give you the answer.
00:09:44.993 --> 00:09:46.486
How many languages are there?
00:09:46.486 --> 00:09:50.194
Yes, of course,
see that there are 313 languages.
00:09:50.194 --> 00:09:55.092
And then, for example,
how many labels were translated.
00:09:55.092 --> 00:09:58.694
So you could see
that the data is being fetched.
00:09:58.694 --> 00:10:00.242
I hope it comes.
00:10:01.512 --> 00:10:03.003
Okay, let's hope. (chuckles)
00:10:07.984 --> 00:10:11.621
Okay, I will take
some other stuff as well.
00:10:11.621 --> 00:10:13.964
Browsing all properties by their time.
00:10:13.964 --> 00:10:17.079
Yes. So you see,
this is count of translated labels,
00:10:17.079 --> 00:10:20.142
and you see all this data
that is coming real time,
00:10:20.142 --> 00:10:21.781
and you can see that the labels
00:10:21.781 --> 00:10:26.881
are currently available
in 6,804 languages in English,
00:10:26.881 --> 00:10:31.291
followed by Dutch, followed by Arabic,
followed by Ukrainian, and then French.
00:10:31.291 --> 00:10:32.922
So this is real-time statistics.
00:10:32.922 --> 00:10:35.446
So you could also do the same
for description,
00:10:35.446 --> 00:10:37.747
also do for aliases, et cetera.
00:10:37.747 --> 00:10:41.383
And you could get the overall
translation statuses if you want.
00:10:41.383 --> 00:10:43.937
So there are some other things
that we will discuss later,
00:10:43.937 --> 00:10:45.586
if time permits.
00:10:45.586 --> 00:10:50.132
But you could navigate
all the different items
00:10:50.132 --> 00:10:52.367
on the left-hand side,
00:10:52.367 --> 00:10:54.127
and you could see
there are a lot of things
00:10:54.127 --> 00:10:59.471
that could really help to see
what things are happening in WDProp.
00:10:59.471 --> 00:11:03.591
So this is, for example,
Wikidata properties,
00:11:03.591 --> 00:11:05.789
these are the properties
that are currently available.
00:11:05.789 --> 00:11:10.039
But as I said some time back,
properties could be deleted.
00:11:10.039 --> 00:11:13.121
And this, you see that these are
the properties that were deleted,
00:11:13.121 --> 00:11:17.171
starting from P1, P2, P3, P4, P5,
these have all been deleted,
00:11:17.171 --> 00:11:23.005
and you could get this thing
just from the statistics board.
00:11:23.005 --> 00:11:24.947
And here, so same thing.
00:11:24.947 --> 00:11:29.938
Then, the next thing that interested me
was to understand the translation pattern.
00:11:29.938 --> 00:11:33.388
So, for example, sometimes we feel
that some languages--
00:11:33.388 --> 00:11:36.514
so English is created first,
and followed by maybe Dutch,
00:11:36.514 --> 00:11:38.201
or maybe French,
00:11:38.201 --> 00:11:40.701
and maybe after French,
it could be Arabic.
00:11:40.701 --> 00:11:43.627
So these things
could be interesting to know.
00:11:43.627 --> 00:11:48.596
So for that, we started to look
at the idea of translation path--
00:11:48.596 --> 00:11:51.607
exactly how things are translated.
00:11:51.607 --> 00:11:56.542
So again, if you go to the property page,
you could click on any property.
00:11:56.542 --> 00:11:57.662
Sorry.
00:11:59.375 --> 00:12:01.053
Maybe I can show.
00:12:03.527 --> 00:12:06.497
So you could click on any property
and you could just say,
00:12:06.497 --> 00:12:07.794
"Give me the translation path."
00:12:07.794 --> 00:12:11.487
It takes some time,
but it will start bringing the data,
00:12:11.487 --> 00:12:15.434
because it's real time,
so you get the data coming from all this.
00:12:15.434 --> 00:12:16.595
So you get the date,
00:12:16.595 --> 00:12:22.244
you get what things have been changed,
when was something deleted, et cetera.
00:12:22.244 --> 00:12:23.848
Why it is important?
00:12:24.948 --> 00:12:29.401
For example, you see
this is something that happened in 2017,
00:12:29.401 --> 00:12:31.955
and the label has been removed.
00:12:31.955 --> 00:12:33.893
This is the official website.
00:12:33.893 --> 00:12:38.944
So imagine you have removed the label
from the official website--
00:12:38.944 --> 00:12:39.978
sorry, this country--
00:12:39.978 --> 00:12:43.357
so anybody who doesn't know P17,
what it is, cannot even understand,
00:12:43.357 --> 00:12:45.971
because the label has been deleted
by the person.
00:12:45.971 --> 00:12:47.915
So this type of vandalism exists.
00:12:47.915 --> 00:12:50.710
Another example where, completely,
00:12:50.710 --> 00:12:52.601
all the language labels
have been deleted--
00:12:52.601 --> 00:12:56.183
English, French, Spanish, German,
everything has been deleted.
00:12:56.183 --> 00:12:58.329
There are no labels,
there are no descriptions.
00:12:58.329 --> 00:13:01.033
So you could find these types of things
from the translation path
00:13:01.033 --> 00:13:05.483
and just because of the color code,
you could see what happened on what day,
00:13:05.483 --> 00:13:09.666
and you could check exactly,
because it is also linked.
00:13:09.666 --> 00:13:14.261
If you click on any of this,
you could also get a link to the revision,
00:13:14.261 --> 00:13:19.478
identify what exactly happened
during that particular revision.
00:13:19.478 --> 00:13:21.309
So this is coming from revision history.
00:13:21.309 --> 00:13:25.311
So if you click on any of this,
you get what exactly is happening
00:13:25.311 --> 00:13:28.567
in any particular revision.
00:13:28.567 --> 00:13:30.733
So how did we build it?
00:13:30.733 --> 00:13:31.923
Just if you come back,
00:13:31.923 --> 00:13:38.396
here, you see there is something
called a comment on the right-hand side.
00:13:38.396 --> 00:13:42.602
You see there is something
called added aliases,
00:13:42.602 --> 00:13:46.613
"added British English aliases,"
"changed Esperanto label,"
00:13:46.613 --> 00:13:48.109
"added [io] label," et cetera.
00:13:48.109 --> 00:13:50.710
So we made use of this information,
00:13:50.710 --> 00:13:53.209
for example,
for label description and aliases,
00:13:53.209 --> 00:13:55.507
if you add something,
you have some sort of comment
00:13:55.507 --> 00:13:58.216
which starts with wbsetlabel-add.
00:13:58.216 --> 00:14:01.635
Or if it is updated,
you have wbsetlabel-set.
00:14:01.635 --> 00:14:04.487
And if you remove something,
you see it is removed.
00:14:04.487 --> 00:14:06.795
And based on this type of information,
00:14:06.795 --> 00:14:11.167
we were able to build
such a translation path.
00:14:11.167 --> 00:14:16.557
Okay, this is good, but what happened
is that this type of information,
00:14:16.557 --> 00:14:19.366
this type of things,
just using the comment,
00:14:19.366 --> 00:14:23.932
it is useful for building real-time tools,
just like what I showed before, WDProp,
00:14:23.932 --> 00:14:30.886
but it is very difficult to detect
when there are multiple changes.
00:14:30.886 --> 00:14:34.871
For example, if you have seen
bots activity on Wikidata,
00:14:34.871 --> 00:14:39.550
some bots make multiple labels
in one single edit.
00:14:39.550 --> 00:14:42.037
In that case,
you cannot find what happened
00:14:42.037 --> 00:14:45.878
because you do not have wbsetlabel,
that particular language.
00:14:45.878 --> 00:14:49.254
So you do not have a set of languages
along with your comment.
00:14:49.254 --> 00:14:53.703
So these are some problems
if you want to use this type of approach.
00:14:54.603 --> 00:14:58.245
So what we did,
we decided to collect the data,
00:14:58.245 --> 00:15:01.316
and we decided to publicly
make this data available.
00:15:02.516 --> 00:15:06.246
And what we did,
we wanted to make use of content.
00:15:06.246 --> 00:15:08.579
So what we did,
we started with every revision,
00:15:08.579 --> 00:15:12.096
and we took the content of each revision.
00:15:12.096 --> 00:15:16.717
And we took the next revision,
and we decided to find the difference
00:15:16.717 --> 00:15:19.885
between these two revisions,
to find what exactly changes,
00:15:19.885 --> 00:15:21.822
which of the labels got changed.
00:15:21.822 --> 00:15:25.436
Because of that, we got
much more interesting information,
00:15:25.436 --> 00:15:28.899
much more accurate information
than the previous approach
00:15:28.899 --> 00:15:31.274
because it is very important
for doing analysis.
00:15:31.274 --> 00:15:34.020
It is important
that you make use of correct data.
00:15:34.020 --> 00:15:36.866
So you have four columns
that were used here--
00:15:36.866 --> 00:15:39.091
timestamp, property,
language, type, et cetera.
00:15:39.091 --> 00:15:44.494
And you get this data in this format.
It is publicly available.
00:15:44.494 --> 00:15:47.446
So what does this data give me?
00:15:47.446 --> 00:15:48.791
This data gives me information
00:15:48.791 --> 00:15:54.791
that currently almost 4,000 plus,
00:15:54.791 --> 00:15:57.291
4,500 properties
00:15:57.291 --> 00:15:59.917
have labels between 0 and 20.
00:15:59.917 --> 00:16:02.145
So there are a lot of properties
00:16:02.145 --> 00:16:07.107
who do not have
more than 20 multilingual labels.
00:16:07.107 --> 00:16:10.888
And there are only
1,500 language properties
00:16:10.888 --> 00:16:12.857
that have been translated up to 40.
00:16:12.857 --> 00:16:18.699
And yesterday, if you were present
during the talk of Lydia Pintscher,
00:16:18.699 --> 00:16:21.967
she talked about P18,
so P18 is something here.
00:16:21.967 --> 00:16:25.332
So you can see there are only
a couple of six or seven properties
00:16:25.332 --> 00:16:30.147
that are currently having all the--
00:16:30.147 --> 00:16:35.092
P18 has 154 translations,
just to give that idea.
00:16:35.092 --> 00:16:39.913
So there is one property
which is having 154 multilingual labels.
00:16:39.913 --> 00:16:43.807
There are properties
which have only one particular label.
00:16:43.807 --> 00:16:50.112
And the average number
of labels is only 21,
00:16:50.112 --> 00:16:52.945
and the standard deviation is 20.
00:16:52.945 --> 00:16:55.967
Okay, what next we would like to say?
00:16:55.967 --> 00:16:59.970
So you have seen something similar
in the real-time data.
00:16:59.970 --> 00:17:02.079
This is from the collected data.
00:17:02.079 --> 00:17:07.503
So this is what are the top languages
that are coming up in the results.
00:17:07.503 --> 00:17:09.186
So these we have seen.
00:17:09.186 --> 00:17:13.314
But my next point is,
are there combinations possible.
00:17:13.314 --> 00:17:16.522
For example, if there is French,
there is Arabic.
00:17:16.522 --> 00:17:19.505
If there is Arabic,
there is some other language.
00:17:19.505 --> 00:17:22.102
If there's French,
there's Ukrainian, et cetera.
00:17:22.102 --> 00:17:26.093
Can we find such type of combinations
in the translation data set?
00:17:26.093 --> 00:17:27.415
So, yes, it is possible.
00:17:27.415 --> 00:17:30.195
So if you see this count,
this frequent itemsets--
00:17:30.195 --> 00:17:32.134
so I've just shown seven of them--
00:17:32.134 --> 00:17:35.315
you find that there are combinations
that are possible.
00:17:36.901 --> 00:17:41.397
Okay, let us say, is there a possibility
of having four labels,
00:17:41.397 --> 00:17:44.313
like if there is English,
there's also possibility to find Dutch,
00:17:44.313 --> 00:17:45.794
Arabic, Ukrainian.
00:17:45.794 --> 00:17:48.041
If there is English,
there's possibility to find Dutch,
00:17:48.041 --> 00:17:49.798
French, and Arabic, et cetera.
00:17:49.798 --> 00:17:52.763
You can also find a lot of combinations.
00:17:52.763 --> 00:17:53.907
Why it is important?
00:17:53.907 --> 00:17:57.432
Because it is important to know if,
00:17:57.432 --> 00:17:59.998
for example,
if you have multilingual speakers
00:17:59.998 --> 00:18:03.664
who are contributors,
who can speak multiple languages,
00:18:03.664 --> 00:18:07.402
if you're able to find
any particular pattern
00:18:07.402 --> 00:18:12.556
that helps us to find
that if you tell this person to translate,
00:18:12.556 --> 00:18:15.276
a new property is created
to translate this label,
00:18:15.276 --> 00:18:19.213
because he already
speaks multiple languages,
00:18:19.213 --> 00:18:21.669
we can suggest these things to the user.
00:18:21.669 --> 00:18:24.858
So let's just show you one example.
00:18:24.858 --> 00:18:27.257
This is a complete translation path
00:18:27.257 --> 00:18:29.774
that has obtained
from different languages.
00:18:29.774 --> 00:18:35.001
So here, what we have done is
we selected two small minority languages,
00:18:35.001 --> 00:18:39.293
like Tagalog and Kapampangan,
00:18:39.293 --> 00:18:42.602
which are minority languages
from Philippines,
00:18:42.602 --> 00:18:46.156
and you see that there is
a strong transfer
00:18:46.156 --> 00:18:49.645
between Tagalog and Kapampangan.
00:18:49.645 --> 00:18:51.784
So these types of things can be detected
00:18:51.784 --> 00:18:54.738
when you have such type
of translation results.
00:18:54.738 --> 00:18:57.311
So that is another advantage.
00:18:57.311 --> 00:18:59.780
To conclude my work,
I would like to say,
00:18:59.780 --> 00:19:05.128
this is important that we understand
how properties are translated
00:19:05.128 --> 00:19:10.534
because if you want to extract data
from Wikipedia,
00:19:10.534 --> 00:19:14.661
you need to know what are the words
00:19:14.661 --> 00:19:16.491
in the local languages
that are being used.
00:19:16.491 --> 00:19:20.208
What is "image" in French,
what is "image" in Punjabi,
00:19:20.208 --> 00:19:22.539
what is "image" in Hindi,
or any other language.
00:19:22.539 --> 00:19:25.890
So that is important for importing data.
00:19:25.890 --> 00:19:30.023
And tomorrow, of course,
if you are able to fetch this data,
00:19:30.023 --> 00:19:35.193
to Wikidata, we could also
use new projects like Wikidata Bridge,
00:19:35.193 --> 00:19:38.963
which we could use
to fill other info boxes,
00:19:38.963 --> 00:19:44.563
like multilingual Wikipedia articles,
00:19:44.563 --> 00:19:47.370
and this could be really helpful.
00:19:47.370 --> 00:19:51.238
So withe that, I would like to thank you,
and if you have questions,
00:19:51.238 --> 00:19:54.321
I would be happy to answer them.
00:19:55.131 --> 00:19:57.218
(moderator) Anybody with questions?
00:19:58.842 --> 00:20:01.854
(audience applause)
00:20:08.387 --> 00:20:09.479
Yes?
00:20:11.988 --> 00:20:15.746
(man) So what you're doing
is mainly analyzing how this--
00:20:15.746 --> 00:20:17.389
- (John) Yes.
- (man) ...is all happening?
00:20:17.389 --> 00:20:21.418
Do you know if there are initiatives
or if there are tools
00:20:21.418 --> 00:20:25.331
which can help make this easier,
like translation of properties?
00:20:25.331 --> 00:20:28.321
Yes. Tools, like, for example,
what to translate
00:20:28.321 --> 00:20:32.995
from Wikimedia Foundation, is helpful,
but I have not seen--
00:20:32.995 --> 00:20:35.522
This is not currently
integrated with Wikidata.
00:20:35.522 --> 00:20:41.672
What to translate is only integrated
with certain languages on Wikipedia,
00:20:41.672 --> 00:20:44.485
but not on Wikidata.
00:20:44.485 --> 00:20:46.460
But that could be really interesting.
00:20:46.460 --> 00:20:50.165
Yes, thank you for bringing this up,
because just imagine,
00:20:50.165 --> 00:20:54.490
if we know that a person
has been labeling in multiple languages,
00:20:54.490 --> 00:20:56.842
and we also have
this what to translate tool,
00:20:56.842 --> 00:21:00.007
and we have these statistics,
we have this data
00:21:00.007 --> 00:21:04.657
coming from this type
of property translation,
00:21:04.657 --> 00:21:09.423
it is easier to suggest to a person
that new properties have been created,
00:21:09.423 --> 00:21:11.461
and then you could--
00:21:11.461 --> 00:21:13.980
Right now it's not integrated to Wikidata.
00:21:15.674 --> 00:21:17.432
(moderator) Anybody else?
00:21:20.246 --> 00:21:23.315
(man 2) I have one question myself,
that comes back to it,
00:21:23.315 --> 00:21:27.748
does anybody know of working lists
on translating properties?
00:21:27.748 --> 00:21:28.769
Sorry?
00:21:28.769 --> 00:21:30.489
(man 2) Does anybody
know of working lists
00:21:30.489 --> 00:21:31.695
about translating properties,
00:21:31.695 --> 00:21:37.751
like, I can imagine from your statistics,
you could say, this is the top 100
00:21:37.751 --> 00:21:39.944
most widely used properties
00:21:39.944 --> 00:21:42.844
who lack translations
in this and this language?
00:21:42.844 --> 00:21:47.494
No, there is, I think,
there are ways by,
00:21:47.494 --> 00:21:51.112
for example,
you could browse by data types,
00:21:51.112 --> 00:21:53.843
browse by property classes.
00:21:53.843 --> 00:21:57.398
For example, here is something
called property classes
00:21:57.398 --> 00:22:00.743
where people have created projects--
00:22:00.743 --> 00:22:03.272
it's taking time--
so you have projects,
00:22:03.272 --> 00:22:08.597
and you could say, how would I describe,
what are the, for example,
00:22:08.597 --> 00:22:11.978
what are the properties
that I could describe for this,
00:22:11.978 --> 00:22:14.183
for describing IEEE standard version?
00:22:14.183 --> 00:22:16.846
You need edition number,
you need edition translation, et cetera.
00:22:16.846 --> 00:22:22.890
So if you have a targeted thing,
you could search for what type of classes.
00:22:22.890 --> 00:22:25.853
For example, if you're working
in GLAM or histories,
00:22:25.853 --> 00:22:29.652
you could say, what is history-related
any document are there?
00:22:29.652 --> 00:22:32.715
So you could say, historical,
and you could find historical.
00:22:32.715 --> 00:22:36.247
Okay, this is a property class,
go to this property class.
00:22:36.247 --> 00:22:37.855
And, sorry, where is it?
00:22:37.855 --> 00:22:40.437
So it is having something
called "Merimee ID."
00:22:40.437 --> 00:22:44.467
So people have been
trying to use property classes
00:22:44.467 --> 00:22:45.913
to link objects.
00:22:45.913 --> 00:22:49.577
That helps if you're working
on a particular project,
00:22:49.577 --> 00:22:52.342
and you could find
that property's related to that.
00:22:52.342 --> 00:22:58.246
(man 2) But your tool could quite easily
make a list of, let's say,
00:22:58.246 --> 00:23:02.746
the top 100 most widely used properties
00:23:02.746 --> 00:23:07.488
who haven't got, I don't know,
Punjabi label, let's say?
00:23:07.488 --> 00:23:10.284
- (John) For that, I will just--
- (man 2) Which could be interesting.
00:23:10.284 --> 00:23:14.310
(John) Okay, tell me any language,
for example, let us say, Netherlands,
00:23:14.310 --> 00:23:17.456
because it's performing very well.
00:23:17.456 --> 00:23:21.861
So I would say-- translated labels.
00:23:21.861 --> 00:23:24.011
So this is translate-- sorry.
00:23:30.491 --> 00:23:33.059
(mouse clicking)
00:23:36.747 --> 00:23:38.697
For example, Hindi.
00:23:38.697 --> 00:23:40.497
So here, what happens,
00:23:40.497 --> 00:23:44.335
here you just see any properties
that need translation.
00:23:44.335 --> 00:23:47.473
So there are like 6,647 properties
00:23:47.473 --> 00:23:50.299
that need translation
in a particular language.
00:23:50.299 --> 00:23:54.998
So you could click on any language
that you want and get the data.
00:23:54.998 --> 00:23:58.778
And you could get the list
of where people need support.
00:23:58.778 --> 00:24:03.345
So, this could be interesting
to link with property usage,
00:24:03.345 --> 00:24:06.232
how many people, is it really top,
is it under the top ten.
00:24:06.232 --> 00:24:08.871
So suggest those ten top hundred,
in that language.
00:24:08.871 --> 00:24:11.282
That would be an interesting list.
That's good.
00:24:11.852 --> 00:24:13.054
(man 3) Just what you asked,
00:24:13.054 --> 00:24:17.077
there is a list of top 100
most used properties on Wikidata.
00:24:17.077 --> 00:24:18.924
It's on Wikidata.
00:24:18.924 --> 00:24:21.432
So, yeah, it's there,
00:24:21.432 --> 00:24:25.942
under Wikidata Database Reports/
Top 100 Properties.
00:24:25.942 --> 00:24:31.083
So one thing could be that
we could just link this and suggest it.
00:24:31.083 --> 00:24:33.349
(moderator) Could you maybe
add the link to the etherpad,
00:24:33.349 --> 00:24:37.270
and then maybe,
this information can come together.
00:24:37.270 --> 00:24:38.631
(John) Okay.
00:24:40.049 --> 00:24:42.007
(moderator) If there is
no other questions,
00:24:42.007 --> 00:24:44.045
then we will conclude here.
00:24:44.045 --> 00:24:49.236
And we have two, three minutes break
until we start with the next speaker.
00:24:49.236 --> 00:24:50.864
- Thanks.
- (John) Thank you very much.
00:24:50.864 --> 00:24:53.041
(audience applause)