1
00:00:06,055 --> 00:00:09,281
(moderator) Good afternoon, everybody.
We're about to start.
2
00:00:09,281 --> 00:00:11,416
I'm presenting you John Samuel
3
00:00:11,416 --> 00:00:17,207
who works at the French
engineering school CPE,
4
00:00:17,207 --> 00:00:19,658
based in Lyon in France.
5
00:00:19,658 --> 00:00:21,101
And he will tell us something more
6
00:00:21,101 --> 00:00:27,271
about the translation
of properties in Wikidata.
7
00:00:27,271 --> 00:00:29,604
As you know,
as is the case in all sessions,
8
00:00:29,604 --> 00:00:32,172
there is an etherpad
for collaborative note-taking.
9
00:00:32,172 --> 00:00:34,904
Please don't forget that.
10
00:00:34,904 --> 00:00:36,302
We'll have the presentation
11
00:00:36,302 --> 00:00:39,988
and then we'll have
some time for a short Q&A.
12
00:00:39,988 --> 00:00:42,051
- The floor is yours.
- (John) Thanks, [inaudible].
13
00:00:42,917 --> 00:00:45,114
Thank you all for coming here.
14
00:00:45,114 --> 00:00:50,257
So my talk is about analyzing
translation of Wikidata properties.
15
00:00:50,257 --> 00:00:52,743
So just give you a quick outline.
16
00:00:52,743 --> 00:00:54,859
I would like to introduce this topic.
17
00:00:54,859 --> 00:00:58,756
I will present a tool
that I developed some years before,
18
00:00:58,756 --> 00:01:01,446
called WDProp,
which I'm continuously working,
19
00:01:01,446 --> 00:01:03,795
and based on the feedback
from the community,
20
00:01:03,795 --> 00:01:05,319
I add new features.
21
00:01:05,319 --> 00:01:09,368
And then I will talk about
something called coarser analysis,
22
00:01:09,368 --> 00:01:12,476
where I would like to look
at the property translation,
23
00:01:12,476 --> 00:01:15,257
from a much larger picture.
24
00:01:15,257 --> 00:01:18,667
So I will talk about
how we collected this data,
25
00:01:18,667 --> 00:01:23,002
because this work is also done
with one of my students, Thibaut Chamard.
26
00:01:23,002 --> 00:01:26,682
And then I will present some results,
and finally, I will conclude the talk.
27
00:01:27,469 --> 00:01:30,982
So Wikidata, as you all know,
it started in 2012,
28
00:01:30,982 --> 00:01:33,877
and it's a free, open, linked,
structured, collaborative,
29
00:01:33,877 --> 00:01:36,010
and multilingual knowledge base.
30
00:01:36,910 --> 00:01:40,063
My focus today
is on the multilingual part,
31
00:01:40,063 --> 00:01:42,979
because there is a big change
from the traditional way
32
00:01:42,979 --> 00:01:45,412
of how we used to edit on Wikipedia site.
33
00:01:45,412 --> 00:01:47,917
There were multiple subdomains,
34
00:01:47,917 --> 00:01:50,753
and now you'll have a single domain
on a Wikidata
35
00:01:50,753 --> 00:01:56,191
where multilingual contributors come
and write or create articles.
36
00:01:56,191 --> 00:01:57,499
So this is a collaborative.
37
00:01:57,499 --> 00:02:00,585
There has been work to say
what exactly is collaborative,
38
00:02:00,585 --> 00:02:02,441
why it is collaborative.
39
00:02:02,441 --> 00:02:04,597
I have given references for these works.
40
00:02:04,597 --> 00:02:07,254
So this is, if you see Wikidata,
41
00:02:07,254 --> 00:02:11,057
everything that starts
is starting from the property.
42
00:02:11,057 --> 00:02:14,144
The property is proposed
and then discussed and voted.
43
00:02:14,144 --> 00:02:17,471
And then it is created
and finally translated,
44
00:02:17,471 --> 00:02:20,005
and then you are finally
able to use these properties.
45
00:02:20,005 --> 00:02:22,010
But these properties may also be deleted--
46
00:02:22,010 --> 00:02:24,019
there's also something called deletion.
47
00:02:24,019 --> 00:02:26,700
But, as I highlighted on this slide,
48
00:02:26,700 --> 00:02:28,856
my focus is on the multilingual aspect,
49
00:02:28,856 --> 00:02:32,671
and the property creation
and translation point of view.
50
00:02:32,671 --> 00:02:36,408
So you have been here
for the past two days,
51
00:02:36,408 --> 00:02:40,095
and by this time
you have seen many articles,
52
00:02:40,095 --> 00:02:46,029
and I just want to point
what am I looking for on a Wikidata item.
53
00:02:46,029 --> 00:02:48,005
This is a Wikidata item,
54
00:02:48,005 --> 00:02:51,697
so you have this Q2841, which is Bogotá,
55
00:02:51,697 --> 00:02:55,597
which is the capital city of Colombia,
56
00:02:55,597 --> 00:02:57,389
and you have four parts here:
57
00:02:57,389 --> 00:03:00,678
the languages, the labels,
the description, and aliases.
58
00:03:00,678 --> 00:03:02,255
So you can see,
for different languages
59
00:03:02,255 --> 00:03:05,089
you'll have the label,
you have the description
60
00:03:05,089 --> 00:03:10,970
as well as if there any aliases
also known as, you could see them.
61
00:03:10,970 --> 00:03:14,180
And this, under the city,
where you see the labels
62
00:03:14,180 --> 00:03:16,155
and the properties together.
63
00:03:16,155 --> 00:03:20,845
This is Avignon, a city in France.
64
00:03:20,845 --> 00:03:24,966
So what I'm interested in
is only the properties part.
65
00:03:24,966 --> 00:03:30,638
For example, official name, native label,
country, capital of, et cetera.
66
00:03:30,638 --> 00:03:34,310
So when I say property,
for example, if a country,
67
00:03:34,310 --> 00:03:37,736
in this country,
I'm looking at different aspects:
68
00:03:37,736 --> 00:03:39,986
the language, the label,
and the description,
69
00:03:39,986 --> 00:03:42,670
and see how things change.
70
00:03:42,670 --> 00:03:44,446
For example, if you take instance of--
71
00:03:44,446 --> 00:03:48,932
okay, everybody knows instance of,
you have been using it quite a lot--
72
00:03:48,932 --> 00:03:54,089
this is P31, you see
the number of aliases in English
73
00:03:54,089 --> 00:03:58,667
for the property P31 in instance of,
74
00:03:58,667 --> 00:04:03,686
and then you would find
that these types of properties
75
00:04:03,686 --> 00:04:07,536
are created after discussion
with the community.
76
00:04:07,536 --> 00:04:10,513
So if I take the complete prop--
the procedure,
77
00:04:10,513 --> 00:04:13,343
what happens to creation of properties--
78
00:04:13,343 --> 00:04:17,347
you start proposing properties
with some possible translation.
79
00:04:17,347 --> 00:04:19,388
It is important it's not just in English.
80
00:04:19,388 --> 00:04:23,734
You have the templates
to suggest your properties
81
00:04:23,734 --> 00:04:25,129
in your local language.
82
00:04:25,129 --> 00:04:28,552
So that's why it's a proposition
with possible translation.
83
00:04:28,552 --> 00:04:32,367
And then you put it to discussion,
then you are put to voting,
84
00:04:32,367 --> 00:04:37,273
and it's created, and then finally,
the community members start translating it
85
00:04:37,273 --> 00:04:38,976
and people put it into use.
86
00:04:38,976 --> 00:04:42,336
But then you cannot be guaranteed
the properties that are created
87
00:04:42,336 --> 00:04:44,435
are always there forever.
88
00:04:44,435 --> 00:04:47,417
Properties can be deleted,
just like items can be deleted.
89
00:04:47,417 --> 00:04:51,004
But then, again,
it goes through a similar procedure.
90
00:04:51,004 --> 00:04:54,727
You put the property
91
00:04:54,727 --> 00:04:58,427
as you propose that it should be deleted,
92
00:04:58,427 --> 00:05:02,424
and if the community decides it,
it votes it, and then if it is decided--
93
00:05:02,424 --> 00:05:05,238
the majority votes
has decided to delete it--
94
00:05:05,238 --> 00:05:09,191
we deprecate the property,
and finally we delete this property.
95
00:05:09,191 --> 00:05:14,826
So for today's talk, I'm mostly interested
for the translation part.
96
00:05:14,826 --> 00:05:17,004
So where are the translations happening?
97
00:05:17,004 --> 00:05:20,037
First, the translation would happen
at the proposition part,
98
00:05:20,037 --> 00:05:22,778
and then you could find that,
at the time of creation,
99
00:05:22,778 --> 00:05:27,917
the person who creates the property
can use the exact names
100
00:05:27,917 --> 00:05:31,062
that were suggested
by the property proposer
101
00:05:31,062 --> 00:05:34,753
and he or she will create the properties,
102
00:05:34,753 --> 00:05:38,705
and later, you start translating
these properties.
103
00:05:38,705 --> 00:05:43,176
So let us look at why this matters,
why it is important.
104
00:05:43,176 --> 00:05:44,909
So I put some examples.
105
00:05:44,909 --> 00:05:47,162
This is, again, on P31,
106
00:05:47,162 --> 00:05:51,762
instance of the very, very famous
property P31,
107
00:05:51,762 --> 00:05:56,094
and you see there is
no description for this item.
108
00:05:56,094 --> 00:06:00,876
There are almost
six descriptions on this image,
109
00:06:00,876 --> 00:06:03,310
where we do not have any description.
110
00:06:03,310 --> 00:06:06,961
Again, some more description
for Odia and Punjabi,
111
00:06:06,961 --> 00:06:07,970
there is no description.
112
00:06:07,970 --> 00:06:10,806
This is a property
which is used quite a lot,
113
00:06:10,806 --> 00:06:13,820
and you see that there is
no description for it.
114
00:06:13,820 --> 00:06:17,876
And there is a surprising part
that you could also have cases
115
00:06:17,876 --> 00:06:22,000
where there are descriptions,
but there are no labels.
116
00:06:22,000 --> 00:06:25,293
For example, Ruffian,
that has been shown here,
117
00:06:25,293 --> 00:06:30,116
again on property P31,
there is a label that is missing.
118
00:06:30,116 --> 00:06:34,100
So this was the initial
inspiration for this work
119
00:06:34,100 --> 00:06:37,486
when I started working
on property analysis.
120
00:06:37,486 --> 00:06:44,272
I wanted to look at
what aspects of properties,
121
00:06:44,272 --> 00:06:46,459
or what aspects of property
122
00:06:46,459 --> 00:06:49,569
that the whole flow chart
that we have seen,
123
00:06:49,569 --> 00:06:51,316
is multilingual.
124
00:06:51,316 --> 00:06:53,048
So I wanted to look at,
125
00:06:53,048 --> 00:06:56,304
okay, we know that Wikidata
is multilingual,
126
00:06:56,304 --> 00:06:58,984
and it's collaborative,
that has been done.
127
00:06:58,984 --> 00:07:05,285
But are we really able to achieve
a truly multilingual experience?
128
00:07:05,285 --> 00:07:09,054
That was the question
behind the creation of WDProp.
129
00:07:09,054 --> 00:07:11,166
So you may ask
why there are so many people
130
00:07:11,166 --> 00:07:14,600
who have worked on items,
there are people who have worked on--
131
00:07:14,600 --> 00:07:17,047
users, multilingual users
and bots, et cetera,
132
00:07:17,047 --> 00:07:19,444
why you want to focus on properties?
133
00:07:19,444 --> 00:07:22,770
The answer is,
I want to focus on properties
134
00:07:22,770 --> 00:07:25,738
because it's very, very
less influenced by bots.
135
00:07:25,738 --> 00:07:28,581
You may have heard today or yesterday,
136
00:07:28,581 --> 00:07:31,895
many people said,
"Okay, if you have translation
137
00:07:31,895 --> 00:07:36,761
in your local languages,
and it has reached a very good number,
138
00:07:36,761 --> 00:07:39,227
you should ensure
what type of translation it is.
139
00:07:39,227 --> 00:07:44,339
Is it just bots, which copies
the name of a person to another language.
140
00:07:44,339 --> 00:07:47,242
Then is it really translation?"
141
00:07:47,242 --> 00:07:48,413
Okay, that's debatable.
142
00:07:48,413 --> 00:07:51,365
But, of course,
there is an influence by bot,
143
00:07:51,365 --> 00:07:54,811
but in case of properties,
there is not so much influence by bots,
144
00:07:54,811 --> 00:07:55,913
and that is a good part.
145
00:07:55,913 --> 00:08:00,706
That's why I focus on the bots part.
146
00:08:00,706 --> 00:08:05,552
So, as I said, when WDProp was created,
147
00:08:05,552 --> 00:08:09,451
it was to understand every aspect--
the proposal, the creation, translation.
148
00:08:09,451 --> 00:08:12,326
What are the templates that are available.
149
00:08:12,326 --> 00:08:16,232
Are these templates,
for example, you said support,
150
00:08:16,232 --> 00:08:21,875
if a French person opens Wikidata,
a Wikidata France translation page,
151
00:08:21,875 --> 00:08:28,039
can he see the word, [soutien],
for that particular property proposal?
152
00:08:28,039 --> 00:08:29,373
Is it possible?
153
00:08:29,373 --> 00:08:33,125
So this type of things was needed.
154
00:08:33,125 --> 00:08:35,987
In the end, it was also
about giving real-time statistics
155
00:08:35,987 --> 00:08:37,741
to the multilingual contributors.
156
00:08:37,741 --> 00:08:38,783
It's not about one time,
157
00:08:38,783 --> 00:08:42,178
it's like you just made it
and published for one time-- no.
158
00:08:42,178 --> 00:08:45,434
You want people
to get this data in real time.
159
00:08:45,434 --> 00:08:46,716
So what are we doing?
160
00:08:46,716 --> 00:08:52,065
So the goal of WDProp
was to understand everything
161
00:08:52,065 --> 00:08:54,418
about Wikidata properties.
162
00:08:54,418 --> 00:08:56,955
So, label, aliases, description.
163
00:08:56,955 --> 00:09:01,348
So you have got all these three translated
so the middle part where you say,
164
00:09:01,348 --> 00:09:05,618
this property is completely usable
because all the three aspects
165
00:09:05,618 --> 00:09:08,984
have been translated.
166
00:09:08,984 --> 00:09:12,055
So let me just show you quickly,
what is this WDProp,
167
00:09:12,055 --> 00:09:14,224
what I'm talking about.
168
00:09:14,224 --> 00:09:15,496
So this is the WDProp,
169
00:09:15,496 --> 00:09:19,726
it's available on
tools.wmflabs.org/wdprop/.
170
00:09:19,726 --> 00:09:23,813
So you have a lot statistics
and if I ask you some questions today,
171
00:09:23,813 --> 00:09:27,960
like, for example,
"How many data types are there
172
00:09:27,960 --> 00:09:30,846
that are supported by Wikidata right now?"
173
00:09:30,846 --> 00:09:34,369
So if such questions, we do not know,
174
00:09:34,369 --> 00:09:37,549
sometimes because there are new data types
that keep on coming.
175
00:09:37,549 --> 00:09:41,668
So this data,
this is generated at real time,
176
00:09:41,668 --> 00:09:44,993
this creates the data structure
and it will give you the answer.
177
00:09:44,993 --> 00:09:46,486
How many languages are there?
178
00:09:46,486 --> 00:09:50,194
Yes, of course,
see that there are 313 languages.
179
00:09:50,194 --> 00:09:55,092
And then, for example,
how many labels were translated.
180
00:09:55,092 --> 00:09:58,694
So you could see
that the data is being fetched.
181
00:09:58,694 --> 00:10:00,242
I hope it comes.
182
00:10:01,512 --> 00:10:03,003
Okay, let's hope. (chuckles)
183
00:10:07,984 --> 00:10:11,621
Okay, I will take
some other stuff as well.
184
00:10:11,621 --> 00:10:13,964
Browsing all properties by their time.
185
00:10:13,964 --> 00:10:17,079
Yes. So you see,
this is count of translated labels,
186
00:10:17,079 --> 00:10:20,142
and you see all this data
that is coming real time,
187
00:10:20,142 --> 00:10:21,781
and you can see that the labels
188
00:10:21,781 --> 00:10:26,881
are currently available
in 6,804 languages in English,
189
00:10:26,881 --> 00:10:31,291
followed by Dutch, followed by Arabic,
followed by Ukrainian, and then French.
190
00:10:31,291 --> 00:10:32,922
So this is real-time statistics.
191
00:10:32,922 --> 00:10:35,446
So you could also do the same
for description,
192
00:10:35,446 --> 00:10:37,747
also do for aliases, et cetera.
193
00:10:37,747 --> 00:10:41,383
And you could get the overall
translation statuses if you want.
194
00:10:41,383 --> 00:10:43,937
So there are some other things
that we will discuss later,
195
00:10:43,937 --> 00:10:45,586
if time permits.
196
00:10:45,586 --> 00:10:50,132
But you could navigate
all the different items
197
00:10:50,132 --> 00:10:52,367
on the left-hand side,
198
00:10:52,367 --> 00:10:54,127
and you could see
there are a lot of things
199
00:10:54,127 --> 00:10:59,471
that could really help to see
what things are happening in WDProp.
200
00:10:59,471 --> 00:11:03,591
So this is, for example,
Wikidata properties,
201
00:11:03,591 --> 00:11:05,789
these are the properties
that are currently available.
202
00:11:05,789 --> 00:11:10,039
But as I said some time back,
properties could be deleted.
203
00:11:10,039 --> 00:11:13,121
And this, you see that these are
the properties that were deleted,
204
00:11:13,121 --> 00:11:17,171
starting from P1, P2, P3, P4, P5,
these have all been deleted,
205
00:11:17,171 --> 00:11:23,005
and you could get this thing
just from the statistics board.
206
00:11:23,005 --> 00:11:24,947
And here, so same thing.
207
00:11:24,947 --> 00:11:29,938
Then, the next thing that interested me
was to understand the translation pattern.
208
00:11:29,938 --> 00:11:33,388
So, for example, sometimes we feel
that some languages--
209
00:11:33,388 --> 00:11:36,514
so English is created first,
and followed by maybe Dutch,
210
00:11:36,514 --> 00:11:38,201
or maybe French,
211
00:11:38,201 --> 00:11:40,701
and maybe after French,
it could be Arabic.
212
00:11:40,701 --> 00:11:43,627
So these things
could be interesting to know.
213
00:11:43,627 --> 00:11:48,596
So for that, we started to look
at the idea of translation path--
214
00:11:48,596 --> 00:11:51,607
exactly how things are translated.
215
00:11:51,607 --> 00:11:56,542
So again, if you go to the property page,
you could click on any property.
216
00:11:56,542 --> 00:11:57,662
Sorry.
217
00:11:59,375 --> 00:12:01,053
Maybe I can show.
218
00:12:03,527 --> 00:12:06,497
So you could click on any property
and you could just say,
219
00:12:06,497 --> 00:12:07,794
"Give me the translation path."
220
00:12:07,794 --> 00:12:11,487
It takes some time,
but it will start bringing the data,
221
00:12:11,487 --> 00:12:15,434
because it's real time,
so you get the data coming from all this.
222
00:12:15,434 --> 00:12:16,595
So you get the date,
223
00:12:16,595 --> 00:12:22,244
you get what things have been changed,
when was something deleted, et cetera.
224
00:12:22,244 --> 00:12:23,848
Why it is important?
225
00:12:24,948 --> 00:12:29,401
For example, you see
this is something that happened in 2017,
226
00:12:29,401 --> 00:12:31,955
and the label has been removed.
227
00:12:31,955 --> 00:12:33,893
This is the official website.
228
00:12:33,893 --> 00:12:38,944
So imagine you have removed the label
from the official website--
229
00:12:38,944 --> 00:12:39,978
sorry, this country--
230
00:12:39,978 --> 00:12:43,357
so anybody who doesn't know P17,
what it is, cannot even understand,
231
00:12:43,357 --> 00:12:45,971
because the label has been deleted
by the person.
232
00:12:45,971 --> 00:12:47,915
So this type of vandalism exists.
233
00:12:47,915 --> 00:12:50,710
Another example where, completely,
234
00:12:50,710 --> 00:12:52,601
all the language labels
have been deleted--
235
00:12:52,601 --> 00:12:56,183
English, French, Spanish, German,
everything has been deleted.
236
00:12:56,183 --> 00:12:58,329
There are no labels,
there are no descriptions.
237
00:12:58,329 --> 00:13:01,033
So you could find these types of things
from the translation path
238
00:13:01,033 --> 00:13:05,483
and just because of the color code,
you could see what happened on what day,
239
00:13:05,483 --> 00:13:09,666
and you could check exactly,
because it is also linked.
240
00:13:09,666 --> 00:13:14,261
If you click on any of this,
you could also get a link to the revision,
241
00:13:14,261 --> 00:13:19,478
identify what exactly happened
during that particular revision.
242
00:13:19,478 --> 00:13:21,309
So this is coming from revision history.
243
00:13:21,309 --> 00:13:25,311
So if you click on any of this,
you get what exactly is happening
244
00:13:25,311 --> 00:13:28,567
in any particular revision.
245
00:13:28,567 --> 00:13:30,733
So how did we build it?
246
00:13:30,733 --> 00:13:31,923
Just if you come back,
247
00:13:31,923 --> 00:13:38,396
here, you see there is something
called a comment on the right-hand side.
248
00:13:38,396 --> 00:13:42,602
You see there is something
called added aliases,
249
00:13:42,602 --> 00:13:46,613
"added British English aliases,"
"changed Esperanto label,"
250
00:13:46,613 --> 00:13:48,109
"added [io] label," et cetera.
251
00:13:48,109 --> 00:13:50,710
So we made use of this information,
252
00:13:50,710 --> 00:13:53,209
for example,
for label description and aliases,
253
00:13:53,209 --> 00:13:55,507
if you add something,
you have some sort of comment
254
00:13:55,507 --> 00:13:58,216
which starts with wbsetlabel-add.
255
00:13:58,216 --> 00:14:01,635
Or if it is updated,
you have wbsetlabel-set.
256
00:14:01,635 --> 00:14:04,487
And if you remove something,
you see it is removed.
257
00:14:04,487 --> 00:14:06,795
And based on this type of information,
258
00:14:06,795 --> 00:14:11,167
we were able to build
such a translation path.
259
00:14:11,167 --> 00:14:16,557
Okay, this is good, but what happened
is that this type of information,
260
00:14:16,557 --> 00:14:19,366
this type of things,
just using the comment,
261
00:14:19,366 --> 00:14:23,932
it is useful for building real-time tools,
just like what I showed before, WDProp,
262
00:14:23,932 --> 00:14:30,886
but it is very difficult to detect
when there are multiple changes.
263
00:14:30,886 --> 00:14:34,871
For example, if you have seen
bots activity on Wikidata,
264
00:14:34,871 --> 00:14:39,550
some bots make multiple labels
in one single edit.
265
00:14:39,550 --> 00:14:42,037
In that case,
you cannot find what happened
266
00:14:42,037 --> 00:14:45,878
because you do not have wbsetlabel,
that particular language.
267
00:14:45,878 --> 00:14:49,254
So you do not have a set of languages
along with your comment.
268
00:14:49,254 --> 00:14:53,703
So these are some problems
if you want to use this type of approach.
269
00:14:54,603 --> 00:14:58,245
So what we did,
we decided to collect the data,
270
00:14:58,245 --> 00:15:01,316
and we decided to publicly
make this data available.
271
00:15:02,516 --> 00:15:06,246
And what we did,
we wanted to make use of content.
272
00:15:06,246 --> 00:15:08,579
So what we did,
we started with every revision,
273
00:15:08,579 --> 00:15:12,096
and we took the content of each revision.
274
00:15:12,096 --> 00:15:16,717
And we took the next revision,
and we decided to find the difference
275
00:15:16,717 --> 00:15:19,885
between these two revisions,
to find what exactly changes,
276
00:15:19,885 --> 00:15:21,822
which of the labels got changed.
277
00:15:21,822 --> 00:15:25,436
Because of that, we got
much more interesting information,
278
00:15:25,436 --> 00:15:28,899
much more accurate information
than the previous approach
279
00:15:28,899 --> 00:15:31,274
because it is very important
for doing analysis.
280
00:15:31,274 --> 00:15:34,020
It is important
that you make use of correct data.
281
00:15:34,020 --> 00:15:36,866
So you have four columns
that were used here--
282
00:15:36,866 --> 00:15:39,091
timestamp, property,
language, type, et cetera.
283
00:15:39,091 --> 00:15:44,494
And you get this data in this format.
It is publicly available.
284
00:15:44,494 --> 00:15:47,446
So what does this data give me?
285
00:15:47,446 --> 00:15:48,791
This data gives me information
286
00:15:48,791 --> 00:15:54,791
that currently almost 4,000 plus,
287
00:15:54,791 --> 00:15:57,291
4,500 properties
288
00:15:57,291 --> 00:15:59,917
have labels between 0 and 20.
289
00:15:59,917 --> 00:16:02,145
So there are a lot of properties
290
00:16:02,145 --> 00:16:07,107
who do not have
more than 20 multilingual labels.
291
00:16:07,107 --> 00:16:10,888
And there are only
1,500 language properties
292
00:16:10,888 --> 00:16:12,857
that have been translated up to 40.
293
00:16:12,857 --> 00:16:18,699
And yesterday, if you were present
during the talk of Lydia Pintscher,
294
00:16:18,699 --> 00:16:21,967
she talked about P18,
so P18 is something here.
295
00:16:21,967 --> 00:16:25,332
So you can see there are only
a couple of six or seven properties
296
00:16:25,332 --> 00:16:30,147
that are currently having all the--
297
00:16:30,147 --> 00:16:35,092
P18 has 154 translations,
just to give that idea.
298
00:16:35,092 --> 00:16:39,913
So there is one property
which is having 154 multilingual labels.
299
00:16:39,913 --> 00:16:43,807
There are properties
which have only one particular label.
300
00:16:43,807 --> 00:16:50,112
And the average number
of labels is only 21,
301
00:16:50,112 --> 00:16:52,945
and the standard deviation is 20.
302
00:16:52,945 --> 00:16:55,967
Okay, what next we would like to say?
303
00:16:55,967 --> 00:16:59,970
So you have seen something similar
in the real-time data.
304
00:16:59,970 --> 00:17:02,079
This is from the collected data.
305
00:17:02,079 --> 00:17:07,503
So this is what are the top languages
that are coming up in the results.
306
00:17:07,503 --> 00:17:09,186
So these we have seen.
307
00:17:09,186 --> 00:17:13,314
But my next point is,
are there combinations possible.
308
00:17:13,314 --> 00:17:16,522
For example, if there is French,
there is Arabic.
309
00:17:16,522 --> 00:17:19,505
If there is Arabic,
there is some other language.
310
00:17:19,505 --> 00:17:22,102
If there's French,
there's Ukrainian, et cetera.
311
00:17:22,102 --> 00:17:26,093
Can we find such type of combinations
in the translation data set?
312
00:17:26,093 --> 00:17:27,415
So, yes, it is possible.
313
00:17:27,415 --> 00:17:30,195
So if you see this count,
this frequent itemsets--
314
00:17:30,195 --> 00:17:32,134
so I've just shown seven of them--
315
00:17:32,134 --> 00:17:35,315
you find that there are combinations
that are possible.
316
00:17:36,901 --> 00:17:41,397
Okay, let us say, is there a possibility
of having four labels,
317
00:17:41,397 --> 00:17:44,313
like if there is English,
there's also possibility to find Dutch,
318
00:17:44,313 --> 00:17:45,794
Arabic, Ukrainian.
319
00:17:45,794 --> 00:17:48,041
If there is English,
there's possibility to find Dutch,
320
00:17:48,041 --> 00:17:49,798
French, and Arabic, et cetera.
321
00:17:49,798 --> 00:17:52,763
You can also find a lot of combinations.
322
00:17:52,763 --> 00:17:53,907
Why it is important?
323
00:17:53,907 --> 00:17:57,432
Because it is important to know if,
324
00:17:57,432 --> 00:17:59,998
for example,
if you have multilingual speakers
325
00:17:59,998 --> 00:18:03,664
who are contributors,
who can speak multiple languages,
326
00:18:03,664 --> 00:18:07,402
if you're able to find
any particular pattern
327
00:18:07,402 --> 00:18:12,556
that helps us to find
that if you tell this person to translate,
328
00:18:12,556 --> 00:18:15,276
a new property is created
to translate this label,
329
00:18:15,276 --> 00:18:19,213
because he already
speaks multiple languages,
330
00:18:19,213 --> 00:18:21,669
we can suggest these things to the user.
331
00:18:21,669 --> 00:18:24,858
So let's just show you one example.
332
00:18:24,858 --> 00:18:27,257
This is a complete translation path
333
00:18:27,257 --> 00:18:29,774
that has obtained
from different languages.
334
00:18:29,774 --> 00:18:35,001
So here, what we have done is
we selected two small minority languages,
335
00:18:35,001 --> 00:18:39,293
like Tagalog and Kapampangan,
336
00:18:39,293 --> 00:18:42,602
which are minority languages
from Philippines,
337
00:18:42,602 --> 00:18:46,156
and you see that there is
a strong transfer
338
00:18:46,156 --> 00:18:49,645
between Tagalog and Kapampangan.
339
00:18:49,645 --> 00:18:51,784
So these types of things can be detected
340
00:18:51,784 --> 00:18:54,738
when you have such type
of translation results.
341
00:18:54,738 --> 00:18:57,311
So that is another advantage.
342
00:18:57,311 --> 00:18:59,780
To conclude my work,
I would like to say,
343
00:18:59,780 --> 00:19:05,128
this is important that we understand
how properties are translated
344
00:19:05,128 --> 00:19:10,534
because if you want to extract data
from Wikipedia,
345
00:19:10,534 --> 00:19:14,661
you need to know what are the words
346
00:19:14,661 --> 00:19:16,491
in the local languages
that are being used.
347
00:19:16,491 --> 00:19:20,208
What is "image" in French,
what is "image" in Punjabi,
348
00:19:20,208 --> 00:19:22,539
what is "image" in Hindi,
or any other language.
349
00:19:22,539 --> 00:19:25,890
So that is important for importing data.
350
00:19:25,890 --> 00:19:30,023
And tomorrow, of course,
if you are able to fetch this data,
351
00:19:30,023 --> 00:19:35,193
to Wikidata, we could also
use new projects like Wikidata Bridge,
352
00:19:35,193 --> 00:19:38,963
which we could use
to fill other info boxes,
353
00:19:38,963 --> 00:19:44,563
like multilingual Wikipedia articles,
354
00:19:44,563 --> 00:19:47,370
and this could be really helpful.
355
00:19:47,370 --> 00:19:51,238
So withe that, I would like to thank you,
and if you have questions,
356
00:19:51,238 --> 00:19:54,321
I would be happy to answer them.
357
00:19:55,131 --> 00:19:57,218
(moderator) Anybody with questions?
358
00:19:58,842 --> 00:20:01,854
(audience applause)
359
00:20:08,387 --> 00:20:09,479
Yes?
360
00:20:11,988 --> 00:20:15,746
(man) So what you're doing
is mainly analyzing how this--
361
00:20:15,746 --> 00:20:17,389
- (John) Yes.
- (man) ...is all happening?
362
00:20:17,389 --> 00:20:21,418
Do you know if there are initiatives
or if there are tools
363
00:20:21,418 --> 00:20:25,331
which can help make this easier,
like translation of properties?
364
00:20:25,331 --> 00:20:28,321
Yes. Tools, like, for example,
what to translate
365
00:20:28,321 --> 00:20:32,995
from Wikimedia Foundation, is helpful,
but I have not seen--
366
00:20:32,995 --> 00:20:35,522
This is not currently
integrated with Wikidata.
367
00:20:35,522 --> 00:20:41,672
What to translate is only integrated
with certain languages on Wikipedia,
368
00:20:41,672 --> 00:20:44,485
but not on Wikidata.
369
00:20:44,485 --> 00:20:46,460
But that could be really interesting.
370
00:20:46,460 --> 00:20:50,165
Yes, thank you for bringing this up,
because just imagine,
371
00:20:50,165 --> 00:20:54,490
if we know that a person
has been labeling in multiple languages,
372
00:20:54,490 --> 00:20:56,842
and we also have
this what to translate tool,
373
00:20:56,842 --> 00:21:00,007
and we have these statistics,
we have this data
374
00:21:00,007 --> 00:21:04,657
coming from this type
of property translation,
375
00:21:04,657 --> 00:21:09,423
it is easier to suggest to a person
that new properties have been created,
376
00:21:09,423 --> 00:21:11,461
and then you could--
377
00:21:11,461 --> 00:21:13,980
Right now it's not integrated to Wikidata.
378
00:21:15,674 --> 00:21:17,432
(moderator) Anybody else?
379
00:21:20,246 --> 00:21:23,315
(man 2) I have one question myself,
that comes back to it,
380
00:21:23,315 --> 00:21:27,748
does anybody know of working lists
on translating properties?
381
00:21:27,748 --> 00:21:28,769
Sorry?
382
00:21:28,769 --> 00:21:30,489
(man 2) Does anybody
know of working lists
383
00:21:30,489 --> 00:21:31,695
about translating properties,
384
00:21:31,695 --> 00:21:37,751
like, I can imagine from your statistics,
you could say, this is the top 100
385
00:21:37,751 --> 00:21:39,944
most widely used properties
386
00:21:39,944 --> 00:21:42,844
who lack translations
in this and this language?
387
00:21:42,844 --> 00:21:47,494
No, there is, I think,
there are ways by,
388
00:21:47,494 --> 00:21:51,112
for example,
you could browse by data types,
389
00:21:51,112 --> 00:21:53,843
browse by property classes.
390
00:21:53,843 --> 00:21:57,398
For example, here is something
called property classes
391
00:21:57,398 --> 00:22:00,743
where people have created projects--
392
00:22:00,743 --> 00:22:03,272
it's taking time--
so you have projects,
393
00:22:03,272 --> 00:22:08,597
and you could say, how would I describe,
what are the, for example,
394
00:22:08,597 --> 00:22:11,978
what are the properties
that I could describe for this,
395
00:22:11,978 --> 00:22:14,183
for describing IEEE standard version?
396
00:22:14,183 --> 00:22:16,846
You need edition number,
you need edition translation, et cetera.
397
00:22:16,846 --> 00:22:22,890
So if you have a targeted thing,
you could search for what type of classes.
398
00:22:22,890 --> 00:22:25,853
For example, if you're working
in GLAM or histories,
399
00:22:25,853 --> 00:22:29,652
you could say, what is history-related
any document are there?
400
00:22:29,652 --> 00:22:32,715
So you could say, historical,
and you could find historical.
401
00:22:32,715 --> 00:22:36,247
Okay, this is a property class,
go to this property class.
402
00:22:36,247 --> 00:22:37,855
And, sorry, where is it?
403
00:22:37,855 --> 00:22:40,437
So it is having something
called "Merimee ID."
404
00:22:40,437 --> 00:22:44,467
So people have been
trying to use property classes
405
00:22:44,467 --> 00:22:45,913
to link objects.
406
00:22:45,913 --> 00:22:49,577
That helps if you're working
on a particular project,
407
00:22:49,577 --> 00:22:52,342
and you could find
that property's related to that.
408
00:22:52,342 --> 00:22:58,246
(man 2) But your tool could quite easily
make a list of, let's say,
409
00:22:58,246 --> 00:23:02,746
the top 100 most widely used properties
410
00:23:02,746 --> 00:23:07,488
who haven't got, I don't know,
Punjabi label, let's say?
411
00:23:07,488 --> 00:23:10,284
- (John) For that, I will just--
- (man 2) Which could be interesting.
412
00:23:10,284 --> 00:23:14,310
(John) Okay, tell me any language,
for example, let us say, Netherlands,
413
00:23:14,310 --> 00:23:17,456
because it's performing very well.
414
00:23:17,456 --> 00:23:21,861
So I would say-- translated labels.
415
00:23:21,861 --> 00:23:24,011
So this is translate-- sorry.
416
00:23:30,491 --> 00:23:33,059
(mouse clicking)
417
00:23:36,747 --> 00:23:38,697
For example, Hindi.
418
00:23:38,697 --> 00:23:40,497
So here, what happens,
419
00:23:40,497 --> 00:23:44,335
here you just see any properties
that need translation.
420
00:23:44,335 --> 00:23:47,473
So there are like 6,647 properties
421
00:23:47,473 --> 00:23:50,299
that need translation
in a particular language.
422
00:23:50,299 --> 00:23:54,998
So you could click on any language
that you want and get the data.
423
00:23:54,998 --> 00:23:58,778
And you could get the list
of where people need support.
424
00:23:58,778 --> 00:24:03,345
So, this could be interesting
to link with property usage,
425
00:24:03,345 --> 00:24:06,232
how many people, is it really top,
is it under the top ten.
426
00:24:06,232 --> 00:24:08,871
So suggest those ten top hundred,
in that language.
427
00:24:08,871 --> 00:24:11,282
That would be an interesting list.
That's good.
428
00:24:11,852 --> 00:24:13,054
(man 3) Just what you asked,
429
00:24:13,054 --> 00:24:17,077
there is a list of top 100
most used properties on Wikidata.
430
00:24:17,077 --> 00:24:18,924
It's on Wikidata.
431
00:24:18,924 --> 00:24:21,432
So, yeah, it's there,
432
00:24:21,432 --> 00:24:25,942
under Wikidata Database Reports/
Top 100 Properties.
433
00:24:25,942 --> 00:24:31,083
So one thing could be that
we could just link this and suggest it.
434
00:24:31,083 --> 00:24:33,349
(moderator) Could you maybe
add the link to the etherpad,
435
00:24:33,349 --> 00:24:37,270
and then maybe,
this information can come together.
436
00:24:37,270 --> 00:24:38,631
(John) Okay.
437
00:24:40,049 --> 00:24:42,007
(moderator) If there is
no other questions,
438
00:24:42,007 --> 00:24:44,045
then we will conclude here.
439
00:24:44,045 --> 00:24:49,236
And we have two, three minutes break
until we start with the next speaker.
440
00:24:49,236 --> 00:24:50,864
- Thanks.
- (John) Thank you very much.
441
00:24:50,864 --> 00:24:53,041
(audience applause)