WEBVTT 00:00:06.055 --> 00:00:09.281 (moderator) Good afternoon, everybody. We're about to start. 00:00:09.281 --> 00:00:11.416 I'm presenting you John Samuel 00:00:11.416 --> 00:00:17.207 who works at the French engineering school CPE, 00:00:17.207 --> 00:00:19.658 based in Lyon in France. 00:00:19.658 --> 00:00:21.101 And he will tell us something more 00:00:21.101 --> 00:00:27.271 about the translation of properties in Wikidata. 00:00:27.271 --> 00:00:29.604 As you know, as is the case in all sessions, 00:00:29.604 --> 00:00:32.172 there is an etherpad for collaborative note-taking. 00:00:32.172 --> 00:00:34.904 Please don't forget that. 00:00:34.904 --> 00:00:36.302 We'll have the presentation 00:00:36.302 --> 00:00:39.988 and then we'll have some time for a short Q&A. 00:00:39.988 --> 00:00:42.051 - The floor is yours. - (John) Thanks, [inaudible]. 00:00:42.917 --> 00:00:45.114 Thank you all for coming here. 00:00:45.114 --> 00:00:50.257 So my talk is about analyzing translation of Wikidata properties. 00:00:50.257 --> 00:00:52.743 So just give you a quick outline. 00:00:52.743 --> 00:00:54.859 I would like to introduce this topic. 00:00:54.859 --> 00:00:58.756 I will present a tool that I developed some years before, 00:00:58.756 --> 00:01:01.446 called WDProp, which I'm continuously working, 00:01:01.446 --> 00:01:03.795 and based on the feedback from the community, 00:01:03.795 --> 00:01:05.319 I add new features. 00:01:05.319 --> 00:01:09.368 And then I will talk about something called coarser analysis, 00:01:09.368 --> 00:01:12.476 where I would like to look at the property translation, 00:01:12.476 --> 00:01:15.257 from a much larger picture. 00:01:15.257 --> 00:01:18.667 So I will talk about how we collected this data, 00:01:18.667 --> 00:01:23.002 because this work is also done with one of my students, Thibaut Chamard. 00:01:23.002 --> 00:01:26.682 And then I will present some results, and finally, I will conclude the talk. 00:01:27.469 --> 00:01:30.982 So Wikidata, as you all know, it started in 2012, 00:01:30.982 --> 00:01:33.877 and it's a free, open, linked, structured, collaborative, 00:01:33.877 --> 00:01:36.010 and multilingual knowledge base. 00:01:36.910 --> 00:01:40.063 My focus today is on the multilingual part, 00:01:40.063 --> 00:01:42.979 because there is a big change from the traditional way 00:01:42.979 --> 00:01:45.412 of how we used to edit on Wikipedia site. 00:01:45.412 --> 00:01:47.917 There were multiple subdomains, 00:01:47.917 --> 00:01:50.753 and now you'll have a single domain on a Wikidata 00:01:50.753 --> 00:01:56.191 where multilingual contributors come and write or create articles. 00:01:56.191 --> 00:01:57.499 So this is a collaborative. 00:01:57.499 --> 00:02:00.585 There has been work to say what exactly is collaborative, 00:02:00.585 --> 00:02:02.441 why it is collaborative. 00:02:02.441 --> 00:02:04.597 I have given references for these works. 00:02:04.597 --> 00:02:07.254 So this is, if you see Wikidata, 00:02:07.254 --> 00:02:11.057 everything that starts is starting from the property. 00:02:11.057 --> 00:02:14.144 The property is proposed and then discussed and voted. 00:02:14.144 --> 00:02:17.471 And then it is created and finally translated, 00:02:17.471 --> 00:02:20.005 and then you are finally able to use these properties. 00:02:20.005 --> 00:02:22.010 But these properties may also be deleted-- 00:02:22.010 --> 00:02:24.019 there's also something called deletion. 00:02:24.019 --> 00:02:26.700 But, as I highlighted on this slide, 00:02:26.700 --> 00:02:28.856 my focus is on the multilingual aspect, 00:02:28.856 --> 00:02:32.671 and the property creation and translation point of view. 00:02:32.671 --> 00:02:36.408 So you have been here for the past two days, 00:02:36.408 --> 00:02:40.095 and by this time you have seen many articles, 00:02:40.095 --> 00:02:46.029 and I just want to point what am I looking for on a Wikidata item. 00:02:46.029 --> 00:02:48.005 This is a Wikidata item, 00:02:48.005 --> 00:02:51.697 so you have this Q2841, which is Bogotá, 00:02:51.697 --> 00:02:55.597 which is the capital city of Colombia, 00:02:55.597 --> 00:02:57.389 and you have four parts here: 00:02:57.389 --> 00:03:00.678 the languages, the labels, the description, and aliases. 00:03:00.678 --> 00:03:02.255 So you can see, for different languages 00:03:02.255 --> 00:03:05.089 you'll have the label, you have the description 00:03:05.089 --> 00:03:10.970 as well as if there any aliases also known as, you could see them. 00:03:10.970 --> 00:03:14.180 And this, under the city, where you see the labels 00:03:14.180 --> 00:03:16.155 and the properties together. 00:03:16.155 --> 00:03:20.845 This is Avignon, a city in France. 00:03:20.845 --> 00:03:24.966 So what I'm interested in is only the properties part. 00:03:24.966 --> 00:03:30.638 For example, official name, native label, country, capital of, et cetera. 00:03:30.638 --> 00:03:34.310 So when I say property, for example, if a country, 00:03:34.310 --> 00:03:37.736 in this country, I'm looking at different aspects: 00:03:37.736 --> 00:03:39.986 the language, the label, and the description, 00:03:39.986 --> 00:03:42.670 and see how things change. 00:03:42.670 --> 00:03:44.446 For example, if you take instance of-- 00:03:44.446 --> 00:03:48.932 okay, everybody knows instance of, you have been using it quite a lot-- 00:03:48.932 --> 00:03:54.089 this is P31, you see the number of aliases in English 00:03:54.089 --> 00:03:58.667 for the property P31 in instance of, 00:03:58.667 --> 00:04:03.686 and then you would find that these types of properties 00:04:03.686 --> 00:04:07.536 are created after discussion with the community. 00:04:07.536 --> 00:04:10.513 So if I take the complete prop-- the procedure, 00:04:10.513 --> 00:04:13.343 what happens to creation of properties-- 00:04:13.343 --> 00:04:17.347 you start proposing properties with some possible translation. 00:04:17.347 --> 00:04:19.388 It is important it's not just in English. 00:04:19.388 --> 00:04:23.734 You have the templates to suggest your properties 00:04:23.734 --> 00:04:25.129 in your local language. 00:04:25.129 --> 00:04:28.552 So that's why it's a proposition with possible translation. 00:04:28.552 --> 00:04:32.367 And then you put it to discussion, then you are put to voting, 00:04:32.367 --> 00:04:37.273 and it's created, and then finally, the community members start translating it 00:04:37.273 --> 00:04:38.976 and people put it into use. 00:04:38.976 --> 00:04:42.336 But then you cannot be guaranteed the properties that are created 00:04:42.336 --> 00:04:44.435 are always there forever. 00:04:44.435 --> 00:04:47.417 Properties can be deleted, just like items can be deleted. 00:04:47.417 --> 00:04:51.004 But then, again, it goes through a similar procedure. 00:04:51.004 --> 00:04:54.727 You put the property 00:04:54.727 --> 00:04:58.427 as you propose that it should be deleted, 00:04:58.427 --> 00:05:02.424 and if the community decides it, it votes it, and then if it is decided-- 00:05:02.424 --> 00:05:05.238 the majority votes has decided to delete it-- 00:05:05.238 --> 00:05:09.191 we deprecate the property, and finally we delete this property. 00:05:09.191 --> 00:05:14.826 So for today's talk, I'm mostly interested for the translation part. 00:05:14.826 --> 00:05:17.004 So where are the translations happening? 00:05:17.004 --> 00:05:20.037 First, the translation would happen at the proposition part, 00:05:20.037 --> 00:05:22.778 and then you could find that, at the time of creation, 00:05:22.778 --> 00:05:27.917 the person who creates the property can use the exact names 00:05:27.917 --> 00:05:31.062 that were suggested by the property proposer 00:05:31.062 --> 00:05:34.753 and he or she will create the properties, 00:05:34.753 --> 00:05:38.705 and later, you start translating these properties. 00:05:38.705 --> 00:05:43.176 So let us look at why this matters, why it is important. 00:05:43.176 --> 00:05:44.909 So I put some examples. 00:05:44.909 --> 00:05:47.162 This is, again, on P31, 00:05:47.162 --> 00:05:51.762 instance of the very, very famous property P31, 00:05:51.762 --> 00:05:56.094 and you see there is no description for this item. 00:05:56.094 --> 00:06:00.876 There are almost six descriptions on this image, 00:06:00.876 --> 00:06:03.310 where we do not have any description. 00:06:03.310 --> 00:06:06.961 Again, some more description for Odia and Punjabi, 00:06:06.961 --> 00:06:07.970 there is no description. 00:06:07.970 --> 00:06:10.806 This is a property which is used quite a lot, 00:06:10.806 --> 00:06:13.820 and you see that there is no description for it. 00:06:13.820 --> 00:06:17.876 And there is a surprising part that you could also have cases 00:06:17.876 --> 00:06:22.000 where there are descriptions, but there are no labels. 00:06:22.000 --> 00:06:25.293 For example, Ruffian, that has been shown here, 00:06:25.293 --> 00:06:30.116 again on property P31, there is a label that is missing. 00:06:30.116 --> 00:06:34.100 So this was the initial inspiration for this work 00:06:34.100 --> 00:06:37.486 when I started working on property analysis. 00:06:37.486 --> 00:06:44.272 I wanted to look at what aspects of properties, 00:06:44.272 --> 00:06:46.459 or what aspects of property 00:06:46.459 --> 00:06:49.569 that the whole flow chart that we have seen, 00:06:49.569 --> 00:06:51.316 is multilingual. 00:06:51.316 --> 00:06:53.048 So I wanted to look at, 00:06:53.048 --> 00:06:56.304 okay, we know that Wikidata is multilingual, 00:06:56.304 --> 00:06:58.984 and it's collaborative, that has been done. 00:06:58.984 --> 00:07:05.285 But are we really able to achieve a truly multilingual experience? 00:07:05.285 --> 00:07:09.054 That was the question behind the creation of WDProp. 00:07:09.054 --> 00:07:11.166 So you may ask why there are so many people 00:07:11.166 --> 00:07:14.600 who have worked on items, there are people who have worked on-- 00:07:14.600 --> 00:07:17.047 users, multilingual users and bots, et cetera, 00:07:17.047 --> 00:07:19.444 why you want to focus on properties? 00:07:19.444 --> 00:07:22.770 The answer is, I want to focus on properties 00:07:22.770 --> 00:07:25.738 because it's very, very less influenced by bots. 00:07:25.738 --> 00:07:28.581 You may have heard today or yesterday, 00:07:28.581 --> 00:07:31.895 many people said, "Okay, if you have translation 00:07:31.895 --> 00:07:36.761 in your local languages, and it has reached a very good number, 00:07:36.761 --> 00:07:39.227 you should ensure what type of translation it is. 00:07:39.227 --> 00:07:44.339 Is it just bots, which copies the name of a person to another language. 00:07:44.339 --> 00:07:47.242 Then is it really translation?" 00:07:47.242 --> 00:07:48.413 Okay, that's debatable. 00:07:48.413 --> 00:07:51.365 But, of course, there is an influence by bot, 00:07:51.365 --> 00:07:54.811 but in case of properties, there is not so much influence by bots, 00:07:54.811 --> 00:07:55.913 and that is a good part. 00:07:55.913 --> 00:08:00.706 That's why I focus on the bots part. 00:08:00.706 --> 00:08:05.552 So, as I said, when WDProp was created, 00:08:05.552 --> 00:08:09.451 it was to understand every aspect-- the proposal, the creation, translation. 00:08:09.451 --> 00:08:12.326 What are the templates that are available. 00:08:12.326 --> 00:08:16.232 Are these templates, for example, you said support, 00:08:16.232 --> 00:08:21.875 if a French person opens Wikidata, a Wikidata France translation page, 00:08:21.875 --> 00:08:28.039 can he see the word, [soutien], for that particular property proposal? 00:08:28.039 --> 00:08:29.373 Is it possible? 00:08:29.373 --> 00:08:33.125 So this type of things was needed. 00:08:33.125 --> 00:08:35.987 In the end, it was also about giving real-time statistics 00:08:35.987 --> 00:08:37.741 to the multilingual contributors. 00:08:37.741 --> 00:08:38.783 It's not about one time, 00:08:38.783 --> 00:08:42.178 it's like you just made it and published for one time-- no. 00:08:42.178 --> 00:08:45.434 You want people to get this data in real time. 00:08:45.434 --> 00:08:46.716 So what are we doing? 00:08:46.716 --> 00:08:52.065 So the goal of WDProp was to understand everything 00:08:52.065 --> 00:08:54.418 about Wikidata properties. 00:08:54.418 --> 00:08:56.955 So, label, aliases, description. 00:08:56.955 --> 00:09:01.348 So you have got all these three translated so the middle part where you say, 00:09:01.348 --> 00:09:05.618 this property is completely usable because all the three aspects 00:09:05.618 --> 00:09:08.984 have been translated. 00:09:08.984 --> 00:09:12.055 So let me just show you quickly, what is this WDProp, 00:09:12.055 --> 00:09:14.224 what I'm talking about. 00:09:14.224 --> 00:09:15.496 So this is the WDProp, 00:09:15.496 --> 00:09:19.726 it's available on tools.wmflabs.org/wdprop/. 00:09:19.726 --> 00:09:23.813 So you have a lot statistics and if I ask you some questions today, 00:09:23.813 --> 00:09:27.960 like, for example, "How many data types are there 00:09:27.960 --> 00:09:30.846 that are supported by Wikidata right now?" 00:09:30.846 --> 00:09:34.369 So if such questions, we do not know, 00:09:34.369 --> 00:09:37.549 sometimes because there are new data types that keep on coming. 00:09:37.549 --> 00:09:41.668 So this data, this is generated at real time, 00:09:41.668 --> 00:09:44.993 this creates the data structure and it will give you the answer. 00:09:44.993 --> 00:09:46.486 How many languages are there? 00:09:46.486 --> 00:09:50.194 Yes, of course, see that there are 313 languages. 00:09:50.194 --> 00:09:55.092 And then, for example, how many labels were translated. 00:09:55.092 --> 00:09:58.694 So you could see that the data is being fetched. 00:09:58.694 --> 00:10:00.242 I hope it comes. 00:10:01.512 --> 00:10:03.003 Okay, let's hope. (chuckles) 00:10:07.984 --> 00:10:11.621 Okay, I will take some other stuff as well. 00:10:11.621 --> 00:10:13.964 Browsing all properties by their time. 00:10:13.964 --> 00:10:17.079 Yes. So you see, this is count of translated labels, 00:10:17.079 --> 00:10:20.142 and you see all this data that is coming real time, 00:10:20.142 --> 00:10:21.781 and you can see that the labels 00:10:21.781 --> 00:10:26.881 are currently available in 6,804 languages in English, 00:10:26.881 --> 00:10:31.291 followed by Dutch, followed by Arabic, followed by Ukrainian, and then French. 00:10:31.291 --> 00:10:32.922 So this is real-time statistics. 00:10:32.922 --> 00:10:35.446 So you could also do the same for description, 00:10:35.446 --> 00:10:37.747 also do for aliases, et cetera. 00:10:37.747 --> 00:10:41.383 And you could get the overall translation statuses if you want. 00:10:41.383 --> 00:10:43.937 So there are some other things that we will discuss later, 00:10:43.937 --> 00:10:45.586 if time permits. 00:10:45.586 --> 00:10:50.132 But you could navigate all the different items 00:10:50.132 --> 00:10:52.367 on the left-hand side, 00:10:52.367 --> 00:10:54.127 and you could see there are a lot of things 00:10:54.127 --> 00:10:59.471 that could really help to see what things are happening in WDProp. 00:10:59.471 --> 00:11:03.591 So this is, for example, Wikidata properties, 00:11:03.591 --> 00:11:05.789 these are the properties that are currently available. 00:11:05.789 --> 00:11:10.039 But as I said some time back, properties could be deleted. 00:11:10.039 --> 00:11:13.121 And this, you see that these are the properties that were deleted, 00:11:13.121 --> 00:11:17.171 starting from P1, P2, P3, P4, P5, these have all been deleted, 00:11:17.171 --> 00:11:23.005 and you could get this thing just from the statistics board. 00:11:23.005 --> 00:11:24.947 And here, so same thing. 00:11:24.947 --> 00:11:29.938 Then, the next thing that interested me was to understand the translation pattern. 00:11:29.938 --> 00:11:33.388 So, for example, sometimes we feel that some languages-- 00:11:33.388 --> 00:11:36.514 so English is created first, and followed by maybe Dutch, 00:11:36.514 --> 00:11:38.201 or maybe French, 00:11:38.201 --> 00:11:40.701 and maybe after French, it could be Arabic. 00:11:40.701 --> 00:11:43.627 So these things could be interesting to know. 00:11:43.627 --> 00:11:48.596 So for that, we started to look at the idea of translation path-- 00:11:48.596 --> 00:11:51.607 exactly how things are translated. 00:11:51.607 --> 00:11:56.542 So again, if you go to the property page, you could click on any property. 00:11:56.542 --> 00:11:57.662 Sorry. 00:11:59.375 --> 00:12:01.053 Maybe I can show. 00:12:03.527 --> 00:12:06.497 So you could click on any property and you could just say, 00:12:06.497 --> 00:12:07.794 "Give me the translation path." 00:12:07.794 --> 00:12:11.487 It takes some time, but it will start bringing the data, 00:12:11.487 --> 00:12:15.434 because it's real time, so you get the data coming from all this. 00:12:15.434 --> 00:12:16.595 So you get the date, 00:12:16.595 --> 00:12:22.244 you get what things have been changed, when was something deleted, et cetera. 00:12:22.244 --> 00:12:23.848 Why it is important? 00:12:24.948 --> 00:12:29.401 For example, you see this is something that happened in 2017, 00:12:29.401 --> 00:12:31.955 and the label has been removed. 00:12:31.955 --> 00:12:33.893 This is the official website. 00:12:33.893 --> 00:12:38.944 So imagine you have removed the label from the official website-- 00:12:38.944 --> 00:12:39.978 sorry, this country-- 00:12:39.978 --> 00:12:43.357 so anybody who doesn't know P17, what it is, cannot even understand, 00:12:43.357 --> 00:12:45.971 because the label has been deleted by the person. 00:12:45.971 --> 00:12:47.915 So this type of vandalism exists. 00:12:47.915 --> 00:12:50.710 Another example where, completely, 00:12:50.710 --> 00:12:52.601 all the language labels have been deleted-- 00:12:52.601 --> 00:12:56.183 English, French, Spanish, German, everything has been deleted. 00:12:56.183 --> 00:12:58.329 There are no labels, there are no descriptions. 00:12:58.329 --> 00:13:01.033 So you could find these types of things from the translation path 00:13:01.033 --> 00:13:05.483 and just because of the color code, you could see what happened on what day, 00:13:05.483 --> 00:13:09.666 and you could check exactly, because it is also linked. 00:13:09.666 --> 00:13:14.261 If you click on any of this, you could also get a link to the revision, 00:13:14.261 --> 00:13:19.478 identify what exactly happened during that particular revision. 00:13:19.478 --> 00:13:21.309 So this is coming from revision history. 00:13:21.309 --> 00:13:25.311 So if you click on any of this, you get what exactly is happening 00:13:25.311 --> 00:13:28.567 in any particular revision. 00:13:28.567 --> 00:13:30.733 So how did we build it? 00:13:30.733 --> 00:13:31.923 Just if you come back, 00:13:31.923 --> 00:13:38.396 here, you see there is something called a comment on the right-hand side. 00:13:38.396 --> 00:13:42.602 You see there is something called added aliases, 00:13:42.602 --> 00:13:46.613 "added British English aliases," "changed Esperanto label," 00:13:46.613 --> 00:13:48.109 "added [io] label," et cetera. 00:13:48.109 --> 00:13:50.710 So we made use of this information, 00:13:50.710 --> 00:13:53.209 for example, for label description and aliases, 00:13:53.209 --> 00:13:55.507 if you add something, you have some sort of comment 00:13:55.507 --> 00:13:58.216 which starts with wbsetlabel-add. 00:13:58.216 --> 00:14:01.635 Or if it is updated, you have wbsetlabel-set. 00:14:01.635 --> 00:14:04.487 And if you remove something, you see it is removed. 00:14:04.487 --> 00:14:06.795 And based on this type of information, 00:14:06.795 --> 00:14:11.167 we were able to build such a translation path. 00:14:11.167 --> 00:14:16.557 Okay, this is good, but what happened is that this type of information, 00:14:16.557 --> 00:14:19.366 this type of things, just using the comment, 00:14:19.366 --> 00:14:23.932 it is useful for building real-time tools, just like what I showed before, WDProp, 00:14:23.932 --> 00:14:30.886 but it is very difficult to detect when there are multiple changes. 00:14:30.886 --> 00:14:34.871 For example, if you have seen bots activity on Wikidata, 00:14:34.871 --> 00:14:39.550 some bots make multiple labels in one single edit. 00:14:39.550 --> 00:14:42.037 In that case, you cannot find what happened 00:14:42.037 --> 00:14:45.878 because you do not have wbsetlabel, that particular language. 00:14:45.878 --> 00:14:49.254 So you do not have a set of languages along with your comment. 00:14:49.254 --> 00:14:53.703 So these are some problems if you want to use this type of approach. 00:14:54.603 --> 00:14:58.245 So what we did, we decided to collect the data, 00:14:58.245 --> 00:15:01.316 and we decided to publicly make this data available. 00:15:02.516 --> 00:15:06.246 And what we did, we wanted to make use of content. 00:15:06.246 --> 00:15:08.579 So what we did, we started with every revision, 00:15:08.579 --> 00:15:12.096 and we took the content of each revision. 00:15:12.096 --> 00:15:16.717 And we took the next revision, and we decided to find the difference 00:15:16.717 --> 00:15:19.885 between these two revisions, to find what exactly changes, 00:15:19.885 --> 00:15:21.822 which of the labels got changed. 00:15:21.822 --> 00:15:25.436 Because of that, we got much more interesting information, 00:15:25.436 --> 00:15:28.899 much more accurate information than the previous approach 00:15:28.899 --> 00:15:31.274 because it is very important for doing analysis. 00:15:31.274 --> 00:15:34.020 It is important that you make use of correct data. 00:15:34.020 --> 00:15:36.866 So you have four columns that were used here-- 00:15:36.866 --> 00:15:39.091 timestamp, property, language, type, et cetera. 00:15:39.091 --> 00:15:44.494 And you get this data in this format. It is publicly available. 00:15:44.494 --> 00:15:47.446 So what does this data give me? 00:15:47.446 --> 00:15:48.791 This data gives me information 00:15:48.791 --> 00:15:54.791 that currently almost 4,000 plus, 00:15:54.791 --> 00:15:57.291 4,500 properties 00:15:57.291 --> 00:15:59.917 have labels between 0 and 20. 00:15:59.917 --> 00:16:02.145 So there are a lot of properties 00:16:02.145 --> 00:16:07.107 who do not have more than 20 multilingual labels. 00:16:07.107 --> 00:16:10.888 And there are only 1,500 language properties 00:16:10.888 --> 00:16:12.857 that have been translated up to 40. 00:16:12.857 --> 00:16:18.699 And yesterday, if you were present during the talk of Lydia Pintscher, 00:16:18.699 --> 00:16:21.967 she talked about P18, so P18 is something here. 00:16:21.967 --> 00:16:25.332 So you can see there are only a couple of six or seven properties 00:16:25.332 --> 00:16:30.147 that are currently having all the-- 00:16:30.147 --> 00:16:35.092 P18 has 154 translations, just to give that idea. 00:16:35.092 --> 00:16:39.913 So there is one property which is having 154 multilingual labels. 00:16:39.913 --> 00:16:43.807 There are properties which have only one particular label. 00:16:43.807 --> 00:16:50.112 And the average number of labels is only 21, 00:16:50.112 --> 00:16:52.945 and the standard deviation is 20. 00:16:52.945 --> 00:16:55.967 Okay, what next we would like to say? 00:16:55.967 --> 00:16:59.970 So you have seen something similar in the real-time data. 00:16:59.970 --> 00:17:02.079 This is from the collected data. 00:17:02.079 --> 00:17:07.503 So this is what are the top languages that are coming up in the results. 00:17:07.503 --> 00:17:09.186 So these we have seen. 00:17:09.186 --> 00:17:13.314 But my next point is, are there combinations possible. 00:17:13.314 --> 00:17:16.522 For example, if there is French, there is Arabic. 00:17:16.522 --> 00:17:19.505 If there is Arabic, there is some other language. 00:17:19.505 --> 00:17:22.102 If there's French, there's Ukrainian, et cetera. 00:17:22.102 --> 00:17:26.093 Can we find such type of combinations in the translation data set? 00:17:26.093 --> 00:17:27.415 So, yes, it is possible. 00:17:27.415 --> 00:17:30.195 So if you see this count, this frequent itemsets-- 00:17:30.195 --> 00:17:32.134 so I've just shown seven of them-- 00:17:32.134 --> 00:17:35.315 you find that there are combinations that are possible. 00:17:36.901 --> 00:17:41.397 Okay, let us say, is there a possibility of having four labels, 00:17:41.397 --> 00:17:44.313 like if there is English, there's also possibility to find Dutch, 00:17:44.313 --> 00:17:45.794 Arabic, Ukrainian. 00:17:45.794 --> 00:17:48.041 If there is English, there's possibility to find Dutch, 00:17:48.041 --> 00:17:49.798 French, and Arabic, et cetera. 00:17:49.798 --> 00:17:52.763 You can also find a lot of combinations. 00:17:52.763 --> 00:17:53.907 Why it is important? 00:17:53.907 --> 00:17:57.432 Because it is important to know if, 00:17:57.432 --> 00:17:59.998 for example, if you have multilingual speakers 00:17:59.998 --> 00:18:03.664 who are contributors, who can speak multiple languages, 00:18:03.664 --> 00:18:07.402 if you're able to find any particular pattern 00:18:07.402 --> 00:18:12.556 that helps us to find that if you tell this person to translate, 00:18:12.556 --> 00:18:15.276 a new property is created to translate this label, 00:18:15.276 --> 00:18:19.213 because he already speaks multiple languages, 00:18:19.213 --> 00:18:21.669 we can suggest these things to the user. 00:18:21.669 --> 00:18:24.858 So let's just show you one example. 00:18:24.858 --> 00:18:27.257 This is a complete translation path 00:18:27.257 --> 00:18:29.774 that has obtained from different languages. 00:18:29.774 --> 00:18:35.001 So here, what we have done is we selected two small minority languages, 00:18:35.001 --> 00:18:39.293 like Tagalog and Kapampangan, 00:18:39.293 --> 00:18:42.602 which are minority languages from Philippines, 00:18:42.602 --> 00:18:46.156 and you see that there is a strong transfer 00:18:46.156 --> 00:18:49.645 between Tagalog and Kapampangan. 00:18:49.645 --> 00:18:51.784 So these types of things can be detected 00:18:51.784 --> 00:18:54.738 when you have such type of translation results. 00:18:54.738 --> 00:18:57.311 So that is another advantage. 00:18:57.311 --> 00:18:59.780 To conclude my work, I would like to say, 00:18:59.780 --> 00:19:05.128 this is important that we understand how properties are translated 00:19:05.128 --> 00:19:10.534 because if you want to extract data from Wikipedia, 00:19:10.534 --> 00:19:14.661 you need to know what are the words 00:19:14.661 --> 00:19:16.491 in the local languages that are being used. 00:19:16.491 --> 00:19:20.208 What is "image" in French, what is "image" in Punjabi, 00:19:20.208 --> 00:19:22.539 what is "image" in Hindi, or any other language. 00:19:22.539 --> 00:19:25.890 So that is important for importing data. 00:19:25.890 --> 00:19:30.023 And tomorrow, of course, if you are able to fetch this data, 00:19:30.023 --> 00:19:35.193 to Wikidata, we could also use new projects like Wikidata Bridge, 00:19:35.193 --> 00:19:38.963 which we could use to fill other info boxes, 00:19:38.963 --> 00:19:44.563 like multilingual Wikipedia articles, 00:19:44.563 --> 00:19:47.370 and this could be really helpful. 00:19:47.370 --> 00:19:51.238 So withe that, I would like to thank you, and if you have questions, 00:19:51.238 --> 00:19:54.321 I would be happy to answer them. 00:19:55.131 --> 00:19:57.218 (moderator) Anybody with questions? 00:19:58.842 --> 00:20:01.854 (audience applause) 00:20:08.387 --> 00:20:09.479 Yes? 00:20:11.988 --> 00:20:15.746 (man) So what you're doing is mainly analyzing how this-- 00:20:15.746 --> 00:20:17.389 - (John) Yes. - (man) ...is all happening? 00:20:17.389 --> 00:20:21.418 Do you know if there are initiatives or if there are tools 00:20:21.418 --> 00:20:25.331 which can help make this easier, like translation of properties? 00:20:25.331 --> 00:20:28.321 Yes. Tools, like, for example, what to translate 00:20:28.321 --> 00:20:32.995 from Wikimedia Foundation, is helpful, but I have not seen-- 00:20:32.995 --> 00:20:35.522 This is not currently integrated with Wikidata. 00:20:35.522 --> 00:20:41.672 What to translate is only integrated with certain languages on Wikipedia, 00:20:41.672 --> 00:20:44.485 but not on Wikidata. 00:20:44.485 --> 00:20:46.460 But that could be really interesting. 00:20:46.460 --> 00:20:50.165 Yes, thank you for bringing this up, because just imagine, 00:20:50.165 --> 00:20:54.490 if we know that a person has been labeling in multiple languages, 00:20:54.490 --> 00:20:56.842 and we also have this what to translate tool, 00:20:56.842 --> 00:21:00.007 and we have these statistics, we have this data 00:21:00.007 --> 00:21:04.657 coming from this type of property translation, 00:21:04.657 --> 00:21:09.423 it is easier to suggest to a person that new properties have been created, 00:21:09.423 --> 00:21:11.461 and then you could-- 00:21:11.461 --> 00:21:13.980 Right now it's not integrated to Wikidata. 00:21:15.674 --> 00:21:17.432 (moderator) Anybody else? 00:21:20.246 --> 00:21:23.315 (man 2) I have one question myself, that comes back to it, 00:21:23.315 --> 00:21:27.748 does anybody know of working lists on translating properties? 00:21:27.748 --> 00:21:28.769 Sorry? 00:21:28.769 --> 00:21:30.489 (man 2) Does anybody know of working lists 00:21:30.489 --> 00:21:31.695 about translating properties, 00:21:31.695 --> 00:21:37.751 like, I can imagine from your statistics, you could say, this is the top 100 00:21:37.751 --> 00:21:39.944 most widely used properties 00:21:39.944 --> 00:21:42.844 who lack translations in this and this language? 00:21:42.844 --> 00:21:47.494 No, there is, I think, there are ways by, 00:21:47.494 --> 00:21:51.112 for example, you could browse by data types, 00:21:51.112 --> 00:21:53.843 browse by property classes. 00:21:53.843 --> 00:21:57.398 For example, here is something called property classes 00:21:57.398 --> 00:22:00.743 where people have created projects-- 00:22:00.743 --> 00:22:03.272 it's taking time-- so you have projects, 00:22:03.272 --> 00:22:08.597 and you could say, how would I describe, what are the, for example, 00:22:08.597 --> 00:22:11.978 what are the properties that I could describe for this, 00:22:11.978 --> 00:22:14.183 for describing IEEE standard version? 00:22:14.183 --> 00:22:16.846 You need edition number, you need edition translation, et cetera. 00:22:16.846 --> 00:22:22.890 So if you have a targeted thing, you could search for what type of classes. 00:22:22.890 --> 00:22:25.853 For example, if you're working in GLAM or histories, 00:22:25.853 --> 00:22:29.652 you could say, what is history-related any document are there? 00:22:29.652 --> 00:22:32.715 So you could say, historical, and you could find historical. 00:22:32.715 --> 00:22:36.247 Okay, this is a property class, go to this property class. 00:22:36.247 --> 00:22:37.855 And, sorry, where is it? 00:22:37.855 --> 00:22:40.437 So it is having something called "Merimee ID." 00:22:40.437 --> 00:22:44.467 So people have been trying to use property classes 00:22:44.467 --> 00:22:45.913 to link objects. 00:22:45.913 --> 00:22:49.577 That helps if you're working on a particular project, 00:22:49.577 --> 00:22:52.342 and you could find that property's related to that. 00:22:52.342 --> 00:22:58.246 (man 2) But your tool could quite easily make a list of, let's say, 00:22:58.246 --> 00:23:02.746 the top 100 most widely used properties 00:23:02.746 --> 00:23:07.488 who haven't got, I don't know, Punjabi label, let's say? 00:23:07.488 --> 00:23:10.284 - (John) For that, I will just-- - (man 2) Which could be interesting. 00:23:10.284 --> 00:23:14.310 (John) Okay, tell me any language, for example, let us say, Netherlands, 00:23:14.310 --> 00:23:17.456 because it's performing very well. 00:23:17.456 --> 00:23:21.861 So I would say-- translated labels. 00:23:21.861 --> 00:23:24.011 So this is translate-- sorry. 00:23:30.491 --> 00:23:33.059 (mouse clicking) 00:23:36.747 --> 00:23:38.697 For example, Hindi. 00:23:38.697 --> 00:23:40.497 So here, what happens, 00:23:40.497 --> 00:23:44.335 here you just see any properties that need translation. 00:23:44.335 --> 00:23:47.473 So there are like 6,647 properties 00:23:47.473 --> 00:23:50.299 that need translation in a particular language. 00:23:50.299 --> 00:23:54.998 So you could click on any language that you want and get the data. 00:23:54.998 --> 00:23:58.778 And you could get the list of where people need support. 00:23:58.778 --> 00:24:03.345 So, this could be interesting to link with property usage, 00:24:03.345 --> 00:24:06.232 how many people, is it really top, is it under the top ten. 00:24:06.232 --> 00:24:08.871 So suggest those ten top hundred, in that language. 00:24:08.871 --> 00:24:11.282 That would be an interesting list. That's good. 00:24:11.852 --> 00:24:13.054 (man 3) Just what you asked, 00:24:13.054 --> 00:24:17.077 there is a list of top 100 most used properties on Wikidata. 00:24:17.077 --> 00:24:18.924 It's on Wikidata. 00:24:18.924 --> 00:24:21.432 So, yeah, it's there, 00:24:21.432 --> 00:24:25.942 under Wikidata Database Reports/ Top 100 Properties. 00:24:25.942 --> 00:24:31.083 So one thing could be that we could just link this and suggest it. 00:24:31.083 --> 00:24:33.349 (moderator) Could you maybe add the link to the etherpad, 00:24:33.349 --> 00:24:37.270 and then maybe, this information can come together. 00:24:37.270 --> 00:24:38.631 (John) Okay. 00:24:40.049 --> 00:24:42.007 (moderator) If there is no other questions, 00:24:42.007 --> 00:24:44.045 then we will conclude here. 00:24:44.045 --> 00:24:49.236 And we have two, three minutes break until we start with the next speaker. 00:24:49.236 --> 00:24:50.864 - Thanks. - (John) Thank you very much. 00:24:50.864 --> 00:24:53.041 (audience applause)