(moderator) Good afternoon, everybody. We're about to start. I'm presenting you John Samuel who works at the French engineering school CPE, based in Lyon in France. And he will tell us something more about the translation of properties in Wikidata. As you know, as is the case in all sessions, there is an etherpad for collaborative note-taking. Please don't forget that. We'll have the presentation and then we'll have some time for a short Q&A. - The floor is yours. - (John) Thanks, [inaudible]. Thank you all for coming here. So my talk is about analyzing translation of Wikidata properties. So just give you a quick outline. I would like to introduce this topic. I will present a tool that I developed some years before, called WDProp, which I'm continuously working, and based on the feedback from the community, I add new features. And then I will talk about something called coarser analysis, where I would like to look at the property translation, from a much larger picture. So I will talk about how we collected this data, because this work is also done with one of my students, Thibaut Chamard. And then I will present some results, and finally, I will conclude the talk. So Wikidata, as you all know, it started in 2012, and it's a free, open, linked, structured, collaborative, and multilingual knowledge base. My focus today is on the multilingual part, because there is a big change from the traditional way of how we used to edit on Wikipedia site. There were multiple subdomains, and now you'll have a single domain on a Wikidata where multilingual contributors come and write or create articles. So this is a collaborative. There has been work to say what exactly is collaborative, why it is collaborative. I have given references for these works. So this is, if you see Wikidata, everything that starts is starting from the property. The property is proposed and then discussed and voted. And then it is created and finally translated, and then you are finally able to use these properties. But these properties may also be deleted-- there's also something called deletion. But, as I highlighted on this slide, my focus is on the multilingual aspect, and the property creation and translation point of view. So you have been here for the past two days, and by this time you have seen many articles, and I just want to point what am I looking for on a Wikidata item. This is a Wikidata item, so you have this Q2841, which is Bogotá, which is the capital city of Colombia, and you have four parts here: the languages, the labels, the description, and aliases. So you can see, for different languages you'll have the label, you have the description as well as if there any aliases also known as, you could see them. And this, under the city, where you see the labels and the properties together. This is Avignon, a city in France. So what I'm interested in is only the properties part. For example, official name, native label, country, capital of, et cetera. So when I say property, for example, if a country, in this country, I'm looking at different aspects: the language, the label, and the description, and see how things change. For example, if you take instance of-- okay, everybody knows instance of, you have been using it quite a lot-- this is P31, you see the number of aliases in English for the property P31 in instance of, and then you would find that these types of properties are created after discussion with the community. So if I take the complete prop-- the procedure, what happens to creation of properties-- you start proposing properties with some possible translation. It is important it's not just in English. You have the templates to suggest your properties in your local language. So that's why it's a proposition with possible translation. And then you put it to discussion, then you are put to voting, and it's created, and then finally, the community members start translating it and people put it into use. But then you cannot be guaranteed the properties that are created are always there forever. Properties can be deleted, just like items can be deleted. But then, again, it goes through a similar procedure. You put the property as you propose that it should be deleted, and if the community decides it, it votes it, and then if it is decided-- the majority votes has decided to delete it-- we deprecate the property, and finally we delete this property. So for today's talk, I'm mostly interested for the translation part. So where are the translations happening? First, the translation would happen at the proposition part, and then you could find that, at the time of creation, the person who creates the property can use the exact names that were suggested by the property proposer and he or she will create the properties, and later, you start translating these properties. So let us look at why this matters, why it is important. So I put some examples. This is, again, on P31, instance of the very, very famous property P31, and you see there is no description for this item. There are almost six descriptions on this image, where we do not have any description. Again, some more description for Odia and Punjabi, there is no description. This is a property which is used quite a lot, and you see that there is no description for it. And there is a surprising part that you could also have cases where there are descriptions, but there are no labels. For example, Ruffian, that has been shown here, again on property P31, there is a label that is missing. So this was the initial inspiration for this work when I started working on property analysis. I wanted to look at what aspects of properties, or what aspects of property that the whole flow chart that we have seen, is multilingual. So I wanted to look at, okay, we know that Wikidata is multilingual, and it's collaborative, that has been done. But are we really able to achieve a truly multilingual experience? That was the question behind the creation of WDProp. So you may ask why there are so many people who have worked on items, there are people who have worked on-- users, multilingual users and bots, et cetera, why you want to focus on properties? The answer is, I want to focus on properties because it's very, very less influenced by bots. You may have heard today or yesterday, many people said, "Okay, if you have translation in your local languages, and it has reached a very good number, you should ensure what type of translation it is. Is it just bots, which copies the name of a person to another language. Then is it really translation?" Okay, that's debatable. But, of course, there is an influence by bot, but in case of properties, there is not so much influence by bots, and that is a good part. That's why I focus on the bots part. So, as I said, when WDProp was created, it was to understand every aspect-- the proposal, the creation, translation. What are the templates that are available. Are these templates, for example, you said support, if a French person opens Wikidata, a Wikidata France translation page, can he see the word, [soutien], for that particular property proposal? Is it possible? So this type of things was needed. In the end, it was also about giving real-time statistics to the multilingual contributors. It's not about one time, it's like you just made it and published for one time-- no. You want people to get this data in real time. So what are we doing? So the goal of WDProp was to understand everything about Wikidata properties. So, label, aliases, description. So you have got all these three translated so the middle part where you say, this property is completely usable because all the three aspects have been translated. So let me just show you quickly, what is this WDProp, what I'm talking about. So this is the WDProp, it's available on tools.wmflabs.org/wdprop/. So you have a lot statistics and if I ask you some questions today, like, for example, "How many data types are there that are supported by Wikidata right now?" So if such questions, we do not know, sometimes because there are new data types that keep on coming. So this data, this is generated at real time, this creates the data structure and it will give you the answer. How many languages are there? Yes, of course, see that there are 313 languages. And then, for example, how many labels were translated. So you could see that the data is being fetched. I hope it comes. Okay, let's hope. (chuckles) Okay, I will take some other stuff as well. Browsing all properties by their time. Yes. So you see, this is count of translated labels, and you see all this data that is coming real time, and you can see that the labels are currently available in 6,804 languages in English, followed by Dutch, followed by Arabic, followed by Ukrainian, and then French. So this is real-time statistics. So you could also do the same for description, also do for aliases, et cetera. And you could get the overall translation statuses if you want. So there are some other things that we will discuss later, if time permits. But you could navigate all the different items on the left-hand side, and you could see there are a lot of things that could really help to see what things are happening in WDProp. So this is, for example, Wikidata properties, these are the properties that are currently available. But as I said some time back, properties could be deleted. And this, you see that these are the properties that were deleted, starting from P1, P2, P3, P4, P5, these have all been deleted, and you could get this thing just from the statistics board. And here, so same thing. Then, the next thing that interested me was to understand the translation pattern. So, for example, sometimes we feel that some languages-- so English is created first, and followed by maybe Dutch, or maybe French, and maybe after French, it could be Arabic. So these things could be interesting to know. So for that, we started to look at the idea of translation path-- exactly how things are translated. So again, if you go to the property page, you could click on any property. Sorry. Maybe I can show. So you could click on any property and you could just say, "Give me the translation path." It takes some time, but it will start bringing the data, because it's real time, so you get the data coming from all this. So you get the date, you get what things have been changed, when was something deleted, et cetera. Why it is important? For example, you see this is something that happened in 2017, and the label has been removed. This is the official website. So imagine you have removed the label from the official website-- sorry, this country-- so anybody who doesn't know P17, what it is, cannot even understand, because the label has been deleted by the person. So this type of vandalism exists. Another example where, completely, all the language labels have been deleted-- English, French, Spanish, German, everything has been deleted. There are no labels, there are no descriptions. So you could find these types of things from the translation path and just because of the color code, you could see what happened on what day, and you could check exactly, because it is also linked. If you click on any of this, you could also get a link to the revision, identify what exactly happened during that particular revision. So this is coming from revision history. So if you click on any of this, you get what exactly is happening in any particular revision. So how did we build it? Just if you come back, here, you see there is something called a comment on the right-hand side. You see there is something called added aliases, "added British English aliases," "changed Esperanto label," "added [io] label," et cetera. So we made use of this information, for example, for label description and aliases, if you add something, you have some sort of comment which starts with wbsetlabel-add. Or if it is updated, you have wbsetlabel-set. And if you remove something, you see it is removed. And based on this type of information, we were able to build such a translation path. Okay, this is good, but what happened is that this type of information, this type of things, just using the comment, it is useful for building real-time tools, just like what I showed before, WDProp, but it is very difficult to detect when there are multiple changes. For example, if you have seen bots activity on Wikidata, some bots make multiple labels in one single edit. In that case, you cannot find what happened because you do not have wbsetlabel, that particular language. So you do not have a set of languages along with your comment. So these are some problems if you want to use this type of approach. So what we did, we decided to collect the data, and we decided to publicly make this data available. And what we did, we wanted to make use of content. So what we did, we started with every revision, and we took the content of each revision. And we took the next revision, and we decided to find the difference between these two revisions, to find what exactly changes, which of the labels got changed. Because of that, we got much more interesting information, much more accurate information than the previous approach because it is very important for doing analysis. It is important that you make use of correct data. So you have four columns that were used here-- timestamp, property, language, type, et cetera. And you get this data in this format. It is publicly available. So what does this data give me? This data gives me information that currently almost 4,000 plus, 4,500 properties have labels between 0 and 20. So there are a lot of properties who do not have more than 20 multilingual labels. And there are only 1,500 language properties that have been translated up to 40. And yesterday, if you were present during the talk of Lydia Pintscher, she talked about P18, so P18 is something here. So you can see there are only a couple of six or seven properties that are currently having all the-- P18 has 154 translations, just to give that idea. So there is one property which is having 154 multilingual labels. There are properties which have only one particular label. And the average number of labels is only 21, and the standard deviation is 20. Okay, what next we would like to say? So you have seen something similar in the real-time data. This is from the collected data. So this is what are the top languages that are coming up in the results. So these we have seen. But my next point is, are there combinations possible. For example, if there is French, there is Arabic. If there is Arabic, there is some other language. If there's French, there's Ukrainian, et cetera. Can we find such type of combinations in the translation data set? So, yes, it is possible. So if you see this count, this frequent itemsets-- so I've just shown seven of them-- you find that there are combinations that are possible. Okay, let us say, is there a possibility of having four labels, like if there is English, there's also possibility to find Dutch, Arabic, Ukrainian. If there is English, there's possibility to find Dutch, French, and Arabic, et cetera. You can also find a lot of combinations. Why it is important? Because it is important to know if, for example, if you have multilingual speakers who are contributors, who can speak multiple languages, if you're able to find any particular pattern that helps us to find that if you tell this person to translate, a new property is created to translate this label, because he already speaks multiple languages, we can suggest these things to the user. So let's just show you one example. This is a complete translation path that has obtained from different languages. So here, what we have done is we selected two small minority languages, like Tagalog and Kapampangan, which are minority languages from Philippines, and you see that there is a strong transfer between Tagalog and Kapampangan. So these types of things can be detected when you have such type of translation results. So that is another advantage. To conclude my work, I would like to say, this is important that we understand how properties are translated because if you want to extract data from Wikipedia, you need to know what are the words in the local languages that are being used. What is "image" in French, what is "image" in Punjabi, what is "image" in Hindi, or any other language. So that is important for importing data. And tomorrow, of course, if you are able to fetch this data, to Wikidata, we could also use new projects like Wikidata Bridge, which we could use to fill other info boxes, like multilingual Wikipedia articles, and this could be really helpful. So withe that, I would like to thank you, and if you have questions, I would be happy to answer them. (moderator) Anybody with questions? (audience applause) Yes? (man) So what you're doing is mainly analyzing how this-- - (John) Yes. - (man) ...is all happening? Do you know if there are initiatives or if there are tools which can help make this easier, like translation of properties? Yes. Tools, like, for example, what to translate from Wikimedia Foundation, is helpful, but I have not seen-- This is not currently integrated with Wikidata. What to translate is only integrated with certain languages on Wikipedia, but not on Wikidata. But that could be really interesting. Yes, thank you for bringing this up, because just imagine, if we know that a person has been labeling in multiple languages, and we also have this what to translate tool, and we have these statistics, we have this data coming from this type of property translation, it is easier to suggest to a person that new properties have been created, and then you could-- Right now it's not integrated to Wikidata. (moderator) Anybody else? (man 2) I have one question myself, that comes back to it, does anybody know of working lists on translating properties? Sorry? (man 2) Does anybody know of working lists about translating properties, like, I can imagine from your statistics, you could say, this is the top 100 most widely used properties who lack translations in this and this language? No, there is, I think, there are ways by, for example, you could browse by data types, browse by property classes. For example, here is something called property classes where people have created projects-- it's taking time-- so you have projects, and you could say, how would I describe, what are the, for example, what are the properties that I could describe for this, for describing IEEE standard version? You need edition number, you need edition translation, et cetera. So if you have a targeted thing, you could search for what type of classes. For example, if you're working in GLAM or histories, you could say, what is history-related any document are there? So you could say, historical, and you could find historical. Okay, this is a property class, go to this property class. And, sorry, where is it? So it is having something called "Merimee ID." So people have been trying to use property classes to link objects. That helps if you're working on a particular project, and you could find that property's related to that. (man 2) But your tool could quite easily make a list of, let's say, the top 100 most widely used properties who haven't got, I don't know, Punjabi label, let's say? - (John) For that, I will just-- - (man 2) Which could be interesting. (John) Okay, tell me any language, for example, let us say, Netherlands, because it's performing very well. So I would say-- translated labels. So this is translate-- sorry. (mouse clicking) For example, Hindi. So here, what happens, here you just see any properties that need translation. So there are like 6,647 properties that need translation in a particular language. So you could click on any language that you want and get the data. And you could get the list of where people need support. So, this could be interesting to link with property usage, how many people, is it really top, is it under the top ten. So suggest those ten top hundred, in that language. That would be an interesting list. That's good. (man 3) Just what you asked, there is a list of top 100 most used properties on Wikidata. It's on Wikidata. So, yeah, it's there, under Wikidata Database Reports/ Top 100 Properties. So one thing could be that we could just link this and suggest it. (moderator) Could you maybe add the link to the etherpad, and then maybe, this information can come together. (John) Okay. (moderator) If there is no other questions, then we will conclude here. And we have two, three minutes break until we start with the next speaker. - Thanks. - (John) Thank you very much. (audience applause)