(Lydia) Thank you so much.
So, this conference,
one of the big themes is languages.
I want to give you an overview
of where we actually are currently
when it comes to languages
and where we can go from here.
Wikidata is all about giving more people
more access to more knowledge,
and language is such an important part
of making that a reality,
especially since more and more
of our lives depends on technology.
And as our keynote speaker
earlier today was talking,
some of the technology
leaves people behind
simply because they can't speak
a certain language,
and that's not okay.
So we want to do something about that.
And in order to change that,
you need at least two things.
One is you need to provide content
to the people in their language,
and the second thing you need
is to provide them
with interaction in their language
in those applications
or whatever it is you have.
And Wikidata helps with both of those.
And the first thing,
content in your language,
that is basically what we have
in items and properties,
how we describe the world.
Now, this is certainly
not everything you need,
but it gets you quite far ahead.
The other thing
is interaction in your language,
and that's where lexemes come into play
If you want to talk
to your digital personal assistant
or if you want to have your device
translate a text and things like that.
Alright, let's look into
content in your language.
So what we have in items and properties.
For this, the labels in those items
and properties are crucial.
We need to know what this entity
is called that we're talking about.
And instead of talking about Q5,
someone who speaks English
knows that's a "human,"
someone who speaks German
knows that's a "mensch,"
and similar things.
So those labels on items and properties
are bridging the gap
between humans and machines.
And humans and humans
making more existing knowledge
accessible to them.
Now, that's a nice aspiration.
What does it actually look like?
It looks like this.
What you're seeing here
is that most of the items
on Wikidata have two labels,
so labels in two languages.
And after that, it's one, and then three,
and then it becomes very sad.
(quiet laughter)
I think we need to do better than this.
But, on the other hand,
I was actually expecting this
to be even worse.
I was expecting the average to be one.
So I was quite happy
to see two. (chuckles)
Alright.
But it's not just interesting to know
how many labels our items
and properties have.
It's also interesting to see
in which languages.
Here you see a graph of the languages
that we have labels for on Items.
So the biggest part there is Other.
So I just took the top 100 languages
and everything else is Other
to make this graph readable.
And then there's English and Dutch,
French,
and not to forget, Asturian.
- (person 1) Whoo!
- Whoo-hoo, yes!
So what you see here is quite an imbalance
and still quite a lot of focus on English.
Another thing is if you look
at the same thing for Properties,
it's actually looking better.
And I think part of that constituted
just being way less properties.
So even smaller communities
have a chance to keep up with that.
But it's also a pretty
important part of Wikidata
to localize into your language.
So that's good.
What I want to highlight
here with Asturian
is that a small community
can really make a huge difference
with some dedication and work,
and that's really cool.
A small quiz for you.
If you take all the properties on Wikidata
that are not external identifiers,
which one has the most labels,
like the most languages?
(audience) [inaudible]
I hear some agreement on instance of?
You would be wrong.
It's image. (chuckles)
So, yeah, that tells you,
if you speak one of the languages
where instance of
doesn't yet have a label,
you might want to add it.
So it has 148 labels currently.
But that's just another slide.
This graph tells us something
about how much content we are making
available in a certain language
and how much of that content
is actually used.
So what you're seeing is basically a curve
with most content having English labels,
being available in English,
and being used a lot.
And then it kind of goes down.
But, again, what you can see are outliers
who have a lot more content
than you would necessarily expect,
and that is really, really good.
The problem still is it's not used a lot.
Asturian and Dutch should be higher,
and I think helping those communities
increase the use
of the data they collected
is a really useful thing to do.
What this analysis and others
showed us is also a good thing though
is that we are seeing
that highly used items
also tend to have more labels
or the other way around--
it's not entirely clear.
And then the question is,
are we serving
just the powerful languages?
Or are we serving everyone?
And what you see here
is a grouping of languages.
The languages that are grouped together
tend to have labels together.
And you see it clustering.
Now here's a similar clustering, colored,
based on how alive, how used,
how endangered the language is.
And a good thing you're seeing here
is that safe languages
and endangered languages
do not form two different clusters.
But they're all mixed together,
which is much better than it would be
the other way around
where the safe languages,
the powerful languages
are just helping each other out.
No, that's not the case.
And it's a really good thing.
When I saw this,
I thought this was very good.
Here's a similar thing
where we looked at
the languages' status
and how many labels it has.
What you're seeing
is a clear win for safe languages,
as is expected.
But what you're also seeing
is that the languages in category 2
and 3 and maybe even 4
are not that bad, actually,
in terms of their representation
in Wikidata and others.
It's a really good thing to find.
Now, if you look at the same thing
for how much of that content
of those labels
is actually used
on Wikipedia, for example,
then we see a similar
picture emerging again.
And it tells us that those communities
are actually making good use of their time
by filling in labels
for higher used items, for example.
There are outliers
where I think we can help,
to help those communities find the places
where their work would be most valuable.
But, overall, I'm happy with this picture.
Now, that was the items
and properties part of Wikidata.
Now, let's look at interaction
in your languages.
So the lexeme parts of Wikidata
where we describe words
and their forms and their meanings.
We've been doing this now
since May last year,
and content has been growing.
You can see here in blue the lexemes,
and then in red,
the forms on those lexemes
and yellow, the senses
on those lexemes.
So some communities--
we'll get to that later--
have spent a lot of time creating forms
and senses for their lexemes,
which is really useful
because that builds
the core of the data set that you need.
Now, we looked at all the languages
that have lexemes on Wikidata.
So words we have,
those are right now 310 languages.
Now, what do you think is the top language
when it comes to the number
of lexemes currently in Wikidata?
(audience) [inaudible]
Huh?
(person 2) German.
Sorry, I've heard it before.
It's Russian.
Russian is quite ahead.
And just to give you some perspective,
there's different opinions
but I've read, for example,
that 1,000 to 3,000 words
gets you to conversation level,
roughly, in another language,
and 4,000 to 10,000 words
to an advanced level.
So, we still have a bit to catch up there.
One thing I want you
to pay attention to is Basque here
with 10,000, roughly, lexemes.
Now, if you look at the number
of forms for those lexemes,
Basque is way up there,
which is really cool,
and you should go to a talk that explains
to you why that is the case.
Now, if you look at the number
of senses, so what do words mean,
Basque even gets to the top of the list.
I think that deserves an applause.
(applause)
Another short quiz.
What's the lexeme
with the most translations currently?
(audience) Cats, cats, [inaudible],
Douglas Adams, [inaudible]
All good guesses, but no.
It's this, the Russian word for "water."
Alright, so now we talked a lot
about how many lexemes,
forms, and senses we have,
but that's just one thing you need.
The other thing you need
is actually describing those lexemes,
forms, and senses
in a machine-readable way.
And for that you have statements,
like on items.
And one of the properties
you use is usage example.
So whoever is using that data
can understand how to use
that word in context,
so that could be a quote, for example.
And here, Polish rocks.
Good job, Polish speakers.
Another property
that's really useful is IPA,
so how do you pronounce this word.
Russian apparently needs
lots of IPA statements.
But, again, Polish, second.
And last but not least
we have pronunciation audio.
So that is links to files on Commons
where someone speaks the word,
so you can hear a native speaker
pronounce the word
in case you can't read IPA, for example.
And there's a really nice actually
Wiki-based powered project
called Lingua Libre
where you can go and help record
words in your language
that then can be added
to lexemes on Wikidata,
so other people can understand
how to pronounce your words.
(person 2) [inaudible]
If you search for "Lingua Libre,"
and I'm sure someone can post it
in the Telegram channel.
Those guys rock.
They did really cool stuff with Wikibase.
Alright.
Then the question is,
where do we go from here?
Based on the numbers I've just shown you,
we've come a long way
towards giving more people
more access to more knowledge
when looking at languages on Wikidata.
But there is also still
a lot of work ahead of us.
Some of the things
you can do to help, for example,
is run label-a-thons
like get people together
to label items in Wikidata
or do an edit-a-thon
around lexemes in your language
to get the most used words
in your language into Wikidata.
Or you can use a tool like Terminator
that helps you find the most
important items in your language
that are still missing a label.
Most important being measured
by how often it is used
in other Wikidata items
as links in statements.
And, of course, for the lexeme part,
now that we've got
a basic coverage of those lexemes,
it's also about building them out,
adding more statements to them
so that they actually can build the base
for meaningful applications
to build on top of that.
Because we're getting closer
to that critical mass,
but we're still away from that,
that you can build
serious applications on top of it.
And I hope all of you
will join us in doing that.
And that already brings me
to a little help from our friends,
and Bruno, do you want to come over
and talk to us about lexical masks.
(Bruno) Thank you, Lydia,
thank you for giving me
this short period of time
to present this work
that we are doing at Google
Denny that most of you
probably have heard of or know.
Because at Google so I'm a linguist.
so I'm very happy to be here
amongst other language enthusiasts.
We are also building some lexicons,
and we have built this technology
or this approach that we think
can be useful for you.
Just to give you
a little bit of background,
this is my lexicographic
background talking here.
When we build a lexicon database,
there is a lot of hard time to maintain,
to keep them consistent
and to exchange data,
as you probably know.
There are several attempts
to unify the feature and the properties
that are describing
those lexemes and those forms,
and it's not a solved problem,
but there are some
unification attempts on that side.
But what is really missing--
and this is a problem we had
at the beginning of our project at Google
is to try to have an internal structure
that describes how
a lexical entry should look like,
what kind of data
or what kind of information we have
and the specification that are expected.
So, this is what we came up
with this thing called lexicon mask.
A lexicon mask is describing
what is expected for an entry,
a lexicographic entry, to be complete,
both in terms of the number of forms
you expect for a lexeme,
and the number of features
you expect for each of those forms.
Here is an example for Italian adjectives.
You expect, in Italian, to have
four forms for your adjectives,
and each of these forms
have a specific combination
of gender and number features.
This is what we expect
for the Italian adjectives.
Of course, you can have
extremely complex masks,
like the French verbs conjugation,
which is quite extensive,
and I don't show you
any other Russian mask
because it doesn't fit the screen.
And we also have
some detailed specifications
because we distinguish
what is at the form level.
So here you have Russian nouns
that have three numbers
and a number of cases
with different forms,
but they also have
an entry level specification
that says a noun particularly has
an inherent gender
and an inherent animacy feature
that is also specified in the mask.
We also want to distinguish
that a mask gives a specification
for, in general,
what an entry should look like.
But you can have smaller masks
for defective aspects of the form
or defective aspects of the lexeme
that happen in language.
So here is the simplest version
of French verbs
that have only the 3rd person singular
for all the weather verbs,
like "it rains" or "it snows,"
like in English.
So we distinguish these two levels.
And how we use this at Google
is that when we have a lexicon
that we want to use,
we use the mask to really
literally throw the lexicons,
all the entries, through the mask
and see which entry has a problem
in terms of structure.
Are we missing a form?
Are we missing a feature?
And when there is a problem,
we do some human validation
or just to see if it passes the mask.
So it's an extremely powerful tool
to check the quality of the structure.
So what we are happy to announce today
is that we get the green light
to open source our mask.
So this is a schema.
If you want that, we can release
and that we will provide
to Wikidata as to ShEx files.
This is a ShEx file for German nouns,
and Denny is working on the conversion
from our internal specification
to a more open-source specification.
We currently cover more than 25 languages.
So we expect to grow on our side,
but we also look for this opportunity
to collaborate for other languages.
And one of the ongoing collaborations
also that Denny has with Lukas.
Lukas has these great tools to have a UI
to help the user or the contributor
to add more forms.
So if you want to add
an adjective in French,
the UI is telling you
how many forms are expected
and what kind of features
this form should have.
So our mask will help the tool
to be defined and expanded.
That's it.
(Lydia) Thank you so much.
(applause)
Alright. Are there questions?
Do you want to talk more about lexemes?
- (person 3) Yes.
- Yes. (chuckles)
(person 3) My question,
because you were talking
about giving more access
to more people in more languages.
But there are a lot of languages
that can't be used in Wikidata.
So what solution do you have for that?
When you say that can't use Wikidata,
are you talking about entering labels?
- (person 3) Labels, descriptions.
- Right.
So, for lexemes, it's a bit different
because there we don't have
that restriction.
For labels on items and properties,
there is some restriction
because we wanted to make sure
that it's not completely
anyone does anything,
and it becomes unmanageable.
Even a small community who wants
one language and wants to work on that,
come talk to us, we will make it happen.
(person 3) I mean, we did this
at the Prague Hackathon in May,
and it took us until almost August
in order to be able to use our language.
- Yeah.
- (person 3) So, it's very slow.
Yeah, it is, unfortunately, very slow.
We're currently working
with the language Committee
on solving some fundamental...
Like, getting agreement on what kind
of languages are actually "allowed,"
and that has taken too long,
which is the reason why your request
probably took longer than it should have.
(person 3) Thanks.
(person 4) Thank you.
Lydia, if you remember
the statistics that you showed,
the number of lexemes per language.
So, did you count
all the forms as a data point
or only lexemes?
(Lydia) Do you mean this?
Which one do you mean?
(person 4) Yes, exactly.
If you remember,
does this number [inaudible]
all the forms for all the lexemes
or just how many lexemes there are?
No, this is just a number of lexemes.
(person 4) Just a number of lexemes, okay.
So then it is a just statistic
because if it would then
compose the forms--
that's why I'm asking--
then all the languages
with the inflectional morphology,
like Russian, Serbian,
Slovenian and et cetera,
they have a natural advantage
because they have so many.
So, this kind of kicks in here
on this number of forms.
(person 4) Yeah, that was this one.
Thank you.
(person 5) So, I had
a quick question about the...
When we're talking about
the actual items and properties.
Like as far as I understand,
there is currently no way
to give an actual source
to any of the labels
and descriptions that are given.
So, for example,
because when you're talking
about an item property,
like, for example,
you can get conflicting labels.
Yes.
(person 5) So this person is like...
We were talking about
indigenous things before, for example.
So this person is a Norwegian artist
according to this source,
and a Sami artist,
according to this source.
Or, for example, in Estonian,
we had an issue
where we had to change terminology
to the official use terminology
in official lexicons,
but we have no way to indicate really why,
like what was the source of this
and why this was better
and what was there before.
It was just me as a random person
just switching the thing
to anyone who sees it.
So is there a plan
to make this possible in any way
so that we can actually have
proper sources for the language data?
So, it is partially possible.
So, for example, when you have
an item for a person,
you have a statement, first name,
last name, and so on, of that person,
and then you can provide
the reference for that there.
I'm quite hesitant to add more complexity
for references on labels and descriptions,
but if people really, really think
this is something that isn't covered
by any reference on the statement,
then let's talk about it.
But I fear it will add a lot of complexity
for what I hope are few cases,
but I'm willing to be convinced otherwise
if people really feel
very strongly about this.
(person 5) I mean, if it's added
it probably shouldn't be the default,
show to all the users as a beginner,
interface, in any case.
More like, "Click here if you need to say
a specific thing about this."
Do we have a sense of how many times
that would actually matter?
(person 5) In Estonian, for example--
I expect this is true
of other languages as well--
for example, there is an official name
that is the actual legitimate translation,
for example, into English,
of, say, a specific kind of municipality.
That was my use case, for example,
where we were using the word "parish"
which the original Estonian word
was meant kind of like church parish,
and that was the origin,
but that's not the official translation
Estonia gets right now.
In this case, I would just add it
as official name statements
and add the reference there.
(person 5) Okay.
More questions, yes?
(person 6) I have two quick comments.
You specifically called out Asturian
as a language that does well,
and I think that's a false artifact.
Tell me about it.
(person 6) I think it's just a bot
that pasted person names,
like proper names,
and said, "Well, this is exactly
like in French or Spanish,"
and just massively copied it.
One point of evidence is that
you don't see that energy in Asturian
in things that actually
require translation, like property names,
or names of items
that are not proper names.
Asaf, you break my heart.
(person 6) I know,
I like raining on parades,
but I have good news as well,
which is about the pronunciation numbers.
As you probably know,
Commons is full of pronunciation files,
and, for example,
Dutch has no less than 300,000
pronunciation files already on Commons
that just need to somehow be ingested.
So if anyone's looking for a side project,
there's tons and tons
of classified, categorized
pronunciation files on Commons
under the category
"Pronunciation" by language.
So that's just waiting to be matched
to lexemes and put on Lexeme.
And I was wondering
if you could say something
about the road map,
something about how much investment
or what can we expect
from Lexeme in the coming year,
because I, for one, can't wait.
You can't wait? (chuckles)
- (person 6) For more.
- Yes. (chuckles)
Right now, we're concentrating
more on Wikibase and data quality
to see how much traction this gets
and then getting more for feeding off
where the pain points are next,
and then going back to improving
lexicographical data further.
And one of the things
I'd love to hear from you
is where exactly do you see
the next steps,
where do you want to see improvements
so that we can then figure out
how to make that happen.
But, of course, you're right,
there's still so much to do
also on the technical side.
(person 7) Okay, as we were uploading
the Basque words with forms,
and you'll see some
of these kinds of things,
we were both like, last week we said,
"Oh, we are the first one in something."
It's It appears in press, and it's like,
"Oh, Basque are the first time in some--
they are the first in something, okay."
(laughs)
And then people ask,
"Okay, but what is this for?"
We don't have a real good answer.
I mean it's like, okay,
this will help computers
to understand more our language, yes,
but what kind of tools
can we make in the future?
And we don't have a good answer for this.
So I don't know
if you have a good answer for this.
(chuckles) I don't know
if I have a good answer,
but I have an answer.
So I think right now
as I was telling [inaudible],
we haven't reached that critical mass
where you can build a lot
of the really interesting tools.
But there are already some tools.
Just the other day,
Esther [Pandelia], for example,
released a tool where you can see,
I think it was the words on a globe
where they're spoken,
where they're coming from.
I'm probably wrong about this,
but she had answered
on the Project chat on Wikidata--
you can look it up there.
So we have seen these first tools,
just like we've seen
back when Wikidata started.
First some--like just a network,
and like, "Hey, look, there's this thing
that connects to this other thing."
And as we have more data,
and as we've reached some critical mass,
more powerful applications
become possible,
things like Histropedia,
things like question and answering
in your digital personal assistant,
Platypus, and so on.
And we're seeing
a similar thing with lexemes.
We're at the stage
where you can build like these little,
hey, look, there's a connection
between the two things,
and there's a translation
of this word into that language stage,
and as we build it out
and as we describe more words,
more becomes possible.
Now, what becomes possible?
As Ben, our keynote speaker earlier
was talking about translations,
being able to translate
from one language to another.
And Jens, my colleague,
he's always talking about
the European Union
looking for a translator
who can translate from
I think it was Maltese to Swedish--
- (person 8) Estonian.
- Estonian.
And that is not a usual combination.
But once you have all these languages
in one machine-readable place,
you can do that,
you can get a dictionary
from Estonian to Maltese and back.
So covering language
combinations in dictionaries
that just haven't been covered before
because there wasn't
enough demand for it, for example,
to make it financially viable
and to justify the work.
Now we can do that.
Then text generation.
Lucie was earlier talking
about how she's working
with Hattie on generating text
to get Wikipedia articles
in minority languages started,
and that needs data about words,
and you need to understand
the language to do that.
Yeah, and those are just some
that come to my mind right now.
Maybe our audience has more ideas
what they want to do
when we have all the glorious data.
(person 9) Okay, I will deviate
from the lexemes topic.
I will ask the question,
how can I as a member of community
influence that priority is put on task,
that a new user comes, and he can indicate
what languages he wants to see and edit
without some secret verbal
template knowledge.
Maybe there will be this year
this technical wish list
without Wikipedia topics.
Maybe there's a hope
we can all vote about
this thing we didn't fix for seven years.
So do you have any ideas
and comments about this?
So you're talking about the fact
that someone who is
not logged into Wikidata
can't change their language easily?
(person 9) No, for [inaudible] users.
So, if they are logged in,
they can just change their language
at the top of the page,
and then it will appear
where the labels' description
[inaudible] are,
and they can edit it.
(person 9) Well, actually, usually
many times the workflow
is that if you want to have
multiple languages, they are available,
and it's not always the case.
Okay, maybe we should sit down
after this talk and you show me.
Cool. More questions?
Yes.
(person 10) Thanks for the presentation.
Can you comment
on the state of the correlation
with the Wiktionary community.
As far as I've seen,
there were some discussions
about importing some elements of the work,
but there seems to be licensing issues
and some disagreements, et cetera.
Right.
So, Wiktionary communities
have spent a lot of time
building Wiktionary.
They have built
amazingly complicated
and complex templates
to build pretty tables
that automatically generate forms for you
and all kinds of really impressive,
and kind of crazy stuff,
if you think about it.
And, of course, they have invested
a lot of time and effort into that.
And understandably,
they don't just want that to be grabbed,
just like that.
So there's some of that coming from there.
And that's fine, that's okay.
Now, the first Wiktionary communities
are talking about turning out
and importing some
of their data into Wikidata.
Russian, you have seen,
for example, is one of those cases
And I expect more of that to happen.
But it will be a slow process,
just like adoption
of Wikidata's data on Wikipedia
has been a rather slow process.
On the other side
of making it actually easier
to use the data that is in lexemes,
on Wiktionary, so that
they can make use of that
and share data between
the language Wiktionaries
which is super hard
to impossible right now,
which is crazy,
just like it was on Wikipedia.
Wait for the birthday present. (chuckles)
Yes.
(person 11) When I was thinking
the other way around it,
I actually didn't want to say it
because I think this will be super silly,
but I think that Wiktionary
already has some content,
and I know that
we can't transfer it to Wikidata
because there's a difference in licenses.
But I was thinking maybe
we can do something about that.
Maybe, I don't know, we can obtain
the communities' permission
after like, I don't know,
having like a public voting
and for the community,
the active members of the community
to vote and say if they would like
or accept or to transfer the content
for which they may do
the Wikidata lexemes.
Because I just think it is such a waste.
So, that's definitely
a conversation those people
who are in Wiktionary communities
are very welcome to bring up there.
I think it would be a bit presumptuous
for us to go and force that.
But, yeah, I think it's definitely worth
having a conversation.
But I think it's also important
to understand
that there's a distinction between
what is actually legally allowed
and what we should be doing
and what those people want or do not want.
So even if it's legally allowed,
if some other Wiktionary communities
do not want that,
I would be careful, at least.
I think you need the mic
for the stream.
(person 12) So, obviously,
it's all very exciting,
and I immediately think
how can I take that to my students
and how can I incorporate it
with the courses,
the work that we're doing,
educational settings.
And I don't have, at the moment,
first of all, enough knowledge,
but I think the documentation
that we do have
could be maybe improved.
So that's a kind of request
to make cool videos
that explain how it works
because if we have it, we can then use it,
and we can have students on board,
and we can make people understand
how awesome it all is.
And yeah, just think about documentation
and think about education, please.
Because I think a lot could be done.
These are like many tasks
that could be done even with...
well, I wouldn't say primary schools,
but certainly, even younger students.
And so I would really like to see
that potential being tapped into,
and, as of now, I personally
don't understand enough
to be able to create tasks
or to create like...
to do something practical with it.
So any help, any thoughts
anyone here has about that,
I would be very happy to hear
your thoughts, and yours as well.
Yeah, let's talk about that.
More questions?
Someone else raised a hand.
I forgot where it was.
(person 13) So, if we can't import
from Wiktionary,
is there some concerted effort
to find other public domain sources,
maybe all the data,
and kind of prefilter it, organize it
so that it's easy to be checked
by people for import?
So there are first efforts.
My understanding is that Basque
is one of those efforts.
Maybe you want to say
a bit more about it?
(person 14) [inaudible]
Okay, the actual answer
is paying for that...
I mean, we have an agreement
with a contractor we usually work with.
They do dictionaries--
lots of stuff, but they do dictionaries.
So we agreed with them
to make free the students' dictionary,
we would [cast] the most common words
and start uploading it
with an external identifier
and the scheme of things.
But there was some discussion
about leaving it on CC0
because they have
the dictionary with CC by it,
and they understood
what the difference was.
So there was some discussion.
But I think that we can provide some tools
or some examples in the future,
and I think that there will be
other dictionaries
that we can handle,
and also I think Wiktionary
should start moving in that direction,
but that's another great discussion.
And on top of that,
Lea is also in contact
with people from Occitan
who work on Occitan dictionaries,
and they're currently working
on a Sumerian collaboration.
More questions?
(person 15) Hi! We are the people
who want to import Occitan data.
Aha! Perfect!
(person 15) And we have a small problem.
We don't know how to represent
the variety of all lexemes.
We have six dialects,
and we want to indicate for Lexeme
in which dialect it's used,
and we don't have a proper
C0 statement to do that.
So as long as the segment doesn't exist,
it prevents us from [inaudible]
because we will need to do it again
when we will be able
to [export] the statement.
And it's complicated
because it's a statement
which won't be asked by many people
because it's a statement
which concerns mostly minority languages.
So you will have one person to ask this.
But as our colleagues Basque,
it can be one person
who will power thousands of others,
so it might not be asking a lot,
but it will be very important for us.
Do you already have
a new property proposal up,
or do you need help creating it?
(person 15) We asked four months ago.
Alright, then let's get some people
to help out with this property proposal.
I'm sure there are enough people
in this room to make this happen.
(person 15) Property proposal
[speaking in French].
(person 16) We didn't have an answer.
(person 15) We didn't have any answer,
and we don't know how to do this
because we aren't
in the Wikidata community.
Yup, so there are people here
who can help you.
Maybe someone raises their hand to take--
(person 14) I'm for that.
But I think this is quite interesting
that only the variant of form
also can handle it geographically,
with coordinates or some kind of mapping.
Also having different pronunciations,
and I think this is something
that happens in lots of languages.
We should start making
it happen [inaudible],
and I'm going to search for the property.
Cool.
So you will get backing
for your property proposal.
Thank you.
Alright, more questions?
Finn.
Finn is one of those people
who builds stuff
on top of lexicographical data.
(Finn) It's just a small question,
and that's about spelling variations.
It seems to be difficult to put them in...
You could, of course,
have multiple forms for the same word.
I don't know, it seems to be...
If you don't do it that way,
it seems to be difficult to specify...
or I don't know whether
this is just a minor technical issue
or whether...
Let's look at it together.
I would love to see an example.
Asaf.
(Asaf) Thank you.
I can give a very concrete example
from my mother tongue, Hebrew.
Hebrew has two main variants
for expressing almost every word
because the traditional spelling
leaves out many of the vowels.
And, therefore, in modern editions
of the Bible and of poetry,
diacritics are used.
However, those diacritics
are never used for modern prose
or newspaper writing or street signs.
So the average daily casual use
puts in extra vowels
and doesn't use the diacritics
because they are,
of course, more cumbersome
and have all kinds of rules
and nobody knows the rules.
So there are basically two variants.
There's the everyday casual prose variant,
and there's the Bible or poetry,
which always come
in this traditional diacriticized text.
To be useful,
Lexeme would have to recognize
both varieties of every single word
and every single form
of every single word.
So that's a very comprehensive use case
for official stable variants.
It's not dialect, it's not regions,
it's basically two coexisting
morphological systems.
And I too don't know exactly
how to express that in Lexeme today,
which is one thing that is keeping me
in partial answer to Magnus' question
from uploading the parts that are ready
from the biggest Hebrew dictionary,
which is public domain
and which I have been digitizing
for several years now.
A good portion of it is ready,
but I'm not putting it on Lexeme right now
because I don't know exactly
how to solve this problem.
Alright, let's solve
this problem here. (chuckles)
That has to be possible.
Alright, more questions?
If not, then thank you so much.
(applause)