-
(Lydia) Thank you so much.
-
So, this conference,
one of the big themes is languages.
-
I want to give you an overview
of where we actually are currently
-
when it comes to languages
-
and where we can go from here.
-
Wikidata is all about giving more people
more access to more knowledge,
-
and language is such an important part
of making that a reality,
-
especially since more and more
of our lives depends on technology.
-
And as our keynote speaker
earlier today was talking,
-
some of the technology
leaves people behind
-
simply because they can't speak
a certain language,
-
and that's not okay.
-
So we want to do something about that.
-
And in order to change that,
you need at least two things.
-
One is you need to provide content
to the people in their language,
-
and the second thing you need
-
is to provide them
with interaction in their language
-
in those applications
or whatever it is you have.
-
And Wikidata helps with both of those.
-
And the first thing,
content in your language,
-
that is basically what we have
in items and properties,
-
how we describe the world.
-
Now, this is certainly
not everything you need,
-
but it gets you quite far ahead.
-
The other thing
is interaction in your language,
-
and that's where lexemes come into play
-
If you want to talk
to your digital personal assistant
-
or if you want to have your device
translate a text and things like that.
-
Alright, let's look into
content in your language.
-
So what we have in items and properties.
-
For this, the labels in those items
and properties are crucial.
-
We need to know what this entity
is called that we're talking about.
-
And instead of talking about Q5,
-
someone who speaks English
knows that's a "human,"
-
someone who speaks German
knows that's a "mensch,"
-
and similar things.
-
So those labels on items and properties
-
are bridging the gap
between humans and machines.
-
And humans and humans
-
making more existing knowledge
accessible to them.
-
Now, that's a nice aspiration.
-
What does it actually look like?
-
It looks like this.
-
What you're seeing here
-
is that most of the items
on Wikidata have two labels,
-
so labels in two languages.
-
And after that, it's one, and then three,
-
and then it becomes very sad.
-
(quiet laughter)
-
I think we need to do better than this.
-
But, on the other hand,
-
I was actually expecting this
to be even worse.
-
I was expecting the average to be one.
-
So I was quite happy
to see two. (chuckles)
-
Alright.
-
But it's not just interesting to know
-
how many labels our items
and properties have.
-
It's also interesting to see
in which languages.
-
Here you see a graph of the languages
-
that we have labels for on Items.
-
So the biggest part there is Other.
-
So I just took the top 100 languages
-
and everything else is Other
to make this graph readable.
-
And then there's English and Dutch,
-
French,
-
and not to forget, Asturian.
-
- (person 1) Whoo!
- Whoo-hoo, yes!
-
So what you see here is quite an imbalance
-
and still quite a lot of focus on English.
-
Another thing is if you look
at the same thing for Properties,
-
it's actually looking better.
-
And I think part of that constituted
just being way less properties.
-
So even smaller communities
have a chance to keep up with that.
-
But it's also a pretty
important part of Wikidata
-
to localize into your language.
-
So that's good.
-
What I want to highlight
here with Asturian
-
is that a small community
can really make a huge difference
-
with some dedication and work,
-
and that's really cool.
-
A small quiz for you.
-
If you take all the properties on Wikidata
-
that are not external identifiers,
-
which one has the most labels,
like the most languages?
-
(audience) [inaudible]
-
I hear some agreement on instance of?
-
You would be wrong.
-
It's image. (chuckles)
-
So, yeah, that tells you,
if you speak one of the languages
-
where instance of
doesn't yet have a label,
-
you might want to add it.
-
So it has 148 labels currently.
-
But that's just another slide.
-
This graph tells us something
-
about how much content we are making
available in a certain language
-
and how much of that content
is actually used.
-
So what you're seeing is basically a curve
-
with most content having English labels,
being available in English,
-
and being used a lot.
-
And then it kind of goes down.
-
But, again, what you can see are outliers
-
who have a lot more content
than you would necessarily expect,
-
and that is really, really good.
-
The problem still is it's not used a lot.
-
Asturian and Dutch should be higher,
-
and I think helping those communities
-
increase the use
of the data they collected
-
is a really useful thing to do.
-
What this analysis and others
showed us is also a good thing though
-
is that we are seeing
that highly used items
-
also tend to have more labels
-
or the other way around--
it's not entirely clear.
-
And then the question is,
-
are we serving
just the powerful languages?
-
Or are we serving everyone?
-
And what you see here
is a grouping of languages.
-
The languages that are grouped together
tend to have labels together.
-
And you see it clustering.
-
Now here's a similar clustering, colored,
-
based on how alive, how used,
-
how endangered the language is.
-
And a good thing you're seeing here
-
is that safe languages
and endangered languages
-
do not form two different clusters.
-
But they're all mixed together,
-
which is much better than it would be
the other way around
-
where the safe languages,
the powerful languages
-
are just helping each other out.
-
No, that's not the case.
-
And it's a really good thing.
-
When I saw this,
I thought this was very good.
-
Here's a similar thing
-
where we looked at
-
the languages' status
-
and how many labels it has.
-
What you're seeing
is a clear win for safe languages,
-
as is expected.
-
But what you're also seeing
-
is that the languages in category 2
and 3 and maybe even 4
-
are not that bad, actually,
-
in terms of their representation
in Wikidata and others.
-
It's a really good thing to find.
-
Now, if you look at the same thing
-
for how much of that content
of those labels
-
is actually used
on Wikipedia, for example,
-
then we see a similar
picture emerging again.
-
And it tells us that those communities
are actually making good use of their time
-
by filling in labels
for higher used items, for example.
-
There are outliers
where I think we can help,
-
to help those communities find the places
where their work would be most valuable.
-
But, overall, I'm happy with this picture.
-
Now, that was the items
and properties part of Wikidata.
-
Now, let's look at interaction
in your languages.
-
So the lexeme parts of Wikidata
-
where we describe words
and their forms and their meanings.
-
We've been doing this now
since May last year,
-
and content has been growing.
-
You can see here in blue the lexemes,
-
and then in red,
the forms on those lexemes
-
and yellow, the senses
on those lexemes.
-
So some communities--
we'll get to that later--
-
have spent a lot of time creating forms
and senses for their lexemes,
-
which is really useful
-
because that builds
the core of the data set that you need.
-
Now, we looked at all the languages
-
that have lexemes on Wikidata.
-
So words we have,
-
those are right now 310 languages.
-
Now, what do you think is the top language
-
when it comes to the number
of lexemes currently in Wikidata?
-
(audience) [inaudible]
-
Huh?
-
(person 2) German.
-
Sorry, I've heard it before.
-
It's Russian.
-
Russian is quite ahead.
-
And just to give you some perspective,
-
there's different opinions
-
but I've read, for example,
that 1,000 to 3,000 words
-
gets you to conversation level,
roughly, in another language,
-
and 4,000 to 10,000 words
to an advanced level.
-
So, we still have a bit to catch up there.
-
One thing I want you
to pay attention to is Basque here
-
with 10,000, roughly, lexemes.
-
Now, if you look at the number
of forms for those lexemes,
-
Basque is way up there,
-
which is really cool,
-
and you should go to a talk that explains
to you why that is the case.
-
Now, if you look at the number
of senses, so what do words mean,
-
Basque even gets to the top of the list.
-
I think that deserves an applause.
-
(applause)
-
Another short quiz.
-
What's the lexeme
with the most translations currently?
-
(audience) Cats, cats, [inaudible],
Douglas Adams, [inaudible]
-
All good guesses, but no.
-
It's this, the Russian word for "water."
-
Alright, so now we talked a lot
-
about how many lexemes,
forms, and senses we have,
-
but that's just one thing you need.
-
The other thing you need
-
is actually describing those lexemes,
forms, and senses
-
in a machine-readable way.
-
And for that you have statements,
like on items.
-
And one of the properties
you use is usage example.
-
So whoever is using that data
-
can understand how to use
that word in context,
-
so that could be a quote, for example.
-
And here, Polish rocks.
-
Good job, Polish speakers.
-
Another property
that's really useful is IPA,
-
so how do you pronounce this word.
-
Russian apparently needs
lots of IPA statements.
-
But, again, Polish, second.
-
And last but not least
we have pronunciation audio.
-
So that is links to files on Commons
-
where someone speaks the word,
-
so you can hear a native speaker
pronounce the word
-
in case you can't read IPA, for example.
-
And there's a really nice actually
Wiki-based powered project
-
called Lingua Libre
-
where you can go and help record
words in your language
-
that then can be added
to lexemes on Wikidata,
-
so other people can understand
how to pronounce your words.
-
(person 2) [inaudible]
-
If you search for "Lingua Libre,"
-
and I'm sure someone can post it
in the Telegram channel.
-
Those guys rock.
-
They did really cool stuff with Wikibase.
-
Alright.
-
Then the question is,
where do we go from here?
-
Based on the numbers I've just shown you,
-
we've come a long way
-
towards giving more people
more access to more knowledge
-
when looking at languages on Wikidata.
-
But there is also still
a lot of work ahead of us.
-
Some of the things
you can do to help, for example,
-
is run label-a-thons
-
like get people together
to label items in Wikidata
-
or do an edit-a-thon
around lexemes in your language
-
to get the most used words
in your language into Wikidata.
-
Or you can use a tool like Terminator
-
that helps you find the most
important items in your language
-
that are still missing a label.
-
Most important being measured
by how often it is used
-
in other Wikidata items
as links in statements.
-
And, of course, for the lexeme part,
-
now that we've got
a basic coverage of those lexemes,
-
it's also about building them out,
adding more statements to them
-
so that they actually can build the base
-
for meaningful applications
to build on top of that.
-
Because we're getting closer
to that critical mass,
-
but we're still away from that,
-
that you can build
serious applications on top of it.
-
And I hope all of you
will join us in doing that.
-
And that already brings me
-
to a little help from our friends,
-
and Bruno, do you want to come over
-
and talk to us about lexical masks.
-
(Bruno) Thank you, Lydia,
-
thank you for giving me
this short period of time
-
to present this work
that we are doing at Google
-
Denny that most of you
probably have heard of or know.
-
Because at Google so I'm a linguist.
-
so I'm very happy to be here
amongst other language enthusiasts.
-
We are also building some lexicons,
-
and we have built this technology
-
or this approach that we think
can be useful for you.
-
Just to give you
a little bit of background,
-
this is my lexicographic
background talking here.
-
When we build a lexicon database,
-
there is a lot of hard time to maintain,
to keep them consistent
-
and to exchange data,
-
as you probably know.
-
There are several attempts
to unify the feature and the properties
-
that are describing
those lexemes and those forms,
-
and it's not a solved problem,
-
but there are some
unification attempts on that side.
-
But what is really missing--
-
and this is a problem we had
at the beginning of our project at Google
-
is to try to have an internal structure
-
that describes how
a lexical entry should look like,
-
what kind of data
or what kind of information we have
-
and the specification that are expected.
-
So, this is what we came up
with this thing called lexicon mask.
-
A lexicon mask is describing
what is expected for an entry,
-
a lexicographic entry, to be complete,
-
both in terms of the number of forms
you expect for a lexeme,
-
and the number of features
you expect for each of those forms.
-
Here is an example for Italian adjectives.
-
You expect, in Italian, to have
four forms for your adjectives,
-
and each of these forms
have a specific combination
-
of gender and number features.
-
This is what we expect
for the Italian adjectives.
-
Of course, you can have
extremely complex masks,
-
like the French verbs conjugation,
which is quite extensive,
-
and I don't show you
any other Russian mask
-
because it doesn't fit the screen.
-
And we also have
some detailed specifications
-
because we distinguish
what is at the form level.
-
So here you have Russian nouns
that have three numbers
-
and a number of cases
with different forms,
-
but they also have
an entry level specification
-
that says a noun particularly has
-
an inherent gender
and an inherent animacy feature
-
that is also specified in the mask.
-
We also want to distinguish
that a mask gives a specification
-
for, in general,
what an entry should look like.
-
But you can have smaller masks
for defective aspects of the form
-
or defective aspects of the lexeme
that happen in language.
-
So here is the simplest version
of French verbs
-
that have only the 3rd person singular
for all the weather verbs,
-
like "it rains" or "it snows,"
like in English.
-
So we distinguish these two levels.
-
And how we use this at Google
-
is that when we have a lexicon
that we want to use,
-
we use the mask to really
literally throw the lexicons,
-
all the entries, through the mask
-
and see which entry has a problem
in terms of structure.
-
Are we missing a form?
Are we missing a feature?
-
And when there is a problem,
we do some human validation
-
or just to see if it passes the mask.
-
So it's an extremely powerful tool
to check the quality of the structure.
-
So what we are happy to announce today
-
is that we get the green light
to open source our mask.
-
So this is a schema.
-
If you want that, we can release
-
and that we will provide
to Wikidata as to ShEx files.
-
This is a ShEx file for German nouns,
-
and Denny is working on the conversion
from our internal specification
-
to a more open-source specification.
-
We currently cover more than 25 languages.
-
So we expect to grow on our side,
-
but we also look for this opportunity
to collaborate for other languages.
-
And one of the ongoing collaborations
also that Denny has with Lukas.
-
Lukas has these great tools to have a UI
-
to help the user or the contributor
to add more forms.
-
So if you want to add
an adjective in French,
-
the UI is telling you
how many forms are expected
-
and what kind of features
this form should have.
-
So our mask will help the tool
to be defined and expanded.
-
That's it.
-
(Lydia) Thank you so much.
-
(applause)
-
Alright. Are there questions?
-
Do you want to talk more about lexemes?
-
- (person 3) Yes.
- Yes. (chuckles)
-
(person 3) My question,
because you were talking
-
about giving more access
to more people in more languages.
-
But there are a lot of languages
that can't be used in Wikidata.
-
So what solution do you have for that?
-
When you say that can't use Wikidata,
-
are you talking about entering labels?
-
- (person 3) Labels, descriptions.
- Right.
-
So, for lexemes, it's a bit different
-
because there we don't have
that restriction.
-
For labels on items and properties,
there is some restriction
-
because we wanted to make sure
that it's not completely
-
anyone does anything,
-
and it becomes unmanageable.
-
Even a small community who wants
one language and wants to work on that,
-
come talk to us, we will make it happen.
-
(person 3) I mean, we did this
at the Prague Hackathon in May,
-
and it took us until almost August
in order to be able to use our language.
-
- Yeah.
- (person 3) So, it's very slow.
-
Yeah, it is, unfortunately, very slow.
-
We're currently working
with the language Committee
-
on solving some fundamental...
-
Like, getting agreement on what kind
of languages are actually "allowed,"
-
and that has taken too long,
-
which is the reason why your request
probably took longer than it should have.
-
(person 3) Thanks.
-
(person 4) Thank you.
-
Lydia, if you remember
the statistics that you showed,
-
the number of lexemes per language.
-
So, did you count
all the forms as a data point
-
or only lexemes?
-
(Lydia) Do you mean this?
-
Which one do you mean?
-
(person 4) Yes, exactly.
-
If you remember,
does this number [inaudible]
-
all the forms for all the lexemes
or just how many lexemes there are?
-
No, this is just a number of lexemes.
-
(person 4) Just a number of lexemes, okay.
-
So then it is a just statistic
-
because if it would then
compose the forms--
-
that's why I'm asking--
-
then all the languages
with the inflectional morphology,
-
like Russian, Serbian,
Slovenian and et cetera,
-
they have a natural advantage
because they have so many.
-
So, this kind of kicks in here
on this number of forms.
-
(person 4) Yeah, that was this one.
Thank you.
-
(person 5) So, I had
a quick question about the...
-
When we're talking about
the actual items and properties.
-
Like as far as I understand,
-
there is currently no way
to give an actual source
-
to any of the labels
and descriptions that are given.
-
So, for example,
because when you're talking
-
about an item property,
-
like, for example,
you can get conflicting labels.
-
Yes.
-
(person 5) So this person is like...
-
We were talking about
indigenous things before, for example.
-
So this person is a Norwegian artist
according to this source,
-
and a Sami artist,
according to this source.
-
Or, for example, in Estonian,
we had an issue
-
where we had to change terminology
to the official use terminology
-
in official lexicons,
-
but we have no way to indicate really why,
-
like what was the source of this
-
and why this was better
and what was there before.
-
It was just me as a random person
-
just switching the thing
to anyone who sees it.
-
So is there a plan
to make this possible in any way
-
so that we can actually have
proper sources for the language data?
-
So, it is partially possible.
-
So, for example, when you have
an item for a person,
-
you have a statement, first name,
last name, and so on, of that person,
-
and then you can provide
the reference for that there.
-
I'm quite hesitant to add more complexity
-
for references on labels and descriptions,
-
but if people really, really think
-
this is something that isn't covered
by any reference on the statement,
-
then let's talk about it.
-
But I fear it will add a lot of complexity
-
for what I hope are few cases,
-
but I'm willing to be convinced otherwise
-
if people really feel
very strongly about this.
-
(person 5) I mean, if it's added
it probably shouldn't be the default,
-
show to all the users as a beginner,
interface, in any case.
-
More like, "Click here if you need to say
a specific thing about this."
-
Do we have a sense of how many times
that would actually matter?
-
(person 5) In Estonian, for example--
-
I expect this is true
of other languages as well--
-
for example, there is an official name
that is the actual legitimate translation,
-
for example, into English,
-
of, say, a specific kind of municipality.
-
That was my use case, for example,
-
where we were using the word "parish"
-
which the original Estonian word
was meant kind of like church parish,
-
and that was the origin,
-
but that's not the official translation
Estonia gets right now.
-
In this case, I would just add it
as official name statements
-
and add the reference there.
-
(person 5) Okay.
-
More questions, yes?
-
(person 6) I have two quick comments.
-
You specifically called out Asturian
as a language that does well,
-
and I think that's a false artifact.
-
Tell me about it.
-
(person 6) I think it's just a bot
-
that pasted person names,
like proper names,
-
and said, "Well, this is exactly
like in French or Spanish,"
-
and just massively copied it.
-
One point of evidence is that
you don't see that energy in Asturian
-
in things that actually
require translation, like property names,
-
or names of items
that are not proper names.
-
Asaf, you break my heart.
-
(person 6) I know,
I like raining on parades,
-
but I have good news as well,
which is about the pronunciation numbers.
-
As you probably know,
Commons is full of pronunciation files,
-
and, for example,
-
Dutch has no less than 300,000
pronunciation files already on Commons
-
that just need to somehow be ingested.
-
So if anyone's looking for a side project,
-
there's tons and tons
-
of classified, categorized
pronunciation files on Commons
-
under the category
"Pronunciation" by language.
-
So that's just waiting to be matched
to lexemes and put on Lexeme.
-
And I was wondering
if you could say something
-
about the road map,
-
something about how much investment
-
or what can we expect
from Lexeme in the coming year,
-
because I, for one, can't wait.
-
You can't wait? (chuckles)
-
- (person 6) For more.
- Yes. (chuckles)
-
Right now, we're concentrating
more on Wikibase and data quality
-
to see how much traction this gets
-
and then getting more for feeding off
where the pain points are next,
-
and then going back to improving
lexicographical data further.
-
And one of the things
I'd love to hear from you
-
is where exactly do you see
the next steps,
-
where do you want to see improvements
-
so that we can then figure out
how to make that happen.
-
But, of course, you're right,
-
there's still so much to do
also on the technical side.
-
(person 7) Okay, as we were uploading
the Basque words with forms,
-
and you'll see some
of these kinds of things,
-
we were both like, last week we said,
"Oh, we are the first one in something."
-
It's It appears in press, and it's like,
-
"Oh, Basque are the first time in some--
they are the first in something, okay."
-
(laughs)
-
And then people ask,
"Okay, but what is this for?"
-
We don't have a real good answer.
-
I mean it's like, okay,
-
this will help computers
to understand more our language, yes,
-
but what kind of tools
can we make in the future?
-
And we don't have a good answer for this.
-
So I don't know
if you have a good answer for this.
-
(chuckles) I don't know
if I have a good answer,
-
but I have an answer.
-
So I think right now
as I was telling [inaudible],
-
we haven't reached that critical mass
-
where you can build a lot
of the really interesting tools.
-
But there are already some tools.
-
Just the other day,
Esther [Pandelia], for example,
-
released a tool where you can see,
-
I think it was the words on a globe
-
where they're spoken,
where they're coming from.
-
I'm probably wrong about this,
-
but she had answered
on the Project chat on Wikidata--
-
you can look it up there.
-
So we have seen these first tools,
-
just like we've seen
back when Wikidata started.
-
First some--like just a network,
-
and like, "Hey, look, there's this thing
that connects to this other thing."
-
And as we have more data,
-
and as we've reached some critical mass,
-
more powerful applications
become possible,
-
things like Histropedia,
-
things like question and answering
-
in your digital personal assistant,
Platypus, and so on.
-
And we're seeing
a similar thing with lexemes.
-
We're at the stage
where you can build like these little,
-
hey, look, there's a connection
between the two things,
-
and there's a translation
of this word into that language stage,
-
and as we build it out
and as we describe more words,
-
more becomes possible.
-
Now, what becomes possible?
-
As Ben, our keynote speaker earlier
was talking about translations,
-
being able to translate
from one language to another.
-
And Jens, my colleague,
he's always talking about
-
the European Union
looking for a translator
-
who can translate from
I think it was Maltese to Swedish--
-
- (person 8) Estonian.
- Estonian.
-
And that is not a usual combination.
-
But once you have all these languages
in one machine-readable place,
-
you can do that,
-
you can get a dictionary
-
from Estonian to Maltese and back.
-
So covering language
combinations in dictionaries
-
that just haven't been covered before
-
because there wasn't
enough demand for it, for example,
-
to make it financially viable
and to justify the work.
-
Now we can do that.
-
Then text generation.
-
Lucie was earlier talking
-
about how she's working
with Hattie on generating text
-
to get Wikipedia articles
in minority languages started,
-
and that needs data about words,
-
and you need to understand
the language to do that.
-
Yeah, and those are just some
that come to my mind right now.
-
Maybe our audience has more ideas
-
what they want to do
when we have all the glorious data.
-
(person 9) Okay, I will deviate
from the lexemes topic.
-
I will ask the question,
-
how can I as a member of community
-
influence that priority is put on task,
-
that a new user comes, and he can indicate
what languages he wants to see and edit
-
without some secret verbal
template knowledge.
-
Maybe there will be this year
this technical wish list
-
without Wikipedia topics.
-
Maybe there's a hope
we can all vote about
-
this thing we didn't fix for seven years.
-
So do you have any ideas
and comments about this?
-
So you're talking about the fact
-
that someone who is
not logged into Wikidata
-
can't change their language easily?
-
(person 9) No, for [inaudible] users.
-
So, if they are logged in,
-
they can just change their language
at the top of the page,
-
and then it will appear
-
where the labels' description
[inaudible] are,
-
and they can edit it.
-
(person 9) Well, actually, usually
many times the workflow
-
is that if you want to have
multiple languages, they are available,
-
and it's not always the case.
-
Okay, maybe we should sit down
after this talk and you show me.
-
Cool. More questions?
-
Yes.
-
(person 10) Thanks for the presentation.
-
Can you comment
-
on the state of the correlation
with the Wiktionary community.
-
As far as I've seen,
there were some discussions
-
about importing some elements of the work,
-
but there seems to be licensing issues
and some disagreements, et cetera.
-
Right.
-
So, Wiktionary communities
have spent a lot of time
-
building Wiktionary.
-
They have built
-
amazingly complicated
and complex templates
-
to build pretty tables
that automatically generate forms for you
-
and all kinds of really impressive,
-
and kind of crazy stuff,
if you think about it.
-
And, of course, they have invested
a lot of time and effort into that.
-
And understandably,
-
they don't just want that to be grabbed,
-
just like that.
-
So there's some of that coming from there.
-
And that's fine, that's okay.
-
Now, the first Wiktionary communities
are talking about turning out
-
and importing some
of their data into Wikidata.
-
Russian, you have seen,
for example, is one of those cases
-
And I expect more of that to happen.
-
But it will be a slow process,
-
just like adoption
of Wikidata's data on Wikipedia
-
has been a rather slow process.
-
On the other side
of making it actually easier
-
to use the data that is in lexemes,
-
on Wiktionary, so that
they can make use of that
-
and share data between
the language Wiktionaries
-
which is super hard
to impossible right now,
-
which is crazy,
just like it was on Wikipedia.
-
Wait for the birthday present. (chuckles)
-
Yes.
-
(person 11) When I was thinking
the other way around it,
-
I actually didn't want to say it
because I think this will be super silly,
-
but I think that Wiktionary
already has some content,
-
and I know that
we can't transfer it to Wikidata
-
because there's a difference in licenses.
-
But I was thinking maybe
we can do something about that.
-
Maybe, I don't know, we can obtain
the communities' permission
-
after like, I don't know,
having like a public voting
-
and for the community,
the active members of the community
-
to vote and say if they would like
or accept or to transfer the content
-
for which they may do
the Wikidata lexemes.
-
Because I just think it is such a waste.
-
So, that's definitely
a conversation those people
-
who are in Wiktionary communities
are very welcome to bring up there.
-
I think it would be a bit presumptuous
for us to go and force that.
-
But, yeah, I think it's definitely worth
having a conversation.
-
But I think it's also important
to understand
-
that there's a distinction between
what is actually legally allowed
-
and what we should be doing
-
and what those people want or do not want.
-
So even if it's legally allowed,
-
if some other Wiktionary communities
do not want that,
-
I would be careful, at least.
-
I think you need the mic
for the stream.
-
(person 12) So, obviously,
it's all very exciting,
-
and I immediately think
how can I take that to my students
-
and how can I incorporate it
with the courses,
-
the work that we're doing,
educational settings.
-
And I don't have, at the moment,
-
first of all, enough knowledge,
-
but I think the documentation
that we do have
-
could be maybe improved.
-
So that's a kind of request
to make cool videos
-
that explain how it works
-
because if we have it, we can then use it,
-
and we can have students on board,
-
and we can make people understand
how awesome it all is.
-
And yeah, just think about documentation
and think about education, please.
-
Because I think a lot could be done.
-
These are like many tasks
that could be done even with...
-
well, I wouldn't say primary schools,
-
but certainly, even younger students.
-
And so I would really like to see
that potential being tapped into,
-
and, as of now, I personally
don't understand enough
-
to be able to create tasks
or to create like...
-
to do something practical with it.
-
So any help, any thoughts
anyone here has about that,
-
I would be very happy to hear
your thoughts, and yours as well.
-
Yeah, let's talk about that.
-
More questions?
-
Someone else raised a hand.
-
I forgot where it was.
-
(person 13) So, if we can't import
from Wiktionary,
-
is there some concerted effort
to find other public domain sources,
-
maybe all the data,
-
and kind of prefilter it, organize it
-
so that it's easy to be checked
by people for import?
-
So there are first efforts.
-
My understanding is that Basque
is one of those efforts.
-
Maybe you want to say
a bit more about it?
-
(person 14) [inaudible]
-
Okay, the actual answer
is paying for that...
-
I mean, we have an agreement
with a contractor we usually work with.
-
They do dictionaries--
-
lots of stuff, but they do dictionaries.
-
So we agreed with them
to make free the students' dictionary,
-
we would [cast] the most common words
and start uploading it
-
with an external identifier
and the scheme of things.
-
But there was some discussion
about leaving it on CC0
-
because they have
the dictionary with CC by it,
-
and they understood
what the difference was.
-
So there was some discussion.
-
But I think that we can provide some tools
or some examples in the future,
-
and I think that there will be
other dictionaries
-
that we can handle,
-
and also I think Wiktionary
should start moving in that direction,
-
but that's another great discussion.
-
And on top of that,
-
Lea is also in contact
with people from Occitan
-
who work on Occitan dictionaries,
-
and they're currently working
on a Sumerian collaboration.
-
More questions?
-
(person 15) Hi! We are the people
who want to import Occitan data.
-
Aha! Perfect!
-
(person 15) And we have a small problem.
-
We don't know how to represent
the variety of all lexemes.
-
We have six dialects,
-
and we want to indicate for Lexeme
in which dialect it's used,
-
and we don't have a proper
C0 statement to do that.
-
So as long as the segment doesn't exist,
-
it prevents us from [inaudible]
-
because we will need to do it again
-
when we will be able
to [export] the statement.
-
And it's complicated
because it's a statement
-
which won't be asked by many people
-
because it's a statement
which concerns mostly minority languages.
-
So you will have one person to ask this.
-
But as our colleagues Basque,
-
it can be one person
who will power thousands of others,
-
so it might not be asking a lot,
-
but it will be very important for us.
-
Do you already have
a new property proposal up,
-
or do you need help creating it?
-
(person 15) We asked four months ago.
-
Alright, then let's get some people
to help out with this property proposal.
-
I'm sure there are enough people
in this room to make this happen.
-
(person 15) Property proposal
[speaking in French].
-
(person 16) We didn't have an answer.
-
(person 15) We didn't have any answer,
and we don't know how to do this
-
because we aren't
in the Wikidata community.
-
Yup, so there are people here
who can help you.
-
Maybe someone raises their hand to take--
-
(person 14) I'm for that.
-
But I think this is quite interesting
-
that only the variant of form
-
also can handle it geographically,
-
with coordinates or some kind of mapping.
-
Also having different pronunciations,
-
and I think this is something
that happens in lots of languages.
-
We should start making
it happen [inaudible],
-
and I'm going to search for the property.
-
Cool.
-
So you will get backing
for your property proposal.
-
Thank you.
-
Alright, more questions?
-
Finn.
-
Finn is one of those people
-
who builds stuff
on top of lexicographical data.
-
(Finn) It's just a small question,
-
and that's about spelling variations.
-
It seems to be difficult to put them in...
-
You could, of course,
have multiple forms for the same word.
-
I don't know, it seems to be...
-
If you don't do it that way,
it seems to be difficult to specify...
-
or I don't know whether
-
this is just a minor technical issue
or whether...
-
Let's look at it together.
-
I would love to see an example.
-
Asaf.
-
(Asaf) Thank you.
-
I can give a very concrete example
from my mother tongue, Hebrew.
-
Hebrew has two main variants
-
for expressing almost every word
-
because the traditional spelling
-
leaves out many of the vowels.
-
And, therefore, in modern editions
of the Bible and of poetry,
-
diacritics are used.
-
However, those diacritics
are never used for modern prose
-
or newspaper writing or street signs.
-
So the average daily casual use
puts in extra vowels
-
and doesn't use the diacritics
-
because they are,
of course, more cumbersome
-
and have all kinds of rules
and nobody knows the rules.
-
So there are basically two variants.
-
There's the everyday casual prose variant,
-
and there's the Bible or poetry,
-
which always come
in this traditional diacriticized text.
-
To be useful,
-
Lexeme would have to recognize
both varieties of every single word
-
and every single form
of every single word.
-
So that's a very comprehensive use case
-
for official stable variants.
-
It's not dialect, it's not regions,
-
it's basically two coexisting
morphological systems.
-
And I too don't know exactly
how to express that in Lexeme today,
-
which is one thing that is keeping me
in partial answer to Magnus' question
-
from uploading the parts that are ready
-
from the biggest Hebrew dictionary,
which is public domain
-
and which I have been digitizing
for several years now.
-
A good portion of it is ready,
-
but I'm not putting it on Lexeme right now
-
because I don't know exactly
how to solve this problem.
-
Alright, let's solve
this problem here. (chuckles)
-
That has to be possible.
-
Alright, more questions?
-
If not, then thank you so much.
-
(applause)