WEBVTT
00:00:06.303 --> 00:00:07.362
(Lydia) Thank you so much.
00:00:07.362 --> 00:00:11.244
So, this conference,
one of the big themes is languages.
00:00:14.220 --> 00:00:18.508
I want to give you an overview
of where we actually are currently
00:00:18.508 --> 00:00:19.812
when it comes to languages
00:00:20.264 --> 00:00:22.167
and where we can go from here.
00:00:29.036 --> 00:00:32.580
Wikidata is all about giving more people
more access to more knowledge,
00:00:32.580 --> 00:00:37.168
and language is such an important part
of making that a reality,
00:00:38.205 --> 00:00:43.291
especially since more and more
of our lives depends on technology.
00:00:44.114 --> 00:00:48.873
And as our keynote speaker
earlier today was talking,
00:00:49.723 --> 00:00:51.588
some of the technology
leaves people behind
00:00:51.588 --> 00:00:55.020
simply because they can't speak
a certain language,
00:00:55.320 --> 00:00:57.573
and that's not okay.
00:00:58.633 --> 00:01:02.097
So we want to do something about that.
00:01:02.927 --> 00:01:05.841
And in order to change that,
you need at least two things.
00:01:06.411 --> 00:01:11.270
One is you need to provide content
to the people in their language,
00:01:11.270 --> 00:01:12.955
and the second thing you need
00:01:12.955 --> 00:01:15.910
is to provide them
with interaction in their language
00:01:15.910 --> 00:01:19.189
in those applications
or whatever it is you have.
00:01:20.367 --> 00:01:25.277
And Wikidata helps with both of those.
00:01:25.277 --> 00:01:28.408
And the first thing,
content in your language,
00:01:28.408 --> 00:01:30.879
that is basically what we have
in items and properties,
00:01:31.319 --> 00:01:33.082
how we describe the world.
00:01:33.082 --> 00:01:35.085
Now, this is certainly
not everything you need,
00:01:35.085 --> 00:01:39.294
but it gets you quite far ahead.
00:01:39.764 --> 00:01:41.847
The other thing
is interaction in your language,
00:01:41.847 --> 00:01:46.389
and that's where lexemes come into play
00:01:46.389 --> 00:01:49.382
If you want to talk
to your digital personal assistant
00:01:49.382 --> 00:01:54.918
or if you want to have your device
translate a text and things like that.
00:01:56.404 --> 00:01:59.254
Alright, let's look into
content in your language.
00:01:59.254 --> 00:02:03.396
So what we have in items and properties.
00:02:05.406 --> 00:02:09.696
For this, the labels in those items
and properties are crucial.
00:02:10.236 --> 00:02:14.866
We need to know what this entity
is called that we're talking about.
00:02:15.656 --> 00:02:19.987
And instead of talking about Q5,
00:02:19.987 --> 00:02:22.180
someone who speaks English
knows that's a "human,"
00:02:22.180 --> 00:02:24.706
someone who speaks German
knows that's a "mensch,"
00:02:24.706 --> 00:02:26.374
and similar things.
00:02:26.374 --> 00:02:29.742
So those labels on items and properties
00:02:29.742 --> 00:02:33.619
are bridging the gap
between humans and machines.
00:02:33.619 --> 00:02:35.439
And humans and humans
00:02:35.439 --> 00:02:40.115
making more existing knowledge
accessible to them.
00:02:43.270 --> 00:02:46.290
Now, that's a nice aspiration.
00:02:46.290 --> 00:02:48.342
What does it actually look like?
00:02:48.342 --> 00:02:49.607
It looks like this.
00:02:50.947 --> 00:02:52.416
What you're seeing here
00:02:52.416 --> 00:02:58.496
is that most of the items
on Wikidata have two labels,
00:02:58.496 --> 00:03:00.767
so labels in two languages.
00:03:01.697 --> 00:03:03.851
And after that, it's one, and then three,
00:03:03.851 --> 00:03:06.115
and then it becomes very sad.
00:03:06.781 --> 00:03:08.581
(quiet laughter)
00:03:10.047 --> 00:03:12.713
I think we need to do better than this.
00:03:14.185 --> 00:03:15.319
But, on the other hand,
00:03:15.319 --> 00:03:17.478
I was actually expecting this
to be even worse.
00:03:17.478 --> 00:03:19.560
I was expecting the average to be one.
00:03:19.560 --> 00:03:22.503
So I was quite happy
to see two. (chuckles)
00:03:24.921 --> 00:03:26.186
Alright.
00:03:27.156 --> 00:03:29.527
But it's not just interesting to know
00:03:29.527 --> 00:03:33.742
how many labels our items
and properties have.
00:03:33.742 --> 00:03:36.565
It's also interesting to see
in which languages.
00:03:38.045 --> 00:03:43.764
Here you see a graph of the languages
00:03:43.764 --> 00:03:46.838
that we have labels for on Items.
00:03:46.838 --> 00:03:50.669
So the biggest part there is Other.
00:03:51.229 --> 00:03:53.863
So I just took the top 100 languages
00:03:54.533 --> 00:03:58.902
and everything else is Other
to make this graph readable.
00:03:59.542 --> 00:04:02.142
And then there's English and Dutch,
00:04:03.002 --> 00:04:04.254
French,
00:04:05.924 --> 00:04:09.129
and not to forget, Asturian.
00:04:09.659 --> 00:04:11.889
- (person 1) Whoo!
- Whoo-hoo, yes!
00:04:13.899 --> 00:04:16.954
So what you see here is quite an imbalance
00:04:16.954 --> 00:04:20.114
and still quite a lot of focus on English.
00:04:21.236 --> 00:04:24.367
Another thing is if you look
at the same thing for Properties,
00:04:24.367 --> 00:04:25.999
it's actually looking better.
00:04:27.399 --> 00:04:32.750
And I think part of that constituted
just being way less properties.
00:04:32.750 --> 00:04:36.770
So even smaller communities
have a chance to keep up with that.
00:04:36.770 --> 00:04:39.173
But it's also a pretty
important part of Wikidata
00:04:39.173 --> 00:04:41.159
to localize into your language.
00:04:41.159 --> 00:04:42.384
So that's good.
00:04:45.752 --> 00:04:47.842
What I want to highlight
here with Asturian
00:04:47.842 --> 00:04:53.698
is that a small community
can really make a huge difference
00:04:54.448 --> 00:04:57.085
with some dedication and work,
00:04:57.085 --> 00:04:58.420
and that's really cool.
00:05:01.846 --> 00:05:03.530
A small quiz for you.
00:05:03.530 --> 00:05:05.493
If you take all the properties on Wikidata
00:05:05.493 --> 00:05:07.687
that are not external identifiers,
00:05:07.687 --> 00:05:10.358
which one has the most labels,
like the most languages?
00:05:10.977 --> 00:05:13.847
(audience) [inaudible]
00:05:13.847 --> 00:05:16.786
I hear some agreement on instance of?
00:05:17.506 --> 00:05:19.443
You would be wrong.
00:05:19.983 --> 00:05:22.210
It's image. (chuckles)
00:05:23.230 --> 00:05:26.366
So, yeah, that tells you,
if you speak one of the languages
00:05:26.366 --> 00:05:28.621
where instance of
doesn't yet have a label,
00:05:28.621 --> 00:05:30.190
you might want to add it.
00:05:32.102 --> 00:05:35.676
So it has 148 labels currently.
00:05:37.688 --> 00:05:41.249
But that's just another slide.
00:05:42.631 --> 00:05:44.162
This graph tells us something
00:05:44.162 --> 00:05:49.321
about how much content we are making
available in a certain language
00:05:49.321 --> 00:05:52.042
and how much of that content
is actually used.
00:05:52.042 --> 00:05:55.448
So what you're seeing is basically a curve
00:05:55.448 --> 00:06:00.987
with most content having English labels,
being available in English,
00:06:01.507 --> 00:06:04.295
and being used a lot.
00:06:04.295 --> 00:06:06.449
And then it kind of goes down.
00:06:06.449 --> 00:06:09.436
But, again, what you can see are outliers
00:06:09.436 --> 00:06:15.333
who have a lot more content
than you would necessarily expect,
00:06:16.903 --> 00:06:19.539
and that is really, really good.
00:06:20.839 --> 00:06:24.945
The problem still is it's not used a lot.
00:06:25.565 --> 00:06:28.742
Asturian and Dutch should be higher,
00:06:28.742 --> 00:06:31.994
and I think helping those communities
00:06:33.266 --> 00:06:35.563
increase the use
of the data they collected
00:06:35.563 --> 00:06:37.682
is a really useful thing to do.
00:06:42.910 --> 00:06:48.110
What this analysis and others
showed us is also a good thing though
00:06:48.300 --> 00:06:51.378
is that we are seeing
that highly used items
00:06:51.378 --> 00:06:55.295
also tend to have more labels
00:06:55.295 --> 00:06:58.188
or the other way around--
it's not entirely clear.
00:07:02.513 --> 00:07:04.376
And then the question is,
00:07:04.806 --> 00:07:07.009
are we serving
just the powerful languages?
00:07:07.899 --> 00:07:11.147
Or are we serving everyone?
00:07:12.757 --> 00:07:17.743
And what you see here
is a grouping of languages.
00:07:17.743 --> 00:07:21.832
The languages that are grouped together
tend to have labels together.
00:07:26.042 --> 00:07:28.599
And you see it clustering.
00:07:28.599 --> 00:07:34.065
Now here's a similar clustering, colored,
00:07:34.065 --> 00:07:39.475
based on how alive, how used,
00:07:40.455 --> 00:07:43.156
how endangered the language is.
00:07:43.156 --> 00:07:44.642
And a good thing you're seeing here
00:07:44.642 --> 00:07:49.566
is that safe languages
and endangered languages
00:07:49.566 --> 00:07:53.773
do not form two different clusters.
00:07:53.773 --> 00:07:58.872
But they're all mixed together,
00:08:00.262 --> 00:08:04.625
which is much better than it would be
the other way around
00:08:04.625 --> 00:08:09.377
where the safe languages,
the powerful languages
00:08:10.197 --> 00:08:12.164
are just helping each other out.
00:08:12.744 --> 00:08:14.356
No, that's not the case.
00:08:14.356 --> 00:08:17.417
And it's a really good thing.
00:08:17.417 --> 00:08:20.042
When I saw this,
I thought this was very good.
00:08:23.474 --> 00:08:25.169
Here's a similar thing
00:08:26.239 --> 00:08:28.800
where we looked at
00:08:30.230 --> 00:08:34.222
the languages' status
00:08:34.222 --> 00:08:36.225
and how many labels it has.
00:08:39.367 --> 00:08:42.937
What you're seeing
is a clear win for safe languages,
00:08:42.937 --> 00:08:44.248
as is expected.
00:08:45.508 --> 00:08:46.693
But what you're also seeing
00:08:46.693 --> 00:08:54.407
is that the languages in category 2
and 3 and maybe even 4
00:08:54.407 --> 00:08:59.280
are not that bad, actually,
00:08:59.280 --> 00:09:02.367
in terms of their representation
in Wikidata and others.
00:09:03.287 --> 00:09:06.408
It's a really good thing to find.
00:09:07.646 --> 00:09:09.129
Now, if you look at the same thing
00:09:09.129 --> 00:09:12.418
for how much of that content
of those labels
00:09:12.418 --> 00:09:15.495
is actually used
on Wikipedia, for example,
00:09:17.455 --> 00:09:22.563
then we see a similar
picture emerging again.
00:09:23.603 --> 00:09:29.813
And it tells us that those communities
are actually making good use of their time
00:09:29.813 --> 00:09:34.504
by filling in labels
for higher used items, for example.
00:09:36.410 --> 00:09:40.493
There are outliers
where I think we can help,
00:09:41.683 --> 00:09:48.202
to help those communities find the places
where their work would be most valuable.
00:09:49.312 --> 00:09:52.663
But, overall, I'm happy with this picture.
00:09:54.823 --> 00:09:59.844
Now, that was the items
and properties part of Wikidata.
00:10:00.714 --> 00:10:03.033
Now, let's look at interaction
in your languages.
00:10:03.033 --> 00:10:05.203
So the lexeme parts of Wikidata
00:10:05.203 --> 00:10:09.394
where we describe words
and their forms and their meanings.
00:10:10.167 --> 00:10:13.301
We've been doing this now
since May last year,
00:10:16.461 --> 00:10:19.127
and content has been growing.
00:10:20.114 --> 00:10:22.149
You can see here in blue the lexemes,
00:10:22.149 --> 00:10:25.938
and then in red,
the forms on those lexemes
00:10:25.938 --> 00:10:29.910
and yellow, the senses
on those lexemes.
00:10:30.991 --> 00:10:34.451
So some communities--
we'll get to that later--
00:10:34.451 --> 00:10:39.793
have spent a lot of time creating forms
and senses for their lexemes,
00:10:39.793 --> 00:10:42.753
which is really useful
00:10:42.753 --> 00:10:48.243
because that builds
the core of the data set that you need.
00:10:50.562 --> 00:10:55.133
Now, we looked at all the languages
00:10:55.133 --> 00:10:57.906
that have lexemes on Wikidata.
00:10:57.906 --> 00:11:01.003
So words we have,
00:11:01.713 --> 00:11:04.404
those are right now 310 languages.
00:11:04.884 --> 00:11:08.290
Now, what do you think is the top language
00:11:08.290 --> 00:11:11.949
when it comes to the number
of lexemes currently in Wikidata?
00:11:12.933 --> 00:11:14.700
(audience) [inaudible]
00:11:19.183 --> 00:11:20.216
Huh?
00:11:20.216 --> 00:11:21.741
(person 2) German.
00:11:21.741 --> 00:11:24.252
Sorry, I've heard it before.
00:11:24.252 --> 00:11:25.651
It's Russian.
00:11:28.011 --> 00:11:29.754
Russian is quite ahead.
00:11:31.897 --> 00:11:33.832
And just to give you some perspective,
00:11:35.652 --> 00:11:36.816
there's different opinions
00:11:36.816 --> 00:11:42.231
but I've read, for example,
that 1,000 to 3,000 words
00:11:42.231 --> 00:11:45.450
gets you to conversation level,
roughly, in another language,
00:11:45.450 --> 00:11:49.461
and 4,000 to 10,000 words
to an advanced level.
00:11:51.591 --> 00:11:55.282
So, we still have a bit to catch up there.
00:11:58.483 --> 00:12:03.279
One thing I want you
to pay attention to is Basque here
00:12:03.279 --> 00:12:07.744
with 10,000, roughly, lexemes.
00:12:09.244 --> 00:12:13.003
Now, if you look at the number
of forms for those lexemes,
00:12:14.163 --> 00:12:16.497
Basque is way up there,
00:12:18.257 --> 00:12:20.006
which is really cool,
00:12:20.006 --> 00:12:24.930
and you should go to a talk that explains
to you why that is the case.
00:12:27.341 --> 00:12:31.175
Now, if you look at the number
of senses, so what do words mean,
00:12:32.015 --> 00:12:35.081
Basque even gets to the top of the list.
00:12:35.081 --> 00:12:37.102
I think that deserves an applause.
00:12:37.102 --> 00:12:38.921
(applause)
00:12:45.678 --> 00:12:47.118
Another short quiz.
00:12:47.118 --> 00:12:50.181
What's the lexeme
with the most translations currently?
00:12:50.651 --> 00:12:55.414
(audience) Cats, cats, [inaudible],
Douglas Adams, [inaudible]
00:12:56.766 --> 00:13:00.014
All good guesses, but no.
00:13:01.012 --> 00:13:04.137
It's this, the Russian word for "water."
00:13:09.571 --> 00:13:12.253
Alright, so now we talked a lot
00:13:12.253 --> 00:13:16.412
about how many lexemes,
forms, and senses we have,
00:13:16.412 --> 00:13:20.493
but that's just one thing you need.
00:13:20.493 --> 00:13:21.515
The other thing you need
00:13:21.515 --> 00:13:25.161
is actually describing those lexemes,
forms, and senses
00:13:25.161 --> 00:13:27.647
in a machine-readable way.
00:13:27.647 --> 00:13:30.039
And for that you have statements,
like on items.
00:13:31.479 --> 00:13:36.362
And one of the properties
you use is usage example.
00:13:36.362 --> 00:13:38.582
So whoever is using that data
00:13:38.582 --> 00:13:42.089
can understand how to use
that word in context,
00:13:42.089 --> 00:13:44.158
so that could be a quote, for example.
00:13:45.396 --> 00:13:47.113
And here, Polish rocks.
00:13:47.900 --> 00:13:49.764
Good job, Polish speakers.
00:13:54.219 --> 00:13:57.680
Another property
that's really useful is IPA,
00:13:57.680 --> 00:14:00.186
so how do you pronounce this word.
00:14:00.876 --> 00:14:07.497
Russian apparently needs
lots of IPA statements.
00:14:10.419 --> 00:14:13.314
But, again, Polish, second.
00:14:17.148 --> 00:14:20.753
And last but not least
we have pronunciation audio.
00:14:20.753 --> 00:14:23.372
So that is links to files on Commons
00:14:23.372 --> 00:14:25.959
where someone speaks the word,
00:14:25.959 --> 00:14:29.913
so you can hear a native speaker
pronounce the word
00:14:29.913 --> 00:14:32.871
in case you can't read IPA, for example.
00:14:34.959 --> 00:14:39.205
And there's a really nice actually
Wiki-based powered project
00:14:39.205 --> 00:14:40.474
called Lingua Libre
00:14:40.884 --> 00:14:45.173
where you can go and help record
words in your language
00:14:45.173 --> 00:14:47.836
that then can be added
to lexemes on Wikidata,
00:14:48.446 --> 00:14:52.103
so other people can understand
how to pronounce your words.
00:14:53.663 --> 00:14:55.694
(person 2) [inaudible]
00:14:55.694 --> 00:14:57.665
If you search for "Lingua Libre,"
00:14:57.665 --> 00:15:00.981
and I'm sure someone can post it
in the Telegram channel.
00:15:03.138 --> 00:15:04.621
Those guys rock.
00:15:04.621 --> 00:15:06.726
They did really cool stuff with Wikibase.
00:15:09.416 --> 00:15:10.617
Alright.
00:15:12.706 --> 00:15:17.285
Then the question is,
where do we go from here?
00:15:19.165 --> 00:15:22.010
Based on the numbers I've just shown you,
00:15:23.030 --> 00:15:25.172
we've come a long way
00:15:25.172 --> 00:15:28.430
towards giving more people
more access to more knowledge
00:15:28.430 --> 00:15:31.240
when looking at languages on Wikidata.
00:15:32.530 --> 00:15:36.392
But there is also still
a lot of work ahead of us.
00:15:38.992 --> 00:15:42.341
Some of the things
you can do to help, for example,
00:15:42.341 --> 00:15:44.921
is run label-a-thons
00:15:44.921 --> 00:15:50.124
like get people together
to label items in Wikidata
00:15:50.914 --> 00:15:55.121
or do an edit-a-thon
around lexemes in your language
00:15:55.121 --> 00:15:59.212
to get the most used words
in your language into Wikidata.
00:16:00.773 --> 00:16:03.285
Or you can use a tool like Terminator
00:16:03.285 --> 00:16:08.493
that helps you find the most
important items in your language
00:16:08.493 --> 00:16:11.549
that are still missing a label.
00:16:13.274 --> 00:16:18.359
Most important being measured
by how often it is used
00:16:18.359 --> 00:16:22.553
in other Wikidata items
as links in statements.
00:16:25.768 --> 00:16:30.022
And, of course, for the lexeme part,
00:16:31.342 --> 00:16:35.169
now that we've got
a basic coverage of those lexemes,
00:16:35.169 --> 00:16:41.163
it's also about building them out,
adding more statements to them
00:16:41.163 --> 00:16:44.401
so that they actually can build the base
00:16:44.401 --> 00:16:47.421
for meaningful applications
to build on top of that.
00:16:48.141 --> 00:16:50.795
Because we're getting closer
to that critical mass,
00:16:50.795 --> 00:16:53.616
but we're still away from that,
00:16:53.616 --> 00:16:56.624
that you can build
serious applications on top of it.
00:16:58.277 --> 00:17:01.680
And I hope all of you
will join us in doing that.
00:17:02.583 --> 00:17:07.103
And that already brings me
00:17:07.103 --> 00:17:09.843
to a little help from our friends,
00:17:09.843 --> 00:17:12.812
and Bruno, do you want to come over
00:17:13.882 --> 00:17:16.854
and talk to us about lexical masks.
00:17:17.541 --> 00:17:18.567
(Bruno) Thank you, Lydia,
00:17:18.567 --> 00:17:21.519
thank you for giving me
this short period of time
00:17:21.519 --> 00:17:24.150
to present this work
that we are doing at Google
00:17:24.150 --> 00:17:29.635
Denny that most of you
probably have heard of or know.
00:17:30.126 --> 00:17:32.030
Because at Google so I'm a linguist.
00:17:32.030 --> 00:17:36.150
so I'm very happy to be here
amongst other language enthusiasts.
00:17:36.620 --> 00:17:39.278
We are also building some lexicons,
00:17:39.278 --> 00:17:41.766
and we have built this technology
00:17:41.766 --> 00:17:45.589
or this approach that we think
can be useful for you.
00:17:46.369 --> 00:17:48.455
Just to give you
a little bit of background,
00:17:48.455 --> 00:17:52.068
this is my lexicographic
background talking here.
00:17:52.788 --> 00:17:54.347
When we build a lexicon database,
00:17:54.347 --> 00:17:58.623
there is a lot of hard time to maintain,
to keep them consistent
00:17:58.623 --> 00:18:00.125
and to exchange data,
00:18:00.125 --> 00:18:02.027
as you probably know.
00:18:02.517 --> 00:18:05.927
There are several attempts
to unify the feature and the properties
00:18:05.927 --> 00:18:09.184
that are describing
those lexemes and those forms,
00:18:09.184 --> 00:18:10.936
and it's not a solved problem,
00:18:10.936 --> 00:18:13.958
but there are some
unification attempts on that side.
00:18:13.958 --> 00:18:15.209
But what is really missing--
00:18:15.209 --> 00:18:18.732
and this is a problem we had
at the beginning of our project at Google
00:18:18.732 --> 00:18:21.607
is to try to have an internal structure
00:18:22.197 --> 00:18:25.910
that describes how
a lexical entry should look like,
00:18:25.910 --> 00:18:28.581
what kind of data
or what kind of information we have
00:18:28.581 --> 00:18:32.237
and the specification that are expected.
00:18:32.237 --> 00:18:38.187
So, this is what we came up
with this thing called lexicon mask.
00:18:38.897 --> 00:18:44.841
A lexicon mask is describing
what is expected for an entry,
00:18:44.841 --> 00:18:47.329
a lexicographic entry, to be complete,
00:18:47.329 --> 00:18:51.436
both in terms of the number of forms
you expect for a lexeme,
00:18:51.436 --> 00:18:55.607
and the number of features
you expect for each of those forms.
00:18:56.397 --> 00:18:58.329
Here is an example for Italian adjectives.
00:18:58.329 --> 00:19:02.002
You expect, in Italian, to have
four forms for your adjectives,
00:19:02.002 --> 00:19:05.383
and each of these forms
have a specific combination
00:19:05.383 --> 00:19:07.946
of gender and number features.
00:19:08.606 --> 00:19:12.672
This is what we expect
for the Italian adjectives.
00:19:12.672 --> 00:19:16.176
Of course, you can have
extremely complex masks,
00:19:16.176 --> 00:19:20.783
like the French verbs conjugation,
which is quite extensive,
00:19:20.783 --> 00:19:23.487
and I don't show you
any other Russian mask
00:19:23.487 --> 00:19:25.378
because it doesn't fit the screen.
00:19:26.308 --> 00:19:29.531
And we also have
some detailed specifications
00:19:29.531 --> 00:19:33.421
because we distinguish
what is at the form level.
00:19:33.421 --> 00:19:37.544
So here you have Russian nouns
that have three numbers
00:19:37.544 --> 00:19:40.048
and a number of cases
with different forms,
00:19:40.048 --> 00:19:43.086
but they also have
an entry level specification
00:19:43.086 --> 00:19:45.590
that says a noun particularly has
00:19:45.590 --> 00:19:50.133
an inherent gender
and an inherent animacy feature
00:19:50.133 --> 00:19:52.488
that is also specified in the mask.
00:19:54.518 --> 00:19:58.779
We also want to distinguish
that a mask gives a specification
00:19:58.779 --> 00:20:01.874
for, in general,
what an entry should look like.
00:20:01.874 --> 00:20:07.158
But you can have smaller masks
for defective aspects of the form
00:20:07.158 --> 00:20:11.282
or defective aspects of the lexeme
that happen in language.
00:20:11.282 --> 00:20:14.537
So here is the simplest version
of French verbs
00:20:14.537 --> 00:20:19.729
that have only the 3rd person singular
for all the weather verbs,
00:20:19.729 --> 00:20:23.969
like "it rains" or "it snows,"
like in English.
00:20:24.537 --> 00:20:26.493
So we distinguish these two levels.
00:20:26.923 --> 00:20:29.962
And how we use this at Google
00:20:29.962 --> 00:20:32.643
is that when we have a lexicon
that we want to use,
00:20:33.063 --> 00:20:38.309
we use the mask to really
literally throw the lexicons,
00:20:38.309 --> 00:20:40.163
all the entries, through the mask
00:20:40.163 --> 00:20:44.303
and see which entry has a problem
in terms of structure.
00:20:44.303 --> 00:20:46.523
Are we missing a form?
Are we missing a feature?
00:20:46.523 --> 00:20:51.497
And when there is a problem,
we do some human validation
00:20:51.497 --> 00:20:53.751
or just to see if it passes the mask.
00:20:53.751 --> 00:20:57.924
So it's an extremely powerful tool
to check the quality of the structure.
00:20:59.427 --> 00:21:01.964
So what we are happy to announce today
00:21:01.964 --> 00:21:05.408
is that we get the green light
to open source our mask.
00:21:05.948 --> 00:21:07.573
So this is a schema.
00:21:07.573 --> 00:21:09.477
If you want that, we can release
00:21:09.477 --> 00:21:13.483
and that we will provide
to Wikidata as to ShEx files.
00:21:13.483 --> 00:21:16.688
This is a ShEx file for German nouns,
00:21:16.688 --> 00:21:20.428
and Denny is working on the conversion
from our internal specification
00:21:20.428 --> 00:21:23.666
to a more open-source specification.
00:21:23.666 --> 00:21:27.522
We currently cover more than 25 languages.
00:21:27.522 --> 00:21:29.225
So we expect to grow on our side,
00:21:29.225 --> 00:21:34.350
but we also look for this opportunity
to collaborate for other languages.
00:21:34.350 --> 00:21:40.728
And one of the ongoing collaborations
also that Denny has with Lukas.
00:21:40.728 --> 00:21:45.052
Lukas has these great tools to have a UI
00:21:45.052 --> 00:21:51.061
to help the user or the contributor
to add more forms.
00:21:51.061 --> 00:21:54.151
So if you want to add
an adjective in French,
00:21:54.151 --> 00:21:59.057
the UI is telling you
how many forms are expected
00:21:59.057 --> 00:22:01.562
and what kind of features
this form should have.
00:22:01.562 --> 00:22:06.268
So our mask will help the tool
to be defined and expanded.
00:22:07.238 --> 00:22:08.385
That's it.
00:22:08.791 --> 00:22:10.358
(Lydia) Thank you so much.
00:22:10.358 --> 00:22:11.993
(applause)
00:22:14.249 --> 00:22:16.891
Alright. Are there questions?
00:22:16.891 --> 00:22:19.381
Do you want to talk more about lexemes?
00:22:19.817 --> 00:22:21.475
- (person 3) Yes.
- Yes. (chuckles)
00:22:33.485 --> 00:22:35.380
(person 3) My question,
because you were talking
00:22:35.380 --> 00:22:39.106
about giving more access
to more people in more languages.
00:22:39.106 --> 00:22:42.444
But there are a lot of languages
that can't be used in Wikidata.
00:22:42.444 --> 00:22:44.588
So what solution do you have for that?
00:22:45.889 --> 00:22:47.686
When you say that can't use Wikidata,
00:22:47.686 --> 00:22:50.308
are you talking about entering labels?
00:22:50.308 --> 00:22:52.578
- (person 3) Labels, descriptions.
- Right.
00:22:52.578 --> 00:22:55.498
So, for lexemes, it's a bit different
00:22:55.498 --> 00:22:57.793
because there we don't have
that restriction.
00:22:58.923 --> 00:23:05.003
For labels on items and properties,
there is some restriction
00:23:05.433 --> 00:23:12.411
because we wanted to make sure
that it's not completely
00:23:12.411 --> 00:23:14.229
anyone does anything,
00:23:14.229 --> 00:23:17.769
and it becomes unmanageable.
00:23:19.349 --> 00:23:23.328
Even a small community who wants
one language and wants to work on that,
00:23:23.898 --> 00:23:26.787
come talk to us, we will make it happen.
00:23:26.787 --> 00:23:29.202
(person 3) I mean, we did this
at the Prague Hackathon in May,
00:23:29.202 --> 00:23:32.459
and it took us until almost August
in order to be able to use our language.
00:23:32.459 --> 00:23:35.135
- Yeah.
- (person 3) So, it's very slow.
00:23:35.135 --> 00:23:37.854
Yeah, it is, unfortunately, very slow.
00:23:37.854 --> 00:23:39.883
We're currently working
with the language Committee
00:23:39.883 --> 00:23:46.048
on solving some fundamental...
00:23:49.537 --> 00:23:55.447
Like, getting agreement on what kind
of languages are actually "allowed,"
00:23:56.047 --> 00:23:59.398
and that has taken too long,
00:23:59.988 --> 00:24:04.178
which is the reason why your request
probably took longer than it should have.
00:24:04.778 --> 00:24:05.963
(person 3) Thanks.
00:24:06.815 --> 00:24:07.950
(person 4) Thank you.
00:24:07.950 --> 00:24:10.938
Lydia, if you remember
the statistics that you showed,
00:24:10.938 --> 00:24:12.886
the number of lexemes per language.
00:24:12.886 --> 00:24:17.599
So, did you count
all the forms as a data point
00:24:17.599 --> 00:24:20.034
or only lexemes?
00:24:21.289 --> 00:24:22.941
(Lydia) Do you mean this?
00:24:22.941 --> 00:24:24.053
Which one do you mean?
00:24:24.053 --> 00:24:25.529
(person 4) Yes, exactly.
00:24:25.797 --> 00:24:28.341
If you remember,
does this number [inaudible]
00:24:28.341 --> 00:24:31.954
all the forms for all the lexemes
or just how many lexemes there are?
00:24:31.954 --> 00:24:33.585
No, this is just a number of lexemes.
00:24:33.585 --> 00:24:35.395
(person 4) Just a number of lexemes, okay.
00:24:35.395 --> 00:24:36.797
So then it is a just statistic
00:24:36.797 --> 00:24:39.390
because if it would then
compose the forms--
00:24:39.390 --> 00:24:40.614
that's why I'm asking--
00:24:40.614 --> 00:24:42.817
then all the languages
with the inflectional morphology,
00:24:42.817 --> 00:24:45.027
like Russian, Serbian,
Slovenian and et cetera,
00:24:45.027 --> 00:24:47.616
they have a natural advantage
because they have so many.
00:24:47.616 --> 00:24:51.990
So, this kind of kicks in here
on this number of forms.
00:24:51.990 --> 00:24:53.851
(person 4) Yeah, that was this one.
Thank you.
00:24:56.546 --> 00:25:00.224
(person 5) So, I had
a quick question about the...
00:25:00.644 --> 00:25:06.824
When we're talking about
the actual items and properties.
00:25:07.124 --> 00:25:08.901
Like as far as I understand,
00:25:08.901 --> 00:25:11.955
there is currently no way
to give an actual source
00:25:11.955 --> 00:25:14.726
to any of the labels
and descriptions that are given.
00:25:14.726 --> 00:25:18.047
So, for example,
because when you're talking
00:25:18.047 --> 00:25:20.920
about an item property,
00:25:20.920 --> 00:25:24.509
like, for example,
you can get conflicting labels.
00:25:24.509 --> 00:25:25.739
Yes.
00:25:25.739 --> 00:25:27.662
(person 5) So this person is like...
00:25:28.402 --> 00:25:30.781
We were talking about
indigenous things before, for example.
00:25:30.781 --> 00:25:35.965
So this person is a Norwegian artist
according to this source,
00:25:35.965 --> 00:25:38.750
and a Sami artist,
according to this source.
00:25:39.550 --> 00:25:42.883
Or, for example, in Estonian,
we had an issue
00:25:42.883 --> 00:25:47.729
where we had to change terminology
to the official use terminology
00:25:47.729 --> 00:25:49.482
in official lexicons,
00:25:49.482 --> 00:25:52.262
but we have no way to indicate really why,
00:25:52.262 --> 00:25:53.596
like what was the source of this
00:25:53.596 --> 00:25:55.561
and why this was better
and what was there before.
00:25:55.561 --> 00:25:57.150
It was just me as a random person
00:25:57.150 --> 00:25:59.615
just switching the thing
to anyone who sees it.
00:25:59.615 --> 00:26:02.520
So is there a plan
to make this possible in any way
00:26:02.520 --> 00:26:06.355
so that we can actually have
proper sources for the language data?
00:26:07.045 --> 00:26:11.568
So, it is partially possible.
00:26:11.568 --> 00:26:15.958
So, for example, when you have
an item for a person,
00:26:16.968 --> 00:26:22.720
you have a statement, first name,
last name, and so on, of that person,
00:26:22.720 --> 00:26:26.226
and then you can provide
the reference for that there.
00:26:28.211 --> 00:26:32.544
I'm quite hesitant to add more complexity
00:26:32.544 --> 00:26:35.557
for references on labels and descriptions,
00:26:35.557 --> 00:26:38.624
but if people really, really think
00:26:38.624 --> 00:26:44.939
this is something that isn't covered
by any reference on the statement,
00:26:44.939 --> 00:26:46.803
then let's talk about it.
00:26:49.079 --> 00:26:53.303
But I fear it will add a lot of complexity
00:26:53.303 --> 00:26:56.523
for what I hope are few cases,
00:26:57.393 --> 00:27:00.188
but I'm willing to be convinced otherwise
00:27:00.188 --> 00:27:04.087
if people really feel
very strongly about this.
00:27:04.087 --> 00:27:08.177
(person 5) I mean, if it's added
it probably shouldn't be the default,
00:27:08.177 --> 00:27:12.452
show to all the users as a beginner,
interface, in any case.
00:27:12.452 --> 00:27:16.190
More like, "Click here if you need to say
a specific thing about this."
00:27:17.632 --> 00:27:23.368
Do we have a sense of how many times
that would actually matter?
00:27:24.520 --> 00:27:26.423
(person 5) In Estonian, for example--
00:27:26.423 --> 00:27:28.844
I expect this is true
of other languages as well--
00:27:29.274 --> 00:27:34.203
for example, there is an official name
that is the actual legitimate translation,
00:27:34.203 --> 00:27:36.206
for example, into English,
00:27:36.206 --> 00:27:40.314
of, say, a specific kind of municipality.
00:27:40.614 --> 00:27:42.182
That was my use case, for example,
00:27:42.182 --> 00:27:44.409
where we were using the word "parish"
00:27:45.159 --> 00:27:50.885
which the original Estonian word
was meant kind of like church parish,
00:27:50.885 --> 00:27:51.899
and that was the origin,
00:27:51.899 --> 00:27:54.809
but that's not the official translation
Estonia gets right now.
00:27:55.189 --> 00:27:58.993
In this case, I would just add it
as official name statements
00:27:58.993 --> 00:28:00.817
and add the reference there.
00:28:02.032 --> 00:28:03.158
(person 5) Okay.
00:28:05.186 --> 00:28:06.572
More questions, yes?
00:28:07.682 --> 00:28:10.044
(person 6) I have two quick comments.
00:28:10.044 --> 00:28:13.934
You specifically called out Asturian
as a language that does well,
00:28:13.934 --> 00:28:16.455
and I think that's a false artifact.
00:28:16.455 --> 00:28:17.724
Tell me about it.
00:28:17.724 --> 00:28:19.748
(person 6) I think it's just a bot
00:28:19.748 --> 00:28:24.068
that pasted person names,
like proper names,
00:28:24.068 --> 00:28:27.172
and said, "Well, this is exactly
like in French or Spanish,"
00:28:27.172 --> 00:28:28.558
and just massively copied it.
00:28:28.558 --> 00:28:33.316
One point of evidence is that
you don't see that energy in Asturian
00:28:33.316 --> 00:28:37.205
in things that actually
require translation, like property names,
00:28:37.205 --> 00:28:39.648
or names of items
that are not proper names.
00:28:39.648 --> 00:28:41.219
Asaf, you break my heart.
00:28:41.219 --> 00:28:43.198
(person 6) I know,
I like raining on parades,
00:28:43.198 --> 00:28:48.458
but I have good news as well,
which is about the pronunciation numbers.
00:28:49.408 --> 00:28:53.515
As you probably know,
Commons is full of pronunciation files,
00:28:53.515 --> 00:28:54.668
and, for example,
00:28:54.668 --> 00:29:01.102
Dutch has no less than 300,000
pronunciation files already on Commons
00:29:01.912 --> 00:29:05.051
that just need to somehow be ingested.
00:29:05.051 --> 00:29:07.697
So if anyone's looking for a side project,
00:29:07.697 --> 00:29:08.997
there's tons and tons
00:29:08.997 --> 00:29:13.280
of classified, categorized
pronunciation files on Commons
00:29:13.280 --> 00:29:16.893
under the category
"Pronunciation" by language.
00:29:16.893 --> 00:29:22.840
So that's just waiting to be matched
to lexemes and put on Lexeme.
00:29:23.180 --> 00:29:25.484
And I was wondering
if you could say something
00:29:25.484 --> 00:29:26.585
about the road map,
00:29:26.585 --> 00:29:28.757
something about how much investment
00:29:28.757 --> 00:29:31.995
or what can we expect
from Lexeme in the coming year,
00:29:31.995 --> 00:29:34.020
because I, for one, can't wait.
00:29:34.949 --> 00:29:37.044
You can't wait? (chuckles)
00:29:37.044 --> 00:29:39.118
- (person 6) For more.
- Yes. (chuckles)
00:29:44.541 --> 00:29:49.523
Right now, we're concentrating
more on Wikibase and data quality
00:29:51.493 --> 00:29:55.087
to see how much traction this gets
00:29:55.087 --> 00:30:01.676
and then getting more for feeding off
where the pain points are next,
00:30:01.676 --> 00:30:06.003
and then going back to improving
lexicographical data further.
00:30:06.903 --> 00:30:09.790
And one of the things
I'd love to hear from you
00:30:09.790 --> 00:30:14.136
is where exactly do you see
the next steps,
00:30:14.136 --> 00:30:15.966
where do you want to see improvements
00:30:15.966 --> 00:30:20.340
so that we can then figure out
how to make that happen.
00:30:21.125 --> 00:30:22.810
But, of course, you're right,
00:30:22.810 --> 00:30:25.712
there's still so much to do
also on the technical side.
00:30:30.573 --> 00:30:35.848
(person 7) Okay, as we were uploading
the Basque words with forms,
00:30:35.848 --> 00:30:37.768
and you'll see some
of these kinds of things,
00:30:37.768 --> 00:30:41.329
we were both like, last week we said,
"Oh, we are the first one in something."
00:30:42.919 --> 00:30:44.928
It's It appears in press, and it's like,
00:30:44.928 --> 00:30:49.488
"Oh, Basque are the first time in some--
they are the first in something, okay."
00:30:49.488 --> 00:30:50.606
(laughs)
00:30:50.606 --> 00:30:53.318
And then people ask,
"Okay, but what is this for?"
00:30:54.678 --> 00:30:56.849
We don't have a real good answer.
00:30:56.849 --> 00:30:57.888
I mean it's like, okay,
00:30:57.888 --> 00:31:01.841
this will help computers
to understand more our language, yes,
00:31:01.841 --> 00:31:05.279
but what kind of tools
can we make in the future?
00:31:05.279 --> 00:31:07.467
And we don't have a good answer for this.
00:31:07.467 --> 00:31:10.625
So I don't know
if you have a good answer for this.
00:31:10.625 --> 00:31:12.742
(chuckles) I don't know
if I have a good answer,
00:31:12.742 --> 00:31:14.746
but I have an answer.
00:31:15.480 --> 00:31:20.425
So I think right now
as I was telling [inaudible],
00:31:20.425 --> 00:31:21.924
we haven't reached that critical mass
00:31:21.924 --> 00:31:25.529
where you can build a lot
of the really interesting tools.
00:31:25.529 --> 00:31:27.707
But there are already some tools.
00:31:28.267 --> 00:31:31.912
Just the other day,
Esther [Pandelia], for example,
00:31:31.912 --> 00:31:33.817
released a tool where you can see,
00:31:35.837 --> 00:31:38.889
I think it was the words on a globe
00:31:38.889 --> 00:31:41.901
where they're spoken,
where they're coming from.
00:31:42.631 --> 00:31:44.090
I'm probably wrong about this,
00:31:44.090 --> 00:31:46.346
but she had answered
on the Project chat on Wikidata--
00:31:46.346 --> 00:31:48.984
you can look it up there.
00:31:49.574 --> 00:31:51.805
So we have seen these first tools,
00:31:51.805 --> 00:31:55.696
just like we've seen
back when Wikidata started.
00:31:56.846 --> 00:31:59.602
First some--like just a network,
00:31:59.602 --> 00:32:03.424
and like, "Hey, look, there's this thing
that connects to this other thing."
00:32:04.824 --> 00:32:07.059
And as we have more data,
00:32:07.059 --> 00:32:10.352
and as we've reached some critical mass,
00:32:11.852 --> 00:32:14.747
more powerful applications
become possible,
00:32:15.677 --> 00:32:17.516
things like Histropedia,
00:32:19.126 --> 00:32:21.988
things like question and answering
00:32:21.988 --> 00:32:26.663
in your digital personal assistant,
Platypus, and so on.
00:32:26.663 --> 00:32:29.668
And we're seeing
a similar thing with lexemes.
00:32:31.198 --> 00:32:34.650
We're at the stage
where you can build like these little,
00:32:34.650 --> 00:32:37.464
hey, look, there's a connection
between the two things,
00:32:37.864 --> 00:32:42.738
and there's a translation
of this word into that language stage,
00:32:42.738 --> 00:32:47.747
and as we build it out
and as we describe more words,
00:32:47.747 --> 00:32:49.533
more becomes possible.
00:32:49.533 --> 00:32:51.795
Now, what becomes possible?
00:32:53.482 --> 00:32:59.483
As Ben, our keynote speaker earlier
was talking about translations,
00:33:00.103 --> 00:33:03.455
being able to translate
from one language to another.
00:33:03.455 --> 00:33:07.929
And Jens, my colleague,
he's always talking about
00:33:07.929 --> 00:33:11.452
the European Union
looking for a translator
00:33:11.452 --> 00:33:17.439
who can translate from
I think it was Maltese to Swedish--
00:33:17.439 --> 00:33:19.436
- (person 8) Estonian.
- Estonian.
00:33:22.016 --> 00:33:26.211
And that is not a usual combination.
00:33:27.211 --> 00:33:31.735
But once you have all these languages
in one machine-readable place,
00:33:31.735 --> 00:33:33.143
you can do that,
00:33:33.143 --> 00:33:36.857
you can get a dictionary
00:33:36.857 --> 00:33:41.735
from Estonian to Maltese and back.
00:33:42.935 --> 00:33:45.607
So covering language
combinations in dictionaries
00:33:45.607 --> 00:33:47.911
that just haven't been covered before
00:33:47.911 --> 00:33:51.050
because there wasn't
enough demand for it, for example,
00:33:51.050 --> 00:33:55.540
to make it financially viable
and to justify the work.
00:33:55.540 --> 00:33:57.147
Now we can do that.
00:33:59.797 --> 00:34:02.318
Then text generation.
00:34:02.318 --> 00:34:03.653
Lucie was earlier talking
00:34:03.653 --> 00:34:10.136
about how she's working
with Hattie on generating text
00:34:10.136 --> 00:34:14.673
to get Wikipedia articles
in minority languages started,
00:34:15.423 --> 00:34:19.512
and that needs data about words,
00:34:19.512 --> 00:34:22.589
and you need to understand
the language to do that.
00:34:23.769 --> 00:34:28.133
Yeah, and those are just some
that come to my mind right now.
00:34:28.693 --> 00:34:30.494
Maybe our audience has more ideas
00:34:30.494 --> 00:34:34.353
what they want to do
when we have all the glorious data.
00:34:37.693 --> 00:34:40.892
(person 9) Okay, I will deviate
from the lexemes topic.
00:34:40.892 --> 00:34:42.666
I will ask the question,
00:34:42.666 --> 00:34:45.634
how can I as a member of community
00:34:45.634 --> 00:34:50.135
influence that priority is put on task,
00:34:50.135 --> 00:34:56.644
that a new user comes, and he can indicate
what languages he wants to see and edit
00:34:56.644 --> 00:35:01.135
without some secret verbal
template knowledge.
00:35:02.145 --> 00:35:05.053
Maybe there will be this year
this technical wish list
00:35:05.053 --> 00:35:07.040
without Wikipedia topics.
00:35:07.040 --> 00:35:10.119
Maybe there's a hope
we can all vote about
00:35:10.119 --> 00:35:14.218
this thing we didn't fix for seven years.
00:35:14.218 --> 00:35:17.607
So do you have any ideas
and comments about this?
00:35:18.217 --> 00:35:20.328
So you're talking about the fact
00:35:20.328 --> 00:35:23.518
that someone who is
not logged into Wikidata
00:35:23.518 --> 00:35:25.971
can't change their language easily?
00:35:25.971 --> 00:35:27.839
(person 9) No, for [inaudible] users.
00:35:28.309 --> 00:35:30.689
So, if they are logged in,
00:35:30.689 --> 00:35:34.871
they can just change their language
at the top of the page,
00:35:35.891 --> 00:35:38.099
and then it will appear
00:35:39.769 --> 00:35:42.013
where the labels' description
[inaudible] are,
00:35:42.013 --> 00:35:43.483
and they can edit it.
00:35:45.657 --> 00:35:49.009
(person 9) Well, actually, usually
many times the workflow
00:35:49.009 --> 00:35:52.447
is that if you want to have
multiple languages, they are available,
00:35:52.447 --> 00:35:55.419
and it's not always the case.
00:35:55.419 --> 00:35:58.584
Okay, maybe we should sit down
after this talk and you show me.
00:36:01.562 --> 00:36:04.089
Cool. More questions?
00:36:05.534 --> 00:36:06.536
Yes.
00:36:11.595 --> 00:36:13.196
(person 10) Thanks for the presentation.
00:36:14.106 --> 00:36:15.127
Can you comment
00:36:15.127 --> 00:36:19.307
on the state of the correlation
with the Wiktionary community.
00:36:19.307 --> 00:36:22.296
As far as I've seen,
there were some discussions
00:36:22.296 --> 00:36:26.051
about importing some elements of the work,
00:36:26.051 --> 00:36:30.843
but there seems to be licensing issues
and some disagreements, et cetera.
00:36:30.843 --> 00:36:31.848
Right.
00:36:31.848 --> 00:36:36.330
So, Wiktionary communities
have spent a lot of time
00:36:37.320 --> 00:36:39.473
building Wiktionary.
00:36:39.473 --> 00:36:42.643
They have built
00:36:43.193 --> 00:36:47.554
amazingly complicated
and complex templates
00:36:47.554 --> 00:36:53.614
to build pretty tables
that automatically generate forms for you
00:36:53.614 --> 00:36:56.392
and all kinds of really impressive,
00:36:56.392 --> 00:37:00.683
and kind of crazy stuff,
if you think about it.
00:37:02.311 --> 00:37:07.994
And, of course, they have invested
a lot of time and effort into that.
00:37:09.364 --> 00:37:11.801
And understandably,
00:37:11.801 --> 00:37:17.116
they don't just want that to be grabbed,
00:37:18.046 --> 00:37:19.102
just like that.
00:37:19.102 --> 00:37:21.791
So there's some of that coming from there.
00:37:22.761 --> 00:37:25.137
And that's fine, that's okay.
00:37:25.737 --> 00:37:32.092
Now, the first Wiktionary communities
are talking about turning out
00:37:32.092 --> 00:37:34.329
and importing some
of their data into Wikidata.
00:37:34.329 --> 00:37:39.095
Russian, you have seen,
for example, is one of those cases
00:37:40.375 --> 00:37:42.355
And I expect more of that to happen.
00:37:43.635 --> 00:37:46.800
But it will be a slow process,
00:37:46.800 --> 00:37:49.383
just like adoption
of Wikidata's data on Wikipedia
00:37:49.383 --> 00:37:51.909
has been a rather slow process.
00:37:52.849 --> 00:37:56.183
On the other side
of making it actually easier
00:37:56.183 --> 00:37:59.132
to use the data that is in lexemes,
00:37:59.132 --> 00:38:02.209
on Wiktionary, so that
they can make use of that
00:38:02.209 --> 00:38:05.531
and share data between
the language Wiktionaries
00:38:05.531 --> 00:38:08.853
which is super hard
to impossible right now,
00:38:08.853 --> 00:38:11.560
which is crazy,
just like it was on Wikipedia.
00:38:13.860 --> 00:38:16.325
Wait for the birthday present. (chuckles)
00:38:20.038 --> 00:38:21.182
Yes.
00:38:22.599 --> 00:38:24.827
(person 11) When I was thinking
the other way around it,
00:38:24.827 --> 00:38:28.168
I actually didn't want to say it
because I think this will be super silly,
00:38:28.168 --> 00:38:32.003
but I think that Wiktionary
already has some content,
00:38:32.003 --> 00:38:34.978
and I know that
we can't transfer it to Wikidata
00:38:34.978 --> 00:38:37.048
because there's a difference in licenses.
00:38:37.048 --> 00:38:39.631
But I was thinking maybe
we can do something about that.
00:38:40.321 --> 00:38:45.913
Maybe, I don't know, we can obtain
the communities' permission
00:38:45.913 --> 00:38:51.205
after like, I don't know,
having like a public voting
00:38:52.075 --> 00:38:55.642
and for the community,
the active members of the community
00:38:55.642 --> 00:39:02.523
to vote and say if they would like
or accept or to transfer the content
00:39:02.523 --> 00:39:05.528
for which they may do
the Wikidata lexemes.
00:39:06.238 --> 00:39:08.537
Because I just think it is such a waste.
00:39:09.568 --> 00:39:14.443
So, that's definitely
a conversation those people
00:39:14.443 --> 00:39:18.249
who are in Wiktionary communities
are very welcome to bring up there.
00:39:18.249 --> 00:39:24.647
I think it would be a bit presumptuous
for us to go and force that.
00:39:25.917 --> 00:39:31.142
But, yeah, I think it's definitely worth
having a conversation.
00:39:31.142 --> 00:39:33.898
But I think it's also important
to understand
00:39:33.898 --> 00:39:39.082
that there's a distinction between
what is actually legally allowed
00:39:39.082 --> 00:39:43.147
and what we should be doing
00:39:43.147 --> 00:39:45.426
and what those people want or do not want.
00:39:45.736 --> 00:39:47.329
So even if it's legally allowed,
00:39:47.329 --> 00:39:50.640
if some other Wiktionary communities
do not want that,
00:39:50.640 --> 00:39:53.537
I would be careful, at least.
00:39:58.886 --> 00:40:02.489
I think you need the mic
for the stream.
00:40:04.540 --> 00:40:07.299
(person 12) So, obviously,
it's all very exciting,
00:40:07.979 --> 00:40:12.319
and I immediately think
how can I take that to my students
00:40:12.319 --> 00:40:15.558
and how can I incorporate it
with the courses,
00:40:15.558 --> 00:40:18.531
the work that we're doing,
educational settings.
00:40:18.531 --> 00:40:22.271
And I don't have, at the moment,
00:40:22.871 --> 00:40:24.116
first of all, enough knowledge,
00:40:24.116 --> 00:40:27.278
but I think the documentation
that we do have
00:40:27.808 --> 00:40:30.082
could be maybe improved.
00:40:30.082 --> 00:40:33.437
So that's a kind of request
to make cool videos
00:40:33.437 --> 00:40:35.898
that explain how it works
00:40:35.898 --> 00:40:39.948
because if we have it, we can then use it,
00:40:39.948 --> 00:40:41.985
and we can have students on board,
00:40:41.985 --> 00:40:47.072
and we can make people understand
how awesome it all is.
00:40:47.072 --> 00:40:52.001
And yeah, just think about documentation
and think about education, please.
00:40:52.001 --> 00:40:54.480
Because I think a lot could be done.
00:40:54.480 --> 00:40:58.585
These are like many tasks
that could be done even with...
00:41:00.125 --> 00:41:02.033
well, I wouldn't say primary schools,
00:41:02.033 --> 00:41:05.495
but certainly, even younger students.
00:41:05.915 --> 00:41:10.866
And so I would really like to see
that potential being tapped into,
00:41:10.866 --> 00:41:15.272
and, as of now, I personally
don't understand enough
00:41:15.272 --> 00:41:19.500
to be able to create tasks
or to create like...
00:41:20.430 --> 00:41:22.155
to do something practical with it.
00:41:22.155 --> 00:41:25.772
So any help, any thoughts
anyone here has about that,
00:41:25.772 --> 00:41:29.648
I would be very happy to hear
your thoughts, and yours as well.
00:41:30.508 --> 00:41:32.129
Yeah, let's talk about that.
00:41:35.473 --> 00:41:37.139
More questions?
00:41:37.809 --> 00:41:39.195
Someone else raised a hand.
00:41:39.195 --> 00:41:40.495
I forgot where it was.
00:41:45.739 --> 00:41:49.996
(person 13) So, if we can't import
from Wiktionary,
00:41:49.996 --> 00:41:55.772
is there some concerted effort
to find other public domain sources,
00:41:55.772 --> 00:41:57.459
maybe all the data,
00:41:58.769 --> 00:42:03.167
and kind of prefilter it, organize it
00:42:03.167 --> 00:42:08.470
so that it's easy to be checked
by people for import?
00:42:09.093 --> 00:42:11.181
So there are first efforts.
00:42:11.181 --> 00:42:14.769
My understanding is that Basque
is one of those efforts.
00:42:14.769 --> 00:42:17.474
Maybe you want to say
a bit more about it?
00:42:18.426 --> 00:42:20.130
(person 14) [inaudible]
00:42:23.166 --> 00:42:27.148
Okay, the actual answer
is paying for that...
00:42:28.374 --> 00:42:33.381
I mean, we have an agreement
with a contractor we usually work with.
00:42:34.801 --> 00:42:38.725
They do dictionaries--
00:42:40.315 --> 00:42:42.458
lots of stuff, but they do dictionaries.
00:42:42.458 --> 00:42:47.473
So we agreed with them
to make free the students' dictionary,
00:42:47.473 --> 00:42:52.782
we would [cast] the most common words
and start uploading it
00:42:52.782 --> 00:42:55.590
with an external identifier
and the scheme of things.
00:42:56.420 --> 00:43:02.902
But there was some discussion
about leaving it on CC0
00:43:03.212 --> 00:43:05.322
because they have
the dictionary with CC by it,
00:43:06.537 --> 00:43:10.326
and they understood
what the difference was.
00:43:10.326 --> 00:43:13.866
So there was some discussion.
00:43:13.866 --> 00:43:19.709
But I think that we can provide some tools
or some examples in the future,
00:43:19.709 --> 00:43:21.761
and I think that there will be
other dictionaries
00:43:21.761 --> 00:43:24.016
that we can handle,
00:43:24.016 --> 00:43:29.274
and also I think Wiktionary
should start moving in that direction,
00:43:29.274 --> 00:43:32.260
but that's another great discussion.
00:43:33.285 --> 00:43:34.487
And on top of that,
00:43:34.487 --> 00:43:38.839
Lea is also in contact
with people from Occitan
00:43:38.839 --> 00:43:41.827
who work on Occitan dictionaries,
00:43:41.827 --> 00:43:45.138
and they're currently working
on a Sumerian collaboration.
00:43:51.644 --> 00:43:53.363
More questions?
00:44:01.487 --> 00:44:05.349
(person 15) Hi! We are the people
who want to import Occitan data.
00:44:05.349 --> 00:44:06.585
Aha! Perfect!
00:44:06.585 --> 00:44:08.368
(person 15) And we have a small problem.
00:44:09.188 --> 00:44:14.215
We don't know how to represent
the variety of all lexemes.
00:44:14.215 --> 00:44:17.893
We have six dialects,
00:44:17.893 --> 00:44:24.014
and we want to indicate for Lexeme
in which dialect it's used,
00:44:24.014 --> 00:44:27.285
and we don't have a proper
C0 statement to do that.
00:44:27.285 --> 00:44:31.105
So as long as the segment doesn't exist,
00:44:31.635 --> 00:44:34.465
it prevents us from [inaudible]
00:44:34.465 --> 00:44:37.603
because we will need to do it again
00:44:37.603 --> 00:44:42.076
when we will be able
to [export] the statement.
00:44:42.076 --> 00:44:44.551
And it's complicated
because it's a statement
00:44:44.551 --> 00:44:47.802
which won't be asked by many people
00:44:47.802 --> 00:44:53.444
because it's a statement
which concerns mostly minority languages.
00:44:53.444 --> 00:44:56.933
So you will have one person to ask this.
00:44:56.933 --> 00:45:00.022
But as our colleagues Basque,
00:45:00.022 --> 00:45:06.082
it can be one person
who will power thousands of others,
00:45:06.082 --> 00:45:10.884
so it might not be asking a lot,
00:45:10.884 --> 00:45:14.136
but it will be very important for us.
00:45:14.874 --> 00:45:17.600
Do you already have
a new property proposal up,
00:45:17.600 --> 00:45:19.470
or do you need help creating it?
00:45:21.524 --> 00:45:24.300
(person 15) We asked four months ago.
00:45:24.720 --> 00:45:28.755
Alright, then let's get some people
to help out with this property proposal.
00:45:30.159 --> 00:45:33.092
I'm sure there are enough people
in this room to make this happen.
00:45:33.360 --> 00:45:35.452
(person 15) Property proposal
[speaking in French].
00:45:35.452 --> 00:45:36.965
(person 16) We didn't have an answer.
00:45:36.965 --> 00:45:39.769
(person 15) We didn't have any answer,
and we don't know how to do this
00:45:39.769 --> 00:45:42.953
because we aren't
in the Wikidata community.
00:45:44.694 --> 00:45:48.817
Yup, so there are people here
who can help you.
00:45:48.817 --> 00:45:52.134
Maybe someone raises their hand to take--
00:45:52.574 --> 00:45:53.644
(person 14) I'm for that.
00:45:53.644 --> 00:45:55.512
But I think this is quite interesting
00:45:55.512 --> 00:45:59.059
that only the variant of form
00:45:59.059 --> 00:46:02.607
also can handle it geographically,
00:46:02.607 --> 00:46:04.995
with coordinates or some kind of mapping.
00:46:05.595 --> 00:46:07.815
Also having different pronunciations,
00:46:07.815 --> 00:46:11.837
and I think this is something
that happens in lots of languages.
00:46:12.607 --> 00:46:16.262
We should start making
it happen [inaudible],
00:46:16.262 --> 00:46:18.865
and I'm going to search for the property.
00:46:19.782 --> 00:46:20.933
Cool.
00:46:20.933 --> 00:46:24.446
So you will get backing
for your property proposal.
00:46:26.136 --> 00:46:27.297
Thank you.
00:46:28.153 --> 00:46:30.261
Alright, more questions?
00:46:32.410 --> 00:46:33.474
Finn.
00:46:33.974 --> 00:46:35.055
Finn is one of those people
00:46:35.055 --> 00:46:38.031
who builds stuff
on top of lexicographical data.
00:46:38.031 --> 00:46:40.085
(Finn) It's just a small question,
00:46:40.405 --> 00:46:44.226
and that's about spelling variations.
00:46:44.896 --> 00:46:48.002
It seems to be difficult to put them in...
00:46:48.532 --> 00:46:53.368
You could, of course,
have multiple forms for the same word.
00:46:56.327 --> 00:46:58.448
I don't know, it seems to be...
00:46:59.558 --> 00:47:03.535
If you don't do it that way,
it seems to be difficult to specify...
00:47:04.771 --> 00:47:05.888
or I don't know whether
00:47:05.888 --> 00:47:09.731
this is just a minor technical issue
or whether...
00:47:09.731 --> 00:47:11.252
Let's look at it together.
00:47:11.642 --> 00:47:15.230
I would love to see an example.
00:47:17.478 --> 00:47:18.478
Asaf.
00:47:26.886 --> 00:47:28.396
(Asaf) Thank you.
00:47:29.386 --> 00:47:33.685
I can give a very concrete example
from my mother tongue, Hebrew.
00:47:34.205 --> 00:47:38.845
Hebrew has two main variants
00:47:38.845 --> 00:47:42.786
for expressing almost every word
00:47:42.786 --> 00:47:47.640
because the traditional spelling
00:47:47.640 --> 00:47:50.044
leaves out many of the vowels.
00:47:50.934 --> 00:47:55.207
And, therefore, in modern editions
of the Bible and of poetry,
00:47:55.207 --> 00:47:57.461
diacritics are used.
00:47:57.461 --> 00:48:02.670
However, those diacritics
are never used for modern prose
00:48:02.670 --> 00:48:05.974
or newspaper writing or street signs.
00:48:05.974 --> 00:48:11.209
So the average daily casual use
puts in extra vowels
00:48:12.169 --> 00:48:13.519
and doesn't use the diacritics
00:48:13.519 --> 00:48:15.607
because they are,
of course, more cumbersome
00:48:15.607 --> 00:48:17.893
and have all kinds of rules
and nobody knows the rules.
00:48:18.633 --> 00:48:20.531
So there are basically two variants.
00:48:20.531 --> 00:48:25.322
There's the everyday casual prose variant,
00:48:25.322 --> 00:48:27.827
and there's the Bible or poetry,
00:48:27.827 --> 00:48:32.200
which always come
in this traditional diacriticized text.
00:48:32.200 --> 00:48:33.302
To be useful,
00:48:33.302 --> 00:48:37.428
Lexeme would have to recognize
both varieties of every single word
00:48:37.428 --> 00:48:39.747
and every single form
of every single word.
00:48:40.677 --> 00:48:43.391
So that's a very comprehensive use case
00:48:43.391 --> 00:48:46.340
for official stable variants.
00:48:46.340 --> 00:48:48.942
It's not dialect, it's not regions,
00:48:49.332 --> 00:48:53.627
it's basically two coexisting
morphological systems.
00:48:54.537 --> 00:48:58.926
And I too don't know exactly
how to express that in Lexeme today,
00:48:58.926 --> 00:49:02.800
which is one thing that is keeping me
in partial answer to Magnus' question
00:49:02.800 --> 00:49:05.238
from uploading the parts that are ready
00:49:05.238 --> 00:49:09.394
from the biggest Hebrew dictionary,
which is public domain
00:49:09.394 --> 00:49:13.141
and which I have been digitizing
for several years now.
00:49:13.141 --> 00:49:14.803
A good portion of it is ready,
00:49:14.803 --> 00:49:16.549
but I'm not putting it on Lexeme right now
00:49:16.549 --> 00:49:20.245
because I don't know exactly
how to solve this problem.
00:49:20.245 --> 00:49:23.387
Alright, let's solve
this problem here. (chuckles)
00:49:24.503 --> 00:49:26.021
That has to be possible.
00:49:30.045 --> 00:49:32.047
Alright, more questions?
00:49:37.173 --> 00:49:39.735
If not, then thank you so much.
00:49:40.605 --> 00:49:42.675
(applause)