cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4

Edit subtitles

0:06 - 0:07

(Lydia) Thank you so much.
0:07 - 0:11

So, this conference,
one of the big themes is languages.
0:14 - 0:19

I want to give you an overview
of where we actually are currently
0:19 - 0:20

when it comes to languages
0:20 - 0:22

and where we can go from here.
0:29 - 0:33

Wikidata is all about giving more people
more access to more knowledge,
0:33 - 0:37

and language is such an important part
of making that a reality,
0:38 - 0:43

especially since more and more
of our lives depends on technology.
0:44 - 0:49

And as our keynote speaker
earlier today was talking,
0:50 - 0:52

some of the technology
leaves people behind
0:52 - 0:55

simply because they can't speak
a certain language,
0:55 - 0:58

and that's not okay.
0:59 - 1:02

So we want to do something about that.
1:03 - 1:06

And in order to change that,
you need at least two things.
1:06 - 1:11

One is you need to provide content
to the people in their language,
1:11 - 1:13

and the second thing you need
1:13 - 1:16

is to provide them
with interaction in their language
1:16 - 1:19

in those applications
or whatever it is you have.
1:20 - 1:25

And Wikidata helps with both of those.
1:25 - 1:28

And the first thing,
content in your language,
1:28 - 1:31

that is basically what we have
in items and properties,
1:31 - 1:33

how we describe the world.
1:33 - 1:35

Now, this is certainly
not everything you need,
1:35 - 1:39

but it gets you quite far ahead.
1:40 - 1:42

The other thing
is interaction in your language,
1:42 - 1:46

and that's where lexemes come into play
1:46 - 1:49

If you want to talk
to your digital personal assistant
1:49 - 1:55

or if you want to have your device
translate a text and things like that.
1:56 - 1:59

Alright, let's look into
content in your language.
1:59 - 2:03

So what we have in items and properties.
2:05 - 2:10

For this, the labels in those items
and properties are crucial.
2:10 - 2:15

We need to know what this entity
is called that we're talking about.
2:16 - 2:20

And instead of talking about Q5,
2:20 - 2:22

someone who speaks English
knows that's a "human,"
2:22 - 2:25

someone who speaks German
knows that's a "mensch,"
2:25 - 2:26

and similar things.
2:26 - 2:30

So those labels on items and properties
2:30 - 2:34

are bridging the gap
between humans and machines.
2:34 - 2:35

And humans and humans
2:35 - 2:40

making more existing knowledge
accessible to them.
2:43 - 2:46

Now, that's a nice aspiration.
2:46 - 2:48

What does it actually look like?
2:48 - 2:50

It looks like this.
2:51 - 2:52

What you're seeing here
2:52 - 2:58

is that most of the items
on Wikidata have two labels,
2:58 - 3:01

so labels in two languages.
3:02 - 3:04

And after that, it's one, and then three,
3:04 - 3:06

and then it becomes very sad.
3:07 - 3:09

(quiet laughter)
3:10 - 3:13

I think we need to do better than this.
3:14 - 3:15

But, on the other hand,
3:15 - 3:17

I was actually expecting this
to be even worse.
3:17 - 3:20

I was expecting the average to be one.
3:20 - 3:23

So I was quite happy
to see two. (chuckles)
3:25 - 3:26

Alright.
3:27 - 3:30

But it's not just interesting to know
3:30 - 3:34

how many labels our items
and properties have.
3:34 - 3:37

It's also interesting to see
in which languages.
3:38 - 3:44

Here you see a graph of the languages
3:44 - 3:47

that we have labels for on Items.
3:47 - 3:51

So the biggest part there is Other.
3:51 - 3:54

So I just took the top 100 languages
3:55 - 3:59

and everything else is Other
to make this graph readable.
4:00 - 4:02

And then there's English and Dutch,
4:03 - 4:04

French,
4:06 - 4:09

and not to forget, Asturian.
4:10 - 4:12

- (person 1) Whoo!
- Whoo-hoo, yes!
4:14 - 4:17

So what you see here is quite an imbalance
4:17 - 4:20

and still quite a lot of focus on English.
4:21 - 4:24

Another thing is if you look
at the same thing for Properties,
4:24 - 4:26

it's actually looking better.
4:27 - 4:33

And I think part of that constituted
just being way less properties.
4:33 - 4:37

So even smaller communities
have a chance to keep up with that.
4:37 - 4:39

But it's also a pretty
important part of Wikidata
4:39 - 4:41

to localize into your language.
4:41 - 4:42

So that's good.
4:46 - 4:48

What I want to highlight
here with Asturian
4:48 - 4:54

is that a small community
can really make a huge difference
4:54 - 4:57

with some dedication and work,
4:57 - 4:58

and that's really cool.
5:02 - 5:04

A small quiz for you.
5:04 - 5:05

If you take all the properties on Wikidata
5:05 - 5:08

that are not external identifiers,
5:08 - 5:10

which one has the most labels,
like the most languages?
5:11 - 5:14

(audience) [inaudible]
5:14 - 5:17

I hear some agreement on instance of?
5:18 - 5:19

You would be wrong.
5:20 - 5:22

It's image. (chuckles)
5:23 - 5:26

So, yeah, that tells you,
if you speak one of the languages
5:26 - 5:29

where instance of
doesn't yet have a label,
5:29 - 5:30

you might want to add it.
5:32 - 5:36

So it has 148 labels currently.
5:38 - 5:41

But that's just another slide.
5:43 - 5:44

This graph tells us something
5:44 - 5:49

about how much content we are making
available in a certain language
5:49 - 5:52

and how much of that content
is actually used.
5:52 - 5:55

So what you're seeing is basically a curve
5:55 - 6:01

with most content having English labels,
being available in English,
6:02 - 6:04

and being used a lot.
6:04 - 6:06

And then it kind of goes down.
6:06 - 6:09

But, again, what you can see are outliers
6:09 - 6:15

who have a lot more content
than you would necessarily expect,
6:17 - 6:20

and that is really, really good.
6:21 - 6:25

The problem still is it's not used a lot.
6:26 - 6:29

Asturian and Dutch should be higher,
6:29 - 6:32

and I think helping those communities
6:33 - 6:36

increase the use
of the data they collected
6:36 - 6:38

is a really useful thing to do.
6:43 - 6:48

What this analysis and others
showed us is also a good thing though
6:48 - 6:51

is that we are seeing
that highly used items
6:51 - 6:55

also tend to have more labels
6:55 - 6:58

or the other way around--
it's not entirely clear.
7:03 - 7:04

And then the question is,
7:05 - 7:07

are we serving
just the powerful languages?
7:08 - 7:11

Or are we serving everyone?
7:13 - 7:18

And what you see here
is a grouping of languages.
7:18 - 7:22

The languages that are grouped together
tend to have labels together.
7:26 - 7:29

And you see it clustering.
7:29 - 7:34

Now here's a similar clustering, colored,
7:34 - 7:39

based on how alive, how used,
7:40 - 7:43

how endangered the language is.
7:43 - 7:45

And a good thing you're seeing here
7:45 - 7:50

is that safe languages
and endangered languages
7:50 - 7:54

do not form two different clusters.
7:54 - 7:59

But they're all mixed together,
8:00 - 8:05

which is much better than it would be
the other way around
8:05 - 8:09

where the safe languages,
the powerful languages
8:10 - 8:12

are just helping each other out.
8:13 - 8:14

No, that's not the case.
8:14 - 8:17

And it's a really good thing.
8:17 - 8:20

When I saw this,
I thought this was very good.
8:23 - 8:25

Here's a similar thing
8:26 - 8:29

where we looked at
8:30 - 8:34

the languages' status
8:34 - 8:36

and how many labels it has.
8:39 - 8:43

What you're seeing
is a clear win for safe languages,
8:43 - 8:44

as is expected.
8:46 - 8:47

But what you're also seeing
8:47 - 8:54

is that the languages in category 2
and 3 and maybe even 4
8:54 - 8:59

are not that bad, actually,
8:59 - 9:02

in terms of their representation
in Wikidata and others.
9:03 - 9:06

It's a really good thing to find.
9:08 - 9:09

Now, if you look at the same thing
9:09 - 9:12

for how much of that content
of those labels
9:12 - 9:15

is actually used
on Wikipedia, for example,
9:17 - 9:23

then we see a similar
picture emerging again.
9:24 - 9:30

And it tells us that those communities
are actually making good use of their time
9:30 - 9:35

by filling in labels
for higher used items, for example.
9:36 - 9:40

There are outliers
where I think we can help,
9:42 - 9:48

to help those communities find the places
where their work would be most valuable.
9:49 - 9:53

But, overall, I'm happy with this picture.
9:55 - 10:00

Now, that was the items
and properties part of Wikidata.
10:01 - 10:03

Now, let's look at interaction
in your languages.
10:03 - 10:05

So the lexeme parts of Wikidata
10:05 - 10:09

where we describe words
and their forms and their meanings.
10:10 - 10:13

We've been doing this now
since May last year,
10:16 - 10:19

and content has been growing.
10:20 - 10:22

You can see here in blue the lexemes,
10:22 - 10:26

and then in red,
the forms on those lexemes
10:26 - 10:30

and yellow, the senses
on those lexemes.
10:31 - 10:34

So some communities--
we'll get to that later--
10:34 - 10:40

have spent a lot of time creating forms
and senses for their lexemes,
10:40 - 10:43

which is really useful
10:43 - 10:48

because that builds
the core of the data set that you need.
10:51 - 10:55

Now, we looked at all the languages
10:55 - 10:58

that have lexemes on Wikidata.
10:58 - 11:01

So words we have,
11:02 - 11:04

those are right now 310 languages.
11:05 - 11:08

Now, what do you think is the top language
11:08 - 11:12

when it comes to the number
of lexemes currently in Wikidata?
11:13 - 11:15

(audience) [inaudible]
11:19 - 11:20

Huh?
11:20 - 11:22

(person 2) German.
11:22 - 11:24

Sorry, I've heard it before.
11:24 - 11:26

It's Russian.
11:28 - 11:30

Russian is quite ahead.
11:32 - 11:34

And just to give you some perspective,
11:36 - 11:37

there's different opinions
11:37 - 11:42

but I've read, for example,
that 1,000 to 3,000 words
11:42 - 11:45

gets you to conversation level,
roughly, in another language,
11:45 - 11:49

and 4,000 to 10,000 words
to an advanced level.
11:52 - 11:55

So, we still have a bit to catch up there.
11:58 - 12:03

One thing I want you
to pay attention to is Basque here
12:03 - 12:08

with 10,000, roughly, lexemes.
12:09 - 12:13

Now, if you look at the number
of forms for those lexemes,
12:14 - 12:16

Basque is way up there,
12:18 - 12:20

which is really cool,
12:20 - 12:25

and you should go to a talk that explains
to you why that is the case.
12:27 - 12:31

Now, if you look at the number
of senses, so what do words mean,
12:32 - 12:35

Basque even gets to the top of the list.
12:35 - 12:37

I think that deserves an applause.
12:37 - 12:39

(applause)
12:46 - 12:47

Another short quiz.
12:47 - 12:50

What's the lexeme
with the most translations currently?
12:51 - 12:55

(audience) Cats, cats, [inaudible],
Douglas Adams, [inaudible]
12:57 - 13:00

All good guesses, but no.
13:01 - 13:04

It's this, the Russian word for "water."
13:10 - 13:12

Alright, so now we talked a lot
13:12 - 13:16

about how many lexemes,
forms, and senses we have,
13:16 - 13:20

but that's just one thing you need.
13:20 - 13:22

The other thing you need
13:22 - 13:25

is actually describing those lexemes,
forms, and senses
13:25 - 13:28

in a machine-readable way.
13:28 - 13:30

And for that you have statements,
like on items.
13:31 - 13:36

And one of the properties
you use is usage example.
13:36 - 13:39

So whoever is using that data
13:39 - 13:42

can understand how to use
that word in context,
13:42 - 13:44

so that could be a quote, for example.
13:45 - 13:47

And here, Polish rocks.
13:48 - 13:50

Good job, Polish speakers.
13:54 - 13:58

Another property
that's really useful is IPA,
13:58 - 14:00

so how do you pronounce this word.
14:01 - 14:07

Russian apparently needs
lots of IPA statements.
14:10 - 14:13

But, again, Polish, second.
14:17 - 14:21

And last but not least
we have pronunciation audio.
14:21 - 14:23

So that is links to files on Commons
14:23 - 14:26

where someone speaks the word,
14:26 - 14:30

so you can hear a native speaker
pronounce the word
14:30 - 14:33

in case you can't read IPA, for example.
14:35 - 14:39

And there's a really nice actually
Wiki-based powered project
14:39 - 14:40

called Lingua Libre
14:41 - 14:45

where you can go and help record
words in your language
14:45 - 14:48

that then can be added
to lexemes on Wikidata,
14:48 - 14:52

so other people can understand
how to pronounce your words.
14:54 - 14:56

(person 2) [inaudible]
14:56 - 14:58

If you search for "Lingua Libre,"
14:58 - 15:01

and I'm sure someone can post it
in the Telegram channel.
15:03 - 15:05

Those guys rock.
15:05 - 15:07

They did really cool stuff with Wikibase.
15:09 - 15:11

Alright.
15:13 - 15:17

Then the question is,
where do we go from here?
15:19 - 15:22

Based on the numbers I've just shown you,
15:23 - 15:25

we've come a long way
15:25 - 15:28

towards giving more people
more access to more knowledge
15:28 - 15:31

when looking at languages on Wikidata.
15:33 - 15:36

But there is also still
a lot of work ahead of us.
15:39 - 15:42

Some of the things
you can do to help, for example,
15:42 - 15:45

is run label-a-thons
15:45 - 15:50

like get people together
to label items in Wikidata
15:51 - 15:55

or do an edit-a-thon
around lexemes in your language
15:55 - 15:59

to get the most used words
in your language into Wikidata.
16:01 - 16:03

Or you can use a tool like Terminator
16:03 - 16:08

that helps you find the most
important items in your language
16:08 - 16:12

that are still missing a label.
16:13 - 16:18

Most important being measured
by how often it is used
16:18 - 16:23

in other Wikidata items
as links in statements.
16:26 - 16:30

And, of course, for the lexeme part,
16:31 - 16:35

now that we've got
a basic coverage of those lexemes,
16:35 - 16:41

it's also about building them out,
adding more statements to them
16:41 - 16:44

so that they actually can build the base
16:44 - 16:47

for meaningful applications
to build on top of that.
16:48 - 16:51

Because we're getting closer
to that critical mass,
16:51 - 16:54

but we're still away from that,
16:54 - 16:57

that you can build
serious applications on top of it.
16:58 - 17:02

And I hope all of you
will join us in doing that.
17:03 - 17:07

And that already brings me
17:07 - 17:10

to a little help from our friends,
17:10 - 17:13

and Bruno, do you want to come over
17:14 - 17:17

and talk to us about lexical masks.
17:18 - 17:19

(Bruno) Thank you, Lydia,
17:19 - 17:22

thank you for giving me
this short period of time
17:22 - 17:24

to present this work
that we are doing at Google
17:24 - 17:30

Denny that most of you
probably have heard of or know.
17:30 - 17:32

Because at Google so I'm a linguist.
17:32 - 17:36

so I'm very happy to be here
amongst other language enthusiasts.
17:37 - 17:39

We are also building some lexicons,
17:39 - 17:42

and we have built this technology
17:42 - 17:46

or this approach that we think
can be useful for you.
17:46 - 17:48

Just to give you
a little bit of background,
17:48 - 17:52

this is my lexicographic
background talking here.
17:53 - 17:54

When we build a lexicon database,
17:54 - 17:59

there is a lot of hard time to maintain,
to keep them consistent
17:59 - 18:00

and to exchange data,
18:00 - 18:02

as you probably know.
18:03 - 18:06

There are several attempts
to unify the feature and the properties
18:06 - 18:09

that are describing
those lexemes and those forms,
18:09 - 18:11

and it's not a solved problem,
18:11 - 18:14

but there are some
unification attempts on that side.
18:14 - 18:15

But what is really missing--
18:15 - 18:19

and this is a problem we had
at the beginning of our project at Google
18:19 - 18:22

is to try to have an internal structure
18:22 - 18:26

that describes how
a lexical entry should look like,
18:26 - 18:29

what kind of data
or what kind of information we have
18:29 - 18:32

and the specification that are expected.
18:32 - 18:38

So, this is what we came up
with this thing called lexicon mask.
18:39 - 18:45

A lexicon mask is describing
what is expected for an entry,
18:45 - 18:47

a lexicographic entry, to be complete,
18:47 - 18:51

both in terms of the number of forms
you expect for a lexeme,
18:51 - 18:56

and the number of features
you expect for each of those forms.
18:56 - 18:58

Here is an example for Italian adjectives.
18:58 - 19:02

You expect, in Italian, to have
four forms for your adjectives,
19:02 - 19:05

and each of these forms
have a specific combination
19:05 - 19:08

of gender and number features.
19:09 - 19:13

This is what we expect
for the Italian adjectives.
19:13 - 19:16

Of course, you can have
extremely complex masks,
19:16 - 19:21

like the French verbs conjugation,
which is quite extensive,
19:21 - 19:23

and I don't show you
any other Russian mask
19:23 - 19:25

because it doesn't fit the screen.
19:26 - 19:30

And we also have
some detailed specifications
19:30 - 19:33

because we distinguish
what is at the form level.
19:33 - 19:38

So here you have Russian nouns
that have three numbers
19:38 - 19:40

and a number of cases
with different forms,
19:40 - 19:43

but they also have
an entry level specification
19:43 - 19:46

that says a noun particularly has
19:46 - 19:50

an inherent gender
and an inherent animacy feature
19:50 - 19:52

that is also specified in the mask.
19:55 - 19:59

We also want to distinguish
that a mask gives a specification
19:59 - 20:02

for, in general,
what an entry should look like.
20:02 - 20:07

But you can have smaller masks
for defective aspects of the form
20:07 - 20:11

or defective aspects of the lexeme
that happen in language.
20:11 - 20:15

So here is the simplest version
of French verbs
20:15 - 20:20

that have only the 3rd person singular
for all the weather verbs,
20:20 - 20:24

like "it rains" or "it snows,"
like in English.
20:25 - 20:26

So we distinguish these two levels.
20:27 - 20:30

And how we use this at Google
20:30 - 20:33

is that when we have a lexicon
that we want to use,
20:33 - 20:38

we use the mask to really
literally throw the lexicons,
20:38 - 20:40

all the entries, through the mask
20:40 - 20:44

and see which entry has a problem
in terms of structure.
20:44 - 20:47

Are we missing a form?
Are we missing a feature?
20:47 - 20:51

And when there is a problem,
we do some human validation
20:51 - 20:54

or just to see if it passes the mask.
20:54 - 20:58

So it's an extremely powerful tool
to check the quality of the structure.
20:59 - 21:02

So what we are happy to announce today
21:02 - 21:05

is that we get the green light
to open source our mask.
21:06 - 21:08

So this is a schema.
21:08 - 21:09

If you want that, we can release
21:09 - 21:13

and that we will provide
to Wikidata as to ShEx files.
21:13 - 21:17

This is a ShEx file for German nouns,
21:17 - 21:20

and Denny is working on the conversion
from our internal specification
21:20 - 21:24

to a more open-source specification.
21:24 - 21:28

We currently cover more than 25 languages.
21:28 - 21:29

So we expect to grow on our side,
21:29 - 21:34

but we also look for this opportunity
to collaborate for other languages.
21:34 - 21:41

And one of the ongoing collaborations
also that Denny has with Lukas.
21:41 - 21:45

Lukas has these great tools to have a UI
21:45 - 21:51

to help the user or the contributor
to add more forms.
21:51 - 21:54

So if you want to add
an adjective in French,
21:54 - 21:59

the UI is telling you
how many forms are expected
21:59 - 22:02

and what kind of features
this form should have.
22:02 - 22:06

So our mask will help the tool
to be defined and expanded.
22:07 - 22:08

That's it.
22:09 - 22:10

(Lydia) Thank you so much.
22:10 - 22:12

(applause)
22:14 - 22:17

Alright. Are there questions?
22:17 - 22:19

Do you want to talk more about lexemes?
22:20 - 22:21

- (person 3) Yes.
- Yes. (chuckles)
22:33 - 22:35

(person 3) My question,
because you were talking
22:35 - 22:39

about giving more access
to more people in more languages.
22:39 - 22:42

But there are a lot of languages
that can't be used in Wikidata.
22:42 - 22:45

So what solution do you have for that?
22:46 - 22:48

When you say that can't use Wikidata,
22:48 - 22:50

are you talking about entering labels?
22:50 - 22:53

- (person 3) Labels, descriptions.
- Right.
22:53 - 22:55

So, for lexemes, it's a bit different
22:55 - 22:58

because there we don't have
that restriction.
22:59 - 23:05

For labels on items and properties,
there is some restriction
23:05 - 23:12

because we wanted to make sure
that it's not completely
23:12 - 23:14

anyone does anything,
23:14 - 23:18

and it becomes unmanageable.
23:19 - 23:23

Even a small community who wants
one language and wants to work on that,
23:24 - 23:27

come talk to us, we will make it happen.
23:27 - 23:29

(person 3) I mean, we did this
at the Prague Hackathon in May,
23:29 - 23:32

and it took us until almost August
in order to be able to use our language.
23:32 - 23:35

- Yeah.
- (person 3) So, it's very slow.
23:35 - 23:38

Yeah, it is, unfortunately, very slow.
23:38 - 23:40

We're currently working
with the language Committee
23:40 - 23:46

on solving some fundamental...
23:50 - 23:55

Like, getting agreement on what kind
of languages are actually "allowed,"
23:56 - 23:59

and that has taken too long,
24:00 - 24:04

which is the reason why your request
probably took longer than it should have.
24:05 - 24:06

(person 3) Thanks.
24:07 - 24:08

(person 4) Thank you.
24:08 - 24:11

Lydia, if you remember
the statistics that you showed,
24:11 - 24:13

the number of lexemes per language.
24:13 - 24:18

So, did you count
all the forms as a data point
24:18 - 24:20

or only lexemes?
24:21 - 24:23

(Lydia) Do you mean this?
24:23 - 24:24

Which one do you mean?
24:24 - 24:26

(person 4) Yes, exactly.
24:26 - 24:28

If you remember,
does this number [inaudible]
24:28 - 24:32

all the forms for all the lexemes
or just how many lexemes there are?
24:32 - 24:34

No, this is just a number of lexemes.
24:34 - 24:35

(person 4) Just a number of lexemes, okay.
24:35 - 24:37

So then it is a just statistic
24:37 - 24:39

because if it would then
compose the forms--
24:39 - 24:41

that's why I'm asking--
24:41 - 24:43

then all the languages
with the inflectional morphology,
24:43 - 24:45

like Russian, Serbian,
Slovenian and et cetera,
24:45 - 24:48

they have a natural advantage
because they have so many.
24:48 - 24:52

So, this kind of kicks in here
on this number of forms.
24:52 - 24:54

(person 4) Yeah, that was this one.
Thank you.
24:57 - 25:00

(person 5) So, I had
a quick question about the...
25:01 - 25:07

When we're talking about
the actual items and properties.
25:07 - 25:09

Like as far as I understand,
25:09 - 25:12

there is currently no way
to give an actual source
25:12 - 25:15

to any of the labels
and descriptions that are given.
25:15 - 25:18

So, for example,
because when you're talking
25:18 - 25:21

about an item property,
25:21 - 25:25

like, for example,
you can get conflicting labels.
25:25 - 25:26

Yes.
25:26 - 25:28

(person 5) So this person is like...
25:28 - 25:31

We were talking about
indigenous things before, for example.
25:31 - 25:36

So this person is a Norwegian artist
according to this source,
25:36 - 25:39

and a Sami artist,
according to this source.
25:40 - 25:43

Or, for example, in Estonian,
we had an issue
25:43 - 25:48

where we had to change terminology
to the official use terminology
25:48 - 25:49

in official lexicons,
25:49 - 25:52

but we have no way to indicate really why,
25:52 - 25:54

like what was the source of this
25:54 - 25:56

and why this was better
and what was there before.
25:56 - 25:57

It was just me as a random person
25:57 - 26:00

just switching the thing
to anyone who sees it.
26:00 - 26:03

So is there a plan
to make this possible in any way
26:03 - 26:06

so that we can actually have
proper sources for the language data?
26:07 - 26:12

So, it is partially possible.
26:12 - 26:16

So, for example, when you have
an item for a person,
26:17 - 26:23

you have a statement, first name,
last name, and so on, of that person,
26:23 - 26:26

and then you can provide
the reference for that there.
26:28 - 26:33

I'm quite hesitant to add more complexity
26:33 - 26:36

for references on labels and descriptions,
26:36 - 26:39

but if people really, really think
26:39 - 26:45

this is something that isn't covered
by any reference on the statement,
26:45 - 26:47

then let's talk about it.
26:49 - 26:53

But I fear it will add a lot of complexity
26:53 - 26:57

for what I hope are few cases,
26:57 - 27:00

but I'm willing to be convinced otherwise
27:00 - 27:04

if people really feel
very strongly about this.
27:04 - 27:08

(person 5) I mean, if it's added
it probably shouldn't be the default,
27:08 - 27:12

show to all the users as a beginner,
interface, in any case.
27:12 - 27:16

More like, "Click here if you need to say
a specific thing about this."
27:18 - 27:23

Do we have a sense of how many times
that would actually matter?
27:25 - 27:26

(person 5) In Estonian, for example--
27:26 - 27:29

I expect this is true
of other languages as well--
27:29 - 27:34

for example, there is an official name
that is the actual legitimate translation,
27:34 - 27:36

for example, into English,
27:36 - 27:40

of, say, a specific kind of municipality.
27:41 - 27:42

That was my use case, for example,
27:42 - 27:44

where we were using the word "parish"
27:45 - 27:51

which the original Estonian word
was meant kind of like church parish,
27:51 - 27:52

and that was the origin,
27:52 - 27:55

but that's not the official translation
Estonia gets right now.
27:55 - 27:59

In this case, I would just add it
as official name statements
27:59 - 28:01

and add the reference there.
28:02 - 28:03

(person 5) Okay.
28:05 - 28:07

More questions, yes?
28:08 - 28:10

(person 6) I have two quick comments.
28:10 - 28:14

You specifically called out Asturian
as a language that does well,
28:14 - 28:16

and I think that's a false artifact.
28:16 - 28:18

Tell me about it.
28:18 - 28:20

(person 6) I think it's just a bot
28:20 - 28:24

that pasted person names,
like proper names,
28:24 - 28:27

and said, "Well, this is exactly
like in French or Spanish,"
28:27 - 28:29

and just massively copied it.
28:29 - 28:33

One point of evidence is that
you don't see that energy in Asturian
28:33 - 28:37

in things that actually
require translation, like property names,
28:37 - 28:40

or names of items
that are not proper names.
28:40 - 28:41

Asaf, you break my heart.
28:41 - 28:43

(person 6) I know,
I like raining on parades,
28:43 - 28:48

but I have good news as well,
which is about the pronunciation numbers.
28:49 - 28:54

As you probably know,
Commons is full of pronunciation files,
28:54 - 28:55

and, for example,
28:55 - 29:01

Dutch has no less than 300,000
pronunciation files already on Commons
29:02 - 29:05

that just need to somehow be ingested.
29:05 - 29:08

So if anyone's looking for a side project,
29:08 - 29:09

there's tons and tons
29:09 - 29:13

of classified, categorized
pronunciation files on Commons
29:13 - 29:17

under the category
"Pronunciation" by language.
29:17 - 29:23

So that's just waiting to be matched
to lexemes and put on Lexeme.
29:23 - 29:25

And I was wondering
if you could say something
29:25 - 29:27

about the road map,
29:27 - 29:29

something about how much investment
29:29 - 29:32

or what can we expect
from Lexeme in the coming year,
29:32 - 29:34

because I, for one, can't wait.
29:35 - 29:37

You can't wait? (chuckles)
29:37 - 29:39

- (person 6) For more.
- Yes. (chuckles)
29:45 - 29:50

Right now, we're concentrating
more on Wikibase and data quality
29:51 - 29:55

to see how much traction this gets
29:55 - 30:02

and then getting more for feeding off
where the pain points are next,
30:02 - 30:06

and then going back to improving
lexicographical data further.
30:07 - 30:10

And one of the things
I'd love to hear from you
30:10 - 30:14

is where exactly do you see
the next steps,
30:14 - 30:16

where do you want to see improvements
30:16 - 30:20

so that we can then figure out
how to make that happen.
30:21 - 30:23

But, of course, you're right,
30:23 - 30:26

there's still so much to do
also on the technical side.
30:31 - 30:36

(person 7) Okay, as we were uploading
the Basque words with forms,
30:36 - 30:38

and you'll see some
of these kinds of things,
30:38 - 30:41

we were both like, last week we said,
"Oh, we are the first one in something."
30:43 - 30:45

It's It appears in press, and it's like,
30:45 - 30:49

"Oh, Basque are the first time in some--
they are the first in something, okay."
30:49 - 30:51

(laughs)
30:51 - 30:53

And then people ask,
"Okay, but what is this for?"
30:55 - 30:57

We don't have a real good answer.
30:57 - 30:58

I mean it's like, okay,
30:58 - 31:02

this will help computers
to understand more our language, yes,
31:02 - 31:05

but what kind of tools
can we make in the future?
31:05 - 31:07

And we don't have a good answer for this.
31:07 - 31:11

So I don't know
if you have a good answer for this.
31:11 - 31:13

(chuckles) I don't know
if I have a good answer,
31:13 - 31:15

but I have an answer.
31:15 - 31:20

So I think right now
as I was telling [inaudible],
31:20 - 31:22

we haven't reached that critical mass
31:22 - 31:26

where you can build a lot
of the really interesting tools.
31:26 - 31:28

But there are already some tools.
31:28 - 31:32

Just the other day,
Esther [Pandelia], for example,
31:32 - 31:34

released a tool where you can see,
31:36 - 31:39

I think it was the words on a globe
31:39 - 31:42

where they're spoken,
where they're coming from.
31:43 - 31:44

I'm probably wrong about this,
31:44 - 31:46

but she had answered
on the Project chat on Wikidata--
31:46 - 31:49

you can look it up there.
31:50 - 31:52

So we have seen these first tools,
31:52 - 31:56

just like we've seen
back when Wikidata started.
31:57 - 32:00

First some--like just a network,
32:00 - 32:03

and like, "Hey, look, there's this thing
that connects to this other thing."
32:05 - 32:07

And as we have more data,
32:07 - 32:10

and as we've reached some critical mass,
32:12 - 32:15

more powerful applications
become possible,
32:16 - 32:18

things like Histropedia,
32:19 - 32:22

things like question and answering
32:22 - 32:27

in your digital personal assistant,
Platypus, and so on.
32:27 - 32:30

And we're seeing
a similar thing with lexemes.
32:31 - 32:35

We're at the stage
where you can build like these little,
32:35 - 32:37

hey, look, there's a connection
between the two things,
32:38 - 32:43

and there's a translation
of this word into that language stage,
32:43 - 32:48

and as we build it out
and as we describe more words,
32:48 - 32:50

more becomes possible.
32:50 - 32:52

Now, what becomes possible?
32:53 - 32:59

As Ben, our keynote speaker earlier
was talking about translations,
33:00 - 33:03

being able to translate
from one language to another.
33:03 - 33:08

And Jens, my colleague,
he's always talking about
33:08 - 33:11

the European Union
looking for a translator
33:11 - 33:17

who can translate from
I think it was Maltese to Swedish--
33:17 - 33:19

- (person 8) Estonian.
- Estonian.
33:22 - 33:26

And that is not a usual combination.
33:27 - 33:32

But once you have all these languages
in one machine-readable place,
33:32 - 33:33

you can do that,
33:33 - 33:37

you can get a dictionary
33:37 - 33:42

from Estonian to Maltese and back.
33:43 - 33:46

So covering language
combinations in dictionaries
33:46 - 33:48

that just haven't been covered before
33:48 - 33:51

because there wasn't
enough demand for it, for example,
33:51 - 33:56

to make it financially viable
and to justify the work.
33:56 - 33:57

Now we can do that.
34:00 - 34:02

Then text generation.
34:02 - 34:04

Lucie was earlier talking
34:04 - 34:10

about how she's working
with Hattie on generating text
34:10 - 34:15

to get Wikipedia articles
in minority languages started,
34:15 - 34:20

and that needs data about words,
34:20 - 34:23

and you need to understand
the language to do that.
34:24 - 34:28

Yeah, and those are just some
that come to my mind right now.
34:29 - 34:30

Maybe our audience has more ideas
34:30 - 34:34

what they want to do
when we have all the glorious data.
34:38 - 34:41

(person 9) Okay, I will deviate
from the lexemes topic.
34:41 - 34:43

I will ask the question,
34:43 - 34:46

how can I as a member of community
34:46 - 34:50

influence that priority is put on task,
34:50 - 34:57

that a new user comes, and he can indicate
what languages he wants to see and edit
34:57 - 35:01

without some secret verbal
template knowledge.
35:02 - 35:05

Maybe there will be this year
this technical wish list
35:05 - 35:07

without Wikipedia topics.
35:07 - 35:10

Maybe there's a hope
we can all vote about
35:10 - 35:14

this thing we didn't fix for seven years.
35:14 - 35:18

So do you have any ideas
and comments about this?
35:18 - 35:20

So you're talking about the fact
35:20 - 35:24

that someone who is
not logged into Wikidata
35:24 - 35:26

can't change their language easily?
35:26 - 35:28

(person 9) No, for [inaudible] users.
35:28 - 35:31

So, if they are logged in,
35:31 - 35:35

they can just change their language
at the top of the page,
35:36 - 35:38

and then it will appear
35:40 - 35:42

where the labels' description
[inaudible] are,
35:42 - 35:43

and they can edit it.
35:46 - 35:49

(person 9) Well, actually, usually
many times the workflow
35:49 - 35:52

is that if you want to have
multiple languages, they are available,
35:52 - 35:55

and it's not always the case.
35:55 - 35:59

Okay, maybe we should sit down
after this talk and you show me.
36:02 - 36:04

Cool. More questions?
36:06 - 36:07

Yes.
36:12 - 36:13

(person 10) Thanks for the presentation.
36:14 - 36:15

Can you comment
36:15 - 36:19

on the state of the correlation
with the Wiktionary community.
36:19 - 36:22

As far as I've seen,
there were some discussions
36:22 - 36:26

about importing some elements of the work,
36:26 - 36:31

but there seems to be licensing issues
and some disagreements, et cetera.
36:31 - 36:32

Right.
36:32 - 36:36

So, Wiktionary communities
have spent a lot of time
36:37 - 36:39

building Wiktionary.
36:39 - 36:43

They have built
36:43 - 36:48

amazingly complicated
and complex templates
36:48 - 36:54

to build pretty tables
that automatically generate forms for you
36:54 - 36:56

and all kinds of really impressive,
36:56 - 37:01

and kind of crazy stuff,
if you think about it.
37:02 - 37:08

And, of course, they have invested
a lot of time and effort into that.
37:09 - 37:12

And understandably,
37:12 - 37:17

they don't just want that to be grabbed,
37:18 - 37:19

just like that.
37:19 - 37:22

So there's some of that coming from there.
37:23 - 37:25

And that's fine, that's okay.
37:26 - 37:32

Now, the first Wiktionary communities
are talking about turning out
37:32 - 37:34

and importing some
of their data into Wikidata.
37:34 - 37:39

Russian, you have seen,
for example, is one of those cases
37:40 - 37:42

And I expect more of that to happen.
37:44 - 37:47

But it will be a slow process,
37:47 - 37:49

just like adoption
of Wikidata's data on Wikipedia
37:49 - 37:52

has been a rather slow process.
37:53 - 37:56

On the other side
of making it actually easier
37:56 - 37:59

to use the data that is in lexemes,
37:59 - 38:02

on Wiktionary, so that
they can make use of that
38:02 - 38:06

and share data between
the language Wiktionaries
38:06 - 38:09

which is super hard
to impossible right now,
38:09 - 38:12

which is crazy,
just like it was on Wikipedia.
38:14 - 38:16

Wait for the birthday present. (chuckles)
38:20 - 38:21

Yes.
38:23 - 38:25

(person 11) When I was thinking
the other way around it,
38:25 - 38:28

I actually didn't want to say it
because I think this will be super silly,
38:28 - 38:32

but I think that Wiktionary
already has some content,
38:32 - 38:35

and I know that
we can't transfer it to Wikidata
38:35 - 38:37

because there's a difference in licenses.
38:37 - 38:40

But I was thinking maybe
we can do something about that.
38:40 - 38:46

Maybe, I don't know, we can obtain
the communities' permission
38:46 - 38:51

after like, I don't know,
having like a public voting
38:52 - 38:56

and for the community,
the active members of the community
38:56 - 39:03

to vote and say if they would like
or accept or to transfer the content
39:03 - 39:06

for which they may do
the Wikidata lexemes.
39:06 - 39:09

Because I just think it is such a waste.
39:10 - 39:14

So, that's definitely
a conversation those people
39:14 - 39:18

who are in Wiktionary communities
are very welcome to bring up there.
39:18 - 39:25

I think it would be a bit presumptuous
for us to go and force that.
39:26 - 39:31

But, yeah, I think it's definitely worth
having a conversation.
39:31 - 39:34

But I think it's also important
to understand
39:34 - 39:39

that there's a distinction between
what is actually legally allowed
39:39 - 39:43

and what we should be doing
39:43 - 39:45

and what those people want or do not want.
39:46 - 39:47

So even if it's legally allowed,
39:47 - 39:51

if some other Wiktionary communities
do not want that,
39:51 - 39:54

I would be careful, at least.
39:59 - 40:02

I think you need the mic
for the stream.
40:05 - 40:07

(person 12) So, obviously,
it's all very exciting,
40:08 - 40:12

and I immediately think
how can I take that to my students
40:12 - 40:16

and how can I incorporate it
with the courses,
40:16 - 40:19

the work that we're doing,
educational settings.
40:19 - 40:22

And I don't have, at the moment,
40:23 - 40:24

first of all, enough knowledge,
40:24 - 40:27

but I think the documentation
that we do have
40:28 - 40:30

could be maybe improved.
40:30 - 40:33

So that's a kind of request
to make cool videos
40:33 - 40:36

that explain how it works
40:36 - 40:40

because if we have it, we can then use it,
40:40 - 40:42

and we can have students on board,
40:42 - 40:47

and we can make people understand
how awesome it all is.
40:47 - 40:52

And yeah, just think about documentation
and think about education, please.
40:52 - 40:54

Because I think a lot could be done.
40:54 - 40:59

These are like many tasks
that could be done even with...
41:00 - 41:02

well, I wouldn't say primary schools,
41:02 - 41:05

but certainly, even younger students.
41:06 - 41:11

And so I would really like to see
that potential being tapped into,
41:11 - 41:15

and, as of now, I personally
don't understand enough
41:15 - 41:20

to be able to create tasks
or to create like...
41:20 - 41:22

to do something practical with it.
41:22 - 41:26

So any help, any thoughts
anyone here has about that,
41:26 - 41:30

I would be very happy to hear
your thoughts, and yours as well.
41:31 - 41:32

Yeah, let's talk about that.
41:35 - 41:37

More questions?
41:38 - 41:39

Someone else raised a hand.
41:39 - 41:40

I forgot where it was.
41:46 - 41:50

(person 13) So, if we can't import
from Wiktionary,
41:50 - 41:56

is there some concerted effort
to find other public domain sources,
41:56 - 41:57

maybe all the data,
41:59 - 42:03

and kind of prefilter it, organize it
42:03 - 42:08

so that it's easy to be checked
by people for import?
42:09 - 42:11

So there are first efforts.
42:11 - 42:15

My understanding is that Basque
is one of those efforts.
42:15 - 42:17

Maybe you want to say
a bit more about it?
42:18 - 42:20

(person 14) [inaudible]
42:23 - 42:27

Okay, the actual answer
is paying for that...
42:28 - 42:33

I mean, we have an agreement
with a contractor we usually work with.
42:35 - 42:39

They do dictionaries--
42:40 - 42:42

lots of stuff, but they do dictionaries.
42:42 - 42:47

So we agreed with them
to make free the students' dictionary,
42:47 - 42:53

we would [cast] the most common words
and start uploading it
42:53 - 42:56

with an external identifier
and the scheme of things.
42:56 - 43:03

But there was some discussion
about leaving it on CC0
43:03 - 43:05

because they have
the dictionary with CC by it,
43:07 - 43:10

and they understood
what the difference was.
43:10 - 43:14

So there was some discussion.
43:14 - 43:20

But I think that we can provide some tools
or some examples in the future,
43:20 - 43:22

and I think that there will be
other dictionaries
43:22 - 43:24

that we can handle,
43:24 - 43:29

and also I think Wiktionary
should start moving in that direction,
43:29 - 43:32

but that's another great discussion.
43:33 - 43:34

And on top of that,
43:34 - 43:39

Lea is also in contact
with people from Occitan
43:39 - 43:42

who work on Occitan dictionaries,
43:42 - 43:45

and they're currently working
on a Sumerian collaboration.
43:52 - 43:53

More questions?
44:01 - 44:05

(person 15) Hi! We are the people
who want to import Occitan data.
44:05 - 44:07

Aha! Perfect!
44:07 - 44:08

(person 15) And we have a small problem.
44:09 - 44:14

We don't know how to represent
the variety of all lexemes.
44:14 - 44:18

We have six dialects,
44:18 - 44:24

and we want to indicate for Lexeme
in which dialect it's used,
44:24 - 44:27

and we don't have a proper
C0 statement to do that.
44:27 - 44:31

So as long as the segment doesn't exist,
44:32 - 44:34

it prevents us from [inaudible]
44:34 - 44:38

because we will need to do it again
44:38 - 44:42

when we will be able
to [export] the statement.
44:42 - 44:45

And it's complicated
because it's a statement
44:45 - 44:48

which won't be asked by many people
44:48 - 44:53

because it's a statement
which concerns mostly minority languages.
44:53 - 44:57

So you will have one person to ask this.
44:57 - 45:00

But as our colleagues Basque,
45:00 - 45:06

it can be one person
who will power thousands of others,
45:06 - 45:11

so it might not be asking a lot,
45:11 - 45:14

but it will be very important for us.
45:15 - 45:18

Do you already have
a new property proposal up,
45:18 - 45:19

or do you need help creating it?
45:22 - 45:24

(person 15) We asked four months ago.
45:25 - 45:29

Alright, then let's get some people
to help out with this property proposal.
45:30 - 45:33

I'm sure there are enough people
in this room to make this happen.
45:33 - 45:35

(person 15) Property proposal
[speaking in French].
45:35 - 45:37

(person 16) We didn't have an answer.
45:37 - 45:40

(person 15) We didn't have any answer,
and we don't know how to do this
45:40 - 45:43

because we aren't
in the Wikidata community.
45:45 - 45:49

Yup, so there are people here
who can help you.
45:49 - 45:52

Maybe someone raises their hand to take--
45:53 - 45:54

(person 14) I'm for that.
45:54 - 45:56

But I think this is quite interesting
45:56 - 45:59

that only the variant of form
45:59 - 46:03

also can handle it geographically,
46:03 - 46:05

with coordinates or some kind of mapping.
46:06 - 46:08

Also having different pronunciations,
46:08 - 46:12

and I think this is something
that happens in lots of languages.
46:13 - 46:16

We should start making
it happen [inaudible],
46:16 - 46:19

and I'm going to search for the property.
46:20 - 46:21

Cool.
46:21 - 46:24

So you will get backing
for your property proposal.
46:26 - 46:27

Thank you.
46:28 - 46:30

Alright, more questions?
46:32 - 46:33

Finn.
46:34 - 46:35

Finn is one of those people
46:35 - 46:38

who builds stuff
on top of lexicographical data.
46:38 - 46:40

(Finn) It's just a small question,
46:40 - 46:44

and that's about spelling variations.
46:45 - 46:48

It seems to be difficult to put them in...
46:49 - 46:53

You could, of course,
have multiple forms for the same word.
46:56 - 46:58

I don't know, it seems to be...
47:00 - 47:04

If you don't do it that way,
it seems to be difficult to specify...
47:05 - 47:06

or I don't know whether
47:06 - 47:10

this is just a minor technical issue
or whether...
47:10 - 47:11

Let's look at it together.
47:12 - 47:15

I would love to see an example.
47:17 - 47:18

Asaf.
47:27 - 47:28

(Asaf) Thank you.
47:29 - 47:34

I can give a very concrete example
from my mother tongue, Hebrew.
47:34 - 47:39

Hebrew has two main variants
47:39 - 47:43

for expressing almost every word
47:43 - 47:48

because the traditional spelling
47:48 - 47:50

leaves out many of the vowels.
47:51 - 47:55

And, therefore, in modern editions
of the Bible and of poetry,
47:55 - 47:57

diacritics are used.
47:57 - 48:03

However, those diacritics
are never used for modern prose
48:03 - 48:06

or newspaper writing or street signs.
48:06 - 48:11

So the average daily casual use
puts in extra vowels
48:12 - 48:14

and doesn't use the diacritics
48:14 - 48:16

because they are,
of course, more cumbersome
48:16 - 48:18

and have all kinds of rules
and nobody knows the rules.
48:19 - 48:21

So there are basically two variants.
48:21 - 48:25

There's the everyday casual prose variant,
48:25 - 48:28

and there's the Bible or poetry,
48:28 - 48:32

which always come
in this traditional diacriticized text.
48:32 - 48:33

To be useful,
48:33 - 48:37

Lexeme would have to recognize
both varieties of every single word
48:37 - 48:40

and every single form
of every single word.
48:41 - 48:43

So that's a very comprehensive use case
48:43 - 48:46

for official stable variants.
48:46 - 48:49

It's not dialect, it's not regions,
48:49 - 48:54

it's basically two coexisting
morphological systems.
48:55 - 48:59

And I too don't know exactly
how to express that in Lexeme today,
48:59 - 49:03

which is one thing that is keeping me
in partial answer to Magnus' question
49:03 - 49:05

from uploading the parts that are ready
49:05 - 49:09

from the biggest Hebrew dictionary,
which is public domain
49:09 - 49:13

and which I have been digitizing
for several years now.
49:13 - 49:15

A good portion of it is ready,
49:15 - 49:17

but I'm not putting it on Lexeme right now
49:17 - 49:20

because I don't know exactly
how to solve this problem.
49:20 - 49:23

Alright, let's solve
this problem here. (chuckles)
49:25 - 49:26

That has to be possible.
49:30 - 49:32

Alright, more questions?
49:37 - 49:40

If not, then thank you so much.
49:41 - 49:43

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4
Video Language:: English
Duration:: 49:51

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-2-eng-Wikidata_and_languages_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)