cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4

Edit subtitles

0:06 - 0:07

Hi, I'm Lucie.
0:07 - 0:12

You know me from rambling about
not enough language data in Wikidata,
0:12 - 0:16

and I thought instead of rambling today,
which I'll leave to Lydia later today,
0:17 - 0:20

I'll just show you a bit, or give you
an insight on the projects we did
0:20 - 0:25

using the data that we already have
on Wikidata, for different causes.
0:25 - 0:29

So underserved languages
compared to the keynote we just heard
0:29 - 0:33

where the person was talking about
underserved as like minority languages,
0:33 - 0:36

underserved languages to me,
or any languages
0:36 - 0:39

that don't have
enough representation on the web.
0:39 - 0:41

Yeah, just to get that clear.
0:41 - 0:43

So, who am I?
0:43 - 0:46

Why am I always talking
about languages on Wikidata?
0:46 - 0:48

Not sure but...
0:48 - 0:50

I'm a Computer Science PhD student
0:50 - 0:52

at the University of Southampton.
0:52 - 0:55

I'm a research intern
at Bloomberg in London, at the moment.
0:55 - 0:58

I'm a residence
at Newspeak House in London.
0:58 - 1:02

I am a researcher and project manager
for the Scribe project,
1:02 - 1:03

which I'll go into in a bit,
1:03 - 1:09

and I recently got into the idea
of oral knowledge and oral citation.
1:09 - 1:10

Kimberly is sitting right there.
1:11 - 1:13

And then, occasionally,
I have time to sleep
1:13 - 1:16

and do other things, but that's very rare.
1:17 - 1:19

So if you're interested
in any of those things,
1:19 - 1:20

come talk and speak to me.
1:20 - 1:23

Generally, this is an open presentation
and a few questions in between.
1:23 - 1:27

I'll run through a lot of things
in a very short time now.
1:27 - 1:30

Come to me afterwards
if you're interested in any of them.
1:31 - 1:32

Speak to me. I'm here.
1:32 - 1:35

I'm always very happy to speak to people.
1:35 - 1:39

So that's a bit of what
we will talk about today.
1:39 - 1:41

So Wikidata, giving an introduction,
1:41 - 1:44

even though that's obviously
not as necessary.
1:45 - 1:48

The article placeholder
is aimed for Wikipedia readers,
1:48 - 1:51

for Scribe which is aimed
at Wikipedia editors,
1:51 - 1:54

and then we have one topic of my research,
1:54 - 1:57

which is completely outside of Wikipedia
1:57 - 2:00

where we use Wikidata
for question answering.
2:02 - 2:04

So just a quick rerun.
2:04 - 2:07

Why is Wikidata so cool
for low-resource languages
2:07 - 2:11

where we have those unique identifiers?
2:11 - 2:13

I'm speaking to people that know that
2:13 - 2:15

much better than me even.
2:15 - 2:18

And then we have labels
in different languages.
2:18 - 2:22

Those can be in over,
I think, 400 languages by now,
2:22 - 2:24

so we have a good option here
2:24 - 2:28

to reuse language
in different forms and capture it.
2:29 - 2:33

Yeah, so that's a little bit of me
rambling about Wikidata
2:33 - 2:35

because I can't stop it.
2:35 - 2:37

We compared Wikidata,
compared to the native speaker,
2:37 - 2:39

so we can see, obviously,
2:39 - 2:42

there are languages
that are widely spoken in the world.
2:42 - 2:44

There's Chinese, Hindi, or Arabic,
2:44 - 2:47

but then very low coverage on Wikidata.
2:48 - 2:50

Then the opposite.
2:50 - 2:53

Sorry, I have the Dutch
and the Swedish community
2:53 - 2:55

which was super active in Wikidata,
2:55 - 2:58

which is really cool,
and that just points out
2:58 - 3:01

that even though we have
a low number of speakers,
3:01 - 3:07

we can have a big impact if people
are very active in the communities,
3:07 - 3:09

which is really nice and really good.
3:09 - 3:14

But also let's try to equal
that graph out in the future.
3:15 - 3:19

So, cool. So now we have
all this language data in Wikidata.
3:19 - 3:22

We have low-resource Wikipedias,
so we thought, what can we do?
3:22 - 3:27

Well, my undergrad supervisor
is sitting here,
3:27 - 3:31

and we worked back then
in the golden days,
3:31 - 3:34

on something called
the article placeholder
3:35 - 3:39

which takes triples from Wikidata
and displays it on Wikipedia.
3:39 - 3:42

And that's pretty much
relatively straight forward.
3:42 - 3:46

So you just take the content of Wikidata,
display it on Wikipedia
3:46 - 3:49

to attract more readers
and then eventually more editors
3:49 - 3:51

in the different low-resource languages.
3:51 - 3:53

They are dynamically generated,
3:53 - 3:56

so they're not like stubs or bot articles
3:56 - 4:00

that then flood the Wikipedia
so people can edit them.
4:00 - 4:02

It's basically a starting point.
4:02 - 4:05

And we thought,
well, we have that content,
4:05 - 4:09

and we have that knowledge
somewhere already, which is Wikidata.
4:09 - 4:12

It's often already in the languages,
but they don't have articles,
4:12 - 4:15

so at least give them
the insight into the information.
4:15 - 4:19

The article placeholders are live
on 14 low-resource Wikipedias.
4:20 - 4:22

If you are a Wikipedia community,
4:22 - 4:25

if you are part of a Wikipedia community
and interested in it,
4:25 - 4:26

let us know.
4:28 - 4:30

And then I went into research,
4:30 - 4:33

and I got stuck with
the article placeholder, though,
4:33 - 4:36

so we started to look into
text generation from Wikidata
4:36 - 4:38

for Wikipedia and low-resource languages.
4:38 - 4:40

And text generation is really interesting
4:40 - 4:43

because in research it was at that point
when we started the project
4:43 - 4:46

completely only focused on English,
4:46 - 4:49

which is a bit pointless in my experience
4:49 - 4:51

because, I mean, you have a lot of people
who write in English,
4:51 - 4:55

but then what we need is people
who write in those low-source languages.
4:55 - 4:59

And our starting point was that,
looking at triples on Wikipedia
4:59 - 5:02

is not exactly the nicest thing.
5:02 - 5:04

I mean, as much as I love
the article placeholder,
5:04 - 5:06

it's not exactly
what you want to see you or expect
5:06 - 5:08

when you open a Wikipedia page.
5:08 - 5:10

So we try to generate text.
5:10 - 5:12

We use this beautiful
neural network model,
5:12 - 5:13

where we encode Wikidata triples.
5:13 - 5:16

If you're interested more
in the technical parts,
5:16 - 5:17

come and talk to me.
5:17 - 5:22

And so, realistically,
with neural text generation,
5:22 - 5:24

you can generate one or two sentences
5:24 - 5:28

before it completely scrambles
and becomes useless.
5:28 - 5:33

So we've generated one sentence
that describes the topic of the triple.
5:33 - 5:36

And so this, for example, is Arabic.
5:36 - 5:39

We generate the sentence about Marrakesh,
5:39 - 5:41

where it just describes the city.
5:42 - 5:46

So for that, then, we tested this--
5:46 - 5:49

So we did studies, obviously,
to test if our approach works,
5:49 - 5:52

and if it makes sense, to use such things.
5:52 - 5:56

And because we are
very application-focused,
5:56 - 5:59

we tested it with actual
Wikipedia readers and editors.
5:59 - 6:01

So, first, we tested it
with Wikipedia readers
6:01 - 6:03

in Arabic and Esperanto--
6:03 - 6:06

so use cases with Arabic and Esperanto.
6:08 - 6:13

And we can see that our model
can generate sentences
6:13 - 6:14

that are very fluent
6:14 - 6:18

and that feel very much--
surprisingly, a lot, actually--
6:18 - 6:20

like Wikipedia sentences.
6:20 - 6:23

So it picks up, so we train on,
for example, for Arabic,
6:23 - 6:26

we train on Arabic with the idea to say
6:26 - 6:30

we want to keep
the cultural context of that language
6:30 - 6:33

and not let it influence
6:33 - 6:35

from other languages
that have higher coverage.
6:36 - 6:38

And then we did a study
with Wikipedia editors
6:38 - 6:41

because in the end the article placeholder
is just a starting point
6:41 - 6:43

for people to start editing,
6:43 - 6:44

and we try to measure
6:44 - 6:46

how much of the sentences
would they reuse.
6:46 - 6:49

How much is useful for them, basically,
6:49 - 6:51

and you can see
that there is a high number of reuse,
6:51 - 6:55

especially in Esperanto
when we test with editors.
6:56 - 7:01

And finally, we did also
qualitative interviews
7:01 - 7:05

with Wikipedia editors
across six languages.
7:05 - 7:08

I think we had
about ten people we interviewed.
7:09 - 7:12

And we tried to get
more of an understanding
7:12 - 7:15

what's a human perspective
on those generated sentences.
7:15 - 7:18

So now we can have
a very quantified way of saying,
7:18 - 7:19

yeah, they are good,
7:19 - 7:21

but we wanted to see
7:21 - 7:23

how's the interaction
7:23 - 7:26

and especially with whatever
always happens
7:26 - 7:30

in neural machine translation
and neural text generations,
7:30 - 7:34

that you have those missing word tokens
which we put as "rare" in there.
7:34 - 7:39

So that's the example sentences we used.
All of them are in Marrakesh.
7:39 - 7:42

So we wanted to see how much
are people bothered by it,
7:42 - 7:43

what's the quality,
7:43 - 7:45

what are the things
that point out to them,
7:45 - 7:50

and we can see that the mistakes
by the networks like those red tokens
7:50 - 7:51

are often just ignored.
7:53 - 7:56

There is this interesting factor
that because we didn't tell them
7:56 - 8:01

where this happens,
where we got the sentences from--
8:01 - 8:04

because it was on a user page of mine
8:04 - 8:06

but it looked like it was on a Wikipedia,
8:06 - 8:07

people just trusted.
8:07 - 8:09

And I think that's very important
8:09 - 8:13

when we look into those kinds
of research directions that we look into,
8:13 - 8:16

we cannot override
this trust into Wikipedia.
8:16 - 8:20

So if we work with Wikipedians
and Wikipedia itself,
8:20 - 8:23

if we take things from,
for example, Wikidata,
8:23 - 8:26

that's good
because it's also human-curated.
8:26 - 8:31

But when we start
with artificial intelligence projects,
8:31 - 8:35

where you have to be really careful
what we actually expose people to
8:35 - 8:38

because they just trust
the information that we give them.
8:39 - 8:43

So we could see, for example,
in the Arabic version,
8:43 - 8:45

it gave the wrong location for Marrakesh,
8:45 - 8:48

and people, even the people I interviewed
8:48 - 8:50

that we're living in Marrakesh
didn't pick up on that,
8:50 - 8:54

because it's on Wikipedia,
so it should be fine, right?
8:54 - 8:55

(chuckles)
8:55 - 8:56

Yeah.
8:58 - 9:01

We found there was a magical threshold
for the lengths of the generated text,
9:01 - 9:02

so that's something we found,
9:02 - 9:05

especially in comparison
with the content translation tool,
9:05 - 9:08

where you have a long
automatically generated text,
9:08 - 9:12

and people were complaining
that content translation was very hard
9:12 - 9:16

because you're just doing post-editing,
you don't have the creativity.
9:16 - 9:19

There are other remarks
on content translation I usually make--
9:19 - 9:21

I'll skip them for now.
9:22 - 9:25

So that one sentence was helpful
9:25 - 9:30

because even if we've made mistakes,
people were still willing to fix them
9:30 - 9:34

because it's a very short
intervenience [in that].
9:34 - 9:38

And then, finally,
a lot of people pointed out,
9:38 - 9:40

that it was particularly good
for a new editor,
9:40 - 9:42

so for them to have a starting point,
9:42 - 9:44

to have those triples, to have a sentence,
9:44 - 9:46

so they have something to start from.
9:46 - 9:49

So after all those interviews were done,
9:49 - 9:52

as I go, that's very interesting.
9:52 - 9:54

What else can we do with that knowledge?
9:54 - 9:59

And so we started a new project,
exactly because there weren't enough yet.
9:59 - 10:02

And the new project we have
is called Scribe,
10:02 - 10:07

and Scribe focuses on new editors
that want to write a new article,
10:07 - 10:10

and particularly people
who haven't written
10:10 - 10:11

an article on Wikipedia yet,
10:11 - 10:14

and specifically also
on low-resource languages.
10:15 - 10:19

So the idea is that--
that's the pixel version of me.
10:20 - 10:21

All my slides are basically
10:21 - 10:24

references to people in this room,
which I really love.
10:24 - 10:26

It feels like I'm home again.
10:27 - 10:31

So, yeah, I want to write a new article,
10:31 - 10:34

but I don't know where to start
as a new editor,
10:34 - 10:37

and so we have this project Scribe.
10:37 - 10:41

Scribe is a profession
or was the name of someone
10:41 - 10:45

with the profession of writing
in ancient Egypt.
10:47 - 10:53

So the Scribe project's idea
is that we want to give people, basically,
10:53 - 10:56

a hand when they start
writing their first articles.
10:56 - 10:58

So give them a skeleton,
10:58 - 11:01

give them a skeleton that's based
on their language Wikipedia,
11:01 - 11:05

instead of just translating the content
from another language Wikipedia.
11:05 - 11:10

So the first thing we want to do
is plan section titles,
11:10 - 11:14

then select references for each section,
11:14 - 11:16

ideally in the local Wikipedia language,
11:16 - 11:20

and then summarize those references
to give a starting point to write.
11:21 - 11:25

For the project, we have
a Wikimedia Foundation project grant.
11:25 - 11:28

So it just started.
11:28 - 11:31

Some of you are very open
to feedback, in general.
11:31 - 11:35

That was the very first
not so beautiful layout,
11:35 - 11:37

but just for you to get an overview.
11:37 - 11:40

So there is this idea
of collecting references,
11:40 - 11:43

images from comments, section titles.
11:43 - 11:46

And so the main things
we want to use Wikidata for
11:46 - 11:48

is the sections.
11:48 - 11:51

So, basically, we want to see
what are articles
11:51 - 11:55

on similar topics
already existing in your language,
11:55 - 11:58

so we can understand
how the language community
11:58 - 12:02

decided on structuring articles.
12:02 - 12:06

And then we look
for the images, obviously,
12:06 - 12:10

where Wikidata also
is a good point to go through.
12:13 - 12:16

And then we made
a prettier interface for it
12:16 - 12:18

because we decided to go mobile first.
12:18 - 12:21

So most of communities
that we aim to work with
12:21 - 12:25

are very heavy on mobile editing.
12:25 - 12:30

And so we do this mobile-first focus.
12:30 - 12:34

And then, it also forces us
to break down into steps
12:34 - 12:37

which eventually will lead to,
yeah, I don't know,
12:37 - 12:39

a step-by-step guide
on how to write a new article.
12:39 - 12:43

So an editor comes,
they can select section headers
12:43 - 12:47

based on existing articles
in their language,
12:47 - 12:49

write one section at a time,
12:49 - 12:54

switch between the sections,
and select references for each section.
12:56 - 12:59

Yeah, so the idea is that
we will have an easier editing experience,
12:59 - 13:01

especially for new editors,
13:01 - 13:05

to keep them in--
integrate Wikidata information
13:05 - 13:08

and [inaudible] images
from Wikimedia Commons as well.
13:10 - 13:12

If you're interested in Scribe,
13:12 - 13:15

I'm working together
on this project with Hady.
13:15 - 13:19

There is a lot of things online,
13:19 - 13:23

but then also just come and talk to us.
13:23 - 13:26

Also, if you're editing
a low-resource Wikipedia,
13:26 - 13:29

we're still looking
for people to interview
13:29 - 13:32

because we're trying to emulate--
13:32 - 13:34

we're trying to emulate as much as we can
13:34 - 13:37

what people already experience,
or they already edit.
13:37 - 13:39

I'm not big on Wikipedia editing.
13:39 - 13:41

Also, my native language is German.
13:41 - 13:44

So I need a lot of input from editors
13:44 - 13:48

that want to tell me
what they need, what they want,
13:48 - 13:51

where they think this project can go.
13:51 - 13:55

And if you are into Wikidata,
also come and talk to me, please.
13:56 - 13:58

Okay, so that's all the projects
13:58 - 14:02

or most of the projects we did
inside the Wikimedia world.
14:02 - 14:06

And I want to give you one
short overview of what's happening
14:06 - 14:10

on my end of research,
around Wikidata as well.
14:14 - 14:16

So I was part of a project
14:16 - 14:18

that works a lot with question answering,
14:18 - 14:20

and I don't know too much
about question answering,
14:20 - 14:24

but what I do know a lot about
is knowledge graphs and multilinguality.
14:24 - 14:26

So, basically, what we wanted to do
14:26 - 14:30

is we have a question answering system
that gets a question from a user,
14:30 - 14:36

and we wanted to select a knowledge graph
that can answer the question best.
14:36 - 14:40

And again, we focused on
multilingual question answering system.
14:40 - 14:46

So if I want to ask something about Bach,
for example, in Spanish and French--
14:46 - 14:48

because that's the two languages
I know best--
14:48 - 14:52

then what knowledge graph has the data
14:52 - 14:54

to actually answer those questions.
14:55 - 14:59

So what we did was we found a method
to rank knowledge graphs,
15:01 - 15:05

based on the metadata of language,
15:05 - 15:08

that appears on the knowledge graph,
15:08 - 15:10

[which is split] by class.
15:10 - 15:11

And then we look for each class
15:11 - 15:14

into what languages are covered best,
15:14 - 15:18

and then depending on the question,
can suggest a knowledge graph.
15:19 - 15:23

From the big knowledge graphs
we looked into
15:23 - 15:25

and that are very known and widely used,
15:25 - 15:28

Wikidata covers the most languages
over all knowledge graphs,
15:28 - 15:32

and we used a test bed.
15:32 - 15:36

So we'd use a benchmark dataset
called [CALD],
15:36 - 15:39

which we then translated--
which was originally for DBpedia.
15:39 - 15:42

We translated it
for those five knowledge graphs
15:42 - 15:44

into [SPARQL] questions.
15:44 - 15:50

And then we gave that to a crowd
and looked into which knowledge graph
15:50 - 15:55

has the best answers
for each of those [SPARQL] queries.
15:55 - 15:59

And overall, the crowd workers
preferred Wikidata's answers
15:59 - 16:02

because they are very precise,
16:03 - 16:05

they are in most of the languages
16:05 - 16:07

that the others don't cover,
16:08 - 16:11

and they are not
as repetitive or redundant
16:11 - 16:12

as the [inaudible].
16:12 - 16:17

So just to make a quick recap
on the whole topic
16:17 - 16:20

of Wikidata and the future and languages.
16:20 - 16:24

So we can say that Wikidata
is already widely used
16:24 - 16:28

for numerous applications in Wikipedia,
16:28 - 16:30

and then outside Wikipedia for research.
16:30 - 16:34

So what I talked about
is just the things I do research on,
16:34 - 16:36

but there is still so much more.
16:36 - 16:39

So there is machine translation
using knowledge graphs,
16:39 - 16:41

there is rule mining
over knowledge graphs,
16:41 - 16:44

its entity linking in text.
16:44 - 16:47

There is so much more research
happening at the moment,
16:47 - 16:51

and Wikidata is more and more
getting popular for usage of it.
16:51 - 16:55

So I think we are at a very good stage
16:55 - 16:58

to push and connect the communities.
16:59 - 17:03

Yeah, to get the best
from both sides, basically.
17:04 - 17:05

Thank you very much.
17:05 - 17:08

If you want to have a look
at any of those projects,
17:08 - 17:09

they are there,
17:09 - 17:11

my slides are in Commons already.
17:11 - 17:15

If you want to read any of the papers,
I think all of them are open access.
17:15 - 17:16

If you can't find any of them,
17:16 - 17:19

write me an email
and I send it to you immediately.
17:19 - 17:21

Thank you very much.
17:21 - 17:22

(applause)
17:26 - 17:28

(moderator) Okay,
are there any questions?
17:28 - 17:32

- (moderator) I'll come around.
- (person 1) Shall I come to you?
17:35 - 17:36

(person 1) Hi Lucie, thank you so much,
17:36 - 17:38

I'm so glad to see
you taking this forward.
17:38 - 17:41

Now I'm really curious about Scribe.
17:42 - 17:44

The example here within our university
17:44 - 17:46

was that the idea that the person says,
17:46 - 17:48

"This is a university."
17:48 - 17:49

And then you go to the key data
17:49 - 17:52

and say, "Oh gosh!
Universities have places
17:52 - 17:54

and presidents, and I don't know what,"
17:54 - 17:58

that you're using these as the parts,
for telling the person what to do.
17:58 - 18:01

So, basically, the idea
is that someone says,
18:01 - 18:03

"I want to write about Nile University."
18:03 - 18:07

We look into Nile University's
Wikidata item,
18:07 - 18:10

and let's say-- I work a lot with Arabic--
18:10 - 18:13

so let's say we then go
in Arabic Wikipedia,
18:13 - 18:17

so we can make a grid, basically,
18:17 - 18:19

of all items that are around
Nile University.
18:19 - 18:23

So there are also universities,
there are also universities in Cairo,
18:23 - 18:25

or there are also universities
in Egypt, stuff like that,
18:25 - 18:27

or they have similar topics.
18:27 - 18:33

So we can look into
all the similar items on Wikidata,
18:33 - 18:36

and if they already have
a Wikipedia entry in Arabic Wikipedia,
18:36 - 18:39

we can look at the section titles.
18:39 - 18:41

- (person 1) (gasps)
- Exactly, and then we can make basically,
18:41 - 18:46

the most common way
about writing about a university
18:46 - 18:50

in Cairo on Arabic Wikipedia.
18:50 - 18:53

- Yeah, so that's the--
- (person 1) Thank you, [inaudible].
18:57 - 19:00

(person 2) Hi, thank you so much
for your inspiring talk.
19:00 - 19:05

I was wondering if this would work
for languages in Incubator?
19:05 - 19:11

Like, I work with really low,
low, low, low-resource languages
19:11 - 19:16

and this thing about doing it mobile
would be a huge thing,
19:16 - 19:20

because in many communities
they only have phones, not laptops.
19:20 - 19:22

So, would it work?
19:22 - 19:26

So I think, to an extent--
19:26 - 19:32

so the general structure, the skeleton
of the application would work.
19:32 - 19:35

Two things that we're thinking about
a lot at the moment
19:35 - 19:37

for exactly those use cases is,
19:37 - 19:40

how much would we want,
for example, to say,
19:40 - 19:45

if there are no articles
on a similar topic in your Wikipedia,
19:45 - 19:47

how much do we want it
to get it from other Wikipedias.
19:47 - 19:50

And that's why I'm basically
doing those interviews at the moment,
19:50 - 19:51

because I try to understand
19:51 - 19:55

how much people already look
at other language Wikipedias
19:55 - 19:57

to make the structure of an article.
19:57 - 19:59

Are they generally equal
19:59 - 20:02

or do they differ a lot
based on cultural context?
20:02 - 20:04

So that would be something to consider,
20:04 - 20:07

but there is a possibility to say,
20:07 - 20:10

we take everything
from all the language Wikipedias
20:10 - 20:12

and then make an average, basically.
20:12 - 20:15

And the other problem is referencing.
20:15 - 20:16

So that's something we find.
20:16 - 20:21

We make it very convenient
because we use a lot of Arabic,
20:21 - 20:24

and Arabic actually has the problem
that there are a lot of references,
20:24 - 20:29

but they are very little used
or not widely used in Wikipedia.
20:29 - 20:32

That's not true, obviously,
for all languages,
20:32 - 20:34

and that's something
I'd be very interested--
20:34 - 20:35

like, let's talk.
20:35 - 20:37

That's what I'm trying to say,
20:37 - 20:39

I'd be very interested
on your perspective on it
20:39 - 20:42

because I'd like to know, yeah
20:42 - 20:44

what do you think about referencing
20:44 - 20:45

done from English or any other language.
20:45 - 20:47

(person 2) Have you ever tried--
20:47 - 20:52

what we do is we normally
reference to interviews we have.
20:52 - 20:56

We put them in our repository,
institutional repository,
20:56 - 21:00

because these languages
don't have written references,
21:00 - 21:03

and I feel like
that is the way to go, but--
21:03 - 21:07

I'm currently also--
Kimberly and I are discussing a lot.
21:07 - 21:11

We made a session on Wikimania
on oral knowledge and oral citations.
21:11 - 21:14

Yeah, we should hang out
and have a long conversation.
21:14 - 21:16

(laughs)
21:18 - 21:22

(person 3) So [Michael Davignon],
we'll talk about medium size,
21:22 - 21:24

which is probably around ten people,
21:24 - 21:28

so it's medium for Briton Wikipedia.
21:28 - 21:31

And I'm wondering if we can use Scribe,
21:32 - 21:35

how to find a common plan
the other way around
21:35 - 21:38

for existing article
to find [the outer layers],
21:38 - 21:40

that's supposed to be the best plan,
21:40 - 21:42

but I'm not aware of more or less
21:42 - 21:45

[inaudible]
improvement existing article.
21:47 - 21:49

I think there's--
21:49 - 21:51

I forgot the name, I think,
21:51 - 21:54

[Diego] in the Wikimedia Foundation
research team,
21:54 - 21:58

who's working a lot at the moment
with section headings.
21:58 - 22:01

But, yes, generally, the idea is the same.
22:01 - 22:05

So instead of using them
to make an average
22:05 - 22:07

you could say,
this is not like the average,
22:08 - 22:10

That's very possible, yeah.
22:15 - 22:18

(person 4) Hi, Lucy. I'm Erica Azzellini
from Wiki Movement, Brazil,
22:18 - 22:20

and I'm very--
22:20 - 22:22

(Érica) Oh, can you hear me?
22:22 - 22:25

So, I'm Érica Azzellini
from Wiki Movement Brazil,
22:25 - 22:27

and I'm really impressed with your work
22:27 - 22:29

because it's really in sync
22:29 - 22:33

with what we've been working on in Brazil
with the Mbabel tool.
22:33 - 22:34

I don't know if you heard about it?
22:34 - 22:36

- Not yet.
- (Érica) It's a tool that we use
22:36 - 22:38

to automatically
generate Wikipedia entries
22:38 - 22:42

using Wikidata information
in a simple way
22:42 - 22:47

that can be replicated
on other Wikipedia languages.
22:47 - 22:49

So we've been working
on Portuguese mainly,
22:49 - 22:52

and we're trying to get
on English Wikipedia tools,
22:52 - 22:56

but it can be replicated
on any language, basically,
22:56 - 22:58

and I think then we could talk about it.
22:58 - 23:00

Absolutely, it will be super interesting
23:00 - 23:03

because the article placeholder
is an extension already,
23:03 - 23:06

so it might be worth
to integrate your efforts
23:06 - 23:08

into the existing extension.
23:08 - 23:13

Lydia is also fully for it,
and... (laughs)
23:13 - 23:14

And then because--
23:14 - 23:17

so one of the problems--
[Marius] correct me if I'm wrong--
23:17 - 23:20

we had was that
article placeholder doesn't scale
23:20 - 23:22

as well as it should.
23:22 - 23:25

So article placeholder
is not in Portuguese
23:25 - 23:29

because we're always afraid
it will break everything, correct?
23:29 - 23:32

And then [Marius] is just taking a pause.
23:32 - 23:35

- (Érica) Yeah, you should be careful.
- Don't want to say anything about this.
23:35 - 23:39

But, yeah, we should connect
because I'd be super interested to see
23:39 - 23:42

how you solve those issues
and how it works for you.
23:42 - 23:45

(Érica) I'm going to present
on the second section
23:45 - 23:48

of the learning talk about this project
that we've been developing,
23:48 - 23:51

and we've been using it
on [Glenwyck] initiatives
23:51 - 23:52

and education projects already.
23:52 - 23:54

- Perfect.
- (Érica) So let's do that.
23:54 - 23:56

Yeah, absolutely let's chat.
23:57 - 23:58

(moderator) Cool.
23:58 - 24:00

Some other questions on your projects?
24:02 - 24:07

(person 5) Hi, my name is [Alan],
and I think this is extremely cool.
24:07 - 24:09

I had a few questions about
24:09 - 24:13

generating Wiki sentences
from neural networks.
24:13 - 24:16

- Yeah.
- (person 5) So I've come across
24:16 - 24:19

another project
that was attempting to do this,
24:19 - 24:23

and it was essentially using
[triples input and sentences output],
24:23 - 24:26

and it was able
to generate very fluent sentences.
24:26 - 24:29

But sometimes they weren't...
24:30 - 24:34

actually, they weren't correct,
with regards to the triple.
24:34 - 24:39

And I was curious if you had any ways
of doing validity checks of this site.
24:39 - 24:43

Sometimes the triple
is "subject, predicate, object,"
24:43 - 24:46

but the language model says,
24:46 - 24:49

"Okay, this object is very rare,
24:49 - 24:52

I'm going to say you are born in San Jose,
24:52 - 24:55

instead of San Francisco or vice versa."
24:55 - 24:59

And I was curious
if you had come across this?
24:59 - 25:02

So that's what we call hallucinations.
25:02 - 25:05

The idea that
there's something in a sentence
25:05 - 25:08

that wasn't in the original triple
and the data.
25:08 - 25:11

What we do--
so we don't do anything about it,
25:11 - 25:14

we just also realized
that that's happening.
25:14 - 25:16

It's even more happening
for the low-resource,
25:16 - 25:20

because we work across domains,
so we are domain independently generating.
25:20 - 25:25

Traditional energy work
is always biography domain, usually.
25:25 - 25:27

So that happens a lot
25:27 - 25:30

because we just have little training data
on the low-resource languages.
25:30 - 25:33

We have a few ideas.
25:33 - 25:37

It's one of the million topics,
I'm supposed to work on at the moment.
25:39 - 25:43

One of them is to use
entity linking and relation extraction,
25:43 - 25:44

to align what we generate
25:44 - 25:47

with the triples
we inputted in the first place,
25:47 - 25:51

to see if it's off or the network
generates information it shouldn't have
25:51 - 25:54

or it cannot know about, basically.
25:54 - 25:59

That's also all I can say about this
because now time is over.
25:59 - 26:01

(person 5) I'd love to talk offline
about this, if you have time.
26:01 - 26:03

Yeah, absolutely, let's chat about it.
26:03 - 26:05

Thank you so much,
everyone, it was lovely.
26:05 - 26:07

(moderator) Thank you, Lucie.
26:07 - 26:09

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4
Video Language:: English
Duration:: 26:15

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-1060-eng-New_usages_of_Wikidata_to_support_underserved_language_communities_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)