WEBVTT

00:00:05.840 --> 00:00:07.080
Hi, I'm Lucie.

00:00:07.080 --> 00:00:11.930
You know me from rambling about 
not enough language data in Wikidata,

00:00:11.930 --> 00:00:15.800
and I thought instead of rambling today, 
which I'll leave to Lydia later today,

00:00:16.530 --> 00:00:20.370
I'll just show you a bit, or give you
an insight on the projects we did

00:00:20.370 --> 00:00:25.170
using the data that we already have 
on Wikidata, for different causes.

00:00:25.170 --> 00:00:28.550
So underserved languages 
compared to the keynote we just heard

00:00:28.550 --> 00:00:32.530
where the person was talking about 
underserved as like minority languages,

00:00:32.530 --> 00:00:35.600
underserved languages to me,
or any languages

00:00:35.600 --> 00:00:38.760
that don't have 
enough representation on the web.

00:00:39.420 --> 00:00:40.930
Yeah, just to get that clear.

00:00:40.930 --> 00:00:43.060
So, who am I?

00:00:43.060 --> 00:00:45.910
Why am I always talking 
about languages on Wikidata?

00:00:45.910 --> 00:00:47.593
Not sure but...

00:00:47.593 --> 00:00:50.280
I'm a Computer Science PhD student

00:00:50.280 --> 00:00:52.280
at the University of Southampton.

00:00:52.280 --> 00:00:55.420
I'm a research intern 
at Bloomberg in London, at the moment.

00:00:55.420 --> 00:00:58.340
I'm a residence
at Newspeak House in London.

00:00:58.340 --> 00:01:01.660
I am a researcher and project manager 
for the Scribe project,

00:01:01.660 --> 00:01:03.230
which I'll go into in a bit,

00:01:03.230 --> 00:01:08.530
and I recently got into the idea 
of oral knowledge and oral citation.

00:01:08.530 --> 00:01:10.170
Kimberly is sitting right there.

00:01:10.990 --> 00:01:13.330
And then, occasionally,
I have time to sleep

00:01:13.330 --> 00:01:16.010
and do other things, but that's very rare.

00:01:16.680 --> 00:01:18.620
So if you're interested
in any of those things,

00:01:18.620 --> 00:01:20.020
come talk and speak to me.

00:01:20.020 --> 00:01:23.480
Generally, this is an open presentation
and a few questions in between.

00:01:23.480 --> 00:01:26.780
I'll run through a lot of things
in a very short time now.

00:01:27.412 --> 00:01:30.110
Come to me afterwards
if you're interested in any of them.

00:01:30.640 --> 00:01:32.170
Speak to me. I'm here.

00:01:32.170 --> 00:01:35.460
I'm always very happy to speak to people.

00:01:35.460 --> 00:01:39.110
So that's a bit of what
we will talk about today.

00:01:39.110 --> 00:01:41.480
So Wikidata, giving an introduction,

00:01:41.480 --> 00:01:44.060
even though that's obviously
not as necessary.

00:01:44.510 --> 00:01:48.130
The article placeholder
is aimed for Wikipedia readers,

00:01:48.130 --> 00:01:50.910
for Scribe which is aimed
at Wikipedia editors,

00:01:50.910 --> 00:01:54.440
and then we have one topic of my research,

00:01:54.440 --> 00:01:56.880
which is completely outside of Wikipedia

00:01:56.880 --> 00:02:00.110
where we use Wikidata
for question answering.

00:02:01.530 --> 00:02:03.950
So just a quick rerun.

00:02:03.950 --> 00:02:07.040
Why is Wikidata so cool
for low-resource languages

00:02:07.040 --> 00:02:10.820
where we have those unique identifiers?

00:02:10.820 --> 00:02:13.370
I'm speaking to people that know that

00:02:13.370 --> 00:02:14.930
much better than me even.

00:02:14.930 --> 00:02:17.720
And then we have labels
in different languages.

00:02:17.720 --> 00:02:21.820
Those can be in over,
I think, 400 languages by now,

00:02:21.820 --> 00:02:24.060
so we have a good option here

00:02:24.060 --> 00:02:27.820
to reuse language
in different forms and capture it.

00:02:29.310 --> 00:02:32.730
Yeah, so that's a little bit of me
rambling about Wikidata

00:02:32.730 --> 00:02:34.880
because I can't stop it.

00:02:34.880 --> 00:02:37.040
We compared Wikidata,
compared to the native speaker,

00:02:37.040 --> 00:02:39.107
so we can see, obviously,

00:02:39.107 --> 00:02:41.570
there are languages
that are widely spoken in the world.

00:02:41.570 --> 00:02:43.840
There's Chinese, Hindi, or Arabic,

00:02:43.840 --> 00:02:46.640
but then very low coverage on Wikidata.

00:02:48.000 --> 00:02:50.130
Then the opposite.

00:02:50.130 --> 00:02:52.590
Sorry, I have the Dutch
and the Swedish community

00:02:52.590 --> 00:02:54.880
which was super active in Wikidata,

00:02:54.880 --> 00:02:58.060
which is really cool, 
and that just points out

00:02:58.060 --> 00:03:01.330
that even though we have 
a low number of speakers,

00:03:01.330 --> 00:03:06.810
we can have a big impact if people 
are very active in the communities,

00:03:06.810 --> 00:03:09.000
which is really nice and really good.

00:03:09.000 --> 00:03:13.600
But also let's try to equal
that graph out in the future.

00:03:14.560 --> 00:03:18.570
So, cool. So now we have 
all this language data in Wikidata.

00:03:18.570 --> 00:03:22.280
We have low-resource Wikipedias, 
so we thought, what can we do?

00:03:22.280 --> 00:03:27.460
Well, my undergrad supervisor
is sitting here,

00:03:27.460 --> 00:03:31.070
and we worked back then
in the golden days,

00:03:31.070 --> 00:03:33.620
on something called
the article placeholder

00:03:34.730 --> 00:03:39.370
which takes triples from Wikidata
and displays it on Wikipedia.

00:03:39.370 --> 00:03:41.570
And that's pretty much 
relatively straight forward.

00:03:41.570 --> 00:03:46.300
So you just take the content of Wikidata,
display it on Wikipedia

00:03:46.300 --> 00:03:49.330
to attract more readers 
and then eventually more editors

00:03:49.330 --> 00:03:51.330
in the different low-resource languages.

00:03:51.330 --> 00:03:53.130
They are dynamically generated,

00:03:53.130 --> 00:03:55.550
so they're not like stubs or bot articles

00:03:55.550 --> 00:04:00.170
that then flood the Wikipedia 
so people can edit them.

00:04:00.170 --> 00:04:02.420
It's basically a starting point.

00:04:02.420 --> 00:04:04.550
And we thought, 
well, we have that content,

00:04:04.550 --> 00:04:08.570
and we have that knowledge
somewhere already, which is Wikidata.

00:04:08.570 --> 00:04:11.600
It's often already in the languages,
but they don't have articles,

00:04:11.600 --> 00:04:15.136
so at least give them
the insight into the information.

00:04:15.136 --> 00:04:19.220
The article placeholders are live 
on 14 low-resource Wikipedias.

00:04:20.040 --> 00:04:21.770
If you are a Wikipedia community,

00:04:21.770 --> 00:04:24.805
if you are part of a Wikipedia community
and interested in it,

00:04:24.805 --> 00:04:26.110
let us know.

00:04:27.880 --> 00:04:30.080
And then I went into research,

00:04:30.080 --> 00:04:32.770
and I got stuck with
the article placeholder, though,

00:04:32.770 --> 00:04:36.040
so we started to look into 
text generation from Wikidata

00:04:36.040 --> 00:04:38.060
for Wikipedia and low-resource languages.

00:04:38.060 --> 00:04:39.965
And text generation is really interesting

00:04:39.965 --> 00:04:43.310
because in research it was at that point 
when we started the project

00:04:43.310 --> 00:04:46.020
completely only focused on English,

00:04:46.020 --> 00:04:48.880
which is a bit pointless in my experience

00:04:48.880 --> 00:04:51.440
because, I mean, you have a lot of people
who write in English,

00:04:51.440 --> 00:04:55.350
but then what we need is people 
who write in those low-source languages.

00:04:55.350 --> 00:04:59.420
And our starting point was that,
looking at triples on Wikipedia

00:04:59.420 --> 00:05:01.600
is not exactly the nicest thing.

00:05:01.600 --> 00:05:03.680
I mean, as much as I love
the article placeholder,

00:05:03.680 --> 00:05:06.330
it's not exactly 
what you want to see you or expect

00:05:06.330 --> 00:05:07.960
when you open a Wikipedia page.

00:05:07.960 --> 00:05:09.590
So we try to generate text.

00:05:09.590 --> 00:05:11.770
We use this beautiful
neural network model,

00:05:11.770 --> 00:05:13.440
where we encode Wikidata triples.

00:05:13.440 --> 00:05:15.755
If you're interested more
in the technical parts,

00:05:15.755 --> 00:05:16.970
come and talk to me.

00:05:16.970 --> 00:05:21.820
And so, realistically,
with neural text generation,

00:05:21.820 --> 00:05:23.750
you can generate one or two sentences

00:05:23.750 --> 00:05:27.530
before it completely scrambles
and becomes useless.

00:05:27.530 --> 00:05:32.660
So we've generated one sentence
that describes the topic of the triple.

00:05:32.660 --> 00:05:35.600
And so this, for example, is Arabic.

00:05:35.600 --> 00:05:38.620
We generate the sentence about Marrakesh,

00:05:38.620 --> 00:05:40.660
where it just describes the city.

00:05:42.170 --> 00:05:45.680
So for that, then, we tested this--

00:05:45.680 --> 00:05:49.330
So we did studies, obviously,
to test if our approach works,

00:05:49.330 --> 00:05:52.480
and if it makes sense, to use such things.

00:05:52.480 --> 00:05:55.660
And because we are 
very application-focused,

00:05:55.660 --> 00:05:58.730
we tested it with actual 
Wikipedia readers and editors.

00:05:58.730 --> 00:06:01.302
So, first, we tested it
with Wikipedia readers

00:06:01.302 --> 00:06:03.020
in Arabic and Esperanto--

00:06:03.020 --> 00:06:06.170
so use cases with Arabic and Esperanto.

00:06:07.640 --> 00:06:12.710
And we can see that our model
can generate sentences

00:06:12.710 --> 00:06:14.493
that are very fluent

00:06:14.493 --> 00:06:18.050
and that feel very much--
surprisingly, a lot, actually--

00:06:18.050 --> 00:06:19.640
like Wikipedia sentences.

00:06:19.640 --> 00:06:22.710
So it picks up, so we train on,
for example, for Arabic,

00:06:22.710 --> 00:06:26.470
we train on Arabic with the idea to say

00:06:26.470 --> 00:06:29.880
we want to keep
the cultural context of that language

00:06:29.880 --> 00:06:32.980
and not let it influence

00:06:32.980 --> 00:06:35.295
from other languages
that have higher coverage.

00:06:36.150 --> 00:06:38.403
And then we did a study
with Wikipedia editors

00:06:38.403 --> 00:06:41.080
because in the end the article placeholder
is just a starting point

00:06:41.080 --> 00:06:42.515
for people to start editing,

00:06:42.515 --> 00:06:43.570
and we try to measure

00:06:43.570 --> 00:06:45.950
how much of the sentences
would they reuse.

00:06:45.950 --> 00:06:48.750
How much is useful for them, basically,

00:06:48.750 --> 00:06:51.200
and you can see 
that there is a high number of reuse,

00:06:51.200 --> 00:06:54.880
especially in Esperanto
when we test with editors.

00:06:55.860 --> 00:07:01.150
And finally, we did also
qualitative interviews

00:07:01.150 --> 00:07:05.030
with Wikipedia editors 
across six languages.

00:07:05.030 --> 00:07:07.680
I think we had 
about ten people we interviewed.

00:07:08.680 --> 00:07:12.260
And we tried to get
more of an understanding

00:07:12.260 --> 00:07:15.310
what's a human perspective
on those generated sentences.

00:07:15.310 --> 00:07:18.060
So now we can have 
a very quantified way of saying,

00:07:18.060 --> 00:07:19.285
yeah, they are good,

00:07:19.285 --> 00:07:21.284
but we wanted to see

00:07:21.284 --> 00:07:22.775
how's the interaction

00:07:22.775 --> 00:07:25.510
and especially with whatever
always happens

00:07:25.510 --> 00:07:30.340
in neural machine translation
and neural text generations,

00:07:30.340 --> 00:07:33.970
that you have those missing word tokens
which we put as "rare" in there.

00:07:33.970 --> 00:07:38.860
So that's the example sentences we used.
All of them are in Marrakesh.

00:07:38.860 --> 00:07:42.150
So we wanted to see how much 
are people bothered by it,

00:07:42.150 --> 00:07:43.198
what's the quality,

00:07:43.198 --> 00:07:45.350
what are the things
that point out to them,

00:07:45.350 --> 00:07:50.080
and we can see that the mistakes
by the networks like those red tokens

00:07:50.080 --> 00:07:51.420
are often just ignored.

00:07:53.080 --> 00:07:56.080
There is this interesting factor
that because we didn't tell them

00:07:56.080 --> 00:08:00.640
where this happens,
where we got the sentences from--

00:08:00.640 --> 00:08:03.680
because it was on a user page of mine

00:08:03.680 --> 00:08:05.880
but it looked like it was on a Wikipedia,

00:08:05.880 --> 00:08:07.420
people just trusted.

00:08:07.420 --> 00:08:09.000
And I think that's very important

00:08:09.000 --> 00:08:13.350
when we look into those kinds
of research directions that we look into,

00:08:13.350 --> 00:08:16.130
we cannot override 
this trust into Wikipedia.

00:08:16.130 --> 00:08:20.460
So if we work with Wikipedians
and Wikipedia itself,

00:08:20.460 --> 00:08:23.240
if we take things from,
for example, Wikidata,

00:08:23.240 --> 00:08:26.400
that's good
because it's also human-curated.

00:08:26.400 --> 00:08:31.050
But when we start
with artificial intelligence projects,

00:08:31.050 --> 00:08:34.680
where you have to be really careful
what we actually expose people to

00:08:34.680 --> 00:08:37.950
because they just trust
the information that we give them.

00:08:38.910 --> 00:08:42.570
So we could see, for example,
in the Arabic version,

00:08:42.570 --> 00:08:45.480
it gave the wrong location for Marrakesh,

00:08:45.480 --> 00:08:47.770
and people, even the people I interviewed

00:08:47.770 --> 00:08:50.330
that we're living in Marrakesh 
didn't pick up on that,

00:08:50.330 --> 00:08:54.090
because it's on Wikipedia, 
so it should be fine, right?

00:08:54.090 --> 00:08:55.115
(chuckles)

00:08:55.115 --> 00:08:56.340
Yeah.

00:08:57.680 --> 00:09:00.750
We found there was a magical threshold
for the lengths of the generated text,

00:09:00.750 --> 00:09:02.000
so that's something we found,

00:09:02.000 --> 00:09:05.250
especially in comparison
with the content translation tool,

00:09:05.250 --> 00:09:08.080
where you have a long 
automatically generated text,

00:09:08.080 --> 00:09:12.130
and people were complaining
that content translation was very hard

00:09:12.130 --> 00:09:15.610
because you're just doing post-editing,
you don't have the creativity.

00:09:15.610 --> 00:09:19.330
There are other remarks 
on content translation I usually make--

00:09:19.330 --> 00:09:20.710
I'll skip them for now.

00:09:22.400 --> 00:09:25.230
So that one sentence was helpful

00:09:25.230 --> 00:09:30.360
because even if we've made mistakes,
people were still willing to fix them

00:09:30.360 --> 00:09:34.130
because it's a very short 
intervenience [in that].

00:09:34.130 --> 00:09:37.950
And then, finally, 
a lot of people pointed out,

00:09:37.950 --> 00:09:40.200
that it was particularly good 
for a new editor,

00:09:40.200 --> 00:09:42.080
so for them to have a starting point,

00:09:42.080 --> 00:09:44.080
to have those triples, to have a sentence,

00:09:44.080 --> 00:09:46.150
so they have something to start from.

00:09:46.150 --> 00:09:48.720
So after all those interviews were done,

00:09:48.720 --> 00:09:51.990
as I go, that's very interesting.

00:09:51.990 --> 00:09:54.260
What else can we do with that knowledge?

00:09:54.260 --> 00:09:58.950
And so we started a new project, 
exactly because there weren't enough yet.

00:09:58.950 --> 00:10:02.310
And the new project we have
is called Scribe,

00:10:02.310 --> 00:10:07.460
and Scribe focuses on new editors
that want to write a new article,

00:10:07.460 --> 00:10:09.660
and particularly people
who haven't written

00:10:09.660 --> 00:10:11.260
an article on Wikipedia yet,

00:10:11.260 --> 00:10:14.280
and specifically also
on low-resource languages.

00:10:15.130 --> 00:10:18.910
So the idea is that-- 
that's the pixel version of me.

00:10:19.800 --> 00:10:21.170
All my slides are basically

00:10:21.170 --> 00:10:24.240
references to people in this room,
which I really love.

00:10:24.240 --> 00:10:25.750
It feels like I'm home again.

00:10:27.000 --> 00:10:30.880
So, yeah, I want to write a new article,

00:10:30.880 --> 00:10:33.570
but I don't know where to start
as a new editor,

00:10:33.570 --> 00:10:36.710
and so we have this project Scribe.

00:10:36.710 --> 00:10:41.370
Scribe is a profession
or was the name of someone

00:10:41.370 --> 00:10:45.260
with the profession of writing
in ancient Egypt.

00:10:47.080 --> 00:10:52.870
So the Scribe project's idea
is that we want to give people, basically,

00:10:52.870 --> 00:10:55.710
a hand when they start
writing their first articles.

00:10:55.710 --> 00:10:57.750
So give them a skeleton,

00:10:57.750 --> 00:11:01.040
give them a skeleton that's based
on their language Wikipedia,

00:11:01.040 --> 00:11:05.330
instead of just translating the content
from another language Wikipedia.

00:11:05.330 --> 00:11:10.390
So the first thing we want to do
is plan section titles,

00:11:10.390 --> 00:11:13.640
then select references for each section,

00:11:13.640 --> 00:11:15.974
ideally in the local Wikipedia language,

00:11:15.974 --> 00:11:19.950
and then summarize those references
to give a starting point to write.

00:11:21.400 --> 00:11:25.310
For the project, we have 
a Wikimedia Foundation project grant.

00:11:25.310 --> 00:11:27.570
So it just started.

00:11:27.570 --> 00:11:30.670
Some of you are very open
to feedback, in general.

00:11:30.670 --> 00:11:35.170
That was the very first
not so beautiful layout,

00:11:35.170 --> 00:11:36.830
but just for you to get an overview.

00:11:36.830 --> 00:11:39.640
So there is this idea 
of collecting references,

00:11:39.640 --> 00:11:42.950
images from comments, section titles.

00:11:42.950 --> 00:11:45.620
And so the main things 
we want to use Wikidata for

00:11:45.620 --> 00:11:47.850
is the sections.

00:11:47.860 --> 00:11:51.400
So, basically, we want to see
what are articles

00:11:51.400 --> 00:11:55.220
on similar topics 
already existing in your language,

00:11:55.220 --> 00:11:58.350
so we can understand
how the language community

00:11:58.350 --> 00:12:02.220
decided on structuring articles.

00:12:02.220 --> 00:12:06.170
And then we look
for the images, obviously,

00:12:06.170 --> 00:12:10.480
where Wikidata also
is a good point to go through.

00:12:12.550 --> 00:12:16.240
And then we made
a prettier interface for it

00:12:16.240 --> 00:12:18.420
because we decided to go mobile first.

00:12:18.420 --> 00:12:21.280
So most of communities
that we aim to work with

00:12:21.280 --> 00:12:24.510
are very heavy on mobile editing.

00:12:24.510 --> 00:12:29.800
And so we do this mobile-first focus.

00:12:30.230 --> 00:12:34.060
And then, it also forces us
to break down into steps

00:12:34.060 --> 00:12:37.000
which eventually will lead to, 
yeah, I don't know,

00:12:37.000 --> 00:12:39.440
a step-by-step guide 
on how to write a new article.

00:12:39.440 --> 00:12:43.060
So an editor comes,
they can select section headers

00:12:43.060 --> 00:12:46.630
based on existing articles
in their language,

00:12:46.630 --> 00:12:49.150
write one section at a time,

00:12:49.150 --> 00:12:54.130
switch between the sections, 
and select references for each section.

00:12:55.600 --> 00:12:59.050
Yeah, so the idea is that
we will have an easier editing experience,

00:12:59.050 --> 00:13:00.680
especially for new editors,

00:13:00.680 --> 00:13:05.080
to keep them in-- 
integrate Wikidata information

00:13:05.080 --> 00:13:08.280
and [inaudible] images 
from Wikimedia Commons as well.

00:13:09.730 --> 00:13:12.110
If you're interested in Scribe,

00:13:12.110 --> 00:13:15.130
I'm working together
on this project with Hady.

00:13:15.130 --> 00:13:19.310
There is a lot of things online,

00:13:19.310 --> 00:13:23.240
but then also just come and talk to us.

00:13:23.240 --> 00:13:25.860
Also, if you're editing 
a low-resource Wikipedia,

00:13:25.860 --> 00:13:28.613
we're still looking
for people to interview

00:13:28.613 --> 00:13:31.570
because we're trying to emulate--

00:13:31.570 --> 00:13:33.880
we're trying to emulate as much as we can

00:13:33.880 --> 00:13:36.750
what people already experience,
or they already edit.

00:13:36.750 --> 00:13:38.630
I'm not big on Wikipedia editing.

00:13:38.630 --> 00:13:40.510
Also, my native language is German.

00:13:40.510 --> 00:13:43.580
So I need a lot of input from editors

00:13:43.580 --> 00:13:48.080
that want to tell me
what they need, what they want,

00:13:48.080 --> 00:13:51.140
where they think this project can go.

00:13:51.140 --> 00:13:54.590
And if you are into Wikidata,
also come and talk to me, please.

00:13:55.730 --> 00:13:57.889
Okay, so that's all the projects

00:13:57.889 --> 00:14:01.880
or most of the projects we did
inside the Wikimedia world.

00:14:01.880 --> 00:14:05.780
And I want to give you one 
short overview of what's happening

00:14:05.780 --> 00:14:10.420
on my end of research,
around Wikidata as well.

00:14:14.290 --> 00:14:15.950
So I was part of a project

00:14:15.950 --> 00:14:17.820
that works a lot with question answering,

00:14:17.820 --> 00:14:20.460
and I don't know too much
about question answering,

00:14:20.460 --> 00:14:23.880
but what I do know a lot about
is knowledge graphs and multilinguality.

00:14:23.880 --> 00:14:25.680
So, basically, what we wanted to do

00:14:25.680 --> 00:14:29.963
is we have a question answering system
that gets a question from a user,

00:14:29.963 --> 00:14:35.770
and we wanted to select a knowledge graph
that can answer the question best.

00:14:35.770 --> 00:14:40.150
And again, we focused on 
multilingual question answering system.

00:14:40.150 --> 00:14:45.750
So if I want to ask something about Bach,
for example, in Spanish and French--

00:14:45.750 --> 00:14:48.420
because that's the two languages
I know best--

00:14:48.420 --> 00:14:52.030
then what knowledge graph has the data

00:14:52.030 --> 00:14:53.950
to actually answer those questions.

00:14:55.160 --> 00:14:59.260
So what we did was we found a method
to rank knowledge graphs,

00:15:00.600 --> 00:15:04.800
based on the metadata of language,

00:15:04.800 --> 00:15:08.170
that appears on the knowledge graph,

00:15:08.170 --> 00:15:09.510
[which is split] by class.

00:15:09.510 --> 00:15:11.460
And then we look for each class

00:15:11.460 --> 00:15:14.440
into what languages are covered best,

00:15:14.440 --> 00:15:18.170
and then depending on the question,
can suggest a knowledge graph.

00:15:19.000 --> 00:15:22.510
From the big knowledge graphs
we looked into

00:15:22.510 --> 00:15:25.080
and that are very known and widely used,

00:15:25.080 --> 00:15:28.240
Wikidata covers the most languages
over all knowledge graphs,

00:15:28.240 --> 00:15:31.750
and we used a test bed.

00:15:31.750 --> 00:15:35.570
So we'd use a benchmark dataset
called [CALD],

00:15:35.570 --> 00:15:39.350
which we then translated-- 
which was originally for DBpedia.

00:15:39.350 --> 00:15:41.880
We translated it
for those five knowledge graphs

00:15:41.880 --> 00:15:43.550
into [SPARQL] questions.

00:15:43.550 --> 00:15:49.820
And then we gave that to a crowd 
and looked into which knowledge graph

00:15:49.820 --> 00:15:54.680
has the best answers
for each of those [SPARQL] queries.

00:15:54.680 --> 00:15:59.370
And overall, the crowd workers
preferred Wikidata's answers

00:15:59.370 --> 00:16:01.640
because they are very precise,

00:16:02.890 --> 00:16:05.020
they are in most of the languages

00:16:05.020 --> 00:16:06.530
that the others don't cover,

00:16:07.620 --> 00:16:10.970
and they are not 
as repetitive or redundant

00:16:10.970 --> 00:16:12.480
as the [inaudible].

00:16:12.480 --> 00:16:16.680
So just to make a quick recap 
on the whole topic

00:16:16.680 --> 00:16:19.820
of Wikidata and the future and languages.

00:16:19.820 --> 00:16:23.620
So we can say that Wikidata
is already widely used

00:16:23.620 --> 00:16:27.910
for numerous applications in Wikipedia,

00:16:27.910 --> 00:16:30.080
and then outside Wikipedia for research.

00:16:30.080 --> 00:16:33.970
So what I talked about
is just the things I do research on,

00:16:33.970 --> 00:16:35.970
but there is still so much more.

00:16:35.970 --> 00:16:38.950
So there is machine translation
using knowledge graphs,

00:16:38.950 --> 00:16:40.950
there is rule mining
over knowledge graphs,

00:16:40.950 --> 00:16:43.530
its entity linking in text.

00:16:43.530 --> 00:16:47.170
There is so much more research
happening at the moment,

00:16:47.170 --> 00:16:50.880
and Wikidata is more and more
getting popular for usage of it.

00:16:50.880 --> 00:16:54.640
So I think we are at a very good stage

00:16:54.640 --> 00:16:57.590
to push and connect the communities.

00:16:58.640 --> 00:17:02.970
Yeah, to get the best 
from both sides, basically.

00:17:03.510 --> 00:17:04.770
Thank you very much.

00:17:04.770 --> 00:17:07.860
If you want to have a look
at any of those projects,

00:17:07.860 --> 00:17:09.285
they are there,

00:17:09.285 --> 00:17:10.710
my slides are in Commons already.

00:17:10.710 --> 00:17:14.800
If you want to read any of the papers, 
I think all of them are open access.

00:17:14.800 --> 00:17:16.260
If you can't find any of them,

00:17:16.260 --> 00:17:18.770
write me an email 
and I send it to you immediately.

00:17:18.770 --> 00:17:20.705
Thank you very much.

00:17:20.705 --> 00:17:22.400
(applause)

00:17:25.740 --> 00:17:28.130
(moderator) Okay, 
are there any questions?

00:17:28.130 --> 00:17:31.770
- (moderator) I'll come around.
- (person 1) Shall I come to you?

00:17:34.794 --> 00:17:36.370
(person 1) Hi Lucie, thank you so much,

00:17:36.370 --> 00:17:38.460
I'm so glad to see
you taking this forward.

00:17:38.460 --> 00:17:40.680
Now I'm really curious about Scribe.

00:17:41.510 --> 00:17:43.510
The example here within our university

00:17:43.510 --> 00:17:46.060
was that the idea that the person says,

00:17:46.060 --> 00:17:47.540
"This is a university."

00:17:47.540 --> 00:17:49.020
And then you go to the key data

00:17:49.020 --> 00:17:51.930
and say, "Oh gosh! 
Universities have places

00:17:51.930 --> 00:17:54.110
and presidents, and I don't know what,"

00:17:54.110 --> 00:17:57.770
that you're using these as the parts, 
for telling the person what to do.

00:17:57.770 --> 00:18:00.840
So, basically, the idea
is that someone says,

00:18:00.840 --> 00:18:02.820
"I want to write about Nile University."

00:18:02.820 --> 00:18:07.040
We look into Nile University's
Wikidata item,

00:18:07.040 --> 00:18:09.820
and let's say-- I work a lot with Arabic--

00:18:09.820 --> 00:18:13.240
so let's say we then go
in Arabic Wikipedia,

00:18:13.240 --> 00:18:17.040
so we can make a grid, basically,

00:18:17.040 --> 00:18:19.370
of all items that are around
Nile University.

00:18:19.370 --> 00:18:23.060
So there are also universities,
there are also universities in Cairo,

00:18:23.060 --> 00:18:25.480
or there are also universities
in Egypt, stuff like that,

00:18:25.480 --> 00:18:27.350
or they have similar topics.

00:18:27.350 --> 00:18:32.530
So we can look into
all the similar items on Wikidata,

00:18:32.530 --> 00:18:36.330
and if they already have 
a Wikipedia entry in Arabic Wikipedia,

00:18:36.330 --> 00:18:38.610
we can look at the section titles.

00:18:38.610 --> 00:18:41.310
- (person 1) (gasps)
- Exactly, and then we can make basically,

00:18:41.310 --> 00:18:46.370
the most common way 
about writing about a university

00:18:46.370 --> 00:18:50.000
in Cairo on Arabic Wikipedia.

00:18:50.000 --> 00:18:52.703
- Yeah, so that's the--
- (person 1) Thank you, [inaudible].

00:18:56.880 --> 00:18:59.550
(person 2) Hi, thank you so much
for your inspiring talk.

00:18:59.550 --> 00:19:04.800
I was wondering if this would work
for languages in Incubator?

00:19:04.800 --> 00:19:10.620
Like, I work with really low,
low, low, low-resource languages

00:19:10.620 --> 00:19:16.461
and this thing about doing it mobile
would be a huge thing,

00:19:16.461 --> 00:19:20.020
because in many communities
they only have phones, not laptops.

00:19:20.020 --> 00:19:22.020
So, would it work?

00:19:22.020 --> 00:19:26.080
So I think, to an extent--

00:19:26.080 --> 00:19:32.050
so the general structure, the skeleton
of the application would work.

00:19:32.050 --> 00:19:35.280
Two things that we're thinking about
a lot at the moment

00:19:35.280 --> 00:19:37.080
for exactly those use cases is,

00:19:37.080 --> 00:19:39.970
how much would we want,
for example, to say,

00:19:39.970 --> 00:19:44.530
if there are no articles 
on a similar topic in your Wikipedia,

00:19:44.530 --> 00:19:46.930
how much do we want it
to get it from other Wikipedias.

00:19:46.930 --> 00:19:49.750
And that's why I'm basically 
doing those interviews at the moment,

00:19:49.750 --> 00:19:51.420
because I try to understand

00:19:51.420 --> 00:19:54.570
how much people already look
at other language Wikipedias

00:19:54.570 --> 00:19:57.040
to make the structure of an article.

00:19:57.040 --> 00:19:58.800
Are they generally equal

00:19:58.800 --> 00:20:01.630
or do they differ a lot
based on cultural context?

00:20:01.630 --> 00:20:04.310
So that would be something to consider,

00:20:04.310 --> 00:20:06.640
but there is a possibility to say,

00:20:06.640 --> 00:20:09.550
we take everything 
from all the language Wikipedias

00:20:09.550 --> 00:20:12.040
and then make an average, basically.

00:20:12.040 --> 00:20:14.970
And the other problem is referencing.

00:20:14.970 --> 00:20:16.460
So that's something we find.

00:20:16.460 --> 00:20:20.730
We make it very convenient
because we use a lot of Arabic,

00:20:20.730 --> 00:20:24.170
and Arabic actually has the problem
that there are a lot of references,

00:20:24.170 --> 00:20:28.790
but they are very little used
or not widely used in Wikipedia.

00:20:29.260 --> 00:20:31.570
That's not true, obviously,
for all languages,

00:20:31.570 --> 00:20:34.104
and that's something
I'd be very interested--

00:20:34.104 --> 00:20:35.180
like, let's talk.

00:20:35.180 --> 00:20:36.680
That's what I'm trying to say,

00:20:36.680 --> 00:20:39.170
I'd be very interested 
on your perspective on it

00:20:39.170 --> 00:20:41.680
because I'd like to know, yeah

00:20:41.680 --> 00:20:43.750
what do you think about referencing

00:20:43.750 --> 00:20:45.460
done from English or any other language.

00:20:45.460 --> 00:20:46.840
(person 2) Have you ever tried--

00:20:46.840 --> 00:20:51.880
what we do is we normally
reference to interviews we have.

00:20:51.880 --> 00:20:55.600
We put them in our repository,
institutional repository,

00:20:55.600 --> 00:20:59.570
because these languages 
don't have written references,

00:20:59.570 --> 00:21:03.240
and I feel like 
that is the way to go, but--

00:21:03.240 --> 00:21:06.910
I'm currently also--
Kimberly and I are discussing a lot.

00:21:06.910 --> 00:21:10.930
We made a session on Wikimania
on oral knowledge and oral citations.

00:21:10.930 --> 00:21:14.135
Yeah, we should hang out 
and have a long conversation.

00:21:14.135 --> 00:21:15.620
(laughs)

00:21:18.310 --> 00:21:22.040
(person 3) So [Michael Davignon], 
we'll talk about medium size,

00:21:22.040 --> 00:21:23.910
which is probably around ten people,

00:21:23.910 --> 00:21:27.750
so it's medium for Briton Wikipedia.

00:21:27.750 --> 00:21:30.600
And I'm wondering if we can use Scribe,

00:21:31.530 --> 00:21:34.880
how to find a common plan
the other way around

00:21:34.880 --> 00:21:37.770
for existing article
to find [the outer layers],

00:21:37.770 --> 00:21:39.571
that's supposed to be the best plan,

00:21:39.571 --> 00:21:42.130
but I'm not aware of more or less

00:21:42.130 --> 00:21:44.710
[inaudible] 
improvement existing article.

00:21:46.790 --> 00:21:49.440
I think there's--

00:21:49.440 --> 00:21:50.800
I forgot the name, I think,

00:21:50.800 --> 00:21:53.790
[Diego] in the Wikimedia Foundation
research team,

00:21:53.790 --> 00:21:58.407
who's working a lot at the moment
with section headings.

00:21:58.407 --> 00:22:01.420
But, yes, generally, the idea is the same.

00:22:01.420 --> 00:22:04.640
So instead of using them
to make an average

00:22:04.640 --> 00:22:07.260
you could say, 
this is not like the average,

00:22:08.170 --> 00:22:09.680
That's very possible, yeah.

00:22:14.750 --> 00:22:18.330
(person 4) Hi, Lucy. I'm Erica Azzellini
from Wiki Movement, Brazil,

00:22:18.330 --> 00:22:20.130
and I'm very--

00:22:20.130 --> 00:22:21.860
(Érica) Oh, can you hear me?

00:22:21.860 --> 00:22:24.680
So, I'm Érica Azzellini
from Wiki Movement Brazil,

00:22:24.680 --> 00:22:26.560
and I'm really impressed with your work

00:22:26.570 --> 00:22:29.154
because it's really in sync

00:22:29.154 --> 00:22:32.540
with what we've been working on in Brazil
with the Mbabel tool.

00:22:32.540 --> 00:22:33.950
I don't know if you heard about it?

00:22:33.950 --> 00:22:36.000
- Not yet. 
- (Érica) It's a tool that we use

00:22:36.020 --> 00:22:38.440
to automatically
generate Wikipedia entries

00:22:38.440 --> 00:22:42.240
using Wikidata information 
in a simple way

00:22:42.240 --> 00:22:46.510
that can be replicated 
on other Wikipedia languages.

00:22:46.510 --> 00:22:48.950
So we've been working
on Portuguese mainly,

00:22:48.950 --> 00:22:51.860
and we're trying to get
on English Wikipedia tools,

00:22:51.860 --> 00:22:56.200
but it can be replicated
on any language, basically,

00:22:56.200 --> 00:22:58.460
and I think then we could talk about it.

00:22:58.460 --> 00:23:00.460
Absolutely, it will be super interesting

00:23:00.460 --> 00:23:03.260
because the article placeholder
is an extension already,

00:23:03.260 --> 00:23:06.130
so it might be worth 
to integrate your efforts

00:23:06.130 --> 00:23:07.950
into the existing extension.

00:23:07.950 --> 00:23:12.620
Lydia is also fully for it,
and... (laughs)

00:23:12.620 --> 00:23:13.930
And then because--

00:23:13.930 --> 00:23:17.040
so one of the problems--
[Marius] correct me if I'm wrong--

00:23:17.040 --> 00:23:20.310
we had was that
article placeholder doesn't scale

00:23:20.310 --> 00:23:22.240
as well as it should.

00:23:22.240 --> 00:23:24.860
So article placeholder
is not in Portuguese

00:23:24.860 --> 00:23:28.545
because we're always afraid
it will break everything, correct?

00:23:29.460 --> 00:23:32.286
And then [Marius] is just taking a pause.

00:23:32.286 --> 00:23:35.420
- (Érica) Yeah, you should be careful.
- Don't want to say anything about this.

00:23:35.420 --> 00:23:38.950
But, yeah, we should connect
because I'd be super interested to see

00:23:38.950 --> 00:23:42.040
how you solve those issues
and how it works for you.

00:23:42.040 --> 00:23:45.310
(Érica) I'm going to present 
on the second section

00:23:45.310 --> 00:23:48.350
of the learning talk about this project 
that we've been developing,

00:23:48.350 --> 00:23:50.620
and we've been using it 
on [Glenwyck] initiatives

00:23:50.620 --> 00:23:52.440
and education projects already.

00:23:52.440 --> 00:23:54.480
- Perfect.
- (Érica) So let's do that.

00:23:54.480 --> 00:23:56.440
Yeah, absolutely let's chat.

00:23:57.220 --> 00:23:58.274
(moderator) Cool.

00:23:58.274 --> 00:24:00.370
Some other questions on your projects?

00:24:02.460 --> 00:24:06.820
(person 5) Hi, my name is [Alan], 
and I think this is extremely cool.

00:24:06.820 --> 00:24:09.170
I had a few questions about

00:24:09.170 --> 00:24:13.110
generating Wiki sentences
from neural networks.

00:24:13.110 --> 00:24:16.020
- Yeah.
- (person 5) So I've come across

00:24:16.020 --> 00:24:19.240
another project
that was attempting to do this,

00:24:19.240 --> 00:24:23.020
and it was essentially using 
[triples input and sentences output],

00:24:23.020 --> 00:24:25.510
and it was able 
to generate very fluent sentences.

00:24:25.510 --> 00:24:29.360
But sometimes they weren't...

00:24:30.370 --> 00:24:33.820
actually, they weren't correct,
with regards to the triple.

00:24:33.820 --> 00:24:39.420
And I was curious if you had any ways
of doing validity checks of this site.

00:24:39.420 --> 00:24:43.040
Sometimes the triple 
is "subject, predicate, object,"

00:24:43.040 --> 00:24:46.110
but the language model says,

00:24:46.110 --> 00:24:48.565
"Okay, this object is very rare,

00:24:48.565 --> 00:24:51.740
I'm going to say you are born in San Jose,

00:24:51.740 --> 00:24:55.060
instead of San Francisco or vice versa."

00:24:55.060 --> 00:24:58.880
And I was curious
if you had come across this?

00:24:58.880 --> 00:25:01.510
So that's what we call hallucinations.

00:25:01.510 --> 00:25:05.080
The idea that 
there's something in a sentence

00:25:05.080 --> 00:25:07.690
that wasn't in the original triple
and the data.

00:25:08.400 --> 00:25:11.350
What we do-- 
so we don't do anything about it,

00:25:11.350 --> 00:25:13.910
we just also realized
that that's happening.

00:25:13.910 --> 00:25:15.910
It's even more happening
for the low-resource,

00:25:15.910 --> 00:25:19.730
because we work across domains,
so we are domain independently generating.

00:25:19.730 --> 00:25:24.670
Traditional energy work 
is always biography domain, usually.

00:25:24.670 --> 00:25:26.620
So that happens a lot

00:25:26.620 --> 00:25:29.510
because we just have little training data 
on the low-resource languages.

00:25:30.400 --> 00:25:32.800
We have a few ideas.

00:25:32.800 --> 00:25:36.840
It's one of the million topics, 
I'm supposed to work on at the moment.

00:25:38.850 --> 00:25:42.550
One of them is to use 
entity linking and relation extraction,

00:25:42.550 --> 00:25:44.440
to align what we generate

00:25:44.440 --> 00:25:46.640
with the triples
we inputted in the first place,

00:25:46.640 --> 00:25:50.750
to see if it's off or the network 
generates information it shouldn't have

00:25:50.750 --> 00:25:54.090
or it cannot know about, basically.

00:25:54.090 --> 00:25:58.680
That's also all I can say about this
because now time is over.

00:25:58.680 --> 00:26:01.480
(person 5) I'd love to talk offline
about this, if you have time.

00:26:01.480 --> 00:26:03.260
Yeah, absolutely, let's chat about it.

00:26:03.260 --> 00:26:05.140
Thank you so much,
everyone, it was lovely.

00:26:05.140 --> 00:26:06.600
(moderator) Thank you, Lucie.

00:26:06.600 --> 00:26:08.610
(applause)