-
Hi, I'm Lucie.
-
You know me from rambling about
not enough language data in Wikidata,
-
and I thought instead of rambling today,
which I'll leave to Lydia later today,
-
I'll just show you a bit, or give you
an insight on the projects we did
-
using the data that we already have
on Wikidata, for different causes.
-
So underserved languages
compared to the keynote we just heard
-
where the person was talking about
underserved as like minority languages,
-
underserved languages to me,
or any languages
-
that don't have
enough representation on the web.
-
Yeah, just to get that clear.
-
So, who am I?
-
Why am I always talking
about languages on Wikidata?
-
Not sure but...
-
I'm a Computer Science PhD student
-
at the University of Southampton.
-
I'm a research intern
at Bloomberg in London, at the moment.
-
I'm a residence
at Newspeak House in London.
-
I am a researcher and project manager
for the Scribe project,
-
which I'll go into in a bit,
-
and I recently got into the idea
of oral knowledge and oral citation.
-
Kimberly is sitting right there.
-
And then, occasionally,
I have time to sleep
-
and do other things, but that's very rare.
-
So if you're interested
in any of those things,
-
come talk and speak to me.
-
Generally, this is an open presentation
and a few questions in between.
-
I'll run through a lot of things
in a very short time now.
-
Come to me afterwards
if you're interested in any of them.
-
Speak to me. I'm here.
-
I'm always very happy to speak to people.
-
So that's a bit of what
we will talk about today.
-
So Wikidata, giving an introduction,
-
even though that's obviously
not as necessary.
-
The article placeholder
is aimed for Wikipedia readers,
-
for Scribe which is aimed
at Wikipedia editors,
-
and then we have one topic of my research,
-
which is completely outside of Wikipedia
-
where we use Wikidata
for question answering.
-
So just a quick rerun.
-
Why is Wikidata so cool
for low-resource languages
-
where we have those unique identifiers?
-
I'm speaking to people that know that
-
much better than me even.
-
And then we have labels
in different languages.
-
Those can be in over,
I think, 400 languages by now,
-
so we have a good option here
-
to reuse language
in different forms and capture it.
-
Yeah, so that's a little bit of me
rambling about Wikidata
-
because I can't stop it.
-
We compared Wikidata,
compared to the native speaker,
-
so we can see, obviously,
-
there are languages
that are widely spoken in the world.
-
There's Chinese, Hindi, or Arabic,
-
but then very low coverage on Wikidata.
-
Then the opposite.
-
Sorry, I have the Dutch
and the Swedish community
-
which was super active in Wikidata,
-
which is really cool,
and that just points out
-
that even though we have
a low number of speakers,
-
we can have a big impact if people
are very active in the communities,
-
which is really nice and really good.
-
But also let's try to equal
that graph out in the future.
-
So, cool. So now we have
all this language data in Wikidata.
-
We have low-resource Wikipedias,
so we thought, what can we do?
-
Well, my undergrad supervisor
is sitting here,
-
and we worked back then
in the golden days,
-
on something called
the article placeholder
-
which takes triples from Wikidata
and displays it on Wikipedia.
-
And that's pretty much
relatively straight forward.
-
So you just take the content of Wikidata,
display it on Wikipedia
-
to attract more readers
and then eventually more editors
-
in the different low-resource languages.
-
They are dynamically generated,
-
so they're not like stubs or bot articles
-
that then flood the Wikipedia
so people can edit them.
-
It's basically a starting point.
-
And we thought,
well, we have that content,
-
and we have that knowledge
somewhere already, which is Wikidata.
-
It's often already in the languages,
but they don't have articles,
-
so at least give them
the insight into the information.
-
The article placeholders are live
on 14 low-resource Wikipedias.
-
If you are a Wikipedia community,
-
if you are part of a Wikipedia community
and interested in it,
-
let us know.
-
And then I went into research,
-
and I got stuck with
the article placeholder, though,
-
so we started to look into
text generation from Wikidata
-
for Wikipedia and low-resource languages.
-
And text generation is really interesting
-
because in research it was at that point
when we started the project
-
completely only focused on English,
-
which is a bit pointless in my experience
-
because, I mean, you have a lot of people
who write in English,
-
but then what we need is people
who write in those low-source languages.
-
And our starting point was that,
looking at triples on Wikipedia
-
is not exactly the nicest thing.
-
I mean, as much as I love
the article placeholder,
-
it's not exactly
what you want to see you or expect
-
when you open a Wikipedia page.
-
So we try to generate text.
-
We use this beautiful
neural network model,
-
where we encode Wikidata triples.
-
If you're interested more
in the technical parts,
-
come and talk to me.
-
And so, realistically,
with neural text generation,
-
you can generate one or two sentences
-
before it completely scrambles
and becomes useless.
-
So we've generated one sentence
that describes the topic of the triple.
-
And so this, for example, is Arabic.
-
We generate the sentence about Marrakesh,
-
where it just describes the city.
-
So for that, then, we tested this--
-
So we did studies, obviously,
to test if our approach works,
-
and if it makes sense, to use such things.
-
And because we are
very application-focused,
-
we tested it with actual
Wikipedia readers and editors.
-
So, first, we tested it
with Wikipedia readers
-
in Arabic and Esperanto--
-
so use cases with Arabic and Esperanto.
-
And we can see that our model
can generate sentences
-
that are very fluent
-
and that feel very much--
surprisingly, a lot, actually--
-
like Wikipedia sentences.
-
So it picks up, so we train on,
for example, for Arabic,
-
we train on Arabic with the idea to say
-
we want to keep
the cultural context of that language
-
and not let it influence
-
from other languages
that have higher coverage.
-
And then we did a study
with Wikipedia editors
-
because in the end the article placeholder
is just a starting point
-
for people to start editing,
-
and we try to measure
-
how much of the sentences
would they reuse.
-
How much is useful for them, basically,
-
and you can see
that there is a high number of reuse,
-
especially in Esperanto
when we test with editors.
-
And finally, we did also
qualitative interviews
-
with Wikipedia editors
across six languages.
-
I think we had
about ten people we interviewed.
-
And we tried to get
more of an understanding
-
what's a human perspective
on those generated sentences.
-
So now we can have
a very quantified way of saying,
-
yeah, they are good,
-
but we wanted to see
-
how's the interaction
-
and especially with whatever
always happens
-
in neural machine translation
and neural text generations,
-
that you have those missing word tokens
which we put as "rare" in there.
-
So that's the example sentences we used.
All of them are in Marrakesh.
-
So we wanted to see how much
are people bothered by it,
-
what's the quality,
-
what are the things
that point out to them,
-
and we can see that the mistakes
by the networks like those red tokens
-
are often just ignored.
-
There is this interesting factor
that because we didn't tell them
-
where this happens,
where we got the sentences from--
-
because it was on a user page of mine
-
but it looked like it was on a Wikipedia,
-
people just trusted.
-
And I think that's very important
-
when we look into those kinds
of research directions that we look into,
-
we cannot override
this trust into Wikipedia.
-
So if we work with Wikipedians
and Wikipedia itself,
-
if we take things from,
for example, Wikidata,
-
that's good
because it's also human-curated.
-
But when we start
with artificial intelligence projects,
-
where you have to be really careful
what we actually expose people to
-
because they just trust
the information that we give them.
-
So we could see, for example,
in the Arabic version,
-
it gave the wrong location for Marrakesh,
-
and people, even the people I interviewed
-
that we're living in Marrakesh
didn't pick up on that,
-
because it's on Wikipedia,
so it should be fine, right?
-
(chuckles)
-
Yeah.
-
We found there was a magical threshold
for the lengths of the generated text,
-
so that's something we found,
-
especially in comparison
with the content translation tool,
-
where you have a long
automatically generated text,
-
and people were complaining
that content translation was very hard
-
because you're just doing post-editing,
you don't have the creativity.
-
There are other remarks
on content translation I usually make--
-
I'll skip them for now.
-
So that one sentence was helpful
-
because even if we've made mistakes,
people were still willing to fix them
-
because it's a very short
intervenience [in that].
-
And then, finally,
a lot of people pointed out,
-
that it was particularly good
for a new editor,
-
so for them to have a starting point,
-
to have those triples, to have a sentence,
-
so they have something to start from.
-
So after all those interviews were done,
-
as I go, that's very interesting.
-
What else can we do with that knowledge?
-
And so we started a new project,
exactly because there weren't enough yet.
-
And the new project we have
is called Scribe,
-
and Scribe focuses on new editors
that want to write a new article,
-
and particularly people
who haven't written
-
an article on Wikipedia yet,
-
and specifically also
on low-resource languages.
-
So the idea is that--
that's the pixel version of me.
-
All my slides are basically
-
references to people in this room,
which I really love.
-
It feels like I'm home again.
-
So, yeah, I want to write a new article,
-
but I don't know where to start
as a new editor,
-
and so we have this project Scribe.
-
Scribe is a profession
or was the name of someone
-
with the profession of writing
in ancient Egypt.
-
So the Scribe project's idea
is that we want to give people, basically,
-
a hand when they start
writing their first articles.
-
So give them a skeleton,
-
give them a skeleton that's based
on their language Wikipedia,
-
instead of just translating the content
from another language Wikipedia.
-
So the first thing we want to do
is plan section titles,
-
then select references for each section,
-
ideally in the local Wikipedia language,
-
and then summarize those references
to give a starting point to write.
-
For the project, we have
a Wikimedia Foundation project grant.
-
So it just started.
-
Some of you are very open
to feedback, in general.
-
That was the very first
not so beautiful layout,
-
but just for you to get an overview.
-
So there is this idea
of collecting references,
-
images from comments, section titles.
-
And so the main things
we want to use Wikidata for
-
is the sections.
-
So, basically, we want to see
what are articles
-
on similar topics
already existing in your language,
-
so we can understand
how the language community
-
decided on structuring articles.
-
And then we look
for the images, obviously,
-
where Wikidata also
is a good point to go through.
-
And then we made
a prettier interface for it
-
because we decided to go mobile first.
-
So most of communities
that we aim to work with
-
are very heavy on mobile editing.
-
And so we do this mobile-first focus.
-
And then, it also forces us
to break down into steps
-
which eventually will lead to,
yeah, I don't know,
-
a step-by-step guide
on how to write a new article.
-
So an editor comes,
they can select section headers
-
based on existing articles
in their language,
-
write one section at a time,
-
switch between the sections,
and select references for each section.
-
Yeah, so the idea is that
we will have an easier editing experience,
-
especially for new editors,
-
to keep them in--
integrate Wikidata information
-
and [inaudible] images
from Wikimedia Commons as well.
-
If you're interested in Scribe,
-
I'm working together
on this project with Hady.
-
There is a lot of things online,
-
but then also just come and talk to us.
-
Also, if you're editing
a low-resource Wikipedia,
-
we're still looking
for people to interview
-
because we're trying to emulate--
-
we're trying to emulate as much as we can
-
what people already experience,
or they already edit.
-
I'm not big on Wikipedia editing.
-
Also, my native language is German.
-
So I need a lot of input from editors
-
that want to tell me
what they need, what they want,
-
where they think this project can go.
-
And if you are into Wikidata,
also come and talk to me, please.
-
Okay, so that's all the projects
-
or most of the projects we did
inside the Wikimedia world.
-
And I want to give you one
short overview of what's happening
-
on my end of research,
around Wikidata as well.
-
So I was part of a project
-
that works a lot with question answering,
-
and I don't know too much
about question answering,
-
but what I do know a lot about
is knowledge graphs and multilinguality.
-
So, basically, what we wanted to do
-
is we have a question answering system
that gets a question from a user,
-
and we wanted to select a knowledge graph
that can answer the question best.
-
And again, we focused on
multilingual question answering system.
-
So if I want to ask something about Bach,
for example, in Spanish and French--
-
because that's the two languages
I know best--
-
then what knowledge graph has the data
-
to actually answer those questions.
-
So what we did was we found a method
to rank knowledge graphs,
-
based on the metadata of language,
-
that appears on the knowledge graph,
-
[which is split] by class.
-
And then we look for each class
-
into what languages are covered best,
-
and then depending on the question,
can suggest a knowledge graph.
-
From the big knowledge graphs
we looked into
-
and that are very known and widely used,
-
Wikidata covers the most languages
over all knowledge graphs,
-
and we used a test bed.
-
So we'd use a benchmark dataset
called [CALD],
-
which we then translated--
which was originally for DBpedia.
-
We translated it
for those five knowledge graphs
-
into [SPARQL] questions.
-
And then we gave that to a crowd
and looked into which knowledge graph
-
has the best answers
for each of those [SPARQL] queries.
-
And overall, the crowd workers
preferred Wikidata's answers
-
because they are very precise,
-
they are in most of the languages
-
that the others don't cover,
-
and they are not
as repetitive or redundant
-
as the [inaudible].
-
So just to make a quick recap
on the whole topic
-
of Wikidata and the future and languages.
-
So we can say that Wikidata
is already widely used
-
for numerous applications in Wikipedia,
-
and then outside Wikipedia for research.
-
So what I talked about
is just the things I do research on,
-
but there is still so much more.
-
So there is machine translation
using knowledge graphs,
-
there is rule mining
over knowledge graphs,
-
its entity linking in text.
-
There is so much more research
happening at the moment,
-
and Wikidata is more and more
getting popular for usage of it.
-
So I think we are at a very good stage
-
to push and connect the communities.
-
Yeah, to get the best
from both sides, basically.
-
Thank you very much.
-
If you want to have a look
at any of those projects,
-
they are there,
-
my slides are in Commons already.
-
If you want to read any of the papers,
I think all of them are open access.
-
If you can't find any of them,
-
write me an email
and I send it to you immediately.
-
Thank you very much.
-
(applause)
-
(moderator) Okay,
are there any questions?
-
- (moderator) I'll come around.
- (person 1) Shall I come to you?
-
(person 1) Hi Lucie, thank you so much,
-
I'm so glad to see
you taking this forward.
-
Now I'm really curious about Scribe.
-
The example here within our university
-
was that the idea that the person says,
-
"This is a university."
-
And then you go to the key data
-
and say, "Oh gosh!
Universities have places
-
and presidents, and I don't know what,"
-
that you're using these as the parts,
for telling the person what to do.
-
So, basically, the idea
is that someone says,
-
"I want to write about Nile University."
-
We look into Nile University's
Wikidata item,
-
and let's say-- I work a lot with Arabic--
-
so let's say we then go
in Arabic Wikipedia,
-
so we can make a grid, basically,
-
of all items that are around
Nile University.
-
So there are also universities,
there are also universities in Cairo,
-
or there are also universities
in Egypt, stuff like that,
-
or they have similar topics.
-
So we can look into
all the similar items on Wikidata,
-
and if they already have
a Wikipedia entry in Arabic Wikipedia,
-
we can look at the section titles.
-
- (person 1) (gasps)
- Exactly, and then we can make basically,
-
the most common way
about writing about a university
-
in Cairo on Arabic Wikipedia.
-
- Yeah, so that's the--
- (person 1) Thank you, [inaudible].
-
(person 2) Hi, thank you so much
for your inspiring talk.
-
I was wondering if this would work
for languages in Incubator?
-
Like, I work with really low,
low, low, low-resource languages
-
and this thing about doing it mobile
would be a huge thing,
-
because in many communities
they only have phones, not laptops.
-
So, would it work?
-
So I think, to an extent--
-
so the general structure, the skeleton
of the application would work.
-
Two things that we're thinking about
a lot at the moment
-
for exactly those use cases is,
-
how much would we want,
for example, to say,
-
if there are no articles
on a similar topic in your Wikipedia,
-
how much do we want it
to get it from other Wikipedias.
-
And that's why I'm basically
doing those interviews at the moment,
-
because I try to understand
-
how much people already look
at other language Wikipedias
-
to make the structure of an article.
-
Are they generally equal
-
or do they differ a lot
based on cultural context?
-
So that would be something to consider,
-
but there is a possibility to say,
-
we take everything
from all the language Wikipedias
-
and then make an average, basically.
-
And the other problem is referencing.
-
So that's something we find.
-
We make it very convenient
because we use a lot of Arabic,
-
and Arabic actually has the problem
that there are a lot of references,
-
but they are very little used
or not widely used in Wikipedia.
-
That's not true, obviously,
for all languages,
-
and that's something
I'd be very interested--
-
like, let's talk.
-
That's what I'm trying to say,
-
I'd be very interested
on your perspective on it
-
because I'd like to know, yeah
-
what do you think about referencing
-
done from English or any other language.
-
(person 2) Have you ever tried--
-
what we do is we normally
reference to interviews we have.
-
We put them in our repository,
institutional repository,
-
because these languages
don't have written references,
-
and I feel like
that is the way to go, but--
-
I'm currently also--
Kimberly and I are discussing a lot.
-
We made a session on Wikimania
on oral knowledge and oral citations.
-
Yeah, we should hang out
and have a long conversation.
-
(laughs)
-
(person 3) So [Michael Davignon],
we'll talk about medium size,
-
which is probably around ten people,
-
so it's medium for Briton Wikipedia.
-
And I'm wondering if we can use Scribe,
-
how to find a common plan
the other way around
-
for existing article
to find [the outer layers],
-
that's supposed to be the best plan,
-
but I'm not aware of more or less
-
[inaudible]
improvement existing article.
-
I think there's--
-
I forgot the name, I think,
-
[Diego] in the Wikimedia Foundation
research team,
-
who's working a lot at the moment
with section headings.
-
But, yes, generally, the idea is the same.
-
So instead of using them
to make an average
-
you could say,
this is not like the average,
-
That's very possible, yeah.
-
(person 4) Hi, Lucy. I'm Erica Azzellini
from Wiki Movement, Brazil,
-
and I'm very--
-
(Érica) Oh, can you hear me?
-
So, I'm Érica Azzellini
from Wiki Movement Brazil,
-
and I'm really impressed with your work
-
because it's really in sync
-
with what we've been working on in Brazil
with the Mbabel tool.
-
I don't know if you heard about it?
-
- Not yet.
- (Érica) It's a tool that we use
-
to automatically
generate Wikipedia entries
-
using Wikidata information
in a simple way
-
that can be replicated
on other Wikipedia languages.
-
So we've been working
on Portuguese mainly,
-
and we're trying to get
on English Wikipedia tools,
-
but it can be replicated
on any language, basically,
-
and I think then we could talk about it.
-
Absolutely, it will be super interesting
-
because the article placeholder
is an extension already,
-
so it might be worth
to integrate your efforts
-
into the existing extension.
-
Lydia is also fully for it,
and... (laughs)
-
And then because--
-
so one of the problems--
[Marius] correct me if I'm wrong--
-
we had was that
article placeholder doesn't scale
-
as well as it should.
-
So article placeholder
is not in Portuguese
-
because we're always afraid
it will break everything, correct?
-
And then [Marius] is just taking a pause.
-
- (Érica) Yeah, you should be careful.
- Don't want to say anything about this.
-
But, yeah, we should connect
because I'd be super interested to see
-
how you solve those issues
and how it works for you.
-
(Érica) I'm going to present
on the second section
-
of the learning talk about this project
that we've been developing,
-
and we've been using it
on [Glenwyck] initiatives
-
and education projects already.
-
- Perfect.
- (Érica) So let's do that.
-
Yeah, absolutely let's chat.
-
(moderator) Cool.
-
Some other questions on your projects?
-
(person 5) Hi, my name is [Alan],
and I think this is extremely cool.
-
I had a few questions about
-
generating Wiki sentences
from neural networks.
-
- Yeah.
- (person 5) So I've come across
-
another project
that was attempting to do this,
-
and it was essentially using
[triples input and sentences output],
-
and it was able
to generate very fluent sentences.
-
But sometimes they weren't...
-
actually, they weren't correct,
with regards to the triple.
-
And I was curious if you had any ways
of doing validity checks of this site.
-
Sometimes the triple
is "subject, predicate, object,"
-
but the language model says,
-
"Okay, this object is very rare,
-
I'm going to say you are born in San Jose,
-
instead of San Francisco or vice versa."
-
And I was curious
if you had come across this?
-
So that's what we call hallucinations.
-
The idea that
there's something in a sentence
-
that wasn't in the original triple
and the data.
-
What we do--
so we don't do anything about it,
-
we just also realized
that that's happening.
-
It's even more happening
for the low-resource,
-
because we work across domains,
so we are domain independently generating.
-
Traditional energy work
is always biography domain, usually.
-
So that happens a lot
-
because we just have little training data
on the low-resource languages.
-
We have a few ideas.
-
It's one of the million topics,
I'm supposed to work on at the moment.
-
One of them is to use
entity linking and relation extraction,
-
to align what we generate
-
with the triples
we inputted in the first place,
-
to see if it's off or the network
generates information it shouldn't have
-
or it cannot know about, basically.
-
That's also all I can say about this
because now time is over.
-
(person 5) I'd love to talk offline
about this, if you have time.
-
Yeah, absolutely, let's chat about it.
-
Thank you so much,
everyone, it was lovely.
-
(moderator) Thank you, Lucie.
-
(applause)