Hi, I'm Lucie.
You know me from rambling about
not enough language data in Wikidata,
and I thought instead of rambling today,
which I'll leave to Lydia later today,
I'll just show you a bit, or give you
an insight on the projects we did
using the data that we already have
on Wikidata, for different causes.
So underserved languages
compared to the keynote we just heard
where the person was talking about
underserved as like minority languages,
underserved languages to me,
or any languages
that don't have
enough representation on the web.
Yeah, just to get that clear.
So, who am I?
Why am I always talking
about languages on Wikidata?
Not sure but...
I'm a Computer Science PhD student
at the University of Southampton.
I'm a research intern
at Bloomberg in London, at the moment.
I'm a residence
at Newspeak House in London.
I am a researcher and project manager
for the Scribe project,
which I'll go into in a bit,
and I recently got into the idea
of oral knowledge and oral citation.
Kimberly is sitting right there.
And then, occasionally,
I have time to sleep
and do other things, but that's very rare.
So if you're interested
in any of those things,
come talk and speak to me.
Generally, this is an open presentation
and a few questions in between.
I'll run through a lot of things
in a very short time now.
Come to me afterwards
if you're interested in any of them.
Speak to me. I'm here.
I'm always very happy to speak to people.
So that's a bit of what
we will talk about today.
So Wikidata, giving an introduction,
even though that's obviously
not as necessary.
The article placeholder
is aimed for Wikipedia readers,
for Scribe which is aimed
at Wikipedia editors,
and then we have one topic of my research,
which is completely outside of Wikipedia
where we use Wikidata
for question answering.
So just a quick rerun.
Why is Wikidata so cool
for low-resource languages
where we have those unique identifiers?
I'm speaking to people that know that
much better than me even.
And then we have labels
in different languages.
Those can be in over,
I think, 400 languages by now,
so we have a good option here
to reuse language
in different forms and capture it.
Yeah, so that's a little bit of me
rambling about Wikidata
because I can't stop it.
We compared Wikidata,
compared to the native speaker,
so we can see, obviously,
there are languages
that are widely spoken in the world.
There's Chinese, Hindi, or Arabic,
but then very low coverage on Wikidata.
Then the opposite.
Sorry, I have the Dutch
and the Swedish community
which was super active in Wikidata,
which is really cool,
and that just points out
that even though we have
a low number of speakers,
we can have a big impact if people
are very active in the communities,
which is really nice and really good.
But also let's try to equal
that graph out in the future.
So, cool. So now we have
all this language data in Wikidata.
We have low-resource Wikipedias,
so we thought, what can we do?
Well, my undergrad supervisor
is sitting here,
and we worked back then
in the golden days,
on something called
the article placeholder
which takes triples from Wikidata
and displays it on Wikipedia.
And that's pretty much
relatively straight forward.
So you just take the content of Wikidata,
display it on Wikipedia
to attract more readers
and then eventually more editors
in the different low-resource languages.
They are dynamically generated,
so they're not like stubs or bot articles
that then flood the Wikipedia
so people can edit them.
It's basically a starting point.
And we thought,
well, we have that content,
and we have that knowledge
somewhere already, which is Wikidata.
It's often already in the languages,
but they don't have articles,
so at least give them
the insight into the information.
The article placeholders are live
on 14 low-resource Wikipedias.
If you are a Wikipedia community,
if you are part of a Wikipedia community
and interested in it,
let us know.
And then I went into research,
and I got stuck with
the article placeholder, though,
so we started to look into
text generation from Wikidata
for Wikipedia and low-resource languages.
And text generation is really interesting
because in research it was at that point
when we started the project
completely only focused on English,
which is a bit pointless in my experience
because, I mean, you have a lot of people
who write in English,
but then what we need is people
who write in those low-source languages.
And our starting point was that,
looking at triples on Wikipedia
is not exactly the nicest thing.
I mean, as much as I love
the article placeholder,
it's not exactly
what you want to see you or expect
when you open a Wikipedia page.
So we try to generate text.
We use this beautiful
neural network model,
where we encode Wikidata triples.
If you're interested more
in the technical parts,
come and talk to me.
And so, realistically,
with neural text generation,
you can generate one or two sentences
before it completely scrambles
and becomes useless.
So we've generated one sentence
that describes the topic of the triple.
And so this, for example, is Arabic.
We generate the sentence about Marrakesh,
where it just describes the city.
So for that, then, we tested this--
So we did studies, obviously,
to test if our approach works,
and if it makes sense, to use such things.
And because we are
very application-focused,
we tested it with actual
Wikipedia readers and editors.
So, first, we tested it
with Wikipedia readers
in Arabic and Esperanto--
so use cases with Arabic and Esperanto.
And we can see that our model
can generate sentences
that are very fluent
and that feel very much--
surprisingly, a lot, actually--
like Wikipedia sentences.
So it picks up, so we train on,
for example, for Arabic,
we train on Arabic with the idea to say
we want to keep
the cultural context of that language
and not let it influence
from other languages
that have higher coverage.
And then we did a study
with Wikipedia editors
because in the end the article placeholder
is just a starting point
for people to start editing,
and we try to measure
how much of the sentences
would they reuse.
How much is useful for them, basically,
and you can see
that there is a high number of reuse,
especially in Esperanto
when we test with editors.
And finally, we did also
qualitative interviews
with Wikipedia editors
across six languages.
I think we had
about ten people we interviewed.
And we tried to get
more of an understanding
what's a human perspective
on those generated sentences.
So now we can have
a very quantified way of saying,
yeah, they are good,
but we wanted to see
how's the interaction
and especially with whatever
always happens
in neural machine translation
and neural text generations,
that you have those missing word tokens
which we put as "rare" in there.
So that's the example sentences we used.
All of them are in Marrakesh.
So we wanted to see how much
are people bothered by it,
what's the quality,
what are the things
that point out to them,
and we can see that the mistakes
by the networks like those red tokens
are often just ignored.
There is this interesting factor
that because we didn't tell them
where this happens,
where we got the sentences from--
because it was on a user page of mine
but it looked like it was on a Wikipedia,
people just trusted.
And I think that's very important
when we look into those kinds
of research directions that we look into,
we cannot override
this trust into Wikipedia.
So if we work with Wikipedians
and Wikipedia itself,
if we take things from,
for example, Wikidata,
that's good
because it's also human-curated.
But when we start
with artificial intelligence projects,
where you have to be really careful
what we actually expose people to
because they just trust
the information that we give them.
So we could see, for example,
in the Arabic version,
it gave the wrong location for Marrakesh,
and people, even the people I interviewed
that we're living in Marrakesh
didn't pick up on that,
because it's on Wikipedia,
so it should be fine, right?
(chuckles)
Yeah.
We found there was a magical threshold
for the lengths of the generated text,
so that's something we found,
especially in comparison
with the content translation tool,
where you have a long
automatically generated text,
and people were complaining
that content translation was very hard
because you're just doing post-editing,
you don't have the creativity.
There are other remarks
on content translation I usually make--
I'll skip them for now.
So that one sentence was helpful
because even if we've made mistakes,
people were still willing to fix them
because it's a very short
intervenience [in that].
And then, finally,
a lot of people pointed out,
that it was particularly good
for a new editor,
so for them to have a starting point,
to have those triples, to have a sentence,
so they have something to start from.
So after all those interviews were done,
as I go, that's very interesting.
What else can we do with that knowledge?
And so we started a new project,
exactly because there weren't enough yet.
And the new project we have
is called Scribe,
and Scribe focuses on new editors
that want to write a new article,
and particularly people
who haven't written
an article on Wikipedia yet,
and specifically also
on low-resource languages.
So the idea is that--
that's the pixel version of me.
All my slides are basically
references to people in this room,
which I really love.
It feels like I'm home again.
So, yeah, I want to write a new article,
but I don't know where to start
as a new editor,
and so we have this project Scribe.
Scribe is a profession
or was the name of someone
with the profession of writing
in ancient Egypt.
So the Scribe project's idea
is that we want to give people, basically,
a hand when they start
writing their first articles.
So give them a skeleton,
give them a skeleton that's based
on their language Wikipedia,
instead of just translating the content
from another language Wikipedia.
So the first thing we want to do
is plan section titles,
then select references for each section,
ideally in the local Wikipedia language,
and then summarize those references
to give a starting point to write.
For the project, we have
a Wikimedia Foundation project grant.
So it just started.
Some of you are very open
to feedback, in general.
That was the very first
not so beautiful layout,
but just for you to get an overview.
So there is this idea
of collecting references,
images from comments, section titles.
And so the main things
we want to use Wikidata for
is the sections.
So, basically, we want to see
what are articles
on similar topics
already existing in your language,
so we can understand
how the language community
decided on structuring articles.
And then we look
for the images, obviously,
where Wikidata also
is a good point to go through.
And then we made
a prettier interface for it
because we decided to go mobile first.
So most of communities
that we aim to work with
are very heavy on mobile editing.
And so we do this mobile-first focus.
And then, it also forces us
to break down into steps
which eventually will lead to,
yeah, I don't know,
a step-by-step guide
on how to write a new article.
So an editor comes,
they can select section headers
based on existing articles
in their language,
write one section at a time,
switch between the sections,
and select references for each section.
Yeah, so the idea is that
we will have an easier editing experience,
especially for new editors,
to keep them in--
integrate Wikidata information
and [inaudible] images
from Wikimedia Commons as well.
If you're interested in Scribe,
I'm working together
on this project with Hady.
There is a lot of things online,
but then also just come and talk to us.
Also, if you're editing
a low-resource Wikipedia,
we're still looking
for people to interview
because we're trying to emulate--
we're trying to emulate as much as we can
what people already experience,
or they already edit.
I'm not big on Wikipedia editing.
Also, my native language is German.
So I need a lot of input from editors
that want to tell me
what they need, what they want,
where they think this project can go.
And if you are into Wikidata,
also come and talk to me, please.
Okay, so that's all the projects
or most of the projects we did
inside the Wikimedia world.
And I want to give you one
short overview of what's happening
on my end of research,
around Wikidata as well.
So I was part of a project
that works a lot with question answering,
and I don't know too much
about question answering,
but what I do know a lot about
is knowledge graphs and multilinguality.
So, basically, what we wanted to do
is we have a question answering system
that gets a question from a user,
and we wanted to select a knowledge graph
that can answer the question best.
And again, we focused on
multilingual question answering system.
So if I want to ask something about Bach,
for example, in Spanish and French--
because that's the two languages
I know best--
then what knowledge graph has the data
to actually answer those questions.
So what we did was we found a method
to rank knowledge graphs,
based on the metadata of language,
that appears on the knowledge graph,
[which is split] by class.
And then we look for each class
into what languages are covered best,
and then depending on the question,
can suggest a knowledge graph.
From the big knowledge graphs
we looked into
and that are very known and widely used,
Wikidata covers the most languages
over all knowledge graphs,
and we used a test bed.
So we'd use a benchmark dataset
called [CALD],
which we then translated--
which was originally for DBpedia.
We translated it
for those five knowledge graphs
into [SPARQL] questions.
And then we gave that to a crowd
and looked into which knowledge graph
has the best answers
for each of those [SPARQL] queries.
And overall, the crowd workers
preferred Wikidata's answers
because they are very precise,
they are in most of the languages
that the others don't cover,
and they are not
as repetitive or redundant
as the [inaudible].
So just to make a quick recap
on the whole topic
of Wikidata and the future and languages.
So we can say that Wikidata
is already widely used
for numerous applications in Wikipedia,
and then outside Wikipedia for research.
So what I talked about
is just the things I do research on,
but there is still so much more.
So there is machine translation
using knowledge graphs,
there is rule mining
over knowledge graphs,
its entity linking in text.
There is so much more research
happening at the moment,
and Wikidata is more and more
getting popular for usage of it.
So I think we are at a very good stage
to push and connect the communities.
Yeah, to get the best
from both sides, basically.
Thank you very much.
If you want to have a look
at any of those projects,
they are there,
my slides are in Commons already.
If you want to read any of the papers,
I think all of them are open access.
If you can't find any of them,
write me an email
and I send it to you immediately.
Thank you very much.
(applause)
(moderator) Okay,
are there any questions?
- (moderator) I'll come around.
- (person 1) Shall I come to you?
(person 1) Hi Lucie, thank you so much,
I'm so glad to see
you taking this forward.
Now I'm really curious about Scribe.
The example here within our university
was that the idea that the person says,
"This is a university."
And then you go to the key data
and say, "Oh gosh!
Universities have places
and presidents, and I don't know what,"
that you're using these as the parts,
for telling the person what to do.
So, basically, the idea
is that someone says,
"I want to write about Nile University."
We look into Nile University's
Wikidata item,
and let's say-- I work a lot with Arabic--
so let's say we then go
in Arabic Wikipedia,
so we can make a grid, basically,
of all items that are around
Nile University.
So there are also universities,
there are also universities in Cairo,
or there are also universities
in Egypt, stuff like that,
or they have similar topics.
So we can look into
all the similar items on Wikidata,
and if they already have
a Wikipedia entry in Arabic Wikipedia,
we can look at the section titles.
- (person 1) (gasps)
- Exactly, and then we can make basically,
the most common way
about writing about a university
in Cairo on Arabic Wikipedia.
- Yeah, so that's the--
- (person 1) Thank you, [inaudible].
(person 2) Hi, thank you so much
for your inspiring talk.
I was wondering if this would work
for languages in Incubator?
Like, I work with really low,
low, low, low-resource languages
and this thing about doing it mobile
would be a huge thing,
because in many communities
they only have phones, not laptops.
So, would it work?
So I think, to an extent--
so the general structure, the skeleton
of the application would work.
Two things that we're thinking about
a lot at the moment
for exactly those use cases is,
how much would we want,
for example, to say,
if there are no articles
on a similar topic in your Wikipedia,
how much do we want it
to get it from other Wikipedias.
And that's why I'm basically
doing those interviews at the moment,
because I try to understand
how much people already look
at other language Wikipedias
to make the structure of an article.
Are they generally equal
or do they differ a lot
based on cultural context?
So that would be something to consider,
but there is a possibility to say,
we take everything
from all the language Wikipedias
and then make an average, basically.
And the other problem is referencing.
So that's something we find.
We make it very convenient
because we use a lot of Arabic,
and Arabic actually has the problem
that there are a lot of references,
but they are very little used
or not widely used in Wikipedia.
That's not true, obviously,
for all languages,
and that's something
I'd be very interested--
like, let's talk.
That's what I'm trying to say,
I'd be very interested
on your perspective on it
because I'd like to know, yeah
what do you think about referencing
done from English or any other language.
(person 2) Have you ever tried--
what we do is we normally
reference to interviews we have.
We put them in our repository,
institutional repository,
because these languages
don't have written references,
and I feel like
that is the way to go, but--
I'm currently also--
Kimberly and I are discussing a lot.
We made a session on Wikimania
on oral knowledge and oral citations.
Yeah, we should hang out
and have a long conversation.
(laughs)
(person 3) So [Michael Davignon],
we'll talk about medium size,
which is probably around ten people,
so it's medium for Briton Wikipedia.
And I'm wondering if we can use Scribe,
how to find a common plan
the other way around
for existing article
to find [the outer layers],
that's supposed to be the best plan,
but I'm not aware of more or less
[inaudible]
improvement existing article.
I think there's--
I forgot the name, I think,
[Diego] in the Wikimedia Foundation
research team,
who's working a lot at the moment
with section headings.
But, yes, generally, the idea is the same.
So instead of using them
to make an average
you could say,
this is not like the average,
That's very possible, yeah.
(person 4) Hi, Lucy. I'm Erica Azzellini
from Wiki Movement, Brazil,
and I'm very--
(Érica) Oh, can you hear me?
So, I'm Érica Azzellini
from Wiki Movement Brazil,
and I'm really impressed with your work
because it's really in sync
with what we've been working on in Brazil
with the Mbabel tool.
I don't know if you heard about it?
- Not yet.
- (Érica) It's a tool that we use
to automatically
generate Wikipedia entries
using Wikidata information
in a simple way
that can be replicated
on other Wikipedia languages.
So we've been working
on Portuguese mainly,
and we're trying to get
on English Wikipedia tools,
but it can be replicated
on any language, basically,
and I think then we could talk about it.
Absolutely, it will be super interesting
because the article placeholder
is an extension already,
so it might be worth
to integrate your efforts
into the existing extension.
Lydia is also fully for it,
and... (laughs)
And then because--
so one of the problems--
[Marius] correct me if I'm wrong--
we had was that
article placeholder doesn't scale
as well as it should.
So article placeholder
is not in Portuguese
because we're always afraid
it will break everything, correct?
And then [Marius] is just taking a pause.
- (Érica) Yeah, you should be careful.
- Don't want to say anything about this.
But, yeah, we should connect
because I'd be super interested to see
how you solve those issues
and how it works for you.
(Érica) I'm going to present
on the second section
of the learning talk about this project
that we've been developing,
and we've been using it
on [Glenwyck] initiatives
and education projects already.
- Perfect.
- (Érica) So let's do that.
Yeah, absolutely let's chat.
(moderator) Cool.
Some other questions on your projects?
(person 5) Hi, my name is [Alan],
and I think this is extremely cool.
I had a few questions about
generating Wiki sentences
from neural networks.
- Yeah.
- (person 5) So I've come across
another project
that was attempting to do this,
and it was essentially using
[triples input and sentences output],
and it was able
to generate very fluent sentences.
But sometimes they weren't...
actually, they weren't correct,
with regards to the triple.
And I was curious if you had any ways
of doing validity checks of this site.
Sometimes the triple
is "subject, predicate, object,"
but the language model says,
"Okay, this object is very rare,
I'm going to say you are born in San Jose,
instead of San Francisco or vice versa."
And I was curious
if you had come across this?
So that's what we call hallucinations.
The idea that
there's something in a sentence
that wasn't in the original triple
and the data.
What we do--
so we don't do anything about it,
we just also realized
that that's happening.
It's even more happening
for the low-resource,
because we work across domains,
so we are domain independently generating.
Traditional energy work
is always biography domain, usually.
So that happens a lot
because we just have little training data
on the low-resource languages.
We have a few ideas.
It's one of the million topics,
I'm supposed to work on at the moment.
One of them is to use
entity linking and relation extraction,
to align what we generate
with the triples
we inputted in the first place,
to see if it's off or the network
generates information it shouldn't have
or it cannot know about, basically.
That's also all I can say about this
because now time is over.
(person 5) I'd love to talk offline
about this, if you have time.
Yeah, absolutely, let's chat about it.
Thank you so much,
everyone, it was lovely.
(moderator) Thank you, Lucie.
(applause)