[inaudible] and I
have an effort called WikiLoop,
and this is what I'm going
to introduce to you about.
We have presented WikiLoop, the idea,
to several Wikimedia related conferences.
How many of you have heard
about WikiLoop before?
Thanks.
And how many of you have interacted
with the datasets and toolings
that we provided before?
Okay, fairly new.
So this will be mostly an introduction.
So we would like to tell you
why we start this initiative
and what it intends to do,
and how you can get involved
or what it will go for.
So, to begin with,
we would like to give you an example.
This is a vandalism
that happened in Italian...
that happened in Italy Wikipedia.
I know that most people here
are interested in Wikidata.
I will tell you why this is relevant too.
So basically what we found is
that someone vandalized
Wikipedia on Italian
and says, "Bezos who cannot afford a car."
And this is an interesting question,
if you think about it,
this is blatant obvious vandalism
but when it comes to machines
and algorithms
which find to detect vandalism
and avoid serving users the information,
how can computer understand
this kind of information,
like it would be...
we realize that sometimes
there are limitations
of how far algorithms can go
and machine can go.
Another example here is let's say,
there is a word or label,
or a category on Wikipedia says,
someone, a person,
is a Christian scientist.
Now, given this label,
what facts do you come up with
like what would you infer
from this category?
Do you think it would be a "Christian"
or do you think it would be a "scientist"?
In this specific case--
it does not apply everywhere--
but it this specific case,
there is a religion
called "Christian Science,"
and people who hold that belief
is called "Christian Scientist."
And, again, for machines,
how can we know, like
even if many people here are big [fan]
that's the better we make our data
a knowledge machine-friendly
the easier we can work and improve
the overall knowledge accessibility
and contribute together
but there is always things
that we believe
that machine has restrictions.
So all in all, we start to realize
that coming from Internet companies
who have a strong belief
of our technology
and what machine can do,
there is always a gap
or there is always something
that we would need to rely on human being
and more, we would need
to rely on communities
who are actively contributing,
who are doing the peer reviews to our...
collaborating with each other.
So this is a picture
about the background effort of WikiLoop.
For the human being,
they have the knowledge,
we have our domain expertize
and we can crosscheck each other
but we just have that enough time.
And there are many things
that machine can empower this
but there is restrictions there.
So the goal is to empower
or improve the productivity
of human editors.
But also the other side of the formula
is we want to loop that back
to the research and the academic efforts
that improve how machine
can help in these cases.
So by raise of hand,
how many of you have used Google?
Thank you.
And how many of you
think that companies like Google
and other big knowledge companies
should contribute more
to the knowledge world?
So what happens is that...
we all know that our mission at Google
or other similar companies--
we have a strong background
of leveraging the open knowledge world,
like for Google specific case
it's like organize
the world's information.
So we help disseminate the information,
which in one sense that helps
the mission of this movement.
But only every once a while
we have sporadic help
trying to donate knowledge
and datasets, and tools,
and we want to see
if we can make this sustainable,
both in the technical sense
and also in the business sense.
So this is like
a one-sentence introduction.
We want WikiLoop
to become an umbrella program
for a series of technical projects
intended to contribute
datasets and toolings
and hopefully make this a community effort
with participation of
other likeminded people,
partners and institutions
to join with this effort.
There are several projects
that we think would be a good fit,
and these are the criteria.
First of all, the idea is
that it needs to be source improvements
or source improvements
by and large is a good fit,
and also the second thing
that companies like us
really cannot do very well by ourself
is to maximize the neutrality,
to avoid picking sides
on the controversies,
decisions or discussions
and another thing is that to make this
in the long-term sustainability
and to keep it
being supported by this industry.
We want to see
the productivity, the scalability
of our contribution and efforts.
To explain a little bit more...
We always look trying to extract...
for example, we are trying
to extract facts from Wikipedia.
And while we can do
several separations,
we're labeling, fairly well,
up to certain point
the bottleneck is no longer
how good the machine,
the algorithm can reach
but sometimes
there is a noise in the source,
and if we do not remove the source
or minimize the source noise there,
that's how far the machine can go.
So that's the first criteria.
And the second criteria is,
we don't want to get to be seen as buyers
or introduce potential buyers.
We want to rely on
governance that is peer reviewed
and that is done by the community
so that we can avoid picking sides
in the controversy questions.
And the third thing
which probably not so intuitive
but this is the kind of...
I would like...
Let me give you an example
of the projects we have in mind.
Let's say there are smaller,
minority language there.
I have heard a very good talk
earlier this morning.
But one idea we have here is,
let's say you are a minority
language contributor, very active,
and you want to advocate for your culture
and supporting your knowledge creation.
But because companies like Google
or other consumer company,
they have a bar
for releasing a translation,
to make it available.
They want the precision to be high enough
so that they can use it to serve users.
But maybe internally they have AI modules
that are experimenting,
not good enough to the bar
because lack of training data,
so the translation is not available.
But the community is doing
the translation by hand anyway.
Now, one of the things we are thinking of,
if we can provide
some of this experimental thing
that is not good enough
to serve general user purpose
but still good for the community
and somewhat improving the productivity,
it would be able to
one, improve the speed of how well
a community can contribute,
and second, what a community is creating
anyway can come back as a training data
that keeps bootstrapping the machines.
So over time by this effort
we hope to generate a model
that both helps
the human being, the editors,
but also helps the research
that improves the AI and other approaches.
And this is a big overview
of a few projects
we are going to introduce.
Due to the time limitation
I will feature a few.
The WikiLoop Game, which you can look up,
is one that we leveraged a platform
created by Magnus called Wikidata Game.
We provide several datasets there
to be played, to be introduced
and commit to the Wikidata
but by the human review.
And Google doesn't get
to contribute data directly
to Wikipedia or Wikidata
but having someone who is reviewing it
as non-biased individuals to do so.
And the second one I'm going to feature
is WikiLoop Battlefield,
the one that you have seen just now
as a counter-vandalism platform,
and this one also features
the same criteria
of source improvements,
of how it can empower machines
by looping back to the training data
and also how it avoids companies like us
to pick sides allowing way to rely
on the community's assessment.
And the third one is CitePool,
which is creating...
we're trying to help creating
citation candidate pool
to improve the productivity of people
who want to add citation
but also see if we can make that
into a training data
accessible to researchers.
So let me use WikiLoop Battlefield
as an example.
If you have... try it on your phone--
battlefield.wikiloop.org.
By the way, I want to highlight,
the name is subject to change
because some friendly community members
have come to me and suggest
that Battlefield might not be
the best name for a project
serving the Wikimedia movement.
So if you don't like this name,
come join us in the discussion,
provide your suggestion,
we will be very happy
to converge to a name
that has community consensus
and popularity.
But let's use that as a placeholder here.
I don't need to introduce
to this group of people
about the typical vandalism workflow
but if you have already...
trying to conduct
some counter-vandalism activity,
you might know that it's not very trivial.
How many of you have seen vandalism
on Wikipedia and Wikidata?
Okay, how many of you
have reverted, by hand, some of them?
How many of you have used certain tools
or go ahead and find certain tools
to patrol or revert vandalism?
Okay.
Cool, this is
the highest density of people
who have tried to revert vandalism
that I have spoken to before.
So maybe some of you have been
very comfortably doing that
but for me as someone
who started editing actively
only since like three years ago
and who only started to be very serous
doing vandalism detection and patrolling
only since about last year
I found that doing so is not super easy
on the world of Wikimedia movement.
If we look at the existing alternatives
there are tools that is built
featuring desktops,
there are tools that is relying
on users who have rollback permissions,
which itself is a big barrier to get.
We want to make this
a super easy to use platform
for all the three roles.
The first one is user, reviewer or editor,
whatever you call it.
The second one is researcher
who is trying to create
vandalism detecting algorithms or systems.
And the third one is developers
who is trying to improve
this WikiLoop Battlefield tooling.
We want it to be
super easy for user to use.
You can you pull up your phone,
you don't have to install it,
you can do in on your laptop.
And we also want
to lower a barrier to review.
The reason why other tools
are trying to limit the access to the tool
is because there needs to be
a base trust level for people to use them.
You don't want someone
to come to a counter-vandalism tool
to vandalize itself.
So what we are trying to do is that,
to begin with, we want
to make it super easy
but also we want to allow multiple people
to label the same thing.
Also we want to make it super convenient
to see the [inaudible],
to see other label, and all in real time.
We also want to make it
for researchers super easy to use.
By one click you can download the labeling
and maybe start play with the data
and see how it fits in your model.
And we provide APIs
that have access to real time data.
And for the developer
we make it very easy to pick up--
we have one click--
you can deploy your trial instances,
things like that.
This is an example
about building projects
for umbrella like WikiLoop.
We want to make sure
the community trust comes the first.
We usually need to make it
open source the best.
And we want to avoid proprietary tech,
we want to avoid tech lock-down,
and we rely on community approval
for certain features.
And if you have seen this--
this is the components that we rely on--
still very early stage but you get
the principles behind the design.
So what's next, we are trying
to grow our usage.
Hopefully you can try it out by yourself
and promise to me
that you don't click on the login.
There is a login button--
there will be some good features
that make it super easy
to even revert something.
Currently it's still a jump to revert.
But we are building features,
and we are also trying
to let you choose some categories
or the watchlist
that you will be watching
and the one that you care about to patrol.
And also if you are researchers
while doing related vandalism detection,
try our data and give us feedback.
And I will go through quickly
about a few other projects
that we are featuring here
and we will look for questions
and feedback from you
about what we think
and what you think should be there
or how we should fix things
if it doesn't work right.
Wikidata Game is a platform
built by a community member Magnus,
a celebrity in this community, I think.
And by showing this
we are providing datasets
but we also want to let people know
that we are not reinventing the wheels,
that we are not trying to...
When we come up with some idea,
we look into with community
and see if there is
existing tools that's there
and how we can be
a part of the ecosystem
rather than building everything
independently and everything separately.
And this is the current status.
By early results, we show that Wikidata...
a few games that we released
have triggered and proved activity
on the entities related
and a few follow up.
One thing that we have come up with,
as I have talked
to a few community members
is the PreCheck idea
that is basically providing
preliminary check about bulk uploads,
sampled preliminary check
by community member
and use that to generate a report,
make it easier for discussions
about whether this big block
of Wikidata datasets
should be included
or uploaded to wikidata.org
or it should be rechecked or fixed.
And there is another project
that is mostly a dataset project
called CatFacts.
CatFacts is datasets that we generate
about facts from categories,
the one that you see,
the Christian Scientist, just now
is actually an interesting outlier
of data points
from this effort.
This goal is to generate
the facts from category
which we think have been
very rich facts online that people...
that has been under leverage.
But before it can be fully leveraged
we need to make sure
that quality is good enough as well
and there is efforts
of putting it onto Wikidata Game
and there is effort that we're thinking
maybe building PreCheck
would help as well.
And it's still in early stage.
Feel free to come to talk us
about other efforts,
other ideas you think
about datasets we could provide.
The Bot, which is communication tools.
We know that Bot can do many things
like writhing Wikipedia article
but we promised
that we don't write actual article
but we mostly use it
as a way to communicate
from, let's say, user talk
to give us access
to large scale conversations
with the community members.
Explorer is going to show
all our datasets,
our toolings, their stats
and queries you can run on our things.
Stay tuned, this one is releasing soon.
And we have several other ideas
but I would jump
to this overall portfolio.
It would be several projects
to begin with datasets and tooling,
and what we are doing currently
is Explorer, Battlefield,
CatFacts and PageRank,
and there are some other upcoming ideas
like PreCheck, CitePool and Bubbles.
And this is one of the diagrams
that I want to show you.
We want to not only use
one individual project
to contribute the community
and also generate the training data
for the research, academia,
we also have an idea
that these projects may work together.
For example, the CitePool,
the system that we want to build
to allow people to easier find citations
for Wikipedia articles or Wikidata
but also use the Explorer
to display the result--
it depends on the page rank
scorances of datasets
to determine how to rank the citation page
that we will recommend
and use the PreCheck
to do quality, sanity check
and maybe create
bulk batch reports by Bot
and the PreCheck will depend
on the Game as well.
If some of our community friends
have been following
the progress of WikiLoop,
we have been through ice-breaking phase,
we were trying to earn the community trust
because we know how cautious
we need to be
coming to contribute to a movement
that relies so much
on the neutrality and non-bias policies.
And we have gradually start to have ideas
about tools and datas
and find the direction
of how we can possibly
make this sustainable.
And we are looking into creating
long-term sustainability,
both internally and also externally,
both in terms of getting resource
and getting support,
also externally of getting engagement,
getting usage, and getting contributors,
starting from next quarter.
I want to quote Evan You,
who is a creator
of popular frontend framework Vue.js,
"Software development
gets tremendously harder
when you start to have to convince people
instead of just writing the code."
This applies to editing
Wikipedia or Wikidata.
It's very easy to click a button
and add individual articles
but also it's very hard
when you need to convince people.
I hope to leave some time for questions,
although we only have few,
probably one or two minutes.
Yes, so we have about two minutes.
So if people want to shout questions out,
I'll bring the mic over.
Hands up maybe.
(person 1) So where would I go to
at this moment if I would like to use this
to solve some of the problem
with chemicals,
where some Wikipedia pages
about chemicals,
they have a chem box
about a specific chemical
but are otherwise about
a class of chemicals.
Is that something
where WikiLoop could help?
I think that's the individual
domain expertize part, right?
If you are talking
about topics of articles
that are associated with specific topics.
We are trying to...
we might be able to help
but we are trying to tackle the problem
that is like more general currently.
And overall the goal is
to find the possibility of
empowering human beings productivity
and also trying to generate the knowledge
that potentially helps...
the training data that potentially
helps the algorithms.
(person 2) I think we have time
for a very quick one.
(person 3) Are you also going to do this
for search of data on Commons?
Yeah, we hope to...
If you are referring to Battlefield
or counter-vandalism tools,
yeah, we are planning
to expand it to other Wiki projects,
including Commons in Wikidata.
(person 2) I think that's all the questions
we have time for
but if you'd like to show
your appreciation for [Victor.]
Thank you.
(applause)