36C3 preroll music
Daniel: Good morning! I'm glad you all
made it here this early on the last day. I
know it can can't be easy wasn't easy for
me I have to warn you that the way I
prepared for this song is a bit
experimental. I didn't make a slide set I
just made a mind map and I'll just click
through it while I talk to you. So,
this talk is about modernizing Wikipedia
as you probably have noticed visiting,
Wikipedia can feel a bit like visiting a
website from 10-15 years ago but before I
talk about any problems or things to
improve, I first want to revisit that the
software and the the infrastructure we
build around it has been running Wikipedia
and its sister sites for the last... well
nearly 19 years now and it's extremely
successful. We serve 17 billion page
views a month, yes?
Person in the audience: Could you make it
louder or speak up and also make the image
bigger?
inaudible dialogue
Daniel: Is this better? Like if I speak up
I will loose my voice in 10 minutes it's
already in it, no it's fine. We have
technology for this. I can... the light
doesn't help, yeah the contrast could be
better. Is it better like this? Okay cool.
All right so yeah we are serving 17
billion page views a month, which is quite
a lot. Wikipedia exists in about 100
languages. If you attended the talk about
the Wikimedia infrastructure yesterday, we
talked about 300 languages. We actually
support 300 languages for localization but
we have Wikipedia in about 100, if I'm not
completely off. I find this picture quite
fascinating. This is a visualization of
all the places in the world that are
described on Wikipedia and sister projects
and I find this quite impressive although
it's also a nice display of cultural bias
of course. We, that is Wikimedia
Foundation, run about 900 to a 1000 wikis
depending on how you count, but there are
many many more media wiki installations
out there, some of them big and many many
of them small. We have actually no idea
how many small instances there are. So
it's a very powerful very flexible and
versatile piece of software but, you know, but
sometimes it can feel like... you can do a
lot of things with it, right, but
sometimes it feels like it's a bit
overburdened and maybe you should look at
improving the foundations. So one of the
things that make MediaWiki great but also
sometimes hard to use is that kind of
everything is text, everything is markup,
everything is done with with wikitext,
which has grown in complexity over the
years so if you look at the autonomy of a
wiki page it can be a bit daunting. You
have different syntax for markup at
different kinds of transclusion or
templates and media and some things
actually, you know, get displayed in
place, some things show up in a completely
different place on the page it can be
rather confusing and daunting for
newcomers. And also things like having a
conversation just talking to people like,
you know, having a conversation thread
looks like this. You open the page you
look through the markup and you indent to
make a conversation thread and then you
get confused about the indenting and
someone messes with the formatting and
it's all excellent. There have been many
attempts over the years to improve the
situation, we have things like echo which
notifies you, for instance when someone
mentions your name or someone... It is
also used to to welcome people and do this
kind of achievement unlocked
notifications: hey, you did your first
edit, this is great welcome! To make
people a bit more engaged with the system
but it's really mostly improvements around
the fringes. We have had a system called
Flow for awhile to improve the way
conversations work. So you have more like
a thread structure that the software
actually knows about but then there are
many, well quite a few people who have
been around for a while that are very used
to the manual system and also there's a
lot of tools to support this manual system
which of course are incompatible with
making things more modern. So we use this
for instance on MediaWiki.org which is a
site which is basically a self
documentation site of MediaWiki but on
most Wikipedia this is not enabled or at
least not used for default everywhere. The
biggest attempt to move away from the text
only approach is Wikidata, which we
started in 2012. The idea of Wikidata of
course, if you didn't attend many great
talks we had about it here over of the
course of the Congress, is a way to
basically model the world using structured
data, using a semantic approach instead of
natural language which has its own
complexities but at least it's a way to
represent the knowledge of the world in a
way that machines can understand. So this
would be an alternative to wiki text but
still the vast majority of things
especially on Wikipedia are just markup.
And this markup is pretty powerful and
there's lots of ways to extend it and to
do things with it. So a lot of things on
MediaWiki are just DIY, just do it
yourself. Templates are a great example of
this. Infoboxes of course, the nice blue
boxes here on the right side of pages, are
done using templates but these templates
are just for formatting, there is not data
processing there's no the data base or
structured data backing them. It's just
basically, you know, it's still just
markup. It's still... you have a predefined
layout but you're still feeding a text not
data. You have parameters but the values
of the parameters are still again maybe
templates or links or you have markup in
them, like you know HTML line breaks and
stuff. So it's kind of semi structured.
And this of course is also used to do
things like workflow. The template... Oh
no, this was actually an infobox, wrong
picture, wrong capture. This is also used
to do workflows, so if a page on Wikipedia
gets nominated for deletion you put manual
put a template on the page that defines
why this is supposed to be deleted and
then you have to go to a different page
and put a different template there, giving
more explanation and this again is used
for discussion. It's a lot of structure
created by the community and maintained by
the community, using conventions and tools
built on top of what is essentially just a
pile of markup. And because doing all this
manually is kind of painful, only on there
we created a system to allow people to add
JavaScript to the site, which is then
maintained on wiki pages by the community
and it can tweak and automate. But again,
it doesn't really have much to work with,
right? It basically messes with whatever
it can, it directly interacts with the DOM
of the page, whenever the layout of the
software changes, things break. So this is
not great for for compatibility but it's
used a lot and it is very important for
the community to have this power. Sorry, I
wish there was a better way to show these
pictures. Okay, that's just to give you an
idea of what kind of thing is implemented
that way and maintained by the community
on their site. One of the problems we have
with that is: these are bound to a wiki
and I just told you that we run over 900
of these not over 9,000 and it would be
great if you could just share them between
wikis but we can't. And again, there have
been... we have been talking about it a
lot and it seems like it shouldn't be so
hard, but you kind of need to write these
tools differently, if you want to share
them across sites, because different sites
use different conventions, they use
different templates. Then it just doesn't
work and you actually have to write decent
software that uses internationalization if
you want to use it across wikis. While
these are usually just you know one-off
hacks with everything hard-coded we would
have to put in place an
internationalization system and it's
actually a lot of effort and there's a lot
of things that are actually unclear about
it. So, before I dive more deeply into the
different things that will make it hard to
improve on the current situation and the
things that we are doing to improve it do
we have any questions or do you have any
other - do you have any things you may
find particularly, well, annoying or
particularly outdated, when interacting
with Wikipedia? Any thoughts on that?
Beyond what I just said?
Microphone: The strict separation, just in
Wikipedia, between mobile layout and
desktop layout.
Daniel: Yeah. So, actually having a
reactive layout system that would just
work for mobile and desktop in the same
way and allowing the designers and UX
experts, who work on the system to just do
this once and not two or maybe even three
times - because of course we also have
native applications for different
platforms - would be great and it's
something that we're looking into at the
moment. But it's not, you know , it's not
that easy we could build a completely new
system, that does this but then again you
would be telling people: "You can no
longer use the old system", but now they
have build all these tools that rely on
how the old system works and you have to
port all of this over so there's a lot of
inertia. Any other thoughts? Everyone is
still asleep that's excellent. So I can
continue. So, another thing that makes it
difficult to change how MediaWiki works or
to improve it is that we are trying to do
well to be at least two things at once: on
the one hand we are running a top 5
website and serving over 100,000 requests
per second using the system and you on the
other hand, at least until now, we have
always made sure that you can just
download MediaWiki and install it on a
shared hosting platform you don't even
need root on the system, right? You don't
even need administrative privileges you
can just set it up and run it in your web
space and it will work. And, having the
same piece of software do both, run in a
minimal environment and run at scale, is
rather difficult and it also means that
there's a lot of things that we can't
easily do, right? All this modern micro
service architecture separate front-end
and back-end systems, all of that means
that it's a lot more complicated to set up
and needs more knowledge or more
infrastructure to set up and so far that
meant we can't do it, because so far there
was this requirement that you should
really be able to just run it on your
shared hosting. And we are currently
considering to what extent we can continue
this, I mean, container based hosting is
picking up. Maybe this is an alternative
it's still unclear but it seems like this
is something that we need to reconsider.
Yeah, but if we make this harder to do
then a lot of current users of MediaWiki
would maybe not, well, maybe no longer
exist or at least would not exist as they
do now, right. You probably have seen
this nice MediaWiki instance the Congress
wiki. Which - with a completely customized
skin and a lot of extensions installed to
allow people to define their sessions
there and making sure these sessions
automatically get listed and get put into
a calendar - this is all done using
extensions, like Semantic MediaWiki, that
allow you to basically define queries in
the wiki text markup. Yeah, another thing
that, of course, slows down development is
that Wikimedia does engineering on a,
well, comparatively a shoestring budget,
right? The budget of the Wikimedia
Foundation, the annual budget is something
like a hundred million dollars, that
sounds like a lot of money, but if you
compare it to other companies running a
top five or top ten website it's like two
percent of their budget or something like
that, right? It's really, I mean, 100
million is not peanuts but compared to
what other companies invest to achieve
this kind of goal it kind of is, so , what
this budget translates into is something
like 300, depending on how you count,
between three hundred and four hundred
staff. So, this is the people who run all
of this, including all the community
outreach all the social aspects all the
administrative aspects. Less than half of
these are the engineers who do all this.
And we have like, something like 2,500
servers, bare-metal, so, which is not a
lot for this kind of thing. Which also
means that we have to design the software
to be not just scalable but also quite
efficient. The modern approach to scaling
is usually scale horizontally make it so
you can just spin up another virtual
machine in some cloud service, but, yeah,
we run our own service, we run our own
servers, so we can design to scale
horizontally, but it means ordering
hardware and setting it up and it's going
to take half a year or so. And we don't
actually have that many people who do
this, so, scalability and performance are
also important factors when designing the
software. Okay. Before I dive into what we
are actually doing - any questions? This
one in the back. Wait for the mic, please.
In the very...
Q: Hi!
Daniel: Hello.
Q: So, you said you don't have that many
people, but how many do you actually have?
Daniel: For... it's something like 150 engineers
worldwide. It always depends on what you
count, right? So you count the people, who
- do you count engineers, who work on the
native apps, do you account engineers, who
work on the Wikimedia cloud services -
actually we do have cloud services, we
offer them to the community to run their
own things, but we don't run our stuff on
other people's cloud. Yeah, so depending
on how you count or something and whether
you count the people working here in
Germany for Wikimedia Germany, which is a
separate organization technically - it's
something like 150 engineers.
Q: Thanks!
Q: I'm interested: What are the reasons
that you don't run on other people's
services like on the cloud. I mean, then
it will be easy to scale horizontally,
right?
Daniel: There's, well, one reason is being
independent, right? If we, yeah, I imagine
we ran all our stuff on Amazon's
infrastructure and then maybe Amazon
doesn't like the way that the Wikipedia
article about Amazon is written - what do
we do, right? Maybe they shut us down,
maybe they make things very expensive,
maybe they make things very painful for
us, maybe there is some at least like it
self-censorship mechanism happening and we
want to avoid that. There are there are
thoughts about this there are thoughts
like maybe we can do this at least for
development infrastructure and CI, not for
production or maybe we can make it so that
we run stuff in the cloud services by more
than one vendor, so we basically we spread
out so we are not reliant on a single
company. We are thinking about these
things but so far the way to actually stay
independent has been to run our own
servers.
Q: You've been talking about scalability
and changing the architecture, that kind
of seems to imply to me that there's a
problem with scaling at the moment or that
it's foreseeable that things are not gonna
work out if you just keep doing what
you're doing at the moment. Can you maybe
elaborate on that.
Daniel: So, there's, I think there's two sides
to this. On the one hand the reason I
mentioned it is just that a lot of things
that are really easy to do basically for
me, right? Works on my machine are really
hard to do if you want to do them at
scale. That's one aspect. The other aspect
is MediaWiki is pretty much a PHP monolith
and that means getting it always means
copying the monolith and breaking it down
so you have smaller units that you can
scale and just say, yeah, I don't know, I
need more instances for authentication
handling or something like that. That
would be more efficient, right, because
you have higher granularity, you can just
scale the things that you actually need
but that of course needs rearchitecting.
It's not like things are going to explode
if we don't do that very soon, it's not,
so there's not like an urgent problem
there. The reason for us to rearchitect is
more, to gain more flexibility in
development, because if you have a
monolith that is pretty entangled, code
changes are risky and take a long time.
Q: How many people work on product design
or like user experience research to, like,
sit down with users and try to understand
what their needs are and from there
proceed.
A: Across... I don't have an exact number,
something like five.
Audience: Do you think that's sufficient?
Herald: The question was, whether it's
sufficient. So just...
Daniel: Probably not? But it's more than,
that's more people than we have for
database administration, and that's also
not sufficient.
Herald: Are the further questions? I don't
think.
Daniel: Okay. So, one of the things, that
holds us back a bit, is that there's
literally thousands of extensions for
MediaWiki and the extension mechanism is
heavily reliant on hooks, so basically on
callbacks. And, we have - I don't have a
picture, I have a link here - we have a
great number of these. So, you see, each
paragraph is basically documenting one
callback that you can use to modify the
behavior of the software and, I mean,
there's, I have never counted, but
something like a thousand? And all of them
are of course interfaces to extra - to
software that is maintained externally, so
they have to be kept stable and if you
have a large chunk of software that you
want to restructure but you have a
thousand fixed points that you can't
change, things become rather difficult.
It's basi.. yeah, these hook points kind
of, like, they act like nails in the
architecture and then you kind of have to
wiggle around them - it's fun. We are
working to change that. We want to
architect it so the interface that is
exposed to these hooks become much more
narrow and the things that these hooks or
these callback functions can do is much
more restricted. There's currently an RSC
open for this, has been open for a while
actually. The problem is that in order to
assess whether the proposal is actually
viable you have to survey all the current
users of these hooks and make sure that we
can, the use case is still covered in the
new system and, yeah, we have like a
thousand hook points and we have like a
thousand extensions that's quite a bit of
work. Another thing that I'm currently
working on is establishing a stable
interface policy. This may sound pretty
obvious - it has a lot of pretty obvious
things like, yeah, if you have a class and
there's a public method then that's a
stable interface it will not just change
without notice, we have deprecation policy
and all that. But if you have worked with
extensible systems that rely on the
mechanisms of object-oriented programming,
you may have come across the question
whether a protected method is part of this
stable interface of the software or not,
or maybe the constructor? I don't know, if
you have worked in environments that use
dependency injection the idea is basically
that the construction signature should be
able to change at any time but then you
have extensions that you're subclassing and
things break. So, this is why we are
trying to establish a much more
restrictive stable interface policy, that
would would make explicit things like
constructor signatures actually not being
stable and that gives us a lot more wiggle
room to restructure the software.
MediaWiki itself has grown as a software
for the last 18 years or so and, at least
in the beginning, was mostly created by
volunteers. And in a monolithic
architecture there's a great tendency to
just, you know, find and grab the thing
that you want to use and just use it.
Which leads to, well, structures like this
one: everything depends on everything. And
if you change one bit of code everything
else may or may not break. And with, yeah.
And if you don't have great test coverage
at the same time this just makes it so
that any change becomes very risky and you
have to do a lot of manual testing a lot
of manual digging around, touching a lot
of files and we are for the last year,
year and a half we have started a
concerted effort to tie the worst - to cut
the worst ties, to decouple these things
that are, basically that have most impact
there's a few objects in the software that
rep... - for instance one that represents
the user and one that represents a title
that are used everywhere and the way
they're implemented currently also means
that they depend on everything and that of
course is not a good situation. On a,
well, a similar idea on a higher level is
decomposition of the software so the
decoupling was about the software
architecture this is about the system
architecture breaking up the
monolith itself into multiple services that
serve different purposes. The specifics of
this diagram are not really relevant to
this talk. This is more to, you know, give
you an impression of the complexity and
the sort of work we are doing there. The
idea is that perhaps we could split out
certain functionality into its own service
into a separate application, like maybe
move all the search functionality into
something separate and self-contained, but
then the question is how do you, again,
compose this into the final user interface
- at some point these things have to get
composed together again - and again this
is a very trivial trivial issue if you
only want to want this to work on your
machine or you only need to serve a
hundred users or something. But doing this
at scale doing it at the rate of something
like 10,000 page views a second, I said a
hundred thousand requests earlier but that
includes resources, icons, CSS and all
that. So, yeah, then you have to think
pretty hard about what you can cache and,
thank you, how you can recombine things
without having to recompute everything and
this is something that we are currently
looking into - coming up with a
architecture that allows us to compose and
recombine the output of different
background services. Okay. Before I
started this talk I said I would probably
roughly use half of my time going through
the presentation and I guess I just hit
that spot on. So, this is all I have
prepared but I'm happy to talk to you more
about the things I said or maybe any other
aspects of this that you may be interested
in. If any comments or questions. Oh!
Three already.
Q: First of all thanks a lot for the
presentation, such a really interesting
case of a legacy system and thanks for the
honesty. It was really interesting as a,
you know, software engineer to see how
that works. I have a question about
decoupling, so, I mean, I kind of, you
have like, probably your system is
enormous and how do you find, so to say,
the most evil, you know, parts which
sort of have to be decoupled. Do you use other
software, with, you know, this, like, what
a metrics and stuff or do you just know,
kind of intuitively..
Daniel: Yeah, it's actually, this is quite
interesting and maybe I can, maybe we can
talk about it a bit more in depth later.
Very quickly: it's a combination on the
one hand you just have the anecdotal
experience of what is actually annoying
when you work with the software and you
try to fix it and on the other hand I try
to find good tooling for this and the
existing tooling tends to die when you
just run it against our code base. So, one
of the things that you are looking for are
cyclic dependencies but the number of
possible cycles in a graph grows
exponentially with a number of nodes. And
if you have a pretty tightly knit graph
that number quickly goes into the
millions. And, yeah, the tool just goes to
100% CPU and never returns. So, I spend
quite a bit of time trying to find
heuristics to get around that - was a lot
of fun. I can, yeah, we can talk about
that later, if you like. Okay, thanks.
Q: So what exactly is this Wikidata you
mentioned before. Is it like an extension
or is it a completely different project?
Daniel: Wiki - so there's an extension called
Wikibase, that implements this, well I
would say, ontological modeling interface
for MediaWiki and that is used to run a
website called Wikidata which has
something like 30 million items modeled
that describe the world and serve as a
machine-readable data back-end to other
wiki project, other Wikimedia projects.
Yeah, I used to work on that project for
Wikimedia Germany. I moved on to do
different things now for a couple of
years. Lukas here in front is probably the
person most knowledgeable about the latest
and greatest in the Wikidata development.
Q: You've shortly talked about test
coverage. I will be into history..
Daniel: Sorry?
Q: You talked about test coverage.
Daniel: Yes.
Q: I would be interested in if you amped
your efforts to help you modernize it and
how your current situation is with test
coverage.
Daniel: Test coverage for MediaWiki core is below
50%. In some parts it's below 10% which is
very worrying. One thing that we started
to look into, like half a year ago, is
instead of writing unit tests for all the
code that we actually want to throw away,
before we touch it, we tried to improve
the test coverage using integration tests
on the API level. So we are currently in
the process of writing a suite of tests,
not just for the API modules, but for all
the functionality, all the application
logic behind the the API. And that will
hopefully cover most of the relevant code
paths and will give us confidence when we
refactor the code.
Q: Thanks.
Herald: Other questions?
Q: So you said that you have this legacy
system and eventually you have to move
away from it but are there any, like, I
don't know, plans for the near future to,
I don't know. At some point you have to
cut the current infrastructure to your
extensions and so on and it's a hard cut, I
see. But are there any plans to build it
up from scratch or what are the plans?
Daniel: Yeah, we are not going to rewrite from
scratch - that's a pretty sure fire way to
just kill the system. We will have to make
some tough decisions about backwards
compatibility and probably reconsider some
of the requirements and constraints we
have, well, with respect to the platforms
we run on and also the platforms we serve.
One of the things that we have been very
careful to do in the past for instance is
to make sure that you can do pretty much
everything with MediaWiki with no
JavaScript on the client side. And that
requirement is likely to drop. You will
still be able to read of course, without
any JavaScript or anything, but the extent
of functionality you will have without
JavaScript on the client side is likely to
be greatly reduced - that kind of thing.
Also we will probably end up breaking
compatibility to at least some of the
user-created tools. Hopefully we can offer
good alternatives, good APIs, good
libraries that people can actually port
to, that are less brittle. I hope that
will motivate people and maybe repay them
a bit for the pain of having their tool
broken. If we can give them something that
is more stable, more reliable, and
hopefully even nicer to use. Yeah, so,
it's small increments, bits, and pieces
all over the system there's no, you know,
no great master plan, no big change to
point to really.
Herald: Okay, okay, further questions?
Daniel: I plan to just sit outside here at
the table later if you just want to come
and chat so we can also do that there.
Herald: Okay, so, last call are there any
other questions? It does not appear so,
so, I'd like ask for a huge applause for
Daniel for this talk.
Applause
36C3 postroll music
Subtitles created by c3subtitles.de
in the year 2020. Join, and help us!