WEBVTT
00:00:00.000 --> 00:00:20.510
36C3 preroll music
00:00:20.510 --> 00:00:24.750
Daniel: Good morning! I'm glad you all
made it here this early on the last day. I
00:00:24.750 --> 00:00:32.439
know it can can't be easy wasn't easy for
me I have to warn you that the way I
00:00:32.439 --> 00:00:36.160
prepared for this song is a bit
experimental. I didn't make a slide set I
00:00:36.160 --> 00:00:44.559
just made a mind map and I'll just click
through it while I talk to you. So,
00:00:44.559 --> 00:00:51.180
this talk is about modernizing Wikipedia
as you probably have noticed visiting,
00:00:51.180 --> 00:00:58.500
Wikipedia can feel a bit like visiting a
website from 10-15 years ago but before I
00:00:58.500 --> 00:01:05.280
talk about any problems or things to
improve, I first want to revisit that the
00:01:05.280 --> 00:01:11.619
software and the the infrastructure we
build around it has been running Wikipedia
00:01:11.619 --> 00:01:20.160
and its sister sites for the last... well
nearly 19 years now and it's extremely
00:01:20.160 --> 00:01:32.200
successful. We serve 17 billion page
views a month, yes?
00:01:32.200 --> 00:01:40.870
Person in the audience: Could you make it
louder or speak up and also make the image
00:01:40.870 --> 00:01:42.870
bigger?
00:01:42.870 --> 00:01:43.870
inaudible dialogue
00:01:43.870 --> 00:01:45.870
Daniel: Is this better? Like if I speak up
I will loose my voice in 10 minutes it's
00:01:45.870 --> 00:01:55.720
already in it, no it's fine. We have
technology for this. I can... the light
00:01:55.720 --> 00:02:05.490
doesn't help, yeah the contrast could be
better. Is it better like this? Okay cool.
00:02:05.490 --> 00:02:13.840
All right so yeah we are serving 17
billion page views a month, which is quite
00:02:13.840 --> 00:02:19.560
a lot. Wikipedia exists in about 100
languages. If you attended the talk about
00:02:19.560 --> 00:02:24.250
the Wikimedia infrastructure yesterday, we
talked about 300 languages. We actually
00:02:24.250 --> 00:02:29.989
support 300 languages for localization but
we have Wikipedia in about 100, if I'm not
00:02:29.989 --> 00:02:38.689
completely off. I find this picture quite
fascinating. This is a visualization of
00:02:38.689 --> 00:02:43.719
all the places in the world that are
described on Wikipedia and sister projects
00:02:43.719 --> 00:02:49.319
and I find this quite impressive although
it's also a nice display of cultural bias
00:02:49.319 --> 00:03:00.790
of course. We, that is Wikimedia
Foundation, run about 900 to a 1000 wikis
00:03:00.790 --> 00:03:06.680
depending on how you count, but there are
many many more media wiki installations
00:03:06.680 --> 00:03:11.459
out there, some of them big and many many
of them small. We have actually no idea
00:03:11.459 --> 00:03:17.150
how many small instances there are. So
it's a very powerful very flexible and
00:03:17.150 --> 00:03:23.730
versatile piece of software but, you know, but
sometimes it can feel like... you can do a
00:03:23.730 --> 00:03:28.329
lot of things with it, right, but
sometimes it feels like it's a bit
00:03:28.329 --> 00:03:42.180
overburdened and maybe you should look at
improving the foundations. So one of the
00:03:42.180 --> 00:03:47.829
things that make MediaWiki great but also
sometimes hard to use is that kind of
00:03:47.829 --> 00:03:52.609
everything is text, everything is markup,
everything is done with with wikitext,
00:03:52.609 --> 00:04:02.529
which has grown in complexity over the
years so if you look at the autonomy of a
00:04:02.529 --> 00:04:09.159
wiki page it can be a bit daunting. You
have different syntax for markup at
00:04:09.159 --> 00:04:16.150
different kinds of transclusion or
templates and media and some things
00:04:16.150 --> 00:04:21.739
actually, you know, get displayed in
place, some things show up in a completely
00:04:21.739 --> 00:04:26.340
different place on the page it can be
rather confusing and daunting for
00:04:26.340 --> 00:04:31.720
newcomers. And also things like having a
conversation just talking to people like,
00:04:31.720 --> 00:04:35.540
you know, having a conversation thread
looks like this. You open the page you
00:04:35.540 --> 00:04:40.510
look through the markup and you indent to
make a conversation thread and then you
00:04:40.510 --> 00:04:43.480
get confused about the indenting and
someone messes with the formatting and
00:04:43.480 --> 00:04:52.120
it's all excellent. There have been many
attempts over the years to improve the
00:04:52.120 --> 00:05:00.290
situation, we have things like echo which
notifies you, for instance when someone
00:05:00.290 --> 00:05:09.130
mentions your name or someone... It is
also used to to welcome people and do this
00:05:09.130 --> 00:05:12.400
kind of achievement unlocked
notifications: hey, you did your first
00:05:12.400 --> 00:05:19.900
edit, this is great welcome! To make
people a bit more engaged with the system
00:05:19.900 --> 00:05:24.380
but it's really mostly improvements around
the fringes. We have had a system called
00:05:24.380 --> 00:05:31.350
Flow for awhile to improve the way
conversations work. So you have more like
00:05:31.350 --> 00:05:37.960
a thread structure that the software
actually knows about but then there are
00:05:37.960 --> 00:05:42.160
many, well quite a few people who have
been around for a while that are very used
00:05:42.160 --> 00:05:46.900
to the manual system and also there's a
lot of tools to support this manual system
00:05:46.900 --> 00:05:52.780
which of course are incompatible with
making things more modern. So we use this
00:05:52.780 --> 00:05:56.250
for instance on MediaWiki.org which is a
site which is basically a self
00:05:56.250 --> 00:06:03.000
documentation site of MediaWiki but on
most Wikipedia this is not enabled or at
00:06:03.000 --> 00:06:14.530
least not used for default everywhere. The
biggest attempt to move away from the text
00:06:14.530 --> 00:06:23.050
only approach is Wikidata, which we
started in 2012. The idea of Wikidata of
00:06:23.050 --> 00:06:29.580
course, if you didn't attend many great
talks we had about it here over of the
00:06:29.580 --> 00:06:36.470
course of the Congress, is a way to
basically model the world using structured
00:06:36.470 --> 00:06:45.470
data, using a semantic approach instead of
natural language which has its own
00:06:45.470 --> 00:06:50.740
complexities but at least it's a way to
represent the knowledge of the world in a
00:06:50.740 --> 00:06:56.790
way that machines can understand. So this
would be an alternative to wiki text but
00:06:56.790 --> 00:07:09.389
still the vast majority of things
especially on Wikipedia are just markup.
00:07:09.389 --> 00:07:13.800
And this markup is pretty powerful and
there's lots of ways to extend it and to
00:07:13.800 --> 00:07:21.050
do things with it. So a lot of things on
MediaWiki are just DIY, just do it
00:07:21.050 --> 00:07:29.250
yourself. Templates are a great example of
this. Infoboxes of course, the nice blue
00:07:29.250 --> 00:07:34.730
boxes here on the right side of pages, are
done using templates but these templates
00:07:34.730 --> 00:07:39.090
are just for formatting, there is not data
processing there's no the data base or
00:07:39.090 --> 00:07:47.530
structured data backing them. It's just
basically, you know, it's still just
00:07:47.530 --> 00:07:56.630
markup. It's still... you have a predefined
layout but you're still feeding a text not
00:07:56.630 --> 00:08:04.520
data. You have parameters but the values
of the parameters are still again maybe
00:08:04.520 --> 00:08:11.610
templates or links or you have markup in
them, like you know HTML line breaks and
00:08:11.610 --> 00:08:18.860
stuff. So it's kind of semi structured.
And this of course is also used to do
00:08:18.860 --> 00:08:24.100
things like workflow. The template... Oh
no, this was actually an infobox, wrong
00:08:24.100 --> 00:08:34.229
picture, wrong capture. This is also used
to do workflows, so if a page on Wikipedia
00:08:34.229 --> 00:08:39.789
gets nominated for deletion you put manual
put a template on the page that defines
00:08:39.789 --> 00:08:44.870
why this is supposed to be deleted and
then you have to go to a different page
00:08:44.870 --> 00:08:49.390
and put a different template there, giving
more explanation and this again is used
00:08:49.390 --> 00:08:55.149
for discussion. It's a lot of structure
created by the community and maintained by
00:08:55.149 --> 00:09:02.730
the community, using conventions and tools
built on top of what is essentially just a
00:09:02.730 --> 00:09:10.620
pile of markup. And because doing all this
manually is kind of painful, only on there
00:09:10.620 --> 00:09:17.360
we created a system to allow people to add
JavaScript to the site, which is then
00:09:17.360 --> 00:09:27.019
maintained on wiki pages by the community
and it can tweak and automate. But again,
00:09:27.019 --> 00:09:30.589
it doesn't really have much to work with,
right? It basically messes with whatever
00:09:30.589 --> 00:09:35.470
it can, it directly interacts with the DOM
of the page, whenever the layout of the
00:09:35.470 --> 00:09:41.040
software changes, things break. So this is
not great for for compatibility but it's
00:09:41.040 --> 00:09:54.730
used a lot and it is very important for
the community to have this power. Sorry, I
00:09:54.730 --> 00:10:00.110
wish there was a better way to show these
pictures. Okay, that's just to give you an
00:10:00.110 --> 00:10:05.220
idea of what kind of thing is implemented
that way and maintained by the community
00:10:05.220 --> 00:10:10.189
on their site. One of the problems we have
with that is: these are bound to a wiki
00:10:10.189 --> 00:10:19.410
and I just told you that we run over 900
of these not over 9,000 and it would be
00:10:19.410 --> 00:10:26.300
great if you could just share them between
wikis but we can't. And again, there have
00:10:26.300 --> 00:10:30.790
been... we have been talking about it a
lot and it seems like it shouldn't be so
00:10:30.790 --> 00:10:36.759
hard, but you kind of need to write these
tools differently, if you want to share
00:10:36.759 --> 00:10:39.899
them across sites, because different sites
use different conventions, they use
00:10:39.899 --> 00:10:45.529
different templates. Then it just doesn't
work and you actually have to write decent
00:10:45.529 --> 00:10:50.970
software that uses internationalization if
you want to use it across wikis. While
00:10:50.970 --> 00:10:55.019
these are usually just you know one-off
hacks with everything hard-coded we would
00:10:55.019 --> 00:10:58.450
have to put in place an
internationalization system and it's
00:10:58.450 --> 00:11:02.910
actually a lot of effort and there's a lot
of things that are actually unclear about
00:11:02.910 --> 00:11:15.260
it. So, before I dive more deeply into the
different things that will make it hard to
00:11:15.260 --> 00:11:20.529
improve on the current situation and the
things that we are doing to improve it do
00:11:20.529 --> 00:11:27.309
we have any questions or do you have any
other - do you have any things you may
00:11:27.309 --> 00:11:34.519
find particularly, well, annoying or
particularly outdated, when interacting
00:11:34.519 --> 00:11:40.920
with Wikipedia? Any thoughts on that?
Beyond what I just said?
00:11:40.920 --> 00:11:48.769
Microphone: The strict separation, just in
Wikipedia, between mobile layout and
00:11:48.769 --> 00:11:54.259
desktop layout.
Daniel: Yeah. So, actually having a
00:11:54.259 --> 00:12:02.069
reactive layout system that would just
work for mobile and desktop in the same
00:12:02.069 --> 00:12:09.130
way and allowing the designers and UX
experts, who work on the system to just do
00:12:09.130 --> 00:12:15.180
this once and not two or maybe even three
times - because of course we also have
00:12:15.180 --> 00:12:20.550
native applications for different
platforms - would be great and it's
00:12:20.550 --> 00:12:24.360
something that we're looking into at the
moment. But it's not, you know , it's not
00:12:24.360 --> 00:12:29.519
that easy we could build a completely new
system, that does this but then again you
00:12:29.519 --> 00:12:33.249
would be telling people: "You can no
longer use the old system", but now they
00:12:33.249 --> 00:12:39.019
have build all these tools that rely on
how the old system works and you have to
00:12:39.019 --> 00:12:52.089
port all of this over so there's a lot of
inertia. Any other thoughts? Everyone is
00:12:52.089 --> 00:13:03.720
still asleep that's excellent. So I can
continue. So, another thing that makes it
00:13:03.720 --> 00:13:10.879
difficult to change how MediaWiki works or
to improve it is that we are trying to do
00:13:10.879 --> 00:13:19.180
well to be at least two things at once: on
the one hand we are running a top 5
00:13:19.180 --> 00:13:24.360
website and serving over 100,000 requests
per second using the system and you on the
00:13:24.360 --> 00:13:30.540
other hand, at least until now, we have
always made sure that you can just
00:13:30.540 --> 00:13:33.800
download MediaWiki and install it on a
shared hosting platform you don't even
00:13:33.800 --> 00:13:38.920
need root on the system, right? You don't
even need administrative privileges you
00:13:38.920 --> 00:13:44.769
can just set it up and run it in your web
space and it will work. And, having the
00:13:44.769 --> 00:13:51.779
same piece of software do both, run in a
minimal environment and run at scale, is
00:13:51.779 --> 00:13:55.040
rather difficult and it also means that
there's a lot of things that we can't
00:13:55.040 --> 00:14:02.110
easily do, right? All this modern micro
service architecture separate front-end
00:14:02.110 --> 00:14:09.309
and back-end systems, all of that means
that it's a lot more complicated to set up
00:14:09.309 --> 00:14:15.720
and needs more knowledge or more
infrastructure to set up and so far that
00:14:15.720 --> 00:14:19.500
meant we can't do it, because so far there
was this requirement that you should
00:14:19.500 --> 00:14:23.569
really be able to just run it on your
shared hosting. And we are currently
00:14:23.569 --> 00:14:29.639
considering to what extent we can continue
this, I mean, container based hosting is
00:14:29.639 --> 00:14:34.620
picking up. Maybe this is an alternative
it's still unclear but it seems like this
00:14:34.620 --> 00:14:45.999
is something that we need to reconsider.
Yeah, but if we make this harder to do
00:14:45.999 --> 00:14:52.739
then a lot of current users of MediaWiki
would maybe not, well, maybe no longer
00:14:52.739 --> 00:14:57.230
exist or at least would not exist as they
do now, right. You probably have seen
00:14:57.230 --> 00:15:05.259
this nice MediaWiki instance the Congress
wiki. Which - with a completely customized
00:15:05.259 --> 00:15:09.689
skin and a lot of extensions installed to
allow people to define their sessions
00:15:09.689 --> 00:15:14.410
there and making sure these sessions
automatically get listed and get put into
00:15:14.410 --> 00:15:20.660
a calendar - this is all done using
extensions, like Semantic MediaWiki, that
00:15:20.660 --> 00:15:34.279
allow you to basically define queries in
the wiki text markup. Yeah, another thing
00:15:34.279 --> 00:15:42.079
that, of course, slows down development is
that Wikimedia does engineering on a,
00:15:42.079 --> 00:15:48.130
well, comparatively a shoestring budget,
right? The budget of the Wikimedia
00:15:48.130 --> 00:15:52.199
Foundation, the annual budget is something
like a hundred million dollars, that
00:15:52.199 --> 00:15:58.009
sounds like a lot of money, but if you
compare it to other companies running a
00:15:58.009 --> 00:16:03.209
top five or top ten website it's like two
percent of their budget or something like
00:16:03.209 --> 00:16:10.769
that, right? It's really, I mean, 100
million is not peanuts but compared to
00:16:10.769 --> 00:16:16.699
what other companies invest to achieve
this kind of goal it kind of is, so , what
00:16:16.699 --> 00:16:22.230
this budget translates into is something
like 300, depending on how you count,
00:16:22.230 --> 00:16:28.800
between three hundred and four hundred
staff. So, this is the people who run all
00:16:28.800 --> 00:16:32.189
of this, including all the community
outreach all the social aspects all the
00:16:32.189 --> 00:16:40.920
administrative aspects. Less than half of
these are the engineers who do all this.
00:16:40.920 --> 00:16:50.989
And we have like, something like 2,500
servers, bare-metal, so, which is not a
00:16:50.989 --> 00:16:57.619
lot for this kind of thing. Which also
means that we have to design the software
00:16:57.619 --> 00:17:07.079
to be not just scalable but also quite
efficient. The modern approach to scaling
00:17:07.079 --> 00:17:11.640
is usually scale horizontally make it so
you can just spin up another virtual
00:17:11.640 --> 00:17:19.280
machine in some cloud service, but, yeah,
we run our own service, we run our own
00:17:19.280 --> 00:17:24.440
servers, so we can design to scale
horizontally, but it means ordering
00:17:24.440 --> 00:17:32.070
hardware and setting it up and it's going
to take half a year or so. And we don't
00:17:32.070 --> 00:17:38.390
actually have that many people who do
this, so, scalability and performance are
00:17:38.390 --> 00:17:49.000
also important factors when designing the
software. Okay. Before I dive into what we
00:17:49.000 --> 00:18:03.860
are actually doing - any questions? This
one in the back. Wait for the mic, please.
00:18:03.860 --> 00:18:07.330
In the very...
Q: Hi!
00:18:07.330 --> 00:18:12.950
Daniel: Hello.
Q: So, you said you don't have that many
00:18:12.950 --> 00:18:22.990
people, but how many do you actually have?
Daniel: For... it's something like 150 engineers
00:18:22.990 --> 00:18:27.170
worldwide. It always depends on what you
count, right? So you count the people, who
00:18:27.170 --> 00:18:32.260
- do you count engineers, who work on the
native apps, do you account engineers, who
00:18:32.260 --> 00:18:36.980
work on the Wikimedia cloud services -
actually we do have cloud services, we
00:18:36.980 --> 00:18:41.190
offer them to the community to run their
own things, but we don't run our stuff on
00:18:41.190 --> 00:18:45.560
other people's cloud. Yeah, so depending
on how you count or something and whether
00:18:45.560 --> 00:18:50.210
you count the people working here in
Germany for Wikimedia Germany, which is a
00:18:50.210 --> 00:18:57.760
separate organization technically - it's
something like 150 engineers.
00:18:57.760 --> 00:19:08.210
Q: Thanks!
Q: I'm interested: What are the reasons
00:19:08.210 --> 00:19:13.880
that you don't run on other people's
services like on the cloud. I mean, then
00:19:13.880 --> 00:19:17.090
it will be easy to scale horizontally,
right?
00:19:17.090 --> 00:19:25.330
Daniel: There's, well, one reason is being
independent, right? If we, yeah, I imagine
00:19:25.330 --> 00:19:32.350
we ran all our stuff on Amazon's
infrastructure and then maybe Amazon
00:19:32.350 --> 00:19:38.060
doesn't like the way that the Wikipedia
article about Amazon is written - what do
00:19:38.060 --> 00:19:42.050
we do, right? Maybe they shut us down,
maybe they make things very expensive,
00:19:42.050 --> 00:19:47.360
maybe they make things very painful for
us, maybe there is some at least like it
00:19:47.360 --> 00:19:54.070
self-censorship mechanism happening and we
want to avoid that. There are there are
00:19:54.070 --> 00:19:58.440
thoughts about this there are thoughts
like maybe we can do this at least for
00:19:58.440 --> 00:20:04.270
development infrastructure and CI, not for
production or maybe we can make it so that
00:20:04.270 --> 00:20:12.200
we run stuff in the cloud services by more
than one vendor, so we basically we spread
00:20:12.200 --> 00:20:17.860
out so we are not reliant on a single
company. We are thinking about these
00:20:17.860 --> 00:20:21.820
things but so far the way to actually stay
independent has been to run our own
00:20:21.820 --> 00:20:28.300
servers.
Q: You've been talking about scalability
00:20:28.300 --> 00:20:35.490
and changing the architecture, that kind
of seems to imply to me that there's a
00:20:35.490 --> 00:20:42.270
problem with scaling at the moment or that
it's foreseeable that things are not gonna
00:20:42.270 --> 00:20:46.580
work out if you just keep doing what
you're doing at the moment. Can you maybe
00:20:46.580 --> 00:20:52.480
elaborate on that.
Daniel: So, there's, I think there's two sides
00:20:52.480 --> 00:20:56.850
to this. On the one hand the reason I
mentioned it is just that a lot of things
00:20:56.850 --> 00:21:01.610
that are really easy to do basically for
me, right? Works on my machine are really
00:21:01.610 --> 00:21:08.920
hard to do if you want to do them at
scale. That's one aspect. The other aspect
00:21:08.920 --> 00:21:16.670
is MediaWiki is pretty much a PHP monolith
and that means getting it always means
00:21:16.670 --> 00:21:23.680
copying the monolith and breaking it down
so you have smaller units that you can
00:21:23.680 --> 00:21:29.040
scale and just say, yeah, I don't know, I
need more instances for authentication
00:21:29.040 --> 00:21:33.910
handling or something like that. That
would be more efficient, right, because
00:21:33.910 --> 00:21:40.730
you have higher granularity, you can just
scale the things that you actually need
00:21:40.730 --> 00:21:47.530
but that of course needs rearchitecting.
It's not like things are going to explode
00:21:47.530 --> 00:21:52.910
if we don't do that very soon, it's not,
so there's not like an urgent problem
00:21:52.910 --> 00:21:58.400
there. The reason for us to rearchitect is
more, to gain more flexibility in
00:21:58.400 --> 00:22:03.330
development, because if you have a
monolith that is pretty entangled, code
00:22:03.330 --> 00:22:16.130
changes are risky and take a long time.
Q: How many people work on product design
00:22:16.130 --> 00:22:25.460
or like user experience research to, like,
sit down with users and try to understand
00:22:25.460 --> 00:22:28.440
what their needs are and from there
proceed.
00:22:28.440 --> 00:22:33.230
A: Across... I don't have an exact number,
something like five.
00:22:33.230 --> 00:22:37.930
Audience: Do you think that's sufficient?
Herald: The question was, whether it's
00:22:37.930 --> 00:22:46.800
sufficient. So just...
Daniel: Probably not? But it's more than,
00:22:46.800 --> 00:22:50.310
that's more people than we have for
database administration, and that's also
00:22:50.310 --> 00:23:06.040
not sufficient.
Herald: Are the further questions? I don't
00:23:06.040 --> 00:23:16.270
think.
Daniel: Okay. So, one of the things, that
00:23:16.270 --> 00:23:20.320
holds us back a bit, is that there's
literally thousands of extensions for
00:23:20.320 --> 00:23:26.870
MediaWiki and the extension mechanism is
heavily reliant on hooks, so basically on
00:23:26.870 --> 00:23:39.600
callbacks. And, we have - I don't have a
picture, I have a link here - we have a
00:23:39.600 --> 00:23:44.500
great number of these. So, you see, each
paragraph is basically documenting one
00:23:44.500 --> 00:23:51.970
callback that you can use to modify the
behavior of the software and, I mean,
00:23:51.970 --> 00:23:59.240
there's, I have never counted, but
something like a thousand? And all of them
00:23:59.240 --> 00:24:07.520
are of course interfaces to extra - to
software that is maintained externally, so
00:24:07.520 --> 00:24:12.611
they have to be kept stable and if you
have a large chunk of software that you
00:24:12.611 --> 00:24:16.730
want to restructure but you have a
thousand fixed points that you can't
00:24:16.730 --> 00:24:22.960
change, things become rather difficult.
It's basi.. yeah, these hook points kind
00:24:22.960 --> 00:24:27.640
of, like, they act like nails in the
architecture and then you kind of have to
00:24:27.640 --> 00:24:36.650
wiggle around them - it's fun. We are
working to change that. We want to
00:24:36.650 --> 00:24:43.950
architect it so the interface that is
exposed to these hooks become much more
00:24:43.950 --> 00:24:51.360
narrow and the things that these hooks or
these callback functions can do is much
00:24:51.360 --> 00:24:58.690
more restricted. There's currently an RSC
open for this, has been open for a while
00:24:58.690 --> 00:25:04.690
actually. The problem is that in order to
assess whether the proposal is actually
00:25:04.690 --> 00:25:11.530
viable you have to survey all the current
users of these hooks and make sure that we
00:25:11.530 --> 00:25:15.660
can, the use case is still covered in the
new system and, yeah, we have like a
00:25:15.660 --> 00:25:21.030
thousand hook points and we have like a
thousand extensions that's quite a bit of
00:25:21.030 --> 00:25:31.060
work. Another thing that I'm currently
working on is establishing a stable
00:25:31.060 --> 00:25:36.990
interface policy. This may sound pretty
obvious - it has a lot of pretty obvious
00:25:36.990 --> 00:25:42.430
things like, yeah, if you have a class and
there's a public method then that's a
00:25:42.430 --> 00:25:46.410
stable interface it will not just change
without notice, we have deprecation policy
00:25:46.410 --> 00:25:53.020
and all that. But if you have worked with
extensible systems that rely on the
00:25:53.020 --> 00:25:58.350
mechanisms of object-oriented programming,
you may have come across the question
00:25:58.350 --> 00:26:05.040
whether a protected method is part of this
stable interface of the software or not,
00:26:05.040 --> 00:26:10.010
or maybe the constructor? I don't know, if
you have worked in environments that use
00:26:10.010 --> 00:26:15.860
dependency injection the idea is basically
that the construction signature should be
00:26:15.860 --> 00:26:21.270
able to change at any time but then you
have extensions that you're subclassing and
00:26:21.270 --> 00:26:25.640
things break. So, this is why we are
trying to establish a much more
00:26:25.640 --> 00:26:32.750
restrictive stable interface policy, that
would would make explicit things like
00:26:32.750 --> 00:26:36.650
constructor signatures actually not being
stable and that gives us a lot more wiggle
00:26:36.650 --> 00:26:51.030
room to restructure the software.
MediaWiki itself has grown as a software
00:26:51.030 --> 00:26:58.750
for the last 18 years or so and, at least
in the beginning, was mostly created by
00:26:58.750 --> 00:27:06.330
volunteers. And in a monolithic
architecture there's a great tendency to
00:27:06.330 --> 00:27:11.070
just, you know, find and grab the thing
that you want to use and just use it.
00:27:11.070 --> 00:27:19.100
Which leads to, well, structures like this
one: everything depends on everything. And
00:27:19.100 --> 00:27:26.360
if you change one bit of code everything
else may or may not break. And with, yeah.
00:27:26.360 --> 00:27:31.350
And if you don't have great test coverage
at the same time this just makes it so
00:27:31.350 --> 00:27:35.312
that any change becomes very risky and you
have to do a lot of manual testing a lot
00:27:35.312 --> 00:27:43.690
of manual digging around, touching a lot
of files and we are for the last year,
00:27:43.690 --> 00:27:50.510
year and a half we have started a
concerted effort to tie the worst - to cut
00:27:50.510 --> 00:27:57.760
the worst ties, to decouple these things
that are, basically that have most impact
00:27:57.760 --> 00:28:03.320
there's a few objects in the software that
rep... - for instance one that represents
00:28:03.320 --> 00:28:08.280
the user and one that represents a title
that are used everywhere and the way
00:28:08.280 --> 00:28:14.240
they're implemented currently also means
that they depend on everything and that of
00:28:14.240 --> 00:28:29.620
course is not a good situation. On a,
well, a similar idea on a higher level is
00:28:29.620 --> 00:28:34.400
decomposition of the software so the
decoupling was about the software
00:28:34.400 --> 00:28:39.990
architecture this is about the system
architecture breaking up the
00:28:39.990 --> 00:28:45.490
monolith itself into multiple services that
serve different purposes. The specifics of
00:28:45.490 --> 00:28:50.281
this diagram are not really relevant to
this talk. This is more to, you know, give
00:28:50.281 --> 00:28:57.710
you an impression of the complexity and
the sort of work we are doing there. The
00:28:57.710 --> 00:29:05.580
idea is that perhaps we could split out
certain functionality into its own service
00:29:05.580 --> 00:29:11.160
into a separate application, like maybe
move all the search functionality into
00:29:11.160 --> 00:29:17.150
something separate and self-contained, but
then the question is how do you, again,
00:29:17.150 --> 00:29:23.280
compose this into the final user interface
- at some point these things have to get
00:29:23.280 --> 00:29:28.420
composed together again - and again this
is a very trivial trivial issue if you
00:29:28.420 --> 00:29:32.470
only want to want this to work on your
machine or you only need to serve a
00:29:32.470 --> 00:29:39.680
hundred users or something. But doing this
at scale doing it at the rate of something
00:29:39.680 --> 00:29:45.230
like 10,000 page views a second, I said a
hundred thousand requests earlier but that
00:29:45.230 --> 00:29:51.790
includes resources, icons, CSS and all
that. So, yeah, then you have to think
00:29:51.790 --> 00:29:58.470
pretty hard about what you can cache and,
thank you, how you can recombine things
00:29:58.470 --> 00:30:02.760
without having to recompute everything and
this is something that we are currently
00:30:02.760 --> 00:30:08.580
looking into - coming up with a
architecture that allows us to compose and
00:30:08.580 --> 00:30:23.220
recombine the output of different
background services. Okay. Before I
00:30:23.220 --> 00:30:27.600
started this talk I said I would probably
roughly use half of my time going through
00:30:27.600 --> 00:30:33.310
the presentation and I guess I just hit
that spot on. So, this is all I have
00:30:33.310 --> 00:30:41.070
prepared but I'm happy to talk to you more
about the things I said or maybe any other
00:30:41.070 --> 00:30:48.050
aspects of this that you may be interested
in. If any comments or questions. Oh!
00:30:48.050 --> 00:30:56.800
Three already.
Q: First of all thanks a lot for the
00:30:56.800 --> 00:31:03.150
presentation, such a really interesting
case of a legacy system and thanks for the
00:31:03.150 --> 00:31:10.130
honesty. It was really interesting as a,
you know, software engineer to see how
00:31:10.130 --> 00:31:15.101
that works. I have a question about
decoupling, so, I mean, I kind of, you
00:31:15.101 --> 00:31:23.190
have like, probably your system is
enormous and how do you find, so to say,
00:31:23.190 --> 00:31:29.100
the most evil, you know, parts which
sort of have to be decoupled. Do you use other
00:31:29.100 --> 00:31:34.820
software, with, you know, this, like, what
a metrics and stuff or do you just know,
00:31:34.820 --> 00:31:38.370
kind of intuitively..
Daniel: Yeah, it's actually, this is quite
00:31:38.370 --> 00:31:44.970
interesting and maybe I can, maybe we can
talk about it a bit more in depth later.
00:31:44.970 --> 00:31:49.020
Very quickly: it's a combination on the
one hand you just have the anecdotal
00:31:49.020 --> 00:31:53.280
experience of what is actually annoying
when you work with the software and you
00:31:53.280 --> 00:31:59.111
try to fix it and on the other hand I try
to find good tooling for this and the
00:31:59.111 --> 00:32:05.440
existing tooling tends to die when you
just run it against our code base. So, one
00:32:05.440 --> 00:32:09.930
of the things that you are looking for are
cyclic dependencies but the number of
00:32:09.930 --> 00:32:15.080
possible cycles in a graph grows
exponentially with a number of nodes. And
00:32:15.080 --> 00:32:17.710
if you have a pretty tightly knit graph
that number quickly goes into the
00:32:17.710 --> 00:32:26.580
millions. And, yeah, the tool just goes to
100% CPU and never returns. So, I spend
00:32:26.580 --> 00:32:33.600
quite a bit of time trying to find
heuristics to get around that - was a lot
00:32:33.600 --> 00:32:41.550
of fun. I can, yeah, we can talk about
that later, if you like. Okay, thanks.
00:32:41.550 --> 00:32:49.221
Q: So what exactly is this Wikidata you
mentioned before. Is it like an extension
00:32:49.221 --> 00:32:55.580
or is it a completely different project?
Daniel: Wiki - so there's an extension called
00:32:55.580 --> 00:33:04.630
Wikibase, that implements this, well I
would say, ontological modeling interface
00:33:04.630 --> 00:33:11.980
for MediaWiki and that is used to run a
website called Wikidata which has
00:33:11.980 --> 00:33:19.500
something like 30 million items modeled
that describe the world and serve as a
00:33:19.500 --> 00:33:25.610
machine-readable data back-end to other
wiki project, other Wikimedia projects.
00:33:25.610 --> 00:33:32.890
Yeah, I used to work on that project for
Wikimedia Germany. I moved on to do
00:33:32.890 --> 00:33:41.150
different things now for a couple of
years. Lukas here in front is probably the
00:33:41.150 --> 00:33:50.190
person most knowledgeable about the latest
and greatest in the Wikidata development.
00:33:50.190 --> 00:33:56.240
Q: You've shortly talked about test
coverage. I will be into history..
00:33:56.240 --> 00:33:58.650
Daniel: Sorry?
Q: You talked about test coverage.
00:33:58.650 --> 00:34:02.010
Daniel: Yes.
Q: I would be interested in if you amped
00:34:02.010 --> 00:34:07.660
your efforts to help you modernize it and
how your current situation is with test
00:34:07.660 --> 00:34:11.809
coverage.
Daniel: Test coverage for MediaWiki core is below
00:34:11.809 --> 00:34:21.809
50%. In some parts it's below 10% which is
very worrying. One thing that we started
00:34:21.809 --> 00:34:30.050
to look into, like half a year ago, is
instead of writing unit tests for all the
00:34:30.050 --> 00:34:36.010
code that we actually want to throw away,
before we touch it, we tried to improve
00:34:36.010 --> 00:34:40.900
the test coverage using integration tests
on the API level. So we are currently in
00:34:40.900 --> 00:34:48.240
the process of writing a suite of tests,
not just for the API modules, but for all
00:34:48.240 --> 00:34:54.540
the functionality, all the application
logic behind the the API. And that will
00:34:54.540 --> 00:35:01.070
hopefully cover most of the relevant code
paths and will give us confidence when we
00:35:01.070 --> 00:35:12.420
refactor the code.
Q: Thanks.
00:35:12.420 --> 00:35:26.280
Herald: Other questions?
Q: So you said that you have this legacy
00:35:26.280 --> 00:35:32.240
system and eventually you have to move
away from it but are there any, like, I
00:35:32.240 --> 00:35:39.820
don't know, plans for the near future to,
I don't know. At some point you have to
00:35:39.820 --> 00:35:47.310
cut the current infrastructure to your
extensions and so on and it's a hard cut, I
00:35:47.310 --> 00:35:53.330
see. But are there any plans to build it
up from scratch or what are the plans?
00:35:53.330 --> 00:35:58.060
Daniel: Yeah, we are not going to rewrite from
scratch - that's a pretty sure fire way to
00:35:58.060 --> 00:36:05.370
just kill the system. We will have to make
some tough decisions about backwards
00:36:05.370 --> 00:36:11.340
compatibility and probably reconsider some
of the requirements and constraints we
00:36:11.340 --> 00:36:17.100
have, well, with respect to the platforms
we run on and also the platforms we serve.
00:36:17.100 --> 00:36:21.130
One of the things that we have been very
careful to do in the past for instance is
00:36:21.130 --> 00:36:26.530
to make sure that you can do pretty much
everything with MediaWiki with no
00:36:26.530 --> 00:36:32.800
JavaScript on the client side. And that
requirement is likely to drop. You will
00:36:32.800 --> 00:36:40.010
still be able to read of course, without
any JavaScript or anything, but the extent
00:36:40.010 --> 00:36:45.910
of functionality you will have without
JavaScript on the client side is likely to
00:36:45.910 --> 00:36:51.140
be greatly reduced - that kind of thing.
Also we will probably end up breaking
00:36:51.140 --> 00:36:57.660
compatibility to at least some of the
user-created tools. Hopefully we can offer
00:36:57.660 --> 00:37:02.390
good alternatives, good APIs, good
libraries that people can actually port
00:37:02.390 --> 00:37:11.070
to, that are less brittle. I hope that
will motivate people and maybe repay them
00:37:11.070 --> 00:37:15.950
a bit for the pain of having their tool
broken. If we can give them something that
00:37:15.950 --> 00:37:21.119
is more stable, more reliable, and
hopefully even nicer to use. Yeah, so,
00:37:21.119 --> 00:37:25.930
it's small increments, bits, and pieces
all over the system there's no, you know,
00:37:25.930 --> 00:37:32.550
no great master plan, no big change to
point to really.
00:37:32.550 --> 00:37:45.470
Herald: Okay, okay, further questions?
Daniel: I plan to just sit outside here at
00:37:45.470 --> 00:37:54.800
the table later if you just want to come
and chat so we can also do that there.
00:37:54.800 --> 00:38:01.250
Herald: Okay, so, last call are there any
other questions? It does not appear so,
00:38:01.250 --> 00:38:08.110
so, I'd like ask for a huge applause for
Daniel for this talk.
00:38:08.110 --> 00:38:12.627
Applause
00:38:12.627 --> 00:38:14.730
36C3 postroll music
00:38:14.730 --> 00:38:38.320
Subtitles created by c3subtitles.de
in the year 2020. Join, and help us!