1
00:00:00,000 --> 00:00:20,510
36C3 preroll music
2
00:00:20,510 --> 00:00:24,750
Daniel: Good morning! I'm glad you all
made it here this early on the last day. I
3
00:00:24,750 --> 00:00:32,439
know it can can't be easy wasn't easy for
me I have to warn you that the way I
4
00:00:32,439 --> 00:00:36,160
prepared for this song is a bit
experimental. I didn't make a slide set I
5
00:00:36,160 --> 00:00:44,559
just made a mind map and I'll just click
through it while I talk to you. So,
6
00:00:44,559 --> 00:00:51,180
this talk is about modernizing Wikipedia
as you probably have noticed visiting,
7
00:00:51,180 --> 00:00:58,500
Wikipedia can feel a bit like visiting a
website from 10-15 years ago but before I
8
00:00:58,500 --> 00:01:05,280
talk about any problems or things to
improve, I first want to revisit that the
9
00:01:05,280 --> 00:01:11,619
software and the the infrastructure we
build around it has been running Wikipedia
10
00:01:11,619 --> 00:01:20,160
and its sister sites for the last... well
nearly 19 years now and it's extremely
11
00:01:20,160 --> 00:01:32,200
successful. We serve 17 billion page
views a month, yes?
12
00:01:32,200 --> 00:01:40,870
Person in the audience: Could you make it
louder or speak up and also make the image
13
00:01:40,870 --> 00:01:42,870
bigger?
14
00:01:42,870 --> 00:01:43,870
inaudible dialogue
15
00:01:43,870 --> 00:01:45,870
Daniel: Is this better? Like if I speak up
I will loose my voice in 10 minutes it's
16
00:01:45,870 --> 00:01:55,720
already in it, no it's fine. We have
technology for this. I can... the light
17
00:01:55,720 --> 00:02:05,490
doesn't help, yeah the contrast could be
better. Is it better like this? Okay cool.
18
00:02:05,490 --> 00:02:13,840
All right so yeah we are serving 17
billion page views a month, which is quite
19
00:02:13,840 --> 00:02:19,560
a lot. Wikipedia exists in about 100
languages. If you attended the talk about
20
00:02:19,560 --> 00:02:24,250
the Wikimedia infrastructure yesterday, we
talked about 300 languages. We actually
21
00:02:24,250 --> 00:02:29,989
support 300 languages for localization but
we have Wikipedia in about 100, if I'm not
22
00:02:29,989 --> 00:02:38,689
completely off. I find this picture quite
fascinating. This is a visualization of
23
00:02:38,689 --> 00:02:43,719
all the places in the world that are
described on Wikipedia and sister projects
24
00:02:43,719 --> 00:02:49,319
and I find this quite impressive although
it's also a nice display of cultural bias
25
00:02:49,319 --> 00:03:00,790
of course. We, that is Wikimedia
Foundation, run about 900 to a 1000 wikis
26
00:03:00,790 --> 00:03:06,680
depending on how you count, but there are
many many more media wiki installations
27
00:03:06,680 --> 00:03:11,459
out there, some of them big and many many
of them small. We have actually no idea
28
00:03:11,459 --> 00:03:17,150
how many small instances there are. So
it's a very powerful very flexible and
29
00:03:17,150 --> 00:03:23,730
versatile piece of software but, you know, but
sometimes it can feel like... you can do a
30
00:03:23,730 --> 00:03:28,329
lot of things with it, right, but
sometimes it feels like it's a bit
31
00:03:28,329 --> 00:03:42,180
overburdened and maybe you should look at
improving the foundations. So one of the
32
00:03:42,180 --> 00:03:47,829
things that make MediaWiki great but also
sometimes hard to use is that kind of
33
00:03:47,829 --> 00:03:52,609
everything is text, everything is markup,
everything is done with with wikitext,
34
00:03:52,609 --> 00:04:02,529
which has grown in complexity over the
years so if you look at the autonomy of a
35
00:04:02,529 --> 00:04:09,159
wiki page it can be a bit daunting. You
have different syntax for markup at
36
00:04:09,159 --> 00:04:16,150
different kinds of transclusion or
templates and media and some things
37
00:04:16,150 --> 00:04:21,739
actually, you know, get displayed in
place, some things show up in a completely
38
00:04:21,739 --> 00:04:26,340
different place on the page it can be
rather confusing and daunting for
39
00:04:26,340 --> 00:04:31,720
newcomers. And also things like having a
conversation just talking to people like,
40
00:04:31,720 --> 00:04:35,540
you know, having a conversation thread
looks like this. You open the page you
41
00:04:35,540 --> 00:04:40,510
look through the markup and you indent to
make a conversation thread and then you
42
00:04:40,510 --> 00:04:43,480
get confused about the indenting and
someone messes with the formatting and
43
00:04:43,480 --> 00:04:52,120
it's all excellent. There have been many
attempts over the years to improve the
44
00:04:52,120 --> 00:05:00,290
situation, we have things like echo which
notifies you, for instance when someone
45
00:05:00,290 --> 00:05:09,130
mentions your name or someone... It is
also used to to welcome people and do this
46
00:05:09,130 --> 00:05:12,400
kind of achievement unlocked
notifications: hey, you did your first
47
00:05:12,400 --> 00:05:19,900
edit, this is great welcome! To make
people a bit more engaged with the system
48
00:05:19,900 --> 00:05:24,380
but it's really mostly improvements around
the fringes. We have had a system called
49
00:05:24,380 --> 00:05:31,350
Flow for awhile to improve the way
conversations work. So you have more like
50
00:05:31,350 --> 00:05:37,960
a thread structure that the software
actually knows about but then there are
51
00:05:37,960 --> 00:05:42,160
many, well quite a few people who have
been around for a while that are very used
52
00:05:42,160 --> 00:05:46,900
to the manual system and also there's a
lot of tools to support this manual system
53
00:05:46,900 --> 00:05:52,780
which of course are incompatible with
making things more modern. So we use this
54
00:05:52,780 --> 00:05:56,250
for instance on MediaWiki.org which is a
site which is basically a self
55
00:05:56,250 --> 00:06:03,000
documentation site of MediaWiki but on
most Wikipedia this is not enabled or at
56
00:06:03,000 --> 00:06:14,530
least not used for default everywhere. The
biggest attempt to move away from the text
57
00:06:14,530 --> 00:06:23,050
only approach is Wikidata, which we
started in 2012. The idea of Wikidata of
58
00:06:23,050 --> 00:06:29,580
course, if you didn't attend many great
talks we had about it here over of the
59
00:06:29,580 --> 00:06:36,470
course of the Congress, is a way to
basically model the world using structured
60
00:06:36,470 --> 00:06:45,470
data, using a semantic approach instead of
natural language which has its own
61
00:06:45,470 --> 00:06:50,740
complexities but at least it's a way to
represent the knowledge of the world in a
62
00:06:50,740 --> 00:06:56,790
way that machines can understand. So this
would be an alternative to wiki text but
63
00:06:56,790 --> 00:07:09,389
still the vast majority of things
especially on Wikipedia are just markup.
64
00:07:09,389 --> 00:07:13,800
And this markup is pretty powerful and
there's lots of ways to extend it and to
65
00:07:13,800 --> 00:07:21,050
do things with it. So a lot of things on
MediaWiki are just DIY, just do it
66
00:07:21,050 --> 00:07:29,250
yourself. Templates are a great example of
this. Infoboxes of course, the nice blue
67
00:07:29,250 --> 00:07:34,730
boxes here on the right side of pages, are
done using templates but these templates
68
00:07:34,730 --> 00:07:39,090
are just for formatting, there is not data
processing there's no the data base or
69
00:07:39,090 --> 00:07:47,530
structured data backing them. It's just
basically, you know, it's still just
70
00:07:47,530 --> 00:07:56,630
markup. It's still... you have a predefined
layout but you're still feeding a text not
71
00:07:56,630 --> 00:08:04,520
data. You have parameters but the values
of the parameters are still again maybe
72
00:08:04,520 --> 00:08:11,610
templates or links or you have markup in
them, like you know HTML line breaks and
73
00:08:11,610 --> 00:08:18,860
stuff. So it's kind of semi structured.
And this of course is also used to do
74
00:08:18,860 --> 00:08:24,100
things like workflow. The template... Oh
no, this was actually an infobox, wrong
75
00:08:24,100 --> 00:08:34,229
picture, wrong capture. This is also used
to do workflows, so if a page on Wikipedia
76
00:08:34,229 --> 00:08:39,789
gets nominated for deletion you put manual
put a template on the page that defines
77
00:08:39,789 --> 00:08:44,870
why this is supposed to be deleted and
then you have to go to a different page
78
00:08:44,870 --> 00:08:49,390
and put a different template there, giving
more explanation and this again is used
79
00:08:49,390 --> 00:08:55,149
for discussion. It's a lot of structure
created by the community and maintained by
80
00:08:55,149 --> 00:09:02,730
the community, using conventions and tools
built on top of what is essentially just a
81
00:09:02,730 --> 00:09:10,620
pile of markup. And because doing all this
manually is kind of painful, only on there
82
00:09:10,620 --> 00:09:17,360
we created a system to allow people to add
JavaScript to the site, which is then
83
00:09:17,360 --> 00:09:27,019
maintained on wiki pages by the community
and it can tweak and automate. But again,
84
00:09:27,019 --> 00:09:30,589
it doesn't really have much to work with,
right? It basically messes with whatever
85
00:09:30,589 --> 00:09:35,470
it can, it directly interacts with the DOM
of the page, whenever the layout of the
86
00:09:35,470 --> 00:09:41,040
software changes, things break. So this is
not great for for compatibility but it's
87
00:09:41,040 --> 00:09:54,730
used a lot and it is very important for
the community to have this power. Sorry, I
88
00:09:54,730 --> 00:10:00,110
wish there was a better way to show these
pictures. Okay, that's just to give you an
89
00:10:00,110 --> 00:10:05,220
idea of what kind of thing is implemented
that way and maintained by the community
90
00:10:05,220 --> 00:10:10,189
on their site. One of the problems we have
with that is: these are bound to a wiki
91
00:10:10,189 --> 00:10:19,410
and I just told you that we run over 900
of these not over 9,000 and it would be
92
00:10:19,410 --> 00:10:26,300
great if you could just share them between
wikis but we can't. And again, there have
93
00:10:26,300 --> 00:10:30,790
been... we have been talking about it a
lot and it seems like it shouldn't be so
94
00:10:30,790 --> 00:10:36,759
hard, but you kind of need to write these
tools differently, if you want to share
95
00:10:36,759 --> 00:10:39,899
them across sites, because different sites
use different conventions, they use
96
00:10:39,899 --> 00:10:45,529
different templates. Then it just doesn't
work and you actually have to write decent
97
00:10:45,529 --> 00:10:50,970
software that uses internationalization if
you want to use it across wikis. While
98
00:10:50,970 --> 00:10:55,019
these are usually just you know one-off
hacks with everything hard-coded we would
99
00:10:55,019 --> 00:10:58,450
have to put in place an
internationalization system and it's
100
00:10:58,450 --> 00:11:02,910
actually a lot of effort and there's a lot
of things that are actually unclear about
101
00:11:02,910 --> 00:11:15,260
it. So, before I dive more deeply into the
different things that will make it hard to
102
00:11:15,260 --> 00:11:20,529
improve on the current situation and the
things that we are doing to improve it do
103
00:11:20,529 --> 00:11:27,309
we have any questions or do you have any
other - do you have any things you may
104
00:11:27,309 --> 00:11:34,519
find particularly, well, annoying or
particularly outdated, when interacting
105
00:11:34,519 --> 00:11:40,920
with Wikipedia? Any thoughts on that?
Beyond what I just said?
106
00:11:40,920 --> 00:11:48,769
Microphone: The strict separation, just in
Wikipedia, between mobile layout and
107
00:11:48,769 --> 00:11:54,259
desktop layout.
Daniel: Yeah. So, actually having a
108
00:11:54,259 --> 00:12:02,069
reactive layout system that would just
work for mobile and desktop in the same
109
00:12:02,069 --> 00:12:09,130
way and allowing the designers and UX
experts, who work on the system to just do
110
00:12:09,130 --> 00:12:15,180
this once and not two or maybe even three
times - because of course we also have
111
00:12:15,180 --> 00:12:20,550
native applications for different
platforms - would be great and it's
112
00:12:20,550 --> 00:12:24,360
something that we're looking into at the
moment. But it's not, you know , it's not
113
00:12:24,360 --> 00:12:29,519
that easy we could build a completely new
system, that does this but then again you
114
00:12:29,519 --> 00:12:33,249
would be telling people: "You can no
longer use the old system", but now they
115
00:12:33,249 --> 00:12:39,019
have build all these tools that rely on
how the old system works and you have to
116
00:12:39,019 --> 00:12:52,089
port all of this over so there's a lot of
inertia. Any other thoughts? Everyone is
117
00:12:52,089 --> 00:13:03,720
still asleep that's excellent. So I can
continue. So, another thing that makes it
118
00:13:03,720 --> 00:13:10,879
difficult to change how MediaWiki works or
to improve it is that we are trying to do
119
00:13:10,879 --> 00:13:19,180
well to be at least two things at once: on
the one hand we are running a top 5
120
00:13:19,180 --> 00:13:24,360
website and serving over 100,000 requests
per second using the system and you on the
121
00:13:24,360 --> 00:13:30,540
other hand, at least until now, we have
always made sure that you can just
122
00:13:30,540 --> 00:13:33,800
download MediaWiki and install it on a
shared hosting platform you don't even
123
00:13:33,800 --> 00:13:38,920
need root on the system, right? You don't
even need administrative privileges you
124
00:13:38,920 --> 00:13:44,769
can just set it up and run it in your web
space and it will work. And, having the
125
00:13:44,769 --> 00:13:51,779
same piece of software do both, run in a
minimal environment and run at scale, is
126
00:13:51,779 --> 00:13:55,040
rather difficult and it also means that
there's a lot of things that we can't
127
00:13:55,040 --> 00:14:02,110
easily do, right? All this modern micro
service architecture separate front-end
128
00:14:02,110 --> 00:14:09,309
and back-end systems, all of that means
that it's a lot more complicated to set up
129
00:14:09,309 --> 00:14:15,720
and needs more knowledge or more
infrastructure to set up and so far that
130
00:14:15,720 --> 00:14:19,500
meant we can't do it, because so far there
was this requirement that you should
131
00:14:19,500 --> 00:14:23,569
really be able to just run it on your
shared hosting. And we are currently
132
00:14:23,569 --> 00:14:29,639
considering to what extent we can continue
this, I mean, container based hosting is
133
00:14:29,639 --> 00:14:34,620
picking up. Maybe this is an alternative
it's still unclear but it seems like this
134
00:14:34,620 --> 00:14:45,999
is something that we need to reconsider.
Yeah, but if we make this harder to do
135
00:14:45,999 --> 00:14:52,739
then a lot of current users of MediaWiki
would maybe not, well, maybe no longer
136
00:14:52,739 --> 00:14:57,230
exist or at least would not exist as they
do now, right. You probably have seen
137
00:14:57,230 --> 00:15:05,259
this nice MediaWiki instance the Congress
wiki. Which - with a completely customized
138
00:15:05,259 --> 00:15:09,689
skin and a lot of extensions installed to
allow people to define their sessions
139
00:15:09,689 --> 00:15:14,410
there and making sure these sessions
automatically get listed and get put into
140
00:15:14,410 --> 00:15:20,660
a calendar - this is all done using
extensions, like Semantic MediaWiki, that
141
00:15:20,660 --> 00:15:34,279
allow you to basically define queries in
the wiki text markup. Yeah, another thing
142
00:15:34,279 --> 00:15:42,079
that, of course, slows down development is
that Wikimedia does engineering on a,
143
00:15:42,079 --> 00:15:48,130
well, comparatively a shoestring budget,
right? The budget of the Wikimedia
144
00:15:48,130 --> 00:15:52,199
Foundation, the annual budget is something
like a hundred million dollars, that
145
00:15:52,199 --> 00:15:58,009
sounds like a lot of money, but if you
compare it to other companies running a
146
00:15:58,009 --> 00:16:03,209
top five or top ten website it's like two
percent of their budget or something like
147
00:16:03,209 --> 00:16:10,769
that, right? It's really, I mean, 100
million is not peanuts but compared to
148
00:16:10,769 --> 00:16:16,699
what other companies invest to achieve
this kind of goal it kind of is, so , what
149
00:16:16,699 --> 00:16:22,230
this budget translates into is something
like 300, depending on how you count,
150
00:16:22,230 --> 00:16:28,800
between three hundred and four hundred
staff. So, this is the people who run all
151
00:16:28,800 --> 00:16:32,189
of this, including all the community
outreach all the social aspects all the
152
00:16:32,189 --> 00:16:40,920
administrative aspects. Less than half of
these are the engineers who do all this.
153
00:16:40,920 --> 00:16:50,989
And we have like, something like 2,500
servers, bare-metal, so, which is not a
154
00:16:50,989 --> 00:16:57,619
lot for this kind of thing. Which also
means that we have to design the software
155
00:16:57,619 --> 00:17:07,079
to be not just scalable but also quite
efficient. The modern approach to scaling
156
00:17:07,079 --> 00:17:11,640
is usually scale horizontally make it so
you can just spin up another virtual
157
00:17:11,640 --> 00:17:19,280
machine in some cloud service, but, yeah,
we run our own service, we run our own
158
00:17:19,280 --> 00:17:24,440
servers, so we can design to scale
horizontally, but it means ordering
159
00:17:24,440 --> 00:17:32,070
hardware and setting it up and it's going
to take half a year or so. And we don't
160
00:17:32,070 --> 00:17:38,390
actually have that many people who do
this, so, scalability and performance are
161
00:17:38,390 --> 00:17:49,000
also important factors when designing the
software. Okay. Before I dive into what we
162
00:17:49,000 --> 00:18:03,860
are actually doing - any questions? This
one in the back. Wait for the mic, please.
163
00:18:03,860 --> 00:18:07,330
In the very...
Q: Hi!
164
00:18:07,330 --> 00:18:12,950
Daniel: Hello.
Q: So, you said you don't have that many
165
00:18:12,950 --> 00:18:22,990
people, but how many do you actually have?
Daniel: For... it's something like 150 engineers
166
00:18:22,990 --> 00:18:27,170
worldwide. It always depends on what you
count, right? So you count the people, who
167
00:18:27,170 --> 00:18:32,260
- do you count engineers, who work on the
native apps, do you account engineers, who
168
00:18:32,260 --> 00:18:36,980
work on the Wikimedia cloud services -
actually we do have cloud services, we
169
00:18:36,980 --> 00:18:41,190
offer them to the community to run their
own things, but we don't run our stuff on
170
00:18:41,190 --> 00:18:45,560
other people's cloud. Yeah, so depending
on how you count or something and whether
171
00:18:45,560 --> 00:18:50,210
you count the people working here in
Germany for Wikimedia Germany, which is a
172
00:18:50,210 --> 00:18:57,760
separate organization technically - it's
something like 150 engineers.
173
00:18:57,760 --> 00:19:08,210
Q: Thanks!
Q: I'm interested: What are the reasons
174
00:19:08,210 --> 00:19:13,880
that you don't run on other people's
services like on the cloud. I mean, then
175
00:19:13,880 --> 00:19:17,090
it will be easy to scale horizontally,
right?
176
00:19:17,090 --> 00:19:25,330
Daniel: There's, well, one reason is being
independent, right? If we, yeah, I imagine
177
00:19:25,330 --> 00:19:32,350
we ran all our stuff on Amazon's
infrastructure and then maybe Amazon
178
00:19:32,350 --> 00:19:38,060
doesn't like the way that the Wikipedia
article about Amazon is written - what do
179
00:19:38,060 --> 00:19:42,050
we do, right? Maybe they shut us down,
maybe they make things very expensive,
180
00:19:42,050 --> 00:19:47,360
maybe they make things very painful for
us, maybe there is some at least like it
181
00:19:47,360 --> 00:19:54,070
self-censorship mechanism happening and we
want to avoid that. There are there are
182
00:19:54,070 --> 00:19:58,440
thoughts about this there are thoughts
like maybe we can do this at least for
183
00:19:58,440 --> 00:20:04,270
development infrastructure and CI, not for
production or maybe we can make it so that
184
00:20:04,270 --> 00:20:12,200
we run stuff in the cloud services by more
than one vendor, so we basically we spread
185
00:20:12,200 --> 00:20:17,860
out so we are not reliant on a single
company. We are thinking about these
186
00:20:17,860 --> 00:20:21,820
things but so far the way to actually stay
independent has been to run our own
187
00:20:21,820 --> 00:20:28,300
servers.
Q: You've been talking about scalability
188
00:20:28,300 --> 00:20:35,490
and changing the architecture, that kind
of seems to imply to me that there's a
189
00:20:35,490 --> 00:20:42,270
problem with scaling at the moment or that
it's foreseeable that things are not gonna
190
00:20:42,270 --> 00:20:46,580
work out if you just keep doing what
you're doing at the moment. Can you maybe
191
00:20:46,580 --> 00:20:52,480
elaborate on that.
Daniel: So, there's, I think there's two sides
192
00:20:52,480 --> 00:20:56,850
to this. On the one hand the reason I
mentioned it is just that a lot of things
193
00:20:56,850 --> 00:21:01,610
that are really easy to do basically for
me, right? Works on my machine are really
194
00:21:01,610 --> 00:21:08,920
hard to do if you want to do them at
scale. That's one aspect. The other aspect
195
00:21:08,920 --> 00:21:16,670
is MediaWiki is pretty much a PHP monolith
and that means getting it always means
196
00:21:16,670 --> 00:21:23,680
copying the monolith and breaking it down
so you have smaller units that you can
197
00:21:23,680 --> 00:21:29,040
scale and just say, yeah, I don't know, I
need more instances for authentication
198
00:21:29,040 --> 00:21:33,910
handling or something like that. That
would be more efficient, right, because
199
00:21:33,910 --> 00:21:40,730
you have higher granularity, you can just
scale the things that you actually need
200
00:21:40,730 --> 00:21:47,530
but that of course needs rearchitecting.
It's not like things are going to explode
201
00:21:47,530 --> 00:21:52,910
if we don't do that very soon, it's not,
so there's not like an urgent problem
202
00:21:52,910 --> 00:21:58,400
there. The reason for us to rearchitect is
more, to gain more flexibility in
203
00:21:58,400 --> 00:22:03,330
development, because if you have a
monolith that is pretty entangled, code
204
00:22:03,330 --> 00:22:16,130
changes are risky and take a long time.
Q: How many people work on product design
205
00:22:16,130 --> 00:22:25,460
or like user experience research to, like,
sit down with users and try to understand
206
00:22:25,460 --> 00:22:28,440
what their needs are and from there
proceed.
207
00:22:28,440 --> 00:22:33,230
A: Across... I don't have an exact number,
something like five.
208
00:22:33,230 --> 00:22:37,930
Audience: Do you think that's sufficient?
Herald: The question was, whether it's
209
00:22:37,930 --> 00:22:46,800
sufficient. So just...
Daniel: Probably not? But it's more than,
210
00:22:46,800 --> 00:22:50,310
that's more people than we have for
database administration, and that's also
211
00:22:50,310 --> 00:23:06,040
not sufficient.
Herald: Are the further questions? I don't
212
00:23:06,040 --> 00:23:16,270
think.
Daniel: Okay. So, one of the things, that
213
00:23:16,270 --> 00:23:20,320
holds us back a bit, is that there's
literally thousands of extensions for
214
00:23:20,320 --> 00:23:26,870
MediaWiki and the extension mechanism is
heavily reliant on hooks, so basically on
215
00:23:26,870 --> 00:23:39,600
callbacks. And, we have - I don't have a
picture, I have a link here - we have a
216
00:23:39,600 --> 00:23:44,500
great number of these. So, you see, each
paragraph is basically documenting one
217
00:23:44,500 --> 00:23:51,970
callback that you can use to modify the
behavior of the software and, I mean,
218
00:23:51,970 --> 00:23:59,240
there's, I have never counted, but
something like a thousand? And all of them
219
00:23:59,240 --> 00:24:07,520
are of course interfaces to extra - to
software that is maintained externally, so
220
00:24:07,520 --> 00:24:12,611
they have to be kept stable and if you
have a large chunk of software that you
221
00:24:12,611 --> 00:24:16,730
want to restructure but you have a
thousand fixed points that you can't
222
00:24:16,730 --> 00:24:22,960
change, things become rather difficult.
It's basi.. yeah, these hook points kind
223
00:24:22,960 --> 00:24:27,640
of, like, they act like nails in the
architecture and then you kind of have to
224
00:24:27,640 --> 00:24:36,650
wiggle around them - it's fun. We are
working to change that. We want to
225
00:24:36,650 --> 00:24:43,950
architect it so the interface that is
exposed to these hooks become much more
226
00:24:43,950 --> 00:24:51,360
narrow and the things that these hooks or
these callback functions can do is much
227
00:24:51,360 --> 00:24:58,690
more restricted. There's currently an RSC
open for this, has been open for a while
228
00:24:58,690 --> 00:25:04,690
actually. The problem is that in order to
assess whether the proposal is actually
229
00:25:04,690 --> 00:25:11,530
viable you have to survey all the current
users of these hooks and make sure that we
230
00:25:11,530 --> 00:25:15,660
can, the use case is still covered in the
new system and, yeah, we have like a
231
00:25:15,660 --> 00:25:21,030
thousand hook points and we have like a
thousand extensions that's quite a bit of
232
00:25:21,030 --> 00:25:31,060
work. Another thing that I'm currently
working on is establishing a stable
233
00:25:31,060 --> 00:25:36,990
interface policy. This may sound pretty
obvious - it has a lot of pretty obvious
234
00:25:36,990 --> 00:25:42,430
things like, yeah, if you have a class and
there's a public method then that's a
235
00:25:42,430 --> 00:25:46,410
stable interface it will not just change
without notice, we have deprecation policy
236
00:25:46,410 --> 00:25:53,020
and all that. But if you have worked with
extensible systems that rely on the
237
00:25:53,020 --> 00:25:58,350
mechanisms of object-oriented programming,
you may have come across the question
238
00:25:58,350 --> 00:26:05,040
whether a protected method is part of this
stable interface of the software or not,
239
00:26:05,040 --> 00:26:10,010
or maybe the constructor? I don't know, if
you have worked in environments that use
240
00:26:10,010 --> 00:26:15,860
dependency injection the idea is basically
that the construction signature should be
241
00:26:15,860 --> 00:26:21,270
able to change at any time but then you
have extensions that you're subclassing and
242
00:26:21,270 --> 00:26:25,640
things break. So, this is why we are
trying to establish a much more
243
00:26:25,640 --> 00:26:32,750
restrictive stable interface policy, that
would would make explicit things like
244
00:26:32,750 --> 00:26:36,650
constructor signatures actually not being
stable and that gives us a lot more wiggle
245
00:26:36,650 --> 00:26:51,030
room to restructure the software.
MediaWiki itself has grown as a software
246
00:26:51,030 --> 00:26:58,750
for the last 18 years or so and, at least
in the beginning, was mostly created by
247
00:26:58,750 --> 00:27:06,330
volunteers. And in a monolithic
architecture there's a great tendency to
248
00:27:06,330 --> 00:27:11,070
just, you know, find and grab the thing
that you want to use and just use it.
249
00:27:11,070 --> 00:27:19,100
Which leads to, well, structures like this
one: everything depends on everything. And
250
00:27:19,100 --> 00:27:26,360
if you change one bit of code everything
else may or may not break. And with, yeah.
251
00:27:26,360 --> 00:27:31,350
And if you don't have great test coverage
at the same time this just makes it so
252
00:27:31,350 --> 00:27:35,312
that any change becomes very risky and you
have to do a lot of manual testing a lot
253
00:27:35,312 --> 00:27:43,690
of manual digging around, touching a lot
of files and we are for the last year,
254
00:27:43,690 --> 00:27:50,510
year and a half we have started a
concerted effort to tie the worst - to cut
255
00:27:50,510 --> 00:27:57,760
the worst ties, to decouple these things
that are, basically that have most impact
256
00:27:57,760 --> 00:28:03,320
there's a few objects in the software that
rep... - for instance one that represents
257
00:28:03,320 --> 00:28:08,280
the user and one that represents a title
that are used everywhere and the way
258
00:28:08,280 --> 00:28:14,240
they're implemented currently also means
that they depend on everything and that of
259
00:28:14,240 --> 00:28:29,620
course is not a good situation. On a,
well, a similar idea on a higher level is
260
00:28:29,620 --> 00:28:34,400
decomposition of the software so the
decoupling was about the software
261
00:28:34,400 --> 00:28:39,990
architecture this is about the system
architecture breaking up the
262
00:28:39,990 --> 00:28:45,490
monolith itself into multiple services that
serve different purposes. The specifics of
263
00:28:45,490 --> 00:28:50,281
this diagram are not really relevant to
this talk. This is more to, you know, give
264
00:28:50,281 --> 00:28:57,710
you an impression of the complexity and
the sort of work we are doing there. The
265
00:28:57,710 --> 00:29:05,580
idea is that perhaps we could split out
certain functionality into its own service
266
00:29:05,580 --> 00:29:11,160
into a separate application, like maybe
move all the search functionality into
267
00:29:11,160 --> 00:29:17,150
something separate and self-contained, but
then the question is how do you, again,
268
00:29:17,150 --> 00:29:23,280
compose this into the final user interface
- at some point these things have to get
269
00:29:23,280 --> 00:29:28,420
composed together again - and again this
is a very trivial trivial issue if you
270
00:29:28,420 --> 00:29:32,470
only want to want this to work on your
machine or you only need to serve a
271
00:29:32,470 --> 00:29:39,680
hundred users or something. But doing this
at scale doing it at the rate of something
272
00:29:39,680 --> 00:29:45,230
like 10,000 page views a second, I said a
hundred thousand requests earlier but that
273
00:29:45,230 --> 00:29:51,790
includes resources, icons, CSS and all
that. So, yeah, then you have to think
274
00:29:51,790 --> 00:29:58,470
pretty hard about what you can cache and,
thank you, how you can recombine things
275
00:29:58,470 --> 00:30:02,760
without having to recompute everything and
this is something that we are currently
276
00:30:02,760 --> 00:30:08,580
looking into - coming up with a
architecture that allows us to compose and
277
00:30:08,580 --> 00:30:23,220
recombine the output of different
background services. Okay. Before I
278
00:30:23,220 --> 00:30:27,600
started this talk I said I would probably
roughly use half of my time going through
279
00:30:27,600 --> 00:30:33,310
the presentation and I guess I just hit
that spot on. So, this is all I have
280
00:30:33,310 --> 00:30:41,070
prepared but I'm happy to talk to you more
about the things I said or maybe any other
281
00:30:41,070 --> 00:30:48,050
aspects of this that you may be interested
in. If any comments or questions. Oh!
282
00:30:48,050 --> 00:30:56,800
Three already.
Q: First of all thanks a lot for the
283
00:30:56,800 --> 00:31:03,150
presentation, such a really interesting
case of a legacy system and thanks for the
284
00:31:03,150 --> 00:31:10,130
honesty. It was really interesting as a,
you know, software engineer to see how
285
00:31:10,130 --> 00:31:15,101
that works. I have a question about
decoupling, so, I mean, I kind of, you
286
00:31:15,101 --> 00:31:23,190
have like, probably your system is
enormous and how do you find, so to say,
287
00:31:23,190 --> 00:31:29,100
the most evil, you know, parts which
sort of have to be decoupled. Do you use other
288
00:31:29,100 --> 00:31:34,820
software, with, you know, this, like, what
a metrics and stuff or do you just know,
289
00:31:34,820 --> 00:31:38,370
kind of intuitively..
Daniel: Yeah, it's actually, this is quite
290
00:31:38,370 --> 00:31:44,970
interesting and maybe I can, maybe we can
talk about it a bit more in depth later.
291
00:31:44,970 --> 00:31:49,020
Very quickly: it's a combination on the
one hand you just have the anecdotal
292
00:31:49,020 --> 00:31:53,280
experience of what is actually annoying
when you work with the software and you
293
00:31:53,280 --> 00:31:59,111
try to fix it and on the other hand I try
to find good tooling for this and the
294
00:31:59,111 --> 00:32:05,440
existing tooling tends to die when you
just run it against our code base. So, one
295
00:32:05,440 --> 00:32:09,930
of the things that you are looking for are
cyclic dependencies but the number of
296
00:32:09,930 --> 00:32:15,080
possible cycles in a graph grows
exponentially with a number of nodes. And
297
00:32:15,080 --> 00:32:17,710
if you have a pretty tightly knit graph
that number quickly goes into the
298
00:32:17,710 --> 00:32:26,580
millions. And, yeah, the tool just goes to
100% CPU and never returns. So, I spend
299
00:32:26,580 --> 00:32:33,600
quite a bit of time trying to find
heuristics to get around that - was a lot
300
00:32:33,600 --> 00:32:41,550
of fun. I can, yeah, we can talk about
that later, if you like. Okay, thanks.
301
00:32:41,550 --> 00:32:49,221
Q: So what exactly is this Wikidata you
mentioned before. Is it like an extension
302
00:32:49,221 --> 00:32:55,580
or is it a completely different project?
Daniel: Wiki - so there's an extension called
303
00:32:55,580 --> 00:33:04,630
Wikibase, that implements this, well I
would say, ontological modeling interface
304
00:33:04,630 --> 00:33:11,980
for MediaWiki and that is used to run a
website called Wikidata which has
305
00:33:11,980 --> 00:33:19,500
something like 30 million items modeled
that describe the world and serve as a
306
00:33:19,500 --> 00:33:25,610
machine-readable data back-end to other
wiki project, other Wikimedia projects.
307
00:33:25,610 --> 00:33:32,890
Yeah, I used to work on that project for
Wikimedia Germany. I moved on to do
308
00:33:32,890 --> 00:33:41,150
different things now for a couple of
years. Lukas here in front is probably the
309
00:33:41,150 --> 00:33:50,190
person most knowledgeable about the latest
and greatest in the Wikidata development.
310
00:33:50,190 --> 00:33:56,240
Q: You've shortly talked about test
coverage. I will be into history..
311
00:33:56,240 --> 00:33:58,650
Daniel: Sorry?
Q: You talked about test coverage.
312
00:33:58,650 --> 00:34:02,010
Daniel: Yes.
Q: I would be interested in if you amped
313
00:34:02,010 --> 00:34:07,660
your efforts to help you modernize it and
how your current situation is with test
314
00:34:07,660 --> 00:34:11,809
coverage.
Daniel: Test coverage for MediaWiki core is below
315
00:34:11,809 --> 00:34:21,809
50%. In some parts it's below 10% which is
very worrying. One thing that we started
316
00:34:21,809 --> 00:34:30,050
to look into, like half a year ago, is
instead of writing unit tests for all the
317
00:34:30,050 --> 00:34:36,010
code that we actually want to throw away,
before we touch it, we tried to improve
318
00:34:36,010 --> 00:34:40,900
the test coverage using integration tests
on the API level. So we are currently in
319
00:34:40,900 --> 00:34:48,240
the process of writing a suite of tests,
not just for the API modules, but for all
320
00:34:48,240 --> 00:34:54,540
the functionality, all the application
logic behind the the API. And that will
321
00:34:54,540 --> 00:35:01,070
hopefully cover most of the relevant code
paths and will give us confidence when we
322
00:35:01,070 --> 00:35:12,420
refactor the code.
Q: Thanks.
323
00:35:12,420 --> 00:35:26,280
Herald: Other questions?
Q: So you said that you have this legacy
324
00:35:26,280 --> 00:35:32,240
system and eventually you have to move
away from it but are there any, like, I
325
00:35:32,240 --> 00:35:39,820
don't know, plans for the near future to,
I don't know. At some point you have to
326
00:35:39,820 --> 00:35:47,310
cut the current infrastructure to your
extensions and so on and it's a hard cut, I
327
00:35:47,310 --> 00:35:53,330
see. But are there any plans to build it
up from scratch or what are the plans?
328
00:35:53,330 --> 00:35:58,060
Daniel: Yeah, we are not going to rewrite from
scratch - that's a pretty sure fire way to
329
00:35:58,060 --> 00:36:05,370
just kill the system. We will have to make
some tough decisions about backwards
330
00:36:05,370 --> 00:36:11,340
compatibility and probably reconsider some
of the requirements and constraints we
331
00:36:11,340 --> 00:36:17,100
have, well, with respect to the platforms
we run on and also the platforms we serve.
332
00:36:17,100 --> 00:36:21,130
One of the things that we have been very
careful to do in the past for instance is
333
00:36:21,130 --> 00:36:26,530
to make sure that you can do pretty much
everything with MediaWiki with no
334
00:36:26,530 --> 00:36:32,800
JavaScript on the client side. And that
requirement is likely to drop. You will
335
00:36:32,800 --> 00:36:40,010
still be able to read of course, without
any JavaScript or anything, but the extent
336
00:36:40,010 --> 00:36:45,910
of functionality you will have without
JavaScript on the client side is likely to
337
00:36:45,910 --> 00:36:51,140
be greatly reduced - that kind of thing.
Also we will probably end up breaking
338
00:36:51,140 --> 00:36:57,660
compatibility to at least some of the
user-created tools. Hopefully we can offer
339
00:36:57,660 --> 00:37:02,390
good alternatives, good APIs, good
libraries that people can actually port
340
00:37:02,390 --> 00:37:11,070
to, that are less brittle. I hope that
will motivate people and maybe repay them
341
00:37:11,070 --> 00:37:15,950
a bit for the pain of having their tool
broken. If we can give them something that
342
00:37:15,950 --> 00:37:21,119
is more stable, more reliable, and
hopefully even nicer to use. Yeah, so,
343
00:37:21,119 --> 00:37:25,930
it's small increments, bits, and pieces
all over the system there's no, you know,
344
00:37:25,930 --> 00:37:32,550
no great master plan, no big change to
point to really.
345
00:37:32,550 --> 00:37:45,470
Herald: Okay, okay, further questions?
Daniel: I plan to just sit outside here at
346
00:37:45,470 --> 00:37:54,800
the table later if you just want to come
and chat so we can also do that there.
347
00:37:54,800 --> 00:38:01,250
Herald: Okay, so, last call are there any
other questions? It does not appear so,
348
00:38:01,250 --> 00:38:08,110
so, I'd like ask for a huge applause for
Daniel for this talk.
349
00:38:08,110 --> 00:38:12,627
Applause
350
00:38:12,627 --> 00:38:14,730
36C3 postroll music
351
00:38:14,730 --> 00:38:38,320
Subtitles created by c3subtitles.de
in the year 2020. Join, and help us!