WEBVTT 00:00:00.000 --> 00:00:20.510 36C3 preroll music 00:00:20.510 --> 00:00:24.750 Daniel: Good morning! I'm glad you all made it here this early on the last day. I 00:00:24.750 --> 00:00:32.439 know it can can't be easy wasn't easy for me I have to warn you that the way I 00:00:32.439 --> 00:00:36.160 prepared for this song is a bit experimental. I didn't make a slide set I 00:00:36.160 --> 00:00:44.559 just made a mind map and I'll just click through it while I talk to you. So, 00:00:44.559 --> 00:00:51.180 this talk is about modernizing Wikipedia as you probably have noticed visiting, 00:00:51.180 --> 00:00:58.500 Wikipedia can feel a bit like visiting a website from 10-15 years ago but before I 00:00:58.500 --> 00:01:05.280 talk about any problems or things to improve, I first want to revisit that the 00:01:05.280 --> 00:01:11.619 software and the the infrastructure we build around it has been running Wikipedia 00:01:11.619 --> 00:01:20.160 and its sister sites for the last... well nearly 19 years now and it's extremely 00:01:20.160 --> 00:01:32.200 successful. We serve 17 billion page views a month, yes? 00:01:32.200 --> 00:01:40.870 Person in the audience: Could you make it louder or speak up and also make the image 00:01:40.870 --> 00:01:42.870 bigger? 00:01:42.870 --> 00:01:43.870 inaudible dialogue 00:01:43.870 --> 00:01:45.870 Daniel: Is this better? Like if I speak up I will loose my voice in 10 minutes it's 00:01:45.870 --> 00:01:55.720 already in it, no it's fine. We have technology for this. I can... the light 00:01:55.720 --> 00:02:05.490 doesn't help, yeah the contrast could be better. Is it better like this? Okay cool. 00:02:05.490 --> 00:02:13.840 All right so yeah we are serving 17 billion page views a month, which is quite 00:02:13.840 --> 00:02:19.560 a lot. Wikipedia exists in about 100 languages. If you attended the talk about 00:02:19.560 --> 00:02:24.250 the Wikimedia infrastructure yesterday, we talked about 300 languages. We actually 00:02:24.250 --> 00:02:29.989 support 300 languages for localization but we have Wikipedia in about 100, if I'm not 00:02:29.989 --> 00:02:38.689 completely off. I find this picture quite fascinating. This is a visualization of 00:02:38.689 --> 00:02:43.719 all the places in the world that are described on Wikipedia and sister projects 00:02:43.719 --> 00:02:49.319 and I find this quite impressive although it's also a nice display of cultural bias 00:02:49.319 --> 00:03:00.790 of course. We, that is Wikimedia Foundation, run about 900 to a 1000 wikis 00:03:00.790 --> 00:03:06.680 depending on how you count, but there are many many more media wiki installations 00:03:06.680 --> 00:03:11.459 out there, some of them big and many many of them small. We have actually no idea 00:03:11.459 --> 00:03:17.150 how many small instances there are. So it's a very powerful very flexible and 00:03:17.150 --> 00:03:23.730 versatile piece of software but, you know, but sometimes it can feel like... you can do a 00:03:23.730 --> 00:03:28.329 lot of things with it, right, but sometimes it feels like it's a bit 00:03:28.329 --> 00:03:42.180 overburdened and maybe you should look at improving the foundations. So one of the 00:03:42.180 --> 00:03:47.829 things that make MediaWiki great but also sometimes hard to use is that kind of 00:03:47.829 --> 00:03:52.609 everything is text, everything is markup, everything is done with with wikitext, 00:03:52.609 --> 00:04:02.529 which has grown in complexity over the years so if you look at the autonomy of a 00:04:02.529 --> 00:04:09.159 wiki page it can be a bit daunting. You have different syntax for markup at 00:04:09.159 --> 00:04:16.150 different kinds of transclusion or templates and media and some things 00:04:16.150 --> 00:04:21.739 actually, you know, get displayed in place, some things show up in a completely 00:04:21.739 --> 00:04:26.340 different place on the page it can be rather confusing and daunting for 00:04:26.340 --> 00:04:31.720 newcomers. And also things like having a conversation just talking to people like, 00:04:31.720 --> 00:04:35.540 you know, having a conversation thread looks like this. You open the page you 00:04:35.540 --> 00:04:40.510 look through the markup and you indent to make a conversation thread and then you 00:04:40.510 --> 00:04:43.480 get confused about the indenting and someone messes with the formatting and 00:04:43.480 --> 00:04:52.120 it's all excellent. There have been many attempts over the years to improve the 00:04:52.120 --> 00:05:00.290 situation, we have things like echo which notifies you, for instance when someone 00:05:00.290 --> 00:05:09.130 mentions your name or someone... It is also used to to welcome people and do this 00:05:09.130 --> 00:05:12.400 kind of achievement unlocked notifications: hey, you did your first 00:05:12.400 --> 00:05:19.900 edit, this is great welcome! To make people a bit more engaged with the system 00:05:19.900 --> 00:05:24.380 but it's really mostly improvements around the fringes. We have had a system called 00:05:24.380 --> 00:05:31.350 Flow for awhile to improve the way conversations work. So you have more like 00:05:31.350 --> 00:05:37.960 a thread structure that the software actually knows about but then there are 00:05:37.960 --> 00:05:42.160 many, well quite a few people who have been around for a while that are very used 00:05:42.160 --> 00:05:46.900 to the manual system and also there's a lot of tools to support this manual system 00:05:46.900 --> 00:05:52.780 which of course are incompatible with making things more modern. So we use this 00:05:52.780 --> 00:05:56.250 for instance on MediaWiki.org which is a site which is basically a self 00:05:56.250 --> 00:06:03.000 documentation site of MediaWiki but on most Wikipedia this is not enabled or at 00:06:03.000 --> 00:06:14.530 least not used for default everywhere. The biggest attempt to move away from the text 00:06:14.530 --> 00:06:23.050 only approach is Wikidata, which we started in 2012. The idea of Wikidata of 00:06:23.050 --> 00:06:29.580 course, if you didn't attend many great talks we had about it here over of the 00:06:29.580 --> 00:06:36.470 course of the Congress, is a way to basically model the world using structured 00:06:36.470 --> 00:06:45.470 data, using a semantic approach instead of natural language which has its own 00:06:45.470 --> 00:06:50.740 complexities but at least it's a way to represent the knowledge of the world in a 00:06:50.740 --> 00:06:56.790 way that machines can understand. So this would be an alternative to wiki text but 00:06:56.790 --> 00:07:09.389 still the vast majority of things especially on Wikipedia are just markup. 00:07:09.389 --> 00:07:13.800 And this markup is pretty powerful and there's lots of ways to extend it and to 00:07:13.800 --> 00:07:21.050 do things with it. So a lot of things on MediaWiki are just DIY, just do it 00:07:21.050 --> 00:07:29.250 yourself. Templates are a great example of this. Infoboxes of course, the nice blue 00:07:29.250 --> 00:07:34.730 boxes here on the right side of pages, are done using templates but these templates 00:07:34.730 --> 00:07:39.090 are just for formatting, there is not data processing there's no the data base or 00:07:39.090 --> 00:07:47.530 structured data backing them. It's just basically, you know, it's still just 00:07:47.530 --> 00:07:56.630 markup. It's still... you have a predefined layout but you're still feeding a text not 00:07:56.630 --> 00:08:04.520 data. You have parameters but the values of the parameters are still again maybe 00:08:04.520 --> 00:08:11.610 templates or links or you have markup in them, like you know HTML line breaks and 00:08:11.610 --> 00:08:18.860 stuff. So it's kind of semi structured. And this of course is also used to do 00:08:18.860 --> 00:08:24.100 things like workflow. The template... Oh no, this was actually an infobox, wrong 00:08:24.100 --> 00:08:34.229 picture, wrong capture. This is also used to do workflows, so if a page on Wikipedia 00:08:34.229 --> 00:08:39.789 gets nominated for deletion you put manual put a template on the page that defines 00:08:39.789 --> 00:08:44.870 why this is supposed to be deleted and then you have to go to a different page 00:08:44.870 --> 00:08:49.390 and put a different template there, giving more explanation and this again is used 00:08:49.390 --> 00:08:55.149 for discussion. It's a lot of structure created by the community and maintained by 00:08:55.149 --> 00:09:02.730 the community, using conventions and tools built on top of what is essentially just a 00:09:02.730 --> 00:09:10.620 pile of markup. And because doing all this manually is kind of painful, only on there 00:09:10.620 --> 00:09:17.360 we created a system to allow people to add JavaScript to the site, which is then 00:09:17.360 --> 00:09:27.019 maintained on wiki pages by the community and it can tweak and automate. But again, 00:09:27.019 --> 00:09:30.589 it doesn't really have much to work with, right? It basically messes with whatever 00:09:30.589 --> 00:09:35.470 it can, it directly interacts with the DOM of the page, whenever the layout of the 00:09:35.470 --> 00:09:41.040 software changes, things break. So this is not great for for compatibility but it's 00:09:41.040 --> 00:09:54.730 used a lot and it is very important for the community to have this power. Sorry, I 00:09:54.730 --> 00:10:00.110 wish there was a better way to show these pictures. Okay, that's just to give you an 00:10:00.110 --> 00:10:05.220 idea of what kind of thing is implemented that way and maintained by the community 00:10:05.220 --> 00:10:10.189 on their site. One of the problems we have with that is: these are bound to a wiki 00:10:10.189 --> 00:10:19.410 and I just told you that we run over 900 of these not over 9,000 and it would be 00:10:19.410 --> 00:10:26.300 great if you could just share them between wikis but we can't. And again, there have 00:10:26.300 --> 00:10:30.790 been... we have been talking about it a lot and it seems like it shouldn't be so 00:10:30.790 --> 00:10:36.759 hard, but you kind of need to write these tools differently, if you want to share 00:10:36.759 --> 00:10:39.899 them across sites, because different sites use different conventions, they use 00:10:39.899 --> 00:10:45.529 different templates. Then it just doesn't work and you actually have to write decent 00:10:45.529 --> 00:10:50.970 software that uses internationalization if you want to use it across wikis. While 00:10:50.970 --> 00:10:55.019 these are usually just you know one-off hacks with everything hard-coded we would 00:10:55.019 --> 00:10:58.450 have to put in place an internationalization system and it's 00:10:58.450 --> 00:11:02.910 actually a lot of effort and there's a lot of things that are actually unclear about 00:11:02.910 --> 00:11:15.260 it. So, before I dive more deeply into the different things that will make it hard to 00:11:15.260 --> 00:11:20.529 improve on the current situation and the things that we are doing to improve it do 00:11:20.529 --> 00:11:27.309 we have any questions or do you have any other - do you have any things you may 00:11:27.309 --> 00:11:34.519 find particularly, well, annoying or particularly outdated, when interacting 00:11:34.519 --> 00:11:40.920 with Wikipedia? Any thoughts on that? Beyond what I just said? 00:11:40.920 --> 00:11:48.769 Microphone: The strict separation, just in Wikipedia, between mobile layout and 00:11:48.769 --> 00:11:54.259 desktop layout. Daniel: Yeah. So, actually having a 00:11:54.259 --> 00:12:02.069 reactive layout system that would just work for mobile and desktop in the same 00:12:02.069 --> 00:12:09.130 way and allowing the designers and UX experts, who work on the system to just do 00:12:09.130 --> 00:12:15.180 this once and not two or maybe even three times - because of course we also have 00:12:15.180 --> 00:12:20.550 native applications for different platforms - would be great and it's 00:12:20.550 --> 00:12:24.360 something that we're looking into at the moment. But it's not, you know , it's not 00:12:24.360 --> 00:12:29.519 that easy we could build a completely new system, that does this but then again you 00:12:29.519 --> 00:12:33.249 would be telling people: "You can no longer use the old system", but now they 00:12:33.249 --> 00:12:39.019 have build all these tools that rely on how the old system works and you have to 00:12:39.019 --> 00:12:52.089 port all of this over so there's a lot of inertia. Any other thoughts? Everyone is 00:12:52.089 --> 00:13:03.720 still asleep that's excellent. So I can continue. So, another thing that makes it 00:13:03.720 --> 00:13:10.879 difficult to change how MediaWiki works or to improve it is that we are trying to do 00:13:10.879 --> 00:13:19.180 well to be at least two things at once: on the one hand we are running a top 5 00:13:19.180 --> 00:13:24.360 website and serving over 100,000 requests per second using the system and you on the 00:13:24.360 --> 00:13:30.540 other hand, at least until now, we have always made sure that you can just 00:13:30.540 --> 00:13:33.800 download MediaWiki and install it on a shared hosting platform you don't even 00:13:33.800 --> 00:13:38.920 need root on the system, right? You don't even need administrative privileges you 00:13:38.920 --> 00:13:44.769 can just set it up and run it in your web space and it will work. And, having the 00:13:44.769 --> 00:13:51.779 same piece of software do both, run in a minimal environment and run at scale, is 00:13:51.779 --> 00:13:55.040 rather difficult and it also means that there's a lot of things that we can't 00:13:55.040 --> 00:14:02.110 easily do, right? All this modern micro service architecture separate front-end 00:14:02.110 --> 00:14:09.309 and back-end systems, all of that means that it's a lot more complicated to set up 00:14:09.309 --> 00:14:15.720 and needs more knowledge or more infrastructure to set up and so far that 00:14:15.720 --> 00:14:19.500 meant we can't do it, because so far there was this requirement that you should 00:14:19.500 --> 00:14:23.569 really be able to just run it on your shared hosting. And we are currently 00:14:23.569 --> 00:14:29.639 considering to what extent we can continue this, I mean, container based hosting is 00:14:29.639 --> 00:14:34.620 picking up. Maybe this is an alternative it's still unclear but it seems like this 00:14:34.620 --> 00:14:45.999 is something that we need to reconsider. Yeah, but if we make this harder to do 00:14:45.999 --> 00:14:52.739 then a lot of current users of MediaWiki would maybe not, well, maybe no longer 00:14:52.739 --> 00:14:57.230 exist or at least would not exist as they do now, right. You probably have seen 00:14:57.230 --> 00:15:05.259 this nice MediaWiki instance the Congress wiki. Which - with a completely customized 00:15:05.259 --> 00:15:09.689 skin and a lot of extensions installed to allow people to define their sessions 00:15:09.689 --> 00:15:14.410 there and making sure these sessions automatically get listed and get put into 00:15:14.410 --> 00:15:20.660 a calendar - this is all done using extensions, like Semantic MediaWiki, that 00:15:20.660 --> 00:15:34.279 allow you to basically define queries in the wiki text markup. Yeah, another thing 00:15:34.279 --> 00:15:42.079 that, of course, slows down development is that Wikimedia does engineering on a, 00:15:42.079 --> 00:15:48.130 well, comparatively a shoestring budget, right? The budget of the Wikimedia 00:15:48.130 --> 00:15:52.199 Foundation, the annual budget is something like a hundred million dollars, that 00:15:52.199 --> 00:15:58.009 sounds like a lot of money, but if you compare it to other companies running a 00:15:58.009 --> 00:16:03.209 top five or top ten website it's like two percent of their budget or something like 00:16:03.209 --> 00:16:10.769 that, right? It's really, I mean, 100 million is not peanuts but compared to 00:16:10.769 --> 00:16:16.699 what other companies invest to achieve this kind of goal it kind of is, so , what 00:16:16.699 --> 00:16:22.230 this budget translates into is something like 300, depending on how you count, 00:16:22.230 --> 00:16:28.800 between three hundred and four hundred staff. So, this is the people who run all 00:16:28.800 --> 00:16:32.189 of this, including all the community outreach all the social aspects all the 00:16:32.189 --> 00:16:40.920 administrative aspects. Less than half of these are the engineers who do all this. 00:16:40.920 --> 00:16:50.989 And we have like, something like 2,500 servers, bare-metal, so, which is not a 00:16:50.989 --> 00:16:57.619 lot for this kind of thing. Which also means that we have to design the software 00:16:57.619 --> 00:17:07.079 to be not just scalable but also quite efficient. The modern approach to scaling 00:17:07.079 --> 00:17:11.640 is usually scale horizontally make it so you can just spin up another virtual 00:17:11.640 --> 00:17:19.280 machine in some cloud service, but, yeah, we run our own service, we run our own 00:17:19.280 --> 00:17:24.440 servers, so we can design to scale horizontally, but it means ordering 00:17:24.440 --> 00:17:32.070 hardware and setting it up and it's going to take half a year or so. And we don't 00:17:32.070 --> 00:17:38.390 actually have that many people who do this, so, scalability and performance are 00:17:38.390 --> 00:17:49.000 also important factors when designing the software. Okay. Before I dive into what we 00:17:49.000 --> 00:18:03.860 are actually doing - any questions? This one in the back. Wait for the mic, please. 00:18:03.860 --> 00:18:07.330 In the very... Q: Hi! 00:18:07.330 --> 00:18:12.950 Daniel: Hello. Q: So, you said you don't have that many 00:18:12.950 --> 00:18:22.990 people, but how many do you actually have? Daniel: For... it's something like 150 engineers 00:18:22.990 --> 00:18:27.170 worldwide. It always depends on what you count, right? So you count the people, who 00:18:27.170 --> 00:18:32.260 - do you count engineers, who work on the native apps, do you account engineers, who 00:18:32.260 --> 00:18:36.980 work on the Wikimedia cloud services - actually we do have cloud services, we 00:18:36.980 --> 00:18:41.190 offer them to the community to run their own things, but we don't run our stuff on 00:18:41.190 --> 00:18:45.560 other people's cloud. Yeah, so depending on how you count or something and whether 00:18:45.560 --> 00:18:50.210 you count the people working here in Germany for Wikimedia Germany, which is a 00:18:50.210 --> 00:18:57.760 separate organization technically - it's something like 150 engineers. 00:18:57.760 --> 00:19:08.210 Q: Thanks! Q: I'm interested: What are the reasons 00:19:08.210 --> 00:19:13.880 that you don't run on other people's services like on the cloud. I mean, then 00:19:13.880 --> 00:19:17.090 it will be easy to scale horizontally, right? 00:19:17.090 --> 00:19:25.330 Daniel: There's, well, one reason is being independent, right? If we, yeah, I imagine 00:19:25.330 --> 00:19:32.350 we ran all our stuff on Amazon's infrastructure and then maybe Amazon 00:19:32.350 --> 00:19:38.060 doesn't like the way that the Wikipedia article about Amazon is written - what do 00:19:38.060 --> 00:19:42.050 we do, right? Maybe they shut us down, maybe they make things very expensive, 00:19:42.050 --> 00:19:47.360 maybe they make things very painful for us, maybe there is some at least like it 00:19:47.360 --> 00:19:54.070 self-censorship mechanism happening and we want to avoid that. There are there are 00:19:54.070 --> 00:19:58.440 thoughts about this there are thoughts like maybe we can do this at least for 00:19:58.440 --> 00:20:04.270 development infrastructure and CI, not for production or maybe we can make it so that 00:20:04.270 --> 00:20:12.200 we run stuff in the cloud services by more than one vendor, so we basically we spread 00:20:12.200 --> 00:20:17.860 out so we are not reliant on a single company. We are thinking about these 00:20:17.860 --> 00:20:21.820 things but so far the way to actually stay independent has been to run our own 00:20:21.820 --> 00:20:28.300 servers. Q: You've been talking about scalability 00:20:28.300 --> 00:20:35.490 and changing the architecture, that kind of seems to imply to me that there's a 00:20:35.490 --> 00:20:42.270 problem with scaling at the moment or that it's foreseeable that things are not gonna 00:20:42.270 --> 00:20:46.580 work out if you just keep doing what you're doing at the moment. Can you maybe 00:20:46.580 --> 00:20:52.480 elaborate on that. Daniel: So, there's, I think there's two sides 00:20:52.480 --> 00:20:56.850 to this. On the one hand the reason I mentioned it is just that a lot of things 00:20:56.850 --> 00:21:01.610 that are really easy to do basically for me, right? Works on my machine are really 00:21:01.610 --> 00:21:08.920 hard to do if you want to do them at scale. That's one aspect. The other aspect 00:21:08.920 --> 00:21:16.670 is MediaWiki is pretty much a PHP monolith and that means getting it always means 00:21:16.670 --> 00:21:23.680 copying the monolith and breaking it down so you have smaller units that you can 00:21:23.680 --> 00:21:29.040 scale and just say, yeah, I don't know, I need more instances for authentication 00:21:29.040 --> 00:21:33.910 handling or something like that. That would be more efficient, right, because 00:21:33.910 --> 00:21:40.730 you have higher granularity, you can just scale the things that you actually need 00:21:40.730 --> 00:21:47.530 but that of course needs rearchitecting. It's not like things are going to explode 00:21:47.530 --> 00:21:52.910 if we don't do that very soon, it's not, so there's not like an urgent problem 00:21:52.910 --> 00:21:58.400 there. The reason for us to rearchitect is more, to gain more flexibility in 00:21:58.400 --> 00:22:03.330 development, because if you have a monolith that is pretty entangled, code 00:22:03.330 --> 00:22:16.130 changes are risky and take a long time. Q: How many people work on product design 00:22:16.130 --> 00:22:25.460 or like user experience research to, like, sit down with users and try to understand 00:22:25.460 --> 00:22:28.440 what their needs are and from there proceed. 00:22:28.440 --> 00:22:33.230 A: Across... I don't have an exact number, something like five. 00:22:33.230 --> 00:22:37.930 Audience: Do you think that's sufficient? Herald: The question was, whether it's 00:22:37.930 --> 00:22:46.800 sufficient. So just... Daniel: Probably not? But it's more than, 00:22:46.800 --> 00:22:50.310 that's more people than we have for database administration, and that's also 00:22:50.310 --> 00:23:06.040 not sufficient. Herald: Are the further questions? I don't 00:23:06.040 --> 00:23:16.270 think. Daniel: Okay. So, one of the things, that 00:23:16.270 --> 00:23:20.320 holds us back a bit, is that there's literally thousands of extensions for 00:23:20.320 --> 00:23:26.870 MediaWiki and the extension mechanism is heavily reliant on hooks, so basically on 00:23:26.870 --> 00:23:39.600 callbacks. And, we have - I don't have a picture, I have a link here - we have a 00:23:39.600 --> 00:23:44.500 great number of these. So, you see, each paragraph is basically documenting one 00:23:44.500 --> 00:23:51.970 callback that you can use to modify the behavior of the software and, I mean, 00:23:51.970 --> 00:23:59.240 there's, I have never counted, but something like a thousand? And all of them 00:23:59.240 --> 00:24:07.520 are of course interfaces to extra - to software that is maintained externally, so 00:24:07.520 --> 00:24:12.611 they have to be kept stable and if you have a large chunk of software that you 00:24:12.611 --> 00:24:16.730 want to restructure but you have a thousand fixed points that you can't 00:24:16.730 --> 00:24:22.960 change, things become rather difficult. It's basi.. yeah, these hook points kind 00:24:22.960 --> 00:24:27.640 of, like, they act like nails in the architecture and then you kind of have to 00:24:27.640 --> 00:24:36.650 wiggle around them - it's fun. We are working to change that. We want to 00:24:36.650 --> 00:24:43.950 architect it so the interface that is exposed to these hooks become much more 00:24:43.950 --> 00:24:51.360 narrow and the things that these hooks or these callback functions can do is much 00:24:51.360 --> 00:24:58.690 more restricted. There's currently an RSC open for this, has been open for a while 00:24:58.690 --> 00:25:04.690 actually. The problem is that in order to assess whether the proposal is actually 00:25:04.690 --> 00:25:11.530 viable you have to survey all the current users of these hooks and make sure that we 00:25:11.530 --> 00:25:15.660 can, the use case is still covered in the new system and, yeah, we have like a 00:25:15.660 --> 00:25:21.030 thousand hook points and we have like a thousand extensions that's quite a bit of 00:25:21.030 --> 00:25:31.060 work. Another thing that I'm currently working on is establishing a stable 00:25:31.060 --> 00:25:36.990 interface policy. This may sound pretty obvious - it has a lot of pretty obvious 00:25:36.990 --> 00:25:42.430 things like, yeah, if you have a class and there's a public method then that's a 00:25:42.430 --> 00:25:46.410 stable interface it will not just change without notice, we have deprecation policy 00:25:46.410 --> 00:25:53.020 and all that. But if you have worked with extensible systems that rely on the 00:25:53.020 --> 00:25:58.350 mechanisms of object-oriented programming, you may have come across the question 00:25:58.350 --> 00:26:05.040 whether a protected method is part of this stable interface of the software or not, 00:26:05.040 --> 00:26:10.010 or maybe the constructor? I don't know, if you have worked in environments that use 00:26:10.010 --> 00:26:15.860 dependency injection the idea is basically that the construction signature should be 00:26:15.860 --> 00:26:21.270 able to change at any time but then you have extensions that you're subclassing and 00:26:21.270 --> 00:26:25.640 things break. So, this is why we are trying to establish a much more 00:26:25.640 --> 00:26:32.750 restrictive stable interface policy, that would would make explicit things like 00:26:32.750 --> 00:26:36.650 constructor signatures actually not being stable and that gives us a lot more wiggle 00:26:36.650 --> 00:26:51.030 room to restructure the software. MediaWiki itself has grown as a software 00:26:51.030 --> 00:26:58.750 for the last 18 years or so and, at least in the beginning, was mostly created by 00:26:58.750 --> 00:27:06.330 volunteers. And in a monolithic architecture there's a great tendency to 00:27:06.330 --> 00:27:11.070 just, you know, find and grab the thing that you want to use and just use it. 00:27:11.070 --> 00:27:19.100 Which leads to, well, structures like this one: everything depends on everything. And 00:27:19.100 --> 00:27:26.360 if you change one bit of code everything else may or may not break. And with, yeah. 00:27:26.360 --> 00:27:31.350 And if you don't have great test coverage at the same time this just makes it so 00:27:31.350 --> 00:27:35.312 that any change becomes very risky and you have to do a lot of manual testing a lot 00:27:35.312 --> 00:27:43.690 of manual digging around, touching a lot of files and we are for the last year, 00:27:43.690 --> 00:27:50.510 year and a half we have started a concerted effort to tie the worst - to cut 00:27:50.510 --> 00:27:57.760 the worst ties, to decouple these things that are, basically that have most impact 00:27:57.760 --> 00:28:03.320 there's a few objects in the software that rep... - for instance one that represents 00:28:03.320 --> 00:28:08.280 the user and one that represents a title that are used everywhere and the way 00:28:08.280 --> 00:28:14.240 they're implemented currently also means that they depend on everything and that of 00:28:14.240 --> 00:28:29.620 course is not a good situation. On a, well, a similar idea on a higher level is 00:28:29.620 --> 00:28:34.400 decomposition of the software so the decoupling was about the software 00:28:34.400 --> 00:28:39.990 architecture this is about the system architecture breaking up the 00:28:39.990 --> 00:28:45.490 monolith itself into multiple services that serve different purposes. The specifics of 00:28:45.490 --> 00:28:50.281 this diagram are not really relevant to this talk. This is more to, you know, give 00:28:50.281 --> 00:28:57.710 you an impression of the complexity and the sort of work we are doing there. The 00:28:57.710 --> 00:29:05.580 idea is that perhaps we could split out certain functionality into its own service 00:29:05.580 --> 00:29:11.160 into a separate application, like maybe move all the search functionality into 00:29:11.160 --> 00:29:17.150 something separate and self-contained, but then the question is how do you, again, 00:29:17.150 --> 00:29:23.280 compose this into the final user interface - at some point these things have to get 00:29:23.280 --> 00:29:28.420 composed together again - and again this is a very trivial trivial issue if you 00:29:28.420 --> 00:29:32.470 only want to want this to work on your machine or you only need to serve a 00:29:32.470 --> 00:29:39.680 hundred users or something. But doing this at scale doing it at the rate of something 00:29:39.680 --> 00:29:45.230 like 10,000 page views a second, I said a hundred thousand requests earlier but that 00:29:45.230 --> 00:29:51.790 includes resources, icons, CSS and all that. So, yeah, then you have to think 00:29:51.790 --> 00:29:58.470 pretty hard about what you can cache and, thank you, how you can recombine things 00:29:58.470 --> 00:30:02.760 without having to recompute everything and this is something that we are currently 00:30:02.760 --> 00:30:08.580 looking into - coming up with a architecture that allows us to compose and 00:30:08.580 --> 00:30:23.220 recombine the output of different background services. Okay. Before I 00:30:23.220 --> 00:30:27.600 started this talk I said I would probably roughly use half of my time going through 00:30:27.600 --> 00:30:33.310 the presentation and I guess I just hit that spot on. So, this is all I have 00:30:33.310 --> 00:30:41.070 prepared but I'm happy to talk to you more about the things I said or maybe any other 00:30:41.070 --> 00:30:48.050 aspects of this that you may be interested in. If any comments or questions. Oh! 00:30:48.050 --> 00:30:56.800 Three already. Q: First of all thanks a lot for the 00:30:56.800 --> 00:31:03.150 presentation, such a really interesting case of a legacy system and thanks for the 00:31:03.150 --> 00:31:10.130 honesty. It was really interesting as a, you know, software engineer to see how 00:31:10.130 --> 00:31:15.101 that works. I have a question about decoupling, so, I mean, I kind of, you 00:31:15.101 --> 00:31:23.190 have like, probably your system is enormous and how do you find, so to say, 00:31:23.190 --> 00:31:29.100 the most evil, you know, parts which sort of have to be decoupled. Do you use other 00:31:29.100 --> 00:31:34.820 software, with, you know, this, like, what a metrics and stuff or do you just know, 00:31:34.820 --> 00:31:38.370 kind of intuitively.. Daniel: Yeah, it's actually, this is quite 00:31:38.370 --> 00:31:44.970 interesting and maybe I can, maybe we can talk about it a bit more in depth later. 00:31:44.970 --> 00:31:49.020 Very quickly: it's a combination on the one hand you just have the anecdotal 00:31:49.020 --> 00:31:53.280 experience of what is actually annoying when you work with the software and you 00:31:53.280 --> 00:31:59.111 try to fix it and on the other hand I try to find good tooling for this and the 00:31:59.111 --> 00:32:05.440 existing tooling tends to die when you just run it against our code base. So, one 00:32:05.440 --> 00:32:09.930 of the things that you are looking for are cyclic dependencies but the number of 00:32:09.930 --> 00:32:15.080 possible cycles in a graph grows exponentially with a number of nodes. And 00:32:15.080 --> 00:32:17.710 if you have a pretty tightly knit graph that number quickly goes into the 00:32:17.710 --> 00:32:26.580 millions. And, yeah, the tool just goes to 100% CPU and never returns. So, I spend 00:32:26.580 --> 00:32:33.600 quite a bit of time trying to find heuristics to get around that - was a lot 00:32:33.600 --> 00:32:41.550 of fun. I can, yeah, we can talk about that later, if you like. Okay, thanks. 00:32:41.550 --> 00:32:49.221 Q: So what exactly is this Wikidata you mentioned before. Is it like an extension 00:32:49.221 --> 00:32:55.580 or is it a completely different project? Daniel: Wiki - so there's an extension called 00:32:55.580 --> 00:33:04.630 Wikibase, that implements this, well I would say, ontological modeling interface 00:33:04.630 --> 00:33:11.980 for MediaWiki and that is used to run a website called Wikidata which has 00:33:11.980 --> 00:33:19.500 something like 30 million items modeled that describe the world and serve as a 00:33:19.500 --> 00:33:25.610 machine-readable data back-end to other wiki project, other Wikimedia projects. 00:33:25.610 --> 00:33:32.890 Yeah, I used to work on that project for Wikimedia Germany. I moved on to do 00:33:32.890 --> 00:33:41.150 different things now for a couple of years. Lukas here in front is probably the 00:33:41.150 --> 00:33:50.190 person most knowledgeable about the latest and greatest in the Wikidata development. 00:33:50.190 --> 00:33:56.240 Q: You've shortly talked about test coverage. I will be into history.. 00:33:56.240 --> 00:33:58.650 Daniel: Sorry? Q: You talked about test coverage. 00:33:58.650 --> 00:34:02.010 Daniel: Yes. Q: I would be interested in if you amped 00:34:02.010 --> 00:34:07.660 your efforts to help you modernize it and how your current situation is with test 00:34:07.660 --> 00:34:11.809 coverage. Daniel: Test coverage for MediaWiki core is below 00:34:11.809 --> 00:34:21.809 50%. In some parts it's below 10% which is very worrying. One thing that we started 00:34:21.809 --> 00:34:30.050 to look into, like half a year ago, is instead of writing unit tests for all the 00:34:30.050 --> 00:34:36.010 code that we actually want to throw away, before we touch it, we tried to improve 00:34:36.010 --> 00:34:40.900 the test coverage using integration tests on the API level. So we are currently in 00:34:40.900 --> 00:34:48.240 the process of writing a suite of tests, not just for the API modules, but for all 00:34:48.240 --> 00:34:54.540 the functionality, all the application logic behind the the API. And that will 00:34:54.540 --> 00:35:01.070 hopefully cover most of the relevant code paths and will give us confidence when we 00:35:01.070 --> 00:35:12.420 refactor the code. Q: Thanks. 00:35:12.420 --> 00:35:26.280 Herald: Other questions? Q: So you said that you have this legacy 00:35:26.280 --> 00:35:32.240 system and eventually you have to move away from it but are there any, like, I 00:35:32.240 --> 00:35:39.820 don't know, plans for the near future to, I don't know. At some point you have to 00:35:39.820 --> 00:35:47.310 cut the current infrastructure to your extensions and so on and it's a hard cut, I 00:35:47.310 --> 00:35:53.330 see. But are there any plans to build it up from scratch or what are the plans? 00:35:53.330 --> 00:35:58.060 Daniel: Yeah, we are not going to rewrite from scratch - that's a pretty sure fire way to 00:35:58.060 --> 00:36:05.370 just kill the system. We will have to make some tough decisions about backwards 00:36:05.370 --> 00:36:11.340 compatibility and probably reconsider some of the requirements and constraints we 00:36:11.340 --> 00:36:17.100 have, well, with respect to the platforms we run on and also the platforms we serve. 00:36:17.100 --> 00:36:21.130 One of the things that we have been very careful to do in the past for instance is 00:36:21.130 --> 00:36:26.530 to make sure that you can do pretty much everything with MediaWiki with no 00:36:26.530 --> 00:36:32.800 JavaScript on the client side. And that requirement is likely to drop. You will 00:36:32.800 --> 00:36:40.010 still be able to read of course, without any JavaScript or anything, but the extent 00:36:40.010 --> 00:36:45.910 of functionality you will have without JavaScript on the client side is likely to 00:36:45.910 --> 00:36:51.140 be greatly reduced - that kind of thing. Also we will probably end up breaking 00:36:51.140 --> 00:36:57.660 compatibility to at least some of the user-created tools. Hopefully we can offer 00:36:57.660 --> 00:37:02.390 good alternatives, good APIs, good libraries that people can actually port 00:37:02.390 --> 00:37:11.070 to, that are less brittle. I hope that will motivate people and maybe repay them 00:37:11.070 --> 00:37:15.950 a bit for the pain of having their tool broken. If we can give them something that 00:37:15.950 --> 00:37:21.119 is more stable, more reliable, and hopefully even nicer to use. Yeah, so, 00:37:21.119 --> 00:37:25.930 it's small increments, bits, and pieces all over the system there's no, you know, 00:37:25.930 --> 00:37:32.550 no great master plan, no big change to point to really. 00:37:32.550 --> 00:37:45.470 Herald: Okay, okay, further questions? Daniel: I plan to just sit outside here at 00:37:45.470 --> 00:37:54.800 the table later if you just want to come and chat so we can also do that there. 00:37:54.800 --> 00:38:01.250 Herald: Okay, so, last call are there any other questions? It does not appear so, 00:38:01.250 --> 00:38:08.110 so, I'd like ask for a huge applause for Daniel for this talk. 00:38:08.110 --> 00:38:12.627 Applause 00:38:12.627 --> 00:38:14.730 36C3 postroll music 00:38:14.730 --> 00:38:38.320 Subtitles created by c3subtitles.de in the year 2020. Join, and help us!