Music Herald:Hi! Welcome, welcome in Wikipaka- WG, in this extremely crowded Esszimmer. I'm Jakob, I'm your Herald for tonight until 10:00 and I'm here to welcome you and to welcome these wonderful three guys on the stage. They're going to talk about the infrastructure of Wikipedia. And yeah, they are Lucas, Amir, and Daniel and I hope you'll have fun! Applause Amir Sarabadani: Hello, my name is Amir, um, I'm a software engineer at Wikimedia Deutschland, which is the German chapter of Wikimedia Foundation. Wikimedia Foundation runs Wikipedia. Here is Lucas. Lucas is also a software engineer, at Wikimedia Deutschland, and Daniel here is a software architect at Wikimedia Foundation. We are all based in Germany, Daniel in Leipzig, we are in Berlin. And today we want to talk about how we run Wikipedia, with using donors' money and not lots of advertisement and collecting data. So in this talk, first we are going to go on an inside-out approach. So we are going to first talk about the application layer and then the outside layers, and then we go to an outside-in approach and then talk about how you're going to hit Wikipedia from the outside. So first of all, let's some, let me get you some information. First of all, all of Wikimedia, Wikipedia infrastructure is run by Wikimedia Foundation, an American nonprofit charitable organization. We don't run any ads and we are only 370 people. If you count Wikimedia Deutschland or all other chapters, it's around 500 people in total. It's nothing compared to the companies outside. But all of the content is managed by volunteers. Even our staff doesn't do edits, add content to Wikipedia. And we support 300 languages, which is a very large number. And Wikipedia, it's eighteen years old, so it can vote now. And also, Wikipedia has some really, really weird articles. Um, I want to ask you, what is your, if you have encountered any really weird article in Wikipedia? My favorite is a list of people who died on the toilet. But if you know anything, raise your hands. Uh, do you know any weird articles in Wikipedia? Do you know some? Daniel Kinzler: Oh, the classic one…. Amir: You need to unmute yourself. Oh, okay. Daniel: This is technology. I don't know anything about technology. OK, no. The, my favorite example is "people killed by their own invention". That's yeah. That's a lot of fun. Look it up. It's amazing. Lucas Werkmeister: There's also a list, there is also a list of prison escapes using helicopters. I almost said helicopter escapes using prisons, which doesn't make any sense. But that was also a very interesting list. Daniel: I think we also have a category of lists of lists of lists. Amir: That's a page. Lucas: And every few months someone thinks it's funny to redirect it to Russel's paradox or so. Daniel: Yeah. Amir: But also beside that, people cannot read Wikipedia in Turkey or China. But three days ago, actually, the block in Turkey was ruled unconstitutional, but it's not lifted yet. Hopefully they will lift it soon. Um, so Wikipedia, Wikimedia projects is just not Wikipedia. It's lots and lots of projects. Some of them are not as successful as the Wikipedia. Um, uh, like Wikinews. But uh, for example, Wikipedia is the most successful one, and there's another one, that's Wikidata. It's being developed by Wikimedia Deutschland. I mean the Wikidata team, with Lucas, um, and it's being used – it's infobox – it has the data that Wikipedia or Google Knowledge Graph or Siri or Alexa uses. It's basically, it's sort of a backbone of all of the data, uh, through the whole Internet. Um, so our infrastructure. Let me… So first of all, our infrastructure is all Open Source. By principle, we never use any commercial software. Uh, we could use a lots of things. They are even sometimes were given us for free, but we were, refused to use them. Second thing is we have two primary data center for like failovers, when, for example, a whole datacenter goes offline, so we can failover to another data center. We have three caching points of presence or CDNs. Our CDNs are all over the world. Uh, also, we have our own CDN. We don't have, we don't use CloudFlare, because CloudFlare, we care about the privacy of the users and is very important that, for example, people edit from countries that might be, uh, dangerous for them to edit Wikipedia. So we really care to keep the data as protected as possible. Applause Amir: Uh, we have 17 billion page views per month, and, which goes up and down based on the season and everything, we have around 100 to 200 thousand requests per second. It's different from the pageview because requests can be requests to the objects, can be API, can be lots of things. And we have 300,000 new editors per month and we run all of this with 1300 bare metal servers. So right now, Daniel is going to talk about the application layer and the inside of that infrastructure. Daniel: Thanks, Amir. Oh, the clicky thing. Thank you. So the application layer is basically the software that actually does what a wiki does, right? It lets you edit pages, create or update pages and then search the page views. interference noise The challenge for Wikipedia, of course, is serving all the many page views that Amir just described. The core of the application is a classic LAMP application. interference noise I have to stop moving. Yes? Is that it? It's a classic LAMP stack application. So it's written in PHP, it runs on an Apache server. It uses MySQL as a database in the backend. We used to use a HHVM instead of the… Yeah, we… Herald: Hier. Sorry. Nimm mal das hier. Daniel: Hello. We used to use HHVM as the PHP engine, but we just switched back to the mainstream PHP, using PHP 7.2 now, because Facebook decided that HHVM is going to be incompatible with the standard and they were just basically developing it for, for themselves. Right. So we have separate clusters of servers for serving requests, for serving different requests, page views on the one hand, and also handling edits. Then we have a cluster for handling API calls and then we have a bunch of servers set up to handle asynchronous jobs, things that happen in the background, the job runners, and… I guess video scaling is a very obvious example of that. It just takes too long to do it on the fly. But we use it for many other things as well. MediaWiki, MediaWiki is kind of an amazing thing because you can just install it on your own shared- hosting, 10-bucks-a-month's webspace and it will run. But you can also use it to, you know, serve half the world. And so it's a very powerful and versatile system, which also… I mean, this, this wide span of different applications also creates problems. That's something that I will talk about tomorrow. But for now, let's look at the fun things. So if you want to serve a lot of page views, you have to do a lot of caching. And so we have a whole… yeah, a whole set of different caching systems. The most important one is probably the parser cache. So as you probably know, wiki pages are created in, in a markup language, Wikitext, and they need to be parsed and turned into HTML. And the result of that parsing is, of course, cached. And that cache is semi- persistent, it… nothing really ever drops out of it. It's a huge thing. And it's, it lives in a dedicated MySQL database system. Yeah. We use memcached a lot for all kinds of miscellaneous things, anything that we need to keep around and share between server instances. And we have been using redis for a while, for anything that we want to have available, not just between different servers, but also between different data centers, because redis is a bit better about synchronizing things between, between different systems, we still use it for session storage, especially, though we are about to move away from that and we'll be using Cassandra for session storage. We have a bunch of additional services running for specialized purposes, like scaling images, rendering formulas, math formulas, ORES is pretty interesting. ORES is a system for automatically detecting vandalism or rating edits. So this is a machine learning based system for detecting problems and highlighting edits that may not be, may not be great and need more attention. We have some additional services that process our content for consumption on mobile devices, chopping pages up into bits and pieces that then can be consumed individually and many, many more. In the background, we also have to manage events, right, we use Kafka for message queuing, and we use that to notify different parts of the system about changes. On the one hand, we use that to feed the job runners that I just mentioned. But we also use it, for instance, to purge the entries in the CDN when pages become updated and things like that. OK, the next session is going to be about the databases. Are there, very quickly, we will have quite a bit of time for discussion afterwards. But are there any questions right now about what we said so far? Everything extremely crystal clear. OK, no clarity is left? I see. Oh, one question, in the back. Q: Can you maybe turn the volume up a little bit? Thank you. Daniel: Yeah, I think this is your section, right? Oh, its Amir again. Sorry. Amir: So I want to talk about my favorite topic, the dungeons of, dungeons of every production system, databases. The database of Wikipedia is really interesting and complicated on its own. We use MariaDB, we switched from MySQL in 2013 for lots of complicated reasons. As, as I said, because we are really open source, you can go and not just check our database tree, that says, like, how it looks and what's the replicas and masters. Actually, you can even query the Wikipedia's database live when you have that, you can just go to that address and login with your Wikipedia account and just can do whatever you want. Like, it was a funny thing that a couple of months ago, someone sent me a message, sent me a message like, oh, I found a security issue. You can just query Wikipedia's database. I was like, no, no, it's actually, we, we let this happen. It's like, it's sanitized. We removed the password hashes and everything. But still, you can use this. And, but if you wanted to say, like, how the clusters work, the database clusters, because it gets too big, they first started sharding, but now we have sections that are basically different clusters. Uh, really large wikis have their own section. For example, English Wikipedia is s1. German Wikipedia with two or three other small wikis are in s5. Wikidata is on s8, and so on. And each section have a master and several replicas. But one of the replicas is actually a master in another data center because of the failover that I told you. So it is, basically two layers of replication exist. This is, what I'm telling you, is about metadata. But for Wikitext, we also need to have a complete different set of databases. But it can be, we use consistent hashing to just scale it horizontally so we can just put more databases on it, for that. Uh, but I don't know if you know it, but Wikipedia stores every edit. So you have the text of, Wikitext of every edit in the whole history in the database. Uhm, also we have parser cache that Daniel explained, and parser cache is also consistent hashing. So we just can horizontally scale it. But for metadata, it is slightly more complicated. Um, metadata shows and is being used to render the page. So in order to do this, this is, for example, a very short version of the database tree that I showed you. You can even go and look for other ones but this is a s1. s1 eqiad this is the main data center the master is this number and it replicates to some of this and then this 7, the second one that this was with 2000 because it's the second data center and it's a master of the other one. And it has its own replications between cross three replications because the master, that master data center is in Ashburn, Virginia. The second data center is in Dallas, Texas. So they need to have a cross DC replication and that happens with a TLS to make sure that no one starts to listen to, in between these two, and we have snapshots and even dumps of the whole history of Wikipedia. You can go to dumps.wikimedia.org and download the whole reserve every wiki you want, except the ones that we had to remove for privacy reasons and with a lots and lots of backups. I recently realized we have lots of backups. And in total it is 570 TB of data and total 150 database servers and a queries that happens to them is around 350,000 queries per second and, in total, it requires 70 terabytes of RAM. So and also we have another storage section that called Elasticsearch which you can guess it- it's being used for search, on the top right, if you're using desktop. It's different in mobile, I think. And also it depends on if you're rtl language as well, but also it runs by a team called search platform because none of us are from search platform we cannot explain it this much we don't know much how it works it slightly. Also we have a media storage for all of the free pictures that's being uploaded to Wikimedia like, for example, if you have a category in Commons. Commons is our wiki that holds all of the free media and if we have a category in Commons called cats looking at left and you have category cats looking at right so we have lots and lots of images. It's 390 terabytes of media, 1 billion object and uses Swift. Swift is the object is storage component of OpenStack and it has it has several layers of caching, frontend, backend. Yeah, that's mostly it. And we want to talk about traffic now and so this picture is when Sweden in 1967 moved from a left- driving from left to there driving to right. This is basically what happens in Wikipedia infrastructure as well. So we have five caching layers and the most recent one is eqsin which is in Singapore, the three one are just CDN ulsfo, codfw, esams and eqsin. Sorry, ulsfo, esams and eqsin are just CDNs. We have also two points of presence, one in Chicago and the other one is also in Amsterdam, but we don't get to that. So, we have, as I said, we have our own content delivery network with our traffic or allocation is done by GeoDNS which actually is written and maintained by one of the traffic people, and we can pool and depool DCs. It has a time to live of 10 minute- 10 minutes, so if a data center goes down. We have - it takes 10 minutes to actually propagate for being depooled and repooled again. And we use LVS as transport layer and this layer 3 and 4 of the Linux load balancer for Linux and supports consistent hashing and also we ever got we grow so big that we needed to have something that manages the load balancer so we wrote something our own system is called pybal. And also we - lots of companies actually peer with us. We for example directly connect to Amsterdam amps X. So this is how the caching works, which is, anyway, it's there is lots of reasons for this. Let's just get the started. We use TLS, we support TLS 1.2 where we have K then the first layer we have nginx-. Do you know it - does anyone know what nginx- means? And so that's related but not - not correct. So we have nginx which is the free version and we have nginx plus which is the commercial version and nginx. But we don't use nginx to do load balancing or anything so we stripped out everything from it, and we just use it for TLS termination so we call it nginx-, is an internal joke. So and then we have Varnish frontend. Varnish also is a caching layer and this is the frontend is on the memory which is very very fast and you have the backend which is on the storage and the hard disk but this is slow. The fun thing is like just CDN caching layer takes 90% of our requests. Its response and 90% of because just gets to the Varnish and just return and then with doesn't work it goes through the application layer. The Varnish holds-- it has a TTL of 24 hours so if you change an article, it also get invalidated by the application. So if someone added the CDN actually purges the result. And the thing is, the frontend is shorted that can spike by request so you come here load balancer just randomly sends your request to a frontend but then the backend is actually, if the frontend can't find it, it sends it to the backend and the backend is actually sort of - how is it called? - it's a used hash by request, so, for example, article of Barack Obama is only being served from one node in the data center in the CDN. If none of this works it actually hits the other data center. So, yeah, I actually explained all of this. So we have two - two caching clusters and one is called text and the other one is called upload, it's not confusing at all, and if you want to find out, you can just do mtr en.wikipedia.org and you - you're - the end node is text-lb.wikimedia.org which is the our text storage but if you go to upload.wikimedia.org, you get to hit the upload cluster. Yeah this is so far, what is it, and it has lots of problems because a) varnish is open core, so the version that you use is open source we don't use the commercial one, but the open core one doesn't support TLS. What? What happened? Okay. No, no, no! You should I just- you're not supposed to see this. Okay, sorry for the- huh? Okay, okay sorry. So Varnish has lots of problems, Varnish is open core, it doesn't support TLS termination which makes us to have this nginx- their system just to do TLS termination, makes our system complicated. It doesn't work very well with so if that causes us to have a cron job to restart every Varnish node twice a week. We have a cron job that this restarts every Vanish node which is embarrassing, but also, on the other hand then the end of Varnish like backend wants to talk to the application layer, it also doesn't support terminate - TLS termination, so we use IPSec which is even more embarrassing, but we are changing it. So we call it, if you are using a particular fixed server which is very very nice and it's also open source, a fully open source like in with Apache Foundation, Apache does the TLS, does the TLS by termination and still for now we have a Varnish frontend that still exists but a backend is also going to change to the ATS, so we call this ATS sandwich. Two ATS happening between and there the middle there's a Varnish. The good thing is that the TLS termination when it moves to ATS, you can actually use TLS 1.3 which is more modern and more secure and even very faster so it basically drops 100 milliseconds from every request that goes to Wikipedia. That translates to centuries of our users' time every month, but ATS is going on and hopefully it will go live soon and once these are done, so this is the new version. And, as I said, the TLS and when we can do this we can actually use the more secure instead of IPSec to talk about between data centers. Yes. And now it's time that Lucas talks about what happens when you type in en.wikipedia.org. Lucas: Yes, this makes sense, thank you. So, first of all, what you see on the slide here as the image doesn't really have anything to do with what happens when you type in wikipedia.org because it's an offline Wikipedia reader but it's just a nice image. So this is basically a summary of everything they already said, so if, which is the most common case, you are lucky and get a URL which is cached, then, so, first your computer asked for the IP address of en.wikipedia.org it reaches this whole DNS daemon and because we're at Congress here it tells you the closest data center is the one in Amsterdam, so esams and it's going to hit the edge, what we call load bouncers/router there, then going through TLS termination through nginx- and then it's going to hit the Varnish caching server, either frontend or backends and then you get a response and that's already it and nothing else is ever bothered again. It doesn't even reach any other data center which is very nice and so that's, you said around 90% of the requests we get, and if you're unlucky and the URL you requested is not in the Varnish in the Amsterdam data center then it gets forwarded to the eqiad data center, which is the primary one and there it still has a chance to hit the cache and perhaps this time it's there and then the response is going to get cached in the frontend, no, in the Amsterdam Varnish and you're also going to get a response and we still don't have to run any application stuff. If we do have to hit any application stuff and then Varnish is going to forward that, if it's upload.wikimedia.org, it goes to the media storage Swift, if it's any other domain it goes to MediaWiki and then MediaWiki does a ton of work to connect to the database, in this case the first shard for English Wikipedia, get the wiki text from there, get the wiki text of all the related pages and templates. No, wait I forgot something. First it checks if the HTML for this page is available in parser cache, so that's another caching layer, and this application cache - this parser cache might either be memcached or the database cache behind it and if it's not there, then it has to go get the wikitext, get all the related things and render that into HTML which takes a long time and goes through some pretty ancient code and if you are doing an edit or an upload, it's even worse, because then always has to go to MediaWiki and then it not only has to store this new edit, either in the media back-end or in the database, it also has update a bunch of stuff, like, especially if you-- first of all, it has to purge the cache, it has to tell all the Varnish servers that there's a new version of this URL available so that it doesn't take a full day until the time-to-live expires. It also has to update a bunch of things, for example, if you edited a template, it might have been used in a million pages and the next time anyone requests one of those million pages, those should also actually be rendered again using the new version of the template so it has to invalidate the cache for all of those and all that is deferred through the job queue and it might have to calculate thumbnails if you uploaded the file or create a - retranscode media files because maybe you uploaded in - what do we support? - you upload in WebM and the browser only supports some other media codec or something, we transcode that and also encode it down to the different resolutions, so then it goes through that whole dance and, yeah, that was already those slides. Is Amir going to talk again about how we manage - Amir: I mean okay yeah I quickly come back just for a short break to talk about managing to manage because managing 100- 1300 bare metal hardware plus a Kubernetes cluster is not easy, so what we do is that we use Puppet for configuration management in our bare metal systems, it's fun, five to 50,000 lines of Puppet code. I mean, lines of code is not a great indicator but you can roughly get an estimate of how its things work and we have 100,000 lines of Ruby and we have our CI and CD cluster, we have so we don't store anything in GitHub or GitLab, we have our own system which is based on Gerrit and for that we have a system of Jenkins and the Jenkins does all of this kind of things and also because we have a Kubernetes cluster for services, some of our services, if you make a merger change in the Gerrit it also builds the Docker files and containers and push it up to the production and also in order to run remote SSH commands, we have cumin that's like in the house automation and we built this farm for our systems and for example you go there and say ok we pull this node or run this command in all of the data Varnish nodes that I told you like you want to restart them. And with this I get back to Lucas. Lucas: So, I am going to talk a bit more about Wikimedia Cloud Services which is a bit different in that it's not really our production stuff but it's where you people, the volunteers of the Wikimedia movement can run their own code, so you can request a project which is kind of a group of users and then you get assigned a pool of you have this much CPU and this much RAM and you can create virtual machines with those resources and then do stuff there and run basically whatever you want, to create and boot and shut down the VMs and stuff we use OpenStack and there's a Horizon frontend for that which you use through the browser and it's largely out all the time but otherwise it works pretty well. Internally, ideally you manage the VMs using Puppet but a lot of people just SSH in and then do whatever they need to set up the VM manually and it happens, well, and there's a few big projects like Toolforge where you can run your own web- based tools or the beta cluster which is basically a copy of some of the biggest wikis like there's a beta English Wikipedia, beta Wikidata, beta Wikimedia Commons using mostly the same configuration as production but using the current master version of the software instead of whatever we deploy once a week so if there's a bug, we see it earlier hopefully, even if we didn't catch it locally, because the beta cluster is more similar to the production environment and also the continuous - continuous integration service run in Wikimedia Cloud Services as well. Yeah and also you have to have Kubernetes somewhere on these slides right, so you can use that to distribute work between the tools in Toolforge or you can use the grid engine which does a similar thing but it's like three decades old and through five forks now I think the current fork we use is son of grid engine and I don't know what it was called before, but that's Cloud Services. Amir: So in a nutshell, this is our - our systems. We have 1300 bare metal services with lots and lots of caching, like lots of layers of caching, because mostly we serves read and we can just keep them as a cached version and all of this is open source, you can contribute to it, if you want to and there's a lot of configuration is also open and I - this is the way I got hired like I open it started contributing to the system I feel like yeah we can- come and work for us, so this is a - Daniel: That's actually how all of us got hired. Amir: So yeah, and this is the whole thing that happens in Wikimedia and if you want to - no, if you want to help us, we are hiring. You can just go to jobs at wikimedia.org, if you want to work for Wikimedia Foundation. If you want to work with Wikimedia Deutschland, you can go to wikimedia.de and at the bottom there's a link for jobs because the links got too long. If you can contribute, if you want to contribute to us, there is so many ways to contribute, as I said, there's so many bugs, we have our own graphical system, you can just look at the monitor and a Phabricator is our bug tracker, you can just go there and find the bug and fix things. Actually, we have one repository that is private but it only holds the certificate for as TLS and things that are really really private then we cannot remove them. But also there are documentations, the documentation for infrastructure is at wikitech.wikimedia.org and documentation for configuration is at noc.wikimedia.org plus the documentation of our codebase. The documentation for MediaWiki itself is at mediawiki.org and also we have a our own system of URL shortener you can go to w.wiki and short and shorten any URL in Wikimedia structure so we reserved the dollar sign for the donate site and yeah, you have any questions, please. Applause Daniel: It's if you know we have quite a bit of time for questions so if anything wasn't clear or they're curious about anything please, please ask. AM: So one question what is not in the presentation. Do you have any efforts with hacking attacks? Amir: So the first rule of security issues is that we don't talk about security issues but let's say this baby has all sorts of attacks happening, we have usually we have DDo. Once there was happening a couple of months ago that was very successful. I don't know if you read the news about that, but we also, we have a infrastructure to handle this, we have a security team that handles these cases and yes. AM: Hello how do you manage access to your infrastructure from your employees? Amir: So it's SS-- so we have a LDAP group and LDAP for the web-based systems but for SSH and for this ssh we have strict protocols and then you get a private key and some people usually protect their private key using UV keys and then you have you can SSH to the system basically. Lucas: Yeah, well, there's some firewalling setup but there's only one server for data center that you can actually reach through SSH and then you have to tunnel through that to get to any other server. Amir: And also, like, we have we have a internal firewall and it's basically if you go to the inside of the production you cannot talk to the outside. You even, you for example do git clone github.org, it doesn't, github.com doesn't work. It only can access tools that are for inside Wikimedia Foundation infrastructure. AM: Okay, hi, you said you do TLS termination through nginx, do you still allow non-HTTPS so it should be non-secure access. Amir: No we dropped it a really long time ago but also Lucas: 2013 or so Amir: Yeah, 2015 Lucas: 2015 Amir: 2013 started serving the most of the traffic but 15, we dropped all of the HTTP- non-HTTPS protocols and recently even dropped and we are not serving any SSL requests anymore and TLS 1.1 is also being phased out, so we are sending you a warning to the users like you're using TLS 1.1, please migrate to these new things that came out around 10 years ago, so yeah Lucas: Yeah I think the deadline for that is like February 2020 or something then we'll only have TLS 1.2 Amir: And soon we are going to support TLS 1.3 Lucas: Yeah Are there any questions? Q: so does read-only traffic from logged in users hit all the way through to the parser cache or is there another layer of caching for that? Amir: Yes we, you bypass all of that, you can. Daniel: We need one more microphone. Yes, it actually does and this is a pretty big problem and something we want to look into clears throat but it requires quite a bit of rearchitecting. If you are interested in this kind of thing, maybe come to my talk tomorrow at noon. Amir: Yeah one reason we can, we are planning to do is active active so we have two primaries and the read request gets request - from like the users can hit their secondary data center instead of the main one. Lucas: I think there was a question way in the back there, for some time already AM: Hi, I got a question. I read on the Wikitech that you are using karate as a validation platform for some parts, can you tell us something about this or what parts of Wikipedia or Wikimedia are hosted on this platform? Amir: I am I'm not oh sorry so I don't know this kind of very very sure but take it with a grain of salt but as far as I know karate is used to build a very small VMs in productions that we need for very very small micro sites that we serve to the users. So we built just one or two VMs, we don't use it very as often as I think so. AM: Do you also think about open hardware? Amir: I don't, you can Daniel: Not - not for servers. I think for the offline Reader project, but this is not actually run by the Foundation, it's supported but it's not something that the Foundation does. They were sort of thinking about open hardware but really open hardware in practice usually means, you - you don't, you know, if you really want to go down to the chip design, it's pretty tough, so yeah, it's- it's it- it's usually not practical, sadly. Amir: And one thing I can say but this is that we have a some machine - machines that are really powerful that we give to the researchers to run analysis on the between this itself and we needed to have GPUs for those but the problem was - was there wasn't any open source driver for them so we migrated and use AMD I think, but AMD didn't fit in the rack it was a quite a endeavor to get it to work for our researchers to help you CPU. AM: I'm still impressed that you answer 90% out of the cache. Do all people access the same pages or is the cache that huge? So what percentage of - of the whole database is in the cache then? Daniel: I don't have the exact numbers to be honest, but a large percentage of the whole database is in the cache. I mean it expires after 24 hours so really obscure stuff isn't there but I mean it's- it's a- it's a- it's a power-law distribution right? You have a few pages that are accessed a lot and you have many many many pages that are not actually accessed at all for a week or so except maybe for a crawler, so I don't know a number. My guess would be it's less than 50% that is actually cached but, you know, that still covers 90%-- it's probably the top 10% of pages would still cover 90% of the pageviews, but I don't-- this would be actually-- I should look this up, it would be interesting numbers to have, yes. Lucas: Do you know if this is 90% of the pageviews or 90% of the get requests because, like, requests for the JavaScript would also be cached more often, I assume Daniel: I would expect that for non- pageviews, it's even higher Lucas: Yeah Daniel: Yeah, because you know all the icons and- and, you know, JavaScript bundles and CSS and stuff doesn't ever change Lucas: I'm gonna say for every 180 min 90% but there's a question back there AM: Hey. Do your data centers run on green energy? Amir: Very valid question. So, the Amsterdam city n1 is a full green but the other ones are partially green, partially coal and like gas. As far as I know, there are some plans to make them move away from it but the other hand we realized that if we don't produce as much as a carbon emission because we don't have much servers and we don't use much data, there was a summation and that we realized our carbon emission is basically as the same as 200 and in the datacenter plus all of their travel that all of this have to and all of the events is 250 households, it's very very small it's I think it's one thousandth of the comparable traffic with Facebook even if you just cut down with the same traffic because Facebook collects the data, it runs very sophisticated machine learning algorithms that's that's a real complicate, but for Wikimedia, we don't do this so we don't need much energy. Does - does the answer your question? Herald: Do we have any other questions left? Yeah sorry AM: hi how many developers do you need to maintain the whole infrastructure and how many developers or let's say head developer hours you needed to build the whole infrastructure like the question is because what I find very interesting about the talk it's a non-profit, so as an example for other nonprofits is how much money are we talking about in order to build something like this as a digital common. Daniel: If this is just about actually running all this so just operations is less than 20 people I think which makes if you if you basically divide the requests per second by people you get to something like 8,000 requests per second per operations engineer which I think is a pretty impressive number. This is probably a lot higher I would I would really like to know if there's any organization that tops that. I don't actually know the whole the the actual operations budget I know is it two two-digit millions annually. Total hours for building this over the last 18 years, I have no idea. For the for the first five or so years, the people doing it were actually volunteers. We still had volunteer database administrators and stuff until maybe ten years ago, eight years ago, so yeah it's really nobody did any accounting of this I can only guess. AM: Hello a tools question. I a few years back I saw some interesting examples of saltstack use for Wikimedia but right now I see only Puppet that come in mentioned so kind of what happened with that Amir: I think we dished saltstack you - I don't I cannot because none of us are in the Cloud Services team and I don't think I can answer you but if you look at the wikitech.wikimedia.org, it's probably if last time I checked says like it's deprecated and obsolete we don't use it anymore. AM: Do you use the bat-ropes like the top runners to fill spare capacity on the web serving servers or do you have dedicated servers for the roles. Lucas: I think they're dedicated. Amir: The job runners if you're asking job runners are dedicated yes they are they are I think 5 per primary data center so Daniel: Yeah they don't, I mean do we do we actually have any spare capacity on anything? We don't have that much hardware everything is pretty much at a hundred percent. Lucas: I think we still have some server that is just called misc1111 or something which run five different things at once, you can look for those on wikitech. Amir: But but we go oh sorry it's not five it's 20 per data center 20 per primary data center that's our job runner and they run 700 jobs per second. Lucas: And I think that does not include the video scaler so those are separate again Amir: No, they merged them in like a month ago Lucas: Okay, cool AM: Maybe a little bit off topic that can tell us a little bit about decision making process for- for technical decision, architecture decisions, how does it work in an organization like this: decision making process for architectural decisions for example. Daniel: Yeah so Wikimedia has a committee for making high-level technical decisions, it's called a Wikimedia Technical Committee, techcom and we run an RFC process so any decision that is a cross-cutting strategic are especially hard to undo should go through this process and it's pretty informal, basically you file a ticket and start this process. It gets announced in the mailing list, hopefully you get input and feedback and at some point it is it's approved for implementation. We're currently looking into improving this process, it's not- sometimes it works pretty well, sometimes things don't get that much feedback but it still it makes sure that people are aware of these high- level decisions Amir: Daniel is the chair of that committee Daniel: Yeah, if you want to complain about the process, please do. AM: yes regarding CI and CD across along the pipeline, of course with that much traffic you want to keep everything consistent right. So is there any testing strategies that you have said internally, like of course unit tests integration tests but do you do something like continuous end to end testing on beta instances? Amir: So if we have beta cluster but also we do deploy, we call it train and so we deploy once a week, all of the changes gets merged to one, like a branch and the branch gets cut in every Tuesday and it first goes to the test wikis and then it goes to all of the wikis that are not Wikipedia except Catalan and Hebrew Wikipedia. So basically Hebrew and Catalan Wikipedia volunteer to be the guinea pigs of the next wikis and if everything works fine usually it goes there and is like oh the fatal mater and we have a logging and then it's like okay we need to fix this and we fix it immediately and then it goes live to all wikis. This is one way of looking at it well so okay yeah Daniel: So, our test coverage is not as great as it should be and so we kind of, you know, abuse our users for this. We are, of course, working to improve this and one thing that we started recently is a program for creating end-to-end tests for all the API modules we have, in the hope that we can thereby cover pretty much all of the application logic bypassing the user interface. I mean, full end-to-end should, of course, include the user interface but user interface tests are pretty brittle and often tests you know where things are on the screen and it just seems to us that it makes a lot of sense to have more- to have tests that actually test the application logic for what the system actually should be doing, rather than what it should look like and, yeah, we are currently working on making- so yeah, basically this has been a proof of concept and we're currently working to actually integrate it in- in CI. That perhaps should land once everyone is back from the vacations and then we have to write about a thousand or so tests, I guess. Lucas: I think there's also a plan to move to a system where we actually deploy basically after every commit and can immediately roll back if something goes wrong but that's more midterm stuff and I'm not sure what the current status of that proposal is Amir: And it will be in Kubernetes, so it will be completely different Daniel: That would be amazing Lucas: But right now, we are on this weekly basis, if something goes wrong, we roll back to the last week's version of the code Herald: Are there are any questions- questions left? Sorry. Yeah. Okay, um, I don't think so. So, yeah, thank you for this wonderful talk. Thank you for all your questions. Um, yeah, I hope you liked it. Um, see you around, yeah. Applause Music Subtitles created by c3subtitles.de in the year 2021. Join, and help us!