< Return to Video

36C3 Wikipaka WG: Infrastructure of Wikipedia

  • 0:00 - 0:19
    Music
  • 0:19 - 0:25
    Herald:Hi! Welcome, welcome in Wikipaka-
    WG, in this extremely crowded Esszimmer.
  • 0:25 - 0:32
    I'm Jakob, I'm your Herald for tonight
    until 10:00 and I'm here to welcome you
  • 0:32 - 0:37
    and to welcome these wonderful three guys
    on the stage. They're going to talk about
  • 0:37 - 0:45
    the infrastructure of Wikipedia.
    And yeah, they are Lucas, Amir, and Daniel
  • 0:45 - 0:53
    and I hope you'll have fun!
    Applause
  • 0:53 - 0:57
    Amir Sarabadani: Hello, my name is
    Amir, um, I'm a software engineer at
  • 0:57 - 1:01
    Wikimedia Deutschland, which is the German
    chapter of Wikimedia Foundation. Wikimedia
  • 1:01 - 1:07
    Foundation runs Wikipedia. Here is Lucas.
    Lucas is also a software engineer, at
  • 1:07 - 1:10
    Wikimedia Deutschland, and Daniel here is
    a software architect at Wikimedia
  • 1:10 - 1:15
    Foundation. We are all based in Germany,
    Daniel in Leipzig, we are in Berlin. And
  • 1:15 - 1:21
    today we want to talk about how we run
    Wikipedia, with using donors' money and
  • 1:21 - 1:30
    not lots of advertisement and collecting
    data. So in this talk, first we are going
  • 1:30 - 1:35
    to go on an inside-out approach. So we are
    going to first talk about the application
  • 1:35 - 1:40
    layer and then the outside layers, and
    then we go to an outside-in approach and
  • 1:40 - 1:49
    then talk about how you're going to hit
    Wikipedia from the outside.
  • 1:49 - 1:53
    So first of all, let's some,
    let me get you some information. First of
  • 1:53 - 1:57
    all, all of Wikimedia, Wikipedia
    infrastructure is run by Wikimedia
  • 1:57 - 2:02
    Foundation, an American nonprofit
    charitable organization. We don't run any
  • 2:02 - 2:08
    ads and we are only 370 people. If you
    count Wikimedia Deutschland or all other
  • 2:08 - 2:12
    chapters, it's around 500 people in total.
    It's nothing compared to the companies
  • 2:12 - 2:20
    outside. But all of the content is
    managed by volunteers. Even our staff
  • 2:20 - 2:24
    doesn't do edits, add content to
    Wikipedia. And we support 300 languages,
  • 2:24 - 2:30
    which is a very large number. And
    Wikipedia, it's eighteen years old, so it
  • 2:30 - 2:38
    can vote now. And also, Wikipedia has some
    really, really weird articles. Um, I want
  • 2:38 - 2:43
    to ask you, what is your, if you have
    encountered any really weird article
  • 2:43 - 2:48
    in Wikipedia? My favorite is a list of
    people who died on the toilet. But if you
  • 2:48 - 2:55
    know anything, raise your hands. Uh, do
    you know any weird articles in Wikipedia?
  • 2:55 - 2:59
    Do you know some?
    Daniel Kinzler: Oh, the classic one….
  • 2:59 - 3:04
    Amir: You need to unmute yourself. Oh,
    okay.
  • 3:04 - 3:10
    Daniel: This is technology. I don't know
    anything about technology. OK, no. The, my
  • 3:10 - 3:14
    favorite example is "people killed by
    their own invention". That's yeah. That's
  • 3:14 - 3:21
    a lot of fun. Look it up. It's amazing.
    Lucas Werkmeister: There's also a list,
  • 3:21 - 3:25
    there is also a list of prison escapes
    using helicopters. I almost said
  • 3:25 - 3:29
    helicopter escapes using prisons, which
    doesn't make any sense. But that was also
  • 3:29 - 3:32
    a very interesting list.
    Daniel: I think we also have a category of
  • 3:32 - 3:35
    lists of lists of lists.
    Amir: That's a page.
  • 3:35 - 3:39
    Lucas: And every few months someone thinks
    it's funny to redirect it to Russel's
  • 3:39 - 3:43
    paradox or so.
    Daniel: Yeah.
  • 3:43 - 3:49
    Amir: But also beside that, people cannot
    read Wikipedia in Turkey or China. But
  • 3:49 - 3:54
    three days ago, actually, the block in
    Turkey was ruled unconstitutional, but
  • 3:54 - 4:01
    it's not lifted yet. Hopefully they will
    lift it soon. Um, so Wikipedia, Wikimedia
  • 4:01 - 4:06
    projects is just not Wikipedia. It's lots
    and lots of projects. Some of them are not
  • 4:06 - 4:12
    as successful as the Wikipedia. Um, uh,
    like Wikinews. But uh, for example,
  • 4:12 - 4:16
    Wikipedia is the most successful one, and
    there's another one, that's Wikidata. It's
  • 4:16 - 4:22
    being developed by Wikimedia Deutschland.
    I mean the Wikidata team, with Lucas, um,
  • 4:22 - 4:27
    and it's being used – it's infobox – it
    has the data that Wikipedia or Google
  • 4:27 - 4:31
    Knowledge Graph or Siri or Alexa uses.
    It's basically, it's sort of a backbone of
  • 4:31 - 4:38
    all of the data, uh, through the whole
    Internet. Um, so our infrastructure. Let
  • 4:38 - 4:43
    me… So first of all, our infrastructure is
    all Open Source. By principle, we never
  • 4:43 - 4:48
    use any commercial software. Uh, we could
    use a lots of things. They are even
  • 4:48 - 4:54
    sometimes were given us for free, but we
    were, refused to use them. Second
  • 4:54 - 4:59
    thing is we have two primary data center
    for like failovers, when, for example, a
  • 4:59 - 5:04
    whole datacenter goes offline, so we can
    failover to another data center. We have
  • 5:04 - 5:11
    three caching points of presence or
    CDNs. Our CDNs are all over the world. Uh,
  • 5:11 - 5:15
    also, we have our own CDN. We don't have,
    we don't use CloudFlare, because
  • 5:15 - 5:21
    CloudFlare, we care about the privacy of
    the users and is very important that, for
  • 5:21 - 5:25
    example, people edit from countries that
    might be, uh, dangerous for them to edit
  • 5:25 - 5:30
    Wikipedia. So we really care to keep the
    data as protected as possible.
  • 5:30 - 5:32
    Applause
  • 5:32 - 5:39
    Amir: Uh, we have 17 billion page views
    per month, and, which goes up and down
  • 5:39 - 5:44
    based on the season and everything, we
    have around 100 to 200 thousand requests
  • 5:44 - 5:48
    per second. It's different from the
    pageview because requests can be requests
  • 5:48 - 5:55
    to the objects, can be API, can be lots of
    things. And we have 300,000 new editors
  • 5:55 - 6:03
    per month and we run all of this with 1300
    bare metal servers. So right now, Daniel
  • 6:03 - 6:07
    is going to talk about the application
    layer and the inside of that
  • 6:07 - 6:12
    infrastructure.
    Daniel: Thanks, Amir. Oh, the clicky
  • 6:12 - 6:20
    thing. Thank you. So the application layer
    is basically the software that actually
  • 6:20 - 6:25
    does what a wiki does, right? It lets you
    edit pages, create or update pages and
  • 6:25 - 6:30
    then search the page views. interference
    noise
    The challenge for Wikipedia, of
  • 6:30 - 6:37
    course, is serving all the many page views
    that Amir just described. The core of the
  • 6:37 - 6:43
    application is a classic LAMP application.
    interference noise I have to stop
  • 6:43 - 6:50
    moving. Yes? Is that it? It's a classic
    LAMP stack application. So it's written in
  • 6:50 - 6:57
    PHP, it runs on an Apache server. It uses
    MySQL as a database in the backend. We
  • 6:57 - 7:02
    used to use a HHVM instead of the… Yeah,
    we…
  • 7:02 - 7:14
    Herald: Hier. Sorry. Nimm mal das hier.
    Daniel: Hello. We used to use HHVM as the
  • 7:14 - 7:21
    PHP engine, but we just switched back to
    the mainstream PHP, using PHP 7.2 now,
  • 7:21 - 7:25
    because Facebook decided that HHVM is
    going to be incompatible with the standard
  • 7:25 - 7:35
    and they were just basically developing it
    for, for themselves. Right. So we have
  • 7:35 - 7:43
    separate clusters of servers for serving
    requests, for serving different requests,
  • 7:43 - 7:48
    page views on the one hand, and also
    handling edits. Then we have a cluster for
  • 7:48 - 7:55
    handling API calls and then we have a
    bunch of servers set up to handle
  • 7:55 - 8:01
    asynchronous jobs, things that happen in
    the background, the job runners, and…
  • 8:01 - 8:05
    I guess video scaling is a very obvious
    example of that. It just takes too long to
  • 8:05 - 8:12
    do it on the fly. But we use it for many
    other things as well. MediaWiki, MediaWiki
  • 8:12 - 8:16
    is kind of an amazing thing because you
    can just install it on your own shared-
  • 8:16 - 8:23
    hosting, 10-bucks-a-month's webspace and
    it will run. But you can also use it to,
  • 8:23 - 8:29
    you know, serve half the world. And so
    it's a very powerful and versatile system,
  • 8:29 - 8:34
    which also… I mean, this, this wide span
    of different applications also creates
  • 8:34 - 8:41
    problems. That's something that I will
    talk about tomorrow. But for now, let's
  • 8:41 - 8:49
    look at the fun things. So if you want to
    serve a lot of page views, you have to do
  • 8:49 - 8:56
    a lot of caching. And so we have a whole…
    yeah, a whole set of different caching
  • 8:56 - 9:01
    systems. The most important one is
    probably the parser cache. So as you
  • 9:01 - 9:07
    probably know, wiki pages are created in,
    in a markup language, Wikitext, and they
  • 9:07 - 9:13
    need to be parsed and turned into HTML.
    And the result of that parsing is, of
  • 9:13 - 9:20
    course, cached. And that cache is semi-
    persistent, it… nothing really ever drops
  • 9:20 - 9:25
    out of it. It's a huge thing. And it's, it
    lives in a dedicated MySQL database
  • 9:25 - 9:33
    system. Yeah. We use memcached a lot for
    all kinds of miscellaneous things,
  • 9:33 - 9:39
    anything that we need to keep around and
    share between server instances. And we
  • 9:39 - 9:44
    have been using redis for a while, for
    anything that we want to have available,
  • 9:44 - 9:48
    not just between different servers, but
    also between different data centers,
  • 9:48 - 9:53
    because redis is a bit better about
    synchronizing things between, between
  • 9:53 - 10:00
    different systems, we still use it for
    session storage, especially, though we are
  • 10:00 - 10:10
    about to move away from that and we'll be
    using Cassandra for session storage. We
  • 10:10 - 10:19
    have a bunch of additional services
    running for specialized purposes, like
  • 10:19 - 10:27
    scaling images, rendering formulas, math
    formulas, ORES is pretty interesting. ORES
  • 10:27 - 10:33
    is a system for automatically detecting
    vandalism or rating edits. So this is a
  • 10:33 - 10:38
    machine learning based system for
    detecting problems and highlighting edits
  • 10:38 - 10:45
    that may not be, may not be great and need
    more attention. We have some additional
  • 10:45 - 10:51
    services that process our content for
    consumption on mobile devices, chopping
  • 10:51 - 10:56
    pages up into bits and pieces that then
    can be consumed individually and many,
  • 10:56 - 11:08
    many more. In the background, we also have
    to manage events, right, we use Kafka for
  • 11:08 - 11:15
    message queuing, and we use that to notify
    different parts of the system about
  • 11:15 - 11:20
    changes. On the one hand, we use that to
    feed the job runners that I just
  • 11:20 - 11:28
    mentioned. But we also use it, for
    instance, to purge the entries in the
  • 11:28 - 11:35
    CDN when pages become updated and things
    like that. OK, the next session is going
  • 11:35 - 11:40
    to be about the databases. Are there, very
    quickly, we will have quite a bit of time
  • 11:40 - 11:45
    for discussion afterwards. But are there
    any questions right now about what we said
  • 11:45 - 11:57
    so far? Everything extremely crystal
    clear. OK, no clarity is left? I see. Oh,
  • 11:57 - 12:08
    one question, in the back.
    Q: Can you maybe turn the volume up a
  • 12:08 - 12:20
    little bit? Thank you.
    Daniel: Yeah, I think this is your
  • 12:20 - 12:28
    section, right? Oh, its Amir again. Sorry.
    Amir: So I want to talk about my favorite
  • 12:28 - 12:32
    topic, the dungeons of, dungeons of every
    production system, databases. The database
  • 12:32 - 12:40
    of Wikipedia is really interesting and
    complicated on its own. We use MariaDB, we
  • 12:40 - 12:46
    switched from MySQL in 2013 for lots of
    complicated reasons. As, as I said,
  • 12:46 - 12:50
    because we are really open source, you can
    go and not just check our database tree,
  • 12:50 - 12:55
    that says, like, how it looks and what's
    the replicas and masters. Actually, you
  • 12:55 - 13:00
    can even query the Wikipedia's database
    live when you have that, you can just go
  • 13:00 - 13:03
    to that address and login with your
    Wikipedia account and just can do whatever
  • 13:03 - 13:07
    you want. Like, it was a funny thing that
    a couple of months ago, someone sent me a
  • 13:07 - 13:13
    message, sent me a message like, oh, I
    found a security issue. You can just query
  • 13:13 - 13:18
    Wikipedia's database. I was like, no, no,
    it's actually, we, we let this happen.
  • 13:18 - 13:22
    It's like, it's sanitized. We removed the
    password hashes and everything. But still,
  • 13:22 - 13:28
    you can use this. And, but if you wanted
    to say, like, how the clusters work, the
  • 13:28 - 13:32
    database clusters, because it gets too
    big, they first started sharding, but now
  • 13:32 - 13:36
    we have sections that are basically
    different clusters. Uh, really large wikis
  • 13:36 - 13:43
    have their own section. For example,
    English Wikipedia is s1. German Wikipedia
  • 13:43 - 13:51
    with two or three other small wikis are in
    s5. Wikidata is on s8, and so on. And
  • 13:51 - 13:56
    each section have a master and several
    replicas. But one of the replicas is
  • 13:56 - 14:02
    actually a master in another data center
    because of the failover that I told you.
  • 14:02 - 14:08
    So it is, basically two layers of
    replication exist. This is, what I'm
  • 14:08 - 14:13
    telling you, is about metadata. But for
    Wikitext, we also need to have a complete
  • 14:13 - 14:19
    different set of databases. But it can be,
    we use consistent hashing to just scale it
  • 14:19 - 14:28
    horizontally so we can just put more
    databases on it, for that. Uh, but I don't
  • 14:28 - 14:32
    know if you know it, but Wikipedia stores
    every edit. So you have the text of,
  • 14:32 - 14:37
    Wikitext of every edit in the whole
    history in the database. Uhm, also we have
  • 14:37 - 14:42
    parser cache that Daniel explained, and
    parser cache is also consistent hashing.
  • 14:42 - 14:47
    So we just can horizontally scale it. But
    for metadata, it is slightly more
  • 14:47 - 14:56
    complicated. Um, metadata shows and is
    being used to render the page. So in order
  • 14:56 - 15:02
    to do this, this is, for example, a very
    short version of the database tree that I
  • 15:02 - 15:07
    showed you. You can even go and look for
    other ones but this is a s1. s1 eqiad this
  • 15:07 - 15:12
    is the main data center the master is this
    number and it replicates to some of this
  • 15:12 - 15:17
    and then this 7, the second one that this
    was with 2000 because it's the second data
  • 15:17 - 15:25
    center and it's a master of the other one.
    And it has its own replications
  • 15:25 - 15:31
    between cross three replications because
    the master, that master data center is in
  • 15:31 - 15:37
    Ashburn, Virginia. The second data center
    is in Dallas, Texas. So they need to have a
  • 15:37 - 15:43
    cross DC replication and that happens
    with a TLS to make sure that no one starts
  • 15:43 - 15:49
    to listen to, in between these two, and we
    have snapshots and even dumps of the whole
  • 15:49 - 15:53
    history of Wikipedia. You can go to
    dumps.wikimedia.org and download the whole
  • 15:53 - 15:59
    reserve every wiki you want, except the
    ones that we had to remove for privacy
  • 15:59 - 16:05
    reasons and with a lots and lots of
    backups. I recently realized we have lots
  • 16:05 - 16:15
    of backups. And in total it is 570 TB of data
    and total 150 database servers and a
  • 16:15 - 16:20
    queries that happens to them is around
    350,000 queries per second and, in total,
  • 16:20 - 16:29
    it requires 70 terabytes of RAM. So and
    also we have another storage section that
  • 16:29 - 16:35
    called Elasticsearch which you can guess
    it- it's being used for search, on the top
  • 16:35 - 16:39
    right, if you're using desktop. It's
    different in mobile, I think. And also it
  • 16:39 - 16:45
    depends on if you're rtl language as well,
    but also it runs by a team called search
  • 16:45 - 16:48
    platform because none of us are from
    search platform we cannot explain it this
  • 16:48 - 16:54
    much we don't know much how it works it
    slightly. Also we have a media storage for
  • 16:54 - 16:58
    all of the free pictures that's being
    uploaded to Wikimedia like, for example,
  • 16:58 - 17:02
    if you have a category in Commons. Commons
    is our wiki that holds all of the free
  • 17:02 - 17:08
    media and if we have a category in Commons
    called cats looking at left and you have
  • 17:08 - 17:16
    category cats looking at right so we have
    lots and lots of images. It's 390 terabytes
  • 17:16 - 17:21
    of media, 1 billion object and uses Swift.
    Swift is the object is storage component
  • 17:21 - 17:29
    of OpenStack and it has it has several
    layers of caching, frontend, backend.
  • 17:29 - 17:37
    Yeah, that's mostly it. And we want to
    talk about traffic now and so this picture
  • 17:37 - 17:44
    is when Sweden in 1967 moved from a left-
    driving from left to there driving to
  • 17:44 - 17:49
    right. This is basically what happens in
    Wikipedia infrastructure as well. So we
  • 17:49 - 17:55
    have five caching layers and the most
    recent one is eqsin which is in Singapore,
  • 17:55 - 17:59
    the three one are just CDN ulsfo, codfw,
    esams and eqsin. Sorry, ulsfo, esams and
  • 17:59 - 18:07
    eqsin are just CDNs. We have also two
    points of presence, one in Chicago and the
  • 18:07 - 18:15
    other one is also in Amsterdam, but we
    don't get to that. So, we have, as I said,
  • 18:15 - 18:20
    we have our own content delivery network
    with our traffic or allocation is done by
  • 18:20 - 18:27
    GeoDNS which actually is written and
    maintained by one of the traffic people,
  • 18:27 - 18:32
    and we can pool and depool DCs. It has a
    time to live of 10 minute- 10 minutes, so
  • 18:32 - 18:38
    if a data center goes down. We have - it
    takes 10 minutes to actually propagate for
  • 18:38 - 18:47
    being depooled and repooled again. And we
    use LVS as transport layer and this layer
  • 18:47 - 18:56
    3 and 4 of the Linux load balancer for
    Linux and supports consistent hashing and
  • 18:56 - 19:01
    also we ever got we grow so big that we
    needed to have something that manages the
  • 19:01 - 19:07
    load balancer so we wrote something our
    own system is called pybal. And also we -
  • 19:07 - 19:11
    lots of companies actually peer with us. We
    for example directly connect to
  • 19:11 - 19:20
    Amsterdam amps X. So this is how the
    caching works, which is, anyway, it's
  • 19:20 - 19:25
    there is lots of reasons for this. Let's
    just get the started. We use TLS, we
  • 19:25 - 19:31
    support TLS 1.2 where we have K then
    the first layer we have nginx-. Do you
  • 19:31 - 19:40
    know it - does anyone know what nginx-
    means? And so that's related but not - not
  • 19:40 - 19:47
    correct. So we have nginx which is the free
    version and we have nginx plus which is
  • 19:47 - 19:52
    the commercial version and nginx. But we
    don't use nginx to do load balancing or
  • 19:52 - 19:56
    anything so we stripped out everything
    from it, and we just use it for TLS
  • 19:56 - 20:02
    termination so we call it nginx-, is an
    internal joke. So and then we have Varnish
  • 20:02 - 20:10
    frontend. Varnish also is a caching layer
    and this is the frontend is on the memory
  • 20:10 - 20:15
    which is very very fast and you have the
    backend which is on the storage and the
  • 20:15 - 20:23
    hard disk but this is slow. The fun thing
    is like just CDN caching layer takes 90%
  • 20:23 - 20:27
    of our requests. Its response and 90% of
    because just gets to the Varnish and just
  • 20:27 - 20:35
    return and then with doesn't work it goes
    through the application layer. The Varnish
  • 20:35 - 20:41
    holds-- it has a TTL of 24 hours so if you
    change an article, it also get invalidated
  • 20:41 - 20:47
    by the application. So if someone added the
    CDN actually purges the result. And the
  • 20:47 - 20:52
    thing is, the frontend is shorted that can
    spike by request so you come here load
  • 20:52 - 20:56
    balancer just randomly sends your request
    to a frontend but then the backend is
  • 20:56 - 21:01
    actually, if the frontend can't find it,
    it sends it to the backend and the backend
  • 21:01 - 21:10
    is actually sort of - how is it called? -
    it's a used hash by request, so, for
  • 21:10 - 21:15
    example, article of Barack Obama is only
    being served from one node in the data
  • 21:15 - 21:22
    center in the CDN. If none of this works it
    actually hits the other data center. So,
  • 21:22 - 21:30
    yeah, I actually explained all of this. So
    we have two - two caching clusters and one
  • 21:30 - 21:36
    is called text and the other one is called
    upload, it's not confusing at all, and if
  • 21:36 - 21:43
    you want to find out, you can just do mtr
    en.wikipedia.org and you - you're - the end
  • 21:43 - 21:50
    node is text-lb.wikimedia.org which is the
    our text storage but if you go to
  • 21:50 - 21:58
    upload.wikimedia.org, you get to hit the
    upload cluster. Yeah this is so far, what
  • 21:58 - 22:04
    is it, and it has lots of problems because
    a) varnish is open core, so the version
  • 22:04 - 22:09
    that you use is open source we don't use
    the commercial one, but the open core one
  • 22:09 - 22:21
    doesn't support TLS. What? What happened?
    Okay. No, no, no! You should I just-
  • 22:21 - 22:36
    you're not supposed to see this. Okay,
    sorry for the- huh? Okay, okay sorry. So
  • 22:36 - 22:40
    Varnish has lots of problems, Varnish is
    open core, it doesn't support TLS
  • 22:40 - 22:45
    termination which makes us to have this
    nginx- their system just to do TLS
  • 22:45 - 22:50
    termination, makes our system complicated.
    It doesn't work very well with so if that
  • 22:50 - 22:56
    causes us to have a cron job to restart
    every Varnish node twice a week. We have a
  • 22:56 - 23:04
    cron job that this restarts every Vanish
    node which is embarrassing, but also, on
  • 23:04 - 23:09
    the other hand then the end of Varnish
    like backend wants to talk to the
  • 23:09 - 23:13
    application layer, it also doesn't support
    terminate - TLS termination, so we use
  • 23:13 - 23:20
    IPSec which is even more embarrassing, but
    we are changing it. So we call it, if you
  • 23:20 - 23:25
    are using a particular fixed server which
    is very very nice and it's also open
  • 23:25 - 23:31
    source, a fully open source like in with
    Apache Foundation, Apache does the TLS,
  • 23:31 - 23:37
    does the TLS by termination and still
    for now we have a Varnish frontend that
  • 23:37 - 23:45
    still exists but a backend is also going
    to change to the ATS, so we call this ATS
  • 23:45 - 23:50
    sandwich. Two ATS happening between and
    there the middle there's a Varnish. The
  • 23:50 - 23:55
    good thing is that the TLS termination
    when it moves to ATS, you can actually use
  • 23:55 - 24:01
    TLS 1.3 which is more modern and more
    secure and even very faster so it
  • 24:01 - 24:06
    basically drops 100 milliseconds from
    every request that goes to Wikipedia.
  • 24:06 - 24:12
    That translates to centuries of our
    users' time every month, but ATS is going
  • 24:12 - 24:19
    on and hopefully it will go live soon and
    once these are done, so this is the new
  • 24:19 - 24:26
    version. And, as I said, the TLS and when
    we can do this we can actually use the
  • 24:26 - 24:37
    more secure instead of IPSec to talk about
    between data centers. Yes. And now it's
  • 24:37 - 24:42
    time that Lucas talks about what happens
    when you type in en.wikipedia.org.
  • 24:42 - 24:45

    Lucas: Yes, this makes sense, thank you.
  • 24:45 - 24:49
    So, first of all, what you see on the
    slide here as the image doesn't really
  • 24:49 - 24:52
    have anything to do with what happens when
    you type in wikipedia.org because it's an
  • 24:52 - 24:57
    offline Wikipedia reader but it's just a
    nice image. So this is basically a summary
  • 24:57 - 25:03
    of everything they already said, so if,
    which is the most common case, you are
  • 25:03 - 25:11
    lucky and get a URL which is cached, then,
    so, first your computer asked for the IP
  • 25:11 - 25:16
    address of en.wikipedia.org it reaches
    this whole DNS daemon and because we're at
  • 25:16 - 25:19
    Congress here it tells you the closest
    data center is the one in Amsterdam, so
  • 25:19 - 25:26
    esams and it's going to hit the edge, what
    we call load bouncers/router there, then
  • 25:26 - 25:32
    going through TLS termination through
    nginx- and then it's going to hit the
  • 25:32 - 25:37
    Varnish caching server, either frontend or
    backends and then you get a response and
  • 25:37 - 25:41
    that's already it and nothing else is ever
    bothered again. It doesn't even reach any
  • 25:41 - 25:46
    other data center which is very nice and
    so that's, you said around 90% of the
  • 25:46 - 25:52
    requests we get, and if you're unlucky and
    the URL you requested is not in the
  • 25:52 - 25:57
    Varnish in the Amsterdam data center then
    it gets forwarded to the eqiad data
  • 25:57 - 26:02
    center, which is the primary one and there
    it still has a chance to hit the cache and
  • 26:02 - 26:05
    perhaps this time it's there and then the
    response is going to get cached in the
  • 26:05 - 26:10
    frontend, no, in the Amsterdam Varnish and
    you're also going to get a response and we
  • 26:10 - 26:14
    still don't have to run any application
    stuff. If we do have to hit any
  • 26:14 - 26:17
    application stuff and then Varnish is
    going to forward that, if it's
  • 26:17 - 26:23
    upload.wikimedia.org, it goes to the media
    storage Swift, if it's any other domain it
  • 26:23 - 26:28
    goes to MediaWiki and then MediaWiki does
    a ton of work to connect to the database,
  • 26:28 - 26:34
    in this case the first shard for English
    Wikipedia, get the wiki text from there,
  • 26:34 - 26:39
    get the wiki text of all the related pages
    and templates. No, wait I forgot
  • 26:39 - 26:44
    something. First it checks if the HTML for
    this page is available in parser cache, so
  • 26:44 - 26:47
    that's another caching layer, and this
    application cache - this parser cache
  • 26:47 - 26:54
    might either be memcached or the database
    cache behind it and if it's not there,
  • 26:54 - 26:58
    then it has to go get the wikitext, get
    all the related things and render that
  • 26:58 - 27:04
    into HTML which takes a long time and goes
    through some pretty ancient code and if
  • 27:04 - 27:08
    you are doing an edit or an upload, it's
    even worse, because then always has to go
  • 27:08 - 27:14
    to MediaWiki and then it not only has to
    store this new edit, either in the media
  • 27:14 - 27:20
    back-end or in the database, it also has
    update a bunch of stuff, like, especially
  • 27:20 - 27:25
    if you-- first of all, it has to purge the
    cache, it has to tell all the Varnish
  • 27:25 - 27:29
    servers that there's a new version of this
    URL available so that it doesn't take a
  • 27:29 - 27:34
    full day until the time-to-live expires.
    It also has to update a bunch of things,
  • 27:34 - 27:39
    for example, if you edited a template, it
    might have been used in a million pages
  • 27:39 - 27:44
    and the next time anyone requests one of
    those million pages, those should also
  • 27:44 - 27:49
    actually be rendered again using the new
    version of the template so it has to
  • 27:49 - 27:54
    invalidate the cache for all of those and
    all that is deferred through the job queue
  • 27:54 - 28:01
    and it might have to calculate thumbnails
    if you uploaded the file or create a -
  • 28:01 - 28:07
    retranscode media files because maybe you
    uploaded in - what do we support? - you
  • 28:07 - 28:10
    upload in WebM and the browser only
    supports some other media codec or
  • 28:10 - 28:13
    something, we transcode that and also
    encode it down to the different
  • 28:13 - 28:20
    resolutions, so then it goes through that
    whole dance and, yeah, that was already
  • 28:20 - 28:24
    those slides. Is Amir going to talk again
    about how we manage -
  • 28:24 - 28:30
    Amir: I mean okay yeah I quickly come back
    just for a short break to talk about
  • 28:30 - 28:37
    managing to manage because managing 100-
    1300 bare metal hardware plus a Kubernetes
  • 28:37 - 28:43
    cluster is not easy, so what we do is that
    we use Puppet for configuration
  • 28:43 - 28:48
    management in our bare metal systems, it's
    fun, five to 50,000 lines of Puppet code. I
  • 28:48 - 28:52
    mean, lines of code is not a great
    indicator but you can roughly get an
  • 28:52 - 28:59
    estimate of how its things work and we
    have 100,000 lines of Ruby and we have our
  • 28:59 - 29:04
    CI and CD cluster, we have so we don't
    store anything in GitHub or GitLab, we
  • 29:04 - 29:11
    have our own system which is based on
    Gerrit and for that we have a system of
  • 29:11 - 29:16
    Jenkins and the Jenkins does all of this
    kind of things and also because we have a
  • 29:16 - 29:22
    Kubernetes cluster for services, some of
    our services, if you make a merger change
  • 29:22 - 29:26
    in the Gerrit it also builds the Docker
    files and containers and push it up to the
  • 29:26 - 29:35
    production and also in order to run remote
    SSH commands, we have cumin that's like in
  • 29:35 - 29:39
    the house automation and we built this
    farm for our systems and for example you
  • 29:39 - 29:46
    go there and say ok we pull this node or
    run this command in all of the data
  • 29:46 - 29:53
    Varnish nodes that I told you like you
    want to restart them. And with this I get
  • 29:53 - 29:58
    back to Lucas.
    Lucas: So, I am going to talk a bit more
  • 29:58 - 30:02
    about Wikimedia Cloud Services which is a
    bit different in that it's not really our
  • 30:02 - 30:06
    production stuff but it's where you
    people, the volunteers of the Wikimedia
  • 30:06 - 30:11
    movement can run their own code, so you
    can request a project which is kind of a
  • 30:11 - 30:16
    group of users and then you get assigned a
    pool of you have this much CPU and this
  • 30:16 - 30:21
    much RAM and you can create virtual
    machines with those resources and then do
  • 30:21 - 30:29
    stuff there and run basically whatever you
    want, to create and boot and shut down the
  • 30:29 - 30:33
    VMs and stuff we use OpenStack and there's
    a Horizon frontend for that which you use
  • 30:33 - 30:36
    through the browser and it's largely out
    all the time but otherwise it works pretty
  • 30:36 - 30:43
    well. Internally, ideally you manage the
    VMs using Puppet but a lot of people just
  • 30:43 - 30:48
    SSH in and then do whatever they need to
    set up the VM manually and it happens,
  • 30:48 - 30:53
    well, and there's a few big projects like
    Toolforge where you can run your own web-
  • 30:53 - 30:57
    based tools or the beta cluster which is
    basically a copy of some of the biggest
  • 30:57 - 31:02
    wikis like there's a beta English
    Wikipedia, beta Wikidata, beta Wikimedia
  • 31:02 - 31:08
    Commons using mostly the same
    configuration as production but using the
  • 31:08 - 31:12
    current master version of the software
    instead of whatever we deploy once a week so
  • 31:12 - 31:16
    if there's a bug, we see it earlier
    hopefully, even if we didn't catch it
  • 31:16 - 31:20
    locally, because the beta cluster is more
    similar to the production environment and
  • 31:20 - 31:24
    also the continuous - continuous
    integration service run in Wikimedia Cloud
  • 31:24 - 31:29
    Services as well. Yeah and also you have
    to have Kubernetes somewhere on these
  • 31:29 - 31:34
    slides right, so you can use that to
    distribute work between the tools in
  • 31:34 - 31:37
    Toolforge or you can use the grid engine
    which does a similar thing but it's like
  • 31:37 - 31:43
    three decades old and through five forks
    now I think the current fork we use is son
  • 31:43 - 31:47
    of grid engine and I don't know what it
    was called before, but that's Cloud
  • 31:47 - 31:55
    Services.
    Amir: So in a nutshell, this is our - our
  • 31:55 - 32:01
    systems. We have 1300 bare metal services
    with lots and lots of caching, like lots
  • 32:01 - 32:07
    of layers of caching, because mostly we
    serves read and we can just keep them as a
  • 32:07 - 32:12
    cached version and all of this is open
    source, you can contribute to it, if you
  • 32:12 - 32:18
    want to and there's a lot of configuration
    is also open and I - this is the way I got
  • 32:18 - 32:22
    hired like I open it started contributing
    to the system I feel like yeah we can-
  • 32:22 - 32:32
    come and work for us, so this is a -
    Daniel: That's actually how all of us got
  • 32:32 - 32:38
    hired.
    Amir: So yeah, and this is the whole thing
  • 32:38 - 32:48
    that happens in Wikimedia and if you want
    to - no, if you want to help us, we are
  • 32:48 - 32:51
    hiring. You can just go to jobs at
    wikimedia.org, if you want to work for
  • 32:51 - 32:54
    Wikimedia Foundation. If you want to work
    with Wikimedia Deutschland, you can go to
  • 32:54 - 32:59
    wikimedia.de and at the bottom there's a
    link for jobs because the links got too
  • 32:59 - 33:03
    long. If you can contribute, if you want
    to contribute to us, there is so many ways
  • 33:03 - 33:08
    to contribute, as I said, there's so many
    bugs, we have our own graphical system,
  • 33:08 - 33:13
    you can just look at the monitor and a
    Phabricator is our bug tracker, you can
  • 33:13 - 33:21
    just go there and find the bug and fix
    things. Actually, we have one repository
  • 33:21 - 33:26
    that is private but it only holds the
    certificate for as TLS and things that are
  • 33:26 - 33:31
    really really private then we cannot
    remove them. But also there are
  • 33:31 - 33:34
    documentations, the documentation for
    infrastructure is at
  • 33:34 - 33:40
    wikitech.wikimedia.org and documentation
    for configuration is at noc.wikimedia.org
  • 33:40 - 33:47
    plus the documentation of our codebase.
    The documentation for MediaWiki itself is
  • 33:47 - 33:53
    at mediawiki.org and also we have a our
    own system of URL shortener you can go to
  • 33:53 - 33:59
    w.wiki and short and shorten any URL in
    Wikimedia structure so we reserved the
  • 33:59 - 34:09
    dollar sign for the donate site and yeah,
    you have any questions, please.
  • 34:09 - 34:17
    Applause
  • 34:17 - 34:22
    Daniel: It's if you know we have quite a bit of
    time for questions so if anything wasn't
  • 34:22 - 34:27
    clear or they're curious about anything
    please, please ask.
  • 34:27 - 34:37
    AM: So one question what is not in the
    presentation. Do you have any efforts with
  • 34:37 - 34:42
    hacking attacks?
    Amir: So the first rule of security issues
  • 34:42 - 34:49
    is that we don't talk about security issues
    but let's say this baby has all sorts of
  • 34:49 - 34:56
    attacks happening, we have usually we have
    DDo. Once there was happening a couple of
  • 34:56 - 35:00
    months ago that was very successful. I
    don't know if you read the news about
  • 35:00 - 35:05
    that, but we also, we have a infrastructure
    to handle this, we have a security team
  • 35:05 - 35:13
    that handles these cases and yes.
    AM: Hello how do you manage access to your
  • 35:13 - 35:20
    infrastructure from your employees?
    Amir: So it's SS-- so we have a LDAP
  • 35:20 - 35:25
    group and LDAP for the web-based
    systems but for SSH and for this ssh we
  • 35:25 - 35:31
    have strict protocols and then you get a
    private key and some people usually
  • 35:31 - 35:35
    protect their private key using UV keys
    and then you have you can SSH to the
  • 35:35 - 35:40
    system basically.
    Lucas: Yeah, well, there's some
  • 35:40 - 35:45
    firewalling setup but there's only one
    server for data center that you can
  • 35:45 - 35:48
    actually reach through SSH and then you
    have to tunnel through that to get to any
  • 35:48 - 35:51
    other server.
    Amir: And also, like, we have we have a
  • 35:51 - 35:56
    internal firewall and it's basically if
    you go to the inside of the production you
  • 35:56 - 36:01
    cannot talk to the outside. You even, you
    for example do git clone github.org, it
  • 36:01 - 36:07
    doesn't, github.com doesn't work. It
    only can access tools that are for inside
  • 36:07 - 36:13
    Wikimedia Foundation infrastructure.
    AM: Okay, hi, you said you do TLS
  • 36:13 - 36:19
    termination through nginx, do you still
    allow non-HTTPS so it should be non-secure access.
  • 36:19 - 36:23
    Amir: No we dropped it a really long
    time ago but also
  • 36:23 - 36:25
    Lucas: 2013 or so
    Amir: Yeah, 2015
  • 36:25 - 36:29
    Lucas: 2015
    Amir: 2013 started serving the most of the
  • 36:29 - 36:36
    traffic but 15, we dropped all of the
    HTTP- non-HTTPS protocols and recently even
  • 36:36 - 36:44
    dropped and we are not serving any SSL
    requests anymore and TLS 1.1 is also being
  • 36:44 - 36:48
    phased out, so we are sending you a warning
    to the users like you're using TLS 1.1,
  • 36:48 - 36:55
    please migrate to these new things that
    came out around 10 years ago, so yeah
  • 36:55 - 37:00
    Lucas: Yeah I think the deadline for that
    is like February 2020 or something then
  • 37:00 - 37:05
    we'll only have TLS 1.2
    Amir: And soon we are going to support TLS
  • 37:05 - 37:07
    1.3
    Lucas: Yeah
  • 37:07 - 37:12
    Are there any questions?
    Q: so does read-only traffic
  • 37:12 - 37:18
    from logged in users hit all the way
    through to the parser cache or is there
  • 37:18 - 37:22
    another layer of caching for that?
    Amir: Yes we, you bypass all of
  • 37:22 - 37:28
    that, you can.
    Daniel: We need one more microphone. Yes,
  • 37:28 - 37:34
    it actually does and this is a pretty big
    problem and something we want to look into
  • 37:34 - 37:39
    clears throat but it requires quite a
    bit of rearchitecting. If you are
  • 37:39 - 37:44
    interested in this kind of thing, maybe
    come to my talk tomorrow at noon.
  • 37:44 - 37:49
    Amir: Yeah one reason we can, we are
    planning to do is active active so we have
  • 37:49 - 37:56
    two primaries and the read request gets
    request - from like the users can hit
  • 37:56 - 37:58
    their secondary data center instead of the
    main one.
  • 37:58 - 38:04
    Lucas: I think there was a question way in
    the back there, for some time already
  • 38:04 - 38:14
    AM: Hi, I got a question. I read on the
    Wikitech that you are using karate as a
  • 38:14 - 38:19
    validation platform for some parts, can
    you tell us something about this or what
  • 38:19 - 38:25
    parts of Wikipedia or Wikimedia are hosted
    on this platform?
  • 38:25 - 38:30
    Amir: I am I'm not oh sorry so I don't
    know this kind of very very sure but take
  • 38:30 - 38:34
    it with a grain of salt but as far as I
    know karate is used to build a very small
  • 38:34 - 38:40
    VMs in productions that we need for very
    very small micro sites that we serve to
  • 38:40 - 38:46
    the users. So we built just one or two VMs,
    we don't use it very as often as I think
  • 38:46 - 38:55
    so.
    AM: Do you also think about open hardware?
  • 38:55 - 39:04
    Amir: I don't, you can
    Daniel: Not - not for servers. I think for
  • 39:04 - 39:08
    the offline Reader project, but this is not
    actually run by the Foundation, it's
  • 39:08 - 39:10
    supported but it's not something that the
    Foundation does. They were sort of
  • 39:10 - 39:15
    thinking about open hardware but really
    open hardware in practice usually means,
  • 39:15 - 39:20
    you - you don't, you know, if you really
    want to go down to the chip design, it's
  • 39:20 - 39:25
    pretty tough, so yeah, it's- it's it- it's
    usually not practical, sadly.
  • 39:25 - 39:32
    Amir: And one thing I can say but this is
    that we have a some machine - machines that
  • 39:32 - 39:37
    are really powerful that we give to the
    researchers to run analysis on the between
  • 39:37 - 39:43
    this itself and we needed to have GPUs for
    those but the problem was - was there
  • 39:43 - 39:49
    wasn't any open source driver for them so
    we migrated and use AMD I think, but AMD
  • 39:49 - 39:54
    didn't fit in the rack it was a quite a
    endeavor to get it to work for our
  • 39:54 - 40:04
    researchers to help you CPU.
    AM: I'm still impressed that you answer
  • 40:04 - 40:11
    90% out of the cache. Do all people access
    the same pages or is the cache that huge?
  • 40:11 - 40:21
    So what percentage of - of the whole
    database is in the cache then?
  • 40:21 - 40:30
    Daniel: I don't have the exact numbers to
    be honest, but a large percentage of the
  • 40:30 - 40:37
    whole database is in the cache. I mean it
    expires after 24 hours so really obscure
  • 40:37 - 40:43
    stuff isn't there but I mean it's- it's a-
    it's a- it's a power-law distribution
  • 40:43 - 40:48
    right? You have a few pages that are
    accessed a lot and you have many many many
  • 40:48 - 40:55
    pages that are not actually accessed
    at all for a week or so except maybe for a
  • 40:55 - 41:02
    crawler, so I don't know a number. My
    guess would be it's less than 50% that is
  • 41:02 - 41:07
    actually cached but, you know, that still
    covers 90%-- it's probably the top 10% of
  • 41:07 - 41:12
    pages would still cover 90% of the
    pageviews, but I don't-- this would be
  • 41:12 - 41:16
    actually-- I should look this up, it would
    be interesting numbers to have, yes.
  • 41:16 - 41:21
    Lucas: Do you know if this is 90% of the
    pageviews or 90% of the get requests
  • 41:21 - 41:24
    because, like, requests for the JavaScript
    would also be cached more often, I assume
  • 41:24 - 41:28
    Daniel: I would expect that for non-
    pageviews, it's even higher
  • 41:28 - 41:30
    Lucas: Yeah
    Daniel: Yeah, because you know all the
  • 41:30 - 41:34
    icons and- and, you know, JavaScript
    bundles and CSS and stuff doesn't ever
  • 41:34 - 41:40
    change
    Lucas: I'm gonna say for every 180 min 90%
  • 41:40 - 41:51
    but there's a question back there
    AM: Hey. Do your data centers run on green
  • 41:51 - 41:55
    energy?
    Amir: Very valid question. So, the
  • 41:55 - 42:03
    Amsterdam city n1 is a full green but the
    other ones are partially green, partially
  • 42:03 - 42:11
    coal and like gas. As far as I know, there
    are some plans to make them move away from
  • 42:11 - 42:15
    it but the other hand we realized that if
    we don't produce as much as a carbon
  • 42:15 - 42:21
    emission because we don't have much servers
    and we don't use much data, there was a
  • 42:21 - 42:27
    summation and that we realized our carbon
    emission is basically as the same as 200
  • 42:27 - 42:35
    and in the datacenter plus all of their
    travel that all of this have to and all of
  • 42:35 - 42:38
    the events is 250 households, it's very
    very small it's I think it's one
  • 42:38 - 42:45
    thousandth of the comparable
    traffic with Facebook even if you just cut
  • 42:45 - 42:51
    down with the same traffic because
    Facebook collects the data, it runs very
  • 42:51 - 42:54
    sophisticated machine learning algorithms
    that's that's a real complicate, but for
  • 42:54 - 43:01
    Wikimedia, we don't do this so we don't
    need much energy. Does - does the answer
  • 43:01 - 43:05
    your question?
    Herald: Do we have any other
  • 43:05 - 43:16
    questions left? Yeah sorry
    AM: hi how many developers do you need to
  • 43:16 - 43:20
    maintain the whole infrastructure and how
    many developers or let's say head
  • 43:20 - 43:24
    developer hours you needed to build the
    whole infrastructure like the question is
  • 43:24 - 43:29
    because what I find very interesting about
    the talk it's a non-profit, so as an
  • 43:29 - 43:34
    example for other nonprofits is how much
    money are we talking about in order to
  • 43:34 - 43:39
    build something like this as a digital
    common.
  • 43:46 - 43:49
    Daniel: If this is just about actually
    running all this so just operations is
  • 43:49 - 43:54
    less than 20 people I think which makes if
    you if you basically divide the requests
  • 43:54 - 44:00
    per second by people you get to something
    like 8,000 requests per second per
  • 44:00 - 44:04
    operations engineer which I think is a
    pretty impressive number. This is probably
  • 44:04 - 44:10
    a lot higher I would I would really like
    to know if there's any organization that
  • 44:10 - 44:17
    tops that. I don't actually know the whole
    the the actual operations budget I know is
  • 44:17 - 44:25
    it two two-digit millions annually. Total
    hours for building this over the last 18
  • 44:25 - 44:29
    years, I have no idea. For the for the
    first five or so years, the people doing
  • 44:29 - 44:35
    it were actually volunteers. We still had
    volunteer database administrators and
  • 44:35 - 44:42
    stuff until maybe ten years ago, eight
    years ago, so yeah it's really nobody
  • 44:42 - 44:45
    did any accounting of this I can only
    guess.
  • 44:57 - 45:04
    AM: Hello a tools question. I a few years
    back I saw some interesting examples of
  • 45:04 - 45:09
    saltstack use for Wikimedia but right now
    I see only Puppet that come in mentioned
  • 45:09 - 45:18
    so kind of what happened with that
    Amir: I think we dished saltstack you -
  • 45:18 - 45:23
    I don't I cannot because none of us are in
    the Cloud Services team and I don't think
  • 45:23 - 45:27
    I can answer you but if you look at the
    wikitech.wikimedia.org, it's
  • 45:27 - 45:31
    probably if last time I checked says like
    it's deprecated and obsolete we don't use
  • 45:31 - 45:32
    it anymore.
  • 45:37 - 45:40
    AM: Do you use the bat-ropes like the top
  • 45:40 - 45:46
    runners to fill spare capacity on the web
    serving servers or do you have dedicated
  • 45:46 - 45:52
    servers for the roles.
    Lucas: I think they're dedicated.
  • 45:52 - 45:56
    Amir: The job runners if you're asking job runners
    are dedicated yes they are they are I
  • 45:56 - 46:03
    think 5 per primary data center so
    Daniel: Yeah they don't, I mean do we do we
  • 46:03 - 46:07
    actually have any spare capacity on
    anything? We don't have that much hardware
  • 46:07 - 46:09
    everything is pretty much at a hundred
    percent.
  • 46:09 - 46:14
    Lucas: I think we still have some server
    that is just called misc1111 or something
  • 46:14 - 46:19
    which run five different things at once,
    you can look for those on wikitech.
  • 46:19 - 46:26
    Amir: But but we go oh sorry it's not five
    it's 20 per data center 20 per primary
  • 46:26 - 46:31
    data center that's our job runner and they
    run 700 jobs per second.
  • 46:31 - 46:36
    Lucas: And I think that does not include
    the video scaler so those are separate
  • 46:36 - 46:38
    again
    Amir: No, they merged them in like a month
  • 46:38 - 46:40
    ago
    Lucas: Okay, cool
  • 46:47 - 46:51
    AM: Maybe a little bit off topic that can
    tell us a little bit about decision making
  • 46:51 - 46:56
    process for- for technical decision,
    architecture decisions, how does it work
  • 46:56 - 47:02
    in an organization like this: decision
    making process for architectural
  • 47:02 - 47:03
    decisions for example.
  • 47:08 - 47:11
    Daniel: Yeah so Wikimedia has a
  • 47:11 - 47:17
    committee for making high-level technical
    decisions, it's called a Wikimedia
  • 47:17 - 47:24
    Technical Committee, techcom and we run an
    RFC process so any decision that is a
  • 47:24 - 47:28
    cross-cutting strategic are especially
    hard to undo should go through this
  • 47:28 - 47:34
    process and it's pretty informal,
    basically you file a ticket and start
  • 47:34 - 47:38
    this process. It gets announced
    in the mailing list, hopefully you get
  • 47:38 - 47:45
    input and feedback and at some point it is
    it's approved for implementation. We're
  • 47:45 - 47:49
    currently looking into improving this
    process, it's not- sometimes it works
  • 47:49 - 47:52
    pretty well, sometimes things don't get
    that much feedback but it still it makes
  • 47:52 - 47:56
    sure that people are aware of these high-
    level decisions
  • 47:56 - 48:00
    Amir: Daniel is the chair of that
    committee
  • 48:02 - 48:08
    Daniel: Yeah, if you want to complain
    about the process, please do.
  • 48:14 - 48:21
    AM: yes regarding CI and CD across along the
    pipeline, of course with that much traffic
  • 48:21 - 48:27
    you want to keep everything consistent
    right. So is there any testing
  • 48:27 - 48:32
    strategies that you have said internally,
    like of course unit tests integration
  • 48:32 - 48:36
    tests but do you do something like
    continuous end to end testing on beta
  • 48:36 - 48:40
    instances?
    Amir: So if we have beta cluster but also
  • 48:40 - 48:45
    we do deploy, we call it train and so
    we deploy once a week, all of the changes
  • 48:45 - 48:50
    gets merged to one, like a branch and the
    branch gets cut in every Tuesday and it
  • 48:50 - 48:55
    first goes to the test wikis and
    then it goes to all of the wikis that are
  • 48:55 - 48:59
    not Wikipedia except Catalan and Hebrew
    Wikipedia. So basically Hebrew and Catalan
  • 48:59 - 49:04
    Wikipedia volunteer to be the guinea pigs
    of the next wikis and if everything works
  • 49:04 - 49:08
    fine usually it goes there and is like oh
    the fatal mater and we have a logging and
  • 49:08 - 49:13
    then it's like okay we need to fix this
    and we fix it immediately and then it goes
  • 49:13 - 49:19
    live to all wikis. This is one way of
    looking at it well so okay yeah
  • 49:19 - 49:23
    Daniel: So, our test coverage is not as
    great as it should be and so we kind of,
  • 49:23 - 49:31
    you know, abuse our users for this. We
    are, of course, working to improve this
  • 49:31 - 49:37
    and one thing that we started recently is
    a program for creating end-to-end tests
  • 49:37 - 49:43
    for all the API modules we have, in the
    hope that we can thereby cover pretty much
  • 49:43 - 49:50
    all of the application logic bypassing the
    user interface. I mean, full end-to-end
  • 49:50 - 49:53
    should, of course, include the user
    interface but user interface tests are
  • 49:53 - 49:58
    pretty brittle and often tests you know
    where things are on the screen and it just
  • 49:58 - 50:03
    seems to us that it makes a lot of sense
    to have more- to have tests that actually
  • 50:03 - 50:07
    test the application logic for what the
    system actually should be doing, rather
  • 50:07 - 50:16
    than what it should look like and, yeah,
    we are currently working on making- so
  • 50:16 - 50:20
    yeah, basically this has been a proof of
    concept and we're currently working to
  • 50:20 - 50:27
    actually integrate it in- in CI. That
    perhaps should land once everyone is back
  • 50:27 - 50:35
    from the vacations and then we have to
    write about a thousand or so tests, I
  • 50:35 - 50:38
    guess.
    Lucas: I think there's also a plan to move
  • 50:38 - 50:43
    to a system where we actually deploy
    basically after every commit and can
  • 50:43 - 50:46
    immediately roll back if something goes
    wrong but that's more midterm stuff and
  • 50:46 - 50:48
    I'm not sure what the current status of
    that proposal is
  • 50:48 - 50:50
    Amir: And it will be in Kubernetes, so it
    will be completely different
  • 50:50 - 50:56
    Daniel: That would be amazing
    Lucas: But right now, we are on this
  • 50:56 - 51:00
    weekly basis, if something goes wrong, we
    roll back to the last week's version of
  • 51:00 - 51:06
    the code
    Herald: Are there are any questions-
  • 51:06 - 51:19
    questions left? Sorry. Yeah. Okay, um, I
    don't think so. So, yeah, thank you for
  • 51:19 - 51:25
    this wonderful talk. Thank you for all
    your questions. Um, yeah, I hope you liked
  • 51:25 - 51:30
    it. Um, see you around, yeah.
  • 51:30 - 51:34
    Applause
  • 51:34 - 51:39
    Music
  • 51:39 - 52:01
    Subtitles created by c3subtitles.de
    in the year 2021. Join, and help us!
Title:
36C3 Wikipaka WG: Infrastructure of Wikipedia
Description:

more » « less
Video Language:
English
Duration:
52:01

English subtitles

Revisions