< Return to Video

36C3 Wikipaka WG: Modernizing Wikipedia

  • 0:00 - 0:21
    36C3 preroll music
  • 0:21 - 0:25
    Daniel: Good morning! I'm glad you all
    made it here this early on the last day. I
  • 0:25 - 0:32
    know it can can't be easy wasn't easy for
    me I have to warn you that the way I
  • 0:32 - 0:36
    prepared for this song is a bit
    experimental. I didn't make a slide set I
  • 0:36 - 0:45
    just made a mind map and I'll just click
    through it while I talk to you. So,
  • 0:45 - 0:51
    this talk is about modernizing Wikipedia
    as you probably have noticed visiting,
  • 0:51 - 0:58
    Wikipedia can feel a bit like visiting a
    website from 10-15 years ago but before I
  • 0:58 - 1:05
    talk about any problems or things to
    improve, I first want to revisit that the
  • 1:05 - 1:12
    software and the the infrastructure we
    build around it has been running Wikipedia
  • 1:12 - 1:20
    and its sister sites for the last... well
    nearly 19 years now and it's extremely
  • 1:20 - 1:32
    successful. We serve 17 billion page
    views a month, yes?
  • 1:32 - 1:41
    Person in the audience: Could you make it
    louder or speak up and also make the image
  • 1:41 - 1:43
    bigger?
  • 1:43 - 1:44
    inaudible dialogue
  • 1:44 - 1:46
    Daniel: Is this better? Like if I speak up
    I will loose my voice in 10 minutes it's
  • 1:46 - 1:56
    already in it, no it's fine. We have
    technology for this. I can... the light
  • 1:56 - 2:05
    doesn't help, yeah the contrast could be
    better. Is it better like this? Okay cool.
  • 2:05 - 2:14
    All right so yeah we are serving 17
    billion page views a month, which is quite
  • 2:14 - 2:20
    a lot. Wikipedia exists in about 100
    languages. If you attended the talk about
  • 2:20 - 2:24
    the Wikimedia infrastructure yesterday, we
    talked about 300 languages. We actually
  • 2:24 - 2:30
    support 300 languages for localization but
    we have Wikipedia in about 100, if I'm not
  • 2:30 - 2:39
    completely off. I find this picture quite
    fascinating. This is a visualization of
  • 2:39 - 2:44
    all the places in the world that are
    described on Wikipedia and sister projects
  • 2:44 - 2:49
    and I find this quite impressive although
    it's also a nice display of cultural bias
  • 2:49 - 3:01
    of course. We, that is Wikimedia
    Foundation, run about 900 to a 1000 wikis
  • 3:01 - 3:07
    depending on how you count, but there are
    many many more media wiki installations
  • 3:07 - 3:11
    out there, some of them big and many many
    of them small. We have actually no idea
  • 3:11 - 3:17
    how many small instances there are. So
    it's a very powerful very flexible and
  • 3:17 - 3:24
    versatile piece of software but, you know, but
    sometimes it can feel like... you can do a
  • 3:24 - 3:28
    lot of things with it, right, but
    sometimes it feels like it's a bit
  • 3:28 - 3:42
    overburdened and maybe you should look at
    improving the foundations. So one of the
  • 3:42 - 3:48
    things that make MediaWiki great but also
    sometimes hard to use is that kind of
  • 3:48 - 3:53
    everything is text, everything is markup,
    everything is done with with wikitext,
  • 3:53 - 4:03
    which has grown in complexity over the
    years so if you look at the autonomy of a
  • 4:03 - 4:09
    wiki page it can be a bit daunting. You
    have different syntax for markup at
  • 4:09 - 4:16
    different kinds of transclusion or
    templates and media and some things
  • 4:16 - 4:22
    actually, you know, get displayed in
    place, some things show up in a completely
  • 4:22 - 4:26
    different place on the page it can be
    rather confusing and daunting for
  • 4:26 - 4:32
    newcomers. And also things like having a
    conversation just talking to people like,
  • 4:32 - 4:36
    you know, having a conversation thread
    looks like this. You open the page you
  • 4:36 - 4:41
    look through the markup and you indent to
    make a conversation thread and then you
  • 4:41 - 4:43
    get confused about the indenting and
    someone messes with the formatting and
  • 4:43 - 4:52
    it's all excellent. There have been many
    attempts over the years to improve the
  • 4:52 - 5:00
    situation, we have things like echo which
    notifies you, for instance when someone
  • 5:00 - 5:09
    mentions your name or someone... It is
    also used to to welcome people and do this
  • 5:09 - 5:12
    kind of achievement unlocked
    notifications: hey, you did your first
  • 5:12 - 5:20
    edit, this is great welcome! To make
    people a bit more engaged with the system
  • 5:20 - 5:24
    but it's really mostly improvements around
    the fringes. We have had a system called
  • 5:24 - 5:31
    Flow for awhile to improve the way
    conversations work. So you have more like
  • 5:31 - 5:38
    a thread structure that the software
    actually knows about but then there are
  • 5:38 - 5:42
    many, well quite a few people who have
    been around for a while that are very used
  • 5:42 - 5:47
    to the manual system and also there's a
    lot of tools to support this manual system
  • 5:47 - 5:53
    which of course are incompatible with
    making things more modern. So we use this
  • 5:53 - 5:56
    for instance on MediaWiki.org which is a
    site which is basically a self
  • 5:56 - 6:03
    documentation site of MediaWiki but on
    most Wikipedia this is not enabled or at
  • 6:03 - 6:15
    least not used for default everywhere. The
    biggest attempt to move away from the text
  • 6:15 - 6:23
    only approach is Wikidata, which we
    started in 2012. The idea of Wikidata of
  • 6:23 - 6:30
    course, if you didn't attend many great
    talks we had about it here over of the
  • 6:30 - 6:36
    course of the Congress, is a way to
    basically model the world using structured
  • 6:36 - 6:45
    data, using a semantic approach instead of
    natural language which has its own
  • 6:45 - 6:51
    complexities but at least it's a way to
    represent the knowledge of the world in a
  • 6:51 - 6:57
    way that machines can understand. So this
    would be an alternative to wiki text but
  • 6:57 - 7:09
    still the vast majority of things
    especially on Wikipedia are just markup.
  • 7:09 - 7:14
    And this markup is pretty powerful and
    there's lots of ways to extend it and to
  • 7:14 - 7:21
    do things with it. So a lot of things on
    MediaWiki are just DIY, just do it
  • 7:21 - 7:29
    yourself. Templates are a great example of
    this. Infoboxes of course, the nice blue
  • 7:29 - 7:35
    boxes here on the right side of pages, are
    done using templates but these templates
  • 7:35 - 7:39
    are just for formatting, there is not data
    processing there's no the data base or
  • 7:39 - 7:48
    structured data backing them. It's just
    basically, you know, it's still just
  • 7:48 - 7:57
    markup. It's still... you have a predefined
    layout but you're still feeding a text not
  • 7:57 - 8:05
    data. You have parameters but the values
    of the parameters are still again maybe
  • 8:05 - 8:12
    templates or links or you have markup in
    them, like you know HTML line breaks and
  • 8:12 - 8:19
    stuff. So it's kind of semi structured.
    And this of course is also used to do
  • 8:19 - 8:24
    things like workflow. The template... Oh
    no, this was actually an infobox, wrong
  • 8:24 - 8:34
    picture, wrong capture. This is also used
    to do workflows, so if a page on Wikipedia
  • 8:34 - 8:40
    gets nominated for deletion you put manual
    put a template on the page that defines
  • 8:40 - 8:45
    why this is supposed to be deleted and
    then you have to go to a different page
  • 8:45 - 8:49
    and put a different template there, giving
    more explanation and this again is used
  • 8:49 - 8:55
    for discussion. It's a lot of structure
    created by the community and maintained by
  • 8:55 - 9:03
    the community, using conventions and tools
    built on top of what is essentially just a
  • 9:03 - 9:11
    pile of markup. And because doing all this
    manually is kind of painful, only on there
  • 9:11 - 9:17
    we created a system to allow people to add
    JavaScript to the site, which is then
  • 9:17 - 9:27
    maintained on wiki pages by the community
    and it can tweak and automate. But again,
  • 9:27 - 9:31
    it doesn't really have much to work with,
    right? It basically messes with whatever
  • 9:31 - 9:35
    it can, it directly interacts with the DOM
    of the page, whenever the layout of the
  • 9:35 - 9:41
    software changes, things break. So this is
    not great for for compatibility but it's
  • 9:41 - 9:55
    used a lot and it is very important for
    the community to have this power. Sorry, I
  • 9:55 - 10:00
    wish there was a better way to show these
    pictures. Okay, that's just to give you an
  • 10:00 - 10:05
    idea of what kind of thing is implemented
    that way and maintained by the community
  • 10:05 - 10:10
    on their site. One of the problems we have
    with that is: these are bound to a wiki
  • 10:10 - 10:19
    and I just told you that we run over 900
    of these not over 9,000 and it would be
  • 10:19 - 10:26
    great if you could just share them between
    wikis but we can't. And again, there have
  • 10:26 - 10:31
    been... we have been talking about it a
    lot and it seems like it shouldn't be so
  • 10:31 - 10:37
    hard, but you kind of need to write these
    tools differently, if you want to share
  • 10:37 - 10:40
    them across sites, because different sites
    use different conventions, they use
  • 10:40 - 10:46
    different templates. Then it just doesn't
    work and you actually have to write decent
  • 10:46 - 10:51
    software that uses internationalization if
    you want to use it across wikis. While
  • 10:51 - 10:55
    these are usually just you know one-off
    hacks with everything hard-coded we would
  • 10:55 - 10:58
    have to put in place an
    internationalization system and it's
  • 10:58 - 11:03
    actually a lot of effort and there's a lot
    of things that are actually unclear about
  • 11:03 - 11:15
    it. So, before I dive more deeply into the
    different things that will make it hard to
  • 11:15 - 11:21
    improve on the current situation and the
    things that we are doing to improve it do
  • 11:21 - 11:27
    we have any questions or do you have any
    other - do you have any things you may
  • 11:27 - 11:35
    find particularly, well, annoying or
    particularly outdated, when interacting
  • 11:35 - 11:41
    with Wikipedia? Any thoughts on that?
    Beyond what I just said?
  • 11:41 - 11:49
    Microphone: The strict separation, just in
    Wikipedia, between mobile layout and
  • 11:49 - 11:54
    desktop layout.
    Daniel: Yeah. So, actually having a
  • 11:54 - 12:02
    reactive layout system that would just
    work for mobile and desktop in the same
  • 12:02 - 12:09
    way and allowing the designers and UX
    experts, who work on the system to just do
  • 12:09 - 12:15
    this once and not two or maybe even three
    times - because of course we also have
  • 12:15 - 12:21
    native applications for different
    platforms - would be great and it's
  • 12:21 - 12:24
    something that we're looking into at the
    moment. But it's not, you know , it's not
  • 12:24 - 12:30
    that easy we could build a completely new
    system, that does this but then again you
  • 12:30 - 12:33
    would be telling people: "You can no
    longer use the old system", but now they
  • 12:33 - 12:39
    have build all these tools that rely on
    how the old system works and you have to
  • 12:39 - 12:52
    port all of this over so there's a lot of
    inertia. Any other thoughts? Everyone is
  • 12:52 - 13:04
    still asleep that's excellent. So I can
    continue. So, another thing that makes it
  • 13:04 - 13:11
    difficult to change how MediaWiki works or
    to improve it is that we are trying to do
  • 13:11 - 13:19
    well to be at least two things at once: on
    the one hand we are running a top 5
  • 13:19 - 13:24
    website and serving over 100,000 requests
    per second using the system and you on the
  • 13:24 - 13:31
    other hand, at least until now, we have
    always made sure that you can just
  • 13:31 - 13:34
    download MediaWiki and install it on a
    shared hosting platform you don't even
  • 13:34 - 13:39
    need root on the system, right? You don't
    even need administrative privileges you
  • 13:39 - 13:45
    can just set it up and run it in your web
    space and it will work. And, having the
  • 13:45 - 13:52
    same piece of software do both, run in a
    minimal environment and run at scale, is
  • 13:52 - 13:55
    rather difficult and it also means that
    there's a lot of things that we can't
  • 13:55 - 14:02
    easily do, right? All this modern micro
    service architecture separate front-end
  • 14:02 - 14:09
    and back-end systems, all of that means
    that it's a lot more complicated to set up
  • 14:09 - 14:16
    and needs more knowledge or more
    infrastructure to set up and so far that
  • 14:16 - 14:20
    meant we can't do it, because so far there
    was this requirement that you should
  • 14:20 - 14:24
    really be able to just run it on your
    shared hosting. And we are currently
  • 14:24 - 14:30
    considering to what extent we can continue
    this, I mean, container based hosting is
  • 14:30 - 14:35
    picking up. Maybe this is an alternative
    it's still unclear but it seems like this
  • 14:35 - 14:46
    is something that we need to reconsider.
    Yeah, but if we make this harder to do
  • 14:46 - 14:53
    then a lot of current users of MediaWiki
    would maybe not, well, maybe no longer
  • 14:53 - 14:57
    exist or at least would not exist as they
    do now, right. You probably have seen
  • 14:57 - 15:05
    this nice MediaWiki instance the Congress
    wiki. Which - with a completely customized
  • 15:05 - 15:10
    skin and a lot of extensions installed to
    allow people to define their sessions
  • 15:10 - 15:14
    there and making sure these sessions
    automatically get listed and get put into
  • 15:14 - 15:21
    a calendar - this is all done using
    extensions, like Semantic MediaWiki, that
  • 15:21 - 15:34
    allow you to basically define queries in
    the wiki text markup. Yeah, another thing
  • 15:34 - 15:42
    that, of course, slows down development is
    that Wikimedia does engineering on a,
  • 15:42 - 15:48
    well, comparatively a shoestring budget,
    right? The budget of the Wikimedia
  • 15:48 - 15:52
    Foundation, the annual budget is something
    like a hundred million dollars, that
  • 15:52 - 15:58
    sounds like a lot of money, but if you
    compare it to other companies running a
  • 15:58 - 16:03
    top five or top ten website it's like two
    percent of their budget or something like
  • 16:03 - 16:11
    that, right? It's really, I mean, 100
    million is not peanuts but compared to
  • 16:11 - 16:17
    what other companies invest to achieve
    this kind of goal it kind of is, so , what
  • 16:17 - 16:22
    this budget translates into is something
    like 300, depending on how you count,
  • 16:22 - 16:29
    between three hundred and four hundred
    staff. So, this is the people who run all
  • 16:29 - 16:32
    of this, including all the community
    outreach all the social aspects all the
  • 16:32 - 16:41
    administrative aspects. Less than half of
    these are the engineers who do all this.
  • 16:41 - 16:51
    And we have like, something like 2,500
    servers, bare-metal, so, which is not a
  • 16:51 - 16:58
    lot for this kind of thing. Which also
    means that we have to design the software
  • 16:58 - 17:07
    to be not just scalable but also quite
    efficient. The modern approach to scaling
  • 17:07 - 17:12
    is usually scale horizontally make it so
    you can just spin up another virtual
  • 17:12 - 17:19
    machine in some cloud service, but, yeah,
    we run our own service, we run our own
  • 17:19 - 17:24
    servers, so we can design to scale
    horizontally, but it means ordering
  • 17:24 - 17:32
    hardware and setting it up and it's going
    to take half a year or so. And we don't
  • 17:32 - 17:38
    actually have that many people who do
    this, so, scalability and performance are
  • 17:38 - 17:49
    also important factors when designing the
    software. Okay. Before I dive into what we
  • 17:49 - 18:04
    are actually doing - any questions? This
    one in the back. Wait for the mic, please.
  • 18:04 - 18:07
    In the very...
    Q: Hi!
  • 18:07 - 18:13
    Daniel: Hello.
    Q: So, you said you don't have that many
  • 18:13 - 18:23
    people, but how many do you actually have?
    Daniel: For... it's something like 150 engineers
  • 18:23 - 18:27
    worldwide. It always depends on what you
    count, right? So you count the people, who
  • 18:27 - 18:32
    - do you count engineers, who work on the
    native apps, do you account engineers, who
  • 18:32 - 18:37
    work on the Wikimedia cloud services -
    actually we do have cloud services, we
  • 18:37 - 18:41
    offer them to the community to run their
    own things, but we don't run our stuff on
  • 18:41 - 18:46
    other people's cloud. Yeah, so depending
    on how you count or something and whether
  • 18:46 - 18:50
    you count the people working here in
    Germany for Wikimedia Germany, which is a
  • 18:50 - 18:58
    separate organization technically - it's
    something like 150 engineers.
  • 18:58 - 19:08
    Q: Thanks!
    Q: I'm interested: What are the reasons
  • 19:08 - 19:14
    that you don't run on other people's
    services like on the cloud. I mean, then
  • 19:14 - 19:17
    it will be easy to scale horizontally,
    right?
  • 19:17 - 19:25
    Daniel: There's, well, one reason is being
    independent, right? If we, yeah, I imagine
  • 19:25 - 19:32
    we ran all our stuff on Amazon's
    infrastructure and then maybe Amazon
  • 19:32 - 19:38
    doesn't like the way that the Wikipedia
    article about Amazon is written - what do
  • 19:38 - 19:42
    we do, right? Maybe they shut us down,
    maybe they make things very expensive,
  • 19:42 - 19:47
    maybe they make things very painful for
    us, maybe there is some at least like it
  • 19:47 - 19:54
    self-censorship mechanism happening and we
    want to avoid that. There are there are
  • 19:54 - 19:58
    thoughts about this there are thoughts
    like maybe we can do this at least for
  • 19:58 - 20:04
    development infrastructure and CI, not for
    production or maybe we can make it so that
  • 20:04 - 20:12
    we run stuff in the cloud services by more
    than one vendor, so we basically we spread
  • 20:12 - 20:18
    out so we are not reliant on a single
    company. We are thinking about these
  • 20:18 - 20:22
    things but so far the way to actually stay
    independent has been to run our own
  • 20:22 - 20:28
    servers.
    Q: You've been talking about scalability
  • 20:28 - 20:35
    and changing the architecture, that kind
    of seems to imply to me that there's a
  • 20:35 - 20:42
    problem with scaling at the moment or that
    it's foreseeable that things are not gonna
  • 20:42 - 20:47
    work out if you just keep doing what
    you're doing at the moment. Can you maybe
  • 20:47 - 20:52
    elaborate on that.
    Daniel: So, there's, I think there's two sides
  • 20:52 - 20:57
    to this. On the one hand the reason I
    mentioned it is just that a lot of things
  • 20:57 - 21:02
    that are really easy to do basically for
    me, right? Works on my machine are really
  • 21:02 - 21:09
    hard to do if you want to do them at
    scale. That's one aspect. The other aspect
  • 21:09 - 21:17
    is MediaWiki is pretty much a PHP monolith
    and that means getting it always means
  • 21:17 - 21:24
    copying the monolith and breaking it down
    so you have smaller units that you can
  • 21:24 - 21:29
    scale and just say, yeah, I don't know, I
    need more instances for authentication
  • 21:29 - 21:34
    handling or something like that. That
    would be more efficient, right, because
  • 21:34 - 21:41
    you have higher granularity, you can just
    scale the things that you actually need
  • 21:41 - 21:48
    but that of course needs rearchitecting.
    It's not like things are going to explode
  • 21:48 - 21:53
    if we don't do that very soon, it's not,
    so there's not like an urgent problem
  • 21:53 - 21:58
    there. The reason for us to rearchitect is
    more, to gain more flexibility in
  • 21:58 - 22:03
    development, because if you have a
    monolith that is pretty entangled, code
  • 22:03 - 22:16
    changes are risky and take a long time.
    Q: How many people work on product design
  • 22:16 - 22:25
    or like user experience research to, like,
    sit down with users and try to understand
  • 22:25 - 22:28
    what their needs are and from there
    proceed.
  • 22:28 - 22:33
    A: Across... I don't have an exact number,
    something like five.
  • 22:33 - 22:38
    Audience: Do you think that's sufficient?
    Herald: The question was, whether it's
  • 22:38 - 22:47
    sufficient. So just...
    Daniel: Probably not? But it's more than,
  • 22:47 - 22:50
    that's more people than we have for
    database administration, and that's also
  • 22:50 - 23:06
    not sufficient.
    Herald: Are the further questions? I don't
  • 23:06 - 23:16
    think.
    Daniel: Okay. So, one of the things, that
  • 23:16 - 23:20
    holds us back a bit, is that there's
    literally thousands of extensions for
  • 23:20 - 23:27
    MediaWiki and the extension mechanism is
    heavily reliant on hooks, so basically on
  • 23:27 - 23:40
    callbacks. And, we have - I don't have a
    picture, I have a link here - we have a
  • 23:40 - 23:44
    great number of these. So, you see, each
    paragraph is basically documenting one
  • 23:44 - 23:52
    callback that you can use to modify the
    behavior of the software and, I mean,
  • 23:52 - 23:59
    there's, I have never counted, but
    something like a thousand? And all of them
  • 23:59 - 24:08
    are of course interfaces to extra - to
    software that is maintained externally, so
  • 24:08 - 24:13
    they have to be kept stable and if you
    have a large chunk of software that you
  • 24:13 - 24:17
    want to restructure but you have a
    thousand fixed points that you can't
  • 24:17 - 24:23
    change, things become rather difficult.
    It's basi.. yeah, these hook points kind
  • 24:23 - 24:28
    of, like, they act like nails in the
    architecture and then you kind of have to
  • 24:28 - 24:37
    wiggle around them - it's fun. We are
    working to change that. We want to
  • 24:37 - 24:44
    architect it so the interface that is
    exposed to these hooks become much more
  • 24:44 - 24:51
    narrow and the things that these hooks or
    these callback functions can do is much
  • 24:51 - 24:59
    more restricted. There's currently an RSC
    open for this, has been open for a while
  • 24:59 - 25:05
    actually. The problem is that in order to
    assess whether the proposal is actually
  • 25:05 - 25:12
    viable you have to survey all the current
    users of these hooks and make sure that we
  • 25:12 - 25:16
    can, the use case is still covered in the
    new system and, yeah, we have like a
  • 25:16 - 25:21
    thousand hook points and we have like a
    thousand extensions that's quite a bit of
  • 25:21 - 25:31
    work. Another thing that I'm currently
    working on is establishing a stable
  • 25:31 - 25:37
    interface policy. This may sound pretty
    obvious - it has a lot of pretty obvious
  • 25:37 - 25:42
    things like, yeah, if you have a class and
    there's a public method then that's a
  • 25:42 - 25:46
    stable interface it will not just change
    without notice, we have deprecation policy
  • 25:46 - 25:53
    and all that. But if you have worked with
    extensible systems that rely on the
  • 25:53 - 25:58
    mechanisms of object-oriented programming,
    you may have come across the question
  • 25:58 - 26:05
    whether a protected method is part of this
    stable interface of the software or not,
  • 26:05 - 26:10
    or maybe the constructor? I don't know, if
    you have worked in environments that use
  • 26:10 - 26:16
    dependency injection the idea is basically
    that the construction signature should be
  • 26:16 - 26:21
    able to change at any time but then you
    have extensions that you're subclassing and
  • 26:21 - 26:26
    things break. So, this is why we are
    trying to establish a much more
  • 26:26 - 26:33
    restrictive stable interface policy, that
    would would make explicit things like
  • 26:33 - 26:37
    constructor signatures actually not being
    stable and that gives us a lot more wiggle
  • 26:37 - 26:51
    room to restructure the software.
    MediaWiki itself has grown as a software
  • 26:51 - 26:59
    for the last 18 years or so and, at least
    in the beginning, was mostly created by
  • 26:59 - 27:06
    volunteers. And in a monolithic
    architecture there's a great tendency to
  • 27:06 - 27:11
    just, you know, find and grab the thing
    that you want to use and just use it.
  • 27:11 - 27:19
    Which leads to, well, structures like this
    one: everything depends on everything. And
  • 27:19 - 27:26
    if you change one bit of code everything
    else may or may not break. And with, yeah.
  • 27:26 - 27:31
    And if you don't have great test coverage
    at the same time this just makes it so
  • 27:31 - 27:35
    that any change becomes very risky and you
    have to do a lot of manual testing a lot
  • 27:35 - 27:44
    of manual digging around, touching a lot
    of files and we are for the last year,
  • 27:44 - 27:51
    year and a half we have started a
    concerted effort to tie the worst - to cut
  • 27:51 - 27:58
    the worst ties, to decouple these things
    that are, basically that have most impact
  • 27:58 - 28:03
    there's a few objects in the software that
    rep... - for instance one that represents
  • 28:03 - 28:08
    the user and one that represents a title
    that are used everywhere and the way
  • 28:08 - 28:14
    they're implemented currently also means
    that they depend on everything and that of
  • 28:14 - 28:30
    course is not a good situation. On a,
    well, a similar idea on a higher level is
  • 28:30 - 28:34
    decomposition of the software so the
    decoupling was about the software
  • 28:34 - 28:40
    architecture this is about the system
    architecture breaking up the
  • 28:40 - 28:45
    monolith itself into multiple services that
    serve different purposes. The specifics of
  • 28:45 - 28:50
    this diagram are not really relevant to
    this talk. This is more to, you know, give
  • 28:50 - 28:58
    you an impression of the complexity and
    the sort of work we are doing there. The
  • 28:58 - 29:06
    idea is that perhaps we could split out
    certain functionality into its own service
  • 29:06 - 29:11
    into a separate application, like maybe
    move all the search functionality into
  • 29:11 - 29:17
    something separate and self-contained, but
    then the question is how do you, again,
  • 29:17 - 29:23
    compose this into the final user interface
    - at some point these things have to get
  • 29:23 - 29:28
    composed together again - and again this
    is a very trivial trivial issue if you
  • 29:28 - 29:32
    only want to want this to work on your
    machine or you only need to serve a
  • 29:32 - 29:40
    hundred users or something. But doing this
    at scale doing it at the rate of something
  • 29:40 - 29:45
    like 10,000 page views a second, I said a
    hundred thousand requests earlier but that
  • 29:45 - 29:52
    includes resources, icons, CSS and all
    that. So, yeah, then you have to think
  • 29:52 - 29:58
    pretty hard about what you can cache and,
    thank you, how you can recombine things
  • 29:58 - 30:03
    without having to recompute everything and
    this is something that we are currently
  • 30:03 - 30:09
    looking into - coming up with a
    architecture that allows us to compose and
  • 30:09 - 30:23
    recombine the output of different
    background services. Okay. Before I
  • 30:23 - 30:28
    started this talk I said I would probably
    roughly use half of my time going through
  • 30:28 - 30:33
    the presentation and I guess I just hit
    that spot on. So, this is all I have
  • 30:33 - 30:41
    prepared but I'm happy to talk to you more
    about the things I said or maybe any other
  • 30:41 - 30:48
    aspects of this that you may be interested
    in. If any comments or questions. Oh!
  • 30:48 - 30:57
    Three already.
    Q: First of all thanks a lot for the
  • 30:57 - 31:03
    presentation, such a really interesting
    case of a legacy system and thanks for the
  • 31:03 - 31:10
    honesty. It was really interesting as a,
    you know, software engineer to see how
  • 31:10 - 31:15
    that works. I have a question about
    decoupling, so, I mean, I kind of, you
  • 31:15 - 31:23
    have like, probably your system is
    enormous and how do you find, so to say,
  • 31:23 - 31:29
    the most evil, you know, parts which
    sort of have to be decoupled. Do you use other
  • 31:29 - 31:35
    software, with, you know, this, like, what
    a metrics and stuff or do you just know,
  • 31:35 - 31:38
    kind of intuitively..
    Daniel: Yeah, it's actually, this is quite
  • 31:38 - 31:45
    interesting and maybe I can, maybe we can
    talk about it a bit more in depth later.
  • 31:45 - 31:49
    Very quickly: it's a combination on the
    one hand you just have the anecdotal
  • 31:49 - 31:53
    experience of what is actually annoying
    when you work with the software and you
  • 31:53 - 31:59
    try to fix it and on the other hand I try
    to find good tooling for this and the
  • 31:59 - 32:05
    existing tooling tends to die when you
    just run it against our code base. So, one
  • 32:05 - 32:10
    of the things that you are looking for are
    cyclic dependencies but the number of
  • 32:10 - 32:15
    possible cycles in a graph grows
    exponentially with a number of nodes. And
  • 32:15 - 32:18
    if you have a pretty tightly knit graph
    that number quickly goes into the
  • 32:18 - 32:27
    millions. And, yeah, the tool just goes to
    100% CPU and never returns. So, I spend
  • 32:27 - 32:34
    quite a bit of time trying to find
    heuristics to get around that - was a lot
  • 32:34 - 32:42
    of fun. I can, yeah, we can talk about
    that later, if you like. Okay, thanks.
  • 32:42 - 32:49
    Q: So what exactly is this Wikidata you
    mentioned before. Is it like an extension
  • 32:49 - 32:56
    or is it a completely different project?
    Daniel: Wiki - so there's an extension called
  • 32:56 - 33:05
    Wikibase, that implements this, well I
    would say, ontological modeling interface
  • 33:05 - 33:12
    for MediaWiki and that is used to run a
    website called Wikidata which has
  • 33:12 - 33:20
    something like 30 million items modeled
    that describe the world and serve as a
  • 33:20 - 33:26
    machine-readable data back-end to other
    wiki project, other Wikimedia projects.
  • 33:26 - 33:33
    Yeah, I used to work on that project for
    Wikimedia Germany. I moved on to do
  • 33:33 - 33:41
    different things now for a couple of
    years. Lukas here in front is probably the
  • 33:41 - 33:50
    person most knowledgeable about the latest
    and greatest in the Wikidata development.
  • 33:50 - 33:56
    Q: You've shortly talked about test
    coverage. I will be into history..
  • 33:56 - 33:59
    Daniel: Sorry?
    Q: You talked about test coverage.
  • 33:59 - 34:02
    Daniel: Yes.
    Q: I would be interested in if you amped
  • 34:02 - 34:08
    your efforts to help you modernize it and
    how your current situation is with test
  • 34:08 - 34:12
    coverage.
    Daniel: Test coverage for MediaWiki core is below
  • 34:12 - 34:22
    50%. In some parts it's below 10% which is
    very worrying. One thing that we started
  • 34:22 - 34:30
    to look into, like half a year ago, is
    instead of writing unit tests for all the
  • 34:30 - 34:36
    code that we actually want to throw away,
    before we touch it, we tried to improve
  • 34:36 - 34:41
    the test coverage using integration tests
    on the API level. So we are currently in
  • 34:41 - 34:48
    the process of writing a suite of tests,
    not just for the API modules, but for all
  • 34:48 - 34:55
    the functionality, all the application
    logic behind the the API. And that will
  • 34:55 - 35:01
    hopefully cover most of the relevant code
    paths and will give us confidence when we
  • 35:01 - 35:12
    refactor the code.
    Q: Thanks.
  • 35:12 - 35:26
    Herald: Other questions?
    Q: So you said that you have this legacy
  • 35:26 - 35:32
    system and eventually you have to move
    away from it but are there any, like, I
  • 35:32 - 35:40
    don't know, plans for the near future to,
    I don't know. At some point you have to
  • 35:40 - 35:47
    cut the current infrastructure to your
    extensions and so on and it's a hard cut, I
  • 35:47 - 35:53
    see. But are there any plans to build it
    up from scratch or what are the plans?
  • 35:53 - 35:58
    Daniel: Yeah, we are not going to rewrite from
    scratch - that's a pretty sure fire way to
  • 35:58 - 36:05
    just kill the system. We will have to make
    some tough decisions about backwards
  • 36:05 - 36:11
    compatibility and probably reconsider some
    of the requirements and constraints we
  • 36:11 - 36:17
    have, well, with respect to the platforms
    we run on and also the platforms we serve.
  • 36:17 - 36:21
    One of the things that we have been very
    careful to do in the past for instance is
  • 36:21 - 36:27
    to make sure that you can do pretty much
    everything with MediaWiki with no
  • 36:27 - 36:33
    JavaScript on the client side. And that
    requirement is likely to drop. You will
  • 36:33 - 36:40
    still be able to read of course, without
    any JavaScript or anything, but the extent
  • 36:40 - 36:46
    of functionality you will have without
    JavaScript on the client side is likely to
  • 36:46 - 36:51
    be greatly reduced - that kind of thing.
    Also we will probably end up breaking
  • 36:51 - 36:58
    compatibility to at least some of the
    user-created tools. Hopefully we can offer
  • 36:58 - 37:02
    good alternatives, good APIs, good
    libraries that people can actually port
  • 37:02 - 37:11
    to, that are less brittle. I hope that
    will motivate people and maybe repay them
  • 37:11 - 37:16
    a bit for the pain of having their tool
    broken. If we can give them something that
  • 37:16 - 37:21
    is more stable, more reliable, and
    hopefully even nicer to use. Yeah, so,
  • 37:21 - 37:26
    it's small increments, bits, and pieces
    all over the system there's no, you know,
  • 37:26 - 37:33
    no great master plan, no big change to
    point to really.
  • 37:33 - 37:45
    Herald: Okay, okay, further questions?
    Daniel: I plan to just sit outside here at
  • 37:45 - 37:55
    the table later if you just want to come
    and chat so we can also do that there.
  • 37:55 - 38:01
    Herald: Okay, so, last call are there any
    other questions? It does not appear so,
  • 38:01 - 38:08
    so, I'd like ask for a huge applause for
    Daniel for this talk.
  • 38:08 - 38:13
    Applause
  • 38:13 - 38:15
    36C3 postroll music
  • 38:15 - 38:38
    Subtitles created by c3subtitles.de
    in the year 2020. Join, and help us!
Title:
36C3 Wikipaka WG: Modernizing Wikipedia
Description:

more » « less
Video Language:
English
Duration:
38:40

English subtitles

Revisions