< Return to Video

36C3 Wikipaka WG: Measuring Code Contributions in Wikimedia's Technical Community

  • 0:00 - 0:12
    36C3 preroll music
  • 0:12 - 0:23
    Andre Klapper: Alright, thank you. Thanks
    for your interest. I'm Andre, I'm with the
  • 0:23 - 0:28
    Wikimedia Foundation, and one of the
    things I'm currently trying to find out is
  • 0:28 - 0:37
    how to measure activity, people in our
    technical communities. And you probably
  • 0:37 - 0:42
    know that Wikimedia is a large, large
    project. There's like more than 900
  • 0:42 - 0:48
    websites, and there's many areas where you
    can contribute, technically, in different
  • 0:48 - 0:53
    ways. And we're currently trying to get an
    overview. And even that is hard.
  • 0:53 - 1:02
    So, it is a complex task. And in this talk, I would
    like to quickly show you what we already
  • 1:02 - 1:08
    have in place, and what we want to get in
    place, and maybe also little bits of the
  • 1:08 - 1:14
    problems and the complexity. So, it's more
    like, for your interest, or if you're
  • 1:14 - 1:24
    curious also to play with technical
    metrics, statistics, things like these.
  • 1:24 - 1:31
    What we have currently is, mostly is about
    git repositories, code repositories, and
  • 1:31 - 1:35
    we mostly use Gerrit for code review. We
    have our own Gerrit instance at
  • 1:35 - 1:43
    gerrit.wikimedia.org. And for this we've
    been having a platform called
  • 1:43 - 1:52
    wikimedia.biterg.io. If you've seen a
    ElasticSearch, Kibana, standard platform
  • 1:52 - 1:59
    thingy, this might be familiar to you. It
    is all Free and Open Source, it's actually
  • 1:59 - 2:03
    a Linux Foundation project, you can find
    it under chaoss.community, chaoss with
  • 2:03 - 2:09
    double s, and the code base is public on
    GitHub. So any other free and open source
  • 2:09 - 2:15
    software project can also set this up for
    themselves. We have it hosted by Bitergia,
  • 2:15 - 2:19
    but this is also possible to set up
    yourself, if you're interested in
  • 2:19 - 2:27
    gathering statistics about your Free and
    Open Source project. And there's also a
  • 2:27 - 2:36
    documentation page on MediaWiki.org which
    is called community metrics. I think I
  • 2:36 - 2:41
    have screenshots here, because I never
    trust the Internet at conferences, but I
  • 2:41 - 2:47
    could also show you live… so this is the
    GitHub page of the chaoss project by the
  • 2:47 - 2:55
    Linux foundation where you could get the
    code. This is, I hope the zoom is
  • 2:55 - 3:04
    sufficient, wikimedia.biterg.io So this is
    the overview page. You can see the
  • 3:04 - 3:13
    navigation up here, and you get some basic
    statistics about the most active people in
  • 3:13 - 3:18
    the git repositories, which organizations
    we have, so here you can see Wikimedia
  • 3:18 - 3:26
    Foundation individuals, hello welt,
    Wikimedia Deutschland. So these are, this
  • 3:26 - 3:32
    is the contributor base we have, by
    organization, by affiliation. And down
  • 3:32 - 3:38
    here there's way more statistics, gits,
    Geritt, mailing lists, we index a lot of
  • 3:38 - 3:43
    things. We also index a little bit our
    issue tracking system, which is
  • 3:43 - 3:51
    phabricator, and some edits on
    MediaWiki.org. And, for example, now, if I
  • 3:51 - 3:59
    go to Gerrit and the overview page,
    because we use Gerrit for code review,
  • 3:59 - 4:06
    they have more specific statistics, and as
    it's ElasticSearch, Kibana based, you
  • 4:06 - 4:10
    might know this if you've played with
    this, whenever you click on a certain
  • 4:10 - 4:15
    value, you can filter by that value. So,
    for example, if I use the pie chart here,
  • 4:15 - 4:20
    and only want to see the numbers for
    independent volunteer contributors,
  • 4:20 - 4:26
    I click it, and you see the numbers now
    change. Obviously a bit lower, and you see
  • 4:26 - 4:31
    up here, that a filter has been applied,
    and you can continue with these things.
  • 4:31 - 4:36
    Then you can go filter here also via code
    repository, for example, the MediaWiki
  • 4:36 - 4:42
    core repository. If I click on that one,
    it also filters for the value, and you can
  • 4:42 - 4:50
    basically drill down the statistics you
    want to gather here. And there's, as I
  • 4:50 - 4:54
    only have 15 minutes, there's way more
    things you can find out here, also, for
  • 4:54 - 5:03
    example, who reviews patches in Gerrit,
    how long patches have been open, median
  • 5:03 - 5:09
    time, all these things you might want to
    gather to find out how well are we doing
  • 5:09 - 5:16
    as a project, when it comes to both
    involving volunteers, and also give them
  • 5:16 - 5:21
    the feedback when it comes to code review,
    and engagement, that you would like to
  • 5:21 - 5:26
    give. Or, also, areas for improvement. For
    example, in Wikimedia Foundation obviously
  • 5:26 - 5:33
    we have engineering teams, and some of
    them maintain certain code repositories,
  • 5:33 - 5:39
    so you can filter the view for certain
    code repositories, and then see, for
  • 5:39 - 5:45
    example, you realize sometimes that
    patches written by volunteers, it takes
  • 5:45 - 5:49
    longer to review them than patches written
    by your coworkers. And these kinds of
  • 5:49 - 5:54
    things which you maybe already assumed,
    but it's nice to have actually data.
  • 5:54 - 6:03
    There's also a few caveats here. So, for
    example, I usually don't use the git
  • 6:03 - 6:10
    statistics, because Gerrit is where the
    code review happens. And once a patch
  • 6:10 - 6:15
    proposed and Gerrit has been accepted and
    merged in the git repository, you would
  • 6:15 - 6:21
    also see that in the git repository, but
    as all our software is Open Source, Free
  • 6:21 - 6:26
    Software, we also of course pull in a lot
    of git repositories from other upstream
  • 6:26 - 6:31
    projects, because we use a lot of software
    invented and maintained somewhere else to
  • 6:31 - 6:39
    run our servers. So the git statistics
    also include activity that we've imported
  • 6:39 - 6:44
    within the git repositories from other
    companies. So, that's kind of misleading.
  • 6:44 - 6:49
    And there's a few more caveats, which are
    actually, I hope all of them are listed on
  • 6:49 - 6:54
    the community metrics page on
    MediaWiki.org, because at some point I had
  • 6:54 - 7:01
    to create a section "behavior that might
    surprise you". It also, that page also has
  • 7:01 - 7:06
    some examples like, how can I, for the
    most common questions I get from
  • 7:06 - 7:13
    interested people, and also co-workers,
    or, you want to publish an annual report,
  • 7:13 - 7:16
    and show how many volunteer contributors
    you have in the code bases and these
  • 7:16 - 7:28
    things. So that is what we have. These
    were the screenshots in case the Wi-Fi
  • 7:28 - 7:36
    doesn't work. And now the section, what is
    patchwork. A spoiler: Basically everything
  • 7:36 - 7:43
    else. Because this was the look at git and
    git repositories and Gerrit for code
  • 7:43 - 7:49
    review. But there is way more going on
    when it comes to technical contributions
  • 7:49 - 7:59
    and code in Wikimedia. There is GitHub.
    So, we have some projects, quite a few,
  • 7:59 - 8:02
    that don't use Wikimedia git, Wikimedia
    Gerrit, but they prefer GitHub, because
  • 8:02 - 8:11
    it's a different contribution system or
    workflow. So, we already track some of
  • 8:11 - 8:16
    that, but we still have to improve even
    finding a way how to find all the
  • 8:16 - 8:20
    repositories related to Wikimedia
    Development on GitHub. Because they're not
  • 8:20 - 8:27
    all under the same organization. When it
    comes to what I just showed you,
  • 8:27 - 8:34
    wikimedia.biterg.io, we define what is
    being indexed in a public JSON file,
  • 8:34 - 8:38
    "projects". So, this is also linked from
    the community metrics page on
  • 8:38 - 8:43
    mediawiki.org, where we define basically
    what's, what gets indexed. And it's a long
  • 8:43 - 8:51
    list as you can say– see, also some
    mailing lists, but there's a lot of code
  • 8:51 - 8:57
    actually on the Wikis. Inside of Wiki
    pages. So, there are user scripts, there
  • 8:57 - 9:03
    are gadgets, like small JavaScript things
    that enhance functionality, and they're
  • 9:03 - 9:09
    actually quite common. So, for example,
    Wikimedia Commons, or English or German
  • 9:09 - 9:15
    Wikipedia, they have a lot of gadgets even
    enabled by default, which makes some
  • 9:15 - 9:22
    behavior easier. For example, on Commons a
    common gadget is adding a category to a
  • 9:22 - 9:27
    photo or image that has been uploaded.
    That's way easier if you use a gadget
  • 9:27 - 9:34
    which is enabled by default. There are Lua
    modules, and there's templates. For
  • 9:34 - 9:39
    example the info boxes that you see in
    many Wikipedia articles on the side, for
  • 9:39 - 9:44
    example, if you look up a Wikipedia
    article about a person. These are all
  • 9:44 - 9:51
    templates. And they're all stored on Wiki.
    So, this is harder to track, to get a full
  • 9:51 - 10:00
    overview of that. And some extension code,
    even we have about 130 MediaWiki
  • 10:00 - 10:06
    extensions deployed on Wikimedia servers.
    But if you take a look only at the
  • 10:06 - 10:12
    extension home pages or MediaWiki.org,
    there is more than 2000. So there's a lot
  • 10:12 - 10:16
    of code out there, and sometimes this code
    is even stored just by copy and paste
  • 10:16 - 10:21
    putting it on a Wiki page, and saying:
    here, copy and paste this, and it should
  • 10:21 - 10:27
    work. Which might not be the best revision
    system when it comes to maintaining code,
  • 10:27 - 10:33
    ever, but it's a quick and dirty way, so
    these things exist. And one other example,
  • 10:33 - 10:40
    unknown code repository locations. We also
    have something called ToolForge. That's
  • 10:40 - 10:45
    what some people call "cloud services"
    nowadays. So you can host your own little
  • 10:45 - 10:51
    helper tools which other people then can
    also use, on a cloud services platform
  • 10:51 - 10:55
    called ToolForge that we offer. One
    example would be, for example, page views.
  • 10:55 - 11:03
    So, if you want to see which pages are the
    most popular on some Wiki, that's one
  • 11:03 - 11:08
    example out of, also thousands of tools
    now actually. And though, of course, the
  • 11:08 - 11:14
    rules are that you must publish the source
    code, it's sometimes really hard to also
  • 11:14 - 11:18
    make sure that this happens, and where it
    happens. So for most repositories, we
  • 11:18 - 11:23
    know, we have an index, but for some we
    actually don't know, which is also
  • 11:23 - 11:32
    something to work out. So, recently, even
    getting a number of things, or getting an
  • 11:32 - 11:39
    idea, like, what what can we measure, what
    do we have, how much do we have, I started
  • 11:39 - 11:44
    to create a table, and even visualizing
    that was, was an interesting task. I'm
  • 11:44 - 11:49
    still not sure if anybody understands
    this, but black basically means doesn't
  • 11:49 - 11:56
    exist. You don't need to, there is nothing
    to, to measure, to index. Green means, yes
  • 11:56 - 12:03
    we do measure this already. And the red
    ones mean, yellow means, it's tricky, but
  • 12:03 - 12:09
    it's kind of possible via some scripts or
    using the API to get numbers out of the
  • 12:09 - 12:15
    Wikis, in certain name spaces, for example
    the module name space. And red means, it's
  • 12:15 - 12:23
    very hard, but we'd like to get this data
    at some point. Plus, also the complexity,
  • 12:23 - 12:29
    so the numbers you see here is sometimes
    correct numbers, sometimes more of a
  • 12:29 - 12:35
    ballpark vague figure about how many
    items, code repositories, projects we're
  • 12:35 - 12:39
    actually talking about. And with some
    numbers, we're even wondering. For
  • 12:39 - 12:46
    example, it says 270 000 modules and
    templates on the 900 sites, websites
  • 12:46 - 12:53
    we have on Wikimedia servers, and this is
    what the database query says on hive, but
  • 12:53 - 12:58
    we're not really trusting that number yet.
    So, this is actually what we're going to
  • 12:58 - 13:03
    be after over the next months to also have
    way better data, and a way better overview
  • 13:03 - 13:08
    of where our developers actually are.
    Because we know, in code repositories, we
  • 13:08 - 13:17
    have about 200 to 400 code contributors,
    in Gerrit code review, per month.
  • 13:17 - 13:24
    And we now also know that we have about 500,
    600 people who work on user scripts and
  • 13:24 - 13:31
    gadgets, per year. But for many other
    things, we don't know yet, and that's what
  • 13:31 - 13:36
    I'm trying to improve over the next
    months, or, maybe realistically, years.
  • 13:36 - 13:45
    Let's see. But, yeah. So, that's basically
    it. I hope this was a bit interesting.
  • 13:45 - 13:51
    If you have any comments, questions, feel
    free to catch me here. I'm sometimes
  • 13:51 - 13:56
    around the table. Feel free to catch me
    after this talk. These are links with more
  • 13:56 - 14:03
    information, or, if you don't manage to
    catch me, feel also free on the community
  • 14:03 - 14:09
    metrics page on MediaWiki.org, the first
    link, there is a discussion page, and
  • 14:09 - 14:15
    there you can also bring up anything,
    ideas, ask questions, I watch that page,
  • 14:15 - 14:18
    and, usually, reply. Thank you!
  • 14:18 - 14:21
    applause
  • 14:21 - 14:25
    postroll music
  • 14:25 - 14:48
    Subtitles created by c3subtitles.de
    in the year 2021. Join, and help us!
Title:
36C3 Wikipaka WG: Measuring Code Contributions in Wikimedia's Technical Community
Description:

more » « less
Video Language:
English
Duration:
14:52

English subtitles

Revisions