1
00:00:00,000 --> 00:00:12,010
36C3 preroll music
2
00:00:12,010 --> 00:00:22,720
Andre Klapper: Alright, thank you. Thanks
for your interest. I'm Andre, I'm with the
3
00:00:22,720 --> 00:00:28,130
Wikimedia Foundation, and one of the
things I'm currently trying to find out is
4
00:00:28,130 --> 00:00:37,090
how to measure activity, people in our
technical communities. And you probably
5
00:00:37,090 --> 00:00:42,020
know that Wikimedia is a large, large
project. There's like more than 900
6
00:00:42,020 --> 00:00:47,680
websites, and there's many areas where you
can contribute, technically, in different
7
00:00:47,680 --> 00:00:53,330
ways. And we're currently trying to get an
overview. And even that is hard.
8
00:00:53,330 --> 00:01:02,280
So, it is a complex task. And in this talk, I would
like to quickly show you what we already
9
00:01:02,280 --> 00:01:08,220
have in place, and what we want to get in
place, and maybe also little bits of the
10
00:01:08,220 --> 00:01:14,030
problems and the complexity. So, it's more
like, for your interest, or if you're
11
00:01:14,030 --> 00:01:24,260
curious also to play with technical
metrics, statistics, things like these.
12
00:01:24,260 --> 00:01:30,830
What we have currently is, mostly is about
git repositories, code repositories, and
13
00:01:30,830 --> 00:01:35,030
we mostly use Gerrit for code review. We
have our own Gerrit instance at
14
00:01:35,030 --> 00:01:43,320
gerrit.wikimedia.org. And for this we've
been having a platform called
15
00:01:43,320 --> 00:01:52,070
wikimedia.biterg.io. If you've seen a
ElasticSearch, Kibana, standard platform
16
00:01:52,070 --> 00:01:58,979
thingy, this might be familiar to you. It
is all Free and Open Source, it's actually
17
00:01:58,979 --> 00:02:03,259
a Linux Foundation project, you can find
it under chaoss.community, chaoss with
18
00:02:03,259 --> 00:02:09,399
double s, and the code base is public on
GitHub. So any other free and open source
19
00:02:09,399 --> 00:02:14,859
software project can also set this up for
themselves. We have it hosted by Bitergia,
20
00:02:14,859 --> 00:02:19,019
but this is also possible to set up
yourself, if you're interested in
21
00:02:19,019 --> 00:02:27,150
gathering statistics about your Free and
Open Source project. And there's also a
22
00:02:27,150 --> 00:02:36,269
documentation page on MediaWiki.org which
is called community metrics. I think I
23
00:02:36,269 --> 00:02:40,959
have screenshots here, because I never
trust the Internet at conferences, but I
24
00:02:40,959 --> 00:02:47,319
could also show you live… so this is the
GitHub page of the chaoss project by the
25
00:02:47,319 --> 00:02:55,010
Linux foundation where you could get the
code. This is, I hope the zoom is
26
00:02:55,010 --> 00:03:03,699
sufficient, wikimedia.biterg.io So this is
the overview page. You can see the
27
00:03:03,699 --> 00:03:12,790
navigation up here, and you get some basic
statistics about the most active people in
28
00:03:12,790 --> 00:03:18,260
the git repositories, which organizations
we have, so here you can see Wikimedia
29
00:03:18,260 --> 00:03:26,080
Foundation individuals, hello welt,
Wikimedia Deutschland. So these are, this
30
00:03:26,080 --> 00:03:31,619
is the contributor base we have, by
organization, by affiliation. And down
31
00:03:31,619 --> 00:03:37,620
here there's way more statistics, gits,
Geritt, mailing lists, we index a lot of
32
00:03:37,620 --> 00:03:43,230
things. We also index a little bit our
issue tracking system, which is
33
00:03:43,230 --> 00:03:51,469
phabricator, and some edits on
MediaWiki.org. And, for example, now, if I
34
00:03:51,469 --> 00:03:58,999
go to Gerrit and the overview page,
because we use Gerrit for code review,
35
00:03:58,999 --> 00:04:06,109
they have more specific statistics, and as
it's ElasticSearch, Kibana based, you
36
00:04:06,109 --> 00:04:09,930
might know this if you've played with
this, whenever you click on a certain
37
00:04:09,930 --> 00:04:15,029
value, you can filter by that value. So,
for example, if I use the pie chart here,
38
00:04:15,029 --> 00:04:19,590
and only want to see the numbers for
independent volunteer contributors,
39
00:04:19,590 --> 00:04:26,400
I click it, and you see the numbers now
change. Obviously a bit lower, and you see
40
00:04:26,400 --> 00:04:30,530
up here, that a filter has been applied,
and you can continue with these things.
41
00:04:30,530 --> 00:04:36,250
Then you can go filter here also via code
repository, for example, the MediaWiki
42
00:04:36,250 --> 00:04:42,500
core repository. If I click on that one,
it also filters for the value, and you can
43
00:04:42,500 --> 00:04:49,510
basically drill down the statistics you
want to gather here. And there's, as I
44
00:04:49,510 --> 00:04:53,871
only have 15 minutes, there's way more
things you can find out here, also, for
45
00:04:53,871 --> 00:05:02,600
example, who reviews patches in Gerrit,
how long patches have been open, median
46
00:05:02,600 --> 00:05:08,870
time, all these things you might want to
gather to find out how well are we doing
47
00:05:08,870 --> 00:05:15,540
as a project, when it comes to both
involving volunteers, and also give them
48
00:05:15,540 --> 00:05:21,350
the feedback when it comes to code review,
and engagement, that you would like to
49
00:05:21,350 --> 00:05:26,470
give. Or, also, areas for improvement. For
example, in Wikimedia Foundation obviously
50
00:05:26,470 --> 00:05:33,100
we have engineering teams, and some of
them maintain certain code repositories,
51
00:05:33,100 --> 00:05:39,261
so you can filter the view for certain
code repositories, and then see, for
52
00:05:39,261 --> 00:05:44,640
example, you realize sometimes that
patches written by volunteers, it takes
53
00:05:44,640 --> 00:05:49,130
longer to review them than patches written
by your coworkers. And these kinds of
54
00:05:49,130 --> 00:05:54,180
things which you maybe already assumed,
but it's nice to have actually data.
55
00:05:54,180 --> 00:06:02,810
There's also a few caveats here. So, for
example, I usually don't use the git
56
00:06:02,810 --> 00:06:10,310
statistics, because Gerrit is where the
code review happens. And once a patch
57
00:06:10,310 --> 00:06:15,430
proposed and Gerrit has been accepted and
merged in the git repository, you would
58
00:06:15,430 --> 00:06:20,700
also see that in the git repository, but
as all our software is Open Source, Free
59
00:06:20,700 --> 00:06:26,420
Software, we also of course pull in a lot
of git repositories from other upstream
60
00:06:26,420 --> 00:06:31,020
projects, because we use a lot of software
invented and maintained somewhere else to
61
00:06:31,020 --> 00:06:38,550
run our servers. So the git statistics
also include activity that we've imported
62
00:06:38,550 --> 00:06:43,790
within the git repositories from other
companies. So, that's kind of misleading.
63
00:06:43,790 --> 00:06:48,820
And there's a few more caveats, which are
actually, I hope all of them are listed on
64
00:06:48,820 --> 00:06:54,350
the community metrics page on
MediaWiki.org, because at some point I had
65
00:06:54,350 --> 00:07:01,230
to create a section "behavior that might
surprise you". It also, that page also has
66
00:07:01,230 --> 00:07:05,820
some examples like, how can I, for the
most common questions I get from
67
00:07:05,820 --> 00:07:12,820
interested people, and also co-workers,
or, you want to publish an annual report,
68
00:07:12,820 --> 00:07:16,300
and show how many volunteer contributors
you have in the code bases and these
69
00:07:16,300 --> 00:07:27,870
things. So that is what we have. These
were the screenshots in case the Wi-Fi
70
00:07:27,870 --> 00:07:35,990
doesn't work. And now the section, what is
patchwork. A spoiler: Basically everything
71
00:07:35,990 --> 00:07:43,120
else. Because this was the look at git and
git repositories and Gerrit for code
72
00:07:43,120 --> 00:07:49,480
review. But there is way more going on
when it comes to technical contributions
73
00:07:49,480 --> 00:07:58,590
and code in Wikimedia. There is GitHub.
So, we have some projects, quite a few,
74
00:07:58,590 --> 00:08:02,461
that don't use Wikimedia git, Wikimedia
Gerrit, but they prefer GitHub, because
75
00:08:02,461 --> 00:08:10,860
it's a different contribution system or
workflow. So, we already track some of
76
00:08:10,860 --> 00:08:15,840
that, but we still have to improve even
finding a way how to find all the
77
00:08:15,840 --> 00:08:20,100
repositories related to Wikimedia
Development on GitHub. Because they're not
78
00:08:20,100 --> 00:08:27,090
all under the same organization. When it
comes to what I just showed you,
79
00:08:27,090 --> 00:08:33,650
wikimedia.biterg.io, we define what is
being indexed in a public JSON file,
80
00:08:33,650 --> 00:08:38,409
"projects". So, this is also linked from
the community metrics page on
81
00:08:38,409 --> 00:08:43,379
mediawiki.org, where we define basically
what's, what gets indexed. And it's a long
82
00:08:43,379 --> 00:08:50,579
list as you can say– see, also some
mailing lists, but there's a lot of code
83
00:08:50,579 --> 00:08:57,149
actually on the Wikis. Inside of Wiki
pages. So, there are user scripts, there
84
00:08:57,149 --> 00:09:02,830
are gadgets, like small JavaScript things
that enhance functionality, and they're
85
00:09:02,830 --> 00:09:08,759
actually quite common. So, for example,
Wikimedia Commons, or English or German
86
00:09:08,759 --> 00:09:15,059
Wikipedia, they have a lot of gadgets even
enabled by default, which makes some
87
00:09:15,059 --> 00:09:22,279
behavior easier. For example, on Commons a
common gadget is adding a category to a
88
00:09:22,279 --> 00:09:26,640
photo or image that has been uploaded.
That's way easier if you use a gadget
89
00:09:26,640 --> 00:09:34,240
which is enabled by default. There are Lua
modules, and there's templates. For
90
00:09:34,240 --> 00:09:39,241
example the info boxes that you see in
many Wikipedia articles on the side, for
91
00:09:39,241 --> 00:09:43,839
example, if you look up a Wikipedia
article about a person. These are all
92
00:09:43,839 --> 00:09:51,009
templates. And they're all stored on Wiki.
So, this is harder to track, to get a full
93
00:09:51,009 --> 00:10:00,079
overview of that. And some extension code,
even we have about 130 MediaWiki
94
00:10:00,079 --> 00:10:06,449
extensions deployed on Wikimedia servers.
But if you take a look only at the
95
00:10:06,449 --> 00:10:11,860
extension home pages or MediaWiki.org,
there is more than 2000. So there's a lot
96
00:10:11,860 --> 00:10:16,100
of code out there, and sometimes this code
is even stored just by copy and paste
97
00:10:16,100 --> 00:10:20,510
putting it on a Wiki page, and saying:
here, copy and paste this, and it should
98
00:10:20,510 --> 00:10:26,720
work. Which might not be the best revision
system when it comes to maintaining code,
99
00:10:26,720 --> 00:10:33,139
ever, but it's a quick and dirty way, so
these things exist. And one other example,
100
00:10:33,139 --> 00:10:40,199
unknown code repository locations. We also
have something called ToolForge. That's
101
00:10:40,199 --> 00:10:44,920
what some people call "cloud services"
nowadays. So you can host your own little
102
00:10:44,920 --> 00:10:50,579
helper tools which other people then can
also use, on a cloud services platform
103
00:10:50,579 --> 00:10:55,069
called ToolForge that we offer. One
example would be, for example, page views.
104
00:10:55,069 --> 00:11:02,770
So, if you want to see which pages are the
most popular on some Wiki, that's one
105
00:11:02,770 --> 00:11:08,319
example out of, also thousands of tools
now actually. And though, of course, the
106
00:11:08,319 --> 00:11:14,019
rules are that you must publish the source
code, it's sometimes really hard to also
107
00:11:14,019 --> 00:11:18,249
make sure that this happens, and where it
happens. So for most repositories, we
108
00:11:18,249 --> 00:11:23,329
know, we have an index, but for some we
actually don't know, which is also
109
00:11:23,329 --> 00:11:31,790
something to work out. So, recently, even
getting a number of things, or getting an
110
00:11:31,790 --> 00:11:38,790
idea, like, what what can we measure, what
do we have, how much do we have, I started
111
00:11:38,790 --> 00:11:43,829
to create a table, and even visualizing
that was, was an interesting task. I'm
112
00:11:43,829 --> 00:11:49,439
still not sure if anybody understands
this, but black basically means doesn't
113
00:11:49,439 --> 00:11:55,970
exist. You don't need to, there is nothing
to, to measure, to index. Green means, yes
114
00:11:55,970 --> 00:12:02,830
we do measure this already. And the red
ones mean, yellow means, it's tricky, but
115
00:12:02,830 --> 00:12:09,459
it's kind of possible via some scripts or
using the API to get numbers out of the
116
00:12:09,459 --> 00:12:15,420
Wikis, in certain name spaces, for example
the module name space. And red means, it's
117
00:12:15,420 --> 00:12:22,600
very hard, but we'd like to get this data
at some point. Plus, also the complexity,
118
00:12:22,600 --> 00:12:28,579
so the numbers you see here is sometimes
correct numbers, sometimes more of a
119
00:12:28,579 --> 00:12:34,670
ballpark vague figure about how many
items, code repositories, projects we're
120
00:12:34,670 --> 00:12:39,089
actually talking about. And with some
numbers, we're even wondering. For
121
00:12:39,089 --> 00:12:46,199
example, it says 270 000 modules and
templates on the 900 sites, websites
122
00:12:46,199 --> 00:12:53,019
we have on Wikimedia servers, and this is
what the database query says on hive, but
123
00:12:53,019 --> 00:12:58,179
we're not really trusting that number yet.
So, this is actually what we're going to
124
00:12:58,179 --> 00:13:03,139
be after over the next months to also have
way better data, and a way better overview
125
00:13:03,139 --> 00:13:07,890
of where our developers actually are.
Because we know, in code repositories, we
126
00:13:07,890 --> 00:13:17,209
have about 200 to 400 code contributors,
in Gerrit code review, per month.
127
00:13:17,209 --> 00:13:24,480
And we now also know that we have about 500,
600 people who work on user scripts and
128
00:13:24,480 --> 00:13:30,619
gadgets, per year. But for many other
things, we don't know yet, and that's what
129
00:13:30,619 --> 00:13:36,199
I'm trying to improve over the next
months, or, maybe realistically, years.
130
00:13:36,199 --> 00:13:45,299
Let's see. But, yeah. So, that's basically
it. I hope this was a bit interesting.
131
00:13:45,299 --> 00:13:51,089
If you have any comments, questions, feel
free to catch me here. I'm sometimes
132
00:13:51,089 --> 00:13:56,329
around the table. Feel free to catch me
after this talk. These are links with more
133
00:13:56,329 --> 00:14:03,019
information, or, if you don't manage to
catch me, feel also free on the community
134
00:14:03,019 --> 00:14:09,110
metrics page on MediaWiki.org, the first
link, there is a discussion page, and
135
00:14:09,110 --> 00:14:14,939
there you can also bring up anything,
ideas, ask questions, I watch that page,
136
00:14:14,939 --> 00:14:18,149
and, usually, reply. Thank you!
137
00:14:18,149 --> 00:14:21,049
applause
138
00:14:21,049 --> 00:14:24,809
postroll music
139
00:14:24,809 --> 00:14:48,000
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!