1
99:59:59,999 --> 99:59:59,999
So, we had a talk by a non-GitLab person
about GitLab.

2
99:59:59,999 --> 99:59:59,999
Now, we have a talk by a GitLab person
on non-GtlLab.

3
99:59:59,999 --> 99:59:59,999
Something like that?

4
99:59:59,999 --> 99:59:59,999
The CCCHH hackerspace is now open,

5
99:59:59,999 --> 99:59:59,999
from now on if you want to go there,
that's the announcement.

6
99:59:59,999 --> 99:59:59,999
And the next talk will be by Ben Kochie

7
99:59:59,999 --> 99:59:59,999
on metrics-based monitoring
with Prometheus.

8
99:59:59,999 --> 99:59:59,999
Welcome.

9
99:59:59,999 --> 99:59:59,999
[Applause]

10
99:59:59,999 --> 99:59:59,999
Alright, so

11
99:59:59,999 --> 99:59:59,999
my name is Ben Kochie

12
99:59:59,999 --> 99:59:59,999
I work on DevOps features for GitLab

13
99:59:59,999 --> 99:59:59,999
and apart working for GitLab, I also work
on the opensource Prometheus project.

14
99:59:59,999 --> 99:59:59,999
I live in Berlin and I've been using
Debian since ???

15
99:59:59,999 --> 99:59:59,999
yes, quite a long time.

16
99:59:59,999 --> 99:59:59,999
So, what is Metrics-based Monitoring?

17
99:59:59,999 --> 99:59:59,999
If you're running software in production,

18
99:59:59,999 --> 99:59:59,999
you probably want to monitor it,

19
99:59:59,999 --> 99:59:59,999
because if you don't monitor it, you don't
know it's right.

20
99:59:59,999 --> 99:59:59,999
??? break down into two categories:

21
99:59:59,999 --> 99:59:59,999
there's blackbox monitoring and
there's whitebox monitoring.

22
99:59:59,999 --> 99:59:59,999
Blackbox monitoring is treating
your software like a blackbox.

23
99:59:59,999 --> 99:59:59,999
It's just checks to see, like,

24
99:59:59,999 --> 99:59:59,999
is it responding, or does it ping

25
99:59:59,999 --> 99:59:59,999
or ??? HTTP requests

26
99:59:59,999 --> 99:59:59,999
[mic turned on]

27
99:59:59,999 --> 99:59:59,999
Ah, there we go, much better.

28
99:59:59,999 --> 99:59:59,999
So, blackbox monitoring is a probe,

29
99:59:59,999 --> 99:59:59,999
it just kind of looks from the outside
to your software

30
99:59:59,999 --> 99:59:59,999
and it has no knowledge of the internals

31
99:59:59,999 --> 99:59:59,999
and it's really good for end to end testing.

32
99:59:59,999 --> 99:59:59,999
So if you've got a fairly complicated
service,

33
99:59:59,999 --> 99:59:59,999
you come in from the outside, you go
through the load balancer,

34
99:59:59,999 --> 99:59:59,999
you hit the API server,

35
99:59:59,999 --> 99:59:59,999
the API server might hit a database,

36
99:59:59,999 --> 99:59:59,999
and you go all the way through
to the back of the stack

37
99:59:59,999 --> 99:59:59,999
and all the way back out

38
99:59:59,999 --> 99:59:59,999
so you know that everything is working
end to end.

39
99:59:59,999 --> 99:59:59,999
But you only know about it
for that one request.

40
99:59:59,999 --> 99:59:59,999
So in order to find out if your service
is working,

41
99:59:59,999 --> 99:59:59,999
from the end to end, for every single
request,

42
99:59:59,999 --> 99:59:59,999
this requires whitebox intrumentation.

43
99:59:59,999 --> 99:59:59,999
So, basically, every event that happens
inside your software,

44
99:59:59,999 --> 99:59:59,999
inside a serving stack,

45
99:59:59,999 --> 99:59:59,999
gets collected and gets counted,

46
99:59:59,999 --> 99:59:59,999
so you know that every request hits
the load balancer,

47
99:59:59,999 --> 99:59:59,999
every request hits your application
service,

48
99:59:59,999 --> 99:59:59,999
every request hits the database.

49
99:59:59,999 --> 99:59:59,999
You know that everything matches up

50
99:59:59,999 --> 99:59:59,999
and this is called whitebox, or
metrics-based monitoring.

51
99:59:59,999 --> 99:59:59,999
There is different examples of, like,

52
99:59:59,999 --> 99:59:59,999
the kind of software that does blackbox
and whitebox monitoring.

53
99:59:59,999 --> 99:59:59,999
So you have software like Nagios that
you can configure checks

54
99:59:59,999 --> 99:59:59,999
or pingdom,

55
99:59:59,999 --> 99:59:59,999
pingdom will do ping of your website.

56
99:59:59,999 --> 99:59:59,999
And then there is metrics-based monitoring,

57
99:59:59,999 --> 99:59:59,999
things like Prometheus, things like
the TICK stack from influx data,

58
99:59:59,999 --> 99:59:59,999
New Relic and other commercial solutions

59
99:59:59,999 --> 99:59:59,999
but of course I like to talk about
the opensorce solutions.

60
99:59:59,999 --> 99:59:59,999
We're gonna talk a little bit about
Prometheus.

61
99:59:59,999 --> 99:59:59,999
Prometheus came out of the idea that

62
99:59:59,999 --> 99:59:59,999
we needed a monitoring system that could
collect all this whitebox metric data

63
99:59:59,999 --> 99:59:59,999
and do something useful with it.

64
99:59:59,999 --> 99:59:59,999
Not just give us a pretty graph, but
we also want to be able to

65
99:59:59,999 --> 99:59:59,999
alert on it.

66
99:59:59,999 --> 99:59:59,999
So we needed both

67
99:59:59,999 --> 99:59:59,999
a data gathering and an analytics system
in the same instance.

68
99:59:59,999 --> 99:59:59,999
To do this, we built this thing and
we looked at the way that

69
99:59:59,999 --> 99:59:59,999
data was being generated
by the applications

70
99:59:59,999 --> 99:59:59,999
and there are advantages and
disadvantages to this

71
99:59:59,999 --> 99:59:59,999
push vs. poll model for metrics.

72
99:59:59,999 --> 99:59:59,999
We decided to go with the polling model

73
99:59:59,999 --> 99:59:59,999
because there is some slight advantages
for polling over pushing.

74
99:59:59,999 --> 99:59:59,999
With polling, you get this free
blackbox check

75
99:59:59,999 --> 99:59:59,999
that the application is running.

76
99:59:59,999 --> 99:59:59,999
When you poll your application, you know
that the process is running.

77
99:59:59,999 --> 99:59:59,999
If you are doing push-based, you can't
tell the difference between

78
99:59:59,999 --> 99:59:59,999
your application doing no work and
your application not running.

79
99:59:59,999 --> 99:59:59,999
So you don't know if it's stuck,

80
99:59:59,999 --> 99:59:59,999
or is it just not having to do any work.

81
99:59:59,999 --> 99:59:59,999
With polling, the polling system knows
the state of your network.

82
99:59:59,999 --> 99:59:59,999
If you have a defined set of services,

83
99:59:59,999 --> 99:59:59,999
that inventory drives what should be there.

84
99:59:59,999 --> 99:59:59,999
Again, it's like, the disappearing,

85
99:59:59,999 --> 99:59:59,999
is the process dead, or is it just
not doing anything?

86
99:59:59,999 --> 99:59:59,999
With polling, you know for a fact
what processes should be there,

87
99:59:59,999 --> 99:59:59,999
and it's a bit of an advantage there.

88
99:59:59,999 --> 99:59:59,999
With polling, there's really easy testing.

89
99:59:59,999 --> 99:59:59,999
With push-based metrics, you have to
figure out

90
99:59:59,999 --> 99:59:59,999
if you want to test a new version of
the monitoring system or

91
99:59:59,999 --> 99:59:59,999
you want to test something new,

92
99:59:59,999 --> 99:59:59,999
you have to ??? a copy of the data.

93
99:59:59,999 --> 99:59:59,999
With polling, you can just set up
another instance of your monitoring

94
99:59:59,999 --> 99:59:59,999
and just test it.

95
99:59:59,999 --> 99:59:59,999
Or you don't even have,

96
99:59:59,999 --> 99:59:59,999
it doesn't even have to be monitoring,
you can just use curl

97
99:59:59,999 --> 99:59:59,999
to poll the metrics endpoint.

98
99:59:59,999 --> 99:59:59,999
It's significantly easier to test.

99
99:59:59,999 --> 99:59:59,999
The other thing with the…

100
99:59:59,999 --> 99:59:59,999
The other nice thing is that
the client is really simple.

101
99:59:59,999 --> 99:59:59,999
The client doesn't have to know
where the monitoring system is.

102
99:59:59,999 --> 99:59:59,999
It doesn't have to know about ???

103
99:59:59,999 --> 99:59:59,999
It just has to sit and collect the data
about itself.

104
99:59:59,999 --> 99:59:59,999
So it doesn't have to know anything about
the topology of the network.

105
99:59:59,999 --> 99:59:59,999
As an application developer, if you're
writing a DNS server or

106
99:59:59,999 --> 99:59:59,999
some other piece of software,

107
99:59:59,999 --> 99:59:59,999
you don't have to know anything about
monitoring software,

108
99:59:59,999 --> 99:59:59,999
you can just implement it inside
your application and

109
99:59:59,999 --> 99:59:59,999
the monitoring software, whether it's
Prometheus or something else,

110
99:59:59,999 --> 99:59:59,999
can just come and collect that data for you.

111
99:59:59,999 --> 99:59:59,999
That's kind of similar to a very old
monitoring system called SNMP,

112
99:59:59,999 --> 99:59:59,999
but SNMP has a significantly less friendly
data model for developers.

113
99:59:59,999 --> 99:59:59,999
This is the basic layout
of a Prometheus server.

114
99:59:59,999 --> 99:59:59,999
At the core, there's a Prometheus server

115
99:59:59,999 --> 99:59:59,999
and it deals with all the data collection
and analytics.

116
99:59:59,999 --> 99:59:59,999
Basically, this one binary,
it's all written in golang.

117
99:59:59,999 --> 99:59:59,999
It's a single binary.

118
99:59:59,999 --> 99:59:59,999
It knows how to read from your inventory,

119
99:59:59,999 --> 99:59:59,999
there's a bunch of different methods,
whether you've got

120
99:59:59,999 --> 99:59:59,999
a kubernetes cluster or a cloud platform

121
99:59:59,999 --> 99:59:59,999
or you have your own customized thing
with ansible.

122
99:59:59,999 --> 99:59:59,999
Ansible can take your layout, drop that
into a config file and

123
99:59:59,999 --> 99:59:59,999
Prometheus can pick that up.

124
99:59:59,999 --> 99:59:59,999
Once it has the layout, it goes out and
collects all the data.

125
99:59:59,999 --> 99:59:59,999
It has a storage and a time series
database to store all that data locally.

126
99:59:59,999 --> 99:59:59,999
It has a thing called PromQL, which is
a query language designed

127
99:59:59,999 --> 99:59:59,999
for metrics and analytics.

128
99:59:59,999 --> 99:59:59,999
From that PromQL, you can add frontends
that will,

129
99:59:59,999 --> 99:59:59,999
whether it's a simple API client
to run reports,

130
99:59:59,999 --> 99:59:59,999
you can use things like Grafana
for creating dashboards,

131
99:59:59,999 --> 99:59:59,999
it's got a simple webUI built in.

132
99:59:59,999 --> 99:59:59,999
You can plug in anything you want
on that side.

133
99:59:59,999 --> 99:59:59,999
And then, it also has the ability to
continuously execute queries

134
99:59:59,999 --> 99:59:59,999
called "recording rules"

135
99:59:59,999 --> 99:59:59,999
and these recording rules have
two different modes.

136
99:59:59,999 --> 99:59:59,999
You can either record, you can take
a query

137
99:59:59,999 --> 99:59:59,999
and it will generate new data
from that query

138
99:59:59,999 --> 99:59:59,999
or you can take a query, and
if it returns results,

139
99:59:59,999 --> 99:59:59,999
it will return an alert.

140
99:59:59,999 --> 99:59:59,999
That alert is a push message
to the alert manager.

141
99:59:59,999 --> 99:59:59,999
This allows us to separate the generating
of alerts from the routing of alerts.

142
99:59:59,999 --> 99:59:59,999
You can have one or hundreds of Prometheus
services, all generating alerts

143
99:59:59,999 --> 99:59:59,999
and it goes into an alert manager cluster
and sends, does the deduplication

144
99:59:59,999 --> 99:59:59,999
and the routing to the human

145
99:59:59,999 --> 99:59:59,999
because, of course, the thing
that we want is

146
99:59:59,999 --> 99:59:59,999
we had dashboards with graphs, but
in order to find out if something is broken

147
99:59:59,999 --> 99:59:59,999
you had to have a human
looking at the graph.

148
99:59:59,999 --> 99:59:59,999
With Prometheus, we don't have to do that
anymore,

149
99:59:59,999 --> 99:59:59,999
we can simply let the software tell us
that we need to go investigate

150
99:59:59,999 --> 99:59:59,999
our problems.

151
99:59:59,999 --> 99:59:59,999
We don't have to sit there and
stare at dashboards all day,

152
99:59:59,999 --> 99:59:59,999
because that's really boring.

153
99:59:59,999 --> 99:59:59,999
What does it look like to actually
get data into Prometheus?

154
99:59:59,999 --> 99:59:59,999
This is a very basic output
of a Prometheus metric.

155
99:59:59,999 --> 99:59:59,999
This is a very simple thing.

156
99:59:59,999 --> 99:59:59,999
If you know much about
the linux kernel,

157
99:59:59,999 --> 99:59:59,999
the linux kernel tracks ??? stats,
all the state of all the CPUs

158
99:59:59,999 --> 99:59:59,999
in your system

159
99:59:59,999 --> 99:59:59,999
and we express this by having
the name of the metric, which is

160
99:59:59,999 --> 99:59:59,999
'node_cpu_seconds_total' and so
this is a self describing metric,

161
99:59:59,999 --> 99:59:59,999
like you can just read the metrics name

162
99:59:59,999 --> 99:59:59,999
and you understand a little bit about
what's going on here.