1
99:59:59,999 --> 99:59:59,999
So, we had a talk by a non-GitLab person
about GitLab.

2
99:59:59,999 --> 99:59:59,999
Now, we have a talk by a GitLab person
on non-GtlLab.

3
99:59:59,999 --> 99:59:59,999
Something like that?

4
99:59:59,999 --> 99:59:59,999
The CCCHH hackerspace is now open,

5
99:59:59,999 --> 99:59:59,999
from now on if you want to go there,
that's the announcement.

6
99:59:59,999 --> 99:59:59,999
And the next talk will be by Ben Kochie

7
99:59:59,999 --> 99:59:59,999
on metrics-based monitoring
with Prometheus.

8
99:59:59,999 --> 99:59:59,999
Welcome.

9
99:59:59,999 --> 99:59:59,999
[Applause]

10
99:59:59,999 --> 99:59:59,999
Alright, so

11
99:59:59,999 --> 99:59:59,999
my name is Ben Kochie

12
99:59:59,999 --> 99:59:59,999
I work on DevOps features for GitLab

13
99:59:59,999 --> 99:59:59,999
and apart working for GitLab, I also work
on the opensource Prometheus project.

14
99:59:59,999 --> 99:59:59,999
I live in Berlin and I've been using
Debian since ???

15
99:59:59,999 --> 99:59:59,999
yes, quite a long time.

16
99:59:59,999 --> 99:59:59,999
So, what is Metrics-based Monitoring?

17
99:59:59,999 --> 99:59:59,999
If you're running software in production,

18
99:59:59,999 --> 99:59:59,999
you probably want to monitor it,

19
99:59:59,999 --> 99:59:59,999
because if you don't monitor it, you don't
know it's right.

20
99:59:59,999 --> 99:59:59,999
??? break down into two categories:

21
99:59:59,999 --> 99:59:59,999
there's blackbox monitoring and
there's whitebox monitoring.

22
99:59:59,999 --> 99:59:59,999
Blackbox monitoring is treating
your software like a blackbox.

23
99:59:59,999 --> 99:59:59,999
It's just checks to see, like,

24
99:59:59,999 --> 99:59:59,999
is it responding, or does it ping

25
99:59:59,999 --> 99:59:59,999
or ??? HTTP requests

26
99:59:59,999 --> 99:59:59,999
[mic turned on]

27
99:59:59,999 --> 99:59:59,999
Ah, there we go, much better.

28
99:59:59,999 --> 99:59:59,999
So, blackbox monitoring is a probe,

29
99:59:59,999 --> 99:59:59,999
it just kind of looks from the outside
to your software

30
99:59:59,999 --> 99:59:59,999
and it has no knowledge of the internals

31
99:59:59,999 --> 99:59:59,999
and it's really good for end to end testing.

32
99:59:59,999 --> 99:59:59,999
So if you've got a fairly complicated
service,

33
99:59:59,999 --> 99:59:59,999
you come in from the outside, you go
through the load balancer,

34
99:59:59,999 --> 99:59:59,999
you hit the API server,

35
99:59:59,999 --> 99:59:59,999
the API server might hit a database,

36
99:59:59,999 --> 99:59:59,999
and you go all the way through
to the back of the stack

37
99:59:59,999 --> 99:59:59,999
and all the way back out

38
99:59:59,999 --> 99:59:59,999
so you know that everything is working
end to end.

39
99:59:59,999 --> 99:59:59,999
But you only know about it
for that one request.

40
99:59:59,999 --> 99:59:59,999
So in order to find out if your service
is working,

41
99:59:59,999 --> 99:59:59,999
from the end to end, for every single
request,

42
99:59:59,999 --> 99:59:59,999
this requires whitebox intrumentation.

43
99:59:59,999 --> 99:59:59,999
So, basically, every event that happens
inside your software,

44
99:59:59,999 --> 99:59:59,999
inside a serving stack,

45
99:59:59,999 --> 99:59:59,999
gets collected and gets counted,

46
99:59:59,999 --> 99:59:59,999
so you know that every request hits
the load balancer,

47
99:59:59,999 --> 99:59:59,999
every request hits your application
service,

48
99:59:59,999 --> 99:59:59,999
every request hits the database.

49
99:59:59,999 --> 99:59:59,999
You know that everything matches up

50
99:59:59,999 --> 99:59:59,999
and this is called whitebox, or
metrics-based monitoring.

51
99:59:59,999 --> 99:59:59,999
There is different examples of, like,

52
99:59:59,999 --> 99:59:59,999
the kind of software that does blackbox
and whitebox monitoring.

53
99:59:59,999 --> 99:59:59,999
So you have software like Nagios that
you can configure checks

54
99:59:59,999 --> 99:59:59,999
or pingdom,

55
99:59:59,999 --> 99:59:59,999
pingdom will do ping of your website.

56
99:59:59,999 --> 99:59:59,999
And then there is metrics-based monitoring,

57
99:59:59,999 --> 99:59:59,999
things like Prometheus, things like
the TICK stack from influx data,

58
99:59:59,999 --> 99:59:59,999
New Relic and other commercial solutions

59
99:59:59,999 --> 99:59:59,999
but of course I like to talk about
the opensorce solutions.

60
99:59:59,999 --> 99:59:59,999
We're gonna talk a little bit about
Prometheus.

61
99:59:59,999 --> 99:59:59,999
Prometheus came out of the idea that

62
99:59:59,999 --> 99:59:59,999
we needed a monitoring system that could
collect all this whitebox metric data

63
99:59:59,999 --> 99:59:59,999
and do something useful with it.

64
99:59:59,999 --> 99:59:59,999
Not just give us a pretty graph, but
we also want to be able to

65
99:59:59,999 --> 99:59:59,999
alert on it.

66
99:59:59,999 --> 99:59:59,999
So we needed both

67
99:59:59,999 --> 99:59:59,999
a data gathering and an analytics system
in the same instance.

68
99:59:59,999 --> 99:59:59,999
To do this, we built this thing and
we looked at the way that

69
99:59:59,999 --> 99:59:59,999
data was being generated
by the applications

70
99:59:59,999 --> 99:59:59,999
and there are advantages and
disadvantages to this

71
99:59:59,999 --> 99:59:59,999
push vs. poll model for metrics.

72
99:59:59,999 --> 99:59:59,999
We decided to go with the polling model

73
99:59:59,999 --> 99:59:59,999
because there is some slight advantages
for polling over pushing.

74
99:59:59,999 --> 99:59:59,999
With polling, you get this free
blackbox check

75
99:59:59,999 --> 99:59:59,999
that the application is running.

76
99:59:59,999 --> 99:59:59,999
When you poll your application, you know
that the process is running.

77
99:59:59,999 --> 99:59:59,999
If you are doing push-based, you can't
tell the difference between

78
99:59:59,999 --> 99:59:59,999
your application doing no work and
your application not running.

79
99:59:59,999 --> 99:59:59,999
So you don't know if it's stuck,

80
99:59:59,999 --> 99:59:59,999
or is it just not having to do any work.

81
99:59:59,999 --> 99:59:59,999
With polling, the polling system knows
the state of your network.

82
99:59:59,999 --> 99:59:59,999
If you have a defined set of services,

83
99:59:59,999 --> 99:59:59,999
that inventory drives what should be there.

84
99:59:59,999 --> 99:59:59,999
Again, it's like, the disappearing,

85
99:59:59,999 --> 99:59:59,999
is the process dead, or is it just
not doing anything?

86
99:59:59,999 --> 99:59:59,999
With polling, you know for a fact
what processes should be there,

87
99:59:59,999 --> 99:59:59,999
and it's a bit of an advantage there.

88
99:59:59,999 --> 99:59:59,999
With polling, there's really easy testing.

89
99:59:59,999 --> 99:59:59,999
With push-based metrics, you have to
figure out

90
99:59:59,999 --> 99:59:59,999
if you want to test a new version of
the monitoring system or

91
99:59:59,999 --> 99:59:59,999
you want to test something new,

92
99:59:59,999 --> 99:59:59,999
you have to ??? a copy of the data.

93
99:59:59,999 --> 99:59:59,999
With polling, you can just set up
another instance of your monitoring

94
99:59:59,999 --> 99:59:59,999
and just test it.

95
99:59:59,999 --> 99:59:59,999
Or you don't even have,

96
99:59:59,999 --> 99:59:59,999
it doesn't even have to be monitoring,
you can just use curl

97
99:59:59,999 --> 99:59:59,999
to poll the metrics endpoint.

98
99:59:59,999 --> 99:59:59,999
It's significantly easier to test.

99
99:59:59,999 --> 99:59:59,999
The other thing with the…

100
99:59:59,999 --> 99:59:59,999
The other nice thing is that
the client is really simple.

101
99:59:59,999 --> 99:59:59,999
The client doesn't have to know
where the monitoring system is.

102
99:59:59,999 --> 99:59:59,999
It doesn't have to know about ???

103
99:59:59,999 --> 99:59:59,999
It just has to sit and collect the data
about itself.

104
99:59:59,999 --> 99:59:59,999
So it doesn't have to know anything about
the topology of the network.

105
99:59:59,999 --> 99:59:59,999
As an application developer, if you're
writing a DNS server or

106
99:59:59,999 --> 99:59:59,999
some other piece of software,

107
99:59:59,999 --> 99:59:59,999
you don't have to know anything about
monitoring software,

108
99:59:59,999 --> 99:59:59,999
you can just implement it inside
your application and

109
99:59:59,999 --> 99:59:59,999
the monitoring software, whether it's
Prometheus or something else,

110
99:59:59,999 --> 99:59:59,999
can just come and collect that data for you.

111
99:59:59,999 --> 99:59:59,999
That's kind of similar to a very old
monitoring system called SNMP,

112
99:59:59,999 --> 99:59:59,999
but SNMP has a significantly less friendly
data model for developers.

113
99:59:59,999 --> 99:59:59,999
This is the basic layout
of a Prometheus server.

114
99:59:59,999 --> 99:59:59,999
At the core, there's a Prometheus server

115
99:59:59,999 --> 99:59:59,999
and it deals with all the data collection
and analytics.

116
99:59:59,999 --> 99:59:59,999
Basically, this one binary,
it's all written in golang.

117
99:59:59,999 --> 99:59:59,999
It's a single binary.

118
99:59:59,999 --> 99:59:59,999
It knows how to read from your inventory,

119
99:59:59,999 --> 99:59:59,999
there's a bunch of different methods,
whether you've got

120
99:59:59,999 --> 99:59:59,999
a kubernetes cluster or a cloud platform

121
99:59:59,999 --> 99:59:59,999
or you have your own customized thing
with ansible.

122
99:59:59,999 --> 99:59:59,999
Ansible can take your layout, drop that
into a config file and

123
99:59:59,999 --> 99:59:59,999
Prometheus can pick that up.

124
99:59:59,999 --> 99:59:59,999
Once it has the layout, it goes out and
collects all the data.

125
99:59:59,999 --> 99:59:59,999
It has a storage and a time series
database to store all that data locally.

126
99:59:59,999 --> 99:59:59,999
It has a thing called PromQL, which is
a query language designed

127
99:59:59,999 --> 99:59:59,999
for metrics and analytics.

128
99:59:59,999 --> 99:59:59,999
From that PromQL, you can add frontends
that will,

129
99:59:59,999 --> 99:59:59,999
whether it's a simple API client
to run reports,

130
99:59:59,999 --> 99:59:59,999
you can use things like Grafana
for creating dashboards,

131
99:59:59,999 --> 99:59:59,999
it's got a simple webUI built in.

132
99:59:59,999 --> 99:59:59,999
You can plug in anything you want
on that side.

133
99:59:59,999 --> 99:59:59,999
And then, it also has the ability to
continuously execute queries

134
99:59:59,999 --> 99:59:59,999
called "recording rules"

135
99:59:59,999 --> 99:59:59,999
and these recording rules have
two different modes.

136
99:59:59,999 --> 99:59:59,999
You can either record, you can take
a query

137
99:59:59,999 --> 99:59:59,999
and it will generate new data
from that query

138
99:59:59,999 --> 99:59:59,999
or you can take a query, and
if it returns results,

139
99:59:59,999 --> 99:59:59,999
it will return an alert.

140
99:59:59,999 --> 99:59:59,999
That alert is a push message
to the alert manager.

141
99:59:59,999 --> 99:59:59,999
This allows us to separate the generating
of alerts from the routing of alerts.

142
99:59:59,999 --> 99:59:59,999
You can have one or hundreds of Prometheus
services, all generating alerts

143
99:59:59,999 --> 99:59:59,999
and it goes into an alert manager cluster
and sends, does the deduplication

144
99:59:59,999 --> 99:59:59,999
and the routing to the human

145
99:59:59,999 --> 99:59:59,999
because, of course, the thing
that we want is

146
99:59:59,999 --> 99:59:59,999
we had dashboards with graphs, but
in order to find out if something is broken

147
99:59:59,999 --> 99:59:59,999
you had to have a human
looking at the graph.

148
99:59:59,999 --> 99:59:59,999
With Prometheus, we don't have to do that
anymore,

149
99:59:59,999 --> 99:59:59,999
we can simply let the software tell us
that we need to go investigate

150
99:59:59,999 --> 99:59:59,999
our problems.

151
99:59:59,999 --> 99:59:59,999
We don't have to sit there and
stare at dashboards all day,

152
99:59:59,999 --> 99:59:59,999
because that's really boring.

153
99:59:59,999 --> 99:59:59,999
What does it look like to actually
get data into Prometheus?

154
99:59:59,999 --> 99:59:59,999
This is a very basic output
of a Prometheus metric.

155
99:59:59,999 --> 99:59:59,999
This is a very simple thing.

156
99:59:59,999 --> 99:59:59,999
If you know much about
the linux kernel,

157
99:59:59,999 --> 99:59:59,999
the linux kernel tracks ??? stats,
all the state of all the CPUs

158
99:59:59,999 --> 99:59:59,999
in your system

159
99:59:59,999 --> 99:59:59,999
and we express this by having
the name of the metric, which is

160
99:59:59,999 --> 99:59:59,999
'node_cpu_seconds_total' and so
this is a self-describing metric,

161
99:59:59,999 --> 99:59:59,999
like you can just read the metrics name

162
99:59:59,999 --> 99:59:59,999
and you understand a little bit about
what's going on here.

163
99:59:59,999 --> 99:59:59,999
The linux kernel and other kernels track
their usage by the number of seconds

164
99:59:59,999 --> 99:59:59,999
spent doing different things and

165
99:59:59,999 --> 99:59:59,999
that could be, whether it's in system or
user space or IRQs

166
99:59:59,999 --> 99:59:59,999
or iowait or idle.

167
99:59:59,999 --> 99:59:59,999
Actually, the kernel tracks how much
idle time it has.

168
99:59:59,999 --> 99:59:59,999
It also tracks it by the number of CPUs.

169
99:59:59,999 --> 99:59:59,999
With other monitoring systems, they used
to do this with a tree structure

170
99:59:59,999 --> 99:59:59,999
and this caused a lot of problems,
for like

171
99:59:59,999 --> 99:59:59,999
How do you mix and match data so
by switching from

172
99:59:59,999 --> 99:59:59,999
a tree structure to a tag-based structure,

173
99:59:59,999 --> 99:59:59,999
we can do some really interesting
powerful data analytics.

174
99:59:59,999 --> 99:59:59,999
Here's a nice example of taking
those CPU seconds counters

175
99:59:59,999 --> 99:59:59,999
and then converting them into a graph
by using PromQL.

176
99:59:59,999 --> 99:59:59,999
Now we can get into
Metrics-Based Alerting.

177
99:59:59,999 --> 99:59:59,999
Now we have this graph, we have this thing

178
99:59:59,999 --> 99:59:59,999
we can look and see here

179
99:59:59,999 --> 99:59:59,999
"Oh there is some little spike here,
we might want to know about that."

180
99:59:59,999 --> 99:59:59,999
Now we can get into Metrics-Based
Alerting.

181
99:59:59,999 --> 99:59:59,999
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart

182
99:59:59,999 --> 99:59:59,999
and we have this concept of things that
you need on a site or a service reliably

183
99:59:59,999 --> 99:59:59,999
The most important thing you need is
down at the bottom,

184
99:59:59,999 --> 99:59:59,999
Monitoring, because if you don't have
monitoring of your service,

185
99:59:59,999 --> 99:59:59,999
how do you know it's even working?

186
99:59:59,999 --> 99:59:59,999
There's a couple of techniques here, and
we want to alert based on data

187
99:59:59,999 --> 99:59:59,999
and not just those end to end tests.

188
99:59:59,999 --> 99:59:59,999
There's a couple of techniques, a thing
called the RED method

189
99:59:59,999 --> 99:59:59,999
and there's a thing called the USE method

190
99:59:59,999 --> 99:59:59,999
and there's a couple nice things to some
blog posts about this

191
99:59:59,999 --> 99:59:59,999
and basically it defines that, for example,

192
99:59:59,999 --> 99:59:59,999
the RED method talks about the requests
that your system is handling

193
99:59:59,999 --> 99:59:59,999
There are three things:

194
99:59:59,999 --> 99:59:59,999
There's the number of requests, there's
the number of errors

195
99:59:59,999 --> 99:59:59,999
and there's how long takes a duration.

196
99:59:59,999 --> 99:59:59,999
With the combination of these three things

197
99:59:59,999 --> 99:59:59,999
you can determine most of
what your users see

198
99:59:59,999 --> 99:59:59,999
"Did my request go through? Did it
return an error? Was it fast?"

199
99:59:59,999 --> 99:59:59,999
Most people, that's all they care about.

200
99:59:59,999 --> 99:59:59,999
"I made a request to a website and
it came back and it was fast."

201
99:59:59,999 --> 99:59:59,999
It's a very simple method of just, like,

202
99:59:59,999 --> 99:59:59,999
those are the important things to
determine if your site is healthy.

203
99:59:59,999 --> 99:59:59,999
But we can go back to some more
traditional, sysadmin style, alerts

204
99:59:59,999 --> 99:59:59,999
this is basically taking the filesystem
available space,

205
99:59:59,999 --> 99:59:59,999
divided by the filesystem size, that becomes
the ratio of filesystem availability

206
99:59:59,999 --> 99:59:59,999
from 0 to 1.

207
99:59:59,999 --> 99:59:59,999
Multiply it by 100, we now have
a percentage

208
99:59:59,999 --> 99:59:59,999
and if it's less than or equal to 1%
for 15 minutes,

209
99:59:59,999 --> 99:59:59,999
this is less than 1% space, we should tell
a sysadmin to go check

210
99:59:59,999 --> 99:59:59,999
the ??? filesystem ???

211
99:59:59,999 --> 99:59:59,999
It's super nice and simple.

212
99:59:59,999 --> 99:59:59,999
We can also tag, we can include…

213
99:59:59,999 --> 99:59:59,999
Every alert includes all the extraneous
labels that Prometheus adds to your metrics

214
99:59:59,999 --> 99:59:59,999
When you add a metric in Prometheus, if
we go back and we look at this metric.

215
99:59:59,999 --> 99:59:59,999
This metric only contain the information
about the internals of the application

216
99:59:59,999 --> 99:59:59,999
anything about, like, what server it's on,
is it running in a container,

217
99:59:59,999 --> 99:59:59,999
what cluster does it come from,
what ??? is it on,

218
99:59:59,999 --> 99:59:59,999
that's all extra annotations that are
added by the Prometheus server

219
99:59:59,999 --> 99:59:59,999
at discovery time.

220
99:59:59,999 --> 99:59:59,999
I don't have a good example of what
those labels look like

221
99:59:59,999 --> 99:59:59,999
but every metric gets annotated
with location information.

222
99:59:59,999 --> 99:59:59,999
That location information also comes through
as labels in the alert

223
99:59:59,999 --> 99:59:59,999
so, if you have a message coming
into your alert manager,

224
99:59:59,999 --> 99:59:59,999
the alert manager can look and go

225
99:59:59,999 --> 99:59:59,999
"Oh, that's coming from this datacenter"

226
99:59:59,999 --> 99:59:59,999
and it can include that in the email or
IRC message or SMS message.

227
99:59:59,999 --> 99:59:59,999
So you can include

228
99:59:59,999 --> 99:59:59,999
"Filesystem is out of space on this host
from this datacenter"

229
99:59:59,999 --> 99:59:59,999
All these labels get passed through and
then you can append

230
99:59:59,999 --> 99:59:59,999
"severity: critical" to that alert and
include that in the message to the human

231
99:59:59,999 --> 99:59:59,999
because of course, this is how you define…

232
99:59:59,999 --> 99:59:59,999
Getting the message from the monitoring
to the human.

233
99:59:59,999 --> 99:59:59,999
You can even include nice things like,

234
99:59:59,999 --> 99:59:59,999
if you've got documentation, you can
include a link to the documentation

235
99:59:59,999 --> 99:59:59,999
as an annotation

236
99:59:59,999 --> 99:59:59,999
and the alert manager can take that
basic url and, you know,

237
99:59:59,999 --> 99:59:59,999
massaging it into whatever it needs
to look like to actually get

238
99:59:59,999 --> 99:59:59,999
the operator to the correct documentation.

239
99:59:59,999 --> 99:59:59,999
We can also do more fun things:

240
99:59:59,999 --> 99:59:59,999
since we actually are not just checking

241
99:59:59,999 --> 99:59:59,999
what is the space right now,
we're tracking data over time,

242
99:59:59,999 --> 99:59:59,999
we can use 'predict_linear'.

243
99:59:59,999 --> 99:59:59,999
'predict_linear' just takes and does
a simple linear regression.

244
99:59:59,999 --> 99:59:59,999
This example takes the filesystem
available space over the last hour and

245
99:59:59,999 --> 99:59:59,999
does a linear regression.

246
99:59:59,999 --> 99:59:59,999
Prediction says "Well, it's going that way
and four hours from now,

247
99:59:59,999 --> 99:59:59,999
based on one hour of history, it's gonna
be less than 0, which means full".

248
99:59:59,999 --> 99:59:59,999
We know that within the next four hours,
the disc is gonna be full

249
99:59:59,999 --> 99:59:59,999
so we can tell the operator ahead of time
that it's gonna be full

250
99:59:59,999 --> 99:59:59,999
and not just tell them that it's full
right now.

251
99:59:59,999 --> 99:59:59,999
They have some window of ability
to fix it before it fails.

252
99:59:59,999 --> 99:59:59,999
This is really important because
if you're running a site

253
99:59:59,999 --> 99:59:59,999
you want to be able to have alerts
that tell you that your system is failing

254
99:59:59,999 --> 99:59:59,999
before it actually fails.

255
99:59:59,999 --> 99:59:59,999
Because if it fails, you're out of SLO
or SLA and

256
99:59:59,999 --> 99:59:59,999
your users are gonna be unhappy

257
99:59:59,999 --> 99:59:59,999
and you don't want the users to tell you
that your site is down

258
99:59:59,999 --> 99:59:59,999
you want to know about it before
your users can even tell.

259
99:59:59,999 --> 99:59:59,999
This allows you to do that.

260
99:59:59,999 --> 99:59:59,999
And also of course, Prometheus being
a modern system,

261
99:59:59,999 --> 99:59:59,999
we support fully UTF8 in all of our labels.

262
99:59:59,999 --> 99:59:59,999
Here's an other one, here's a good example
from the USE method.

263
99:59:59,999 --> 99:59:59,999
This is a rate of 500 errors coming from
an application

264
99:59:59,999 --> 99:59:59,999
and you can simply alert that

265
99:59:59,999 --> 99:59:59,999
there's more than 500 errors per second
coming out of the application

266
99:59:59,999 --> 99:59:59,999
if that's your threshold for ???

267
99:59:59,999 --> 99:59:59,999
And you can do other things,

268
99:59:59,999 --> 99:59:59,999
you can convert that from just
a raid of errors

269
99:59:59,999 --> 99:59:59,999
to a percentive error.

270
99:59:59,999 --> 99:59:59,999
So you could say

271
99:59:59,999 --> 99:59:59,999
"I have an SLA of 3 9" and so you can say

272
99:59:59,999 --> 99:59:59,999
"If the rate of errors divided by the rate
of requests is .01,

273
99:59:59,999 --> 99:59:59,999
or is more than .01, then
that's a problem."

274
99:59:59,999 --> 99:59:59,999
You can include that level of
error granularity.

275
99:59:59,999 --> 99:59:59,999
And if you're just doing a blackbox test,

276
99:59:59,999 --> 99:59:59,999
you wouldn't know this, you would only get
if you got an error from the system,

277
99:59:59,999 --> 99:59:59,999
then you got another error from the system

278
99:59:59,999 --> 99:59:59,999
then you fire an alert.

279
99:59:59,999 --> 99:59:59,999
But if those checks are one minute apart
and you're serving 1000 requests per second

280
99:59:59,999 --> 99:59:59,999
you could be serving 10,000 errors before
you even get an alert.

281
99:59:59,999 --> 99:59:59,999
And you might miss it, because

282
99:59:59,999 --> 99:59:59,999
what if you only get one random error

283
99:59:59,999 --> 99:59:59,999
and then the next time, you're serving
25% errors,

284
99:59:59,999 --> 99:59:59,999
you only have a 25% chance of that check
failing again.

285
99:59:59,999 --> 99:59:59,999
You really need these metrics in order
to get

286
99:59:59,999 --> 99:59:59,999
proper reports of the status of your system

287
99:59:59,999 --> 99:59:59,999
There's even options

288
99:59:59,999 --> 99:59:59,999
You can slice and dice those labels.

289
99:59:59,999 --> 99:59:59,999
If you have a label on all of
your applications called 'service'

290
99:59:59,999 --> 99:59:59,999
you can send that 'service' label through
to the message

291
99:59:59,999 --> 99:59:59,999
and you can say
"Hey, this service is broken".

292
99:59:59,999 --> 99:59:59,999
You can include that service label
in your alert messages.

293
99:59:59,999 --> 99:59:59,999
And that's it, I can go to a demo and Q&A.

294
99:59:59,999 --> 99:59:59,999
[Applause]

295
99:59:59,999 --> 99:59:59,999
Any questions so far?

296
99:59:59,999 --> 99:59:59,999
Or anybody want to see a demo?

297
99:59:59,999 --> 99:59:59,999
[Q] Hi. Does Prometheus make metric
discovery inside containers

298
99:59:59,999 --> 99:59:59,999
or do I have to implement the metrics
myself?

299
99:59:59,999 --> 99:59:59,999
[A] For metrics in containers, there are
already things that expose

300
99:59:59,999 --> 99:59:59,999
the metrics of the container system
itself.

301
99:59:59,999 --> 99:59:59,999
There's a utility called 'cadvisor' and

302
99:59:59,999 --> 99:59:59,999
cadvisor takes the links cgroup data
and exposes it as metrics

303
99:59:59,999 --> 99:59:59,999
so you can get data about
how much CPU time is being

304
99:59:59,999 --> 99:59:59,999
spent in your container,

305
99:59:59,999 --> 99:59:59,999
how much memory is being used
by your container.

306
99:59:59,999 --> 99:59:59,999
[Q] But not about the application,
just about the container usage ?

307
99:59:59,999 --> 99:59:59,999
[A] Right. Because the container
has no idea

308
99:59:59,999 --> 99:59:59,999
whether your application is written
in Ruby or go or Python or whatever,

309
99:59:59,999 --> 99:59:59,999
you have to build that into
your application in order to get the data.

310
99:59:59,999 --> 99:59:59,999
So for Prometheus,

311
99:59:59,999 --> 99:59:59,999
we've written client libraries that can be
included in your application directly

312
99:59:59,999 --> 99:59:59,999
so you can get that data out.

313
99:59:59,999 --> 99:59:59,999
If you go to the Prometheus website,
we have a whole series of client libraries

314
99:59:59,999 --> 99:59:59,999
and we cover a pretty good selection
of popular software.

315
99:59:59,999 --> 99:59:59,999
[Q] What is the current state of
long-term data storage?

316
99:59:59,999 --> 99:59:59,999
[A] Very good question.

317
99:59:59,999 --> 99:59:59,999
There's been several…

318
99:59:59,999 --> 99:59:59,999
There's actually several different methods
of doing this.

319
99:59:59,999 --> 99:59:59,999
Prometheus stores all this data locally
in its own data storage

320
99:59:59,999 --> 99:59:59,999
on the local disk.

321
99:59:59,999 --> 99:59:59,999
But that's only as durable as
that server is durable.

322
99:59:59,999 --> 99:59:59,999
So if you've got a really durable server,

323
99:59:59,999 --> 99:59:59,999
you can store as much data as you want,

324
99:59:59,999 --> 99:59:59,999
you can store years and years of data
locally on the Prometheus server.

325
99:59:59,999 --> 99:59:59,999
That's not a problem.

326
99:59:59,999 --> 99:59:59,999
There's a bunch of misconceptions because
of our default

327
99:59:59,999 --> 99:59:59,999
and the language on our website said

328
99:59:59,999 --> 99:59:59,999
"It's not long-term storage"

329
99:59:59,999 --> 99:59:59,999
simply because we leave that problem
up to the person running the server.

330
99:59:59,999 --> 99:59:59,999
But the time series database
that Prometheus includes

331
99:59:59,999 --> 99:59:59,999
is actually quite durable.

332
99:59:59,999 --> 99:59:59,999
But it's only as durable as the server
underneath it.

333
99:59:59,999 --> 99:59:59,999
So if you've got a very large cluster and
you want really high durability,

334
99:59:59,999 --> 99:59:59,999
you need to have some kind of
cluster software,

335
99:59:59,999 --> 99:59:59,999
but because we want Prometheus to be
simple to deploy

336
99:59:59,999 --> 99:59:59,999
and very simple to operate

337
99:59:59,999 --> 99:59:59,999
and also very robust.

338
99:59:59,999 --> 99:59:59,999
We didn't want to include any clustering
in Prometheus itself,

339
99:59:59,999 --> 99:59:59,999
because anytime you have a clustered
software,

340
99:59:59,999 --> 99:59:59,999
what happens if your network is
a little wanky.

341
99:59:59,999 --> 99:59:59,999
The first thing that goes down is
all of your distributed systems fail.

342
99:59:59,999 --> 99:59:59,999
And building distributed systems to be
really robust is really hard

343
99:59:59,999 --> 99:59:59,999
so Prometheus is what we call
"uncoordinated distributed systems".

344
99:59:59,999 --> 99:59:59,999
If you've got two Prometheus servers
monitoring all your targets in an HA mode

345
99:59:59,999 --> 99:59:59,999
in a cluster, and there's a split brain,

346
99:59:59,999 --> 99:59:59,999
each Prometheus can see
half of the cluster and

347
99:59:59,999 --> 99:59:59,999
it can see that the other half
of the cluster is down.

348
99:59:59,999 --> 99:59:59,999
They can both try to get alerts out
to the alert manager

349
99:59:59,999 --> 99:59:59,999
and this is a really really robust way of
handling split brains

350
99:59:59,999 --> 99:59:59,999
and bad network failures and bad problems
in a cluster.

351
99:59:59,999 --> 99:59:59,999
It's designed to be super super robust

352
99:59:59,999 --> 99:59:59,999
and so the two individual
Promotheus servers in you cluster

353
99:59:59,999 --> 99:59:59,999
don't have to talk to each other
to do this,

354
99:59:59,999 --> 99:59:59,999
they can just to it independently.

355
99:59:59,999 --> 99:59:59,999
But if you want to be able
to correlate data

356
99:59:59,999 --> 99:59:59,999
between many different Prometheus servers

357
99:59:59,999 --> 99:59:59,999
you need an external data storage
to do this.

358
99:59:59,999 --> 99:59:59,999
And also you may not have
very big servers,

359
99:59:59,999 --> 99:59:59,999
you might be running your Prometheus
in a container

360
99:59:59,999 --> 99:59:59,999
and it's only got a little bit of local
storage space

361
99:59:59,999 --> 99:59:59,999
so you want to send all that data up
to a big cluster datastore

362
99:59:59,999 --> 99:59:59,999
for ???

363
99:59:59,999 --> 99:59:59,999
We have several different ways of
doing this.

364
99:59:59,999 --> 99:59:59,999
There's the classic way which is called
federation

365
99:59:59,999 --> 99:59:59,999
where you have one Prometheus server
polling in summary data from

366
99:59:59,999 --> 99:59:59,999
each of the individual Prometheus servers

367
99:59:59,999 --> 99:59:59,999
and this is useful if you want to run
alerts against data coming

368
99:59:59,999 --> 99:59:59,999
from multiple Prometheus servers.

369
99:59:59,999 --> 99:59:59,999
But federation is not replication.

370
99:59:59,999 --> 99:59:59,999
It only can do a little bit of data from
each Prometheus server.

371
99:59:59,999 --> 99:59:59,999
If you've got a million metrics on
each Prometheus server,

372
99:59:59,999 --> 99:59:59,999
you can't poll in a million metrics
and do…

373
99:59:59,999 --> 99:59:59,999
If you've got 10 of those, you can't
poll in 10 million metrics

374
99:59:59,999 --> 99:59:59,999
simultaneously into one Prometheus
server.

375
99:59:59,999 --> 99:59:59,999
It's just to much data.

376
99:59:59,999 --> 99:59:59,999
There is two others, a couple of other
nice options.

377
99:59:59,999 --> 99:59:59,999
There's a piece of software called
Cortex.

378
99:59:59,999 --> 99:59:59,999
Cortex is a Prometheus server that
stores its data in a database.

379
99:59:59,999 --> 99:59:59,999
Specifically, a distributed database.

380
99:59:59,999 --> 99:59:59,999
Things that are based on the Google
big table model, like Cassandra or…

381
99:59:59,999 --> 99:59:59,999
What's the Amazon one?

382
99:59:59,999 --> 99:59:59,999
Yeah.

383
99:59:59,999 --> 99:59:59,999
Dynamodb.

384
99:59:59,999 --> 99:59:59,999
If you have a dynamodb or a cassandra
cluster, or one of these other

385
99:59:59,999 --> 99:59:59,999
really big distributed storage clusters,

386
99:59:59,999 --> 99:59:59,999
Cortex can run and the Prometheus servers
will stream their data up to Cortex

387
99:59:59,999 --> 99:59:59,999
and it will keep a copy of that accross
all of your Prometheus servers.

388
99:59:59,999 --> 99:59:59,999
And because it's based on things
like Cassandra,

389
99:59:59,999 --> 99:59:59,999
it's super scalable.

390
99:59:59,999 --> 99:59:59,999
But it's a little complex to run and

391
99:59:59,999 --> 99:59:59,999
many people don't want to run that
complex infrastructure.

392
99:59:59,999 --> 99:59:59,999
We have another new one, we just blogged
about it yesterday.

393
99:59:59,999 --> 99:59:59,999
It's a thing called Thanos.

394
99:59:59,999 --> 99:59:59,999
Thanos is Prometheus at scale.

395
99:59:59,999 --> 99:59:59,999
Basically, the way it works…

396
99:59:59,999 --> 99:59:59,999
Actually, why don't I bring that up?

397
99:59:59,999 --> 99:59:59,999
This was developed by a company
called Improbable

398
99:59:59,999 --> 99:59:59,999
and they wanted to…

399
99:59:59,999 --> 99:59:59,999
They had billions of metrics coming from
hundreds of Prometheus servers.

400
99:59:59,999 --> 99:59:59,999
They developed this in collaboration with
the Prometheus team to build

401
99:59:59,999 --> 99:59:59,999
a super highly scalable Prometheus server.

402
99:59:59,999 --> 99:59:59,999
Prometheus itself stores the incoming
metrics data in ??? log

403
99:59:59,999 --> 99:59:59,999
and then every two hours, it creates
a compaction cycle

404
99:59:59,999 --> 99:59:59,999
and it creates a mutable series block
of data which is

405
99:59:59,999 --> 99:59:59,999
all the time series blocks themselves

406
99:59:59,999 --> 99:59:59,999
and then an index into that data.

407
99:59:59,999 --> 99:59:59,999
Those two hour windows are all imutable

408
99:59:59,999 --> 99:59:59,999
so ??? has a little sidecar binary that
watches for those new directories and

409
99:59:59,999 --> 99:59:59,999
uploads them into a blob store.

410
99:59:59,999 --> 99:59:59,999
So you could put them in S3 or minio or
some other simple object storage.

411
99:59:59,999 --> 99:59:59,999
And then now you have all of your data,
all of this index data already

412
99:59:59,999 --> 99:59:59,999
ready to go

413
99:59:59,999 --> 99:59:59,999
and then the final sidecar creates
a little mesh cluster that can read from

414
99:59:59,999 --> 99:59:59,999
all of those S3 blocks.

415
99:59:59,999 --> 99:59:59,999
Now, you have this super global view
all stored in a big bucket storage and

416
99:59:59,999 --> 99:59:59,999
things like S3 or minio are…

417
99:59:59,999 --> 99:59:59,999
Bucket storage is not databases so they're
operationally a little easier to operate.

418
99:59:59,999 --> 99:59:59,999
Plus, now we have all this data in
a bucket store and

419
99:59:59,999 --> 99:59:59,999
the Thanos sidecars can talk to each other

420
99:59:59,999 --> 99:59:59,999
We can now have a single entry point.

421
99:59:59,999 --> 99:59:59,999
You can query Thanos and Thanos will
distribute your query

422
99:59:59,999 --> 99:59:59,999
across all your Prometheus servers.

423
99:59:59,999 --> 99:59:59,999
So now you can do global queries across
all of your servers.

424
99:59:59,999 --> 99:59:59,999
But it's very new, they just released
their first release candidate yesterday.

425
99:59:59,999 --> 99:59:59,999
It is looking to be like
the coolest thing ever

426
99:59:59,999 --> 99:59:59,999
for running large scale Prometheus.

427
99:59:59,999 --> 99:59:59,999
Here's an example of how that is laid out.

428
99:59:59,999 --> 99:59:59,999
This will ??? let you have
a billion metric Prometheus cluster.

429
99:59:59,999 --> 99:59:59,999
And it's got a bunch of other
cool features.

430
99:59:59,999 --> 99:59:59,999
Any more questions?

431
99:59:59,999 --> 99:59:59,999
Alright, maybe I'll do
a quick little demo.

432
99:59:59,999 --> 99:59:59,999
Here is a Prometheus server that is
provided by ???

433
99:59:59,999 --> 99:59:59,999
that just does a ansible deployment
for Prometheus.

434
99:59:59,999 --> 99:59:59,999
And you can just simply query
for something like 'node_cpu'.

435
99:59:59,999 --> 99:59:59,999
This is actually the old name for
that metric.

436
99:59:59,999 --> 99:59:59,999
And you can see, here's exactly

437
99:59:59,999 --> 99:59:59,999
the CPU metrics from some servers.

438
99:59:59,999 --> 99:59:59,999
It's just a bunch of stuff.

439
99:59:59,999 --> 99:59:59,999
There's actually two servers here,

440
99:59:59,999 --> 99:59:59,999
there's an influx cloud alchemy and
there is a demo cloud alchemy.

441
99:59:59,999 --> 99:59:59,999
[Q] Can you zoom in?
[A] Oh yeah sure.

442
99:59:59,999 --> 99:59:59,999
So you can see all the extra labels.

443
99:59:59,999 --> 99:59:59,999
We can also do some things like…

444
99:59:59,999 --> 99:59:59,999
Let's take a look at, say,
the last 30 seconds.

445
99:59:59,999 --> 99:59:59,999
We can just add this little time window.

446
99:59:59,999 --> 99:59:59,999
It's called a range request,
and you can see

447
99:59:59,999 --> 99:59:59,999
the individual samples.

448
99:59:59,999 --> 99:59:59,999
You can see that all Prometheus is doing

449
99:59:59,999 --> 99:59:59,999
is storing the sample and a timestamp.

450
99:59:59,999 --> 99:59:59,999
All the timestamps are in milliseconds
and it's all epoch

451
99:59:59,999 --> 99:59:59,999
so it's super easy to manipulate.

452
99:59:59,999 --> 99:59:59,999
But, looking at the individual samples and
looking at this, you can see that

453
99:59:59,999 --> 99:59:59,999
if we go back and just take…
and look at the raw data, and

454
99:59:59,999 --> 99:59:59,999
we graph the raw data…

455
99:59:59,999 --> 99:59:59,999
Oops, that's a syntax error.

456
99:59:59,999 --> 99:59:59,999
And we look at this graph…
Come on.

457
99:59:59,999 --> 99:59:59,999
Here we go.

458
99:59:59,999 --> 99:59:59,999
Well, that's kind of boring, it's just
a flat line because

459
99:59:59,999 --> 99:59:59,999
it's just a counter going up very slowly.

460
99:59:59,999 --> 99:59:59,999
What we really want to do, is we want to
take, and we want to apply

461
99:59:59,999 --> 99:59:59,999
a rate function to this counter.

462
99:59:59,999 --> 99:59:59,999
So let's look at the rate over
the last one minute.

463
99:59:59,999 --> 99:59:59,999
There we go, now we get
a nice little graph.

464
99:59:59,999 --> 99:59:59,999
And so you can see that this is
0.6 CPU seconds per second

465
99:59:59,999 --> 99:59:59,999
for that set of labels.

466
99:59:59,999 --> 99:59:59,999
But this is pretty noisy, there's a lot
of lines on this graph and

467
99:59:59,999 --> 99:59:59,999
there's still a lot of data here.

468
99:59:59,999 --> 99:59:59,999
So let's start doing some filtering.

469
99:59:59,999 --> 99:59:59,999
One of the things we see here is,
well, there's idle.

470
99:59:59,999 --> 99:59:59,999
We don't really care about
the machine being idle,

471
99:59:59,999 --> 99:59:59,999
so let's just add a label filter
so we can say

472
99:59:59,999 --> 99:59:59,999
'mode', it's the label name, and it's not
equal to 'idle'. Done.

473
99:59:59,999 --> 99:59:59,999
And if I could type…
What did I miss?

474
99:59:59,999 --> 99:59:59,999
Here we go.

475
99:59:59,999 --> 99:59:59,999
So now we've removed idle from the graph.

476
99:59:59,999 --> 99:59:59,999
That looks a little more sane.

477
99:59:59,999 --> 99:59:59,999
Oh, wow, look at that, that's a nice
big spike in user space on the influx server

478
99:59:59,999 --> 99:59:59,999
Okay…

479
99:59:59,999 --> 99:59:59,999
Well, that's pretty cool.

480
99:59:59,999 --> 99:59:59,999
What about…

481
99:59:59,999 --> 99:59:59,999
This is still quite a lot of lines.

482
99:59:59,999 --> 99:59:59,999
How much CPU is in use total across
all the servers that we have.

483
99:59:59,999 --> 99:59:59,999
We can just sum up that rate.

484
99:59:59,999 --> 99:59:59,999
We can just see that there is
a sum total of 0.6 CPU seconds/s

485
99:59:59,999 --> 99:59:59,999
across the servers we have.

486
99:59:59,999 --> 99:59:59,999
But that's a little to coarse.

487
99:59:59,999 --> 99:59:59,999
What if we want to see it by instance?

488
99:59:59,999 --> 99:59:59,999
Now, we can see the two servers,
we can see

489
99:59:59,999 --> 99:59:59,999
that we're left with just that label.

490
99:59:59,999 --> 99:59:59,999
The influx labels are the influx instance
and the influx demo.

491
99:59:59,999 --> 99:59:59,999
That's a super easy way to see that,

492
99:59:59,999 --> 99:59:59,999
but we can also do this
the other way around.

493
99:59:59,999 --> 99:59:59,999
We can say 'without (mode,cpu)' so
we can drop those modes and

494
99:59:59,999 --> 99:59:59,999
see all the labels that we have.

495
99:59:59,999 --> 99:59:59,999
We can still see the environment label
and the job label on our list data.

496
99:59:59,999 --> 99:59:59,999
You can go either way
with the summary functions.

497
99:59:59,999 --> 99:59:59,999
There's a whole bunch of different functions

498
99:59:59,999 --> 99:59:59,999
and it's all in our documentation.

499
99:59:59,999 --> 99:59:59,999
But what if we want to see it…

500
99:59:59,999 --> 99:59:59,999
What if we want to see which CPUs
are in use?

501
99:59:59,999 --> 99:59:59,999
Now we can see that it's only CPU0

502
99:59:59,999 --> 99:59:59,999
because apparently these are only
1-core instances.

503
99:59:59,999 --> 99:59:59,999
You can add/remove labels and do
all these queries.

504
99:59:59,999 --> 99:59:59,999
Any other questions so far?

505
99:59:59,999 --> 99:59:59,999
[Q] I don't have a question, but I have
something to add.

506
99:59:59,999 --> 99:59:59,999
Prometheus is really nice, but it's
a lot better if you combine it

507
99:59:59,999 --> 99:59:59,999
with grafana.

508
99:59:59,999 --> 99:59:59,999
[A] Yes, yes.

509
99:59:59,999 --> 99:59:59,999
In the beginning, when we were creating
Prometheus, we actually built

510
99:59:59,999 --> 99:59:59,999
a piece of dashboard software called
promdash.

511
99:59:59,999 --> 99:59:59,999
It was a simple little Ruby on Rails app
to create dashboards

512
99:59:59,999 --> 99:59:59,999
and it had a bunch of JavaScript.

513
99:59:59,999 --> 99:59:59,999
And then grafana came out.

514
99:59:59,999 --> 99:59:59,999
And we're like

515
99:59:59,999 --> 99:59:59,999
"Oh, that's interesting. It doesn't support
Prometheus" so we were like

516
99:59:59,999 --> 99:59:59,999
"Hey, can you support Prometheus"

517
99:59:59,999 --> 99:59:59,999
and they're like "Yeah, we've got
a REST API, get the data, done"

518
99:59:59,999 --> 99:59:59,999
Now grafana supports Prometheus and
we're like

519
99:59:59,999 --> 99:59:59,999
"Well, promdash, this is crap, delete".

520
99:59:59,999 --> 99:59:59,999
The Prometheus development team,

521
99:59:59,999 --> 99:59:59,999
we're all backend developers
and SREs and

522
99:59:59,999 --> 99:59:59,999
we have no JavaScript skills at all.

523
99:59:59,999 --> 99:59:59,999
So we're like "Let somebody deal
with that".

524
99:59:59,999 --> 99:59:59,999
One of the nice things about working on
this kind of project is

525
99:59:59,999 --> 99:59:59,999
we can do things that we're good at and
and we don't, we don't try…

526
99:59:59,999 --> 99:59:59,999
We don't have any marketing people,
it's just an opensource project,

527
99:59:59,999 --> 99:59:59,999
there's no single company behind Prometheus.

528
99:59:59,999 --> 99:59:59,999
I work for GitLab, Improbable paid for
the Thanos system,

529
99:59:59,999 --> 99:59:59,999
other companies like Red Hat now pays
people that used to work on CoreOS to

530
99:59:59,999 --> 99:59:59,999
work on Prometheus.

531
99:59:59,999 --> 99:59:59,999
There's lots and lots of collaboration
between many companies

532
99:59:59,999 --> 99:59:59,999
to build the Prometheus ecosystem.

533
99:59:59,999 --> 99:59:59,999
But yeah, grafana is great.

534
99:59:59,999 --> 99:59:59,999
Actually, grafana now has
two fulltime Prometheus developers.

535
99:59:59,999 --> 99:59:59,999
Alright, that's it.

536
99:59:59,999 --> 99:59:59,999
[Applause]