WEBVTT

99:59:59.999 --> 99:59:59.999
So, we had a talk by a non-GitLab person
about GitLab.

99:59:59.999 --> 99:59:59.999
Now, we have a talk by a GitLab person
on non-GtlLab.

99:59:59.999 --> 99:59:59.999
Something like that?

99:59:59.999 --> 99:59:59.999
The CCCHH hackerspace is now open,

99:59:59.999 --> 99:59:59.999
from now on if you want to go there,
that's the announcement.

99:59:59.999 --> 99:59:59.999
And the next talk will be by Ben Kochie

99:59:59.999 --> 99:59:59.999
on metrics-based monitoring
with Prometheus.

99:59:59.999 --> 99:59:59.999
Welcome.

99:59:59.999 --> 99:59:59.999
[Applause]

99:59:59.999 --> 99:59:59.999
Alright, so

99:59:59.999 --> 99:59:59.999
my name is Ben Kochie

99:59:59.999 --> 99:59:59.999
I work on DevOps features for GitLab

99:59:59.999 --> 99:59:59.999
and apart working for GitLab, I also work
on the opensource Prometheus project.

99:59:59.999 --> 99:59:59.999
I live in Berlin and I've been using
Debian since ???

99:59:59.999 --> 99:59:59.999
yes, quite a long time.

99:59:59.999 --> 99:59:59.999
So, what is Metrics-based Monitoring?

99:59:59.999 --> 99:59:59.999
If you're running software in production,

99:59:59.999 --> 99:59:59.999
you probably want to monitor it,

99:59:59.999 --> 99:59:59.999
because if you don't monitor it, you don't
know it's right.

99:59:59.999 --> 99:59:59.999
??? break down into two categories:

99:59:59.999 --> 99:59:59.999
there's blackbox monitoring and
there's whitebox monitoring.

99:59:59.999 --> 99:59:59.999
Blackbox monitoring is treating
your software like a blackbox.

99:59:59.999 --> 99:59:59.999
It's just checks to see, like,

99:59:59.999 --> 99:59:59.999
is it responding, or does it ping

99:59:59.999 --> 99:59:59.999
or ??? HTTP requests

99:59:59.999 --> 99:59:59.999
[mic turned on]

99:59:59.999 --> 99:59:59.999
Ah, there we go, much better.

99:59:59.999 --> 99:59:59.999
So, blackbox monitoring is a probe,

99:59:59.999 --> 99:59:59.999
it just kind of looks from the outside
to your software

99:59:59.999 --> 99:59:59.999
and it has no knowledge of the internals

99:59:59.999 --> 99:59:59.999
and it's really good for end to end testing.

99:59:59.999 --> 99:59:59.999
So if you've got a fairly complicated
service,

99:59:59.999 --> 99:59:59.999
you come in from the outside, you go
through the load balancer,

99:59:59.999 --> 99:59:59.999
you hit the API server,

99:59:59.999 --> 99:59:59.999
the API server might hit a database,

99:59:59.999 --> 99:59:59.999
and you go all the way through
to the back of the stack

99:59:59.999 --> 99:59:59.999
and all the way back out

99:59:59.999 --> 99:59:59.999
so you know that everything is working
end to end.

99:59:59.999 --> 99:59:59.999
But you only know about it
for that one request.

99:59:59.999 --> 99:59:59.999
So in order to find out if your service
is working,

99:59:59.999 --> 99:59:59.999
from the end to end, for every single
request,

99:59:59.999 --> 99:59:59.999
this requires whitebox intrumentation.

99:59:59.999 --> 99:59:59.999
So, basically, every event that happens
inside your software,

99:59:59.999 --> 99:59:59.999
inside a serving stack,

99:59:59.999 --> 99:59:59.999
gets collected and gets counted,

99:59:59.999 --> 99:59:59.999
so you know that every request hits
the load balancer,

99:59:59.999 --> 99:59:59.999
every request hits your application
service,

99:59:59.999 --> 99:59:59.999
every request hits the database.

99:59:59.999 --> 99:59:59.999
You know that everything matches up

99:59:59.999 --> 99:59:59.999
and this is called whitebox, or
metrics-based monitoring.

99:59:59.999 --> 99:59:59.999
There is different examples of, like,

99:59:59.999 --> 99:59:59.999
the kind of software that does blackbox
and whitebox monitoring.

99:59:59.999 --> 99:59:59.999
So you have software like Nagios that
you can configure checks

99:59:59.999 --> 99:59:59.999
or pingdom,

99:59:59.999 --> 99:59:59.999
pingdom will do ping of your website.

99:59:59.999 --> 99:59:59.999
And then there is metrics-based monitoring,

99:59:59.999 --> 99:59:59.999
things like Prometheus, things like
the TICK stack from influx data,

99:59:59.999 --> 99:59:59.999
New Relic and other commercial solutions

99:59:59.999 --> 99:59:59.999
but of course I like to talk about
the opensorce solutions.

99:59:59.999 --> 99:59:59.999
We're gonna talk a little bit about
Prometheus.

99:59:59.999 --> 99:59:59.999
Prometheus came out of the idea that

99:59:59.999 --> 99:59:59.999
we needed a monitoring system that could
collect all this whitebox metric data

99:59:59.999 --> 99:59:59.999
and do something useful with it.

99:59:59.999 --> 99:59:59.999
Not just give us a pretty graph, but
we also want to be able to

99:59:59.999 --> 99:59:59.999
alert on it.

99:59:59.999 --> 99:59:59.999
So we needed both

99:59:59.999 --> 99:59:59.999
a data gathering and an analytics system
in the same instance.

99:59:59.999 --> 99:59:59.999
To do this, we built this thing and
we looked at the way that

99:59:59.999 --> 99:59:59.999
data was being generated
by the applications

99:59:59.999 --> 99:59:59.999
and there are advantages and
disadvantages to this

99:59:59.999 --> 99:59:59.999
push vs. poll model for metrics.

99:59:59.999 --> 99:59:59.999
We decided to go with the polling model

99:59:59.999 --> 99:59:59.999
because there is some slight advantages
for polling over pushing.

99:59:59.999 --> 99:59:59.999
With polling, you get this free
blackbox check

99:59:59.999 --> 99:59:59.999
that the application is running.

99:59:59.999 --> 99:59:59.999
When you poll your application, you know
that the process is running.

99:59:59.999 --> 99:59:59.999
If you are doing push-based, you can't
tell the difference between

99:59:59.999 --> 99:59:59.999
your application doing no work and
your application not running.

99:59:59.999 --> 99:59:59.999
So you don't know if it's stuck,

99:59:59.999 --> 99:59:59.999
or is it just not having to do any work.

99:59:59.999 --> 99:59:59.999
With polling, the polling system knows
the state of your network.

99:59:59.999 --> 99:59:59.999
If you have a defined set of services,

99:59:59.999 --> 99:59:59.999
that inventory drives what should be there.

99:59:59.999 --> 99:59:59.999
Again, it's like, the disappearing,

99:59:59.999 --> 99:59:59.999
is the process dead, or is it just
not doing anything?

99:59:59.999 --> 99:59:59.999
With polling, you know for a fact
what processes should be there,

99:59:59.999 --> 99:59:59.999
and it's a bit of an advantage there.

99:59:59.999 --> 99:59:59.999
With polling, there's really easy testing.

99:59:59.999 --> 99:59:59.999
With push-based metrics, you have to
figure out

99:59:59.999 --> 99:59:59.999
if you want to test a new version of
the monitoring system or

99:59:59.999 --> 99:59:59.999
you want to test something new,

99:59:59.999 --> 99:59:59.999
you have to ??? a copy of the data.

99:59:59.999 --> 99:59:59.999
With polling, you can just set up
another instance of your monitoring

99:59:59.999 --> 99:59:59.999
and just test it.

99:59:59.999 --> 99:59:59.999
Or you don't even have,

99:59:59.999 --> 99:59:59.999
it doesn't even have to be monitoring,
you can just use curl

99:59:59.999 --> 99:59:59.999
to poll the metrics endpoint.

99:59:59.999 --> 99:59:59.999
It's significantly easier to test.

99:59:59.999 --> 99:59:59.999
The other thing with the…

99:59:59.999 --> 99:59:59.999
The other nice thing is that
the client is really simple.

99:59:59.999 --> 99:59:59.999
The client doesn't have to know
where the monitoring system is.

99:59:59.999 --> 99:59:59.999
It doesn't have to know about ???

99:59:59.999 --> 99:59:59.999
It just has to sit and collect the data
about itself.

99:59:59.999 --> 99:59:59.999
So it doesn't have to know anything about
the topology of the network.

99:59:59.999 --> 99:59:59.999
As an application developer, if you're
writing a DNS server or

99:59:59.999 --> 99:59:59.999
some other piece of software,

99:59:59.999 --> 99:59:59.999
you don't have to know anything about
monitoring software,

99:59:59.999 --> 99:59:59.999
you can just implement it inside
your application and

99:59:59.999 --> 99:59:59.999
the monitoring software, whether it's
Prometheus or something else,

99:59:59.999 --> 99:59:59.999
can just come and collect that data for you.

99:59:59.999 --> 99:59:59.999
That's kind of similar to a very old
monitoring system called SNMP,

99:59:59.999 --> 99:59:59.999
but SNMP has a significantly less friendly
data model for developers.

99:59:59.999 --> 99:59:59.999
This is the basic layout
of a Prometheus server.

99:59:59.999 --> 99:59:59.999
At the core, there's a Prometheus server

99:59:59.999 --> 99:59:59.999
and it deals with all the data collection
and analytics.

99:59:59.999 --> 99:59:59.999
Basically, this one binary,
it's all written in golang.

99:59:59.999 --> 99:59:59.999
It's a single binary.

99:59:59.999 --> 99:59:59.999
It knows how to read from your inventory,

99:59:59.999 --> 99:59:59.999
there's a bunch of different methods,
whether you've got

99:59:59.999 --> 99:59:59.999
a kubernetes cluster or a cloud platform

99:59:59.999 --> 99:59:59.999
or you have your own customized thing
with ansible.

99:59:59.999 --> 99:59:59.999
Ansible can take your layout, drop that
into a config file and

99:59:59.999 --> 99:59:59.999
Prometheus can pick that up.

99:59:59.999 --> 99:59:59.999
Once it has the layout, it goes out and
collects all the data.

99:59:59.999 --> 99:59:59.999
It has a storage and a time series
database to store all that data locally.

99:59:59.999 --> 99:59:59.999
It has a thing called PromQL, which is
a query language designed

99:59:59.999 --> 99:59:59.999
for metrics and analytics.

99:59:59.999 --> 99:59:59.999
From that PromQL, you can add frontends
that will,

99:59:59.999 --> 99:59:59.999
whether it's a simple API client
to run reports,

99:59:59.999 --> 99:59:59.999
you can use things like Grafana
for creating dashboards,

99:59:59.999 --> 99:59:59.999
it's got a simple webUI built in.

99:59:59.999 --> 99:59:59.999
You can plug in anything you want
on that side.

99:59:59.999 --> 99:59:59.999
And then, it also has the ability to
continuously execute queries

99:59:59.999 --> 99:59:59.999
called "recording rules"

99:59:59.999 --> 99:59:59.999
and these recording rules have
two different modes.

99:59:59.999 --> 99:59:59.999
You can either record, you can take
a query

99:59:59.999 --> 99:59:59.999
and it will generate new data
from that query

99:59:59.999 --> 99:59:59.999
or you can take a query, and
if it returns results,

99:59:59.999 --> 99:59:59.999
it will return an alert.

99:59:59.999 --> 99:59:59.999
That alert is a push message
to the alert manager.

99:59:59.999 --> 99:59:59.999
This allows us to separate the generating
of alerts from the routing of alerts.

99:59:59.999 --> 99:59:59.999
You can have one or hundreds of Prometheus
services, all generating alerts

99:59:59.999 --> 99:59:59.999
and it goes into an alert manager cluster
and sends, does the deduplication

99:59:59.999 --> 99:59:59.999
and the routing to the human

99:59:59.999 --> 99:59:59.999
because, of course, the thing
that we want is

99:59:59.999 --> 99:59:59.999
we had dashboards with graphs, but
in order to find out if something is broken

99:59:59.999 --> 99:59:59.999
you had to have a human
looking at the graph.

99:59:59.999 --> 99:59:59.999
With Prometheus, we don't have to do that
anymore,

99:59:59.999 --> 99:59:59.999
we can simply let the software tell us
that we need to go investigate

99:59:59.999 --> 99:59:59.999
our problems.

99:59:59.999 --> 99:59:59.999
We don't have to sit there and
stare at dashboards all day,

99:59:59.999 --> 99:59:59.999
because that's really boring.

99:59:59.999 --> 99:59:59.999
What does it look like to actually
get data into Prometheus?

99:59:59.999 --> 99:59:59.999
This is a very basic output
of a Prometheus metric.

99:59:59.999 --> 99:59:59.999
This is a very simple thing.

99:59:59.999 --> 99:59:59.999
If you know much about
the linux kernel,

99:59:59.999 --> 99:59:59.999
the linux kernel tracks ??? stats,
all the state of all the CPUs

99:59:59.999 --> 99:59:59.999
in your system

99:59:59.999 --> 99:59:59.999
and we express this by having
the name of the metric, which is

99:59:59.999 --> 99:59:59.999
'node_cpu_seconds_total' and so
this is a self-describing metric,

99:59:59.999 --> 99:59:59.999
like you can just read the metrics name

99:59:59.999 --> 99:59:59.999
and you understand a little bit about
what's going on here.

99:59:59.999 --> 99:59:59.999
The linux kernel and other kernels track
their usage by the number of seconds

99:59:59.999 --> 99:59:59.999
spent doing different things and

99:59:59.999 --> 99:59:59.999
that could be, whether it's in system or
user space or IRQs

99:59:59.999 --> 99:59:59.999
or iowait or idle.

99:59:59.999 --> 99:59:59.999
Actually, the kernel tracks how much
idle time it has.

99:59:59.999 --> 99:59:59.999
It also tracks it by the number of CPUs.

99:59:59.999 --> 99:59:59.999
With other monitoring systems, they used
to do this with a tree structure

99:59:59.999 --> 99:59:59.999
and this caused a lot of problems,
for like

99:59:59.999 --> 99:59:59.999
How do you mix and match data so
by switching from

99:59:59.999 --> 99:59:59.999
a tree structure to a tag-based structure,

99:59:59.999 --> 99:59:59.999
we can do some really interesting
powerful data analytics.

99:59:59.999 --> 99:59:59.999
Here's a nice example of taking
those CPU seconds counters

99:59:59.999 --> 99:59:59.999
and then converting them into a graph
by using PromQL.

99:59:59.999 --> 99:59:59.999
Now we can get into
Metrics-Based Alerting.

99:59:59.999 --> 99:59:59.999
Now we have this graph, we have this thing

99:59:59.999 --> 99:59:59.999
we can look and see here

99:59:59.999 --> 99:59:59.999
"Oh there is some little spike here,
we might want to know about that."

99:59:59.999 --> 99:59:59.999
Now we can get into Metrics-Based
Alerting.

99:59:59.999 --> 99:59:59.999
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart

99:59:59.999 --> 99:59:59.999
and we have this concept of things that
you need on a site or a service reliably

99:59:59.999 --> 99:59:59.999
The most important thing you need is
down at the bottom,

99:59:59.999 --> 99:59:59.999
Monitoring, because if you don't have
monitoring of your service,

99:59:59.999 --> 99:59:59.999
how do you know it's even working?

99:59:59.999 --> 99:59:59.999
There's a couple of techniques here, and
we want to alert based on data

99:59:59.999 --> 99:59:59.999
and not just those end to end tests.

99:59:59.999 --> 99:59:59.999
There's a couple of techniques, a thing
called the RED method

99:59:59.999 --> 99:59:59.999
and there's a thing called the USE method

99:59:59.999 --> 99:59:59.999
and there's a couple nice things to some
blog posts about this

99:59:59.999 --> 99:59:59.999
and basically it defines that, for example,

99:59:59.999 --> 99:59:59.999
the RED method talks about the requests
that your system is handling

99:59:59.999 --> 99:59:59.999
There are three things:

99:59:59.999 --> 99:59:59.999
There's the number of requests, there's
the number of errors

99:59:59.999 --> 99:59:59.999
and there's how long takes a duration.

99:59:59.999 --> 99:59:59.999
With the combination of these three things

99:59:59.999 --> 99:59:59.999
you can determine most of
what your users see

99:59:59.999 --> 99:59:59.999
"Did my request go through? Did it
return an error? Was it fast?"

99:59:59.999 --> 99:59:59.999
Most people, that's all they care about.

99:59:59.999 --> 99:59:59.999
"I made a request to a website and
it came back and it was fast."

99:59:59.999 --> 99:59:59.999
It's a very simple method of just, like,

99:59:59.999 --> 99:59:59.999
those are the important things to
determine if your site is healthy.

99:59:59.999 --> 99:59:59.999
But we can go back to some more
traditional, sysadmin style, alerts

99:59:59.999 --> 99:59:59.999
this is basically taking the filesystem
available space,

99:59:59.999 --> 99:59:59.999
divided by the filesystem size, that becomes
the ratio of filesystem availability

99:59:59.999 --> 99:59:59.999
from 0 to 1.

99:59:59.999 --> 99:59:59.999
Multiply it by 100, we now have
a percentage

99:59:59.999 --> 99:59:59.999
and if it's less than or equal to 1%
for 15 minutes,

99:59:59.999 --> 99:59:59.999
this is less than 1% space, we should tell
a sysadmin to go check

99:59:59.999 --> 99:59:59.999
the ??? filesystem ???

99:59:59.999 --> 99:59:59.999
It's super nice and simple.

99:59:59.999 --> 99:59:59.999
We can also tag, we can include…

99:59:59.999 --> 99:59:59.999
Every alert includes all the extraneous
labels that Prometheus adds to your metrics

99:59:59.999 --> 99:59:59.999
When you add a metric in Prometheus, if
we go back and we look at this metric.

99:59:59.999 --> 99:59:59.999
This metric only contain the information
about the internals of the application

99:59:59.999 --> 99:59:59.999
anything about, like, what server it's on,
is it running in a container,

99:59:59.999 --> 99:59:59.999
what cluster does it come from,
what ??? is it on,

99:59:59.999 --> 99:59:59.999
that's all extra annotations that are
added by the Prometheus server

99:59:59.999 --> 99:59:59.999
at discovery time.

99:59:59.999 --> 99:59:59.999
I don't have a good example of what
those labels look like

99:59:59.999 --> 99:59:59.999
but every metric gets annotated
with location information.

99:59:59.999 --> 99:59:59.999
That location information also comes through
as labels in the alert

99:59:59.999 --> 99:59:59.999
so, if you have a message coming
into your alert manager,

99:59:59.999 --> 99:59:59.999
the alert manager can look and go

99:59:59.999 --> 99:59:59.999
"Oh, that's coming from this datacenter"

99:59:59.999 --> 99:59:59.999
and it can include that in the email or
IRC message or SMS message.

99:59:59.999 --> 99:59:59.999
So you can include

99:59:59.999 --> 99:59:59.999
"Filesystem is out of space on this host
from this datacenter"

99:59:59.999 --> 99:59:59.999
All these labels get passed through and
then you can append

99:59:59.999 --> 99:59:59.999
"severity: critical" to that alert and
include that in the message to the human

99:59:59.999 --> 99:59:59.999
because of course, this is how you define…

99:59:59.999 --> 99:59:59.999
Getting the message from the monitoring
to the human.

99:59:59.999 --> 99:59:59.999
You can even include nice things like,

99:59:59.999 --> 99:59:59.999
if you've got documentation, you can
include a link to the documentation

99:59:59.999 --> 99:59:59.999
as an annotation

99:59:59.999 --> 99:59:59.999
and the alert manager can take that
basic url and, you know,

99:59:59.999 --> 99:59:59.999
massaging it into whatever it needs
to look like to actually get

99:59:59.999 --> 99:59:59.999
the operator to the correct documentation.

99:59:59.999 --> 99:59:59.999
We can also do more fun things:

99:59:59.999 --> 99:59:59.999
since we actually are not just checking

99:59:59.999 --> 99:59:59.999
what is the space right now,
we're tracking data over time,

99:59:59.999 --> 99:59:59.999
we can use 'predict_linear'.

99:59:59.999 --> 99:59:59.999
'predict_linear' just takes and does
a simple linear regression.

99:59:59.999 --> 99:59:59.999
This example takes the filesystem
available space over the last hour and

99:59:59.999 --> 99:59:59.999
does a linear regression.

99:59:59.999 --> 99:59:59.999
Prediction says "Well, it's going that way
and four hours from now,

99:59:59.999 --> 99:59:59.999
based on one hour of history, it's gonna
be less than 0, which means full".