So, we had a talk by a non-GitLab person
about GitLab.
Now, we have a talk by a GitLab person
on non-GtlLab.
Something like that?
The CCCHH hackerspace is now open,
from now on if you want to go there,
that's the announcement.
And the next talk will be by Ben Kochie
on metrics-based monitoring
with Prometheus.
Welcome.
[Applause]
Alright, so
my name is Ben Kochie
I work on DevOps features for GitLab
and apart working for GitLab, I also work
on the opensource Prometheus project.
I live in Berlin and I've been using
Debian since ???
yes, quite a long time.
So, what is Metrics-based Monitoring?
If you're running software in production,
you probably want to monitor it,
because if you don't monitor it, you don't
know it's right.
??? break down into two categories:
there's blackbox monitoring and
there's whitebox monitoring.
Blackbox monitoring is treating
your software like a blackbox.
It's just checks to see, like,
is it responding, or does it ping
or ??? HTTP requests
[mic turned on]
Ah, there we go, much better.
So, blackbox monitoring is a probe,
it just kind of looks from the outside
to your software
and it has no knowledge of the internals
and it's really good for end to end testing.
So if you've got a fairly complicated
service,
you come in from the outside, you go
through the load balancer,
you hit the API server,
the API server might hit a database,
and you go all the way through
to the back of the stack
and all the way back out
so you know that everything is working
end to end.
But you only know about it
for that one request.
So in order to find out if your service
is working,
from the end to end, for every single
request,
this requires whitebox intrumentation.
So, basically, every event that happens
inside your software,
inside a serving stack,
gets collected and gets counted,
so you know that every request hits
the load balancer,
every request hits your application
service,
every request hits the database.
You know that everything matches up
and this is called whitebox, or
metrics-based monitoring.
There is different examples of, like,
the kind of software that does blackbox
and whitebox monitoring.
So you have software like Nagios that
you can configure checks
or pingdom,
pingdom will do ping of your website.
And then there is metrics-based monitoring,
things like Prometheus, things like
the TICK stack from influx data,
New Relic and other commercial solutions
but of course I like to talk about
the opensorce solutions.
We're gonna talk a little bit about
Prometheus.
Prometheus came out of the idea that
we needed a monitoring system that could
collect all this whitebox metric data
and do something useful with it.
Not just give us a pretty graph, but
we also want to be able to
alert on it.
So we needed both
a data gathering and an analytics system
in the same instance.
To do this, we built this thing and
we looked at the way that
data was being generated
by the applications
and there are advantages and
disadvantages to this
push vs. poll model for metrics.
We decided to go with the polling model
because there is some slight advantages
for polling over pushing.
With polling, you get this free
blackbox check
that the application is running.
When you poll your application, you know
that the process is running.
If you are doing push-based, you can't
tell the difference between
your application doing no work and
your application not running.
So you don't know if it's stuck,
or is it just not having to do any work.
With polling, the polling system knows
the state of your network.
If you have a defined set of services,
that inventory drives what should be there.
Again, it's like, the disappearing,
is the process dead, or is it just
not doing anything?
With polling, you know for a fact
what processes should be there,
and it's a bit of an advantage there.
With polling, there's really easy testing.
With push-based metrics, you have to
figure out
if you want to test a new version of
the monitoring system or
you want to test something new,
you have to ??? a copy of the data.
With polling, you can just set up
another instance of your monitoring
and just test it.
Or you don't even have,
it doesn't even have to be monitoring,
you can just use curl
to poll the metrics endpoint.
It's significantly easier to test.
The other thing with the…
The other nice thing is that
the client is really simple.
The client doesn't have to know
where the monitoring system is.
It doesn't have to know about ???
It just has to sit and collect the data
about itself.
So it doesn't have to know anything about
the topology of the network.
As an application developer, if you're
writing a DNS server or
some other piece of software,
you don't have to know anything about
monitoring software,
you can just implement it inside
your application and
the monitoring software, whether it's
Prometheus or something else,
can just come and collect that data for you.
That's kind of similar to a very old
monitoring system called SNMP,
but SNMP has a significantly less friendly
data model for developers.
This is the basic layout
of a Prometheus server.
At the core, there's a Prometheus server
and it deals with all the data collection
and analytics.
Basically, this one binary,
it's all written in golang.
It's a single binary.
It knows how to read from your inventory,
there's a bunch of different methods,
whether you've got
a kubernetes cluster or a cloud platform
or you have your own customized thing
with ansible.
Ansible can take your layout, drop that
into a config file and
Prometheus can pick that up.
Once it has the layout, it goes out and
collects all the data.
It has a storage and a time series
database to store all that data locally.
It has a thing called PromQL, which is
a query language designed
for metrics and analytics.
From that PromQL, you can add frontends
that will,
whether it's a simple API client
to run reports,
you can use things like Grafana
for creating dashboards,
it's got a simple webUI built in.
You can plug in anything you want
on that side.
And then, it also has the ability to
continuously execute queries
called "recording rules"
and these recording rules have
two different modes.
You can either record, you can take
a query
and it will generate new data
from that query
or you can take a query, and
if it returns results,
it will return an alert.
That alert is a push message
to the alert manager.
This allows us to separate the generating
of alerts from the routing of alerts.
You can have one or hundreds of Prometheus
services, all generating alerts
and it goes into an alert manager cluster
and sends, does the deduplication
and the routing to the human
because, of course, the thing
that we want is
we had dashboards with graphs, but
in order to find out if something is broken
you had to have a human
looking at the graph.
With Prometheus, we don't have to do that
anymore,
we can simply let the software tell us
that we need to go investigate
our problems.
We don't have to sit there and
stare at dashboards all day,
because that's really boring.
What does it look like to actually
get data into Prometheus?
This is a very basic output
of a Prometheus metric.
This is a very simple thing.
If you know much about
the linux kernel,
the linux kernel tracks ??? stats,
all the state of all the CPUs
in your system
and we express this by having
the name of the metric, which is
'node_cpu_seconds_total' and so
this is a self-describing metric,
like you can just read the metrics name
and you understand a little bit about
what's going on here.
The linux kernel and other kernels track
their usage by the number of seconds
spent doing different things and
that could be, whether it's in system or
user space or IRQs
or iowait or idle.
Actually, the kernel tracks how much
idle time it has.
It also tracks it by the number of CPUs.
With other monitoring systems, they used
to do this with a tree structure
and this caused a lot of problems,
for like
How do you mix and match data so
by switching from
a tree structure to a tag-based structure,
we can do some really interesting
powerful data analytics.
Here's a nice example of taking
those CPU seconds counters
and then converting them into a graph
by using PromQL.
Now we can get into
Metrics-Based Alerting.
Now we have this graph, we have this thing
we can look and see here
"Oh there is some little spike here,
we might want to know about that."
Now we can get into Metrics-Based
Alerting.
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart
and we have this concept of things that
you need on a site or a service reliably
The most important thing you need is
down at the bottom,
Monitoring, because if you don't have
monitoring of your service,
how do you know it's even working?
There's a couple of techniques here, and
we want to alert based on data
and not just those end to end tests.
There's a couple of techniques, a thing
called the RED method
and there's a thing called the USE method
and there's a couple nice things to some
blog posts about this
and basically it defines that, for example,
the RED method talks about the requests
that your system is handling
There are three things:
There's the number of requests, there's
the number of errors
and there's how long takes a duration.
With the combination of these three things
you can determine most of
what your users see
"Did my request go through? Did it
return an error? Was it fast?"
Most people, that's all they care about.
"I made a request to a website and
it came back and it was fast."
It's a very simple method of just, like,
those are the important things to
determine if your site is healthy.
But we can go back to some more
traditional, sysadmin style, alerts
this is basically taking the filesystem
available space,
divided by the filesystem size, that becomes
the ratio of filesystem availability
from 0 to 1.
Multiply it by 100, we now have
a percentage
and if it's less than or equal to 1%
for 15 minutes,
this is less than 1% space, we should tell
a sysadmin to go check
the ??? filesystem ???
It's super nice and simple.
We can also tag, we can include…
Every alert includes all the extraneous
labels that Prometheus adds to your metrics
When you add a metric in Prometheus, if
we go back and we look at this metric.
This metric only contain the information
about the internals of the application
anything about, like, what server it's on,
is it running in a container,
what cluster does it come from,
what ??? is it on,
that's all extra annotations that are
added by the Prometheus server
at discovery time.
I don't have a good example of what
those labels look like
but every metric gets annotated
with location information.
That location information also comes through
as labels in the alert
so, if you have a message coming
into your alert manager,
the alert manager can look and go
"Oh, that's coming from this datacenter"
and it can include that in the email or
IRC message or SMS message.
So you can include
"Filesystem is out of space on this host
from this datacenter"
All these labels get passed through and
then you can append
"severity: critical" to that alert and
include that in the message to the human
because of course, this is how you define…
Getting the message from the monitoring
to the human.
You can even include nice things like,
if you've got documentation, you can
include a link to the documentation
as an annotation
and the alert manager can take that
basic url and, you know,
massaging it into whatever it needs
to look like to actually get
the operator to the correct documentation.
We can also do more fun things:
since we actually are not just checking
what is the space right now,
we're tracking data over time,
we can use 'predict_linear'.
'predict_linear' just takes and does
a simple linear regression.
This example takes the filesystem
available space over the last hour and
does a linear regression.
Prediction says "Well, it's going that way
and four hours from now,
based on one hour of history, it's gonna
be less than 0, which means full".
We know that within the next four hours,
the disc is gonna be full
so we can tell the operator ahead of time
that it's gonna be full
and not just tell them that it's full
right now.
They have some window of ability
to fix it before it fails.
This is really important because
if you're running a site
you want to be able to have alerts
that tell you that your system is failing
before it actually fails.
Because if it fails, you're out of SLO
or SLA and
your users are gonna be unhappy
and you don't want the users to tell you
that your site is down
you want to know about it before
your users can even tell.
This allows you to do that.
And also of course, Prometheus being
a modern system,
we support fully UTF8 in all of our labels.
Here's an other one, here's a good example
from the USE method.
This is a rate of 500 errors coming from
an application
and you can simply alert that
there's more than 500 errors per second
coming out of the application
if that's your threshold for ???
And you can do other things,
you can convert that from just
a raid of errors
to a percentive error.
So you could say
"I have an SLA of 3 9" and so you can say
"If the rate of errors divided by the rate
of requests is .01,
or is more than .01, then
that's a problem."
You can include that level of
error granularity.
And if you're just doing a blackbox test,
you wouldn't know this, you would only get
if you got an error from the system,
then you got another error from the system
then you fire an alert.
But if those checks are one minute apart
and you're serving 1000 requests per second
you could be serving 10,000 errors before
you even get an alert.
And you might miss it, because
what if you only get one random error
and then the next time, you're serving
25% errors,
you only have a 25% chance of that check
failing again.
You really need these metrics in order
to get
proper reports of the status of your system
There's even options
You can slice and dice those labels.
If you have a label on all of
your applications called 'service'
you can send that 'service' label through
to the message
and you can say
"Hey, this service is broken".
You can include that service label
in your alert messages.
And that's it, I can go to a demo and Q&A.
[Applause]
Any questions so far?
Or anybody want to see a demo?
[Q] Hi. Does Prometheus make metric
discovery inside containers
or do I have to implement the metrics
myself?
[A] For metrics in containers, there are
already things that expose
the metrics of the container system
itself.
There's a utility called 'cadvisor' and
cadvisor takes the links cgroup data
and exposes it as metrics
so you can get data about
how much CPU time is being
spent in your container,
how much memory is being used
by your container.
[Q] But not about the application,
just about the container usage ?
[A] Right. Because the container
has no idea
whether your application is written
in Ruby or go or Python or whatever,
you have to build that into
your application in order to get the data.
So for Prometheus,
we've written client libraries that can be
included in your application directly
so you can get that data out.
If you go to the Prometheus website,
we have a whole series of client libraries
and we cover a pretty good selection
of popular software.
[Q] What is the current state of
long-term data storage?
[A] Very good question.
There's been several…
There's actually several different methods
of doing this.
Prometheus stores all this data locally
in its own data storage
on the local disk.
But that's only as durable as
that server is durable.
So if you've got a really durable server,
you can store as much data as you want,
you can store years and years of data
locally on the Prometheus server.
That's not a problem.
There's a bunch of misconceptions because
of our default
and the language on our website said
"It's not long-term storage"
simply because we leave that problem
up to the person running the server.
But the time series database
that Prometheus includes
is actually quite durable.
But it's only as durable as the server
underneath it.
So if you've got a very large cluster and
you want really high durability,
you need to have some kind of
cluster software,
but because we want Prometheus to be
simple to deploy
and very simple to operate
and also very robust.
We didn't want to include any clustering
in Prometheus itself,
because anytime you have a clustered
software,
what happens if your network is
a little wanky.
The first thing that goes down is
all of your distributed systems fail.
And building distributed systems to be
really robust is really hard
so Prometheus is what we call
"uncoordinated distributed systems".
If you've got two Prometheus servers
monitoring all your targets in an HA mode
in a cluster, and there's a split brain,
each Prometheus can see
half of the cluster and
it can see that the other half
of the cluster is down.
They can both try to get alerts out
to the alert manager
and this is a really really robust way of
handling split brains
and bad network failures and bad problems
in a cluster.
It's designed to be super super robust
and so the two individual
Promotheus servers in you cluster
don't have to talk to each other
to do this,
they can just to it independently.
But if you want to be able
to correlate data
between many different Prometheus servers
you need an external data storage
to do this.
And also you may not have
very big servers,
you might be running your Prometheus
in a container
and it's only got a little bit of local
storage space
so you want to send all that data up
to a big cluster datastore
for ???
We have several different ways of
doing this.
There's the classic way which is called
federation
where you have one Prometheus server
polling in summary data from
each of the individual Prometheus servers
and this is useful if you want to run
alerts against data coming
from multiple Prometheus servers.
But federation is not replication.
It only can do a little bit of data from
each Prometheus server.
If you've got a million metrics on
each Prometheus server,
you can't poll in a million metrics
and do…
If you've got 10 of those, you can't
poll in 10 million metrics
simultaneously into one Prometheus
server.
It's just to much data.