Metrics-Based Monitoring with Prometheus

Rollback to version 5

Not Synced

So, we had a talk by a non-GitLab person
about GitLab.
Not Synced

Now, we have a talk by a GitLab person
on non-GtlLab.
Not Synced

Something like that?
Not Synced

The CCCHH hackerspace is now open,
Not Synced

from now on if you want to go there,
that's the announcement.
Not Synced

And the next talk will be by Ben Kochie
Not Synced

on metrics-based monitoring
with Prometheus.
Not Synced

Welcome.
Not Synced

[Applause]
Not Synced

Alright, so
Not Synced

my name is Ben Kochie
Not Synced

I work on DevOps features for GitLab
Not Synced

and apart working for GitLab, I also work
on the opensource Prometheus project.
Not Synced

I live in Berlin and I've been using
Debian since ???
Not Synced

yes, quite a long time.
Not Synced

So, what is Metrics-based Monitoring?
Not Synced

If you're running software in production,
Not Synced

you probably want to monitor it,
Not Synced

because if you don't monitor it, you don't
know it's right.
Not Synced

??? break down into two categories:
Not Synced

there's blackbox monitoring and
there's whitebox monitoring.
Not Synced

Blackbox monitoring is treating
your software like a blackbox.
Not Synced

It's just checks to see, like,
Not Synced

is it responding, or does it ping
Not Synced

or ??? HTTP requests
Not Synced

[mic turned on]
Not Synced

Ah, there we go, much better.
Not Synced

So, blackbox monitoring is a probe,
Not Synced

it just kind of looks from the outside
to your software
Not Synced

and it has no knowledge of the internals
Not Synced

and it's really good for end to end testing.
Not Synced

So if you've got a fairly complicated
service,
Not Synced

you come in from the outside, you go
through the load balancer,
Not Synced

you hit the API server,
Not Synced

the API server might hit a database,
Not Synced

and you go all the way through
to the back of the stack
Not Synced

and all the way back out
Not Synced

so you know that everything is working
end to end.
Not Synced

But you only know about it
for that one request.
Not Synced

So in order to find out if your service
is working,
Not Synced

from the end to end, for every single
request,
Not Synced

this requires whitebox intrumentation.
Not Synced

So, basically, every event that happens
inside your software,
Not Synced

inside a serving stack,
Not Synced

gets collected and gets counted,
Not Synced

so you know that every request hits
the load balancer,
Not Synced

every request hits your application
service,
Not Synced

every request hits the database.
Not Synced

You know that everything matches up
Not Synced

and this is called whitebox, or
metrics-based monitoring.
Not Synced

There is different examples of, like,
Not Synced

the kind of software that does blackbox
and whitebox monitoring.
Not Synced

So you have software like Nagios that
you can configure checks
Not Synced

or pingdom,
Not Synced

pingdom will do ping of your website.
Not Synced

And then there is metrics-based monitoring,
Not Synced

things like Prometheus, things like
the TICK stack from influx data,
Not Synced

New Relic and other commercial solutions
Not Synced

but of course I like to talk about
the opensorce solutions.
Not Synced

We're gonna talk a little bit about
Prometheus.
Not Synced

Prometheus came out of the idea that
Not Synced

we needed a monitoring system that could
collect all this whitebox metric data
Not Synced

and do something useful with it.
Not Synced

Not just give us a pretty graph, but
we also want to be able to
Not Synced

alert on it.
Not Synced

So we needed both
Not Synced

a data gathering and an analytics system
in the same instance.
Not Synced

To do this, we built this thing and
we looked at the way that
Not Synced

data was being generated
by the applications
Not Synced

and there are advantages and
disadvantages to this
Not Synced

push vs. poll model for metrics.
Not Synced

We decided to go with the polling model
Not Synced

because there is some slight advantages
for polling over pushing.
Not Synced

With polling, you get this free
blackbox check
Not Synced

that the application is running.
Not Synced

When you poll your application, you know
that the process is running.
Not Synced

If you are doing push-based, you can't
tell the difference between
Not Synced

your application doing no work and
your application not running.
Not Synced

So you don't know if it's stuck,
Not Synced

or is it just not having to do any work.
Not Synced

With polling, the polling system knows
the state of your network.
Not Synced

If you have a defined set of services,
Not Synced

that inventory drives what should be there.
Not Synced

Again, it's like, the disappearing,
Not Synced

is the process dead, or is it just
not doing anything?
Not Synced

With polling, you know for a fact
what processes should be there,
Not Synced

and it's a bit of an advantage there.
Not Synced

With polling, there's really easy testing.
Not Synced

With push-based metrics, you have to
figure out
Not Synced

if you want to test a new version of
the monitoring system or
Not Synced

you want to test something new,
Not Synced

you have to ??? a copy of the data.
Not Synced

With polling, you can just set up
another instance of your monitoring
Not Synced

and just test it.
Not Synced

Or you don't even have,
Not Synced

it doesn't even have to be monitoring,
you can just use curl
Not Synced

to poll the metrics endpoint.
Not Synced

It's significantly easier to test.
Not Synced

The other thing with the…
Not Synced

The other nice thing is that
the client is really simple.
Not Synced

The client doesn't have to know
where the monitoring system is.
Not Synced

It doesn't have to know about ???
Not Synced

It just has to sit and collect the data
about itself.
Not Synced

So it doesn't have to know anything about
the topology of the network.
Not Synced

As an application developer, if you're
writing a DNS server or
Not Synced

some other piece of software,
Not Synced

you don't have to know anything about
monitoring software,
Not Synced

you can just implement it inside
your application and
Not Synced

the monitoring software, whether it's
Prometheus or something else,
Not Synced

can just come and collect that data for you.
Not Synced

That's kind of similar to a very old
monitoring system called SNMP,
Not Synced

but SNMP has a significantly less friendly
data model for developers.
Not Synced

This is the basic layout
of a Prometheus server.
Not Synced

At the core, there's a Prometheus server
Not Synced

and it deals with all the data collection
and analytics.
Not Synced

Basically, this one binary,
it's all written in golang.
Not Synced

It's a single binary.
Not Synced

It knows how to read from your inventory,
Not Synced

there's a bunch of different methods,
whether you've got
Not Synced

a kubernetes cluster or a cloud platform
Not Synced

or you have your own customized thing
with ansible.
Not Synced

Ansible can take your layout, drop that
into a config file and
Not Synced

Prometheus can pick that up.
Not Synced

Once it has the layout, it goes out and
collects all the data.
Not Synced

It has a storage and a time series
database to store all that data locally.
Not Synced

It has a thing called PromQL, which is
a query language designed
Not Synced

for metrics and analytics.
Not Synced

From that PromQL, you can add frontends
that will,
Not Synced

whether it's a simple API client
to run reports,
Not Synced

you can use things like Grafana
for creating dashboards,
Not Synced

it's got a simple webUI built in.
Not Synced

You can plug in anything you want
on that side.
Not Synced

And then, it also has the ability to
continuously execute queries
Not Synced

called "recording rules"
Not Synced

and these recording rules have
two different modes.
Not Synced

You can either record, you can take
a query
Not Synced

and it will generate new data
from that query
Not Synced

or you can take a query, and
if it returns results,
Not Synced

it will return an alert.
Not Synced

That alert is a push message
to the alert manager.
Not Synced

This allows us to separate the generating
of alerts from the routing of alerts.
Not Synced

You can have one or hundreds of Prometheus
services, all generating alerts
Not Synced

and it goes into an alert manager cluster
and sends, does the deduplication
Not Synced

and the routing to the human
Not Synced

because, of course, the thing
that we want is
Not Synced

we had dashboards with graphs, but
in order to find out if something is broken
Not Synced

you had to have a human
looking at the graph.
Not Synced

With Prometheus, we don't have to do that
anymore,
Not Synced

we can simply let the software tell us
that we need to go investigate
Not Synced

our problems.
Not Synced

We don't have to sit there and
stare at dashboards all day,
Not Synced

because that's really boring.
Not Synced

What does it look like to actually
get data into Prometheus?
Not Synced

This is a very basic output
of a Prometheus metric.
Not Synced

This is a very simple thing.
Not Synced

If you know much about
the linux kernel,
Not Synced

the linux kernel tracks ??? stats,
all the state of all the CPUs
Not Synced

in your system
Not Synced

and we express this by having
the name of the metric, which is
Not Synced

'node_cpu_seconds_total' and so
this is a self-describing metric,
Not Synced

like you can just read the metrics name
Not Synced

and you understand a little bit about
what's going on here.
Not Synced

The linux kernel and other kernels track
their usage by the number of seconds
Not Synced

spent doing different things and
Not Synced

that could be, whether it's in system or
user space or IRQs
Not Synced

or iowait or idle.
Not Synced

Actually, the kernel tracks how much
idle time it has.
Not Synced

It also tracks it by the number of CPUs.
Not Synced

With other monitoring systems, they used
to do this with a tree structure
Not Synced

and this caused a lot of problems,
for like
Not Synced

How do you mix and match data so
by switching from
Not Synced

a tree structure to a tag-based structure,
Not Synced

we can do some really interesting
powerful data analytics.
Not Synced

Here's a nice example of taking
those CPU seconds counters
Not Synced

and then converting them into a graph
by using PromQL.
Not Synced

Now we can get into
Metrics-Based Alerting.
Not Synced

Now we have this graph, we have this thing
Not Synced

we can look and see here
Not Synced

"Oh there is some little spike here,
we might want to know about that."
Not Synced

Now we can get into Metrics-Based
Alerting.
Not Synced

I used to be a site reliability engineer,
I'm still a site reliability engineer at heart
Not Synced

and we have this concept of things that
you need on a site or a service reliably
Not Synced

The most important thing you need is
down at the bottom,
Not Synced

Monitoring, because if you don't have
monitoring of your service,
Not Synced

how do you know it's even working?
Not Synced

There's a couple of techniques here, and
we want to alert based on data
Not Synced

and not just those end to end tests.
Not Synced

There's a couple of techniques, a thing
called the RED method
Not Synced

and there's a thing called the USE method
Not Synced

and there's a couple nice things to some
blog posts about this
Not Synced

and basically it defines that, for example,
Not Synced

the RED method talks about the requests
that your system is handling
Not Synced

There are three things:
Not Synced

There's the number of requests, there's
the number of errors
Not Synced

and there's how long takes a duration.
Not Synced

With the combination of these three things
Not Synced

you can determine most of
what your users see
Not Synced

"Did my request go through? Did it
return an error? Was it fast?"
Not Synced

Most people, that's all they care about.
Not Synced

"I made a request to a website and
it came back and it was fast."
Not Synced

It's a very simple method of just, like,
Not Synced

those are the important things to
determine if your site is healthy.
Not Synced

But we can go back to some more
traditional, sysadmin style, alerts
Not Synced

this is basically taking the filesystem
available space,
Not Synced

divided by the filesystem size, that becomes
the ratio of filesystem availability
Not Synced

from 0 to 1.
Not Synced

Multiply it by 100, we now have
a percentage
Not Synced

and if it's less than or equal to 1%
for 15 minutes,
Not Synced

this is less than 1% space, we should tell
a sysadmin to go check
Not Synced

the ??? filesystem ???
Not Synced

It's super nice and simple.
Not Synced

We can also tag, we can include…
Not Synced

Every alert includes all the extraneous
labels that Prometheus adds to your metrics
Not Synced

When you add a metric in Prometheus, if
we go back and we look at this metric.
Not Synced

This metric only contain the information
about the internals of the application
Not Synced

anything about, like, what server it's on,
is it running in a container,
Not Synced

what cluster does it come from,
what ??? is it on,
Not Synced

that's all extra annotations that are
added by the Prometheus server
Not Synced

at discovery time.
Not Synced

I don't have a good example of what
those labels look like
Not Synced

but every metric gets annotated
with location information.
Not Synced

That location information also comes through
as labels in the alert
Not Synced

so, if you have a message coming
into your alert manager,
Not Synced

the alert manager can look and go
Not Synced

"Oh, that's coming from this datacenter"
Not Synced

and it can include that in the email or
IRC message or SMS message.
Not Synced

So you can include
Not Synced

"Filesystem is out of space on this host
from this datacenter"
Not Synced

All these labels get passed through and
then you can append
Not Synced

"severity: critical" to that alert and
include that in the message to the human
Not Synced

because of course, this is how you define…
Not Synced

Getting the message from the monitoring
to the human.
Not Synced

You can even include nice things like,
Not Synced

if you've got documentation, you can
include a link to the documentation
Not Synced

as an annotation
Not Synced

and the alert manager can take that
basic url and, you know,
Not Synced

massaging it into whatever it needs
to look like to actually get
Not Synced

the operator to the correct documentation.
Not Synced

We can also do more fun things:
Not Synced

since we actually are not just checking
Not Synced

what is the space right now,
we're tracking data over time,
Not Synced

we can use 'predict_linear'.
Not Synced

'predict_linear' just takes and does
a simple linear regression.
Not Synced

This example takes the filesystem
available space over the last hour and
Not Synced

does a linear regression.
Not Synced

Prediction says "Well, it's going that way
and four hours from now,
Not Synced

based on one hour of history, it's gonna
be less than 0, which means full".
Not Synced

We know that within the next four hours,
the disc is gonna be full
Not Synced

so we can tell the operator ahead of time
that it's gonna be full
Not Synced

and not just tell them that it's full
right now.
Not Synced

They have some window of ability
to fix it before it fails.
Not Synced

This is really important because
if you're running a site
Not Synced

you want to be able to have alerts
that tell you that your system is failing
Not Synced

before it actually fails.
Not Synced

Because if it fails, you're out of SLO
or SLA and
Not Synced

your users are gonna be unhappy
Not Synced

and you don't want the users to tell you
that your site is down
Not Synced

you want to know about it before
your users can even tell.
Not Synced

This allows you to do that.
Not Synced

And also of course, Prometheus being
a modern system,
Not Synced

we support fully UTF8 in all of our labels.
Not Synced

Here's an other one, here's a good example
from the USE method.
Not Synced

This is a rate of 500 errors coming from
an application
Not Synced

and you can simply alert that
Not Synced

there's more than 500 errors per second
coming out of the application
Not Synced

if that's your threshold for ???
Not Synced

And you can do other things,
Not Synced

you can convert that from just
a raid of errors
Not Synced

to a percentive error.
Not Synced

So you could say
Not Synced

"I have an SLA of 3 9" and so you can say
Not Synced

"If the rate of errors divided by the rate
of requests is .01,
Not Synced

or is more than .01, then
that's a problem."
Not Synced

You can include that level of
error granularity.
Not Synced

And if you're just doing a blackbox test,
Not Synced

you wouldn't know this, you would only get
if you got an error from the system,
Not Synced

then you got another error from the system
Not Synced

then you fire an alert.
Not Synced

But if those checks are one minute apart
and you're serving 1000 requests per second
Not Synced

you could be serving 10,000 errors before
you even get an alert.
Not Synced

And you might miss it, because
Not Synced

what if you only get one random error
Not Synced

and then the next time, you're serving
25% errors,
Not Synced

you only have a 25% chance of that check
failing again.
Not Synced

You really need these metrics in order
to get
Not Synced

proper reports of the status of your system
Not Synced

There's even options
Not Synced

You can slice and dice those labels.
Not Synced

If you have a label on all of
your applications called 'service'
Not Synced

you can send that 'service' label through
to the message
Not Synced

and you can say
"Hey, this service is broken".
Not Synced

You can include that service label
in your alert messages.
Not Synced

And that's it, I can go to a demo and Q&A.
Not Synced

[Applause]
Not Synced

Any questions so far?
Not Synced

Or anybody want to see a demo?
Not Synced

[Q] Hi. Does Prometheus make metric
discovery inside containers
Not Synced

or do I have to implement the metrics
myself?
Not Synced

[A] For metrics in containers, there are
already things that expose
Not Synced

the metrics of the container system
itself.
Not Synced

There's a utility called 'cadvisor' and
Not Synced

cadvisor takes the links cgroup data
and exposes it as metrics
Not Synced

so you can get data about
how much CPU time is being
Not Synced

spent in your container,
Not Synced

how much memory is being used
by your container.
Not Synced

[Q] But not about the application,
just about the container usage ?
Not Synced

[A] Right. Because the container
has no idea
Not Synced

whether your application is written
in Ruby or go or Python or whatever,
Not Synced

you have to build that into
your application in order to get the data.
Not Synced

So for Prometheus,
Not Synced

we've written client libraries that can be
included in your application directly
Not Synced

so you can get that data out.
Not Synced

If you go to the Prometheus website,
we have a whole series of client libraries
Not Synced

and we cover a pretty good selection
of popular software.
Not Synced

[Q] What is the current state of
long-term data storage?
Not Synced

[A] Very good question.
Not Synced

There's been several…
Not Synced

There's actually several different methods
of doing this.
Not Synced

Prometheus stores all this data locally
in its own data storage
Not Synced

on the local disk.
Not Synced

But that's only as durable as
that server is durable.
Not Synced

So if you've got a really durable server,
Not Synced

you can store as much data as you want,
Not Synced

you can store years and years of data
locally on the Prometheus server.
Not Synced

That's not a problem.
Not Synced

There's a bunch of misconceptions because
of our default
Not Synced

and the language on our website said
Not Synced

"It's not long-term storage"
Not Synced

simply because we leave that problem
up to the person running the server.
Not Synced

But the time series database
that Prometheus includes
Not Synced

is actually quite durable.
Not Synced

But it's only as durable as the server
underneath it.
Not Synced

So if you've got a very large cluster and
you want really high durability,
Not Synced

you need to have some kind of
cluster software,
Not Synced

but because we want Prometheus to be
simple to deploy
Not Synced

and very simple to operate
Not Synced

and also very robust.
Not Synced

We didn't want to include any clustering
in Prometheus itself,
Not Synced

because anytime you have a clustered
software,
Not Synced

what happens if your network is
a little wanky.
Not Synced

The first thing that goes down is
all of your distributed systems fail.
Not Synced

And building distributed systems to be
really robust is really hard
Not Synced

so Prometheus is what we call
"uncoordinated distributed systems".
Not Synced

If you've got two Prometheus servers
monitoring all your targets in an HA mode
Not Synced

in a cluster, and there's a split brain,
Not Synced

each Prometheus can see
half of the cluster and
Not Synced

it can see that the other half
of the cluster is down.
Not Synced

They can both try to get alerts out
to the alert manager
Not Synced

and this is a really really robust way of
handling split brains
Not Synced

and bad network failures and bad problems
in a cluster.
Not Synced

It's designed to be super super robust

Title:: Metrics-Based Monitoring with Prometheus
Description:: more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 34:03

	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 12 Edited

tvincent
Revision 11 Edited

ruipb
Revision 10 Edited

ruipb
Revision 9 Edited

ruipb
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	12	tvincent
	11	ruipb
	10	ruipb
	9	ruipb
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Metrics-Based Monitoring with Prometheus

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)