Metrics-Based Monitoring with Prometheus

Edit subtitles

0:06 - 0:11

So, we had a talk by a non-GitLab person
about GitLab.
0:11 - 0:13

Now, we have a talk by a GitLab person
on non-GtlLab.
0:13 - 0:15

Something like that?
0:16 - 0:19

The CCCHH hackerspace is now open,
0:20 - 0:22

from now on if you want to go there,
that's the announcement.
0:22 - 0:26

And the next talk will be by Ben Kochie
0:26 - 0:28

on metrics-based monitoring
with Prometheus.
0:29 - 0:30

Welcome.
0:31 - 0:33

[Applause]
0:35 - 0:37

Alright, so
0:37 - 0:39

my name is Ben Kochie
0:40 - 0:44

I work on DevOps features for GitLab
0:44 - 0:48

and apart working for GitLab, I also work
on the opensource Prometheus project.
0:51 - 0:54

I live in Berlin and I've been using
Debian since ???
0:54 - 0:57

yes, quite a long time.
0:59 - 1:01

So, what is Metrics-based Monitoring?
1:03 - 1:05

If you're running software in production,
1:06 - 1:08

you probably want to monitor it,
1:08 - 1:11

because if you don't monitor it, you don't
know it's right.
1:13 - 1:16

??? break down into two categories:
1:16 - 1:19

there's blackbox monitoring and
there's whitebox monitoring.
1:20 - 1:25

Blackbox monitoring is treating
your software like a blackbox.
1:25 - 1:26

It's just checks to see, like,
1:26 - 1:29

is it responding, or does it ping
1:30 - 1:34

or ??? HTTP requests
1:34 - 1:36

[mic turned on]
1:38 - 1:41

Ah, there we go, that's better.
1:47 - 1:52

So, blackbox monitoring is a probe,
1:52 - 1:55

it just kind of looks from the outside
to your software
1:55 - 1:57

and it has no knowledge of the internals
1:58 - 2:01

and it's really good for end to end testing.
2:01 - 2:04

So if you've got a fairly complicated
service,
2:04 - 2:06

you come in from the outside, you go
through the load balancer,
2:07 - 2:08

you hit the API server,
2:08 - 2:10

the API server might hit a database,
2:10 - 2:13

and you go all the way through
to the back of the stack
2:13 - 2:15

and all the way back out
2:15 - 2:16

so you know that everything is working
end to end.
2:16 - 2:19

But you only know about it
for that one request.
2:19 - 2:22

So in order to find out if your service
is working,
2:23 - 2:27

from the end to end, for every single
request,
2:27 - 2:30

this requires whitebox intrumentation.
2:30 - 2:34

So, basically, every event that happens
inside your software,
2:34 - 2:37

inside a serving stack,
2:37 - 2:40

gets collected and gets counted,
2:40 - 2:43

so you know that every request hits
the load balancer,
2:43 - 2:46

every request hits your application
service,
2:46 - 2:47

every request hits the database.
2:48 - 2:51

You know that everything matches up
2:51 - 2:56

and this is called whitebox, or
metrics-based monitoring.
2:56 - 2:58

There is different examples of, like,
2:58 - 3:02

the kind of software that does blackbox
and whitebox monitoring.
3:03 - 3:07

So you have software like Nagios that
you can configure checks
3:09 - 3:10

or pingdom,
3:10 - 3:12

pingdom will do ping of your website.
3:13 - 3:15

And then there is metrics-based monitoring,
3:16 - 3:19

things like Prometheus, things like
the TICK stack from influx data,
3:20 - 3:23

New Relic and other commercial solutions
3:23 - 3:25

but of course I like to talk about
the opensorce solutions.
3:26 - 3:28

We're gonna talk a little bit about
Prometheus.
3:29 - 3:32

Prometheus came out of the idea that
3:32 - 3:38

we needed a monitoring system that could
collect all this whitebox metric data
3:38 - 3:41

and do something useful with it.
3:41 - 3:43

Not just give us a pretty graph, but
we also want to be able to
3:43 - 3:44

alert on it.
3:44 - 3:46

So we needed both
3:50 - 3:54

a data gathering and an analytics system
in the same instance.
3:54 - 3:59

To do this, we built this thing and
we looked at the way that
3:59 - 4:02

data was being generated
by the applications
4:02 - 4:05

and there are advantages and
disadvantages to this
4:05 - 4:07

push vs. pull model for metrics.
4:07 - 4:10

We decided to go with the pulling model
4:10 - 4:14

because there is some slight advantages
for pulling over pushing.
4:16 - 4:18

With pulling, you get this free
blackbox check
4:18 - 4:20

that the application is running.
4:21 - 4:24

When you pull your application, you know
that the process is running.
4:25 - 4:28

If you are doing push-based, you can't
tell the difference between
4:28 - 4:32

your application doing no work and
your application not running.
4:32 - 4:34

So you don't know if it's stuck,
4:34 - 4:38

or is it just not having to do any work.
4:43 - 4:49

With pulling, the pulling system knows
the state of your network.
4:50 - 4:53

If you have a defined set of services,
4:53 - 4:57

that inventory drives what should be there.
4:58 - 5:00

Again, it's like, the disappearing,
5:00 - 5:04

is the process dead, or is it just
not doing anything?
5:04 - 5:07

With polling, you know for a fact
what processes should be there,
5:08 - 5:11

and it's a bit of an advantage there.
5:11 - 5:13

With pulling, there's really easy testing.
5:13 - 5:16

With push-based metrics, you have to
figure out
5:17 - 5:19

if you want to test a new version of
the monitoring system or
5:19 - 5:21

you want to test something new,
5:21 - 5:24

you have to tear off a copy of the data.
5:24 - 5:28

With pulling, you can just set up
another instance of your monitoring
5:28 - 5:29

and just test it.
5:30 - 5:31

Or you don't even have,
5:31 - 5:33

it doesn't even have to be monitoring,
you can just use curl
5:33 - 5:35

to pull the metrics endpoint.
5:38 - 5:40

It's significantly easier to test.
5:40 - 5:43

The other thing with the…
5:46 - 5:48

The other nice thing is that
the client is really simple.
5:48 - 5:51

The client doesn't have to know
where the monitoring system is.
5:51 - 5:54

It doesn't have to know about HA
5:54 - 5:56

It just has to sit and collect the data
about itself.
5:56 - 5:59

So it doesn't have to know anything about
the topology of the network.
5:59 - 6:03

As an application developer, if you're
writing a DNS server or
6:04 - 6:06

some other piece of software,
6:06 - 6:10

you don't have to know anything about
monitoring software,
6:10 - 6:12

you can just implement it inside
your application and
6:13 - 6:17

the monitoring software, whether it's
Prometheus or something else,
6:17 - 6:19

can just come and collect that data for you.
6:20 - 6:24

That's kind of similar to a very old
monitoring system called SNMP,
6:24 - 6:29

but SNMP has a significantly less friendly
data model for developers.
6:30 - 6:34

This is the basic layout
of a Prometheus server.
6:34 - 6:36

At the core, there's a Prometheus server
6:36 - 6:40

and it deals with all the data collection
and analytics.
6:43 - 6:47

Basically, this one binary,
it's all written in golang.
6:47 - 6:49

It's a single binary.
6:49 - 6:51

It knows how to read from your inventory,
6:51 - 6:53

there's a bunch of different methods,
whether you've got
6:53 - 6:59

a kubernetes cluster or a cloud platform
7:00 - 7:04

or you have your own customized thing
with ansible.
7:05 - 7:10

Ansible can take your layout, drop that
into a config file and
7:11 - 7:12

Prometheus can pick that up.
7:16 - 7:19

Once it has the layout, it goes out and
collects all the data.
7:19 - 7:24

It has a storage and a time series
database to store all that data locally.
7:24 - 7:28

It has a thing called PromQL, which is
a query language designed
7:28 - 7:31

for metrics and analytics.
7:32 - 7:37

From that PromQL, you can add frontends
that will,
7:37 - 7:39

whether it's a simple API client
to run reports,
7:40 - 7:43

you can use things like Grafana
for creating dashboards,
7:43 - 7:45

it's got a simple webUI built in.
7:45 - 7:47

You can plug in anything you want
on that side.
7:49 - 7:54

And then, it also has the ability to
continuously execute queries
7:55 - 7:56

called "recording rules"
7:57 - 7:59

and these recording rules have
two different modes.
7:59 - 8:02

You can either record, you can take
a query
8:02 - 8:04

and it will generate new data
from that query
8:04 - 8:07

or you can take a query, and
if it returns results,
8:07 - 8:09

it will return an alert.
8:09 - 8:13

That alert is a push message
to the alert manager.
8:13 - 8:19

This allows us to separate the generating
of alerts from the routing of alerts.
8:19 - 8:24

You can have one or hundreds of Prometheus
services, all generating alerts
8:25 - 8:29

and it goes into an alert manager cluster
and sends, does the deduplication
8:29 - 8:31

and the routing to the human
8:31 - 8:34

because, of course, the thing
that we want is
8:35 - 8:39

we had dashboards with graphs, but
in order to find out if something is broken
8:39 - 8:41

you had to have a human
looking at the graph.
8:41 - 8:43

With Prometheus, we don't have to do that
anymore,
8:43 - 8:48

we can simply let the software tell us
that we need to go investigate
8:48 - 8:49

our problems.
8:49 - 8:51

We don't have to sit there and
stare at dashboards all day,
8:51 - 8:52

because that's really boring.
8:55 - 8:58

What does it look like to actually
get data into Prometheus?
8:58 - 9:02

This is a very basic output
of a Prometheus metric.
9:03 - 9:04

This is a very simple thing.
9:04 - 9:08

If you know much about
the linux kernel,
9:07 - 9:13

the linux kernel tracks and proc stats,
all the state of all the CPUs
9:13 - 9:14

in your system
9:15 - 9:18

and we express this by having
the name of the metric, which is
9:22 - 9:26

'node_cpu_seconds_total' and so
this is a self-describing metric,
9:27 - 9:28

like you can just read the metrics name
9:29 - 9:31

and you understand a little bit about
what's going on here.
9:33 - 9:39

The linux kernel and other kernels track
their usage by the number of seconds
9:39 - 9:41

spent doing different things and
9:41 - 9:47

that could be, whether it's in system or
user space or IRQs
9:47 - 9:49

or iowait or idle.
9:49 - 9:51

Actually, the kernel tracks how much
idle time it has.
9:54 - 9:55

It also tracks it by the number of CPUs.
9:56 - 10:00

With other monitoring systems, they used
to do this with a tree structure
10:01 - 10:04

and this caused a lot of problems,
for like
10:04 - 10:09

How do you mix and match data so
by switching from
10:10 - 10:12

a tree structure to a tag-based structure,
10:13 - 10:17

we can do some really interesting
powerful data analytics.
10:18 - 10:25

Here's a nice example of taking
those CPU seconds counters
10:26 - 10:30

and then converting them into a graph
by using PromQL.
10:33 - 10:35

Now we can get into
Metrics-Based Alerting.
10:35 - 10:38

Now we have this graph, we have this thing
10:38 - 10:39

we can look and see here
10:40 - 10:43

"Oh there is some little spike here,
we might want to know about that."
10:43 - 10:46

Now we can get into Metrics-Based
Alerting.
10:46 - 10:51

I used to be a site reliability engineer,
I'm still a site reliability engineer at heart
10:52 - 11:00

and we have this concept of things that
you need on a site or a service reliably
11:01 - 11:03

The most important thing you need is
down at the bottom,
11:04 - 11:07

Monitoring, because if you don't have
monitoring of your service,
11:07 - 11:09

how do you know it's even working?
11:12 - 11:15

There's a couple of techniques here, and
we want to alert based on data
11:16 - 11:18

and not just those end to end tests.
11:19 - 11:23

There's a couple of techniques, a thing
called the RED method
11:24 - 11:25

and there's a thing called the USE method
11:26 - 11:28

and there's a couple nice things to some
blog posts about this
11:29 - 11:31

and basically it defines that, for example,
11:31 - 11:35

the RED method talks about the requests
that your system is handling
11:36 - 11:38

There are three things:
11:38 - 11:40

There's the number of requests, there's
the number of errors
11:40 - 11:42

and there's how long takes a duration.
11:43 - 11:45

With the combination of these three things
11:45 - 11:48

you can determine most of
what your users see
11:49 - 11:54

"Did my request go through? Did it
return an error? Was it fast?"
11:55 - 11:58

Most people, that's all they care about.
11:58 - 12:02

"I made a request to a website and
it came back and it was fast."
12:05 - 12:07

It's a very simple method of just, like,
12:07 - 12:10

those are the important things to
determine if your site is healthy.
12:12 - 12:17

But we can go back to some more
traditional, sysadmin style alerts
12:17 - 12:21

this is basically taking the filesystem
available space,
12:21 - 12:27

divided by the filesystem size, that becomes
the ratio of filesystem availability
12:27 - 12:28

from 0 to 1.
12:28 - 12:31

Multiply it by 100, we now have
a percentage
12:31 - 12:36

and if it's less than or equal to 1%
for 15 minutes,
12:36 - 12:42

this is less than 1% space, we should tell
a sysadmin to go check
12:42 - 12:44

to find out why the filesystem
has fall
12:45 - 12:46

It's super nice and simple.
12:46 - 12:50

We can also tag, we can include…
12:51 - 12:58

Every alert includes all the extraneous
labels that Prometheus adds to your metrics
12:59 - 13:05

When you add a metric in Prometheus, if
we go back and we look at this metric.
13:06 - 13:11

This metric only contain the information
about the internals of the application
13:13 - 13:15

anything about, like, what server it's on,
is it running in a container,
13:15 - 13:19

what cluster does it come from,
what continent is it on,
13:18 - 13:22

that's all extra annotations that are
added by the Prometheus server
13:23 - 13:24

at discovery time.
13:25 - 13:28

Unfortunately I don't have a good example
of what those labels look like
13:29 - 13:34

but every metric gets annotated
with location information.
13:37 - 13:41

That location information also comes through
as labels in the alert
13:41 - 13:48

so, if you have a message coming
into your alert manager,
13:48 - 13:50

the alert manager can look and go
13:50 - 13:52

"Oh, that's coming from this datacenter"
13:52 - 13:59

and it can include that in the email or
IRC message or SMS message.
13:59 - 14:01

So you can include
13:59 - 14:04

"Filesystem is out of space on this host
from this datacenter"
14:05 - 14:07

All these labels get passed through and
then you can append
14:07 - 14:13

"severity: critical" to that alert and
include that in the message to the human
14:14 - 14:17

because of course, this is how you define…
14:17 - 14:21

Getting the message from the monitoring
to the human.
14:22 - 14:24

You can even include nice things like,
14:24 - 14:28

if you've got documentation, you can
include a link to the documentation
14:28 - 14:29

as an annotation
14:29 - 14:33

and the alert manager can take that
basic url and, you know,
14:33 - 14:37

massaging it into whatever it needs
to look like to actually get
14:37 - 14:40

the operator to the correct documentation.
14:42 - 14:43

We can also do more fun things:
14:44 - 14:46

since we actually are not just checking
14:46 - 14:49

what is the space right now,
we're tracking data over time,
14:49 - 14:51

we can use 'predict_linear'.
14:52 - 14:55

'predict_linear' just takes and does
a simple linear regression.
14:56 - 15:00

This example takes the filesystem
available space over the last hour and
15:01 - 15:02

does a linear regression.
15:03 - 15:09

Prediction says "Well, it's going that way
and four hours from now,
15:09 - 15:13

based on one hour of history, it's gonna
be less than 0, which means full".
15:14 - 15:21

We know that within the next four hours,
the disc is gonna be full
15:21 - 15:25

so we can tell the operator ahead of time
that it's gonna be full
15:25 - 15:27

and not just tell them that it's full
right now.
15:27 - 15:32

They have some window of ability
to fix it before it fails.
15:33 - 15:35

This is really important because
if you're running a site
15:36 - 15:41

you want to be able to have alerts
that tell you that your system is failing
15:42 - 15:43

before it actually fails.
15:44 - 15:48

Because if it fails, you're out of SLO
or SLA and
15:48 - 15:50

your users are gonna be unhappy
15:51 - 15:52

and you don't want the users to tell you
that your site is down
15:53 - 15:55

you want to know about it before
your users can even tell.
15:55 - 15:58

This allows you to do that.
15:59 - 16:02

And also of course, Prometheus being
a modern system,
16:03 - 16:06

we support fully UTF8 in all of our labels.
16:08 - 16:12

Here's an other one, here's a good example
from the USE method.
16:12 - 16:16

This is a rate of 500 errors coming from
an application
16:16 - 16:18

and you can simply alert that
16:18 - 16:23

there's more than 500 errors per second
coming out of the application
16:23 - 16:26

if that's your threshold for pain
16:26 - 16:27

And you can do other things,
16:28 - 16:29

you can convert that from just
a raid of errors
16:30 - 16:31

to a percentive error.
16:31 - 16:33

So you could say
16:33 - 16:37

"I have an SLA of 3 9" and so you can say
16:38 - 16:47

"If the rate of errors divided by the rate
of requests is .01,
16:47 - 16:49

or is more than .01, then
that's a problem."
16:50 - 16:55

You can include that level of
error granularity.
16:55 - 16:58

And if you're just doing a blackbox test,
16:58 - 17:04

you wouldn't know this, you would only get
if you got an error from the system,
17:04 - 17:06

then you got another error from the system
17:06 - 17:07

then you fire an alert.
17:07 - 17:12

But if those checks are one minute apart
and you're serving 1000 requests per second
17:13 - 17:21

you could be serving 10,000 errors before
you even get an alert.
17:22 - 17:23

And you might miss it, because
17:23 - 17:25

what if you only get one random error
17:25 - 17:29

and then the next time, you're serving
25% errors,
17:29 - 17:32

you only have a 25% chance of that check
failing again.
17:32 - 17:36

You really need these metrics in order
to get
17:36 - 17:39

proper reports of the status of your system
17:43 - 17:44

There's even options
17:44 - 17:46

You can slice and dice those labels.
17:46 - 17:50

If you have a label on all of
your applications called 'service'
17:50 - 17:53

you can send that 'service' label through
to the message
17:54 - 17:56

and you can say
"Hey, this service is broken".
17:56 - 18:00

You can include that service label
in your alert messages.
18:01 - 18:07

And that's it, I can go to a demo and Q&A.
18:10 - 18:14

[Applause]
18:17 - 18:18

Any questions so far?
18:19 - 18:20

Or anybody want to see a demo?
18:30 - 18:35

[Q] Hi. Does Prometheus make metric
discovery inside containers
18:35 - 18:37

or do I have to implement the metrics
myself?
18:38 - 18:46

[A] For metrics in containers, there are
already things that expose
18:46 - 18:49

the metrics of the container system
itself.
18:50 - 18:52

There's a utility called 'cadvisor' and
18:52 - 18:57

cadvisor takes the links cgroup data
and exposes it as metrics
18:57 - 19:01

so you can get data about
how much CPU time is being
19:01 - 19:02

spent in your container,
19:03 - 19:04

how much memory is being spent
by your container.
19:05 - 19:08

[Q] But not about the application,
just about the container usage ?
19:09 - 19:11

[A] Right. Because the container
has no idea
19:12 - 19:15

whether your application is written
in Ruby or Go or Python or whatever,
19:19 - 19:22

you have to build that into
your application in order to get the data.
19:24 - 19:24

So for Prometheus,
19:28 - 19:35

we've written client libraries that can be
included in your application directly
19:35 - 19:36

so you can get that data out.
19:37 - 19:41

If you go to the Prometheus website,
we have a whole series of client libraries
19:45 - 19:49

and we cover a pretty good selection
of popular software.
19:57 - 20:00

[Q] What is the current state of
long-term data storage?
20:01 - 20:02

[A] Very good question.
20:03 - 20:05

There's been several…
20:05 - 20:07

There's actually several different methods
of doing this.
20:10 - 20:15

Prometheus stores all this data locally
in its own data storage
20:15 - 20:16

on the local disk.
20:17 - 20:19

But that's only as durable as
that server is durable.
20:19 - 20:22

So if you've got a really durable server,
20:22 - 20:23

you can store as much data as you want,
20:24 - 20:27

you can store years and years of data
locally on the Prometheus server.
20:27 - 20:28

That's not a problem.
20:29 - 20:32

There's a bunch of misconceptions because
of our default
20:32 - 20:34

and the language on our website said
20:35 - 20:36

"It's not long-term storage"
20:37 - 20:42

simply because we leave that problem
up to the person running the server.
20:43 - 20:46

But the time series database
that Prometheus includes
20:47 - 20:48

is actually quite durable.
20:49 - 20:51

But it's only as durable as the server
underneath it.
20:52 - 20:55

So if you've got a very large cluster and
you want really high durability,
20:56 - 20:58

you need to have some kind of
cluster software,
20:58 - 21:01

but because we want Prometheus to be
simple to deploy
21:02 - 21:03

and very simple to operate
21:03 - 21:07

and also very robust.
21:07 - 21:09

We didn't want to include any clustering
in Prometheus itself,
21:10 - 21:12

because anytime you have a clustered
software,
21:12 - 21:15

what happens if your network is
a little wanky.
21:16 - 21:19

The first thing that goes down is
all of your distributed systems fail.
21:20 - 21:23

And building distributed systems to be
really robust is really hard
21:23 - 21:29

so Prometheus is what we call
"uncoordinated distributed systems".
21:29 - 21:34

If you've got two Prometheus servers
monitoring all your targets in an HA mode
21:34 - 21:37

in a cluster, and there's a split brain,
21:37 - 21:40

each Prometheus can see
half of the cluster and
21:41 - 21:44

it can see that the other half
of the cluster is down.
21:44 - 21:47

They can both try to get alerts out
to the alert manager
21:47 - 21:50

and this is a really really robust way of
handling split brains
21:51 - 21:54

and bad network failures and bad problems
in a cluster.
21:54 - 21:57

It's designed to be super super robust
21:57 - 22:00

and so the two individual
Promotheus servers in you cluster
22:00 - 22:02

don't have to talk to each other
to do this,
22:02 - 22:04

they can just to it independently.
22:04 - 22:07

But if you want to be able
to correlate data
22:08 - 22:09

between many different Prometheus servers
22:09 - 22:12

you need an external data storage
to do this.
22:13 - 22:15

And also you may not have
very big servers,
22:15 - 22:17

you might be running your Prometheus
in a container
22:17 - 22:19

and it's only got a little bit of local
storage space
22:20 - 22:23

so you want to send all that data up
to a big cluster datastore
22:23 - 22:25

for a bigger use
22:26 - 22:28

We have several different ways of
doing this.
22:28 - 22:31

There's the classic way which is called
federation
22:31 - 22:35

where you have one Prometheus server
polling in summary data from
22:35 - 22:37

each of the individual Prometheus servers
22:37 - 22:40

and this is useful if you want to run
alerts against data coming
22:40 - 22:42

from multiple Prometheus servers.
22:42 - 22:44

But federation is not replication.
22:45 - 22:47

It only can do a little bit of data from
each Prometheus server.
22:48 - 22:51

If you've got a million metrics on
each Prometheus server,
22:52 - 22:56

you can't poll in a million metrics
and do…
22:56 - 22:59

If you've got 10 of those, you can't
poll in 10 million metrics
22:59 - 23:01

simultaneously into one Prometheus
server.
23:01 - 23:02

It's just to much data.
23:03 - 23:06

There is two others, a couple of other
nice options.
23:07 - 23:09

There's a piece of software called
Cortex.
23:09 - 23:16

Cortex is a Prometheus server that
stores its data in a database.
23:17 - 23:19

Specifically, a distributed database.
23:19 - 23:24

Things that are based on the Google
big table model, like Cassandra or…
23:26 - 23:27

What's the Amazon one?
23:30 - 23:33

Yeah.
23:33 - 23:34

Dynamodb.
23:34 - 23:37

If you have a dynamodb or a cassandra
cluster, or one of these other
23:37 - 23:39

really big distributed storage clusters,
23:40 - 23:45

Cortex can run and the Prometheus servers
will stream their data up to Cortex
23:45 - 23:49

and it will keep a copy of that accross
all of your Prometheus servers.
23:50 - 23:51

And because it's based on things
like Cassandra,
23:52 - 23:53

it's super scalable.
23:53 - 23:58

But it's a little complex to run and
23:58 - 24:01

many people don't want to run that
complex infrastructure.
24:01 - 24:06

We have another new one, we just blogged
about it yesterday.
24:02 - 24:07

It's a thing called Thanos.
24:07 - 24:11

Thanos is Prometheus at scale.
24:11 - 24:12

Basically, the way it works…
24:13 - 24:15

Actually, why don't I bring that up?
24:24 - 24:31

This was developed by a company
called Improbable
24:31 - 24:33

and they wanted to…
24:35 - 24:40

They had billions of metrics coming from
hundreds of Prometheus servers.
24:41 - 24:47

They developed this in collaboration with
the Prometheus team to build
24:47 - 24:49

a super highly scalable Prometheus server.
24:50 - 24:56

Prometheus itself stores the incoming
metrics data in a write ahead log
24:56 - 25:00

and then every two hours, it creates
a compaction cycle
25:00 - 25:03

and it creates an imutable time series block
of data which is
25:04 - 25:07

all the time series blocks themselves
25:07 - 25:10

and then an index into that data.
25:11 - 25:14

Those two hour windows are all imutable
25:14 - 25:16

so what Thanos does,
it has a little sidecar binary that
25:16 - 25:19

watches for those new directories and
25:19 - 25:21

uploads them into a blob store.
25:21 - 25:26

So you could put them in S3 or minio or
some other simple object storage.
25:26 - 25:33

And then now you have all of your data,
all of this index data already
25:33 - 25:35

ready to go
25:35 - 25:38

and then the final sidecar creates
a little mesh cluster that can read from
25:38 - 25:40

all of those S3 blocks.
25:40 - 25:48

Now, you have this super global view
all stored in a big bucket storage and
25:50 - 25:52

things like S3 or minio are…
25:53 - 25:58

Bucket storage is not databases so they're
operationally a little easier to operate.
25:58 - 26:02

Plus, now we have all this data in
a bucket store and
26:03 - 26:06

the Thanos sidecars can talk to each other
26:07 - 26:08

We can now have a single entry point.
26:08 - 26:12

You can query Thanos and Thanos will
distribute your query
26:12 - 26:14

across all your Prometheus servers.
26:14 - 26:16

So now you can do global queries across
all of your servers.
26:18 - 26:22

But it's very new, they just released
their first release candidate yesterday.
26:24 - 26:27

It is looking to be like
the coolest thing ever
26:27 - 26:29

for running large scale Prometheus.
26:30 - 26:35

Here's an example of how that is laid out.
26:37 - 26:39

This will bring and let you have
a billion metric Prometheus cluster.
26:43 - 26:44

And it's got a bunch of other
cool features.
26:45 - 26:47

Any more questions?
26:55 - 26:57

Alright, maybe I'll do
a quick little demo.
27:05 - 27:11

Here is a Prometheus server that is
provided by this group
27:11 - 27:14

that just does a ansible deployment
for Prometheus.
27:15 - 27:20

And you can just simply query
for something like 'node_cpu'.
27:21 - 27:23

This is actually the old name for
that metric.
27:24 - 27:26

And you can see, here's exactly
27:28 - 27:31

the CPU metrics from some servers.
27:33 - 27:35

It's just a bunch of stuff.
27:35 - 27:37

There's actually two servers here,
27:37 - 27:41

there's an influx cloud alchemy and
there is a demo cloud alchemy.
27:42 - 27:44

[Q] Can you zoom in?
[A] Oh yeah sure.
27:53 - 27:58

So you can see all the extra labels.
28:00 - 28:02

We can also do some things like…
28:02 - 28:04

Let's take a look at, say,
the last 30 seconds.
28:05 - 28:07

We can just add this little time window.
28:08 - 28:11

It's called a range request,
and you can see
28:11 - 28:12

the individual samples.
28:13 - 28:15

You can see that all Prometheus is doing
28:15 - 28:18

is storing the sample and a timestamp.
28:18 - 28:23

All the timestamps are in milliseconds
and it's all epoch
28:23 - 28:25

so it's super easy to manipulate.
28:26 - 28:30

But, looking at the individual samples and
looking at this, you can see that
28:30 - 28:36

if we go back and just take…
and look at the raw data, and
28:36 - 28:38

we graph the raw data…
28:40 - 28:43

Oops, that's a syntax error.
28:44 - 28:47

And we look at this graph…
Come on.
28:47 - 28:48

Here we go.
28:48 - 28:50

Well, that's kind of boring, it's just
a flat line because
28:51 - 28:53

it's just a counter going up very slowly.
28:53 - 28:56

What we really want to do, is we want to
take, and we want to apply
28:57 - 28:59

a rate function to this counter.
29:00 - 29:04

So let's look at the rate over
the last one minute.
29:04 - 29:07

There we go, now we get
a nice little graph.
29:08 - 29:14

And so you can see that this is
0.6 CPU seconds per second
29:15 - 29:18

for that set of labels.
29:19 - 29:21

But this is pretty noisy, there's a lot
of lines on this graph and
29:21 - 29:23

there's still a lot of data here.
29:23 - 29:26

So let's start doing some filtering.
29:26 - 29:29

One of the things we see here is,
well, there's idle.
29:30 - 29:32

We don't really care about
the machine being idle,
29:33 - 29:35

so let's just add a label filter
so we can say
29:36 - 29:42

'mode', it's the label name, and it's not
equal to 'idle'. Done.
29:45 - 29:48

And if I could type…
What did I miss?
29:51 - 29:51

Here we go.
29:51 - 29:54

So now we've removed idle from the graph.
29:54 - 29:56

That looks a little more sane.
29:57 - 30:01

Oh, wow, look at that, that's a nice
big spike in user space on the influx server
30:01 - 30:02

Okay…
30:04 - 30:05

Well, that's pretty cool.
30:06 - 30:06

What about…
30:07 - 30:09

This is still quite a lot of lines.
30:09 - 30:14

We can just sum up that rate.
30:11 - 30:14

How much CPU is in use total across
all the servers that we have.
30:14 - 30:24

We can just see that there is
a sum total of 0.6 CPU seconds/s
30:25 - 30:28

across the servers we have.
30:28 - 30:31

But that's a little to coarse.
30:32 - 30:37

What if we want to see it by instance?
30:39 - 30:42

Now, we can see the two servers,
we can see
30:43 - 30:45

that we're left with just that label.
30:46 - 30:50

The influx labels are the influx instance
and the influx demo.
30:50 - 30:53

That's a super easy way to see that,
30:54 - 30:57

but we can also do this
the other way around.
30:57 - 31:03

We can say 'without (mode,cpu)' so
we can drop those modes and
31:03 - 31:05

see all the labels that we have.
31:05 - 31:12

We can still see the environment label
and the job label on our list data.
31:12 - 31:16

You can go either way
with the summary functions.
31:16 - 31:20

There's a whole bunch of different functions
31:21 - 31:23

and it's all in our documentation.
31:25 - 31:30

But what if we want to see it…
31:31 - 31:34

What if we want to see which CPUs
are in use?
31:34 - 31:37

Now we can see that it's only CPU0
31:37 - 31:40

because apparently these are only
1-core instances.
31:42 - 31:47

You can add/remove labels and do
all these queries.
31:50 - 31:52

Any other questions so far?
31:54 - 31:59

[Q] I don't have a question, but I have
something to add.
31:59 - 32:03

Prometheus is really nice, but it's
a lot better if you combine it
32:03 - 32:05

with grafana.
32:05 - 32:06

[A] Yes, yes.
32:07 - 32:12

In the beginning, when we were creating
Prometheus, we actually built
32:13 - 32:15

a piece of dashboard software called
promdash.
32:16 - 32:21

It was a simple little Ruby on Rails app
to create dashboards
32:21 - 32:23

and it had a bunch of JavaScript.
32:23 - 32:24

And then grafana came out.
32:25 - 32:26

And we're like
32:26 - 32:30

"Oh, that's interesting. It doesn't support
Prometheus" so we were like
32:30 - 32:32

"Hey, can you support Prometheus"
32:32 - 32:34

and they're like "Yeah, we've got
a REST API, get the data, done"
32:36 - 32:38

Now grafana supports Prometheus and
we're like
32:40 - 32:42

"Well, promdash, this is crap, delete".
32:44 - 32:46

The Prometheus development team,
32:46 - 32:49

we're all backend developers
and SREs and
32:50 - 32:51

we have no JavaScript skills at all.
32:53 - 32:55

So we're like "Let somebody deal
with that".
32:55 - 32:58

One of the nice things about working on
this kind of project is
32:58 - 33:02

we can do things that we're good at and
and we don't, we don't try…
33:02 - 33:05

We don't have any marketing people,
it's just an opensource project,
33:06 - 33:09

there's no single company behind Prometheus.
33:10 - 33:14

I work for GitLab, Improbable paid for
the Thanos system,
33:16 - 33:25

other companies like Red Hat now pays
people that used to work on CoreOS to
33:25 - 33:27

work on Prometheus.
33:27 - 33:30

There's lots and lots of collaboration
between many companies
33:30 - 33:33

to build the Prometheus ecosystem.
33:36 - 33:37

But yeah, grafana is great.
33:39 - 33:45

Actually, grafana now has
two fulltime Prometheus developers.
33:49 - 33:51

Alright, that's it.
33:53 - 33:57

[Applause]

Title:: Metrics-Based Monitoring with Prometheus
Description:: Talk given by Ben Kochie at Minidebconf Hamburg 18
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-19/metrics_based_monitoring.webm

more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 34:03

	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 12 Edited

tvincent
Revision 11 Edited

ruipb
Revision 10 Edited

ruipb
Revision 9 Edited

ruipb
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	12	tvincent
	11	ruipb
	10	ruipb
	9	ruipb
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Metrics-Based Monitoring with Prometheus

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)