Metrics-Based Monitoring with Prometheus

Rollback to version 9

0:06 - 0:11

So, we had a talk by a non-GitLab person
about GitLab.
0:11 - 0:13

Now, we have a talk by a GitLab person
on non-GtlLab.
0:13 - 0:15

Something like that?
0:16 - 0:19

The CCCHH hackerspace is now open,
0:20 - 0:22

from now on if you want to go there,
that's the announcement.
0:22 - 0:26

And the next talk will be by Ben Kochie
0:26 - 0:28

on metrics-based monitoring
with Prometheus.
0:29 - 0:30

Welcome.
0:31 - 0:33

[Applause]
0:35 - 0:37

Alright, so
0:37 - 0:39

my name is Ben Kochie
0:40 - 0:44

I work on DevOps features for GitLab
0:44 - 0:48

and apart working for GitLab, I also work
on the opensource Prometheus project.
0:51 - 0:54

I live in Berlin and I've been using
Debian since ???
0:54 - 0:57

yes, quite a long time.
0:59 - 1:01

So, what is Metrics-based Monitoring?
1:03 - 1:05

If you're running software in production,
1:06 - 1:08

you probably want to monitor it,
1:08 - 1:11

because if you don't monitor it, you don't
know it's right.
1:13 - 1:16

??? break down into two categories:
1:16 - 1:19

there's blackbox monitoring and
there's whitebox monitoring.
1:20 - 1:25

Blackbox monitoring is treating
your software like a blackbox.
1:25 - 1:27

It's just checks to see, like,
1:27 - 1:29

is it responding, or does it ping
1:30 - 1:34

or ??? HTTP requests
1:34 - 1:36

[mic turned on]
1:38 - 1:41

Ah, there we go, much better.
1:47 - 1:52

So, blackbox monitoring is a probe,
1:52 - 1:55

it just kind of looks from the outside
to your software
1:55 - 1:57

and it has no knowledge of the internals
1:58 - 2:01

and it's really good for end to end testing.
2:01 - 2:04

So if you've got a fairly complicated
service,
2:04 - 2:06

you come in from the outside, you go
through the load balancer,
2:07 - 2:08

you hit the API server,
2:08 - 2:10

the API server might hit a database,
2:11 - 2:13

and you go all the way through
to the back of the stack
2:13 - 2:15

and all the way back out
2:15 - 2:16

so you know that everything is working
end to end.
2:17 - 2:19

But you only know about it
for that one request.
2:19 - 2:22

So in order to find out if your service
is working,
2:23 - 2:27

from the end to end, for every single
request,
2:27 - 2:30

this requires whitebox intrumentation.
2:30 - 2:34

So, basically, every event that happens
inside your software,
2:34 - 2:37

inside a serving stack,
2:37 - 2:40

gets collected and gets counted,
2:40 - 2:44

so you know that every request hits
the load balancer,
2:45 - 2:46

every request hits your application
service,
2:46 - 2:47

every request hits the database.
2:48 - 2:51

You know that everything matches up
2:51 - 2:56

and this is called whitebox, or
metrics-based monitoring.
2:56 - 2:58

There is different examples of, like,
2:58 - 3:02

the kind of software that does blackbox
and whitebox monitoring.
3:03 - 3:07

So you have software like Nagios that
you can configure checks
3:09 - 3:10

or pingdom,
3:10 - 3:12

pingdom will do ping of your website.
3:13 - 3:15

And then there is metrics-based monitoring,
3:16 - 3:19

things like Prometheus, things like
the TICK stack from influx data,
3:20 - 3:23

New Relic and other commercial solutions
3:23 - 3:25

but of course I like to talk about
the opensorce solutions.
3:26 - 3:28

We're gonna talk a little bit about
Prometheus.
3:29 - 3:32

Prometheus came out of the idea that
3:32 - 3:38

we needed a monitoring system that could
collect all this whitebox metric data
3:38 - 3:41

and do something useful with it.
3:41 - 3:43

Not just give us a pretty graph, but
we also want to be able to
3:43 - 3:44

alert on it.
3:44 - 3:46

So we needed both
3:50 - 3:54

a data gathering and an analytics system
in the same instance.
3:54 - 3:59

To do this, we built this thing and
we looked at the way that
3:59 - 4:02

data was being generated
by the applications
4:02 - 4:05

and there are advantages and
disadvantages to this
4:05 - 4:07

push vs. poll model for metrics.
4:07 - 4:10

We decided to go with the polling model
4:10 - 4:14

because there is some slight advantages
for polling over pushing.
4:16 - 4:18

With polling, you get this free
blackbox check
4:18 - 4:20

that the application is running.
4:21 - 4:24

When you poll your application, you know
that the process is running.
4:25 - 4:28

If you are doing push-based, you can't
tell the difference between
4:28 - 4:32

your application doing no work and
your application not running.
4:32 - 4:34

So you don't know if it's stuck,
4:34 - 4:38

or is it just not having to do any work.
4:43 - 4:49

With polling, the polling system knows
the state of your network.
4:50 - 4:53

If you have a defined set of services,
4:53 - 4:57

that inventory drives what should be there.
4:58 - 5:00

Again, it's like, the disappearing,
5:00 - 5:04

is the process dead, or is it just
not doing anything?
5:04 - 5:07

With polling, you know for a fact
what processes should be there,
5:08 - 5:11

and it's a bit of an advantage there.
5:11 - 5:13

With polling, there's really easy testing.
5:13 - 5:16

With push-based metrics, you have to
figure out
5:17 - 5:19

if you want to test a new version of
the monitoring system or
5:19 - 5:21

you want to test something new,
5:21 - 5:24

you have to ??? a copy of the data.
5:24 - 5:28

With polling, you can just set up
another instance of your monitoring
5:28 - 5:29

and just test it.
5:30 - 5:31

Or you don't even have,
5:31 - 5:33

it doesn't even have to be monitoring,
you can just use curl
5:33 - 5:36

to poll the metrics endpoint.
5:38 - 5:40

It's significantly easier to test.
5:40 - 5:43

The other thing with the…
5:46 - 5:48

The other nice thing is that
the client is really simple.
5:48 - 5:51

The client doesn't have to know
where the monitoring system is.
5:51 - 5:54

It doesn't have to know about ???
5:54 - 5:56

It just has to sit and collect the data
about itself.
5:56 - 5:59

So it doesn't have to know anything about
the topology of the network.
5:59 - 6:03

As an application developer, if you're
writing a DNS server or
6:04 - 6:06

some other piece of software,
6:06 - 6:10

you don't have to know anything about
monitoring software,
6:10 - 6:12

you can just implement it inside
your application and
6:13 - 6:17

the monitoring software, whether it's
Prometheus or something else,
6:17 - 6:19

can just come and collect that data for you.
6:20 - 6:24

That's kind of similar to a very old
monitoring system called SNMP,
6:24 - 6:29

but SNMP has a significantly less friendly
data model for developers.
6:30 - 6:34

This is the basic layout
of a Prometheus server.
6:34 - 6:36

At the core, there's a Prometheus server
6:36 - 6:40

and it deals with all the data collection
and analytics.
6:43 - 6:47

Basically, this one binary,
it's all written in golang.
6:47 - 6:49

It's a single binary.
6:49 - 6:51

It knows how to read from your inventory,
6:51 - 6:53

there's a bunch of different methods,
whether you've got
6:53 - 6:59

a kubernetes cluster or a cloud platform
7:00 - 7:04

or you have your own customized thing
with ansible.
7:05 - 7:10

Ansible can take your layout, drop that
into a config file and
7:11 - 7:12

Prometheus can pick that up.
7:16 - 7:19

Once it has the layout, it goes out and
collects all the data.
7:19 - 7:24

It has a storage and a time series
database to store all that data locally.
7:24 - 7:28

It has a thing called PromQL, which is
a query language designed
7:28 - 7:31

for metrics and analytics.
7:32 - 7:37

From that PromQL, you can add frontends
that will,
7:37 - 7:39

whether it's a simple API client
to run reports,
7:40 - 7:43

you can use things like Grafana
for creating dashboards,
7:43 - 7:45

it's got a simple webUI built in.
7:45 - 7:47

You can plug in anything you want
on that side.
7:49 - 7:54

And then, it also has the ability to
continuously execute queries
7:55 - 7:56

called "recording rules"
7:57 - 7:59

and these recording rules have
two different modes.
7:59 - 8:02

You can either record, you can take
a query
8:02 - 8:04

and it will generate new data
from that query
8:04 - 8:07

or you can take a query, and
if it returns results,
8:07 - 8:09

it will return an alert.
8:09 - 8:13

That alert is a push message
to the alert manager.
8:13 - 8:19

This allows us to separate the generating
of alerts from the routing of alerts.
8:19 - 8:24

You can have one or hundreds of Prometheus
services, all generating alerts
8:25 - 8:29

and it goes into an alert manager cluster
and sends, does the deduplication
8:29 - 8:31

and the routing to the human
8:31 - 8:34

because, of course, the thing
that we want is
8:35 - 8:39

we had dashboards with graphs, but
in order to find out if something is broken
8:39 - 8:41

you had to have a human
looking at the graph.
8:41 - 8:43

With Prometheus, we don't have to do that
anymore,
8:43 - 8:48

we can simply let the software tell us
that we need to go investigate
8:48 - 8:49

our problems.
8:49 - 8:51

We don't have to sit there and
stare at dashboards all day,
8:51 - 8:52

because that's really boring.
8:55 - 8:58

What does it look like to actually
get data into Prometheus?
8:58 - 9:02

This is a very basic output
of a Prometheus metric.
9:03 - 9:04

This is a very simple thing.
9:04 - 9:08

If you know much about
the linux kernel,
9:07 - 9:13

the linux kernel tracks and proc stats,
all the state of all the CPUs
9:13 - 9:14

in your system
9:15 - 9:18

and we express this by having
the name of the metric, which is
9:22 - 9:26

'node_cpu_seconds_total' and so
this is a self-describing metric,
9:27 - 9:28

like you can just read the metrics name
9:29 - 9:31

and you understand a little bit about
what's going on here.
9:33 - 9:39

The linux kernel and other kernels track
their usage by the number of seconds
9:39 - 9:41

spent doing different things and
9:41 - 9:47

that could be, whether it's in system or
user space or IRQs
9:47 - 9:49

or iowait or idle.
9:49 - 9:51

Actually, the kernel tracks how much
idle time it has.
9:54 - 9:55

It also tracks it by the number of CPUs.
9:56 - 10:00

With other monitoring systems, they used
to do this with a tree structure
10:01 - 10:04

and this caused a lot of problems,
for like
10:04 - 10:09

How do you mix and match data so
by switching from
10:10 - 10:12

a tree structure to a tag-based structure,
10:13 - 10:17

we can do some really interesting
powerful data analytics.
10:18 - 10:25

Here's a nice example of taking
those CPU seconds counters
10:26 - 10:30

and then converting them into a graph
by using PromQL.
10:33 - 10:35

Now we can get into
Metrics-Based Alerting.
10:35 - 10:38

Now we have this graph, we have this thing
10:38 - 10:39

we can look and see here
10:40 - 10:43

"Oh there is some little spike here,
we might want to know about that."
10:43 - 10:46

Now we can get into Metrics-Based
Alerting.
10:46 - 10:51

I used to be a site reliability engineer,
I'm still a site reliability engineer at heart
10:52 - 11:00

and we have this concept of things that
you need on a site or a service reliably
11:01 - 11:03

The most important thing you need is
down at the bottom,
11:04 - 11:07

Monitoring, because if you don't have
monitoring of your service,
11:07 - 11:09

how do you know it's even working?
11:12 - 11:15

There's a couple of techniques here, and
we want to alert based on data
11:16 - 11:18

and not just those end to end tests.
11:19 - 11:23

There's a couple of techniques, a thing
called the RED method
11:24 - 11:25

and there's a thing called the USE method
11:26 - 11:28

and there's a couple nice things to some
blog posts about this
11:29 - 11:31

and basically it defines that, for example,
11:31 - 11:35

the RED method talks about the requests
that your system is handling
11:36 - 11:38

There are three things:
11:38 - 11:40

There's the number of requests, there's
the number of errors
11:40 - 11:42

and there's how long takes a duration.
11:43 - 11:45

With the combination of these three things
11:45 - 11:48

you can determine most of
what your users see
11:49 - 11:54

"Did my request go through? Did it
return an error? Was it fast?"
11:55 - 11:58

Most people, that's all they care about.
11:58 - 12:02

"I made a request to a website and
it came back and it was fast."
12:05 - 12:07

It's a very simple method of just, like,
12:07 - 12:10

those are the important things to
determine if your site is healthy.
12:12 - 12:17

But we can go back to some more
traditional, sysadmin style, alerts
12:17 - 12:21

this is basically taking the filesystem
available space,
12:21 - 12:27

divided by the filesystem size, that becomes
the ratio of filesystem availability
12:27 - 12:28

from 0 to 1.
12:28 - 12:31

Multiply it by 100, we now have
a percentage
12:31 - 12:36

and if it's less than or equal to 1%
for 15 minutes,
12:36 - 12:42

this is less than 1% space, we should tell
a sysadmin to go check
12:42 - 12:44

to find out why the filesystem
has fall
12:45 - 12:46

It's super nice and simple.
12:46 - 12:50

We can also tag, we can include…
12:51 - 12:58

Every alert includes all the extraneous
labels that Prometheus adds to your metrics
12:59 - 13:05

When you add a metric in Prometheus, if
we go back and we look at this metric.
13:06 - 13:11

This metric only contain the information
about the internals of the application
13:13 - 13:15

anything about, like, what server it's on,
is it running in a container,
13:15 - 13:19

what cluster does it come from,
what continent is it on,
13:18 - 13:22

that's all extra annotations that are
added by the Prometheus server
13:23 - 13:24

at discovery time.
13:25 - 13:28

Unfortunate I don't have a good example
of what those labels look like
13:29 - 13:34

but every metric gets annotated
with location information.
13:37 - 13:41

That location information also comes through
as labels in the alert
13:41 - 13:48

so, if you have a message coming
into your alert manager,
13:48 - 13:50

the alert manager can look and go
13:50 - 13:52

"Oh, that's coming from this datacenter"
13:52 - 13:59

and it can include that in the email or
IRC message or SMS message.
13:59 - 14:01

So you can include
13:59 - 14:04

"Filesystem is out of space on this host
from this datacenter"
14:05 - 14:07

All these labels get passed through and
then you can append
14:07 - 14:13

"severity: critical" to that alert and
include that in the message to the human
14:14 - 14:17

because of course, this is how you define…
14:17 - 14:21

Getting the message from the monitoring
to the human.
14:22 - 14:24

You can even include nice things like,
14:24 - 14:28

if you've got documentation, you can
include a link to the documentation
14:28 - 14:29

as an annotation
14:29 - 14:33

and the alert manager can take that
basic url and, you know,
14:33 - 14:37

massaging it into whatever it needs
to look like to actually get
14:37 - 14:40

the operator to the correct documentation.
14:42 - 14:43

We can also do more fun things:
14:44 - 14:46

since we actually are not just checking
14:46 - 14:49

what is the space right now,
we're tracking data over time,
14:49 - 14:51

we can use 'predict_linear'.
14:52 - 14:55

'predict_linear' just takes and does
a simple linear regression.
14:56 - 15:00

This example takes the filesystem
available space over the last hour and
15:01 - 15:02

does a linear regression.
15:03 - 15:09

Prediction says "Well, it's going that way
and four hours from now,
15:09 - 15:13

based on one hour of history, it's gonna
be less than 0, which means full".
15:14 - 15:21

We know that within the next four hours,
the disc is gonna be full
15:21 - 15:25

so we can tell the operator ahead of time
that it's gonna be full
15:25 - 15:27

and not just tell them that it's full
right now.
15:27 - 15:32

They have some window of ability
to fix it before it fails.
15:33 - 15:35

This is really important because
if you're running a site
15:36 - 15:41

you want to be able to have alerts
that tell you that your system is failing
15:42 - 15:43

before it actually fails.
15:44 - 15:48

Because if it fails, you're out of SLO
or SLA and
15:48 - 15:50

your users are gonna be unhappy
15:51 - 15:52

and you don't want the users to tell you
that your site is down
15:53 - 15:55

you want to know about it before
your users can even tell.
15:55 - 15:58

This allows you to do that.
Not Synced

And also of course, Prometheus being
a modern system,
Not Synced

we support fully UTF8 in all of our labels.
Not Synced

Here's an other one, here's a good example
from the USE method.
Not Synced

This is a rate of 500 errors coming from
an application
Not Synced

and you can simply alert that
Not Synced

there's more than 500 errors per second
coming out of the application
Not Synced

if that's your threshold for ???
Not Synced

And you can do other things,
Not Synced

you can convert that from just
a raid of errors
Not Synced

to a percentive error.
Not Synced

So you could say
Not Synced

"I have an SLA of 3 9" and so you can say
Not Synced

"If the rate of errors divided by the rate
of requests is .01,
Not Synced

or is more than .01, then
that's a problem."
Not Synced

You can include that level of
error granularity.
Not Synced

And if you're just doing a blackbox test,
Not Synced

you wouldn't know this, you would only get
if you got an error from the system,
Not Synced

then you got another error from the system
Not Synced

then you fire an alert.
Not Synced

But if those checks are one minute apart
and you're serving 1000 requests per second
Not Synced

you could be serving 10,000 errors before
you even get an alert.
Not Synced

And you might miss it, because
Not Synced

what if you only get one random error
Not Synced

and then the next time, you're serving
25% errors,
Not Synced

you only have a 25% chance of that check
failing again.
Not Synced

You really need these metrics in order
to get
Not Synced

proper reports of the status of your system
Not Synced

There's even options
Not Synced

You can slice and dice those labels.
Not Synced

If you have a label on all of
your applications called 'service'
Not Synced

you can send that 'service' label through
to the message
Not Synced

and you can say
"Hey, this service is broken".
Not Synced

You can include that service label
in your alert messages.
Not Synced

And that's it, I can go to a demo and Q&A.
Not Synced

[Applause]
Not Synced

Any questions so far?
Not Synced

Or anybody want to see a demo?
Not Synced

[Q] Hi. Does Prometheus make metric
discovery inside containers
Not Synced

or do I have to implement the metrics
myself?
Not Synced

[A] For metrics in containers, there are
already things that expose
Not Synced

the metrics of the container system
itself.
Not Synced

There's a utility called 'cadvisor' and
Not Synced

cadvisor takes the links cgroup data
and exposes it as metrics
Not Synced

so you can get data about
how much CPU time is being
Not Synced

spent in your container,
Not Synced

how much memory is being used
by your container.
Not Synced

[Q] But not about the application,
just about the container usage ?
Not Synced

[A] Right. Because the container
has no idea
Not Synced

whether your application is written
in Ruby or go or Python or whatever,
Not Synced

you have to build that into
your application in order to get the data.
Not Synced

So for Prometheus,
Not Synced

we've written client libraries that can be
included in your application directly
Not Synced

so you can get that data out.
Not Synced

If you go to the Prometheus website,
we have a whole series of client libraries
Not Synced

and we cover a pretty good selection
of popular software.
Not Synced

[Q] What is the current state of
long-term data storage?
Not Synced

[A] Very good question.
Not Synced

There's been several…
Not Synced

There's actually several different methods
of doing this.
Not Synced

Prometheus stores all this data locally
in its own data storage
Not Synced

on the local disk.
Not Synced

But that's only as durable as
that server is durable.
Not Synced

So if you've got a really durable server,
Not Synced

you can store as much data as you want,
Not Synced

you can store years and years of data
locally on the Prometheus server.
Not Synced

That's not a problem.
Not Synced

There's a bunch of misconceptions because
of our default
Not Synced

and the language on our website said
Not Synced

"It's not long-term storage"
Not Synced

simply because we leave that problem
up to the person running the server.
Not Synced

But the time series database
that Prometheus includes
Not Synced

is actually quite durable.
Not Synced

But it's only as durable as the server
underneath it.
Not Synced

So if you've got a very large cluster and
you want really high durability,
Not Synced

you need to have some kind of
cluster software,
Not Synced

but because we want Prometheus to be
simple to deploy
Not Synced

and very simple to operate
Not Synced

and also very robust.
Not Synced

We didn't want to include any clustering
in Prometheus itself,
Not Synced

because anytime you have a clustered
software,
Not Synced

what happens if your network is
a little wanky.
Not Synced

The first thing that goes down is
all of your distributed systems fail.
Not Synced

And building distributed systems to be
really robust is really hard
Not Synced

so Prometheus is what we call
"uncoordinated distributed systems".
Not Synced

If you've got two Prometheus servers
monitoring all your targets in an HA mode
Not Synced

in a cluster, and there's a split brain,
Not Synced

each Prometheus can see
half of the cluster and
Not Synced

it can see that the other half
of the cluster is down.
Not Synced

They can both try to get alerts out
to the alert manager
Not Synced

and this is a really really robust way of
handling split brains
Not Synced

and bad network failures and bad problems
in a cluster.
Not Synced

It's designed to be super super robust
Not Synced

and so the two individual
Promotheus servers in you cluster
Not Synced

don't have to talk to each other
to do this,
Not Synced

they can just to it independently.
Not Synced

But if you want to be able
to correlate data
Not Synced

between many different Prometheus servers
Not Synced

you need an external data storage
to do this.
Not Synced

And also you may not have
very big servers,
Not Synced

you might be running your Prometheus
in a container
Not Synced

and it's only got a little bit of local
storage space
Not Synced

so you want to send all that data up
to a big cluster datastore
Not Synced

for ???
Not Synced

We have several different ways of
doing this.
Not Synced

There's the classic way which is called
federation
Not Synced

where you have one Prometheus server
polling in summary data from
Not Synced

each of the individual Prometheus servers
Not Synced

and this is useful if you want to run
alerts against data coming
Not Synced

from multiple Prometheus servers.
Not Synced

But federation is not replication.
Not Synced

It only can do a little bit of data from
each Prometheus server.
Not Synced

If you've got a million metrics on
each Prometheus server,
Not Synced

you can't poll in a million metrics
and do…
Not Synced

If you've got 10 of those, you can't
poll in 10 million metrics
Not Synced

simultaneously into one Prometheus
server.
Not Synced

It's just to much data.
Not Synced

There is two others, a couple of other
nice options.
Not Synced

There's a piece of software called
Cortex.
Not Synced

Cortex is a Prometheus server that
stores its data in a database.
Not Synced

Specifically, a distributed database.
Not Synced

Things that are based on the Google
big table model, like Cassandra or…
Not Synced

What's the Amazon one?
Not Synced

Yeah.
Not Synced

Dynamodb.
Not Synced

If you have a dynamodb or a cassandra
cluster, or one of these other
Not Synced

really big distributed storage clusters,
Not Synced

Cortex can run and the Prometheus servers
will stream their data up to Cortex
Not Synced

and it will keep a copy of that accross
all of your Prometheus servers.
Not Synced

And because it's based on things
like Cassandra,
Not Synced

it's super scalable.
Not Synced

But it's a little complex to run and
Not Synced

many people don't want to run that
complex infrastructure.
Not Synced

We have another new one, we just blogged
about it yesterday.
Not Synced

It's a thing called Thanos.
Not Synced

Thanos is Prometheus at scale.
Not Synced

Basically, the way it works…
Not Synced

Actually, why don't I bring that up?
Not Synced

This was developed by a company
called Improbable
Not Synced

and they wanted to…
Not Synced

They had billions of metrics coming from
hundreds of Prometheus servers.
Not Synced

They developed this in collaboration with
the Prometheus team to build
Not Synced

a super highly scalable Prometheus server.
Not Synced

Prometheus itself stores the incoming
metrics data in ??? log
Not Synced

and then every two hours, it creates
a compaction cycle
Not Synced

and it creates a mutable series block
of data which is
Not Synced

all the time series blocks themselves
Not Synced

and then an index into that data.
Not Synced

Those two hour windows are all imutable
Not Synced

so ??? has a little sidecar binary that
watches for those new directories and
Not Synced

uploads them into a blob store.
Not Synced

So you could put them in S3 or minio or
some other simple object storage.
Not Synced

And then now you have all of your data,
all of this index data already
Not Synced

ready to go
Not Synced

and then the final sidecar creates
a little mesh cluster that can read from
Not Synced

all of those S3 blocks.
Not Synced

Now, you have this super global view
all stored in a big bucket storage and
Not Synced

things like S3 or minio are…
Not Synced

Bucket storage is not databases so they're
operationally a little easier to operate.
Not Synced

Plus, now we have all this data in
a bucket store and
Not Synced

the Thanos sidecars can talk to each other
Not Synced

We can now have a single entry point.
Not Synced

You can query Thanos and Thanos will
distribute your query
Not Synced

across all your Prometheus servers.
Not Synced

So now you can do global queries across
all of your servers.
Not Synced

But it's very new, they just released
their first release candidate yesterday.
Not Synced

It is looking to be like
the coolest thing ever
Not Synced

for running large scale Prometheus.
Not Synced

Here's an example of how that is laid out.
Not Synced

This will ??? let you have
a billion metric Prometheus cluster.
Not Synced

And it's got a bunch of other
cool features.
Not Synced

Any more questions?
Not Synced

Alright, maybe I'll do
a quick little demo.
Not Synced

Here is a Prometheus server that is
provided by ???
Not Synced

that just does a ansible deployment
for Prometheus.
Not Synced

And you can just simply query
for something like 'node_cpu'.
Not Synced

This is actually the old name for
that metric.
Not Synced

And you can see, here's exactly
Not Synced

the CPU metrics from some servers.
Not Synced

It's just a bunch of stuff.
Not Synced

There's actually two servers here,
Not Synced

there's an influx cloud alchemy and
there is a demo cloud alchemy.
Not Synced

[Q] Can you zoom in?
[A] Oh yeah sure.
Not Synced

So you can see all the extra labels.
Not Synced

We can also do some things like…
Not Synced

Let's take a look at, say,
the last 30 seconds.
Not Synced

We can just add this little time window.
Not Synced

It's called a range request,
and you can see
Not Synced

the individual samples.
Not Synced

You can see that all Prometheus is doing
Not Synced

is storing the sample and a timestamp.
Not Synced

All the timestamps are in milliseconds
and it's all epoch
Not Synced

so it's super easy to manipulate.
Not Synced

But, looking at the individual samples and
looking at this, you can see that
Not Synced

if we go back and just take…
and look at the raw data, and
Not Synced

we graph the raw data…
Not Synced

Oops, that's a syntax error.
Not Synced

And we look at this graph…
Come on.
Not Synced

Here we go.
Not Synced

Well, that's kind of boring, it's just
a flat line because
Not Synced

it's just a counter going up very slowly.
Not Synced

What we really want to do, is we want to
take, and we want to apply
Not Synced

a rate function to this counter.
Not Synced

So let's look at the rate over
the last one minute.
Not Synced

There we go, now we get
a nice little graph.
Not Synced

And so you can see that this is
0.6 CPU seconds per second
Not Synced

for that set of labels.
Not Synced

But this is pretty noisy, there's a lot
of lines on this graph and
Not Synced

there's still a lot of data here.
Not Synced

So let's start doing some filtering.
Not Synced

One of the things we see here is,
well, there's idle.
Not Synced

We don't really care about
the machine being idle,
Not Synced

so let's just add a label filter
so we can say
Not Synced

'mode', it's the label name, and it's not
equal to 'idle'. Done.
Not Synced

And if I could type…
What did I miss?
Not Synced

Here we go.
Not Synced

So now we've removed idle from the graph.
Not Synced

That looks a little more sane.
Not Synced

Oh, wow, look at that, that's a nice
big spike in user space on the influx server
Not Synced

Okay…
Not Synced

Well, that's pretty cool.
Not Synced

What about…
Not Synced

This is still quite a lot of lines.
Not Synced

How much CPU is in use total across
all the servers that we have.
Not Synced

We can just sum up that rate.
Not Synced

We can just see that there is
a sum total of 0.6 CPU seconds/s
Not Synced

across the servers we have.
Not Synced

But that's a little to coarse.
Not Synced

What if we want to see it by instance?
Not Synced

Now, we can see the two servers,
we can see
Not Synced

that we're left with just that label.
Not Synced

The influx labels are the influx instance
and the influx demo.
Not Synced

That's a super easy way to see that,
Not Synced

but we can also do this
the other way around.
Not Synced

We can say 'without (mode,cpu)' so
we can drop those modes and
Not Synced

see all the labels that we have.
Not Synced

We can still see the environment label
and the job label on our list data.
Not Synced

You can go either way
with the summary functions.
Not Synced

There's a whole bunch of different functions
Not Synced

and it's all in our documentation.
Not Synced

But what if we want to see it…
Not Synced

What if we want to see which CPUs
are in use?
Not Synced

Now we can see that it's only CPU0
Not Synced

because apparently these are only
1-core instances.
Not Synced

You can add/remove labels and do
all these queries.
Not Synced

Any other questions so far?
Not Synced

[Q] I don't have a question, but I have
something to add.
Not Synced

Prometheus is really nice, but it's
a lot better if you combine it
Not Synced

with grafana.
Not Synced

[A] Yes, yes.
Not Synced

In the beginning, when we were creating
Prometheus, we actually built
Not Synced

a piece of dashboard software called
promdash.
Not Synced

It was a simple little Ruby on Rails app
to create dashboards
Not Synced

and it had a bunch of JavaScript.
Not Synced

And then grafana came out.
Not Synced

And we're like
Not Synced

"Oh, that's interesting. It doesn't support
Prometheus" so we were like
Not Synced

"Hey, can you support Prometheus"
Not Synced

and they're like "Yeah, we've got
a REST API, get the data, done"
Not Synced

Now grafana supports Prometheus and
we're like
Not Synced

"Well, promdash, this is crap, delete".
Not Synced

The Prometheus development team,
Not Synced

we're all backend developers
and SREs and
Not Synced

we have no JavaScript skills at all.
Not Synced

So we're like "Let somebody deal
with that".
Not Synced

One of the nice things about working on
this kind of project is
Not Synced

we can do things that we're good at and
and we don't, we don't try…
Not Synced

We don't have any marketing people,
it's just an opensource project,
Not Synced

there's no single company behind Prometheus.
Not Synced

I work for GitLab, Improbable paid for
the Thanos system,
Not Synced

other companies like Red Hat now pays
people that used to work on CoreOS to
Not Synced

work on Prometheus.
Not Synced

There's lots and lots of collaboration
between many companies
Not Synced

to build the Prometheus ecosystem.
Not Synced

But yeah, grafana is great.
Not Synced

Actually, grafana now has
two fulltime Prometheus developers.
Not Synced

Alright, that's it.
Not Synced

[Applause]

Title:: Metrics-Based Monitoring with Prometheus
Description:: more » « less
Video Language:: English
Team:: Debconf
Project:: 2018_mini-debconf-hamburg
Duration:: 34:03

	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus
	tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus

Show all

English subtitles

Incomplete

Revisions Compare revisions

Revision 12 Edited

tvincent
Revision 11 Edited

ruipb
Revision 10 Edited

ruipb
Revision 9 Edited

ruipb
Revision 8 Edited

tvincent
Revision 7 Edited

tvincent
Revision 6 Edited

tvincent
Revision 5 Edited

tvincent
Revision 4 Edited

tvincent
Revision 3 Edited

tvincent
Revision 2 Edited

tvincent
Revision 1 Edited

tvincent

	Revision Number	Author	Created
	12	tvincent
	11	ruipb
	10	ruipb
	9	ruipb
	8	tvincent
	7	tvincent
	6	tvincent
	5	tvincent
	4	tvincent
	3	tvincent
	2	tvincent
	1	tvincent

Metrics-Based Monitoring with Prometheus

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)