Metrics-Based Monitoring with Prometheus
-
0:06 - 0:11So, we had a talk by a non-GitLab person
about GitLab. -
0:11 - 0:13Now, we have a talk by a GitLab person
on non-GtlLab. -
0:13 - 0:15Something like that?
-
0:16 - 0:19The CCCHH hackerspace is now open,
-
0:20 - 0:22from now on if you want to go there,
that's the announcement. -
0:22 - 0:26And the next talk will be by Ben Kochie
-
0:26 - 0:28on metrics-based monitoring
with Prometheus. -
0:29 - 0:30Welcome.
-
0:31 - 0:33[Applause]
-
0:35 - 0:37Alright, so
-
0:37 - 0:39my name is Ben Kochie
-
0:40 - 0:44I work on DevOps features for GitLab
-
0:44 - 0:48and apart working for GitLab, I also work
on the opensource Prometheus project. -
0:51 - 0:54I live in Berlin and I've been using
Debian since ??? -
0:54 - 0:57yes, quite a long time.
-
0:59 - 1:01So, what is Metrics-based Monitoring?
-
1:03 - 1:05If you're running software in production,
-
1:06 - 1:08you probably want to monitor it,
-
1:08 - 1:11because if you don't monitor it, you don't
know it's right. -
1:13 - 1:16??? break down into two categories:
-
1:16 - 1:19there's blackbox monitoring and
there's whitebox monitoring. -
1:20 - 1:25Blackbox monitoring is treating
your software like a blackbox. -
1:25 - 1:26It's just checks to see, like,
-
1:26 - 1:29is it responding, or does it ping
-
1:30 - 1:34or ??? HTTP requests
-
1:34 - 1:36[mic turned on]
-
1:38 - 1:41Ah, there we go, that's better.
-
1:47 - 1:52So, blackbox monitoring is a probe,
-
1:52 - 1:55it just kind of looks from the outside
to your software -
1:55 - 1:57and it has no knowledge of the internals
-
1:58 - 2:01and it's really good for end to end testing.
-
2:01 - 2:04So if you've got a fairly complicated
service, -
2:04 - 2:06you come in from the outside, you go
through the load balancer, -
2:07 - 2:08you hit the API server,
-
2:08 - 2:10the API server might hit a database,
-
2:10 - 2:13and you go all the way through
to the back of the stack -
2:13 - 2:15and all the way back out
-
2:15 - 2:16so you know that everything is working
end to end. -
2:16 - 2:19But you only know about it
for that one request. -
2:19 - 2:22So in order to find out if your service
is working, -
2:23 - 2:27from the end to end, for every single
request, -
2:27 - 2:30this requires whitebox intrumentation.
-
2:30 - 2:34So, basically, every event that happens
inside your software, -
2:34 - 2:37inside a serving stack,
-
2:37 - 2:40gets collected and gets counted,
-
2:40 - 2:43so you know that every request hits
the load balancer, -
2:43 - 2:46every request hits your application
service, -
2:46 - 2:47every request hits the database.
-
2:48 - 2:51You know that everything matches up
-
2:51 - 2:56and this is called whitebox, or
metrics-based monitoring. -
2:56 - 2:58There is different examples of, like,
-
2:58 - 3:02the kind of software that does blackbox
and whitebox monitoring. -
3:03 - 3:07So you have software like Nagios that
you can configure checks -
3:09 - 3:10or pingdom,
-
3:10 - 3:12pingdom will do ping of your website.
-
3:13 - 3:15And then there is metrics-based monitoring,
-
3:16 - 3:19things like Prometheus, things like
the TICK stack from influx data, -
3:20 - 3:23New Relic and other commercial solutions
-
3:23 - 3:25but of course I like to talk about
the opensorce solutions. -
3:26 - 3:28We're gonna talk a little bit about
Prometheus. -
3:29 - 3:32Prometheus came out of the idea that
-
3:32 - 3:38we needed a monitoring system that could
collect all this whitebox metric data -
3:38 - 3:41and do something useful with it.
-
3:41 - 3:43Not just give us a pretty graph, but
we also want to be able to -
3:43 - 3:44alert on it.
-
3:44 - 3:46So we needed both
-
3:50 - 3:54a data gathering and an analytics system
in the same instance. -
3:54 - 3:59To do this, we built this thing and
we looked at the way that -
3:59 - 4:02data was being generated
by the applications -
4:02 - 4:05and there are advantages and
disadvantages to this -
4:05 - 4:07push vs. pull model for metrics.
-
4:07 - 4:10We decided to go with the pulling model
-
4:10 - 4:14because there is some slight advantages
for pulling over pushing. -
4:16 - 4:18With pulling, you get this free
blackbox check -
4:18 - 4:20that the application is running.
-
4:21 - 4:24When you pull your application, you know
that the process is running. -
4:25 - 4:28If you are doing push-based, you can't
tell the difference between -
4:28 - 4:32your application doing no work and
your application not running. -
4:32 - 4:34So you don't know if it's stuck,
-
4:34 - 4:38or is it just not having to do any work.
-
4:43 - 4:49With pulling, the pulling system knows
the state of your network. -
4:50 - 4:53If you have a defined set of services,
-
4:53 - 4:57that inventory drives what should be there.
-
4:58 - 5:00Again, it's like, the disappearing,
-
5:00 - 5:04is the process dead, or is it just
not doing anything? -
5:04 - 5:07With polling, you know for a fact
what processes should be there, -
5:08 - 5:11and it's a bit of an advantage there.
-
5:11 - 5:13With pulling, there's really easy testing.
-
5:13 - 5:16With push-based metrics, you have to
figure out -
5:17 - 5:19if you want to test a new version of
the monitoring system or -
5:19 - 5:21you want to test something new,
-
5:21 - 5:24you have to tear off a copy of the data.
-
5:24 - 5:28With pulling, you can just set up
another instance of your monitoring -
5:28 - 5:29and just test it.
-
5:30 - 5:31Or you don't even have,
-
5:31 - 5:33it doesn't even have to be monitoring,
you can just use curl -
5:33 - 5:35to pull the metrics endpoint.
-
5:38 - 5:40It's significantly easier to test.
-
5:40 - 5:43The other thing with the…
-
5:46 - 5:48The other nice thing is that
the client is really simple. -
5:48 - 5:51The client doesn't have to know
where the monitoring system is. -
5:51 - 5:54It doesn't have to know about HA
-
5:54 - 5:56It just has to sit and collect the data
about itself. -
5:56 - 5:59So it doesn't have to know anything about
the topology of the network. -
5:59 - 6:03As an application developer, if you're
writing a DNS server or -
6:04 - 6:06some other piece of software,
-
6:06 - 6:10you don't have to know anything about
monitoring software, -
6:10 - 6:12you can just implement it inside
your application and -
6:13 - 6:17the monitoring software, whether it's
Prometheus or something else, -
6:17 - 6:19can just come and collect that data for you.
-
6:20 - 6:24That's kind of similar to a very old
monitoring system called SNMP, -
6:24 - 6:29but SNMP has a significantly less friendly
data model for developers. -
6:30 - 6:34This is the basic layout
of a Prometheus server. -
6:34 - 6:36At the core, there's a Prometheus server
-
6:36 - 6:40and it deals with all the data collection
and analytics. -
6:43 - 6:47Basically, this one binary,
it's all written in golang. -
6:47 - 6:49It's a single binary.
-
6:49 - 6:51It knows how to read from your inventory,
-
6:51 - 6:53there's a bunch of different methods,
whether you've got -
6:53 - 6:59a kubernetes cluster or a cloud platform
-
7:00 - 7:04or you have your own customized thing
with ansible. -
7:05 - 7:10Ansible can take your layout, drop that
into a config file and -
7:11 - 7:12Prometheus can pick that up.
-
7:16 - 7:19Once it has the layout, it goes out and
collects all the data. -
7:19 - 7:24It has a storage and a time series
database to store all that data locally. -
7:24 - 7:28It has a thing called PromQL, which is
a query language designed -
7:28 - 7:31for metrics and analytics.
-
7:32 - 7:37From that PromQL, you can add frontends
that will, -
7:37 - 7:39whether it's a simple API client
to run reports, -
7:40 - 7:43you can use things like Grafana
for creating dashboards, -
7:43 - 7:45it's got a simple webUI built in.
-
7:45 - 7:47You can plug in anything you want
on that side. -
7:49 - 7:54And then, it also has the ability to
continuously execute queries -
7:55 - 7:56called "recording rules"
-
7:57 - 7:59and these recording rules have
two different modes. -
7:59 - 8:02You can either record, you can take
a query -
8:02 - 8:04and it will generate new data
from that query -
8:04 - 8:07or you can take a query, and
if it returns results, -
8:07 - 8:09it will return an alert.
-
8:09 - 8:13That alert is a push message
to the alert manager. -
8:13 - 8:19This allows us to separate the generating
of alerts from the routing of alerts. -
8:19 - 8:24You can have one or hundreds of Prometheus
services, all generating alerts -
8:25 - 8:29and it goes into an alert manager cluster
and sends, does the deduplication -
8:29 - 8:31and the routing to the human
-
8:31 - 8:34because, of course, the thing
that we want is -
8:35 - 8:39we had dashboards with graphs, but
in order to find out if something is broken -
8:39 - 8:41you had to have a human
looking at the graph. -
8:41 - 8:43With Prometheus, we don't have to do that
anymore, -
8:43 - 8:48we can simply let the software tell us
that we need to go investigate -
8:48 - 8:49our problems.
-
8:49 - 8:51We don't have to sit there and
stare at dashboards all day, -
8:51 - 8:52because that's really boring.
-
8:55 - 8:58What does it look like to actually
get data into Prometheus? -
8:58 - 9:02This is a very basic output
of a Prometheus metric. -
9:03 - 9:04This is a very simple thing.
-
9:04 - 9:08If you know much about
the linux kernel, -
9:07 - 9:13the linux kernel tracks and proc stats,
all the state of all the CPUs -
9:13 - 9:14in your system
-
9:15 - 9:18and we express this by having
the name of the metric, which is -
9:22 - 9:26'node_cpu_seconds_total' and so
this is a self-describing metric, -
9:27 - 9:28like you can just read the metrics name
-
9:29 - 9:31and you understand a little bit about
what's going on here. -
9:33 - 9:39The linux kernel and other kernels track
their usage by the number of seconds -
9:39 - 9:41spent doing different things and
-
9:41 - 9:47that could be, whether it's in system or
user space or IRQs -
9:47 - 9:49or iowait or idle.
-
9:49 - 9:51Actually, the kernel tracks how much
idle time it has. -
9:54 - 9:55It also tracks it by the number of CPUs.
-
9:56 - 10:00With other monitoring systems, they used
to do this with a tree structure -
10:01 - 10:04and this caused a lot of problems,
for like -
10:04 - 10:09How do you mix and match data so
by switching from -
10:10 - 10:12a tree structure to a tag-based structure,
-
10:13 - 10:17we can do some really interesting
powerful data analytics. -
10:18 - 10:25Here's a nice example of taking
those CPU seconds counters -
10:26 - 10:30and then converting them into a graph
by using PromQL. -
10:33 - 10:35Now we can get into
Metrics-Based Alerting. -
10:35 - 10:38Now we have this graph, we have this thing
-
10:38 - 10:39we can look and see here
-
10:40 - 10:43"Oh there is some little spike here,
we might want to know about that." -
10:43 - 10:46Now we can get into Metrics-Based
Alerting. -
10:46 - 10:51I used to be a site reliability engineer,
I'm still a site reliability engineer at heart -
10:52 - 11:00and we have this concept of things that
you need on a site or a service reliably -
11:01 - 11:03The most important thing you need is
down at the bottom, -
11:04 - 11:07Monitoring, because if you don't have
monitoring of your service, -
11:07 - 11:09how do you know it's even working?
-
11:12 - 11:15There's a couple of techniques here, and
we want to alert based on data -
11:16 - 11:18and not just those end to end tests.
-
11:19 - 11:23There's a couple of techniques, a thing
called the RED method -
11:24 - 11:25and there's a thing called the USE method
-
11:26 - 11:28and there's a couple nice things to some
blog posts about this -
11:29 - 11:31and basically it defines that, for example,
-
11:31 - 11:35the RED method talks about the requests
that your system is handling -
11:36 - 11:38There are three things:
-
11:38 - 11:40There's the number of requests, there's
the number of errors -
11:40 - 11:42and there's how long takes a duration.
-
11:43 - 11:45With the combination of these three things
-
11:45 - 11:48you can determine most of
what your users see -
11:49 - 11:54"Did my request go through? Did it
return an error? Was it fast?" -
11:55 - 11:58Most people, that's all they care about.
-
11:58 - 12:02"I made a request to a website and
it came back and it was fast." -
12:05 - 12:07It's a very simple method of just, like,
-
12:07 - 12:10those are the important things to
determine if your site is healthy. -
12:12 - 12:17But we can go back to some more
traditional, sysadmin style alerts -
12:17 - 12:21this is basically taking the filesystem
available space, -
12:21 - 12:27divided by the filesystem size, that becomes
the ratio of filesystem availability -
12:27 - 12:28from 0 to 1.
-
12:28 - 12:31Multiply it by 100, we now have
a percentage -
12:31 - 12:36and if it's less than or equal to 1%
for 15 minutes, -
12:36 - 12:42this is less than 1% space, we should tell
a sysadmin to go check -
12:42 - 12:44to find out why the filesystem
has fall -
12:45 - 12:46It's super nice and simple.
-
12:46 - 12:50We can also tag, we can include…
-
12:51 - 12:58Every alert includes all the extraneous
labels that Prometheus adds to your metrics -
12:59 - 13:05When you add a metric in Prometheus, if
we go back and we look at this metric. -
13:06 - 13:11This metric only contain the information
about the internals of the application -
13:13 - 13:15anything about, like, what server it's on,
is it running in a container, -
13:15 - 13:19what cluster does it come from,
what continent is it on, -
13:18 - 13:22that's all extra annotations that are
added by the Prometheus server -
13:23 - 13:24at discovery time.
-
13:25 - 13:28Unfortunately I don't have a good example
of what those labels look like -
13:29 - 13:34but every metric gets annotated
with location information. -
13:37 - 13:41That location information also comes through
as labels in the alert -
13:41 - 13:48so, if you have a message coming
into your alert manager, -
13:48 - 13:50the alert manager can look and go
-
13:50 - 13:52"Oh, that's coming from this datacenter"
-
13:52 - 13:59and it can include that in the email or
IRC message or SMS message. -
13:59 - 14:01So you can include
-
13:59 - 14:04"Filesystem is out of space on this host
from this datacenter" -
14:05 - 14:07All these labels get passed through and
then you can append -
14:07 - 14:13"severity: critical" to that alert and
include that in the message to the human -
14:14 - 14:17because of course, this is how you define…
-
14:17 - 14:21Getting the message from the monitoring
to the human. -
14:22 - 14:24You can even include nice things like,
-
14:24 - 14:28if you've got documentation, you can
include a link to the documentation -
14:28 - 14:29as an annotation
-
14:29 - 14:33and the alert manager can take that
basic url and, you know, -
14:33 - 14:37massaging it into whatever it needs
to look like to actually get -
14:37 - 14:40the operator to the correct documentation.
-
14:42 - 14:43We can also do more fun things:
-
14:44 - 14:46since we actually are not just checking
-
14:46 - 14:49what is the space right now,
we're tracking data over time, -
14:49 - 14:51we can use 'predict_linear'.
-
14:52 - 14:55'predict_linear' just takes and does
a simple linear regression. -
14:56 - 15:00This example takes the filesystem
available space over the last hour and -
15:01 - 15:02does a linear regression.
-
15:03 - 15:09Prediction says "Well, it's going that way
and four hours from now, -
15:09 - 15:13based on one hour of history, it's gonna
be less than 0, which means full". -
15:14 - 15:21We know that within the next four hours,
the disc is gonna be full -
15:21 - 15:25so we can tell the operator ahead of time
that it's gonna be full -
15:25 - 15:27and not just tell them that it's full
right now. -
15:27 - 15:32They have some window of ability
to fix it before it fails. -
15:33 - 15:35This is really important because
if you're running a site -
15:36 - 15:41you want to be able to have alerts
that tell you that your system is failing -
15:42 - 15:43before it actually fails.
-
15:44 - 15:48Because if it fails, you're out of SLO
or SLA and -
15:48 - 15:50your users are gonna be unhappy
-
15:51 - 15:52and you don't want the users to tell you
that your site is down -
15:53 - 15:55you want to know about it before
your users can even tell. -
15:55 - 15:58This allows you to do that.
-
15:59 - 16:02And also of course, Prometheus being
a modern system, -
16:03 - 16:06we support fully UTF8 in all of our labels.
-
16:08 - 16:12Here's an other one, here's a good example
from the USE method. -
16:12 - 16:16This is a rate of 500 errors coming from
an application -
16:16 - 16:18and you can simply alert that
-
16:18 - 16:23there's more than 500 errors per second
coming out of the application -
16:23 - 16:26if that's your threshold for pain
-
16:26 - 16:27And you can do other things,
-
16:28 - 16:29you can convert that from just
a raid of errors -
16:30 - 16:31to a percentive error.
-
16:31 - 16:33So you could say
-
16:33 - 16:37"I have an SLA of 3 9" and so you can say
-
16:38 - 16:47"If the rate of errors divided by the rate
of requests is .01, -
16:47 - 16:49or is more than .01, then
that's a problem." -
16:50 - 16:55You can include that level of
error granularity. -
16:55 - 16:58And if you're just doing a blackbox test,
-
16:58 - 17:04you wouldn't know this, you would only get
if you got an error from the system, -
17:04 - 17:06then you got another error from the system
-
17:06 - 17:07then you fire an alert.
-
17:07 - 17:12But if those checks are one minute apart
and you're serving 1000 requests per second -
17:13 - 17:21you could be serving 10,000 errors before
you even get an alert. -
17:22 - 17:23And you might miss it, because
-
17:23 - 17:25what if you only get one random error
-
17:25 - 17:29and then the next time, you're serving
25% errors, -
17:29 - 17:32you only have a 25% chance of that check
failing again. -
17:32 - 17:36You really need these metrics in order
to get -
17:36 - 17:39proper reports of the status of your system
-
17:43 - 17:44There's even options
-
17:44 - 17:46You can slice and dice those labels.
-
17:46 - 17:50If you have a label on all of
your applications called 'service' -
17:50 - 17:53you can send that 'service' label through
to the message -
17:54 - 17:56and you can say
"Hey, this service is broken". -
17:56 - 18:00You can include that service label
in your alert messages. -
18:01 - 18:07And that's it, I can go to a demo and Q&A.
-
18:10 - 18:14[Applause]
-
18:17 - 18:18Any questions so far?
-
18:19 - 18:20Or anybody want to see a demo?
-
18:30 - 18:35[Q] Hi. Does Prometheus make metric
discovery inside containers -
18:35 - 18:37or do I have to implement the metrics
myself? -
18:38 - 18:46[A] For metrics in containers, there are
already things that expose -
18:46 - 18:49the metrics of the container system
itself. -
18:50 - 18:52There's a utility called 'cadvisor' and
-
18:52 - 18:57cadvisor takes the links cgroup data
and exposes it as metrics -
18:57 - 19:01so you can get data about
how much CPU time is being -
19:01 - 19:02spent in your container,
-
19:03 - 19:04how much memory is being spent
by your container. -
19:05 - 19:08[Q] But not about the application,
just about the container usage ? -
19:09 - 19:11[A] Right. Because the container
has no idea -
19:12 - 19:15whether your application is written
in Ruby or Go or Python or whatever, -
19:19 - 19:22you have to build that into
your application in order to get the data. -
19:24 - 19:24So for Prometheus,
-
19:28 - 19:35we've written client libraries that can be
included in your application directly -
19:35 - 19:36so you can get that data out.
-
19:37 - 19:41If you go to the Prometheus website,
we have a whole series of client libraries -
19:45 - 19:49and we cover a pretty good selection
of popular software. -
19:57 - 20:00[Q] What is the current state of
long-term data storage? -
20:01 - 20:02[A] Very good question.
-
20:03 - 20:05There's been several…
-
20:05 - 20:07There's actually several different methods
of doing this. -
20:10 - 20:15Prometheus stores all this data locally
in its own data storage -
20:15 - 20:16on the local disk.
-
20:17 - 20:19But that's only as durable as
that server is durable. -
20:19 - 20:22So if you've got a really durable server,
-
20:22 - 20:23you can store as much data as you want,
-
20:24 - 20:27you can store years and years of data
locally on the Prometheus server. -
20:27 - 20:28That's not a problem.
-
20:29 - 20:32There's a bunch of misconceptions because
of our default -
20:32 - 20:34and the language on our website said
-
20:35 - 20:36"It's not long-term storage"
-
20:37 - 20:42simply because we leave that problem
up to the person running the server. -
20:43 - 20:46But the time series database
that Prometheus includes -
20:47 - 20:48is actually quite durable.
-
20:49 - 20:51But it's only as durable as the server
underneath it. -
20:52 - 20:55So if you've got a very large cluster and
you want really high durability, -
20:56 - 20:58you need to have some kind of
cluster software, -
20:58 - 21:01but because we want Prometheus to be
simple to deploy -
21:02 - 21:03and very simple to operate
-
21:03 - 21:07and also very robust.
-
21:07 - 21:09We didn't want to include any clustering
in Prometheus itself, -
21:10 - 21:12because anytime you have a clustered
software, -
21:12 - 21:15what happens if your network is
a little wanky. -
21:16 - 21:19The first thing that goes down is
all of your distributed systems fail. -
21:20 - 21:23And building distributed systems to be
really robust is really hard -
21:23 - 21:29so Prometheus is what we call
"uncoordinated distributed systems". -
21:29 - 21:34If you've got two Prometheus servers
monitoring all your targets in an HA mode -
21:34 - 21:37in a cluster, and there's a split brain,
-
21:37 - 21:40each Prometheus can see
half of the cluster and -
21:41 - 21:44it can see that the other half
of the cluster is down. -
21:44 - 21:47They can both try to get alerts out
to the alert manager -
21:47 - 21:50and this is a really really robust way of
handling split brains -
21:51 - 21:54and bad network failures and bad problems
in a cluster. -
21:54 - 21:57It's designed to be super super robust
-
21:57 - 22:00and so the two individual
Promotheus servers in you cluster -
22:00 - 22:02don't have to talk to each other
to do this, -
22:02 - 22:04they can just to it independently.
-
22:04 - 22:07But if you want to be able
to correlate data -
22:08 - 22:09between many different Prometheus servers
-
22:09 - 22:12you need an external data storage
to do this. -
22:13 - 22:15And also you may not have
very big servers, -
22:15 - 22:17you might be running your Prometheus
in a container -
22:17 - 22:19and it's only got a little bit of local
storage space -
22:20 - 22:23so you want to send all that data up
to a big cluster datastore -
22:23 - 22:25for a bigger use
-
22:26 - 22:28We have several different ways of
doing this. -
22:28 - 22:31There's the classic way which is called
federation -
22:31 - 22:35where you have one Prometheus server
polling in summary data from -
22:35 - 22:37each of the individual Prometheus servers
-
22:37 - 22:40and this is useful if you want to run
alerts against data coming -
22:40 - 22:42from multiple Prometheus servers.
-
22:42 - 22:44But federation is not replication.
-
22:45 - 22:47It only can do a little bit of data from
each Prometheus server. -
22:48 - 22:51If you've got a million metrics on
each Prometheus server, -
22:52 - 22:56you can't poll in a million metrics
and do… -
22:56 - 22:59If you've got 10 of those, you can't
poll in 10 million metrics -
22:59 - 23:01simultaneously into one Prometheus
server. -
23:01 - 23:02It's just to much data.
-
23:03 - 23:06There is two others, a couple of other
nice options. -
23:07 - 23:09There's a piece of software called
Cortex. -
23:09 - 23:16Cortex is a Prometheus server that
stores its data in a database. -
23:17 - 23:19Specifically, a distributed database.
-
23:19 - 23:24Things that are based on the Google
big table model, like Cassandra or… -
23:26 - 23:27What's the Amazon one?
-
23:30 - 23:33Yeah.
-
23:33 - 23:34Dynamodb.
-
23:34 - 23:37If you have a dynamodb or a cassandra
cluster, or one of these other -
23:37 - 23:39really big distributed storage clusters,
-
23:40 - 23:45Cortex can run and the Prometheus servers
will stream their data up to Cortex -
23:45 - 23:49and it will keep a copy of that accross
all of your Prometheus servers. -
23:50 - 23:51And because it's based on things
like Cassandra, -
23:52 - 23:53it's super scalable.
-
23:53 - 23:58But it's a little complex to run and
-
23:58 - 24:01many people don't want to run that
complex infrastructure. -
24:01 - 24:06We have another new one, we just blogged
about it yesterday. -
24:02 - 24:07It's a thing called Thanos.
-
24:07 - 24:11Thanos is Prometheus at scale.
-
24:11 - 24:12Basically, the way it works…
-
24:13 - 24:15Actually, why don't I bring that up?
-
24:24 - 24:31This was developed by a company
called Improbable -
24:31 - 24:33and they wanted to…
-
24:35 - 24:40They had billions of metrics coming from
hundreds of Prometheus servers. -
24:41 - 24:47They developed this in collaboration with
the Prometheus team to build -
24:47 - 24:49a super highly scalable Prometheus server.
-
24:50 - 24:56Prometheus itself stores the incoming
metrics data in a write ahead log -
24:56 - 25:00and then every two hours, it creates
a compaction cycle -
25:00 - 25:03and it creates an imutable time series block
of data which is -
25:04 - 25:07all the time series blocks themselves
-
25:07 - 25:10and then an index into that data.
-
25:11 - 25:14Those two hour windows are all imutable
-
25:14 - 25:16so what Thanos does,
it has a little sidecar binary that -
25:16 - 25:19
watches for those new directories and -
25:19 - 25:21uploads them into a blob store.
-
25:21 - 25:26So you could put them in S3 or minio or
some other simple object storage. -
25:26 - 25:33And then now you have all of your data,
all of this index data already -
25:33 - 25:35ready to go
-
25:35 - 25:38and then the final sidecar creates
a little mesh cluster that can read from -
25:38 - 25:40all of those S3 blocks.
-
25:40 - 25:48Now, you have this super global view
all stored in a big bucket storage and -
25:50 - 25:52things like S3 or minio are…
-
25:53 - 25:58Bucket storage is not databases so they're
operationally a little easier to operate. -
25:58 - 26:02Plus, now we have all this data in
a bucket store and -
26:03 - 26:06the Thanos sidecars can talk to each other
-
26:07 - 26:08We can now have a single entry point.
-
26:08 - 26:12You can query Thanos and Thanos will
distribute your query -
26:12 - 26:14across all your Prometheus servers.
-
26:14 - 26:16So now you can do global queries across
all of your servers. -
26:18 - 26:22But it's very new, they just released
their first release candidate yesterday. -
26:24 - 26:27It is looking to be like
the coolest thing ever -
26:27 - 26:29for running large scale Prometheus.
-
26:30 - 26:35Here's an example of how that is laid out.
-
26:37 - 26:39This will bring and let you have
a billion metric Prometheus cluster. -
26:43 - 26:44And it's got a bunch of other
cool features. -
26:45 - 26:47Any more questions?
-
26:55 - 26:57Alright, maybe I'll do
a quick little demo. -
27:05 - 27:11Here is a Prometheus server that is
provided by this group -
27:11 - 27:14that just does a ansible deployment
for Prometheus. -
27:15 - 27:20And you can just simply query
for something like 'node_cpu'. -
27:21 - 27:23This is actually the old name for
that metric. -
27:24 - 27:26And you can see, here's exactly
-
27:28 - 27:31the CPU metrics from some servers.
-
27:33 - 27:35It's just a bunch of stuff.
-
27:35 - 27:37There's actually two servers here,
-
27:37 - 27:41there's an influx cloud alchemy and
there is a demo cloud alchemy. -
27:42 - 27:44[Q] Can you zoom in?
[A] Oh yeah sure. -
27:53 - 27:58So you can see all the extra labels.
-
28:00 - 28:02We can also do some things like…
-
28:02 - 28:04Let's take a look at, say,
the last 30 seconds. -
28:05 - 28:07We can just add this little time window.
-
28:08 - 28:11It's called a range request,
and you can see -
28:11 - 28:12the individual samples.
-
28:13 - 28:15You can see that all Prometheus is doing
-
28:15 - 28:18is storing the sample and a timestamp.
-
28:18 - 28:23All the timestamps are in milliseconds
and it's all epoch -
28:23 - 28:25so it's super easy to manipulate.
-
28:26 - 28:30But, looking at the individual samples and
looking at this, you can see that -
28:30 - 28:36if we go back and just take…
and look at the raw data, and -
28:36 - 28:38we graph the raw data…
-
28:40 - 28:43Oops, that's a syntax error.
-
28:44 - 28:47And we look at this graph…
Come on. -
28:47 - 28:48Here we go.
-
28:48 - 28:50Well, that's kind of boring, it's just
a flat line because -
28:51 - 28:53it's just a counter going up very slowly.
-
28:53 - 28:56What we really want to do, is we want to
take, and we want to apply -
28:57 - 28:59a rate function to this counter.
-
29:00 - 29:04So let's look at the rate over
the last one minute. -
29:04 - 29:07There we go, now we get
a nice little graph. -
29:08 - 29:14And so you can see that this is
0.6 CPU seconds per second -
29:15 - 29:18for that set of labels.
-
29:19 - 29:21But this is pretty noisy, there's a lot
of lines on this graph and -
29:21 - 29:23there's still a lot of data here.
-
29:23 - 29:26So let's start doing some filtering.
-
29:26 - 29:29One of the things we see here is,
well, there's idle. -
29:30 - 29:32We don't really care about
the machine being idle, -
29:33 - 29:35so let's just add a label filter
so we can say -
29:36 - 29:42'mode', it's the label name, and it's not
equal to 'idle'. Done. -
29:45 - 29:48And if I could type…
What did I miss? -
29:51 - 29:51Here we go.
-
29:51 - 29:54So now we've removed idle from the graph.
-
29:54 - 29:56That looks a little more sane.
-
29:57 - 30:01Oh, wow, look at that, that's a nice
big spike in user space on the influx server -
30:01 - 30:02Okay…
-
30:04 - 30:05Well, that's pretty cool.
-
30:06 - 30:06What about…
-
30:07 - 30:09This is still quite a lot of lines.
-
30:09 - 30:14We can just sum up that rate.
-
30:11 - 30:14How much CPU is in use total across
all the servers that we have. -
30:14 - 30:24We can just see that there is
a sum total of 0.6 CPU seconds/s -
30:25 - 30:28across the servers we have.
-
30:28 - 30:31But that's a little to coarse.
-
30:32 - 30:37What if we want to see it by instance?
-
30:39 - 30:42Now, we can see the two servers,
we can see -
30:43 - 30:45that we're left with just that label.
-
30:46 - 30:50The influx labels are the influx instance
and the influx demo. -
30:50 - 30:53That's a super easy way to see that,
-
30:54 - 30:57but we can also do this
the other way around. -
30:57 - 31:03We can say 'without (mode,cpu)' so
we can drop those modes and -
31:03 - 31:05see all the labels that we have.
-
31:05 - 31:12We can still see the environment label
and the job label on our list data. -
31:12 - 31:16You can go either way
with the summary functions. -
31:16 - 31:20There's a whole bunch of different functions
-
31:21 - 31:23and it's all in our documentation.
-
31:25 - 31:30But what if we want to see it…
-
31:31 - 31:34What if we want to see which CPUs
are in use? -
31:34 - 31:37Now we can see that it's only CPU0
-
31:37 - 31:40because apparently these are only
1-core instances. -
31:42 - 31:47You can add/remove labels and do
all these queries. -
31:50 - 31:52Any other questions so far?
-
31:54 - 31:59[Q] I don't have a question, but I have
something to add. -
31:59 - 32:03Prometheus is really nice, but it's
a lot better if you combine it -
32:03 - 32:05with grafana.
-
32:05 - 32:06[A] Yes, yes.
-
32:07 - 32:12In the beginning, when we were creating
Prometheus, we actually built -
32:13 - 32:15a piece of dashboard software called
promdash. -
32:16 - 32:21It was a simple little Ruby on Rails app
to create dashboards -
32:21 - 32:23and it had a bunch of JavaScript.
-
32:23 - 32:24And then grafana came out.
-
32:25 - 32:26And we're like
-
32:26 - 32:30"Oh, that's interesting. It doesn't support
Prometheus" so we were like -
32:30 - 32:32"Hey, can you support Prometheus"
-
32:32 - 32:34and they're like "Yeah, we've got
a REST API, get the data, done" -
32:36 - 32:38Now grafana supports Prometheus and
we're like -
32:40 - 32:42"Well, promdash, this is crap, delete".
-
32:44 - 32:46The Prometheus development team,
-
32:46 - 32:49we're all backend developers
and SREs and -
32:50 - 32:51we have no JavaScript skills at all.
-
32:53 - 32:55So we're like "Let somebody deal
with that". -
32:55 - 32:58One of the nice things about working on
this kind of project is -
32:58 - 33:02we can do things that we're good at and
and we don't, we don't try… -
33:02 - 33:05We don't have any marketing people,
it's just an opensource project, -
33:06 - 33:09there's no single company behind Prometheus.
-
33:10 - 33:14I work for GitLab, Improbable paid for
the Thanos system, -
33:16 - 33:25other companies like Red Hat now pays
people that used to work on CoreOS to -
33:25 - 33:27work on Prometheus.
-
33:27 - 33:30There's lots and lots of collaboration
between many companies -
33:30 - 33:33to build the Prometheus ecosystem.
-
33:36 - 33:37But yeah, grafana is great.
-
33:39 - 33:45Actually, grafana now has
two fulltime Prometheus developers. -
33:49 - 33:51Alright, that's it.
-
33:53 - 33:57[Applause]
- Title:
- Metrics-Based Monitoring with Prometheus
- Description:
-
Talk given by Ben Kochie at Minidebconf Hamburg 18
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-19/metrics_based_monitoring.webm - Video Language:
- English
- Team:
Debconf
- Project:
- 2018_mini-debconf-hamburg
- Duration:
- 34:03
![]() |
tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
ruipb edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus | |
![]() |
tvincent edited English subtitles for Metrics-Based Monitoring with Prometheus |