WEBVTT

00:00:05.901 --> 00:00:10.531
So, we had a talk by a non-GitLab person
about GitLab.

00:00:10.531 --> 00:00:13.057
Now, we have a talk by a GitLab person
on non-GtlLab.

00:00:13.202 --> 00:00:14.603
Something like that?

00:00:15.894 --> 00:00:19.393
The CCCHH hackerspace is now open,

00:00:19.946 --> 00:00:22.118
from now on if you want to go there,
that's the announcement.

00:00:22.471 --> 00:00:25.871
And the next talk will be by Ben Kochie

00:00:26.009 --> 00:00:28.265
on metrics-based monitoring
with Prometheus.

00:00:28.748 --> 00:00:30.212
Welcome.

00:00:30.545 --> 00:00:33.133
[Applause]

00:00:35.395 --> 00:00:36.578
Alright, so

00:00:36.886 --> 00:00:39.371
my name is Ben Kochie

00:00:39.845 --> 00:00:43.870
I work on DevOps features for GitLab

00:00:44.327 --> 00:00:48.293
and apart working for GitLab, I also work
on the opensource Prometheus project.

00:00:51.163 --> 00:00:53.595
I live in Berlin and I've been using
Debian since ???

00:00:54.353 --> 00:00:56.797
yes, quite a long time.

00:00:58.806 --> 00:01:01.018
So, what is Metrics-based Monitoring?

00:01:02.638 --> 00:01:05.165
If you're running software in production,

00:01:05.885 --> 00:01:07.826
you probably want to monitor it,

00:01:08.212 --> 00:01:10.547
because if you don't monitor it, you don't
know it's right.

00:01:13.278 --> 00:01:16.112
??? break down into two categories:

00:01:16.112 --> 00:01:19.146
there's blackbox monitoring and
there's whitebox monitoring.

00:01:19.500 --> 00:01:24.582
Blackbox monitoring is treating
your software like a blackbox.

00:01:24.757 --> 00:01:27.158
It's just checks to see, like,

00:01:27.447 --> 00:01:29.483
is it responding, or does it ping

00:01:30.023 --> 00:01:33.588
or ??? HTTP requests

00:01:34.348 --> 00:01:35.669
[mic turned on]

00:01:37.760 --> 00:01:41.379
Ah, there we go, much better.

00:01:46.592 --> 00:01:51.898
So, blackbox monitoring is a probe,

00:01:51.898 --> 00:01:54.684
it just kind of looks from the outside
to your software

00:01:55.454 --> 00:01:57.432
and it has no knowledge of the internals

00:01:58.133 --> 00:02:00.699
and it's really good for end to end testing.

00:02:00.942 --> 00:02:03.560
So if you've got a fairly complicated
service,

00:02:03.990 --> 00:02:06.426
you come in from the outside, you go
through the load balancer,

00:02:06.721 --> 00:02:07.975
you hit the API server,

00:02:07.975 --> 00:02:10.152
the API server might hit a database,

00:02:10.675 --> 00:02:13.054
and you go all the way through
to the back of the stack

00:02:13.224 --> 00:02:14.536
and all the way back out

00:02:14.560 --> 00:02:16.294
so you know that everything is working
end to end.

00:02:16.518 --> 00:02:18.768
But you only know about it
for that one request.

00:02:19.036 --> 00:02:22.429
So in order to find out if your service
is working,

00:02:22.831 --> 00:02:27.128
from the end to end, for every single
request,

00:02:27.475 --> 00:02:29.523
this requires whitebox intrumentation.

00:02:29.836 --> 00:02:33.965
So, basically, every event that happens
inside your software,

00:02:33.973 --> 00:02:36.517
inside a serving stack,

00:02:36.817 --> 00:02:39.807
gets collected and gets counted,

00:02:40.037 --> 00:02:43.466
so you know that every request hits
the load balancer,

00:02:43.466 --> 00:02:45.656
every request hits your application
service,

00:02:45.972 --> 00:02:47.329
every request hits the database.

00:02:47.789 --> 00:02:50.832
You know that everything matches up

00:02:50.997 --> 00:02:55.764
and this is called whitebox, or
metrics-based monitoring.

00:02:56.010 --> 00:02:57.688
There is different examples of, like,

00:02:57.913 --> 00:03:02.392
the kind of software that does blackbox
and whitebox monitoring.

00:03:02.572 --> 00:03:06.680
So you have software like Nagios that
you can configure checks

00:03:08.826 --> 00:03:10.012
or pingdom,

00:03:10.211 --> 00:03:12.347
pingdom will do ping of your website.

00:03:12.971 --> 00:03:15.307
And then there is metrics-based monitoring,

00:03:15.517 --> 00:03:19.293
things like Prometheus, things like
the TICK stack from influx data,

00:03:19.610 --> 00:03:22.728
New Relic and other commercial solutions

00:03:23.027 --> 00:03:25.480
but of course I like to talk about
the opensorce solutions.

00:03:25.748 --> 00:03:28.379
We're gonna talk a little bit about
Prometheus.

00:03:28.819 --> 00:03:31.955
Prometheus came out of the idea that

00:03:32.343 --> 00:03:37.555
we needed a monitoring system that could
collect all this whitebox metric data

00:03:37.941 --> 00:03:40.786
and do something useful with it.

00:03:40.915 --> 00:03:42.667
Not just give us a pretty graph, but
we also want to be able to

00:03:42.985 --> 00:03:44.189
alert on it.

00:03:44.189 --> 00:03:45.988
So we needed both

00:03:49.872 --> 00:03:54.068
a data gathering and an analytics system
in the same instance.

00:03:54.148 --> 00:03:58.821
To do this, we built this thing and
we looked at the way that

00:03:59.014 --> 00:04:01.835
data was being generated
by the applications

00:04:02.369 --> 00:04:05.204
and there are advantages and
disadvantages to this

00:04:05.204 --> 00:04:07.250
push vs. pull model for metrics.

00:04:07.384 --> 00:04:09.701
We decided to go with the pulling model

00:04:09.938 --> 00:04:13.953
because there is some slight advantages
for pulling over pushing.

00:04:16.323 --> 00:04:18.163
With pulling, you get this free
blackbox check

00:04:18.471 --> 00:04:20.151
that the application is running.

00:04:20.527 --> 00:04:24.319
When you pull your application, you know
that the process is running.

00:04:24.532 --> 00:04:27.529
If you are doing push-based, you can't
tell the difference between

00:04:27.851 --> 00:04:31.521
your application doing no work and
your application not running.

00:04:32.416 --> 00:04:33.900
So you don't know if it's stuck,

00:04:34.140 --> 00:04:37.878
or is it just not having to do any work.

00:04:42.671 --> 00:04:48.940
With pulling, the pulling system knows
the state of your network.

00:04:49.850 --> 00:04:52.522
If you have a defined set of services,

00:04:52.887 --> 00:04:56.788
that inventory drives what should be there.

00:04:58.274 --> 00:05:00.080
Again, it's like, the disappearing,

00:05:00.288 --> 00:05:03.950
is the process dead, or is it just
not doing anything?

00:05:04.205 --> 00:05:07.117
With polling, you know for a fact
what processes should be there,

00:05:07.593 --> 00:05:10.900
and it's a bit of an advantage there.

00:05:11.138 --> 00:05:12.913
With pulling, there's really easy testing.

00:05:13.117 --> 00:05:16.295
With push-based metrics, you have to
figure out

00:05:16.505 --> 00:05:18.843
if you want to test a new version of
the monitoring system or

00:05:19.058 --> 00:05:21.262
you want to test something new,

00:05:21.420 --> 00:05:24.129
you have to tear off a copy of the data.

00:05:24.370 --> 00:05:27.652
With pulling, you can just set up
another instance of your monitoring

00:05:27.856 --> 00:05:29.189
and just test it.

00:05:29.714 --> 00:05:31.321
Or you don't even have,

00:05:31.473 --> 00:05:33.194
it doesn't even have to be monitoring,
you can just use curl

00:05:33.199 --> 00:05:35.977
to pull the metrics endpoint.

00:05:38.417 --> 00:05:40.436
It's significantly easier to test.

00:05:40.436 --> 00:05:42.977
The other thing with the…

00:05:45.999 --> 00:05:48.109
The other nice thing is that
the client is really simple.

00:05:48.481 --> 00:05:51.068
The client doesn't have to know
where the monitoring system is.

00:05:51.272 --> 00:05:53.669
It doesn't have to know about ???

00:05:53.820 --> 00:05:55.720
It just has to sit and collect the data
about itself.

00:05:55.882 --> 00:05:58.708
So it doesn't have to know anything about
the topology of the network.

00:05:59.134 --> 00:06:03.363
As an application developer, if you're
writing a DNS server or

00:06:03.724 --> 00:06:05.572
some other piece of software,

00:06:05.896 --> 00:06:09.562
you don't have to know anything about
monitoring software,

00:06:09.803 --> 00:06:12.217
you can just implement it inside
your application and

00:06:12.683 --> 00:06:17.058
the monitoring software, whether it's
Prometheus or something else,

00:06:17.414 --> 00:06:19.332
can just come and collect that data for you.

00:06:20.210 --> 00:06:23.611
That's kind of similar to a very old
monitoring system called SNMP,

00:06:23.832 --> 00:06:28.530
but SNMP has a significantly less friendly
data model for developers.

00:06:30.010 --> 00:06:33.556
This is the basic layout
of a Prometheus server.

00:06:33.921 --> 00:06:35.918
At the core, there's a Prometheus server

00:06:36.278 --> 00:06:40.302
and it deals with all the data collection
and analytics.

00:06:42.941 --> 00:06:46.697
Basically, this one binary,
it's all written in golang.

00:06:46.867 --> 00:06:48.559
It's a single binary.

00:06:48.559 --> 00:06:50.823
It knows how to read from your inventory,

00:06:50.823 --> 00:06:52.659
there's a bunch of different methods,
whether you've got

00:06:53.121 --> 00:06:58.843
a kubernetes cluster or a cloud platform

00:07:00.234 --> 00:07:03.800
or you have your own customized thing
with ansible.

00:07:05.380 --> 00:07:09.750
Ansible can take your layout, drop that
into a config file and

00:07:10.639 --> 00:07:11.902
Prometheus can pick that up.

00:07:15.594 --> 00:07:18.812
Once it has the layout, it goes out and
collects all the data.

00:07:18.844 --> 00:07:24.254
It has a storage and a time series
database to store all that data locally.

00:07:24.462 --> 00:07:28.228
It has a thing called PromQL, which is
a query language designed

00:07:28.452 --> 00:07:31.033
for metrics and analytics.

00:07:31.500 --> 00:07:36.779
From that PromQL, you can add frontends
that will,

00:07:36.985 --> 00:07:39.319
whether it's a simple API client
to run reports,

00:07:40.019 --> 00:07:42.942
you can use things like Grafana
for creating dashboards,

00:07:43.124 --> 00:07:44.834
it's got a simple webUI built in.

00:07:45.031 --> 00:07:46.920
You can plug in anything you want
on that side.

00:07:48.693 --> 00:07:54.478
And then, it also has the ability to
continuously execute queries

00:07:54.625 --> 00:07:56.191
called "recording rules"

00:07:56.832 --> 00:07:59.103
and these recording rules have
two different modes.

00:07:59.103 --> 00:08:01.871
You can either record, you can take
a query

00:08:02.150 --> 00:08:03.711
and it will generate new data
from that query

00:08:04.072 --> 00:08:06.967
or you can take a query, and
if it returns results,

00:08:07.354 --> 00:08:08.910
it will return an alert.

00:08:09.176 --> 00:08:12.506
That alert is a push message
to the alert manager.

00:08:12.813 --> 00:08:18.969
This allows us to separate the generating
of alerts from the routing of alerts.

00:08:19.153 --> 00:08:24.259
You can have one or hundreds of Prometheus
services, all generating alerts

00:08:24.599 --> 00:08:28.807
and it goes into an alert manager cluster
and sends, does the deduplication

00:08:29.329 --> 00:08:30.684
and the routing to the human

00:08:30.879 --> 00:08:34.138
because, of course, the thing
that we want is

00:08:34.927 --> 00:08:38.797
we had dashboards with graphs, but
in order to find out if something is broken

00:08:38.966 --> 00:08:40.650
you had to have a human
looking at the graph.

00:08:40.830 --> 00:08:42.942
With Prometheus, we don't have to do that
anymore,

00:08:43.103 --> 00:08:47.638
we can simply let the software tell us
that we need to go investigate

00:08:47.638 --> 00:08:48.650
our problems.

00:08:48.778 --> 00:08:50.831
We don't have to sit there and
stare at dashboards all day,

00:08:51.035 --> 00:08:52.380
because that's really boring.

00:08:54.519 --> 00:08:57.556
What does it look like to actually
get data into Prometheus?

00:08:57.587 --> 00:09:02.140
This is a very basic output
of a Prometheus metric.

00:09:02.613 --> 00:09:03.930
This is a very simple thing.

00:09:04.086 --> 00:09:07.572
If you know much about
the linux kernel,

00:09:06.883 --> 00:09:12.779
the linux kernel tracks and proc stats,
all the state of all the CPUs

00:09:12.779 --> 00:09:14.459
in your system

00:09:14.662 --> 00:09:18.078
and we express this by having
the name of the metric, which is

00:09:22.449 --> 00:09:26.123
'node_cpu_seconds_total' and so
this is a self-describing metric,

00:09:26.547 --> 00:09:28.375
like you can just read the metrics name

00:09:28.530 --> 00:09:30.845
and you understand a little bit about
what's going on here.

00:09:33.241 --> 00:09:38.521
The linux kernel and other kernels track
their usage by the number of seconds

00:09:38.859 --> 00:09:41.004
spent doing different things and

00:09:41.199 --> 00:09:46.721
that could be, whether it's in system or
user space or IRQs

00:09:47.065 --> 00:09:48.690
or iowait or idle.

00:09:48.908 --> 00:09:51.280
Actually, the kernel tracks how much
idle time it has.

00:09:53.660 --> 00:09:55.309
It also tracks it by the number of CPUs.

00:09:55.997 --> 00:10:00.067
With other monitoring systems, they used
to do this with a tree structure

00:10:01.021 --> 00:10:03.688
and this caused a lot of problems,
for like

00:10:03.854 --> 00:10:09.291
How do you mix and match data so
by switching from

00:10:10.043 --> 00:10:12.484
a tree structure to a tag-based structure,

00:10:12.985 --> 00:10:16.896
we can do some really interesting
powerful data analytics.

00:10:18.170 --> 00:10:25.170
Here's a nice example of taking
those CPU seconds counters

00:10:26.101 --> 00:10:30.198
and then converting them into a graph
by using PromQL.

00:10:32.724 --> 00:10:34.830
Now we can get into
Metrics-Based Alerting.

00:10:35.315 --> 00:10:37.665
Now we have this graph, we have this thing

00:10:37.847 --> 00:10:39.497
we can look and see here

00:10:39.999 --> 00:10:42.920
"Oh there is some little spike here,
we might want to know about that."

00:10:43.191 --> 00:10:45.849
Now we can get into Metrics-Based
Alerting.

00:10:46.281 --> 00:10:51.128
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart

00:10:52.371 --> 00:11:00.362
and we have this concept of things that
you need on a site or a service reliably

00:11:00.910 --> 00:11:03.231
The most important thing you need is
down at the bottom,

00:11:03.569 --> 00:11:06.869
Monitoring, because if you don't have
monitoring of your service,

00:11:07.108 --> 00:11:08.688
how do you know it's even working?

00:11:11.628 --> 00:11:15.235
There's a couple of techniques here, and
we want to alert based on data

00:11:15.693 --> 00:11:17.644
and not just those end to end tests.

00:11:18.796 --> 00:11:23.387
There's a couple of techniques, a thing
called the RED method

00:11:23.555 --> 00:11:25.141
and there's a thing called the USE method

00:11:25.588 --> 00:11:28.400
and there's a couple nice things to some
blog posts about this

00:11:28.695 --> 00:11:31.306
and basically it defines that, for example,

00:11:31.484 --> 00:11:35.000
the RED method talks about the requests
that your system is handling

00:11:36.421 --> 00:11:37.604
There are three things:

00:11:37.775 --> 00:11:40.073
There's the number of requests, there's
the number of errors

00:11:40.268 --> 00:11:42.306
and there's how long takes a duration.

00:11:42.868 --> 00:11:45.000
With the combination of these three things

00:11:45.341 --> 00:11:48.368
you can determine most of
what your users see

00:11:48.712 --> 00:11:53.616
"Did my request go through? Did it
return an error? Was it fast?"

00:11:55.492 --> 00:11:57.971
Most people, that's all they care about.

00:11:58.205 --> 00:12:01.965
"I made a request to a website and
it came back and it was fast."

00:12:04.975 --> 00:12:06.517
It's a very simple method of just, like,

00:12:07.162 --> 00:12:10.109
those are the important things to
determine if your site is healthy.

00:12:12.193 --> 00:12:17.045
But we can go back to some more
traditional, sysadmin style alerts

00:12:17.309 --> 00:12:20.553
this is basically taking the filesystem
available space,

00:12:20.824 --> 00:12:26.522
divided by the filesystem size, that becomes
the ratio of filesystem availability

00:12:26.697 --> 00:12:27.523
from 0 to 1.

00:12:28.241 --> 00:12:30.759
Multiply it by 100, we now have
a percentage

00:12:31.016 --> 00:12:35.659
and if it's less than or equal to 1%
for 15 minutes,

00:12:35.940 --> 00:12:41.782
this is less than 1% space, we should tell
a sysadmin to go check

00:12:41.957 --> 00:12:44.290
to find out why the filesystem
has fall

00:12:44.635 --> 00:12:46.168
It's super nice and simple.

00:12:46.494 --> 00:12:49.685
We can also tag, we can include…

00:12:51.418 --> 00:12:58.232
Every alert includes all the extraneous
labels that Prometheus adds to your metrics

00:12:59.488 --> 00:13:05.461
When you add a metric in Prometheus, if
we go back and we look at this metric.

00:13:06.009 --> 00:13:10.803
This metric only contain the information
about the internals of the application

00:13:12.942 --> 00:13:14.995
anything about, like, what server it's on,
is it running in a container,

00:13:15.186 --> 00:13:18.724
what cluster does it come from,
what continent is it on,

00:13:17.702 --> 00:13:22.280
that's all extra annotations that are
added by the Prometheus server

00:13:22.619 --> 00:13:23.949
at discovery time.

00:13:24.514 --> 00:13:28.347
Unfortunately I don't have a good example 
of what those labels look like

00:13:28.514 --> 00:13:34.180
but every metric gets annotated
with location information.

00:13:36.904 --> 00:13:41.121
That location information also comes through
as labels in the alert

00:13:41.300 --> 00:13:48.074
so, if you have a message coming
into your alert manager,

00:13:48.269 --> 00:13:49.899
the alert manager can look and go

00:13:50.093 --> 00:13:51.621
"Oh, that's coming from this datacenter"

00:13:52.007 --> 00:13:58.905
and it can include that in the email or
IRC message or SMS message.

00:13:59.069 --> 00:14:00.772
So you can include

00:13:59.271 --> 00:14:04.422
"Filesystem is out of space on this host
from this datacenter"

00:14:04.557 --> 00:14:07.340
All these labels get passed through and
then you can append

00:14:07.491 --> 00:14:13.292
"severity: critical" to that alert and
include that in the message to the human

00:14:13.693 --> 00:14:16.775
because of course, this is how you define…

00:14:16.940 --> 00:14:20.857
Getting the message from the monitoring
to the human.

00:14:22.197 --> 00:14:23.850
You can even include nice things like,

00:14:24.027 --> 00:14:27.508
if you've got documentation, you can
include a link to the documentation

00:14:27.620 --> 00:14:28.686
as an annotation

00:14:29.079 --> 00:14:33.438
and the alert manager can take that
basic url and, you know,

00:14:33.467 --> 00:14:36.806
massaging it into whatever it needs
to look like to actually get

00:14:37.135 --> 00:14:40.417
the operator to the correct documentation.

00:14:42.117 --> 00:14:43.450
We can also do more fun things:

00:14:43.657 --> 00:14:45.567
since we actually are not just checking

00:14:45.746 --> 00:14:48.523
what is the space right now,
we're tracking data over time,

00:14:49.232 --> 00:14:50.827
we can use 'predict_linear'.

00:14:52.406 --> 00:14:55.255
'predict_linear' just takes and does
a simple linear regression.

00:14:55.749 --> 00:15:00.270
This example takes the filesystem
available space over the last hour and

00:15:00.865 --> 00:15:02.453
does a linear regression.

00:15:02.785 --> 00:15:08.536
Prediction says "Well, it's going that way
and four hours from now,

00:15:08.749 --> 00:15:13.112
based on one hour of history, it's gonna
be less than 0, which means full".

00:15:13.667 --> 00:15:20.645
We know that within the next four hours,
the disc is gonna be full

00:15:20.874 --> 00:15:24.658
so we can tell the operator ahead of time
that it's gonna be full

00:15:24.833 --> 00:15:26.517
and not just tell them that it's full
right now.

00:15:27.113 --> 00:15:32.303
They have some window of ability
to fix it before it fails.

00:15:32.674 --> 00:15:35.369
This is really important because
if you're running a site

00:15:35.689 --> 00:15:41.370
you want to be able to have alerts
that tell you that your system is failing

00:15:41.573 --> 00:15:42.994
before it actually fails.

00:15:43.667 --> 00:15:48.254
Because if it fails, you're out of SLO
or SLA and

00:15:48.404 --> 00:15:50.322
your users are gonna be unhappy

00:15:50.729 --> 00:15:52.493
and you don't want the users to tell you
that your site is down

00:15:52.682 --> 00:15:54.953
you want to know about it before
your users can even tell.

00:15:55.193 --> 00:15:58.491
This allows you to do that.

00:15:58.693 --> 00:16:02.232
And also of course, Prometheus being
a modern system,

00:16:02.735 --> 00:16:05.633
we support fully UTF8 in all of our labels.

00:16:08.283 --> 00:16:12.101
Here's an other one, here's a good example
from the USE method.

00:16:12.490 --> 00:16:16.036
This is a rate of 500 errors coming from
an application

00:16:16.423 --> 00:16:17.813
and you can simply alert that

00:16:17.977 --> 00:16:22.555
there's more than 500 errors per second
coming out of the application

00:16:22.568 --> 00:16:25.670
if that's your threshold for pain

00:16:26.041 --> 00:16:27.298
And you can do other things,

00:16:27.501 --> 00:16:29.338
you can convert that from just
a raid of errors

00:16:29.723 --> 00:16:31.054
to a percentive error.

00:16:31.304 --> 00:16:32.605
So you could say

00:16:33.053 --> 00:16:37.336
"I have an SLA of 3 9" and so you can say

00:16:37.574 --> 00:16:46.710
"If the rate of errors divided by the rate
of requests is .01,

00:16:47.265 --> 00:16:49.335
or is more than .01, then
that's a problem."

00:16:49.725 --> 00:16:54.589
You can include that level of
error granularity.

00:16:54.797 --> 00:16:57.622
And if you're just doing a blackbox test,

00:16:58.185 --> 00:17:03.727
you wouldn't know this, you would only get
if you got an error from the system,

00:17:04.188 --> 00:17:05.601
then you got another error from the system

00:17:05.826 --> 00:17:06.938
then you fire an alert.

00:17:07.307 --> 00:17:11.847
But if those checks are one minute apart
and you're serving 1000 requests per second

00:17:13.324 --> 00:17:20.987
you could be serving 10,000 errors before
you even get an alert.

00:17:21.579 --> 00:17:22.876
And you might miss it, because

00:17:23.104 --> 00:17:24.993
what if you only get one random error

00:17:25.327 --> 00:17:28.898
and then the next time, you're serving
25% errors,

00:17:29.094 --> 00:17:31.571
you only have a 25% chance of that check
failing again.

00:17:31.800 --> 00:17:36.230
You really need these metrics in order
to get

00:17:36.430 --> 00:17:38.867
proper reports of the status of your system

00:17:43.176 --> 00:17:43.850
There's even options

00:17:44.051 --> 00:17:45.816
You can slice and dice those labels.

00:17:46.225 --> 00:17:50.056
If you have a label on all of
your applications called 'service'

00:17:50.322 --> 00:17:53.251
you can send that 'service' label through
to the message

00:17:53.523 --> 00:17:55.857
and you can say
"Hey, this service is broken".

00:17:56.073 --> 00:18:00.363
You can include that service label
in your alert messages.

00:18:01.426 --> 00:18:06.723
And that's it, I can go to a demo and Q&amp;A.

00:18:09.881 --> 00:18:13.687
[Applause]

00:18:16.877 --> 00:18:18.417
Any questions so far?

00:18:18.811 --> 00:18:20.071
Or anybody want to see a demo?

00:18:29.517 --> 00:18:35.065
[Q] Hi. Does Prometheus make metric
discovery inside containers

00:18:35.364 --> 00:18:37.476
or do I have to implement the metrics
myself?

00:18:38.184 --> 00:18:45.743
[A] For metrics in containers, there are
already things that expose

00:18:45.887 --> 00:18:49.214
the metrics of the container system
itself.

00:18:49.512 --> 00:18:52.174
There's a utility called 'cadvisor' and

00:18:52.395 --> 00:18:57.172
cadvisor takes the links cgroup data
and exposes it as metrics

00:18:57.416 --> 00:19:01.164
so you can get data about
how much CPU time is being

00:19:01.164 --> 00:19:02.421
spent in your container,

00:19:02.683 --> 00:19:04.139
how much memory is being spent
by your container.

00:19:04.775 --> 00:19:08.411
[Q] But not about the application,
just about the container usage ?

00:19:08.597 --> 00:19:11.355
[A] Right. Because the container
has no idea

00:19:11.698 --> 00:19:15.451
whether your application is written
in Ruby or Go or Python or whatever,

00:19:18.698 --> 00:19:21.602
you have to build that into
your application in order to get the data.

00:19:24.057 --> 00:19:24.307
So for Prometheus,

00:19:27.890 --> 00:19:35.031
we've written client libraries that can be
included in your application directly

00:19:35.195 --> 00:19:36.413
so you can get that data out.

00:19:36.602 --> 00:19:41.460
If you go to the Prometheus website,
we have a whole series of client libraries

00:19:44.936 --> 00:19:48.913
and we cover a pretty good selection
of popular software.

00:19:56.569 --> 00:19:59.537
[Q] What is the current state of
long-term data storage?

00:20:00.803 --> 00:20:01.678
[A] Very good question.

00:20:02.697 --> 00:20:04.513
There's been several…

00:20:04.913 --> 00:20:06.521
There's actually several different methods
of doing this.

00:20:09.653 --> 00:20:14.667
Prometheus stores all this data locally
in its own data storage

00:20:14.667 --> 00:20:15.711
on the local disk.

00:20:16.609 --> 00:20:19.156
But that's only as durable as
that server is durable.

00:20:19.423 --> 00:20:21.627
So if you've got a really durable server,

00:20:21.812 --> 00:20:23.357
you can store as much data as you want,

00:20:23.551 --> 00:20:26.521
you can store years and years of data
locally on the Prometheus server.

00:20:26.653 --> 00:20:28.088
That's not a problem.

00:20:28.781 --> 00:20:32.244
There's a bunch of misconceptions because
of our default

00:20:32.464 --> 00:20:34.492
and the language on our website said

00:20:34.698 --> 00:20:36.160
"It's not long-term storage"

00:20:36.707 --> 00:20:41.841
simply because we leave that problem
up to the person running the server.

00:20:43.389 --> 00:20:46.389
But the time series database
that Prometheus includes

00:20:46.562 --> 00:20:47.739
is actually quite durable.

00:20:49.157 --> 00:20:51.069
But it's only as durable as the server
underneath it.

00:20:51.642 --> 00:20:55.172
So if you've got a very large cluster and
you want really high durability,

00:20:55.800 --> 00:20:57.705
you need to have some kind of
cluster software,

00:20:58.217 --> 00:21:01.106
but because we want Prometheus to be
simple to deploy

00:21:01.701 --> 00:21:02.911
and very simple to operate

00:21:03.355 --> 00:21:06.774
and also very robust.

00:21:06.950 --> 00:21:09.370
We didn't want to include any clustering
in Prometheus itself,

00:21:09.787 --> 00:21:12.078
because anytime you have a clustered
software,

00:21:12.294 --> 00:21:15.100
what happens if your network is
a little wanky.

00:21:15.586 --> 00:21:19.470
The first thing that goes down is
all of your distributed systems fail.

00:21:20.328 --> 00:21:23.048
And building distributed systems to be
really robust is really hard

00:21:23.445 --> 00:21:29.142
so Prometheus is what we call
"uncoordinated distributed systems".

00:21:29.348 --> 00:21:34.048
If you've got two Prometheus servers
monitoring all your targets in an HA mode

00:21:34.273 --> 00:21:36.890
in a cluster, and there's a split brain,

00:21:37.131 --> 00:21:40.363
each Prometheus can see
half of the cluster and

00:21:40.768 --> 00:21:43.557
it can see that the other half
of the cluster is down.

00:21:43.846 --> 00:21:46.740
They can both try to get alerts out
to the alert manager

00:21:46.945 --> 00:21:50.466
and this is a really really robust way of
handling split brains

00:21:50.734 --> 00:21:54.069
and bad network failures and bad problems
in a cluster.

00:21:54.294 --> 00:21:57.163
It's designed to be super super robust

00:21:57.342 --> 00:21:59.844
and so the two individual
Promotheus servers in you cluster

00:22:00.079 --> 00:22:02.009
don't have to talk to each other
to do this,

00:22:02.193 --> 00:22:03.994
they can just to it independently.

00:22:04.377 --> 00:22:07.392
But if you want to be able
to correlate data

00:22:07.604 --> 00:22:09.255
between many different Prometheus servers

00:22:09.439 --> 00:22:12.185
you need an external data storage
to do this.

00:22:12.777 --> 00:22:15.008
And also you may not have
very big servers,

00:22:15.164 --> 00:22:17.126
you might be running your Prometheus
in a container

00:22:17.293 --> 00:22:19.373
and it's only got a little bit of local
storage space

00:22:19.543 --> 00:22:23.217
so you want to send all that data up
to a big cluster datastore

00:22:23.439 --> 00:22:25.124
for a bigger use

00:22:25.707 --> 00:22:27.913
We have several different ways of
doing this.

00:22:28.383 --> 00:22:30.941
There's the classic way which is called
federation

00:22:31.156 --> 00:22:34.875
where you have one Prometheus server
polling in summary data from

00:22:35.083 --> 00:22:36.604
each of the individual Prometheus servers

00:22:36.823 --> 00:22:40.266
and this is useful if you want to run
alerts against data coming

00:22:40.363 --> 00:22:41.578
from multiple Prometheus servers.

00:22:42.488 --> 00:22:44.240
But federation is not replication.

00:22:44.870 --> 00:22:47.488
It only can do a little bit of data from
each Prometheus server.

00:22:47.715 --> 00:22:51.078
If you've got a million metrics on
each Prometheus server,

00:22:51.683 --> 00:22:55.725
you can't poll in a million metrics
and do…

00:22:55.725 --> 00:22:58.850
If you've got 10 of those, you can't
poll in 10 million metrics

00:22:59.011 --> 00:23:00.635
simultaneously into one Prometheus
server.

00:23:00.919 --> 00:23:01.890
It's just to much data.

00:23:02.875 --> 00:23:06.006
There is two others, a couple of other
nice options.

00:23:06.618 --> 00:23:08.923
There's a piece of software called
Cortex.

00:23:09.132 --> 00:23:16.033
Cortex is a Prometheus server that
stores its data in a database.

00:23:16.570 --> 00:23:19.127
Specifically, a distributed database.

00:23:19.395 --> 00:23:24.136
Things that are based on the Google
big table model, like Cassandra or…

00:23:25.892 --> 00:23:27.166
What's the Amazon one?

00:23:30.332 --> 00:23:32.667
Yeah.

00:23:32.682 --> 00:23:33.700
Dynamodb.

00:23:34.193 --> 00:23:37.137
If you have a dynamodb or a cassandra
cluster, or one of these other

00:23:37.350 --> 00:23:39.298
really big distributed storage clusters,

00:23:39.713 --> 00:23:44.615
Cortex can run and the Prometheus servers
will stream their data up to Cortex

00:23:44.907 --> 00:23:49.384
and it will keep a copy of that accross
all of your Prometheus servers.

00:23:49.596 --> 00:23:51.373
And because it's based on things
like Cassandra,

00:23:51.709 --> 00:23:53.150
it's super scalable.

00:23:53.436 --> 00:23:57.862
But it's a little complex to run and

00:23:57.536 --> 00:24:00.836
many people don't want to run that
complex infrastructure.

00:24:01.254 --> 00:24:06.080
We have another new one, we just blogged
about it yesterday.

00:24:01.564 --> 00:24:06.513
It's a thing called Thanos.

00:24:06.513 --> 00:24:10.596
Thanos is Prometheus at scale.

00:24:11.143 --> 00:24:12.356
Basically, the way it works…

00:24:12.761 --> 00:24:15.063
Actually, why don't I bring that up?

00:24:24.122 --> 00:24:30.519
This was developed by a company
called Improbable

00:24:30.935 --> 00:24:32.632
and they wanted to…

00:24:35.489 --> 00:24:40.063
They had billions of metrics coming from
hundreds of Prometheus servers.

00:24:40.604 --> 00:24:46.645
They developed this in collaboration with
the Prometheus team to build

00:24:47.000 --> 00:24:48.581
a super highly scalable Prometheus server.

00:24:49.877 --> 00:24:55.518
Prometheus itself stores the incoming
metrics data in a write ahead log

00:24:56.008 --> 00:24:59.560
and then every two hours, it creates
a compaction cycle

00:24:59.982 --> 00:25:03.177
and it creates a mutable series block
of data which is

00:25:03.606 --> 00:25:06.718
all the time series blocks themselves

00:25:07.131 --> 00:25:10.319
and then an index into that data.

00:25:10.849 --> 00:25:13.678
Those two hour windows are all imutable

00:25:14.037 --> 00:25:19.428
so ??? has a little sidecar binary that
watches for those new directories and

00:25:19.594 --> 00:25:20.843
uploads them into a blob store.

00:25:21.121 --> 00:25:25.819
So you could put them in S3 or minio or
some other simple object storage.

00:25:26.301 --> 00:25:32.916
And then now you have all of your data,
all of this index data already

00:25:32.916 --> 00:25:34.816
ready to go

00:25:34.816 --> 00:25:38.489
and then the final sidecar creates
a little mesh cluster that can read from

00:25:38.489 --> 00:25:39.616
all of those S3 blocks.

00:25:40.123 --> 00:25:48.470
Now, you have this super global view
all stored in a big bucket storage and

00:25:49.621 --> 00:25:52.404
things like S3 or minio are…

00:25:52.995 --> 00:25:57.669
Bucket storage is not databases so they're
operationally a little easier to operate.

00:25:58.405 --> 00:26:02.183
Plus, now we have all this data in
a bucket store and

00:26:02.600 --> 00:26:06.081
the Thanos sidecars can talk to each other

00:26:06.526 --> 00:26:08.150
We can now have a single entry point.

00:26:08.418 --> 00:26:11.915
You can query Thanos and Thanos will
distribute your query

00:26:12.131 --> 00:26:13.577
across all your Prometheus servers.

00:26:13.792 --> 00:26:16.181
So now you can do global queries across
all of your servers.

00:26:17.696 --> 00:26:22.246
But it's very new, they just released
their first release candidate yesterday.

00:26:23.926 --> 00:26:26.875
It is looking to be like
the coolest thing ever

00:26:27.448 --> 00:26:29.341
for running large scale Prometheus.

00:26:30.315 --> 00:26:34.779
Here's an example of how that is laid out.

00:26:36.840 --> 00:26:39.469
This will bring and let you have
a billion metric Prometheus cluster.

00:26:42.607 --> 00:26:44.261
And it's got a bunch of other
cool features.

00:26:45.376 --> 00:26:46.672
Any more questions?

00:26:55.353 --> 00:26:57.436
Alright, maybe I'll do
a quick little demo.

00:27:05.407 --> 00:27:10.547
Here is a Prometheus server that is
provided by this group

00:27:10.736 --> 00:27:14.141
that just does a ansible deployment
for Prometheus.

00:27:15.342 --> 00:27:19.597
And you can just simply query
for something like 'node_cpu'.

00:27:21.077 --> 00:27:23.073
This is actually the old name for
that metric.

00:27:24.083 --> 00:27:25.659
And you can see, here's exactly

00:27:28.078 --> 00:27:31.250
the CPU metrics from some servers.

00:27:32.907 --> 00:27:34.634
It's just a bunch of stuff.

00:27:35.008 --> 00:27:37.060
There's actually two servers here,

00:27:37.445 --> 00:27:40.660
there's an influx cloud alchemy and
there is a demo cloud alchemy.

00:27:42.011 --> 00:27:43.666
[Q] Can you zoom in?
[A] Oh yeah sure.

00:27:53.135 --> 00:27:57.617
So you can see all the extra labels.

00:28:00.067 --> 00:28:01.644
We can also do some things like…

00:28:02.176 --> 00:28:04.247
Let's take a look at, say,
the last 30 seconds.

00:28:04.614 --> 00:28:07.226
We can just add this little time window.

00:28:07.755 --> 00:28:11.033
It's called a range request,
and you can see

00:28:11.257 --> 00:28:12.398
the individual samples.

00:28:12.651 --> 00:28:14.671
You can see that all Prometheus is doing

00:28:14.825 --> 00:28:17.899
is storing the sample and a timestamp.

00:28:18.472 --> 00:28:23.029
All the timestamps are in milliseconds
and it's all epoch

00:28:23.238 --> 00:28:25.395
so it's super easy to manipulate.

00:28:25.600 --> 00:28:30.169
But, looking at the individual samples and
looking at this, you can see that

00:28:30.493 --> 00:28:36.333
if we go back and just take…
and look at the raw data, and

00:28:36.493 --> 00:28:37.859
we graph the raw data…

00:28:39.961 --> 00:28:43.026
Oops, that's a syntax error.

00:28:44.500 --> 00:28:46.968
And we look at this graph…
Come on.

00:28:47.221 --> 00:28:48.282
Here we go.

00:28:48.481 --> 00:28:50.329
Well, that's kind of boring, it's just
a flat line because

00:28:50.600 --> 00:28:52.795
it's just a counter going up very slowly.

00:28:52.992 --> 00:28:55.999
What we really want to do, is we want to
take, and we want to apply

00:28:57.128 --> 00:28:59.046
a rate function to this counter.

00:28:59.569 --> 00:29:03.635
So let's look at the rate over
the last one minute.

00:29:04.493 --> 00:29:06.772
There we go, now we get
a nice little graph.

00:29:08.308 --> 00:29:14.056
And so you can see that this is
0.6 CPU seconds per second

00:29:15.223 --> 00:29:18.118
for that set of labels.

00:29:18.529 --> 00:29:21.034
But this is pretty noisy, there's a lot
of lines on this graph and

00:29:21.235 --> 00:29:22.621
there's still a lot of data here.

00:29:23.137 --> 00:29:25.842
So let's start doing some filtering.

00:29:26.194 --> 00:29:29.434
One of the things we see here is,
well, there's idle.

00:29:29.720 --> 00:29:32.296
We don't really care about
the machine being idle,

00:29:32.593 --> 00:29:35.492
so let's just add a label filter
so we can say

00:29:35.673 --> 00:29:42.354
'mode', it's the label name, and it's not
equal to 'idle'. Done.

00:29:45.089 --> 00:29:47.560
And if I could type…
What did I miss?

00:29:50.555 --> 00:29:51.126
Here we go.

00:29:51.438 --> 00:29:53.911
So now we've removed idle from the graph.

00:29:54.164 --> 00:29:55.907
That looks a little more sane.

00:29:56.659 --> 00:30:01.094
Oh, wow, look at that, that's a nice
big spike in user space on the influx server

00:30:01.363 --> 00:30:02.310
Okay…

00:30:03.672 --> 00:30:05.252
Well, that's pretty cool.

00:30:05.654 --> 00:30:06.479
What about…

00:30:06.940 --> 00:30:08.625
This is still quite a lot of lines.

00:30:10.637 --> 00:30:14.194
How much CPU is in use total across
all the servers that we have.

00:30:09.217 --> 00:30:14.378
We can just sum up that rate.

00:30:14.378 --> 00:30:24.457
We can just see that there is
a sum total of 0.6 CPU seconds/s

00:30:25.000 --> 00:30:27.515
across the servers we have.

00:30:27.715 --> 00:30:31.379
But that's a little to coarse.

00:30:31.733 --> 00:30:36.698
What if we want to see it by instance?

00:30:39.155 --> 00:30:42.156
Now, we can see the two servers,
we can see

00:30:42.527 --> 00:30:45.395
that we're left with just that label.

00:30:45.959 --> 00:30:50.229
The influx labels are the influx instance
and the influx demo.

00:30:50.229 --> 00:30:53.334
That's a super easy way to see that,

00:30:53.854 --> 00:30:56.817
but we can also do this
the other way around.

00:30:57.060 --> 00:31:03.022
We can say 'without (mode,cpu)' so
we can drop those modes and

00:31:03.367 --> 00:31:05.243
see all the labels that we have.

00:31:05.438 --> 00:31:11.563
We can still see the environment label
and the job label on our list data.

00:31:12.182 --> 00:31:15.640
You can go either way
with the summary functions.

00:31:15.812 --> 00:31:20.210
There's a whole bunch of different functions

00:31:20.558 --> 00:31:22.730
and it's all in our documentation.

00:31:25.124 --> 00:31:30.113
But what if we want to see it…

00:31:30.572 --> 00:31:33.726
What if we want to see which CPUs
are in use?

00:31:34.154 --> 00:31:36.937
Now we can see that it's only CPU0

00:31:37.203 --> 00:31:39.587
because apparently these are only
1-core instances.

00:31:42.276 --> 00:31:46.660
You can add/remove labels and do
all these queries.

00:31:49.966 --> 00:31:51.833
Any other questions so far?

00:31:53.965 --> 00:31:59.056
[Q] I don't have a question, but I have
something to add.

00:31:59.427 --> 00:32:03.063
Prometheus is really nice, but it's
a lot better if you combine it

00:32:03.389 --> 00:32:04.954
with grafana.

00:32:05.222 --> 00:32:06.330
[A] Yes, yes.

00:32:06.537 --> 00:32:12.332
In the beginning, when we were creating
Prometheus, we actually built

00:32:12.851 --> 00:32:14.698
a piece of dashboard software called
promdash.

00:32:16.029 --> 00:32:20.566
It was a simple little Ruby on Rails app
to create dashboards

00:32:20.733 --> 00:32:22.744
and it had a bunch of JavaScript.

00:32:22.936 --> 00:32:24.195
And then grafana came out.

00:32:25.157 --> 00:32:25.880
And we're like

00:32:25.997 --> 00:32:29.590
"Oh, that's interesting. It doesn't support
Prometheus" so we were like

00:32:29.826 --> 00:32:31.806
"Hey, can you support Prometheus"

00:32:32.217 --> 00:32:34.375
and they're like "Yeah, we've got
a REST API, get the data, done"

00:32:36.035 --> 00:32:37.867
Now grafana supports Prometheus and
we're like

00:32:39.761 --> 00:32:41.991
"Well, promdash, this is crap, delete".

00:32:44.390 --> 00:32:46.171
The Prometheus development team,

00:32:46.395 --> 00:32:49.485
we're all backend developers
and SREs and

00:32:49.731 --> 00:32:51.463
we have no JavaScript skills at all.

00:32:52.589 --> 00:32:54.879
So we're like "Let somebody deal
with that".

00:32:55.393 --> 00:32:57.647
One of the nice things about working on
this kind of project is

00:32:57.862 --> 00:33:01.648
we can do things that we're good at and
and we don't, we don't try…

00:33:02.398 --> 00:33:05.317
We don't have any marketing people,
it's just an opensource project,

00:33:06.320 --> 00:33:09.111
there's no single company behind Prometheus.

00:33:09.914 --> 00:33:14.452
I work for GitLab, Improbable paid for
the Thanos system,

00:33:15.594 --> 00:33:25.286
other companies like Red Hat now pays
people that used to work on CoreOS to

00:33:25.471 --> 00:33:26.517
work on Prometheus.

00:33:27.211 --> 00:33:30.283
There's lots and lots of collaboration
between many companies

00:33:30.467 --> 00:33:32.609
to build the Prometheus ecosystem.

00:33:35.864 --> 00:33:37.455
But yeah, grafana is great.

00:33:38.835 --> 00:33:44.983
Actually, grafana now has
two fulltime Prometheus developers.

00:33:49.185 --> 00:33:51.031
Alright, that's it.

00:33:52.637 --> 00:33:57.044
[Applause]