9:59:59.000,9:59:59.000
So, we had a talk by a non-GitLab person[br]about GitLab.

9:59:59.000,9:59:59.000
Now, we have a talk by a GitLab person[br]on non-GtlLab.

9:59:59.000,9:59:59.000
Something like that?

9:59:59.000,9:59:59.000
The CCCHH hackerspace is now open,

9:59:59.000,9:59:59.000
from now on if you want to go there,[br]that's the announcement.

9:59:59.000,9:59:59.000
And the next talk will be by Ben Kochie

9:59:59.000,9:59:59.000
on metrics-based monitoring[br]with Prometheus.

9:59:59.000,9:59:59.000
Welcome.

9:59:59.000,9:59:59.000
[Applause]

9:59:59.000,9:59:59.000
Alright, so

9:59:59.000,9:59:59.000
my name is Ben Kochie

9:59:59.000,9:59:59.000
I work on DevOps features for GitLab

9:59:59.000,9:59:59.000
and apart working for GitLab, I also work[br]on the opensource Prometheus project.

9:59:59.000,9:59:59.000
I live in Berlin and I've been using[br]Debian since ???

9:59:59.000,9:59:59.000
yes, quite a long time.

9:59:59.000,9:59:59.000
So, what is Metrics-based Monitoring?

9:59:59.000,9:59:59.000
If you're running software in production,

9:59:59.000,9:59:59.000
you probably want to monitor it,

9:59:59.000,9:59:59.000
because if you don't monitor it, you don't[br]know it's right.

9:59:59.000,9:59:59.000
??? break down into two categories:

9:59:59.000,9:59:59.000
there's blackbox monitoring and[br]there's whitebox monitoring.

9:59:59.000,9:59:59.000
Blackbox monitoring is treating[br]your software like a blackbox.

9:59:59.000,9:59:59.000
It's just checks to see, like,

9:59:59.000,9:59:59.000
is it responding, or does it ping

9:59:59.000,9:59:59.000
or ??? HTTP requests

9:59:59.000,9:59:59.000
[mic turned on]

9:59:59.000,9:59:59.000
Ah, there we go, much better.

9:59:59.000,9:59:59.000
So, blackbox monitoring is a probe,

9:59:59.000,9:59:59.000
it just kind of looks from the outside[br]to your software

9:59:59.000,9:59:59.000
and it has no knowledge of the internals

9:59:59.000,9:59:59.000
and it's really good for end to end testing.

9:59:59.000,9:59:59.000
So if you've got a fairly complicated[br]service,

9:59:59.000,9:59:59.000
you come in from the outside, you go[br]through the load balancer,

9:59:59.000,9:59:59.000
you hit the API server,

9:59:59.000,9:59:59.000
the API server might hit a database,

9:59:59.000,9:59:59.000
and you go all the way through[br]to the back of the stack

9:59:59.000,9:59:59.000
and all the way back out

9:59:59.000,9:59:59.000
so you know that everything is working[br]end to end.

9:59:59.000,9:59:59.000
But you only know about it[br]for that one request.

9:59:59.000,9:59:59.000
So in order to find out if your service[br]is working,

9:59:59.000,9:59:59.000
from the end to end, for every single[br]request,

9:59:59.000,9:59:59.000
this requires whitebox intrumentation.

9:59:59.000,9:59:59.000
So, basically, every event that happens[br]inside your software,

9:59:59.000,9:59:59.000
inside a serving stack,

9:59:59.000,9:59:59.000
gets collected and gets counted,

9:59:59.000,9:59:59.000
so you know that every request hits[br]the load balancer,

9:59:59.000,9:59:59.000
every request hits your application[br]service,

9:59:59.000,9:59:59.000
every request hits the database.

9:59:59.000,9:59:59.000
You know that everything matches up

9:59:59.000,9:59:59.000
and this is called whitebox, or[br]metrics-based monitoring.

9:59:59.000,9:59:59.000
There is different examples of, like,

9:59:59.000,9:59:59.000
the kind of software that does blackbox[br]and whitebox monitoring.

9:59:59.000,9:59:59.000
So you have software like Nagios that[br]you can configure checks

9:59:59.000,9:59:59.000
or pingdom,

9:59:59.000,9:59:59.000
pingdom will do ping of your website.

9:59:59.000,9:59:59.000
And then there is metrics-based monitoring,

9:59:59.000,9:59:59.000
things like Prometheus, things like[br]the TICK stack from influx data,

9:59:59.000,9:59:59.000
New Relic and other commercial solutions

9:59:59.000,9:59:59.000
but of course I like to talk about[br]the opensorce solutions.

9:59:59.000,9:59:59.000
We're gonna talk a little bit about[br]Prometheus.

9:59:59.000,9:59:59.000
Prometheus came out of the idea that

9:59:59.000,9:59:59.000
we needed a monitoring system that could[br]collect all this whitebox metric data

9:59:59.000,9:59:59.000
and do something useful with it.

9:59:59.000,9:59:59.000
Not just give us a pretty graph, but[br]we also want to be able to

9:59:59.000,9:59:59.000
alert on it.

9:59:59.000,9:59:59.000
So we needed both

9:59:59.000,9:59:59.000
a data gathering and an analytics system[br]in the same instance.

9:59:59.000,9:59:59.000
To do this, we built this thing and[br]we looked at the way that

9:59:59.000,9:59:59.000
data was being generated[br]by the applications

9:59:59.000,9:59:59.000
and there are advantages and[br]disadvantages to this

9:59:59.000,9:59:59.000
push vs. poll model for metrics.

9:59:59.000,9:59:59.000
We decided to go with the polling model

9:59:59.000,9:59:59.000
because there is some slight advantages[br]for polling over pushing.

9:59:59.000,9:59:59.000
With polling, you get this free[br]blackbox check

9:59:59.000,9:59:59.000
that the application is running.

9:59:59.000,9:59:59.000
When you poll your application, you know[br]that the process is running.

9:59:59.000,9:59:59.000
If you are doing push-based, you can't[br]tell the difference between

9:59:59.000,9:59:59.000
your application doing no work and[br]your application not running.

9:59:59.000,9:59:59.000
So you don't know if it's stuck,

9:59:59.000,9:59:59.000
or is it just not having to do any work.

9:59:59.000,9:59:59.000
With polling, the polling system knows[br]the state of your network.

9:59:59.000,9:59:59.000
If you have a defined set of services,

9:59:59.000,9:59:59.000
that inventory drives what should be there.

9:59:59.000,9:59:59.000
Again, it's like, the disappearing,

9:59:59.000,9:59:59.000
is the process dead, or is it just[br]not doing anything?

9:59:59.000,9:59:59.000
With polling, you know for a fact[br]what processes should be there,

9:59:59.000,9:59:59.000
and it's a bit of an advantage there.

9:59:59.000,9:59:59.000
With polling, there's really easy testing.

9:59:59.000,9:59:59.000
With push-based metrics, you have to[br]figure out

9:59:59.000,9:59:59.000
if you want to test a new version of[br]the monitoring system or

9:59:59.000,9:59:59.000
you want to test something new,

9:59:59.000,9:59:59.000
you have to ??? a copy of the data.

9:59:59.000,9:59:59.000
With polling, you can just set up[br]another instance of your monitoring

9:59:59.000,9:59:59.000
and just test it.

9:59:59.000,9:59:59.000
Or you don't even have,

9:59:59.000,9:59:59.000
it doesn't even have to be monitoring,[br]you can just use curl

9:59:59.000,9:59:59.000
to poll the metrics endpoint.

9:59:59.000,9:59:59.000
It's significantly easier to test.

9:59:59.000,9:59:59.000
The other thing with the…

9:59:59.000,9:59:59.000
The other nice thing is that[br]the client is really simple.

9:59:59.000,9:59:59.000
The client doesn't have to know[br]where the monitoring system is.

9:59:59.000,9:59:59.000
It doesn't have to know about ???

9:59:59.000,9:59:59.000
It just has to sit and collect the data[br]about itself.

9:59:59.000,9:59:59.000
So it doesn't have to know anything about[br]the topology of the network.

9:59:59.000,9:59:59.000
As an application developer, if you're[br]writing a DNS server or

9:59:59.000,9:59:59.000
some other piece of software,

9:59:59.000,9:59:59.000
you don't have to know anything about[br]monitoring software,

9:59:59.000,9:59:59.000
you can just implement it inside[br]your application and

9:59:59.000,9:59:59.000
the monitoring software, whether it's[br]Prometheus or something else,

9:59:59.000,9:59:59.000
can just come and collect that data for you.

9:59:59.000,9:59:59.000
That's kind of similar to a very old[br]monitoring system called SNMP,

9:59:59.000,9:59:59.000
but SNMP has a significantly less friendly[br]data model for developers.

9:59:59.000,9:59:59.000
This is the basic layout[br]of a Prometheus server.

9:59:59.000,9:59:59.000
At the core, there's a Prometheus server

9:59:59.000,9:59:59.000
and it deals with all the data collection[br]and analytics.

9:59:59.000,9:59:59.000
Basically, this one binary,[br]it's all written in golang.

9:59:59.000,9:59:59.000
It's a single binary.

9:59:59.000,9:59:59.000
It knows how to read from your inventory,

9:59:59.000,9:59:59.000
there's a bunch of different methods,[br]whether you've got

9:59:59.000,9:59:59.000
a kubernetes cluster or a cloud platform

9:59:59.000,9:59:59.000
or you have your own customized thing[br]with ansible.

9:59:59.000,9:59:59.000
Ansible can take your layout, drop that[br]into a config file and

9:59:59.000,9:59:59.000
Prometheus can pick that up.

9:59:59.000,9:59:59.000
Once it has the layout, it goes out and[br]collects all the data.

9:59:59.000,9:59:59.000
It has a storage and a time series[br]database to store all that data locally.

9:59:59.000,9:59:59.000
It has a thing called PromQL, which is[br]a query language designed

9:59:59.000,9:59:59.000
for metrics and analytics.

9:59:59.000,9:59:59.000
From that PromQL, you can add frontends[br]that will,

9:59:59.000,9:59:59.000
whether it's a simple API client[br]to run reports,

9:59:59.000,9:59:59.000
you can use things like Grafana[br]for creating dashboards,

9:59:59.000,9:59:59.000
it's got a simple webUI built in.

9:59:59.000,9:59:59.000
You can plug in anything you want[br]on that side.

9:59:59.000,9:59:59.000
And then, it also has the ability to[br]continuously execute queries

9:59:59.000,9:59:59.000
called "recording rules"

9:59:59.000,9:59:59.000
and these recording rules have[br]two different modes.

9:59:59.000,9:59:59.000
You can either record, you can take[br]a query

9:59:59.000,9:59:59.000
and it will generate new data[br]from that query

9:59:59.000,9:59:59.000
or you can take a query, and[br]if it returns results,

9:59:59.000,9:59:59.000
it will return an alert.

9:59:59.000,9:59:59.000
That alert is a push message[br]to the alert manager.

9:59:59.000,9:59:59.000
This allows us to separate the generating[br]of alerts from the routing of alerts.

9:59:59.000,9:59:59.000
You can have one or hundreds of Prometheus[br]services, all generating alerts

9:59:59.000,9:59:59.000
and it goes into an alert manager cluster[br]and sends, does the deduplication

9:59:59.000,9:59:59.000
and the routing to the human

9:59:59.000,9:59:59.000
because, of course, the thing[br]that we want is

9:59:59.000,9:59:59.000
we had dashboards with graphs, but[br]in order to find out if something is broken

9:59:59.000,9:59:59.000
you had to have a human[br]looking at the graph.

9:59:59.000,9:59:59.000
With Prometheus, we don't have to do that[br]anymore,

9:59:59.000,9:59:59.000
we can simply let the software tell us[br]that we need to go investigate

9:59:59.000,9:59:59.000
our problems.

9:59:59.000,9:59:59.000
We don't have to sit there and[br]stare at dashboards all day,

9:59:59.000,9:59:59.000
because that's really boring.

9:59:59.000,9:59:59.000
What does it look like to actually[br]get data into Prometheus?

9:59:59.000,9:59:59.000
This is a very basic output[br]of a Prometheus metric.

9:59:59.000,9:59:59.000
This is a very simple thing.

9:59:59.000,9:59:59.000
If you know much about[br]the linux kernel,

9:59:59.000,9:59:59.000
the linux kernel tracks ??? stats,[br]all the state of all the CPUs

9:59:59.000,9:59:59.000
in your system

9:59:59.000,9:59:59.000
and we express this by having[br]the name of the metric, which is

9:59:59.000,9:59:59.000
'node_cpu_seconds_total' and so[br]this is a self-describing metric,

9:59:59.000,9:59:59.000
like you can just read the metrics name

9:59:59.000,9:59:59.000
and you understand a little bit about[br]what's going on here.

9:59:59.000,9:59:59.000
The linux kernel and other kernels track[br]their usage by the number of seconds

9:59:59.000,9:59:59.000
spent doing different things and

9:59:59.000,9:59:59.000
that could be, whether it's in system or[br]user space or IRQs

9:59:59.000,9:59:59.000
or iowait or idle.

9:59:59.000,9:59:59.000
Actually, the kernel tracks how much[br]idle time it has.

9:59:59.000,9:59:59.000
It also tracks it by the number of CPUs.

9:59:59.000,9:59:59.000
With other monitoring systems, they used[br]to do this with a tree structure

9:59:59.000,9:59:59.000
and this caused a lot of problems,[br]for like

9:59:59.000,9:59:59.000
How do you mix and match data so[br]by switching from

9:59:59.000,9:59:59.000
a tree structure to a tag-based structure,

9:59:59.000,9:59:59.000
we can do some really interesting[br]powerful data analytics.

9:59:59.000,9:59:59.000
Here's a nice example of taking[br]those CPU seconds counters

9:59:59.000,9:59:59.000
and then converting them into a graph[br]by using PromQL.

9:59:59.000,9:59:59.000
Now we can get into[br]Metrics-Based Alerting.

9:59:59.000,9:59:59.000
Now we have this graph, we have this thing

9:59:59.000,9:59:59.000
we can look and see here

9:59:59.000,9:59:59.000
"Oh there is some little spike here,[br]we might want to know about that."

9:59:59.000,9:59:59.000
Now we can get into Metrics-Based[br]Alerting.

9:59:59.000,9:59:59.000
I used to be a site reliability engineer,[br]I'm still a site reliability engineer at heart

9:59:59.000,9:59:59.000
and we have this concept of things that[br]you need on a site or a service reliably

9:59:59.000,9:59:59.000
The most important thing you need is[br]down at the bottom,

9:59:59.000,9:59:59.000
Monitoring, because if you don't have[br]monitoring of your service,

9:59:59.000,9:59:59.000
how do you know it's even working?

9:59:59.000,9:59:59.000
There's a couple of techniques here, and[br]we want to alert based on data

9:59:59.000,9:59:59.000
and not just those end to end tests.

9:59:59.000,9:59:59.000
There's a couple of techniques, a thing[br]called the RED method

9:59:59.000,9:59:59.000
and there's a thing called the USE method

9:59:59.000,9:59:59.000
and there's a couple nice things to some[br]blog posts about this

9:59:59.000,9:59:59.000
and basically it defines that, for example,

9:59:59.000,9:59:59.000
the RED method talks about the requests[br]that your system is handling

9:59:59.000,9:59:59.000
There are three things:

9:59:59.000,9:59:59.000
There's the number of requests, there's[br]the number of errors

9:59:59.000,9:59:59.000
and there's how long takes a duration.

9:59:59.000,9:59:59.000
With the combination of these three things

9:59:59.000,9:59:59.000
you can determine most of[br]what your users see

9:59:59.000,9:59:59.000
"Did my request go through? Did it[br]return an error? Was it fast?"

9:59:59.000,9:59:59.000
Most people, that's all they care about.

9:59:59.000,9:59:59.000
"I made a request to a website and[br]it came back and it was fast."

9:59:59.000,9:59:59.000
It's a very simple method of just, like,

9:59:59.000,9:59:59.000
those are the important things to[br]determine if your site is healthy.

9:59:59.000,9:59:59.000
But we can go back to some more[br]traditional, sysadmin style, alerts

9:59:59.000,9:59:59.000
this is basically taking the filesystem[br]available space,

9:59:59.000,9:59:59.000
divided by the filesystem size, that becomes[br]the ratio of filesystem availability

9:59:59.000,9:59:59.000
from 0 to 1.

9:59:59.000,9:59:59.000
Multiply it by 100, we now have[br]a percentage

9:59:59.000,9:59:59.000
and if it's less than or equal to 1%[br]for 15 minutes,

9:59:59.000,9:59:59.000
this is less than 1% space, we should tell[br]a sysadmin to go check

9:59:59.000,9:59:59.000
the ??? filesystem ???

9:59:59.000,9:59:59.000
It's super nice and simple.

9:59:59.000,9:59:59.000
We can also tag, we can include…

9:59:59.000,9:59:59.000
Every alert includes all the extraneous[br]labels that Prometheus adds to your metrics

9:59:59.000,9:59:59.000
When you add a metric in Prometheus, if[br]we go back and we look at this metric.

9:59:59.000,9:59:59.000
This metric only contain the information[br]about the internals of the application

9:59:59.000,9:59:59.000
anything about, like, what server it's on,[br]is it running in a container,

9:59:59.000,9:59:59.000
what cluster does it come from,[br]what ??? is it on,

9:59:59.000,9:59:59.000
that's all extra annotations that are[br]added by the Prometheus server

9:59:59.000,9:59:59.000
at discovery time.

9:59:59.000,9:59:59.000
I don't have a good example of what[br]those labels look like

9:59:59.000,9:59:59.000
but every metric gets annotated[br]with location information.

9:59:59.000,9:59:59.000
That location information also comes through[br]as labels in the alert

9:59:59.000,9:59:59.000
so, if you have a message coming[br]into your alert manager,

9:59:59.000,9:59:59.000
the alert manager can look and go

9:59:59.000,9:59:59.000
"Oh, that's coming from this datacenter"

9:59:59.000,9:59:59.000
and it can include that in the email or[br]IRC message or SMS message.

9:59:59.000,9:59:59.000
So you can include

9:59:59.000,9:59:59.000
"Filesystem is out of space on this host[br]from this datacenter"

9:59:59.000,9:59:59.000
All these labels get passed through and[br]then you can append

9:59:59.000,9:59:59.000
"severity: critical" to that alert and[br]include that in the message to the human

9:59:59.000,9:59:59.000
because of course, this is how you define…

9:59:59.000,9:59:59.000
Getting the message from the monitoring[br]to the human.

9:59:59.000,9:59:59.000
You can even include nice things like,

9:59:59.000,9:59:59.000
if you've got documentation, you can[br]include a link to the documentation

9:59:59.000,9:59:59.000
as an annotation

9:59:59.000,9:59:59.000
and the alert manager can take that[br]basic url and, you know,

9:59:59.000,9:59:59.000
massaging it into whatever it needs[br]to look like to actually get

9:59:59.000,9:59:59.000
the operator to the correct documentation.

9:59:59.000,9:59:59.000
We can also do more fun things:

9:59:59.000,9:59:59.000
since we actually are not just checking

9:59:59.000,9:59:59.000
what is the space right now,[br]we're tracking data over time,

9:59:59.000,9:59:59.000
we can use 'predict_linear'.

9:59:59.000,9:59:59.000
'predict_linear' just takes and does[br]a simple linear regression.

9:59:59.000,9:59:59.000
This example takes the filesystem[br]available space over the last hour and

9:59:59.000,9:59:59.000
does a linear regression.

9:59:59.000,9:59:59.000
Prediction says "Well, it's going that way[br]and four hours from now,

9:59:59.000,9:59:59.000
based on one hour of history, it's gonna[br]be less than 0, which means full".

9:59:59.000,9:59:59.000
We know that within the next four hours,[br]the disc is gonna be full

9:59:59.000,9:59:59.000
so we can tell the operator ahead of time[br]that it's gonna be full

9:59:59.000,9:59:59.000
and not just tell them that it's full[br]right now.

9:59:59.000,9:59:59.000
They have some window of ability[br]to fix it before it fails.

9:59:59.000,9:59:59.000
This is really important because[br]if you're running a site

9:59:59.000,9:59:59.000
you want to be able to have alerts[br]that tell you that your system is failing

9:59:59.000,9:59:59.000
before it actually fails.

9:59:59.000,9:59:59.000
Because if it fails, you're out of SLO[br]or SLA and

9:59:59.000,9:59:59.000
your users are gonna be unhappy

9:59:59.000,9:59:59.000
and you don't want the users to tell you[br]that your site is down

9:59:59.000,9:59:59.000
you want to know about it before[br]your users can even tell.

9:59:59.000,9:59:59.000
This allows you to do that.

9:59:59.000,9:59:59.000
And also of course, Prometheus being[br]a modern system,

9:59:59.000,9:59:59.000
we support fully UTF8 in all of our labels.

9:59:59.000,9:59:59.000
Here's an other one, here's a good example[br]from the USE method.

9:59:59.000,9:59:59.000
This is a rate of 500 errors coming from[br]an application

9:59:59.000,9:59:59.000
and you can simply alert that

9:59:59.000,9:59:59.000
there's more than 500 errors per second[br]coming out of the application

9:59:59.000,9:59:59.000
if that's your threshold for ???

9:59:59.000,9:59:59.000
And you can do other things,

9:59:59.000,9:59:59.000
you can convert that from just[br]a raid of errors

9:59:59.000,9:59:59.000
to a percentive error.

9:59:59.000,9:59:59.000
So you could say

9:59:59.000,9:59:59.000
"I have an SLA of 3 9" and so you can say

9:59:59.000,9:59:59.000
"If the rate of errors divided by the rate[br]of requests is .01,

9:59:59.000,9:59:59.000
or is more than .01, then[br]that's a problem."

9:59:59.000,9:59:59.000
You can include that level of[br]error granularity.

9:59:59.000,9:59:59.000
And if you're just doing a blackbox test,

9:59:59.000,9:59:59.000
you wouldn't know this, you would only get[br]if you got an error from the system,

9:59:59.000,9:59:59.000
then you got another error from the system

9:59:59.000,9:59:59.000
then you fire an alert.

9:59:59.000,9:59:59.000
But if those checks are one minute apart[br]and you're serving 1000 requests per second

9:59:59.000,9:59:59.000
you could be serving 10,000 errors before[br]you even get an alert.

9:59:59.000,9:59:59.000
And you might miss it, because

9:59:59.000,9:59:59.000
what if you only get one random error

9:59:59.000,9:59:59.000
and then the next time, you're serving[br]25% errors,

9:59:59.000,9:59:59.000
you only have a 25% chance of that check[br]failing again.

9:59:59.000,9:59:59.000
You really need these metrics in order[br]to get

9:59:59.000,9:59:59.000
proper reports of the status of your system

9:59:59.000,9:59:59.000
There's even options

9:59:59.000,9:59:59.000
You can slice and dice those labels.

9:59:59.000,9:59:59.000
If you have a label on all of[br]your applications called 'service'

9:59:59.000,9:59:59.000
you can send that 'service' label through[br]to the message

9:59:59.000,9:59:59.000
and you can say[br]"Hey, this service is broken".

9:59:59.000,9:59:59.000
You can include that service label[br]in your alert messages.

9:59:59.000,9:59:59.000
And that's it, I can go to a demo and Q&amp;A.

9:59:59.000,9:59:59.000
[Applause]

9:59:59.000,9:59:59.000
Any questions so far?

9:59:59.000,9:59:59.000
Or anybody want to see a demo?

9:59:59.000,9:59:59.000
[Q] Hi. Does Prometheus make metric[br]discovery inside containers

9:59:59.000,9:59:59.000
or do I have to implement the metrics[br]myself?

9:59:59.000,9:59:59.000
[A] For metrics in containers, there are[br]already things that expose

9:59:59.000,9:59:59.000
the metrics of the container system[br]itself.

9:59:59.000,9:59:59.000
There's a utility called 'cadvisor' and

9:59:59.000,9:59:59.000
cadvisor takes the links cgroup data[br]and exposes it as metrics

9:59:59.000,9:59:59.000
so you can get data about[br]how much CPU time is being

9:59:59.000,9:59:59.000
spent in your container,

9:59:59.000,9:59:59.000
how much memory is being used[br]by your container.

9:59:59.000,9:59:59.000
[Q] But not about the application,[br]just about the container usage ?

9:59:59.000,9:59:59.000
[A] Right. Because the container[br]has no idea

9:59:59.000,9:59:59.000
whether your application is written[br]in Ruby or go or Python or whatever,

9:59:59.000,9:59:59.000
you have to build that into[br]your application in order to get the data.

9:59:59.000,9:59:59.000
So for Prometheus,

9:59:59.000,9:59:59.000
we've written client libraries that can be[br]included in your application directly

9:59:59.000,9:59:59.000
so you can get that data out.

9:59:59.000,9:59:59.000
If you go to the Prometheus website,[br]we have a whole series of client libraries

9:59:59.000,9:59:59.000
and we cover a pretty good selection[br]of popular software.

9:59:59.000,9:59:59.000
[Q] What is the current state of[br]long-term data storage?

9:59:59.000,9:59:59.000
[A] Very good question.

9:59:59.000,9:59:59.000
There's been several…

9:59:59.000,9:59:59.000
There's actually several different methods[br]of doing this.

9:59:59.000,9:59:59.000
Prometheus stores all this data locally[br]in its own data storage

9:59:59.000,9:59:59.000
on the local disk.

9:59:59.000,9:59:59.000
But that's only as durable as[br]that server is durable.

9:59:59.000,9:59:59.000
So if you've got a really durable server,

9:59:59.000,9:59:59.000
you can store as much data as you want,

9:59:59.000,9:59:59.000
you can store years and years of data[br]locally on the Prometheus server.

9:59:59.000,9:59:59.000
That's not a problem.

9:59:59.000,9:59:59.000
There's a bunch of misconceptions because[br]of our default

9:59:59.000,9:59:59.000
and the language on our website said

9:59:59.000,9:59:59.000
"It's not long-term storage"

9:59:59.000,9:59:59.000
simply because we leave that problem[br]up to the person running the server.

9:59:59.000,9:59:59.000
But the time series database[br]that Prometheus includes

9:59:59.000,9:59:59.000
is actually quite durable.

9:59:59.000,9:59:59.000
But it's only as durable as the server[br]underneath it.

9:59:59.000,9:59:59.000
So if you've got a very large cluster and[br]you want really high durability,

9:59:59.000,9:59:59.000
you need to have some kind of[br]cluster software,

9:59:59.000,9:59:59.000
but because we want Prometheus to be[br]simple to deploy

9:59:59.000,9:59:59.000
and very simple to operate

9:59:59.000,9:59:59.000
and also very robust.

9:59:59.000,9:59:59.000
We didn't want to include any clustering[br]in Prometheus itself,

9:59:59.000,9:59:59.000
because anytime you have a clustered[br]software,

9:59:59.000,9:59:59.000
what happens if your network is[br]a little wanky.

9:59:59.000,9:59:59.000
The first thing that goes down is[br]all of your distributed systems fail.

9:59:59.000,9:59:59.000
And building distributed systems to be[br]really robust is really hard

9:59:59.000,9:59:59.000
so Prometheus is what we call[br]"uncoordinated distributed systems".

9:59:59.000,9:59:59.000
If you've got two Prometheus servers[br]monitoring all your targets in an HA mode

9:59:59.000,9:59:59.000
in a cluster, and there's a split brain,

9:59:59.000,9:59:59.000
each Prometheus can see[br]half of the cluster and

9:59:59.000,9:59:59.000
it can see that the other half[br]of the cluster is down.

9:59:59.000,9:59:59.000
They can both try to get alerts out[br]to the alert manager

9:59:59.000,9:59:59.000
and this is a really really robust way of[br]handling split brains

9:59:59.000,9:59:59.000
and bad network failures and bad problems[br]in a cluster.

9:59:59.000,9:59:59.000
It's designed to be super super robust

9:59:59.000,9:59:59.000
and so the two individual[br]Promotheus servers in you cluster

9:59:59.000,9:59:59.000
don't have to talk to each other[br]to do this,

9:59:59.000,9:59:59.000
they can just to it independently.

9:59:59.000,9:59:59.000
But if you want to be able[br]to correlate data

9:59:59.000,9:59:59.000
between many different Prometheus servers

9:59:59.000,9:59:59.000
you need an external data storage[br]to do this.

9:59:59.000,9:59:59.000
And also you may not have[br]very big servers,

9:59:59.000,9:59:59.000
you might be running your Prometheus[br]in a container

9:59:59.000,9:59:59.000
and it's only got a little bit of local[br]storage space

9:59:59.000,9:59:59.000
so you want to send all that data up[br]to a big cluster datastore

9:59:59.000,9:59:59.000
for ???

9:59:59.000,9:59:59.000
We have several different ways of[br]doing this.

9:59:59.000,9:59:59.000
There's the classic way which is called[br]federation

9:59:59.000,9:59:59.000
where you have one Prometheus server[br]polling in summary data from

9:59:59.000,9:59:59.000
each of the individual Prometheus servers

9:59:59.000,9:59:59.000
and this is useful if you want to run[br]alerts against data coming

9:59:59.000,9:59:59.000
from multiple Prometheus servers.

9:59:59.000,9:59:59.000
But federation is not replication.

9:59:59.000,9:59:59.000
It only can do a little bit of data from[br]each Prometheus server.

9:59:59.000,9:59:59.000
If you've got a million metrics on[br]each Prometheus server,

9:59:59.000,9:59:59.000
you can't poll in a million metrics[br]and do…

9:59:59.000,9:59:59.000
If you've got 10 of those, you can't[br]poll in 10 million metrics

9:59:59.000,9:59:59.000
simultaneously into one Prometheus[br]server.

9:59:59.000,9:59:59.000
It's just to much data.