9:59:59.000,9:59:59.000
So, we had a talk by a non-GitLab person[br]about GitLab.

9:59:59.000,9:59:59.000
Now, we have a talk by a GitLab person[br]on non-GtlLab.

9:59:59.000,9:59:59.000
Something like that?

9:59:59.000,9:59:59.000
The CCCHH hackerspace is now open,

9:59:59.000,9:59:59.000
from now on if you want to go there,[br]that's the announcement.

9:59:59.000,9:59:59.000
And the next talk will be by Ben Kochie

9:59:59.000,9:59:59.000
on metrics-based monitoring[br]with Prometheus.

9:59:59.000,9:59:59.000
Welcome.

9:59:59.000,9:59:59.000
[Applause]

9:59:59.000,9:59:59.000
Alright, so

9:59:59.000,9:59:59.000
my name is Ben Kochie

9:59:59.000,9:59:59.000
I work on DevOps features for GitLab

9:59:59.000,9:59:59.000
and apart working for GitLab, I also work[br]on the opensource Prometheus project.

9:59:59.000,9:59:59.000
I live in Berlin and I've been using[br]Debian since ???

9:59:59.000,9:59:59.000
yes, quite a long time.

9:59:59.000,9:59:59.000
So, what is Metrics-based Monitoring?

9:59:59.000,9:59:59.000
If you're running software in production,

9:59:59.000,9:59:59.000
you probably want to monitor it,

9:59:59.000,9:59:59.000
because if you don't monitor it, you don't[br]know it's right.

9:59:59.000,9:59:59.000
??? break down into two categories:

9:59:59.000,9:59:59.000
there's blackbox monitoring and[br]there's whitebox monitoring.

9:59:59.000,9:59:59.000
Blackbox monitoring is treating[br]your software like a blackbox.

9:59:59.000,9:59:59.000
It's just checks to see, like,

9:59:59.000,9:59:59.000
is it responding, or does it ping

9:59:59.000,9:59:59.000
or ??? HTTP requests

9:59:59.000,9:59:59.000
[mic turned on]

9:59:59.000,9:59:59.000
Ah, there we go, much better.

9:59:59.000,9:59:59.000
So, blackbox monitoring is a probe,

9:59:59.000,9:59:59.000
it just kind of looks from the outside[br]to your software

9:59:59.000,9:59:59.000
and it has no knowledge of the internals

9:59:59.000,9:59:59.000
and it's really good for end to end testing.

9:59:59.000,9:59:59.000
So if you've got a fairly complicated[br]service,

9:59:59.000,9:59:59.000
you come in from the outside, you go[br]through the load balancer,

9:59:59.000,9:59:59.000
you hit the API server,

9:59:59.000,9:59:59.000
the API server might hit a database,

9:59:59.000,9:59:59.000
and you go all the way through[br]to the back of the stack

9:59:59.000,9:59:59.000
and all the way back out

9:59:59.000,9:59:59.000
so you know that everything is working[br]end to end.

9:59:59.000,9:59:59.000
But you only know about it[br]for that one request.

9:59:59.000,9:59:59.000
So in order to find out if your service[br]is working,

9:59:59.000,9:59:59.000
from the end to end, for every single[br]request,

9:59:59.000,9:59:59.000
this requires whitebox intrumentation.

9:59:59.000,9:59:59.000
So, basically, every event that happens[br]inside your software,

9:59:59.000,9:59:59.000
inside a serving stack,

9:59:59.000,9:59:59.000
gets collected and gets counted,

9:59:59.000,9:59:59.000
so you know that every request hits[br]the load balancer,

9:59:59.000,9:59:59.000
every request hits your application[br]service,

9:59:59.000,9:59:59.000
every request hits the database.

9:59:59.000,9:59:59.000
You know that everything matches up

9:59:59.000,9:59:59.000
and this is called whitebox, or[br]metrics-based monitoring.

9:59:59.000,9:59:59.000
There is different examples of, like,

9:59:59.000,9:59:59.000
the kind of software that does blackbox[br]and whitebox monitoring.

9:59:59.000,9:59:59.000
So you have software like Nagios that[br]you can configure checks

9:59:59.000,9:59:59.000
or pingdom,

9:59:59.000,9:59:59.000
pingdom will do ping of your website.

9:59:59.000,9:59:59.000
And then there is metrics-based monitoring,

9:59:59.000,9:59:59.000
things like Prometheus, things like[br]the TICK stack from influx data,

9:59:59.000,9:59:59.000
New Relic and other commercial solutions

9:59:59.000,9:59:59.000
but of course I like to talk about[br]the opensorce solutions.

9:59:59.000,9:59:59.000
We're gonna talk a little bit about[br]Prometheus.

9:59:59.000,9:59:59.000
Prometheus came out of the idea that

9:59:59.000,9:59:59.000
we needed a monitoring system that could[br]collect all this whitebox metric data

9:59:59.000,9:59:59.000
and do something useful with it.

9:59:59.000,9:59:59.000
Not just give us a pretty graph, but[br]we also want to be able to

9:59:59.000,9:59:59.000
alert on it.

9:59:59.000,9:59:59.000
So we needed both

9:59:59.000,9:59:59.000
a data gathering and an analytics system[br]in the same instance.

9:59:59.000,9:59:59.000
To do this, we built this thing and[br]we looked at the way that

9:59:59.000,9:59:59.000
data was being generated[br]by the applications

9:59:59.000,9:59:59.000
and there are advantages and[br]disadvantages to this

9:59:59.000,9:59:59.000
push vs. poll model for metrics.

9:59:59.000,9:59:59.000
We decided to go with the polling model

9:59:59.000,9:59:59.000
because there is some slight advantages[br]for polling over pushing.

9:59:59.000,9:59:59.000
With polling, you get this free[br]blackbox check

9:59:59.000,9:59:59.000
that the application is running.

9:59:59.000,9:59:59.000
When you poll your application, you know[br]that the process is running.

9:59:59.000,9:59:59.000
If you are doing push-based, you can't[br]tell the difference between

9:59:59.000,9:59:59.000
your application doing no work and[br]your application not running.

9:59:59.000,9:59:59.000
So you don't know if it's stuck,

9:59:59.000,9:59:59.000
or is it just not having to do any work.

9:59:59.000,9:59:59.000
With polling, the polling system knows[br]the state of your network.

9:59:59.000,9:59:59.000
If you have a defined set of services,

9:59:59.000,9:59:59.000
that inventory drives what should be there.

9:59:59.000,9:59:59.000
Again, it's like, the disappearing,

9:59:59.000,9:59:59.000
is the process dead, or is it just[br]not doing anything?

9:59:59.000,9:59:59.000
With polling, you know for a fact[br]what processes should be there,

9:59:59.000,9:59:59.000
and it's a bit of an advantage there.

9:59:59.000,9:59:59.000
With polling, there's really easy testing.

9:59:59.000,9:59:59.000
With push-based metrics, you have to[br]figure out

9:59:59.000,9:59:59.000
if you want to test a new version of[br]the monitoring system or

9:59:59.000,9:59:59.000
you want to test something new,

9:59:59.000,9:59:59.000
you have to ??? a copy of the data.

9:59:59.000,9:59:59.000
With polling, you can just set up[br]another instance of your monitoring

9:59:59.000,9:59:59.000
and just test it.

9:59:59.000,9:59:59.000
Or you don't even have,

9:59:59.000,9:59:59.000
it doesn't even have to be monitoring,[br]you can just use curl

9:59:59.000,9:59:59.000
to poll the metrics endpoint.

9:59:59.000,9:59:59.000
It's significantly easier to test.

9:59:59.000,9:59:59.000
The other thing with the…

9:59:59.000,9:59:59.000
The other nice thing is that[br]the client is really simple.

9:59:59.000,9:59:59.000
The client doesn't have to know[br]where the monitoring system is.

9:59:59.000,9:59:59.000
It doesn't have to know about ???

9:59:59.000,9:59:59.000
It just has to sit and collect the data[br]about itself.

9:59:59.000,9:59:59.000
So it doesn't have to know anything about[br]the topology of the network.

9:59:59.000,9:59:59.000
As an application developer, if you're[br]writing a DNS server or

9:59:59.000,9:59:59.000
some other piece of software,

9:59:59.000,9:59:59.000
you don't have to know anything about[br]monitoring software,

9:59:59.000,9:59:59.000
you can just implement it inside[br]your application and

9:59:59.000,9:59:59.000
the monitoring software, whether it's[br]Prometheus or something else,

9:59:59.000,9:59:59.000
can just come and collect that data for you.

9:59:59.000,9:59:59.000
That's kind of similar to a very old[br]monitoring system called SNMP,

9:59:59.000,9:59:59.000
but SNMP has a significantly less friendly[br]data model for developers.

9:59:59.000,9:59:59.000
This is the basic layout[br]of a Prometheus server.

9:59:59.000,9:59:59.000
At the core, there's a Prometheus server

9:59:59.000,9:59:59.000
and it deals with all the data collection[br]and analytics.

9:59:59.000,9:59:59.000
Basically, this one binary,[br]it's all written in golang.

9:59:59.000,9:59:59.000
It's a single binary.

9:59:59.000,9:59:59.000
It knows how to read from your inventory,

9:59:59.000,9:59:59.000
there's a bunch of different methods,[br]whether you've got

9:59:59.000,9:59:59.000
a kubernetes cluster or a cloud platform

9:59:59.000,9:59:59.000
or you have your own customized thing[br]with ansible.

9:59:59.000,9:59:59.000
Ansible can take your layout, drop that[br]into a config file and

9:59:59.000,9:59:59.000
Prometheus can pick that up.

9:59:59.000,9:59:59.000
Once it has the layout, it goes out and[br]collects all the data.

9:59:59.000,9:59:59.000
It has a storage and a time series[br]database to store all that data locally.

9:59:59.000,9:59:59.000
It has a thing called PromQL, which is[br]a query language designed

9:59:59.000,9:59:59.000
for metrics and analytics.

9:59:59.000,9:59:59.000
From that PromQL, you can add frontends[br]that will,

9:59:59.000,9:59:59.000
whether it's a simple API client[br]to run reports,

9:59:59.000,9:59:59.000
you can use things like Grafana[br]for creating dashboards,

9:59:59.000,9:59:59.000
it's got a simple webUI built in.

9:59:59.000,9:59:59.000
You can plug in anything you want[br]on that side.

9:59:59.000,9:59:59.000
And then, it also has the ability to[br]continuously execute queries

9:59:59.000,9:59:59.000
called "recording rules"

9:59:59.000,9:59:59.000
and these recording rules have[br]two different modes.

9:59:59.000,9:59:59.000
You can either record, you can take[br]a query

9:59:59.000,9:59:59.000
and it will generate new data[br]from that query

9:59:59.000,9:59:59.000
or you can take a query, and[br]if it returns results,

9:59:59.000,9:59:59.000
it will return an alert.

9:59:59.000,9:59:59.000
That alert is a push message[br]to the alert manager.

9:59:59.000,9:59:59.000
This allows us to separate the generating[br]of alerts from the routing of alerts.

9:59:59.000,9:59:59.000
You can have one or hundreds of Prometheus[br]services, all generating alerts

9:59:59.000,9:59:59.000
and it goes into an alert manager cluster[br]and sends, does the deduplication

9:59:59.000,9:59:59.000
and the routing to the human

9:59:59.000,9:59:59.000
because, of course, the thing[br]that we want is

9:59:59.000,9:59:59.000
we had dashboards with graphs, but[br]in order to find out if something is broken

9:59:59.000,9:59:59.000
you had to have a human[br]looking at the graph.

9:59:59.000,9:59:59.000
With Prometheus, we don't have to do that[br]anymore,

9:59:59.000,9:59:59.000
we can simply let the software tell us[br]that we need to go investigate

9:59:59.000,9:59:59.000
our problems.

9:59:59.000,9:59:59.000
We don't have to sit there and[br]stare at dashboards all day,

9:59:59.000,9:59:59.000
because that's really boring.

9:59:59.000,9:59:59.000
What does it look like to actually[br]get data into Prometheus?

9:59:59.000,9:59:59.000
This is a very basic output[br]of a Prometheus metric.

9:59:59.000,9:59:59.000
This is a very simple thing.

9:59:59.000,9:59:59.000
If you know much about[br]the linux kernel,

9:59:59.000,9:59:59.000
the linux kernel tracks ??? stats,[br]all the state of all the CPUs

9:59:59.000,9:59:59.000
in your system

9:59:59.000,9:59:59.000
and we express this by having[br]the name of the metric, which is

9:59:59.000,9:59:59.000
'node_cpu_seconds_total' and so[br]this is a self describing metric,

9:59:59.000,9:59:59.000
like you can just read the metrics name

9:59:59.000,9:59:59.000
and you understand a little bit about[br]what's going on here.