-
Not Synced
So, we had a talk by a non-GitLab person
about GitLab.
-
Not Synced
Now, we have a talk by a GitLab person
on non-GtlLab.
-
Not Synced
Something like that?
-
Not Synced
The CCCHH hackerspace is now open,
-
Not Synced
from now on if you want to go there,
that's the announcement.
-
Not Synced
And the next talk will be by Ben Kochie
-
Not Synced
on metrics-based monitoring
with Prometheus.
-
Not Synced
Welcome.
-
Not Synced
[Applause]
-
Not Synced
Alright, so
-
Not Synced
my name is Ben Kochie
-
Not Synced
I work on DevOps features for GitLab
-
Not Synced
and apart working for GitLab, I also work
on the opensource Prometheus project.
-
Not Synced
I live in Berlin and I've been using
Debian since ???
-
Not Synced
yes, quite a long time.
-
Not Synced
So, what is Metrics-based Monitoring?
-
Not Synced
If you're running software in production,
-
Not Synced
you probably want to monitor it,
-
Not Synced
because if you don't monitor it, you don't
know it's right.
-
Not Synced
??? break down into two categories:
-
Not Synced
there's blackbox monitoring and
there's whitebox monitoring.
-
Not Synced
Blackbox monitoring is treating
your software like a blackbox.
-
Not Synced
It's just checks to see, like,
-
Not Synced
is it responding, or does it ping
-
Not Synced
or ??? HTTP requests
-
Not Synced
[mic turned on]
-
Not Synced
Ah, there we go, much better.
-
Not Synced
So, blackbox monitoring is a probe,
-
Not Synced
it just kind of looks from the outside
to your software
-
Not Synced
and it has no knowledge of the internals
-
Not Synced
and it's really good for end to end testing.
-
Not Synced
So if you've got a fairly complicated
service,
-
Not Synced
you come in from the outside, you go
through the load balancer,
-
Not Synced
you hit the API server,
-
Not Synced
the API server might hit a database,
-
Not Synced
and you go all the way through
to the back of the stack
-
Not Synced
and all the way back out
-
Not Synced
so you know that everything is working
end to end.
-
Not Synced
But you only know about it
for that one request.
-
Not Synced
So in order to find out if your service
is working,
-
Not Synced
from the end to end, for every single
request,
-
Not Synced
this requires whitebox intrumentation.
-
Not Synced
So, basically, every event that happens
inside your software,
-
Not Synced
inside a serving stack,
-
Not Synced
gets collected and gets counted,
-
Not Synced
so you know that every request hits
the load balancer,
-
Not Synced
every request hits your application
service,
-
Not Synced
every request hits the database.
-
Not Synced
You know that everything matches up
-
Not Synced
and this is called whitebox, or
metrics-based monitoring.
-
Not Synced
There is different examples of, like,
-
Not Synced
the kind of software that does blackbox
and whitebox monitoring.
-
Not Synced
So you have software like Nagios that
you can configure checks
-
Not Synced
or pingdom,
-
Not Synced
pingdom will do ping of your website.
-
Not Synced
And then there is metrics-based monitoring,
-
Not Synced
things like Prometheus, things like
the TICK stack from influx data,
-
Not Synced
New Relic and other commercial solutions
-
Not Synced
but of course I like to talk about
the opensorce solutions.
-
Not Synced
We're gonna talk a little bit about
Prometheus.
-
Not Synced
Prometheus came out of the idea that
-
Not Synced
we needed a monitoring system that could
collect all this whitebox metric data
-
Not Synced
and do something useful with it.
-
Not Synced
Not just give us a pretty graph, but
we also want to be able to
-
Not Synced
alert on it.
-
Not Synced
So we needed both
-
Not Synced
a data gathering and an analytics system
in the same instance.
-
Not Synced
To do this, we built this thing and
we looked at the way that
-
Not Synced
data was being generated
by the applications
-
Not Synced
and there are advantages and
disadvantages to this
-
Not Synced
push vs. poll model for metrics.
-
Not Synced
We decided to go with the polling model
-
Not Synced
because there is some slight advantages
for polling over pushing.
-
Not Synced
With polling, you get this free
blackbox check
-
Not Synced
that the application is running.
-
Not Synced
When you poll your application, you know
that the process is running.
-
Not Synced
If you are doing push-based, you can't
tell the difference between
-
Not Synced
your application doing no work and
your application not running.
-
Not Synced
So you don't know if it's stuck,
-
Not Synced
or is it just not having to do any work.
-
Not Synced
With polling, the polling system knows
the state of your network.
-
Not Synced
If you have a defined set of services,
-
Not Synced
that inventory drives what should be there.
-
Not Synced
Again, it's like, the disappearing,
-
Not Synced
is the process dead, or is it just
not doing anything?
-
Not Synced
With polling, you know for a fact
what processes should be there,
-
Not Synced
and it's a bit of an advantage there.
-
Not Synced
With polling, there's really easy testing.
-
Not Synced
With push-based metrics, you have to
figure out
-
Not Synced
if you want to test a new version of
the monitoring system or
-
Not Synced
you want to test something new,
-
Not Synced
you have to ??? a copy of the data.
-
Not Synced
With polling, you can just set up
another instance of your monitoring
-
Not Synced
and just test it.
-
Not Synced
Or you don't even have,
-
Not Synced
it doesn't even have to be monitoring,
you can just use curl
-
Not Synced
to poll the metrics endpoint.
-
Not Synced
It's significantly easier to test.
-
Not Synced
The other thing with the…
-
Not Synced
The other nice thing is that
the client is really simple.
-
Not Synced
The client doesn't have to know
where the monitoring system is.
-
Not Synced
It doesn't have to know about ???
-
Not Synced
It just has to sit and collect the data
about itself.
-
Not Synced
So it doesn't have to know anything about
the topology of the network.
-
Not Synced
As an application developer, if you're
writing a DNS server or
-
Not Synced
some other piece of software,
-
Not Synced
you don't have to know anything about
monitoring software,
-
Not Synced
you can just implement it inside
your application and
-
Not Synced
the monitoring software, whether it's
Prometheus or something else,
-
Not Synced
can just come and collect that data for you.
-
Not Synced
That's kind of similar to a very old
monitoring system called SNMP,
-
Not Synced
but SNMP has a significantly less friendly
data model for developers.
-
Not Synced
This is the basic layout
of a Prometheus server.
-
Not Synced
At the core, there's a Prometheus server
-
Not Synced
and it deals with all the data collection
and analytics.
-
Not Synced
Basically, this one binary,
it's all written in golang.
-
Not Synced
It's a single binary.
-
Not Synced
It knows how to read from your inventory,
-
Not Synced
there's a bunch of different methods,
whether you've got
-
Not Synced
a kubernetes cluster or a cloud platform
-
Not Synced
or you have your own customized thing
with ansible.
-
Not Synced
Ansible can take your layout, drop that
into a config file and
-
Not Synced
Prometheus can pick that up.
-
Not Synced
Once it has the layout, it goes out and
collects all the data.
-
Not Synced
It has a storage and a time series
database to store all that data locally.
-
Not Synced
It has a thing called PromQL, which is
a query language designed
-
Not Synced
for metrics and analytics.
-
Not Synced
From that PromQL, you can add frontends
that will,
-
Not Synced
whether it's a simple API client
to run reports,
-
Not Synced
you can use things like Grafana
for creating dashboards,
-
Not Synced
it's got a simple webUI built in.
-
Not Synced
You can plug in anything you want
on that side.
-
Not Synced
And then, it also has the ability to
continuously execute queries
-
Not Synced
called "recording rules"
-
Not Synced
and these recording rules have
two different modes.
-
Not Synced
You can either record, you can take
a query
-
Not Synced
and it will generate new data
from that query
-
Not Synced
or you can take a query, and
if it returns results,
-
Not Synced
it will return an alert.
-
Not Synced
That alert is a push message
to the alert manager.
-
Not Synced
This allows us to separate the generating
of alerts from the routing of alerts.
-
Not Synced
You can have one or hundreds of Prometheus
services, all generating alerts
-
Not Synced
and it goes into an alert manager cluster
and sends, does the deduplication
-
Not Synced
and the routing to the human
-
Not Synced
because, of course, the thing
that we want is
-
Not Synced
we had dashboards with graphs, but
in order to find out if something is broken
-
Not Synced
you had to have a human
looking at the graph.
-
Not Synced
With Prometheus, we don't have to do that
anymore,
-
Not Synced
we can simply let the software tell us
that we need to go investigate
-
Not Synced
our problems.
-
Not Synced
We don't have to sit there and
stare at dashboards all day,
-
Not Synced
because that's really boring.
-
Not Synced
What does it look like to actually
get data into Prometheus?
-
Not Synced
This is a very basic output
of a Prometheus metric.
-
Not Synced
This is a very simple thing.
-
Not Synced
If you know much about
the linux kernel,
-
Not Synced
the linux kernel tracks ??? stats,
all the state of all the CPUs
-
Not Synced
in your system
-
Not Synced
and we express this by having
the name of the metric, which is
-
Not Synced
'node_cpu_seconds_total' and so
this is a self describing metric,
-
Not Synced
like you can just read the metrics name
-
Not Synced
and you understand a little bit about
what's going on here.