So, we had a talk by a non-GitLab person
about GitLab.
Now, we have a talk by a GitLab person
on non-GtlLab.
Something like that?
The CCCHH hackerspace is now open,
from now on if you want to go there,
that's the announcement.
And the next talk will be by Ben Kochie
on metrics-based monitoring
with Prometheus.
Welcome.
[Applause]
Alright, so
my name is Ben Kochie
I work on DevOps features for GitLab
and apart working for GitLab, I also work
on the opensource Prometheus project.
I live in Berlin and I've been using
Debian since ???
yes, quite a long time.
So, what is Metrics-based Monitoring?
If you're running software in production,
you probably want to monitor it,
because if you don't monitor it, you don't
know it's right.
??? break down into two categories:
there's blackbox monitoring and
there's whitebox monitoring.
Blackbox monitoring is treating
your software like a blackbox.
It's just checks to see, like,
is it responding, or does it ping
or ??? HTTP requests
[mic turned on]
Ah, there we go, much better.
So, blackbox monitoring is a probe,
it just kind of looks from the outside
to your software
and it has no knowledge of the internals
and it's really good for end to end testing.
So if you've got a fairly complicated
service,
you come in from the outside, you go
through the load balancer,
you hit the API server,
the API server might hit a database,
and you go all the way through
to the back of the stack
and all the way back out
so you know that everything is working
end to end.
But you only know about it
for that one request.
So in order to find out if your service
is working,
from the end to end, for every single
request,
this requires whitebox intrumentation.
So, basically, every event that happens
inside your software,
inside a serving stack,
gets collected and gets counted,
so you know that every request hits
the load balancer,
every request hits your application
service,
every request hits the database.
You know that everything matches up
and this is called whitebox, or
metrics-based monitoring.
There is different examples of, like,
the kind of software that does blackbox
and whitebox monitoring.
So you have software like Nagios that
you can configure checks
or pingdom,
pingdom will do ping of your website.
And then there is metrics-based monitoring,
things like Prometheus, things like
the TICK stack from influx data,
New Relic and other commercial solutions
but of course I like to talk about
the opensorce solutions.
We're gonna talk a little bit about
Prometheus.
Prometheus came out of the idea that
we needed a monitoring system that could
collect all this whitebox metric data
and do something useful with it.
Not just give us a pretty graph, but
we also want to be able to
alert on it.
So we needed both
a data gathering and an analytics system
in the same instance.
To do this, we built this thing and
we looked at the way that
data was being generated
by the applications
and there are advantages and
disadvantages to this
push vs. poll model for metrics.
We decided to go with the polling model
because there is some slight advantages
for polling over pushing.
With polling, you get this free
blackbox check
that the application is running.
When you poll your application, you know
that the process is running.
If you are doing push-based, you can't
tell the difference between
your application doing no work and
your application not running.
So you don't know if it's stuck,
or is it just not having to do any work.
With polling, the polling system knows
the state of your network.
If you have a defined set of services,
that inventory drives what should be there.
Again, it's like, the disappearing,
is the process dead, or is it just
not doing anything?
With polling, you know for a fact
what processes should be there,
and it's a bit of an advantage there.
With polling, there's really easy testing.
With push-based metrics, you have to
figure out
if you want to test a new version of
the monitoring system or
you want to test something new,
you have to ??? a copy of the data.
With polling, you can just set up
another instance of your monitoring
and just test it.
Or you don't even have,
it doesn't even have to be monitoring,
you can just use curl
to poll the metrics endpoint.
It's significantly easier to test.
The other thing with theā¦
The other nice thing is that
the client is really simple.
The client doesn't have to know
where the monitoring system is.
It doesn't have to know about ???
It just has to sit and collect the data
about itself.
So it doesn't have to know anything about
the topology of the network.
As an application developer, if you're
writing a DNS server or
some other piece of software,
you don't have to know anything about
monitoring software,
you can just implement it inside
your application and
the monitoring software, whether it's
Prometheus or something else,
can just come and collect that data for you.
That's kind of similar to a very old
monitoring system called SNMP,
but SNMP has a significantly less friendly
data model for developers.
This is the basic layout
of a Prometheus server.
At the core, there's a Prometheus server
and it deals with all the data collection
and analytics.
Basically, this one binary,
it's all written in golang.
It's a single binary.
It knows how to read from your inventory,
there's a bunch of different methods,
whether you've got
a kubernetes cluster or a cloud platform
or you have your own customized thing
with ansible.
Ansible can take your layout, drop that
into a config file and
Prometheus can pick that up.
Once it has the layout, it goes out and
collects all the data.
It has a storage and a time series
database to store all that data locally.
It has a thing called PromQL, which is
a query language designed
for metrics and analytics.
From that PromQL, you can add frontends
that will,
whether it's a simple API client
to run reports,
you can use things like Grafana
for creating dashboards,
it's got a simple webUI built in.
You can plug in anything you want
on that side.
And then, it also has the ability to
continuously execute queries
called "recording rules"
and these recording rules have
two different modes.
You can either record, you can take
a query
and it will generate new data
from that query
or you can take a query, and
if it returns results,
it will return an alert.
That alert is a push message
to the alert manager.
This allows us to separate the generating
of alerts from the routing of alerts.
You can have one or hundreds of Prometheus
services, all generating alerts
and it goes into an alert manager cluster
and sends, does the deduplication
and the routing to the human
because, of course, the thing
that we want is
we had dashboards with graphs, but
in order to find out if something is broken
you had to have a human
looking at the graph.
With Prometheus, we don't have to do that
anymore,
we can simply let the software tell us
that we need to go investigate
our problems.
We don't have to sit there and
stare at dashboards all day,
because that's really boring.
What does it look like to actually
get data into Prometheus?
This is a very basic output
of a Prometheus metric.
This is a very simple thing.
If you know much about
the linux kernel,
the linux kernel tracks ??? stats,
all the state of all the CPUs
in your system
and we express this by having
the name of the metric, which is
'node_cpu_seconds_total' and so
this is a self describing metric,
like you can just read the metrics name
and you understand a little bit about
what's going on here.