-
Not Synced
So, we had a talk by a non-GitLab person
about GitLab.
-
Not Synced
Now, we have a talk by a GitLab person
on non-GtlLab.
-
Not Synced
Something like that?
-
Not Synced
The CCCHH hackerspace is now open,
-
Not Synced
from now on if you want to go there,
that's the announcement.
-
Not Synced
And the next talk will be by Ben Kochie
-
Not Synced
on metrics-based monitoring
with Prometheus.
-
Not Synced
Welcome.
-
Not Synced
[Applause]
-
Not Synced
Alright, so
-
Not Synced
my name is Ben Kochie
-
Not Synced
I work on DevOps features for GitLab
-
Not Synced
and apart working for GitLab, I also work
on the opensource Prometheus project.
-
Not Synced
I live in Berlin and I've been using
Debian since ???
-
Not Synced
yes, quite a long time.
-
Not Synced
So, what is Metrics-based Monitoring?
-
Not Synced
If you're running software in production,
-
Not Synced
you probably want to monitor it,
-
Not Synced
because if you don't monitor it, you don't
know it's right.
-
Not Synced
??? break down into two categories:
-
Not Synced
there's blackbox monitoring and
there's whitebox monitoring.
-
Not Synced
Blackbox monitoring is treating
your software like a blackbox.
-
Not Synced
It's just checks to see, like,
-
Not Synced
is it responding, or does it ping
-
Not Synced
or ??? HTTP requests
-
Not Synced
[mic turned on]
-
Not Synced
Ah, there we go, much better.
-
Not Synced
So, blackbox monitoring is a probe,
-
Not Synced
it just kind of looks from the outside
to your software
-
Not Synced
and it has no knowledge of the internals
-
Not Synced
and it's really good for end to end testing.
-
Not Synced
So if you've got a fairly complicated
service,
-
Not Synced
you come in from the outside, you go
through the load balancer,
-
Not Synced
you hit the API server,
-
Not Synced
the API server might hit a database,
-
Not Synced
and you go all the way through
to the back of the stack
-
Not Synced
and all the way back out
-
Not Synced
so you know that everything is working
end to end.
-
Not Synced
But you only know about it
for that one request.
-
Not Synced
So in order to find out if your service
is working,
-
Not Synced
from the end to end, for every single
request,
-
Not Synced
this requires whitebox intrumentation.
-
Not Synced
So, basically, every event that happens
inside your software,
-
Not Synced
inside a serving stack,
-
Not Synced
gets collected and gets counted,
-
Not Synced
so you know that every request hits
the load balancer,
-
Not Synced
every request hits your application
service,
-
Not Synced
every request hits the database.
-
Not Synced
You know that everything matches up
-
Not Synced
and this is called whitebox, or
metrics-based monitoring.
-
Not Synced
There is different examples of, like,
-
Not Synced
the kind of software that does blackbox
and whitebox monitoring.
-
Not Synced
So you have software like Nagios that
you can configure checks
-
Not Synced
or pingdom,
-
Not Synced
pingdom will do ping of your website.
-
Not Synced
And then there is metrics-based monitoring,
-
Not Synced
things like Prometheus, things like
the TICK stack from influx data,
-
Not Synced
New Relic and other commercial solutions
-
Not Synced
but of course I like to talk about
the opensorce solutions.
-
Not Synced
We're gonna talk a little bit about
Prometheus.
-
Not Synced
Prometheus came out of the idea that
-
Not Synced
we needed a monitoring system that could
collect all this whitebox metric data
-
Not Synced
and do something useful with it.
-
Not Synced
Not just give us a pretty graph, but
we also want to be able to
-
Not Synced
alert on it.
-
Not Synced
So we needed both
-
Not Synced
a data gathering and an analytics system
in the same instance.
-
Not Synced
To do this, we built this thing and
we looked at the way that
-
Not Synced
data was being generated
by the applications
-
Not Synced
and there are advantages and
disadvantages to this
-
Not Synced
push vs. poll model for metrics.
-
Not Synced
We decided to go with the polling model
-
Not Synced
because there is some slight advantages
for polling over pushing.
-
Not Synced
With polling, you get this free
blackbox check
-
Not Synced
that the application is running.
-
Not Synced
When you poll your application, you know
that the process is running.
-
Not Synced
If you are doing push-based, you can't
tell the difference between
-
Not Synced
your application doing no work and
your application not running.
-
Not Synced
So you don't know if it's stuck,
-
Not Synced
or is it just not having to do any work.
-
Not Synced
With polling, the polling system knows
the state of your network.
-
Not Synced
If you have a defined set of services,
-
Not Synced
that inventory drives what should be there.
-
Not Synced
Again, it's like, the disappearing,
-
Not Synced
is the process dead, or is it just
not doing anything?
-
Not Synced
With polling, you know for a fact
what processes should be there,
-
Not Synced
and it's a bit of an advantage there.
-
Not Synced
With polling, there's really easy testing.
-
Not Synced
With push-based metrics, you have to
figure out
-
Not Synced
if you want to test a new version of
the monitoring system or
-
Not Synced
you want to test something new,
-
Not Synced
you have to ??? a copy of the data.
-
Not Synced
With polling, you can just set up
another instance of your monitoring
-
Not Synced
and just test it.
-
Not Synced
Or you don't even have,
-
Not Synced
it doesn't even have to be monitoring,
you can just use curl
-
Not Synced
to poll the metrics endpoint.
-
Not Synced
It's significantly easier to test.
-
Not Synced
The other thing with the…
-
Not Synced
The other nice thing is that
the client is really simple.
-
Not Synced
The client doesn't have to know
where the monitoring system is.
-
Not Synced
It doesn't have to know about ???
-
Not Synced
It just has to sit and collect the data
about itself.
-
Not Synced
So it doesn't have to know anything about
the topology of the network.
-
Not Synced
As an application developer, if you're
writing a DNS server or
-
Not Synced
some other piece of software,
-
Not Synced
you don't have to know anything about
monitoring software,
-
Not Synced
you can just implement it inside
your application and
-
Not Synced
the monitoring software, whether it's
Prometheus or something else,
-
Not Synced
can just come and collect that data for you.
-
Not Synced
That's kind of similar to a very old
monitoring system called SNMP,
-
Not Synced
but SNMP has a significantly less friendly
data model for developers.
-
Not Synced
This is the basic layout
of a Prometheus server.
-
Not Synced
At the core, there's a Prometheus server
-
Not Synced
and it deals with all the data collection
and analytics.
-
Not Synced
Basically, this one binary,
it's all written in golang.
-
Not Synced
It's a single binary.
-
Not Synced
It knows how to read from your inventory,
-
Not Synced
there's a bunch of different methods,
whether you've got
-
Not Synced
a kubernetes cluster or a cloud platform
-
Not Synced
or you have your own customized thing
with ansible.
-
Not Synced
Ansible can take your layout, drop that
into a config file and
-
Not Synced
Prometheus can pick that up.
-
Not Synced
Once it has the layout, it goes out and
collects all the data.
-
Not Synced
It has a storage and a time series
database to store all that data locally.
-
Not Synced
It has a thing called PromQL, which is
a query language designed
-
Not Synced
for metrics and analytics.
-
Not Synced
From that PromQL, you can add frontends
that will,
-
Not Synced
whether it's a simple API client
to run reports,
-
Not Synced
you can use things like Grafana
for creating dashboards,
-
Not Synced
it's got a simple webUI built in.
-
Not Synced
You can plug in anything you want
on that side.
-
Not Synced
And then, it also has the ability to
continuously execute queries
-
Not Synced
called "recording rules"
-
Not Synced
and these recording rules have
two different modes.
-
Not Synced
You can either record, you can take
a query
-
Not Synced
and it will generate new data
from that query
-
Not Synced
or you can take a query, and
if it returns results,
-
Not Synced
it will return an alert.
-
Not Synced
That alert is a push message
to the alert manager.
-
Not Synced
This allows us to separate the generating
of alerts from the routing of alerts.
-
Not Synced
You can have one or hundreds of Prometheus
services, all generating alerts
-
Not Synced
and it goes into an alert manager cluster
and sends, does the deduplication
-
Not Synced
and the routing to the human
-
Not Synced
because, of course, the thing
that we want is
-
Not Synced
we had dashboards with graphs, but
in order to find out if something is broken
-
Not Synced
you had to have a human
looking at the graph.
-
Not Synced
With Prometheus, we don't have to do that
anymore,
-
Not Synced
we can simply let the software tell us
that we need to go investigate
-
Not Synced
our problems.
-
Not Synced
We don't have to sit there and
stare at dashboards all day,
-
Not Synced
because that's really boring.
-
Not Synced
What does it look like to actually
get data into Prometheus?
-
Not Synced
This is a very basic output
of a Prometheus metric.
-
Not Synced
This is a very simple thing.
-
Not Synced
If you know much about
the linux kernel,
-
Not Synced
the linux kernel tracks ??? stats,
all the state of all the CPUs
-
Not Synced
in your system
-
Not Synced
and we express this by having
the name of the metric, which is
-
Not Synced
'node_cpu_seconds_total' and so
this is a self-describing metric,
-
Not Synced
like you can just read the metrics name
-
Not Synced
and you understand a little bit about
what's going on here.
-
Not Synced
The linux kernel and other kernels track
their usage by the number of seconds
-
Not Synced
spent doing different things and
-
Not Synced
that could be, whether it's in system or
user space or IRQs
-
Not Synced
or iowait or idle.
-
Not Synced
Actually, the kernel tracks how much
idle time it has.
-
Not Synced
It also tracks it by the number of CPUs.
-
Not Synced
With other monitoring systems, they used
to do this with a tree structure
-
Not Synced
and this caused a lot of problems,
for like
-
Not Synced
How do you mix and match data so
by switching from
-
Not Synced
a tree structure to a tag-based structure,
-
Not Synced
we can do some really interesting
powerful data analytics.
-
Not Synced
Here's a nice example of taking
those CPU seconds counters
-
Not Synced
and then converting them into a graph
by using PromQL.
-
Not Synced
Now we can get into
Metrics-Based Alerting.
-
Not Synced
Now we have this graph, we have this thing
-
Not Synced
we can look and see here
-
Not Synced
"Oh there is some little spike here,
we might want to know about that."
-
Not Synced
Now we can get into Metrics-Based
Alerting.
-
Not Synced
I used to be a site reliability engineer,
I'm still a site reliability engineer at heart
-
Not Synced
and we have this concept of things that
you need on a site or a service reliably
-
Not Synced
The most important thing you need is
down at the bottom,
-
Not Synced
Monitoring, because if you don't have
monitoring of your service,
-
Not Synced
how do you know it's even working?
-
Not Synced
There's a couple of techniques here, and
we want to alert based on data
-
Not Synced
and not just those end to end tests.
-
Not Synced
There's a couple of techniques, a thing
called the RED method
-
Not Synced
and there's a thing called the USE method
-
Not Synced
and there's a couple nice things to some
blog posts about this
-
Not Synced
and basically it defines that, for example,
-
Not Synced
the RED method talks about the requests
that your system is handling
-
Not Synced
There are three things:
-
Not Synced
There's the number of requests, there's
the number of errors
-
Not Synced
and there's how long takes a duration.
-
Not Synced
With the combination of these three things
-
Not Synced
you can determine most of
what your users see
-
Not Synced
"Did my request go through? Did it
return an error? Was it fast?"
-
Not Synced
Most people, that's all they care about.
-
Not Synced
"I made a request to a website and
it came back and it was fast."
-
Not Synced
It's a very simple method of just, like,
-
Not Synced
those are the important things to
determine if your site is healthy.
-
Not Synced
But we can go back to some more
traditional, sysadmin style, alerts
-
Not Synced
this is basically taking the filesystem
available space,
-
Not Synced
divided by the filesystem size, that becomes
the ratio of filesystem availability
-
Not Synced
from 0 to 1.
-
Not Synced
Multiply it by 100, we now have
a percentage
-
Not Synced
and if it's less than or equal to 1%
for 15 minutes,
-
Not Synced
this is less than 1% space, we should tell
a sysadmin to go check
-
Not Synced
the ??? filesystem ???
-
Not Synced
It's super nice and simple.
-
Not Synced
We can also tag, we can include…
-
Not Synced
Every alert includes all the extraneous
labels that Prometheus adds to your metrics
-
Not Synced
When you add a metric in Prometheus, if
we go back and we look at this metric.
-
Not Synced
This metric only contain the information
about the internals of the application
-
Not Synced
anything about, like, what server it's on,
is it running in a container,
-
Not Synced
what cluster does it come from,
what ??? is it on,
-
Not Synced
that's all extra annotations that are
added by the Prometheus server
-
Not Synced
at discovery time.
-
Not Synced
I don't have a good example of what
those labels look like
-
Not Synced
but every metric gets annotated
with location information.
-
Not Synced
That location information also comes through
as labels in the alert
-
Not Synced
so, if you have a message coming
into your alert manager,
-
Not Synced
the alert manager can look and go
-
Not Synced
"Oh, that's coming from this datacenter"
-
Not Synced
and it can include that in the email or
IRC message or SMS message.
-
Not Synced
So you can include
-
Not Synced
"Filesystem is out of space on this host
from this datacenter"
-
Not Synced
All these labels get passed through and
then you can append
-
Not Synced
"severity: critical" to that alert and
include that in the message to the human
-
Not Synced
because of course, this is how you define…
-
Not Synced
Getting the message from the monitoring
to the human.
-
Not Synced
You can even include nice things like,
-
Not Synced
if you've got documentation, you can
include a link to the documentation
-
Not Synced
as an annotation
-
Not Synced
and the alert manager can take that
basic url and, you know,
-
Not Synced
massaging it into whatever it needs
to look like to actually get
-
Not Synced
the operator to the correct documentation.
-
Not Synced
We can also do more fun things:
-
Not Synced
since we actually are not just checking
-
Not Synced
what is the space right now,
we're tracking data over time,
-
Not Synced
we can use 'predict_linear'.
-
Not Synced
'predict_linear' just takes and does
a simple linear regression.
-
Not Synced
This example takes the filesystem
available space over the last hour and
-
Not Synced
does a linear regression.
-
Not Synced
Prediction says "Well, it's going that way
and four hours from now,
-
Not Synced
based on one hour of history, it's gonna
be less than 0, which means full".
-
Not Synced
We know that within the next four hours,
the disc is gonna be full
-
Not Synced
so we can tell the operator ahead of time
that it's gonna be full
-
Not Synced
and not just tell them that it's full
right now.
-
Not Synced
They have some window of ability
to fix it before it fails.
-
Not Synced
This is really important because
if you're running a site
-
Not Synced
you want to be able to have alerts
that tell you that your system is failing
-
Not Synced
before it actually fails.
-
Not Synced
Because if it fails, you're out of SLO
or SLA and
-
Not Synced
your users are gonna be unhappy
-
Not Synced
and you don't want the users to tell you
that your site is down
-
Not Synced
you want to know about it before
your users can even tell.
-
Not Synced
This allows you to do that.
-
Not Synced
And also of course, Prometheus being
a modern system,
-
Not Synced
we support fully UTF8 in all of our labels.
-
Not Synced
Here's an other one, here's a good example
from the USE method.
-
Not Synced
This is a rate of 500 errors coming from
an application
-
Not Synced
and you can simply alert that
-
Not Synced
there's more than 500 errors per second
coming out of the application
-
Not Synced
if that's your threshold for ???
-
Not Synced
And you can do other things,
-
Not Synced
you can convert that from just
a raid of errors
-
Not Synced
to a percentive error.
-
Not Synced
So you could say
-
Not Synced
"I have an SLA of 3 9" and so you can say
-
Not Synced
"If the rate of errors divided by the rate
of requests is .01,
-
Not Synced
or is more than .01, then
that's a problem."
-
Not Synced
You can include that level of
error granularity.
-
Not Synced
And if you're just doing a blackbox test,
-
Not Synced
you wouldn't know this, you would only get
if you got an error from the system,
-
Not Synced
then you got another error from the system
-
Not Synced
then you fire an alert.
-
Not Synced
But if those checks are one minute apart
and you're serving 1000 requests per second
-
Not Synced
you could be serving 10,000 errors before
you even get an alert.
-
Not Synced
And you might miss it, because
-
Not Synced
what if you only get one random error
-
Not Synced
and then the next time, you're serving
25% errors,
-
Not Synced
you only have a 25% chance of that check
failing again.
-
Not Synced
You really need these metrics in order
to get
-
Not Synced
proper reports of the status of your system
-
Not Synced
There's even options
-
Not Synced
You can slice and dice those labels.
-
Not Synced
If you have a label on all of
your applications called 'service'
-
Not Synced
you can send that 'service' label through
to the message
-
Not Synced
and you can say
"Hey, this service is broken".
-
Not Synced
You can include that service label
in your alert messages.
-
Not Synced
And that's it, I can go to a demo and Q&A.
-
Not Synced
[Applause]
-
Not Synced
Any questions so far?
-
Not Synced
Or anybody want to see a demo?
-
Not Synced
[Q] Hi. Does Prometheus make metric
discovery inside containers
-
Not Synced
or do I have to implement the metrics
myself?
-
Not Synced
[A] For metrics in containers, there are
already things that expose
-
Not Synced
the metrics of the container system
itself.
-
Not Synced
There's a utility called 'cadvisor' and
-
Not Synced
cadvisor takes the links cgroup data
and exposes it as metrics
-
Not Synced
so you can get data about
how much CPU time is being
-
Not Synced
spent in your container,
-
Not Synced
how much memory is being used
by your container.
-
Not Synced
[Q] But not about the application,
just about the container usage ?
-
Not Synced
[A] Right. Because the container
has no idea
-
Not Synced
whether your application is written
in Ruby or go or Python or whatever,
-
Not Synced
you have to build that into
your application in order to get the data.
-
Not Synced
So for Prometheus,
-
Not Synced
we've written client libraries that can be
included in your application directly
-
Not Synced
so you can get that data out.
-
Not Synced
If you go to the Prometheus website,
we have a whole series of client libraries
-
Not Synced
and we cover a pretty good selection
of popular software.
-
Not Synced
[Q] What is the current state of
long-term data storage?
-
Not Synced
[A] Very good question.
-
Not Synced
There's been several…
-
Not Synced
There's actually several different methods
of doing this.
-
Not Synced
Prometheus stores all this data locally
in its own data storage
-
Not Synced
on the local disk.
-
Not Synced
But that's only as durable as
that server is durable.
-
Not Synced
So if you've got a really durable server,
-
Not Synced
you can store as much data as you want,
-
Not Synced
you can store years and years of data
locally on the Prometheus server.
-
Not Synced
That's not a problem.
-
Not Synced
There's a bunch of misconceptions because
of our default
-
Not Synced
and the language on our website said
-
Not Synced
"It's not long-term storage"
-
Not Synced
simply because we leave that problem
up to the person running the server.
-
Not Synced
But the time series database
that Prometheus includes
-
Not Synced
is actually quite durable.
-
Not Synced
But it's only as durable as the server
underneath it.
-
Not Synced
So if you've got a very large cluster and
you want really high durability,
-
Not Synced
you need to have some kind of
cluster software,
-
Not Synced
but because we want Prometheus to be
simple to deploy
-
Not Synced
and very simple to operate
-
Not Synced
and also very robust.
-
Not Synced
We didn't want to include any clustering
in Prometheus itself,
-
Not Synced
because anytime you have a clustered
software,
-
Not Synced
what happens if your network is
a little wanky.
-
Not Synced
The first thing that goes down is
all of your distributed systems fail.
-
Not Synced
And building distributed systems to be
really robust is really hard
-
Not Synced
so Prometheus is what we call
"uncoordinated distributed systems".
-
Not Synced
If you've got two Prometheus servers
monitoring all your targets in an HA mode
-
Not Synced
in a cluster, and there's a split brain,
-
Not Synced
each Prometheus can see
half of the cluster and
-
Not Synced
it can see that the other half
of the cluster is down.
-
Not Synced
They can both try to get alerts out
to the alert manager
-
Not Synced
and this is a really really robust way of
handling split brains
-
Not Synced
and bad network failures and bad problems
in a cluster.
-
Not Synced
It's designed to be super super robust
-
Not Synced
and so the two individual
Promotheus servers in you cluster
-
Not Synced
don't have to talk to each other
to do this,
-
Not Synced
they can just to it independently.
-
Not Synced
But if you want to be able
to correlate data
-
Not Synced
between many different Prometheus servers
-
Not Synced
you need an external data storage
to do this.
-
Not Synced
And also you may not have
very big servers,
-
Not Synced
you might be running your Prometheus
in a container
-
Not Synced
and it's only got a little bit of local
storage space
-
Not Synced
so you want to send all that data up
to a big cluster datastore
-
Not Synced
for ???
-
Not Synced
We have several different ways of
doing this.
-
Not Synced
There's the classic way which is called
federation
-
Not Synced
where you have one Prometheus server
polling in summary data from
-
Not Synced
each of the individual Prometheus servers
-
Not Synced
and this is useful if you want to run
alerts against data coming
-
Not Synced
from multiple Prometheus servers.
-
Not Synced
But federation is not replication.
-
Not Synced
It only can do a little bit of data from
each Prometheus server.
-
Not Synced
If you've got a million metrics on
each Prometheus server,
-
Not Synced
you can't poll in a million metrics
and do…
-
Not Synced
If you've got 10 of those, you can't
poll in 10 million metrics
-
Not Synced
simultaneously into one Prometheus
server.
-
Not Synced
It's just to much data.
-
Not Synced
There is two others, a couple of other
nice options.
-
Not Synced
There's a piece of software called
Cortex.
-
Not Synced
Cortex is a Prometheus server that
stores its data in a database.
-
Not Synced
Specifically, a distributed database.
-
Not Synced
Things that are based on the Google
big table model, like Cassandra or…
-
Not Synced
What's the Amazon one?
-
Not Synced
Yeah.
-
Not Synced
Dynamodb.
-
Not Synced
If you have a dynamodb or a cassandra
cluster, or one of these other
-
Not Synced
really big distributed storage clusters,
-
Not Synced
Cortex can run and the Prometheus servers
will stream their data up to Cortex
-
Not Synced
and it will keep a copy of that accross
all of your Prometheus servers.
-
Not Synced
And because it's based on things
like Cassandra,
-
Not Synced
it's super scalable.
-
Not Synced
But it's a little complex to run and
-
Not Synced
many people don't want to run that
complex infrastructure.
-
Not Synced
We have another new one, we just blogged
about it yesterday.
-
Not Synced
It's a thing called Thanos.
-
Not Synced
Thanos is Prometheus at scale.
-
Not Synced
Basically, the way it works…
-
Not Synced
Actually, why don't I bring that up?
-
Not Synced
This was developed by a company
called Improbable
-
Not Synced
and they wanted to…
-
Not Synced
They had billions of metrics coming from
hundreds of Prometheus servers.
-
Not Synced
They developed this in collaboration with
the Prometheus team to build
-
Not Synced
a super highly scalable Prometheus server.
-
Not Synced
Prometheus itself stores the incoming
metrics data in ??? log
-
Not Synced
and then every two hours, it creates
a compaction cycle
-
Not Synced
and it creates a mutable series block
of data which is
-
Not Synced
all the time series blocks themselves
-
Not Synced
and then an index into that data.
-
Not Synced
Those two hour windows are all imutable
-
Not Synced
so ??? has a little sidecar binary that
watches for those new directories and
-
Not Synced
uploads them into a blob store.
-
Not Synced
So you could put them in S3 or minio or
some other simple object storage.
-
Not Synced
And then now you have all of your data,
all of this index data already
-
Not Synced
ready to go
-
Not Synced
and then the final sidecar creates
a little mesh cluster that can read from
-
Not Synced
all of those S3 blocks.
-
Not Synced
Now, you have this super global view
all stored in a big bucket storage and
-
Not Synced
things like S3 or minio are…
-
Not Synced
Bucket storage is not databases so they're
operationally a little easier to operate.
-
Not Synced
Plus, now we have all this data in
a bucket store and
-
Not Synced
the Thanos sidecars can talk to each other
-
Not Synced
We can now have a single entry point.
-
Not Synced
You can query Thanos and Thanos will
distribute your query
-
Not Synced
across all your Prometheus servers.
-
Not Synced
So now you can do global queries across
all of your servers.
-
Not Synced
But it's very new, they just released
their first release candidate yesterday.
-
Not Synced
It is looking to be like
the coolest thing ever
-
Not Synced
for running large scale Prometheus.
-
Not Synced
Here's an example of how that is laid out.