[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, we had a talk by a non-GitLab person\Nabout GitLab.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, we have a talk by a GitLab person\Non non-GtlLab.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Something like that?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The CCCHH hackerspace is now open,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from now on if you want to go there,\Nthat's the announcement.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And the next talk will be by Ben Kochie
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,on metrics-based monitoring\Nwith Prometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Welcome.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause]
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Alright, so
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,my name is Ben Kochie
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I work on DevOps features for GitLab
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and apart working for GitLab, I also work\Non the opensource Prometheus project.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I live in Berlin and I've been using\NDebian since ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,yes, quite a long time.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, what is Metrics-based Monitoring?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you're running software in production,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you probably want to monitor it,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because if you don't monitor it, you don't\Nknow it's right.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,??? break down into two categories:
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's blackbox monitoring and\Nthere's whitebox monitoring.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Blackbox monitoring is treating\Nyour software like a blackbox.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just checks to see, like,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is it responding, or does it ping
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or ??? HTTP requests
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[mic turned on]
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ah, there we go, much better.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, blackbox monitoring is a probe,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it just kind of looks from the outside\Nto your software
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it has no knowledge of the internals
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's really good for end to end testing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a fairly complicated\Nservice,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you come in from the outside, you go\Nthrough the load balancer,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you hit the API server,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the API server might hit a database,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you go all the way through\Nto the back of the stack
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and all the way back out
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that everything is working\Nend to end.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But you only know about it\Nfor that one request.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So in order to find out if your service\Nis working,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from the end to end, for every single\Nrequest,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this requires whitebox intrumentation.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, basically, every event that happens\Ninside your software,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,inside a serving stack,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,gets collected and gets counted,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that every request hits\Nthe load balancer,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits your application\Nservice,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits the database.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You know that everything matches up
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is called whitebox, or\Nmetrics-based monitoring.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There is different examples of, like,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the kind of software that does blackbox\Nand whitebox monitoring.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you have software like Nagios that\Nyou can configure checks
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or pingdom,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,pingdom will do ping of your website.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then there is metrics-based monitoring,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,things like Prometheus, things like\Nthe TICK stack from influx data,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,New Relic and other commercial solutions
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but of course I like to talk about\Nthe opensorce solutions.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We're gonna talk a little bit about\NPrometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus came out of the idea that
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we needed a monitoring system that could\Ncollect all this whitebox metric data
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and do something useful with it.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Not just give us a pretty graph, but\Nwe also want to be able to
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,alert on it.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So we needed both
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a data gathering and an analytics system\Nin the same instance.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,To do this, we built this thing and\Nwe looked at the way that
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,data was being generated\Nby the applications
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there are advantages and\Ndisadvantages to this
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,push vs. poll model for metrics.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We decided to go with the polling model
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because there is some slight advantages\Nfor polling over pushing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you get this free\Nblackbox check
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that the application is running.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,When you poll your application, you know\Nthat the process is running.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you are doing push-based, you can't\Ntell the difference between
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,your application doing no work and\Nyour application not running.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you don't know if it's stuck,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or is it just not having to do any work.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, the polling system knows\Nthe state of your network.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a defined set of services,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that inventory drives what should be there.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Again, it's like, the disappearing,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is the process dead, or is it just\Nnot doing anything?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you know for a fact\Nwhat processes should be there,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's a bit of an advantage there.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, there's really easy testing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With push-based metrics, you have to\Nfigure out
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if you want to test a new version of\Nthe monitoring system or
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you want to test something new,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you have to ??? a copy of the data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you can just set up\Nanother instance of your monitoring
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and just test it.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Or you don't even have,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it doesn't even have to be monitoring,\Nyou can just use curl
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to poll the metrics endpoint.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's significantly easier to test.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other thing with the…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other nice thing is that\Nthe client is really simple.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The client doesn't have to know\Nwhere the monitoring system is.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It doesn't have to know about ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It just has to sit and collect the data\Nabout itself.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So it doesn't have to know anything about\Nthe topology of the network.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,As an application developer, if you're\Nwriting a DNS server or
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,some other piece of software,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you don't have to know anything about\Nmonitoring software,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can just implement it inside\Nyour application and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the monitoring software, whether it's\NPrometheus or something else,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,can just come and collect that data for you.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's kind of similar to a very old\Nmonitoring system called SNMP,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but SNMP has a significantly less friendly\Ndata model for developers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is the basic layout\Nof a Prometheus server.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,At the core, there's a Prometheus server
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it deals with all the data collection\Nand analytics.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Basically, this one binary,\Nit's all written in golang.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a single binary.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It knows how to read from your inventory,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's a bunch of different methods,\Nwhether you've got
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a kubernetes cluster or a cloud platform
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you have your own customized thing\Nwith ansible.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ansible can take your layout, drop that\Ninto a config file and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus can pick that up.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Once it has the layout, it goes out and\Ncollects all the data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a storage and a time series\Ndatabase to store all that data locally.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a thing called PromQL, which is\Na query language designed
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for metrics and analytics.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,From that PromQL, you can add frontends\Nthat will,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,whether it's a simple API client\Nto run reports,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can use things like Grafana\Nfor creating dashboards,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's got a simple webUI built in.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can plug in anything you want\Non that side.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then, it also has the ability to\Ncontinuously execute queries
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,called "recording rules"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and these recording rules have\Ntwo different modes.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can either record, you can take\Na query
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it will generate new data\Nfrom that query
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you can take a query, and\Nif it returns results,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it will return an alert.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That alert is a push message\Nto the alert manager.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This allows us to separate the generating\Nof alerts from the routing of alerts.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can have one or hundreds of Prometheus\Nservices, all generating alerts
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it goes into an alert manager cluster\Nand sends, does the deduplication
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the routing to the human
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because, of course, the thing\Nthat we want is
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we had dashboards with graphs, but\Nin order to find out if something is broken
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you had to have a human\Nlooking at the graph.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With Prometheus, we don't have to do that\Nanymore,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can simply let the software tell us\Nthat we need to go investigate
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,our problems.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We don't have to sit there and\Nstare at dashboards all day,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because that's really boring.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What does it look like to actually\Nget data into Prometheus?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very basic output\Nof a Prometheus metric.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very simple thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you know much about\Nthe linux kernel,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the linux kernel tracks ??? stats,\Nall the state of all the CPUs
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,in your system
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we express this by having\Nthe name of the metric, which is
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'node_cpu_seconds_total' and so\Nthis is a self-describing metric,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,like you can just read the metrics name
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you understand a little bit about\Nwhat's going on here.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The linux kernel and other kernels track\Ntheir usage by the number of seconds
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,spent doing different things and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that could be, whether it's in system or\Nuser space or IRQs
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or iowait or idle.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Actually, the kernel tracks how much\Nidle time it has.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It also tracks it by the number of CPUs.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With other monitoring systems, they used\Nto do this with a tree structure
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this caused a lot of problems,\Nfor like
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,How do you mix and match data so\Nby switching from
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a tree structure to a tag-based structure,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can do some really interesting\Npowerful data analytics.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's a nice example of taking\Nthose CPU seconds counters
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then converting them into a graph\Nby using PromQL.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we can get into\NMetrics-Based Alerting.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we have this graph, we have this thing
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can look and see here
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Oh there is some little spike here,\Nwe might want to know about that."
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we can get into Metrics-Based\NAlerting.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I used to be a site reliability engineer,\NI'm still a site reliability engineer at heart
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we have this concept of things that\Nyou need on a site or a service reliably
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The most important thing you need is\Ndown at the bottom,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Monitoring, because if you don't have\Nmonitoring of your service,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,how do you know it's even working?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a couple of techniques here, and\Nwe want to alert based on data
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and not just those end to end tests.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a couple of techniques, a thing\Ncalled the RED method
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's a thing called the USE method
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's a couple nice things to some\Nblog posts about this
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and basically it defines that, for example,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the RED method talks about the requests\Nthat your system is handling
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There are three things:
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's the number of requests, there's\Nthe number of errors
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's how long takes a duration.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With the combination of these three things
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can determine most of\Nwhat your users see
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Did my request go through? Did it\Nreturn an error? Was it fast?"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Most people, that's all they care about.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"I made a request to a website and\Nit came back and it was fast."
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a very simple method of just, like,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,those are the important things to\Ndetermine if your site is healthy.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But we can go back to some more\Ntraditional, sysadmin style, alerts
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this is basically taking the filesystem\Navailable space,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,divided by the filesystem size, that becomes\Nthe ratio of filesystem availability
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from 0 to 1.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Multiply it by 100, we now have\Na percentage
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and if it's less than or equal to 1%\Nfor 15 minutes,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this is less than 1% space, we should tell\Na sysadmin to go check
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the ??? filesystem ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's super nice and simple.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can also tag, we can include…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Every alert includes all the extraneous\Nlabels that Prometheus adds to your metrics
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,When you add a metric in Prometheus, if\Nwe go back and we look at this metric.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This metric only contain the information\Nabout the internals of the application
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,anything about, like, what server it's on,\Nis it running in a container,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what cluster does it come from,\Nwhat ??? is it on,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that's all extra annotations that are\Nadded by the Prometheus server
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,at discovery time.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I don't have a good example of what\Nthose labels look like
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but every metric gets annotated\Nwith location information.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That location information also comes through\Nas labels in the alert
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so, if you have a message coming\Ninto your alert manager,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the alert manager can look and go
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Oh, that's coming from this datacenter"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it can include that in the email or\NIRC message or SMS message.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you can include
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Filesystem is out of space on this host\Nfrom this datacenter"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,All these labels get passed through and\Nthen you can append
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"severity: critical" to that alert and\Ninclude that in the message to the human
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because of course, this is how you define…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Getting the message from the monitoring\Nto the human.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can even include nice things like,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if you've got documentation, you can\Ninclude a link to the documentation
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,as an annotation
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the alert manager can take that\Nbasic url and, you know,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,massaging it into whatever it needs\Nto look like to actually get
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the operator to the correct documentation.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can also do more fun things:
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,since we actually are not just checking
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what is the space right now,\Nwe're tracking data over time,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can use 'predict_linear'.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'predict_linear' just takes and does\Na simple linear regression.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This example takes the filesystem\Navailable space over the last hour and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,does a linear regression.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prediction says "Well, it's going that way\Nand four hours from now,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,based on one hour of history, it's gonna\Nbe less than 0, which means full".