So, we had a talk by a non-GitLab person about GitLab. Now, we have a talk by a GitLab person on non-GtlLab. Something like that? The CCCHH hackerspace is now open, from now on if you want to go there, that's the announcement. And the next talk will be by Ben Kochie on metrics-based monitoring with Prometheus. Welcome. [Applause] Alright, so my name is Ben Kochie I work on DevOps features for GitLab and apart working for GitLab, I also work on the opensource Prometheus project. I live in Berlin and I've been using Debian since ??? yes, quite a long time. So, what is Metrics-based Monitoring? If you're running software in production, you probably want to monitor it, because if you don't monitor it, you don't know it's right. ??? break down into two categories: there's blackbox monitoring and there's whitebox monitoring. Blackbox monitoring is treating your software like a blackbox. It's just checks to see, like, is it responding, or does it ping or ??? HTTP requests [mic turned on] Ah, there we go, much better. So, blackbox monitoring is a probe, it just kind of looks from the outside to your software and it has no knowledge of the internals and it's really good for end to end testing. So if you've got a fairly complicated service, you come in from the outside, you go through the load balancer, you hit the API server, the API server might hit a database, and you go all the way through to the back of the stack and all the way back out so you know that everything is working end to end. But you only know about it for that one request. So in order to find out if your service is working, from the end to end, for every single request, this requires whitebox intrumentation. So, basically, every event that happens inside your software, inside a serving stack, gets collected and gets counted, so you know that every request hits the load balancer, every request hits your application service, every request hits the database. You know that everything matches up and this is called whitebox, or metrics-based monitoring. There is different examples of, like, the kind of software that does blackbox and whitebox monitoring. So you have software like Nagios that you can configure checks or pingdom, pingdom will do ping of your website. And then there is metrics-based monitoring, things like Prometheus, things like the TICK stack from influx data, New Relic and other commercial solutions but of course I like to talk about the opensorce solutions. We're gonna talk a little bit about Prometheus. Prometheus came out of the idea that we needed a monitoring system that could collect all this whitebox metric data and do something useful with it. Not just give us a pretty graph, but we also want to be able to alert on it. So we needed both a data gathering and an analytics system in the same instance. To do this, we built this thing and we looked at the way that data was being generated by the applications and there are advantages and disadvantages to this push vs. poll model for metrics. We decided to go with the polling model because there is some slight advantages for polling over pushing. With polling, you get this free blackbox check that the application is running. When you poll your application, you know that the process is running. If you are doing push-based, you can't tell the difference between your application doing no work and your application not running. So you don't know if it's stuck, or is it just not having to do any work. With polling, the polling system knows the state of your network. If you have a defined set of services, that inventory drives what should be there. Again, it's like, the disappearing, is the process dead, or is it just not doing anything? With polling, you know for a fact what processes should be there, and it's a bit of an advantage there. With polling, there's really easy testing. With push-based metrics, you have to figure out if you want to test a new version of the monitoring system or you want to test something new, you have to ??? a copy of the data. With polling, you can just set up another instance of your monitoring and just test it. Or you don't even have, it doesn't even have to be monitoring, you can just use curl to poll the metrics endpoint. It's significantly easier to test. The other thing with the… The other nice thing is that the client is really simple. The client doesn't have to know where the monitoring system is. It doesn't have to know about ??? It just has to sit and collect the data about itself. So it doesn't have to know anything about the topology of the network. As an application developer, if you're writing a DNS server or some other piece of software, you don't have to know anything about monitoring software, you can just implement it inside your application and the monitoring software, whether it's Prometheus or something else, can just come and collect that data for you. That's kind of similar to a very old monitoring system called SNMP, but SNMP has a significantly less friendly data model for developers. This is the basic layout of a Prometheus server. At the core, there's a Prometheus server and it deals with all the data collection and analytics. Basically, this one binary, it's all written in golang. It's a single binary. It knows how to read from your inventory, there's a bunch of different methods, whether you've got a kubernetes cluster or a cloud platform or you have your own customized thing with ansible. Ansible can take your layout, drop that into a config file and Prometheus can pick that up. Once it has the layout, it goes out and collects all the data. It has a storage and a time series database to store all that data locally. It has a thing called PromQL, which is a query language designed for metrics and analytics. From that PromQL, you can add frontends that will, whether it's a simple API client to run reports, you can use things like Grafana for creating dashboards, it's got a simple webUI built in. You can plug in anything you want on that side. And then, it also has the ability to continuously execute queries called "recording rules" and these recording rules have two different modes. You can either record, you can take a query and it will generate new data from that query or you can take a query, and if it returns results, it will return an alert. That alert is a push message to the alert manager. This allows us to separate the generating of alerts from the routing of alerts. You can have one or hundreds of Prometheus services, all generating alerts and it goes into an alert manager cluster and sends, does the deduplication and the routing to the human because, of course, the thing that we want is we had dashboards with graphs, but in order to find out if something is broken you had to have a human looking at the graph. With Prometheus, we don't have to do that anymore, we can simply let the software tell us that we need to go investigate our problems. We don't have to sit there and stare at dashboards all day, because that's really boring. What does it look like to actually get data into Prometheus? This is a very basic output of a Prometheus metric. This is a very simple thing. If you know much about the linux kernel, the linux kernel tracks ??? stats, all the state of all the CPUs in your system and we express this by having the name of the metric, which is 'node_cpu_seconds_total' and so this is a self-describing metric, like you can just read the metrics name and you understand a little bit about what's going on here. The linux kernel and other kernels track their usage by the number of seconds spent doing different things and that could be, whether it's in system or user space or IRQs or iowait or idle. Actually, the kernel tracks how much idle time it has. It also tracks it by the number of CPUs. With other monitoring systems, they used to do this with a tree structure and this caused a lot of problems, for like How do you mix and match data so by switching from a tree structure to a tag-based structure, we can do some really interesting powerful data analytics. Here's a nice example of taking those CPU seconds counters and then converting them into a graph by using PromQL. Now we can get into Metrics-Based Alerting. Now we have this graph, we have this thing we can look and see here "Oh there is some little spike here, we might want to know about that." Now we can get into Metrics-Based Alerting. I used to be a site reliability engineer, I'm still a site reliability engineer at heart and we have this concept of things that you need on a site or a service reliably The most important thing you need is down at the bottom, Monitoring, because if you don't have monitoring of your service, how do you know it's even working? There's a couple of techniques here, and we want to alert based on data and not just those end to end tests. There's a couple of techniques, a thing called the RED method and there's a thing called the USE method and there's a couple nice things to some blog posts about this and basically it defines that, for example, the RED method talks about the requests that your system is handling There are three things: There's the number of requests, there's the number of errors and there's how long takes a duration. With the combination of these three things you can determine most of what your users see "Did my request go through? Did it return an error? Was it fast?" Most people, that's all they care about. "I made a request to a website and it came back and it was fast." It's a very simple method of just, like, those are the important things to determine if your site is healthy. But we can go back to some more traditional, sysadmin style, alerts this is basically taking the filesystem available space, divided by the filesystem size, that becomes the ratio of filesystem availability from 0 to 1. Multiply it by 100, we now have a percentage and if it's less than or equal to 1% for 15 minutes, this is less than 1% space, we should tell a sysadmin to go check the ??? filesystem ??? It's super nice and simple. We can also tag, we can include… Every alert includes all the extraneous labels that Prometheus adds to your metrics When you add a metric in Prometheus, if we go back and we look at this metric. This metric only contain the information about the internals of the application anything about, like, what server it's on, is it running in a container, what cluster does it come from, what ??? is it on, that's all extra annotations that are added by the Prometheus server at discovery time. I don't have a good example of what those labels look like but every metric gets annotated with location information. That location information also comes through as labels in the alert so, if you have a message coming into your alert manager, the alert manager can look and go "Oh, that's coming from this datacenter" and it can include that in the email or IRC message or SMS message. So you can include "Filesystem is out of space on this host from this datacenter" All these labels get passed through and then you can append "severity: critical" to that alert and include that in the message to the human because of course, this is how you define… Getting the message from the monitoring to the human. You can even include nice things like, if you've got documentation, you can include a link to the documentation as an annotation and the alert manager can take that basic url and, you know, massaging it into whatever it needs to look like to actually get the operator to the correct documentation. We can also do more fun things: since we actually are not just checking what is the space right now, we're tracking data over time, we can use 'predict_linear'. 'predict_linear' just takes and does a simple linear regression. This example takes the filesystem available space over the last hour and does a linear regression. Prediction says "Well, it's going that way and four hours from now, based on one hour of history, it's gonna be less than 0, which means full". We know that within the next four hours, the disc is gonna be full so we can tell the operator ahead of time that it's gonna be full and not just tell them that it's full right now. They have some window of ability to fix it before it fails. This is really important because if you're running a site you want to be able to have alerts that tell you that your system is failing before it actually fails. Because if it fails, you're out of SLO or SLA and your users are gonna be unhappy and you don't want the users to tell you that your site is down you want to know about it before your users can even tell. This allows you to do that. And also of course, Prometheus being a modern system, we support fully UTF8 in all of our labels. Here's an other one, here's a good example from the USE method. This is a rate of 500 errors coming from an application and you can simply alert that there's more than 500 errors per second coming out of the application if that's your threshold for ??? And you can do other things, you can convert that from just a raid of errors to a percentive error. So you could say "I have an SLA of 3 9" and so you can say "If the rate of errors divided by the rate of requests is .01, or is more than .01, then that's a problem." You can include that level of error granularity. And if you're just doing a blackbox test, you wouldn't know this, you would only get if you got an error from the system, then you got another error from the system then you fire an alert. But if those checks are one minute apart and you're serving 1000 requests per second you could be serving 10,000 errors before you even get an alert. And you might miss it, because what if you only get one random error and then the next time, you're serving 25% errors, you only have a 25% chance of that check failing again. You really need these metrics in order to get proper reports of the status of your system There's even options You can slice and dice those labels. If you have a label on all of your applications called 'service' you can send that 'service' label through to the message and you can say "Hey, this service is broken". You can include that service label in your alert messages. And that's it, I can go to a demo and Q&A. [Applause] Any questions so far? Or anybody want to see a demo? [Q] Hi. Does Prometheus make metric discovery inside containers or do I have to implement the metrics myself? [A] For metrics in containers, there are already things that expose the metrics of the container system itself. There's a utility called 'cadvisor' and cadvisor takes the links cgroup data and exposes it as metrics so you can get data about how much CPU time is being spent in your container, how much memory is being used by your container. [Q] But not about the application, just about the container usage ? [A] Right. Because the container has no idea whether your application is written in Ruby or go or Python or whatever, you have to build that into your application in order to get the data. So for Prometheus, we've written client libraries that can be included in your application directly so you can get that data out. If you go to the Prometheus website, we have a whole series of client libraries and we cover a pretty good selection of popular software. [Q] What is the current state of long-term data storage? [A] Very good question. There's been several… There's actually several different methods of doing this. Prometheus stores all this data locally in its own data storage on the local disk. But that's only as durable as that server is durable. So if you've got a really durable server, you can store as much data as you want, you can store years and years of data locally on the Prometheus server. That's not a problem. There's a bunch of misconceptions because of our default and the language on our website said "It's not long-term storage" simply because we leave that problem up to the person running the server. But the time series database that Prometheus includes is actually quite durable. But it's only as durable as the server underneath it. So if you've got a very large cluster and you want really high durability, you need to have some kind of cluster software, but because we want Prometheus to be simple to deploy and very simple to operate and also very robust. We didn't want to include any clustering in Prometheus itself, because anytime you have a clustered software, what happens if your network is a little wanky. The first thing that goes down is all of your distributed systems fail. And building distributed systems to be really robust is really hard so Prometheus is what we call "uncoordinated distributed systems". If you've got two Prometheus servers monitoring all your targets in an HA mode in a cluster, and there's a split brain, each Prometheus can see half of the cluster and it can see that the other half of the cluster is down. They can both try to get alerts out to the alert manager and this is a really really robust way of handling split brains and bad network failures and bad problems in a cluster. It's designed to be super super robust