WEBVTT 99:59:59.999 --> 99:59:59.999 So, we had a talk by a non-GitLab person about GitLab. 99:59:59.999 --> 99:59:59.999 Now, we have a talk by a GitLab person on non-GtlLab. 99:59:59.999 --> 99:59:59.999 Something like that? 99:59:59.999 --> 99:59:59.999 The CCCHH hackerspace is now open, 99:59:59.999 --> 99:59:59.999 from now on if you want to go there, that's the announcement. 99:59:59.999 --> 99:59:59.999 And the next talk will be by Ben Kochie 99:59:59.999 --> 99:59:59.999 on metrics-based monitoring with Prometheus. 99:59:59.999 --> 99:59:59.999 Welcome. 99:59:59.999 --> 99:59:59.999 [Applause] 99:59:59.999 --> 99:59:59.999 Alright, so 99:59:59.999 --> 99:59:59.999 my name is Ben Kochie 99:59:59.999 --> 99:59:59.999 I work on DevOps features for GitLab 99:59:59.999 --> 99:59:59.999 and apart working for GitLab, I also work on the opensource Prometheus project. 99:59:59.999 --> 99:59:59.999 I live in Berlin and I've been using Debian since ??? 99:59:59.999 --> 99:59:59.999 yes, quite a long time. 99:59:59.999 --> 99:59:59.999 So, what is Metrics-based Monitoring? 99:59:59.999 --> 99:59:59.999 If you're running software in production, 99:59:59.999 --> 99:59:59.999 you probably want to monitor it, 99:59:59.999 --> 99:59:59.999 because if you don't monitor it, you don't know it's right. 99:59:59.999 --> 99:59:59.999 ??? break down into two categories: 99:59:59.999 --> 99:59:59.999 there's blackbox monitoring and there's whitebox monitoring. 99:59:59.999 --> 99:59:59.999 Blackbox monitoring is treating your software like a blackbox. 99:59:59.999 --> 99:59:59.999 It's just checks to see, like, 99:59:59.999 --> 99:59:59.999 is it responding, or does it ping 99:59:59.999 --> 99:59:59.999 or ??? HTTP requests 99:59:59.999 --> 99:59:59.999 [mic turned on] 99:59:59.999 --> 99:59:59.999 Ah, there we go, much better. 99:59:59.999 --> 99:59:59.999 So, blackbox monitoring is a probe, 99:59:59.999 --> 99:59:59.999 it just kind of looks from the outside to your software 99:59:59.999 --> 99:59:59.999 and it has no knowledge of the internals 99:59:59.999 --> 99:59:59.999 and it's really good for end to end testing. 99:59:59.999 --> 99:59:59.999 So if you've got a fairly complicated service, 99:59:59.999 --> 99:59:59.999 you come in from the outside, you go through the load balancer, 99:59:59.999 --> 99:59:59.999 you hit the API server, 99:59:59.999 --> 99:59:59.999 the API server might hit a database, 99:59:59.999 --> 99:59:59.999 and you go all the way through to the back of the stack 99:59:59.999 --> 99:59:59.999 and all the way back out 99:59:59.999 --> 99:59:59.999 so you know that everything is working end to end. 99:59:59.999 --> 99:59:59.999 But you only know about it for that one request. 99:59:59.999 --> 99:59:59.999 So in order to find out if your service is working, 99:59:59.999 --> 99:59:59.999 from the end to end, for every single request, 99:59:59.999 --> 99:59:59.999 this requires whitebox intrumentation. 99:59:59.999 --> 99:59:59.999 So, basically, every event that happens inside your software, 99:59:59.999 --> 99:59:59.999 inside a serving stack, 99:59:59.999 --> 99:59:59.999 gets collected and gets counted, 99:59:59.999 --> 99:59:59.999 so you know that every request hits the load balancer, 99:59:59.999 --> 99:59:59.999 every request hits your application service, 99:59:59.999 --> 99:59:59.999 every request hits the database. 99:59:59.999 --> 99:59:59.999 You know that everything matches up 99:59:59.999 --> 99:59:59.999 and this is called whitebox, or metrics-based monitoring. 99:59:59.999 --> 99:59:59.999 There is different examples of, like, 99:59:59.999 --> 99:59:59.999 the kind of software that does blackbox and whitebox monitoring. 99:59:59.999 --> 99:59:59.999 So you have software like Nagios that you can configure checks 99:59:59.999 --> 99:59:59.999 or pingdom, 99:59:59.999 --> 99:59:59.999 pingdom will do ping of your website. 99:59:59.999 --> 99:59:59.999 And then there is metrics-based monitoring, 99:59:59.999 --> 99:59:59.999 things like Prometheus, things like the TICK stack from influx data, 99:59:59.999 --> 99:59:59.999 New Relic and other commercial solutions 99:59:59.999 --> 99:59:59.999 but of course I like to talk about the opensorce solutions. 99:59:59.999 --> 99:59:59.999 We're gonna talk a little bit about Prometheus. 99:59:59.999 --> 99:59:59.999 Prometheus came out of the idea that 99:59:59.999 --> 99:59:59.999 we needed a monitoring system that could collect all this whitebox metric data 99:59:59.999 --> 99:59:59.999 and do something useful with it. 99:59:59.999 --> 99:59:59.999 Not just give us a pretty graph, but we also want to be able to 99:59:59.999 --> 99:59:59.999 alert on it. 99:59:59.999 --> 99:59:59.999 So we needed both 99:59:59.999 --> 99:59:59.999 a data gathering and an analytics system in the same instance. 99:59:59.999 --> 99:59:59.999 To do this, we built this thing and we looked at the way that 99:59:59.999 --> 99:59:59.999 data was being generated by the applications 99:59:59.999 --> 99:59:59.999 and there are advantages and disadvantages to this 99:59:59.999 --> 99:59:59.999 push vs. poll model for metrics. 99:59:59.999 --> 99:59:59.999 We decided to go with the polling model 99:59:59.999 --> 99:59:59.999 because there is some slight advantages for polling over pushing. 99:59:59.999 --> 99:59:59.999 With polling, you get this free blackbox check 99:59:59.999 --> 99:59:59.999 that the application is running. 99:59:59.999 --> 99:59:59.999 When you poll your application, you know that the process is running. 99:59:59.999 --> 99:59:59.999 If you are doing push-based, you can't tell the difference between 99:59:59.999 --> 99:59:59.999 your application doing no work and your application not running. 99:59:59.999 --> 99:59:59.999 So you don't know if it's stuck, 99:59:59.999 --> 99:59:59.999 or is it just not having to do any work. 99:59:59.999 --> 99:59:59.999 With polling, the polling system knows the state of your network. 99:59:59.999 --> 99:59:59.999 If you have a defined set of services, 99:59:59.999 --> 99:59:59.999 that inventory drives what should be there. 99:59:59.999 --> 99:59:59.999 Again, it's like, the disappearing, 99:59:59.999 --> 99:59:59.999 is the process dead, or is it just not doing anything? 99:59:59.999 --> 99:59:59.999 With polling, you know for a fact what processes should be there, 99:59:59.999 --> 99:59:59.999 and it's a bit of an advantage there. 99:59:59.999 --> 99:59:59.999 With polling, there's really easy testing. 99:59:59.999 --> 99:59:59.999 With push-based metrics, you have to figure out 99:59:59.999 --> 99:59:59.999 if you want to test a new version of the monitoring system or 99:59:59.999 --> 99:59:59.999 you want to test something new, 99:59:59.999 --> 99:59:59.999 you have to ??? a copy of the data. 99:59:59.999 --> 99:59:59.999 With polling, you can just set up another instance of your monitoring 99:59:59.999 --> 99:59:59.999 and just test it. 99:59:59.999 --> 99:59:59.999 Or you don't even have, 99:59:59.999 --> 99:59:59.999 it doesn't even have to be monitoring, you can just use curl 99:59:59.999 --> 99:59:59.999 to poll the metrics endpoint. 99:59:59.999 --> 99:59:59.999 It's significantly easier to test. 99:59:59.999 --> 99:59:59.999 The other thing with the… 99:59:59.999 --> 99:59:59.999 The other nice thing is that the client is really simple. 99:59:59.999 --> 99:59:59.999 The client doesn't have to know where the monitoring system is. 99:59:59.999 --> 99:59:59.999 It doesn't have to know about ??? 99:59:59.999 --> 99:59:59.999 It just has to sit and collect the data about itself. 99:59:59.999 --> 99:59:59.999 So it doesn't have to know anything about the topology of the network. 99:59:59.999 --> 99:59:59.999 As an application developer, if you're writing a DNS server or 99:59:59.999 --> 99:59:59.999 some other piece of software, 99:59:59.999 --> 99:59:59.999 you don't have to know anything about monitoring software, 99:59:59.999 --> 99:59:59.999 you can just implement it inside your application and 99:59:59.999 --> 99:59:59.999 the monitoring software, whether it's Prometheus or something else, 99:59:59.999 --> 99:59:59.999 can just come and collect that data for you. 99:59:59.999 --> 99:59:59.999 That's kind of similar to a very old monitoring system called SNMP, 99:59:59.999 --> 99:59:59.999 but SNMP has a significantly less friendly data model for developers. 99:59:59.999 --> 99:59:59.999 This is the basic layout of a Prometheus server. 99:59:59.999 --> 99:59:59.999 At the core, there's a Prometheus server 99:59:59.999 --> 99:59:59.999 and it deals with all the data collection and analytics. 99:59:59.999 --> 99:59:59.999 Basically, this one binary, it's all written in golang. 99:59:59.999 --> 99:59:59.999 It's a single binary. 99:59:59.999 --> 99:59:59.999 It knows how to read from your inventory, 99:59:59.999 --> 99:59:59.999 there's a bunch of different methods, whether you've got 99:59:59.999 --> 99:59:59.999 a kubernetes cluster or a cloud platform 99:59:59.999 --> 99:59:59.999 or you have your own customized thing with ansible. 99:59:59.999 --> 99:59:59.999 Ansible can take your layout, drop that into a config file and 99:59:59.999 --> 99:59:59.999 Prometheus can pick that up. 99:59:59.999 --> 99:59:59.999 Once it has the layout, it goes out and collects all the data. 99:59:59.999 --> 99:59:59.999 It has a storage and a time series database to store all that data locally. 99:59:59.999 --> 99:59:59.999 It has a thing called PromQL, which is a query language designed 99:59:59.999 --> 99:59:59.999 for metrics and analytics. 99:59:59.999 --> 99:59:59.999 From that PromQL, you can add frontends that will, 99:59:59.999 --> 99:59:59.999 whether it's a simple API client to run reports, 99:59:59.999 --> 99:59:59.999 you can use things like Grafana for creating dashboards, 99:59:59.999 --> 99:59:59.999 it's got a simple webUI built in. 99:59:59.999 --> 99:59:59.999 You can plug in anything you want on that side. 99:59:59.999 --> 99:59:59.999 And then, it also has the ability to continuously execute queries 99:59:59.999 --> 99:59:59.999 called "recording rules" 99:59:59.999 --> 99:59:59.999 and these recording rules have two different modes. 99:59:59.999 --> 99:59:59.999 You can either record, you can take a query 99:59:59.999 --> 99:59:59.999 and it will generate new data from that query 99:59:59.999 --> 99:59:59.999 or you can take a query, and if it returns results, 99:59:59.999 --> 99:59:59.999 it will return an alert. 99:59:59.999 --> 99:59:59.999 That alert is a push message to the alert manager. 99:59:59.999 --> 99:59:59.999 This allows us to separate the generating of alerts from the routing of alerts. 99:59:59.999 --> 99:59:59.999 You can have one or hundreds of Prometheus services, all generating alerts 99:59:59.999 --> 99:59:59.999 and it goes into an alert manager cluster and sends, does the deduplication 99:59:59.999 --> 99:59:59.999 and the routing to the human 99:59:59.999 --> 99:59:59.999 because, of course, the thing that we want is 99:59:59.999 --> 99:59:59.999 we had dashboards with graphs, but in order to find out if something is broken 99:59:59.999 --> 99:59:59.999 you had to have a human looking at the graph. 99:59:59.999 --> 99:59:59.999 With Prometheus, we don't have to do that anymore, 99:59:59.999 --> 99:59:59.999 we can simply let the software tell us that we need to go investigate 99:59:59.999 --> 99:59:59.999 our problems. 99:59:59.999 --> 99:59:59.999 We don't have to sit there and stare at dashboards all day, 99:59:59.999 --> 99:59:59.999 because that's really boring. 99:59:59.999 --> 99:59:59.999 What does it look like to actually get data into Prometheus? 99:59:59.999 --> 99:59:59.999 This is a very basic output of a Prometheus metric. 99:59:59.999 --> 99:59:59.999 This is a very simple thing. 99:59:59.999 --> 99:59:59.999 If you know much about the linux kernel, 99:59:59.999 --> 99:59:59.999 the linux kernel tracks ??? stats, all the state of all the CPUs 99:59:59.999 --> 99:59:59.999 in your system 99:59:59.999 --> 99:59:59.999 and we express this by having the name of the metric, which is 99:59:59.999 --> 99:59:59.999 'node_cpu_seconds_total' and so this is a self-describing metric, 99:59:59.999 --> 99:59:59.999 like you can just read the metrics name 99:59:59.999 --> 99:59:59.999 and you understand a little bit about what's going on here. 99:59:59.999 --> 99:59:59.999 The linux kernel and other kernels track their usage by the number of seconds 99:59:59.999 --> 99:59:59.999 spent doing different things and 99:59:59.999 --> 99:59:59.999 that could be, whether it's in system or user space or IRQs 99:59:59.999 --> 99:59:59.999 or iowait or idle. 99:59:59.999 --> 99:59:59.999 Actually, the kernel tracks how much idle time it has. 99:59:59.999 --> 99:59:59.999 It also tracks it by the number of CPUs. 99:59:59.999 --> 99:59:59.999 With other monitoring systems, they used to do this with a tree structure 99:59:59.999 --> 99:59:59.999 and this caused a lot of problems, for like 99:59:59.999 --> 99:59:59.999 How do you mix and match data so by switching from 99:59:59.999 --> 99:59:59.999 a tree structure to a tag-based structure, 99:59:59.999 --> 99:59:59.999 we can do some really interesting powerful data analytics. 99:59:59.999 --> 99:59:59.999 Here's a nice example of taking those CPU seconds counters 99:59:59.999 --> 99:59:59.999 and then converting them into a graph by using PromQL. 99:59:59.999 --> 99:59:59.999 Now we can get into Metrics-Based Alerting. 99:59:59.999 --> 99:59:59.999 Now we have this graph, we have this thing 99:59:59.999 --> 99:59:59.999 we can look and see here 99:59:59.999 --> 99:59:59.999 "Oh there is some little spike here, we might want to know about that." 99:59:59.999 --> 99:59:59.999 Now we can get into Metrics-Based Alerting. 99:59:59.999 --> 99:59:59.999 I used to be a site reliability engineer, I'm still a site reliability engineer at heart 99:59:59.999 --> 99:59:59.999 and we have this concept of things that you need on a site or a service reliably 99:59:59.999 --> 99:59:59.999 The most important thing you need is down at the bottom, 99:59:59.999 --> 99:59:59.999 Monitoring, because if you don't have monitoring of your service, 99:59:59.999 --> 99:59:59.999 how do you know it's even working? 99:59:59.999 --> 99:59:59.999 There's a couple of techniques here, and we want to alert based on data 99:59:59.999 --> 99:59:59.999 and not just those end to end tests. 99:59:59.999 --> 99:59:59.999 There's a couple of techniques, a thing called the RED method 99:59:59.999 --> 99:59:59.999 and there's a thing called the USE method 99:59:59.999 --> 99:59:59.999 and there's a couple nice things to some blog posts about this 99:59:59.999 --> 99:59:59.999 and basically it defines that, for example, 99:59:59.999 --> 99:59:59.999 the RED method talks about the requests that your system is handling 99:59:59.999 --> 99:59:59.999 There are three things: 99:59:59.999 --> 99:59:59.999 There's the number of requests, there's the number of errors 99:59:59.999 --> 99:59:59.999 and there's how long takes a duration. 99:59:59.999 --> 99:59:59.999 With the combination of these three things 99:59:59.999 --> 99:59:59.999 you can determine most of what your users see 99:59:59.999 --> 99:59:59.999 "Did my request go through? Did it return an error? Was it fast?" 99:59:59.999 --> 99:59:59.999 Most people, that's all they care about. 99:59:59.999 --> 99:59:59.999 "I made a request to a website and it came back and it was fast." 99:59:59.999 --> 99:59:59.999 It's a very simple method of just, like, 99:59:59.999 --> 99:59:59.999 those are the important things to determine if your site is healthy. 99:59:59.999 --> 99:59:59.999 But we can go back to some more traditional, sysadmin style, alerts 99:59:59.999 --> 99:59:59.999 this is basically taking the filesystem available space, 99:59:59.999 --> 99:59:59.999 divided by the filesystem size, that becomes the ratio of filesystem availability 99:59:59.999 --> 99:59:59.999 from 0 to 1. 99:59:59.999 --> 99:59:59.999 Multiply it by 100, we now have a percentage 99:59:59.999 --> 99:59:59.999 and if it's less than or equal to 1% for 15 minutes, 99:59:59.999 --> 99:59:59.999 this is less than 1% space, we should tell a sysadmin to go check 99:59:59.999 --> 99:59:59.999 the ??? filesystem ??? 99:59:59.999 --> 99:59:59.999 It's super nice and simple. 99:59:59.999 --> 99:59:59.999 We can also tag, we can include… 99:59:59.999 --> 99:59:59.999 Every alert includes all the extraneous labels that Prometheus adds to your metrics 99:59:59.999 --> 99:59:59.999 When you add a metric in Prometheus, if we go back and we look at this metric. 99:59:59.999 --> 99:59:59.999 This metric only contain the information about the internals of the application 99:59:59.999 --> 99:59:59.999 anything about, like, what server it's on, is it running in a container, 99:59:59.999 --> 99:59:59.999 what cluster does it come from, what ??? is it on, 99:59:59.999 --> 99:59:59.999 that's all extra annotations that are added by the Prometheus server 99:59:59.999 --> 99:59:59.999 at discovery time. 99:59:59.999 --> 99:59:59.999 I don't have a good example of what those labels look like 99:59:59.999 --> 99:59:59.999 but every metric gets annotated with location information. 99:59:59.999 --> 99:59:59.999 That location information also comes through as labels in the alert 99:59:59.999 --> 99:59:59.999 so, if you have a message coming into your alert manager, 99:59:59.999 --> 99:59:59.999 the alert manager can look and go 99:59:59.999 --> 99:59:59.999 "Oh, that's coming from this datacenter" 99:59:59.999 --> 99:59:59.999 and it can include that in the email or IRC message or SMS message. 99:59:59.999 --> 99:59:59.999 So you can include 99:59:59.999 --> 99:59:59.999 "Filesystem is out of space on this host from this datacenter" 99:59:59.999 --> 99:59:59.999 All these labels get passed through and then you can append 99:59:59.999 --> 99:59:59.999 "severity: critical" to that alert and include that in the message to the human 99:59:59.999 --> 99:59:59.999 because of course, this is how you define… 99:59:59.999 --> 99:59:59.999 Getting the message from the monitoring to the human. 99:59:59.999 --> 99:59:59.999 You can even include nice things like, 99:59:59.999 --> 99:59:59.999 if you've got documentation, you can include a link to the documentation 99:59:59.999 --> 99:59:59.999 as an annotation 99:59:59.999 --> 99:59:59.999 and the alert manager can take that basic url and, you know, 99:59:59.999 --> 99:59:59.999 massaging it into whatever it needs to look like to actually get 99:59:59.999 --> 99:59:59.999 the operator to the correct documentation. 99:59:59.999 --> 99:59:59.999 We can also do more fun things: 99:59:59.999 --> 99:59:59.999 since we actually are not just checking 99:59:59.999 --> 99:59:59.999 what is the space right now, we're tracking data over time, 99:59:59.999 --> 99:59:59.999 we can use 'predict_linear'. 99:59:59.999 --> 99:59:59.999 'predict_linear' just takes and does a simple linear regression. 99:59:59.999 --> 99:59:59.999 This example takes the filesystem available space over the last hour and 99:59:59.999 --> 99:59:59.999 does a linear regression. 99:59:59.999 --> 99:59:59.999 Prediction says "Well, it's going that way and four hours from now, 99:59:59.999 --> 99:59:59.999 based on one hour of history, it's gonna be less than 0, which means full". 99:59:59.999 --> 99:59:59.999 We know that within the next four hours, the disc is gonna be full 99:59:59.999 --> 99:59:59.999 so we can tell the operator ahead of time that it's gonna be full 99:59:59.999 --> 99:59:59.999 and not just tell them that it's full right now. 99:59:59.999 --> 99:59:59.999 They have some window of ability to fix it before it fails. 99:59:59.999 --> 99:59:59.999 This is really important because if you're running a site 99:59:59.999 --> 99:59:59.999 you want to be able to have alerts that tell you that your system is failing 99:59:59.999 --> 99:59:59.999 before it actually fails. 99:59:59.999 --> 99:59:59.999 Because if it fails, you're out of SLO or SLA and 99:59:59.999 --> 99:59:59.999 your users are gonna be unhappy 99:59:59.999 --> 99:59:59.999 and you don't want the users to tell you that your site is down 99:59:59.999 --> 99:59:59.999 you want to know about it before your users can even tell. 99:59:59.999 --> 99:59:59.999 This allows you to do that. 99:59:59.999 --> 99:59:59.999 And also of course, Prometheus being a modern system, 99:59:59.999 --> 99:59:59.999 we support fully UTF8 in all of our labels. 99:59:59.999 --> 99:59:59.999 Here's an other one, here's a good example from the USE method. 99:59:59.999 --> 99:59:59.999 This is a rate of 500 errors coming from an application 99:59:59.999 --> 99:59:59.999 and you can simply alert that 99:59:59.999 --> 99:59:59.999 there's more than 500 errors per second coming out of the application 99:59:59.999 --> 99:59:59.999 if that's your threshold for ??? 99:59:59.999 --> 99:59:59.999 And you can do other things, 99:59:59.999 --> 99:59:59.999 you can convert that from just a raid of errors 99:59:59.999 --> 99:59:59.999 to a percentive error. 99:59:59.999 --> 99:59:59.999 So you could say 99:59:59.999 --> 99:59:59.999 "I have an SLA of 3 9" and so you can say 99:59:59.999 --> 99:59:59.999 "If the rate of errors divided by the rate of requests is .01, 99:59:59.999 --> 99:59:59.999 or is more than .01, then that's a problem." 99:59:59.999 --> 99:59:59.999 You can include that level of error granularity. 99:59:59.999 --> 99:59:59.999 And if you're just doing a blackbox test, 99:59:59.999 --> 99:59:59.999 you wouldn't know this, you would only get if you got an error from the system, 99:59:59.999 --> 99:59:59.999 then you got another error from the system 99:59:59.999 --> 99:59:59.999 then you fire an alert. 99:59:59.999 --> 99:59:59.999 But if those checks are one minute apart and you're serving 1000 requests per second 99:59:59.999 --> 99:59:59.999 you could be serving 10,000 errors before you even get an alert. 99:59:59.999 --> 99:59:59.999 And you might miss it, because 99:59:59.999 --> 99:59:59.999 what if you only get one random error 99:59:59.999 --> 99:59:59.999 and then the next time, you're serving 25% errors, 99:59:59.999 --> 99:59:59.999 you only have a 25% chance of that check failing again. 99:59:59.999 --> 99:59:59.999 You really need these metrics in order to get 99:59:59.999 --> 99:59:59.999 proper reports of the status of your system 99:59:59.999 --> 99:59:59.999 There's even options 99:59:59.999 --> 99:59:59.999 You can slice and dice those labels. 99:59:59.999 --> 99:59:59.999 If you have a label on all of your applications called 'service' 99:59:59.999 --> 99:59:59.999 you can send that 'service' label through to the message 99:59:59.999 --> 99:59:59.999 and you can say "Hey, this service is broken". 99:59:59.999 --> 99:59:59.999 You can include that service label in your alert messages. 99:59:59.999 --> 99:59:59.999 And that's it, I can go to a demo and Q&A. 99:59:59.999 --> 99:59:59.999 [Applause] 99:59:59.999 --> 99:59:59.999 Any questions so far? 99:59:59.999 --> 99:59:59.999 Or anybody want to see a demo? 99:59:59.999 --> 99:59:59.999 [Q] Hi. Does Prometheus make metric discovery inside containers 99:59:59.999 --> 99:59:59.999 or do I have to implement the metrics myself? 99:59:59.999 --> 99:59:59.999 [A] For metrics in containers, there are already things that expose 99:59:59.999 --> 99:59:59.999 the metrics of the container system itself. 99:59:59.999 --> 99:59:59.999 There's a utility called 'cadvisor' and 99:59:59.999 --> 99:59:59.999 cadvisor takes the links cgroup data and exposes it as metrics 99:59:59.999 --> 99:59:59.999 so you can get data about how much CPU time is being 99:59:59.999 --> 99:59:59.999 spent in your container, 99:59:59.999 --> 99:59:59.999 how much memory is being used by your container. 99:59:59.999 --> 99:59:59.999 [Q] But not about the application, just about the container usage ? 99:59:59.999 --> 99:59:59.999 [A] Right. Because the container has no idea 99:59:59.999 --> 99:59:59.999 whether your application is written in Ruby or go or Python or whatever, 99:59:59.999 --> 99:59:59.999 you have to build that into your application in order to get the data. 99:59:59.999 --> 99:59:59.999 So for Prometheus, 99:59:59.999 --> 99:59:59.999 we've written client libraries that can be included in your application directly 99:59:59.999 --> 99:59:59.999 so you can get that data out. 99:59:59.999 --> 99:59:59.999 If you go to the Prometheus website, we have a whole series of client libraries 99:59:59.999 --> 99:59:59.999 and we cover a pretty good selection of popular software. 99:59:59.999 --> 99:59:59.999 [Q] What is the current state of long-term data storage? 99:59:59.999 --> 99:59:59.999 [A] Very good question. 99:59:59.999 --> 99:59:59.999 There's been several… 99:59:59.999 --> 99:59:59.999 There's actually several different methods of doing this. 99:59:59.999 --> 99:59:59.999 Prometheus stores all this data locally in its own data storage 99:59:59.999 --> 99:59:59.999 on the local disk. 99:59:59.999 --> 99:59:59.999 But that's only as durable as that server is durable. 99:59:59.999 --> 99:59:59.999 So if you've got a really durable server, 99:59:59.999 --> 99:59:59.999 you can store as much data as you want, 99:59:59.999 --> 99:59:59.999 you can store years and years of data locally on the Prometheus server. 99:59:59.999 --> 99:59:59.999 That's not a problem. 99:59:59.999 --> 99:59:59.999 There's a bunch of misconceptions because of our default 99:59:59.999 --> 99:59:59.999 and the language on our website said 99:59:59.999 --> 99:59:59.999 "It's not long-term storage" 99:59:59.999 --> 99:59:59.999 simply because we leave that problem up to the person running the server. 99:59:59.999 --> 99:59:59.999 But the time series database that Prometheus includes 99:59:59.999 --> 99:59:59.999 is actually quite durable. 99:59:59.999 --> 99:59:59.999 But it's only as durable as the server underneath it. 99:59:59.999 --> 99:59:59.999 So if you've got a very large cluster and you want really high durability, 99:59:59.999 --> 99:59:59.999 you need to have some kind of cluster software, 99:59:59.999 --> 99:59:59.999 but because we want Prometheus to be simple to deploy 99:59:59.999 --> 99:59:59.999 and very simple to operate 99:59:59.999 --> 99:59:59.999 and also very robust. 99:59:59.999 --> 99:59:59.999 We didn't want to include any clustering in Prometheus itself, 99:59:59.999 --> 99:59:59.999 because anytime you have a clustered software, 99:59:59.999 --> 99:59:59.999 what happens if your network is a little wanky. 99:59:59.999 --> 99:59:59.999 The first thing that goes down is all of your distributed systems fail. 99:59:59.999 --> 99:59:59.999 And building distributed systems to be really robust is really hard 99:59:59.999 --> 99:59:59.999 so Prometheus is what we call "uncoordinated distributed systems". 99:59:59.999 --> 99:59:59.999 If you've got two Prometheus servers monitoring all your targets in an HA mode 99:59:59.999 --> 99:59:59.999 in a cluster, and there's a split brain, 99:59:59.999 --> 99:59:59.999 each Prometheus can see half of the cluster and 99:59:59.999 --> 99:59:59.999 it can see that the other half of the cluster is down. 99:59:59.999 --> 99:59:59.999 They can both try to get alerts out to the alert manager 99:59:59.999 --> 99:59:59.999 and this is a really really robust way of handling split brains 99:59:59.999 --> 99:59:59.999 and bad network failures and bad problems in a cluster. 99:59:59.999 --> 99:59:59.999 It's designed to be super super robust