WEBVTT 00:00:05.901 --> 00:00:10.531 So, we had a talk by a non-GitLab person about GitLab. 00:00:10.531 --> 00:00:13.057 Now, we have a talk by a GitLab person on non-GtlLab. 00:00:13.202 --> 00:00:14.603 Something like that? 00:00:15.894 --> 00:00:19.393 The CCCHH hackerspace is now open, 00:00:19.946 --> 00:00:22.118 from now on if you want to go there, that's the announcement. 00:00:22.471 --> 00:00:25.871 And the next talk will be by Ben Kochie 00:00:26.009 --> 00:00:28.265 on metrics-based monitoring with Prometheus. 00:00:28.748 --> 00:00:30.212 Welcome. 00:00:30.545 --> 00:00:33.133 [Applause] 00:00:35.395 --> 00:00:36.578 Alright, so 00:00:36.886 --> 00:00:39.371 my name is Ben Kochie 00:00:39.845 --> 00:00:43.870 I work on DevOps features for GitLab 00:00:44.327 --> 00:00:48.293 and apart working for GitLab, I also work on the opensource Prometheus project. 00:00:51.163 --> 00:00:53.595 I live in Berlin and I've been using Debian since ??? 00:00:54.353 --> 00:00:56.797 yes, quite a long time. 00:00:58.806 --> 00:01:01.018 So, what is Metrics-based Monitoring? 00:01:02.638 --> 00:01:05.165 If you're running software in production, 00:01:05.885 --> 00:01:07.826 you probably want to monitor it, 00:01:08.212 --> 00:01:10.547 because if you don't monitor it, you don't know it's right. 00:01:13.278 --> 00:01:16.112 ??? break down into two categories: 00:01:16.112 --> 00:01:19.146 there's blackbox monitoring and there's whitebox monitoring. 00:01:19.500 --> 00:01:24.582 Blackbox monitoring is treating your software like a blackbox. 00:01:24.757 --> 00:01:27.158 It's just checks to see, like, 00:01:27.447 --> 00:01:29.483 is it responding, or does it ping 00:01:30.023 --> 00:01:33.588 or ??? HTTP requests 00:01:34.348 --> 00:01:35.669 [mic turned on] 00:01:37.760 --> 00:01:41.379 Ah, there we go, much better. 00:01:46.592 --> 00:01:51.898 So, blackbox monitoring is a probe, 00:01:51.898 --> 00:01:54.684 it just kind of looks from the outside to your software 00:01:55.454 --> 00:01:57.432 and it has no knowledge of the internals 00:01:58.133 --> 00:02:00.699 and it's really good for end to end testing. 00:02:00.942 --> 00:02:03.560 So if you've got a fairly complicated service, 00:02:03.990 --> 00:02:06.426 you come in from the outside, you go through the load balancer, 00:02:06.721 --> 00:02:07.975 you hit the API server, 00:02:07.975 --> 00:02:10.152 the API server might hit a database, 00:02:10.675 --> 00:02:13.054 and you go all the way through to the back of the stack 00:02:13.224 --> 00:02:14.536 and all the way back out 00:02:14.560 --> 00:02:16.294 so you know that everything is working end to end. 00:02:16.518 --> 00:02:18.768 But you only know about it for that one request. 00:02:19.036 --> 00:02:22.429 So in order to find out if your service is working, 00:02:22.831 --> 00:02:27.128 from the end to end, for every single request, 00:02:27.475 --> 00:02:29.523 this requires whitebox intrumentation. 00:02:29.836 --> 00:02:33.965 So, basically, every event that happens inside your software, 00:02:33.973 --> 00:02:36.517 inside a serving stack, 00:02:36.817 --> 00:02:39.807 gets collected and gets counted, 00:02:40.037 --> 00:02:43.676 so you know that every request hits the load balancer, 00:02:45.216 --> 00:02:45.656 every request hits your application service, 00:02:45.972 --> 00:02:47.329 every request hits the database. 00:02:47.789 --> 00:02:50.832 You know that everything matches up 00:02:50.997 --> 00:02:55.764 and this is called whitebox, or metrics-based monitoring. 00:02:56.010 --> 00:02:57.688 There is different examples of, like, 00:02:57.913 --> 00:03:02.392 the kind of software that does blackbox and whitebox monitoring. 00:03:02.572 --> 00:03:06.680 So you have software like Nagios that you can configure checks 00:03:08.826 --> 00:03:10.012 or pingdom, 00:03:10.211 --> 00:03:12.347 pingdom will do ping of your website. 00:03:12.971 --> 00:03:15.307 And then there is metrics-based monitoring, 00:03:15.517 --> 00:03:19.293 things like Prometheus, things like the TICK stack from influx data, 00:03:19.610 --> 00:03:22.728 New Relic and other commercial solutions 00:03:23.027 --> 00:03:25.480 but of course I like to talk about the opensorce solutions. 00:03:25.748 --> 00:03:28.379 We're gonna talk a little bit about Prometheus. 00:03:28.819 --> 00:03:31.955 Prometheus came out of the idea that 00:03:32.343 --> 00:03:37.555 we needed a monitoring system that could collect all this whitebox metric data 00:03:37.941 --> 00:03:40.786 and do something useful with it. 00:03:40.915 --> 00:03:42.667 Not just give us a pretty graph, but we also want to be able to 00:03:42.985 --> 00:03:44.189 alert on it. 00:03:44.189 --> 00:03:45.988 So we needed both 00:03:49.872 --> 00:03:54.068 a data gathering and an analytics system in the same instance. 00:03:54.148 --> 00:03:58.821 To do this, we built this thing and we looked at the way that 00:03:59.014 --> 00:04:01.835 data was being generated by the applications 00:04:02.369 --> 00:04:05.204 and there are advantages and disadvantages to this 00:04:05.204 --> 00:04:07.250 push vs. poll model for metrics. 00:04:07.384 --> 00:04:09.701 We decided to go with the polling model 00:04:09.938 --> 00:04:13.953 because there is some slight advantages for polling over pushing. 00:04:16.323 --> 00:04:18.163 With polling, you get this free blackbox check 00:04:18.471 --> 00:04:20.151 that the application is running. 00:04:20.527 --> 00:04:24.319 When you poll your application, you know that the process is running. 00:04:24.532 --> 00:04:27.529 If you are doing push-based, you can't tell the difference between 00:04:27.851 --> 00:04:31.521 your application doing no work and your application not running. 00:04:32.416 --> 00:04:33.900 So you don't know if it's stuck, 00:04:34.140 --> 00:04:37.878 or is it just not having to do any work. 00:04:42.671 --> 00:04:48.940 With polling, the polling system knows the state of your network. 00:04:49.850 --> 00:04:52.522 If you have a defined set of services, 00:04:52.887 --> 00:04:56.788 that inventory drives what should be there. 00:04:58.274 --> 00:05:00.080 Again, it's like, the disappearing, 00:05:00.288 --> 00:05:03.950 is the process dead, or is it just not doing anything? 00:05:04.205 --> 00:05:07.117 With polling, you know for a fact what processes should be there, 00:05:07.593 --> 00:05:10.900 and it's a bit of an advantage there. 00:05:11.138 --> 00:05:12.913 With polling, there's really easy testing. 00:05:13.117 --> 00:05:16.295 With push-based metrics, you have to figure out 00:05:16.505 --> 00:05:18.843 if you want to test a new version of the monitoring system or 00:05:19.058 --> 00:05:21.262 you want to test something new, 00:05:21.420 --> 00:05:24.129 you have to ??? a copy of the data. 00:05:24.370 --> 00:05:27.652 With polling, you can just set up another instance of your monitoring 00:05:27.856 --> 00:05:29.189 and just test it. 00:05:29.714 --> 00:05:31.321 Or you don't even have, 00:05:31.473 --> 00:05:33.194 it doesn't even have to be monitoring, you can just use curl 00:05:33.199 --> 00:05:35.977 to poll the metrics endpoint. 00:05:38.417 --> 00:05:40.436 It's significantly easier to test. 00:05:40.436 --> 00:05:42.977 The other thing with the… 00:05:45.999 --> 00:05:48.109 The other nice thing is that the client is really simple. 00:05:48.481 --> 00:05:51.068 The client doesn't have to know where the monitoring system is. 00:05:51.272 --> 00:05:53.669 It doesn't have to know about ??? 00:05:53.820 --> 00:05:55.720 It just has to sit and collect the data about itself. 00:05:55.882 --> 00:05:58.708 So it doesn't have to know anything about the topology of the network. 00:05:59.134 --> 00:06:03.363 As an application developer, if you're writing a DNS server or 00:06:03.724 --> 00:06:05.572 some other piece of software, 00:06:05.896 --> 00:06:09.562 you don't have to know anything about monitoring software, 00:06:09.803 --> 00:06:12.217 you can just implement it inside your application and 00:06:12.683 --> 00:06:17.058 the monitoring software, whether it's Prometheus or something else, 00:06:17.414 --> 00:06:19.332 can just come and collect that data for you. 00:06:20.210 --> 00:06:23.611 That's kind of similar to a very old monitoring system called SNMP, 00:06:23.832 --> 00:06:28.530 but SNMP has a significantly less friendly data model for developers. 00:06:30.010 --> 00:06:33.556 This is the basic layout of a Prometheus server. 00:06:33.921 --> 00:06:35.918 At the core, there's a Prometheus server 00:06:36.278 --> 00:06:40.302 and it deals with all the data collection and analytics. 00:06:42.941 --> 00:06:46.697 Basically, this one binary, it's all written in golang. 00:06:46.867 --> 00:06:48.559 It's a single binary. 00:06:48.559 --> 00:06:50.823 It knows how to read from your inventory, 00:06:50.823 --> 00:06:52.659 there's a bunch of different methods, whether you've got 00:06:53.121 --> 00:06:58.843 a kubernetes cluster or a cloud platform 00:07:00.234 --> 00:07:03.800 or you have your own customized thing with ansible. 00:07:05.380 --> 00:07:09.750 Ansible can take your layout, drop that into a config file and 00:07:10.639 --> 00:07:11.902 Prometheus can pick that up. 00:07:15.594 --> 00:07:18.812 Once it has the layout, it goes out and collects all the data. 00:07:18.844 --> 00:07:24.254 It has a storage and a time series database to store all that data locally. 00:07:24.462 --> 00:07:28.228 It has a thing called PromQL, which is a query language designed 00:07:28.452 --> 00:07:31.033 for metrics and analytics. 00:07:31.500 --> 00:07:36.779 From that PromQL, you can add frontends that will, 00:07:36.985 --> 00:07:39.319 whether it's a simple API client to run reports, 00:07:40.019 --> 00:07:42.942 you can use things like Grafana for creating dashboards, 00:07:43.124 --> 00:07:44.834 it's got a simple webUI built in. 00:07:45.031 --> 00:07:46.920 You can plug in anything you want on that side. 00:07:48.693 --> 00:07:54.478 And then, it also has the ability to continuously execute queries 00:07:54.625 --> 00:07:56.191 called "recording rules" 00:07:56.832 --> 00:07:59.103 and these recording rules have two different modes. 00:07:59.103 --> 00:08:01.871 You can either record, you can take a query 00:08:02.150 --> 00:08:03.711 and it will generate new data from that query 00:08:04.072 --> 00:08:06.967 or you can take a query, and if it returns results, 00:08:07.354 --> 00:08:08.910 it will return an alert. 00:08:09.176 --> 00:08:12.506 That alert is a push message to the alert manager. 00:08:12.813 --> 00:08:18.969 This allows us to separate the generating of alerts from the routing of alerts. 00:08:19.153 --> 00:08:24.259 You can have one or hundreds of Prometheus services, all generating alerts 00:08:24.599 --> 00:08:28.807 and it goes into an alert manager cluster and sends, does the deduplication 00:08:29.329 --> 00:08:30.684 and the routing to the human 00:08:30.879 --> 00:08:34.138 because, of course, the thing that we want is 00:08:34.927 --> 00:08:38.797 we had dashboards with graphs, but in order to find out if something is broken 00:08:38.966 --> 00:08:40.650 you had to have a human looking at the graph. 00:08:40.830 --> 00:08:42.942 With Prometheus, we don't have to do that anymore, 00:08:43.103 --> 00:08:47.638 we can simply let the software tell us that we need to go investigate 00:08:47.638 --> 00:08:48.650 our problems. 00:08:48.778 --> 00:08:50.831 We don't have to sit there and stare at dashboards all day, 00:08:51.035 --> 00:08:52.380 because that's really boring. 00:08:54.519 --> 00:08:57.556 What does it look like to actually get data into Prometheus? 00:08:57.587 --> 00:09:02.140 This is a very basic output of a Prometheus metric. 00:09:02.613 --> 00:09:03.930 This is a very simple thing. 00:09:04.086 --> 00:09:07.572 If you know much about the linux kernel, 00:09:06.883 --> 00:09:12.779 the linux kernel tracks and proc stats, all the state of all the CPUs 00:09:12.779 --> 00:09:14.459 in your system 00:09:14.662 --> 00:09:18.078 and we express this by having the name of the metric, which is 00:09:22.449 --> 00:09:26.123 'node_cpu_seconds_total' and so this is a self-describing metric, 00:09:26.547 --> 00:09:28.375 like you can just read the metrics name 00:09:28.530 --> 00:09:30.845 and you understand a little bit about what's going on here. 00:09:33.241 --> 00:09:38.521 The linux kernel and other kernels track their usage by the number of seconds 00:09:38.859 --> 00:09:41.004 spent doing different things and 00:09:41.199 --> 00:09:46.721 that could be, whether it's in system or user space or IRQs 00:09:47.065 --> 00:09:48.690 or iowait or idle. 00:09:48.908 --> 00:09:51.280 Actually, the kernel tracks how much idle time it has. 00:09:53.660 --> 00:09:55.309 It also tracks it by the number of CPUs. 00:09:55.997 --> 00:10:00.067 With other monitoring systems, they used to do this with a tree structure 00:10:01.021 --> 00:10:03.688 and this caused a lot of problems, for like 00:10:03.854 --> 00:10:09.291 How do you mix and match data so by switching from 00:10:10.043 --> 00:10:12.484 a tree structure to a tag-based structure, 00:10:12.985 --> 00:10:16.896 we can do some really interesting powerful data analytics. 00:10:18.170 --> 00:10:25.170 Here's a nice example of taking those CPU seconds counters 00:10:26.101 --> 00:10:30.198 and then converting them into a graph by using PromQL. 00:10:32.724 --> 00:10:34.830 Now we can get into Metrics-Based Alerting. 00:10:35.315 --> 00:10:37.665 Now we have this graph, we have this thing 00:10:37.847 --> 00:10:39.497 we can look and see here 00:10:39.999 --> 00:10:42.920 "Oh there is some little spike here, we might want to know about that." 00:10:43.191 --> 00:10:45.849 Now we can get into Metrics-Based Alerting. 00:10:46.281 --> 00:10:51.128 I used to be a site reliability engineer, I'm still a site reliability engineer at heart 00:10:52.371 --> 00:11:00.362 and we have this concept of things that you need on a site or a service reliably 00:11:00.910 --> 00:11:03.231 The most important thing you need is down at the bottom, 00:11:03.569 --> 00:11:06.869 Monitoring, because if you don't have monitoring of your service, 00:11:07.108 --> 00:11:08.688 how do you know it's even working? 00:11:11.628 --> 00:11:15.235 There's a couple of techniques here, and we want to alert based on data 00:11:15.693 --> 00:11:17.644 and not just those end to end tests. 00:11:18.796 --> 00:11:23.387 There's a couple of techniques, a thing called the RED method 00:11:23.555 --> 00:11:25.141 and there's a thing called the USE method 00:11:25.588 --> 00:11:28.400 and there's a couple nice things to some blog posts about this 00:11:28.695 --> 00:11:31.306 and basically it defines that, for example, 00:11:31.484 --> 00:11:35.000 the RED method talks about the requests that your system is handling 00:11:36.421 --> 00:11:37.604 There are three things: 00:11:37.775 --> 00:11:40.073 There's the number of requests, there's the number of errors 00:11:40.268 --> 00:11:42.306 and there's how long takes a duration. 00:11:42.868 --> 00:11:45.000 With the combination of these three things 00:11:45.341 --> 00:11:48.368 you can determine most of what your users see 00:11:48.712 --> 00:11:53.616 "Did my request go through? Did it return an error? Was it fast?" 00:11:55.492 --> 00:11:57.971 Most people, that's all they care about. 00:11:58.205 --> 00:12:01.965 "I made a request to a website and it came back and it was fast." 00:12:04.975 --> 00:12:06.517 It's a very simple method of just, like, 00:12:07.162 --> 00:12:10.109 those are the important things to determine if your site is healthy. 00:12:12.193 --> 00:12:17.045 But we can go back to some more traditional, sysadmin style, alerts 00:12:17.309 --> 00:12:20.553 this is basically taking the filesystem available space, 00:12:20.824 --> 00:12:26.522 divided by the filesystem size, that becomes the ratio of filesystem availability 00:12:26.697 --> 00:12:27.523 from 0 to 1. 00:12:28.241 --> 00:12:30.759 Multiply it by 100, we now have a percentage 00:12:31.016 --> 00:12:35.659 and if it's less than or equal to 1% for 15 minutes, 00:12:35.940 --> 00:12:41.782 this is less than 1% space, we should tell a sysadmin to go check 00:12:41.957 --> 00:12:44.290 to find out why the filesystem has fall 00:12:44.635 --> 00:12:46.168 It's super nice and simple. 00:12:46.494 --> 00:12:49.685 We can also tag, we can include… 00:12:51.418 --> 00:12:58.232 Every alert includes all the extraneous labels that Prometheus adds to your metrics 00:12:59.488 --> 00:13:05.461 When you add a metric in Prometheus, if we go back and we look at this metric. 00:13:06.009 --> 00:13:10.803 This metric only contain the information about the internals of the application 00:13:12.942 --> 00:13:14.995 anything about, like, what server it's on, is it running in a container, 00:13:15.186 --> 00:13:18.724 what cluster does it come from, what continent is it on, 00:13:17.702 --> 00:13:22.280 that's all extra annotations that are added by the Prometheus server 00:13:22.619 --> 00:13:23.949 at discovery time. 00:13:24.514 --> 00:13:28.347 Unfortunate I don't have a good example of what those labels look like 00:13:28.514 --> 00:13:34.180 but every metric gets annotated with location information. 00:13:36.904 --> 00:13:41.121 That location information also comes through as labels in the alert 00:13:41.300 --> 00:13:48.074 so, if you have a message coming into your alert manager, 00:13:48.269 --> 00:13:49.899 the alert manager can look and go 00:13:50.093 --> 00:13:51.621 "Oh, that's coming from this datacenter" 00:13:52.007 --> 00:13:58.905 and it can include that in the email or IRC message or SMS message. 00:13:59.069 --> 00:14:00.772 So you can include 00:13:59.271 --> 00:14:04.422 "Filesystem is out of space on this host from this datacenter" 00:14:04.557 --> 00:14:07.340 All these labels get passed through and then you can append 00:14:07.491 --> 00:14:13.292 "severity: critical" to that alert and include that in the message to the human 00:14:13.693 --> 00:14:16.775 because of course, this is how you define… 00:14:16.940 --> 00:14:20.857 Getting the message from the monitoring to the human. 00:14:22.197 --> 00:14:23.850 You can even include nice things like, 00:14:24.027 --> 00:14:27.508 if you've got documentation, you can include a link to the documentation 00:14:27.620 --> 00:14:28.686 as an annotation 00:14:29.079 --> 00:14:33.438 and the alert manager can take that basic url and, you know, 00:14:33.467 --> 00:14:36.806 massaging it into whatever it needs to look like to actually get 00:14:37.135 --> 00:14:40.417 the operator to the correct documentation. 00:14:42.117 --> 00:14:43.450 We can also do more fun things: 00:14:43.657 --> 00:14:45.567 since we actually are not just checking 00:14:45.746 --> 00:14:48.523 what is the space right now, we're tracking data over time, 00:14:49.232 --> 00:14:50.827 we can use 'predict_linear'. 00:14:52.406 --> 00:14:55.255 'predict_linear' just takes and does a simple linear regression. 00:14:55.749 --> 00:15:00.270 This example takes the filesystem available space over the last hour and 00:15:00.865 --> 00:15:02.453 does a linear regression. 00:15:02.785 --> 00:15:08.536 Prediction says "Well, it's going that way and four hours from now, 00:15:08.749 --> 00:15:13.112 based on one hour of history, it's gonna be less than 0, which means full". 00:15:13.667 --> 00:15:20.645 We know that within the next four hours, the disc is gonna be full 00:15:20.874 --> 00:15:24.658 so we can tell the operator ahead of time that it's gonna be full 00:15:24.833 --> 00:15:26.517 and not just tell them that it's full right now. 00:15:27.113 --> 00:15:32.303 They have some window of ability to fix it before it fails. 00:15:32.674 --> 00:15:35.369 This is really important because if you're running a site 00:15:35.689 --> 00:15:41.370 you want to be able to have alerts that tell you that your system is failing 00:15:41.573 --> 00:15:42.994 before it actually fails. 00:15:43.667 --> 00:15:48.254 Because if it fails, you're out of SLO or SLA and 00:15:48.404 --> 00:15:50.322 your users are gonna be unhappy 00:15:50.729 --> 00:15:52.493 and you don't want the users to tell you that your site is down 00:15:52.682 --> 00:15:54.953 you want to know about it before your users can even tell. 00:15:55.193 --> 00:15:58.491 This allows you to do that. 99:59:59.999 --> 99:59:59.999 And also of course, Prometheus being a modern system, 99:59:59.999 --> 99:59:59.999 we support fully UTF8 in all of our labels. 99:59:59.999 --> 99:59:59.999 Here's an other one, here's a good example from the USE method. 99:59:59.999 --> 99:59:59.999 This is a rate of 500 errors coming from an application 99:59:59.999 --> 99:59:59.999 and you can simply alert that 99:59:59.999 --> 99:59:59.999 there's more than 500 errors per second coming out of the application 99:59:59.999 --> 99:59:59.999 if that's your threshold for ??? 99:59:59.999 --> 99:59:59.999 And you can do other things, 99:59:59.999 --> 99:59:59.999 you can convert that from just a raid of errors 99:59:59.999 --> 99:59:59.999 to a percentive error. 99:59:59.999 --> 99:59:59.999 So you could say 99:59:59.999 --> 99:59:59.999 "I have an SLA of 3 9" and so you can say 99:59:59.999 --> 99:59:59.999 "If the rate of errors divided by the rate of requests is .01, 99:59:59.999 --> 99:59:59.999 or is more than .01, then that's a problem." 99:59:59.999 --> 99:59:59.999 You can include that level of error granularity. 99:59:59.999 --> 99:59:59.999 And if you're just doing a blackbox test, 99:59:59.999 --> 99:59:59.999 you wouldn't know this, you would only get if you got an error from the system, 99:59:59.999 --> 99:59:59.999 then you got another error from the system 99:59:59.999 --> 99:59:59.999 then you fire an alert. 99:59:59.999 --> 99:59:59.999 But if those checks are one minute apart and you're serving 1000 requests per second 99:59:59.999 --> 99:59:59.999 you could be serving 10,000 errors before you even get an alert. 99:59:59.999 --> 99:59:59.999 And you might miss it, because 99:59:59.999 --> 99:59:59.999 what if you only get one random error 99:59:59.999 --> 99:59:59.999 and then the next time, you're serving 25% errors, 99:59:59.999 --> 99:59:59.999 you only have a 25% chance of that check failing again. 99:59:59.999 --> 99:59:59.999 You really need these metrics in order to get 99:59:59.999 --> 99:59:59.999 proper reports of the status of your system 99:59:59.999 --> 99:59:59.999 There's even options 99:59:59.999 --> 99:59:59.999 You can slice and dice those labels. 99:59:59.999 --> 99:59:59.999 If you have a label on all of your applications called 'service' 99:59:59.999 --> 99:59:59.999 you can send that 'service' label through to the message 99:59:59.999 --> 99:59:59.999 and you can say "Hey, this service is broken". 99:59:59.999 --> 99:59:59.999 You can include that service label in your alert messages. 99:59:59.999 --> 99:59:59.999 And that's it, I can go to a demo and Q&A. 99:59:59.999 --> 99:59:59.999 [Applause] 99:59:59.999 --> 99:59:59.999 Any questions so far? 99:59:59.999 --> 99:59:59.999 Or anybody want to see a demo? 99:59:59.999 --> 99:59:59.999 [Q] Hi. Does Prometheus make metric discovery inside containers 99:59:59.999 --> 99:59:59.999 or do I have to implement the metrics myself? 99:59:59.999 --> 99:59:59.999 [A] For metrics in containers, there are already things that expose 99:59:59.999 --> 99:59:59.999 the metrics of the container system itself. 99:59:59.999 --> 99:59:59.999 There's a utility called 'cadvisor' and 99:59:59.999 --> 99:59:59.999 cadvisor takes the links cgroup data and exposes it as metrics 99:59:59.999 --> 99:59:59.999 so you can get data about how much CPU time is being 99:59:59.999 --> 99:59:59.999 spent in your container, 99:59:59.999 --> 99:59:59.999 how much memory is being used by your container. 99:59:59.999 --> 99:59:59.999 [Q] But not about the application, just about the container usage ? 99:59:59.999 --> 99:59:59.999 [A] Right. Because the container has no idea 99:59:59.999 --> 99:59:59.999 whether your application is written in Ruby or go or Python or whatever, 99:59:59.999 --> 99:59:59.999 you have to build that into your application in order to get the data. 99:59:59.999 --> 99:59:59.999 So for Prometheus, 99:59:59.999 --> 99:59:59.999 we've written client libraries that can be included in your application directly 99:59:59.999 --> 99:59:59.999 so you can get that data out. 99:59:59.999 --> 99:59:59.999 If you go to the Prometheus website, we have a whole series of client libraries 99:59:59.999 --> 99:59:59.999 and we cover a pretty good selection of popular software. 99:59:59.999 --> 99:59:59.999 [Q] What is the current state of long-term data storage? 99:59:59.999 --> 99:59:59.999 [A] Very good question. 99:59:59.999 --> 99:59:59.999 There's been several… 99:59:59.999 --> 99:59:59.999 There's actually several different methods of doing this. 99:59:59.999 --> 99:59:59.999 Prometheus stores all this data locally in its own data storage 99:59:59.999 --> 99:59:59.999 on the local disk. 99:59:59.999 --> 99:59:59.999 But that's only as durable as that server is durable. 99:59:59.999 --> 99:59:59.999 So if you've got a really durable server, 99:59:59.999 --> 99:59:59.999 you can store as much data as you want, 99:59:59.999 --> 99:59:59.999 you can store years and years of data locally on the Prometheus server. 99:59:59.999 --> 99:59:59.999 That's not a problem. 99:59:59.999 --> 99:59:59.999 There's a bunch of misconceptions because of our default 99:59:59.999 --> 99:59:59.999 and the language on our website said 99:59:59.999 --> 99:59:59.999 "It's not long-term storage" 99:59:59.999 --> 99:59:59.999 simply because we leave that problem up to the person running the server. 99:59:59.999 --> 99:59:59.999 But the time series database that Prometheus includes 99:59:59.999 --> 99:59:59.999 is actually quite durable. 99:59:59.999 --> 99:59:59.999 But it's only as durable as the server underneath it. 99:59:59.999 --> 99:59:59.999 So if you've got a very large cluster and you want really high durability, 99:59:59.999 --> 99:59:59.999 you need to have some kind of cluster software, 99:59:59.999 --> 99:59:59.999 but because we want Prometheus to be simple to deploy 99:59:59.999 --> 99:59:59.999 and very simple to operate 99:59:59.999 --> 99:59:59.999 and also very robust. 99:59:59.999 --> 99:59:59.999 We didn't want to include any clustering in Prometheus itself, 99:59:59.999 --> 99:59:59.999 because anytime you have a clustered software, 99:59:59.999 --> 99:59:59.999 what happens if your network is a little wanky. 99:59:59.999 --> 99:59:59.999 The first thing that goes down is all of your distributed systems fail. 99:59:59.999 --> 99:59:59.999 And building distributed systems to be really robust is really hard 99:59:59.999 --> 99:59:59.999 so Prometheus is what we call "uncoordinated distributed systems". 99:59:59.999 --> 99:59:59.999 If you've got two Prometheus servers monitoring all your targets in an HA mode 99:59:59.999 --> 99:59:59.999 in a cluster, and there's a split brain, 99:59:59.999 --> 99:59:59.999 each Prometheus can see half of the cluster and 99:59:59.999 --> 99:59:59.999 it can see that the other half of the cluster is down. 99:59:59.999 --> 99:59:59.999 They can both try to get alerts out to the alert manager 99:59:59.999 --> 99:59:59.999 and this is a really really robust way of handling split brains 99:59:59.999 --> 99:59:59.999 and bad network failures and bad problems in a cluster. 99:59:59.999 --> 99:59:59.999 It's designed to be super super robust 99:59:59.999 --> 99:59:59.999 and so the two individual Promotheus servers in you cluster 99:59:59.999 --> 99:59:59.999 don't have to talk to each other to do this, 99:59:59.999 --> 99:59:59.999 they can just to it independently. 99:59:59.999 --> 99:59:59.999 But if you want to be able to correlate data 99:59:59.999 --> 99:59:59.999 between many different Prometheus servers 99:59:59.999 --> 99:59:59.999 you need an external data storage to do this. 99:59:59.999 --> 99:59:59.999 And also you may not have very big servers, 99:59:59.999 --> 99:59:59.999 you might be running your Prometheus in a container 99:59:59.999 --> 99:59:59.999 and it's only got a little bit of local storage space 99:59:59.999 --> 99:59:59.999 so you want to send all that data up to a big cluster datastore 99:59:59.999 --> 99:59:59.999 for ??? 99:59:59.999 --> 99:59:59.999 We have several different ways of doing this. 99:59:59.999 --> 99:59:59.999 There's the classic way which is called federation 99:59:59.999 --> 99:59:59.999 where you have one Prometheus server polling in summary data from 99:59:59.999 --> 99:59:59.999 each of the individual Prometheus servers 99:59:59.999 --> 99:59:59.999 and this is useful if you want to run alerts against data coming 99:59:59.999 --> 99:59:59.999 from multiple Prometheus servers. 99:59:59.999 --> 99:59:59.999 But federation is not replication. 99:59:59.999 --> 99:59:59.999 It only can do a little bit of data from each Prometheus server. 99:59:59.999 --> 99:59:59.999 If you've got a million metrics on each Prometheus server, 99:59:59.999 --> 99:59:59.999 you can't poll in a million metrics and do… 99:59:59.999 --> 99:59:59.999 If you've got 10 of those, you can't poll in 10 million metrics 99:59:59.999 --> 99:59:59.999 simultaneously into one Prometheus server. 99:59:59.999 --> 99:59:59.999 It's just to much data. 99:59:59.999 --> 99:59:59.999 There is two others, a couple of other nice options. 99:59:59.999 --> 99:59:59.999 There's a piece of software called Cortex. 99:59:59.999 --> 99:59:59.999 Cortex is a Prometheus server that stores its data in a database. 99:59:59.999 --> 99:59:59.999 Specifically, a distributed database. 99:59:59.999 --> 99:59:59.999 Things that are based on the Google big table model, like Cassandra or… 99:59:59.999 --> 99:59:59.999 What's the Amazon one? 99:59:59.999 --> 99:59:59.999 Yeah. 99:59:59.999 --> 99:59:59.999 Dynamodb. 99:59:59.999 --> 99:59:59.999 If you have a dynamodb or a cassandra cluster, or one of these other 99:59:59.999 --> 99:59:59.999 really big distributed storage clusters, 99:59:59.999 --> 99:59:59.999 Cortex can run and the Prometheus servers will stream their data up to Cortex 99:59:59.999 --> 99:59:59.999 and it will keep a copy of that accross all of your Prometheus servers. 99:59:59.999 --> 99:59:59.999 And because it's based on things like Cassandra, 99:59:59.999 --> 99:59:59.999 it's super scalable. 99:59:59.999 --> 99:59:59.999 But it's a little complex to run and 99:59:59.999 --> 99:59:59.999 many people don't want to run that complex infrastructure. 99:59:59.999 --> 99:59:59.999 We have another new one, we just blogged about it yesterday. 99:59:59.999 --> 99:59:59.999 It's a thing called Thanos. 99:59:59.999 --> 99:59:59.999 Thanos is Prometheus at scale. 99:59:59.999 --> 99:59:59.999 Basically, the way it works… 99:59:59.999 --> 99:59:59.999 Actually, why don't I bring that up? 99:59:59.999 --> 99:59:59.999 This was developed by a company called Improbable 99:59:59.999 --> 99:59:59.999 and they wanted to… 99:59:59.999 --> 99:59:59.999 They had billions of metrics coming from hundreds of Prometheus servers. 99:59:59.999 --> 99:59:59.999 They developed this in collaboration with the Prometheus team to build 99:59:59.999 --> 99:59:59.999 a super highly scalable Prometheus server. 99:59:59.999 --> 99:59:59.999 Prometheus itself stores the incoming metrics data in ??? log 99:59:59.999 --> 99:59:59.999 and then every two hours, it creates a compaction cycle 99:59:59.999 --> 99:59:59.999 and it creates a mutable series block of data which is 99:59:59.999 --> 99:59:59.999 all the time series blocks themselves 99:59:59.999 --> 99:59:59.999 and then an index into that data. 99:59:59.999 --> 99:59:59.999 Those two hour windows are all imutable 99:59:59.999 --> 99:59:59.999 so ??? has a little sidecar binary that watches for those new directories and 99:59:59.999 --> 99:59:59.999 uploads them into a blob store. 99:59:59.999 --> 99:59:59.999 So you could put them in S3 or minio or some other simple object storage. 99:59:59.999 --> 99:59:59.999 And then now you have all of your data, all of this index data already 99:59:59.999 --> 99:59:59.999 ready to go 99:59:59.999 --> 99:59:59.999 and then the final sidecar creates a little mesh cluster that can read from 99:59:59.999 --> 99:59:59.999 all of those S3 blocks. 99:59:59.999 --> 99:59:59.999 Now, you have this super global view all stored in a big bucket storage and 99:59:59.999 --> 99:59:59.999 things like S3 or minio are… 99:59:59.999 --> 99:59:59.999 Bucket storage is not databases so they're operationally a little easier to operate. 99:59:59.999 --> 99:59:59.999 Plus, now we have all this data in a bucket store and 99:59:59.999 --> 99:59:59.999 the Thanos sidecars can talk to each other 99:59:59.999 --> 99:59:59.999 We can now have a single entry point. 99:59:59.999 --> 99:59:59.999 You can query Thanos and Thanos will distribute your query 99:59:59.999 --> 99:59:59.999 across all your Prometheus servers. 99:59:59.999 --> 99:59:59.999 So now you can do global queries across all of your servers. 99:59:59.999 --> 99:59:59.999 But it's very new, they just released their first release candidate yesterday. 99:59:59.999 --> 99:59:59.999 It is looking to be like the coolest thing ever 99:59:59.999 --> 99:59:59.999 for running large scale Prometheus. 99:59:59.999 --> 99:59:59.999 Here's an example of how that is laid out. 99:59:59.999 --> 99:59:59.999 This will ??? let you have a billion metric Prometheus cluster. 99:59:59.999 --> 99:59:59.999 And it's got a bunch of other cool features. 99:59:59.999 --> 99:59:59.999 Any more questions? 99:59:59.999 --> 99:59:59.999 Alright, maybe I'll do a quick little demo. 99:59:59.999 --> 99:59:59.999 Here is a Prometheus server that is provided by ??? 99:59:59.999 --> 99:59:59.999 that just does a ansible deployment for Prometheus. 99:59:59.999 --> 99:59:59.999 And you can just simply query for something like 'node_cpu'. 99:59:59.999 --> 99:59:59.999 This is actually the old name for that metric. 99:59:59.999 --> 99:59:59.999 And you can see, here's exactly 99:59:59.999 --> 99:59:59.999 the CPU metrics from some servers. 99:59:59.999 --> 99:59:59.999 It's just a bunch of stuff. 99:59:59.999 --> 99:59:59.999 There's actually two servers here, 99:59:59.999 --> 99:59:59.999 there's an influx cloud alchemy and there is a demo cloud alchemy. 99:59:59.999 --> 99:59:59.999 [Q] Can you zoom in? [A] Oh yeah sure. 99:59:59.999 --> 99:59:59.999 So you can see all the extra labels. 99:59:59.999 --> 99:59:59.999 We can also do some things like… 99:59:59.999 --> 99:59:59.999 Let's take a look at, say, the last 30 seconds. 99:59:59.999 --> 99:59:59.999 We can just add this little time window. 99:59:59.999 --> 99:59:59.999 It's called a range request, and you can see 99:59:59.999 --> 99:59:59.999 the individual samples. 99:59:59.999 --> 99:59:59.999 You can see that all Prometheus is doing 99:59:59.999 --> 99:59:59.999 is storing the sample and a timestamp. 99:59:59.999 --> 99:59:59.999 All the timestamps are in milliseconds and it's all epoch 99:59:59.999 --> 99:59:59.999 so it's super easy to manipulate. 99:59:59.999 --> 99:59:59.999 But, looking at the individual samples and looking at this, you can see that 99:59:59.999 --> 99:59:59.999 if we go back and just take… and look at the raw data, and 99:59:59.999 --> 99:59:59.999 we graph the raw data… 99:59:59.999 --> 99:59:59.999 Oops, that's a syntax error. 99:59:59.999 --> 99:59:59.999 And we look at this graph… Come on. 99:59:59.999 --> 99:59:59.999 Here we go. 99:59:59.999 --> 99:59:59.999 Well, that's kind of boring, it's just a flat line because 99:59:59.999 --> 99:59:59.999 it's just a counter going up very slowly. 99:59:59.999 --> 99:59:59.999 What we really want to do, is we want to take, and we want to apply 99:59:59.999 --> 99:59:59.999 a rate function to this counter. 99:59:59.999 --> 99:59:59.999 So let's look at the rate over the last one minute. 99:59:59.999 --> 99:59:59.999 There we go, now we get a nice little graph. 99:59:59.999 --> 99:59:59.999 And so you can see that this is 0.6 CPU seconds per second 99:59:59.999 --> 99:59:59.999 for that set of labels. 99:59:59.999 --> 99:59:59.999 But this is pretty noisy, there's a lot of lines on this graph and 99:59:59.999 --> 99:59:59.999 there's still a lot of data here. 99:59:59.999 --> 99:59:59.999 So let's start doing some filtering. 99:59:59.999 --> 99:59:59.999 One of the things we see here is, well, there's idle. 99:59:59.999 --> 99:59:59.999 We don't really care about the machine being idle, 99:59:59.999 --> 99:59:59.999 so let's just add a label filter so we can say 99:59:59.999 --> 99:59:59.999 'mode', it's the label name, and it's not equal to 'idle'. Done. 99:59:59.999 --> 99:59:59.999 And if I could type… What did I miss? 99:59:59.999 --> 99:59:59.999 Here we go. 99:59:59.999 --> 99:59:59.999 So now we've removed idle from the graph. 99:59:59.999 --> 99:59:59.999 That looks a little more sane. 99:59:59.999 --> 99:59:59.999 Oh, wow, look at that, that's a nice big spike in user space on the influx server 99:59:59.999 --> 99:59:59.999 Okay… 99:59:59.999 --> 99:59:59.999 Well, that's pretty cool. 99:59:59.999 --> 99:59:59.999 What about… 99:59:59.999 --> 99:59:59.999 This is still quite a lot of lines. 99:59:59.999 --> 99:59:59.999 How much CPU is in use total across all the servers that we have. 99:59:59.999 --> 99:59:59.999 We can just sum up that rate. 99:59:59.999 --> 99:59:59.999 We can just see that there is a sum total of 0.6 CPU seconds/s 99:59:59.999 --> 99:59:59.999 across the servers we have. 99:59:59.999 --> 99:59:59.999 But that's a little to coarse. 99:59:59.999 --> 99:59:59.999 What if we want to see it by instance? 99:59:59.999 --> 99:59:59.999 Now, we can see the two servers, we can see 99:59:59.999 --> 99:59:59.999 that we're left with just that label. 99:59:59.999 --> 99:59:59.999 The influx labels are the influx instance and the influx demo. 99:59:59.999 --> 99:59:59.999 That's a super easy way to see that, 99:59:59.999 --> 99:59:59.999 but we can also do this the other way around. 99:59:59.999 --> 99:59:59.999 We can say 'without (mode,cpu)' so we can drop those modes and 99:59:59.999 --> 99:59:59.999 see all the labels that we have. 99:59:59.999 --> 99:59:59.999 We can still see the environment label and the job label on our list data. 99:59:59.999 --> 99:59:59.999 You can go either way with the summary functions. 99:59:59.999 --> 99:59:59.999 There's a whole bunch of different functions 99:59:59.999 --> 99:59:59.999 and it's all in our documentation. 99:59:59.999 --> 99:59:59.999 But what if we want to see it… 99:59:59.999 --> 99:59:59.999 What if we want to see which CPUs are in use? 99:59:59.999 --> 99:59:59.999 Now we can see that it's only CPU0 99:59:59.999 --> 99:59:59.999 because apparently these are only 1-core instances. 99:59:59.999 --> 99:59:59.999 You can add/remove labels and do all these queries. 99:59:59.999 --> 99:59:59.999 Any other questions so far? 99:59:59.999 --> 99:59:59.999 [Q] I don't have a question, but I have something to add. 99:59:59.999 --> 99:59:59.999 Prometheus is really nice, but it's a lot better if you combine it 99:59:59.999 --> 99:59:59.999 with grafana. 99:59:59.999 --> 99:59:59.999 [A] Yes, yes. 99:59:59.999 --> 99:59:59.999 In the beginning, when we were creating Prometheus, we actually built 99:59:59.999 --> 99:59:59.999 a piece of dashboard software called promdash. 99:59:59.999 --> 99:59:59.999 It was a simple little Ruby on Rails app to create dashboards 99:59:59.999 --> 99:59:59.999 and it had a bunch of JavaScript. 99:59:59.999 --> 99:59:59.999 And then grafana came out. 99:59:59.999 --> 99:59:59.999 And we're like 99:59:59.999 --> 99:59:59.999 "Oh, that's interesting. It doesn't support Prometheus" so we were like 99:59:59.999 --> 99:59:59.999 "Hey, can you support Prometheus" 99:59:59.999 --> 99:59:59.999 and they're like "Yeah, we've got a REST API, get the data, done" 99:59:59.999 --> 99:59:59.999 Now grafana supports Prometheus and we're like 99:59:59.999 --> 99:59:59.999 "Well, promdash, this is crap, delete". 99:59:59.999 --> 99:59:59.999 The Prometheus development team, 99:59:59.999 --> 99:59:59.999 we're all backend developers and SREs and 99:59:59.999 --> 99:59:59.999 we have no JavaScript skills at all. 99:59:59.999 --> 99:59:59.999 So we're like "Let somebody deal with that". 99:59:59.999 --> 99:59:59.999 One of the nice things about working on this kind of project is 99:59:59.999 --> 99:59:59.999 we can do things that we're good at and and we don't, we don't try… 99:59:59.999 --> 99:59:59.999 We don't have any marketing people, it's just an opensource project, 99:59:59.999 --> 99:59:59.999 there's no single company behind Prometheus. 99:59:59.999 --> 99:59:59.999 I work for GitLab, Improbable paid for the Thanos system, 99:59:59.999 --> 99:59:59.999 other companies like Red Hat now pays people that used to work on CoreOS to 99:59:59.999 --> 99:59:59.999 work on Prometheus. 99:59:59.999 --> 99:59:59.999 There's lots and lots of collaboration between many companies 99:59:59.999 --> 99:59:59.999 to build the Prometheus ecosystem. 99:59:59.999 --> 99:59:59.999 But yeah, grafana is great. 99:59:59.999 --> 99:59:59.999 Actually, grafana now has two fulltime Prometheus developers. 99:59:59.999 --> 99:59:59.999 Alright, that's it. 99:59:59.999 --> 99:59:59.999 [Applause]