1 00:00:05,901 --> 00:00:10,531 So, we had a talk by a non-GitLab person about GitLab. 2 00:00:10,531 --> 00:00:13,057 Now, we have a talk by a GitLab person on non-GtlLab. 3 00:00:13,202 --> 00:00:14,603 Something like that? 4 00:00:15,894 --> 00:00:19,393 The CCCHH hackerspace is now open, 5 00:00:19,946 --> 00:00:22,118 from now on if you want to go there, that's the announcement. 6 00:00:22,471 --> 00:00:25,871 And the next talk will be by Ben Kochie 7 00:00:26,009 --> 00:00:28,265 on metrics-based monitoring with Prometheus. 8 00:00:28,748 --> 00:00:30,212 Welcome. 9 00:00:30,545 --> 00:00:33,133 [Applause] 10 00:00:35,395 --> 00:00:36,578 Alright, so 11 00:00:36,886 --> 00:00:39,371 my name is Ben Kochie 12 00:00:39,845 --> 00:00:43,870 I work on DevOps features for GitLab 13 00:00:44,327 --> 00:00:48,293 and apart working for GitLab, I also work on the opensource Prometheus project. 14 00:00:51,163 --> 00:00:53,595 I live in Berlin and I've been using Debian since ??? 15 00:00:54,353 --> 00:00:56,797 yes, quite a long time. 16 00:00:58,806 --> 00:01:01,018 So, what is Metrics-based Monitoring? 17 00:01:02,638 --> 00:01:05,165 If you're running software in production, 18 00:01:05,885 --> 00:01:07,826 you probably want to monitor it, 19 00:01:08,212 --> 00:01:10,547 because if you don't monitor it, you don't know it's right. 20 00:01:13,278 --> 00:01:16,112 ??? break down into two categories: 21 00:01:16,112 --> 00:01:19,146 there's blackbox monitoring and there's whitebox monitoring. 22 00:01:19,500 --> 00:01:24,582 Blackbox monitoring is treating your software like a blackbox. 23 00:01:24,757 --> 00:01:27,158 It's just checks to see, like, 24 00:01:27,447 --> 00:01:29,483 is it responding, or does it ping 25 00:01:30,023 --> 00:01:33,588 or ??? HTTP requests 26 00:01:34,348 --> 00:01:35,669 [mic turned on] 27 00:01:37,760 --> 00:01:41,379 Ah, there we go, much better. 28 00:01:46,592 --> 00:01:51,898 So, blackbox monitoring is a probe, 29 00:01:51,898 --> 00:01:54,684 it just kind of looks from the outside to your software 30 00:01:55,454 --> 00:01:57,432 and it has no knowledge of the internals 31 00:01:58,133 --> 00:02:00,699 and it's really good for end to end testing. 32 00:02:00,942 --> 00:02:03,560 So if you've got a fairly complicated service, 33 00:02:03,990 --> 00:02:06,426 you come in from the outside, you go through the load balancer, 34 00:02:06,721 --> 00:02:07,975 you hit the API server, 35 00:02:07,975 --> 00:02:10,152 the API server might hit a database, 36 00:02:10,675 --> 00:02:13,054 and you go all the way through to the back of the stack 37 00:02:13,224 --> 00:02:14,536 and all the way back out 38 00:02:14,560 --> 00:02:16,294 so you know that everything is working end to end. 39 00:02:16,518 --> 00:02:18,768 But you only know about it for that one request. 40 00:02:19,036 --> 00:02:22,429 So in order to find out if your service is working, 41 00:02:22,831 --> 00:02:27,128 from the end to end, for every single request, 42 00:02:27,475 --> 00:02:29,523 this requires whitebox intrumentation. 43 00:02:29,836 --> 00:02:33,965 So, basically, every event that happens inside your software, 44 00:02:33,973 --> 00:02:36,517 inside a serving stack, 45 00:02:36,817 --> 00:02:39,807 gets collected and gets counted, 46 00:02:40,037 --> 00:02:43,466 so you know that every request hits the load balancer, 47 00:02:43,466 --> 00:02:45,656 every request hits your application service, 48 00:02:45,972 --> 00:02:47,329 every request hits the database. 49 00:02:47,789 --> 00:02:50,832 You know that everything matches up 50 00:02:50,997 --> 00:02:55,764 and this is called whitebox, or metrics-based monitoring. 51 00:02:56,010 --> 00:02:57,688 There is different examples of, like, 52 00:02:57,913 --> 00:03:02,392 the kind of software that does blackbox and whitebox monitoring. 53 00:03:02,572 --> 00:03:06,680 So you have software like Nagios that you can configure checks 54 00:03:08,826 --> 00:03:10,012 or pingdom, 55 00:03:10,211 --> 00:03:12,347 pingdom will do ping of your website. 56 00:03:12,971 --> 00:03:15,307 And then there is metrics-based monitoring, 57 00:03:15,517 --> 00:03:19,293 things like Prometheus, things like the TICK stack from influx data, 58 00:03:19,610 --> 00:03:22,728 New Relic and other commercial solutions 59 00:03:23,027 --> 00:03:25,480 but of course I like to talk about the opensorce solutions. 60 00:03:25,748 --> 00:03:28,379 We're gonna talk a little bit about Prometheus. 61 00:03:28,819 --> 00:03:31,955 Prometheus came out of the idea that 62 00:03:32,343 --> 00:03:37,555 we needed a monitoring system that could collect all this whitebox metric data 63 00:03:37,941 --> 00:03:40,786 and do something useful with it. 64 00:03:40,915 --> 00:03:42,667 Not just give us a pretty graph, but we also want to be able to 65 00:03:42,985 --> 00:03:44,189 alert on it. 66 00:03:44,189 --> 00:03:45,988 So we needed both 67 00:03:49,872 --> 00:03:54,068 a data gathering and an analytics system in the same instance. 68 00:03:54,148 --> 00:03:58,821 To do this, we built this thing and we looked at the way that 69 00:03:59,014 --> 00:04:01,835 data was being generated by the applications 70 00:04:02,369 --> 00:04:05,204 and there are advantages and disadvantages to this 71 00:04:05,204 --> 00:04:07,250 push vs. pull model for metrics. 72 00:04:07,384 --> 00:04:09,701 We decided to go with the pulling model 73 00:04:09,938 --> 00:04:13,953 because there is some slight advantages for pulling over pushing. 74 00:04:16,323 --> 00:04:18,163 With pulling, you get this free blackbox check 75 00:04:18,471 --> 00:04:20,151 that the application is running. 76 00:04:20,527 --> 00:04:24,319 When you pull your application, you know that the process is running. 77 00:04:24,532 --> 00:04:27,529 If you are doing push-based, you can't tell the difference between 78 00:04:27,851 --> 00:04:31,521 your application doing no work and your application not running. 79 00:04:32,416 --> 00:04:33,900 So you don't know if it's stuck, 80 00:04:34,140 --> 00:04:37,878 or is it just not having to do any work. 81 00:04:42,671 --> 00:04:48,940 With pulling, the pulling system knows the state of your network. 82 00:04:49,850 --> 00:04:52,522 If you have a defined set of services, 83 00:04:52,887 --> 00:04:56,788 that inventory drives what should be there. 84 00:04:58,274 --> 00:05:00,080 Again, it's like, the disappearing, 85 00:05:00,288 --> 00:05:03,950 is the process dead, or is it just not doing anything? 86 00:05:04,205 --> 00:05:07,117 With polling, you know for a fact what processes should be there, 87 00:05:07,593 --> 00:05:10,900 and it's a bit of an advantage there. 88 00:05:11,138 --> 00:05:12,913 With pulling, there's really easy testing. 89 00:05:13,117 --> 00:05:16,295 With push-based metrics, you have to figure out 90 00:05:16,505 --> 00:05:18,843 if you want to test a new version of the monitoring system or 91 00:05:19,058 --> 00:05:21,262 you want to test something new, 92 00:05:21,420 --> 00:05:24,129 you have to tear off a copy of the data. 93 00:05:24,370 --> 00:05:27,652 With pulling, you can just set up another instance of your monitoring 94 00:05:27,856 --> 00:05:29,189 and just test it. 95 00:05:29,714 --> 00:05:31,321 Or you don't even have, 96 00:05:31,473 --> 00:05:33,194 it doesn't even have to be monitoring, you can just use curl 97 00:05:33,199 --> 00:05:35,977 to pull the metrics endpoint. 98 00:05:38,417 --> 00:05:40,436 It's significantly easier to test. 99 00:05:40,436 --> 00:05:42,977 The other thing with the… 100 00:05:45,999 --> 00:05:48,109 The other nice thing is that the client is really simple. 101 00:05:48,481 --> 00:05:51,068 The client doesn't have to know where the monitoring system is. 102 00:05:51,272 --> 00:05:53,669 It doesn't have to know about ??? 103 00:05:53,820 --> 00:05:55,720 It just has to sit and collect the data about itself. 104 00:05:55,882 --> 00:05:58,708 So it doesn't have to know anything about the topology of the network. 105 00:05:59,134 --> 00:06:03,363 As an application developer, if you're writing a DNS server or 106 00:06:03,724 --> 00:06:05,572 some other piece of software, 107 00:06:05,896 --> 00:06:09,562 you don't have to know anything about monitoring software, 108 00:06:09,803 --> 00:06:12,217 you can just implement it inside your application and 109 00:06:12,683 --> 00:06:17,058 the monitoring software, whether it's Prometheus or something else, 110 00:06:17,414 --> 00:06:19,332 can just come and collect that data for you. 111 00:06:20,210 --> 00:06:23,611 That's kind of similar to a very old monitoring system called SNMP, 112 00:06:23,832 --> 00:06:28,530 but SNMP has a significantly less friendly data model for developers. 113 00:06:30,010 --> 00:06:33,556 This is the basic layout of a Prometheus server. 114 00:06:33,921 --> 00:06:35,918 At the core, there's a Prometheus server 115 00:06:36,278 --> 00:06:40,302 and it deals with all the data collection and analytics. 116 00:06:42,941 --> 00:06:46,697 Basically, this one binary, it's all written in golang. 117 00:06:46,867 --> 00:06:48,559 It's a single binary. 118 00:06:48,559 --> 00:06:50,823 It knows how to read from your inventory, 119 00:06:50,823 --> 00:06:52,659 there's a bunch of different methods, whether you've got 120 00:06:53,121 --> 00:06:58,843 a kubernetes cluster or a cloud platform 121 00:07:00,234 --> 00:07:03,800 or you have your own customized thing with ansible. 122 00:07:05,380 --> 00:07:09,750 Ansible can take your layout, drop that into a config file and 123 00:07:10,639 --> 00:07:11,902 Prometheus can pick that up. 124 00:07:15,594 --> 00:07:18,812 Once it has the layout, it goes out and collects all the data. 125 00:07:18,844 --> 00:07:24,254 It has a storage and a time series database to store all that data locally. 126 00:07:24,462 --> 00:07:28,228 It has a thing called PromQL, which is a query language designed 127 00:07:28,452 --> 00:07:31,033 for metrics and analytics. 128 00:07:31,500 --> 00:07:36,779 From that PromQL, you can add frontends that will, 129 00:07:36,985 --> 00:07:39,319 whether it's a simple API client to run reports, 130 00:07:40,019 --> 00:07:42,942 you can use things like Grafana for creating dashboards, 131 00:07:43,124 --> 00:07:44,834 it's got a simple webUI built in. 132 00:07:45,031 --> 00:07:46,920 You can plug in anything you want on that side. 133 00:07:48,693 --> 00:07:54,478 And then, it also has the ability to continuously execute queries 134 00:07:54,625 --> 00:07:56,191 called "recording rules" 135 00:07:56,832 --> 00:07:59,103 and these recording rules have two different modes. 136 00:07:59,103 --> 00:08:01,871 You can either record, you can take a query 137 00:08:02,150 --> 00:08:03,711 and it will generate new data from that query 138 00:08:04,072 --> 00:08:06,967 or you can take a query, and if it returns results, 139 00:08:07,354 --> 00:08:08,910 it will return an alert. 140 00:08:09,176 --> 00:08:12,506 That alert is a push message to the alert manager. 141 00:08:12,813 --> 00:08:18,969 This allows us to separate the generating of alerts from the routing of alerts. 142 00:08:19,153 --> 00:08:24,259 You can have one or hundreds of Prometheus services, all generating alerts 143 00:08:24,599 --> 00:08:28,807 and it goes into an alert manager cluster and sends, does the deduplication 144 00:08:29,329 --> 00:08:30,684 and the routing to the human 145 00:08:30,879 --> 00:08:34,138 because, of course, the thing that we want is 146 00:08:34,927 --> 00:08:38,797 we had dashboards with graphs, but in order to find out if something is broken 147 00:08:38,966 --> 00:08:40,650 you had to have a human looking at the graph. 148 00:08:40,830 --> 00:08:42,942 With Prometheus, we don't have to do that anymore, 149 00:08:43,103 --> 00:08:47,638 we can simply let the software tell us that we need to go investigate 150 00:08:47,638 --> 00:08:48,650 our problems. 151 00:08:48,778 --> 00:08:50,831 We don't have to sit there and stare at dashboards all day, 152 00:08:51,035 --> 00:08:52,380 because that's really boring. 153 00:08:54,519 --> 00:08:57,556 What does it look like to actually get data into Prometheus? 154 00:08:57,587 --> 00:09:02,140 This is a very basic output of a Prometheus metric. 155 00:09:02,613 --> 00:09:03,930 This is a very simple thing. 156 00:09:04,086 --> 00:09:07,572 If you know much about the linux kernel, 157 00:09:06,883 --> 00:09:12,779 the linux kernel tracks and proc stats, all the state of all the CPUs 158 00:09:12,779 --> 00:09:14,459 in your system 159 00:09:14,662 --> 00:09:18,078 and we express this by having the name of the metric, which is 160 00:09:22,449 --> 00:09:26,123 'node_cpu_seconds_total' and so this is a self-describing metric, 161 00:09:26,547 --> 00:09:28,375 like you can just read the metrics name 162 00:09:28,530 --> 00:09:30,845 and you understand a little bit about what's going on here. 163 00:09:33,241 --> 00:09:38,521 The linux kernel and other kernels track their usage by the number of seconds 164 00:09:38,859 --> 00:09:41,004 spent doing different things and 165 00:09:41,199 --> 00:09:46,721 that could be, whether it's in system or user space or IRQs 166 00:09:47,065 --> 00:09:48,690 or iowait or idle. 167 00:09:48,908 --> 00:09:51,280 Actually, the kernel tracks how much idle time it has. 168 00:09:53,660 --> 00:09:55,309 It also tracks it by the number of CPUs. 169 00:09:55,997 --> 00:10:00,067 With other monitoring systems, they used to do this with a tree structure 170 00:10:01,021 --> 00:10:03,688 and this caused a lot of problems, for like 171 00:10:03,854 --> 00:10:09,291 How do you mix and match data so by switching from 172 00:10:10,043 --> 00:10:12,484 a tree structure to a tag-based structure, 173 00:10:12,985 --> 00:10:16,896 we can do some really interesting powerful data analytics. 174 00:10:18,170 --> 00:10:25,170 Here's a nice example of taking those CPU seconds counters 175 00:10:26,101 --> 00:10:30,198 and then converting them into a graph by using PromQL. 176 00:10:32,724 --> 00:10:34,830 Now we can get into Metrics-Based Alerting. 177 00:10:35,315 --> 00:10:37,665 Now we have this graph, we have this thing 178 00:10:37,847 --> 00:10:39,497 we can look and see here 179 00:10:39,999 --> 00:10:42,920 "Oh there is some little spike here, we might want to know about that." 180 00:10:43,191 --> 00:10:45,849 Now we can get into Metrics-Based Alerting. 181 00:10:46,281 --> 00:10:51,128 I used to be a site reliability engineer, I'm still a site reliability engineer at heart 182 00:10:52,371 --> 00:11:00,362 and we have this concept of things that you need on a site or a service reliably 183 00:11:00,910 --> 00:11:03,231 The most important thing you need is down at the bottom, 184 00:11:03,569 --> 00:11:06,869 Monitoring, because if you don't have monitoring of your service, 185 00:11:07,108 --> 00:11:08,688 how do you know it's even working? 186 00:11:11,628 --> 00:11:15,235 There's a couple of techniques here, and we want to alert based on data 187 00:11:15,693 --> 00:11:17,644 and not just those end to end tests. 188 00:11:18,796 --> 00:11:23,387 There's a couple of techniques, a thing called the RED method 189 00:11:23,555 --> 00:11:25,141 and there's a thing called the USE method 190 00:11:25,588 --> 00:11:28,400 and there's a couple nice things to some blog posts about this 191 00:11:28,695 --> 00:11:31,306 and basically it defines that, for example, 192 00:11:31,484 --> 00:11:35,000 the RED method talks about the requests that your system is handling 193 00:11:36,421 --> 00:11:37,604 There are three things: 194 00:11:37,775 --> 00:11:40,073 There's the number of requests, there's the number of errors 195 00:11:40,268 --> 00:11:42,306 and there's how long takes a duration. 196 00:11:42,868 --> 00:11:45,000 With the combination of these three things 197 00:11:45,341 --> 00:11:48,368 you can determine most of what your users see 198 00:11:48,712 --> 00:11:53,616 "Did my request go through? Did it return an error? Was it fast?" 199 00:11:55,492 --> 00:11:57,971 Most people, that's all they care about. 200 00:11:58,205 --> 00:12:01,965 "I made a request to a website and it came back and it was fast." 201 00:12:04,975 --> 00:12:06,517 It's a very simple method of just, like, 202 00:12:07,162 --> 00:12:10,109 those are the important things to determine if your site is healthy. 203 00:12:12,193 --> 00:12:17,045 But we can go back to some more traditional, sysadmin style alerts 204 00:12:17,309 --> 00:12:20,553 this is basically taking the filesystem available space, 205 00:12:20,824 --> 00:12:26,522 divided by the filesystem size, that becomes the ratio of filesystem availability 206 00:12:26,697 --> 00:12:27,523 from 0 to 1. 207 00:12:28,241 --> 00:12:30,759 Multiply it by 100, we now have a percentage 208 00:12:31,016 --> 00:12:35,659 and if it's less than or equal to 1% for 15 minutes, 209 00:12:35,940 --> 00:12:41,782 this is less than 1% space, we should tell a sysadmin to go check 210 00:12:41,957 --> 00:12:44,290 to find out why the filesystem has fall 211 00:12:44,635 --> 00:12:46,168 It's super nice and simple. 212 00:12:46,494 --> 00:12:49,685 We can also tag, we can include… 213 00:12:51,418 --> 00:12:58,232 Every alert includes all the extraneous labels that Prometheus adds to your metrics 214 00:12:59,488 --> 00:13:05,461 When you add a metric in Prometheus, if we go back and we look at this metric. 215 00:13:06,009 --> 00:13:10,803 This metric only contain the information about the internals of the application 216 00:13:12,942 --> 00:13:14,995 anything about, like, what server it's on, is it running in a container, 217 00:13:15,186 --> 00:13:18,724 what cluster does it come from, what continent is it on, 218 00:13:17,702 --> 00:13:22,280 that's all extra annotations that are added by the Prometheus server 219 00:13:22,619 --> 00:13:23,949 at discovery time. 220 00:13:24,514 --> 00:13:28,347 Unfortunately I don't have a good example of what those labels look like 221 00:13:28,514 --> 00:13:34,180 but every metric gets annotated with location information. 222 00:13:36,904 --> 00:13:41,121 That location information also comes through as labels in the alert 223 00:13:41,300 --> 00:13:48,074 so, if you have a message coming into your alert manager, 224 00:13:48,269 --> 00:13:49,899 the alert manager can look and go 225 00:13:50,093 --> 00:13:51,621 "Oh, that's coming from this datacenter" 226 00:13:52,007 --> 00:13:58,905 and it can include that in the email or IRC message or SMS message. 227 00:13:59,069 --> 00:14:00,772 So you can include 228 00:13:59,271 --> 00:14:04,422 "Filesystem is out of space on this host from this datacenter" 229 00:14:04,557 --> 00:14:07,340 All these labels get passed through and then you can append 230 00:14:07,491 --> 00:14:13,292 "severity: critical" to that alert and include that in the message to the human 231 00:14:13,693 --> 00:14:16,775 because of course, this is how you define… 232 00:14:16,940 --> 00:14:20,857 Getting the message from the monitoring to the human. 233 00:14:22,197 --> 00:14:23,850 You can even include nice things like, 234 00:14:24,027 --> 00:14:27,508 if you've got documentation, you can include a link to the documentation 235 00:14:27,620 --> 00:14:28,686 as an annotation 236 00:14:29,079 --> 00:14:33,438 and the alert manager can take that basic url and, you know, 237 00:14:33,467 --> 00:14:36,806 massaging it into whatever it needs to look like to actually get 238 00:14:37,135 --> 00:14:40,417 the operator to the correct documentation. 239 00:14:42,117 --> 00:14:43,450 We can also do more fun things: 240 00:14:43,657 --> 00:14:45,567 since we actually are not just checking 241 00:14:45,746 --> 00:14:48,523 what is the space right now, we're tracking data over time, 242 00:14:49,232 --> 00:14:50,827 we can use 'predict_linear'. 243 00:14:52,406 --> 00:14:55,255 'predict_linear' just takes and does a simple linear regression. 244 00:14:55,749 --> 00:15:00,270 This example takes the filesystem available space over the last hour and 245 00:15:00,865 --> 00:15:02,453 does a linear regression. 246 00:15:02,785 --> 00:15:08,536 Prediction says "Well, it's going that way and four hours from now, 247 00:15:08,749 --> 00:15:13,112 based on one hour of history, it's gonna be less than 0, which means full". 248 00:15:13,667 --> 00:15:20,645 We know that within the next four hours, the disc is gonna be full 249 00:15:20,874 --> 00:15:24,658 so we can tell the operator ahead of time that it's gonna be full 250 00:15:24,833 --> 00:15:26,517 and not just tell them that it's full right now. 251 00:15:27,113 --> 00:15:32,303 They have some window of ability to fix it before it fails. 252 00:15:32,674 --> 00:15:35,369 This is really important because if you're running a site 253 00:15:35,689 --> 00:15:41,370 you want to be able to have alerts that tell you that your system is failing 254 00:15:41,573 --> 00:15:42,994 before it actually fails. 255 00:15:43,667 --> 00:15:48,254 Because if it fails, you're out of SLO or SLA and 256 00:15:48,404 --> 00:15:50,322 your users are gonna be unhappy 257 00:15:50,729 --> 00:15:52,493 and you don't want the users to tell you that your site is down 258 00:15:52,682 --> 00:15:54,953 you want to know about it before your users can even tell. 259 00:15:55,193 --> 00:15:58,491 This allows you to do that. 260 00:15:58,693 --> 00:16:02,232 And also of course, Prometheus being a modern system, 261 00:16:02,735 --> 00:16:05,633 we support fully UTF8 in all of our labels. 262 00:16:08,283 --> 00:16:12,101 Here's an other one, here's a good example from the USE method. 263 00:16:12,490 --> 00:16:16,036 This is a rate of 500 errors coming from an application 264 00:16:16,423 --> 00:16:17,813 and you can simply alert that 265 00:16:17,977 --> 00:16:22,555 there's more than 500 errors per second coming out of the application 266 00:16:22,568 --> 00:16:25,670 if that's your threshold for pain 267 00:16:26,041 --> 00:16:27,298 And you can do other things, 268 00:16:27,501 --> 00:16:29,338 you can convert that from just a raid of errors 269 00:16:29,723 --> 00:16:31,054 to a percentive error. 270 00:16:31,304 --> 00:16:32,605 So you could say 271 00:16:33,053 --> 00:16:37,336 "I have an SLA of 3 9" and so you can say 272 00:16:37,574 --> 00:16:46,710 "If the rate of errors divided by the rate of requests is .01, 273 00:16:47,265 --> 00:16:49,335 or is more than .01, then that's a problem." 274 00:16:49,725 --> 00:16:54,589 You can include that level of error granularity. 275 00:16:54,797 --> 00:16:57,622 And if you're just doing a blackbox test, 276 00:16:58,185 --> 00:17:03,727 you wouldn't know this, you would only get if you got an error from the system, 277 00:17:04,188 --> 00:17:05,601 then you got another error from the system 278 00:17:05,826 --> 00:17:06,938 then you fire an alert. 279 00:17:07,307 --> 00:17:11,847 But if those checks are one minute apart and you're serving 1000 requests per second 280 00:17:13,324 --> 00:17:20,987 you could be serving 10,000 errors before you even get an alert. 281 00:17:21,579 --> 00:17:22,876 And you might miss it, because 282 00:17:23,104 --> 00:17:24,993 what if you only get one random error 283 00:17:25,327 --> 00:17:28,898 and then the next time, you're serving 25% errors, 284 00:17:29,094 --> 00:17:31,571 you only have a 25% chance of that check failing again. 285 00:17:31,800 --> 00:17:36,230 You really need these metrics in order to get 286 00:17:36,430 --> 00:17:38,867 proper reports of the status of your system 287 00:17:43,176 --> 00:17:43,850 There's even options 288 00:17:44,051 --> 00:17:45,816 You can slice and dice those labels. 289 00:17:46,225 --> 00:17:50,056 If you have a label on all of your applications called 'service' 290 00:17:50,322 --> 00:17:53,251 you can send that 'service' label through to the message 291 00:17:53,523 --> 00:17:55,857 and you can say "Hey, this service is broken". 292 00:17:56,073 --> 00:18:00,363 You can include that service label in your alert messages. 293 00:18:01,426 --> 00:18:06,723 And that's it, I can go to a demo and Q&A. 294 00:18:09,881 --> 00:18:13,687 [Applause] 295 00:18:16,877 --> 00:18:18,417 Any questions so far? 296 00:18:18,811 --> 00:18:20,071 Or anybody want to see a demo? 297 00:18:29,517 --> 00:18:35,065 [Q] Hi. Does Prometheus make metric discovery inside containers 298 00:18:35,364 --> 00:18:37,476 or do I have to implement the metrics myself? 299 00:18:38,184 --> 00:18:45,743 [A] For metrics in containers, there are already things that expose 300 00:18:45,887 --> 00:18:49,214 the metrics of the container system itself. 301 00:18:49,512 --> 00:18:52,174 There's a utility called 'cadvisor' and 302 00:18:52,395 --> 00:18:57,172 cadvisor takes the links cgroup data and exposes it as metrics 303 00:18:57,416 --> 00:19:01,164 so you can get data about how much CPU time is being 304 00:19:01,164 --> 00:19:02,421 spent in your container, 305 00:19:02,683 --> 00:19:04,139 how much memory is being spent by your container. 306 00:19:04,775 --> 00:19:08,411 [Q] But not about the application, just about the container usage ? 307 00:19:08,597 --> 00:19:11,355 [A] Right. Because the container has no idea 308 00:19:11,698 --> 00:19:15,451 whether your application is written in Ruby or Go or Python or whatever, 309 00:19:18,698 --> 00:19:21,602 you have to build that into your application in order to get the data. 310 00:19:24,057 --> 00:19:24,307 So for Prometheus, 311 00:19:27,890 --> 00:19:35,031 we've written client libraries that can be included in your application directly 312 00:19:35,195 --> 00:19:36,413 so you can get that data out. 313 00:19:36,602 --> 00:19:41,460 If you go to the Prometheus website, we have a whole series of client libraries 314 00:19:44,936 --> 00:19:48,913 and we cover a pretty good selection of popular software. 315 00:19:56,569 --> 00:19:59,537 [Q] What is the current state of long-term data storage? 316 00:20:00,803 --> 00:20:01,678 [A] Very good question. 317 00:20:02,697 --> 00:20:04,513 There's been several… 318 00:20:04,913 --> 00:20:06,521 There's actually several different methods of doing this. 319 00:20:09,653 --> 00:20:14,667 Prometheus stores all this data locally in its own data storage 320 00:20:14,667 --> 00:20:15,711 on the local disk. 321 00:20:16,609 --> 00:20:19,156 But that's only as durable as that server is durable. 322 00:20:19,423 --> 00:20:21,627 So if you've got a really durable server, 323 00:20:21,812 --> 00:20:23,357 you can store as much data as you want, 324 00:20:23,551 --> 00:20:26,521 you can store years and years of data locally on the Prometheus server. 325 00:20:26,653 --> 00:20:28,088 That's not a problem. 326 00:20:28,781 --> 00:20:32,244 There's a bunch of misconceptions because of our default 327 00:20:32,464 --> 00:20:34,492 and the language on our website said 328 00:20:34,698 --> 00:20:36,160 "It's not long-term storage" 329 00:20:36,707 --> 00:20:41,841 simply because we leave that problem up to the person running the server. 330 00:20:43,389 --> 00:20:46,389 But the time series database that Prometheus includes 331 00:20:46,562 --> 00:20:47,739 is actually quite durable. 332 00:20:49,157 --> 00:20:51,069 But it's only as durable as the server underneath it. 333 00:20:51,642 --> 00:20:55,172 So if you've got a very large cluster and you want really high durability, 334 00:20:55,800 --> 00:20:57,705 you need to have some kind of cluster software, 335 00:20:58,217 --> 00:21:01,106 but because we want Prometheus to be simple to deploy 336 00:21:01,701 --> 00:21:02,911 and very simple to operate 337 00:21:03,355 --> 00:21:06,774 and also very robust. 338 00:21:06,950 --> 00:21:09,370 We didn't want to include any clustering in Prometheus itself, 339 00:21:09,787 --> 00:21:12,078 because anytime you have a clustered software, 340 00:21:12,294 --> 00:21:15,100 what happens if your network is a little wanky. 341 00:21:15,586 --> 00:21:19,470 The first thing that goes down is all of your distributed systems fail. 342 00:21:20,328 --> 00:21:23,048 And building distributed systems to be really robust is really hard 343 00:21:23,445 --> 00:21:29,142 so Prometheus is what we call "uncoordinated distributed systems". 344 00:21:29,348 --> 00:21:34,048 If you've got two Prometheus servers monitoring all your targets in an HA mode 345 00:21:34,273 --> 00:21:36,890 in a cluster, and there's a split brain, 346 00:21:37,131 --> 00:21:40,363 each Prometheus can see half of the cluster and 347 00:21:40,768 --> 00:21:43,557 it can see that the other half of the cluster is down. 348 00:21:43,846 --> 00:21:46,740 They can both try to get alerts out to the alert manager 349 00:21:46,945 --> 00:21:50,466 and this is a really really robust way of handling split brains 350 00:21:50,734 --> 00:21:54,069 and bad network failures and bad problems in a cluster. 351 00:21:54,294 --> 00:21:57,163 It's designed to be super super robust 352 00:21:57,342 --> 00:21:59,844 and so the two individual Promotheus servers in you cluster 353 00:22:00,079 --> 00:22:02,009 don't have to talk to each other to do this, 354 00:22:02,193 --> 00:22:03,994 they can just to it independently. 355 00:22:04,377 --> 00:22:07,392 But if you want to be able to correlate data 356 00:22:07,604 --> 00:22:09,255 between many different Prometheus servers 357 00:22:09,439 --> 00:22:12,185 you need an external data storage to do this. 358 00:22:12,777 --> 00:22:15,008 And also you may not have very big servers, 359 00:22:15,164 --> 00:22:17,126 you might be running your Prometheus in a container 360 00:22:17,293 --> 00:22:19,373 and it's only got a little bit of local storage space 361 00:22:19,543 --> 00:22:23,217 so you want to send all that data up to a big cluster datastore 362 00:22:23,439 --> 00:22:25,124 for a bigger use 363 00:22:25,707 --> 00:22:27,913 We have several different ways of doing this. 364 00:22:28,383 --> 00:22:30,941 There's the classic way which is called federation 365 00:22:31,156 --> 00:22:34,875 where you have one Prometheus server polling in summary data from 366 00:22:35,083 --> 00:22:36,604 each of the individual Prometheus servers 367 00:22:36,823 --> 00:22:40,266 and this is useful if you want to run alerts against data coming 368 00:22:40,363 --> 00:22:41,578 from multiple Prometheus servers. 369 00:22:42,488 --> 00:22:44,240 But federation is not replication. 370 00:22:44,870 --> 00:22:47,488 It only can do a little bit of data from each Prometheus server. 371 00:22:47,715 --> 00:22:51,078 If you've got a million metrics on each Prometheus server, 372 00:22:51,683 --> 00:22:55,725 you can't poll in a million metrics and do… 373 00:22:55,725 --> 00:22:58,850 If you've got 10 of those, you can't poll in 10 million metrics 374 00:22:59,011 --> 00:23:00,635 simultaneously into one Prometheus server. 375 00:23:00,919 --> 00:23:01,890 It's just to much data. 376 00:23:02,875 --> 00:23:06,006 There is two others, a couple of other nice options. 377 00:23:06,618 --> 00:23:08,923 There's a piece of software called Cortex. 378 00:23:09,132 --> 00:23:16,033 Cortex is a Prometheus server that stores its data in a database. 379 00:23:16,570 --> 00:23:19,127 Specifically, a distributed database. 380 00:23:19,395 --> 00:23:24,136 Things that are based on the Google big table model, like Cassandra or… 381 00:23:25,892 --> 00:23:27,166 What's the Amazon one? 382 00:23:30,332 --> 00:23:32,667 Yeah. 383 00:23:32,682 --> 00:23:33,700 Dynamodb. 384 00:23:34,193 --> 00:23:37,137 If you have a dynamodb or a cassandra cluster, or one of these other 385 00:23:37,350 --> 00:23:39,298 really big distributed storage clusters, 386 00:23:39,713 --> 00:23:44,615 Cortex can run and the Prometheus servers will stream their data up to Cortex 387 00:23:44,907 --> 00:23:49,384 and it will keep a copy of that accross all of your Prometheus servers. 388 00:23:49,596 --> 00:23:51,373 And because it's based on things like Cassandra, 389 00:23:51,709 --> 00:23:53,150 it's super scalable. 390 00:23:53,436 --> 00:23:57,862 But it's a little complex to run and 391 00:23:57,536 --> 00:24:00,836 many people don't want to run that complex infrastructure. 392 00:24:01,254 --> 00:24:06,080 We have another new one, we just blogged about it yesterday. 393 00:24:01,564 --> 00:24:06,513 It's a thing called Thanos. 394 00:24:06,513 --> 00:24:10,596 Thanos is Prometheus at scale. 395 00:24:11,143 --> 00:24:12,356 Basically, the way it works… 396 00:24:12,761 --> 00:24:15,063 Actually, why don't I bring that up? 397 00:24:24,122 --> 00:24:30,519 This was developed by a company called Improbable 398 00:24:30,935 --> 00:24:32,632 and they wanted to… 399 00:24:35,489 --> 00:24:40,063 They had billions of metrics coming from hundreds of Prometheus servers. 400 00:24:40,604 --> 00:24:46,645 They developed this in collaboration with the Prometheus team to build 401 00:24:47,000 --> 00:24:48,581 a super highly scalable Prometheus server. 402 00:24:49,877 --> 00:24:55,518 Prometheus itself stores the incoming metrics data in a write ahead log 403 00:24:56,008 --> 00:24:59,560 and then every two hours, it creates a compaction cycle 404 00:24:59,982 --> 00:25:03,177 and it creates a mutable series block of data which is 405 00:25:03,606 --> 00:25:06,718 all the time series blocks themselves 406 00:25:07,131 --> 00:25:10,319 and then an index into that data. 407 00:25:10,849 --> 00:25:13,678 Those two hour windows are all imutable 408 00:25:14,037 --> 00:25:19,428 so ??? has a little sidecar binary that watches for those new directories and 409 00:25:19,594 --> 00:25:20,843 uploads them into a blob store. 410 00:25:21,121 --> 00:25:25,819 So you could put them in S3 or minio or some other simple object storage. 411 00:25:26,301 --> 00:25:32,916 And then now you have all of your data, all of this index data already 412 00:25:32,916 --> 00:25:34,816 ready to go 413 00:25:34,816 --> 00:25:38,489 and then the final sidecar creates a little mesh cluster that can read from 414 00:25:38,489 --> 00:25:39,616 all of those S3 blocks. 415 00:25:40,123 --> 00:25:48,470 Now, you have this super global view all stored in a big bucket storage and 416 00:25:49,621 --> 00:25:52,404 things like S3 or minio are… 417 00:25:52,995 --> 00:25:57,669 Bucket storage is not databases so they're operationally a little easier to operate. 418 00:25:58,405 --> 00:26:02,183 Plus, now we have all this data in a bucket store and 419 00:26:02,600 --> 00:26:06,081 the Thanos sidecars can talk to each other 420 00:26:06,526 --> 00:26:08,150 We can now have a single entry point. 421 00:26:08,418 --> 00:26:11,915 You can query Thanos and Thanos will distribute your query 422 00:26:12,131 --> 00:26:13,577 across all your Prometheus servers. 423 00:26:13,792 --> 00:26:16,181 So now you can do global queries across all of your servers. 424 00:26:17,696 --> 00:26:22,246 But it's very new, they just released their first release candidate yesterday. 425 00:26:23,926 --> 00:26:26,875 It is looking to be like the coolest thing ever 426 00:26:27,448 --> 00:26:29,341 for running large scale Prometheus. 427 00:26:30,315 --> 00:26:34,779 Here's an example of how that is laid out. 428 00:26:36,840 --> 00:26:39,469 This will bring and let you have a billion metric Prometheus cluster. 429 00:26:42,607 --> 00:26:44,261 And it's got a bunch of other cool features. 430 00:26:45,376 --> 00:26:46,672 Any more questions? 431 00:26:55,353 --> 00:26:57,436 Alright, maybe I'll do a quick little demo. 432 00:27:05,407 --> 00:27:10,547 Here is a Prometheus server that is provided by this group 433 00:27:10,736 --> 00:27:14,141 that just does a ansible deployment for Prometheus. 434 00:27:15,342 --> 00:27:19,597 And you can just simply query for something like 'node_cpu'. 435 00:27:21,077 --> 00:27:23,073 This is actually the old name for that metric. 436 00:27:24,083 --> 00:27:25,659 And you can see, here's exactly 437 00:27:28,078 --> 00:27:31,250 the CPU metrics from some servers. 438 00:27:32,907 --> 00:27:34,634 It's just a bunch of stuff. 439 00:27:35,008 --> 00:27:37,060 There's actually two servers here, 440 00:27:37,445 --> 00:27:40,660 there's an influx cloud alchemy and there is a demo cloud alchemy. 441 00:27:42,011 --> 00:27:43,666 [Q] Can you zoom in? [A] Oh yeah sure. 442 00:27:53,135 --> 00:27:57,617 So you can see all the extra labels. 443 00:28:00,067 --> 00:28:01,644 We can also do some things like… 444 00:28:02,176 --> 00:28:04,247 Let's take a look at, say, the last 30 seconds. 445 00:28:04,614 --> 00:28:07,226 We can just add this little time window. 446 00:28:07,755 --> 00:28:11,033 It's called a range request, and you can see 447 00:28:11,257 --> 00:28:12,398 the individual samples. 448 00:28:12,651 --> 00:28:14,671 You can see that all Prometheus is doing 449 00:28:14,825 --> 00:28:17,899 is storing the sample and a timestamp. 450 00:28:18,472 --> 00:28:23,029 All the timestamps are in milliseconds and it's all epoch 451 00:28:23,238 --> 00:28:25,395 so it's super easy to manipulate. 452 00:28:25,600 --> 00:28:30,169 But, looking at the individual samples and looking at this, you can see that 453 00:28:30,493 --> 00:28:36,333 if we go back and just take… and look at the raw data, and 454 00:28:36,493 --> 00:28:37,859 we graph the raw data… 455 00:28:39,961 --> 00:28:43,026 Oops, that's a syntax error. 456 00:28:44,500 --> 00:28:46,968 And we look at this graph… Come on. 457 00:28:47,221 --> 00:28:48,282 Here we go. 458 00:28:48,481 --> 00:28:50,329 Well, that's kind of boring, it's just a flat line because 459 00:28:50,600 --> 00:28:52,795 it's just a counter going up very slowly. 460 00:28:52,992 --> 00:28:55,999 What we really want to do, is we want to take, and we want to apply 461 00:28:57,128 --> 00:28:59,046 a rate function to this counter. 462 00:28:59,569 --> 00:29:03,635 So let's look at the rate over the last one minute. 463 00:29:04,493 --> 00:29:06,772 There we go, now we get a nice little graph. 464 00:29:08,308 --> 00:29:14,056 And so you can see that this is 0.6 CPU seconds per second 465 00:29:15,223 --> 00:29:18,118 for that set of labels. 466 00:29:18,529 --> 00:29:21,034 But this is pretty noisy, there's a lot of lines on this graph and 467 00:29:21,235 --> 00:29:22,621 there's still a lot of data here. 468 00:29:23,137 --> 00:29:25,842 So let's start doing some filtering. 469 00:29:26,194 --> 00:29:29,434 One of the things we see here is, well, there's idle. 470 00:29:29,720 --> 00:29:32,296 We don't really care about the machine being idle, 471 00:29:32,593 --> 00:29:35,492 so let's just add a label filter so we can say 472 00:29:35,673 --> 00:29:42,354 'mode', it's the label name, and it's not equal to 'idle'. Done. 473 00:29:45,089 --> 00:29:47,560 And if I could type… What did I miss? 474 00:29:50,555 --> 00:29:51,126 Here we go. 475 00:29:51,438 --> 00:29:53,911 So now we've removed idle from the graph. 476 00:29:54,164 --> 00:29:55,907 That looks a little more sane. 477 00:29:56,659 --> 00:30:01,094 Oh, wow, look at that, that's a nice big spike in user space on the influx server 478 00:30:01,363 --> 00:30:02,310 Okay… 479 00:30:03,672 --> 00:30:05,252 Well, that's pretty cool. 480 00:30:05,654 --> 00:30:06,479 What about… 481 00:30:06,940 --> 00:30:08,625 This is still quite a lot of lines. 482 00:30:10,637 --> 00:30:14,194 How much CPU is in use total across all the servers that we have. 483 00:30:09,217 --> 00:30:14,378 We can just sum up that rate. 484 00:30:14,378 --> 00:30:24,457 We can just see that there is a sum total of 0.6 CPU seconds/s 485 00:30:25,000 --> 00:30:27,515 across the servers we have. 486 00:30:27,715 --> 00:30:31,379 But that's a little to coarse. 487 00:30:31,733 --> 00:30:36,698 What if we want to see it by instance? 488 00:30:39,155 --> 00:30:42,156 Now, we can see the two servers, we can see 489 00:30:42,527 --> 00:30:45,395 that we're left with just that label. 490 00:30:45,959 --> 00:30:50,229 The influx labels are the influx instance and the influx demo. 491 00:30:50,229 --> 00:30:53,334 That's a super easy way to see that, 492 00:30:53,854 --> 00:30:56,817 but we can also do this the other way around. 493 00:30:57,060 --> 00:31:03,022 We can say 'without (mode,cpu)' so we can drop those modes and 494 00:31:03,367 --> 00:31:05,243 see all the labels that we have. 495 00:31:05,438 --> 00:31:11,563 We can still see the environment label and the job label on our list data. 496 00:31:12,182 --> 00:31:15,640 You can go either way with the summary functions. 497 00:31:15,812 --> 00:31:20,210 There's a whole bunch of different functions 498 00:31:20,558 --> 00:31:22,730 and it's all in our documentation. 499 00:31:25,124 --> 00:31:30,113 But what if we want to see it… 500 00:31:30,572 --> 00:31:33,726 What if we want to see which CPUs are in use? 501 00:31:34,154 --> 00:31:36,937 Now we can see that it's only CPU0 502 00:31:37,203 --> 00:31:39,587 because apparently these are only 1-core instances. 503 00:31:42,276 --> 00:31:46,660 You can add/remove labels and do all these queries. 504 00:31:49,966 --> 00:31:51,833 Any other questions so far? 505 00:31:53,965 --> 00:31:59,056 [Q] I don't have a question, but I have something to add. 506 00:31:59,427 --> 00:32:03,063 Prometheus is really nice, but it's a lot better if you combine it 507 00:32:03,389 --> 00:32:04,954 with grafana. 508 00:32:05,222 --> 00:32:06,330 [A] Yes, yes. 509 00:32:06,537 --> 00:32:12,332 In the beginning, when we were creating Prometheus, we actually built 510 00:32:12,851 --> 00:32:14,698 a piece of dashboard software called promdash. 511 00:32:16,029 --> 00:32:20,566 It was a simple little Ruby on Rails app to create dashboards 512 00:32:20,733 --> 00:32:22,744 and it had a bunch of JavaScript. 513 00:32:22,936 --> 00:32:24,195 And then grafana came out. 514 00:32:25,157 --> 00:32:25,880 And we're like 515 00:32:25,997 --> 00:32:29,590 "Oh, that's interesting. It doesn't support Prometheus" so we were like 516 00:32:29,826 --> 00:32:31,806 "Hey, can you support Prometheus" 517 00:32:32,217 --> 00:32:34,375 and they're like "Yeah, we've got a REST API, get the data, done" 518 00:32:36,035 --> 00:32:37,867 Now grafana supports Prometheus and we're like 519 00:32:39,761 --> 00:32:41,991 "Well, promdash, this is crap, delete". 520 00:32:44,390 --> 00:32:46,171 The Prometheus development team, 521 00:32:46,395 --> 00:32:49,485 we're all backend developers and SREs and 522 00:32:49,731 --> 00:32:51,463 we have no JavaScript skills at all. 523 00:32:52,589 --> 00:32:54,879 So we're like "Let somebody deal with that". 524 00:32:55,393 --> 00:32:57,647 One of the nice things about working on this kind of project is 525 00:32:57,862 --> 00:33:01,648 we can do things that we're good at and and we don't, we don't try… 526 00:33:02,398 --> 00:33:05,317 We don't have any marketing people, it's just an opensource project, 527 00:33:06,320 --> 00:33:09,111 there's no single company behind Prometheus. 528 00:33:09,914 --> 00:33:14,452 I work for GitLab, Improbable paid for the Thanos system, 529 00:33:15,594 --> 00:33:25,286 other companies like Red Hat now pays people that used to work on CoreOS to 530 00:33:25,471 --> 00:33:26,517 work on Prometheus. 531 00:33:27,211 --> 00:33:30,283 There's lots and lots of collaboration between many companies 532 00:33:30,467 --> 00:33:32,609 to build the Prometheus ecosystem. 533 00:33:35,864 --> 00:33:37,455 But yeah, grafana is great. 534 00:33:38,835 --> 00:33:44,983 Actually, grafana now has two fulltime Prometheus developers. 535 00:33:49,185 --> 00:33:51,031 Alright, that's it. 536 00:33:52,637 --> 00:33:57,044 [Applause]