1 99:59:59,999 --> 99:59:59,999 So, we had a talk by a non-GitLab person about GitLab. 2 99:59:59,999 --> 99:59:59,999 Now, we have a talk by a GitLab person on non-GtlLab. 3 99:59:59,999 --> 99:59:59,999 Something like that? 4 99:59:59,999 --> 99:59:59,999 The CCCHH hackerspace is now open, 5 99:59:59,999 --> 99:59:59,999 from now on if you want to go there, that's the announcement. 6 99:59:59,999 --> 99:59:59,999 And the next talk will be by Ben Kochie 7 99:59:59,999 --> 99:59:59,999 on metrics-based monitoring with Prometheus. 8 99:59:59,999 --> 99:59:59,999 Welcome. 9 99:59:59,999 --> 99:59:59,999 [Applause] 10 99:59:59,999 --> 99:59:59,999 Alright, so 11 99:59:59,999 --> 99:59:59,999 my name is Ben Kochie 12 99:59:59,999 --> 99:59:59,999 I work on DevOps features for GitLab 13 99:59:59,999 --> 99:59:59,999 and apart working for GitLab, I also work on the opensource Prometheus project. 14 99:59:59,999 --> 99:59:59,999 I live in Berlin and I've been using Debian since ??? 15 99:59:59,999 --> 99:59:59,999 yes, quite a long time. 16 99:59:59,999 --> 99:59:59,999 So, what is Metrics-based Monitoring? 17 99:59:59,999 --> 99:59:59,999 If you're running software in production, 18 99:59:59,999 --> 99:59:59,999 you probably want to monitor it, 19 99:59:59,999 --> 99:59:59,999 because if you don't monitor it, you don't know it's right. 20 99:59:59,999 --> 99:59:59,999 ??? break down into two categories: 21 99:59:59,999 --> 99:59:59,999 there's blackbox monitoring and there's whitebox monitoring. 22 99:59:59,999 --> 99:59:59,999 Blackbox monitoring is treating your software like a blackbox. 23 99:59:59,999 --> 99:59:59,999 It's just checks to see, like, 24 99:59:59,999 --> 99:59:59,999 is it responding, or does it ping 25 99:59:59,999 --> 99:59:59,999 or ??? HTTP requests 26 99:59:59,999 --> 99:59:59,999 [mic turned on] 27 99:59:59,999 --> 99:59:59,999 Ah, there we go, much better. 28 99:59:59,999 --> 99:59:59,999 So, blackbox monitoring is a probe, 29 99:59:59,999 --> 99:59:59,999 it just kind of looks from the outside to your software 30 99:59:59,999 --> 99:59:59,999 and it has no knowledge of the internals 31 99:59:59,999 --> 99:59:59,999 and it's really good for end to end testing. 32 99:59:59,999 --> 99:59:59,999 So if you've got a fairly complicated service, 33 99:59:59,999 --> 99:59:59,999 you come in from the outside, you go through the load balancer, 34 99:59:59,999 --> 99:59:59,999 you hit the API server, 35 99:59:59,999 --> 99:59:59,999 the API server might hit a database, 36 99:59:59,999 --> 99:59:59,999 and you go all the way through to the back of the stack 37 99:59:59,999 --> 99:59:59,999 and all the way back out 38 99:59:59,999 --> 99:59:59,999 so you know that everything is working end to end. 39 99:59:59,999 --> 99:59:59,999 But you only know about it for that one request. 40 99:59:59,999 --> 99:59:59,999 So in order to find out if your service is working, 41 99:59:59,999 --> 99:59:59,999 from the end to end, for every single request, 42 99:59:59,999 --> 99:59:59,999 this requires whitebox intrumentation. 43 99:59:59,999 --> 99:59:59,999 So, basically, every event that happens inside your software, 44 99:59:59,999 --> 99:59:59,999 inside a serving stack, 45 99:59:59,999 --> 99:59:59,999 gets collected and gets counted, 46 99:59:59,999 --> 99:59:59,999 so you know that every request hits the load balancer, 47 99:59:59,999 --> 99:59:59,999 every request hits your application service, 48 99:59:59,999 --> 99:59:59,999 every request hits the database. 49 99:59:59,999 --> 99:59:59,999 You know that everything matches up 50 99:59:59,999 --> 99:59:59,999 and this is called whitebox, or metrics-based monitoring. 51 99:59:59,999 --> 99:59:59,999 There is different examples of, like, 52 99:59:59,999 --> 99:59:59,999 the kind of software that does blackbox and whitebox monitoring. 53 99:59:59,999 --> 99:59:59,999 So you have software like Nagios that you can configure checks 54 99:59:59,999 --> 99:59:59,999 or pingdom, 55 99:59:59,999 --> 99:59:59,999 pingdom will do ping of your website. 56 99:59:59,999 --> 99:59:59,999 And then there is metrics-based monitoring, 57 99:59:59,999 --> 99:59:59,999 things like Prometheus, things like the TICK stack from influx data, 58 99:59:59,999 --> 99:59:59,999 New Relic and other commercial solutions 59 99:59:59,999 --> 99:59:59,999 but of course I like to talk about the opensorce solutions. 60 99:59:59,999 --> 99:59:59,999 We're gonna talk a little bit about Prometheus. 61 99:59:59,999 --> 99:59:59,999 Prometheus came out of the idea that 62 99:59:59,999 --> 99:59:59,999 we needed a monitoring system that could collect all this whitebox metric data 63 99:59:59,999 --> 99:59:59,999 and do something useful with it. 64 99:59:59,999 --> 99:59:59,999 Not just give us a pretty graph, but we also want to be able to 65 99:59:59,999 --> 99:59:59,999 alert on it. 66 99:59:59,999 --> 99:59:59,999 So we needed both 67 99:59:59,999 --> 99:59:59,999 a data gathering and an analytics system in the same instance. 68 99:59:59,999 --> 99:59:59,999 To do this, we built this thing and we looked at the way that 69 99:59:59,999 --> 99:59:59,999 data was being generated by the applications 70 99:59:59,999 --> 99:59:59,999 and there are advantages and disadvantages to this 71 99:59:59,999 --> 99:59:59,999 push vs. poll model for metrics. 72 99:59:59,999 --> 99:59:59,999 We decided to go with the polling model 73 99:59:59,999 --> 99:59:59,999 because there is some slight advantages for polling over pushing. 74 99:59:59,999 --> 99:59:59,999 With polling, you get this free blackbox check 75 99:59:59,999 --> 99:59:59,999 that the application is running. 76 99:59:59,999 --> 99:59:59,999 When you poll your application, you know that the process is running. 77 99:59:59,999 --> 99:59:59,999 If you are doing push-based, you can't tell the difference between 78 99:59:59,999 --> 99:59:59,999 your application doing no work and your application not running. 79 99:59:59,999 --> 99:59:59,999 So you don't know if it's stuck, 80 99:59:59,999 --> 99:59:59,999 or is it just not having to do any work. 81 99:59:59,999 --> 99:59:59,999 With polling, the polling system knows the state of your network. 82 99:59:59,999 --> 99:59:59,999 If you have a defined set of services, 83 99:59:59,999 --> 99:59:59,999 that inventory drives what should be there. 84 99:59:59,999 --> 99:59:59,999 Again, it's like, the disappearing, 85 99:59:59,999 --> 99:59:59,999 is the process dead, or is it just not doing anything? 86 99:59:59,999 --> 99:59:59,999 With polling, you know for a fact what processes should be there, 87 99:59:59,999 --> 99:59:59,999 and it's a bit of an advantage there. 88 99:59:59,999 --> 99:59:59,999 With polling, there's really easy testing. 89 99:59:59,999 --> 99:59:59,999 With push-based metrics, you have to figure out 90 99:59:59,999 --> 99:59:59,999 if you want to test a new version of the monitoring system or 91 99:59:59,999 --> 99:59:59,999 you want to test something new, 92 99:59:59,999 --> 99:59:59,999 you have to ??? a copy of the data. 93 99:59:59,999 --> 99:59:59,999 With polling, you can just set up another instance of your monitoring 94 99:59:59,999 --> 99:59:59,999 and just test it. 95 99:59:59,999 --> 99:59:59,999 Or you don't even have, 96 99:59:59,999 --> 99:59:59,999 it doesn't even have to be monitoring, you can just use curl 97 99:59:59,999 --> 99:59:59,999 to poll the metrics endpoint. 98 99:59:59,999 --> 99:59:59,999 It's significantly easier to test. 99 99:59:59,999 --> 99:59:59,999 The other thing with the… 100 99:59:59,999 --> 99:59:59,999 The other nice thing is that the client is really simple. 101 99:59:59,999 --> 99:59:59,999 The client doesn't have to know where the monitoring system is. 102 99:59:59,999 --> 99:59:59,999 It doesn't have to know about ??? 103 99:59:59,999 --> 99:59:59,999 It just has to sit and collect the data about itself. 104 99:59:59,999 --> 99:59:59,999 So it doesn't have to know anything about the topology of the network. 105 99:59:59,999 --> 99:59:59,999 As an application developer, if you're writing a DNS server or 106 99:59:59,999 --> 99:59:59,999 some other piece of software, 107 99:59:59,999 --> 99:59:59,999 you don't have to know anything about monitoring software, 108 99:59:59,999 --> 99:59:59,999 you can just implement it inside your application and 109 99:59:59,999 --> 99:59:59,999 the monitoring software, whether it's Prometheus or something else, 110 99:59:59,999 --> 99:59:59,999 can just come and collect that data for you. 111 99:59:59,999 --> 99:59:59,999 That's kind of similar to a very old monitoring system called SNMP, 112 99:59:59,999 --> 99:59:59,999 but SNMP has a significantly less friendly data model for developers. 113 99:59:59,999 --> 99:59:59,999 This is the basic layout of a Prometheus server. 114 99:59:59,999 --> 99:59:59,999 At the core, there's a Prometheus server 115 99:59:59,999 --> 99:59:59,999 and it deals with all the data collection and analytics. 116 99:59:59,999 --> 99:59:59,999 Basically, this one binary, it's all written in golang. 117 99:59:59,999 --> 99:59:59,999 It's a single binary. 118 99:59:59,999 --> 99:59:59,999 It knows how to read from your inventory, 119 99:59:59,999 --> 99:59:59,999 there's a bunch of different methods, whether you've got 120 99:59:59,999 --> 99:59:59,999 a kubernetes cluster or a cloud platform 121 99:59:59,999 --> 99:59:59,999 or you have your own customized thing with ansible. 122 99:59:59,999 --> 99:59:59,999 Ansible can take your layout, drop that into a config file and 123 99:59:59,999 --> 99:59:59,999 Prometheus can pick that up. 124 99:59:59,999 --> 99:59:59,999 Once it has the layout, it goes out and collects all the data. 125 99:59:59,999 --> 99:59:59,999 It has a storage and a time series database to store all that data locally. 126 99:59:59,999 --> 99:59:59,999 It has a thing called PromQL, which is a query language designed 127 99:59:59,999 --> 99:59:59,999 for metrics and analytics. 128 99:59:59,999 --> 99:59:59,999 From that PromQL, you can add frontends that will, 129 99:59:59,999 --> 99:59:59,999 whether it's a simple API client to run reports, 130 99:59:59,999 --> 99:59:59,999 you can use things like Grafana for creating dashboards, 131 99:59:59,999 --> 99:59:59,999 it's got a simple webUI built in. 132 99:59:59,999 --> 99:59:59,999 You can plug in anything you want on that side. 133 99:59:59,999 --> 99:59:59,999 And then, it also has the ability to continuously execute queries 134 99:59:59,999 --> 99:59:59,999 called "recording rules" 135 99:59:59,999 --> 99:59:59,999 and these recording rules have two different modes. 136 99:59:59,999 --> 99:59:59,999 You can either record, you can take a query 137 99:59:59,999 --> 99:59:59,999 and it will generate new data from that query 138 99:59:59,999 --> 99:59:59,999 or you can take a query, and if it returns results, 139 99:59:59,999 --> 99:59:59,999 it will return an alert. 140 99:59:59,999 --> 99:59:59,999 That alert is a push message to the alert manager. 141 99:59:59,999 --> 99:59:59,999 This allows us to separate the generating of alerts from the routing of alerts. 142 99:59:59,999 --> 99:59:59,999 You can have one or hundreds of Prometheus services, all generating alerts 143 99:59:59,999 --> 99:59:59,999 and it goes into an alert manager cluster and sends, does the deduplication 144 99:59:59,999 --> 99:59:59,999 and the routing to the human 145 99:59:59,999 --> 99:59:59,999 because, of course, the thing that we want is 146 99:59:59,999 --> 99:59:59,999 we had dashboards with graphs, but in order to find out if something is broken 147 99:59:59,999 --> 99:59:59,999 you had to have a human looking at the graph. 148 99:59:59,999 --> 99:59:59,999 With Prometheus, we don't have to do that anymore, 149 99:59:59,999 --> 99:59:59,999 we can simply let the software tell us that we need to go investigate 150 99:59:59,999 --> 99:59:59,999 our problems. 151 99:59:59,999 --> 99:59:59,999 We don't have to sit there and stare at dashboards all day, 152 99:59:59,999 --> 99:59:59,999 because that's really boring. 153 99:59:59,999 --> 99:59:59,999 What does it look like to actually get data into Prometheus? 154 99:59:59,999 --> 99:59:59,999 This is a very basic output of a Prometheus metric. 155 99:59:59,999 --> 99:59:59,999 This is a very simple thing. 156 99:59:59,999 --> 99:59:59,999 If you know much about the linux kernel, 157 99:59:59,999 --> 99:59:59,999 the linux kernel tracks ??? stats, all the state of all the CPUs 158 99:59:59,999 --> 99:59:59,999 in your system 159 99:59:59,999 --> 99:59:59,999 and we express this by having the name of the metric, which is 160 99:59:59,999 --> 99:59:59,999 'node_cpu_seconds_total' and so this is a self-describing metric, 161 99:59:59,999 --> 99:59:59,999 like you can just read the metrics name 162 99:59:59,999 --> 99:59:59,999 and you understand a little bit about what's going on here. 163 99:59:59,999 --> 99:59:59,999 The linux kernel and other kernels track their usage by the number of seconds 164 99:59:59,999 --> 99:59:59,999 spent doing different things and 165 99:59:59,999 --> 99:59:59,999 that could be, whether it's in system or user space or IRQs 166 99:59:59,999 --> 99:59:59,999 or iowait or idle. 167 99:59:59,999 --> 99:59:59,999 Actually, the kernel tracks how much idle time it has. 168 99:59:59,999 --> 99:59:59,999 It also tracks it by the number of CPUs. 169 99:59:59,999 --> 99:59:59,999 With other monitoring systems, they used to do this with a tree structure 170 99:59:59,999 --> 99:59:59,999 and this caused a lot of problems, for like 171 99:59:59,999 --> 99:59:59,999 How do you mix and match data so by switching from 172 99:59:59,999 --> 99:59:59,999 a tree structure to a tag-based structure, 173 99:59:59,999 --> 99:59:59,999 we can do some really interesting powerful data analytics. 174 99:59:59,999 --> 99:59:59,999 Here's a nice example of taking those CPU seconds counters 175 99:59:59,999 --> 99:59:59,999 and then converting them into a graph by using PromQL. 176 99:59:59,999 --> 99:59:59,999 Now we can get into Metrics-Based Alerting. 177 99:59:59,999 --> 99:59:59,999 Now we have this graph, we have this thing 178 99:59:59,999 --> 99:59:59,999 we can look and see here 179 99:59:59,999 --> 99:59:59,999 "Oh there is some little spike here, we might want to know about that." 180 99:59:59,999 --> 99:59:59,999 Now we can get into Metrics-Based Alerting. 181 99:59:59,999 --> 99:59:59,999 I used to be a site reliability engineer, I'm still a site reliability engineer at heart 182 99:59:59,999 --> 99:59:59,999 and we have this concept of things that you need on a site or a service reliably 183 99:59:59,999 --> 99:59:59,999 The most important thing you need is down at the bottom, 184 99:59:59,999 --> 99:59:59,999 Monitoring, because if you don't have monitoring of your service, 185 99:59:59,999 --> 99:59:59,999 how do you know it's even working? 186 99:59:59,999 --> 99:59:59,999 There's a couple of techniques here, and we want to alert based on data 187 99:59:59,999 --> 99:59:59,999 and not just those end to end tests. 188 99:59:59,999 --> 99:59:59,999 There's a couple of techniques, a thing called the RED method 189 99:59:59,999 --> 99:59:59,999 and there's a thing called the USE method 190 99:59:59,999 --> 99:59:59,999 and there's a couple nice things to some blog posts about this 191 99:59:59,999 --> 99:59:59,999 and basically it defines that, for example, 192 99:59:59,999 --> 99:59:59,999 the RED method talks about the requests that your system is handling 193 99:59:59,999 --> 99:59:59,999 There are three things: 194 99:59:59,999 --> 99:59:59,999 There's the number of requests, there's the number of errors 195 99:59:59,999 --> 99:59:59,999 and there's how long takes a duration. 196 99:59:59,999 --> 99:59:59,999 With the combination of these three things 197 99:59:59,999 --> 99:59:59,999 you can determine most of what your users see 198 99:59:59,999 --> 99:59:59,999 "Did my request go through? Did it return an error? Was it fast?" 199 99:59:59,999 --> 99:59:59,999 Most people, that's all they care about. 200 99:59:59,999 --> 99:59:59,999 "I made a request to a website and it came back and it was fast." 201 99:59:59,999 --> 99:59:59,999 It's a very simple method of just, like, 202 99:59:59,999 --> 99:59:59,999 those are the important things to determine if your site is healthy. 203 99:59:59,999 --> 99:59:59,999 But we can go back to some more traditional, sysadmin style, alerts 204 99:59:59,999 --> 99:59:59,999 this is basically taking the filesystem available space, 205 99:59:59,999 --> 99:59:59,999 divided by the filesystem size, that becomes the ratio of filesystem availability 206 99:59:59,999 --> 99:59:59,999 from 0 to 1. 207 99:59:59,999 --> 99:59:59,999 Multiply it by 100, we now have a percentage 208 99:59:59,999 --> 99:59:59,999 and if it's less than or equal to 1% for 15 minutes, 209 99:59:59,999 --> 99:59:59,999 this is less than 1% space, we should tell a sysadmin to go check 210 99:59:59,999 --> 99:59:59,999 the ??? filesystem ??? 211 99:59:59,999 --> 99:59:59,999 It's super nice and simple. 212 99:59:59,999 --> 99:59:59,999 We can also tag, we can include… 213 99:59:59,999 --> 99:59:59,999 Every alert includes all the extraneous labels that Prometheus adds to your metrics 214 99:59:59,999 --> 99:59:59,999 When you add a metric in Prometheus, if we go back and we look at this metric. 215 99:59:59,999 --> 99:59:59,999 This metric only contain the information about the internals of the application 216 99:59:59,999 --> 99:59:59,999 anything about, like, what server it's on, is it running in a container, 217 99:59:59,999 --> 99:59:59,999 what cluster does it come from, what ??? is it on, 218 99:59:59,999 --> 99:59:59,999 that's all extra annotations that are added by the Prometheus server 219 99:59:59,999 --> 99:59:59,999 at discovery time. 220 99:59:59,999 --> 99:59:59,999 I don't have a good example of what those labels look like 221 99:59:59,999 --> 99:59:59,999 but every metric gets annotated with location information. 222 99:59:59,999 --> 99:59:59,999 That location information also comes through as labels in the alert 223 99:59:59,999 --> 99:59:59,999 so, if you have a message coming into your alert manager, 224 99:59:59,999 --> 99:59:59,999 the alert manager can look and go 225 99:59:59,999 --> 99:59:59,999 "Oh, that's coming from this datacenter" 226 99:59:59,999 --> 99:59:59,999 and it can include that in the email or IRC message or SMS message. 227 99:59:59,999 --> 99:59:59,999 So you can include 228 99:59:59,999 --> 99:59:59,999 "Filesystem is out of space on this host from this datacenter" 229 99:59:59,999 --> 99:59:59,999 All these labels get passed through and then you can append 230 99:59:59,999 --> 99:59:59,999 "severity: critical" to that alert and include that in the message to the human 231 99:59:59,999 --> 99:59:59,999 because of course, this is how you define… 232 99:59:59,999 --> 99:59:59,999 Getting the message from the monitoring to the human. 233 99:59:59,999 --> 99:59:59,999 You can even include nice things like, 234 99:59:59,999 --> 99:59:59,999 if you've got documentation, you can include a link to the documentation 235 99:59:59,999 --> 99:59:59,999 as an annotation 236 99:59:59,999 --> 99:59:59,999 and the alert manager can take that basic url and, you know, 237 99:59:59,999 --> 99:59:59,999 massaging it into whatever it needs to look like to actually get 238 99:59:59,999 --> 99:59:59,999 the operator to the correct documentation. 239 99:59:59,999 --> 99:59:59,999 We can also do more fun things: 240 99:59:59,999 --> 99:59:59,999 since we actually are not just checking 241 99:59:59,999 --> 99:59:59,999 what is the space right now, we're tracking data over time, 242 99:59:59,999 --> 99:59:59,999 we can use 'predict_linear'. 243 99:59:59,999 --> 99:59:59,999 'predict_linear' just takes and does a simple linear regression. 244 99:59:59,999 --> 99:59:59,999 This example takes the filesystem available space over the last hour and 245 99:59:59,999 --> 99:59:59,999 does a linear regression. 246 99:59:59,999 --> 99:59:59,999 Prediction says "Well, it's going that way and four hours from now, 247 99:59:59,999 --> 99:59:59,999 based on one hour of history, it's gonna be less than 0, which means full". 248 99:59:59,999 --> 99:59:59,999 We know that within the next four hours, the disc is gonna be full 249 99:59:59,999 --> 99:59:59,999 so we can tell the operator ahead of time that it's gonna be full 250 99:59:59,999 --> 99:59:59,999 and not just tell them that it's full right now. 251 99:59:59,999 --> 99:59:59,999 They have some window of ability to fix it before it fails. 252 99:59:59,999 --> 99:59:59,999 This is really important because if you're running a site 253 99:59:59,999 --> 99:59:59,999 you want to be able to have alerts that tell you that your system is failing 254 99:59:59,999 --> 99:59:59,999 before it actually fails. 255 99:59:59,999 --> 99:59:59,999 Because if it fails, you're out of SLO or SLA and 256 99:59:59,999 --> 99:59:59,999 your users are gonna be unhappy 257 99:59:59,999 --> 99:59:59,999 and you don't want the users to tell you that your site is down 258 99:59:59,999 --> 99:59:59,999 you want to know about it before your users can even tell. 259 99:59:59,999 --> 99:59:59,999 This allows you to do that. 260 99:59:59,999 --> 99:59:59,999 And also of course, Prometheus being a modern system, 261 99:59:59,999 --> 99:59:59,999 we support fully UTF8 in all of our labels. 262 99:59:59,999 --> 99:59:59,999 Here's an other one, here's a good example from the USE method. 263 99:59:59,999 --> 99:59:59,999 This is a rate of 500 errors coming from an application 264 99:59:59,999 --> 99:59:59,999 and you can simply alert that 265 99:59:59,999 --> 99:59:59,999 there's more than 500 errors per second coming out of the application 266 99:59:59,999 --> 99:59:59,999 if that's your threshold for ??? 267 99:59:59,999 --> 99:59:59,999 And you can do other things, 268 99:59:59,999 --> 99:59:59,999 you can convert that from just a raid of errors 269 99:59:59,999 --> 99:59:59,999 to a percentive error. 270 99:59:59,999 --> 99:59:59,999 So you could say 271 99:59:59,999 --> 99:59:59,999 "I have an SLA of 3 9" and so you can say 272 99:59:59,999 --> 99:59:59,999 "If the rate of errors divided by the rate of requests is .01, 273 99:59:59,999 --> 99:59:59,999 or is more than .01, then that's a problem." 274 99:59:59,999 --> 99:59:59,999 You can include that level of error granularity. 275 99:59:59,999 --> 99:59:59,999 And if you're just doing a blackbox test, 276 99:59:59,999 --> 99:59:59,999 you wouldn't know this, you would only get if you got an error from the system, 277 99:59:59,999 --> 99:59:59,999 then you got another error from the system 278 99:59:59,999 --> 99:59:59,999 then you fire an alert. 279 99:59:59,999 --> 99:59:59,999 But if those checks are one minute apart and you're serving 1000 requests per second 280 99:59:59,999 --> 99:59:59,999 you could be serving 10,000 errors before you even get an alert. 281 99:59:59,999 --> 99:59:59,999 And you might miss it, because 282 99:59:59,999 --> 99:59:59,999 what if you only get one random error 283 99:59:59,999 --> 99:59:59,999 and then the next time, you're serving 25% errors, 284 99:59:59,999 --> 99:59:59,999 you only have a 25% chance of that check failing again. 285 99:59:59,999 --> 99:59:59,999 You really need these metrics in order to get 286 99:59:59,999 --> 99:59:59,999 proper reports of the status of your system 287 99:59:59,999 --> 99:59:59,999 There's even options 288 99:59:59,999 --> 99:59:59,999 You can slice and dice those labels. 289 99:59:59,999 --> 99:59:59,999 If you have a label on all of your applications called 'service' 290 99:59:59,999 --> 99:59:59,999 you can send that 'service' label through to the message 291 99:59:59,999 --> 99:59:59,999 and you can say "Hey, this service is broken". 292 99:59:59,999 --> 99:59:59,999 You can include that service label in your alert messages. 293 99:59:59,999 --> 99:59:59,999 And that's it, I can go to a demo and Q&A. 294 99:59:59,999 --> 99:59:59,999 [Applause] 295 99:59:59,999 --> 99:59:59,999 Any questions so far? 296 99:59:59,999 --> 99:59:59,999 Or anybody want to see a demo? 297 99:59:59,999 --> 99:59:59,999 [Q] Hi. Does Prometheus make metric discovery inside containers 298 99:59:59,999 --> 99:59:59,999 or do I have to implement the metrics myself? 299 99:59:59,999 --> 99:59:59,999 [A] For metrics in containers, there are already things that expose 300 99:59:59,999 --> 99:59:59,999 the metrics of the container system itself. 301 99:59:59,999 --> 99:59:59,999 There's a utility called 'cadvisor' and 302 99:59:59,999 --> 99:59:59,999 cadvisor takes the links cgroup data and exposes it as metrics 303 99:59:59,999 --> 99:59:59,999 so you can get data about how much CPU time is being 304 99:59:59,999 --> 99:59:59,999 spent in your container, 305 99:59:59,999 --> 99:59:59,999 how much memory is being used by your container. 306 99:59:59,999 --> 99:59:59,999 [Q] But not about the application, just about the container usage ? 307 99:59:59,999 --> 99:59:59,999 [A] Right. Because the container has no idea 308 99:59:59,999 --> 99:59:59,999 whether your application is written in Ruby or go or Python or whatever, 309 99:59:59,999 --> 99:59:59,999 you have to build that into your application in order to get the data. 310 99:59:59,999 --> 99:59:59,999 So for Prometheus, 311 99:59:59,999 --> 99:59:59,999 we've written client libraries that can be included in your application directly 312 99:59:59,999 --> 99:59:59,999 so you can get that data out. 313 99:59:59,999 --> 99:59:59,999 If you go to the Prometheus website, we have a whole series of client libraries 314 99:59:59,999 --> 99:59:59,999 and we cover a pretty good selection of popular software. 315 99:59:59,999 --> 99:59:59,999 [Q] What is the current state of long-term data storage? 316 99:59:59,999 --> 99:59:59,999 [A] Very good question. 317 99:59:59,999 --> 99:59:59,999 There's been several… 318 99:59:59,999 --> 99:59:59,999 There's actually several different methods of doing this. 319 99:59:59,999 --> 99:59:59,999 Prometheus stores all this data locally in its own data storage 320 99:59:59,999 --> 99:59:59,999 on the local disk. 321 99:59:59,999 --> 99:59:59,999 But that's only as durable as that server is durable. 322 99:59:59,999 --> 99:59:59,999 So if you've got a really durable server, 323 99:59:59,999 --> 99:59:59,999 you can store as much data as you want, 324 99:59:59,999 --> 99:59:59,999 you can store years and years of data locally on the Prometheus server. 325 99:59:59,999 --> 99:59:59,999 That's not a problem. 326 99:59:59,999 --> 99:59:59,999 There's a bunch of misconceptions because of our default 327 99:59:59,999 --> 99:59:59,999 and the language on our website said 328 99:59:59,999 --> 99:59:59,999 "It's not long-term storage" 329 99:59:59,999 --> 99:59:59,999 simply because we leave that problem up to the person running the server. 330 99:59:59,999 --> 99:59:59,999 But the time series database that Prometheus includes 331 99:59:59,999 --> 99:59:59,999 is actually quite durable. 332 99:59:59,999 --> 99:59:59,999 But it's only as durable as the server underneath it. 333 99:59:59,999 --> 99:59:59,999 So if you've got a very large cluster and you want really high durability, 334 99:59:59,999 --> 99:59:59,999 you need to have some kind of cluster software, 335 99:59:59,999 --> 99:59:59,999 but because we want Prometheus to be simple to deploy 336 99:59:59,999 --> 99:59:59,999 and very simple to operate 337 99:59:59,999 --> 99:59:59,999 and also very robust. 338 99:59:59,999 --> 99:59:59,999 We didn't want to include any clustering in Prometheus itself, 339 99:59:59,999 --> 99:59:59,999 because anytime you have a clustered software, 340 99:59:59,999 --> 99:59:59,999 what happens if your network is a little wanky. 341 99:59:59,999 --> 99:59:59,999 The first thing that goes down is all of your distributed systems fail. 342 99:59:59,999 --> 99:59:59,999 And building distributed systems to be really robust is really hard 343 99:59:59,999 --> 99:59:59,999 so Prometheus is what we call "uncoordinated distributed systems". 344 99:59:59,999 --> 99:59:59,999 If you've got two Prometheus servers monitoring all your targets in an HA mode 345 99:59:59,999 --> 99:59:59,999 in a cluster, and there's a split brain, 346 99:59:59,999 --> 99:59:59,999 each Prometheus can see half of the cluster and 347 99:59:59,999 --> 99:59:59,999 it can see that the other half of the cluster is down. 348 99:59:59,999 --> 99:59:59,999 They can both try to get alerts out to the alert manager 349 99:59:59,999 --> 99:59:59,999 and this is a really really robust way of handling split brains 350 99:59:59,999 --> 99:59:59,999 and bad network failures and bad problems in a cluster. 351 99:59:59,999 --> 99:59:59,999 It's designed to be super super robust 352 99:59:59,999 --> 99:59:59,999 and so the two individual Promotheus servers in you cluster 353 99:59:59,999 --> 99:59:59,999 don't have to talk to each other to do this, 354 99:59:59,999 --> 99:59:59,999 they can just to it independently. 355 99:59:59,999 --> 99:59:59,999 But if you want to be able to correlate data 356 99:59:59,999 --> 99:59:59,999 between many different Prometheus servers 357 99:59:59,999 --> 99:59:59,999 you need an external data storage to do this. 358 99:59:59,999 --> 99:59:59,999 And also you may not have very big servers, 359 99:59:59,999 --> 99:59:59,999 you might be running your Prometheus in a container 360 99:59:59,999 --> 99:59:59,999 and it's only got a little bit of local storage space 361 99:59:59,999 --> 99:59:59,999 so you want to send all that data up to a big cluster datastore 362 99:59:59,999 --> 99:59:59,999 for ??? 363 99:59:59,999 --> 99:59:59,999 We have several different ways of doing this. 364 99:59:59,999 --> 99:59:59,999 There's the classic way which is called federation 365 99:59:59,999 --> 99:59:59,999 where you have one Prometheus server polling in summary data from 366 99:59:59,999 --> 99:59:59,999 each of the individual Prometheus servers 367 99:59:59,999 --> 99:59:59,999 and this is useful if you want to run alerts against data coming 368 99:59:59,999 --> 99:59:59,999 from multiple Prometheus servers. 369 99:59:59,999 --> 99:59:59,999 But federation is not replication. 370 99:59:59,999 --> 99:59:59,999 It only can do a little bit of data from each Prometheus server. 371 99:59:59,999 --> 99:59:59,999 If you've got a million metrics on each Prometheus server, 372 99:59:59,999 --> 99:59:59,999 you can't poll in a million metrics and do… 373 99:59:59,999 --> 99:59:59,999 If you've got 10 of those, you can't poll in 10 million metrics 374 99:59:59,999 --> 99:59:59,999 simultaneously into one Prometheus server. 375 99:59:59,999 --> 99:59:59,999 It's just to much data. 376 99:59:59,999 --> 99:59:59,999 There is two others, a couple of other nice options. 377 99:59:59,999 --> 99:59:59,999 There's a piece of software called Cortex. 378 99:59:59,999 --> 99:59:59,999 Cortex is a Prometheus server that stores its data in a database. 379 99:59:59,999 --> 99:59:59,999 Specifically, a distributed database. 380 99:59:59,999 --> 99:59:59,999 Things that are based on the Google big table model, like Cassandra or… 381 99:59:59,999 --> 99:59:59,999 What's the Amazon one? 382 99:59:59,999 --> 99:59:59,999 Yeah. 383 99:59:59,999 --> 99:59:59,999 Dynamodb. 384 99:59:59,999 --> 99:59:59,999 If you have a dynamodb or a cassandra cluster, or one of these other 385 99:59:59,999 --> 99:59:59,999 really big distributed storage clusters, 386 99:59:59,999 --> 99:59:59,999 Cortex can run and the Prometheus servers will stream their data up to Cortex 387 99:59:59,999 --> 99:59:59,999 and it will keep a copy of that accross all of your Prometheus servers. 388 99:59:59,999 --> 99:59:59,999 And because it's based on things like Cassandra, 389 99:59:59,999 --> 99:59:59,999 it's super scalable. 390 99:59:59,999 --> 99:59:59,999 But it's a little complex to run and 391 99:59:59,999 --> 99:59:59,999 many people don't want to run that complex infrastructure. 392 99:59:59,999 --> 99:59:59,999 We have another new one, we just blogged about it yesterday. 393 99:59:59,999 --> 99:59:59,999 It's a thing called Thanos. 394 99:59:59,999 --> 99:59:59,999 Thanos is Prometheus at scale. 395 99:59:59,999 --> 99:59:59,999 Basically, the way it works… 396 99:59:59,999 --> 99:59:59,999 Actually, why don't I bring that up? 397 99:59:59,999 --> 99:59:59,999 This was developed by a company called Improbable 398 99:59:59,999 --> 99:59:59,999 and they wanted to… 399 99:59:59,999 --> 99:59:59,999 They had billions of metrics coming from hundreds of Prometheus servers. 400 99:59:59,999 --> 99:59:59,999 They developed this in collaboration with the Prometheus team to build 401 99:59:59,999 --> 99:59:59,999 a super highly scalable Prometheus server. 402 99:59:59,999 --> 99:59:59,999 Prometheus itself stores the incoming metrics data in ??? log 403 99:59:59,999 --> 99:59:59,999 and then every two hours, it creates a compaction cycle 404 99:59:59,999 --> 99:59:59,999 and it creates a mutable series block of data which is 405 99:59:59,999 --> 99:59:59,999 all the time series blocks themselves 406 99:59:59,999 --> 99:59:59,999 and then an index into that data. 407 99:59:59,999 --> 99:59:59,999 Those two hour windows are all imutable 408 99:59:59,999 --> 99:59:59,999 so ??? has a little sidecar binary that watches for those new directories and 409 99:59:59,999 --> 99:59:59,999 uploads them into a blob store. 410 99:59:59,999 --> 99:59:59,999 So you could put them in S3 or minio or some other simple object storage. 411 99:59:59,999 --> 99:59:59,999 And then now you have all of your data, all of this index data already 412 99:59:59,999 --> 99:59:59,999 ready to go 413 99:59:59,999 --> 99:59:59,999 and then the final sidecar creates a little mesh cluster that can read from 414 99:59:59,999 --> 99:59:59,999 all of those S3 blocks. 415 99:59:59,999 --> 99:59:59,999 Now, you have this super global view all stored in a big bucket storage and 416 99:59:59,999 --> 99:59:59,999 things like S3 or minio are… 417 99:59:59,999 --> 99:59:59,999 Bucket storage is not databases so they're operationally a little easier to operate. 418 99:59:59,999 --> 99:59:59,999 Plus, now we have all this data in a bucket store and 419 99:59:59,999 --> 99:59:59,999 the Thanos sidecars can talk to each other 420 99:59:59,999 --> 99:59:59,999 We can now have a single entry point. 421 99:59:59,999 --> 99:59:59,999 You can query Thanos and Thanos will distribute your query 422 99:59:59,999 --> 99:59:59,999 across all your Prometheus servers. 423 99:59:59,999 --> 99:59:59,999 So now you can do global queries across all of your servers. 424 99:59:59,999 --> 99:59:59,999 But it's very new, they just released their first release candidate yesterday. 425 99:59:59,999 --> 99:59:59,999 It is looking to be like the coolest thing ever 426 99:59:59,999 --> 99:59:59,999 for running large scale Prometheus. 427 99:59:59,999 --> 99:59:59,999 Here's an example of how that is laid out. 428 99:59:59,999 --> 99:59:59,999 This will ??? let you have a billion metric Prometheus cluster. 429 99:59:59,999 --> 99:59:59,999 And it's got a bunch of other cool features. 430 99:59:59,999 --> 99:59:59,999 Any more questions? 431 99:59:59,999 --> 99:59:59,999 Alright, maybe I'll do a quick little demo. 432 99:59:59,999 --> 99:59:59,999 Here is a Prometheus server that is provided by ??? 433 99:59:59,999 --> 99:59:59,999 that just does a ansible deployment for Prometheus. 434 99:59:59,999 --> 99:59:59,999 And you can just simply query for something like 'node_cpu'. 435 99:59:59,999 --> 99:59:59,999 This is actually the old name for that metric. 436 99:59:59,999 --> 99:59:59,999 And you can see, here's exactly 437 99:59:59,999 --> 99:59:59,999 the CPU metrics from some servers. 438 99:59:59,999 --> 99:59:59,999 It's just a bunch of stuff. 439 99:59:59,999 --> 99:59:59,999 There's actually two servers here, 440 99:59:59,999 --> 99:59:59,999 there's an influx cloud alchemy and there is a demo cloud alchemy. 441 99:59:59,999 --> 99:59:59,999 [Q] Can you zoom in? [A] Oh yeah sure. 442 99:59:59,999 --> 99:59:59,999 So you can see all the extra labels. 443 99:59:59,999 --> 99:59:59,999 We can also do some things like… 444 99:59:59,999 --> 99:59:59,999 Let's take a look at, say, the last 30 seconds. 445 99:59:59,999 --> 99:59:59,999 We can just add this little time window. 446 99:59:59,999 --> 99:59:59,999 It's called a range request, and you can see 447 99:59:59,999 --> 99:59:59,999 the individual samples. 448 99:59:59,999 --> 99:59:59,999 You can see that all Prometheus is doing 449 99:59:59,999 --> 99:59:59,999 is storing the sample and a timestamp. 450 99:59:59,999 --> 99:59:59,999 All the timestamps are in milliseconds and it's all epoch 451 99:59:59,999 --> 99:59:59,999 so it's super easy to manipulate. 452 99:59:59,999 --> 99:59:59,999 But, looking at the individual samples and looking at this, you can see that 453 99:59:59,999 --> 99:59:59,999 if we go back and just take… and look at the raw data, and 454 99:59:59,999 --> 99:59:59,999 we graph the raw data… 455 99:59:59,999 --> 99:59:59,999 Oops, that's a syntax error. 456 99:59:59,999 --> 99:59:59,999 And we look at this graph… Come on. 457 99:59:59,999 --> 99:59:59,999 Here we go. 458 99:59:59,999 --> 99:59:59,999 Well, that's kind of boring, it's just a flat line because 459 99:59:59,999 --> 99:59:59,999 it's just a counter going up very slowly. 460 99:59:59,999 --> 99:59:59,999 What we really want to do, is we want to take, and we want to apply 461 99:59:59,999 --> 99:59:59,999 a rate function to this counter. 462 99:59:59,999 --> 99:59:59,999 So let's look at the rate over the last one minute. 463 99:59:59,999 --> 99:59:59,999 There we go, now we get a nice little graph. 464 99:59:59,999 --> 99:59:59,999 And so you can see that this is 0.6 CPU seconds per second 465 99:59:59,999 --> 99:59:59,999 for that set of labels. 466 99:59:59,999 --> 99:59:59,999 But this is pretty noisy, there's a lot of lines on this graph and 467 99:59:59,999 --> 99:59:59,999 there's still a lot of data here. 468 99:59:59,999 --> 99:59:59,999 So let's start doing some filtering. 469 99:59:59,999 --> 99:59:59,999 One of the things we see here is, well, there's idle. 470 99:59:59,999 --> 99:59:59,999 We don't really care about the machine being idle, 471 99:59:59,999 --> 99:59:59,999 so let's just add a label filter so we can say 472 99:59:59,999 --> 99:59:59,999 'mode', it's the label name, and it's not equal to 'idle'. Done. 473 99:59:59,999 --> 99:59:59,999 And if I could type… What did I miss? 474 99:59:59,999 --> 99:59:59,999 Here we go. 475 99:59:59,999 --> 99:59:59,999 So now we've removed idle from the graph. 476 99:59:59,999 --> 99:59:59,999 That looks a little more sane. 477 99:59:59,999 --> 99:59:59,999 Oh, wow, look at that, that's a nice big spike in user space on the influx server 478 99:59:59,999 --> 99:59:59,999 Okay… 479 99:59:59,999 --> 99:59:59,999 Well, that's pretty cool. 480 99:59:59,999 --> 99:59:59,999 What about… 481 99:59:59,999 --> 99:59:59,999 This is still quite a lot of lines. 482 99:59:59,999 --> 99:59:59,999 How much CPU is in use total across all the servers that we have. 483 99:59:59,999 --> 99:59:59,999 We can just sum up that rate. 484 99:59:59,999 --> 99:59:59,999 We can just see that there is a sum total of 0.6 CPU seconds/s 485 99:59:59,999 --> 99:59:59,999 across the servers we have. 486 99:59:59,999 --> 99:59:59,999 But that's a little to coarse. 487 99:59:59,999 --> 99:59:59,999 What if we want to see it by instance? 488 99:59:59,999 --> 99:59:59,999 Now, we can see the two servers, we can see 489 99:59:59,999 --> 99:59:59,999 that we're left with just that label. 490 99:59:59,999 --> 99:59:59,999 The influx labels are the influx instance and the influx demo. 491 99:59:59,999 --> 99:59:59,999 That's a super easy way to see that, 492 99:59:59,999 --> 99:59:59,999 but we can also do this the other way around. 493 99:59:59,999 --> 99:59:59,999 We can say 'without (mode,cpu)' so we can drop those modes and 494 99:59:59,999 --> 99:59:59,999 see all the labels that we have. 495 99:59:59,999 --> 99:59:59,999 We can still see the environment label and the job label on our list data. 496 99:59:59,999 --> 99:59:59,999 You can go either way with the summary functions. 497 99:59:59,999 --> 99:59:59,999 There's a whole bunch of different functions 498 99:59:59,999 --> 99:59:59,999 and it's all in our documentation. 499 99:59:59,999 --> 99:59:59,999 But what if we want to see it… 500 99:59:59,999 --> 99:59:59,999 What if we want to see which CPUs are in use? 501 99:59:59,999 --> 99:59:59,999 Now we can see that it's only CPU0 502 99:59:59,999 --> 99:59:59,999 because apparently these are only 1-core instances. 503 99:59:59,999 --> 99:59:59,999 You can add/remove labels and do all these queries. 504 99:59:59,999 --> 99:59:59,999 Any other questions so far? 505 99:59:59,999 --> 99:59:59,999 [Q] I don't have a question, but I have something to add. 506 99:59:59,999 --> 99:59:59,999 Prometheus is really nice, but it's a lot better if you combine it 507 99:59:59,999 --> 99:59:59,999 with grafana. 508 99:59:59,999 --> 99:59:59,999 [A] Yes, yes. 509 99:59:59,999 --> 99:59:59,999 In the beginning, when we were creating Prometheus, we actually built 510 99:59:59,999 --> 99:59:59,999 a piece of dashboard software called promdash. 511 99:59:59,999 --> 99:59:59,999 It was a simple little Ruby on Rails app to create dashboards 512 99:59:59,999 --> 99:59:59,999 and it had a bunch of JavaScript. 513 99:59:59,999 --> 99:59:59,999 And then grafana came out. 514 99:59:59,999 --> 99:59:59,999 And we're like 515 99:59:59,999 --> 99:59:59,999 "Oh, that's interesting. It doesn't support Prometheus" so we were like 516 99:59:59,999 --> 99:59:59,999 "Hey, can you support Prometheus" 517 99:59:59,999 --> 99:59:59,999 and they're like "Yeah, we've got a REST API, get the data, done" 518 99:59:59,999 --> 99:59:59,999 Now grafana supports Prometheus and we're like 519 99:59:59,999 --> 99:59:59,999 "Well, promdash, this is crap, delete". 520 99:59:59,999 --> 99:59:59,999 The Prometheus development team, 521 99:59:59,999 --> 99:59:59,999 we're all backend developers and SREs and 522 99:59:59,999 --> 99:59:59,999 we have no JavaScript skills at all. 523 99:59:59,999 --> 99:59:59,999 So we're like "Let somebody deal with that". 524 99:59:59,999 --> 99:59:59,999 One of the nice things about working on this kind of project is 525 99:59:59,999 --> 99:59:59,999 we can do things that we're good at and and we don't, we don't try… 526 99:59:59,999 --> 99:59:59,999 We don't have any marketing people, it's just an opensource project, 527 99:59:59,999 --> 99:59:59,999 there's no single company behind Prometheus. 528 99:59:59,999 --> 99:59:59,999 I work for GitLab, Improbable paid for the Thanos system, 529 99:59:59,999 --> 99:59:59,999 other companies like Red Hat now pays people that used to work on CoreOS to 530 99:59:59,999 --> 99:59:59,999 work on Prometheus. 531 99:59:59,999 --> 99:59:59,999 There's lots and lots of collaboration between many companies 532 99:59:59,999 --> 99:59:59,999 to build the Prometheus ecosystem. 533 99:59:59,999 --> 99:59:59,999 But yeah, grafana is great. 534 99:59:59,999 --> 99:59:59,999 Actually, grafana now has two fulltime Prometheus developers. 535 99:59:59,999 --> 99:59:59,999 Alright, that's it. 536 99:59:59,999 --> 99:59:59,999 [Applause]