1 99:59:59,999 --> 99:59:59,999 So, we had a talk by a non-GitLab person about GitLab. 2 99:59:59,999 --> 99:59:59,999 Now, we have a talk by a GitLab person on non-GtlLab. 3 99:59:59,999 --> 99:59:59,999 Something like that? 4 99:59:59,999 --> 99:59:59,999 The CCCHH hackerspace is now open, 5 99:59:59,999 --> 99:59:59,999 from now on if you want to go there, that's the announcement. 6 99:59:59,999 --> 99:59:59,999 And the next talk will be by Ben Kochie 7 99:59:59,999 --> 99:59:59,999 on metrics-based monitoring with Prometheus. 8 99:59:59,999 --> 99:59:59,999 Welcome. 9 99:59:59,999 --> 99:59:59,999 [Applause] 10 99:59:59,999 --> 99:59:59,999 Alright, so 11 99:59:59,999 --> 99:59:59,999 my name is Ben Kochie 12 99:59:59,999 --> 99:59:59,999 I work on DevOps features for GitLab 13 99:59:59,999 --> 99:59:59,999 and apart working for GitLab, I also work on the opensource Prometheus project. 14 99:59:59,999 --> 99:59:59,999 I live in Berlin and I've been using Debian since ??? 15 99:59:59,999 --> 99:59:59,999 yes, quite a long time. 16 99:59:59,999 --> 99:59:59,999 So, what is Metrics-based Monitoring? 17 99:59:59,999 --> 99:59:59,999 If you're running software in production, 18 99:59:59,999 --> 99:59:59,999 you probably want to monitor it, 19 99:59:59,999 --> 99:59:59,999 because if you don't monitor it, you don't know it's right. 20 99:59:59,999 --> 99:59:59,999 ??? break down into two categories: 21 99:59:59,999 --> 99:59:59,999 there's blackbox monitoring and there's whitebox monitoring. 22 99:59:59,999 --> 99:59:59,999 Blackbox monitoring is treating your software like a blackbox. 23 99:59:59,999 --> 99:59:59,999 It's just checks to see, like, 24 99:59:59,999 --> 99:59:59,999 is it responding, or does it ping 25 99:59:59,999 --> 99:59:59,999 or ??? HTTP requests 26 99:59:59,999 --> 99:59:59,999 [mic turned on] 27 99:59:59,999 --> 99:59:59,999 Ah, there we go, much better. 28 99:59:59,999 --> 99:59:59,999 So, blackbox monitoring is a probe, 29 99:59:59,999 --> 99:59:59,999 it just kind of looks from the outside to your software 30 99:59:59,999 --> 99:59:59,999 and it has no knowledge of the internals 31 99:59:59,999 --> 99:59:59,999 and it's really good for end to end testing. 32 99:59:59,999 --> 99:59:59,999 So if you've got a fairly complicated service, 33 99:59:59,999 --> 99:59:59,999 you come in from the outside, you go through the load balancer, 34 99:59:59,999 --> 99:59:59,999 you hit the API server, 35 99:59:59,999 --> 99:59:59,999 the API server might hit a database, 36 99:59:59,999 --> 99:59:59,999 and you go all the way through to the back of the stack 37 99:59:59,999 --> 99:59:59,999 and all the way back out 38 99:59:59,999 --> 99:59:59,999 so you know that everything is working end to end. 39 99:59:59,999 --> 99:59:59,999 But you only know about it for that one request. 40 99:59:59,999 --> 99:59:59,999 So in order to find out if your service is working, 41 99:59:59,999 --> 99:59:59,999 from the end to end, for every single request, 42 99:59:59,999 --> 99:59:59,999 this requires whitebox intrumentation. 43 99:59:59,999 --> 99:59:59,999 So, basically, every event that happens inside your software, 44 99:59:59,999 --> 99:59:59,999 inside a serving stack, 45 99:59:59,999 --> 99:59:59,999 gets collected and gets counted, 46 99:59:59,999 --> 99:59:59,999 so you know that every request hits the load balancer, 47 99:59:59,999 --> 99:59:59,999 every request hits your application service, 48 99:59:59,999 --> 99:59:59,999 every request hits the database. 49 99:59:59,999 --> 99:59:59,999 You know that everything matches up 50 99:59:59,999 --> 99:59:59,999 and this is called whitebox, or metrics-based monitoring. 51 99:59:59,999 --> 99:59:59,999 There is different examples of, like, 52 99:59:59,999 --> 99:59:59,999 the kind of software that does blackbox and whitebox monitoring. 53 99:59:59,999 --> 99:59:59,999 So you have software like Nagios that you can configure checks 54 99:59:59,999 --> 99:59:59,999 or pingdom, 55 99:59:59,999 --> 99:59:59,999 pingdom will do ping of your website. 56 99:59:59,999 --> 99:59:59,999 And then there is metrics-based monitoring, 57 99:59:59,999 --> 99:59:59,999 things like Prometheus, things like the TICK stack from influx data, 58 99:59:59,999 --> 99:59:59,999 New Relic and other commercial solutions 59 99:59:59,999 --> 99:59:59,999 but of course I like to talk about the opensorce solutions. 60 99:59:59,999 --> 99:59:59,999 We're gonna talk a little bit about Prometheus. 61 99:59:59,999 --> 99:59:59,999 Prometheus came out of the idea that 62 99:59:59,999 --> 99:59:59,999 we needed a monitoring system that could collect all this whitebox metric data 63 99:59:59,999 --> 99:59:59,999 and do something useful with it. 64 99:59:59,999 --> 99:59:59,999 Not just give us a pretty graph, but we also want to be able to 65 99:59:59,999 --> 99:59:59,999 alert on it. 66 99:59:59,999 --> 99:59:59,999 So we needed both 67 99:59:59,999 --> 99:59:59,999 a data gathering and an analytics system in the same instance. 68 99:59:59,999 --> 99:59:59,999 To do this, we built this thing and we looked at the way that 69 99:59:59,999 --> 99:59:59,999 data was being generated by the applications 70 99:59:59,999 --> 99:59:59,999 and there are advantages and disadvantages to this 71 99:59:59,999 --> 99:59:59,999 push vs. poll model for metrics. 72 99:59:59,999 --> 99:59:59,999 We decided to go with the polling model 73 99:59:59,999 --> 99:59:59,999 because there is some slight advantages for polling over pushing. 74 99:59:59,999 --> 99:59:59,999 With polling, you get this free blackbox check 75 99:59:59,999 --> 99:59:59,999 that the application is running. 76 99:59:59,999 --> 99:59:59,999 When you poll your application, you know that the process is running. 77 99:59:59,999 --> 99:59:59,999 If you are doing push-based, you can't tell the difference between 78 99:59:59,999 --> 99:59:59,999 your application doing no work and your application not running. 79 99:59:59,999 --> 99:59:59,999 So you don't know if it's stuck, 80 99:59:59,999 --> 99:59:59,999 or is it just not having to do any work. 81 99:59:59,999 --> 99:59:59,999 With polling, the polling system knows the state of your network. 82 99:59:59,999 --> 99:59:59,999 If you have a defined set of services, 83 99:59:59,999 --> 99:59:59,999 that inventory drives what should be there. 84 99:59:59,999 --> 99:59:59,999 Again, it's like, the disappearing, 85 99:59:59,999 --> 99:59:59,999 is the process dead, or is it just not doing anything? 86 99:59:59,999 --> 99:59:59,999 With polling, you know for a fact what processes should be there, 87 99:59:59,999 --> 99:59:59,999 and it's a bit of an advantage there. 88 99:59:59,999 --> 99:59:59,999 With polling, there's really easy testing. 89 99:59:59,999 --> 99:59:59,999 With push-based metrics, you have to figure out 90 99:59:59,999 --> 99:59:59,999 if you want to test a new version of the monitoring system or 91 99:59:59,999 --> 99:59:59,999 you want to test something new, 92 99:59:59,999 --> 99:59:59,999 you have to ??? a copy of the data. 93 99:59:59,999 --> 99:59:59,999 With polling, you can just set up another instance of your monitoring 94 99:59:59,999 --> 99:59:59,999 and just test it. 95 99:59:59,999 --> 99:59:59,999 Or you don't even have, 96 99:59:59,999 --> 99:59:59,999 it doesn't even have to be monitoring, you can just use curl 97 99:59:59,999 --> 99:59:59,999 to poll the metrics endpoint. 98 99:59:59,999 --> 99:59:59,999 It's significantly easier to test. 99 99:59:59,999 --> 99:59:59,999 The other thing with theā€¦ 100 99:59:59,999 --> 99:59:59,999 The other nice thing is that the client is really simple. 101 99:59:59,999 --> 99:59:59,999 The client doesn't have to know where the monitoring system is. 102 99:59:59,999 --> 99:59:59,999 It doesn't have to know about ??? 103 99:59:59,999 --> 99:59:59,999 It just has to sit and collect the data about itself. 104 99:59:59,999 --> 99:59:59,999 So it doesn't have to know anything about the topology of the network. 105 99:59:59,999 --> 99:59:59,999 As an application developer, if you're writing a DNS server or 106 99:59:59,999 --> 99:59:59,999 some other piece of software, 107 99:59:59,999 --> 99:59:59,999 you don't have to know anything about monitoring software, 108 99:59:59,999 --> 99:59:59,999 you can just implement it inside your application and 109 99:59:59,999 --> 99:59:59,999 the monitoring software, whether it's Prometheus or something else, 110 99:59:59,999 --> 99:59:59,999 can just come and collect that data for you. 111 99:59:59,999 --> 99:59:59,999 That's kind of similar to a very old monitoring system called SNMP, 112 99:59:59,999 --> 99:59:59,999 but SNMP has a significantly less friendly data model for developers. 113 99:59:59,999 --> 99:59:59,999 This is the basic layout of a Prometheus server. 114 99:59:59,999 --> 99:59:59,999 At the core, there's a Prometheus server 115 99:59:59,999 --> 99:59:59,999 and it deals with all the data collection and analytics. 116 99:59:59,999 --> 99:59:59,999 Basically, this one binary, it's all written in golang. 117 99:59:59,999 --> 99:59:59,999 It's a single binary. 118 99:59:59,999 --> 99:59:59,999 It knows how to read from your inventory, 119 99:59:59,999 --> 99:59:59,999 there's a bunch of different methods, whether you've got 120 99:59:59,999 --> 99:59:59,999 a kubernetes cluster or a cloud platform 121 99:59:59,999 --> 99:59:59,999 or you have your own customized thing with ansible. 122 99:59:59,999 --> 99:59:59,999 Ansible can take your layout, drop that into a config file and 123 99:59:59,999 --> 99:59:59,999 Prometheus can pick that up. 124 99:59:59,999 --> 99:59:59,999 Once it has the layout, it goes out and collects all the data. 125 99:59:59,999 --> 99:59:59,999 It has a storage and a time series database to store all that data locally. 126 99:59:59,999 --> 99:59:59,999 It has a thing called PromQL, which is a query language designed 127 99:59:59,999 --> 99:59:59,999 for metrics and analytics. 128 99:59:59,999 --> 99:59:59,999 From that PromQL, you can add frontends that will, 129 99:59:59,999 --> 99:59:59,999 whether it's a simple API client to run reports, 130 99:59:59,999 --> 99:59:59,999 you can use things like Grafana for creating dashboards, 131 99:59:59,999 --> 99:59:59,999 it's got a simple webUI built in. 132 99:59:59,999 --> 99:59:59,999 You can plug in anything you want on that side. 133 99:59:59,999 --> 99:59:59,999 And then, it also has the ability to continuously execute queries 134 99:59:59,999 --> 99:59:59,999 called "recording rules" 135 99:59:59,999 --> 99:59:59,999 and these recording rules have two different modes. 136 99:59:59,999 --> 99:59:59,999 You can either record, you can take a query 137 99:59:59,999 --> 99:59:59,999 and it will generate new data from that query 138 99:59:59,999 --> 99:59:59,999 or you can take a query, and if it returns results, 139 99:59:59,999 --> 99:59:59,999 it will return an alert. 140 99:59:59,999 --> 99:59:59,999 That alert is a push message to the alert manager. 141 99:59:59,999 --> 99:59:59,999 This allows us to separate the generating of alerts from the routing of alerts. 142 99:59:59,999 --> 99:59:59,999 You can have one or hundreds of Prometheus services, all generating alerts 143 99:59:59,999 --> 99:59:59,999 and it goes into an alert manager cluster and sends, does the deduplication 144 99:59:59,999 --> 99:59:59,999 and the routing to the human 145 99:59:59,999 --> 99:59:59,999 because, of course, the thing that we want is 146 99:59:59,999 --> 99:59:59,999 we had dashboards with graphs, but in order to find out if something is broken 147 99:59:59,999 --> 99:59:59,999 you had to have a human looking at the graph. 148 99:59:59,999 --> 99:59:59,999 With Prometheus, we don't have to do that anymore, 149 99:59:59,999 --> 99:59:59,999 we can simply let the software tell us that we need to go investigate 150 99:59:59,999 --> 99:59:59,999 our problems. 151 99:59:59,999 --> 99:59:59,999 We don't have to sit there and stare at dashboards all day, 152 99:59:59,999 --> 99:59:59,999 because that's really boring. 153 99:59:59,999 --> 99:59:59,999 What does it look like to actually get data into Prometheus? 154 99:59:59,999 --> 99:59:59,999 This is a very basic output of a Prometheus metric. 155 99:59:59,999 --> 99:59:59,999 This is a very simple thing. 156 99:59:59,999 --> 99:59:59,999 If you know much about the linux kernel, 157 99:59:59,999 --> 99:59:59,999 the linux kernel tracks ??? stats, all the state of all the CPUs 158 99:59:59,999 --> 99:59:59,999 in your system 159 99:59:59,999 --> 99:59:59,999 and we express this by having the name of the metric, which is 160 99:59:59,999 --> 99:59:59,999 'node_cpu_seconds_total' and so this is a self describing metric, 161 99:59:59,999 --> 99:59:59,999 like you can just read the metrics name 162 99:59:59,999 --> 99:59:59,999 and you understand a little bit about what's going on here.