1 99:59:59,999 --> 99:59:59,999 So, we had a talk by a non-GitLab person about GitLab. 2 99:59:59,999 --> 99:59:59,999 Now, we have a talk by a GitLab person on non-GtlLab. 3 99:59:59,999 --> 99:59:59,999 Something like that? 4 99:59:59,999 --> 99:59:59,999 The CCCHH hackerspace is now open, 5 99:59:59,999 --> 99:59:59,999 from now on if you want to go there, that's the announcement. 6 99:59:59,999 --> 99:59:59,999 And the next talk will be by Ben Kochie 7 99:59:59,999 --> 99:59:59,999 on metrics-based monitoring with Prometheus. 8 99:59:59,999 --> 99:59:59,999 Welcome. 9 99:59:59,999 --> 99:59:59,999 [Applause] 10 99:59:59,999 --> 99:59:59,999 Alright, so 11 99:59:59,999 --> 99:59:59,999 my name is Ben Kochie 12 99:59:59,999 --> 99:59:59,999 I work on DevOps features for GitLab 13 99:59:59,999 --> 99:59:59,999 and apart working for GitLab, I also work on the opensource Prometheus project. 14 99:59:59,999 --> 99:59:59,999 I live in Berlin and I've been using Debian since ??? 15 99:59:59,999 --> 99:59:59,999 yes, quite a long time. 16 99:59:59,999 --> 99:59:59,999 So, what is Metrics-based Monitoring? 17 99:59:59,999 --> 99:59:59,999 If you're running software in production, 18 99:59:59,999 --> 99:59:59,999 you probably want to monitor it, 19 99:59:59,999 --> 99:59:59,999 because if you don't monitor it, you don't know it's right. 20 99:59:59,999 --> 99:59:59,999 ??? break down into two categories: 21 99:59:59,999 --> 99:59:59,999 there's blackbox monitoring and there's whitebox monitoring. 22 99:59:59,999 --> 99:59:59,999 Blackbox monitoring is treating your software like a blackbox. 23 99:59:59,999 --> 99:59:59,999 It's just checks to see, like, 24 99:59:59,999 --> 99:59:59,999 is it responding, or does it ping 25 99:59:59,999 --> 99:59:59,999 or ??? HTTP requests 26 99:59:59,999 --> 99:59:59,999 [mic turned on] 27 99:59:59,999 --> 99:59:59,999 Ah, there we go, much better. 28 99:59:59,999 --> 99:59:59,999 So, blackbox monitoring is a probe, 29 99:59:59,999 --> 99:59:59,999 it just kind of looks from the outside to your software 30 99:59:59,999 --> 99:59:59,999 and it has no knowledge of the internals 31 99:59:59,999 --> 99:59:59,999 and it's really good for end to end testing. 32 99:59:59,999 --> 99:59:59,999 So if you've got a fairly complicated service, 33 99:59:59,999 --> 99:59:59,999 you come in from the outside, you go through the load balancer, 34 99:59:59,999 --> 99:59:59,999 you hit the API server, 35 99:59:59,999 --> 99:59:59,999 the API server might hit a database, 36 99:59:59,999 --> 99:59:59,999 and you go all the way through to the back of the stack 37 99:59:59,999 --> 99:59:59,999 and all the way back out 38 99:59:59,999 --> 99:59:59,999 so you know that everything is working end to end. 39 99:59:59,999 --> 99:59:59,999 But you only know about it for that one request. 40 99:59:59,999 --> 99:59:59,999 So in order to find out if your service is working, 41 99:59:59,999 --> 99:59:59,999 from the end to end, for every single request, 42 99:59:59,999 --> 99:59:59,999 this requires whitebox intrumentation. 43 99:59:59,999 --> 99:59:59,999 So, basically, every event that happens inside your software, 44 99:59:59,999 --> 99:59:59,999 inside a serving stack, 45 99:59:59,999 --> 99:59:59,999 gets collected and gets counted, 46 99:59:59,999 --> 99:59:59,999 so you know that every request hits the load balancer, 47 99:59:59,999 --> 99:59:59,999 every request hits your application service, 48 99:59:59,999 --> 99:59:59,999 every request hits the database. 49 99:59:59,999 --> 99:59:59,999 You know that everything matches up 50 99:59:59,999 --> 99:59:59,999 and this is called whitebox, or metrics-based monitoring. 51 99:59:59,999 --> 99:59:59,999 There is different examples of, like, 52 99:59:59,999 --> 99:59:59,999 the kind of software that does blackbox and whitebox monitoring. 53 99:59:59,999 --> 99:59:59,999 So you have software like Nagios that you can configure checks 54 99:59:59,999 --> 99:59:59,999 or pingdom, 55 99:59:59,999 --> 99:59:59,999 pingdom will do ping of your website. 56 99:59:59,999 --> 99:59:59,999 And then there is metrics-based monitoring, 57 99:59:59,999 --> 99:59:59,999 things like Prometheus, things like the TICK stack from influx data, 58 99:59:59,999 --> 99:59:59,999 New Relic and other commercial solutions 59 99:59:59,999 --> 99:59:59,999 but of course I like to talk about the opensorce solutions. 60 99:59:59,999 --> 99:59:59,999 We're gonna talk a little bit about Prometheus. 61 99:59:59,999 --> 99:59:59,999 Prometheus came out of the idea that 62 99:59:59,999 --> 99:59:59,999 we needed a monitoring system that could collect all this whitebox metric data 63 99:59:59,999 --> 99:59:59,999 and do something useful with it. 64 99:59:59,999 --> 99:59:59,999 Not just give us a pretty graph, but we also want to be able to 65 99:59:59,999 --> 99:59:59,999 alert on it. 66 99:59:59,999 --> 99:59:59,999 So we needed both 67 99:59:59,999 --> 99:59:59,999 a data gathering and an analytics system in the same instance. 68 99:59:59,999 --> 99:59:59,999 To do this, we built this thing and we looked at the way that 69 99:59:59,999 --> 99:59:59,999 data was being generated by the applications 70 99:59:59,999 --> 99:59:59,999 and there are advantages and disadvantages to this 71 99:59:59,999 --> 99:59:59,999 push vs. poll model for metrics. 72 99:59:59,999 --> 99:59:59,999 We decided to go with the polling model 73 99:59:59,999 --> 99:59:59,999 because there is some slight advantages for polling over pushing. 74 99:59:59,999 --> 99:59:59,999 With polling, you get this free blackbox check 75 99:59:59,999 --> 99:59:59,999 that the application is running. 76 99:59:59,999 --> 99:59:59,999 When you poll your application, you know that the process is running. 77 99:59:59,999 --> 99:59:59,999 If you are doing push-based, you can't tell the difference between 78 99:59:59,999 --> 99:59:59,999 your application doing no work and your application not running. 79 99:59:59,999 --> 99:59:59,999 So you don't know if it's stuck, 80 99:59:59,999 --> 99:59:59,999 or is it just not having to do any work. 81 99:59:59,999 --> 99:59:59,999 With polling, the polling system knows the state of your network. 82 99:59:59,999 --> 99:59:59,999 If you have a defined set of services, 83 99:59:59,999 --> 99:59:59,999 that inventory drives what should be there. 84 99:59:59,999 --> 99:59:59,999 Again, it's like, the disappearing, 85 99:59:59,999 --> 99:59:59,999 is the process dead, or is it just not doing anything? 86 99:59:59,999 --> 99:59:59,999 With polling, you know for a fact what processes should be there, 87 99:59:59,999 --> 99:59:59,999 and it's a bit of an advantage there. 88 99:59:59,999 --> 99:59:59,999 With polling, there's really easy testing. 89 99:59:59,999 --> 99:59:59,999 With push-based metrics, you have to figure out 90 99:59:59,999 --> 99:59:59,999 if you want to test a new version of the monitoring system or 91 99:59:59,999 --> 99:59:59,999 you want to test something new, 92 99:59:59,999 --> 99:59:59,999 you have to ??? a copy of the data. 93 99:59:59,999 --> 99:59:59,999 With polling, you can just set up another instance of your monitoring 94 99:59:59,999 --> 99:59:59,999 and just test it. 95 99:59:59,999 --> 99:59:59,999 Or you don't even have, 96 99:59:59,999 --> 99:59:59,999 it doesn't even have to be monitoring, you can just use curl 97 99:59:59,999 --> 99:59:59,999 to poll the metrics endpoint. 98 99:59:59,999 --> 99:59:59,999 It's significantly easier to test. 99 99:59:59,999 --> 99:59:59,999 The other thing with theā€¦ 100 99:59:59,999 --> 99:59:59,999 The other nice thing is that the client is really simple. 101 99:59:59,999 --> 99:59:59,999 The client doesn't have to know where the monitoring system is. 102 99:59:59,999 --> 99:59:59,999 It doesn't have to know about ??? 103 99:59:59,999 --> 99:59:59,999 It just has to sit and collect the data about itself. 104 99:59:59,999 --> 99:59:59,999 So it doesn't have to know anything about the topology of the network. 105 99:59:59,999 --> 99:59:59,999 As an application developer, if you're writing a DNS server or 106 99:59:59,999 --> 99:59:59,999 some other piece of software, 107 99:59:59,999 --> 99:59:59,999 you don't have to know anything about monitoring software, 108 99:59:59,999 --> 99:59:59,999 you can just implement it inside your application and 109 99:59:59,999 --> 99:59:59,999 the monitoring software, whether it's Prometheus or something else, 110 99:59:59,999 --> 99:59:59,999 can just come and collect that data for you. 111 99:59:59,999 --> 99:59:59,999 That's kind of similar to a very old monitoring system called SNMP, 112 99:59:59,999 --> 99:59:59,999 but SNMP has a significantly less friendly data model for developers. 113 99:59:59,999 --> 99:59:59,999 This is the basic layout of a Prometheus server. 114 99:59:59,999 --> 99:59:59,999 At the core, there's a Prometheus server 115 99:59:59,999 --> 99:59:59,999 and it deals with all the data collection and analytics. 116 99:59:59,999 --> 99:59:59,999 Basically, this one binary, it's all written in golang. 117 99:59:59,999 --> 99:59:59,999 It's a single binary. 118 99:59:59,999 --> 99:59:59,999 It knows how to read from your inventory, 119 99:59:59,999 --> 99:59:59,999 there's a bunch of different methods, whether you've got 120 99:59:59,999 --> 99:59:59,999 a kubernetes cluster or a cloud platform 121 99:59:59,999 --> 99:59:59,999 or you have your own customized thing with ansible. 122 99:59:59,999 --> 99:59:59,999 Ansible can take your layout, drop that into a config file and 123 99:59:59,999 --> 99:59:59,999 Prometheus can pick that up. 124 99:59:59,999 --> 99:59:59,999 Once it has the layout, it goes out and collects all the data. 125 99:59:59,999 --> 99:59:59,999 It has a storage and a time series database to store all that data locally. 126 99:59:59,999 --> 99:59:59,999 It has a thing called PromQL, which is a query language designed 127 99:59:59,999 --> 99:59:59,999 for metrics and analytics. 128 99:59:59,999 --> 99:59:59,999 From that PromQL, you can add frontends that will, 129 99:59:59,999 --> 99:59:59,999 whether it's a simple API client to run reports, 130 99:59:59,999 --> 99:59:59,999 you can use things like Grafana for creating dashboards, 131 99:59:59,999 --> 99:59:59,999 it's got a simple webUI built in. 132 99:59:59,999 --> 99:59:59,999 You can plug in anything you want on that side.