WEBVTT 99:59:59.999 --> 99:59:59.999 So, we had a talk by a non-GitLab person about GitLab. 99:59:59.999 --> 99:59:59.999 Now, we have a talk by a GitLab person on non-GtlLab. 99:59:59.999 --> 99:59:59.999 Something like that? 99:59:59.999 --> 99:59:59.999 The CCCHH hackerspace is now open, 99:59:59.999 --> 99:59:59.999 from now on if you want to go there, that's the announcement. 99:59:59.999 --> 99:59:59.999 And the next talk will be by Ben Kochie 99:59:59.999 --> 99:59:59.999 on metrics-based monitoring with Prometheus. 99:59:59.999 --> 99:59:59.999 Welcome. 99:59:59.999 --> 99:59:59.999 [Applause] 99:59:59.999 --> 99:59:59.999 Alright, so 99:59:59.999 --> 99:59:59.999 my name is Ben Kochie 99:59:59.999 --> 99:59:59.999 I work on DevOps features for GitLab 99:59:59.999 --> 99:59:59.999 and apart working for GitLab, I also work on the opensource Prometheus project. 99:59:59.999 --> 99:59:59.999 I live in Berlin and I've been using Debian since ??? 99:59:59.999 --> 99:59:59.999 yes, quite a long time. 99:59:59.999 --> 99:59:59.999 So, what is Metrics-based Monitoring? 99:59:59.999 --> 99:59:59.999 If you're running software in production, 99:59:59.999 --> 99:59:59.999 you probably want to monitor it, 99:59:59.999 --> 99:59:59.999 because if you don't monitor it, you don't know it's right. 99:59:59.999 --> 99:59:59.999 ??? break down into two categories: 99:59:59.999 --> 99:59:59.999 there's blackbox monitoring and there's whitebox monitoring. 99:59:59.999 --> 99:59:59.999 Blackbox monitoring is treating your software like a blackbox. 99:59:59.999 --> 99:59:59.999 It's just checks to see, like, 99:59:59.999 --> 99:59:59.999 is it responding, or does it ping 99:59:59.999 --> 99:59:59.999 or ??? HTTP requests 99:59:59.999 --> 99:59:59.999 [mic turned on] 99:59:59.999 --> 99:59:59.999 Ah, there we go, much better. 99:59:59.999 --> 99:59:59.999 So, blackbox monitoring is a probe, 99:59:59.999 --> 99:59:59.999 it just kind of looks from the outside to your software 99:59:59.999 --> 99:59:59.999 and it has no knowledge of the internals 99:59:59.999 --> 99:59:59.999 and it's really good for end to end testing. 99:59:59.999 --> 99:59:59.999 So if you've got a fairly complicated service, 99:59:59.999 --> 99:59:59.999 you come in from the outside, you go through the load balancer, 99:59:59.999 --> 99:59:59.999 you hit the API server, 99:59:59.999 --> 99:59:59.999 the API server might hit a database, 99:59:59.999 --> 99:59:59.999 and you go all the way through to the back of the stack 99:59:59.999 --> 99:59:59.999 and all the way back out 99:59:59.999 --> 99:59:59.999 so you know that everything is working end to end. 99:59:59.999 --> 99:59:59.999 But you only know about it for that one request. 99:59:59.999 --> 99:59:59.999 So in order to find out if your service is working, 99:59:59.999 --> 99:59:59.999 from the end to end, for every single request, 99:59:59.999 --> 99:59:59.999 this requires whitebox intrumentation. 99:59:59.999 --> 99:59:59.999 So, basically, every event that happens inside your software, 99:59:59.999 --> 99:59:59.999 inside a serving stack, 99:59:59.999 --> 99:59:59.999 gets collected and gets counted, 99:59:59.999 --> 99:59:59.999 so you know that every request hits the load balancer, 99:59:59.999 --> 99:59:59.999 every request hits your application service, 99:59:59.999 --> 99:59:59.999 every request hits the database. 99:59:59.999 --> 99:59:59.999 You know that everything matches up 99:59:59.999 --> 99:59:59.999 and this is called whitebox, or metrics-based monitoring. 99:59:59.999 --> 99:59:59.999 There is different examples of, like, 99:59:59.999 --> 99:59:59.999 the kind of software that does blackbox and whitebox monitoring. 99:59:59.999 --> 99:59:59.999 So you have software like Nagios that you can configure checks 99:59:59.999 --> 99:59:59.999 or pingdom, 99:59:59.999 --> 99:59:59.999 pingdom will do ping of your website. 99:59:59.999 --> 99:59:59.999 And then there is metrics-based monitoring, 99:59:59.999 --> 99:59:59.999 things like Prometheus, things like the TICK stack from influx data, 99:59:59.999 --> 99:59:59.999 New Relic and other commercial solutions 99:59:59.999 --> 99:59:59.999 but of course I like to talk about the opensorce solutions. 99:59:59.999 --> 99:59:59.999 We're gonna talk a little bit about Prometheus. 99:59:59.999 --> 99:59:59.999 Prometheus came out of the idea that 99:59:59.999 --> 99:59:59.999 we needed a monitoring system that could collect all this whitebox metric data 99:59:59.999 --> 99:59:59.999 and do something useful with it. 99:59:59.999 --> 99:59:59.999 Not just give us a pretty graph, but we also want to be able to 99:59:59.999 --> 99:59:59.999 alert on it. 99:59:59.999 --> 99:59:59.999 So we needed both 99:59:59.999 --> 99:59:59.999 a data gathering and an analytics system in the same instance. 99:59:59.999 --> 99:59:59.999 To do this, we built this thing and we looked at the way that 99:59:59.999 --> 99:59:59.999 data was being generated by the applications 99:59:59.999 --> 99:59:59.999 and there are advantages and disadvantages to this 99:59:59.999 --> 99:59:59.999 push vs. poll model for metrics. 99:59:59.999 --> 99:59:59.999 We decided to go with the polling model 99:59:59.999 --> 99:59:59.999 because there is some slight advantages for polling over pushing. 99:59:59.999 --> 99:59:59.999 With polling, you get this free blackbox check 99:59:59.999 --> 99:59:59.999 that the application is running. 99:59:59.999 --> 99:59:59.999 When you poll your application, you know that the process is running. 99:59:59.999 --> 99:59:59.999 If you are doing push-based, you can't tell the difference between 99:59:59.999 --> 99:59:59.999 your application doing no work and your application not running. 99:59:59.999 --> 99:59:59.999 So you don't know if it's stuck, 99:59:59.999 --> 99:59:59.999 or is it just not having to do any work. 99:59:59.999 --> 99:59:59.999 With polling, the polling system knows the state of your network. 99:59:59.999 --> 99:59:59.999 If you have a defined set of services, 99:59:59.999 --> 99:59:59.999 that inventory drives what should be there. 99:59:59.999 --> 99:59:59.999 Again, it's like, the disappearing, 99:59:59.999 --> 99:59:59.999 is the process dead, or is it just not doing anything? 99:59:59.999 --> 99:59:59.999 With polling, you know for a fact what processes should be there, 99:59:59.999 --> 99:59:59.999 and it's a bit of an advantage there. 99:59:59.999 --> 99:59:59.999 With polling, there's really easy testing. 99:59:59.999 --> 99:59:59.999 With push-based metrics, you have to figure out 99:59:59.999 --> 99:59:59.999 if you want to test a new version of the monitoring system or 99:59:59.999 --> 99:59:59.999 you want to test something new, 99:59:59.999 --> 99:59:59.999 you have to ??? a copy of the data. 99:59:59.999 --> 99:59:59.999 With polling, you can just set up another instance of your monitoring 99:59:59.999 --> 99:59:59.999 and just test it. 99:59:59.999 --> 99:59:59.999 Or you don't even have, 99:59:59.999 --> 99:59:59.999 it doesn't even have to be monitoring, you can just use curl 99:59:59.999 --> 99:59:59.999 to poll the metrics endpoint. 99:59:59.999 --> 99:59:59.999 It's significantly easier to test. 99:59:59.999 --> 99:59:59.999 The other thing with theā€¦ 99:59:59.999 --> 99:59:59.999 The other nice thing is that the client is really simple. 99:59:59.999 --> 99:59:59.999 The client doesn't have to know where the monitoring system is. 99:59:59.999 --> 99:59:59.999 It doesn't have to know about ??? 99:59:59.999 --> 99:59:59.999 It just has to sit and collect the data about itself. 99:59:59.999 --> 99:59:59.999 So it doesn't have to know anything about the topology of the network. 99:59:59.999 --> 99:59:59.999 As an application developer, if you're writing a DNS server or 99:59:59.999 --> 99:59:59.999 some other piece of software, 99:59:59.999 --> 99:59:59.999 you don't have to know anything about monitoring software, 99:59:59.999 --> 99:59:59.999 you can just implement it inside your application and 99:59:59.999 --> 99:59:59.999 the monitoring software, whether it's Prometheus or something else, 99:59:59.999 --> 99:59:59.999 can just come and collect that data for you. 99:59:59.999 --> 99:59:59.999 That's kind of similar to a very old monitoring system called SNMP, 99:59:59.999 --> 99:59:59.999 but SNMP has a significantly less friendly data model for developers. 99:59:59.999 --> 99:59:59.999 This is the basic layout of a Prometheus server. 99:59:59.999 --> 99:59:59.999 At the core, there's a Prometheus server 99:59:59.999 --> 99:59:59.999 and it deals with all the data collection and analytics. 99:59:59.999 --> 99:59:59.999 Basically, this one binary, it's all written in golang. 99:59:59.999 --> 99:59:59.999 It's a single binary. 99:59:59.999 --> 99:59:59.999 It knows how to read from your inventory, 99:59:59.999 --> 99:59:59.999 there's a bunch of different methods, whether you've got 99:59:59.999 --> 99:59:59.999 a kubernetes cluster or a cloud platform 99:59:59.999 --> 99:59:59.999 or you have your own customized thing with ansible. 99:59:59.999 --> 99:59:59.999 Ansible can take your layout, drop that into a config file and 99:59:59.999 --> 99:59:59.999 Prometheus can pick that up. 99:59:59.999 --> 99:59:59.999 Once it has the layout, it goes out and collects all the data. 99:59:59.999 --> 99:59:59.999 It has a storage and a time series database to store all that data locally. 99:59:59.999 --> 99:59:59.999 It has a thing called PromQL, which is a query language designed 99:59:59.999 --> 99:59:59.999 for metrics and analytics. 99:59:59.999 --> 99:59:59.999 From that PromQL, you can add frontends that will, 99:59:59.999 --> 99:59:59.999 whether it's a simple API client to run reports, 99:59:59.999 --> 99:59:59.999 you can use things like Grafana for creating dashboards, 99:59:59.999 --> 99:59:59.999 it's got a simple webUI built in. 99:59:59.999 --> 99:59:59.999 You can plug in anything you want on that side.