9:59:59.000,9:59:59.000 So, we had a talk by a non-GitLab person[br]about GitLab. 9:59:59.000,9:59:59.000 Now, we have a talk by a GitLab person[br]on non-GtlLab. 9:59:59.000,9:59:59.000 Something like that? 9:59:59.000,9:59:59.000 The CCCHH hackerspace is now open, 9:59:59.000,9:59:59.000 from now on if you want to go there,[br]that's the announcement. 9:59:59.000,9:59:59.000 And the next talk will be by Ben Kochie 9:59:59.000,9:59:59.000 on metrics-based monitoring[br]with Prometheus. 9:59:59.000,9:59:59.000 Welcome. 9:59:59.000,9:59:59.000 [Applause] 9:59:59.000,9:59:59.000 Alright, so 9:59:59.000,9:59:59.000 my name is Ben Kochie 9:59:59.000,9:59:59.000 I work on DevOps features for GitLab 9:59:59.000,9:59:59.000 and apart working for GitLab, I also work[br]on the opensource Prometheus project. 9:59:59.000,9:59:59.000 I live in Berlin and I've been using[br]Debian since ??? 9:59:59.000,9:59:59.000 yes, quite a long time. 9:59:59.000,9:59:59.000 So, what is Metrics-based Monitoring? 9:59:59.000,9:59:59.000 If you're running software in production, 9:59:59.000,9:59:59.000 you probably want to monitor it, 9:59:59.000,9:59:59.000 because if you don't monitor it, you don't[br]know it's right. 9:59:59.000,9:59:59.000 ??? break down into two categories: 9:59:59.000,9:59:59.000 there's blackbox monitoring and[br]there's whitebox monitoring. 9:59:59.000,9:59:59.000 Blackbox monitoring is treating[br]your software like a blackbox. 9:59:59.000,9:59:59.000 It's just checks to see, like, 9:59:59.000,9:59:59.000 is it responding, or does it ping 9:59:59.000,9:59:59.000 or ??? HTTP requests 9:59:59.000,9:59:59.000 [mic turned on] 9:59:59.000,9:59:59.000 Ah, there we go, much better. 9:59:59.000,9:59:59.000 So, blackbox monitoring is a probe, 9:59:59.000,9:59:59.000 it just kind of looks from the outside[br]to your software 9:59:59.000,9:59:59.000 and it has no knowledge of the internals 9:59:59.000,9:59:59.000 and it's really good for end to end testing. 9:59:59.000,9:59:59.000 So if you've got a fairly complicated[br]service, 9:59:59.000,9:59:59.000 you come in from the outside, you go[br]through the load balancer, 9:59:59.000,9:59:59.000 you hit the API server, 9:59:59.000,9:59:59.000 the API server might hit a database, 9:59:59.000,9:59:59.000 and you go all the way through[br]to the back of the stack 9:59:59.000,9:59:59.000 and all the way back out 9:59:59.000,9:59:59.000 so you know that everything is working[br]end to end. 9:59:59.000,9:59:59.000 But you only know about it[br]for that one request. 9:59:59.000,9:59:59.000 So in order to find out if your service[br]is working, 9:59:59.000,9:59:59.000 from the end to end, for every single[br]request, 9:59:59.000,9:59:59.000 this requires whitebox intrumentation. 9:59:59.000,9:59:59.000 So, basically, every event that happens[br]inside your software, 9:59:59.000,9:59:59.000 inside a serving stack, 9:59:59.000,9:59:59.000 gets collected and gets counted, 9:59:59.000,9:59:59.000 so you know that every request hits[br]the load balancer, 9:59:59.000,9:59:59.000 every request hits your application[br]service, 9:59:59.000,9:59:59.000 every request hits the database. 9:59:59.000,9:59:59.000 You know that everything matches up 9:59:59.000,9:59:59.000 and this is called whitebox, or[br]metrics-based monitoring. 9:59:59.000,9:59:59.000 There is different examples of, like, 9:59:59.000,9:59:59.000 the kind of software that does blackbox[br]and whitebox monitoring. 9:59:59.000,9:59:59.000 So you have software like Nagios that[br]you can configure checks 9:59:59.000,9:59:59.000 or pingdom, 9:59:59.000,9:59:59.000 pingdom will do ping of your website. 9:59:59.000,9:59:59.000 And then there is metrics-based monitoring, 9:59:59.000,9:59:59.000 things like Prometheus, things like[br]the TICK stack from influx data, 9:59:59.000,9:59:59.000 New Relic and other commercial solutions 9:59:59.000,9:59:59.000 but of course I like to talk about[br]the opensorce solutions. 9:59:59.000,9:59:59.000 We're gonna talk a little bit about[br]Prometheus. 9:59:59.000,9:59:59.000 Prometheus came out of the idea that 9:59:59.000,9:59:59.000 we needed a monitoring system that could[br]collect all this whitebox metric data 9:59:59.000,9:59:59.000 and do something useful with it. 9:59:59.000,9:59:59.000 Not just give us a pretty graph, but[br]we also want to be able to 9:59:59.000,9:59:59.000 alert on it. 9:59:59.000,9:59:59.000 So we needed both 9:59:59.000,9:59:59.000 a data gathering and an analytics system[br]in the same instance. 9:59:59.000,9:59:59.000 To do this, we built this thing and[br]we looked at the way that 9:59:59.000,9:59:59.000 data was being generated[br]by the applications 9:59:59.000,9:59:59.000 and there are advantages and[br]disadvantages to this 9:59:59.000,9:59:59.000 push vs. poll model for metrics. 9:59:59.000,9:59:59.000 We decided to go with the polling model 9:59:59.000,9:59:59.000 because there is some slight advantages[br]for polling over pushing. 9:59:59.000,9:59:59.000 With polling, you get this free[br]blackbox check 9:59:59.000,9:59:59.000 that the application is running. 9:59:59.000,9:59:59.000 When you poll your application, you know[br]that the process is running. 9:59:59.000,9:59:59.000 If you are doing push-based, you can't[br]tell the difference between 9:59:59.000,9:59:59.000 your application doing no work and[br]your application not running. 9:59:59.000,9:59:59.000 So you don't know if it's stuck, 9:59:59.000,9:59:59.000 or is it just not having to do any work. 9:59:59.000,9:59:59.000 With polling, the polling system knows[br]the state of your network. 9:59:59.000,9:59:59.000 If you have a defined set of services, 9:59:59.000,9:59:59.000 that inventory drives what should be there. 9:59:59.000,9:59:59.000 Again, it's like, the disappearing, 9:59:59.000,9:59:59.000 is the process dead, or is it just[br]not doing anything? 9:59:59.000,9:59:59.000 With polling, you know for a fact[br]what processes should be there, 9:59:59.000,9:59:59.000 and it's a bit of an advantage there. 9:59:59.000,9:59:59.000 With polling, there's really easy testing. 9:59:59.000,9:59:59.000 With push-based metrics, you have to[br]figure out 9:59:59.000,9:59:59.000 if you want to test a new version of[br]the monitoring system or 9:59:59.000,9:59:59.000 you want to test something new, 9:59:59.000,9:59:59.000 you have to ??? a copy of the data. 9:59:59.000,9:59:59.000 With polling, you can just set up[br]another instance of your monitoring 9:59:59.000,9:59:59.000 and just test it. 9:59:59.000,9:59:59.000 Or you don't even have, 9:59:59.000,9:59:59.000 it doesn't even have to be monitoring,[br]you can just use curl 9:59:59.000,9:59:59.000 to poll the metrics endpoint. 9:59:59.000,9:59:59.000 It's significantly easier to test. 9:59:59.000,9:59:59.000 The other thing with theā€¦ 9:59:59.000,9:59:59.000 The other nice thing is that[br]the client is really simple. 9:59:59.000,9:59:59.000 The client doesn't have to know[br]where the monitoring system is. 9:59:59.000,9:59:59.000 It doesn't have to know about ??? 9:59:59.000,9:59:59.000 It just has to sit and collect the data[br]about itself. 9:59:59.000,9:59:59.000 So it doesn't have to know anything about[br]the topology of the network. 9:59:59.000,9:59:59.000 As an application developer, if you're[br]writing a DNS server or 9:59:59.000,9:59:59.000 some other piece of software, 9:59:59.000,9:59:59.000 you don't have to know anything about[br]monitoring software, 9:59:59.000,9:59:59.000 you can just implement it inside[br]your application and 9:59:59.000,9:59:59.000 the monitoring software, whether it's[br]Prometheus or something else, 9:59:59.000,9:59:59.000 can just come and collect that data for you. 9:59:59.000,9:59:59.000 That's kind of similar to a very old[br]monitoring system called SNMP, 9:59:59.000,9:59:59.000 but SNMP has a significantly less friendly[br]data model for developers. 9:59:59.000,9:59:59.000 This is the basic layout[br]of a Prometheus server. 9:59:59.000,9:59:59.000 At the core, there's a Prometheus server 9:59:59.000,9:59:59.000 and it deals with all the data collection[br]and analytics. 9:59:59.000,9:59:59.000 Basically, this one binary,[br]it's all written in golang. 9:59:59.000,9:59:59.000 It's a single binary. 9:59:59.000,9:59:59.000 It knows how to read from your inventory, 9:59:59.000,9:59:59.000 there's a bunch of different methods,[br]whether you've got 9:59:59.000,9:59:59.000 a kubernetes cluster or a cloud platform 9:59:59.000,9:59:59.000 or you have your own customized thing[br]with ansible. 9:59:59.000,9:59:59.000 Ansible can take your layout, drop that[br]into a config file and 9:59:59.000,9:59:59.000 Prometheus can pick that up. 9:59:59.000,9:59:59.000 Once it has the layout, it goes out and[br]collects all the data. 9:59:59.000,9:59:59.000 It has a storage and a time series[br]database to store all that data locally. 9:59:59.000,9:59:59.000 It has a thing called PromQL, which is[br]a query language designed 9:59:59.000,9:59:59.000 for metrics and analytics. 9:59:59.000,9:59:59.000 From that PromQL, you can add frontends[br]that will, 9:59:59.000,9:59:59.000 whether it's a simple API client[br]to run reports, 9:59:59.000,9:59:59.000 you can use things like Grafana[br]for creating dashboards, 9:59:59.000,9:59:59.000 it's got a simple webUI built in. 9:59:59.000,9:59:59.000 You can plug in anything you want[br]on that side. 9:59:59.000,9:59:59.000 And then, it also has the ability to[br]continuously execute queries 9:59:59.000,9:59:59.000 called "recording rules" 9:59:59.000,9:59:59.000 and these recording rules have[br]two different modes. 9:59:59.000,9:59:59.000 You can either record, you can take[br]a query 9:59:59.000,9:59:59.000 and it will generate new data[br]from that query 9:59:59.000,9:59:59.000 or you can take a query, and[br]if it returns results, 9:59:59.000,9:59:59.000 it will return an alert. 9:59:59.000,9:59:59.000 That alert is a push message[br]to the alert manager. 9:59:59.000,9:59:59.000 This allows us to separate the generating[br]of alerts from the routing of alerts. 9:59:59.000,9:59:59.000 You can have one or hundreds of Prometheus[br]services, all generating alerts 9:59:59.000,9:59:59.000 and it goes into an alert manager cluster[br]and sends, does the deduplication 9:59:59.000,9:59:59.000 and the routing to the human 9:59:59.000,9:59:59.000 because, of course, the thing[br]that we want is 9:59:59.000,9:59:59.000 we had dashboards with graphs, but[br]in order to find out if something is broken 9:59:59.000,9:59:59.000 you had to have a human[br]looking at the graph. 9:59:59.000,9:59:59.000 With Prometheus, we don't have to do that[br]anymore, 9:59:59.000,9:59:59.000 we can simply let the software tell us[br]that we need to go investigate 9:59:59.000,9:59:59.000 our problems. 9:59:59.000,9:59:59.000 We don't have to sit there and[br]stare at dashboards all day, 9:59:59.000,9:59:59.000 because that's really boring. 9:59:59.000,9:59:59.000 What does it look like to actually[br]get data into Prometheus? 9:59:59.000,9:59:59.000 This is a very basic output[br]of a Prometheus metric. 9:59:59.000,9:59:59.000 This is a very simple thing. 9:59:59.000,9:59:59.000 If you know much about[br]the linux kernel, 9:59:59.000,9:59:59.000 the linux kernel tracks ??? stats,[br]all the state of all the CPUs 9:59:59.000,9:59:59.000 in your system 9:59:59.000,9:59:59.000 and we express this by having[br]the name of the metric, which is 9:59:59.000,9:59:59.000 'node_cpu_seconds_total' and so[br]this is a self describing metric, 9:59:59.000,9:59:59.000 like you can just read the metrics name 9:59:59.000,9:59:59.000 and you understand a little bit about[br]what's going on here.