[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, we had a talk by a non-GitLab person\Nabout GitLab. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, we have a talk by a GitLab person\Non non-GtlLab. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Something like that? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The CCCHH hackerspace is now open, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from now on if you want to go there,\Nthat's the announcement. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And the next talk will be by Ben Kochie Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,on metrics-based monitoring\Nwith Prometheus. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Welcome. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause] Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Alright, so Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,my name is Ben Kochie Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I work on DevOps features for GitLab Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and apart working for GitLab, I also work\Non the opensource Prometheus project. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I live in Berlin and I've been using\NDebian since ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,yes, quite a long time. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, what is Metrics-based Monitoring? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you're running software in production, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you probably want to monitor it, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because if you don't monitor it, you don't\Nknow it's right. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,??? break down into two categories: Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's blackbox monitoring and\Nthere's whitebox monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Blackbox monitoring is treating\Nyour software like a blackbox. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just checks to see, like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is it responding, or does it ping Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or ??? HTTP requests Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[mic turned on] Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ah, there we go, much better. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, blackbox monitoring is a probe, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it just kind of looks from the outside\Nto your software Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it has no knowledge of the internals Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's really good for end to end testing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a fairly complicated\Nservice, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you come in from the outside, you go\Nthrough the load balancer, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you hit the API server, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the API server might hit a database, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you go all the way through\Nto the back of the stack Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and all the way back out Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that everything is working\Nend to end. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But you only know about it\Nfor that one request. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So in order to find out if your service\Nis working, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from the end to end, for every single\Nrequest, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this requires whitebox intrumentation. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, basically, every event that happens\Ninside your software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,inside a serving stack, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,gets collected and gets counted, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that every request hits\Nthe load balancer, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits your application\Nservice, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits the database. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You know that everything matches up Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is called whitebox, or\Nmetrics-based monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There is different examples of, like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the kind of software that does blackbox\Nand whitebox monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you have software like Nagios that\Nyou can configure checks Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or pingdom, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,pingdom will do ping of your website. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then there is metrics-based monitoring, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,things like Prometheus, things like\Nthe TICK stack from influx data, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,New Relic and other commercial solutions Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but of course I like to talk about\Nthe opensorce solutions. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We're gonna talk a little bit about\NPrometheus. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus came out of the idea that Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we needed a monitoring system that could\Ncollect all this whitebox metric data Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and do something useful with it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Not just give us a pretty graph, but\Nwe also want to be able to Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,alert on it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So we needed both Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a data gathering and an analytics system\Nin the same instance. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,To do this, we built this thing and\Nwe looked at the way that Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,data was being generated\Nby the applications Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there are advantages and\Ndisadvantages to this Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,push vs. poll model for metrics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We decided to go with the polling model Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because there is some slight advantages\Nfor polling over pushing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you get this free\Nblackbox check Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that the application is running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,When you poll your application, you know\Nthat the process is running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you are doing push-based, you can't\Ntell the difference between Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,your application doing no work and\Nyour application not running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you don't know if it's stuck, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or is it just not having to do any work. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, the polling system knows\Nthe state of your network. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a defined set of services, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that inventory drives what should be there. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Again, it's like, the disappearing, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is the process dead, or is it just\Nnot doing anything? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you know for a fact\Nwhat processes should be there, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's a bit of an advantage there. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, there's really easy testing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With push-based metrics, you have to\Nfigure out Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if you want to test a new version of\Nthe monitoring system or Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you want to test something new, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you have to ??? a copy of the data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you can just set up\Nanother instance of your monitoring Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and just test it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Or you don't even have, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it doesn't even have to be monitoring,\Nyou can just use curl Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to poll the metrics endpoint. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's significantly easier to test. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other thing with the… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other nice thing is that\Nthe client is really simple. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The client doesn't have to know\Nwhere the monitoring system is. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It doesn't have to know about ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It just has to sit and collect the data\Nabout itself. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So it doesn't have to know anything about\Nthe topology of the network. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,As an application developer, if you're\Nwriting a DNS server or Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,some other piece of software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you don't have to know anything about\Nmonitoring software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can just implement it inside\Nyour application and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the monitoring software, whether it's\NPrometheus or something else, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,can just come and collect that data for you. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's kind of similar to a very old\Nmonitoring system called SNMP, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but SNMP has a significantly less friendly\Ndata model for developers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is the basic layout\Nof a Prometheus server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,At the core, there's a Prometheus server Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it deals with all the data collection\Nand analytics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Basically, this one binary,\Nit's all written in golang. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a single binary. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It knows how to read from your inventory, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's a bunch of different methods,\Nwhether you've got Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a kubernetes cluster or a cloud platform Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you have your own customized thing\Nwith ansible. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ansible can take your layout, drop that\Ninto a config file and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus can pick that up. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Once it has the layout, it goes out and\Ncollects all the data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a storage and a time series\Ndatabase to store all that data locally. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a thing called PromQL, which is\Na query language designed Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for metrics and analytics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,From that PromQL, you can add frontends\Nthat will, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,whether it's a simple API client\Nto run reports, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can use things like Grafana\Nfor creating dashboards, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's got a simple webUI built in. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can plug in anything you want\Non that side. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then, it also has the ability to\Ncontinuously execute queries Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,called "recording rules" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and these recording rules have\Ntwo different modes. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can either record, you can take\Na query Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it will generate new data\Nfrom that query Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you can take a query, and\Nif it returns results, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it will return an alert. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That alert is a push message\Nto the alert manager. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This allows us to separate the generating\Nof alerts from the routing of alerts. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can have one or hundreds of Prometheus\Nservices, all generating alerts Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it goes into an alert manager cluster\Nand sends, does the deduplication Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the routing to the human Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because, of course, the thing\Nthat we want is Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we had dashboards with graphs, but\Nin order to find out if something is broken Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you had to have a human\Nlooking at the graph. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With Prometheus, we don't have to do that\Nanymore, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can simply let the software tell us\Nthat we need to go investigate Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,our problems. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We don't have to sit there and\Nstare at dashboards all day, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because that's really boring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What does it look like to actually\Nget data into Prometheus? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very basic output\Nof a Prometheus metric. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very simple thing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you know much about\Nthe linux kernel, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the linux kernel tracks ??? stats,\Nall the state of all the CPUs Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,in your system Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we express this by having\Nthe name of the metric, which is Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'node_cpu_seconds_total' and so\Nthis is a self describing metric, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,like you can just read the metrics name Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you understand a little bit about\Nwhat's going on here.