So, we had a talk by a non-GitLab person about GitLab. Now, we have a talk by a GitLab person on non-GtlLab. Something like that? The CCCHH hackerspace is now open, from now on if you want to go there, that's the announcement. And the next talk will be by Ben Kochie on metrics-based monitoring with Prometheus. Welcome. [Applause] Alright, so my name is Ben Kochie I work on DevOps features for GitLab and apart working for GitLab, I also work on the opensource Prometheus project. I live in Berlin and I've been using Debian since ??? yes, quite a long time. So, what is Metrics-based Monitoring? If you're running software in production, you probably want to monitor it, because if you don't monitor it, you don't know it's right. ??? break down into two categories: there's blackbox monitoring and there's whitebox monitoring. Blackbox monitoring is treating your software like a blackbox. It's just checks to see, like, is it responding, or does it ping or ??? HTTP requests [mic turned on] Ah, there we go, much better. So, blackbox monitoring is a probe, it just kind of looks from the outside to your software and it has no knowledge of the internals and it's really good for end to end testing. So if you've got a fairly complicated service, you come in from the outside, you go through the load balancer, you hit the API server, the API server might hit a database, and you go all the way through to the back of the stack and all the way back out so you know that everything is working end to end. But you only know about it for that one request. So in order to find out if your service is working, from the end to end, for every single request, this requires whitebox intrumentation. So, basically, every event that happens inside your software, inside a serving stack, gets collected and gets counted, so you know that every request hits the load balancer, every request hits your application service, every request hits the database. You know that everything matches up and this is called whitebox, or metrics-based monitoring. There is different examples of, like, the kind of software that does blackbox and whitebox monitoring. So you have software like Nagios that you can configure checks or pingdom, pingdom will do ping of your website. And then there is metrics-based monitoring, things like Prometheus, things like the TICK stack from influx data, New Relic and other commercial solutions but of course I like to talk about the opensorce solutions. We're gonna talk a little bit about Prometheus. Prometheus came out of the idea that we needed a monitoring system that could collect all this whitebox metric data and do something useful with it. Not just give us a pretty graph, but we also want to be able to alert on it. So we needed both a data gathering and an analytics system in the same instance. To do this, we built this thing and we looked at the way that data was being generated by the applications and there are advantages and disadvantages to this push vs. poll model for metrics. We decided to go with the polling model because there is some slight advantages for polling over pushing. With polling, you get this free blackbox check that the application is running. When you poll your application, you know that the process is running. If you are doing push-based, you can't tell the difference between your application doing no work and your application not running. So you don't know if it's stuck, or is it just not having to do any work. With polling, the polling system knows the state of your network. If you have a defined set of services, that inventory drives what should be there. Again, it's like, the disappearing, is the process dead, or is it just not doing anything? With polling, you know for a fact what processes should be there, and it's a bit of an advantage there. With polling, there's really easy testing. With push-based metrics, you have to figure out if you want to test a new version of the monitoring system or you want to test something new, you have to ??? a copy of the data. With polling, you can just set up another instance of your monitoring and just test it. Or you don't even have, it doesn't even have to be monitoring, you can just use curl to poll the metrics endpoint. It's significantly easier to test. The other thing with theā€¦ The other nice thing is that the client is really simple. The client doesn't have to know where the monitoring system is. It doesn't have to know about ??? It just has to sit and collect the data about itself. So it doesn't have to know anything about the topology of the network. As an application developer, if you're writing a DNS server or some other piece of software, you don't have to know anything about monitoring software, you can just implement it inside your application and the monitoring software, whether it's Prometheus or something else, can just come and collect that data for you. That's kind of similar to a very old monitoring system called SNMP, but SNMP has a significantly less friendly data model for developers. This is the basic layout of a Prometheus server. At the core, there's a Prometheus server and it deals with all the data collection and analytics. Basically, this one binary, it's all written in golang. It's a single binary. It knows how to read from your inventory, there's a bunch of different methods, whether you've got a kubernetes cluster or a cloud platform or you have your own customized thing with ansible. Ansible can take your layout, drop that into a config file and Prometheus can pick that up. Once it has the layout, it goes out and collects all the data. It has a storage and a time series database to store all that data locally. It has a thing called PromQL, which is a query language designed for metrics and analytics. From that PromQL, you can add frontends that will, whether it's a simple API client to run reports, you can use things like Grafana for creating dashboards, it's got a simple webUI built in. You can plug in anything you want on that side.