[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, we had a talk by a non-GitLab person\Nabout GitLab. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, we have a talk by a GitLab person\Non non-GtlLab. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Something like that? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The CCCHH hackerspace is now open, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from now on if you want to go there,\Nthat's the announcement. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And the next talk will be by Ben Kochie Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,on metrics-based monitoring\Nwith Prometheus. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Welcome. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause] Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Alright, so Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,my name is Ben Kochie Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I work on DevOps features for GitLab Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and apart working for GitLab, I also work\Non the opensource Prometheus project. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I live in Berlin and I've been using\NDebian since ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,yes, quite a long time. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, what is Metrics-based Monitoring? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you're running software in production, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you probably want to monitor it, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because if you don't monitor it, you don't\Nknow it's right. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,??? break down into two categories: Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's blackbox monitoring and\Nthere's whitebox monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Blackbox monitoring is treating\Nyour software like a blackbox. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just checks to see, like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is it responding, or does it ping Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or ??? HTTP requests Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[mic turned on] Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ah, there we go, much better. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, blackbox monitoring is a probe, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it just kind of looks from the outside\Nto your software Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it has no knowledge of the internals Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's really good for end to end testing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a fairly complicated\Nservice, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you come in from the outside, you go\Nthrough the load balancer, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you hit the API server, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the API server might hit a database, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you go all the way through\Nto the back of the stack Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and all the way back out Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that everything is working\Nend to end. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But you only know about it\Nfor that one request. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So in order to find out if your service\Nis working, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from the end to end, for every single\Nrequest, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this requires whitebox intrumentation. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So, basically, every event that happens\Ninside your software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,inside a serving stack, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,gets collected and gets counted, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you know that every request hits\Nthe load balancer, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits your application\Nservice, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,every request hits the database. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You know that everything matches up Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is called whitebox, or\Nmetrics-based monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There is different examples of, like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the kind of software that does blackbox\Nand whitebox monitoring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you have software like Nagios that\Nyou can configure checks Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or pingdom, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,pingdom will do ping of your website. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then there is metrics-based monitoring, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,things like Prometheus, things like\Nthe TICK stack from influx data, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,New Relic and other commercial solutions Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but of course I like to talk about\Nthe opensorce solutions. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We're gonna talk a little bit about\NPrometheus. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus came out of the idea that Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we needed a monitoring system that could\Ncollect all this whitebox metric data Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and do something useful with it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Not just give us a pretty graph, but\Nwe also want to be able to Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,alert on it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So we needed both Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a data gathering and an analytics system\Nin the same instance. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,To do this, we built this thing and\Nwe looked at the way that Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,data was being generated\Nby the applications Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there are advantages and\Ndisadvantages to this Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,push vs. poll model for metrics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We decided to go with the polling model Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because there is some slight advantages\Nfor polling over pushing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you get this free\Nblackbox check Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that the application is running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,When you poll your application, you know\Nthat the process is running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you are doing push-based, you can't\Ntell the difference between Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,your application doing no work and\Nyour application not running. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you don't know if it's stuck, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or is it just not having to do any work. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, the polling system knows\Nthe state of your network. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a defined set of services, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that inventory drives what should be there. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Again, it's like, the disappearing, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is the process dead, or is it just\Nnot doing anything? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you know for a fact\Nwhat processes should be there, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's a bit of an advantage there. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, there's really easy testing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With push-based metrics, you have to\Nfigure out Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if you want to test a new version of\Nthe monitoring system or Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you want to test something new, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you have to ??? a copy of the data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With polling, you can just set up\Nanother instance of your monitoring Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and just test it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Or you don't even have, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it doesn't even have to be monitoring,\Nyou can just use curl Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to poll the metrics endpoint. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's significantly easier to test. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other thing with the… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The other nice thing is that\Nthe client is really simple. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The client doesn't have to know\Nwhere the monitoring system is. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It doesn't have to know about ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It just has to sit and collect the data\Nabout itself. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So it doesn't have to know anything about\Nthe topology of the network. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,As an application developer, if you're\Nwriting a DNS server or Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,some other piece of software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you don't have to know anything about\Nmonitoring software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can just implement it inside\Nyour application and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the monitoring software, whether it's\NPrometheus or something else, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,can just come and collect that data for you. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's kind of similar to a very old\Nmonitoring system called SNMP, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but SNMP has a significantly less friendly\Ndata model for developers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is the basic layout\Nof a Prometheus server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,At the core, there's a Prometheus server Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it deals with all the data collection\Nand analytics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Basically, this one binary,\Nit's all written in golang. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a single binary. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It knows how to read from your inventory, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's a bunch of different methods,\Nwhether you've got Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a kubernetes cluster or a cloud platform Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you have your own customized thing\Nwith ansible. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Ansible can take your layout, drop that\Ninto a config file and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus can pick that up. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Once it has the layout, it goes out and\Ncollects all the data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a storage and a time series\Ndatabase to store all that data locally. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It has a thing called PromQL, which is\Na query language designed Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for metrics and analytics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,From that PromQL, you can add frontends\Nthat will, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,whether it's a simple API client\Nto run reports, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can use things like Grafana\Nfor creating dashboards, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's got a simple webUI built in. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can plug in anything you want\Non that side. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then, it also has the ability to\Ncontinuously execute queries Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,called "recording rules" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and these recording rules have\Ntwo different modes. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can either record, you can take\Na query Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it will generate new data\Nfrom that query Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or you can take a query, and\Nif it returns results, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it will return an alert. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That alert is a push message\Nto the alert manager. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This allows us to separate the generating\Nof alerts from the routing of alerts. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can have one or hundreds of Prometheus\Nservices, all generating alerts Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it goes into an alert manager cluster\Nand sends, does the deduplication Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the routing to the human Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because, of course, the thing\Nthat we want is Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we had dashboards with graphs, but\Nin order to find out if something is broken Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you had to have a human\Nlooking at the graph. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With Prometheus, we don't have to do that\Nanymore, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can simply let the software tell us\Nthat we need to go investigate Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,our problems. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We don't have to sit there and\Nstare at dashboards all day, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because that's really boring. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What does it look like to actually\Nget data into Prometheus? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very basic output\Nof a Prometheus metric. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a very simple thing. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you know much about\Nthe linux kernel, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the linux kernel tracks ??? stats,\Nall the state of all the CPUs Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,in your system Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we express this by having\Nthe name of the metric, which is Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'node_cpu_seconds_total' and so\Nthis is a self-describing metric, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,like you can just read the metrics name Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you understand a little bit about\Nwhat's going on here. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The linux kernel and other kernels track\Ntheir usage by the number of seconds Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,spent doing different things and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that could be, whether it's in system or\Nuser space or IRQs Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or iowait or idle. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Actually, the kernel tracks how much\Nidle time it has. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It also tracks it by the number of CPUs. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With other monitoring systems, they used\Nto do this with a tree structure Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this caused a lot of problems,\Nfor like Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,How do you mix and match data so\Nby switching from Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a tree structure to a tag-based structure, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can do some really interesting\Npowerful data analytics. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's a nice example of taking\Nthose CPU seconds counters Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then converting them into a graph\Nby using PromQL. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we can get into\NMetrics-Based Alerting. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we have this graph, we have this thing Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can look and see here Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Oh there is some little spike here,\Nwe might want to know about that." Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we can get into Metrics-Based\NAlerting. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I used to be a site reliability engineer,\NI'm still a site reliability engineer at heart Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we have this concept of things that\Nyou need on a site or a service reliably Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The most important thing you need is\Ndown at the bottom, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Monitoring, because if you don't have\Nmonitoring of your service, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,how do you know it's even working? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a couple of techniques here, and\Nwe want to alert based on data Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and not just those end to end tests. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a couple of techniques, a thing\Ncalled the RED method Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's a thing called the USE method Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's a couple nice things to some\Nblog posts about this Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and basically it defines that, for example, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the RED method talks about the requests\Nthat your system is handling Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There are three things: Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's the number of requests, there's\Nthe number of errors Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and there's how long takes a duration. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,With the combination of these three things Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can determine most of\Nwhat your users see Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Did my request go through? Did it\Nreturn an error? Was it fast?" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Most people, that's all they care about. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"I made a request to a website and\Nit came back and it was fast." Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a very simple method of just, like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,those are the important things to\Ndetermine if your site is healthy. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But we can go back to some more\Ntraditional, sysadmin style, alerts Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this is basically taking the filesystem\Navailable space, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,divided by the filesystem size, that becomes\Nthe ratio of filesystem availability Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from 0 to 1. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Multiply it by 100, we now have\Na percentage Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and if it's less than or equal to 1%\Nfor 15 minutes, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,this is less than 1% space, we should tell\Na sysadmin to go check Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the ??? filesystem ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's super nice and simple. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can also tag, we can include… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Every alert includes all the extraneous\Nlabels that Prometheus adds to your metrics Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,When you add a metric in Prometheus, if\Nwe go back and we look at this metric. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This metric only contain the information\Nabout the internals of the application Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,anything about, like, what server it's on,\Nis it running in a container, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what cluster does it come from,\Nwhat ??? is it on, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that's all extra annotations that are\Nadded by the Prometheus server Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,at discovery time. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I don't have a good example of what\Nthose labels look like Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but every metric gets annotated\Nwith location information. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That location information also comes through\Nas labels in the alert Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so, if you have a message coming\Ninto your alert manager, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the alert manager can look and go Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Oh, that's coming from this datacenter" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it can include that in the email or\NIRC message or SMS message. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you can include Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Filesystem is out of space on this host\Nfrom this datacenter" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,All these labels get passed through and\Nthen you can append Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"severity: critical" to that alert and\Ninclude that in the message to the human Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because of course, this is how you define… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Getting the message from the monitoring\Nto the human. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can even include nice things like, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if you've got documentation, you can\Ninclude a link to the documentation Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,as an annotation Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the alert manager can take that\Nbasic url and, you know, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,massaging it into whatever it needs\Nto look like to actually get Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the operator to the correct documentation. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can also do more fun things: Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,since we actually are not just checking Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what is the space right now,\Nwe're tracking data over time, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can use 'predict_linear'. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'predict_linear' just takes and does\Na simple linear regression. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This example takes the filesystem\Navailable space over the last hour and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,does a linear regression. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prediction says "Well, it's going that way\Nand four hours from now, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,based on one hour of history, it's gonna\Nbe less than 0, which means full". Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We know that within the next four hours,\Nthe disc is gonna be full Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so we can tell the operator ahead of time\Nthat it's gonna be full Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and not just tell them that it's full\Nright now. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They have some window of ability\Nto fix it before it fails. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is really important because\Nif you're running a site Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you want to be able to have alerts\Nthat tell you that your system is failing Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,before it actually fails. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Because if it fails, you're out of SLO\Nor SLA and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,your users are gonna be unhappy Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you don't want the users to tell you\Nthat your site is down Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you want to know about it before\Nyour users can even tell. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This allows you to do that. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And also of course, Prometheus being\Na modern system, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we support fully UTF8 in all of our labels. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's an other one, here's a good example\Nfrom the USE method. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a rate of 500 errors coming from\Nan application Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you can simply alert that Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's more than 500 errors per second\Ncoming out of the application Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if that's your threshold for ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you can do other things, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can convert that from just\Na raid of errors Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to a percentive error. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you could say Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"I have an SLA of 3 9" and so you can say Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"If the rate of errors divided by the rate\Nof requests is .01, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or is more than .01, then\Nthat's a problem." Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can include that level of\Nerror granularity. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And if you're just doing a blackbox test, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you wouldn't know this, you would only get\Nif you got an error from the system, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,then you got another error from the system Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,then you fire an alert. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But if those checks are one minute apart\Nand you're serving 1000 requests per second Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you could be serving 10,000 errors before\Nyou even get an alert. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you might miss it, because Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what if you only get one random error Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then the next time, you're serving\N25% errors, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you only have a 25% chance of that check\Nfailing again. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You really need these metrics in order\Nto get Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,proper reports of the status of your system Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's even options Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can slice and dice those labels. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a label on all of\Nyour applications called 'service' Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can send that 'service' label through\Nto the message Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you can say\N"Hey, this service is broken". Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can include that service label\Nin your alert messages. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And that's it, I can go to a demo and Q&A. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause] Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Any questions so far? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Or anybody want to see a demo? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] Hi. Does Prometheus make metric\Ndiscovery inside containers Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or do I have to implement the metrics\Nmyself? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] For metrics in containers, there are\Nalready things that expose Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the metrics of the container system\Nitself. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a utility called 'cadvisor' and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,cadvisor takes the links cgroup data\Nand exposes it as metrics Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you can get data about\Nhow much CPU time is being Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,spent in your container, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,how much memory is being used\Nby your container. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] But not about the application,\Njust about the container usage ? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] Right. Because the container\Nhas no idea Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,whether your application is written\Nin Ruby or go or Python or whatever, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you have to build that into\Nyour application in order to get the data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So for Prometheus, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we've written client libraries that can be\Nincluded in your application directly Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you can get that data out. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you go to the Prometheus website,\Nwe have a whole series of client libraries Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we cover a pretty good selection\Nof popular software. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] What is the current state of\Nlong-term data storage? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] Very good question. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's been several… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's actually several different methods\Nof doing this. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus stores all this data locally\Nin its own data storage Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,on the local disk. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But that's only as durable as\Nthat server is durable. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a really durable server, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can store as much data as you want, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can store years and years of data\Nlocally on the Prometheus server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's not a problem. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a bunch of misconceptions because\Nof our default Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the language on our website said Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"It's not long-term storage" Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,simply because we leave that problem\Nup to the person running the server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But the time series database\Nthat Prometheus includes Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is actually quite durable. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's only as durable as the server\Nunderneath it. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a very large cluster and\Nyou want really high durability, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you need to have some kind of\Ncluster software, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but because we want Prometheus to be\Nsimple to deploy Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and very simple to operate Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and also very robust. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We didn't want to include any clustering\Nin Prometheus itself, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because anytime you have a clustered\Nsoftware, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what happens if your network is\Na little wanky. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The first thing that goes down is\Nall of your distributed systems fail. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And building distributed systems to be\Nreally robust is really hard Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so Prometheus is what we call\N"uncoordinated distributed systems". Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got two Prometheus servers\Nmonitoring all your targets in an HA mode Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,in a cluster, and there's a split brain, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,each Prometheus can see\Nhalf of the cluster and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it can see that the other half\Nof the cluster is down. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They can both try to get alerts out\Nto the alert manager Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is a really really robust way of\Nhandling split brains Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and bad network failures and bad problems\Nin a cluster. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's designed to be super super robust Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and so the two individual\NPromotheus servers in you cluster Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,don't have to talk to each other\Nto do this, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,they can just to it independently. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But if you want to be able\Nto correlate data Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,between many different Prometheus servers Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you need an external data storage\Nto do this. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And also you may not have\Nvery big servers, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you might be running your Prometheus\Nin a container Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's only got a little bit of local\Nstorage space Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you want to send all that data up\Nto a big cluster datastore Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for ??? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We have several different ways of\Ndoing this. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's the classic way which is called\Nfederation Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,where you have one Prometheus server\Npolling in summary data from Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,each of the individual Prometheus servers Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is useful if you want to run\Nalerts against data coming Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from multiple Prometheus servers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But federation is not replication. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It only can do a little bit of data from\Neach Prometheus server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got a million metrics on\Neach Prometheus server, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can't poll in a million metrics\Nand do… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got 10 of those, you can't\Npoll in 10 million metrics Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,simultaneously into one Prometheus\Nserver. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just to much data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There is two others, a couple of other\Nnice options. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a piece of software called\NCortex. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Cortex is a Prometheus server that\Nstores its data in a database. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Specifically, a distributed database. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Things that are based on the Google\Nbig table model, like Cassandra or… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What's the Amazon one? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Yeah. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Dynamodb. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a dynamodb or a cassandra\Ncluster, or one of these other Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,really big distributed storage clusters, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Cortex can run and the Prometheus servers\Nwill stream their data up to Cortex Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it will keep a copy of that accross\Nall of your Prometheus servers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And because it's based on things\Nlike Cassandra, Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's super scalable. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's a little complex to run and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,many people don't want to run that\Ncomplex infrastructure. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We have another new one, we just blogged\Nabout it yesterday. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a thing called Thanos. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Thanos is Prometheus at scale. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Basically, the way it works… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Actually, why don't I bring that up? Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This was developed by a company\Ncalled Improbable Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and they wanted to… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They had billions of metrics coming from\Nhundreds of Prometheus servers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They developed this in collaboration with\Nthe Prometheus team to build Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a super highly scalable Prometheus server. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus itself stores the incoming\Nmetrics data in ??? log Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then every two hours, it creates\Na compaction cycle Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it creates a mutable series block\Nof data which is Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,all the time series blocks themselves Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then an index into that data. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Those two hour windows are all imutable Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so ??? has a little sidecar binary that\Nwatches for those new directories and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,uploads them into a blob store. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you could put them in S3 or minio or\Nsome other simple object storage. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then now you have all of your data,\Nall of this index data already Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,ready to go Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then the final sidecar creates\Na little mesh cluster that can read from Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,all of those S3 blocks. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, you have this super global view\Nall stored in a big bucket storage and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,things like S3 or minio are… Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Bucket storage is not databases so they're\Noperationally a little easier to operate. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Plus, now we have all this data in\Na bucket store and Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the Thanos sidecars can talk to each other Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can now have a single entry point. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can query Thanos and Thanos will\Ndistribute your query Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,across all your Prometheus servers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So now you can do global queries across\Nall of your servers. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's very new, they just released\Ntheir first release candidate yesterday. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It is looking to be like\Nthe coolest thing ever Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for running large scale Prometheus. Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's an example of how that is laid out.