9:59:59.000,9:59:59.000 So, we had a talk by a non-GitLab person[br]about GitLab. 9:59:59.000,9:59:59.000 Now, we have a talk by a GitLab person[br]on non-GtlLab. 9:59:59.000,9:59:59.000 Something like that? 9:59:59.000,9:59:59.000 The CCCHH hackerspace is now open, 9:59:59.000,9:59:59.000 from now on if you want to go there,[br]that's the announcement. 9:59:59.000,9:59:59.000 And the next talk will be by Ben Kochie 9:59:59.000,9:59:59.000 on metrics-based monitoring[br]with Prometheus. 9:59:59.000,9:59:59.000 Welcome. 9:59:59.000,9:59:59.000 [Applause] 9:59:59.000,9:59:59.000 Alright, so 9:59:59.000,9:59:59.000 my name is Ben Kochie 9:59:59.000,9:59:59.000 I work on DevOps features for GitLab 9:59:59.000,9:59:59.000 and apart working for GitLab, I also work[br]on the opensource Prometheus project. 9:59:59.000,9:59:59.000 I live in Berlin and I've been using[br]Debian since ??? 9:59:59.000,9:59:59.000 yes, quite a long time. 9:59:59.000,9:59:59.000 So, what is Metrics-based Monitoring? 9:59:59.000,9:59:59.000 If you're running software in production, 9:59:59.000,9:59:59.000 you probably want to monitor it, 9:59:59.000,9:59:59.000 because if you don't monitor it, you don't[br]know it's right. 9:59:59.000,9:59:59.000 ??? break down into two categories: 9:59:59.000,9:59:59.000 there's blackbox monitoring and[br]there's whitebox monitoring. 9:59:59.000,9:59:59.000 Blackbox monitoring is treating[br]your software like a blackbox. 9:59:59.000,9:59:59.000 It's just checks to see, like, 9:59:59.000,9:59:59.000 is it responding, or does it ping 9:59:59.000,9:59:59.000 or ??? HTTP requests 9:59:59.000,9:59:59.000 [mic turned on] 9:59:59.000,9:59:59.000 Ah, there we go, much better. 9:59:59.000,9:59:59.000 So, blackbox monitoring is a probe, 9:59:59.000,9:59:59.000 it just kind of looks from the outside[br]to your software 9:59:59.000,9:59:59.000 and it has no knowledge of the internals 9:59:59.000,9:59:59.000 and it's really good for end to end testing. 9:59:59.000,9:59:59.000 So if you've got a fairly complicated[br]service, 9:59:59.000,9:59:59.000 you come in from the outside, you go[br]through the load balancer, 9:59:59.000,9:59:59.000 you hit the API server, 9:59:59.000,9:59:59.000 the API server might hit a database, 9:59:59.000,9:59:59.000 and you go all the way through[br]to the back of the stack 9:59:59.000,9:59:59.000 and all the way back out 9:59:59.000,9:59:59.000 so you know that everything is working[br]end to end. 9:59:59.000,9:59:59.000 But you only know about it[br]for that one request. 9:59:59.000,9:59:59.000 So in order to find out if your service[br]is working, 9:59:59.000,9:59:59.000 from the end to end, for every single[br]request, 9:59:59.000,9:59:59.000 this requires whitebox intrumentation. 9:59:59.000,9:59:59.000 So, basically, every event that happens[br]inside your software, 9:59:59.000,9:59:59.000 inside a serving stack, 9:59:59.000,9:59:59.000 gets collected and gets counted, 9:59:59.000,9:59:59.000 so you know that every request hits[br]the load balancer, 9:59:59.000,9:59:59.000 every request hits your application[br]service, 9:59:59.000,9:59:59.000 every request hits the database. 9:59:59.000,9:59:59.000 You know that everything matches up 9:59:59.000,9:59:59.000 and this is called whitebox, or[br]metrics-based monitoring. 9:59:59.000,9:59:59.000 There is different examples of, like, 9:59:59.000,9:59:59.000 the kind of software that does blackbox[br]and whitebox monitoring. 9:59:59.000,9:59:59.000 So you have software like Nagios that[br]you can configure checks 9:59:59.000,9:59:59.000 or pingdom, 9:59:59.000,9:59:59.000 pingdom will do ping of your website. 9:59:59.000,9:59:59.000 And then there is metrics-based monitoring, 9:59:59.000,9:59:59.000 things like Prometheus, things like[br]the TICK stack from influx data, 9:59:59.000,9:59:59.000 New Relic and other commercial solutions 9:59:59.000,9:59:59.000 but of course I like to talk about[br]the opensorce solutions. 9:59:59.000,9:59:59.000 We're gonna talk a little bit about[br]Prometheus. 9:59:59.000,9:59:59.000 Prometheus came out of the idea that 9:59:59.000,9:59:59.000 we needed a monitoring system that could[br]collect all this whitebox metric data 9:59:59.000,9:59:59.000 and do something useful with it. 9:59:59.000,9:59:59.000 Not just give us a pretty graph, but[br]we also want to be able to 9:59:59.000,9:59:59.000 alert on it. 9:59:59.000,9:59:59.000 So we needed both 9:59:59.000,9:59:59.000 a data gathering and an analytics system[br]in the same instance. 9:59:59.000,9:59:59.000 To do this, we built this thing and[br]we looked at the way that 9:59:59.000,9:59:59.000 data was being generated[br]by the applications 9:59:59.000,9:59:59.000 and there are advantages and[br]disadvantages to this 9:59:59.000,9:59:59.000 push vs. poll model for metrics. 9:59:59.000,9:59:59.000 We decided to go with the polling model 9:59:59.000,9:59:59.000 because there is some slight advantages[br]for polling over pushing. 9:59:59.000,9:59:59.000 With polling, you get this free[br]blackbox check 9:59:59.000,9:59:59.000 that the application is running. 9:59:59.000,9:59:59.000 When you poll your application, you know[br]that the process is running. 9:59:59.000,9:59:59.000 If you are doing push-based, you can't[br]tell the difference between 9:59:59.000,9:59:59.000 your application doing no work and[br]your application not running. 9:59:59.000,9:59:59.000 So you don't know if it's stuck, 9:59:59.000,9:59:59.000 or is it just not having to do any work. 9:59:59.000,9:59:59.000 With polling, the polling system knows[br]the state of your network. 9:59:59.000,9:59:59.000 If you have a defined set of services, 9:59:59.000,9:59:59.000 that inventory drives what should be there. 9:59:59.000,9:59:59.000 Again, it's like, the disappearing, 9:59:59.000,9:59:59.000 is the process dead, or is it just[br]not doing anything? 9:59:59.000,9:59:59.000 With polling, you know for a fact[br]what processes should be there, 9:59:59.000,9:59:59.000 and it's a bit of an advantage there. 9:59:59.000,9:59:59.000 With polling, there's really easy testing. 9:59:59.000,9:59:59.000 With push-based metrics, you have to[br]figure out 9:59:59.000,9:59:59.000 if you want to test a new version of[br]the monitoring system or 9:59:59.000,9:59:59.000 you want to test something new, 9:59:59.000,9:59:59.000 you have to ??? a copy of the data. 9:59:59.000,9:59:59.000 With polling, you can just set up[br]another instance of your monitoring 9:59:59.000,9:59:59.000 and just test it. 9:59:59.000,9:59:59.000 Or you don't even have, 9:59:59.000,9:59:59.000 it doesn't even have to be monitoring,[br]you can just use curl 9:59:59.000,9:59:59.000 to poll the metrics endpoint. 9:59:59.000,9:59:59.000 It's significantly easier to test. 9:59:59.000,9:59:59.000 The other thing with the… 9:59:59.000,9:59:59.000 The other nice thing is that[br]the client is really simple. 9:59:59.000,9:59:59.000 The client doesn't have to know[br]where the monitoring system is. 9:59:59.000,9:59:59.000 It doesn't have to know about ??? 9:59:59.000,9:59:59.000 It just has to sit and collect the data[br]about itself. 9:59:59.000,9:59:59.000 So it doesn't have to know anything about[br]the topology of the network. 9:59:59.000,9:59:59.000 As an application developer, if you're[br]writing a DNS server or 9:59:59.000,9:59:59.000 some other piece of software, 9:59:59.000,9:59:59.000 you don't have to know anything about[br]monitoring software, 9:59:59.000,9:59:59.000 you can just implement it inside[br]your application and 9:59:59.000,9:59:59.000 the monitoring software, whether it's[br]Prometheus or something else, 9:59:59.000,9:59:59.000 can just come and collect that data for you. 9:59:59.000,9:59:59.000 That's kind of similar to a very old[br]monitoring system called SNMP, 9:59:59.000,9:59:59.000 but SNMP has a significantly less friendly[br]data model for developers. 9:59:59.000,9:59:59.000 This is the basic layout[br]of a Prometheus server. 9:59:59.000,9:59:59.000 At the core, there's a Prometheus server 9:59:59.000,9:59:59.000 and it deals with all the data collection[br]and analytics. 9:59:59.000,9:59:59.000 Basically, this one binary,[br]it's all written in golang. 9:59:59.000,9:59:59.000 It's a single binary. 9:59:59.000,9:59:59.000 It knows how to read from your inventory, 9:59:59.000,9:59:59.000 there's a bunch of different methods,[br]whether you've got 9:59:59.000,9:59:59.000 a kubernetes cluster or a cloud platform 9:59:59.000,9:59:59.000 or you have your own customized thing[br]with ansible. 9:59:59.000,9:59:59.000 Ansible can take your layout, drop that[br]into a config file and 9:59:59.000,9:59:59.000 Prometheus can pick that up. 9:59:59.000,9:59:59.000 Once it has the layout, it goes out and[br]collects all the data. 9:59:59.000,9:59:59.000 It has a storage and a time series[br]database to store all that data locally. 9:59:59.000,9:59:59.000 It has a thing called PromQL, which is[br]a query language designed 9:59:59.000,9:59:59.000 for metrics and analytics. 9:59:59.000,9:59:59.000 From that PromQL, you can add frontends[br]that will, 9:59:59.000,9:59:59.000 whether it's a simple API client[br]to run reports, 9:59:59.000,9:59:59.000 you can use things like Grafana[br]for creating dashboards, 9:59:59.000,9:59:59.000 it's got a simple webUI built in. 9:59:59.000,9:59:59.000 You can plug in anything you want[br]on that side. 9:59:59.000,9:59:59.000 And then, it also has the ability to[br]continuously execute queries 9:59:59.000,9:59:59.000 called "recording rules" 9:59:59.000,9:59:59.000 and these recording rules have[br]two different modes. 9:59:59.000,9:59:59.000 You can either record, you can take[br]a query 9:59:59.000,9:59:59.000 and it will generate new data[br]from that query 9:59:59.000,9:59:59.000 or you can take a query, and[br]if it returns results, 9:59:59.000,9:59:59.000 it will return an alert. 9:59:59.000,9:59:59.000 That alert is a push message[br]to the alert manager. 9:59:59.000,9:59:59.000 This allows us to separate the generating[br]of alerts from the routing of alerts. 9:59:59.000,9:59:59.000 You can have one or hundreds of Prometheus[br]services, all generating alerts 9:59:59.000,9:59:59.000 and it goes into an alert manager cluster[br]and sends, does the deduplication 9:59:59.000,9:59:59.000 and the routing to the human 9:59:59.000,9:59:59.000 because, of course, the thing[br]that we want is 9:59:59.000,9:59:59.000 we had dashboards with graphs, but[br]in order to find out if something is broken 9:59:59.000,9:59:59.000 you had to have a human[br]looking at the graph. 9:59:59.000,9:59:59.000 With Prometheus, we don't have to do that[br]anymore, 9:59:59.000,9:59:59.000 we can simply let the software tell us[br]that we need to go investigate 9:59:59.000,9:59:59.000 our problems. 9:59:59.000,9:59:59.000 We don't have to sit there and[br]stare at dashboards all day, 9:59:59.000,9:59:59.000 because that's really boring. 9:59:59.000,9:59:59.000 What does it look like to actually[br]get data into Prometheus? 9:59:59.000,9:59:59.000 This is a very basic output[br]of a Prometheus metric. 9:59:59.000,9:59:59.000 This is a very simple thing. 9:59:59.000,9:59:59.000 If you know much about[br]the linux kernel, 9:59:59.000,9:59:59.000 the linux kernel tracks ??? stats,[br]all the state of all the CPUs 9:59:59.000,9:59:59.000 in your system 9:59:59.000,9:59:59.000 and we express this by having[br]the name of the metric, which is 9:59:59.000,9:59:59.000 'node_cpu_seconds_total' and so[br]this is a self-describing metric, 9:59:59.000,9:59:59.000 like you can just read the metrics name 9:59:59.000,9:59:59.000 and you understand a little bit about[br]what's going on here. 9:59:59.000,9:59:59.000 The linux kernel and other kernels track[br]their usage by the number of seconds 9:59:59.000,9:59:59.000 spent doing different things and 9:59:59.000,9:59:59.000 that could be, whether it's in system or[br]user space or IRQs 9:59:59.000,9:59:59.000 or iowait or idle. 9:59:59.000,9:59:59.000 Actually, the kernel tracks how much[br]idle time it has. 9:59:59.000,9:59:59.000 It also tracks it by the number of CPUs. 9:59:59.000,9:59:59.000 With other monitoring systems, they used[br]to do this with a tree structure 9:59:59.000,9:59:59.000 and this caused a lot of problems,[br]for like 9:59:59.000,9:59:59.000 How do you mix and match data so[br]by switching from 9:59:59.000,9:59:59.000 a tree structure to a tag-based structure, 9:59:59.000,9:59:59.000 we can do some really interesting[br]powerful data analytics. 9:59:59.000,9:59:59.000 Here's a nice example of taking[br]those CPU seconds counters 9:59:59.000,9:59:59.000 and then converting them into a graph[br]by using PromQL. 9:59:59.000,9:59:59.000 Now we can get into[br]Metrics-Based Alerting. 9:59:59.000,9:59:59.000 Now we have this graph, we have this thing 9:59:59.000,9:59:59.000 we can look and see here 9:59:59.000,9:59:59.000 "Oh there is some little spike here,[br]we might want to know about that." 9:59:59.000,9:59:59.000 Now we can get into Metrics-Based[br]Alerting. 9:59:59.000,9:59:59.000 I used to be a site reliability engineer,[br]I'm still a site reliability engineer at heart 9:59:59.000,9:59:59.000 and we have this concept of things that[br]you need on a site or a service reliably 9:59:59.000,9:59:59.000 The most important thing you need is[br]down at the bottom, 9:59:59.000,9:59:59.000 Monitoring, because if you don't have[br]monitoring of your service, 9:59:59.000,9:59:59.000 how do you know it's even working? 9:59:59.000,9:59:59.000 There's a couple of techniques here, and[br]we want to alert based on data 9:59:59.000,9:59:59.000 and not just those end to end tests. 9:59:59.000,9:59:59.000 There's a couple of techniques, a thing[br]called the RED method 9:59:59.000,9:59:59.000 and there's a thing called the USE method 9:59:59.000,9:59:59.000 and there's a couple nice things to some[br]blog posts about this 9:59:59.000,9:59:59.000 and basically it defines that, for example, 9:59:59.000,9:59:59.000 the RED method talks about the requests[br]that your system is handling 9:59:59.000,9:59:59.000 There are three things: 9:59:59.000,9:59:59.000 There's the number of requests, there's[br]the number of errors 9:59:59.000,9:59:59.000 and there's how long takes a duration. 9:59:59.000,9:59:59.000 With the combination of these three things 9:59:59.000,9:59:59.000 you can determine most of[br]what your users see 9:59:59.000,9:59:59.000 "Did my request go through? Did it[br]return an error? Was it fast?" 9:59:59.000,9:59:59.000 Most people, that's all they care about. 9:59:59.000,9:59:59.000 "I made a request to a website and[br]it came back and it was fast." 9:59:59.000,9:59:59.000 It's a very simple method of just, like, 9:59:59.000,9:59:59.000 those are the important things to[br]determine if your site is healthy. 9:59:59.000,9:59:59.000 But we can go back to some more[br]traditional, sysadmin style, alerts 9:59:59.000,9:59:59.000 this is basically taking the filesystem[br]available space, 9:59:59.000,9:59:59.000 divided by the filesystem size, that becomes[br]the ratio of filesystem availability 9:59:59.000,9:59:59.000 from 0 to 1. 9:59:59.000,9:59:59.000 Multiply it by 100, we now have[br]a percentage 9:59:59.000,9:59:59.000 and if it's less than or equal to 1%[br]for 15 minutes, 9:59:59.000,9:59:59.000 this is less than 1% space, we should tell[br]a sysadmin to go check 9:59:59.000,9:59:59.000 the ??? filesystem ??? 9:59:59.000,9:59:59.000 It's super nice and simple. 9:59:59.000,9:59:59.000 We can also tag, we can include… 9:59:59.000,9:59:59.000 Every alert includes all the extraneous[br]labels that Prometheus adds to your metrics 9:59:59.000,9:59:59.000 When you add a metric in Prometheus, if[br]we go back and we look at this metric. 9:59:59.000,9:59:59.000 This metric only contain the information[br]about the internals of the application 9:59:59.000,9:59:59.000 anything about, like, what server it's on,[br]is it running in a container, 9:59:59.000,9:59:59.000 what cluster does it come from,[br]what ??? is it on, 9:59:59.000,9:59:59.000 that's all extra annotations that are[br]added by the Prometheus server 9:59:59.000,9:59:59.000 at discovery time. 9:59:59.000,9:59:59.000 I don't have a good example of what[br]those labels look like 9:59:59.000,9:59:59.000 but every metric gets annotated[br]with location information. 9:59:59.000,9:59:59.000 That location information also comes through[br]as labels in the alert 9:59:59.000,9:59:59.000 so, if you have a message coming[br]into your alert manager, 9:59:59.000,9:59:59.000 the alert manager can look and go 9:59:59.000,9:59:59.000 "Oh, that's coming from this datacenter" 9:59:59.000,9:59:59.000 and it can include that in the email or[br]IRC message or SMS message. 9:59:59.000,9:59:59.000 So you can include 9:59:59.000,9:59:59.000 "Filesystem is out of space on this host[br]from this datacenter" 9:59:59.000,9:59:59.000 All these labels get passed through and[br]then you can append 9:59:59.000,9:59:59.000 "severity: critical" to that alert and[br]include that in the message to the human 9:59:59.000,9:59:59.000 because of course, this is how you define… 9:59:59.000,9:59:59.000 Getting the message from the monitoring[br]to the human. 9:59:59.000,9:59:59.000 You can even include nice things like, 9:59:59.000,9:59:59.000 if you've got documentation, you can[br]include a link to the documentation 9:59:59.000,9:59:59.000 as an annotation 9:59:59.000,9:59:59.000 and the alert manager can take that[br]basic url and, you know, 9:59:59.000,9:59:59.000 massaging it into whatever it needs[br]to look like to actually get 9:59:59.000,9:59:59.000 the operator to the correct documentation. 9:59:59.000,9:59:59.000 We can also do more fun things: 9:59:59.000,9:59:59.000 since we actually are not just checking 9:59:59.000,9:59:59.000 what is the space right now,[br]we're tracking data over time, 9:59:59.000,9:59:59.000 we can use 'predict_linear'. 9:59:59.000,9:59:59.000 'predict_linear' just takes and does[br]a simple linear regression. 9:59:59.000,9:59:59.000 This example takes the filesystem[br]available space over the last hour and 9:59:59.000,9:59:59.000 does a linear regression. 9:59:59.000,9:59:59.000 Prediction says "Well, it's going that way[br]and four hours from now, 9:59:59.000,9:59:59.000 based on one hour of history, it's gonna[br]be less than 0, which means full". 9:59:59.000,9:59:59.000 We know that within the next four hours,[br]the disc is gonna be full 9:59:59.000,9:59:59.000 so we can tell the operator ahead of time[br]that it's gonna be full 9:59:59.000,9:59:59.000 and not just tell them that it's full[br]right now. 9:59:59.000,9:59:59.000 They have some window of ability[br]to fix it before it fails. 9:59:59.000,9:59:59.000 This is really important because[br]if you're running a site 9:59:59.000,9:59:59.000 you want to be able to have alerts[br]that tell you that your system is failing 9:59:59.000,9:59:59.000 before it actually fails. 9:59:59.000,9:59:59.000 Because if it fails, you're out of SLO[br]or SLA and 9:59:59.000,9:59:59.000 your users are gonna be unhappy 9:59:59.000,9:59:59.000 and you don't want the users to tell you[br]that your site is down 9:59:59.000,9:59:59.000 you want to know about it before[br]your users can even tell. 9:59:59.000,9:59:59.000 This allows you to do that. 9:59:59.000,9:59:59.000 And also of course, Prometheus being[br]a modern system, 9:59:59.000,9:59:59.000 we support fully UTF8 in all of our labels. 9:59:59.000,9:59:59.000 Here's an other one, here's a good example[br]from the USE method. 9:59:59.000,9:59:59.000 This is a rate of 500 errors coming from[br]an application 9:59:59.000,9:59:59.000 and you can simply alert that 9:59:59.000,9:59:59.000 there's more than 500 errors per second[br]coming out of the application 9:59:59.000,9:59:59.000 if that's your threshold for ??? 9:59:59.000,9:59:59.000 And you can do other things, 9:59:59.000,9:59:59.000 you can convert that from just[br]a raid of errors 9:59:59.000,9:59:59.000 to a percentive error. 9:59:59.000,9:59:59.000 So you could say 9:59:59.000,9:59:59.000 "I have an SLA of 3 9" and so you can say 9:59:59.000,9:59:59.000 "If the rate of errors divided by the rate[br]of requests is .01, 9:59:59.000,9:59:59.000 or is more than .01, then[br]that's a problem." 9:59:59.000,9:59:59.000 You can include that level of[br]error granularity. 9:59:59.000,9:59:59.000 And if you're just doing a blackbox test, 9:59:59.000,9:59:59.000 you wouldn't know this, you would only get[br]if you got an error from the system, 9:59:59.000,9:59:59.000 then you got another error from the system 9:59:59.000,9:59:59.000 then you fire an alert. 9:59:59.000,9:59:59.000 But if those checks are one minute apart[br]and you're serving 1000 requests per second 9:59:59.000,9:59:59.000 you could be serving 10,000 errors before[br]you even get an alert. 9:59:59.000,9:59:59.000 And you might miss it, because 9:59:59.000,9:59:59.000 what if you only get one random error 9:59:59.000,9:59:59.000 and then the next time, you're serving[br]25% errors, 9:59:59.000,9:59:59.000 you only have a 25% chance of that check[br]failing again. 9:59:59.000,9:59:59.000 You really need these metrics in order[br]to get 9:59:59.000,9:59:59.000 proper reports of the status of your system 9:59:59.000,9:59:59.000 There's even options 9:59:59.000,9:59:59.000 You can slice and dice those labels. 9:59:59.000,9:59:59.000 If you have a label on all of[br]your applications called 'service' 9:59:59.000,9:59:59.000 you can send that 'service' label through[br]to the message 9:59:59.000,9:59:59.000 and you can say[br]"Hey, this service is broken". 9:59:59.000,9:59:59.000 You can include that service label[br]in your alert messages. 9:59:59.000,9:59:59.000 And that's it, I can go to a demo and Q&A. 9:59:59.000,9:59:59.000 [Applause] 9:59:59.000,9:59:59.000 Any questions so far? 9:59:59.000,9:59:59.000 Or anybody want to see a demo? 9:59:59.000,9:59:59.000 [Q] Hi. Does Prometheus make metric[br]discovery inside containers 9:59:59.000,9:59:59.000 or do I have to implement the metrics[br]myself? 9:59:59.000,9:59:59.000 [A] For metrics in containers, there are[br]already things that expose 9:59:59.000,9:59:59.000 the metrics of the container system[br]itself. 9:59:59.000,9:59:59.000 There's a utility called 'cadvisor' and 9:59:59.000,9:59:59.000 cadvisor takes the links cgroup data[br]and exposes it as metrics 9:59:59.000,9:59:59.000 so you can get data about[br]how much CPU time is being 9:59:59.000,9:59:59.000 spent in your container, 9:59:59.000,9:59:59.000 how much memory is being used[br]by your container. 9:59:59.000,9:59:59.000 [Q] But not about the application,[br]just about the container usage ? 9:59:59.000,9:59:59.000 [A] Right. Because the container[br]has no idea 9:59:59.000,9:59:59.000 whether your application is written[br]in Ruby or go or Python or whatever, 9:59:59.000,9:59:59.000 you have to build that into[br]your application in order to get the data. 9:59:59.000,9:59:59.000 So for Prometheus, 9:59:59.000,9:59:59.000 we've written client libraries that can be[br]included in your application directly 9:59:59.000,9:59:59.000 so you can get that data out. 9:59:59.000,9:59:59.000 If you go to the Prometheus website,[br]we have a whole series of client libraries 9:59:59.000,9:59:59.000 and we cover a pretty good selection[br]of popular software. 9:59:59.000,9:59:59.000 [Q] What is the current state of[br]long-term data storage? 9:59:59.000,9:59:59.000 [A] Very good question. 9:59:59.000,9:59:59.000 There's been several… 9:59:59.000,9:59:59.000 There's actually several different methods[br]of doing this. 9:59:59.000,9:59:59.000 Prometheus stores all this data locally[br]in its own data storage 9:59:59.000,9:59:59.000 on the local disk. 9:59:59.000,9:59:59.000 But that's only as durable as[br]that server is durable. 9:59:59.000,9:59:59.000 So if you've got a really durable server, 9:59:59.000,9:59:59.000 you can store as much data as you want, 9:59:59.000,9:59:59.000 you can store years and years of data[br]locally on the Prometheus server. 9:59:59.000,9:59:59.000 That's not a problem. 9:59:59.000,9:59:59.000 There's a bunch of misconceptions because[br]of our default 9:59:59.000,9:59:59.000 and the language on our website said 9:59:59.000,9:59:59.000 "It's not long-term storage" 9:59:59.000,9:59:59.000 simply because we leave that problem[br]up to the person running the server. 9:59:59.000,9:59:59.000 But the time series database[br]that Prometheus includes 9:59:59.000,9:59:59.000 is actually quite durable. 9:59:59.000,9:59:59.000 But it's only as durable as the server[br]underneath it. 9:59:59.000,9:59:59.000 So if you've got a very large cluster and[br]you want really high durability, 9:59:59.000,9:59:59.000 you need to have some kind of[br]cluster software, 9:59:59.000,9:59:59.000 but because we want Prometheus to be[br]simple to deploy 9:59:59.000,9:59:59.000 and very simple to operate 9:59:59.000,9:59:59.000 and also very robust. 9:59:59.000,9:59:59.000 We didn't want to include any clustering[br]in Prometheus itself, 9:59:59.000,9:59:59.000 because anytime you have a clustered[br]software, 9:59:59.000,9:59:59.000 what happens if your network is[br]a little wanky. 9:59:59.000,9:59:59.000 The first thing that goes down is[br]all of your distributed systems fail. 9:59:59.000,9:59:59.000 And building distributed systems to be[br]really robust is really hard 9:59:59.000,9:59:59.000 so Prometheus is what we call[br]"uncoordinated distributed systems". 9:59:59.000,9:59:59.000 If you've got two Prometheus servers[br]monitoring all your targets in an HA mode 9:59:59.000,9:59:59.000 in a cluster, and there's a split brain, 9:59:59.000,9:59:59.000 each Prometheus can see[br]half of the cluster and 9:59:59.000,9:59:59.000 it can see that the other half[br]of the cluster is down. 9:59:59.000,9:59:59.000 They can both try to get alerts out[br]to the alert manager 9:59:59.000,9:59:59.000 and this is a really really robust way of[br]handling split brains 9:59:59.000,9:59:59.000 and bad network failures and bad problems[br]in a cluster. 9:59:59.000,9:59:59.000 It's designed to be super super robust 9:59:59.000,9:59:59.000 and so the two individual[br]Promotheus servers in you cluster 9:59:59.000,9:59:59.000 don't have to talk to each other[br]to do this, 9:59:59.000,9:59:59.000 they can just to it independently. 9:59:59.000,9:59:59.000 But if you want to be able[br]to correlate data 9:59:59.000,9:59:59.000 between many different Prometheus servers 9:59:59.000,9:59:59.000 you need an external data storage[br]to do this. 9:59:59.000,9:59:59.000 And also you may not have[br]very big servers, 9:59:59.000,9:59:59.000 you might be running your Prometheus[br]in a container 9:59:59.000,9:59:59.000 and it's only got a little bit of local[br]storage space 9:59:59.000,9:59:59.000 so you want to send all that data up[br]to a big cluster datastore 9:59:59.000,9:59:59.000 for ??? 9:59:59.000,9:59:59.000 We have several different ways of[br]doing this. 9:59:59.000,9:59:59.000 There's the classic way which is called[br]federation 9:59:59.000,9:59:59.000 where you have one Prometheus server[br]polling in summary data from 9:59:59.000,9:59:59.000 each of the individual Prometheus servers 9:59:59.000,9:59:59.000 and this is useful if you want to run[br]alerts against data coming 9:59:59.000,9:59:59.000 from multiple Prometheus servers. 9:59:59.000,9:59:59.000 But federation is not replication. 9:59:59.000,9:59:59.000 It only can do a little bit of data from[br]each Prometheus server. 9:59:59.000,9:59:59.000 If you've got a million metrics on[br]each Prometheus server, 9:59:59.000,9:59:59.000 you can't poll in a million metrics[br]and do… 9:59:59.000,9:59:59.000 If you've got 10 of those, you can't[br]poll in 10 million metrics 9:59:59.000,9:59:59.000 simultaneously into one Prometheus[br]server. 9:59:59.000,9:59:59.000 It's just to much data.