[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:05.90,0:00:10.53,Default,,0000,0000,0000,,So, we had a talk by a non-GitLab person\Nabout GitLab.
Dialogue: 0,0:00:10.53,0:00:13.06,Default,,0000,0000,0000,,Now, we have a talk by a GitLab person\Non non-GtlLab.
Dialogue: 0,0:00:13.20,0:00:14.60,Default,,0000,0000,0000,,Something like that?
Dialogue: 0,0:00:15.89,0:00:19.39,Default,,0000,0000,0000,,The CCCHH hackerspace is now open,
Dialogue: 0,0:00:19.95,0:00:22.12,Default,,0000,0000,0000,,from now on if you want to go there,\Nthat's the announcement.
Dialogue: 0,0:00:22.47,0:00:25.87,Default,,0000,0000,0000,,And the next talk will be by Ben Kochie
Dialogue: 0,0:00:26.01,0:00:28.26,Default,,0000,0000,0000,,on metrics-based monitoring\Nwith Prometheus.
Dialogue: 0,0:00:28.75,0:00:30.21,Default,,0000,0000,0000,,Welcome.
Dialogue: 0,0:00:30.54,0:00:33.13,Default,,0000,0000,0000,,[Applause]
Dialogue: 0,0:00:35.40,0:00:36.58,Default,,0000,0000,0000,,Alright, so
Dialogue: 0,0:00:36.89,0:00:39.37,Default,,0000,0000,0000,,my name is Ben Kochie
Dialogue: 0,0:00:39.84,0:00:43.87,Default,,0000,0000,0000,,I work on DevOps features for GitLab
Dialogue: 0,0:00:44.33,0:00:48.29,Default,,0000,0000,0000,,and apart working for GitLab, I also work\Non the opensource Prometheus project.
Dialogue: 0,0:00:51.16,0:00:53.60,Default,,0000,0000,0000,,I live in Berlin and I've been using\NDebian since ???
Dialogue: 0,0:00:54.35,0:00:56.80,Default,,0000,0000,0000,,yes, quite a long time.
Dialogue: 0,0:00:58.81,0:01:01.02,Default,,0000,0000,0000,,So, what is Metrics-based Monitoring?
Dialogue: 0,0:01:02.64,0:01:05.16,Default,,0000,0000,0000,,If you're running software in production,
Dialogue: 0,0:01:05.88,0:01:07.83,Default,,0000,0000,0000,,you probably want to monitor it,
Dialogue: 0,0:01:08.21,0:01:10.55,Default,,0000,0000,0000,,because if you don't monitor it, you don't\Nknow it's right.
Dialogue: 0,0:01:13.28,0:01:16.11,Default,,0000,0000,0000,,??? break down into two categories:
Dialogue: 0,0:01:16.11,0:01:19.15,Default,,0000,0000,0000,,there's blackbox monitoring and\Nthere's whitebox monitoring.
Dialogue: 0,0:01:19.50,0:01:24.58,Default,,0000,0000,0000,,Blackbox monitoring is treating\Nyour software like a blackbox.
Dialogue: 0,0:01:24.76,0:01:27.16,Default,,0000,0000,0000,,It's just checks to see, like,
Dialogue: 0,0:01:27.45,0:01:29.48,Default,,0000,0000,0000,,is it responding, or does it ping
Dialogue: 0,0:01:30.02,0:01:33.59,Default,,0000,0000,0000,,or ??? HTTP requests
Dialogue: 0,0:01:34.35,0:01:35.67,Default,,0000,0000,0000,,[mic turned on]
Dialogue: 0,0:01:37.76,0:01:41.38,Default,,0000,0000,0000,,Ah, there we go, much better.
Dialogue: 0,0:01:46.59,0:01:51.90,Default,,0000,0000,0000,,So, blackbox monitoring is a probe,
Dialogue: 0,0:01:51.90,0:01:54.68,Default,,0000,0000,0000,,it just kind of looks from the outside\Nto your software
Dialogue: 0,0:01:55.45,0:01:57.43,Default,,0000,0000,0000,,and it has no knowledge of the internals
Dialogue: 0,0:01:58.13,0:02:00.70,Default,,0000,0000,0000,,and it's really good for end to end testing.
Dialogue: 0,0:02:00.94,0:02:03.56,Default,,0000,0000,0000,,So if you've got a fairly complicated\Nservice,
Dialogue: 0,0:02:03.99,0:02:06.43,Default,,0000,0000,0000,,you come in from the outside, you go\Nthrough the load balancer,
Dialogue: 0,0:02:06.72,0:02:07.98,Default,,0000,0000,0000,,you hit the API server,
Dialogue: 0,0:02:07.98,0:02:10.15,Default,,0000,0000,0000,,the API server might hit a database,
Dialogue: 0,0:02:10.68,0:02:13.05,Default,,0000,0000,0000,,and you go all the way through\Nto the back of the stack
Dialogue: 0,0:02:13.22,0:02:14.54,Default,,0000,0000,0000,,and all the way back out
Dialogue: 0,0:02:14.56,0:02:16.29,Default,,0000,0000,0000,,so you know that everything is working\Nend to end.
Dialogue: 0,0:02:16.52,0:02:18.77,Default,,0000,0000,0000,,But you only know about it\Nfor that one request.
Dialogue: 0,0:02:19.04,0:02:22.43,Default,,0000,0000,0000,,So in order to find out if your service\Nis working,
Dialogue: 0,0:02:22.83,0:02:27.13,Default,,0000,0000,0000,,from the end to end, for every single\Nrequest,
Dialogue: 0,0:02:27.48,0:02:29.52,Default,,0000,0000,0000,,this requires whitebox intrumentation.
Dialogue: 0,0:02:29.84,0:02:33.96,Default,,0000,0000,0000,,So, basically, every event that happens\Ninside your software,
Dialogue: 0,0:02:33.97,0:02:36.52,Default,,0000,0000,0000,,inside a serving stack,
Dialogue: 0,0:02:36.82,0:02:39.81,Default,,0000,0000,0000,,gets collected and gets counted,
Dialogue: 0,0:02:40.04,0:02:43.68,Default,,0000,0000,0000,,so you know that every request hits\Nthe load balancer,
Dialogue: 0,0:02:45.22,0:02:45.66,Default,,0000,0000,0000,,every request hits your application\Nservice,
Dialogue: 0,0:02:45.97,0:02:47.33,Default,,0000,0000,0000,,every request hits the database.
Dialogue: 0,0:02:47.79,0:02:50.83,Default,,0000,0000,0000,,You know that everything matches up
Dialogue: 0,0:02:50.100,0:02:55.76,Default,,0000,0000,0000,,and this is called whitebox, or\Nmetrics-based monitoring.
Dialogue: 0,0:02:56.01,0:02:57.69,Default,,0000,0000,0000,,There is different examples of, like,
Dialogue: 0,0:02:57.91,0:03:02.39,Default,,0000,0000,0000,,the kind of software that does blackbox\Nand whitebox monitoring.
Dialogue: 0,0:03:02.57,0:03:06.68,Default,,0000,0000,0000,,So you have software like Nagios that\Nyou can configure checks
Dialogue: 0,0:03:08.83,0:03:10.01,Default,,0000,0000,0000,,or pingdom,
Dialogue: 0,0:03:10.21,0:03:12.35,Default,,0000,0000,0000,,pingdom will do ping of your website.
Dialogue: 0,0:03:12.97,0:03:15.31,Default,,0000,0000,0000,,And then there is metrics-based monitoring,
Dialogue: 0,0:03:15.52,0:03:19.29,Default,,0000,0000,0000,,things like Prometheus, things like\Nthe TICK stack from influx data,
Dialogue: 0,0:03:19.61,0:03:22.73,Default,,0000,0000,0000,,New Relic and other commercial solutions
Dialogue: 0,0:03:23.03,0:03:25.48,Default,,0000,0000,0000,,but of course I like to talk about\Nthe opensorce solutions.
Dialogue: 0,0:03:25.75,0:03:28.38,Default,,0000,0000,0000,,We're gonna talk a little bit about\NPrometheus.
Dialogue: 0,0:03:28.82,0:03:31.96,Default,,0000,0000,0000,,Prometheus came out of the idea that
Dialogue: 0,0:03:32.34,0:03:37.56,Default,,0000,0000,0000,,we needed a monitoring system that could\Ncollect all this whitebox metric data
Dialogue: 0,0:03:37.94,0:03:40.79,Default,,0000,0000,0000,,and do something useful with it.
Dialogue: 0,0:03:40.92,0:03:42.67,Default,,0000,0000,0000,,Not just give us a pretty graph, but\Nwe also want to be able to
Dialogue: 0,0:03:42.98,0:03:44.19,Default,,0000,0000,0000,,alert on it.
Dialogue: 0,0:03:44.19,0:03:45.99,Default,,0000,0000,0000,,So we needed both
Dialogue: 0,0:03:49.87,0:03:54.07,Default,,0000,0000,0000,,a data gathering and an analytics system\Nin the same instance.
Dialogue: 0,0:03:54.15,0:03:58.82,Default,,0000,0000,0000,,To do this, we built this thing and\Nwe looked at the way that
Dialogue: 0,0:03:59.01,0:04:01.84,Default,,0000,0000,0000,,data was being generated\Nby the applications
Dialogue: 0,0:04:02.37,0:04:05.20,Default,,0000,0000,0000,,and there are advantages and\Ndisadvantages to this
Dialogue: 0,0:04:05.20,0:04:07.25,Default,,0000,0000,0000,,push vs. poll model for metrics.
Dialogue: 0,0:04:07.38,0:04:09.70,Default,,0000,0000,0000,,We decided to go with the polling model
Dialogue: 0,0:04:09.94,0:04:13.95,Default,,0000,0000,0000,,because there is some slight advantages\Nfor polling over pushing.
Dialogue: 0,0:04:16.32,0:04:18.16,Default,,0000,0000,0000,,With polling, you get this free\Nblackbox check
Dialogue: 0,0:04:18.47,0:04:20.15,Default,,0000,0000,0000,,that the application is running.
Dialogue: 0,0:04:20.53,0:04:24.32,Default,,0000,0000,0000,,When you poll your application, you know\Nthat the process is running.
Dialogue: 0,0:04:24.53,0:04:27.53,Default,,0000,0000,0000,,If you are doing push-based, you can't\Ntell the difference between
Dialogue: 0,0:04:27.85,0:04:31.52,Default,,0000,0000,0000,,your application doing no work and\Nyour application not running.
Dialogue: 0,0:04:32.42,0:04:33.90,Default,,0000,0000,0000,,So you don't know if it's stuck,
Dialogue: 0,0:04:34.14,0:04:37.88,Default,,0000,0000,0000,,or is it just not having to do any work.
Dialogue: 0,0:04:42.67,0:04:48.94,Default,,0000,0000,0000,,With polling, the polling system knows\Nthe state of your network.
Dialogue: 0,0:04:49.85,0:04:52.52,Default,,0000,0000,0000,,If you have a defined set of services,
Dialogue: 0,0:04:52.89,0:04:56.79,Default,,0000,0000,0000,,that inventory drives what should be there.
Dialogue: 0,0:04:58.27,0:05:00.08,Default,,0000,0000,0000,,Again, it's like, the disappearing,
Dialogue: 0,0:05:00.29,0:05:03.95,Default,,0000,0000,0000,,is the process dead, or is it just\Nnot doing anything?
Dialogue: 0,0:05:04.20,0:05:07.12,Default,,0000,0000,0000,,With polling, you know for a fact\Nwhat processes should be there,
Dialogue: 0,0:05:07.59,0:05:10.90,Default,,0000,0000,0000,,and it's a bit of an advantage there.
Dialogue: 0,0:05:11.14,0:05:12.91,Default,,0000,0000,0000,,With polling, there's really easy testing.
Dialogue: 0,0:05:13.12,0:05:16.30,Default,,0000,0000,0000,,With push-based metrics, you have to\Nfigure out
Dialogue: 0,0:05:16.50,0:05:18.84,Default,,0000,0000,0000,,if you want to test a new version of\Nthe monitoring system or
Dialogue: 0,0:05:19.06,0:05:21.26,Default,,0000,0000,0000,,you want to test something new,
Dialogue: 0,0:05:21.42,0:05:24.13,Default,,0000,0000,0000,,you have to ??? a copy of the data.
Dialogue: 0,0:05:24.37,0:05:27.65,Default,,0000,0000,0000,,With polling, you can just set up\Nanother instance of your monitoring
Dialogue: 0,0:05:27.86,0:05:29.19,Default,,0000,0000,0000,,and just test it.
Dialogue: 0,0:05:29.71,0:05:31.32,Default,,0000,0000,0000,,Or you don't even have,
Dialogue: 0,0:05:31.47,0:05:33.19,Default,,0000,0000,0000,,it doesn't even have to be monitoring,\Nyou can just use curl
Dialogue: 0,0:05:33.20,0:05:35.98,Default,,0000,0000,0000,,to poll the metrics endpoint.
Dialogue: 0,0:05:38.42,0:05:40.44,Default,,0000,0000,0000,,It's significantly easier to test.
Dialogue: 0,0:05:40.44,0:05:42.98,Default,,0000,0000,0000,,The other thing with the…
Dialogue: 0,0:05:45.100,0:05:48.11,Default,,0000,0000,0000,,The other nice thing is that\Nthe client is really simple.
Dialogue: 0,0:05:48.48,0:05:51.07,Default,,0000,0000,0000,,The client doesn't have to know\Nwhere the monitoring system is.
Dialogue: 0,0:05:51.27,0:05:53.67,Default,,0000,0000,0000,,It doesn't have to know about ???
Dialogue: 0,0:05:53.82,0:05:55.72,Default,,0000,0000,0000,,It just has to sit and collect the data\Nabout itself.
Dialogue: 0,0:05:55.88,0:05:58.71,Default,,0000,0000,0000,,So it doesn't have to know anything about\Nthe topology of the network.
Dialogue: 0,0:05:59.13,0:06:03.36,Default,,0000,0000,0000,,As an application developer, if you're\Nwriting a DNS server or
Dialogue: 0,0:06:03.72,0:06:05.57,Default,,0000,0000,0000,,some other piece of software,
Dialogue: 0,0:06:05.90,0:06:09.56,Default,,0000,0000,0000,,you don't have to know anything about\Nmonitoring software,
Dialogue: 0,0:06:09.80,0:06:12.22,Default,,0000,0000,0000,,you can just implement it inside\Nyour application and
Dialogue: 0,0:06:12.68,0:06:17.06,Default,,0000,0000,0000,,the monitoring software, whether it's\NPrometheus or something else,
Dialogue: 0,0:06:17.41,0:06:19.33,Default,,0000,0000,0000,,can just come and collect that data for you.
Dialogue: 0,0:06:20.21,0:06:23.61,Default,,0000,0000,0000,,That's kind of similar to a very old\Nmonitoring system called SNMP,
Dialogue: 0,0:06:23.83,0:06:28.53,Default,,0000,0000,0000,,but SNMP has a significantly less friendly\Ndata model for developers.
Dialogue: 0,0:06:30.01,0:06:33.56,Default,,0000,0000,0000,,This is the basic layout\Nof a Prometheus server.
Dialogue: 0,0:06:33.92,0:06:35.92,Default,,0000,0000,0000,,At the core, there's a Prometheus server
Dialogue: 0,0:06:36.28,0:06:40.30,Default,,0000,0000,0000,,and it deals with all the data collection\Nand analytics.
Dialogue: 0,0:06:42.94,0:06:46.70,Default,,0000,0000,0000,,Basically, this one binary,\Nit's all written in golang.
Dialogue: 0,0:06:46.87,0:06:48.56,Default,,0000,0000,0000,,It's a single binary.
Dialogue: 0,0:06:48.56,0:06:50.82,Default,,0000,0000,0000,,It knows how to read from your inventory,
Dialogue: 0,0:06:50.82,0:06:52.66,Default,,0000,0000,0000,,there's a bunch of different methods,\Nwhether you've got
Dialogue: 0,0:06:53.12,0:06:58.84,Default,,0000,0000,0000,,a kubernetes cluster or a cloud platform
Dialogue: 0,0:07:00.23,0:07:03.80,Default,,0000,0000,0000,,or you have your own customized thing\Nwith ansible.
Dialogue: 0,0:07:05.38,0:07:09.75,Default,,0000,0000,0000,,Ansible can take your layout, drop that\Ninto a config file and
Dialogue: 0,0:07:10.64,0:07:11.90,Default,,0000,0000,0000,,Prometheus can pick that up.
Dialogue: 0,0:07:15.59,0:07:18.81,Default,,0000,0000,0000,,Once it has the layout, it goes out and\Ncollects all the data.
Dialogue: 0,0:07:18.84,0:07:24.25,Default,,0000,0000,0000,,It has a storage and a time series\Ndatabase to store all that data locally.
Dialogue: 0,0:07:24.46,0:07:28.23,Default,,0000,0000,0000,,It has a thing called PromQL, which is\Na query language designed
Dialogue: 0,0:07:28.45,0:07:31.03,Default,,0000,0000,0000,,for metrics and analytics.
Dialogue: 0,0:07:31.50,0:07:36.78,Default,,0000,0000,0000,,From that PromQL, you can add frontends\Nthat will,
Dialogue: 0,0:07:36.98,0:07:39.32,Default,,0000,0000,0000,,whether it's a simple API client\Nto run reports,
Dialogue: 0,0:07:40.02,0:07:42.94,Default,,0000,0000,0000,,you can use things like Grafana\Nfor creating dashboards,
Dialogue: 0,0:07:43.12,0:07:44.83,Default,,0000,0000,0000,,it's got a simple webUI built in.
Dialogue: 0,0:07:45.03,0:07:46.92,Default,,0000,0000,0000,,You can plug in anything you want\Non that side.
Dialogue: 0,0:07:48.69,0:07:54.48,Default,,0000,0000,0000,,And then, it also has the ability to\Ncontinuously execute queries
Dialogue: 0,0:07:54.62,0:07:56.19,Default,,0000,0000,0000,,called "recording rules"
Dialogue: 0,0:07:56.83,0:07:59.10,Default,,0000,0000,0000,,and these recording rules have\Ntwo different modes.
Dialogue: 0,0:07:59.10,0:08:01.87,Default,,0000,0000,0000,,You can either record, you can take\Na query
Dialogue: 0,0:08:02.15,0:08:03.71,Default,,0000,0000,0000,,and it will generate new data\Nfrom that query
Dialogue: 0,0:08:04.07,0:08:06.97,Default,,0000,0000,0000,,or you can take a query, and\Nif it returns results,
Dialogue: 0,0:08:07.35,0:08:08.91,Default,,0000,0000,0000,,it will return an alert.
Dialogue: 0,0:08:09.18,0:08:12.51,Default,,0000,0000,0000,,That alert is a push message\Nto the alert manager.
Dialogue: 0,0:08:12.81,0:08:18.97,Default,,0000,0000,0000,,This allows us to separate the generating\Nof alerts from the routing of alerts.
Dialogue: 0,0:08:19.15,0:08:24.26,Default,,0000,0000,0000,,You can have one or hundreds of Prometheus\Nservices, all generating alerts
Dialogue: 0,0:08:24.60,0:08:28.81,Default,,0000,0000,0000,,and it goes into an alert manager cluster\Nand sends, does the deduplication
Dialogue: 0,0:08:29.33,0:08:30.68,Default,,0000,0000,0000,,and the routing to the human
Dialogue: 0,0:08:30.88,0:08:34.14,Default,,0000,0000,0000,,because, of course, the thing\Nthat we want is
Dialogue: 0,0:08:34.93,0:08:38.80,Default,,0000,0000,0000,,we had dashboards with graphs, but\Nin order to find out if something is broken
Dialogue: 0,0:08:38.97,0:08:40.65,Default,,0000,0000,0000,,you had to have a human\Nlooking at the graph.
Dialogue: 0,0:08:40.83,0:08:42.94,Default,,0000,0000,0000,,With Prometheus, we don't have to do that\Nanymore,
Dialogue: 0,0:08:43.10,0:08:47.64,Default,,0000,0000,0000,,we can simply let the software tell us\Nthat we need to go investigate
Dialogue: 0,0:08:47.64,0:08:48.65,Default,,0000,0000,0000,,our problems.
Dialogue: 0,0:08:48.78,0:08:50.83,Default,,0000,0000,0000,,We don't have to sit there and\Nstare at dashboards all day,
Dialogue: 0,0:08:51.04,0:08:52.38,Default,,0000,0000,0000,,because that's really boring.
Dialogue: 0,0:08:54.52,0:08:57.56,Default,,0000,0000,0000,,What does it look like to actually\Nget data into Prometheus?
Dialogue: 0,0:08:57.59,0:09:02.14,Default,,0000,0000,0000,,This is a very basic output\Nof a Prometheus metric.
Dialogue: 0,0:09:02.61,0:09:03.93,Default,,0000,0000,0000,,This is a very simple thing.
Dialogue: 0,0:09:04.09,0:09:07.57,Default,,0000,0000,0000,,If you know much about\Nthe linux kernel,
Dialogue: 0,0:09:06.88,0:09:12.78,Default,,0000,0000,0000,,the linux kernel tracks and proc stats,\Nall the state of all the CPUs
Dialogue: 0,0:09:12.78,0:09:14.46,Default,,0000,0000,0000,,in your system
Dialogue: 0,0:09:14.66,0:09:18.08,Default,,0000,0000,0000,,and we express this by having\Nthe name of the metric, which is
Dialogue: 0,0:09:22.45,0:09:26.12,Default,,0000,0000,0000,,'node_cpu_seconds_total' and so\Nthis is a self-describing metric,
Dialogue: 0,0:09:26.55,0:09:28.38,Default,,0000,0000,0000,,like you can just read the metrics name
Dialogue: 0,0:09:28.53,0:09:30.84,Default,,0000,0000,0000,,and you understand a little bit about\Nwhat's going on here.
Dialogue: 0,0:09:33.24,0:09:38.52,Default,,0000,0000,0000,,The linux kernel and other kernels track\Ntheir usage by the number of seconds
Dialogue: 0,0:09:38.86,0:09:41.00,Default,,0000,0000,0000,,spent doing different things and
Dialogue: 0,0:09:41.20,0:09:46.72,Default,,0000,0000,0000,,that could be, whether it's in system or\Nuser space or IRQs
Dialogue: 0,0:09:47.06,0:09:48.69,Default,,0000,0000,0000,,or iowait or idle.
Dialogue: 0,0:09:48.91,0:09:51.28,Default,,0000,0000,0000,,Actually, the kernel tracks how much\Nidle time it has.
Dialogue: 0,0:09:53.66,0:09:55.31,Default,,0000,0000,0000,,It also tracks it by the number of CPUs.
Dialogue: 0,0:09:55.100,0:10:00.07,Default,,0000,0000,0000,,With other monitoring systems, they used\Nto do this with a tree structure
Dialogue: 0,0:10:01.02,0:10:03.69,Default,,0000,0000,0000,,and this caused a lot of problems,\Nfor like
Dialogue: 0,0:10:03.85,0:10:09.29,Default,,0000,0000,0000,,How do you mix and match data so\Nby switching from
Dialogue: 0,0:10:10.04,0:10:12.48,Default,,0000,0000,0000,,a tree structure to a tag-based structure,
Dialogue: 0,0:10:12.98,0:10:16.90,Default,,0000,0000,0000,,we can do some really interesting\Npowerful data analytics.
Dialogue: 0,0:10:18.17,0:10:25.17,Default,,0000,0000,0000,,Here's a nice example of taking\Nthose CPU seconds counters
Dialogue: 0,0:10:26.10,0:10:30.20,Default,,0000,0000,0000,,and then converting them into a graph\Nby using PromQL.
Dialogue: 0,0:10:32.72,0:10:34.83,Default,,0000,0000,0000,,Now we can get into\NMetrics-Based Alerting.
Dialogue: 0,0:10:35.32,0:10:37.66,Default,,0000,0000,0000,,Now we have this graph, we have this thing
Dialogue: 0,0:10:37.85,0:10:39.50,Default,,0000,0000,0000,,we can look and see here
Dialogue: 0,0:10:39.100,0:10:42.92,Default,,0000,0000,0000,,"Oh there is some little spike here,\Nwe might want to know about that."
Dialogue: 0,0:10:43.19,0:10:45.85,Default,,0000,0000,0000,,Now we can get into Metrics-Based\NAlerting.
Dialogue: 0,0:10:46.28,0:10:51.13,Default,,0000,0000,0000,,I used to be a site reliability engineer,\NI'm still a site reliability engineer at heart
Dialogue: 0,0:10:52.37,0:11:00.36,Default,,0000,0000,0000,,and we have this concept of things that\Nyou need on a site or a service reliably
Dialogue: 0,0:11:00.91,0:11:03.23,Default,,0000,0000,0000,,The most important thing you need is\Ndown at the bottom,
Dialogue: 0,0:11:03.57,0:11:06.87,Default,,0000,0000,0000,,Monitoring, because if you don't have\Nmonitoring of your service,
Dialogue: 0,0:11:07.11,0:11:08.69,Default,,0000,0000,0000,,how do you know it's even working?
Dialogue: 0,0:11:11.63,0:11:15.24,Default,,0000,0000,0000,,There's a couple of techniques here, and\Nwe want to alert based on data
Dialogue: 0,0:11:15.69,0:11:17.64,Default,,0000,0000,0000,,and not just those end to end tests.
Dialogue: 0,0:11:18.80,0:11:23.39,Default,,0000,0000,0000,,There's a couple of techniques, a thing\Ncalled the RED method
Dialogue: 0,0:11:23.56,0:11:25.14,Default,,0000,0000,0000,,and there's a thing called the USE method
Dialogue: 0,0:11:25.59,0:11:28.40,Default,,0000,0000,0000,,and there's a couple nice things to some\Nblog posts about this
Dialogue: 0,0:11:28.70,0:11:31.31,Default,,0000,0000,0000,,and basically it defines that, for example,
Dialogue: 0,0:11:31.48,0:11:35.00,Default,,0000,0000,0000,,the RED method talks about the requests\Nthat your system is handling
Dialogue: 0,0:11:36.42,0:11:37.60,Default,,0000,0000,0000,,There are three things:
Dialogue: 0,0:11:37.78,0:11:40.07,Default,,0000,0000,0000,,There's the number of requests, there's\Nthe number of errors
Dialogue: 0,0:11:40.27,0:11:42.31,Default,,0000,0000,0000,,and there's how long takes a duration.
Dialogue: 0,0:11:42.87,0:11:45.00,Default,,0000,0000,0000,,With the combination of these three things
Dialogue: 0,0:11:45.34,0:11:48.37,Default,,0000,0000,0000,,you can determine most of\Nwhat your users see
Dialogue: 0,0:11:48.71,0:11:53.62,Default,,0000,0000,0000,,"Did my request go through? Did it\Nreturn an error? Was it fast?"
Dialogue: 0,0:11:55.49,0:11:57.97,Default,,0000,0000,0000,,Most people, that's all they care about.
Dialogue: 0,0:11:58.20,0:12:01.96,Default,,0000,0000,0000,,"I made a request to a website and\Nit came back and it was fast."
Dialogue: 0,0:12:04.98,0:12:06.52,Default,,0000,0000,0000,,It's a very simple method of just, like,
Dialogue: 0,0:12:07.16,0:12:10.11,Default,,0000,0000,0000,,those are the important things to\Ndetermine if your site is healthy.
Dialogue: 0,0:12:12.19,0:12:17.04,Default,,0000,0000,0000,,But we can go back to some more\Ntraditional, sysadmin style, alerts
Dialogue: 0,0:12:17.31,0:12:20.55,Default,,0000,0000,0000,,this is basically taking the filesystem\Navailable space,
Dialogue: 0,0:12:20.82,0:12:26.52,Default,,0000,0000,0000,,divided by the filesystem size, that becomes\Nthe ratio of filesystem availability
Dialogue: 0,0:12:26.70,0:12:27.52,Default,,0000,0000,0000,,from 0 to 1.
Dialogue: 0,0:12:28.24,0:12:30.76,Default,,0000,0000,0000,,Multiply it by 100, we now have\Na percentage
Dialogue: 0,0:12:31.02,0:12:35.66,Default,,0000,0000,0000,,and if it's less than or equal to 1%\Nfor 15 minutes,
Dialogue: 0,0:12:35.94,0:12:41.78,Default,,0000,0000,0000,,this is less than 1% space, we should tell\Na sysadmin to go check
Dialogue: 0,0:12:41.96,0:12:44.29,Default,,0000,0000,0000,,to find out why the filesystem\Nhas fall
Dialogue: 0,0:12:44.64,0:12:46.17,Default,,0000,0000,0000,,It's super nice and simple.
Dialogue: 0,0:12:46.49,0:12:49.68,Default,,0000,0000,0000,,We can also tag, we can include…
Dialogue: 0,0:12:51.42,0:12:58.23,Default,,0000,0000,0000,,Every alert includes all the extraneous\Nlabels that Prometheus adds to your metrics
Dialogue: 0,0:12:59.49,0:13:05.46,Default,,0000,0000,0000,,When you add a metric in Prometheus, if\Nwe go back and we look at this metric.
Dialogue: 0,0:13:06.01,0:13:10.80,Default,,0000,0000,0000,,This metric only contain the information\Nabout the internals of the application
Dialogue: 0,0:13:12.94,0:13:14.100,Default,,0000,0000,0000,,anything about, like, what server it's on,\Nis it running in a container,
Dialogue: 0,0:13:15.19,0:13:18.72,Default,,0000,0000,0000,,what cluster does it come from,\Nwhat continent is it on,
Dialogue: 0,0:13:17.70,0:13:22.28,Default,,0000,0000,0000,,that's all extra annotations that are\Nadded by the Prometheus server
Dialogue: 0,0:13:22.62,0:13:23.95,Default,,0000,0000,0000,,at discovery time.
Dialogue: 0,0:13:24.51,0:13:28.35,Default,,0000,0000,0000,,Unfortunate I don't have a good example \Nof what those labels look like
Dialogue: 0,0:13:28.51,0:13:34.18,Default,,0000,0000,0000,,but every metric gets annotated\Nwith location information.
Dialogue: 0,0:13:36.90,0:13:41.12,Default,,0000,0000,0000,,That location information also comes through\Nas labels in the alert
Dialogue: 0,0:13:41.30,0:13:48.07,Default,,0000,0000,0000,,so, if you have a message coming\Ninto your alert manager,
Dialogue: 0,0:13:48.27,0:13:49.90,Default,,0000,0000,0000,,the alert manager can look and go
Dialogue: 0,0:13:50.09,0:13:51.62,Default,,0000,0000,0000,,"Oh, that's coming from this datacenter"
Dialogue: 0,0:13:52.01,0:13:58.90,Default,,0000,0000,0000,,and it can include that in the email or\NIRC message or SMS message.
Dialogue: 0,0:13:59.07,0:14:00.77,Default,,0000,0000,0000,,So you can include
Dialogue: 0,0:13:59.27,0:14:04.42,Default,,0000,0000,0000,,"Filesystem is out of space on this host\Nfrom this datacenter"
Dialogue: 0,0:14:04.56,0:14:07.34,Default,,0000,0000,0000,,All these labels get passed through and\Nthen you can append
Dialogue: 0,0:14:07.49,0:14:13.29,Default,,0000,0000,0000,,"severity: critical" to that alert and\Ninclude that in the message to the human
Dialogue: 0,0:14:13.69,0:14:16.78,Default,,0000,0000,0000,,because of course, this is how you define…
Dialogue: 0,0:14:16.94,0:14:20.86,Default,,0000,0000,0000,,Getting the message from the monitoring\Nto the human.
Dialogue: 0,0:14:22.20,0:14:23.85,Default,,0000,0000,0000,,You can even include nice things like,
Dialogue: 0,0:14:24.03,0:14:27.51,Default,,0000,0000,0000,,if you've got documentation, you can\Ninclude a link to the documentation
Dialogue: 0,0:14:27.62,0:14:28.69,Default,,0000,0000,0000,,as an annotation
Dialogue: 0,0:14:29.08,0:14:33.44,Default,,0000,0000,0000,,and the alert manager can take that\Nbasic url and, you know,
Dialogue: 0,0:14:33.47,0:14:36.81,Default,,0000,0000,0000,,massaging it into whatever it needs\Nto look like to actually get
Dialogue: 0,0:14:37.14,0:14:40.42,Default,,0000,0000,0000,,the operator to the correct documentation.
Dialogue: 0,0:14:42.12,0:14:43.45,Default,,0000,0000,0000,,We can also do more fun things:
Dialogue: 0,0:14:43.66,0:14:45.57,Default,,0000,0000,0000,,since we actually are not just checking
Dialogue: 0,0:14:45.75,0:14:48.52,Default,,0000,0000,0000,,what is the space right now,\Nwe're tracking data over time,
Dialogue: 0,0:14:49.23,0:14:50.83,Default,,0000,0000,0000,,we can use 'predict_linear'.
Dialogue: 0,0:14:52.41,0:14:55.26,Default,,0000,0000,0000,,'predict_linear' just takes and does\Na simple linear regression.
Dialogue: 0,0:14:55.75,0:15:00.27,Default,,0000,0000,0000,,This example takes the filesystem\Navailable space over the last hour and
Dialogue: 0,0:15:00.86,0:15:02.45,Default,,0000,0000,0000,,does a linear regression.
Dialogue: 0,0:15:02.78,0:15:08.54,Default,,0000,0000,0000,,Prediction says "Well, it's going that way\Nand four hours from now,
Dialogue: 0,0:15:08.75,0:15:13.11,Default,,0000,0000,0000,,based on one hour of history, it's gonna\Nbe less than 0, which means full".
Dialogue: 0,0:15:13.67,0:15:20.64,Default,,0000,0000,0000,,We know that within the next four hours,\Nthe disc is gonna be full
Dialogue: 0,0:15:20.87,0:15:24.66,Default,,0000,0000,0000,,so we can tell the operator ahead of time\Nthat it's gonna be full
Dialogue: 0,0:15:24.83,0:15:26.52,Default,,0000,0000,0000,,and not just tell them that it's full\Nright now.
Dialogue: 0,0:15:27.11,0:15:32.30,Default,,0000,0000,0000,,They have some window of ability\Nto fix it before it fails.
Dialogue: 0,0:15:32.67,0:15:35.37,Default,,0000,0000,0000,,This is really important because\Nif you're running a site
Dialogue: 0,0:15:35.69,0:15:41.37,Default,,0000,0000,0000,,you want to be able to have alerts\Nthat tell you that your system is failing
Dialogue: 0,0:15:41.57,0:15:42.99,Default,,0000,0000,0000,,before it actually fails.
Dialogue: 0,0:15:43.67,0:15:48.25,Default,,0000,0000,0000,,Because if it fails, you're out of SLO\Nor SLA and
Dialogue: 0,0:15:48.40,0:15:50.32,Default,,0000,0000,0000,,your users are gonna be unhappy
Dialogue: 0,0:15:50.73,0:15:52.49,Default,,0000,0000,0000,,and you don't want the users to tell you\Nthat your site is down
Dialogue: 0,0:15:52.68,0:15:54.95,Default,,0000,0000,0000,,you want to know about it before\Nyour users can even tell.
Dialogue: 0,0:15:55.19,0:15:58.49,Default,,0000,0000,0000,,This allows you to do that.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And also of course, Prometheus being\Na modern system,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we support fully UTF8 in all of our labels.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's an other one, here's a good example\Nfrom the USE method.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is a rate of 500 errors coming from\Nan application
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you can simply alert that
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's more than 500 errors per second\Ncoming out of the application
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if that's your threshold for ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you can do other things,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can convert that from just\Na raid of errors
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to a percentive error.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you could say
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"I have an SLA of 3 9" and so you can say
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"If the rate of errors divided by the rate\Nof requests is .01,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or is more than .01, then\Nthat's a problem."
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can include that level of\Nerror granularity.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And if you're just doing a blackbox test,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you wouldn't know this, you would only get\Nif you got an error from the system,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,then you got another error from the system
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,then you fire an alert.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But if those checks are one minute apart\Nand you're serving 1000 requests per second
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you could be serving 10,000 errors before\Nyou even get an alert.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you might miss it, because
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what if you only get one random error
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then the next time, you're serving\N25% errors,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you only have a 25% chance of that check\Nfailing again.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You really need these metrics in order\Nto get
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,proper reports of the status of your system
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's even options
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can slice and dice those labels.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a label on all of\Nyour applications called 'service'
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can send that 'service' label through\Nto the message
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and you can say\N"Hey, this service is broken".
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can include that service label\Nin your alert messages.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And that's it, I can go to a demo and Q&A.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause]
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Any questions so far?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Or anybody want to see a demo?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] Hi. Does Prometheus make metric\Ndiscovery inside containers
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,or do I have to implement the metrics\Nmyself?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] For metrics in containers, there are\Nalready things that expose
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the metrics of the container system\Nitself.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a utility called 'cadvisor' and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,cadvisor takes the links cgroup data\Nand exposes it as metrics
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you can get data about\Nhow much CPU time is being
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,spent in your container,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,how much memory is being used\Nby your container.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] But not about the application,\Njust about the container usage ?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] Right. Because the container\Nhas no idea
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,whether your application is written\Nin Ruby or go or Python or whatever,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you have to build that into\Nyour application in order to get the data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So for Prometheus,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we've written client libraries that can be\Nincluded in your application directly
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you can get that data out.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you go to the Prometheus website,\Nwe have a whole series of client libraries
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and we cover a pretty good selection\Nof popular software.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] What is the current state of\Nlong-term data storage?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] Very good question.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's been several…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's actually several different methods\Nof doing this.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus stores all this data locally\Nin its own data storage
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,on the local disk.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But that's only as durable as\Nthat server is durable.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a really durable server,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can store as much data as you want,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can store years and years of data\Nlocally on the Prometheus server.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's not a problem.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a bunch of misconceptions because\Nof our default
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and the language on our website said
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"It's not long-term storage"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,simply because we leave that problem\Nup to the person running the server.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But the time series database\Nthat Prometheus includes
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is actually quite durable.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's only as durable as the server\Nunderneath it.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So if you've got a very large cluster and\Nyou want really high durability,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you need to have some kind of\Ncluster software,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but because we want Prometheus to be\Nsimple to deploy
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and very simple to operate
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and also very robust.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We didn't want to include any clustering\Nin Prometheus itself,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because anytime you have a clustered\Nsoftware,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,what happens if your network is\Na little wanky.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The first thing that goes down is\Nall of your distributed systems fail.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And building distributed systems to be\Nreally robust is really hard
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so Prometheus is what we call\N"uncoordinated distributed systems".
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got two Prometheus servers\Nmonitoring all your targets in an HA mode
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,in a cluster, and there's a split brain,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,each Prometheus can see\Nhalf of the cluster and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it can see that the other half\Nof the cluster is down.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They can both try to get alerts out\Nto the alert manager
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is a really really robust way of\Nhandling split brains
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and bad network failures and bad problems\Nin a cluster.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's designed to be super super robust
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and so the two individual\NPromotheus servers in you cluster
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,don't have to talk to each other\Nto do this,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,they can just to it independently.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But if you want to be able\Nto correlate data
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,between many different Prometheus servers
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you need an external data storage\Nto do this.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And also you may not have\Nvery big servers,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you might be running your Prometheus\Nin a container
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's only got a little bit of local\Nstorage space
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so you want to send all that data up\Nto a big cluster datastore
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We have several different ways of\Ndoing this.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's the classic way which is called\Nfederation
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,where you have one Prometheus server\Npolling in summary data from
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,each of the individual Prometheus servers
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and this is useful if you want to run\Nalerts against data coming
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,from multiple Prometheus servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But federation is not replication.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It only can do a little bit of data from\Neach Prometheus server.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got a million metrics on\Neach Prometheus server,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,you can't poll in a million metrics\Nand do…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you've got 10 of those, you can't\Npoll in 10 million metrics
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,simultaneously into one Prometheus\Nserver.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just to much data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There is two others, a couple of other\Nnice options.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a piece of software called\NCortex.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Cortex is a Prometheus server that\Nstores its data in a database.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Specifically, a distributed database.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Things that are based on the Google\Nbig table model, like Cassandra or…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What's the Amazon one?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Yeah.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Dynamodb.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,If you have a dynamodb or a cassandra\Ncluster, or one of these other
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,really big distributed storage clusters,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Cortex can run and the Prometheus servers\Nwill stream their data up to Cortex
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it will keep a copy of that accross\Nall of your Prometheus servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And because it's based on things\Nlike Cassandra,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's super scalable.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's a little complex to run and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,many people don't want to run that\Ncomplex infrastructure.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We have another new one, we just blogged\Nabout it yesterday.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's a thing called Thanos.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Thanos is Prometheus at scale.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Basically, the way it works…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Actually, why don't I bring that up?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This was developed by a company\Ncalled Improbable
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and they wanted to…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They had billions of metrics coming from\Nhundreds of Prometheus servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,They developed this in collaboration with\Nthe Prometheus team to build
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a super highly scalable Prometheus server.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus itself stores the incoming\Nmetrics data in ??? log
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then every two hours, it creates\Na compaction cycle
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it creates a mutable series block\Nof data which is
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,all the time series blocks themselves
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then an index into that data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Those two hour windows are all imutable
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so ??? has a little sidecar binary that\Nwatches for those new directories and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,uploads them into a blob store.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you could put them in S3 or minio or\Nsome other simple object storage.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then now you have all of your data,\Nall of this index data already
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,ready to go
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and then the final sidecar creates\Na little mesh cluster that can read from
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,all of those S3 blocks.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, you have this super global view\Nall stored in a big bucket storage and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,things like S3 or minio are…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Bucket storage is not databases so they're\Noperationally a little easier to operate.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Plus, now we have all this data in\Na bucket store and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the Thanos sidecars can talk to each other
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can now have a single entry point.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can query Thanos and Thanos will\Ndistribute your query
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,across all your Prometheus servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So now you can do global queries across\Nall of your servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But it's very new, they just released\Ntheir first release candidate yesterday.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It is looking to be like\Nthe coolest thing ever
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for running large scale Prometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here's an example of how that is laid out.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This will ??? let you have\Na billion metric Prometheus cluster.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And it's got a bunch of other\Ncool features.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Any more questions?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Alright, maybe I'll do\Na quick little demo.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here is a Prometheus server that is\Nprovided by ???
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that just does a ansible deployment\Nfor Prometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you can just simply query\Nfor something like 'node_cpu'.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is actually the old name for\Nthat metric.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And you can see, here's exactly
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the CPU metrics from some servers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's just a bunch of stuff.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's actually two servers here,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's an influx cloud alchemy and\Nthere is a demo cloud alchemy.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] Can you zoom in?\N[A] Oh yeah sure.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So you can see all the extra labels.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can also do some things like…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Let's take a look at, say,\Nthe last 30 seconds.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can just add this little time window.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It's called a range request,\Nand you can see
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,the individual samples.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can see that all Prometheus is doing
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,is storing the sample and a timestamp.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,All the timestamps are in milliseconds\Nand it's all epoch
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so it's super easy to manipulate.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But, looking at the individual samples and\Nlooking at this, you can see that
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,if we go back and just take…\Nand look at the raw data, and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we graph the raw data…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Oops, that's a syntax error.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And we look at this graph…\NCome on.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here we go.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Well, that's kind of boring, it's just\Na flat line because
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,it's just a counter going up very slowly.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What we really want to do, is we want to\Ntake, and we want to apply
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a rate function to this counter.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So let's look at the rate over\Nthe last one minute.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There we go, now we get\Na nice little graph.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And so you can see that this is\N0.6 CPU seconds per second
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,for that set of labels.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But this is pretty noisy, there's a lot\Nof lines on this graph and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's still a lot of data here.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So let's start doing some filtering.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,One of the things we see here is,\Nwell, there's idle.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We don't really care about\Nthe machine being idle,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,so let's just add a label filter\Nso we can say
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,'mode', it's the label name, and it's not\Nequal to 'idle'. Done.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And if I could type…\NWhat did I miss?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Here we go.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So now we've removed idle from the graph.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That looks a little more sane.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Oh, wow, look at that, that's a nice\Nbig spike in user space on the influx server
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Okay…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Well, that's pretty cool.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What about…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,This is still quite a lot of lines.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,How much CPU is in use total across\Nall the servers that we have.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can just sum up that rate.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can just see that there is\Na sum total of 0.6 CPU seconds/s
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,across the servers we have.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But that's a little to coarse.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What if we want to see it by instance?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now, we can see the two servers,\Nwe can see
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,that we're left with just that label.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The influx labels are the influx instance\Nand the influx demo.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,That's a super easy way to see that,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,but we can also do this\Nthe other way around.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can say 'without (mode,cpu)' so\Nwe can drop those modes and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,see all the labels that we have.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We can still see the environment label\Nand the job label on our list data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can go either way\Nwith the summary functions.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's a whole bunch of different functions
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it's all in our documentation.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But what if we want to see it…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,What if we want to see which CPUs\Nare in use?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now we can see that it's only CPU0
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,because apparently these are only\N1-core instances.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,You can add/remove labels and do\Nall these queries.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Any other questions so far?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Q] I don't have a question, but I have\Nsomething to add.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Prometheus is really nice, but it's\Na lot better if you combine it
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,with grafana.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[A] Yes, yes.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,In the beginning, when we were creating\NPrometheus, we actually built
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,a piece of dashboard software called\Npromdash.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,It was a simple little Ruby on Rails app\Nto create dashboards
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and it had a bunch of JavaScript.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And then grafana came out.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,And we're like
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Oh, that's interesting. It doesn't support\NPrometheus" so we were like
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Hey, can you support Prometheus"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,and they're like "Yeah, we've got\Na REST API, get the data, done"
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Now grafana supports Prometheus and\Nwe're like
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,"Well, promdash, this is crap, delete".
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,The Prometheus development team,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we're all backend developers\Nand SREs and
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we have no JavaScript skills at all.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,So we're like "Let somebody deal\Nwith that".
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,One of the nice things about working on\Nthis kind of project is
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,we can do things that we're good at and\Nand we don't, we don't try…
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,We don't have any marketing people,\Nit's just an opensource project,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,there's no single company behind Prometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,I work for GitLab, Improbable paid for\Nthe Thanos system,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,other companies like Red Hat now pays\Npeople that used to work on CoreOS to
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,work on Prometheus.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,There's lots and lots of collaboration\Nbetween many companies
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,to build the Prometheus ecosystem.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,But yeah, grafana is great.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Actually, grafana now has\Ntwo fulltime Prometheus developers.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Alright, that's it.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,[Applause]