Return to Video

Metrics-Based Monitoring with Prometheus

  • 0:06 - 0:11
    So, we had a talk by a non-GitLab person
    about GitLab.
  • 0:11 - 0:13
    Now, we have a talk by a GitLab person
    on non-GtlLab.
  • 0:13 - 0:15
    Something like that?
  • 0:16 - 0:19
    The CCCHH hackerspace is now open,
  • 0:20 - 0:22
    from now on if you want to go there,
    that's the announcement.
  • 0:22 - 0:26
    And the next talk will be by Ben Kochie
  • 0:26 - 0:28
    on metrics-based monitoring
    with Prometheus.
  • 0:29 - 0:30
    Welcome.
  • 0:31 - 0:33
    [Applause]
  • 0:35 - 0:37
    Alright, so
  • 0:37 - 0:39
    my name is Ben Kochie
  • 0:40 - 0:44
    I work on DevOps features for GitLab
  • 0:44 - 0:48
    and apart working for GitLab, I also work
    on the opensource Prometheus project.
  • 0:51 - 0:54
    I live in Berlin and I've been using
    Debian since ???
  • 0:54 - 0:57
    yes, quite a long time.
  • 0:59 - 1:01
    So, what is Metrics-based Monitoring?
  • 1:03 - 1:05
    If you're running software in production,
  • 1:06 - 1:08
    you probably want to monitor it,
  • 1:08 - 1:11
    because if you don't monitor it, you don't
    know it's right.
  • 1:13 - 1:16
    ??? break down into two categories:
  • 1:16 - 1:19
    there's blackbox monitoring and
    there's whitebox monitoring.
  • 1:20 - 1:25
    Blackbox monitoring is treating
    your software like a blackbox.
  • 1:25 - 1:26
    It's just checks to see, like,
  • 1:26 - 1:29
    is it responding, or does it ping
  • 1:30 - 1:34
    or ??? HTTP requests
  • 1:34 - 1:36
    [mic turned on]
  • 1:38 - 1:41
    Ah, there we go, that's better.
  • 1:47 - 1:52
    So, blackbox monitoring is a probe,
  • 1:52 - 1:55
    it just kind of looks from the outside
    to your software
  • 1:55 - 1:57
    and it has no knowledge of the internals
  • 1:58 - 2:01
    and it's really good for end to end testing.
  • 2:01 - 2:04
    So if you've got a fairly complicated
    service,
  • 2:04 - 2:06
    you come in from the outside, you go
    through the load balancer,
  • 2:07 - 2:08
    you hit the API server,
  • 2:08 - 2:10
    the API server might hit a database,
  • 2:10 - 2:13
    and you go all the way through
    to the back of the stack
  • 2:13 - 2:15
    and all the way back out
  • 2:15 - 2:16
    so you know that everything is working
    end to end.
  • 2:16 - 2:19
    But you only know about it
    for that one request.
  • 2:19 - 2:22
    So in order to find out if your service
    is working,
  • 2:23 - 2:27
    from the end to end, for every single
    request,
  • 2:27 - 2:30
    this requires whitebox intrumentation.
  • 2:30 - 2:34
    So, basically, every event that happens
    inside your software,
  • 2:34 - 2:37
    inside a serving stack,
  • 2:37 - 2:40
    gets collected and gets counted,
  • 2:40 - 2:43
    so you know that every request hits
    the load balancer,
  • 2:43 - 2:46
    every request hits your application
    service,
  • 2:46 - 2:47
    every request hits the database.
  • 2:48 - 2:51
    You know that everything matches up
  • 2:51 - 2:56
    and this is called whitebox, or
    metrics-based monitoring.
  • 2:56 - 2:58
    There is different examples of, like,
  • 2:58 - 3:02
    the kind of software that does blackbox
    and whitebox monitoring.
  • 3:03 - 3:07
    So you have software like Nagios that
    you can configure checks
  • 3:09 - 3:10
    or pingdom,
  • 3:10 - 3:12
    pingdom will do ping of your website.
  • 3:13 - 3:15
    And then there is metrics-based monitoring,
  • 3:16 - 3:19
    things like Prometheus, things like
    the TICK stack from influx data,
  • 3:20 - 3:23
    New Relic and other commercial solutions
  • 3:23 - 3:25
    but of course I like to talk about
    the opensorce solutions.
  • 3:26 - 3:28
    We're gonna talk a little bit about
    Prometheus.
  • 3:29 - 3:32
    Prometheus came out of the idea that
  • 3:32 - 3:38
    we needed a monitoring system that could
    collect all this whitebox metric data
  • 3:38 - 3:41
    and do something useful with it.
  • 3:41 - 3:43
    Not just give us a pretty graph, but
    we also want to be able to
  • 3:43 - 3:44
    alert on it.
  • 3:44 - 3:46
    So we needed both
  • 3:50 - 3:54
    a data gathering and an analytics system
    in the same instance.
  • 3:54 - 3:59
    To do this, we built this thing and
    we looked at the way that
  • 3:59 - 4:02
    data was being generated
    by the applications
  • 4:02 - 4:05
    and there are advantages and
    disadvantages to this
  • 4:05 - 4:07
    push vs. pull model for metrics.
  • 4:07 - 4:10
    We decided to go with the pulling model
  • 4:10 - 4:14
    because there is some slight advantages
    for pulling over pushing.
  • 4:16 - 4:18
    With pulling, you get this free
    blackbox check
  • 4:18 - 4:20
    that the application is running.
  • 4:21 - 4:24
    When you pull your application, you know
    that the process is running.
  • 4:25 - 4:28
    If you are doing push-based, you can't
    tell the difference between
  • 4:28 - 4:32
    your application doing no work and
    your application not running.
  • 4:32 - 4:34
    So you don't know if it's stuck,
  • 4:34 - 4:38
    or is it just not having to do any work.
  • 4:43 - 4:49
    With pulling, the pulling system knows
    the state of your network.
  • 4:50 - 4:53
    If you have a defined set of services,
  • 4:53 - 4:57
    that inventory drives what should be there.
  • 4:58 - 5:00
    Again, it's like, the disappearing,
  • 5:00 - 5:04
    is the process dead, or is it just
    not doing anything?
  • 5:04 - 5:07
    With polling, you know for a fact
    what processes should be there,
  • 5:08 - 5:11
    and it's a bit of an advantage there.
  • 5:11 - 5:13
    With pulling, there's really easy testing.
  • 5:13 - 5:16
    With push-based metrics, you have to
    figure out
  • 5:17 - 5:19
    if you want to test a new version of
    the monitoring system or
  • 5:19 - 5:21
    you want to test something new,
  • 5:21 - 5:24
    you have to tear off a copy of the data.
  • 5:24 - 5:28
    With pulling, you can just set up
    another instance of your monitoring
  • 5:28 - 5:29
    and just test it.
  • 5:30 - 5:31
    Or you don't even have,
  • 5:31 - 5:33
    it doesn't even have to be monitoring,
    you can just use curl
  • 5:33 - 5:35
    to pull the metrics endpoint.
  • 5:38 - 5:40
    It's significantly easier to test.
  • 5:40 - 5:43
    The other thing with the…
  • 5:46 - 5:48
    The other nice thing is that
    the client is really simple.
  • 5:48 - 5:51
    The client doesn't have to know
    where the monitoring system is.
  • 5:51 - 5:54
    It doesn't have to know about HA
  • 5:54 - 5:56
    It just has to sit and collect the data
    about itself.
  • 5:56 - 5:59
    So it doesn't have to know anything about
    the topology of the network.
  • 5:59 - 6:03
    As an application developer, if you're
    writing a DNS server or
  • 6:04 - 6:06
    some other piece of software,
  • 6:06 - 6:10
    you don't have to know anything about
    monitoring software,
  • 6:10 - 6:12
    you can just implement it inside
    your application and
  • 6:13 - 6:17
    the monitoring software, whether it's
    Prometheus or something else,
  • 6:17 - 6:19
    can just come and collect that data for you.
  • 6:20 - 6:24
    That's kind of similar to a very old
    monitoring system called SNMP,
  • 6:24 - 6:29
    but SNMP has a significantly less friendly
    data model for developers.
  • 6:30 - 6:34
    This is the basic layout
    of a Prometheus server.
  • 6:34 - 6:36
    At the core, there's a Prometheus server
  • 6:36 - 6:40
    and it deals with all the data collection
    and analytics.
  • 6:43 - 6:47
    Basically, this one binary,
    it's all written in golang.
  • 6:47 - 6:49
    It's a single binary.
  • 6:49 - 6:51
    It knows how to read from your inventory,
  • 6:51 - 6:53
    there's a bunch of different methods,
    whether you've got
  • 6:53 - 6:59
    a kubernetes cluster or a cloud platform
  • 7:00 - 7:04
    or you have your own customized thing
    with ansible.
  • 7:05 - 7:10
    Ansible can take your layout, drop that
    into a config file and
  • 7:11 - 7:12
    Prometheus can pick that up.
  • 7:16 - 7:19
    Once it has the layout, it goes out and
    collects all the data.
  • 7:19 - 7:24
    It has a storage and a time series
    database to store all that data locally.
  • 7:24 - 7:28
    It has a thing called PromQL, which is
    a query language designed
  • 7:28 - 7:31
    for metrics and analytics.
  • 7:32 - 7:37
    From that PromQL, you can add frontends
    that will,
  • 7:37 - 7:39
    whether it's a simple API client
    to run reports,
  • 7:40 - 7:43
    you can use things like Grafana
    for creating dashboards,
  • 7:43 - 7:45
    it's got a simple webUI built in.
  • 7:45 - 7:47
    You can plug in anything you want
    on that side.
  • 7:49 - 7:54
    And then, it also has the ability to
    continuously execute queries
  • 7:55 - 7:56
    called "recording rules"
  • 7:57 - 7:59
    and these recording rules have
    two different modes.
  • 7:59 - 8:02
    You can either record, you can take
    a query
  • 8:02 - 8:04
    and it will generate new data
    from that query
  • 8:04 - 8:07
    or you can take a query, and
    if it returns results,
  • 8:07 - 8:09
    it will return an alert.
  • 8:09 - 8:13
    That alert is a push message
    to the alert manager.
  • 8:13 - 8:19
    This allows us to separate the generating
    of alerts from the routing of alerts.
  • 8:19 - 8:24
    You can have one or hundreds of Prometheus
    services, all generating alerts
  • 8:25 - 8:29
    and it goes into an alert manager cluster
    and sends, does the deduplication
  • 8:29 - 8:31
    and the routing to the human
  • 8:31 - 8:34
    because, of course, the thing
    that we want is
  • 8:35 - 8:39
    we had dashboards with graphs, but
    in order to find out if something is broken
  • 8:39 - 8:41
    you had to have a human
    looking at the graph.
  • 8:41 - 8:43
    With Prometheus, we don't have to do that
    anymore,
  • 8:43 - 8:48
    we can simply let the software tell us
    that we need to go investigate
  • 8:48 - 8:49
    our problems.
  • 8:49 - 8:51
    We don't have to sit there and
    stare at dashboards all day,
  • 8:51 - 8:52
    because that's really boring.
  • 8:55 - 8:58
    What does it look like to actually
    get data into Prometheus?
  • 8:58 - 9:02
    This is a very basic output
    of a Prometheus metric.
  • 9:03 - 9:04
    This is a very simple thing.
  • 9:04 - 9:08
    If you know much about
    the linux kernel,
  • 9:07 - 9:13
    the linux kernel tracks and proc stats,
    all the state of all the CPUs
  • 9:13 - 9:14
    in your system
  • 9:15 - 9:18
    and we express this by having
    the name of the metric, which is
  • 9:22 - 9:26
    'node_cpu_seconds_total' and so
    this is a self-describing metric,
  • 9:27 - 9:28
    like you can just read the metrics name
  • 9:29 - 9:31
    and you understand a little bit about
    what's going on here.
  • 9:33 - 9:39
    The linux kernel and other kernels track
    their usage by the number of seconds
  • 9:39 - 9:41
    spent doing different things and
  • 9:41 - 9:47
    that could be, whether it's in system or
    user space or IRQs
  • 9:47 - 9:49
    or iowait or idle.
  • 9:49 - 9:51
    Actually, the kernel tracks how much
    idle time it has.
  • 9:54 - 9:55
    It also tracks it by the number of CPUs.
  • 9:56 - 10:00
    With other monitoring systems, they used
    to do this with a tree structure
  • 10:01 - 10:04
    and this caused a lot of problems,
    for like
  • 10:04 - 10:09
    How do you mix and match data so
    by switching from
  • 10:10 - 10:12
    a tree structure to a tag-based structure,
  • 10:13 - 10:17
    we can do some really interesting
    powerful data analytics.
  • 10:18 - 10:25
    Here's a nice example of taking
    those CPU seconds counters
  • 10:26 - 10:30
    and then converting them into a graph
    by using PromQL.
  • 10:33 - 10:35
    Now we can get into
    Metrics-Based Alerting.
  • 10:35 - 10:38
    Now we have this graph, we have this thing
  • 10:38 - 10:39
    we can look and see here
  • 10:40 - 10:43
    "Oh there is some little spike here,
    we might want to know about that."
  • 10:43 - 10:46
    Now we can get into Metrics-Based
    Alerting.
  • 10:46 - 10:51
    I used to be a site reliability engineer,
    I'm still a site reliability engineer at heart
  • 10:52 - 11:00
    and we have this concept of things that
    you need on a site or a service reliably
  • 11:01 - 11:03
    The most important thing you need is
    down at the bottom,
  • 11:04 - 11:07
    Monitoring, because if you don't have
    monitoring of your service,
  • 11:07 - 11:09
    how do you know it's even working?
  • 11:12 - 11:15
    There's a couple of techniques here, and
    we want to alert based on data
  • 11:16 - 11:18
    and not just those end to end tests.
  • 11:19 - 11:23
    There's a couple of techniques, a thing
    called the RED method
  • 11:24 - 11:25
    and there's a thing called the USE method
  • 11:26 - 11:28
    and there's a couple nice things to some
    blog posts about this
  • 11:29 - 11:31
    and basically it defines that, for example,
  • 11:31 - 11:35
    the RED method talks about the requests
    that your system is handling
  • 11:36 - 11:38
    There are three things:
  • 11:38 - 11:40
    There's the number of requests, there's
    the number of errors
  • 11:40 - 11:42
    and there's how long takes a duration.
  • 11:43 - 11:45
    With the combination of these three things
  • 11:45 - 11:48
    you can determine most of
    what your users see
  • 11:49 - 11:54
    "Did my request go through? Did it
    return an error? Was it fast?"
  • 11:55 - 11:58
    Most people, that's all they care about.
  • 11:58 - 12:02
    "I made a request to a website and
    it came back and it was fast."
  • 12:05 - 12:07
    It's a very simple method of just, like,
  • 12:07 - 12:10
    those are the important things to
    determine if your site is healthy.
  • 12:12 - 12:17
    But we can go back to some more
    traditional, sysadmin style alerts
  • 12:17 - 12:21
    this is basically taking the filesystem
    available space,
  • 12:21 - 12:27
    divided by the filesystem size, that becomes
    the ratio of filesystem availability
  • 12:27 - 12:28
    from 0 to 1.
  • 12:28 - 12:31
    Multiply it by 100, we now have
    a percentage
  • 12:31 - 12:36
    and if it's less than or equal to 1%
    for 15 minutes,
  • 12:36 - 12:42
    this is less than 1% space, we should tell
    a sysadmin to go check
  • 12:42 - 12:44
    to find out why the filesystem
    has fall
  • 12:45 - 12:46
    It's super nice and simple.
  • 12:46 - 12:50
    We can also tag, we can include…
  • 12:51 - 12:58
    Every alert includes all the extraneous
    labels that Prometheus adds to your metrics
  • 12:59 - 13:05
    When you add a metric in Prometheus, if
    we go back and we look at this metric.
  • 13:06 - 13:11
    This metric only contain the information
    about the internals of the application
  • 13:13 - 13:15
    anything about, like, what server it's on,
    is it running in a container,
  • 13:15 - 13:19
    what cluster does it come from,
    what continent is it on,
  • 13:18 - 13:22
    that's all extra annotations that are
    added by the Prometheus server
  • 13:23 - 13:24
    at discovery time.
  • 13:25 - 13:28
    Unfortunately I don't have a good example
    of what those labels look like
  • 13:29 - 13:34
    but every metric gets annotated
    with location information.
  • 13:37 - 13:41
    That location information also comes through
    as labels in the alert
  • 13:41 - 13:48
    so, if you have a message coming
    into your alert manager,
  • 13:48 - 13:50
    the alert manager can look and go
  • 13:50 - 13:52
    "Oh, that's coming from this datacenter"
  • 13:52 - 13:59
    and it can include that in the email or
    IRC message or SMS message.
  • 13:59 - 14:01
    So you can include
  • 13:59 - 14:04
    "Filesystem is out of space on this host
    from this datacenter"
  • 14:05 - 14:07
    All these labels get passed through and
    then you can append
  • 14:07 - 14:13
    "severity: critical" to that alert and
    include that in the message to the human
  • 14:14 - 14:17
    because of course, this is how you define…
  • 14:17 - 14:21
    Getting the message from the monitoring
    to the human.
  • 14:22 - 14:24
    You can even include nice things like,
  • 14:24 - 14:28
    if you've got documentation, you can
    include a link to the documentation
  • 14:28 - 14:29
    as an annotation
  • 14:29 - 14:33
    and the alert manager can take that
    basic url and, you know,
  • 14:33 - 14:37
    massaging it into whatever it needs
    to look like to actually get
  • 14:37 - 14:40
    the operator to the correct documentation.
  • 14:42 - 14:43
    We can also do more fun things:
  • 14:44 - 14:46
    since we actually are not just checking
  • 14:46 - 14:49
    what is the space right now,
    we're tracking data over time,
  • 14:49 - 14:51
    we can use 'predict_linear'.
  • 14:52 - 14:55
    'predict_linear' just takes and does
    a simple linear regression.
  • 14:56 - 15:00
    This example takes the filesystem
    available space over the last hour and
  • 15:01 - 15:02
    does a linear regression.
  • 15:03 - 15:09
    Prediction says "Well, it's going that way
    and four hours from now,
  • 15:09 - 15:13
    based on one hour of history, it's gonna
    be less than 0, which means full".
  • 15:14 - 15:21
    We know that within the next four hours,
    the disc is gonna be full
  • 15:21 - 15:25
    so we can tell the operator ahead of time
    that it's gonna be full
  • 15:25 - 15:27
    and not just tell them that it's full
    right now.
  • 15:27 - 15:32
    They have some window of ability
    to fix it before it fails.
  • 15:33 - 15:35
    This is really important because
    if you're running a site
  • 15:36 - 15:41
    you want to be able to have alerts
    that tell you that your system is failing
  • 15:42 - 15:43
    before it actually fails.
  • 15:44 - 15:48
    Because if it fails, you're out of SLO
    or SLA and
  • 15:48 - 15:50
    your users are gonna be unhappy
  • 15:51 - 15:52
    and you don't want the users to tell you
    that your site is down
  • 15:53 - 15:55
    you want to know about it before
    your users can even tell.
  • 15:55 - 15:58
    This allows you to do that.
  • 15:59 - 16:02
    And also of course, Prometheus being
    a modern system,
  • 16:03 - 16:06
    we support fully UTF8 in all of our labels.
  • 16:08 - 16:12
    Here's an other one, here's a good example
    from the USE method.
  • 16:12 - 16:16
    This is a rate of 500 errors coming from
    an application
  • 16:16 - 16:18
    and you can simply alert that
  • 16:18 - 16:23
    there's more than 500 errors per second
    coming out of the application
  • 16:23 - 16:26
    if that's your threshold for pain
  • 16:26 - 16:27
    And you can do other things,
  • 16:28 - 16:29
    you can convert that from just
    a raid of errors
  • 16:30 - 16:31
    to a percentive error.
  • 16:31 - 16:33
    So you could say
  • 16:33 - 16:37
    "I have an SLA of 3 9" and so you can say
  • 16:38 - 16:47
    "If the rate of errors divided by the rate
    of requests is .01,
  • 16:47 - 16:49
    or is more than .01, then
    that's a problem."
  • 16:50 - 16:55
    You can include that level of
    error granularity.
  • 16:55 - 16:58
    And if you're just doing a blackbox test,
  • 16:58 - 17:04
    you wouldn't know this, you would only get
    if you got an error from the system,
  • 17:04 - 17:06
    then you got another error from the system
  • 17:06 - 17:07
    then you fire an alert.
  • 17:07 - 17:12
    But if those checks are one minute apart
    and you're serving 1000 requests per second
  • 17:13 - 17:21
    you could be serving 10,000 errors before
    you even get an alert.
  • 17:22 - 17:23
    And you might miss it, because
  • 17:23 - 17:25
    what if you only get one random error
  • 17:25 - 17:29
    and then the next time, you're serving
    25% errors,
  • 17:29 - 17:32
    you only have a 25% chance of that check
    failing again.
  • 17:32 - 17:36
    You really need these metrics in order
    to get
  • 17:36 - 17:39
    proper reports of the status of your system
  • 17:43 - 17:44
    There's even options
  • 17:44 - 17:46
    You can slice and dice those labels.
  • 17:46 - 17:50
    If you have a label on all of
    your applications called 'service'
  • 17:50 - 17:53
    you can send that 'service' label through
    to the message
  • 17:54 - 17:56
    and you can say
    "Hey, this service is broken".
  • 17:56 - 18:00
    You can include that service label
    in your alert messages.
  • 18:01 - 18:07
    And that's it, I can go to a demo and Q&A.
  • 18:10 - 18:14
    [Applause]
  • 18:17 - 18:18
    Any questions so far?
  • 18:19 - 18:20
    Or anybody want to see a demo?
  • 18:30 - 18:35
    [Q] Hi. Does Prometheus make metric
    discovery inside containers
  • 18:35 - 18:37
    or do I have to implement the metrics
    myself?
  • 18:38 - 18:46
    [A] For metrics in containers, there are
    already things that expose
  • 18:46 - 18:49
    the metrics of the container system
    itself.
  • 18:50 - 18:52
    There's a utility called 'cadvisor' and
  • 18:52 - 18:57
    cadvisor takes the links cgroup data
    and exposes it as metrics
  • 18:57 - 19:01
    so you can get data about
    how much CPU time is being
  • 19:01 - 19:02
    spent in your container,
  • 19:03 - 19:04
    how much memory is being spent
    by your container.
  • 19:05 - 19:08
    [Q] But not about the application,
    just about the container usage ?
  • 19:09 - 19:11
    [A] Right. Because the container
    has no idea
  • 19:12 - 19:15
    whether your application is written
    in Ruby or Go or Python or whatever,
  • 19:19 - 19:22
    you have to build that into
    your application in order to get the data.
  • 19:24 - 19:24
    So for Prometheus,
  • 19:28 - 19:35
    we've written client libraries that can be
    included in your application directly
  • 19:35 - 19:36
    so you can get that data out.
  • 19:37 - 19:41
    If you go to the Prometheus website,
    we have a whole series of client libraries
  • 19:45 - 19:49
    and we cover a pretty good selection
    of popular software.
  • 19:57 - 20:00
    [Q] What is the current state of
    long-term data storage?
  • 20:01 - 20:02
    [A] Very good question.
  • 20:03 - 20:05
    There's been several…
  • 20:05 - 20:07
    There's actually several different methods
    of doing this.
  • 20:10 - 20:15
    Prometheus stores all this data locally
    in its own data storage
  • 20:15 - 20:16
    on the local disk.
  • 20:17 - 20:19
    But that's only as durable as
    that server is durable.
  • 20:19 - 20:22
    So if you've got a really durable server,
  • 20:22 - 20:23
    you can store as much data as you want,
  • 20:24 - 20:27
    you can store years and years of data
    locally on the Prometheus server.
  • 20:27 - 20:28
    That's not a problem.
  • 20:29 - 20:32
    There's a bunch of misconceptions because
    of our default
  • 20:32 - 20:34
    and the language on our website said
  • 20:35 - 20:36
    "It's not long-term storage"
  • 20:37 - 20:42
    simply because we leave that problem
    up to the person running the server.
  • 20:43 - 20:46
    But the time series database
    that Prometheus includes
  • 20:47 - 20:48
    is actually quite durable.
  • 20:49 - 20:51
    But it's only as durable as the server
    underneath it.
  • 20:52 - 20:55
    So if you've got a very large cluster and
    you want really high durability,
  • 20:56 - 20:58
    you need to have some kind of
    cluster software,
  • 20:58 - 21:01
    but because we want Prometheus to be
    simple to deploy
  • 21:02 - 21:03
    and very simple to operate
  • 21:03 - 21:07
    and also very robust.
  • 21:07 - 21:09
    We didn't want to include any clustering
    in Prometheus itself,
  • 21:10 - 21:12
    because anytime you have a clustered
    software,
  • 21:12 - 21:15
    what happens if your network is
    a little wanky.
  • 21:16 - 21:19
    The first thing that goes down is
    all of your distributed systems fail.
  • 21:20 - 21:23
    And building distributed systems to be
    really robust is really hard
  • 21:23 - 21:29
    so Prometheus is what we call
    "uncoordinated distributed systems".
  • 21:29 - 21:34
    If you've got two Prometheus servers
    monitoring all your targets in an HA mode
  • 21:34 - 21:37
    in a cluster, and there's a split brain,
  • 21:37 - 21:40
    each Prometheus can see
    half of the cluster and
  • 21:41 - 21:44
    it can see that the other half
    of the cluster is down.
  • 21:44 - 21:47
    They can both try to get alerts out
    to the alert manager
  • 21:47 - 21:50
    and this is a really really robust way of
    handling split brains
  • 21:51 - 21:54
    and bad network failures and bad problems
    in a cluster.
  • 21:54 - 21:57
    It's designed to be super super robust
  • 21:57 - 22:00
    and so the two individual
    Promotheus servers in you cluster
  • 22:00 - 22:02
    don't have to talk to each other
    to do this,
  • 22:02 - 22:04
    they can just to it independently.
  • 22:04 - 22:07
    But if you want to be able
    to correlate data
  • 22:08 - 22:09
    between many different Prometheus servers
  • 22:09 - 22:12
    you need an external data storage
    to do this.
  • 22:13 - 22:15
    And also you may not have
    very big servers,
  • 22:15 - 22:17
    you might be running your Prometheus
    in a container
  • 22:17 - 22:19
    and it's only got a little bit of local
    storage space
  • 22:20 - 22:23
    so you want to send all that data up
    to a big cluster datastore
  • 22:23 - 22:25
    for a bigger use
  • 22:26 - 22:28
    We have several different ways of
    doing this.
  • 22:28 - 22:31
    There's the classic way which is called
    federation
  • 22:31 - 22:35
    where you have one Prometheus server
    polling in summary data from
  • 22:35 - 22:37
    each of the individual Prometheus servers
  • 22:37 - 22:40
    and this is useful if you want to run
    alerts against data coming
  • 22:40 - 22:42
    from multiple Prometheus servers.
  • 22:42 - 22:44
    But federation is not replication.
  • 22:45 - 22:47
    It only can do a little bit of data from
    each Prometheus server.
  • 22:48 - 22:51
    If you've got a million metrics on
    each Prometheus server,
  • 22:52 - 22:56
    you can't poll in a million metrics
    and do…
  • 22:56 - 22:59
    If you've got 10 of those, you can't
    poll in 10 million metrics
  • 22:59 - 23:01
    simultaneously into one Prometheus
    server.
  • 23:01 - 23:02
    It's just to much data.
  • 23:03 - 23:06
    There is two others, a couple of other
    nice options.
  • 23:07 - 23:09
    There's a piece of software called
    Cortex.
  • 23:09 - 23:16
    Cortex is a Prometheus server that
    stores its data in a database.
  • 23:17 - 23:19
    Specifically, a distributed database.
  • 23:19 - 23:24
    Things that are based on the Google
    big table model, like Cassandra or…
  • 23:26 - 23:27
    What's the Amazon one?
  • 23:30 - 23:33
    Yeah.
  • 23:33 - 23:34
    Dynamodb.
  • 23:34 - 23:37
    If you have a dynamodb or a cassandra
    cluster, or one of these other
  • 23:37 - 23:39
    really big distributed storage clusters,
  • 23:40 - 23:45
    Cortex can run and the Prometheus servers
    will stream their data up to Cortex
  • 23:45 - 23:49
    and it will keep a copy of that accross
    all of your Prometheus servers.
  • 23:50 - 23:51
    And because it's based on things
    like Cassandra,
  • 23:52 - 23:53
    it's super scalable.
  • 23:53 - 23:58
    But it's a little complex to run and
  • 23:58 - 24:01
    many people don't want to run that
    complex infrastructure.
  • 24:01 - 24:06
    We have another new one, we just blogged
    about it yesterday.
  • 24:02 - 24:07
    It's a thing called Thanos.
  • 24:07 - 24:11
    Thanos is Prometheus at scale.
  • 24:11 - 24:12
    Basically, the way it works…
  • 24:13 - 24:15
    Actually, why don't I bring that up?
  • 24:24 - 24:31
    This was developed by a company
    called Improbable
  • 24:31 - 24:33
    and they wanted to…
  • 24:35 - 24:40
    They had billions of metrics coming from
    hundreds of Prometheus servers.
  • 24:41 - 24:47
    They developed this in collaboration with
    the Prometheus team to build
  • 24:47 - 24:49
    a super highly scalable Prometheus server.
  • 24:50 - 24:56
    Prometheus itself stores the incoming
    metrics data in a write ahead log
  • 24:56 - 25:00
    and then every two hours, it creates
    a compaction cycle
  • 25:00 - 25:03
    and it creates an imutable time series block
    of data which is
  • 25:04 - 25:07
    all the time series blocks themselves
  • 25:07 - 25:10
    and then an index into that data.
  • 25:11 - 25:14
    Those two hour windows are all imutable
  • 25:14 - 25:16
    so what Thanos does,
    it has a little sidecar binary that
  • 25:16 - 25:19

    watches for those new directories and
  • 25:19 - 25:21
    uploads them into a blob store.
  • 25:21 - 25:26
    So you could put them in S3 or minio or
    some other simple object storage.
  • 25:26 - 25:33
    And then now you have all of your data,
    all of this index data already
  • 25:33 - 25:35
    ready to go
  • 25:35 - 25:38
    and then the final sidecar creates
    a little mesh cluster that can read from
  • 25:38 - 25:40
    all of those S3 blocks.
  • 25:40 - 25:48
    Now, you have this super global view
    all stored in a big bucket storage and
  • 25:50 - 25:52
    things like S3 or minio are…
  • 25:53 - 25:58
    Bucket storage is not databases so they're
    operationally a little easier to operate.
  • 25:58 - 26:02
    Plus, now we have all this data in
    a bucket store and
  • 26:03 - 26:06
    the Thanos sidecars can talk to each other
  • 26:07 - 26:08
    We can now have a single entry point.
  • 26:08 - 26:12
    You can query Thanos and Thanos will
    distribute your query
  • 26:12 - 26:14
    across all your Prometheus servers.
  • 26:14 - 26:16
    So now you can do global queries across
    all of your servers.
  • 26:18 - 26:22
    But it's very new, they just released
    their first release candidate yesterday.
  • 26:24 - 26:27
    It is looking to be like
    the coolest thing ever
  • 26:27 - 26:29
    for running large scale Prometheus.
  • 26:30 - 26:35
    Here's an example of how that is laid out.
  • 26:37 - 26:39
    This will bring and let you have
    a billion metric Prometheus cluster.
  • 26:43 - 26:44
    And it's got a bunch of other
    cool features.
  • 26:45 - 26:47
    Any more questions?
  • 26:55 - 26:57
    Alright, maybe I'll do
    a quick little demo.
  • 27:05 - 27:11
    Here is a Prometheus server that is
    provided by this group
  • 27:11 - 27:14
    that just does a ansible deployment
    for Prometheus.
  • 27:15 - 27:20
    And you can just simply query
    for something like 'node_cpu'.
  • 27:21 - 27:23
    This is actually the old name for
    that metric.
  • 27:24 - 27:26
    And you can see, here's exactly
  • 27:28 - 27:31
    the CPU metrics from some servers.
  • 27:33 - 27:35
    It's just a bunch of stuff.
  • 27:35 - 27:37
    There's actually two servers here,
  • 27:37 - 27:41
    there's an influx cloud alchemy and
    there is a demo cloud alchemy.
  • 27:42 - 27:44
    [Q] Can you zoom in?
    [A] Oh yeah sure.
  • 27:53 - 27:58
    So you can see all the extra labels.
  • 28:00 - 28:02
    We can also do some things like…
  • 28:02 - 28:04
    Let's take a look at, say,
    the last 30 seconds.
  • 28:05 - 28:07
    We can just add this little time window.
  • 28:08 - 28:11
    It's called a range request,
    and you can see
  • 28:11 - 28:12
    the individual samples.
  • 28:13 - 28:15
    You can see that all Prometheus is doing
  • 28:15 - 28:18
    is storing the sample and a timestamp.
  • 28:18 - 28:23
    All the timestamps are in milliseconds
    and it's all epoch
  • 28:23 - 28:25
    so it's super easy to manipulate.
  • 28:26 - 28:30
    But, looking at the individual samples and
    looking at this, you can see that
  • 28:30 - 28:36
    if we go back and just take…
    and look at the raw data, and
  • 28:36 - 28:38
    we graph the raw data…
  • 28:40 - 28:43
    Oops, that's a syntax error.
  • 28:44 - 28:47
    And we look at this graph…
    Come on.
  • 28:47 - 28:48
    Here we go.
  • 28:48 - 28:50
    Well, that's kind of boring, it's just
    a flat line because
  • 28:51 - 28:53
    it's just a counter going up very slowly.
  • 28:53 - 28:56
    What we really want to do, is we want to
    take, and we want to apply
  • 28:57 - 28:59
    a rate function to this counter.
  • 29:00 - 29:04
    So let's look at the rate over
    the last one minute.
  • 29:04 - 29:07
    There we go, now we get
    a nice little graph.
  • 29:08 - 29:14
    And so you can see that this is
    0.6 CPU seconds per second
  • 29:15 - 29:18
    for that set of labels.
  • 29:19 - 29:21
    But this is pretty noisy, there's a lot
    of lines on this graph and
  • 29:21 - 29:23
    there's still a lot of data here.
  • 29:23 - 29:26
    So let's start doing some filtering.
  • 29:26 - 29:29
    One of the things we see here is,
    well, there's idle.
  • 29:30 - 29:32
    We don't really care about
    the machine being idle,
  • 29:33 - 29:35
    so let's just add a label filter
    so we can say
  • 29:36 - 29:42
    'mode', it's the label name, and it's not
    equal to 'idle'. Done.
  • 29:45 - 29:48
    And if I could type…
    What did I miss?
  • 29:51 - 29:51
    Here we go.
  • 29:51 - 29:54
    So now we've removed idle from the graph.
  • 29:54 - 29:56
    That looks a little more sane.
  • 29:57 - 30:01
    Oh, wow, look at that, that's a nice
    big spike in user space on the influx server
  • 30:01 - 30:02
    Okay…
  • 30:04 - 30:05
    Well, that's pretty cool.
  • 30:06 - 30:06
    What about…
  • 30:07 - 30:09
    This is still quite a lot of lines.
  • 30:09 - 30:14
    We can just sum up that rate.
  • 30:11 - 30:14
    How much CPU is in use total across
    all the servers that we have.
  • 30:14 - 30:24
    We can just see that there is
    a sum total of 0.6 CPU seconds/s
  • 30:25 - 30:28
    across the servers we have.
  • 30:28 - 30:31
    But that's a little to coarse.
  • 30:32 - 30:37
    What if we want to see it by instance?
  • 30:39 - 30:42
    Now, we can see the two servers,
    we can see
  • 30:43 - 30:45
    that we're left with just that label.
  • 30:46 - 30:50
    The influx labels are the influx instance
    and the influx demo.
  • 30:50 - 30:53
    That's a super easy way to see that,
  • 30:54 - 30:57
    but we can also do this
    the other way around.
  • 30:57 - 31:03
    We can say 'without (mode,cpu)' so
    we can drop those modes and
  • 31:03 - 31:05
    see all the labels that we have.
  • 31:05 - 31:12
    We can still see the environment label
    and the job label on our list data.
  • 31:12 - 31:16
    You can go either way
    with the summary functions.
  • 31:16 - 31:20
    There's a whole bunch of different functions
  • 31:21 - 31:23
    and it's all in our documentation.
  • 31:25 - 31:30
    But what if we want to see it…
  • 31:31 - 31:34
    What if we want to see which CPUs
    are in use?
  • 31:34 - 31:37
    Now we can see that it's only CPU0
  • 31:37 - 31:40
    because apparently these are only
    1-core instances.
  • 31:42 - 31:47
    You can add/remove labels and do
    all these queries.
  • 31:50 - 31:52
    Any other questions so far?
  • 31:54 - 31:59
    [Q] I don't have a question, but I have
    something to add.
  • 31:59 - 32:03
    Prometheus is really nice, but it's
    a lot better if you combine it
  • 32:03 - 32:05
    with grafana.
  • 32:05 - 32:06
    [A] Yes, yes.
  • 32:07 - 32:12
    In the beginning, when we were creating
    Prometheus, we actually built
  • 32:13 - 32:15
    a piece of dashboard software called
    promdash.
  • 32:16 - 32:21
    It was a simple little Ruby on Rails app
    to create dashboards
  • 32:21 - 32:23
    and it had a bunch of JavaScript.
  • 32:23 - 32:24
    And then grafana came out.
  • 32:25 - 32:26
    And we're like
  • 32:26 - 32:30
    "Oh, that's interesting. It doesn't support
    Prometheus" so we were like
  • 32:30 - 32:32
    "Hey, can you support Prometheus"
  • 32:32 - 32:34
    and they're like "Yeah, we've got
    a REST API, get the data, done"
  • 32:36 - 32:38
    Now grafana supports Prometheus and
    we're like
  • 32:40 - 32:42
    "Well, promdash, this is crap, delete".
  • 32:44 - 32:46
    The Prometheus development team,
  • 32:46 - 32:49
    we're all backend developers
    and SREs and
  • 32:50 - 32:51
    we have no JavaScript skills at all.
  • 32:53 - 32:55
    So we're like "Let somebody deal
    with that".
  • 32:55 - 32:58
    One of the nice things about working on
    this kind of project is
  • 32:58 - 33:02
    we can do things that we're good at and
    and we don't, we don't try…
  • 33:02 - 33:05
    We don't have any marketing people,
    it's just an opensource project,
  • 33:06 - 33:09
    there's no single company behind Prometheus.
  • 33:10 - 33:14
    I work for GitLab, Improbable paid for
    the Thanos system,
  • 33:16 - 33:25
    other companies like Red Hat now pays
    people that used to work on CoreOS to
  • 33:25 - 33:27
    work on Prometheus.
  • 33:27 - 33:30
    There's lots and lots of collaboration
    between many companies
  • 33:30 - 33:33
    to build the Prometheus ecosystem.
  • 33:36 - 33:37
    But yeah, grafana is great.
  • 33:39 - 33:45
    Actually, grafana now has
    two fulltime Prometheus developers.
  • 33:49 - 33:51
    Alright, that's it.
  • 33:53 - 33:57
    [Applause]
Title:
Metrics-Based Monitoring with Prometheus
Description:

Talk given by Ben Kochie at Minidebconf Hamburg 18
https://meetings-archive.debian.net/pub/debian-meetings/2018/miniconf-hamburg/2018-05-19/metrics_based_monitoring.webm

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
34:03

English subtitles

Incomplete

Revisions Compare revisions