< Return to Video

Metrics-Based Monitoring with Prometheus

  • Not Synced
    So, we had a talk by a non-GitLab person
    about GitLab.
  • Not Synced
    Now, we have a talk by a GitLab person
    on non-GtlLab.
  • Not Synced
    Something like that?
  • Not Synced
    The CCCHH hackerspace is now open,
  • Not Synced
    from now on if you want to go there,
    that's the announcement.
  • Not Synced
    And the next talk will be by Ben Kochie
  • Not Synced
    on metrics-based monitoring
    with Prometheus.
  • Not Synced
    Welcome.
  • Not Synced
    [Applause]
  • Not Synced
    Alright, so
  • Not Synced
    my name is Ben Kochie
  • Not Synced
    I work on DevOps features for GitLab
  • Not Synced
    and apart working for GitLab, I also work
    on the opensource Prometheus project.
  • Not Synced
    I live in Berlin and I've been using
    Debian since ???
  • Not Synced
    yes, quite a long time.
  • Not Synced
    So, what is Metrics-based Monitoring?
  • Not Synced
    If you're running software in production,
  • Not Synced
    you probably want to monitor it,
  • Not Synced
    because if you don't monitor it, you don't
    know it's right.
  • Not Synced
    ??? break down into two categories:
  • Not Synced
    there's blackbox monitoring and
    there's whitebox monitoring.
  • Not Synced
    Blackbox monitoring is treating
    your software like a blackbox.
  • Not Synced
    It's just checks to see, like,
  • Not Synced
    is it responding, or does it ping
  • Not Synced
    or ??? HTTP requests
  • Not Synced
    [mic turned on]
  • Not Synced
    Ah, there we go, much better.
  • Not Synced
    So, blackbox monitoring is a probe,
  • Not Synced
    it just kind of looks from the outside
    to your software
  • Not Synced
    and it has no knowledge of the internals
  • Not Synced
    and it's really good for end to end testing.
  • Not Synced
    So if you've got a fairly complicated
    service,
  • Not Synced
    you come in from the outside, you go
    through the load balancer,
  • Not Synced
    you hit the API server,
  • Not Synced
    the API server might hit a database,
  • Not Synced
    and you go all the way through
    to the back of the stack
  • Not Synced
    and all the way back out
  • Not Synced
    so you know that everything is working
    end to end.
  • Not Synced
    But you only know about it
    for that one request.
  • Not Synced
    So in order to find out if your service
    is working,
  • Not Synced
    from the end to end, for every single
    request,
  • Not Synced
    this requires whitebox intrumentation.
  • Not Synced
    So, basically, every event that happens
    inside your software,
  • Not Synced
    inside a serving stack,
  • Not Synced
    gets collected and gets counted,
  • Not Synced
    so you know that every request hits
    the load balancer,
  • Not Synced
    every request hits your application
    service,
  • Not Synced
    every request hits the database.
  • Not Synced
    You know that everything matches up
  • Not Synced
    and this is called whitebox, or
    metrics-based monitoring.
  • Not Synced
    There is different examples of, like,
  • Not Synced
    the kind of software that does blackbox
    and whitebox monitoring.
  • Not Synced
    So you have software like Nagios that
    you can configure checks
  • Not Synced
    or pingdom,
  • Not Synced
    pingdom will do ping of your website.
  • Not Synced
    And then there is metrics-based monitoring,
  • Not Synced
    things like Prometheus, things like
    the TICK stack from influx data,
  • Not Synced
    New Relic and other commercial solutions
  • Not Synced
    but of course I like to talk about
    the opensorce solutions.
  • Not Synced
    We're gonna talk a little bit about
    Prometheus.
  • Not Synced
    Prometheus came out of the idea that
  • Not Synced
    we needed a monitoring system that could
    collect all this whitebox metric data
  • Not Synced
    and do something useful with it.
  • Not Synced
    Not just give us a pretty graph, but
    we also want to be able to
  • Not Synced
    alert on it.
  • Not Synced
    So we needed both
  • Not Synced
    a data gathering and an analytics system
    in the same instance.
  • Not Synced
    To do this, we built this thing and
    we looked at the way that
  • Not Synced
    data was being generated
    by the applications
  • Not Synced
    and there are advantages and
    disadvantages to this
  • Not Synced
    push vs. poll model for metrics.
  • Not Synced
    We decided to go with the polling model
  • Not Synced
    because there is some slight advantages
    for polling over pushing.
  • Not Synced
    With polling, you get this free
    blackbox check
  • Not Synced
    that the application is running.
  • Not Synced
    When you poll your application, you know
    that the process is running.
  • Not Synced
    If you are doing push-based, you can't
    tell the difference between
  • Not Synced
    your application doing no work and
    your application not running.
  • Not Synced
    So you don't know if it's stuck,
  • Not Synced
    or is it just not having to do any work.
  • Not Synced
    With polling, the polling system knows
    the state of your network.
  • Not Synced
    If you have a defined set of services,
  • Not Synced
    that inventory drives what should be there.
  • Not Synced
    Again, it's like, the disappearing,
  • Not Synced
    is the process dead, or is it just
    not doing anything?
  • Not Synced
    With polling, you know for a fact
    what processes should be there,
  • Not Synced
    and it's a bit of an advantage there.
  • Not Synced
    With polling, there's really easy testing.
  • Not Synced
    With push-based metrics, you have to
    figure out
  • Not Synced
    if you want to test a new version of
    the monitoring system or
  • Not Synced
    you want to test something new,
  • Not Synced
    you have to ??? a copy of the data.
  • Not Synced
    With polling, you can just set up
    another instance of your monitoring
  • Not Synced
    and just test it.
  • Not Synced
    Or you don't even have,
  • Not Synced
    it doesn't even have to be monitoring,
    you can just use curl
  • Not Synced
    to poll the metrics endpoint.
  • Not Synced
    It's significantly easier to test.
  • Not Synced
    The other thing with the…
  • Not Synced
    The other nice thing is that
    the client is really simple.
  • Not Synced
    The client doesn't have to know
    where the monitoring system is.
  • Not Synced
    It doesn't have to know about ???
  • Not Synced
    It just has to sit and collect the data
    about itself.
  • Not Synced
    So it doesn't have to know anything about
    the topology of the network.
  • Not Synced
    As an application developer, if you're
    writing a DNS server or
  • Not Synced
    some other piece of software,
  • Not Synced
    you don't have to know anything about
    monitoring software,
  • Not Synced
    you can just implement it inside
    your application and
  • Not Synced
    the monitoring software, whether it's
    Prometheus or something else,
  • Not Synced
    can just come and collect that data for you.
  • Not Synced
    That's kind of similar to a very old
    monitoring system called SNMP,
  • Not Synced
    but SNMP has a significantly less friendly
    data model for developers.
  • Not Synced
    This is the basic layout
    of a Prometheus server.
  • Not Synced
    At the core, there's a Prometheus server
  • Not Synced
    and it deals with all the data collection
    and analytics.
  • Not Synced
    Basically, this one binary,
    it's all written in golang.
  • Not Synced
    It's a single binary.
  • Not Synced
    It knows how to read from your inventory,
  • Not Synced
    there's a bunch of different methods,
    whether you've got
  • Not Synced
    a kubernetes cluster or a cloud platform
  • Not Synced
    or you have your own customized thing
    with ansible.
  • Not Synced
    Ansible can take your layout, drop that
    into a config file and
  • Not Synced
    Prometheus can pick that up.
  • Not Synced
    Once it has the layout, it goes out and
    collects all the data.
  • Not Synced
    It has a storage and a time series
    database to store all that data locally.
  • Not Synced
    It has a thing called PromQL, which is
    a query language designed
  • Not Synced
    for metrics and analytics.
  • Not Synced
    From that PromQL, you can add frontends
    that will,
  • Not Synced
    whether it's a simple API client
    to run reports,
  • Not Synced
    you can use things like Grafana
    for creating dashboards,
  • Not Synced
    it's got a simple webUI built in.
  • Not Synced
    You can plug in anything you want
    on that side.
  • Not Synced
    And then, it also has the ability to
    continuously execute queries
  • Not Synced
    called "recording rules"
  • Not Synced
    and these recording rules have
    two different modes.
  • Not Synced
    You can either record, you can take
    a query
  • Not Synced
    and it will generate new data
    from that query
  • Not Synced
    or you can take a query, and
    if it returns results,
  • Not Synced
    it will return an alert.
  • Not Synced
    That alert is a push message
    to the alert manager.
  • Not Synced
    This allows us to separate the generating
    of alerts from the routing of alerts.
  • Not Synced
    You can have one or hundreds of Prometheus
    services, all generating alerts
  • Not Synced
    and it goes into an alert manager cluster
    and sends, does the deduplication
  • Not Synced
    and the routing to the human
  • Not Synced
    because, of course, the thing
    that we want is
  • Not Synced
    we had dashboards with graphs, but
    in order to find out if something is broken
  • Not Synced
    you had to have a human
    looking at the graph.
  • Not Synced
    With Prometheus, we don't have to do that
    anymore,
  • Not Synced
    we can simply let the software tell us
    that we need to go investigate
  • Not Synced
    our problems.
  • Not Synced
    We don't have to sit there and
    stare at dashboards all day,
  • Not Synced
    because that's really boring.
  • Not Synced
    What does it look like to actually
    get data into Prometheus?
  • Not Synced
    This is a very basic output
    of a Prometheus metric.
  • Not Synced
    This is a very simple thing.
  • Not Synced
    If you know much about
    the linux kernel,
  • Not Synced
    the linux kernel tracks ??? stats,
    all the state of all the CPUs
  • Not Synced
    in your system
  • Not Synced
    and we express this by having
    the name of the metric, which is
  • Not Synced
    'node_cpu_seconds_total' and so
    this is a self-describing metric,
  • Not Synced
    like you can just read the metrics name
  • Not Synced
    and you understand a little bit about
    what's going on here.
  • Not Synced
    The linux kernel and other kernels track
    their usage by the number of seconds
  • Not Synced
    spent doing different things and
  • Not Synced
    that could be, whether it's in system or
    user space or IRQs
  • Not Synced
    or iowait or idle.
  • Not Synced
    Actually, the kernel tracks how much
    idle time it has.
  • Not Synced
    It also tracks it by the number of CPUs.
  • Not Synced
    With other monitoring systems, they used
    to do this with a tree structure
  • Not Synced
    and this caused a lot of problems,
    for like
  • Not Synced
    How do you mix and match data so
    by switching from
  • Not Synced
    a tree structure to a tag-based structure,
  • Not Synced
    we can do some really interesting
    powerful data analytics.
  • Not Synced
    Here's a nice example of taking
    those CPU seconds counters
  • Not Synced
    and then converting them into a graph
    by using PromQL.
  • Not Synced
    Now we can get into
    Metrics-Based Alerting.
  • Not Synced
    Now we have this graph, we have this thing
  • Not Synced
    we can look and see here
  • Not Synced
    "Oh there is some little spike here,
    we might want to know about that."
  • Not Synced
    Now we can get into Metrics-Based
    Alerting.
  • Not Synced
    I used to be a site reliability engineer,
    I'm still a site reliability engineer at heart
  • Not Synced
    and we have this concept of things that
    you need on a site or a service reliably
  • Not Synced
    The most important thing you need is
    down at the bottom,
  • Not Synced
    Monitoring, because if you don't have
    monitoring of your service,
  • Not Synced
    how do you know it's even working?
  • Not Synced
    There's a couple of techniques here, and
    we want to alert based on data
  • Not Synced
    and not just those end to end tests.
  • Not Synced
    There's a couple of techniques, a thing
    called the RED method
  • Not Synced
    and there's a thing called the USE method
  • Not Synced
    and there's a couple nice things to some
    blog posts about this
  • Not Synced
    and basically it defines that, for example,
  • Not Synced
    the RED method talks about the requests
    that your system is handling
  • Not Synced
    There are three things:
  • Not Synced
    There's the number of requests, there's
    the number of errors
  • Not Synced
    and there's how long takes a duration.
  • Not Synced
    With the combination of these three things
  • Not Synced
    you can determine most of
    what your users see
  • Not Synced
    "Did my request go through? Did it
    return an error? Was it fast?"
  • Not Synced
    Most people, that's all they care about.
  • Not Synced
    "I made a request to a website and
    it came back and it was fast."
  • Not Synced
    It's a very simple method of just, like,
  • Not Synced
    those are the important things to
    determine if your site is healthy.
  • Not Synced
    But we can go back to some more
    traditional, sysadmin style, alerts
  • Not Synced
    this is basically taking the filesystem
    available space,
  • Not Synced
    divided by the filesystem size, that becomes
    the ratio of filesystem availability
  • Not Synced
    from 0 to 1.
  • Not Synced
    Multiply it by 100, we now have
    a percentage
  • Not Synced
    and if it's less than or equal to 1%
    for 15 minutes,
  • Not Synced
    this is less than 1% space, we should tell
    a sysadmin to go check
  • Not Synced
    the ??? filesystem ???
  • Not Synced
    It's super nice and simple.
  • Not Synced
    We can also tag, we can include…
  • Not Synced
    Every alert includes all the extraneous
    labels that Prometheus adds to your metrics
  • Not Synced
    When you add a metric in Prometheus, if
    we go back and we look at this metric.
  • Not Synced
    This metric only contain the information
    about the internals of the application
  • Not Synced
    anything about, like, what server it's on,
    is it running in a container,
  • Not Synced
    what cluster does it come from,
    what ??? is it on,
  • Not Synced
    that's all extra annotations that are
    added by the Prometheus server
  • Not Synced
    at discovery time.
  • Not Synced
    I don't have a good example of what
    those labels look like
  • Not Synced
    but every metric gets annotated
    with location information.
  • Not Synced
    That location information also comes through
    as labels in the alert
  • Not Synced
    so, if you have a message coming
    into your alert manager,
  • Not Synced
    the alert manager can look and go
  • Not Synced
    "Oh, that's coming from this datacenter"
  • Not Synced
    and it can include that in the email or
    IRC message or SMS message.
  • Not Synced
    So you can include
  • Not Synced
    "Filesystem is out of space on this host
    from this datacenter"
  • Not Synced
    All these labels get passed through and
    then you can append
  • Not Synced
    "severity: critical" to that alert and
    include that in the message to the human
  • Not Synced
    because of course, this is how you define…
  • Not Synced
    Getting the message from the monitoring
    to the human.
  • Not Synced
    You can even include nice things like,
  • Not Synced
    if you've got documentation, you can
    include a link to the documentation
  • Not Synced
    as an annotation
  • Not Synced
    and the alert manager can take that
    basic url and, you know,
  • Not Synced
    massaging it into whatever it needs
    to look like to actually get
  • Not Synced
    the operator to the correct documentation.
  • Not Synced
    We can also do more fun things:
  • Not Synced
    since we actually are not just checking
  • Not Synced
    what is the space right now,
    we're tracking data over time,
  • Not Synced
    we can use 'predict_linear'.
  • Not Synced
    'predict_linear' just takes and does
    a simple linear regression.
  • Not Synced
    This example takes the filesystem
    available space over the last hour and
  • Not Synced
    does a linear regression.
  • Not Synced
    Prediction says "Well, it's going that way
    and four hours from now,
  • Not Synced
    based on one hour of history, it's gonna
    be less than 0, which means full".
  • Not Synced
    We know that within the next four hours,
    the disc is gonna be full
  • Not Synced
    so we can tell the operator ahead of time
    that it's gonna be full
  • Not Synced
    and not just tell them that it's full
    right now.
  • Not Synced
    They have some window of ability
    to fix it before it fails.
  • Not Synced
    This is really important because
    if you're running a site
  • Not Synced
    you want to be able to have alerts
    that tell you that your system is failing
  • Not Synced
    before it actually fails.
  • Not Synced
    Because if it fails, you're out of SLO
    or SLA and
  • Not Synced
    your users are gonna be unhappy
  • Not Synced
    and you don't want the users to tell you
    that your site is down
  • Not Synced
    you want to know about it before
    your users can even tell.
  • Not Synced
    This allows you to do that.
  • Not Synced
    And also of course, Prometheus being
    a modern system,
  • Not Synced
    we support fully UTF8 in all of our labels.
  • Not Synced
    Here's an other one, here's a good example
    from the USE method.
  • Not Synced
    This is a rate of 500 errors coming from
    an application
  • Not Synced
    and you can simply alert that
  • Not Synced
    there's more than 500 errors per second
    coming out of the application
  • Not Synced
    if that's your threshold for ???
  • Not Synced
    And you can do other things,
  • Not Synced
    you can convert that from just
    a raid of errors
  • Not Synced
    to a percentive error.
  • Not Synced
    So you could say
  • Not Synced
    "I have an SLA of 3 9" and so you can say
  • Not Synced
    "If the rate of errors divided by the rate
    of requests is .01,
  • Not Synced
    or is more than .01, then
    that's a problem."
  • Not Synced
    You can include that level of
    error granularity.
  • Not Synced
    And if you're just doing a blackbox test,
  • Not Synced
    you wouldn't know this, you would only get
    if you got an error from the system,
  • Not Synced
    then you got another error from the system
  • Not Synced
    then you fire an alert.
  • Not Synced
    But if those checks are one minute apart
    and you're serving 1000 requests per second
  • Not Synced
    you could be serving 10,000 errors before
    you even get an alert.
  • Not Synced
    And you might miss it, because
  • Not Synced
    what if you only get one random error
  • Not Synced
    and then the next time, you're serving
    25% errors,
  • Not Synced
    you only have a 25% chance of that check
    failing again.
  • Not Synced
    You really need these metrics in order
    to get
  • Not Synced
    proper reports of the status of your system
  • Not Synced
    There's even options
  • Not Synced
    You can slice and dice those labels.
  • Not Synced
    If you have a label on all of
    your applications called 'service'
  • Not Synced
    you can send that 'service' label through
    to the message
  • Not Synced
    and you can say
    "Hey, this service is broken".
  • Not Synced
    You can include that service label
    in your alert messages.
  • Not Synced
    And that's it, I can go to a demo and Q&A.
  • Not Synced
    [Applause]
  • Not Synced
    Any questions so far?
  • Not Synced
    Or anybody want to see a demo?
  • Not Synced
    [Q] Hi. Does Prometheus make metric
    discovery inside containers
  • Not Synced
    or do I have to implement the metrics
    myself?
  • Not Synced
    [A] For metrics in containers, there are
    already things that expose
  • Not Synced
    the metrics of the container system
    itself.
  • Not Synced
    There's a utility called 'cadvisor' and
  • Not Synced
    cadvisor takes the links cgroup data
    and exposes it as metrics
  • Not Synced
    so you can get data about
    how much CPU time is being
  • Not Synced
    spent in your container,
  • Not Synced
    how much memory is being used
    by your container.
  • Not Synced
    [Q] But not about the application,
    just about the container usage ?
  • Not Synced
    [A] Right. Because the container
    has no idea
  • Not Synced
    whether your application is written
    in Ruby or go or Python or whatever,
  • Not Synced
    you have to build that into
    your application in order to get the data.
  • Not Synced
    So for Prometheus,
  • Not Synced
    we've written client libraries that can be
    included in your application directly
  • Not Synced
    so you can get that data out.
  • Not Synced
    If you go to the Prometheus website,
    we have a whole series of client libraries
  • Not Synced
    and we cover a pretty good selection
    of popular software.
  • Not Synced
    [Q] What is the current state of
    long-term data storage?
  • Not Synced
    [A] Very good question.
  • Not Synced
    There's been several…
  • Not Synced
    There's actually several different methods
    of doing this.
  • Not Synced
    Prometheus stores all this data locally
    in its own data storage
  • Not Synced
    on the local disk.
  • Not Synced
    But that's only as durable as
    that server is durable.
  • Not Synced
    So if you've got a really durable server,
  • Not Synced
    you can store as much data as you want,
  • Not Synced
    you can store years and years of data
    locally on the Prometheus server.
  • Not Synced
    That's not a problem.
  • Not Synced
    There's a bunch of misconceptions because
    of our default
  • Not Synced
    and the language on our website said
  • Not Synced
    "It's not long-term storage"
  • Not Synced
    simply because we leave that problem
    up to the person running the server.
  • Not Synced
    But the time series database
    that Prometheus includes
  • Not Synced
    is actually quite durable.
  • Not Synced
    But it's only as durable as the server
    underneath it.
  • Not Synced
    So if you've got a very large cluster and
    you want really high durability,
  • Not Synced
    you need to have some kind of
    cluster software,
  • Not Synced
    but because we want Prometheus to be
    simple to deploy
  • Not Synced
    and very simple to operate
  • Not Synced
    and also very robust.
  • Not Synced
    We didn't want to include any clustering
    in Prometheus itself,
  • Not Synced
    because anytime you have a clustered
    software,
  • Not Synced
    what happens if your network is
    a little wanky.
  • Not Synced
    The first thing that goes down is
    all of your distributed systems fail.
  • Not Synced
    And building distributed systems to be
    really robust is really hard
  • Not Synced
    so Prometheus is what we call
    "uncoordinated distributed systems".
  • Not Synced
    If you've got two Prometheus servers
    monitoring all your targets in an HA mode
  • Not Synced
    in a cluster, and there's a split brain,
  • Not Synced
    each Prometheus can see
    half of the cluster and
  • Not Synced
    it can see that the other half
    of the cluster is down.
  • Not Synced
    They can both try to get alerts out
    to the alert manager
  • Not Synced
    and this is a really really robust way of
    handling split brains
  • Not Synced
    and bad network failures and bad problems
    in a cluster.
  • Not Synced
    It's designed to be super super robust
Title:
Metrics-Based Monitoring with Prometheus
Description:

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
34:03

English subtitles

Incomplete

Revisions Compare revisions