< Return to Video

Metrics-Based Monitoring with Prometheus

  • Not Synced
    So, we had a talk by a non-GitLab person
    about GitLab.
  • Not Synced
    Now, we have a talk by a GitLab person
    on non-GtlLab.
  • Not Synced
    Something like that?
  • Not Synced
    The CCCHH hackerspace is now open,
  • Not Synced
    from now on if you want to go there,
    that's the announcement.
  • Not Synced
    And the next talk will be by Ben Kochie
  • Not Synced
    on metrics-based monitoring
    with Prometheus.
  • Not Synced
    Welcome.
  • Not Synced
    [Applause]
  • Not Synced
    Alright, so
  • Not Synced
    my name is Ben Kochie
  • Not Synced
    I work on DevOps features for GitLab
  • Not Synced
    and apart working for GitLab, I also work
    on the opensource Prometheus project.
  • Not Synced
    I live in Berlin and I've been using
    Debian since ???
  • Not Synced
    yes, quite a long time.
  • Not Synced
    So, what is Metrics-based Monitoring?
  • Not Synced
    If you're running software in production,
  • Not Synced
    you probably want to monitor it,
  • Not Synced
    because if you don't monitor it, you don't
    know it's right.
  • Not Synced
    ??? break down into two categories:
  • Not Synced
    there's blackbox monitoring and
    there's whitebox monitoring.
  • Not Synced
    Blackbox monitoring is treating
    your software like a blackbox.
  • Not Synced
    It's just checks to see, like,
  • Not Synced
    is it responding, or does it ping
  • Not Synced
    or ??? HTTP requests
  • Not Synced
    [mic turned on]
  • Not Synced
    Ah, there we go, much better.
  • Not Synced
    So, blackbox monitoring is a probe,
  • Not Synced
    it just kind of looks from the outside
    to your software
  • Not Synced
    and it has no knowledge of the internals
  • Not Synced
    and it's really good for end to end testing.
  • Not Synced
    So if you've got a fairly complicated
    service,
  • Not Synced
    you come in from the outside, you go
    through the load balancer,
  • Not Synced
    you hit the API server,
  • Not Synced
    the API server might hit a database,
  • Not Synced
    and you go all the way through
    to the back of the stack
  • Not Synced
    and all the way back out
  • Not Synced
    so you know that everything is working
    end to end.
  • Not Synced
    But you only know about it
    for that one request.
  • Not Synced
    So in order to find out if your service
    is working,
  • Not Synced
    from the end to end, for every single
    request,
  • Not Synced
    this requires whitebox intrumentation.
  • Not Synced
    So, basically, every event that happens
    inside your software,
  • Not Synced
    inside a serving stack,
  • Not Synced
    gets collected and gets counted,
  • Not Synced
    so you know that every request hits
    the load balancer,
  • Not Synced
    every request hits your application
    service,
  • Not Synced
    every request hits the database.
  • Not Synced
    You know that everything matches up
  • Not Synced
    and this is called whitebox, or
    metrics-based monitoring.
  • Not Synced
    There is different examples of, like,
  • Not Synced
    the kind of software that does blackbox
    and whitebox monitoring.
  • Not Synced
    So you have software like Nagios that
    you can configure checks
  • Not Synced
    or pingdom,
  • Not Synced
    pingdom will do ping of your website.
  • Not Synced
    And then there is metrics-based monitoring,
  • Not Synced
    things like Prometheus, things like
    the TICK stack from influx data,
  • Not Synced
    New Relic and other commercial solutions
  • Not Synced
    but of course I like to talk about
    the opensorce solutions.
  • Not Synced
    We're gonna talk a little bit about
    Prometheus.
  • Not Synced
    Prometheus came out of the idea that
  • Not Synced
    we needed a monitoring system that could
    collect all this whitebox metric data
  • Not Synced
    and do something useful with it.
  • Not Synced
    Not just give us a pretty graph, but
    we also want to be able to
  • Not Synced
    alert on it.
  • Not Synced
    So we needed both
  • Not Synced
    a data gathering and an analytics system
    in the same instance.
  • Not Synced
    To do this, we built this thing and
    we looked at the way that
  • Not Synced
    data was being generated
    by the applications
  • Not Synced
    and there are advantages and
    disadvantages to this
  • Not Synced
    push vs. poll model for metrics.
  • Not Synced
    We decided to go with the polling model
  • Not Synced
    because there is some slight advantages
    for polling over pushing.
  • Not Synced
    With polling, you get this free
    blackbox check
  • Not Synced
    that the application is running.
  • Not Synced
    When you poll your application, you know
    that the process is running.
  • Not Synced
    If you are doing push-based, you can't
    tell the difference between
  • Not Synced
    your application doing no work and
    your application not running.
  • Not Synced
    So you don't know if it's stuck,
  • Not Synced
    or is it just not having to do any work.
  • Not Synced
    With polling, the polling system knows
    the state of your network.
  • Not Synced
    If you have a defined set of services,
  • Not Synced
    that inventory drives what should be there.
  • Not Synced
    Again, it's like, the disappearing,
  • Not Synced
    is the process dead, or is it just
    not doing anything?
  • Not Synced
    With polling, you know for a fact
    what processes should be there,
  • Not Synced
    and it's a bit of an advantage there.
  • Not Synced
    With polling, there's really easy testing.
  • Not Synced
    With push-based metrics, you have to
    figure out
  • Not Synced
    if you want to test a new version of
    the monitoring system or
  • Not Synced
    you want to test something new,
  • Not Synced
    you have to ??? a copy of the data.
  • Not Synced
    With polling, you can just set up
    another instance of your monitoring
  • Not Synced
    and just test it.
  • Not Synced
    Or you don't even have,
  • Not Synced
    it doesn't even have to be monitoring,
    you can just use curl
  • Not Synced
    to poll the metrics endpoint.
  • Not Synced
    It's significantly easier to test.
  • Not Synced
    The other thing with the…
  • Not Synced
    The other nice thing is that
    the client is really simple.
  • Not Synced
    The client doesn't have to know
    where the monitoring system is.
  • Not Synced
    It doesn't have to know about ???
  • Not Synced
    It just has to sit and collect the data
    about itself.
  • Not Synced
    So it doesn't have to know anything about
    the topology of the network.
  • Not Synced
    As an application developer, if you're
    writing a DNS server or
  • Not Synced
    some other piece of software,
  • Not Synced
    you don't have to know anything about
    monitoring software,
  • Not Synced
    you can just implement it inside
    your application and
  • Not Synced
    the monitoring software, whether it's
    Prometheus or something else,
  • Not Synced
    can just come and collect that data for you.
  • Not Synced
    That's kind of similar to a very old
    monitoring system called SNMP,
  • Not Synced
    but SNMP has a significantly less friendly
    data model for developers.
  • Not Synced
    This is the basic layout
    of a Prometheus server.
  • Not Synced
    At the core, there's a Prometheus server
  • Not Synced
    and it deals with all the data collection
    and analytics.
  • Not Synced
    Basically, this one binary,
    it's all written in golang.
  • Not Synced
    It's a single binary.
  • Not Synced
    It knows how to read from your inventory,
  • Not Synced
    there's a bunch of different methods,
    whether you've got
  • Not Synced
    a kubernetes cluster or a cloud platform
  • Not Synced
    or you have your own customized thing
    with ansible.
  • Not Synced
    Ansible can take your layout, drop that
    into a config file and
  • Not Synced
    Prometheus can pick that up.
  • Not Synced
    Once it has the layout, it goes out and
    collects all the data.
  • Not Synced
    It has a storage and a time series
    database to store all that data locally.
  • Not Synced
    It has a thing called PromQL, which is
    a query language designed
  • Not Synced
    for metrics and analytics.
  • Not Synced
    From that PromQL, you can add frontends
    that will,
  • Not Synced
    whether it's a simple API client
    to run reports,
  • Not Synced
    you can use things like Grafana
    for creating dashboards,
  • Not Synced
    it's got a simple webUI built in.
  • Not Synced
    You can plug in anything you want
    on that side.
  • Not Synced
    And then, it also has the ability to
    continuously execute queries
  • Not Synced
    called "recording rules"
  • Not Synced
    and these recording rules have
    two different modes.
  • Not Synced
    You can either record, you can take
    a query
  • Not Synced
    and it will generate new data
    from that query
  • Not Synced
    or you can take a query, and
    if it returns results,
  • Not Synced
    it will return an alert.
  • Not Synced
    That alert is a push message
    to the alert manager.
  • Not Synced
    This allows us to separate the generating
    of alerts from the routing of alerts.
  • Not Synced
    You can have one or hundreds of Prometheus
    services, all generating alerts
  • Not Synced
    and it goes into an alert manager cluster
    and sends, does the deduplication
  • Not Synced
    and the routing to the human
  • Not Synced
    because, of course, the thing
    that we want is
  • Not Synced
    we had dashboards with graphs, but
    in order to find out if something is broken
  • Not Synced
    you had to have a human
    looking at the graph.
  • Not Synced
    With Prometheus, we don't have to do that
    anymore,
  • Not Synced
    we can simply let the software tell us
    that we need to go investigate
  • Not Synced
    our problems.
  • Not Synced
    We don't have to sit there and
    stare at dashboards all day,
  • Not Synced
    because that's really boring.
  • Not Synced
    What does it look like to actually
    get data into Prometheus?
  • Not Synced
    This is a very basic output
    of a Prometheus metric.
  • Not Synced
    This is a very simple thing.
  • Not Synced
    If you know much about
    the linux kernel,
  • Not Synced
    the linux kernel tracks ??? stats,
    all the state of all the CPUs
  • Not Synced
    in your system
  • Not Synced
    and we express this by having
    the name of the metric, which is
  • Not Synced
    'node_cpu_seconds_total' and so
    this is a self describing metric,
  • Not Synced
    like you can just read the metrics name
  • Not Synced
    and you understand a little bit about
    what's going on here.
Title:
Metrics-Based Monitoring with Prometheus
Description:

more » « less
Video Language:
English
Team:
Debconf
Project:
2018_mini-debconf-hamburg
Duration:
34:03

English subtitles

Incomplete

Revisions Compare revisions