< Return to Video

RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz

  • 0:17 - 0:18
    TOM DALE: Hey, you guys ready?
  • 0:18 - 0:19
    Thank you guys so much for coming.
  • 0:19 - 0:22
    This is awesome. I was really,
  • 0:22 - 0:24
    I, when they were putting together the schedule,
  • 0:24 - 0:25
    I said, make sure that you put us down
  • 0:25 - 0:27
    in the Caves of Moria. So thank you
  • 0:27 - 0:32
    guys for coming down and making it.
  • 0:32 - 0:33
    I'm Tom. This is Yehuda.
  • 0:33 - 0:35
    YEHUDA KATZ: When people told me was signed
    up
  • 0:35 - 0:41
    to do a back-to-back talk, I don't know what
  • 0:41 - 0:41
    I was thinking.
  • 0:41 - 0:41
    T.D.: Yup. So. We want to talk to you
  • 0:41 - 0:44
    today about, about Skylight. So, just a little
    bit
  • 0:44 - 0:46
    before we talk about that, I want to talk
  • 0:46 - 0:48
    about us a little bit.
  • 0:48 - 0:50
    So, in 2011 we started a company called Tilde.
  • 0:50 - 0:53
    It's this shirt. It may have made me self-conscious,
  • 0:53 - 0:55
    because this is actually a first edition and
    it's
  • 0:55 - 0:58
    printed off-center. Well, either I'm off-center
    or the shirt's
  • 0:58 - 1:01
    off-center. One of the two.
  • 1:01 - 1:03
    So we started Tilde in 2011, and we had
  • 1:03 - 1:06
    all just left a venture-backed company, and
    that was
  • 1:06 - 1:08
    a pretty traumatic experience for us because
    we spent
  • 1:08 - 1:10
    a lot of time building the company and then
  • 1:10 - 1:11
    we ran out of money and sold to Facebook
  • 1:11 - 1:14
    and we really didn't want to repeat that experience.
  • 1:14 - 1:16
    So, we decided to start Tilde, and when we
  • 1:16 - 1:20
    did it, we decided to be. DHH and the
  • 1:20 - 1:22
    other people at Basecamp were talking about,
    you know,
  • 1:22 - 1:24
    being bootstrapped and proud. And that was
    a message
  • 1:24 - 1:26
    that really resonated with us, and so we wanted
  • 1:26 - 1:28
    to capture the same thing.
  • 1:28 - 1:30
    There's only problem with being bootstrapped
    and proud, and
  • 1:30 - 1:32
    that is, in order to be both of those
  • 1:32 - 1:33
    things, you actually need money, it turns
    out. It's
  • 1:33 - 1:35
    not like you just say it in a blog
  • 1:35 - 1:38
    post and then all of the sudden you are
  • 1:38 - 1:39
    in business.
  • 1:39 - 1:40
    So, we had to think a lot about, OK,
  • 1:40 - 1:42
    well, how do we make money? How do we
  • 1:42 - 1:45
    make money? How do we make a profitable and,
  • 1:45 - 1:47
    most importantly, sustainable business? Because
    we didn't want to
  • 1:47 - 1:49
    just flip to Facebook in a couple years.
  • 1:49 - 1:53
    So, looking around, I think the most obvious
    thing
  • 1:53 - 1:55
    that people suggested to us is, well, why
    don't
  • 1:55 - 1:58
    you guys just become Ember, Inc.? Raise a
    few
  • 1:58 - 2:01
    million dollars, you know, build a bunch of
    business
  • 2:01 - 2:06
    model, mostly prayer. But that's not really
    how we
  • 2:06 - 2:09
    want to think about building open source communities.
  • 2:09 - 2:11
    We don't really think that that necessarily
    leads to
  • 2:11 - 2:14
    the best open source communities. And if you're
    interested
  • 2:14 - 2:17
    more in that, I recommend Leia Silver, who
    is
  • 2:17 - 2:20
    one of our co-founders. She's giving a talk
    this
  • 2:20 - 2:22
    afternoon. Oh, sorry. Friday afternoon, about
    how to build
  • 2:22 - 2:25
    a company that is centered on open source.
    So
  • 2:25 - 2:26
    if you want to learn more about how we've
  • 2:26 - 2:29
    done that, I would really suggest you go check
  • 2:29 - 2:30
    out her talk.
  • 2:30 - 2:34
    So, no. So, no Ember, Inc. Not allowed.
  • 2:34 - 2:38
    So, we really want to build something that
    leveraged
  • 2:38 - 2:40
    the strengths that we thought that we had.
    One,
  • 2:40 - 2:43
    I think most importantly, a really deep knowledge
    of
  • 2:43 - 2:45
    open source and a deep knowledge of the Rails
  • 2:45 - 2:47
    stack, and also Carl, it turns out, is really,
  • 2:47 - 2:51
    really good at building highly scalable big
    data sys-
  • 2:51 - 2:54
    big data systems. Lots of Hadoop in there.
  • 2:54 - 2:58
    So, last year at RailsConf, we announced the
    private
  • 2:58 - 3:01
    beta of Skylight. How many of you have used
  • 3:01 - 3:02
    Skylight? Can you raise your hand if you have
  • 3:02 - 3:05
    used it? OK. Many of you. Awesome.
  • 3:05 - 3:08
    So, so Skylight is a tool for profiling and
  • 3:08 - 3:12
    measuring the performance of your Rails applications
    in production.
  • 3:12 - 3:15
    And, as a product, Skylight, I think, was
    built
  • 3:15 - 3:20
    on three really, three key break-throughs.
    There were key,
  • 3:20 - 3:22
    three key break-throughs. We didn't want to
    ship a
  • 3:22 - 3:26
    product that was incrementally better than
    the competition. We
  • 3:26 - 3:28
    wanted to ship a product that was dramatically
    better.
  • 3:28 - 3:30
    Quantum leap. An order of magnitude better.
  • 3:30 - 3:32
    And, in order to do that, we spent a
  • 3:32 - 3:34
    lot of time thinking about it, about how we
  • 3:34 - 3:36
    could solve most of the problems that we saw
  • 3:36 - 3:39
    in the existing landscape. And so those, those
    break-throughs
  • 3:39 - 3:42
    are predicated- sorry, those, delivering a
    product that does
  • 3:42 - 3:45
    that is predicated on these three break-throughs.
  • 3:45 - 3:47
    So, the first one I want to talk about
  • 3:47 - 3:54
    is, honest response times. Honest response
    times. So, DHH
  • 3:54 - 3:56
    wrote a blog post on what was then the
  • 3:56 - 3:59
    37Signals blog, now the Basecamp blog, called
    The problem
  • 3:59 - 4:02
    with averages. How many of you have read this?
  • 4:02 - 4:02
    Awesome.
  • 4:02 - 4:04
    For those of you that have not, how many
  • 4:04 - 4:08
    of you hate raising your hands at presentations?
  • 4:08 - 4:11
    So, for those of you that-
  • 4:11 - 4:12
    Y.K.: Just put a button in every seat. Press
  • 4:12 - 4:12
    this button-
  • 4:12 - 4:15
    T.D.: Press the button if you have. Yes. Great.
  • 4:15 - 4:20
    So, if you read this blog post, the way
  • 4:20 - 4:23
    it opens is, Our average response time for
    Basecamp
  • 4:23 - 4:27
    right now is 87ms... That sounds fantastic.
    And it
  • 4:27 - 4:30
    easily leads you to believe that all is well
  • 4:30 - 4:32
    and that we wouldn't need to spend any more
  • 4:32 - 4:34
    time optimizing performance.
  • 4:34 - 4:39
    But that's actually wrong. The average number
    is completely
  • 4:39 - 4:42
    skewed by tons of fast responses to feed requests
  • 4:42 - 4:46
    and other cached replies. If you have 1000
    requests
  • 4:46 - 4:49
    that return in 5ms, and then you can have
  • 4:49 - 4:54
    200 requests taking 2000ms, or two seconds,
    you can
  • 4:54 - 4:58
    still report an av- a respectable 170ms of
    average.
  • 4:58 - 5:00
    That's useless.
  • 5:00 - 5:03
    So what does DHH say that we need? DHH
  • 5:03 - 5:07
    says the solution is histograms. So, for those
    of
  • 5:07 - 5:09
    you like me who were sleeping through your
    statistics
  • 5:09 - 5:12
    class in high school, and college, a brief
    primer
  • 5:12 - 5:15
    on histograms. So a histogram is very simple.
    Basically,
  • 5:15 - 5:18
    you have a, you have a series of numbers
  • 5:18 - 5:22
    along some axis, and every time you, you're
    in
  • 5:22 - 5:24
    that number, you're in that bucket, you basically
    increment
  • 5:24 - 5:25
    that bar by one.
  • 5:25 - 5:28
    So, this is an example of a histogram of
  • 5:28 - 5:31
    response times in a Rails application. So
    you can
  • 5:31 - 5:32
    see that there's a big cluster in the middle
  • 5:32 - 5:36
    around 488ms, 500ms. This isn't a super speedy
    app
  • 5:36 - 5:39
    but it's not the worst thing in the world.
  • 5:39 - 5:40
    And they're all clustered, and then as you
    kind
  • 5:40 - 5:41
    of move to the right you can see that
  • 5:41 - 5:42
    the respond times get longer and longer and
    longer,
  • 5:42 - 5:44
    and as you move to the left, response times
  • 5:44 - 5:46
    get shorter and shorter and shorter.
  • 5:46 - 5:48
    So, why do you want a histogram? What's the,
  • 5:48 - 5:49
    what's the most important thing about a histogram?
  • 5:49 - 5:53
    Y.K.: Well, I think it's because most requests
    don't
  • 5:53 - 5:53
    actually look like this.
  • 5:53 - 5:54
    T.D.: Yes.
  • 5:54 - 5:55
    Y.K.: Most end points don't actually look
    like this.
  • 5:55 - 5:56
    T.D.: Right. If you think about what your
    Rails
  • 5:56 - 5:59
    app is doing, it's a complicated beast, right.
    Turns
  • 5:59 - 6:02
    out, Ruby frankly, you can, you can do branching
  • 6:02 - 6:04
    logic. You can do a lot of things.
  • 6:04 - 6:06
    And so what that means is that one end
  • 6:06 - 6:09
    point, if you represent that with a single
    number,
  • 6:09 - 6:12
    you are losing a lot of fidelity, to the
  • 6:12 - 6:15
    point where it becomes, as DHH said, useless.
    So,
  • 6:15 - 6:18
    for example, in a histogram, you can easily
    see,
  • 6:18 - 6:20
    oh, here's a group of requests and response
    times
  • 6:20 - 6:22
    where I'm hitting the cache, and here's another
    group
  • 6:22 - 6:24
    where I'm missing it. And you can see that
  • 6:24 - 6:28
    that cluster is significantly slower than
    the faster cache-hitting
  • 6:28 - 6:28
    cluster.
  • 6:28 - 6:31
    And the other thing that you get when you
  • 6:31 - 6:33
    have a, a distribution, when you keep the
    whole
  • 6:33 - 6:35
    distribution in the histogram, is you can
    look at
  • 6:35 - 6:39
    this number at the 95th percentile, right.
    So the
  • 6:39 - 6:42
    right, the way to think about the performance
    of
  • 6:42 - 6:47
    your web application is not the average, because
    the
  • 6:47 - 6:50
    average doesn't really tell you anything.
    You want to
  • 6:50 - 6:52
    think about the 95th percentile, because that's
    not the
  • 6:52 - 6:55
    average response time, that's the average
    worst response time
  • 6:55 - 6:58
    that a user is likely to hit.
  • 6:58 - 6:59
    And the thing to keep in mind is that
  • 6:59 - 7:02
    it's not as though a customer comes to your
  • 7:02 - 7:05
    site, they issue one request, and then they're
    done,
  • 7:05 - 7:08
    right. As someone is using your website, they're
    gonna
  • 7:08 - 7:10
    be generating a lot of requests. And you need
  • 7:10 - 7:15
    to look at the 95th percentile, because otherwise
    every
  • 7:15 - 7:17
    request is basically you rolling the dice
    that they're
  • 7:17 - 7:19
    not gonna hit one of those two second, three
  • 7:19 - 7:21
    second, four second responses, close the tab
    and go
  • 7:21 - 7:24
    to your competitor.
  • 7:24 - 7:25
    So we look at this as, here's the crazy
  • 7:25 - 7:28
    thing. Here's what I think is crazy. That
    blog
  • 7:28 - 7:33
    post that DHH wrote, it's from 2009. It's
    been
  • 7:33 - 7:35
    five years, and there's still no tool that
    does
  • 7:35 - 7:37
    what DHH was asking for. So, we, frankly,
    we
  • 7:37 - 7:39
    smelled money. We were like, holy crap.
  • 7:39 - 7:41
    Y.K.: Yeah, why isn't that slide green?
  • 7:41 - 7:43
    T.D.: Yeah. It should be green and dollars.
    I
  • 7:43 - 7:45
    think keynote has the dollars, the make it
    rain
  • 7:45 - 7:50
    effect I should have used. So we smelled blood
  • 7:50 - 7:53
    in the water. We're like, this is awesome.
    There's
  • 7:53 - 7:57
    only one problem that we discovered, and that
    is,
  • 7:57 - 7:58
    it turns out that building this thing is actually
  • 7:58 - 8:01
    really, really freaky hard. Really, really
    hard.
  • 8:01 - 8:06
    So, we announced the private beta at RailsConf
    last
  • 8:06 - 8:09
    year. Before doing that, we spent a year of
  • 8:09 - 8:13
    research spiking out prototypes, building
    prototypes, building out the
  • 8:13 - 8:17
    beta. We launched at RailsConf, and we realized,
    we
  • 8:17 - 8:19
    made a lot of problems. We made a lot
  • 8:19 - 8:22
    of errors when we were building this system.
  • 8:22 - 8:26
    So then, after RailsConf last year, we basically
    took
  • 8:26 - 8:30
    six months to completely rewrite the backend
    from the
  • 8:30 - 8:32
    ground up. And I think tying into your keynote,
  • 8:32 - 8:36
    Yehuda, we, we were like, oh. We clearly have
  • 8:36 - 8:39
    a bespoke problem. No one else is doing this.
  • 8:39 - 8:42
    So we rewrote our own custom backend. And
    then
  • 8:42 - 8:44
    we had all these problems, and we realized
    that
  • 8:44 - 8:45
    they had actually already all been solved
    by the
  • 8:45 - 8:48
    open source community. And so we benefited
    tremendously by
  • 8:48 - 8:48
    having a shared solution.
  • 8:48 - 8:51
    Y.K.: Yeah. So our first release of this was
  • 8:51 - 8:55
    really very bespoke, and the current release
    uses a
  • 8:55 - 9:00
    tremendous amount of very off-the-shelf open
    source projects that
  • 9:00 - 9:05
    just solved the particular problem very effectively,
    very well.
  • 9:05 - 9:06
    None of which are as easy to use as
  • 9:06 - 9:09
    Rails, but all of which solve really thorny
    problems
  • 9:09 - 9:10
    very effectively.
  • 9:10 - 9:13
    T.D.: So, so let's just talk, just for your
  • 9:13 - 9:16
    own understanding, let's talk about how most
    performance monitoring
  • 9:16 - 9:18
    tools work. So the way that most of these
  • 9:18 - 9:20
    work is that you run your Rails app, and
  • 9:20 - 9:22
    running inside of your Rails app is some gem,
  • 9:22 - 9:25
    some agent that you install. And every time
    the
  • 9:25 - 9:29
    Rails app handles a request, it generates
    events, and
  • 9:29 - 9:32
    those events, which include information about
    performance data, those
  • 9:32 - 9:35
    events are passed into the agent.
  • 9:35 - 9:38
    And then the agent sends that data to some
  • 9:38 - 9:41
    kind of centralized server. Now, it turns
    out that
  • 9:41 - 9:44
    doing a running average is actually really
    simple. Which
  • 9:44 - 9:47
    is why everyone does it. Basically you can
    do
  • 9:47 - 9:48
    it in a single SQL query, right. All you
  • 9:48 - 9:50
    do is you have three columns in database.
    The
  • 9:50 - 9:53
    end point, the running average, and the number
    of
  • 9:53 - 9:56
    requests, and then so, you can, those are
    the
  • 9:56 - 9:57
    two things that you need to keep a running
  • 9:57 - 9:57
    average, right.
  • 9:57 - 9:59
    So keeping a running average is actually really
    simple
  • 9:59 - 10:01
    from a technical point of view.
  • 10:01 - 10:04
    Y.K.: I don't think you could even JavaScript
    through
  • 10:04 - 10:05
    to the lack of integers.
  • 10:05 - 10:06
    T.D.: Yes. You probably wouldn't want to do
    any
  • 10:06 - 10:08
    math in JavaScript, it turns out. So, so we
  • 10:08 - 10:10
    took a little bit different approach. Yehuda,
    do you
  • 10:10 - 10:12
    want to go over the next section?
  • 10:12 - 10:15
    Y.K.: Yeah. Sure. So, when we first started,
    right
  • 10:15 - 10:18
    at the beginning, we basically did a similar
    thing
  • 10:18 - 10:20
    where we had a bunch - your app creates
  • 10:20 - 10:23
    events. Most of those start off as being ActiveSupport::Notifications,
  • 10:23 - 10:26
    although it turns out that there's very limited
    use
  • 10:26 - 10:28
    of ActiveSupport::Notifications so we had
    to do some normalization
  • 10:28 - 10:30
    work to get them sane, which we're gonna be
  • 10:30 - 10:33
    upstreaming back into, into Rails.
  • 10:33 - 10:35
    But, one thing that's kind of unfortunate
    about having
  • 10:35 - 10:37
    every single Rails app have an agent is that
  • 10:37 - 10:39
    you end up having to do a lot of
  • 10:39 - 10:40
    the same kind of work over and over again,
  • 10:40 - 10:42
    and use up a lot of memory. So, for
  • 10:42 - 10:44
    example, every one of these things is making
    HTTP
  • 10:44 - 10:46
    requests. So now you have a queue of things
  • 10:46 - 10:49
    that you're sending over HTTP in every single
    one
  • 10:49 - 10:51
    of your Rails processes. And, of course, you
    probably
  • 10:51 - 10:52
    don't notice this. People are used to Rails
    taking
  • 10:52 - 10:54
    up hundreds and hundreds of megabytes, so
    you probably
  • 10:54 - 10:56
    don't notice if you install some agent and
    it
  • 10:56 - 10:59
    suddenly starts taking twenty, thirty, forty,
    fifty more megabytes.
  • 10:59 - 11:02
    But we really wanted to keep the actual memory
  • 11:02 - 11:05
    per process down to a small amount. So one
  • 11:05 - 11:06
    of the very first things that we did, we
  • 11:06 - 11:08
    even did it before last year, is that we
  • 11:08 - 11:10
    pulled out all that shared logic into a, a
  • 11:10 - 11:13
    separate process called the coordinator. And
    the agent is
  • 11:13 - 11:17
    basically responsible simply for collecting
    the, the trace, and
  • 11:17 - 11:19
    it's not responsible for actually talking
    to our server
  • 11:19 - 11:21
    at all. And that means that the coordinator
    only
  • 11:21 - 11:23
    has to do this queue, this keeping a st-
  • 11:23 - 11:26
    a bunch of stuff of work in one place,
  • 11:26 - 11:28
    and it doesn't end up using up as much
  • 11:28 - 11:29
    memory.
  • 11:29 - 11:31
    And I think this, this ended up being very
  • 11:31 - 11:32
    effective for us.
  • 11:32 - 11:33
    T.D.: And I think that low overhead also allows
  • 11:33 - 11:36
    us to just collect more information, in general.
  • 11:36 - 11:37
    Y.K.: Yeah.
  • 11:37 - 11:40
    Now, after our first attempt, we started getting
    a
  • 11:40 - 11:42
    bunch of customers that were telling us that
    even
  • 11:42 - 11:44
    the separate - so the separate coordinator,
    started as
  • 11:44 - 11:45
    a good thing and a bad thing. On the
  • 11:45 - 11:47
    one hand, there's only one of them, so it
  • 11:47 - 11:50
    uses up only one set of memory. On the
  • 11:50 - 11:51
    other hand, it's really easy for someone to
    go
  • 11:51 - 11:53
    in and PS that process and see how many
  • 11:53 - 11:54
    megabytes of memory it's using.
  • 11:54 - 11:57
    So, we got a lot of additional complaints
    that
  • 11:57 - 11:58
    said oh, your process is using a lot of
  • 11:58 - 12:01
    memory. And, I spent a few weeks, I, I
  • 12:01 - 12:03
    know Ruby pretty well. I spent a couple of
  • 12:03 - 12:06
    weeks. I actually wrote a gem called Allocation
    Counter
  • 12:06 - 12:08
    that basically went in to try to pin point
  • 12:08 - 12:09
    exactly where the allocations were hap- coming
    from. But
  • 12:09 - 12:12
    it turns out that it's actually really, really
    hard
  • 12:12 - 12:14
    to track down exactly where allocations are
    coming from
  • 12:14 - 12:15
    in Ruby, because something as simple as using
    a
  • 12:15 - 12:19
    regular expression in Ruby can allocate match
    objects that
  • 12:19 - 12:19
    get put back on the stack.
  • 12:19 - 12:21
    And so I was able to pair this down
  • 12:21 - 12:24
    to some degree. But I really discovered quickly
    that,
  • 12:24 - 12:27
    trying to keep a lid on the memory allocation
  • 12:27 - 12:30
    by doing all the stuff in Ruby, is mostly
  • 12:30 - 12:32
    fine. But for our specific use case where
    we
  • 12:32 - 12:33
    really wanna, we wanna be telling you, you
    can
  • 12:33 - 12:35
    run the agent on your process, on your box,
  • 12:35 - 12:37
    and it's not gonna use a lot of memory.
  • 12:37 - 12:40
    We really needed something more efficient,
    and our first
  • 12:40 - 12:43
    thought was, we'll use C++ or C. No problem.
  • 12:43 - 12:45
    C is, is native. It's great. And Carl did
  • 12:45 - 12:48
    the work. Carl is very smart. And then he
  • 12:48 - 12:50
    said, Yehuda. It is now your turn. You need
  • 12:50 - 12:51
    to start maintaining this. And I said, I don't
  • 12:51 - 12:54
    trust myself to write C++ code that's running
    in
  • 12:54 - 12:56
    all of your guys's boxes, and not seg-fault.
    So
  • 12:56 - 13:00
    I don't think that, that doesn't work for
    me.
  • 13:00 - 13:02
    And so I, I noticed that rust was coming
  • 13:02 - 13:04
    along, and what rust really gives you is it
  • 13:04 - 13:06
    gives you the ability to write low-level code
    a
  • 13:06 - 13:09
    la C or C++ with magma memory management,
    that
  • 13:09 - 13:12
    keeps your memory allocation low and keeps
    things speedy.
  • 13:12 - 13:15
    Low resource utilization. While also giving
    you compile time
  • 13:15 - 13:18
    guarantees about not seg-faulting. So, again,
    if your processes
  • 13:18 - 13:20
    randomly started seg-faulting because you
    installed the agent, I
  • 13:20 - 13:22
    think you would stop being our customer very
    quickly.
  • 13:22 - 13:25
    So having what, pretty much 100% guarantees
    about that
  • 13:25 - 13:27
    was very important to us. And so that's why
  • 13:27 - 13:28
    we decided to use rust.
  • 13:28 - 13:30
    I'll just keep going.
  • 13:30 - 13:31
    T.D.: Keep going.
  • 13:31 - 13:33
    Y.K.: So, we had this coordinator object.
    And basically
  • 13:33 - 13:36
    the coordinator object is receiving events.
    So the events
  • 13:36 - 13:40
    basically end up being these traces that describe
    what's
  • 13:40 - 13:42
    happening in your application. And the next
    thing, I
  • 13:42 - 13:44
    think our initial work on this we used JSON
  • 13:44 - 13:46
    just to send the pay load to the server,
  • 13:46 - 13:48
    but we noticed that a lot of people have
  • 13:48 - 13:50
    really big requests. So you may have a big
  • 13:50 - 13:52
    request with a big SQL query in it, or
  • 13:52 - 13:53
    a lot of big SQL queries in it. Some
  • 13:53 - 13:55
    people have traces that are hundreds and hundreds
    of
  • 13:55 - 13:58
    nodes long. And so we really wanted to figure
  • 13:58 - 14:00
    out how to shrink down the payload size to
  • 14:00 - 14:03
    something that we could be, you know, pumping
    out
  • 14:03 - 14:05
    of your box on a regular basis without tearing
  • 14:05 - 14:07
    up your bandwidth costs.
  • 14:07 - 14:09
    So, one of the first things that we did
  • 14:09 - 14:11
    early on was we switched using protobuf as
    the
  • 14:11 - 14:14
    transport mechanism, and that really shrunk,
    shrunk down the
  • 14:14 - 14:17
    payloads a lot. Our earlier prototypes for
    actually collecting
  • 14:17 - 14:19
    the data were written in Ruby, but I think
  • 14:19 - 14:21
    Carl did, like, a weekend hack to just pour
  • 14:21 - 14:24
    it over the Java and got, like, 200x performance.
  • 14:24 - 14:26
    And you don't always get 200x performance,
    if mostly
  • 14:26 - 14:28
    what you're doing is database queries, you're
    not gonna
  • 14:28 - 14:29
    get a huge performance swing.
  • 14:29 - 14:32
    But mostly what we're doing is math. And algorithms
  • 14:32 - 14:34
    and data structures. And for that, Ruby is,
    it
  • 14:34 - 14:36
    could, in theory, one day, have a good git
  • 14:36 - 14:39
    or something, but today, writing that code
    in Java
  • 14:39 - 14:41
    didn't end up being significantly more code
    cause it's
  • 14:41 - 14:43
    just, you know, algorithms and data structures.
  • 14:43 - 14:44
    T.D.: And I'll just note something about standardizing
    on
  • 14:44 - 14:47
    protobufs in our, in our stack, is actually
    a
  • 14:47 - 14:52
    huge win, because we, we realized, hey, browsers,
    as
  • 14:52 - 14:53
    it turns out are pretty powerful these days.
    They've
  • 14:53 - 14:54
    got, you know, they can allocate memory, they
    can
  • 14:54 - 14:56
    do all these types of computation. So, and
    protobuff's
  • 14:56 - 14:59
    libraries exist everywhere. So we save ourselves
    a lot
  • 14:59 - 15:02
    of computation and a lot of time by just
  • 15:02 - 15:04
    treating protobuff as the canonical serialization
    form, and then
  • 15:04 - 15:06
    you can move payloads around the entire stack
    and
  • 15:06 - 15:09
    everything speaks the same language, so you've
    saved the
  • 15:09 - 15:09
    serialization and deserialization.
  • 15:09 - 15:12
    Y.K.: And JavaScript is actually surprisingly
    effective at des-
  • 15:12 - 15:14
    at taking protobuffs and converting them to
    the format
  • 15:14 - 15:18
    that we need efficiently. So, so we basically
    take
  • 15:18 - 15:21
    this data. The Java collector is basically
    collecting all
  • 15:21 - 15:24
    these protobuffs, and pretty much it just
    turns around,
  • 15:24 - 15:25
    and this is sort of where we got into
  • 15:25 - 15:28
    bespoke territory before we started rolling
    our own, but
  • 15:28 - 15:31
    we realized that when you write a big, distributed,
  • 15:31 - 15:33
    fault-tolerant system, there's a lot of problems
    that you
  • 15:33 - 15:35
    really just want someone else to have thought
    about.
  • 15:35 - 15:38
    So, what we do is we basically take these,
  • 15:38 - 15:40
    take these payloads that are coming in. We
    convert
  • 15:40 - 15:42
    them into batches and we send the batches
    down
  • 15:42 - 15:45
    into the Kafka queue. And the, the next thing
  • 15:45 - 15:48
    that happens, so the Kafka, sorry, Kafka's
    basically just
  • 15:48 - 15:50
    a queue that allows you to throw things into,
  • 15:50 - 15:53
    I guess, it might be considered similar to
    like,
  • 15:53 - 15:55
    something lime AMQP. It has some nice fault-tolerance
    properties
  • 15:55 - 15:58
    and integrates well with storm. But most important
    it's
  • 15:58 - 16:00
    just super, super high through-put.
  • 16:00 - 16:02
    So basically didn't want to put any barrier
    between
  • 16:02 - 16:04
    you giving us the data and us getting it
  • 16:04 - 16:04
    to disc as soon as possible.
  • 16:04 - 16:06
    T.D.: Yeah. Which we'll, I think, talk about
    in
  • 16:06 - 16:06
    a bit.
  • 16:06 - 16:09
    Y.K.: So we, so the basic Kafka takes the
  • 16:09 - 16:11
    data and starts sending it into Storm. And
    if
  • 16:11 - 16:13
    you think about what has to happen in order
  • 16:13 - 16:15
    to get some request. So, you have these requests.
  • 16:15 - 16:18
    There's, you know, maybe traces that have
    a bunch
  • 16:18 - 16:20
    of SQL queries, and our job is basically to
  • 16:20 - 16:21
    take all those SQL queries and say, OK, I
  • 16:21 - 16:23
    can see that in all of your requests, you
  • 16:23 - 16:24
    had the SQL query and it took around this
  • 16:24 - 16:25
    amount of time and it happened as a child
  • 16:25 - 16:27
    of this other node. And the way to think
  • 16:27 - 16:30
    about that is basically just a processing
    pipeline. Right.
  • 16:30 - 16:31
    So you have these traces that come in one
  • 16:31 - 16:33
    side. You start passing them through a bunch
    of
  • 16:33 - 16:35
    processing steps, and then you end up on the
  • 16:35 - 16:37
    other side with the data.
  • 16:37 - 16:39
    And Storm is actually a way of describing
    that
  • 16:39 - 16:42
    processing pipeline in sort of functional
    style, and then
  • 16:42 - 16:44
    you tell it, OK. Here's how many servers I
  • 16:44 - 16:47
    need. Here's how, here's how I'm gonna handle
    failures.
  • 16:47 - 16:50
    And it basically deals with distribution and
    scaling and
  • 16:50 - 16:52
    all that stuff for you. And part of that
  • 16:52 - 16:55
    is because you wrote everything using functional
    style.
  • 16:55 - 16:57
    And so what happens is Kafka sends the data
  • 16:57 - 17:01
    into the entry spout, which is sort of terminology
  • 17:01 - 17:04
    in, terminology in Storm for these streams
    that get
  • 17:04 - 17:07
    created. And they basically go into these
    processing things,
  • 17:07 - 17:10
    which very clever- cutely are called bolts.
    This is
  • 17:10 - 17:13
    definitely not the naming I would have used,
    but.
  • 17:13 - 17:15
    So they're called bolts. And the idea is that
  • 17:15 - 17:17
    basically every request may have several things.
  • 17:17 - 17:20
    So, for example, we now automatically detect
    n +
  • 17:20 - 17:22
    1 queries and that's sort of a different kind
  • 17:22 - 17:25
    of processing from just, make a picture of
    the
  • 17:25 - 17:27
    entire request. Or what is the 95th percentile
    across
  • 17:27 - 17:29
    your entire app, right. These are all different
    kinds
  • 17:29 - 17:31
    of processing. So we take the data and we
  • 17:31 - 17:34
    send them into a bunch of bolts, and the
  • 17:34 - 17:36
    cool thing about bolts is that, again, because
    they're
  • 17:36 - 17:39
    just functional chaining, you can take the
    output from
  • 17:39 - 17:41
    one bolt and feed it into another bolt. And
  • 17:41 - 17:44
    that works, that works pretty well. And, and
    you
  • 17:44 - 17:45
    don't have to worry about - I mean, you
  • 17:45 - 17:47
    have to worry a little bit about things like
  • 17:47 - 17:50
    fault tolerance, failure, item potence. But
    you worry about
  • 17:50 - 17:52
    them at, at the abstraction level, and then
    the
  • 17:52 - 17:54
    operational part is handled for you.
  • 17:54 - 17:56
    T.D.: So it's just like a very declarative
    way
  • 17:56 - 17:58
    of describing how this computation work in,
    in a
  • 17:58 - 17:59
    way that's easy to scale.
  • 17:59 - 18:02
    Y.K.: And Carl actually talked about this
    at very
  • 18:02 - 18:05
    high speed yesterday, and you, some of you
    may
  • 18:05 - 18:07
    have been there. I would recommend watching
    the video
  • 18:07 - 18:09
    when it comes out if you want to make
  • 18:09 - 18:11
    use of this stuff in your own applications.
  • 18:11 - 18:13
    And then when you're finally done with all
    the
  • 18:13 - 18:15
    processing, you need to actually do something
    with it.
  • 18:15 - 18:16
    You need to put it somewhere so that the
  • 18:16 - 18:18
    web app can get access to it, and that
  • 18:18 - 18:21
    is basically, we use Cassandra for this. And
    Cassandra
  • 18:21 - 18:25
    again is mostly, it's a dumb database, but
    it
  • 18:25 - 18:28
    has, it's, has high capacity. It has some
    of
  • 18:28 - 18:29
    the fault-tolerance capacities that we want.
  • 18:29 - 18:31
    T.D.: We're very, we're just very, very heavy,
    right.
  • 18:31 - 18:33
    Like, we tend to be writing more than we're
  • 18:33 - 18:34
    ever reading.
  • 18:34 - 18:36
    Y.K.: Yup. And then when we're done, when
    we're
  • 18:36 - 18:39
    done with a particular batch, Cassandra basically
    kicks off
  • 18:39 - 18:41
    the process over again. So we're basically
    doing these
  • 18:41 - 18:41
    things as batches.
  • 18:41 - 18:43
    T.D.: So these are, these are roll-ups, is
    what's
  • 18:43 - 18:45
    happening here. So basically every minute,
    every ten minutes,
  • 18:45 - 18:48
    and then at every hour, we reprocess and we
  • 18:48 - 18:50
    re-aggregate, so that when you query us we
    know
  • 18:50 - 18:51
    exactly what to give you.
  • 18:51 - 18:53
    Y.K.: Yup. So we sort of have this cycle
  • 18:53 - 18:55
    where we start off, obviously, in the first
    five
  • 18:55 - 18:57
    second, the first minute, you really want
    high granularity.
  • 18:57 - 18:59
    You want to see what's happening right now.
    But,
  • 18:59 - 19:00
    if you want to go back and look at
  • 19:00 - 19:02
    data from three months ago, you probably care
    about
  • 19:02 - 19:05
    it, like the day granularity or maybe the
    hour
  • 19:05 - 19:07
    granularity. So, we basically do these roll-ups
    and we
  • 19:07 - 19:09
    cycle through the process.
  • 19:09 - 19:12
    T.D.: So this, it turns out, building the
    system
  • 19:12 - 19:15
    required an intense amount of work. Carl spent
    probably
  • 19:15 - 19:17
    six months reading PHP thesises to find-
  • 19:17 - 19:18
    Y.K.: Thesis.
  • 19:18 - 19:24
    T.D.: Thesis. To find, to find data structures
    and
  • 19:24 - 19:26
    algorithms that we could use. Because this
    is a
  • 19:26 - 19:28
    huge amount of data. Like, I think even a
  • 19:28 - 19:31
    few months after we were in private data,
    private
  • 19:31 - 19:34
    beta, we were already handling over a billion
    requests
  • 19:34 - 19:36
    per month. And obviously there's no way that
    we-
  • 19:36 - 19:38
    Y.K.: Basically the number of requests that
    we handle
  • 19:38 - 19:40
    is the sum of all of the requests that
  • 19:40 - 19:40
    you handle.
  • 19:40 - 19:40
    T.D.: Right.
  • 19:40 - 19:41
    Y.K.: And all of our customers handle.
  • 19:41 - 19:42
    T.D.: Right. Right. So.
  • 19:42 - 19:43
    Y.K.: So that's a lot of requests.
  • 19:43 - 19:46
    T.D.: So obviously we can't provide a service,
    at
  • 19:46 - 19:48
    least one that's not, we can't provide an
    affordable
  • 19:48 - 19:51
    service, an accessible service, if we have
    to store
  • 19:51 - 19:53
    terabytes or exabytes of data just to tell
    you
  • 19:53 - 19:54
    how your app is running.
  • 19:54 - 19:57
    Y.K.: And I think, also a problem, it's problematic
  • 19:57 - 19:58
    if you store all the data in a database
  • 19:58 - 20:00
    and then every single time someone wants to
    learn
  • 20:00 - 20:02
    something about that, you have to do a query.
  • 20:02 - 20:03
    Those queries can take a very long time. They
  • 20:03 - 20:05
    can take minutes. And I think we really wanted
  • 20:05 - 20:07
    to have something that would be very, that
    would,
  • 20:07 - 20:10
    where the feedback loop would be fast. So
    we
  • 20:10 - 20:12
    wanted to find algorithms that let us handle
    the
  • 20:12 - 20:14
    data at, at real time, and then provide it
  • 20:14 - 20:16
    to you at real time instead of these, like,
  • 20:16 - 20:18
    dump the data somewhere and then do these
    complicated
  • 20:18 - 20:19
    queries.
  • 20:19 - 20:21
    T.D.: So, hold on. So this slide was not
  • 20:21 - 20:24
    supposed to be here. It was supposed to be
  • 20:24 - 20:28
    a Rails slide. So, whoa. I went too far.
  • 20:28 - 20:30
    K. We'll watch that again. That's pretty cool.
    So
  • 20:30 - 20:32
    then the last thing I want to say is,
  • 20:32 - 20:34
    perhaps your take away from looking at this
    architecture
  • 20:34 - 20:37
    diagram is, oh my gosh, these Rails guys completely-
  • 20:37 - 20:38
    Y.K.: They really jumped the shark.
  • 20:38 - 20:41
    T.D.: They jumped the shark. They ditched
    Rails. I
  • 20:41 - 20:42
    saw, like, three Tweets yesterday - I wasn't
    here,
  • 20:42 - 20:43
    I was in Portland yesterday, but I saw, like,
  • 20:43 - 20:44
    three Tweets that were like, I'm at RailsConf
    and
  • 20:44 - 20:49
    I haven't seen a single talk about, like,
    Rails.
  • 20:49 - 20:52
    So that's true here, too. But, I want to
  • 20:52 - 20:55
    assure you that we are only using this stack
  • 20:55 - 20:58
    for the heavy computation. We started in Rails.
    We
  • 20:58 - 21:01
    started, we were like, hey, what do we need.
  • 21:01 - 21:02
    Ah, well, people probably need to authenticate
    and log
  • 21:02 - 21:05
    in, and we probably need to do billing. And
  • 21:05 - 21:06
    those are all things that Rails is really,
    really
  • 21:06 - 21:09
    good at. So we started with Rails as, basically,
  • 21:09 - 21:11
    the starting point, and then when we realized
    oh
  • 21:11 - 21:14
    my gosh, computation is really slow. There's
    no way
  • 21:14 - 21:15
    we're gonna be able to offer this service.
    OK.
  • 21:15 - 21:16
    Now let's think about how we can do all
  • 21:16 - 21:16
    of that.
  • 21:16 - 21:19
    Y.K.: And I think notably, a lot of people
  • 21:19 - 21:20
    who look at Rails are like, there's a lot
  • 21:20 - 21:22
    of companies that have built big stuff on
    Rails,
  • 21:22 - 21:24
    and their attitude is, like, oh, this legacy
    terrible
  • 21:24 - 21:25
    Rails app. I really wish we could get rid
  • 21:25 - 21:27
    of it. If we could just write everything in
  • 21:27 - 21:31
    Scala or Clojure or Go, everything would be
    amazing.
  • 21:31 - 21:32
    That is definitely not our attitude. Our attitude
    is
  • 21:32 - 21:34
    that Rails is really amazing, at particular,
    at the
  • 21:34 - 21:37
    kinds of things that are really common across
    everyone's
  • 21:37 - 21:40
    web applications - authentication, billing,
    et cetera. And we
  • 21:40 - 21:41
    really want to be using Rails for the parts
  • 21:41 - 21:43
    of our app- even things like error-tracking,
    we do
  • 21:43 - 21:45
    through the Rails app. We want to be using
  • 21:45 - 21:47
    Rails because it's very productive at doing
    those things.
  • 21:47 - 21:49
    It happens to be very slow with doing data
  • 21:49 - 21:50
    crunching, so we're gonna use a different
    tool for
  • 21:50 - 21:51
    that.
  • 21:51 - 21:52
    But I don't think you'll ever see me getting
  • 21:52 - 21:54
    up and saying, ah, I really wish we had
  • 21:54 - 21:55
    just started writing, you know, the Rails
    app in
  • 21:55 - 21:55
    rust.
  • 21:55 - 21:55
    T.D.: Yeah.
  • 21:55 - 21:58
    Y.K.: That would be terrible.
  • 21:58 - 22:02
    T.D.: So that's number one, is, is, honest
    response
  • 22:02 - 22:04
    times, which we're, which it turns out, seems
    like
  • 22:04 - 22:08
    it should be easy, requires storing insane
    amount of
  • 22:08 - 22:09
    data.
  • 22:09 - 22:11
    So the second thing that we realized when
    we
  • 22:11 - 22:12
    were looking at a lot of these tools, is
  • 22:12 - 22:14
    that most of them focus on data. They focus
  • 22:14 - 22:17
    on giving you the raw data. But I'm not
  • 22:17 - 22:19
    a machine. I'm not a computer. I don't enjoy
  • 22:19 - 22:21
    sifting through data. That's what computers
    are good for.
  • 22:21 - 22:23
    I would rather be drinking a beer. It's really
  • 22:23 - 22:25
    nice in Portland, this time of year.
  • 22:25 - 22:27
    So, we wanted to think about, if you're trying
  • 22:27 - 22:31
    to solve the performance problems in your
    application, what
  • 22:31 - 22:33
    are the things that you would suss out with
  • 22:33 - 22:36
    the existing tools after spending, like, four
    hours depleting
  • 22:36 - 22:38
    your ego to get there?
  • 22:38 - 22:39
    Y.K.: And I think part of this is just
  • 22:39 - 22:42
    people are actually very, people like to think
    that
  • 22:42 - 22:44
    they're gonna use these tools, but when the
    tools
  • 22:44 - 22:45
    require you to dig through a lot of data,
  • 22:45 - 22:47
    people just don't use them very much. So,
    the
  • 22:47 - 22:48
    goal here was to build a tool that people
  • 22:48 - 22:51
    actually use and actually like using, and
    not to
  • 22:51 - 22:55
    build a tool that happens to provide a lot
  • 22:55 - 22:55
    of data you can sift through.
  • 22:55 - 22:55
    T.D.: Yes.
  • 22:55 - 22:56
    So, probably the, one of the first things
    that
  • 22:56 - 22:59
    we realized is that we don't want to provide.
  • 22:59 - 23:00
    This is a trace of a request, you've probably
  • 23:00 - 23:04
    seen similar UIs using other tools, using,
    for example,
  • 23:04 - 23:07
    the inspector in, in like Chrome or Safari,
    and
  • 23:07 - 23:09
    this is just showing basically, it's basically
    a visual
  • 23:09 - 23:11
    stack trace of where your application is spending
    its
  • 23:11 - 23:12
    time.
  • 23:12 - 23:14
    But I think what was important for us is
  • 23:14 - 23:18
    showing not just a single request, because
    your app
  • 23:18 - 23:21
    handles, you know, hundreds of thousands of
    requests, or
  • 23:21 - 23:23
    millions of requests. So looking at a single
    request
  • 23:23 - 23:25
    statistically is complete, it's just noise.
  • 23:25 - 23:26
    Y.K.: And it's especially bad if it's the
    worst
  • 23:26 - 23:29
    request, because the worst request is, is
    really noise.
  • 23:29 - 23:31
    It's like, a hiccup in the network, right.
  • 23:31 - 23:31
    T.D.: It's the outlier. Yeah.
  • 23:31 - 23:32
    Y.K.: It's literally the outlier.
  • 23:32 - 23:36
    T.D.: It's literally the outlier. Yup. So,
    what we
  • 23:36 - 23:39
    present in Skylight is something a little
    bit different,
  • 23:39 - 23:42
    and it's something that we call the aggregate
    trace.
  • 23:42 - 23:46
    So the aggregate trace is basically us taking
    all
  • 23:46 - 23:50
    of your requests, averaging them out where
    each of
  • 23:50 - 23:52
    these things spends their time, and then showing
    you
  • 23:52 - 23:55
    that. So this is basically like, this is like,
  • 23:55 - 23:58
    this is like the statue of David. It is
  • 23:58 - 24:01
    the idealized form of the stack trace of how
  • 24:01 - 24:02
    your application's behaving.
  • 24:02 - 24:05
    But, of course, you have the same problem
    as
  • 24:05 - 24:08
    before, which is, if this is all that we
  • 24:08 - 24:11
    were showing you, it would be obscuring a
    lot
  • 24:11 - 24:13
    of information. You want to actually be able
    to
  • 24:13 - 24:14
    tell the difference between, OK, what's my
    stack trace
  • 24:14 - 24:16
    look like for fast requests, and how does
    that
  • 24:16 - 24:19
    differ from requests that are slower.
  • 24:19 - 24:21
    So what we've got, I've got a little video
  • 24:21 - 24:22
    here. You can see that when I move the
  • 24:22 - 24:26
    slider, that this trace below it is actually
    updating
  • 24:26 - 24:29
    in real time. As I move the slider around,
  • 24:29 - 24:32
    you can see that the aggregate trace actually
    updates
  • 24:32 - 24:34
    with it. And that's because we're collecting
    all this
  • 24:34 - 24:36
    information. We're collecting, like I said,
    a lot of
  • 24:36 - 24:39
    data. We can recompute this aggregate trace
    on the
  • 24:39 - 24:39
    fly.
  • 24:39 - 24:41
    Basically, for each bucket, we're storing
    a different trace,
  • 24:41 - 24:43
    and then on the client we're reassembling
    that. We'll
  • 24:43 - 24:44
    go into that a little bit.
  • 24:44 - 24:46
    Y.K.: And I think it's really important that
    you
  • 24:46 - 24:48
    be able to do these experiments quickly. If
    every
  • 24:48 - 24:50
    time you think, oh, I wonder what happens
    if
  • 24:50 - 24:52
    I add another histogram bucket, if it requires
    a
  • 24:52 - 24:55
    whole full page refresh. Then that would basically
    make
  • 24:55 - 24:56
    people not want to use the tool. Not able
  • 24:56 - 24:59
    to use the tool. So, actually building something
    which
  • 24:59 - 25:00
    is real time and fast, gets the data as
  • 25:00 - 25:00
    it comes, was really important to us.
  • 25:00 - 25:01
    T.D.: So that's number one.
  • 25:01 - 25:05
    And the second thing. So we built that, and
  • 25:05 - 25:08
    we're like, OK, well what's next? And I think
  • 25:08 - 25:09
    that the big problem with this is that you
  • 25:09 - 25:12
    need to know that there's a problem before
    you
  • 25:12 - 25:14
    go look at it, right. So we have been
  • 25:14 - 25:16
    working for the past few months, and the Storm
  • 25:16 - 25:18
    infrastructure that we built makes it pretty
    straight-forward to
  • 25:18 - 25:21
    start building more abstractions on top of
    the data
  • 25:21 - 25:22
    that we've already collected.
  • 25:22 - 25:24
    It's a very declarative system. So we've been
    working
  • 25:24 - 25:27
    on a feature called inspections. And what's
    cool about
  • 25:27 - 25:29
    inspections is that we can look at this tremendous
  • 25:29 - 25:31
    volume of data that we've collected from your
    app,
  • 25:31 - 25:34
    and we can automatically tease out what the
    problems
  • 25:34 - 25:35
    are. So the first one that we shipped, this
  • 25:35 - 25:37
    is in beta right now. It's not, it's not
  • 25:37 - 25:40
    out and enabled by default, but there, it's
    behind
  • 25:40 - 25:42
    a feature flag that we've had some users turning
  • 25:42 - 25:43
    on.
  • 25:43 - 25:44
    And, and trying out. And so what we can
  • 25:44 - 25:46
    do in this case, is because we have information
  • 25:46 - 25:49
    about all of the database queries in your
    app,
  • 25:49 - 25:51
    we can look and see if you have n
  • 25:51 - 25:52
    plus one queries. Can you maybe explain what
    an
  • 25:52 - 25:53
    n plus one query is?
  • 25:53 - 25:55
    Y.K.: Yeah. So, I'm, people know, hopefully,
    what n
  • 25:55 - 25:57
    plus one queries. But the, it's the idea that
  • 25:57 - 25:59
    you, by accident, for some reason, instead
    of making
  • 25:59 - 26:02
    one query, you ask for like all the posts
  • 26:02 - 26:03
    and then you iterated through all of them
    and
  • 26:03 - 26:05
    got all the comments and now you, instead
    of
  • 26:05 - 26:09
    having one query, you have one query per post,
  • 26:09 - 26:10
    right. And you, what I've, what I've like
    to
  • 26:10 - 26:13
    do is do eager reloading, where you say include
  • 26:13 - 26:15
    comments, right. But you have to know that
    you
  • 26:15 - 26:15
    have to do that.
  • 26:15 - 26:17
    So there's some tools that will run in development
  • 26:17 - 26:18
    mode, if you happen to catch it, like a
  • 26:18 - 26:20
    bullet. This is basically a tool that's looking
    at
  • 26:20 - 26:22
    every single one of your classes and has some
  • 26:22 - 26:24
    thresholds that, once we see that a bunch
    of
  • 26:24 - 26:27
    your requests have the same exact query, so
    we
  • 26:27 - 26:30
    do some work to pull out binds. So if
  • 26:30 - 26:32
    it's, like, where something equals one, we
    will automatically
  • 26:32 - 26:34
    pull out the one and replace it with a
  • 26:34 - 26:35
    question mark.
  • 26:35 - 26:36
    And then we basically take all those queries,
    if
  • 26:36 - 26:40
    they're the exact same query repeated multiple
    times, subject
  • 26:40 - 26:41
    to some thresholds, we'll start showing you
    hey, there's
  • 26:41 - 26:42
    an n plus one query.
  • 26:42 - 26:44
    And you can imagine this same sort of thing
  • 26:44 - 26:46
    being done for things, like, are you missing
    an
  • 26:46 - 26:50
    index, right. Or, are you using the Ruby version
  • 26:50 - 26:51
    of JSON when you should be using the native
  • 26:51 - 26:52
    version of JSON. These are all things that
    we
  • 26:52 - 26:55
    can start detecting just because we're consuming
    an enormous
  • 26:55 - 26:58
    amount of information, and we can start writing
    some
  • 26:58 - 26:59
    heuristics for bubbling it up.
  • 26:59 - 27:02
    So, third and final breakthrough, we realized
    that we
  • 27:02 - 27:05
    really, really needed a lightning fast UI.
    Something really
  • 27:05 - 27:08
    responsive. So, in particular, the feedback
    loop is critical,
  • 27:08 - 27:10
    right. You can imagine, if the way that you
  • 27:10 - 27:12
    dug into data was you clicked and you wait
  • 27:12 - 27:14
    an hour, and then you get your results, no
  • 27:14 - 27:16
    one would do it. No one would ever do
  • 27:16 - 27:16
    it.
  • 27:16 - 27:19
    And the existing tools are OK, but you click
  • 27:19 - 27:20
    and you wait. You look at it and you're
  • 27:20 - 27:22
    like, oh, I want a different view, so then
  • 27:22 - 27:23
    you go edit your query and then you click
  • 27:23 - 27:25
    and you wait and it's just not a pleasant
  • 27:25 - 27:27
    experience.
  • 27:27 - 27:29
    So, so we use Ember, the, the UI that
  • 27:29 - 27:31
    you're using when you log into Skylight. Even
    though
  • 27:31 - 27:33
    it feels just like a regular website, it doesn't
  • 27:33 - 27:36
    feel like a native app, is powered, all of
  • 27:36 - 27:38
    the routing, all of the rendering, all of
    the
  • 27:38 - 27:41
    decision making, is happening in, as an Ember.js
    app,
  • 27:41 - 27:43
    and we pair that with D3. So all of
  • 27:43 - 27:45
    the charts, the charts that you saw there
    in
  • 27:45 - 27:48
    the aggregate trace, that is all Ember components
    powered
  • 27:48 - 27:49
    by D3.
  • 27:49 - 27:53
    So, this is actually significantly cleaned
    up our client-side
  • 27:53 - 27:56
    code. It makes re-usability really, really
    awesome. So to
  • 27:56 - 27:57
    give you an example, this is from our billing
  • 27:57 - 27:59
    page that I, the designer came and they had,
  • 27:59 - 28:01
    they had a component that was like, the gate
  • 28:01 - 28:02
    component.
  • 28:02 - 28:03
    And, the-
  • 28:03 - 28:06
    T.D.: It seems really boring at first.
  • 28:06 - 28:07
    Y.K.: It seemed really boring. But, this is
    the
  • 28:07 - 28:09
    implementation, right. So you could copy and
    paste this
  • 28:09 - 28:11
    code over and over again, everywhere you go.
    Just
  • 28:11 - 28:13
    remember to format it correctly. If you forget
    to
  • 28:13 - 28:15
    format it, it's not gonna look the same everywhere.
  • 28:15 - 28:17
    But I was like, hey, we're using this all
  • 28:17 - 28:18
    over the place. Why don't we bundle this up
  • 28:18 - 28:20
    into a component? And so with Ember, it was
  • 28:20 - 28:22
    super easy. We basically just said, OK, here's
    new
  • 28:22 - 28:25
    calendar date component. It has a property
    on it
  • 28:25 - 28:26
    called date. Just set that to any JavaScript
    data
  • 28:26 - 28:28
    object. Just set, you don't have to remember
    about
  • 28:28 - 28:30
    converting it or formatting it. Here's the
    component. Set
  • 28:30 - 28:32
    the date and it will render the correct thing
  • 28:32 - 28:33
    automatically.
  • 28:33 - 28:36
    And, so the architecture of the Ember app
    looks
  • 28:36 - 28:38
    a little bit, something like this, where you
    have
  • 28:38 - 28:40
    many, many different components, most of them
    just driven
  • 28:40 - 28:42
    by D3, and then they're plugged into the model
  • 28:42 - 28:43
    and the controller.
  • 28:43 - 28:45
    And the Ember app will go fetch those models
  • 28:45 - 28:47
    from the cloud, and the cloud from the Java
  • 28:47 - 28:50
    app, which just queries Cassandra, and render
    them. And
  • 28:50 - 28:53
    what's neat about this model is turning on
    web
  • 28:53 - 28:56
    sockets is super easy, right. Because all
    of these
  • 28:56 - 28:59
    components are bound to a single place. So
    when
  • 28:59 - 29:01
    the web socket says, hey, we have updated
    information
  • 29:01 - 29:03
    for you to show, it just pushes it onto
  • 29:03 - 29:05
    the model or onto the controller, and the
    whole
  • 29:05 - 29:06
    UI updates automatically.
  • 29:06 - 29:07
    It's like magic.
  • 29:07 - 29:07
    And-
  • 29:07 - 29:08
    Y.K.: Like magic.
  • 29:08 - 29:10
    T.D.: It's like magic. And, and when debugging,
    this
  • 29:10 - 29:12
    is especially awesome too, because, and I'll
    maybe show
  • 29:12 - 29:15
    a demo of the Ember inspector. It's nice.
  • 29:15 - 29:18
    So. Yeah. So, lightning fast UI. Reducing
    the feedback
  • 29:18 - 29:20
    loop so that you can quickly play with your
  • 29:20 - 29:22
    data, makes it go from a chore to something
  • 29:22 - 29:24
    that actually feels kind of fun.
  • 29:24 - 29:27
    So, these were the breakthroughs that we had
    when
  • 29:27 - 29:28
    we were building Skylight. The things that
    made us
  • 29:28 - 29:30
    think, yes, this is actually a product that
    we
  • 29:30 - 29:32
    think deserves to be on the market. So, one,
  • 29:32 - 29:34
    honest response times. Collect data that no
    one else
  • 29:34 - 29:37
    can collect. Focus on answers instead of just
    dumping
  • 29:37 - 29:38
    data, and have a lightning fast UI to do
  • 29:38 - 29:38
    it.
  • 29:38 - 29:40
    So we like to think of Skylight as basically
  • 29:40 - 29:43
    a smart profiler. It's a smart profiler that
    runs
  • 29:43 - 29:44
    in production. It's like the profiler that
    you run
  • 29:44 - 29:47
    on your local development machine, but instead
    of being
  • 29:47 - 29:49
    on your local dev box which has nothing to
  • 29:49 - 29:52
    do with the performance characteristics of
    what your users
  • 29:52 - 29:53
    are experience, we're actually running in
    production.
  • 29:53 - 29:59
    So, let me just give you guys a quick
  • 29:59 - 30:00
    demo.
  • 30:00 - 30:03
    So, this is what the Skylight, this is what
  • 30:03 - 30:08
    Skylight looks like. What's under this? There
    we go.
  • 30:08 - 30:10
    So, the first thing here is we've got the
  • 30:10 - 30:13
    app dash board. So this, it's like our, 95th
  • 30:13 - 30:16
    responsile- 95th percentile response time
    has peaked. Maybe you're
  • 30:16 - 30:18
    all hammering it right now. That would be
    nice.
  • 30:18 - 30:20
    So, this is a graph of your response time
  • 30:20 - 30:22
    over time, and then on the right, this is
  • 30:22 - 30:25
    the graph of the RPMs, the requests per minute
  • 30:25 - 30:27
    that your app is handling. So this is app-wide.
  • 30:27 - 30:29
    And this is live. This updates every minute.
  • 30:29 - 30:31
    Then down below, you have a list of the
  • 30:31 - 30:34
    end points in your application. So you can
    see,
  • 30:34 - 30:36
    actually, the top, the slowest ones for us
    were,
  • 30:36 - 30:38
    we have an instrumentation API, and we've
    gone and
  • 30:38 - 30:40
    instrumented our background workers. So we
    can see them
  • 30:40 - 30:42
    here, and their response time plays in. So
    we
  • 30:42 - 30:44
    can see that we have this reporting worker
    that's
  • 30:44 - 30:47
    taking 95th percentile, thirteen seconds.
  • 30:47 - 30:49
    Y.K.: So all that time used to be inside
  • 30:49 - 30:52
    of some request somewhere, and we discovered
    that there
  • 30:52 - 30:53
    was a lot of time being spent in things
  • 30:53 - 30:55
    that we could push to the background. We probably
  • 30:55 - 30:57
    need to update the agony index so that it
  • 30:57 - 30:59
    doesn't make workers very high, because spending
    some time
  • 30:59 - 31:02
    in your workers is not that big of a
  • 31:02 - 31:02
    deal.
  • 31:02 - 31:03
    T.D.: So, so then, if we dive into one
  • 31:03 - 31:05
    of these, you can see that for this request,
  • 31:05 - 31:07
    we've got the time explorer up above, and
    that
  • 31:07 - 31:10
    shows a graph of response time at, again,
    95th
  • 31:10 - 31:12
    percentile, and you can, if you want to go
  • 31:12 - 31:14
    back and look at historical data, you just
    drag
  • 31:14 - 31:15
    it like this. And this has got a brush,
  • 31:15 - 31:17
    so you can zoom in and out on different
  • 31:17 - 31:18
    times.
  • 31:18 - 31:20
    And every time you change the range, you can
  • 31:20 - 31:21
    see that it's very responsive. It's never
    waiting for
  • 31:21 - 31:23
    the server. But it is going back and fetching
  • 31:23 - 31:25
    data from the server and then when the data
  • 31:25 - 31:29
    comes back, you see the whole UI just updates.
  • 31:29 - 31:29
    And we get that for free with Ember and
  • 31:29 - 31:31
    And then down below, as we discussed, you
    actually
  • 31:31 - 31:34
    have a real histogram. And this histogram,
    in this
  • 31:34 - 31:37
    case, is showing. So this is for fifty-seven
    requests.
  • 31:37 - 31:39
    And if we click and drag, we could just
  • 31:39 - 31:40
    move this. And you can see that the aggregate
  • 31:40 - 31:43
    trace below updates in response to us dragging
    this.
  • 31:43 - 31:45
    And if we want to look at the fastest
  • 31:45 - 31:48
    quartile, we just click faster and we'll just
    choose
  • 31:48 - 31:48
    that range on the histogram.
  • 31:48 - 31:49
    Y.K.: I think it's the fastest load.
  • 31:49 - 31:51
    T.D.: The fastest load. And then if you click
  • 31:51 - 31:53
    on slower, you can see the slower requests.
    So
  • 31:53 - 31:55
    this makes it really easy to compare and contrast.
  • 31:55 - 31:57
    OK. Why are certain requests faster and why
    are
  • 31:57 - 31:59
    certain requests slow?
  • 31:59 - 32:01
    You can see the blue, these blue areas. This
  • 32:01 - 32:04
    is Ruby code. So, right now it's not super
  • 32:04 - 32:06
    granular. It would be nice if you could actually
  • 32:06 - 32:08
    know what was going on here. But, it'll at
  • 32:08 - 32:10
    least tell you where in your controller action
    this
  • 32:10 - 32:13
    is happening, and then you can actually see
    which
  • 32:13 - 32:16
    database queries are being executed, and what
    their duration
  • 32:16 - 32:16
    is.
  • 32:16 - 32:18
    And you can see that we actually extract the
  • 32:18 - 32:20
    SQL and we denormalize it so we, so you,
  • 32:20 - 32:22
    or, we normalize it so you can see exactly
  • 32:22 - 32:24
    what those requests are even if the values
    are
  • 32:24 - 32:25
    totally different between them.
  • 32:25 - 32:28
    Y.K.: Yeah. So the real query, courtesy of
    Rails,
  • 32:28 - 32:30
    not yet supporting bind extraction is like,
    where id
  • 32:30 - 32:32
    equals one or, ten or whatever.
  • 32:32 - 32:34
    T.D.: Yup. So that's pretty cool.
  • 32:34 - 32:37
    Y.K.: So one, one other thing is, initially,
    we
  • 32:37 - 32:39
    actually just showed the whole trace, but
    we discovered
  • 32:39 - 32:42
    that, obviously when you show whole traces
    you have
  • 32:42 - 32:44
    information that doesn't really matter that
    much. So we
  • 32:44 - 32:47
    started off by, we've recently basically started
    to collapse
  • 32:47 - 32:49
    things that don't matter so much so that you
  • 32:49 - 32:51
    can basically expand or condense the trace.
  • 32:51 - 32:53
    And we wanted to make it not, but you
  • 32:53 - 32:56
    have to think about expanding or condensing
    individual areas,
  • 32:56 - 32:58
    but just, you see what matters the most and
  • 32:58 - 32:59
    then you can see trivial errors.
  • 32:59 - 33:02
    T.D.: Yup. So, so that's the demo of Skylight.
  • 33:02 - 33:04
    We'd really like it if you checked it out.
  • 33:04 - 33:06
    There is one more thing I want to show
  • 33:06 - 33:08
    you that is, like, really freaking cool. This
    is
  • 33:08 - 33:11
    coming out of Tilde labs. Carl was like, has
  • 33:11 - 33:14
    been hacking, he's been up until past midnight,
    getting
  • 33:14 - 33:16
    almost no sleep for the past month trying
    to
  • 33:16 - 33:17
    have this ready.
  • 33:17 - 33:19
    I don't know how many of you know this,
  • 33:19 - 33:24
    but Ruby 2 point 1 has a new, a,
  • 33:24 - 33:28
    a stack sampling feature. So you can get really
  • 33:28 - 33:31
    granular information about how your Ruby code
    is performing.
  • 33:31 - 33:33
    So I want to show you, I just mentioned
  • 33:33 - 33:35
    how it would be nice if we could get
  • 33:35 - 33:37
    more information out of what your Ruby code
    is
  • 33:37 - 33:39
    doing. And now we can do that.
  • 33:39 - 33:42
    Basically, every few milliseconds, this code
    that Carl wrote
  • 33:42 - 33:44
    is going into the, to the Ruby, into MRI,
  • 33:44 - 33:47
    and it's taking a snap shot of the stack.
  • 33:47 - 33:51
    And because this is built-in, it's very low-impact.
    It's
  • 33:51 - 33:54
    not allocating any new memory. It's very little
    performance
  • 33:54 - 33:56
    hit. Basically you wouldn't even notice it.
    And so
  • 33:56 - 33:58
    every few milliseconds it's sampling, and
    we take that
  • 33:58 - 34:00
    information and we send it up to our servers.
  • 34:00 - 34:02
    So it's almost like you're running Ruby profiler
    on
  • 34:02 - 34:05
    your local dev box, where you get extremely
    granular
  • 34:05 - 34:07
    information about where your code is spending
    its time
  • 34:07 - 34:09
    in Ruby, per method, per all of these things.
  • 34:09 - 34:12
    But it's happening in production.
  • 34:12 - 34:16
    So, this is, so this is a, we enabled
  • 34:16 - 34:18
    it in staging. You can see that we've got
  • 34:18 - 34:20
    some rendering bugs. It's still in beta.
  • 34:20 - 34:22
    Y.K.: Yeah, and we haven't yet collapsed things
    that
  • 34:22 - 34:22
    are not important-
  • 34:22 - 34:22
    T.D.: Yes.
  • 34:22 - 34:23
    Y.K.: -for this particular feature.
  • 34:23 - 34:24
    T.D.: So we want to show, we want to
  • 34:24 - 34:28
    hide things like, like framework code, obviously.
    But this
  • 34:28 - 34:31
    gives you an incredibly, incredibly granular
    view of what
  • 34:31 - 34:36
    your app is doing in production. And we think.
  • 34:36 - 34:39
    This is a, an API that's built into, into
  • 34:39 - 34:43
    Ruby 2.1.1. Because our agent is running so
    low-level,
  • 34:43 - 34:45
    because we wrote it in Rust, we have the
  • 34:45 - 34:47
    ability to do things like this, and Carl thinks
  • 34:47 - 34:48
    that we may be able to actually back port
  • 34:48 - 34:48
    this to older Rubies, too. So if you're not
  • 34:48 - 34:50
    on Ruby 2.1, we think that we can actually
  • 34:50 - 34:53
    bring this. But that's TPD.
  • 34:53 - 34:55
    Y.K.: Yeah, I- so I think the cool thing
  • 34:55 - 34:58
    about this, in general, is when you run a
  • 34:58 - 34:59
    sampling- so this is a sampling profiler,
    right, we
  • 34:59 - 35:01
    don't want to be burning every single thing
    that
  • 35:01 - 35:04
    you do in your program with tracing, right.
    That
  • 35:04 - 35:05
    would be very slow.
  • 35:05 - 35:07
    So when you normally run a sampling profiler,
    you
  • 35:07 - 35:09
    have to basically make a loop. You have to
  • 35:09 - 35:11
    basically create a loop, run this code a million
  • 35:11 - 35:13
    times and keep sampling it. Eventually we'll
    get enough
  • 35:13 - 35:15
    samples to get the information. But it turns
    out
  • 35:15 - 35:17
    that your production server is a loop. Your
    production
  • 35:17 - 35:21
    server is serving tons and tons of requests.
    So,
  • 35:21 - 35:23
    by simply tak- you know, taking a few microseconds
  • 35:23 - 35:26
    out of every request and collecting a couple
    of
  • 35:26 - 35:27
    samples, over time we can actually get this
    really
  • 35:27 - 35:30
    high fidelity picture with basically no cost.
  • 35:30 - 35:31
    And that's pretty mind-blowing. And this is
    the kind
  • 35:31 - 35:35
    of stuff that we can start doing by really
  • 35:35 - 35:37
    caring about, about both the user experience
    and the
  • 35:37 - 35:41
    implementation and getting really scary about
    it. And I'm
  • 35:41 - 35:43
    really, like, honestly this is a really exciting
    feature
  • 35:43 - 35:45
    that really shows what we can do as we
  • 35:45 - 35:46
    start building this out.
  • 35:46 - 35:47
    T.D.: Once we've got that, once we've got
    that
  • 35:47 - 35:48
    groundwork.
  • 35:48 - 35:50
    So if you guys want to check it out,
  • 35:50 - 35:52
    Skylight dot io, it's available today. It's
    no longer
  • 35:52 - 35:54
    in private beta. Everyone can sign up. No
    invitation
  • 35:54 - 35:57
    token necessary. And you can get a thirty-day
    free
  • 35:57 - 35:58
    trial if you haven't started one already.
    So if
  • 35:58 - 36:00
    you have any questions, please come see us
    right
  • 36:00 - 36:01
    now, or we have a booth in the vendor
  • 36:01 - 36:03
    hall. Thank you guys very much.
Title:
RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz
Description:

more » « less
Duration:
36:28

English subtitles

Revisions