RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz

Edit subtitles

0:17 - 0:18

TOM DALE: Hey, you guys ready?
0:18 - 0:19

Thank you guys so much for coming.
0:19 - 0:22

This is awesome. I was really,
0:22 - 0:24

I, when they were putting together the schedule,
0:24 - 0:25

I said, make sure that you put us down
0:25 - 0:27

in the Caves of Moria. So thank you
0:27 - 0:32

guys for coming down and making it.
0:32 - 0:33

I'm Tom. This is Yehuda.
0:33 - 0:35

YEHUDA KATZ: When people told me was signed
up
0:35 - 0:41

to do a back-to-back talk, I don't know what
0:41 - 0:41

I was thinking.
0:41 - 0:41

T.D.: Yup. So. We want to talk to you
0:41 - 0:44

today about, about Skylight. So, just a little
bit
0:44 - 0:46

before we talk about that, I want to talk
0:46 - 0:48

about us a little bit.
0:48 - 0:50

So, in 2011 we started a company called Tilde.
0:50 - 0:53

It's this shirt. It may have made me self-conscious,
0:53 - 0:55

because this is actually a first edition and
it's
0:55 - 0:58

printed off-center. Well, either I'm off-center
or the shirt's
0:58 - 1:01

off-center. One of the two.
1:01 - 1:03

So we started Tilde in 2011, and we had
1:03 - 1:06

all just left a venture-backed company, and
that was
1:06 - 1:08

a pretty traumatic experience for us because
we spent
1:08 - 1:10

a lot of time building the company and then
1:10 - 1:11

we ran out of money and sold to Facebook
1:11 - 1:14

and we really didn't want to repeat that experience.
1:14 - 1:16

So, we decided to start Tilde, and when we
1:16 - 1:20

did it, we decided to be. DHH and the
1:20 - 1:22

other people at Basecamp were talking about,
you know,
1:22 - 1:24

being bootstrapped and proud. And that was
a message
1:24 - 1:26

that really resonated with us, and so we wanted
1:26 - 1:28

to capture the same thing.
1:28 - 1:30

There's only problem with being bootstrapped
and proud, and
1:30 - 1:32

that is, in order to be both of those
1:32 - 1:33

things, you actually need money, it turns
out. It's
1:33 - 1:35

not like you just say it in a blog
1:35 - 1:38

post and then all of the sudden you are
1:38 - 1:39

in business.
1:39 - 1:40

So, we had to think a lot about, OK,
1:40 - 1:42

well, how do we make money? How do we
1:42 - 1:45

make money? How do we make a profitable and,
1:45 - 1:47

most importantly, sustainable business? Because
we didn't want to
1:47 - 1:49

just flip to Facebook in a couple years.
1:49 - 1:53

So, looking around, I think the most obvious
thing
1:53 - 1:55

that people suggested to us is, well, why
don't
1:55 - 1:58

you guys just become Ember, Inc.? Raise a
few
1:58 - 2:01

million dollars, you know, build a bunch of
business
2:01 - 2:06

model, mostly prayer. But that's not really
how we
2:06 - 2:09

want to think about building open source communities.
2:09 - 2:11

We don't really think that that necessarily
leads to
2:11 - 2:14

the best open source communities. And if you're
interested
2:14 - 2:17

more in that, I recommend Leia Silver, who
is
2:17 - 2:20

one of our co-founders. She's giving a talk
this
2:20 - 2:22

afternoon. Oh, sorry. Friday afternoon, about
how to build
2:22 - 2:25

a company that is centered on open source.
So
2:25 - 2:26

if you want to learn more about how we've
2:26 - 2:29

done that, I would really suggest you go check
2:29 - 2:30

out her talk.
2:30 - 2:34

So, no. So, no Ember, Inc. Not allowed.
2:34 - 2:38

So, we really want to build something that
leveraged
2:38 - 2:40

the strengths that we thought that we had.
One,
2:40 - 2:43

I think most importantly, a really deep knowledge
of
2:43 - 2:45

open source and a deep knowledge of the Rails
2:45 - 2:47

stack, and also Carl, it turns out, is really,
2:47 - 2:51

really good at building highly scalable big
data sys-
2:51 - 2:54

big data systems. Lots of Hadoop in there.
2:54 - 2:58

So, last year at RailsConf, we announced the
private
2:58 - 3:01

beta of Skylight. How many of you have used
3:01 - 3:02

Skylight? Can you raise your hand if you have
3:02 - 3:05

used it? OK. Many of you. Awesome.
3:05 - 3:08

So, so Skylight is a tool for profiling and
3:08 - 3:12

measuring the performance of your Rails applications
in production.
3:12 - 3:15

And, as a product, Skylight, I think, was
built
3:15 - 3:20

on three really, three key break-throughs.
There were key,
3:20 - 3:22

three key break-throughs. We didn't want to
ship a
3:22 - 3:26

product that was incrementally better than
the competition. We
3:26 - 3:28

wanted to ship a product that was dramatically
better.
3:28 - 3:30

Quantum leap. An order of magnitude better.
3:30 - 3:32

And, in order to do that, we spent a
3:32 - 3:34

lot of time thinking about it, about how we
3:34 - 3:36

could solve most of the problems that we saw
3:36 - 3:39

in the existing landscape. And so those, those
break-throughs
3:39 - 3:42

are predicated- sorry, those, delivering a
product that does
3:42 - 3:45

that is predicated on these three break-throughs.
3:45 - 3:47

So, the first one I want to talk about
3:47 - 3:54

is, honest response times. Honest response
times. So, DHH
3:54 - 3:56

wrote a blog post on what was then the
3:56 - 3:59

37Signals blog, now the Basecamp blog, called
The problem
3:59 - 4:02

with averages. How many of you have read this?
4:02 - 4:02

Awesome.
4:02 - 4:04

For those of you that have not, how many
4:04 - 4:08

of you hate raising your hands at presentations?
4:08 - 4:11

So, for those of you that-
4:11 - 4:12

Y.K.: Just put a button in every seat. Press
4:12 - 4:12

this button-
4:12 - 4:15

T.D.: Press the button if you have. Yes. Great.
4:15 - 4:20

So, if you read this blog post, the way
4:20 - 4:23

it opens is, Our average response time for
Basecamp
4:23 - 4:27

right now is 87ms... That sounds fantastic.
And it
4:27 - 4:30

easily leads you to believe that all is well
4:30 - 4:32

and that we wouldn't need to spend any more
4:32 - 4:34

time optimizing performance.
4:34 - 4:39

But that's actually wrong. The average number
is completely
4:39 - 4:42

skewed by tons of fast responses to feed requests
4:42 - 4:46

and other cached replies. If you have 1000
requests
4:46 - 4:49

that return in 5ms, and then you can have
4:49 - 4:54

200 requests taking 2000ms, or two seconds,
you can
4:54 - 4:58

still report an av- a respectable 170ms of
average.
4:58 - 5:00

That's useless.
5:00 - 5:03

So what does DHH say that we need? DHH
5:03 - 5:07

says the solution is histograms. So, for those
of
5:07 - 5:09

you like me who were sleeping through your
statistics
5:09 - 5:12

class in high school, and college, a brief
primer
5:12 - 5:15

on histograms. So a histogram is very simple.
Basically,
5:15 - 5:18

you have a, you have a series of numbers
5:18 - 5:22

along some axis, and every time you, you're
in
5:22 - 5:24

that number, you're in that bucket, you basically
increment
5:24 - 5:25

that bar by one.
5:25 - 5:28

So, this is an example of a histogram of
5:28 - 5:31

response times in a Rails application. So
you can
5:31 - 5:32

see that there's a big cluster in the middle
5:32 - 5:36

around 488ms, 500ms. This isn't a super speedy
app
5:36 - 5:39

but it's not the worst thing in the world.
5:39 - 5:40

And they're all clustered, and then as you
kind
5:40 - 5:41

of move to the right you can see that
5:41 - 5:42

the respond times get longer and longer and
longer,
5:42 - 5:44

and as you move to the left, response times
5:44 - 5:46

get shorter and shorter and shorter.
5:46 - 5:48

So, why do you want a histogram? What's the,
5:48 - 5:49

what's the most important thing about a histogram?
5:49 - 5:53

Y.K.: Well, I think it's because most requests
don't
5:53 - 5:53

actually look like this.
5:53 - 5:54

T.D.: Yes.
5:54 - 5:55

Y.K.: Most end points don't actually look
like this.
5:55 - 5:56

T.D.: Right. If you think about what your
Rails
5:56 - 5:59

app is doing, it's a complicated beast, right.
Turns
5:59 - 6:02

out, Ruby frankly, you can, you can do branching
6:02 - 6:04

logic. You can do a lot of things.
6:04 - 6:06

And so what that means is that one end
6:06 - 6:09

point, if you represent that with a single
number,
6:09 - 6:12

you are losing a lot of fidelity, to the
6:12 - 6:15

point where it becomes, as DHH said, useless.
So,
6:15 - 6:18

for example, in a histogram, you can easily
see,
6:18 - 6:20

oh, here's a group of requests and response
times
6:20 - 6:22

where I'm hitting the cache, and here's another
group
6:22 - 6:24

where I'm missing it. And you can see that
6:24 - 6:28

that cluster is significantly slower than
the faster cache-hitting
6:28 - 6:28

cluster.
6:28 - 6:31

And the other thing that you get when you
6:31 - 6:33

have a, a distribution, when you keep the
whole
6:33 - 6:35

distribution in the histogram, is you can
look at
6:35 - 6:39

this number at the 95th percentile, right.
So the
6:39 - 6:42

right, the way to think about the performance
of
6:42 - 6:47

your web application is not the average, because
the
6:47 - 6:50

average doesn't really tell you anything.
You want to
6:50 - 6:52

think about the 95th percentile, because that's
not the
6:52 - 6:55

average response time, that's the average
worst response time
6:55 - 6:58

that a user is likely to hit.
6:58 - 6:59

And the thing to keep in mind is that
6:59 - 7:02

it's not as though a customer comes to your
7:02 - 7:05

site, they issue one request, and then they're
done,
7:05 - 7:08

right. As someone is using your website, they're
gonna
7:08 - 7:10

be generating a lot of requests. And you need
7:10 - 7:15

to look at the 95th percentile, because otherwise
every
7:15 - 7:17

request is basically you rolling the dice
that they're
7:17 - 7:19

not gonna hit one of those two second, three
7:19 - 7:21

second, four second responses, close the tab
and go
7:21 - 7:24

to your competitor.
7:24 - 7:25

So we look at this as, here's the crazy
7:25 - 7:28

thing. Here's what I think is crazy. That
blog
7:28 - 7:33

post that DHH wrote, it's from 2009. It's
been
7:33 - 7:35

five years, and there's still no tool that
does
7:35 - 7:37

what DHH was asking for. So, we, frankly,
we
7:37 - 7:39

smelled money. We were like, holy crap.
7:39 - 7:41

Y.K.: Yeah, why isn't that slide green?
7:41 - 7:43

T.D.: Yeah. It should be green and dollars.
I
7:43 - 7:45

think keynote has the dollars, the make it
rain
7:45 - 7:50

effect I should have used. So we smelled blood
7:50 - 7:53

in the water. We're like, this is awesome.
There's
7:53 - 7:57

only one problem that we discovered, and that
is,
7:57 - 7:58

it turns out that building this thing is actually
7:58 - 8:01

really, really freaky hard. Really, really
hard.
8:01 - 8:06

So, we announced the private beta at RailsConf
last
8:06 - 8:09

year. Before doing that, we spent a year of
8:09 - 8:13

research spiking out prototypes, building
prototypes, building out the
8:13 - 8:17

beta. We launched at RailsConf, and we realized,
we
8:17 - 8:19

made a lot of problems. We made a lot
8:19 - 8:22

of errors when we were building this system.
8:22 - 8:26

So then, after RailsConf last year, we basically
took
8:26 - 8:30

six months to completely rewrite the backend
from the
8:30 - 8:32

ground up. And I think tying into your keynote,
8:32 - 8:36

Yehuda, we, we were like, oh. We clearly have
8:36 - 8:39

a bespoke problem. No one else is doing this.
8:39 - 8:42

So we rewrote our own custom backend. And
then
8:42 - 8:44

we had all these problems, and we realized
that
8:44 - 8:45

they had actually already all been solved
by the
8:45 - 8:48

open source community. And so we benefited
tremendously by
8:48 - 8:48

having a shared solution.
8:48 - 8:51

Y.K.: Yeah. So our first release of this was
8:51 - 8:55

really very bespoke, and the current release
uses a
8:55 - 9:00

tremendous amount of very off-the-shelf open
source projects that
9:00 - 9:05

just solved the particular problem very effectively,
very well.
9:05 - 9:06

None of which are as easy to use as
9:06 - 9:09

Rails, but all of which solve really thorny
problems
9:09 - 9:10

very effectively.
9:10 - 9:13

T.D.: So, so let's just talk, just for your
9:13 - 9:16

own understanding, let's talk about how most
performance monitoring
9:16 - 9:18

tools work. So the way that most of these
9:18 - 9:20

work is that you run your Rails app, and
9:20 - 9:22

running inside of your Rails app is some gem,
9:22 - 9:25

some agent that you install. And every time
the
9:25 - 9:29

Rails app handles a request, it generates
events, and
9:29 - 9:32

those events, which include information about
performance data, those
9:32 - 9:35

events are passed into the agent.
9:35 - 9:38

And then the agent sends that data to some
9:38 - 9:41

kind of centralized server. Now, it turns
out that
9:41 - 9:44

doing a running average is actually really
simple. Which
9:44 - 9:47

is why everyone does it. Basically you can
do
9:47 - 9:48

it in a single SQL query, right. All you
9:48 - 9:50

do is you have three columns in database.
The
9:50 - 9:53

end point, the running average, and the number
of
9:53 - 9:56

requests, and then so, you can, those are
the
9:56 - 9:57

two things that you need to keep a running
9:57 - 9:57

average, right.
9:57 - 9:59

So keeping a running average is actually really
simple
9:59 - 10:01

from a technical point of view.
10:01 - 10:04

Y.K.: I don't think you could even JavaScript
through
10:04 - 10:05

to the lack of integers.
10:05 - 10:06

T.D.: Yes. You probably wouldn't want to do
any
10:06 - 10:08

math in JavaScript, it turns out. So, so we
10:08 - 10:10

took a little bit different approach. Yehuda,
do you
10:10 - 10:12

want to go over the next section?
10:12 - 10:15

Y.K.: Yeah. Sure. So, when we first started,
right
10:15 - 10:18

at the beginning, we basically did a similar
thing
10:18 - 10:20

where we had a bunch - your app creates
10:20 - 10:23

events. Most of those start off as being ActiveSupport::Notifications,
10:23 - 10:26

although it turns out that there's very limited
use
10:26 - 10:28

of ActiveSupport::Notifications so we had
to do some normalization
10:28 - 10:30

work to get them sane, which we're gonna be
10:30 - 10:33

upstreaming back into, into Rails.
10:33 - 10:35

But, one thing that's kind of unfortunate
about having
10:35 - 10:37

every single Rails app have an agent is that
10:37 - 10:39

you end up having to do a lot of
10:39 - 10:40

the same kind of work over and over again,
10:40 - 10:42

and use up a lot of memory. So, for
10:42 - 10:44

example, every one of these things is making
HTTP
10:44 - 10:46

requests. So now you have a queue of things
10:46 - 10:49

that you're sending over HTTP in every single
one
10:49 - 10:51

of your Rails processes. And, of course, you
probably
10:51 - 10:52

don't notice this. People are used to Rails
taking
10:52 - 10:54

up hundreds and hundreds of megabytes, so
you probably
10:54 - 10:56

don't notice if you install some agent and
it
10:56 - 10:59

suddenly starts taking twenty, thirty, forty,
fifty more megabytes.
10:59 - 11:02

But we really wanted to keep the actual memory
11:02 - 11:05

per process down to a small amount. So one
11:05 - 11:06

of the very first things that we did, we
11:06 - 11:08

even did it before last year, is that we
11:08 - 11:10

pulled out all that shared logic into a, a
11:10 - 11:13

separate process called the coordinator. And
the agent is
11:13 - 11:17

basically responsible simply for collecting
the, the trace, and
11:17 - 11:19

it's not responsible for actually talking
to our server
11:19 - 11:21

at all. And that means that the coordinator
only
11:21 - 11:23

has to do this queue, this keeping a st-
11:23 - 11:26

a bunch of stuff of work in one place,
11:26 - 11:28

and it doesn't end up using up as much
11:28 - 11:29

memory.
11:29 - 11:31

And I think this, this ended up being very
11:31 - 11:32

effective for us.
11:32 - 11:33

T.D.: And I think that low overhead also allows
11:33 - 11:36

us to just collect more information, in general.
11:36 - 11:37

Y.K.: Yeah.
11:37 - 11:40

Now, after our first attempt, we started getting
a
11:40 - 11:42

bunch of customers that were telling us that
even
11:42 - 11:44

the separate - so the separate coordinator,
started as
11:44 - 11:45

a good thing and a bad thing. On the
11:45 - 11:47

one hand, there's only one of them, so it
11:47 - 11:50

uses up only one set of memory. On the
11:50 - 11:51

other hand, it's really easy for someone to
go
11:51 - 11:53

in and PS that process and see how many
11:53 - 11:54

megabytes of memory it's using.
11:54 - 11:57

So, we got a lot of additional complaints
that
11:57 - 11:58

said oh, your process is using a lot of
11:58 - 12:01

memory. And, I spent a few weeks, I, I
12:01 - 12:03

know Ruby pretty well. I spent a couple of
12:03 - 12:06

weeks. I actually wrote a gem called Allocation
Counter
12:06 - 12:08

that basically went in to try to pin point
12:08 - 12:09

exactly where the allocations were hap- coming
from. But
12:09 - 12:12

it turns out that it's actually really, really
hard
12:12 - 12:14

to track down exactly where allocations are
coming from
12:14 - 12:15

in Ruby, because something as simple as using
a
12:15 - 12:19

regular expression in Ruby can allocate match
objects that
12:19 - 12:19

get put back on the stack.
12:19 - 12:21

And so I was able to pair this down
12:21 - 12:24

to some degree. But I really discovered quickly
that,
12:24 - 12:27

trying to keep a lid on the memory allocation
12:27 - 12:30

by doing all the stuff in Ruby, is mostly
12:30 - 12:32

fine. But for our specific use case where
we
12:32 - 12:33

really wanna, we wanna be telling you, you
can
12:33 - 12:35

run the agent on your process, on your box,
12:35 - 12:37

and it's not gonna use a lot of memory.
12:37 - 12:40

We really needed something more efficient,
and our first
12:40 - 12:43

thought was, we'll use C++ or C. No problem.
12:43 - 12:45

C is, is native. It's great. And Carl did
12:45 - 12:48

the work. Carl is very smart. And then he
12:48 - 12:50

said, Yehuda. It is now your turn. You need
12:50 - 12:51

to start maintaining this. And I said, I don't
12:51 - 12:54

trust myself to write C++ code that's running
in
12:54 - 12:56

all of your guys's boxes, and not seg-fault.
So
12:56 - 13:00

I don't think that, that doesn't work for
me.
13:00 - 13:02

And so I, I noticed that rust was coming
13:02 - 13:04

along, and what rust really gives you is it
13:04 - 13:06

gives you the ability to write low-level code
a
13:06 - 13:09

la C or C++ with magma memory management,
that
13:09 - 13:12

keeps your memory allocation low and keeps
things speedy.
13:12 - 13:15

Low resource utilization. While also giving
you compile time
13:15 - 13:18

guarantees about not seg-faulting. So, again,
if your processes
13:18 - 13:20

randomly started seg-faulting because you
installed the agent, I
13:20 - 13:22

think you would stop being our customer very
quickly.
13:22 - 13:25

So having what, pretty much 100% guarantees
about that
13:25 - 13:27

was very important to us. And so that's why
13:27 - 13:28

we decided to use rust.
13:28 - 13:30

I'll just keep going.
13:30 - 13:31

T.D.: Keep going.
13:31 - 13:33

Y.K.: So, we had this coordinator object.
And basically
13:33 - 13:36

the coordinator object is receiving events.
So the events
13:36 - 13:40

basically end up being these traces that describe
what's
13:40 - 13:42

happening in your application. And the next
thing, I
13:42 - 13:44

think our initial work on this we used JSON
13:44 - 13:46

just to send the pay load to the server,
13:46 - 13:48

but we noticed that a lot of people have
13:48 - 13:50

really big requests. So you may have a big
13:50 - 13:52

request with a big SQL query in it, or
13:52 - 13:53

a lot of big SQL queries in it. Some
13:53 - 13:55

people have traces that are hundreds and hundreds
of
13:55 - 13:58

nodes long. And so we really wanted to figure
13:58 - 14:00

out how to shrink down the payload size to
14:00 - 14:03

something that we could be, you know, pumping
out
14:03 - 14:05

of your box on a regular basis without tearing
14:05 - 14:07

up your bandwidth costs.
14:07 - 14:09

So, one of the first things that we did
14:09 - 14:11

early on was we switched using protobuf as
the
14:11 - 14:14

transport mechanism, and that really shrunk,
shrunk down the
14:14 - 14:17

payloads a lot. Our earlier prototypes for
actually collecting
14:17 - 14:19

the data were written in Ruby, but I think
14:19 - 14:21

Carl did, like, a weekend hack to just pour
14:21 - 14:24

it over the Java and got, like, 200x performance.
14:24 - 14:26

And you don't always get 200x performance,
if mostly
14:26 - 14:28

what you're doing is database queries, you're
not gonna
14:28 - 14:29

get a huge performance swing.
14:29 - 14:32

But mostly what we're doing is math. And algorithms
14:32 - 14:34

and data structures. And for that, Ruby is,
it
14:34 - 14:36

could, in theory, one day, have a good git
14:36 - 14:39

or something, but today, writing that code
in Java
14:39 - 14:41

didn't end up being significantly more code
cause it's
14:41 - 14:43

just, you know, algorithms and data structures.
14:43 - 14:44

T.D.: And I'll just note something about standardizing
on
14:44 - 14:47

protobufs in our, in our stack, is actually
a
14:47 - 14:52

huge win, because we, we realized, hey, browsers,
as
14:52 - 14:53

it turns out are pretty powerful these days.
They've
14:53 - 14:54

got, you know, they can allocate memory, they
can
14:54 - 14:56

do all these types of computation. So, and
protobuff's
14:56 - 14:59

libraries exist everywhere. So we save ourselves
a lot
14:59 - 15:02

of computation and a lot of time by just
15:02 - 15:04

treating protobuff as the canonical serialization
form, and then
15:04 - 15:06

you can move payloads around the entire stack
and
15:06 - 15:09

everything speaks the same language, so you've
saved the
15:09 - 15:09

serialization and deserialization.
15:09 - 15:12

Y.K.: And JavaScript is actually surprisingly
effective at des-
15:12 - 15:14

at taking protobuffs and converting them to
the format
15:14 - 15:18

that we need efficiently. So, so we basically
take
15:18 - 15:21

this data. The Java collector is basically
collecting all
15:21 - 15:24

these protobuffs, and pretty much it just
turns around,
15:24 - 15:25

and this is sort of where we got into
15:25 - 15:28

bespoke territory before we started rolling
our own, but
15:28 - 15:31

we realized that when you write a big, distributed,
15:31 - 15:33

fault-tolerant system, there's a lot of problems
that you
15:33 - 15:35

really just want someone else to have thought
about.
15:35 - 15:38

So, what we do is we basically take these,
15:38 - 15:40

take these payloads that are coming in. We
convert
15:40 - 15:42

them into batches and we send the batches
down
15:42 - 15:45

into the Kafka queue. And the, the next thing
15:45 - 15:48

that happens, so the Kafka, sorry, Kafka's
basically just
15:48 - 15:50

a queue that allows you to throw things into,
15:50 - 15:53

I guess, it might be considered similar to
like,
15:53 - 15:55

something lime AMQP. It has some nice fault-tolerance
properties
15:55 - 15:58

and integrates well with storm. But most important
it's
15:58 - 16:00

just super, super high through-put.
16:00 - 16:02

So basically didn't want to put any barrier
between
16:02 - 16:04

you giving us the data and us getting it
16:04 - 16:04

to disc as soon as possible.
16:04 - 16:06

T.D.: Yeah. Which we'll, I think, talk about
in
16:06 - 16:06

a bit.
16:06 - 16:09

Y.K.: So we, so the basic Kafka takes the
16:09 - 16:11

data and starts sending it into Storm. And
if
16:11 - 16:13

you think about what has to happen in order
16:13 - 16:15

to get some request. So, you have these requests.
16:15 - 16:18

There's, you know, maybe traces that have
a bunch
16:18 - 16:20

of SQL queries, and our job is basically to
16:20 - 16:21

take all those SQL queries and say, OK, I
16:21 - 16:23

can see that in all of your requests, you
16:23 - 16:24

had the SQL query and it took around this
16:24 - 16:25

amount of time and it happened as a child
16:25 - 16:27

of this other node. And the way to think
16:27 - 16:30

about that is basically just a processing
pipeline. Right.
16:30 - 16:31

So you have these traces that come in one
16:31 - 16:33

side. You start passing them through a bunch
of
16:33 - 16:35

processing steps, and then you end up on the
16:35 - 16:37

other side with the data.
16:37 - 16:39

And Storm is actually a way of describing
that
16:39 - 16:42

processing pipeline in sort of functional
style, and then
16:42 - 16:44

you tell it, OK. Here's how many servers I
16:44 - 16:47

need. Here's how, here's how I'm gonna handle
failures.
16:47 - 16:50

And it basically deals with distribution and
scaling and
16:50 - 16:52

all that stuff for you. And part of that
16:52 - 16:55

is because you wrote everything using functional
style.
16:55 - 16:57

And so what happens is Kafka sends the data
16:57 - 17:01

into the entry spout, which is sort of terminology
17:01 - 17:04

in, terminology in Storm for these streams
that get
17:04 - 17:07

created. And they basically go into these
processing things,
17:07 - 17:10

which very clever- cutely are called bolts.
This is
17:10 - 17:13

definitely not the naming I would have used,
but.
17:13 - 17:15

So they're called bolts. And the idea is that
17:15 - 17:17

basically every request may have several things.
17:17 - 17:20

So, for example, we now automatically detect
n +
17:20 - 17:22

1 queries and that's sort of a different kind
17:22 - 17:25

of processing from just, make a picture of
the
17:25 - 17:27

entire request. Or what is the 95th percentile
across
17:27 - 17:29

your entire app, right. These are all different
kinds
17:29 - 17:31

of processing. So we take the data and we
17:31 - 17:34

send them into a bunch of bolts, and the
17:34 - 17:36

cool thing about bolts is that, again, because
they're
17:36 - 17:39

just functional chaining, you can take the
output from
17:39 - 17:41

one bolt and feed it into another bolt. And
17:41 - 17:44

that works, that works pretty well. And, and
you
17:44 - 17:45

don't have to worry about - I mean, you
17:45 - 17:47

have to worry a little bit about things like
17:47 - 17:50

fault tolerance, failure, item potence. But
you worry about
17:50 - 17:52

them at, at the abstraction level, and then
the
17:52 - 17:54

operational part is handled for you.
17:54 - 17:56

T.D.: So it's just like a very declarative
way
17:56 - 17:58

of describing how this computation work in,
in a
17:58 - 17:59

way that's easy to scale.
17:59 - 18:02

Y.K.: And Carl actually talked about this
at very
18:02 - 18:05

high speed yesterday, and you, some of you
may
18:05 - 18:07

have been there. I would recommend watching
the video
18:07 - 18:09

when it comes out if you want to make
18:09 - 18:11

use of this stuff in your own applications.
18:11 - 18:13

And then when you're finally done with all
the
18:13 - 18:15

processing, you need to actually do something
with it.
18:15 - 18:16

You need to put it somewhere so that the
18:16 - 18:18

web app can get access to it, and that
18:18 - 18:21

is basically, we use Cassandra for this. And
Cassandra
18:21 - 18:25

again is mostly, it's a dumb database, but
it
18:25 - 18:28

has, it's, has high capacity. It has some
of
18:28 - 18:29

the fault-tolerance capacities that we want.
18:29 - 18:31

T.D.: We're very, we're just very, very heavy,
right.
18:31 - 18:33

Like, we tend to be writing more than we're
18:33 - 18:34

ever reading.
18:34 - 18:36

Y.K.: Yup. And then when we're done, when
we're
18:36 - 18:39

done with a particular batch, Cassandra basically
kicks off
18:39 - 18:41

the process over again. So we're basically
doing these
18:41 - 18:41

things as batches.
18:41 - 18:43

T.D.: So these are, these are roll-ups, is
what's
18:43 - 18:45

happening here. So basically every minute,
every ten minutes,
18:45 - 18:48

and then at every hour, we reprocess and we
18:48 - 18:50

re-aggregate, so that when you query us we
know
18:50 - 18:51

exactly what to give you.
18:51 - 18:53

Y.K.: Yup. So we sort of have this cycle
18:53 - 18:55

where we start off, obviously, in the first
five
18:55 - 18:57

second, the first minute, you really want
high granularity.
18:57 - 18:59

You want to see what's happening right now.
But,
18:59 - 19:00

if you want to go back and look at
19:00 - 19:02

data from three months ago, you probably care
about
19:02 - 19:05

it, like the day granularity or maybe the
hour
19:05 - 19:07

granularity. So, we basically do these roll-ups
and we
19:07 - 19:09

cycle through the process.
19:09 - 19:12

T.D.: So this, it turns out, building the
system
19:12 - 19:15

required an intense amount of work. Carl spent
probably
19:15 - 19:17

six months reading PHP thesises to find-
19:17 - 19:18

Y.K.: Thesis.
19:18 - 19:24

T.D.: Thesis. To find, to find data structures
and
19:24 - 19:26

algorithms that we could use. Because this
is a
19:26 - 19:28

huge amount of data. Like, I think even a
19:28 - 19:31

few months after we were in private data,
private
19:31 - 19:34

beta, we were already handling over a billion
requests
19:34 - 19:36

per month. And obviously there's no way that
we-
19:36 - 19:38

Y.K.: Basically the number of requests that
we handle
19:38 - 19:40

is the sum of all of the requests that
19:40 - 19:40

you handle.
19:40 - 19:40

T.D.: Right.
19:40 - 19:41

Y.K.: And all of our customers handle.
19:41 - 19:42

T.D.: Right. Right. So.
19:42 - 19:43

Y.K.: So that's a lot of requests.
19:43 - 19:46

T.D.: So obviously we can't provide a service,
at
19:46 - 19:48

least one that's not, we can't provide an
affordable
19:48 - 19:51

service, an accessible service, if we have
to store
19:51 - 19:53

terabytes or exabytes of data just to tell
you
19:53 - 19:54

how your app is running.
19:54 - 19:57

Y.K.: And I think, also a problem, it's problematic
19:57 - 19:58

if you store all the data in a database
19:58 - 20:00

and then every single time someone wants to
learn
20:00 - 20:02

something about that, you have to do a query.
20:02 - 20:03

Those queries can take a very long time. They
20:03 - 20:05

can take minutes. And I think we really wanted
20:05 - 20:07

to have something that would be very, that
would,
20:07 - 20:10

where the feedback loop would be fast. So
we
20:10 - 20:12

wanted to find algorithms that let us handle
the
20:12 - 20:14

data at, at real time, and then provide it
20:14 - 20:16

to you at real time instead of these, like,
20:16 - 20:18

dump the data somewhere and then do these
complicated
20:18 - 20:19

queries.
20:19 - 20:21

T.D.: So, hold on. So this slide was not
20:21 - 20:24

supposed to be here. It was supposed to be
20:24 - 20:28

a Rails slide. So, whoa. I went too far.
20:28 - 20:30

K. We'll watch that again. That's pretty cool.
So
20:30 - 20:32

then the last thing I want to say is,
20:32 - 20:34

perhaps your take away from looking at this
architecture
20:34 - 20:37

diagram is, oh my gosh, these Rails guys completely-
20:37 - 20:38

Y.K.: They really jumped the shark.
20:38 - 20:41

T.D.: They jumped the shark. They ditched
Rails. I
20:41 - 20:42

saw, like, three Tweets yesterday - I wasn't
here,
20:42 - 20:43

I was in Portland yesterday, but I saw, like,
20:43 - 20:44

three Tweets that were like, I'm at RailsConf
and
20:44 - 20:49

I haven't seen a single talk about, like,
Rails.
20:49 - 20:52

So that's true here, too. But, I want to
20:52 - 20:55

assure you that we are only using this stack
20:55 - 20:58

for the heavy computation. We started in Rails.
We
20:58 - 21:01

started, we were like, hey, what do we need.
21:01 - 21:02

Ah, well, people probably need to authenticate
and log
21:02 - 21:05

in, and we probably need to do billing. And
21:05 - 21:06

those are all things that Rails is really,
really
21:06 - 21:09

good at. So we started with Rails as, basically,
21:09 - 21:11

the starting point, and then when we realized
oh
21:11 - 21:14

my gosh, computation is really slow. There's
no way
21:14 - 21:15

we're gonna be able to offer this service.
OK.
21:15 - 21:16

Now let's think about how we can do all
21:16 - 21:16

of that.
21:16 - 21:19

Y.K.: And I think notably, a lot of people
21:19 - 21:20

who look at Rails are like, there's a lot
21:20 - 21:22

of companies that have built big stuff on
Rails,
21:22 - 21:24

and their attitude is, like, oh, this legacy
terrible
21:24 - 21:25

Rails app. I really wish we could get rid
21:25 - 21:27

of it. If we could just write everything in
21:27 - 21:31

Scala or Clojure or Go, everything would be
amazing.
21:31 - 21:32

That is definitely not our attitude. Our attitude
is
21:32 - 21:34

that Rails is really amazing, at particular,
at the
21:34 - 21:37

kinds of things that are really common across
everyone's
21:37 - 21:40

web applications - authentication, billing,
et cetera. And we
21:40 - 21:41

really want to be using Rails for the parts
21:41 - 21:43

of our app- even things like error-tracking,
we do
21:43 - 21:45

through the Rails app. We want to be using
21:45 - 21:47

Rails because it's very productive at doing
those things.
21:47 - 21:49

It happens to be very slow with doing data
21:49 - 21:50

crunching, so we're gonna use a different
tool for
21:50 - 21:51

that.
21:51 - 21:52

But I don't think you'll ever see me getting
21:52 - 21:54

up and saying, ah, I really wish we had
21:54 - 21:55

just started writing, you know, the Rails
app in
21:55 - 21:55

rust.
21:55 - 21:55

T.D.: Yeah.
21:55 - 21:58

Y.K.: That would be terrible.
21:58 - 22:02

T.D.: So that's number one, is, is, honest
response
22:02 - 22:04

times, which we're, which it turns out, seems
like
22:04 - 22:08

it should be easy, requires storing insane
amount of
22:08 - 22:09

data.
22:09 - 22:11

So the second thing that we realized when
we
22:11 - 22:12

were looking at a lot of these tools, is
22:12 - 22:14

that most of them focus on data. They focus
22:14 - 22:17

on giving you the raw data. But I'm not
22:17 - 22:19

a machine. I'm not a computer. I don't enjoy
22:19 - 22:21

sifting through data. That's what computers
are good for.
22:21 - 22:23

I would rather be drinking a beer. It's really
22:23 - 22:25

nice in Portland, this time of year.
22:25 - 22:27

So, we wanted to think about, if you're trying
22:27 - 22:31

to solve the performance problems in your
application, what
22:31 - 22:33

are the things that you would suss out with
22:33 - 22:36

the existing tools after spending, like, four
hours depleting
22:36 - 22:38

your ego to get there?
22:38 - 22:39

Y.K.: And I think part of this is just
22:39 - 22:42

people are actually very, people like to think
that
22:42 - 22:44

they're gonna use these tools, but when the
tools
22:44 - 22:45

require you to dig through a lot of data,
22:45 - 22:47

people just don't use them very much. So,
the
22:47 - 22:48

goal here was to build a tool that people
22:48 - 22:51

actually use and actually like using, and
not to
22:51 - 22:55

build a tool that happens to provide a lot
22:55 - 22:55

of data you can sift through.
22:55 - 22:55

T.D.: Yes.
22:55 - 22:56

So, probably the, one of the first things
that
22:56 - 22:59

we realized is that we don't want to provide.
22:59 - 23:00

This is a trace of a request, you've probably
23:00 - 23:04

seen similar UIs using other tools, using,
for example,
23:04 - 23:07

the inspector in, in like Chrome or Safari,
and
23:07 - 23:09

this is just showing basically, it's basically
a visual
23:09 - 23:11

stack trace of where your application is spending
its
23:11 - 23:12

time.
23:12 - 23:14

But I think what was important for us is
23:14 - 23:18

showing not just a single request, because
your app
23:18 - 23:21

handles, you know, hundreds of thousands of
requests, or
23:21 - 23:23

millions of requests. So looking at a single
request
23:23 - 23:25

statistically is complete, it's just noise.
23:25 - 23:26

Y.K.: And it's especially bad if it's the
worst
23:26 - 23:29

request, because the worst request is, is
really noise.
23:29 - 23:31

It's like, a hiccup in the network, right.
23:31 - 23:31

T.D.: It's the outlier. Yeah.
23:31 - 23:32

Y.K.: It's literally the outlier.
23:32 - 23:36

T.D.: It's literally the outlier. Yup. So,
what we
23:36 - 23:39

present in Skylight is something a little
bit different,
23:39 - 23:42

and it's something that we call the aggregate
trace.
23:42 - 23:46

So the aggregate trace is basically us taking
all
23:46 - 23:50

of your requests, averaging them out where
each of
23:50 - 23:52

these things spends their time, and then showing
you
23:52 - 23:55

that. So this is basically like, this is like,
23:55 - 23:58

this is like the statue of David. It is
23:58 - 24:01

the idealized form of the stack trace of how
24:01 - 24:02

your application's behaving.
24:02 - 24:05

But, of course, you have the same problem
as
24:05 - 24:08

before, which is, if this is all that we
24:08 - 24:11

were showing you, it would be obscuring a
lot
24:11 - 24:13

of information. You want to actually be able
to
24:13 - 24:14

tell the difference between, OK, what's my
stack trace
24:14 - 24:16

look like for fast requests, and how does
that
24:16 - 24:19

differ from requests that are slower.
24:19 - 24:21

So what we've got, I've got a little video
24:21 - 24:22

here. You can see that when I move the
24:22 - 24:26

slider, that this trace below it is actually
updating
24:26 - 24:29

in real time. As I move the slider around,
24:29 - 24:32

you can see that the aggregate trace actually
updates
24:32 - 24:34

with it. And that's because we're collecting
all this
24:34 - 24:36

information. We're collecting, like I said,
a lot of
24:36 - 24:39

data. We can recompute this aggregate trace
on the
24:39 - 24:39

fly.
24:39 - 24:41

Basically, for each bucket, we're storing
a different trace,
24:41 - 24:43

and then on the client we're reassembling
that. We'll
24:43 - 24:44

go into that a little bit.
24:44 - 24:46

Y.K.: And I think it's really important that
you
24:46 - 24:48

be able to do these experiments quickly. If
every
24:48 - 24:50

time you think, oh, I wonder what happens
if
24:50 - 24:52

I add another histogram bucket, if it requires
a
24:52 - 24:55

whole full page refresh. Then that would basically
make
24:55 - 24:56

people not want to use the tool. Not able
24:56 - 24:59

to use the tool. So, actually building something
which
24:59 - 25:00

is real time and fast, gets the data as
25:00 - 25:00

it comes, was really important to us.
25:00 - 25:01

T.D.: So that's number one.
25:01 - 25:05

And the second thing. So we built that, and
25:05 - 25:08

we're like, OK, well what's next? And I think
25:08 - 25:09

that the big problem with this is that you
25:09 - 25:12

need to know that there's a problem before
you
25:12 - 25:14

go look at it, right. So we have been
25:14 - 25:16

working for the past few months, and the Storm
25:16 - 25:18

infrastructure that we built makes it pretty
straight-forward to
25:18 - 25:21

start building more abstractions on top of
the data
25:21 - 25:22

that we've already collected.
25:22 - 25:24

It's a very declarative system. So we've been
working
25:24 - 25:27

on a feature called inspections. And what's
cool about
25:27 - 25:29

inspections is that we can look at this tremendous
25:29 - 25:31

volume of data that we've collected from your
app,
25:31 - 25:34

and we can automatically tease out what the
problems
25:34 - 25:35

are. So the first one that we shipped, this
25:35 - 25:37

is in beta right now. It's not, it's not
25:37 - 25:40

out and enabled by default, but there, it's
behind
25:40 - 25:42

a feature flag that we've had some users turning
25:42 - 25:43

on.
25:43 - 25:44

And, and trying out. And so what we can
25:44 - 25:46

do in this case, is because we have information
25:46 - 25:49

about all of the database queries in your
app,
25:49 - 25:51

we can look and see if you have n
25:51 - 25:52

plus one queries. Can you maybe explain what
an
25:52 - 25:53

n plus one query is?
25:53 - 25:55

Y.K.: Yeah. So, I'm, people know, hopefully,
what n
25:55 - 25:57

plus one queries. But the, it's the idea that
25:57 - 25:59

you, by accident, for some reason, instead
of making
25:59 - 26:02

one query, you ask for like all the posts
26:02 - 26:03

and then you iterated through all of them
and
26:03 - 26:05

got all the comments and now you, instead
of
26:05 - 26:09

having one query, you have one query per post,
26:09 - 26:10

right. And you, what I've, what I've like
to
26:10 - 26:13

do is do eager reloading, where you say include
26:13 - 26:15

comments, right. But you have to know that
you
26:15 - 26:15

have to do that.
26:15 - 26:17

So there's some tools that will run in development
26:17 - 26:18

mode, if you happen to catch it, like a
26:18 - 26:20

bullet. This is basically a tool that's looking
at
26:20 - 26:22

every single one of your classes and has some
26:22 - 26:24

thresholds that, once we see that a bunch
of
26:24 - 26:27

your requests have the same exact query, so
we
26:27 - 26:30

do some work to pull out binds. So if
26:30 - 26:32

it's, like, where something equals one, we
will automatically
26:32 - 26:34

pull out the one and replace it with a
26:34 - 26:35

question mark.
26:35 - 26:36

And then we basically take all those queries,
if
26:36 - 26:40

they're the exact same query repeated multiple
times, subject
26:40 - 26:41

to some thresholds, we'll start showing you
hey, there's
26:41 - 26:42

an n plus one query.
26:42 - 26:44

And you can imagine this same sort of thing
26:44 - 26:46

being done for things, like, are you missing
an
26:46 - 26:50

index, right. Or, are you using the Ruby version
26:50 - 26:51

of JSON when you should be using the native
26:51 - 26:52

version of JSON. These are all things that
we
26:52 - 26:55

can start detecting just because we're consuming
an enormous
26:55 - 26:58

amount of information, and we can start writing
some
26:58 - 26:59

heuristics for bubbling it up.
26:59 - 27:02

So, third and final breakthrough, we realized
that we
27:02 - 27:05

really, really needed a lightning fast UI.
Something really
27:05 - 27:08

responsive. So, in particular, the feedback
loop is critical,
27:08 - 27:10

right. You can imagine, if the way that you
27:10 - 27:12

dug into data was you clicked and you wait
27:12 - 27:14

an hour, and then you get your results, no
27:14 - 27:16

one would do it. No one would ever do
27:16 - 27:16

it.
27:16 - 27:19

And the existing tools are OK, but you click
27:19 - 27:20

and you wait. You look at it and you're
27:20 - 27:22

like, oh, I want a different view, so then
27:22 - 27:23

you go edit your query and then you click
27:23 - 27:25

and you wait and it's just not a pleasant
27:25 - 27:27

experience.
27:27 - 27:29

So, so we use Ember, the, the UI that
27:29 - 27:31

you're using when you log into Skylight. Even
though
27:31 - 27:33

it feels just like a regular website, it doesn't
27:33 - 27:36

feel like a native app, is powered, all of
27:36 - 27:38

the routing, all of the rendering, all of
the
27:38 - 27:41

decision making, is happening in, as an Ember.js
app,
27:41 - 27:43

and we pair that with D3. So all of
27:43 - 27:45

the charts, the charts that you saw there
in
27:45 - 27:48

the aggregate trace, that is all Ember components
powered
27:48 - 27:49

by D3.
27:49 - 27:53

So, this is actually significantly cleaned
up our client-side
27:53 - 27:56

code. It makes re-usability really, really
awesome. So to
27:56 - 27:57

give you an example, this is from our billing
27:57 - 27:59

page that I, the designer came and they had,
27:59 - 28:01

they had a component that was like, the gate
28:01 - 28:02

component.
28:02 - 28:03

And, the-
28:03 - 28:06

T.D.: It seems really boring at first.
28:06 - 28:07

Y.K.: It seemed really boring. But, this is
the
28:07 - 28:09

implementation, right. So you could copy and
paste this
28:09 - 28:11

code over and over again, everywhere you go.
Just
28:11 - 28:13

remember to format it correctly. If you forget
to
28:13 - 28:15

format it, it's not gonna look the same everywhere.
28:15 - 28:17

But I was like, hey, we're using this all
28:17 - 28:18

over the place. Why don't we bundle this up
28:18 - 28:20

into a component? And so with Ember, it was
28:20 - 28:22

super easy. We basically just said, OK, here's
new
28:22 - 28:25

calendar date component. It has a property
on it
28:25 - 28:26

called date. Just set that to any JavaScript
data
28:26 - 28:28

object. Just set, you don't have to remember
about
28:28 - 28:30

converting it or formatting it. Here's the
component. Set
28:30 - 28:32

the date and it will render the correct thing
28:32 - 28:33

automatically.
28:33 - 28:36

And, so the architecture of the Ember app
looks
28:36 - 28:38

a little bit, something like this, where you
have
28:38 - 28:40

many, many different components, most of them
just driven
28:40 - 28:42

by D3, and then they're plugged into the model
28:42 - 28:43

and the controller.
28:43 - 28:45

And the Ember app will go fetch those models
28:45 - 28:47

from the cloud, and the cloud from the Java
28:47 - 28:50

app, which just queries Cassandra, and render
them. And
28:50 - 28:53

what's neat about this model is turning on
web
28:53 - 28:56

sockets is super easy, right. Because all
of these
28:56 - 28:59

components are bound to a single place. So
when
28:59 - 29:01

the web socket says, hey, we have updated
information
29:01 - 29:03

for you to show, it just pushes it onto
29:03 - 29:05

the model or onto the controller, and the
whole
29:05 - 29:06

UI updates automatically.
29:06 - 29:07

It's like magic.
29:07 - 29:07

And-
29:07 - 29:08

Y.K.: Like magic.
29:08 - 29:10

T.D.: It's like magic. And, and when debugging,
this
29:10 - 29:12

is especially awesome too, because, and I'll
maybe show
29:12 - 29:15

a demo of the Ember inspector. It's nice.
29:15 - 29:18

So. Yeah. So, lightning fast UI. Reducing
the feedback
29:18 - 29:20

loop so that you can quickly play with your
29:20 - 29:22

data, makes it go from a chore to something
29:22 - 29:24

that actually feels kind of fun.
29:24 - 29:27

So, these were the breakthroughs that we had
when
29:27 - 29:28

we were building Skylight. The things that
made us
29:28 - 29:30

think, yes, this is actually a product that
we
29:30 - 29:32

think deserves to be on the market. So, one,
29:32 - 29:34

honest response times. Collect data that no
one else
29:34 - 29:37

can collect. Focus on answers instead of just
dumping
29:37 - 29:38

data, and have a lightning fast UI to do
29:38 - 29:38

it.
29:38 - 29:40

So we like to think of Skylight as basically
29:40 - 29:43

a smart profiler. It's a smart profiler that
runs
29:43 - 29:44

in production. It's like the profiler that
you run
29:44 - 29:47

on your local development machine, but instead
of being
29:47 - 29:49

on your local dev box which has nothing to
29:49 - 29:52

do with the performance characteristics of
what your users
29:52 - 29:53

are experience, we're actually running in
production.
29:53 - 29:59

So, let me just give you guys a quick
29:59 - 30:00

demo.
30:00 - 30:03

So, this is what the Skylight, this is what
30:03 - 30:08

Skylight looks like. What's under this? There
we go.
30:08 - 30:10

So, the first thing here is we've got the
30:10 - 30:13

app dash board. So this, it's like our, 95th
30:13 - 30:16

responsile- 95th percentile response time
has peaked. Maybe you're
30:16 - 30:18

all hammering it right now. That would be
nice.
30:18 - 30:20

So, this is a graph of your response time
30:20 - 30:22

over time, and then on the right, this is
30:22 - 30:25

the graph of the RPMs, the requests per minute
30:25 - 30:27

that your app is handling. So this is app-wide.
30:27 - 30:29

And this is live. This updates every minute.
30:29 - 30:31

Then down below, you have a list of the
30:31 - 30:34

end points in your application. So you can
see,
30:34 - 30:36

actually, the top, the slowest ones for us
were,
30:36 - 30:38

we have an instrumentation API, and we've
gone and
30:38 - 30:40

instrumented our background workers. So we
can see them
30:40 - 30:42

here, and their response time plays in. So
we
30:42 - 30:44

can see that we have this reporting worker
that's
30:44 - 30:47

taking 95th percentile, thirteen seconds.
30:47 - 30:49

Y.K.: So all that time used to be inside
30:49 - 30:52

of some request somewhere, and we discovered
that there
30:52 - 30:53

was a lot of time being spent in things
30:53 - 30:55

that we could push to the background. We probably
30:55 - 30:57

need to update the agony index so that it
30:57 - 30:59

doesn't make workers very high, because spending
some time
30:59 - 31:02

in your workers is not that big of a
31:02 - 31:02

deal.
31:02 - 31:03

T.D.: So, so then, if we dive into one
31:03 - 31:05

of these, you can see that for this request,
31:05 - 31:07

we've got the time explorer up above, and
that
31:07 - 31:10

shows a graph of response time at, again,
95th
31:10 - 31:12

percentile, and you can, if you want to go
31:12 - 31:14

back and look at historical data, you just
drag
31:14 - 31:15

it like this. And this has got a brush,
31:15 - 31:17

so you can zoom in and out on different
31:17 - 31:18

times.
31:18 - 31:20

And every time you change the range, you can
31:20 - 31:21

see that it's very responsive. It's never
waiting for
31:21 - 31:23

the server. But it is going back and fetching
31:23 - 31:25

data from the server and then when the data
31:25 - 31:29

comes back, you see the whole UI just updates.
31:29 - 31:29

And we get that for free with Ember and
31:29 - 31:31

And then down below, as we discussed, you
actually
31:31 - 31:34

have a real histogram. And this histogram,
in this
31:34 - 31:37

case, is showing. So this is for fifty-seven
requests.
31:37 - 31:39

And if we click and drag, we could just
31:39 - 31:40

move this. And you can see that the aggregate
31:40 - 31:43

trace below updates in response to us dragging
this.
31:43 - 31:45

And if we want to look at the fastest
31:45 - 31:48

quartile, we just click faster and we'll just
choose
31:48 - 31:48

that range on the histogram.
31:48 - 31:49

Y.K.: I think it's the fastest load.
31:49 - 31:51

T.D.: The fastest load. And then if you click
31:51 - 31:53

on slower, you can see the slower requests.
So
31:53 - 31:55

this makes it really easy to compare and contrast.
31:55 - 31:57

OK. Why are certain requests faster and why
are
31:57 - 31:59

certain requests slow?
31:59 - 32:01

You can see the blue, these blue areas. This
32:01 - 32:04

is Ruby code. So, right now it's not super
32:04 - 32:06

granular. It would be nice if you could actually
32:06 - 32:08

know what was going on here. But, it'll at
32:08 - 32:10

least tell you where in your controller action
this
32:10 - 32:13

is happening, and then you can actually see
which
32:13 - 32:16

database queries are being executed, and what
their duration
32:16 - 32:16

is.
32:16 - 32:18

And you can see that we actually extract the
32:18 - 32:20

SQL and we denormalize it so we, so you,
32:20 - 32:22

or, we normalize it so you can see exactly
32:22 - 32:24

what those requests are even if the values
are
32:24 - 32:25

totally different between them.
32:25 - 32:28

Y.K.: Yeah. So the real query, courtesy of
Rails,
32:28 - 32:30

not yet supporting bind extraction is like,
where id
32:30 - 32:32

equals one or, ten or whatever.
32:32 - 32:34

T.D.: Yup. So that's pretty cool.
32:34 - 32:37

Y.K.: So one, one other thing is, initially,
we
32:37 - 32:39

actually just showed the whole trace, but
we discovered
32:39 - 32:42

that, obviously when you show whole traces
you have
32:42 - 32:44

information that doesn't really matter that
much. So we
32:44 - 32:47

started off by, we've recently basically started
to collapse
32:47 - 32:49

things that don't matter so much so that you
32:49 - 32:51

can basically expand or condense the trace.
32:51 - 32:53

And we wanted to make it not, but you
32:53 - 32:56

have to think about expanding or condensing
individual areas,
32:56 - 32:58

but just, you see what matters the most and
32:58 - 32:59

then you can see trivial errors.
32:59 - 33:02

T.D.: Yup. So, so that's the demo of Skylight.
33:02 - 33:04

We'd really like it if you checked it out.
33:04 - 33:06

There is one more thing I want to show
33:06 - 33:08

you that is, like, really freaking cool. This
is
33:08 - 33:11

coming out of Tilde labs. Carl was like, has
33:11 - 33:14

been hacking, he's been up until past midnight,
getting
33:14 - 33:16

almost no sleep for the past month trying
to
33:16 - 33:17

have this ready.
33:17 - 33:19

I don't know how many of you know this,
33:19 - 33:24

but Ruby 2 point 1 has a new, a,
33:24 - 33:28

a stack sampling feature. So you can get really
33:28 - 33:31

granular information about how your Ruby code
is performing.
33:31 - 33:33

So I want to show you, I just mentioned
33:33 - 33:35

how it would be nice if we could get
33:35 - 33:37

more information out of what your Ruby code
is
33:37 - 33:39

doing. And now we can do that.
33:39 - 33:42

Basically, every few milliseconds, this code
that Carl wrote
33:42 - 33:44

is going into the, to the Ruby, into MRI,
33:44 - 33:47

and it's taking a snap shot of the stack.
33:47 - 33:51

And because this is built-in, it's very low-impact.
It's
33:51 - 33:54

not allocating any new memory. It's very little
performance
33:54 - 33:56

hit. Basically you wouldn't even notice it.
And so
33:56 - 33:58

every few milliseconds it's sampling, and
we take that
33:58 - 34:00

information and we send it up to our servers.
34:00 - 34:02

So it's almost like you're running Ruby profiler
on
34:02 - 34:05

your local dev box, where you get extremely
granular
34:05 - 34:07

information about where your code is spending
its time
34:07 - 34:09

in Ruby, per method, per all of these things.
34:09 - 34:12

But it's happening in production.
34:12 - 34:16

So, this is, so this is a, we enabled
34:16 - 34:18

it in staging. You can see that we've got
34:18 - 34:20

some rendering bugs. It's still in beta.
34:20 - 34:22

Y.K.: Yeah, and we haven't yet collapsed things
that
34:22 - 34:22

are not important-
34:22 - 34:22

T.D.: Yes.
34:22 - 34:23

Y.K.: -for this particular feature.
34:23 - 34:24

T.D.: So we want to show, we want to
34:24 - 34:28

hide things like, like framework code, obviously.
But this
34:28 - 34:31

gives you an incredibly, incredibly granular
view of what
34:31 - 34:36

your app is doing in production. And we think.
34:36 - 34:39

This is a, an API that's built into, into
34:39 - 34:43

Ruby 2.1.1. Because our agent is running so
low-level,
34:43 - 34:45

because we wrote it in Rust, we have the
34:45 - 34:47

ability to do things like this, and Carl thinks
34:47 - 34:48

that we may be able to actually back port
34:48 - 34:48

this to older Rubies, too. So if you're not
34:48 - 34:50

on Ruby 2.1, we think that we can actually
34:50 - 34:53

bring this. But that's TPD.
34:53 - 34:55

Y.K.: Yeah, I- so I think the cool thing
34:55 - 34:58

about this, in general, is when you run a
34:58 - 34:59

sampling- so this is a sampling profiler,
right, we
34:59 - 35:01

don't want to be burning every single thing
that
35:01 - 35:04

you do in your program with tracing, right.
That
35:04 - 35:05

would be very slow.
35:05 - 35:07

So when you normally run a sampling profiler,
you
35:07 - 35:09

have to basically make a loop. You have to
35:09 - 35:11

basically create a loop, run this code a million
35:11 - 35:13

times and keep sampling it. Eventually we'll
get enough
35:13 - 35:15

samples to get the information. But it turns
out
35:15 - 35:17

that your production server is a loop. Your
production
35:17 - 35:21

server is serving tons and tons of requests.
So,
35:21 - 35:23

by simply tak- you know, taking a few microseconds
35:23 - 35:26

out of every request and collecting a couple
of
35:26 - 35:27

samples, over time we can actually get this
really
35:27 - 35:30

high fidelity picture with basically no cost.
35:30 - 35:31

And that's pretty mind-blowing. And this is
the kind
35:31 - 35:35

of stuff that we can start doing by really
35:35 - 35:37

caring about, about both the user experience
and the
35:37 - 35:41

implementation and getting really scary about
it. And I'm
35:41 - 35:43

really, like, honestly this is a really exciting
feature
35:43 - 35:45

that really shows what we can do as we
35:45 - 35:46

start building this out.
35:46 - 35:47

T.D.: Once we've got that, once we've got
that
35:47 - 35:48

groundwork.
35:48 - 35:50

So if you guys want to check it out,
35:50 - 35:52

Skylight dot io, it's available today. It's
no longer
35:52 - 35:54

in private beta. Everyone can sign up. No
invitation
35:54 - 35:57

token necessary. And you can get a thirty-day
free
35:57 - 35:58

trial if you haven't started one already.
So if
35:58 - 36:00

you have any questions, please come see us
right
36:00 - 36:01

now, or we have a booth in the vendor
36:01 - 36:03

hall. Thank you guys very much.

Title:: RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz
Description:: more » « less
Duration:: 36:28

Amara Bot edited English subtitles for RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz

English subtitles

Revisions

Revision 1 Imported

Amara Bot

RailsConf 2014 - How to Build a Smart Profiler for Rails by Tom Dale and Yehuda Katz

Revisions

Our website uses cookies

Operating cookies (Required)