WEBVTT

00:00:16.600 --> 00:00:17.920
TOM DALE: Hey, you guys ready?

00:00:17.940 --> 00:00:19.180
Thank you guys so much for coming.

00:00:19.189 --> 00:00:21.770
This is awesome. I was really,

00:00:21.770 --> 00:00:23.700
I, when they were putting together the schedule,

00:00:23.700 --> 00:00:25.349
I said, make sure that you put us down

00:00:25.349 --> 00:00:27.160
in the Caves of Moria. So thank you

00:00:27.160 --> 00:00:31.560
guys for coming down and making it.

00:00:31.560 --> 00:00:32.910
I'm Tom. This is Yehuda.

00:00:32.910 --> 00:00:35.270
YEHUDA KATZ: When people told me was signed
up

00:00:35.270 --> 00:00:40.520
to do a back-to-back talk, I don't know what

00:00:40.520 --> 00:00:40.579
I was thinking.

00:00:40.579 --> 00:00:40.820
T.D.: Yup. So. We want to talk to you

00:00:40.820 --> 00:00:43.809
today about, about Skylight. So, just a little
bit

00:00:43.809 --> 00:00:45.730
before we talk about that, I want to talk

00:00:45.730 --> 00:00:47.530
about us a little bit.

00:00:47.530 --> 00:00:50.420
So, in 2011 we started a company called Tilde.

00:00:50.420 --> 00:00:52.760
It's this shirt. It may have made me self-conscious,

00:00:52.760 --> 00:00:54.609
because this is actually a first edition and
it's

00:00:54.609 --> 00:00:57.550
printed off-center. Well, either I'm off-center
or the shirt's

00:00:57.550 --> 00:01:00.539
off-center. One of the two.

00:01:00.539 --> 00:01:02.570
So we started Tilde in 2011, and we had

00:01:02.570 --> 00:01:06.030
all just left a venture-backed company, and
that was

00:01:06.030 --> 00:01:08.440
a pretty traumatic experience for us because
we spent

00:01:08.440 --> 00:01:09.560
a lot of time building the company and then

00:01:09.560 --> 00:01:11.110
we ran out of money and sold to Facebook

00:01:11.110 --> 00:01:13.500
and we really didn't want to repeat that experience.

00:01:13.500 --> 00:01:16.420
So, we decided to start Tilde, and when we

00:01:16.420 --> 00:01:19.830
did it, we decided to be. DHH and the

00:01:19.830 --> 00:01:21.970
other people at Basecamp were talking about,
you know,

00:01:21.970 --> 00:01:23.630
being bootstrapped and proud. And that was
a message

00:01:23.630 --> 00:01:26.250
that really resonated with us, and so we wanted

00:01:26.250 --> 00:01:28.360
to capture the same thing.

00:01:28.360 --> 00:01:30.250
There's only problem with being bootstrapped
and proud, and

00:01:30.250 --> 00:01:31.890
that is, in order to be both of those

00:01:31.890 --> 00:01:33.180
things, you actually need money, it turns
out. It's

00:01:33.180 --> 00:01:34.850
not like you just say it in a blog

00:01:34.850 --> 00:01:37.660
post and then all of the sudden you are

00:01:37.660 --> 00:01:38.560
in business.

00:01:38.560 --> 00:01:40.350
So, we had to think a lot about, OK,

00:01:40.350 --> 00:01:42.450
well, how do we make money? How do we

00:01:42.450 --> 00:01:44.690
make money? How do we make a profitable and,

00:01:44.690 --> 00:01:47.290
most importantly, sustainable business? Because
we didn't want to

00:01:47.290 --> 00:01:49.380
just flip to Facebook in a couple years.

00:01:49.380 --> 00:01:53.130
So, looking around, I think the most obvious
thing

00:01:53.130 --> 00:01:55.110
that people suggested to us is, well, why
don't

00:01:55.110 --> 00:01:58.050
you guys just become Ember, Inc.? Raise a
few

00:01:58.050 --> 00:02:01.390
million dollars, you know, build a bunch of
business

00:02:01.390 --> 00:02:05.730
model, mostly prayer. But that's not really
how we

00:02:05.730 --> 00:02:08.959
want to think about building open source communities.

00:02:08.959 --> 00:02:10.860
We don't really think that that necessarily
leads to

00:02:10.860 --> 00:02:13.810
the best open source communities. And if you're
interested

00:02:13.810 --> 00:02:16.569
more in that, I recommend Leia Silver, who
is

00:02:16.569 --> 00:02:19.700
one of our co-founders. She's giving a talk
this

00:02:19.700 --> 00:02:22.450
afternoon. Oh, sorry. Friday afternoon, about
how to build

00:02:22.450 --> 00:02:25.219
a company that is centered on open source.
So

00:02:25.219 --> 00:02:26.469
if you want to learn more about how we've

00:02:26.469 --> 00:02:28.989
done that, I would really suggest you go check

00:02:28.989 --> 00:02:30.060
out her talk.

00:02:30.060 --> 00:02:33.689
So, no. So, no Ember, Inc. Not allowed.

00:02:33.689 --> 00:02:38.159
So, we really want to build something that
leveraged

00:02:38.159 --> 00:02:40.249
the strengths that we thought that we had.
One,

00:02:40.249 --> 00:02:42.680
I think most importantly, a really deep knowledge
of

00:02:42.680 --> 00:02:44.569
open source and a deep knowledge of the Rails

00:02:44.569 --> 00:02:47.090
stack, and also Carl, it turns out, is really,

00:02:47.090 --> 00:02:50.689
really good at building highly scalable big
data sys-

00:02:50.689 --> 00:02:53.989
big data systems. Lots of Hadoop in there.

00:02:53.989 --> 00:02:58.290
So, last year at RailsConf, we announced the
private

00:02:58.290 --> 00:03:00.519
beta of Skylight. How many of you have used

00:03:00.519 --> 00:03:01.709
Skylight? Can you raise your hand if you have

00:03:01.709 --> 00:03:04.629
used it? OK. Many of you. Awesome.

00:03:04.629 --> 00:03:08.129
So, so Skylight is a tool for profiling and

00:03:08.129 --> 00:03:11.780
measuring the performance of your Rails applications
in production.

00:03:11.780 --> 00:03:15.389
And, as a product, Skylight, I think, was
built

00:03:15.389 --> 00:03:20.349
on three really, three key break-throughs.
There were key,

00:03:20.349 --> 00:03:22.120
three key break-throughs. We didn't want to
ship a

00:03:22.120 --> 00:03:26.189
product that was incrementally better than
the competition. We

00:03:26.189 --> 00:03:28.319
wanted to ship a product that was dramatically
better.

00:03:28.319 --> 00:03:30.079
Quantum leap. An order of magnitude better.

00:03:30.079 --> 00:03:32.079
And, in order to do that, we spent a

00:03:32.079 --> 00:03:33.889
lot of time thinking about it, about how we

00:03:33.889 --> 00:03:36.389
could solve most of the problems that we saw

00:03:36.389 --> 00:03:39.310
in the existing landscape. And so those, those
break-throughs

00:03:39.310 --> 00:03:42.299
are predicated- sorry, those, delivering a
product that does

00:03:42.299 --> 00:03:44.919
that is predicated on these three break-throughs.

00:03:44.919 --> 00:03:46.870
So, the first one I want to talk about

00:03:46.870 --> 00:03:53.870
is, honest response times. Honest response
times. So, DHH

00:03:54.060 --> 00:03:55.799
wrote a blog post on what was then the

00:03:55.799 --> 00:03:58.930
37Signals blog, now the Basecamp blog, called
The problem

00:03:58.930 --> 00:04:01.909
with averages. How many of you have read this?

00:04:01.909 --> 00:04:02.459
Awesome.

00:04:02.459 --> 00:04:03.779
For those of you that have not, how many

00:04:03.779 --> 00:04:08.469
of you hate raising your hands at presentations?

00:04:08.469 --> 00:04:10.510
So, for those of you that-

00:04:10.510 --> 00:04:11.549
Y.K.: Just put a button in every seat. Press

00:04:11.549 --> 00:04:11.779
this button-

00:04:11.779 --> 00:04:15.290
T.D.: Press the button if you have. Yes. Great.

00:04:15.290 --> 00:04:19.810
So, if you read this blog post, the way

00:04:19.810 --> 00:04:22.810
it opens is, Our average response time for
Basecamp

00:04:22.810 --> 00:04:26.770
right now is 87ms... That sounds fantastic.
And it

00:04:26.770 --> 00:04:29.950
easily leads you to believe that all is well

00:04:29.950 --> 00:04:31.680
and that we wouldn't need to spend any more

00:04:31.680 --> 00:04:34.150
time optimizing performance.

00:04:34.150 --> 00:04:38.840
But that's actually wrong. The average number
is completely

00:04:38.840 --> 00:04:42.250
skewed by tons of fast responses to feed requests

00:04:42.250 --> 00:04:46.169
and other cached replies. If you have 1000
requests

00:04:46.169 --> 00:04:49.229
that return in 5ms, and then you can have

00:04:49.229 --> 00:04:53.560
200 requests taking 2000ms, or two seconds,
you can

00:04:53.560 --> 00:04:57.509
still report an av- a respectable 170ms of
average.

00:04:57.509 --> 00:04:59.819
That's useless.

00:04:59.819 --> 00:05:02.520
So what does DHH say that we need? DHH

00:05:02.520 --> 00:05:06.569
says the solution is histograms. So, for those
of

00:05:06.569 --> 00:05:09.009
you like me who were sleeping through your
statistics

00:05:09.009 --> 00:05:12.410
class in high school, and college, a brief
primer

00:05:12.410 --> 00:05:15.310
on histograms. So a histogram is very simple.
Basically,

00:05:15.310 --> 00:05:17.699
you have a, you have a series of numbers

00:05:17.699 --> 00:05:22.389
along some axis, and every time you, you're
in

00:05:22.389 --> 00:05:24.360
that number, you're in that bucket, you basically
increment

00:05:24.360 --> 00:05:25.280
that bar by one.

00:05:25.280 --> 00:05:27.669
So, this is an example of a histogram of

00:05:27.669 --> 00:05:30.620
response times in a Rails application. So
you can

00:05:30.620 --> 00:05:31.979
see that there's a big cluster in the middle

00:05:31.979 --> 00:05:35.900
around 488ms, 500ms. This isn't a super speedy
app

00:05:35.900 --> 00:05:38.740
but it's not the worst thing in the world.

00:05:38.740 --> 00:05:39.520
And they're all clustered, and then as you
kind

00:05:39.520 --> 00:05:40.810
of move to the right you can see that

00:05:40.810 --> 00:05:42.169
the respond times get longer and longer and
longer,

00:05:42.169 --> 00:05:43.990
and as you move to the left, response times

00:05:43.990 --> 00:05:45.720
get shorter and shorter and shorter.

00:05:45.720 --> 00:05:47.500
So, why do you want a histogram? What's the,

00:05:47.500 --> 00:05:48.599
what's the most important thing about a histogram?

00:05:48.599 --> 00:05:52.550
Y.K.: Well, I think it's because most requests
don't

00:05:52.550 --> 00:05:53.229
actually look like this.

00:05:53.229 --> 00:05:53.509
T.D.: Yes.

00:05:53.509 --> 00:05:54.759
Y.K.: Most end points don't actually look
like this.

00:05:54.759 --> 00:05:56.419
T.D.: Right. If you think about what your
Rails

00:05:56.419 --> 00:05:58.610
app is doing, it's a complicated beast, right.
Turns

00:05:58.610 --> 00:06:02.330
out, Ruby frankly, you can, you can do branching

00:06:02.330 --> 00:06:04.360
logic. You can do a lot of things.

00:06:04.360 --> 00:06:06.150
And so what that means is that one end

00:06:06.150 --> 00:06:09.460
point, if you represent that with a single
number,

00:06:09.460 --> 00:06:11.650
you are losing a lot of fidelity, to the

00:06:11.650 --> 00:06:15.189
point where it becomes, as DHH said, useless.
So,

00:06:15.189 --> 00:06:17.729
for example, in a histogram, you can easily
see,

00:06:17.729 --> 00:06:19.810
oh, here's a group of requests and response
times

00:06:19.810 --> 00:06:22.379
where I'm hitting the cache, and here's another
group

00:06:22.379 --> 00:06:24.169
where I'm missing it. And you can see that

00:06:24.169 --> 00:06:27.889
that cluster is significantly slower than
the faster cache-hitting

00:06:27.889 --> 00:06:28.439
cluster.

00:06:28.439 --> 00:06:30.849
And the other thing that you get when you

00:06:30.849 --> 00:06:32.800
have a, a distribution, when you keep the
whole

00:06:32.800 --> 00:06:35.370
distribution in the histogram, is you can
look at

00:06:35.370 --> 00:06:39.470
this number at the 95th percentile, right.
So the

00:06:39.470 --> 00:06:41.639
right, the way to think about the performance
of

00:06:41.639 --> 00:06:46.639
your web application is not the average, because
the

00:06:46.639 --> 00:06:50.159
average doesn't really tell you anything.
You want to

00:06:50.159 --> 00:06:52.220
think about the 95th percentile, because that's
not the

00:06:52.220 --> 00:06:55.379
average response time, that's the average
worst response time

00:06:55.379 --> 00:06:57.990
that a user is likely to hit.

00:06:57.990 --> 00:06:59.360
And the thing to keep in mind is that

00:06:59.360 --> 00:07:01.849
it's not as though a customer comes to your

00:07:01.849 --> 00:07:05.099
site, they issue one request, and then they're
done,

00:07:05.099 --> 00:07:08.000
right. As someone is using your website, they're
gonna

00:07:08.000 --> 00:07:10.300
be generating a lot of requests. And you need

00:07:10.300 --> 00:07:15.020
to look at the 95th percentile, because otherwise
every

00:07:15.020 --> 00:07:17.219
request is basically you rolling the dice
that they're

00:07:17.219 --> 00:07:18.610
not gonna hit one of those two second, three

00:07:18.610 --> 00:07:21.400
second, four second responses, close the tab
and go

00:07:21.400 --> 00:07:23.919
to your competitor.

00:07:23.919 --> 00:07:25.340
So we look at this as, here's the crazy

00:07:25.340 --> 00:07:28.439
thing. Here's what I think is crazy. That
blog

00:07:28.439 --> 00:07:32.750
post that DHH wrote, it's from 2009. It's
been

00:07:32.750 --> 00:07:35.000
five years, and there's still no tool that
does

00:07:35.000 --> 00:07:36.960
what DHH was asking for. So, we, frankly,
we

00:07:36.960 --> 00:07:39.060
smelled money. We were like, holy crap.

00:07:39.060 --> 00:07:41.169
Y.K.: Yeah, why isn't that slide green?

00:07:41.169 --> 00:07:43.229
T.D.: Yeah. It should be green and dollars.
I

00:07:43.229 --> 00:07:45.270
think keynote has the dollars, the make it
rain

00:07:45.270 --> 00:07:50.240
effect I should have used. So we smelled blood

00:07:50.240 --> 00:07:53.090
in the water. We're like, this is awesome.
There's

00:07:53.090 --> 00:07:56.610
only one problem that we discovered, and that
is,

00:07:56.610 --> 00:07:58.330
it turns out that building this thing is actually

00:07:58.330 --> 00:08:01.189
really, really freaky hard. Really, really
hard.

00:08:01.189 --> 00:08:05.780
So, we announced the private beta at RailsConf
last

00:08:05.780 --> 00:08:09.139
year. Before doing that, we spent a year of

00:08:09.139 --> 00:08:12.789
research spiking out prototypes, building
prototypes, building out the

00:08:12.789 --> 00:08:16.509
beta. We launched at RailsConf, and we realized,
we

00:08:16.509 --> 00:08:18.520
made a lot of problems. We made a lot

00:08:18.520 --> 00:08:21.909
of errors when we were building this system.

00:08:21.909 --> 00:08:26.270
So then, after RailsConf last year, we basically
took

00:08:26.270 --> 00:08:29.689
six months to completely rewrite the backend
from the

00:08:29.689 --> 00:08:32.450
ground up. And I think tying into your keynote,

00:08:32.450 --> 00:08:36.280
Yehuda, we, we were like, oh. We clearly have

00:08:36.280 --> 00:08:39.120
a bespoke problem. No one else is doing this.

00:08:39.120 --> 00:08:42.090
So we rewrote our own custom backend. And
then

00:08:42.090 --> 00:08:43.729
we had all these problems, and we realized
that

00:08:43.729 --> 00:08:45.390
they had actually already all been solved
by the

00:08:45.390 --> 00:08:47.769
open source community. And so we benefited
tremendously by

00:08:47.769 --> 00:08:48.430
having a shared solution.

00:08:48.430 --> 00:08:50.709
Y.K.: Yeah. So our first release of this was

00:08:50.709 --> 00:08:55.279
really very bespoke, and the current release
uses a

00:08:55.279 --> 00:08:59.540
tremendous amount of very off-the-shelf open
source projects that

00:08:59.540 --> 00:09:04.550
just solved the particular problem very effectively,
very well.

00:09:04.550 --> 00:09:05.779
None of which are as easy to use as

00:09:05.779 --> 00:09:09.029
Rails, but all of which solve really thorny
problems

00:09:09.029 --> 00:09:10.230
very effectively.

00:09:10.230 --> 00:09:12.870
T.D.: So, so let's just talk, just for your

00:09:12.870 --> 00:09:15.950
own understanding, let's talk about how most
performance monitoring

00:09:15.950 --> 00:09:17.670
tools work. So the way that most of these

00:09:17.670 --> 00:09:19.930
work is that you run your Rails app, and

00:09:19.930 --> 00:09:22.250
running inside of your Rails app is some gem,

00:09:22.250 --> 00:09:25.000
some agent that you install. And every time
the

00:09:25.000 --> 00:09:28.560
Rails app handles a request, it generates
events, and

00:09:28.560 --> 00:09:32.500
those events, which include information about
performance data, those

00:09:32.500 --> 00:09:34.930
events are passed into the agent.

00:09:34.930 --> 00:09:37.630
And then the agent sends that data to some

00:09:37.630 --> 00:09:40.930
kind of centralized server. Now, it turns
out that

00:09:40.930 --> 00:09:44.139
doing a running average is actually really
simple. Which

00:09:44.139 --> 00:09:46.550
is why everyone does it. Basically you can
do

00:09:46.550 --> 00:09:48.050
it in a single SQL query, right. All you

00:09:48.050 --> 00:09:50.310
do is you have three columns in database.
The

00:09:50.310 --> 00:09:52.690
end point, the running average, and the number
of

00:09:52.690 --> 00:09:55.769
requests, and then so, you can, those are
the

00:09:55.769 --> 00:09:57.170
two things that you need to keep a running

00:09:57.170 --> 00:09:57.310
average, right.

00:09:57.310 --> 00:09:58.750
So keeping a running average is actually really
simple

00:09:58.750 --> 00:10:00.980
from a technical point of view.

00:10:00.980 --> 00:10:03.800
Y.K.: I don't think you could even JavaScript
through

00:10:03.800 --> 00:10:04.600
to the lack of integers.

00:10:04.600 --> 00:10:06.050
T.D.: Yes. You probably wouldn't want to do
any

00:10:06.050 --> 00:10:07.529
math in JavaScript, it turns out. So, so we

00:10:07.529 --> 00:10:10.100
took a little bit different approach. Yehuda,
do you

00:10:10.100 --> 00:10:12.070
want to go over the next section?

00:10:12.070 --> 00:10:15.089
Y.K.: Yeah. Sure. So, when we first started,
right

00:10:15.089 --> 00:10:17.790
at the beginning, we basically did a similar
thing

00:10:17.790 --> 00:10:19.790
where we had a bunch - your app creates

00:10:19.790 --> 00:10:22.620
events. Most of those start off as being ActiveSupport::Notifications,

00:10:22.620 --> 00:10:25.980
although it turns out that there's very limited
use

00:10:25.980 --> 00:10:28.490
of ActiveSupport::Notifications so we had
to do some normalization

00:10:28.490 --> 00:10:30.360
work to get them sane, which we're gonna be

00:10:30.360 --> 00:10:32.630
upstreaming back into, into Rails.

00:10:32.630 --> 00:10:35.320
But, one thing that's kind of unfortunate
about having

00:10:35.320 --> 00:10:37.029
every single Rails app have an agent is that

00:10:37.029 --> 00:10:38.649
you end up having to do a lot of

00:10:38.649 --> 00:10:40.310
the same kind of work over and over again,

00:10:40.310 --> 00:10:42.180
and use up a lot of memory. So, for

00:10:42.180 --> 00:10:44.220
example, every one of these things is making
HTTP

00:10:44.220 --> 00:10:46.380
requests. So now you have a queue of things

00:10:46.380 --> 00:10:48.810
that you're sending over HTTP in every single
one

00:10:48.810 --> 00:10:50.510
of your Rails processes. And, of course, you
probably

00:10:50.510 --> 00:10:52.250
don't notice this. People are used to Rails
taking

00:10:52.250 --> 00:10:54.089
up hundreds and hundreds of megabytes, so
you probably

00:10:54.089 --> 00:10:55.790
don't notice if you install some agent and
it

00:10:55.790 --> 00:10:59.449
suddenly starts taking twenty, thirty, forty,
fifty more megabytes.

00:10:59.449 --> 00:11:01.510
But we really wanted to keep the actual memory

00:11:01.510 --> 00:11:04.649
per process down to a small amount. So one

00:11:04.649 --> 00:11:06.170
of the very first things that we did, we

00:11:06.170 --> 00:11:07.910
even did it before last year, is that we

00:11:07.910 --> 00:11:09.800
pulled out all that shared logic into a, a

00:11:09.800 --> 00:11:13.420
separate process called the coordinator. And
the agent is

00:11:13.420 --> 00:11:16.940
basically responsible simply for collecting
the, the trace, and

00:11:16.940 --> 00:11:18.899
it's not responsible for actually talking
to our server

00:11:18.899 --> 00:11:20.709
at all. And that means that the coordinator
only

00:11:20.709 --> 00:11:22.720
has to do this queue, this keeping a st-

00:11:22.720 --> 00:11:25.550
a bunch of stuff of work in one place,

00:11:25.550 --> 00:11:28.500
and it doesn't end up using up as much

00:11:28.500 --> 00:11:28.839
memory.

00:11:28.839 --> 00:11:31.019
And I think this, this ended up being very

00:11:31.019 --> 00:11:31.940
effective for us.

00:11:31.940 --> 00:11:33.490
T.D.: And I think that low overhead also allows

00:11:33.490 --> 00:11:36.350
us to just collect more information, in general.

00:11:36.350 --> 00:11:37.060
Y.K.: Yeah.

00:11:37.060 --> 00:11:39.880
Now, after our first attempt, we started getting
a

00:11:39.880 --> 00:11:42.079
bunch of customers that were telling us that
even

00:11:42.079 --> 00:11:43.920
the separate - so the separate coordinator,
started as

00:11:43.920 --> 00:11:45.260
a good thing and a bad thing. On the

00:11:45.260 --> 00:11:47.399
one hand, there's only one of them, so it

00:11:47.399 --> 00:11:49.529
uses up only one set of memory. On the

00:11:49.529 --> 00:11:51.220
other hand, it's really easy for someone to
go

00:11:51.220 --> 00:11:52.839
in and PS that process and see how many

00:11:52.839 --> 00:11:54.260
megabytes of memory it's using.

00:11:54.260 --> 00:11:56.560
So, we got a lot of additional complaints
that

00:11:56.560 --> 00:11:58.100
said oh, your process is using a lot of

00:11:58.100 --> 00:12:00.990
memory. And, I spent a few weeks, I, I

00:12:00.990 --> 00:12:03.160
know Ruby pretty well. I spent a couple of

00:12:03.160 --> 00:12:05.670
weeks. I actually wrote a gem called Allocation
Counter

00:12:05.670 --> 00:12:07.550
that basically went in to try to pin point

00:12:07.550 --> 00:12:09.490
exactly where the allocations were hap- coming
from. But

00:12:09.490 --> 00:12:11.800
it turns out that it's actually really, really
hard

00:12:11.800 --> 00:12:14.019
to track down exactly where allocations are
coming from

00:12:14.019 --> 00:12:15.170
in Ruby, because something as simple as using
a

00:12:15.170 --> 00:12:18.630
regular expression in Ruby can allocate match
objects that

00:12:18.630 --> 00:12:19.410
get put back on the stack.

00:12:19.410 --> 00:12:21.449
And so I was able to pair this down

00:12:21.449 --> 00:12:24.220
to some degree. But I really discovered quickly
that,

00:12:24.220 --> 00:12:26.980
trying to keep a lid on the memory allocation

00:12:26.980 --> 00:12:29.940
by doing all the stuff in Ruby, is mostly

00:12:29.940 --> 00:12:31.579
fine. But for our specific use case where
we

00:12:31.579 --> 00:12:33.100
really wanna, we wanna be telling you, you
can

00:12:33.100 --> 00:12:34.860
run the agent on your process, on your box,

00:12:34.860 --> 00:12:36.870
and it's not gonna use a lot of memory.

00:12:36.870 --> 00:12:40.079
We really needed something more efficient,
and our first

00:12:40.079 --> 00:12:42.790
thought was, we'll use C++ or C. No problem.

00:12:42.790 --> 00:12:45.220
C is, is native. It's great. And Carl did

00:12:45.220 --> 00:12:48.120
the work. Carl is very smart. And then he

00:12:48.120 --> 00:12:49.509
said, Yehuda. It is now your turn. You need

00:12:49.509 --> 00:12:51.250
to start maintaining this. And I said, I don't

00:12:51.250 --> 00:12:53.620
trust myself to write C++ code that's running
in

00:12:53.620 --> 00:12:56.029
all of your guys's boxes, and not seg-fault.
So

00:12:56.029 --> 00:12:59.649
I don't think that, that doesn't work for
me.

00:12:59.649 --> 00:13:01.790
And so I, I noticed that rust was coming

00:13:01.790 --> 00:13:03.630
along, and what rust really gives you is it

00:13:03.630 --> 00:13:05.899
gives you the ability to write low-level code
a

00:13:05.899 --> 00:13:08.529
la C or C++ with magma memory management,
that

00:13:08.529 --> 00:13:11.790
keeps your memory allocation low and keeps
things speedy.

00:13:11.790 --> 00:13:14.930
Low resource utilization. While also giving
you compile time

00:13:14.930 --> 00:13:17.949
guarantees about not seg-faulting. So, again,
if your processes

00:13:17.949 --> 00:13:20.320
randomly started seg-faulting because you
installed the agent, I

00:13:20.320 --> 00:13:21.949
think you would stop being our customer very
quickly.

00:13:21.949 --> 00:13:24.680
So having what, pretty much 100% guarantees
about that

00:13:24.680 --> 00:13:26.600
was very important to us. And so that's why

00:13:26.600 --> 00:13:28.420
we decided to use rust.

00:13:28.420 --> 00:13:29.880
I'll just keep going.

00:13:29.880 --> 00:13:30.970
T.D.: Keep going.

00:13:30.970 --> 00:13:32.949
Y.K.: So, we had this coordinator object.
And basically

00:13:32.949 --> 00:13:36.149
the coordinator object is receiving events.
So the events

00:13:36.149 --> 00:13:39.750
basically end up being these traces that describe
what's

00:13:39.750 --> 00:13:42.050
happening in your application. And the next
thing, I

00:13:42.050 --> 00:13:44.420
think our initial work on this we used JSON

00:13:44.420 --> 00:13:46.160
just to send the pay load to the server,

00:13:46.160 --> 00:13:47.949
but we noticed that a lot of people have

00:13:47.949 --> 00:13:49.889
really big requests. So you may have a big

00:13:49.889 --> 00:13:51.519
request with a big SQL query in it, or

00:13:51.519 --> 00:13:53.320
a lot of big SQL queries in it. Some

00:13:53.320 --> 00:13:55.279
people have traces that are hundreds and hundreds
of

00:13:55.279 --> 00:13:57.500
nodes long. And so we really wanted to figure

00:13:57.500 --> 00:14:00.100
out how to shrink down the payload size to

00:14:00.100 --> 00:14:02.569
something that we could be, you know, pumping
out

00:14:02.569 --> 00:14:04.759
of your box on a regular basis without tearing

00:14:04.759 --> 00:14:06.850
up your bandwidth costs.

00:14:06.850 --> 00:14:09.009
So, one of the first things that we did

00:14:09.009 --> 00:14:11.069
early on was we switched using protobuf as
the

00:14:11.069 --> 00:14:13.550
transport mechanism, and that really shrunk,
shrunk down the

00:14:13.550 --> 00:14:17.250
payloads a lot. Our earlier prototypes for
actually collecting

00:14:17.250 --> 00:14:19.490
the data were written in Ruby, but I think

00:14:19.490 --> 00:14:21.370
Carl did, like, a weekend hack to just pour

00:14:21.370 --> 00:14:24.180
it over the Java and got, like, 200x performance.

00:14:24.180 --> 00:14:26.319
And you don't always get 200x performance,
if mostly

00:14:26.319 --> 00:14:27.970
what you're doing is database queries, you're
not gonna

00:14:27.970 --> 00:14:29.139
get a huge performance swing.

00:14:29.139 --> 00:14:31.889
But mostly what we're doing is math. And algorithms

00:14:31.889 --> 00:14:34.209
and data structures. And for that, Ruby is,
it

00:14:34.209 --> 00:14:36.310
could, in theory, one day, have a good git

00:14:36.310 --> 00:14:38.569
or something, but today, writing that code
in Java

00:14:38.569 --> 00:14:40.949
didn't end up being significantly more code
cause it's

00:14:40.949 --> 00:14:42.540
just, you know, algorithms and data structures.

00:14:42.540 --> 00:14:44.420
T.D.: And I'll just note something about standardizing
on

00:14:44.420 --> 00:14:46.779
protobufs in our, in our stack, is actually
a

00:14:46.779 --> 00:14:52.170
huge win, because we, we realized, hey, browsers,
as

00:14:52.170 --> 00:14:53.350
it turns out are pretty powerful these days.
They've

00:14:53.350 --> 00:14:54.490
got, you know, they can allocate memory, they
can

00:14:54.490 --> 00:14:56.069
do all these types of computation. So, and
protobuff's

00:14:56.069 --> 00:14:59.410
libraries exist everywhere. So we save ourselves
a lot

00:14:59.410 --> 00:15:01.589
of computation and a lot of time by just

00:15:01.589 --> 00:15:04.259
treating protobuff as the canonical serialization
form, and then

00:15:04.259 --> 00:15:06.190
you can move payloads around the entire stack
and

00:15:06.190 --> 00:15:08.560
everything speaks the same language, so you've
saved the

00:15:08.560 --> 00:15:09.300
serialization and deserialization.

00:15:09.300 --> 00:15:11.990
Y.K.: And JavaScript is actually surprisingly
effective at des-

00:15:11.990 --> 00:15:13.870
at taking protobuffs and converting them to
the format

00:15:13.870 --> 00:15:18.190
that we need efficiently. So, so we basically
take

00:15:18.190 --> 00:15:21.029
this data. The Java collector is basically
collecting all

00:15:21.029 --> 00:15:23.550
these protobuffs, and pretty much it just
turns around,

00:15:23.550 --> 00:15:24.940
and this is sort of where we got into

00:15:24.940 --> 00:15:28.149
bespoke territory before we started rolling
our own, but

00:15:28.149 --> 00:15:30.930
we realized that when you write a big, distributed,

00:15:30.930 --> 00:15:33.130
fault-tolerant system, there's a lot of problems
that you

00:15:33.130 --> 00:15:35.319
really just want someone else to have thought
about.

00:15:35.319 --> 00:15:37.750
So, what we do is we basically take these,

00:15:37.750 --> 00:15:39.600
take these payloads that are coming in. We
convert

00:15:39.600 --> 00:15:41.709
them into batches and we send the batches
down

00:15:41.709 --> 00:15:45.259
into the Kafka queue. And the, the next thing

00:15:45.259 --> 00:15:47.680
that happens, so the Kafka, sorry, Kafka's
basically just

00:15:47.680 --> 00:15:49.910
a queue that allows you to throw things into,

00:15:49.910 --> 00:15:53.019
I guess, it might be considered similar to
like,

00:15:53.019 --> 00:15:55.430
something lime AMQP. It has some nice fault-tolerance
properties

00:15:55.430 --> 00:15:57.949
and integrates well with storm. But most important
it's

00:15:57.949 --> 00:15:59.670
just super, super high through-put.

00:15:59.670 --> 00:16:01.940
So basically didn't want to put any barrier
between

00:16:01.940 --> 00:16:03.560
you giving us the data and us getting it

00:16:03.560 --> 00:16:04.480
to disc as soon as possible.

00:16:04.480 --> 00:16:05.870
T.D.: Yeah. Which we'll, I think, talk about
in

00:16:05.870 --> 00:16:06.180
a bit.

00:16:06.180 --> 00:16:08.540
Y.K.: So we, so the basic Kafka takes the

00:16:08.540 --> 00:16:11.410
data and starts sending it into Storm. And
if

00:16:11.410 --> 00:16:13.009
you think about what has to happen in order

00:16:13.009 --> 00:16:15.089
to get some request. So, you have these requests.

00:16:15.089 --> 00:16:18.149
There's, you know, maybe traces that have
a bunch

00:16:18.149 --> 00:16:19.670
of SQL queries, and our job is basically to

00:16:19.670 --> 00:16:21.459
take all those SQL queries and say, OK, I

00:16:21.459 --> 00:16:22.560
can see that in all of your requests, you

00:16:22.560 --> 00:16:24.040
had the SQL query and it took around this

00:16:24.040 --> 00:16:25.449
amount of time and it happened as a child

00:16:25.449 --> 00:16:27.470
of this other node. And the way to think

00:16:27.470 --> 00:16:29.740
about that is basically just a processing
pipeline. Right.

00:16:29.740 --> 00:16:31.480
So you have these traces that come in one

00:16:31.480 --> 00:16:33.480
side. You start passing them through a bunch
of

00:16:33.480 --> 00:16:34.800
processing steps, and then you end up on the

00:16:34.800 --> 00:16:36.670
other side with the data.

00:16:36.670 --> 00:16:38.879
And Storm is actually a way of describing
that

00:16:38.879 --> 00:16:41.930
processing pipeline in sort of functional
style, and then

00:16:41.930 --> 00:16:43.740
you tell it, OK. Here's how many servers I

00:16:43.740 --> 00:16:46.839
need. Here's how, here's how I'm gonna handle
failures.

00:16:46.839 --> 00:16:50.000
And it basically deals with distribution and
scaling and

00:16:50.000 --> 00:16:52.379
all that stuff for you. And part of that

00:16:52.379 --> 00:16:55.379
is because you wrote everything using functional
style.

00:16:55.379 --> 00:16:57.110
And so what happens is Kafka sends the data

00:16:57.110 --> 00:17:00.550
into the entry spout, which is sort of terminology

00:17:00.550 --> 00:17:04.040
in, terminology in Storm for these streams
that get

00:17:04.040 --> 00:17:06.930
created. And they basically go into these
processing things,

00:17:06.930 --> 00:17:09.839
which very clever- cutely are called bolts.
This is

00:17:09.839 --> 00:17:12.980
definitely not the naming I would have used,
but.

00:17:12.980 --> 00:17:15.130
So they're called bolts. And the idea is that

00:17:15.130 --> 00:17:16.970
basically every request may have several things.

00:17:16.970 --> 00:17:20.089
So, for example, we now automatically detect
n +

00:17:20.089 --> 00:17:22.220
1 queries and that's sort of a different kind

00:17:22.220 --> 00:17:25.059
of processing from just, make a picture of
the

00:17:25.059 --> 00:17:26.980
entire request. Or what is the 95th percentile
across

00:17:26.980 --> 00:17:29.090
your entire app, right. These are all different
kinds

00:17:29.090 --> 00:17:30.940
of processing. So we take the data and we

00:17:30.940 --> 00:17:33.580
send them into a bunch of bolts, and the

00:17:33.580 --> 00:17:35.750
cool thing about bolts is that, again, because
they're

00:17:35.750 --> 00:17:38.890
just functional chaining, you can take the
output from

00:17:38.890 --> 00:17:41.130
one bolt and feed it into another bolt. And

00:17:41.130 --> 00:17:43.510
that works, that works pretty well. And, and
you

00:17:43.510 --> 00:17:44.730
don't have to worry about - I mean, you

00:17:44.730 --> 00:17:46.600
have to worry a little bit about things like

00:17:46.600 --> 00:17:49.960
fault tolerance, failure, item potence. But
you worry about

00:17:49.960 --> 00:17:52.230
them at, at the abstraction level, and then
the

00:17:52.230 --> 00:17:54.159
operational part is handled for you.

00:17:54.159 --> 00:17:55.740
T.D.: So it's just like a very declarative
way

00:17:55.740 --> 00:17:58.179
of describing how this computation work in,
in a

00:17:58.179 --> 00:17:59.230
way that's easy to scale.

00:17:59.230 --> 00:18:01.860
Y.K.: And Carl actually talked about this
at very

00:18:01.860 --> 00:18:04.909
high speed yesterday, and you, some of you
may

00:18:04.909 --> 00:18:06.620
have been there. I would recommend watching
the video

00:18:06.620 --> 00:18:09.020
when it comes out if you want to make

00:18:09.020 --> 00:18:11.159
use of this stuff in your own applications.

00:18:11.159 --> 00:18:13.250
And then when you're finally done with all
the

00:18:13.250 --> 00:18:14.970
processing, you need to actually do something
with it.

00:18:14.970 --> 00:18:16.289
You need to put it somewhere so that the

00:18:16.289 --> 00:18:18.360
web app can get access to it, and that

00:18:18.360 --> 00:18:21.350
is basically, we use Cassandra for this. And
Cassandra

00:18:21.350 --> 00:18:24.929
again is mostly, it's a dumb database, but
it

00:18:24.929 --> 00:18:27.780
has, it's, has high capacity. It has some
of

00:18:27.780 --> 00:18:29.080
the fault-tolerance capacities that we want.

00:18:29.080 --> 00:18:30.770
T.D.: We're very, we're just very, very heavy,
right.

00:18:30.770 --> 00:18:32.820
Like, we tend to be writing more than we're

00:18:32.820 --> 00:18:33.730
ever reading.

00:18:33.730 --> 00:18:36.270
Y.K.: Yup. And then when we're done, when
we're

00:18:36.270 --> 00:18:38.780
done with a particular batch, Cassandra basically
kicks off

00:18:38.780 --> 00:18:40.720
the process over again. So we're basically
doing these

00:18:40.720 --> 00:18:41.200
things as batches.

00:18:41.200 --> 00:18:42.919
T.D.: So these are, these are roll-ups, is
what's

00:18:42.919 --> 00:18:45.360
happening here. So basically every minute,
every ten minutes,

00:18:45.360 --> 00:18:48.140
and then at every hour, we reprocess and we

00:18:48.140 --> 00:18:49.970
re-aggregate, so that when you query us we
know

00:18:49.970 --> 00:18:51.010
exactly what to give you.

00:18:51.010 --> 00:18:52.830
Y.K.: Yup. So we sort of have this cycle

00:18:52.830 --> 00:18:55.049
where we start off, obviously, in the first
five

00:18:55.049 --> 00:18:56.890
second, the first minute, you really want
high granularity.

00:18:56.890 --> 00:18:59.140
You want to see what's happening right now.
But,

00:18:59.140 --> 00:19:00.110
if you want to go back and look at

00:19:00.110 --> 00:19:02.460
data from three months ago, you probably care
about

00:19:02.460 --> 00:19:04.830
it, like the day granularity or maybe the
hour

00:19:04.830 --> 00:19:07.490
granularity. So, we basically do these roll-ups
and we

00:19:07.490 --> 00:19:09.200
cycle through the process.

00:19:09.200 --> 00:19:11.880
T.D.: So this, it turns out, building the
system

00:19:11.880 --> 00:19:15.169
required an intense amount of work. Carl spent
probably

00:19:15.169 --> 00:19:17.490
six months reading PHP thesises to find-

00:19:17.490 --> 00:19:18.140
Y.K.: Thesis.

00:19:18.140 --> 00:19:23.850
T.D.: Thesis. To find, to find data structures
and

00:19:23.850 --> 00:19:25.870
algorithms that we could use. Because this
is a

00:19:25.870 --> 00:19:28.340
huge amount of data. Like, I think even a

00:19:28.340 --> 00:19:31.049
few months after we were in private data,
private

00:19:31.049 --> 00:19:34.179
beta, we were already handling over a billion
requests

00:19:34.179 --> 00:19:36.200
per month. And obviously there's no way that
we-

00:19:36.200 --> 00:19:37.970
Y.K.: Basically the number of requests that
we handle

00:19:37.970 --> 00:19:39.770
is the sum of all of the requests that

00:19:39.770 --> 00:19:40.010
you handle.

00:19:40.010 --> 00:19:40.110
T.D.: Right.

00:19:40.110 --> 00:19:41.200
Y.K.: And all of our customers handle.

00:19:41.200 --> 00:19:41.909
T.D.: Right. Right. So.

00:19:41.909 --> 00:19:43.120
Y.K.: So that's a lot of requests.

00:19:43.120 --> 00:19:45.760
T.D.: So obviously we can't provide a service,
at

00:19:45.760 --> 00:19:48.480
least one that's not, we can't provide an
affordable

00:19:48.480 --> 00:19:51.130
service, an accessible service, if we have
to store

00:19:51.130 --> 00:19:53.190
terabytes or exabytes of data just to tell
you

00:19:53.190 --> 00:19:53.990
how your app is running.

00:19:53.990 --> 00:19:56.630
Y.K.: And I think, also a problem, it's problematic

00:19:56.630 --> 00:19:58.429
if you store all the data in a database

00:19:58.429 --> 00:19:59.760
and then every single time someone wants to
learn

00:19:59.760 --> 00:20:01.590
something about that, you have to do a query.

00:20:01.590 --> 00:20:03.159
Those queries can take a very long time. They

00:20:03.159 --> 00:20:04.700
can take minutes. And I think we really wanted

00:20:04.700 --> 00:20:07.159
to have something that would be very, that
would,

00:20:07.159 --> 00:20:09.580
where the feedback loop would be fast. So
we

00:20:09.580 --> 00:20:11.860
wanted to find algorithms that let us handle
the

00:20:11.860 --> 00:20:13.940
data at, at real time, and then provide it

00:20:13.940 --> 00:20:16.020
to you at real time instead of these, like,

00:20:16.020 --> 00:20:18.309
dump the data somewhere and then do these
complicated

00:20:18.309 --> 00:20:18.549
queries.

00:20:18.549 --> 00:20:20.820
T.D.: So, hold on. So this slide was not

00:20:20.820 --> 00:20:24.440
supposed to be here. It was supposed to be

00:20:24.440 --> 00:20:27.669
a Rails slide. So, whoa. I went too far.

00:20:27.669 --> 00:20:29.970
K. We'll watch that again. That's pretty cool.
So

00:20:29.970 --> 00:20:32.460
then the last thing I want to say is,

00:20:32.460 --> 00:20:34.330
perhaps your take away from looking at this
architecture

00:20:34.330 --> 00:20:36.750
diagram is, oh my gosh, these Rails guys completely-

00:20:36.750 --> 00:20:38.309
Y.K.: They really jumped the shark.

00:20:38.309 --> 00:20:40.870
T.D.: They jumped the shark. They ditched
Rails. I

00:20:40.870 --> 00:20:42.330
saw, like, three Tweets yesterday - I wasn't
here,

00:20:42.330 --> 00:20:43.450
I was in Portland yesterday, but I saw, like,

00:20:43.450 --> 00:20:44.340
three Tweets that were like, I'm at RailsConf
and

00:20:44.340 --> 00:20:48.809
I haven't seen a single talk about, like,
Rails.

00:20:48.809 --> 00:20:51.950
So that's true here, too. But, I want to

00:20:51.950 --> 00:20:54.940
assure you that we are only using this stack

00:20:54.940 --> 00:20:58.070
for the heavy computation. We started in Rails.
We

00:20:58.070 --> 00:21:00.730
started, we were like, hey, what do we need.

00:21:00.730 --> 00:21:01.929
Ah, well, people probably need to authenticate
and log

00:21:01.929 --> 00:21:05.090
in, and we probably need to do billing. And

00:21:05.090 --> 00:21:06.220
those are all things that Rails is really,
really

00:21:06.220 --> 00:21:08.830
good at. So we started with Rails as, basically,

00:21:08.830 --> 00:21:11.110
the starting point, and then when we realized
oh

00:21:11.110 --> 00:21:14.039
my gosh, computation is really slow. There's
no way

00:21:14.039 --> 00:21:15.240
we're gonna be able to offer this service.
OK.

00:21:15.240 --> 00:21:16.059
Now let's think about how we can do all

00:21:16.059 --> 00:21:16.270
of that.

00:21:16.270 --> 00:21:18.510
Y.K.: And I think notably, a lot of people

00:21:18.510 --> 00:21:20.049
who look at Rails are like, there's a lot

00:21:20.049 --> 00:21:21.750
of companies that have built big stuff on
Rails,

00:21:21.750 --> 00:21:24.090
and their attitude is, like, oh, this legacy
terrible

00:21:24.090 --> 00:21:25.409
Rails app. I really wish we could get rid

00:21:25.409 --> 00:21:26.850
of it. If we could just write everything in

00:21:26.850 --> 00:21:30.760
Scala or Clojure or Go, everything would be
amazing.

00:21:30.760 --> 00:21:31.500
That is definitely not our attitude. Our attitude
is

00:21:31.500 --> 00:21:34.320
that Rails is really amazing, at particular,
at the

00:21:34.320 --> 00:21:36.740
kinds of things that are really common across
everyone's

00:21:36.740 --> 00:21:39.900
web applications - authentication, billing,
et cetera. And we

00:21:39.900 --> 00:21:41.429
really want to be using Rails for the parts

00:21:41.429 --> 00:21:43.360
of our app- even things like error-tracking,
we do

00:21:43.360 --> 00:21:45.039
through the Rails app. We want to be using

00:21:45.039 --> 00:21:47.470
Rails because it's very productive at doing
those things.

00:21:47.470 --> 00:21:48.789
It happens to be very slow with doing data

00:21:48.789 --> 00:21:50.330
crunching, so we're gonna use a different
tool for

00:21:50.330 --> 00:21:50.539
that.

00:21:50.539 --> 00:21:51.909
But I don't think you'll ever see me getting

00:21:51.909 --> 00:21:54.210
up and saying, ah, I really wish we had

00:21:54.210 --> 00:21:55.080
just started writing, you know, the Rails
app in

00:21:55.080 --> 00:21:55.159
rust.

00:21:55.159 --> 00:21:55.309
T.D.: Yeah.

00:21:55.309 --> 00:21:58.090
Y.K.: That would be terrible.

00:21:58.090 --> 00:22:02.429
T.D.: So that's number one, is, is, honest
response

00:22:02.429 --> 00:22:04.390
times, which we're, which it turns out, seems
like

00:22:04.390 --> 00:22:08.289
it should be easy, requires storing insane
amount of

00:22:08.289 --> 00:22:09.169
data.

00:22:09.169 --> 00:22:10.620
So the second thing that we realized when
we

00:22:10.620 --> 00:22:12.059
were looking at a lot of these tools, is

00:22:12.059 --> 00:22:14.360
that most of them focus on data. They focus

00:22:14.360 --> 00:22:16.590
on giving you the raw data. But I'm not

00:22:16.590 --> 00:22:19.130
a machine. I'm not a computer. I don't enjoy

00:22:19.130 --> 00:22:21.320
sifting through data. That's what computers
are good for.

00:22:21.320 --> 00:22:23.289
I would rather be drinking a beer. It's really

00:22:23.289 --> 00:22:24.830
nice in Portland, this time of year.

00:22:24.830 --> 00:22:27.409
So, we wanted to think about, if you're trying

00:22:27.409 --> 00:22:31.179
to solve the performance problems in your
application, what

00:22:31.179 --> 00:22:32.840
are the things that you would suss out with

00:22:32.840 --> 00:22:35.760
the existing tools after spending, like, four
hours depleting

00:22:35.760 --> 00:22:37.510
your ego to get there?

00:22:37.510 --> 00:22:38.929
Y.K.: And I think part of this is just

00:22:38.929 --> 00:22:42.260
people are actually very, people like to think
that

00:22:42.260 --> 00:22:43.880
they're gonna use these tools, but when the
tools

00:22:43.880 --> 00:22:45.320
require you to dig through a lot of data,

00:22:45.320 --> 00:22:47.090
people just don't use them very much. So,
the

00:22:47.090 --> 00:22:48.330
goal here was to build a tool that people

00:22:48.330 --> 00:22:50.809
actually use and actually like using, and
not to

00:22:50.809 --> 00:22:54.870
build a tool that happens to provide a lot

00:22:54.870 --> 00:22:55.039
of data you can sift through.

00:22:55.039 --> 00:22:55.059
T.D.: Yes.

00:22:55.059 --> 00:22:55.929
So, probably the, one of the first things
that

00:22:55.929 --> 00:22:58.529
we realized is that we don't want to provide.

00:22:58.529 --> 00:23:00.400
This is a trace of a request, you've probably

00:23:00.400 --> 00:23:04.070
seen similar UIs using other tools, using,
for example,

00:23:04.070 --> 00:23:07.059
the inspector in, in like Chrome or Safari,
and

00:23:07.059 --> 00:23:08.700
this is just showing basically, it's basically
a visual

00:23:08.700 --> 00:23:10.830
stack trace of where your application is spending
its

00:23:10.830 --> 00:23:11.600
time.

00:23:11.600 --> 00:23:13.950
But I think what was important for us is

00:23:13.950 --> 00:23:17.809
showing not just a single request, because
your app

00:23:17.809 --> 00:23:20.570
handles, you know, hundreds of thousands of
requests, or

00:23:20.570 --> 00:23:22.679
millions of requests. So looking at a single
request

00:23:22.679 --> 00:23:24.630
statistically is complete, it's just noise.

00:23:24.630 --> 00:23:26.500
Y.K.: And it's especially bad if it's the
worst

00:23:26.500 --> 00:23:28.659
request, because the worst request is, is
really noise.

00:23:28.659 --> 00:23:30.850
It's like, a hiccup in the network, right.

00:23:30.850 --> 00:23:31.250
T.D.: It's the outlier. Yeah.

00:23:31.250 --> 00:23:32.150
Y.K.: It's literally the outlier.

00:23:32.150 --> 00:23:35.659
T.D.: It's literally the outlier. Yup. So,
what we

00:23:35.659 --> 00:23:38.700
present in Skylight is something a little
bit different,

00:23:38.700 --> 00:23:41.770
and it's something that we call the aggregate
trace.

00:23:41.770 --> 00:23:46.260
So the aggregate trace is basically us taking
all

00:23:46.260 --> 00:23:49.559
of your requests, averaging them out where
each of

00:23:49.559 --> 00:23:51.750
these things spends their time, and then showing
you

00:23:51.750 --> 00:23:54.899
that. So this is basically like, this is like,

00:23:54.899 --> 00:23:57.929
this is like the statue of David. It is

00:23:57.929 --> 00:24:00.880
the idealized form of the stack trace of how

00:24:00.880 --> 00:24:02.270
your application's behaving.

00:24:02.270 --> 00:24:05.330
But, of course, you have the same problem
as

00:24:05.330 --> 00:24:07.500
before, which is, if this is all that we

00:24:07.500 --> 00:24:10.580
were showing you, it would be obscuring a
lot

00:24:10.580 --> 00:24:12.870
of information. You want to actually be able
to

00:24:12.870 --> 00:24:13.990
tell the difference between, OK, what's my
stack trace

00:24:13.990 --> 00:24:16.419
look like for fast requests, and how does
that

00:24:16.419 --> 00:24:18.539
differ from requests that are slower.

00:24:18.539 --> 00:24:20.860
So what we've got, I've got a little video

00:24:20.860 --> 00:24:22.320
here. You can see that when I move the

00:24:22.320 --> 00:24:26.490
slider, that this trace below it is actually
updating

00:24:26.490 --> 00:24:29.130
in real time. As I move the slider around,

00:24:29.130 --> 00:24:31.770
you can see that the aggregate trace actually
updates

00:24:31.770 --> 00:24:34.240
with it. And that's because we're collecting
all this

00:24:34.240 --> 00:24:36.159
information. We're collecting, like I said,
a lot of

00:24:36.159 --> 00:24:38.669
data. We can recompute this aggregate trace
on the

00:24:38.669 --> 00:24:38.909
fly.

00:24:38.909 --> 00:24:41.200
Basically, for each bucket, we're storing
a different trace,

00:24:41.200 --> 00:24:42.880
and then on the client we're reassembling
that. We'll

00:24:42.880 --> 00:24:43.899
go into that a little bit.

00:24:43.899 --> 00:24:45.799
Y.K.: And I think it's really important that
you

00:24:45.799 --> 00:24:48.370
be able to do these experiments quickly. If
every

00:24:48.370 --> 00:24:50.059
time you think, oh, I wonder what happens
if

00:24:50.059 --> 00:24:52.260
I add another histogram bucket, if it requires
a

00:24:52.260 --> 00:24:54.830
whole full page refresh. Then that would basically
make

00:24:54.830 --> 00:24:56.309
people not want to use the tool. Not able

00:24:56.309 --> 00:24:58.580
to use the tool. So, actually building something
which

00:24:58.580 --> 00:24:59.649
is real time and fast, gets the data as

00:24:59.649 --> 00:25:00.110
it comes, was really important to us.

00:25:00.110 --> 00:25:01.220
T.D.: So that's number one.

00:25:01.220 --> 00:25:04.850
And the second thing. So we built that, and

00:25:04.850 --> 00:25:07.929
we're like, OK, well what's next? And I think

00:25:07.929 --> 00:25:09.250
that the big problem with this is that you

00:25:09.250 --> 00:25:12.020
need to know that there's a problem before
you

00:25:12.020 --> 00:25:14.429
go look at it, right. So we have been

00:25:14.429 --> 00:25:16.080
working for the past few months, and the Storm

00:25:16.080 --> 00:25:18.390
infrastructure that we built makes it pretty
straight-forward to

00:25:18.390 --> 00:25:21.149
start building more abstractions on top of
the data

00:25:21.149 --> 00:25:21.559
that we've already collected.

00:25:21.559 --> 00:25:24.120
It's a very declarative system. So we've been
working

00:25:24.120 --> 00:25:26.679
on a feature called inspections. And what's
cool about

00:25:26.679 --> 00:25:29.279
inspections is that we can look at this tremendous

00:25:29.279 --> 00:25:31.270
volume of data that we've collected from your
app,

00:25:31.270 --> 00:25:33.840
and we can automatically tease out what the
problems

00:25:33.840 --> 00:25:35.210
are. So the first one that we shipped, this

00:25:35.210 --> 00:25:37.399
is in beta right now. It's not, it's not

00:25:37.399 --> 00:25:39.840
out and enabled by default, but there, it's
behind

00:25:39.840 --> 00:25:42.440
a feature flag that we've had some users turning

00:25:42.440 --> 00:25:42.730
on.

00:25:42.730 --> 00:25:44.419
And, and trying out. And so what we can

00:25:44.419 --> 00:25:46.450
do in this case, is because we have information

00:25:46.450 --> 00:25:48.730
about all of the database queries in your
app,

00:25:48.730 --> 00:25:50.840
we can look and see if you have n

00:25:50.840 --> 00:25:52.390
plus one queries. Can you maybe explain what
an

00:25:52.390 --> 00:25:53.250
n plus one query is?

00:25:53.250 --> 00:25:54.600
Y.K.: Yeah. So, I'm, people know, hopefully,
what n

00:25:54.600 --> 00:25:56.770
plus one queries. But the, it's the idea that

00:25:56.770 --> 00:25:59.260
you, by accident, for some reason, instead
of making

00:25:59.260 --> 00:26:01.970
one query, you ask for like all the posts

00:26:01.970 --> 00:26:02.940
and then you iterated through all of them
and

00:26:02.940 --> 00:26:04.940
got all the comments and now you, instead
of

00:26:04.940 --> 00:26:08.679
having one query, you have one query per post,

00:26:08.679 --> 00:26:10.309
right. And you, what I've, what I've like
to

00:26:10.309 --> 00:26:12.549
do is do eager reloading, where you say include

00:26:12.549 --> 00:26:14.559
comments, right. But you have to know that
you

00:26:14.559 --> 00:26:15.039
have to do that.

00:26:15.039 --> 00:26:16.899
So there's some tools that will run in development

00:26:16.899 --> 00:26:18.380
mode, if you happen to catch it, like a

00:26:18.380 --> 00:26:20.460
bullet. This is basically a tool that's looking
at

00:26:20.460 --> 00:26:22.210
every single one of your classes and has some

00:26:22.210 --> 00:26:24.169
thresholds that, once we see that a bunch
of

00:26:24.169 --> 00:26:27.429
your requests have the same exact query, so
we

00:26:27.429 --> 00:26:29.549
do some work to pull out binds. So if

00:26:29.549 --> 00:26:32.200
it's, like, where something equals one, we
will automatically

00:26:32.200 --> 00:26:34.110
pull out the one and replace it with a

00:26:34.110 --> 00:26:34.740
question mark.

00:26:34.740 --> 00:26:36.230
And then we basically take all those queries,
if

00:26:36.230 --> 00:26:39.529
they're the exact same query repeated multiple
times, subject

00:26:39.529 --> 00:26:41.390
to some thresholds, we'll start showing you
hey, there's

00:26:41.390 --> 00:26:42.450
an n plus one query.

00:26:42.450 --> 00:26:43.799
And you can imagine this same sort of thing

00:26:43.799 --> 00:26:46.320
being done for things, like, are you missing
an

00:26:46.320 --> 00:26:49.690
index, right. Or, are you using the Ruby version

00:26:49.690 --> 00:26:50.950
of JSON when you should be using the native

00:26:50.950 --> 00:26:52.179
version of JSON. These are all things that
we

00:26:52.179 --> 00:26:55.140
can start detecting just because we're consuming
an enormous

00:26:55.140 --> 00:26:57.510
amount of information, and we can start writing
some

00:26:57.510 --> 00:26:59.330
heuristics for bubbling it up.

00:26:59.330 --> 00:27:02.330
So, third and final breakthrough, we realized
that we

00:27:02.330 --> 00:27:05.289
really, really needed a lightning fast UI.
Something really

00:27:05.289 --> 00:27:08.279
responsive. So, in particular, the feedback
loop is critical,

00:27:08.279 --> 00:27:09.929
right. You can imagine, if the way that you

00:27:09.929 --> 00:27:12.279
dug into data was you clicked and you wait

00:27:12.279 --> 00:27:14.320
an hour, and then you get your results, no

00:27:14.320 --> 00:27:15.730
one would do it. No one would ever do

00:27:15.730 --> 00:27:15.890
it.

00:27:15.890 --> 00:27:19.090
And the existing tools are OK, but you click

00:27:19.090 --> 00:27:20.470
and you wait. You look at it and you're

00:27:20.470 --> 00:27:21.730
like, oh, I want a different view, so then

00:27:21.730 --> 00:27:23.240
you go edit your query and then you click

00:27:23.240 --> 00:27:25.360
and you wait and it's just not a pleasant

00:27:25.360 --> 00:27:26.600
experience.

00:27:26.600 --> 00:27:28.850
So, so we use Ember, the, the UI that

00:27:28.850 --> 00:27:31.250
you're using when you log into Skylight. Even
though

00:27:31.250 --> 00:27:33.289
it feels just like a regular website, it doesn't

00:27:33.289 --> 00:27:35.940
feel like a native app, is powered, all of

00:27:35.940 --> 00:27:37.679
the routing, all of the rendering, all of
the

00:27:37.679 --> 00:27:40.769
decision making, is happening in, as an Ember.js
app,

00:27:40.769 --> 00:27:43.049
and we pair that with D3. So all of

00:27:43.049 --> 00:27:44.830
the charts, the charts that you saw there
in

00:27:44.830 --> 00:27:48.039
the aggregate trace, that is all Ember components
powered

00:27:48.039 --> 00:27:48.970
by D3.

00:27:48.970 --> 00:27:52.860
So, this is actually significantly cleaned
up our client-side

00:27:52.860 --> 00:27:55.679
code. It makes re-usability really, really
awesome. So to

00:27:55.679 --> 00:27:57.039
give you an example, this is from our billing

00:27:57.039 --> 00:27:58.789
page that I, the designer came and they had,

00:27:58.789 --> 00:28:01.260
they had a component that was like, the gate

00:28:01.260 --> 00:28:01.809
component.

00:28:01.809 --> 00:28:02.919
And, the-

00:28:02.919 --> 00:28:05.899
T.D.: It seems really boring at first.

00:28:05.899 --> 00:28:06.799
Y.K.: It seemed really boring. But, this is
the

00:28:06.799 --> 00:28:08.950
implementation, right. So you could copy and
paste this

00:28:08.950 --> 00:28:11.059
code over and over again, everywhere you go.
Just

00:28:11.059 --> 00:28:12.750
remember to format it correctly. If you forget
to

00:28:12.750 --> 00:28:15.070
format it, it's not gonna look the same everywhere.

00:28:15.070 --> 00:28:17.460
But I was like, hey, we're using this all

00:28:17.460 --> 00:28:18.010
over the place. Why don't we bundle this up

00:28:18.010 --> 00:28:20.070
into a component? And so with Ember, it was

00:28:20.070 --> 00:28:22.230
super easy. We basically just said, OK, here's
new

00:28:22.230 --> 00:28:24.590
calendar date component. It has a property
on it

00:28:24.590 --> 00:28:26.460
called date. Just set that to any JavaScript
data

00:28:26.460 --> 00:28:28.059
object. Just set, you don't have to remember
about

00:28:28.059 --> 00:28:30.450
converting it or formatting it. Here's the
component. Set

00:28:30.450 --> 00:28:31.840
the date and it will render the correct thing

00:28:31.840 --> 00:28:32.760
automatically.

00:28:32.760 --> 00:28:36.039
And, so the architecture of the Ember app
looks

00:28:36.039 --> 00:28:37.640
a little bit, something like this, where you
have

00:28:37.640 --> 00:28:39.919
many, many different components, most of them
just driven

00:28:39.919 --> 00:28:42.370
by D3, and then they're plugged into the model

00:28:42.370 --> 00:28:43.480
and the controller.

00:28:43.480 --> 00:28:44.909
And the Ember app will go fetch those models

00:28:44.909 --> 00:28:46.750
from the cloud, and the cloud from the Java

00:28:46.750 --> 00:28:50.190
app, which just queries Cassandra, and render
them. And

00:28:50.190 --> 00:28:53.429
what's neat about this model is turning on
web

00:28:53.429 --> 00:28:56.360
sockets is super easy, right. Because all
of these

00:28:56.360 --> 00:28:58.860
components are bound to a single place. So
when

00:28:58.860 --> 00:29:00.890
the web socket says, hey, we have updated
information

00:29:00.890 --> 00:29:02.630
for you to show, it just pushes it onto

00:29:02.630 --> 00:29:04.980
the model or onto the controller, and the
whole

00:29:04.980 --> 00:29:06.159
UI updates automatically.

00:29:06.159 --> 00:29:06.890
It's like magic.

00:29:06.890 --> 00:29:07.230
And-

00:29:07.230 --> 00:29:08.250
Y.K.: Like magic.

00:29:08.250 --> 00:29:09.679
T.D.: It's like magic. And, and when debugging,
this

00:29:09.679 --> 00:29:11.559
is especially awesome too, because, and I'll
maybe show

00:29:11.559 --> 00:29:15.080
a demo of the Ember inspector. It's nice.

00:29:15.080 --> 00:29:17.830
So. Yeah. So, lightning fast UI. Reducing
the feedback

00:29:17.830 --> 00:29:19.510
loop so that you can quickly play with your

00:29:19.510 --> 00:29:21.880
data, makes it go from a chore to something

00:29:21.880 --> 00:29:23.620
that actually feels kind of fun.

00:29:23.620 --> 00:29:27.039
So, these were the breakthroughs that we had
when

00:29:27.039 --> 00:29:28.440
we were building Skylight. The things that
made us

00:29:28.440 --> 00:29:29.980
think, yes, this is actually a product that
we

00:29:29.980 --> 00:29:31.940
think deserves to be on the market. So, one,

00:29:31.940 --> 00:29:33.860
honest response times. Collect data that no
one else

00:29:33.860 --> 00:29:36.549
can collect. Focus on answers instead of just
dumping

00:29:36.549 --> 00:29:38.289
data, and have a lightning fast UI to do

00:29:38.289 --> 00:29:38.409
it.

00:29:38.409 --> 00:29:40.100
So we like to think of Skylight as basically

00:29:40.100 --> 00:29:42.690
a smart profiler. It's a smart profiler that
runs

00:29:42.690 --> 00:29:44.350
in production. It's like the profiler that
you run

00:29:44.350 --> 00:29:47.230
on your local development machine, but instead
of being

00:29:47.230 --> 00:29:49.179
on your local dev box which has nothing to

00:29:49.179 --> 00:29:51.610
do with the performance characteristics of
what your users

00:29:51.610 --> 00:29:53.450
are experience, we're actually running in
production.

00:29:53.450 --> 00:29:58.919
So, let me just give you guys a quick

00:29:58.919 --> 00:30:00.390
demo.

00:30:00.390 --> 00:30:03.120
So, this is what the Skylight, this is what

00:30:03.120 --> 00:30:07.610
Skylight looks like. What's under this? There
we go.

00:30:07.610 --> 00:30:09.620
So, the first thing here is we've got the

00:30:09.620 --> 00:30:12.669
app dash board. So this, it's like our, 95th

00:30:12.669 --> 00:30:15.500
responsile- 95th percentile response time
has peaked. Maybe you're

00:30:15.500 --> 00:30:17.970
all hammering it right now. That would be
nice.

00:30:17.970 --> 00:30:19.940
So, this is a graph of your response time

00:30:19.940 --> 00:30:22.010
over time, and then on the right, this is

00:30:22.010 --> 00:30:24.700
the graph of the RPMs, the requests per minute

00:30:24.700 --> 00:30:26.750
that your app is handling. So this is app-wide.

00:30:26.750 --> 00:30:29.440
And this is live. This updates every minute.

00:30:29.440 --> 00:30:31.039
Then down below, you have a list of the

00:30:31.039 --> 00:30:33.730
end points in your application. So you can
see,

00:30:33.730 --> 00:30:35.700
actually, the top, the slowest ones for us
were,

00:30:35.700 --> 00:30:37.789
we have an instrumentation API, and we've
gone and

00:30:37.789 --> 00:30:39.929
instrumented our background workers. So we
can see them

00:30:39.929 --> 00:30:42.010
here, and their response time plays in. So
we

00:30:42.010 --> 00:30:44.220
can see that we have this reporting worker
that's

00:30:44.220 --> 00:30:46.899
taking 95th percentile, thirteen seconds.

00:30:46.899 --> 00:30:48.880
Y.K.: So all that time used to be inside

00:30:48.880 --> 00:30:51.500
of some request somewhere, and we discovered
that there

00:30:51.500 --> 00:30:52.840
was a lot of time being spent in things

00:30:52.840 --> 00:30:54.679
that we could push to the background. We probably

00:30:54.679 --> 00:30:56.789
need to update the agony index so that it

00:30:56.789 --> 00:30:59.190
doesn't make workers very high, because spending
some time

00:30:59.190 --> 00:31:02.120
in your workers is not that big of a

00:31:02.120 --> 00:31:02.130
deal.

00:31:02.130 --> 00:31:03.000
T.D.: So, so then, if we dive into one

00:31:03.000 --> 00:31:05.299
of these, you can see that for this request,

00:31:05.299 --> 00:31:07.000
we've got the time explorer up above, and
that

00:31:07.000 --> 00:31:10.429
shows a graph of response time at, again,
95th

00:31:10.429 --> 00:31:11.840
percentile, and you can, if you want to go

00:31:11.840 --> 00:31:13.549
back and look at historical data, you just
drag

00:31:13.549 --> 00:31:15.250
it like this. And this has got a brush,

00:31:15.250 --> 00:31:16.980
so you can zoom in and out on different

00:31:16.980 --> 00:31:17.760
times.

00:31:17.760 --> 00:31:19.649
And every time you change the range, you can

00:31:19.649 --> 00:31:21.360
see that it's very responsive. It's never
waiting for

00:31:21.360 --> 00:31:23.039
the server. But it is going back and fetching

00:31:23.039 --> 00:31:25.080
data from the server and then when the data

00:31:25.080 --> 00:31:29.210
comes back, you see the whole UI just updates.

00:31:29.210 --> 00:31:29.250
And we get that for free with Ember and

00:31:29.250 --> 00:31:31.190
And then down below, as we discussed, you
actually

00:31:31.190 --> 00:31:33.760
have a real histogram. And this histogram,
in this

00:31:33.760 --> 00:31:37.159
case, is showing. So this is for fifty-seven
requests.

00:31:37.159 --> 00:31:39.019
And if we click and drag, we could just

00:31:39.019 --> 00:31:40.429
move this. And you can see that the aggregate

00:31:40.429 --> 00:31:43.360
trace below updates in response to us dragging
this.

00:31:43.360 --> 00:31:44.919
And if we want to look at the fastest

00:31:44.919 --> 00:31:47.500
quartile, we just click faster and we'll just
choose

00:31:47.500 --> 00:31:48.149
that range on the histogram.

00:31:48.149 --> 00:31:49.210
Y.K.: I think it's the fastest load.

00:31:49.210 --> 00:31:50.899
T.D.: The fastest load. And then if you click

00:31:50.899 --> 00:31:52.899
on slower, you can see the slower requests.
So

00:31:52.899 --> 00:31:54.669
this makes it really easy to compare and contrast.

00:31:54.669 --> 00:31:56.710
OK. Why are certain requests faster and why
are

00:31:56.710 --> 00:31:58.529
certain requests slow?

00:31:58.529 --> 00:32:00.779
You can see the blue, these blue areas. This

00:32:00.779 --> 00:32:03.559
is Ruby code. So, right now it's not super

00:32:03.559 --> 00:32:05.820
granular. It would be nice if you could actually

00:32:05.820 --> 00:32:07.940
know what was going on here. But, it'll at

00:32:07.940 --> 00:32:09.940
least tell you where in your controller action
this

00:32:09.940 --> 00:32:12.690
is happening, and then you can actually see
which

00:32:12.690 --> 00:32:15.919
database queries are being executed, and what
their duration

00:32:15.919 --> 00:32:16.080
is.

00:32:16.080 --> 00:32:17.889
And you can see that we actually extract the

00:32:17.889 --> 00:32:20.419
SQL and we denormalize it so we, so you,

00:32:20.419 --> 00:32:22.159
or, we normalize it so you can see exactly

00:32:22.159 --> 00:32:24.019
what those requests are even if the values
are

00:32:24.019 --> 00:32:24.820
totally different between them.

00:32:24.820 --> 00:32:27.649
Y.K.: Yeah. So the real query, courtesy of
Rails,

00:32:27.649 --> 00:32:29.730
not yet supporting bind extraction is like,
where id

00:32:29.730 --> 00:32:32.169
equals one or, ten or whatever.

00:32:32.169 --> 00:32:33.659
T.D.: Yup. So that's pretty cool.

00:32:33.659 --> 00:32:37.429
Y.K.: So one, one other thing is, initially,
we

00:32:37.429 --> 00:32:39.269
actually just showed the whole trace, but
we discovered

00:32:39.269 --> 00:32:41.659
that, obviously when you show whole traces
you have

00:32:41.659 --> 00:32:43.639
information that doesn't really matter that
much. So we

00:32:43.639 --> 00:32:47.340
started off by, we've recently basically started
to collapse

00:32:47.340 --> 00:32:48.850
things that don't matter so much so that you

00:32:48.850 --> 00:32:51.090
can basically expand or condense the trace.

00:32:51.090 --> 00:32:52.519
And we wanted to make it not, but you

00:32:52.519 --> 00:32:55.690
have to think about expanding or condensing
individual areas,

00:32:55.690 --> 00:32:57.960
but just, you see what matters the most and

00:32:57.960 --> 00:32:59.100
then you can see trivial errors.

00:32:59.100 --> 00:33:02.179
T.D.: Yup. So, so that's the demo of Skylight.

00:33:02.179 --> 00:33:04.190
We'd really like it if you checked it out.

00:33:04.190 --> 00:33:05.899
There is one more thing I want to show

00:33:05.899 --> 00:33:07.720
you that is, like, really freaking cool. This
is

00:33:07.720 --> 00:33:10.529
coming out of Tilde labs. Carl was like, has

00:33:10.529 --> 00:33:13.730
been hacking, he's been up until past midnight,
getting

00:33:13.730 --> 00:33:15.769
almost no sleep for the past month trying
to

00:33:15.769 --> 00:33:16.730
have this ready.

00:33:16.730 --> 00:33:19.090
I don't know how many of you know this,

00:33:19.090 --> 00:33:23.630
but Ruby 2 point 1 has a new, a,

00:33:23.630 --> 00:33:27.950
a stack sampling feature. So you can get really

00:33:27.950 --> 00:33:31.149
granular information about how your Ruby code
is performing.

00:33:31.149 --> 00:33:33.450
So I want to show you, I just mentioned

00:33:33.450 --> 00:33:34.570
how it would be nice if we could get

00:33:34.570 --> 00:33:36.830
more information out of what your Ruby code
is

00:33:36.830 --> 00:33:38.760
doing. And now we can do that.

00:33:38.760 --> 00:33:42.039
Basically, every few milliseconds, this code
that Carl wrote

00:33:42.039 --> 00:33:44.399
is going into the, to the Ruby, into MRI,

00:33:44.399 --> 00:33:47.419
and it's taking a snap shot of the stack.

00:33:47.419 --> 00:33:50.769
And because this is built-in, it's very low-impact.
It's

00:33:50.769 --> 00:33:53.570
not allocating any new memory. It's very little
performance

00:33:53.570 --> 00:33:55.769
hit. Basically you wouldn't even notice it.
And so

00:33:55.769 --> 00:33:58.149
every few milliseconds it's sampling, and
we take that

00:33:58.149 --> 00:34:00.260
information and we send it up to our servers.

00:34:00.260 --> 00:34:02.260
So it's almost like you're running Ruby profiler
on

00:34:02.260 --> 00:34:05.220
your local dev box, where you get extremely
granular

00:34:05.220 --> 00:34:07.159
information about where your code is spending
its time

00:34:07.159 --> 00:34:09.010
in Ruby, per method, per all of these things.

00:34:09.010 --> 00:34:11.909
But it's happening in production.

00:34:11.909 --> 00:34:16.409
So, this is, so this is a, we enabled

00:34:16.409 --> 00:34:18.399
it in staging. You can see that we've got

00:34:18.399 --> 00:34:19.600
some rendering bugs. It's still in beta.

00:34:19.600 --> 00:34:21.918
Y.K.: Yeah, and we haven't yet collapsed things
that

00:34:21.918 --> 00:34:21.980
are not important-

00:34:21.980 --> 00:34:22.020
T.D.: Yes.

00:34:22.020 --> 00:34:23.270
Y.K.: -for this particular feature.

00:34:23.270 --> 00:34:24.170
T.D.: So we want to show, we want to

00:34:24.170 --> 00:34:27.610
hide things like, like framework code, obviously.
But this

00:34:27.610 --> 00:34:31.070
gives you an incredibly, incredibly granular
view of what

00:34:31.070 --> 00:34:35.659
your app is doing in production. And we think.

00:34:35.659 --> 00:34:39.230
This is a, an API that's built into, into

00:34:39.230 --> 00:34:43.159
Ruby 2.1.1. Because our agent is running so
low-level,

00:34:43.159 --> 00:34:44.659
because we wrote it in Rust, we have the

00:34:44.659 --> 00:34:47.409
ability to do things like this, and Carl thinks

00:34:47.409 --> 00:34:48.370
that we may be able to actually back port

00:34:48.370 --> 00:34:48.480
this to older Rubies, too. So if you're not

00:34:48.480 --> 00:34:50.130
on Ruby 2.1, we think that we can actually

00:34:50.130 --> 00:34:52.790
bring this. But that's TPD.

00:34:52.790 --> 00:34:55.480
Y.K.: Yeah, I- so I think the cool thing

00:34:55.480 --> 00:34:57.940
about this, in general, is when you run a

00:34:57.940 --> 00:34:59.430
sampling- so this is a sampling profiler,
right, we

00:34:59.430 --> 00:35:01.260
don't want to be burning every single thing
that

00:35:01.260 --> 00:35:03.790
you do in your program with tracing, right.
That

00:35:03.790 --> 00:35:05.380
would be very slow.

00:35:05.380 --> 00:35:06.920
So when you normally run a sampling profiler,
you

00:35:06.920 --> 00:35:08.760
have to basically make a loop. You have to

00:35:08.760 --> 00:35:11.090
basically create a loop, run this code a million

00:35:11.090 --> 00:35:12.970
times and keep sampling it. Eventually we'll
get enough

00:35:12.970 --> 00:35:15.030
samples to get the information. But it turns
out

00:35:15.030 --> 00:35:17.280
that your production server is a loop. Your
production

00:35:17.280 --> 00:35:20.560
server is serving tons and tons of requests.
So,

00:35:20.560 --> 00:35:22.880
by simply tak- you know, taking a few microseconds

00:35:22.880 --> 00:35:25.580
out of every request and collecting a couple
of

00:35:25.580 --> 00:35:27.210
samples, over time we can actually get this
really

00:35:27.210 --> 00:35:29.700
high fidelity picture with basically no cost.

00:35:29.700 --> 00:35:31.150
And that's pretty mind-blowing. And this is
the kind

00:35:31.150 --> 00:35:34.650
of stuff that we can start doing by really

00:35:34.650 --> 00:35:37.250
caring about, about both the user experience
and the

00:35:37.250 --> 00:35:40.830
implementation and getting really scary about
it. And I'm

00:35:40.830 --> 00:35:42.700
really, like, honestly this is a really exciting
feature

00:35:42.700 --> 00:35:45.330
that really shows what we can do as we

00:35:45.330 --> 00:35:46.130
start building this out.

00:35:46.130 --> 00:35:47.140
T.D.: Once we've got that, once we've got
that

00:35:47.140 --> 00:35:48.380
groundwork.

00:35:48.380 --> 00:35:49.820
So if you guys want to check it out,

00:35:49.820 --> 00:35:51.760
Skylight dot io, it's available today. It's
no longer

00:35:51.760 --> 00:35:54.040
in private beta. Everyone can sign up. No
invitation

00:35:54.040 --> 00:35:56.630
token necessary. And you can get a thirty-day
free

00:35:56.630 --> 00:35:58.240
trial if you haven't started one already.
So if

00:35:58.240 --> 00:35:59.620
you have any questions, please come see us
right

00:35:59.620 --> 00:36:00.980
now, or we have a booth in the vendor

00:36:00.980 --> 00:36:03.140
hall. Thank you guys very much.