1
00:00:16,600 --> 00:00:17,920
TOM DALE: Hey, you guys ready?

2
00:00:17,940 --> 00:00:19,180
Thank you guys so much for coming.

3
00:00:19,189 --> 00:00:21,770
This is awesome. I was really,

4
00:00:21,770 --> 00:00:23,700
I, when they were putting together the schedule,

5
00:00:23,700 --> 00:00:25,349
I said, make sure that you put us down

6
00:00:25,349 --> 00:00:27,160
in the Caves of Moria. So thank you

7
00:00:27,160 --> 00:00:31,560
guys for coming down and making it.

8
00:00:31,560 --> 00:00:32,910
I'm Tom. This is Yehuda.

9
00:00:32,910 --> 00:00:35,270
YEHUDA KATZ: When people told me was signed
up

10
00:00:35,270 --> 00:00:40,520
to do a back-to-back talk, I don't know what

11
00:00:40,520 --> 00:00:40,579
I was thinking.

12
00:00:40,579 --> 00:00:40,820
T.D.: Yup. So. We want to talk to you

13
00:00:40,820 --> 00:00:43,809
today about, about Skylight. So, just a little
bit

14
00:00:43,809 --> 00:00:45,730
before we talk about that, I want to talk

15
00:00:45,730 --> 00:00:47,530
about us a little bit.

16
00:00:47,530 --> 00:00:50,420
So, in 2011 we started a company called Tilde.

17
00:00:50,420 --> 00:00:52,760
It's this shirt. It may have made me self-conscious,

18
00:00:52,760 --> 00:00:54,609
because this is actually a first edition and
it's

19
00:00:54,609 --> 00:00:57,550
printed off-center. Well, either I'm off-center
or the shirt's

20
00:00:57,550 --> 00:01:00,539
off-center. One of the two.

21
00:01:00,539 --> 00:01:02,570
So we started Tilde in 2011, and we had

22
00:01:02,570 --> 00:01:06,030
all just left a venture-backed company, and
that was

23
00:01:06,030 --> 00:01:08,440
a pretty traumatic experience for us because
we spent

24
00:01:08,440 --> 00:01:09,560
a lot of time building the company and then

25
00:01:09,560 --> 00:01:11,110
we ran out of money and sold to Facebook

26
00:01:11,110 --> 00:01:13,500
and we really didn't want to repeat that experience.

27
00:01:13,500 --> 00:01:16,420
So, we decided to start Tilde, and when we

28
00:01:16,420 --> 00:01:19,830
did it, we decided to be. DHH and the

29
00:01:19,830 --> 00:01:21,970
other people at Basecamp were talking about,
you know,

30
00:01:21,970 --> 00:01:23,630
being bootstrapped and proud. And that was
a message

31
00:01:23,630 --> 00:01:26,250
that really resonated with us, and so we wanted

32
00:01:26,250 --> 00:01:28,360
to capture the same thing.

33
00:01:28,360 --> 00:01:30,250
There's only problem with being bootstrapped
and proud, and

34
00:01:30,250 --> 00:01:31,890
that is, in order to be both of those

35
00:01:31,890 --> 00:01:33,180
things, you actually need money, it turns
out. It's

36
00:01:33,180 --> 00:01:34,850
not like you just say it in a blog

37
00:01:34,850 --> 00:01:37,660
post and then all of the sudden you are

38
00:01:37,660 --> 00:01:38,560
in business.

39
00:01:38,560 --> 00:01:40,350
So, we had to think a lot about, OK,

40
00:01:40,350 --> 00:01:42,450
well, how do we make money? How do we

41
00:01:42,450 --> 00:01:44,690
make money? How do we make a profitable and,

42
00:01:44,690 --> 00:01:47,290
most importantly, sustainable business? Because
we didn't want to

43
00:01:47,290 --> 00:01:49,380
just flip to Facebook in a couple years.

44
00:01:49,380 --> 00:01:53,130
So, looking around, I think the most obvious
thing

45
00:01:53,130 --> 00:01:55,110
that people suggested to us is, well, why
don't

46
00:01:55,110 --> 00:01:58,050
you guys just become Ember, Inc.? Raise a
few

47
00:01:58,050 --> 00:02:01,390
million dollars, you know, build a bunch of
business

48
00:02:01,390 --> 00:02:05,730
model, mostly prayer. But that's not really
how we

49
00:02:05,730 --> 00:02:08,959
want to think about building open source communities.

50
00:02:08,959 --> 00:02:10,860
We don't really think that that necessarily
leads to

51
00:02:10,860 --> 00:02:13,810
the best open source communities. And if you're
interested

52
00:02:13,810 --> 00:02:16,569
more in that, I recommend Leia Silver, who
is

53
00:02:16,569 --> 00:02:19,700
one of our co-founders. She's giving a talk
this

54
00:02:19,700 --> 00:02:22,450
afternoon. Oh, sorry. Friday afternoon, about
how to build

55
00:02:22,450 --> 00:02:25,219
a company that is centered on open source.
So

56
00:02:25,219 --> 00:02:26,469
if you want to learn more about how we've

57
00:02:26,469 --> 00:02:28,989
done that, I would really suggest you go check

58
00:02:28,989 --> 00:02:30,060
out her talk.

59
00:02:30,060 --> 00:02:33,689
So, no. So, no Ember, Inc. Not allowed.

60
00:02:33,689 --> 00:02:38,159
So, we really want to build something that
leveraged

61
00:02:38,159 --> 00:02:40,249
the strengths that we thought that we had.
One,

62
00:02:40,249 --> 00:02:42,680
I think most importantly, a really deep knowledge
of

63
00:02:42,680 --> 00:02:44,569
open source and a deep knowledge of the Rails

64
00:02:44,569 --> 00:02:47,090
stack, and also Carl, it turns out, is really,

65
00:02:47,090 --> 00:02:50,689
really good at building highly scalable big
data sys-

66
00:02:50,689 --> 00:02:53,989
big data systems. Lots of Hadoop in there.

67
00:02:53,989 --> 00:02:58,290
So, last year at RailsConf, we announced the
private

68
00:02:58,290 --> 00:03:00,519
beta of Skylight. How many of you have used

69
00:03:00,519 --> 00:03:01,709
Skylight? Can you raise your hand if you have

70
00:03:01,709 --> 00:03:04,629
used it? OK. Many of you. Awesome.

71
00:03:04,629 --> 00:03:08,129
So, so Skylight is a tool for profiling and

72
00:03:08,129 --> 00:03:11,780
measuring the performance of your Rails applications
in production.

73
00:03:11,780 --> 00:03:15,389
And, as a product, Skylight, I think, was
built

74
00:03:15,389 --> 00:03:20,349
on three really, three key break-throughs.
There were key,

75
00:03:20,349 --> 00:03:22,120
three key break-throughs. We didn't want to
ship a

76
00:03:22,120 --> 00:03:26,189
product that was incrementally better than
the competition. We

77
00:03:26,189 --> 00:03:28,319
wanted to ship a product that was dramatically
better.

78
00:03:28,319 --> 00:03:30,079
Quantum leap. An order of magnitude better.

79
00:03:30,079 --> 00:03:32,079
And, in order to do that, we spent a

80
00:03:32,079 --> 00:03:33,889
lot of time thinking about it, about how we

81
00:03:33,889 --> 00:03:36,389
could solve most of the problems that we saw

82
00:03:36,389 --> 00:03:39,310
in the existing landscape. And so those, those
break-throughs

83
00:03:39,310 --> 00:03:42,299
are predicated- sorry, those, delivering a
product that does

84
00:03:42,299 --> 00:03:44,919
that is predicated on these three break-throughs.

85
00:03:44,919 --> 00:03:46,870
So, the first one I want to talk about

86
00:03:46,870 --> 00:03:53,870
is, honest response times. Honest response
times. So, DHH

87
00:03:54,060 --> 00:03:55,799
wrote a blog post on what was then the

88
00:03:55,799 --> 00:03:58,930
37Signals blog, now the Basecamp blog, called
The problem

89
00:03:58,930 --> 00:04:01,909
with averages. How many of you have read this?

90
00:04:01,909 --> 00:04:02,459
Awesome.

91
00:04:02,459 --> 00:04:03,779
For those of you that have not, how many

92
00:04:03,779 --> 00:04:08,469
of you hate raising your hands at presentations?

93
00:04:08,469 --> 00:04:10,510
So, for those of you that-

94
00:04:10,510 --> 00:04:11,549
Y.K.: Just put a button in every seat. Press

95
00:04:11,549 --> 00:04:11,779
this button-

96
00:04:11,779 --> 00:04:15,290
T.D.: Press the button if you have. Yes. Great.

97
00:04:15,290 --> 00:04:19,810
So, if you read this blog post, the way

98
00:04:19,810 --> 00:04:22,810
it opens is, Our average response time for
Basecamp

99
00:04:22,810 --> 00:04:26,770
right now is 87ms... That sounds fantastic.
And it

100
00:04:26,770 --> 00:04:29,950
easily leads you to believe that all is well

101
00:04:29,950 --> 00:04:31,680
and that we wouldn't need to spend any more

102
00:04:31,680 --> 00:04:34,150
time optimizing performance.

103
00:04:34,150 --> 00:04:38,840
But that's actually wrong. The average number
is completely

104
00:04:38,840 --> 00:04:42,250
skewed by tons of fast responses to feed requests

105
00:04:42,250 --> 00:04:46,169
and other cached replies. If you have 1000
requests

106
00:04:46,169 --> 00:04:49,229
that return in 5ms, and then you can have

107
00:04:49,229 --> 00:04:53,560
200 requests taking 2000ms, or two seconds,
you can

108
00:04:53,560 --> 00:04:57,509
still report an av- a respectable 170ms of
average.

109
00:04:57,509 --> 00:04:59,819
That's useless.

110
00:04:59,819 --> 00:05:02,520
So what does DHH say that we need? DHH

111
00:05:02,520 --> 00:05:06,569
says the solution is histograms. So, for those
of

112
00:05:06,569 --> 00:05:09,009
you like me who were sleeping through your
statistics

113
00:05:09,009 --> 00:05:12,410
class in high school, and college, a brief
primer

114
00:05:12,410 --> 00:05:15,310
on histograms. So a histogram is very simple.
Basically,

115
00:05:15,310 --> 00:05:17,699
you have a, you have a series of numbers

116
00:05:17,699 --> 00:05:22,389
along some axis, and every time you, you're
in

117
00:05:22,389 --> 00:05:24,360
that number, you're in that bucket, you basically
increment

118
00:05:24,360 --> 00:05:25,280
that bar by one.

119
00:05:25,280 --> 00:05:27,669
So, this is an example of a histogram of

120
00:05:27,669 --> 00:05:30,620
response times in a Rails application. So
you can

121
00:05:30,620 --> 00:05:31,979
see that there's a big cluster in the middle

122
00:05:31,979 --> 00:05:35,900
around 488ms, 500ms. This isn't a super speedy
app

123
00:05:35,900 --> 00:05:38,740
but it's not the worst thing in the world.

124
00:05:38,740 --> 00:05:39,520
And they're all clustered, and then as you
kind

125
00:05:39,520 --> 00:05:40,810
of move to the right you can see that

126
00:05:40,810 --> 00:05:42,169
the respond times get longer and longer and
longer,

127
00:05:42,169 --> 00:05:43,990
and as you move to the left, response times

128
00:05:43,990 --> 00:05:45,720
get shorter and shorter and shorter.

129
00:05:45,720 --> 00:05:47,500
So, why do you want a histogram? What's the,

130
00:05:47,500 --> 00:05:48,599
what's the most important thing about a histogram?

131
00:05:48,599 --> 00:05:52,550
Y.K.: Well, I think it's because most requests
don't

132
00:05:52,550 --> 00:05:53,229
actually look like this.

133
00:05:53,229 --> 00:05:53,509
T.D.: Yes.

134
00:05:53,509 --> 00:05:54,759
Y.K.: Most end points don't actually look
like this.

135
00:05:54,759 --> 00:05:56,419
T.D.: Right. If you think about what your
Rails

136
00:05:56,419 --> 00:05:58,610
app is doing, it's a complicated beast, right.
Turns

137
00:05:58,610 --> 00:06:02,330
out, Ruby frankly, you can, you can do branching

138
00:06:02,330 --> 00:06:04,360
logic. You can do a lot of things.

139
00:06:04,360 --> 00:06:06,150
And so what that means is that one end

140
00:06:06,150 --> 00:06:09,460
point, if you represent that with a single
number,

141
00:06:09,460 --> 00:06:11,650
you are losing a lot of fidelity, to the

142
00:06:11,650 --> 00:06:15,189
point where it becomes, as DHH said, useless.
So,

143
00:06:15,189 --> 00:06:17,729
for example, in a histogram, you can easily
see,

144
00:06:17,729 --> 00:06:19,810
oh, here's a group of requests and response
times

145
00:06:19,810 --> 00:06:22,379
where I'm hitting the cache, and here's another
group

146
00:06:22,379 --> 00:06:24,169
where I'm missing it. And you can see that

147
00:06:24,169 --> 00:06:27,889
that cluster is significantly slower than
the faster cache-hitting

148
00:06:27,889 --> 00:06:28,439
cluster.

149
00:06:28,439 --> 00:06:30,849
And the other thing that you get when you

150
00:06:30,849 --> 00:06:32,800
have a, a distribution, when you keep the
whole

151
00:06:32,800 --> 00:06:35,370
distribution in the histogram, is you can
look at

152
00:06:35,370 --> 00:06:39,470
this number at the 95th percentile, right.
So the

153
00:06:39,470 --> 00:06:41,639
right, the way to think about the performance
of

154
00:06:41,639 --> 00:06:46,639
your web application is not the average, because
the

155
00:06:46,639 --> 00:06:50,159
average doesn't really tell you anything.
You want to

156
00:06:50,159 --> 00:06:52,220
think about the 95th percentile, because that's
not the

157
00:06:52,220 --> 00:06:55,379
average response time, that's the average
worst response time

158
00:06:55,379 --> 00:06:57,990
that a user is likely to hit.

159
00:06:57,990 --> 00:06:59,360
And the thing to keep in mind is that

160
00:06:59,360 --> 00:07:01,849
it's not as though a customer comes to your

161
00:07:01,849 --> 00:07:05,099
site, they issue one request, and then they're
done,

162
00:07:05,099 --> 00:07:08,000
right. As someone is using your website, they're
gonna

163
00:07:08,000 --> 00:07:10,300
be generating a lot of requests. And you need

164
00:07:10,300 --> 00:07:15,020
to look at the 95th percentile, because otherwise
every

165
00:07:15,020 --> 00:07:17,219
request is basically you rolling the dice
that they're

166
00:07:17,219 --> 00:07:18,610
not gonna hit one of those two second, three

167
00:07:18,610 --> 00:07:21,400
second, four second responses, close the tab
and go

168
00:07:21,400 --> 00:07:23,919
to your competitor.

169
00:07:23,919 --> 00:07:25,340
So we look at this as, here's the crazy

170
00:07:25,340 --> 00:07:28,439
thing. Here's what I think is crazy. That
blog

171
00:07:28,439 --> 00:07:32,750
post that DHH wrote, it's from 2009. It's
been

172
00:07:32,750 --> 00:07:35,000
five years, and there's still no tool that
does

173
00:07:35,000 --> 00:07:36,960
what DHH was asking for. So, we, frankly,
we

174
00:07:36,960 --> 00:07:39,060
smelled money. We were like, holy crap.

175
00:07:39,060 --> 00:07:41,169
Y.K.: Yeah, why isn't that slide green?

176
00:07:41,169 --> 00:07:43,229
T.D.: Yeah. It should be green and dollars.
I

177
00:07:43,229 --> 00:07:45,270
think keynote has the dollars, the make it
rain

178
00:07:45,270 --> 00:07:50,240
effect I should have used. So we smelled blood

179
00:07:50,240 --> 00:07:53,090
in the water. We're like, this is awesome.
There's

180
00:07:53,090 --> 00:07:56,610
only one problem that we discovered, and that
is,

181
00:07:56,610 --> 00:07:58,330
it turns out that building this thing is actually

182
00:07:58,330 --> 00:08:01,189
really, really freaky hard. Really, really
hard.

183
00:08:01,189 --> 00:08:05,780
So, we announced the private beta at RailsConf
last

184
00:08:05,780 --> 00:08:09,139
year. Before doing that, we spent a year of

185
00:08:09,139 --> 00:08:12,789
research spiking out prototypes, building
prototypes, building out the

186
00:08:12,789 --> 00:08:16,509
beta. We launched at RailsConf, and we realized,
we

187
00:08:16,509 --> 00:08:18,520
made a lot of problems. We made a lot

188
00:08:18,520 --> 00:08:21,909
of errors when we were building this system.

189
00:08:21,909 --> 00:08:26,270
So then, after RailsConf last year, we basically
took

190
00:08:26,270 --> 00:08:29,689
six months to completely rewrite the backend
from the

191
00:08:29,689 --> 00:08:32,450
ground up. And I think tying into your keynote,

192
00:08:32,450 --> 00:08:36,280
Yehuda, we, we were like, oh. We clearly have

193
00:08:36,280 --> 00:08:39,120
a bespoke problem. No one else is doing this.

194
00:08:39,120 --> 00:08:42,090
So we rewrote our own custom backend. And
then

195
00:08:42,090 --> 00:08:43,729
we had all these problems, and we realized
that

196
00:08:43,729 --> 00:08:45,390
they had actually already all been solved
by the

197
00:08:45,390 --> 00:08:47,769
open source community. And so we benefited
tremendously by

198
00:08:47,769 --> 00:08:48,430
having a shared solution.

199
00:08:48,430 --> 00:08:50,709
Y.K.: Yeah. So our first release of this was

200
00:08:50,709 --> 00:08:55,279
really very bespoke, and the current release
uses a

201
00:08:55,279 --> 00:08:59,540
tremendous amount of very off-the-shelf open
source projects that

202
00:08:59,540 --> 00:09:04,550
just solved the particular problem very effectively,
very well.

203
00:09:04,550 --> 00:09:05,779
None of which are as easy to use as

204
00:09:05,779 --> 00:09:09,029
Rails, but all of which solve really thorny
problems

205
00:09:09,029 --> 00:09:10,230
very effectively.

206
00:09:10,230 --> 00:09:12,870
T.D.: So, so let's just talk, just for your

207
00:09:12,870 --> 00:09:15,950
own understanding, let's talk about how most
performance monitoring

208
00:09:15,950 --> 00:09:17,670
tools work. So the way that most of these

209
00:09:17,670 --> 00:09:19,930
work is that you run your Rails app, and

210
00:09:19,930 --> 00:09:22,250
running inside of your Rails app is some gem,

211
00:09:22,250 --> 00:09:25,000
some agent that you install. And every time
the

212
00:09:25,000 --> 00:09:28,560
Rails app handles a request, it generates
events, and

213
00:09:28,560 --> 00:09:32,500
those events, which include information about
performance data, those

214
00:09:32,500 --> 00:09:34,930
events are passed into the agent.

215
00:09:34,930 --> 00:09:37,630
And then the agent sends that data to some

216
00:09:37,630 --> 00:09:40,930
kind of centralized server. Now, it turns
out that

217
00:09:40,930 --> 00:09:44,139
doing a running average is actually really
simple. Which

218
00:09:44,139 --> 00:09:46,550
is why everyone does it. Basically you can
do

219
00:09:46,550 --> 00:09:48,050
it in a single SQL query, right. All you

220
00:09:48,050 --> 00:09:50,310
do is you have three columns in database.
The

221
00:09:50,310 --> 00:09:52,690
end point, the running average, and the number
of

222
00:09:52,690 --> 00:09:55,769
requests, and then so, you can, those are
the

223
00:09:55,769 --> 00:09:57,170
two things that you need to keep a running

224
00:09:57,170 --> 00:09:57,310
average, right.

225
00:09:57,310 --> 00:09:58,750
So keeping a running average is actually really
simple

226
00:09:58,750 --> 00:10:00,980
from a technical point of view.

227
00:10:00,980 --> 00:10:03,800
Y.K.: I don't think you could even JavaScript
through

228
00:10:03,800 --> 00:10:04,600
to the lack of integers.

229
00:10:04,600 --> 00:10:06,050
T.D.: Yes. You probably wouldn't want to do
any

230
00:10:06,050 --> 00:10:07,529
math in JavaScript, it turns out. So, so we

231
00:10:07,529 --> 00:10:10,100
took a little bit different approach. Yehuda,
do you

232
00:10:10,100 --> 00:10:12,070
want to go over the next section?

233
00:10:12,070 --> 00:10:15,089
Y.K.: Yeah. Sure. So, when we first started,
right

234
00:10:15,089 --> 00:10:17,790
at the beginning, we basically did a similar
thing

235
00:10:17,790 --> 00:10:19,790
where we had a bunch - your app creates

236
00:10:19,790 --> 00:10:22,620
events. Most of those start off as being ActiveSupport::Notifications,

237
00:10:22,620 --> 00:10:25,980
although it turns out that there's very limited
use

238
00:10:25,980 --> 00:10:28,490
of ActiveSupport::Notifications so we had
to do some normalization

239
00:10:28,490 --> 00:10:30,360
work to get them sane, which we're gonna be

240
00:10:30,360 --> 00:10:32,630
upstreaming back into, into Rails.

241
00:10:32,630 --> 00:10:35,320
But, one thing that's kind of unfortunate
about having

242
00:10:35,320 --> 00:10:37,029
every single Rails app have an agent is that

243
00:10:37,029 --> 00:10:38,649
you end up having to do a lot of

244
00:10:38,649 --> 00:10:40,310
the same kind of work over and over again,

245
00:10:40,310 --> 00:10:42,180
and use up a lot of memory. So, for

246
00:10:42,180 --> 00:10:44,220
example, every one of these things is making
HTTP

247
00:10:44,220 --> 00:10:46,380
requests. So now you have a queue of things

248
00:10:46,380 --> 00:10:48,810
that you're sending over HTTP in every single
one

249
00:10:48,810 --> 00:10:50,510
of your Rails processes. And, of course, you
probably

250
00:10:50,510 --> 00:10:52,250
don't notice this. People are used to Rails
taking

251
00:10:52,250 --> 00:10:54,089
up hundreds and hundreds of megabytes, so
you probably

252
00:10:54,089 --> 00:10:55,790
don't notice if you install some agent and
it

253
00:10:55,790 --> 00:10:59,449
suddenly starts taking twenty, thirty, forty,
fifty more megabytes.

254
00:10:59,449 --> 00:11:01,510
But we really wanted to keep the actual memory

255
00:11:01,510 --> 00:11:04,649
per process down to a small amount. So one

256
00:11:04,649 --> 00:11:06,170
of the very first things that we did, we

257
00:11:06,170 --> 00:11:07,910
even did it before last year, is that we

258
00:11:07,910 --> 00:11:09,800
pulled out all that shared logic into a, a

259
00:11:09,800 --> 00:11:13,420
separate process called the coordinator. And
the agent is

260
00:11:13,420 --> 00:11:16,940
basically responsible simply for collecting
the, the trace, and

261
00:11:16,940 --> 00:11:18,899
it's not responsible for actually talking
to our server

262
00:11:18,899 --> 00:11:20,709
at all. And that means that the coordinator
only

263
00:11:20,709 --> 00:11:22,720
has to do this queue, this keeping a st-

264
00:11:22,720 --> 00:11:25,550
a bunch of stuff of work in one place,

265
00:11:25,550 --> 00:11:28,500
and it doesn't end up using up as much

266
00:11:28,500 --> 00:11:28,839
memory.

267
00:11:28,839 --> 00:11:31,019
And I think this, this ended up being very

268
00:11:31,019 --> 00:11:31,940
effective for us.

269
00:11:31,940 --> 00:11:33,490
T.D.: And I think that low overhead also allows

270
00:11:33,490 --> 00:11:36,350
us to just collect more information, in general.

271
00:11:36,350 --> 00:11:37,060
Y.K.: Yeah.

272
00:11:37,060 --> 00:11:39,880
Now, after our first attempt, we started getting
a

273
00:11:39,880 --> 00:11:42,079
bunch of customers that were telling us that
even

274
00:11:42,079 --> 00:11:43,920
the separate - so the separate coordinator,
started as

275
00:11:43,920 --> 00:11:45,260
a good thing and a bad thing. On the

276
00:11:45,260 --> 00:11:47,399
one hand, there's only one of them, so it

277
00:11:47,399 --> 00:11:49,529
uses up only one set of memory. On the

278
00:11:49,529 --> 00:11:51,220
other hand, it's really easy for someone to
go

279
00:11:51,220 --> 00:11:52,839
in and PS that process and see how many

280
00:11:52,839 --> 00:11:54,260
megabytes of memory it's using.

281
00:11:54,260 --> 00:11:56,560
So, we got a lot of additional complaints
that

282
00:11:56,560 --> 00:11:58,100
said oh, your process is using a lot of

283
00:11:58,100 --> 00:12:00,990
memory. And, I spent a few weeks, I, I

284
00:12:00,990 --> 00:12:03,160
know Ruby pretty well. I spent a couple of

285
00:12:03,160 --> 00:12:05,670
weeks. I actually wrote a gem called Allocation
Counter

286
00:12:05,670 --> 00:12:07,550
that basically went in to try to pin point

287
00:12:07,550 --> 00:12:09,490
exactly where the allocations were hap- coming
from. But

288
00:12:09,490 --> 00:12:11,800
it turns out that it's actually really, really
hard

289
00:12:11,800 --> 00:12:14,019
to track down exactly where allocations are
coming from

290
00:12:14,019 --> 00:12:15,170
in Ruby, because something as simple as using
a

291
00:12:15,170 --> 00:12:18,630
regular expression in Ruby can allocate match
objects that

292
00:12:18,630 --> 00:12:19,410
get put back on the stack.

293
00:12:19,410 --> 00:12:21,449
And so I was able to pair this down

294
00:12:21,449 --> 00:12:24,220
to some degree. But I really discovered quickly
that,

295
00:12:24,220 --> 00:12:26,980
trying to keep a lid on the memory allocation

296
00:12:26,980 --> 00:12:29,940
by doing all the stuff in Ruby, is mostly

297
00:12:29,940 --> 00:12:31,579
fine. But for our specific use case where
we

298
00:12:31,579 --> 00:12:33,100
really wanna, we wanna be telling you, you
can

299
00:12:33,100 --> 00:12:34,860
run the agent on your process, on your box,

300
00:12:34,860 --> 00:12:36,870
and it's not gonna use a lot of memory.

301
00:12:36,870 --> 00:12:40,079
We really needed something more efficient,
and our first

302
00:12:40,079 --> 00:12:42,790
thought was, we'll use C++ or C. No problem.

303
00:12:42,790 --> 00:12:45,220
C is, is native. It's great. And Carl did

304
00:12:45,220 --> 00:12:48,120
the work. Carl is very smart. And then he

305
00:12:48,120 --> 00:12:49,509
said, Yehuda. It is now your turn. You need

306
00:12:49,509 --> 00:12:51,250
to start maintaining this. And I said, I don't

307
00:12:51,250 --> 00:12:53,620
trust myself to write C++ code that's running
in

308
00:12:53,620 --> 00:12:56,029
all of your guys's boxes, and not seg-fault.
So

309
00:12:56,029 --> 00:12:59,649
I don't think that, that doesn't work for
me.

310
00:12:59,649 --> 00:13:01,790
And so I, I noticed that rust was coming

311
00:13:01,790 --> 00:13:03,630
along, and what rust really gives you is it

312
00:13:03,630 --> 00:13:05,899
gives you the ability to write low-level code
a

313
00:13:05,899 --> 00:13:08,529
la C or C++ with magma memory management,
that

314
00:13:08,529 --> 00:13:11,790
keeps your memory allocation low and keeps
things speedy.

315
00:13:11,790 --> 00:13:14,930
Low resource utilization. While also giving
you compile time

316
00:13:14,930 --> 00:13:17,949
guarantees about not seg-faulting. So, again,
if your processes

317
00:13:17,949 --> 00:13:20,320
randomly started seg-faulting because you
installed the agent, I

318
00:13:20,320 --> 00:13:21,949
think you would stop being our customer very
quickly.

319
00:13:21,949 --> 00:13:24,680
So having what, pretty much 100% guarantees
about that

320
00:13:24,680 --> 00:13:26,600
was very important to us. And so that's why

321
00:13:26,600 --> 00:13:28,420
we decided to use rust.

322
00:13:28,420 --> 00:13:29,880
I'll just keep going.

323
00:13:29,880 --> 00:13:30,970
T.D.: Keep going.

324
00:13:30,970 --> 00:13:32,949
Y.K.: So, we had this coordinator object.
And basically

325
00:13:32,949 --> 00:13:36,149
the coordinator object is receiving events.
So the events

326
00:13:36,149 --> 00:13:39,750
basically end up being these traces that describe
what's

327
00:13:39,750 --> 00:13:42,050
happening in your application. And the next
thing, I

328
00:13:42,050 --> 00:13:44,420
think our initial work on this we used JSON

329
00:13:44,420 --> 00:13:46,160
just to send the pay load to the server,

330
00:13:46,160 --> 00:13:47,949
but we noticed that a lot of people have

331
00:13:47,949 --> 00:13:49,889
really big requests. So you may have a big

332
00:13:49,889 --> 00:13:51,519
request with a big SQL query in it, or

333
00:13:51,519 --> 00:13:53,320
a lot of big SQL queries in it. Some

334
00:13:53,320 --> 00:13:55,279
people have traces that are hundreds and hundreds
of

335
00:13:55,279 --> 00:13:57,500
nodes long. And so we really wanted to figure

336
00:13:57,500 --> 00:14:00,100
out how to shrink down the payload size to

337
00:14:00,100 --> 00:14:02,569
something that we could be, you know, pumping
out

338
00:14:02,569 --> 00:14:04,759
of your box on a regular basis without tearing

339
00:14:04,759 --> 00:14:06,850
up your bandwidth costs.

340
00:14:06,850 --> 00:14:09,009
So, one of the first things that we did

341
00:14:09,009 --> 00:14:11,069
early on was we switched using protobuf as
the

342
00:14:11,069 --> 00:14:13,550
transport mechanism, and that really shrunk,
shrunk down the

343
00:14:13,550 --> 00:14:17,250
payloads a lot. Our earlier prototypes for
actually collecting

344
00:14:17,250 --> 00:14:19,490
the data were written in Ruby, but I think

345
00:14:19,490 --> 00:14:21,370
Carl did, like, a weekend hack to just pour

346
00:14:21,370 --> 00:14:24,180
it over the Java and got, like, 200x performance.

347
00:14:24,180 --> 00:14:26,319
And you don't always get 200x performance,
if mostly

348
00:14:26,319 --> 00:14:27,970
what you're doing is database queries, you're
not gonna

349
00:14:27,970 --> 00:14:29,139
get a huge performance swing.

350
00:14:29,139 --> 00:14:31,889
But mostly what we're doing is math. And algorithms

351
00:14:31,889 --> 00:14:34,209
and data structures. And for that, Ruby is,
it

352
00:14:34,209 --> 00:14:36,310
could, in theory, one day, have a good git

353
00:14:36,310 --> 00:14:38,569
or something, but today, writing that code
in Java

354
00:14:38,569 --> 00:14:40,949
didn't end up being significantly more code
cause it's

355
00:14:40,949 --> 00:14:42,540
just, you know, algorithms and data structures.

356
00:14:42,540 --> 00:14:44,420
T.D.: And I'll just note something about standardizing
on

357
00:14:44,420 --> 00:14:46,779
protobufs in our, in our stack, is actually
a

358
00:14:46,779 --> 00:14:52,170
huge win, because we, we realized, hey, browsers,
as

359
00:14:52,170 --> 00:14:53,350
it turns out are pretty powerful these days.
They've

360
00:14:53,350 --> 00:14:54,490
got, you know, they can allocate memory, they
can

361
00:14:54,490 --> 00:14:56,069
do all these types of computation. So, and
protobuff's

362
00:14:56,069 --> 00:14:59,410
libraries exist everywhere. So we save ourselves
a lot

363
00:14:59,410 --> 00:15:01,589
of computation and a lot of time by just

364
00:15:01,589 --> 00:15:04,259
treating protobuff as the canonical serialization
form, and then

365
00:15:04,259 --> 00:15:06,190
you can move payloads around the entire stack
and

366
00:15:06,190 --> 00:15:08,560
everything speaks the same language, so you've
saved the

367
00:15:08,560 --> 00:15:09,300
serialization and deserialization.

368
00:15:09,300 --> 00:15:11,990
Y.K.: And JavaScript is actually surprisingly
effective at des-

369
00:15:11,990 --> 00:15:13,870
at taking protobuffs and converting them to
the format

370
00:15:13,870 --> 00:15:18,190
that we need efficiently. So, so we basically
take

371
00:15:18,190 --> 00:15:21,029
this data. The Java collector is basically
collecting all

372
00:15:21,029 --> 00:15:23,550
these protobuffs, and pretty much it just
turns around,

373
00:15:23,550 --> 00:15:24,940
and this is sort of where we got into

374
00:15:24,940 --> 00:15:28,149
bespoke territory before we started rolling
our own, but

375
00:15:28,149 --> 00:15:30,930
we realized that when you write a big, distributed,

376
00:15:30,930 --> 00:15:33,130
fault-tolerant system, there's a lot of problems
that you

377
00:15:33,130 --> 00:15:35,319
really just want someone else to have thought
about.

378
00:15:35,319 --> 00:15:37,750
So, what we do is we basically take these,

379
00:15:37,750 --> 00:15:39,600
take these payloads that are coming in. We
convert

380
00:15:39,600 --> 00:15:41,709
them into batches and we send the batches
down

381
00:15:41,709 --> 00:15:45,259
into the Kafka queue. And the, the next thing

382
00:15:45,259 --> 00:15:47,680
that happens, so the Kafka, sorry, Kafka's
basically just

383
00:15:47,680 --> 00:15:49,910
a queue that allows you to throw things into,

384
00:15:49,910 --> 00:15:53,019
I guess, it might be considered similar to
like,

385
00:15:53,019 --> 00:15:55,430
something lime AMQP. It has some nice fault-tolerance
properties

386
00:15:55,430 --> 00:15:57,949
and integrates well with storm. But most important
it's

387
00:15:57,949 --> 00:15:59,670
just super, super high through-put.

388
00:15:59,670 --> 00:16:01,940
So basically didn't want to put any barrier
between

389
00:16:01,940 --> 00:16:03,560
you giving us the data and us getting it

390
00:16:03,560 --> 00:16:04,480
to disc as soon as possible.

391
00:16:04,480 --> 00:16:05,870
T.D.: Yeah. Which we'll, I think, talk about
in

392
00:16:05,870 --> 00:16:06,180
a bit.

393
00:16:06,180 --> 00:16:08,540
Y.K.: So we, so the basic Kafka takes the

394
00:16:08,540 --> 00:16:11,410
data and starts sending it into Storm. And
if

395
00:16:11,410 --> 00:16:13,009
you think about what has to happen in order

396
00:16:13,009 --> 00:16:15,089
to get some request. So, you have these requests.

397
00:16:15,089 --> 00:16:18,149
There's, you know, maybe traces that have
a bunch

398
00:16:18,149 --> 00:16:19,670
of SQL queries, and our job is basically to

399
00:16:19,670 --> 00:16:21,459
take all those SQL queries and say, OK, I

400
00:16:21,459 --> 00:16:22,560
can see that in all of your requests, you

401
00:16:22,560 --> 00:16:24,040
had the SQL query and it took around this

402
00:16:24,040 --> 00:16:25,449
amount of time and it happened as a child

403
00:16:25,449 --> 00:16:27,470
of this other node. And the way to think

404
00:16:27,470 --> 00:16:29,740
about that is basically just a processing
pipeline. Right.

405
00:16:29,740 --> 00:16:31,480
So you have these traces that come in one

406
00:16:31,480 --> 00:16:33,480
side. You start passing them through a bunch
of

407
00:16:33,480 --> 00:16:34,800
processing steps, and then you end up on the

408
00:16:34,800 --> 00:16:36,670
other side with the data.

409
00:16:36,670 --> 00:16:38,879
And Storm is actually a way of describing
that

410
00:16:38,879 --> 00:16:41,930
processing pipeline in sort of functional
style, and then

411
00:16:41,930 --> 00:16:43,740
you tell it, OK. Here's how many servers I

412
00:16:43,740 --> 00:16:46,839
need. Here's how, here's how I'm gonna handle
failures.

413
00:16:46,839 --> 00:16:50,000
And it basically deals with distribution and
scaling and

414
00:16:50,000 --> 00:16:52,379
all that stuff for you. And part of that

415
00:16:52,379 --> 00:16:55,379
is because you wrote everything using functional
style.

416
00:16:55,379 --> 00:16:57,110
And so what happens is Kafka sends the data

417
00:16:57,110 --> 00:17:00,550
into the entry spout, which is sort of terminology

418
00:17:00,550 --> 00:17:04,040
in, terminology in Storm for these streams
that get

419
00:17:04,040 --> 00:17:06,930
created. And they basically go into these
processing things,

420
00:17:06,930 --> 00:17:09,839
which very clever- cutely are called bolts.
This is

421
00:17:09,839 --> 00:17:12,980
definitely not the naming I would have used,
but.

422
00:17:12,980 --> 00:17:15,130
So they're called bolts. And the idea is that

423
00:17:15,130 --> 00:17:16,970
basically every request may have several things.

424
00:17:16,970 --> 00:17:20,089
So, for example, we now automatically detect
n +

425
00:17:20,089 --> 00:17:22,220
1 queries and that's sort of a different kind

426
00:17:22,220 --> 00:17:25,059
of processing from just, make a picture of
the

427
00:17:25,059 --> 00:17:26,980
entire request. Or what is the 95th percentile
across

428
00:17:26,980 --> 00:17:29,090
your entire app, right. These are all different
kinds

429
00:17:29,090 --> 00:17:30,940
of processing. So we take the data and we

430
00:17:30,940 --> 00:17:33,580
send them into a bunch of bolts, and the

431
00:17:33,580 --> 00:17:35,750
cool thing about bolts is that, again, because
they're

432
00:17:35,750 --> 00:17:38,890
just functional chaining, you can take the
output from

433
00:17:38,890 --> 00:17:41,130
one bolt and feed it into another bolt. And

434
00:17:41,130 --> 00:17:43,510
that works, that works pretty well. And, and
you

435
00:17:43,510 --> 00:17:44,730
don't have to worry about - I mean, you

436
00:17:44,730 --> 00:17:46,600
have to worry a little bit about things like

437
00:17:46,600 --> 00:17:49,960
fault tolerance, failure, item potence. But
you worry about

438
00:17:49,960 --> 00:17:52,230
them at, at the abstraction level, and then
the

439
00:17:52,230 --> 00:17:54,159
operational part is handled for you.

440
00:17:54,159 --> 00:17:55,740
T.D.: So it's just like a very declarative
way

441
00:17:55,740 --> 00:17:58,179
of describing how this computation work in,
in a

442
00:17:58,179 --> 00:17:59,230
way that's easy to scale.

443
00:17:59,230 --> 00:18:01,860
Y.K.: And Carl actually talked about this
at very

444
00:18:01,860 --> 00:18:04,909
high speed yesterday, and you, some of you
may

445
00:18:04,909 --> 00:18:06,620
have been there. I would recommend watching
the video

446
00:18:06,620 --> 00:18:09,020
when it comes out if you want to make

447
00:18:09,020 --> 00:18:11,159
use of this stuff in your own applications.

448
00:18:11,159 --> 00:18:13,250
And then when you're finally done with all
the

449
00:18:13,250 --> 00:18:14,970
processing, you need to actually do something
with it.

450
00:18:14,970 --> 00:18:16,289
You need to put it somewhere so that the

451
00:18:16,289 --> 00:18:18,360
web app can get access to it, and that

452
00:18:18,360 --> 00:18:21,350
is basically, we use Cassandra for this. And
Cassandra

453
00:18:21,350 --> 00:18:24,929
again is mostly, it's a dumb database, but
it

454
00:18:24,929 --> 00:18:27,780
has, it's, has high capacity. It has some
of

455
00:18:27,780 --> 00:18:29,080
the fault-tolerance capacities that we want.

456
00:18:29,080 --> 00:18:30,770
T.D.: We're very, we're just very, very heavy,
right.

457
00:18:30,770 --> 00:18:32,820
Like, we tend to be writing more than we're

458
00:18:32,820 --> 00:18:33,730
ever reading.

459
00:18:33,730 --> 00:18:36,270
Y.K.: Yup. And then when we're done, when
we're

460
00:18:36,270 --> 00:18:38,780
done with a particular batch, Cassandra basically
kicks off

461
00:18:38,780 --> 00:18:40,720
the process over again. So we're basically
doing these

462
00:18:40,720 --> 00:18:41,200
things as batches.

463
00:18:41,200 --> 00:18:42,919
T.D.: So these are, these are roll-ups, is
what's

464
00:18:42,919 --> 00:18:45,360
happening here. So basically every minute,
every ten minutes,

465
00:18:45,360 --> 00:18:48,140
and then at every hour, we reprocess and we

466
00:18:48,140 --> 00:18:49,970
re-aggregate, so that when you query us we
know

467
00:18:49,970 --> 00:18:51,010
exactly what to give you.

468
00:18:51,010 --> 00:18:52,830
Y.K.: Yup. So we sort of have this cycle

469
00:18:52,830 --> 00:18:55,049
where we start off, obviously, in the first
five

470
00:18:55,049 --> 00:18:56,890
second, the first minute, you really want
high granularity.

471
00:18:56,890 --> 00:18:59,140
You want to see what's happening right now.
But,

472
00:18:59,140 --> 00:19:00,110
if you want to go back and look at

473
00:19:00,110 --> 00:19:02,460
data from three months ago, you probably care
about

474
00:19:02,460 --> 00:19:04,830
it, like the day granularity or maybe the
hour

475
00:19:04,830 --> 00:19:07,490
granularity. So, we basically do these roll-ups
and we

476
00:19:07,490 --> 00:19:09,200
cycle through the process.

477
00:19:09,200 --> 00:19:11,880
T.D.: So this, it turns out, building the
system

478
00:19:11,880 --> 00:19:15,169
required an intense amount of work. Carl spent
probably

479
00:19:15,169 --> 00:19:17,490
six months reading PHP thesises to find-

480
00:19:17,490 --> 00:19:18,140
Y.K.: Thesis.

481
00:19:18,140 --> 00:19:23,850
T.D.: Thesis. To find, to find data structures
and

482
00:19:23,850 --> 00:19:25,870
algorithms that we could use. Because this
is a

483
00:19:25,870 --> 00:19:28,340
huge amount of data. Like, I think even a

484
00:19:28,340 --> 00:19:31,049
few months after we were in private data,
private

485
00:19:31,049 --> 00:19:34,179
beta, we were already handling over a billion
requests

486
00:19:34,179 --> 00:19:36,200
per month. And obviously there's no way that
we-

487
00:19:36,200 --> 00:19:37,970
Y.K.: Basically the number of requests that
we handle

488
00:19:37,970 --> 00:19:39,770
is the sum of all of the requests that

489
00:19:39,770 --> 00:19:40,010
you handle.

490
00:19:40,010 --> 00:19:40,110
T.D.: Right.

491
00:19:40,110 --> 00:19:41,200
Y.K.: And all of our customers handle.

492
00:19:41,200 --> 00:19:41,909
T.D.: Right. Right. So.

493
00:19:41,909 --> 00:19:43,120
Y.K.: So that's a lot of requests.

494
00:19:43,120 --> 00:19:45,760
T.D.: So obviously we can't provide a service,
at

495
00:19:45,760 --> 00:19:48,480
least one that's not, we can't provide an
affordable

496
00:19:48,480 --> 00:19:51,130
service, an accessible service, if we have
to store

497
00:19:51,130 --> 00:19:53,190
terabytes or exabytes of data just to tell
you

498
00:19:53,190 --> 00:19:53,990
how your app is running.

499
00:19:53,990 --> 00:19:56,630
Y.K.: And I think, also a problem, it's problematic

500
00:19:56,630 --> 00:19:58,429
if you store all the data in a database

501
00:19:58,429 --> 00:19:59,760
and then every single time someone wants to
learn

502
00:19:59,760 --> 00:20:01,590
something about that, you have to do a query.

503
00:20:01,590 --> 00:20:03,159
Those queries can take a very long time. They

504
00:20:03,159 --> 00:20:04,700
can take minutes. And I think we really wanted

505
00:20:04,700 --> 00:20:07,159
to have something that would be very, that
would,

506
00:20:07,159 --> 00:20:09,580
where the feedback loop would be fast. So
we

507
00:20:09,580 --> 00:20:11,860
wanted to find algorithms that let us handle
the

508
00:20:11,860 --> 00:20:13,940
data at, at real time, and then provide it

509
00:20:13,940 --> 00:20:16,020
to you at real time instead of these, like,

510
00:20:16,020 --> 00:20:18,309
dump the data somewhere and then do these
complicated

511
00:20:18,309 --> 00:20:18,549
queries.

512
00:20:18,549 --> 00:20:20,820
T.D.: So, hold on. So this slide was not

513
00:20:20,820 --> 00:20:24,440
supposed to be here. It was supposed to be

514
00:20:24,440 --> 00:20:27,669
a Rails slide. So, whoa. I went too far.

515
00:20:27,669 --> 00:20:29,970
K. We'll watch that again. That's pretty cool.
So

516
00:20:29,970 --> 00:20:32,460
then the last thing I want to say is,

517
00:20:32,460 --> 00:20:34,330
perhaps your take away from looking at this
architecture

518
00:20:34,330 --> 00:20:36,750
diagram is, oh my gosh, these Rails guys completely-

519
00:20:36,750 --> 00:20:38,309
Y.K.: They really jumped the shark.

520
00:20:38,309 --> 00:20:40,870
T.D.: They jumped the shark. They ditched
Rails. I

521
00:20:40,870 --> 00:20:42,330
saw, like, three Tweets yesterday - I wasn't
here,

522
00:20:42,330 --> 00:20:43,450
I was in Portland yesterday, but I saw, like,

523
00:20:43,450 --> 00:20:44,340
three Tweets that were like, I'm at RailsConf
and

524
00:20:44,340 --> 00:20:48,809
I haven't seen a single talk about, like,
Rails.

525
00:20:48,809 --> 00:20:51,950
So that's true here, too. But, I want to

526
00:20:51,950 --> 00:20:54,940
assure you that we are only using this stack

527
00:20:54,940 --> 00:20:58,070
for the heavy computation. We started in Rails.
We

528
00:20:58,070 --> 00:21:00,730
started, we were like, hey, what do we need.

529
00:21:00,730 --> 00:21:01,929
Ah, well, people probably need to authenticate
and log

530
00:21:01,929 --> 00:21:05,090
in, and we probably need to do billing. And

531
00:21:05,090 --> 00:21:06,220
those are all things that Rails is really,
really

532
00:21:06,220 --> 00:21:08,830
good at. So we started with Rails as, basically,

533
00:21:08,830 --> 00:21:11,110
the starting point, and then when we realized
oh

534
00:21:11,110 --> 00:21:14,039
my gosh, computation is really slow. There's
no way

535
00:21:14,039 --> 00:21:15,240
we're gonna be able to offer this service.
OK.

536
00:21:15,240 --> 00:21:16,059
Now let's think about how we can do all

537
00:21:16,059 --> 00:21:16,270
of that.

538
00:21:16,270 --> 00:21:18,510
Y.K.: And I think notably, a lot of people

539
00:21:18,510 --> 00:21:20,049
who look at Rails are like, there's a lot

540
00:21:20,049 --> 00:21:21,750
of companies that have built big stuff on
Rails,

541
00:21:21,750 --> 00:21:24,090
and their attitude is, like, oh, this legacy
terrible

542
00:21:24,090 --> 00:21:25,409
Rails app. I really wish we could get rid

543
00:21:25,409 --> 00:21:26,850
of it. If we could just write everything in

544
00:21:26,850 --> 00:21:30,760
Scala or Clojure or Go, everything would be
amazing.

545
00:21:30,760 --> 00:21:31,500
That is definitely not our attitude. Our attitude
is

546
00:21:31,500 --> 00:21:34,320
that Rails is really amazing, at particular,
at the

547
00:21:34,320 --> 00:21:36,740
kinds of things that are really common across
everyone's

548
00:21:36,740 --> 00:21:39,900
web applications - authentication, billing,
et cetera. And we

549
00:21:39,900 --> 00:21:41,429
really want to be using Rails for the parts

550
00:21:41,429 --> 00:21:43,360
of our app- even things like error-tracking,
we do

551
00:21:43,360 --> 00:21:45,039
through the Rails app. We want to be using

552
00:21:45,039 --> 00:21:47,470
Rails because it's very productive at doing
those things.

553
00:21:47,470 --> 00:21:48,789
It happens to be very slow with doing data

554
00:21:48,789 --> 00:21:50,330
crunching, so we're gonna use a different
tool for

555
00:21:50,330 --> 00:21:50,539
that.

556
00:21:50,539 --> 00:21:51,909
But I don't think you'll ever see me getting

557
00:21:51,909 --> 00:21:54,210
up and saying, ah, I really wish we had

558
00:21:54,210 --> 00:21:55,080
just started writing, you know, the Rails
app in

559
00:21:55,080 --> 00:21:55,159
rust.

560
00:21:55,159 --> 00:21:55,309
T.D.: Yeah.

561
00:21:55,309 --> 00:21:58,090
Y.K.: That would be terrible.

562
00:21:58,090 --> 00:22:02,429
T.D.: So that's number one, is, is, honest
response

563
00:22:02,429 --> 00:22:04,390
times, which we're, which it turns out, seems
like

564
00:22:04,390 --> 00:22:08,289
it should be easy, requires storing insane
amount of

565
00:22:08,289 --> 00:22:09,169
data.

566
00:22:09,169 --> 00:22:10,620
So the second thing that we realized when
we

567
00:22:10,620 --> 00:22:12,059
were looking at a lot of these tools, is

568
00:22:12,059 --> 00:22:14,360
that most of them focus on data. They focus

569
00:22:14,360 --> 00:22:16,590
on giving you the raw data. But I'm not

570
00:22:16,590 --> 00:22:19,130
a machine. I'm not a computer. I don't enjoy

571
00:22:19,130 --> 00:22:21,320
sifting through data. That's what computers
are good for.

572
00:22:21,320 --> 00:22:23,289
I would rather be drinking a beer. It's really

573
00:22:23,289 --> 00:22:24,830
nice in Portland, this time of year.

574
00:22:24,830 --> 00:22:27,409
So, we wanted to think about, if you're trying

575
00:22:27,409 --> 00:22:31,179
to solve the performance problems in your
application, what

576
00:22:31,179 --> 00:22:32,840
are the things that you would suss out with

577
00:22:32,840 --> 00:22:35,760
the existing tools after spending, like, four
hours depleting

578
00:22:35,760 --> 00:22:37,510
your ego to get there?

579
00:22:37,510 --> 00:22:38,929
Y.K.: And I think part of this is just

580
00:22:38,929 --> 00:22:42,260
people are actually very, people like to think
that

581
00:22:42,260 --> 00:22:43,880
they're gonna use these tools, but when the
tools

582
00:22:43,880 --> 00:22:45,320
require you to dig through a lot of data,

583
00:22:45,320 --> 00:22:47,090
people just don't use them very much. So,
the

584
00:22:47,090 --> 00:22:48,330
goal here was to build a tool that people

585
00:22:48,330 --> 00:22:50,809
actually use and actually like using, and
not to

586
00:22:50,809 --> 00:22:54,870
build a tool that happens to provide a lot

587
00:22:54,870 --> 00:22:55,039
of data you can sift through.

588
00:22:55,039 --> 00:22:55,059
T.D.: Yes.

589
00:22:55,059 --> 00:22:55,929
So, probably the, one of the first things
that

590
00:22:55,929 --> 00:22:58,529
we realized is that we don't want to provide.

591
00:22:58,529 --> 00:23:00,400
This is a trace of a request, you've probably

592
00:23:00,400 --> 00:23:04,070
seen similar UIs using other tools, using,
for example,

593
00:23:04,070 --> 00:23:07,059
the inspector in, in like Chrome or Safari,
and

594
00:23:07,059 --> 00:23:08,700
this is just showing basically, it's basically
a visual

595
00:23:08,700 --> 00:23:10,830
stack trace of where your application is spending
its

596
00:23:10,830 --> 00:23:11,600
time.

597
00:23:11,600 --> 00:23:13,950
But I think what was important for us is

598
00:23:13,950 --> 00:23:17,809
showing not just a single request, because
your app

599
00:23:17,809 --> 00:23:20,570
handles, you know, hundreds of thousands of
requests, or

600
00:23:20,570 --> 00:23:22,679
millions of requests. So looking at a single
request

601
00:23:22,679 --> 00:23:24,630
statistically is complete, it's just noise.

602
00:23:24,630 --> 00:23:26,500
Y.K.: And it's especially bad if it's the
worst

603
00:23:26,500 --> 00:23:28,659
request, because the worst request is, is
really noise.

604
00:23:28,659 --> 00:23:30,850
It's like, a hiccup in the network, right.

605
00:23:30,850 --> 00:23:31,250
T.D.: It's the outlier. Yeah.

606
00:23:31,250 --> 00:23:32,150
Y.K.: It's literally the outlier.

607
00:23:32,150 --> 00:23:35,659
T.D.: It's literally the outlier. Yup. So,
what we

608
00:23:35,659 --> 00:23:38,700
present in Skylight is something a little
bit different,

609
00:23:38,700 --> 00:23:41,770
and it's something that we call the aggregate
trace.

610
00:23:41,770 --> 00:23:46,260
So the aggregate trace is basically us taking
all

611
00:23:46,260 --> 00:23:49,559
of your requests, averaging them out where
each of

612
00:23:49,559 --> 00:23:51,750
these things spends their time, and then showing
you

613
00:23:51,750 --> 00:23:54,899
that. So this is basically like, this is like,

614
00:23:54,899 --> 00:23:57,929
this is like the statue of David. It is

615
00:23:57,929 --> 00:24:00,880
the idealized form of the stack trace of how

616
00:24:00,880 --> 00:24:02,270
your application's behaving.

617
00:24:02,270 --> 00:24:05,330
But, of course, you have the same problem
as

618
00:24:05,330 --> 00:24:07,500
before, which is, if this is all that we

619
00:24:07,500 --> 00:24:10,580
were showing you, it would be obscuring a
lot

620
00:24:10,580 --> 00:24:12,870
of information. You want to actually be able
to

621
00:24:12,870 --> 00:24:13,990
tell the difference between, OK, what's my
stack trace

622
00:24:13,990 --> 00:24:16,419
look like for fast requests, and how does
that

623
00:24:16,419 --> 00:24:18,539
differ from requests that are slower.

624
00:24:18,539 --> 00:24:20,860
So what we've got, I've got a little video

625
00:24:20,860 --> 00:24:22,320
here. You can see that when I move the

626
00:24:22,320 --> 00:24:26,490
slider, that this trace below it is actually
updating

627
00:24:26,490 --> 00:24:29,130
in real time. As I move the slider around,

628
00:24:29,130 --> 00:24:31,770
you can see that the aggregate trace actually
updates

629
00:24:31,770 --> 00:24:34,240
with it. And that's because we're collecting
all this

630
00:24:34,240 --> 00:24:36,159
information. We're collecting, like I said,
a lot of

631
00:24:36,159 --> 00:24:38,669
data. We can recompute this aggregate trace
on the

632
00:24:38,669 --> 00:24:38,909
fly.

633
00:24:38,909 --> 00:24:41,200
Basically, for each bucket, we're storing
a different trace,

634
00:24:41,200 --> 00:24:42,880
and then on the client we're reassembling
that. We'll

635
00:24:42,880 --> 00:24:43,899
go into that a little bit.

636
00:24:43,899 --> 00:24:45,799
Y.K.: And I think it's really important that
you

637
00:24:45,799 --> 00:24:48,370
be able to do these experiments quickly. If
every

638
00:24:48,370 --> 00:24:50,059
time you think, oh, I wonder what happens
if

639
00:24:50,059 --> 00:24:52,260
I add another histogram bucket, if it requires
a

640
00:24:52,260 --> 00:24:54,830
whole full page refresh. Then that would basically
make

641
00:24:54,830 --> 00:24:56,309
people not want to use the tool. Not able

642
00:24:56,309 --> 00:24:58,580
to use the tool. So, actually building something
which

643
00:24:58,580 --> 00:24:59,649
is real time and fast, gets the data as

644
00:24:59,649 --> 00:25:00,110
it comes, was really important to us.

645
00:25:00,110 --> 00:25:01,220
T.D.: So that's number one.

646
00:25:01,220 --> 00:25:04,850
And the second thing. So we built that, and

647
00:25:04,850 --> 00:25:07,929
we're like, OK, well what's next? And I think

648
00:25:07,929 --> 00:25:09,250
that the big problem with this is that you

649
00:25:09,250 --> 00:25:12,020
need to know that there's a problem before
you

650
00:25:12,020 --> 00:25:14,429
go look at it, right. So we have been

651
00:25:14,429 --> 00:25:16,080
working for the past few months, and the Storm

652
00:25:16,080 --> 00:25:18,390
infrastructure that we built makes it pretty
straight-forward to

653
00:25:18,390 --> 00:25:21,149
start building more abstractions on top of
the data

654
00:25:21,149 --> 00:25:21,559
that we've already collected.

655
00:25:21,559 --> 00:25:24,120
It's a very declarative system. So we've been
working

656
00:25:24,120 --> 00:25:26,679
on a feature called inspections. And what's
cool about

657
00:25:26,679 --> 00:25:29,279
inspections is that we can look at this tremendous

658
00:25:29,279 --> 00:25:31,270
volume of data that we've collected from your
app,

659
00:25:31,270 --> 00:25:33,840
and we can automatically tease out what the
problems

660
00:25:33,840 --> 00:25:35,210
are. So the first one that we shipped, this

661
00:25:35,210 --> 00:25:37,399
is in beta right now. It's not, it's not

662
00:25:37,399 --> 00:25:39,840
out and enabled by default, but there, it's
behind

663
00:25:39,840 --> 00:25:42,440
a feature flag that we've had some users turning

664
00:25:42,440 --> 00:25:42,730
on.

665
00:25:42,730 --> 00:25:44,419
And, and trying out. And so what we can

666
00:25:44,419 --> 00:25:46,450
do in this case, is because we have information

667
00:25:46,450 --> 00:25:48,730
about all of the database queries in your
app,

668
00:25:48,730 --> 00:25:50,840
we can look and see if you have n

669
00:25:50,840 --> 00:25:52,390
plus one queries. Can you maybe explain what
an

670
00:25:52,390 --> 00:25:53,250
n plus one query is?

671
00:25:53,250 --> 00:25:54,600
Y.K.: Yeah. So, I'm, people know, hopefully,
what n

672
00:25:54,600 --> 00:25:56,770
plus one queries. But the, it's the idea that

673
00:25:56,770 --> 00:25:59,260
you, by accident, for some reason, instead
of making

674
00:25:59,260 --> 00:26:01,970
one query, you ask for like all the posts

675
00:26:01,970 --> 00:26:02,940
and then you iterated through all of them
and

676
00:26:02,940 --> 00:26:04,940
got all the comments and now you, instead
of

677
00:26:04,940 --> 00:26:08,679
having one query, you have one query per post,

678
00:26:08,679 --> 00:26:10,309
right. And you, what I've, what I've like
to

679
00:26:10,309 --> 00:26:12,549
do is do eager reloading, where you say include

680
00:26:12,549 --> 00:26:14,559
comments, right. But you have to know that
you

681
00:26:14,559 --> 00:26:15,039
have to do that.

682
00:26:15,039 --> 00:26:16,899
So there's some tools that will run in development

683
00:26:16,899 --> 00:26:18,380
mode, if you happen to catch it, like a

684
00:26:18,380 --> 00:26:20,460
bullet. This is basically a tool that's looking
at

685
00:26:20,460 --> 00:26:22,210
every single one of your classes and has some

686
00:26:22,210 --> 00:26:24,169
thresholds that, once we see that a bunch
of

687
00:26:24,169 --> 00:26:27,429
your requests have the same exact query, so
we

688
00:26:27,429 --> 00:26:29,549
do some work to pull out binds. So if

689
00:26:29,549 --> 00:26:32,200
it's, like, where something equals one, we
will automatically

690
00:26:32,200 --> 00:26:34,110
pull out the one and replace it with a

691
00:26:34,110 --> 00:26:34,740
question mark.

692
00:26:34,740 --> 00:26:36,230
And then we basically take all those queries,
if

693
00:26:36,230 --> 00:26:39,529
they're the exact same query repeated multiple
times, subject

694
00:26:39,529 --> 00:26:41,390
to some thresholds, we'll start showing you
hey, there's

695
00:26:41,390 --> 00:26:42,450
an n plus one query.

696
00:26:42,450 --> 00:26:43,799
And you can imagine this same sort of thing

697
00:26:43,799 --> 00:26:46,320
being done for things, like, are you missing
an

698
00:26:46,320 --> 00:26:49,690
index, right. Or, are you using the Ruby version

699
00:26:49,690 --> 00:26:50,950
of JSON when you should be using the native

700
00:26:50,950 --> 00:26:52,179
version of JSON. These are all things that
we

701
00:26:52,179 --> 00:26:55,140
can start detecting just because we're consuming
an enormous

702
00:26:55,140 --> 00:26:57,510
amount of information, and we can start writing
some

703
00:26:57,510 --> 00:26:59,330
heuristics for bubbling it up.

704
00:26:59,330 --> 00:27:02,330
So, third and final breakthrough, we realized
that we

705
00:27:02,330 --> 00:27:05,289
really, really needed a lightning fast UI.
Something really

706
00:27:05,289 --> 00:27:08,279
responsive. So, in particular, the feedback
loop is critical,

707
00:27:08,279 --> 00:27:09,929
right. You can imagine, if the way that you

708
00:27:09,929 --> 00:27:12,279
dug into data was you clicked and you wait

709
00:27:12,279 --> 00:27:14,320
an hour, and then you get your results, no

710
00:27:14,320 --> 00:27:15,730
one would do it. No one would ever do

711
00:27:15,730 --> 00:27:15,890
it.

712
00:27:15,890 --> 00:27:19,090
And the existing tools are OK, but you click

713
00:27:19,090 --> 00:27:20,470
and you wait. You look at it and you're

714
00:27:20,470 --> 00:27:21,730
like, oh, I want a different view, so then

715
00:27:21,730 --> 00:27:23,240
you go edit your query and then you click

716
00:27:23,240 --> 00:27:25,360
and you wait and it's just not a pleasant

717
00:27:25,360 --> 00:27:26,600
experience.

718
00:27:26,600 --> 00:27:28,850
So, so we use Ember, the, the UI that

719
00:27:28,850 --> 00:27:31,250
you're using when you log into Skylight. Even
though

720
00:27:31,250 --> 00:27:33,289
it feels just like a regular website, it doesn't

721
00:27:33,289 --> 00:27:35,940
feel like a native app, is powered, all of

722
00:27:35,940 --> 00:27:37,679
the routing, all of the rendering, all of
the

723
00:27:37,679 --> 00:27:40,769
decision making, is happening in, as an Ember.js
app,

724
00:27:40,769 --> 00:27:43,049
and we pair that with D3. So all of

725
00:27:43,049 --> 00:27:44,830
the charts, the charts that you saw there
in

726
00:27:44,830 --> 00:27:48,039
the aggregate trace, that is all Ember components
powered

727
00:27:48,039 --> 00:27:48,970
by D3.

728
00:27:48,970 --> 00:27:52,860
So, this is actually significantly cleaned
up our client-side

729
00:27:52,860 --> 00:27:55,679
code. It makes re-usability really, really
awesome. So to

730
00:27:55,679 --> 00:27:57,039
give you an example, this is from our billing

731
00:27:57,039 --> 00:27:58,789
page that I, the designer came and they had,

732
00:27:58,789 --> 00:28:01,260
they had a component that was like, the gate

733
00:28:01,260 --> 00:28:01,809
component.

734
00:28:01,809 --> 00:28:02,919
And, the-

735
00:28:02,919 --> 00:28:05,899
T.D.: It seems really boring at first.

736
00:28:05,899 --> 00:28:06,799
Y.K.: It seemed really boring. But, this is
the

737
00:28:06,799 --> 00:28:08,950
implementation, right. So you could copy and
paste this

738
00:28:08,950 --> 00:28:11,059
code over and over again, everywhere you go.
Just

739
00:28:11,059 --> 00:28:12,750
remember to format it correctly. If you forget
to

740
00:28:12,750 --> 00:28:15,070
format it, it's not gonna look the same everywhere.

741
00:28:15,070 --> 00:28:17,460
But I was like, hey, we're using this all

742
00:28:17,460 --> 00:28:18,010
over the place. Why don't we bundle this up

743
00:28:18,010 --> 00:28:20,070
into a component? And so with Ember, it was

744
00:28:20,070 --> 00:28:22,230
super easy. We basically just said, OK, here's
new

745
00:28:22,230 --> 00:28:24,590
calendar date component. It has a property
on it

746
00:28:24,590 --> 00:28:26,460
called date. Just set that to any JavaScript
data

747
00:28:26,460 --> 00:28:28,059
object. Just set, you don't have to remember
about

748
00:28:28,059 --> 00:28:30,450
converting it or formatting it. Here's the
component. Set

749
00:28:30,450 --> 00:28:31,840
the date and it will render the correct thing

750
00:28:31,840 --> 00:28:32,760
automatically.

751
00:28:32,760 --> 00:28:36,039
And, so the architecture of the Ember app
looks

752
00:28:36,039 --> 00:28:37,640
a little bit, something like this, where you
have

753
00:28:37,640 --> 00:28:39,919
many, many different components, most of them
just driven

754
00:28:39,919 --> 00:28:42,370
by D3, and then they're plugged into the model

755
00:28:42,370 --> 00:28:43,480
and the controller.

756
00:28:43,480 --> 00:28:44,909
And the Ember app will go fetch those models

757
00:28:44,909 --> 00:28:46,750
from the cloud, and the cloud from the Java

758
00:28:46,750 --> 00:28:50,190
app, which just queries Cassandra, and render
them. And

759
00:28:50,190 --> 00:28:53,429
what's neat about this model is turning on
web

760
00:28:53,429 --> 00:28:56,360
sockets is super easy, right. Because all
of these

761
00:28:56,360 --> 00:28:58,860
components are bound to a single place. So
when

762
00:28:58,860 --> 00:29:00,890
the web socket says, hey, we have updated
information

763
00:29:00,890 --> 00:29:02,630
for you to show, it just pushes it onto

764
00:29:02,630 --> 00:29:04,980
the model or onto the controller, and the
whole

765
00:29:04,980 --> 00:29:06,159
UI updates automatically.

766
00:29:06,159 --> 00:29:06,890
It's like magic.

767
00:29:06,890 --> 00:29:07,230
And-

768
00:29:07,230 --> 00:29:08,250
Y.K.: Like magic.

769
00:29:08,250 --> 00:29:09,679
T.D.: It's like magic. And, and when debugging,
this

770
00:29:09,679 --> 00:29:11,559
is especially awesome too, because, and I'll
maybe show

771
00:29:11,559 --> 00:29:15,080
a demo of the Ember inspector. It's nice.

772
00:29:15,080 --> 00:29:17,830
So. Yeah. So, lightning fast UI. Reducing
the feedback

773
00:29:17,830 --> 00:29:19,510
loop so that you can quickly play with your

774
00:29:19,510 --> 00:29:21,880
data, makes it go from a chore to something

775
00:29:21,880 --> 00:29:23,620
that actually feels kind of fun.

776
00:29:23,620 --> 00:29:27,039
So, these were the breakthroughs that we had
when

777
00:29:27,039 --> 00:29:28,440
we were building Skylight. The things that
made us

778
00:29:28,440 --> 00:29:29,980
think, yes, this is actually a product that
we

779
00:29:29,980 --> 00:29:31,940
think deserves to be on the market. So, one,

780
00:29:31,940 --> 00:29:33,860
honest response times. Collect data that no
one else

781
00:29:33,860 --> 00:29:36,549
can collect. Focus on answers instead of just
dumping

782
00:29:36,549 --> 00:29:38,289
data, and have a lightning fast UI to do

783
00:29:38,289 --> 00:29:38,409
it.

784
00:29:38,409 --> 00:29:40,100
So we like to think of Skylight as basically

785
00:29:40,100 --> 00:29:42,690
a smart profiler. It's a smart profiler that
runs

786
00:29:42,690 --> 00:29:44,350
in production. It's like the profiler that
you run

787
00:29:44,350 --> 00:29:47,230
on your local development machine, but instead
of being

788
00:29:47,230 --> 00:29:49,179
on your local dev box which has nothing to

789
00:29:49,179 --> 00:29:51,610
do with the performance characteristics of
what your users

790
00:29:51,610 --> 00:29:53,450
are experience, we're actually running in
production.

791
00:29:53,450 --> 00:29:58,919
So, let me just give you guys a quick

792
00:29:58,919 --> 00:30:00,390
demo.

793
00:30:00,390 --> 00:30:03,120
So, this is what the Skylight, this is what

794
00:30:03,120 --> 00:30:07,610
Skylight looks like. What's under this? There
we go.

795
00:30:07,610 --> 00:30:09,620
So, the first thing here is we've got the

796
00:30:09,620 --> 00:30:12,669
app dash board. So this, it's like our, 95th

797
00:30:12,669 --> 00:30:15,500
responsile- 95th percentile response time
has peaked. Maybe you're

798
00:30:15,500 --> 00:30:17,970
all hammering it right now. That would be
nice.

799
00:30:17,970 --> 00:30:19,940
So, this is a graph of your response time

800
00:30:19,940 --> 00:30:22,010
over time, and then on the right, this is

801
00:30:22,010 --> 00:30:24,700
the graph of the RPMs, the requests per minute

802
00:30:24,700 --> 00:30:26,750
that your app is handling. So this is app-wide.

803
00:30:26,750 --> 00:30:29,440
And this is live. This updates every minute.

804
00:30:29,440 --> 00:30:31,039
Then down below, you have a list of the

805
00:30:31,039 --> 00:30:33,730
end points in your application. So you can
see,

806
00:30:33,730 --> 00:30:35,700
actually, the top, the slowest ones for us
were,

807
00:30:35,700 --> 00:30:37,789
we have an instrumentation API, and we've
gone and

808
00:30:37,789 --> 00:30:39,929
instrumented our background workers. So we
can see them

809
00:30:39,929 --> 00:30:42,010
here, and their response time plays in. So
we

810
00:30:42,010 --> 00:30:44,220
can see that we have this reporting worker
that's

811
00:30:44,220 --> 00:30:46,899
taking 95th percentile, thirteen seconds.

812
00:30:46,899 --> 00:30:48,880
Y.K.: So all that time used to be inside

813
00:30:48,880 --> 00:30:51,500
of some request somewhere, and we discovered
that there

814
00:30:51,500 --> 00:30:52,840
was a lot of time being spent in things

815
00:30:52,840 --> 00:30:54,679
that we could push to the background. We probably

816
00:30:54,679 --> 00:30:56,789
need to update the agony index so that it

817
00:30:56,789 --> 00:30:59,190
doesn't make workers very high, because spending
some time

818
00:30:59,190 --> 00:31:02,120
in your workers is not that big of a

819
00:31:02,120 --> 00:31:02,130
deal.

820
00:31:02,130 --> 00:31:03,000
T.D.: So, so then, if we dive into one

821
00:31:03,000 --> 00:31:05,299
of these, you can see that for this request,

822
00:31:05,299 --> 00:31:07,000
we've got the time explorer up above, and
that

823
00:31:07,000 --> 00:31:10,429
shows a graph of response time at, again,
95th

824
00:31:10,429 --> 00:31:11,840
percentile, and you can, if you want to go

825
00:31:11,840 --> 00:31:13,549
back and look at historical data, you just
drag

826
00:31:13,549 --> 00:31:15,250
it like this. And this has got a brush,

827
00:31:15,250 --> 00:31:16,980
so you can zoom in and out on different

828
00:31:16,980 --> 00:31:17,760
times.

829
00:31:17,760 --> 00:31:19,649
And every time you change the range, you can

830
00:31:19,649 --> 00:31:21,360
see that it's very responsive. It's never
waiting for

831
00:31:21,360 --> 00:31:23,039
the server. But it is going back and fetching

832
00:31:23,039 --> 00:31:25,080
data from the server and then when the data

833
00:31:25,080 --> 00:31:29,210
comes back, you see the whole UI just updates.

834
00:31:29,210 --> 00:31:29,250
And we get that for free with Ember and

835
00:31:29,250 --> 00:31:31,190
And then down below, as we discussed, you
actually

836
00:31:31,190 --> 00:31:33,760
have a real histogram. And this histogram,
in this

837
00:31:33,760 --> 00:31:37,159
case, is showing. So this is for fifty-seven
requests.

838
00:31:37,159 --> 00:31:39,019
And if we click and drag, we could just

839
00:31:39,019 --> 00:31:40,429
move this. And you can see that the aggregate

840
00:31:40,429 --> 00:31:43,360
trace below updates in response to us dragging
this.

841
00:31:43,360 --> 00:31:44,919
And if we want to look at the fastest

842
00:31:44,919 --> 00:31:47,500
quartile, we just click faster and we'll just
choose

843
00:31:47,500 --> 00:31:48,149
that range on the histogram.

844
00:31:48,149 --> 00:31:49,210
Y.K.: I think it's the fastest load.

845
00:31:49,210 --> 00:31:50,899
T.D.: The fastest load. And then if you click

846
00:31:50,899 --> 00:31:52,899
on slower, you can see the slower requests.
So

847
00:31:52,899 --> 00:31:54,669
this makes it really easy to compare and contrast.

848
00:31:54,669 --> 00:31:56,710
OK. Why are certain requests faster and why
are

849
00:31:56,710 --> 00:31:58,529
certain requests slow?

850
00:31:58,529 --> 00:32:00,779
You can see the blue, these blue areas. This

851
00:32:00,779 --> 00:32:03,559
is Ruby code. So, right now it's not super

852
00:32:03,559 --> 00:32:05,820
granular. It would be nice if you could actually

853
00:32:05,820 --> 00:32:07,940
know what was going on here. But, it'll at

854
00:32:07,940 --> 00:32:09,940
least tell you where in your controller action
this

855
00:32:09,940 --> 00:32:12,690
is happening, and then you can actually see
which

856
00:32:12,690 --> 00:32:15,919
database queries are being executed, and what
their duration

857
00:32:15,919 --> 00:32:16,080
is.

858
00:32:16,080 --> 00:32:17,889
And you can see that we actually extract the

859
00:32:17,889 --> 00:32:20,419
SQL and we denormalize it so we, so you,

860
00:32:20,419 --> 00:32:22,159
or, we normalize it so you can see exactly

861
00:32:22,159 --> 00:32:24,019
what those requests are even if the values
are

862
00:32:24,019 --> 00:32:24,820
totally different between them.

863
00:32:24,820 --> 00:32:27,649
Y.K.: Yeah. So the real query, courtesy of
Rails,

864
00:32:27,649 --> 00:32:29,730
not yet supporting bind extraction is like,
where id

865
00:32:29,730 --> 00:32:32,169
equals one or, ten or whatever.

866
00:32:32,169 --> 00:32:33,659
T.D.: Yup. So that's pretty cool.

867
00:32:33,659 --> 00:32:37,429
Y.K.: So one, one other thing is, initially,
we

868
00:32:37,429 --> 00:32:39,269
actually just showed the whole trace, but
we discovered

869
00:32:39,269 --> 00:32:41,659
that, obviously when you show whole traces
you have

870
00:32:41,659 --> 00:32:43,639
information that doesn't really matter that
much. So we

871
00:32:43,639 --> 00:32:47,340
started off by, we've recently basically started
to collapse

872
00:32:47,340 --> 00:32:48,850
things that don't matter so much so that you

873
00:32:48,850 --> 00:32:51,090
can basically expand or condense the trace.

874
00:32:51,090 --> 00:32:52,519
And we wanted to make it not, but you

875
00:32:52,519 --> 00:32:55,690
have to think about expanding or condensing
individual areas,

876
00:32:55,690 --> 00:32:57,960
but just, you see what matters the most and

877
00:32:57,960 --> 00:32:59,100
then you can see trivial errors.

878
00:32:59,100 --> 00:33:02,179
T.D.: Yup. So, so that's the demo of Skylight.

879
00:33:02,179 --> 00:33:04,190
We'd really like it if you checked it out.

880
00:33:04,190 --> 00:33:05,899
There is one more thing I want to show

881
00:33:05,899 --> 00:33:07,720
you that is, like, really freaking cool. This
is

882
00:33:07,720 --> 00:33:10,529
coming out of Tilde labs. Carl was like, has

883
00:33:10,529 --> 00:33:13,730
been hacking, he's been up until past midnight,
getting

884
00:33:13,730 --> 00:33:15,769
almost no sleep for the past month trying
to

885
00:33:15,769 --> 00:33:16,730
have this ready.

886
00:33:16,730 --> 00:33:19,090
I don't know how many of you know this,

887
00:33:19,090 --> 00:33:23,630
but Ruby 2 point 1 has a new, a,

888
00:33:23,630 --> 00:33:27,950
a stack sampling feature. So you can get really

889
00:33:27,950 --> 00:33:31,149
granular information about how your Ruby code
is performing.

890
00:33:31,149 --> 00:33:33,450
So I want to show you, I just mentioned

891
00:33:33,450 --> 00:33:34,570
how it would be nice if we could get

892
00:33:34,570 --> 00:33:36,830
more information out of what your Ruby code
is

893
00:33:36,830 --> 00:33:38,760
doing. And now we can do that.

894
00:33:38,760 --> 00:33:42,039
Basically, every few milliseconds, this code
that Carl wrote

895
00:33:42,039 --> 00:33:44,399
is going into the, to the Ruby, into MRI,

896
00:33:44,399 --> 00:33:47,419
and it's taking a snap shot of the stack.

897
00:33:47,419 --> 00:33:50,769
And because this is built-in, it's very low-impact.
It's

898
00:33:50,769 --> 00:33:53,570
not allocating any new memory. It's very little
performance

899
00:33:53,570 --> 00:33:55,769
hit. Basically you wouldn't even notice it.
And so

900
00:33:55,769 --> 00:33:58,149
every few milliseconds it's sampling, and
we take that

901
00:33:58,149 --> 00:34:00,260
information and we send it up to our servers.

902
00:34:00,260 --> 00:34:02,260
So it's almost like you're running Ruby profiler
on

903
00:34:02,260 --> 00:34:05,220
your local dev box, where you get extremely
granular

904
00:34:05,220 --> 00:34:07,159
information about where your code is spending
its time

905
00:34:07,159 --> 00:34:09,010
in Ruby, per method, per all of these things.

906
00:34:09,010 --> 00:34:11,909
But it's happening in production.

907
00:34:11,909 --> 00:34:16,409
So, this is, so this is a, we enabled

908
00:34:16,409 --> 00:34:18,399
it in staging. You can see that we've got

909
00:34:18,399 --> 00:34:19,600
some rendering bugs. It's still in beta.

910
00:34:19,600 --> 00:34:21,918
Y.K.: Yeah, and we haven't yet collapsed things
that

911
00:34:21,918 --> 00:34:21,980
are not important-

912
00:34:21,980 --> 00:34:22,020
T.D.: Yes.

913
00:34:22,020 --> 00:34:23,270
Y.K.: -for this particular feature.

914
00:34:23,270 --> 00:34:24,170
T.D.: So we want to show, we want to

915
00:34:24,170 --> 00:34:27,610
hide things like, like framework code, obviously.
But this

916
00:34:27,610 --> 00:34:31,070
gives you an incredibly, incredibly granular
view of what

917
00:34:31,070 --> 00:34:35,659
your app is doing in production. And we think.

918
00:34:35,659 --> 00:34:39,230
This is a, an API that's built into, into

919
00:34:39,230 --> 00:34:43,159
Ruby 2.1.1. Because our agent is running so
low-level,

920
00:34:43,159 --> 00:34:44,659
because we wrote it in Rust, we have the

921
00:34:44,659 --> 00:34:47,409
ability to do things like this, and Carl thinks

922
00:34:47,409 --> 00:34:48,370
that we may be able to actually back port

923
00:34:48,370 --> 00:34:48,480
this to older Rubies, too. So if you're not

924
00:34:48,480 --> 00:34:50,130
on Ruby 2.1, we think that we can actually

925
00:34:50,130 --> 00:34:52,790
bring this. But that's TPD.

926
00:34:52,790 --> 00:34:55,480
Y.K.: Yeah, I- so I think the cool thing

927
00:34:55,480 --> 00:34:57,940
about this, in general, is when you run a

928
00:34:57,940 --> 00:34:59,430
sampling- so this is a sampling profiler,
right, we

929
00:34:59,430 --> 00:35:01,260
don't want to be burning every single thing
that

930
00:35:01,260 --> 00:35:03,790
you do in your program with tracing, right.
That

931
00:35:03,790 --> 00:35:05,380
would be very slow.

932
00:35:05,380 --> 00:35:06,920
So when you normally run a sampling profiler,
you

933
00:35:06,920 --> 00:35:08,760
have to basically make a loop. You have to

934
00:35:08,760 --> 00:35:11,090
basically create a loop, run this code a million

935
00:35:11,090 --> 00:35:12,970
times and keep sampling it. Eventually we'll
get enough

936
00:35:12,970 --> 00:35:15,030
samples to get the information. But it turns
out

937
00:35:15,030 --> 00:35:17,280
that your production server is a loop. Your
production

938
00:35:17,280 --> 00:35:20,560
server is serving tons and tons of requests.
So,

939
00:35:20,560 --> 00:35:22,880
by simply tak- you know, taking a few microseconds

940
00:35:22,880 --> 00:35:25,580
out of every request and collecting a couple
of

941
00:35:25,580 --> 00:35:27,210
samples, over time we can actually get this
really

942
00:35:27,210 --> 00:35:29,700
high fidelity picture with basically no cost.

943
00:35:29,700 --> 00:35:31,150
And that's pretty mind-blowing. And this is
the kind

944
00:35:31,150 --> 00:35:34,650
of stuff that we can start doing by really

945
00:35:34,650 --> 00:35:37,250
caring about, about both the user experience
and the

946
00:35:37,250 --> 00:35:40,830
implementation and getting really scary about
it. And I'm

947
00:35:40,830 --> 00:35:42,700
really, like, honestly this is a really exciting
feature

948
00:35:42,700 --> 00:35:45,330
that really shows what we can do as we

949
00:35:45,330 --> 00:35:46,130
start building this out.

950
00:35:46,130 --> 00:35:47,140
T.D.: Once we've got that, once we've got
that

951
00:35:47,140 --> 00:35:48,380
groundwork.

952
00:35:48,380 --> 00:35:49,820
So if you guys want to check it out,

953
00:35:49,820 --> 00:35:51,760
Skylight dot io, it's available today. It's
no longer

954
00:35:51,760 --> 00:35:54,040
in private beta. Everyone can sign up. No
invitation

955
00:35:54,040 --> 00:35:56,630
token necessary. And you can get a thirty-day
free

956
00:35:56,630 --> 00:35:58,240
trial if you haven't started one already.
So if

957
00:35:58,240 --> 00:35:59,620
you have any questions, please come see us
right

958
00:35:59,620 --> 00:36:00,980
now, or we have a booth in the vendor

959
00:36:00,980 --> 00:36:03,140
hall. Thank you guys very much.