1 00:00:16,600 --> 00:00:17,920 TOM DALE: Hey, you guys ready? 2 00:00:17,940 --> 00:00:19,180 Thank you guys so much for coming. 3 00:00:19,189 --> 00:00:21,770 This is awesome. I was really, 4 00:00:21,770 --> 00:00:23,700 I, when they were putting together the schedule, 5 00:00:23,700 --> 00:00:25,349 I said, make sure that you put us down 6 00:00:25,349 --> 00:00:27,160 in the Caves of Moria. So thank you 7 00:00:27,160 --> 00:00:31,560 guys for coming down and making it. 8 00:00:31,560 --> 00:00:32,910 I'm Tom. This is Yehuda. 9 00:00:32,910 --> 00:00:35,270 YEHUDA KATZ: When people told me was signed up 10 00:00:35,270 --> 00:00:40,520 to do a back-to-back talk, I don't know what 11 00:00:40,520 --> 00:00:40,579 I was thinking. 12 00:00:40,579 --> 00:00:40,820 T.D.: Yup. So. We want to talk to you 13 00:00:40,820 --> 00:00:43,809 today about, about Skylight. So, just a little bit 14 00:00:43,809 --> 00:00:45,730 before we talk about that, I want to talk 15 00:00:45,730 --> 00:00:47,530 about us a little bit. 16 00:00:47,530 --> 00:00:50,420 So, in 2011 we started a company called Tilde. 17 00:00:50,420 --> 00:00:52,760 It's this shirt. It may have made me self-conscious, 18 00:00:52,760 --> 00:00:54,609 because this is actually a first edition and it's 19 00:00:54,609 --> 00:00:57,550 printed off-center. Well, either I'm off-center or the shirt's 20 00:00:57,550 --> 00:01:00,539 off-center. One of the two. 21 00:01:00,539 --> 00:01:02,570 So we started Tilde in 2011, and we had 22 00:01:02,570 --> 00:01:06,030 all just left a venture-backed company, and that was 23 00:01:06,030 --> 00:01:08,440 a pretty traumatic experience for us because we spent 24 00:01:08,440 --> 00:01:09,560 a lot of time building the company and then 25 00:01:09,560 --> 00:01:11,110 we ran out of money and sold to Facebook 26 00:01:11,110 --> 00:01:13,500 and we really didn't want to repeat that experience. 27 00:01:13,500 --> 00:01:16,420 So, we decided to start Tilde, and when we 28 00:01:16,420 --> 00:01:19,830 did it, we decided to be. DHH and the 29 00:01:19,830 --> 00:01:21,970 other people at Basecamp were talking about, you know, 30 00:01:21,970 --> 00:01:23,630 being bootstrapped and proud. And that was a message 31 00:01:23,630 --> 00:01:26,250 that really resonated with us, and so we wanted 32 00:01:26,250 --> 00:01:28,360 to capture the same thing. 33 00:01:28,360 --> 00:01:30,250 There's only problem with being bootstrapped and proud, and 34 00:01:30,250 --> 00:01:31,890 that is, in order to be both of those 35 00:01:31,890 --> 00:01:33,180 things, you actually need money, it turns out. It's 36 00:01:33,180 --> 00:01:34,850 not like you just say it in a blog 37 00:01:34,850 --> 00:01:37,660 post and then all of the sudden you are 38 00:01:37,660 --> 00:01:38,560 in business. 39 00:01:38,560 --> 00:01:40,350 So, we had to think a lot about, OK, 40 00:01:40,350 --> 00:01:42,450 well, how do we make money? How do we 41 00:01:42,450 --> 00:01:44,690 make money? How do we make a profitable and, 42 00:01:44,690 --> 00:01:47,290 most importantly, sustainable business? Because we didn't want to 43 00:01:47,290 --> 00:01:49,380 just flip to Facebook in a couple years. 44 00:01:49,380 --> 00:01:53,130 So, looking around, I think the most obvious thing 45 00:01:53,130 --> 00:01:55,110 that people suggested to us is, well, why don't 46 00:01:55,110 --> 00:01:58,050 you guys just become Ember, Inc.? Raise a few 47 00:01:58,050 --> 00:02:01,390 million dollars, you know, build a bunch of business 48 00:02:01,390 --> 00:02:05,730 model, mostly prayer. But that's not really how we 49 00:02:05,730 --> 00:02:08,959 want to think about building open source communities. 50 00:02:08,959 --> 00:02:10,860 We don't really think that that necessarily leads to 51 00:02:10,860 --> 00:02:13,810 the best open source communities. And if you're interested 52 00:02:13,810 --> 00:02:16,569 more in that, I recommend Leia Silver, who is 53 00:02:16,569 --> 00:02:19,700 one of our co-founders. She's giving a talk this 54 00:02:19,700 --> 00:02:22,450 afternoon. Oh, sorry. Friday afternoon, about how to build 55 00:02:22,450 --> 00:02:25,219 a company that is centered on open source. So 56 00:02:25,219 --> 00:02:26,469 if you want to learn more about how we've 57 00:02:26,469 --> 00:02:28,989 done that, I would really suggest you go check 58 00:02:28,989 --> 00:02:30,060 out her talk. 59 00:02:30,060 --> 00:02:33,689 So, no. So, no Ember, Inc. Not allowed. 60 00:02:33,689 --> 00:02:38,159 So, we really want to build something that leveraged 61 00:02:38,159 --> 00:02:40,249 the strengths that we thought that we had. One, 62 00:02:40,249 --> 00:02:42,680 I think most importantly, a really deep knowledge of 63 00:02:42,680 --> 00:02:44,569 open source and a deep knowledge of the Rails 64 00:02:44,569 --> 00:02:47,090 stack, and also Carl, it turns out, is really, 65 00:02:47,090 --> 00:02:50,689 really good at building highly scalable big data sys- 66 00:02:50,689 --> 00:02:53,989 big data systems. Lots of Hadoop in there. 67 00:02:53,989 --> 00:02:58,290 So, last year at RailsConf, we announced the private 68 00:02:58,290 --> 00:03:00,519 beta of Skylight. How many of you have used 69 00:03:00,519 --> 00:03:01,709 Skylight? Can you raise your hand if you have 70 00:03:01,709 --> 00:03:04,629 used it? OK. Many of you. Awesome. 71 00:03:04,629 --> 00:03:08,129 So, so Skylight is a tool for profiling and 72 00:03:08,129 --> 00:03:11,780 measuring the performance of your Rails applications in production. 73 00:03:11,780 --> 00:03:15,389 And, as a product, Skylight, I think, was built 74 00:03:15,389 --> 00:03:20,349 on three really, three key break-throughs. There were key, 75 00:03:20,349 --> 00:03:22,120 three key break-throughs. We didn't want to ship a 76 00:03:22,120 --> 00:03:26,189 product that was incrementally better than the competition. We 77 00:03:26,189 --> 00:03:28,319 wanted to ship a product that was dramatically better. 78 00:03:28,319 --> 00:03:30,079 Quantum leap. An order of magnitude better. 79 00:03:30,079 --> 00:03:32,079 And, in order to do that, we spent a 80 00:03:32,079 --> 00:03:33,889 lot of time thinking about it, about how we 81 00:03:33,889 --> 00:03:36,389 could solve most of the problems that we saw 82 00:03:36,389 --> 00:03:39,310 in the existing landscape. And so those, those break-throughs 83 00:03:39,310 --> 00:03:42,299 are predicated- sorry, those, delivering a product that does 84 00:03:42,299 --> 00:03:44,919 that is predicated on these three break-throughs. 85 00:03:44,919 --> 00:03:46,870 So, the first one I want to talk about 86 00:03:46,870 --> 00:03:53,870 is, honest response times. Honest response times. So, DHH 87 00:03:54,060 --> 00:03:55,799 wrote a blog post on what was then the 88 00:03:55,799 --> 00:03:58,930 37Signals blog, now the Basecamp blog, called The problem 89 00:03:58,930 --> 00:04:01,909 with averages. How many of you have read this? 90 00:04:01,909 --> 00:04:02,459 Awesome. 91 00:04:02,459 --> 00:04:03,779 For those of you that have not, how many 92 00:04:03,779 --> 00:04:08,469 of you hate raising your hands at presentations? 93 00:04:08,469 --> 00:04:10,510 So, for those of you that- 94 00:04:10,510 --> 00:04:11,549 Y.K.: Just put a button in every seat. Press 95 00:04:11,549 --> 00:04:11,779 this button- 96 00:04:11,779 --> 00:04:15,290 T.D.: Press the button if you have. Yes. Great. 97 00:04:15,290 --> 00:04:19,810 So, if you read this blog post, the way 98 00:04:19,810 --> 00:04:22,810 it opens is, Our average response time for Basecamp 99 00:04:22,810 --> 00:04:26,770 right now is 87ms... That sounds fantastic. And it 100 00:04:26,770 --> 00:04:29,950 easily leads you to believe that all is well 101 00:04:29,950 --> 00:04:31,680 and that we wouldn't need to spend any more 102 00:04:31,680 --> 00:04:34,150 time optimizing performance. 103 00:04:34,150 --> 00:04:38,840 But that's actually wrong. The average number is completely 104 00:04:38,840 --> 00:04:42,250 skewed by tons of fast responses to feed requests 105 00:04:42,250 --> 00:04:46,169 and other cached replies. If you have 1000 requests 106 00:04:46,169 --> 00:04:49,229 that return in 5ms, and then you can have 107 00:04:49,229 --> 00:04:53,560 200 requests taking 2000ms, or two seconds, you can 108 00:04:53,560 --> 00:04:57,509 still report an av- a respectable 170ms of average. 109 00:04:57,509 --> 00:04:59,819 That's useless. 110 00:04:59,819 --> 00:05:02,520 So what does DHH say that we need? DHH 111 00:05:02,520 --> 00:05:06,569 says the solution is histograms. So, for those of 112 00:05:06,569 --> 00:05:09,009 you like me who were sleeping through your statistics 113 00:05:09,009 --> 00:05:12,410 class in high school, and college, a brief primer 114 00:05:12,410 --> 00:05:15,310 on histograms. So a histogram is very simple. Basically, 115 00:05:15,310 --> 00:05:17,699 you have a, you have a series of numbers 116 00:05:17,699 --> 00:05:22,389 along some axis, and every time you, you're in 117 00:05:22,389 --> 00:05:24,360 that number, you're in that bucket, you basically increment 118 00:05:24,360 --> 00:05:25,280 that bar by one. 119 00:05:25,280 --> 00:05:27,669 So, this is an example of a histogram of 120 00:05:27,669 --> 00:05:30,620 response times in a Rails application. So you can 121 00:05:30,620 --> 00:05:31,979 see that there's a big cluster in the middle 122 00:05:31,979 --> 00:05:35,900 around 488ms, 500ms. This isn't a super speedy app 123 00:05:35,900 --> 00:05:38,740 but it's not the worst thing in the world. 124 00:05:38,740 --> 00:05:39,520 And they're all clustered, and then as you kind 125 00:05:39,520 --> 00:05:40,810 of move to the right you can see that 126 00:05:40,810 --> 00:05:42,169 the respond times get longer and longer and longer, 127 00:05:42,169 --> 00:05:43,990 and as you move to the left, response times 128 00:05:43,990 --> 00:05:45,720 get shorter and shorter and shorter. 129 00:05:45,720 --> 00:05:47,500 So, why do you want a histogram? What's the, 130 00:05:47,500 --> 00:05:48,599 what's the most important thing about a histogram? 131 00:05:48,599 --> 00:05:52,550 Y.K.: Well, I think it's because most requests don't 132 00:05:52,550 --> 00:05:53,229 actually look like this. 133 00:05:53,229 --> 00:05:53,509 T.D.: Yes. 134 00:05:53,509 --> 00:05:54,759 Y.K.: Most end points don't actually look like this. 135 00:05:54,759 --> 00:05:56,419 T.D.: Right. If you think about what your Rails 136 00:05:56,419 --> 00:05:58,610 app is doing, it's a complicated beast, right. Turns 137 00:05:58,610 --> 00:06:02,330 out, Ruby frankly, you can, you can do branching 138 00:06:02,330 --> 00:06:04,360 logic. You can do a lot of things. 139 00:06:04,360 --> 00:06:06,150 And so what that means is that one end 140 00:06:06,150 --> 00:06:09,460 point, if you represent that with a single number, 141 00:06:09,460 --> 00:06:11,650 you are losing a lot of fidelity, to the 142 00:06:11,650 --> 00:06:15,189 point where it becomes, as DHH said, useless. So, 143 00:06:15,189 --> 00:06:17,729 for example, in a histogram, you can easily see, 144 00:06:17,729 --> 00:06:19,810 oh, here's a group of requests and response times 145 00:06:19,810 --> 00:06:22,379 where I'm hitting the cache, and here's another group 146 00:06:22,379 --> 00:06:24,169 where I'm missing it. And you can see that 147 00:06:24,169 --> 00:06:27,889 that cluster is significantly slower than the faster cache-hitting 148 00:06:27,889 --> 00:06:28,439 cluster. 149 00:06:28,439 --> 00:06:30,849 And the other thing that you get when you 150 00:06:30,849 --> 00:06:32,800 have a, a distribution, when you keep the whole 151 00:06:32,800 --> 00:06:35,370 distribution in the histogram, is you can look at 152 00:06:35,370 --> 00:06:39,470 this number at the 95th percentile, right. So the 153 00:06:39,470 --> 00:06:41,639 right, the way to think about the performance of 154 00:06:41,639 --> 00:06:46,639 your web application is not the average, because the 155 00:06:46,639 --> 00:06:50,159 average doesn't really tell you anything. You want to 156 00:06:50,159 --> 00:06:52,220 think about the 95th percentile, because that's not the 157 00:06:52,220 --> 00:06:55,379 average response time, that's the average worst response time 158 00:06:55,379 --> 00:06:57,990 that a user is likely to hit. 159 00:06:57,990 --> 00:06:59,360 And the thing to keep in mind is that 160 00:06:59,360 --> 00:07:01,849 it's not as though a customer comes to your 161 00:07:01,849 --> 00:07:05,099 site, they issue one request, and then they're done, 162 00:07:05,099 --> 00:07:08,000 right. As someone is using your website, they're gonna 163 00:07:08,000 --> 00:07:10,300 be generating a lot of requests. And you need 164 00:07:10,300 --> 00:07:15,020 to look at the 95th percentile, because otherwise every 165 00:07:15,020 --> 00:07:17,219 request is basically you rolling the dice that they're 166 00:07:17,219 --> 00:07:18,610 not gonna hit one of those two second, three 167 00:07:18,610 --> 00:07:21,400 second, four second responses, close the tab and go 168 00:07:21,400 --> 00:07:23,919 to your competitor. 169 00:07:23,919 --> 00:07:25,340 So we look at this as, here's the crazy 170 00:07:25,340 --> 00:07:28,439 thing. Here's what I think is crazy. That blog 171 00:07:28,439 --> 00:07:32,750 post that DHH wrote, it's from 2009. It's been 172 00:07:32,750 --> 00:07:35,000 five years, and there's still no tool that does 173 00:07:35,000 --> 00:07:36,960 what DHH was asking for. So, we, frankly, we 174 00:07:36,960 --> 00:07:39,060 smelled money. We were like, holy crap. 175 00:07:39,060 --> 00:07:41,169 Y.K.: Yeah, why isn't that slide green? 176 00:07:41,169 --> 00:07:43,229 T.D.: Yeah. It should be green and dollars. I 177 00:07:43,229 --> 00:07:45,270 think keynote has the dollars, the make it rain 178 00:07:45,270 --> 00:07:50,240 effect I should have used. So we smelled blood 179 00:07:50,240 --> 00:07:53,090 in the water. We're like, this is awesome. There's 180 00:07:53,090 --> 00:07:56,610 only one problem that we discovered, and that is, 181 00:07:56,610 --> 00:07:58,330 it turns out that building this thing is actually 182 00:07:58,330 --> 00:08:01,189 really, really freaky hard. Really, really hard. 183 00:08:01,189 --> 00:08:05,780 So, we announced the private beta at RailsConf last 184 00:08:05,780 --> 00:08:09,139 year. Before doing that, we spent a year of 185 00:08:09,139 --> 00:08:12,789 research spiking out prototypes, building prototypes, building out the 186 00:08:12,789 --> 00:08:16,509 beta. We launched at RailsConf, and we realized, we 187 00:08:16,509 --> 00:08:18,520 made a lot of problems. We made a lot 188 00:08:18,520 --> 00:08:21,909 of errors when we were building this system. 189 00:08:21,909 --> 00:08:26,270 So then, after RailsConf last year, we basically took 190 00:08:26,270 --> 00:08:29,689 six months to completely rewrite the backend from the 191 00:08:29,689 --> 00:08:32,450 ground up. And I think tying into your keynote, 192 00:08:32,450 --> 00:08:36,280 Yehuda, we, we were like, oh. We clearly have 193 00:08:36,280 --> 00:08:39,120 a bespoke problem. No one else is doing this. 194 00:08:39,120 --> 00:08:42,090 So we rewrote our own custom backend. And then 195 00:08:42,090 --> 00:08:43,729 we had all these problems, and we realized that 196 00:08:43,729 --> 00:08:45,390 they had actually already all been solved by the 197 00:08:45,390 --> 00:08:47,769 open source community. And so we benefited tremendously by 198 00:08:47,769 --> 00:08:48,430 having a shared solution. 199 00:08:48,430 --> 00:08:50,709 Y.K.: Yeah. So our first release of this was 200 00:08:50,709 --> 00:08:55,279 really very bespoke, and the current release uses a 201 00:08:55,279 --> 00:08:59,540 tremendous amount of very off-the-shelf open source projects that 202 00:08:59,540 --> 00:09:04,550 just solved the particular problem very effectively, very well. 203 00:09:04,550 --> 00:09:05,779 None of which are as easy to use as 204 00:09:05,779 --> 00:09:09,029 Rails, but all of which solve really thorny problems 205 00:09:09,029 --> 00:09:10,230 very effectively. 206 00:09:10,230 --> 00:09:12,870 T.D.: So, so let's just talk, just for your 207 00:09:12,870 --> 00:09:15,950 own understanding, let's talk about how most performance monitoring 208 00:09:15,950 --> 00:09:17,670 tools work. So the way that most of these 209 00:09:17,670 --> 00:09:19,930 work is that you run your Rails app, and 210 00:09:19,930 --> 00:09:22,250 running inside of your Rails app is some gem, 211 00:09:22,250 --> 00:09:25,000 some agent that you install. And every time the 212 00:09:25,000 --> 00:09:28,560 Rails app handles a request, it generates events, and 213 00:09:28,560 --> 00:09:32,500 those events, which include information about performance data, those 214 00:09:32,500 --> 00:09:34,930 events are passed into the agent. 215 00:09:34,930 --> 00:09:37,630 And then the agent sends that data to some 216 00:09:37,630 --> 00:09:40,930 kind of centralized server. Now, it turns out that 217 00:09:40,930 --> 00:09:44,139 doing a running average is actually really simple. Which 218 00:09:44,139 --> 00:09:46,550 is why everyone does it. Basically you can do 219 00:09:46,550 --> 00:09:48,050 it in a single SQL query, right. All you 220 00:09:48,050 --> 00:09:50,310 do is you have three columns in database. The 221 00:09:50,310 --> 00:09:52,690 end point, the running average, and the number of 222 00:09:52,690 --> 00:09:55,769 requests, and then so, you can, those are the 223 00:09:55,769 --> 00:09:57,170 two things that you need to keep a running 224 00:09:57,170 --> 00:09:57,310 average, right. 225 00:09:57,310 --> 00:09:58,750 So keeping a running average is actually really simple 226 00:09:58,750 --> 00:10:00,980 from a technical point of view. 227 00:10:00,980 --> 00:10:03,800 Y.K.: I don't think you could even JavaScript through 228 00:10:03,800 --> 00:10:04,600 to the lack of integers. 229 00:10:04,600 --> 00:10:06,050 T.D.: Yes. You probably wouldn't want to do any 230 00:10:06,050 --> 00:10:07,529 math in JavaScript, it turns out. So, so we 231 00:10:07,529 --> 00:10:10,100 took a little bit different approach. Yehuda, do you 232 00:10:10,100 --> 00:10:12,070 want to go over the next section? 233 00:10:12,070 --> 00:10:15,089 Y.K.: Yeah. Sure. So, when we first started, right 234 00:10:15,089 --> 00:10:17,790 at the beginning, we basically did a similar thing 235 00:10:17,790 --> 00:10:19,790 where we had a bunch - your app creates 236 00:10:19,790 --> 00:10:22,620 events. Most of those start off as being ActiveSupport::Notifications, 237 00:10:22,620 --> 00:10:25,980 although it turns out that there's very limited use 238 00:10:25,980 --> 00:10:28,490 of ActiveSupport::Notifications so we had to do some normalization 239 00:10:28,490 --> 00:10:30,360 work to get them sane, which we're gonna be 240 00:10:30,360 --> 00:10:32,630 upstreaming back into, into Rails. 241 00:10:32,630 --> 00:10:35,320 But, one thing that's kind of unfortunate about having 242 00:10:35,320 --> 00:10:37,029 every single Rails app have an agent is that 243 00:10:37,029 --> 00:10:38,649 you end up having to do a lot of 244 00:10:38,649 --> 00:10:40,310 the same kind of work over and over again, 245 00:10:40,310 --> 00:10:42,180 and use up a lot of memory. So, for 246 00:10:42,180 --> 00:10:44,220 example, every one of these things is making HTTP 247 00:10:44,220 --> 00:10:46,380 requests. So now you have a queue of things 248 00:10:46,380 --> 00:10:48,810 that you're sending over HTTP in every single one 249 00:10:48,810 --> 00:10:50,510 of your Rails processes. And, of course, you probably 250 00:10:50,510 --> 00:10:52,250 don't notice this. People are used to Rails taking 251 00:10:52,250 --> 00:10:54,089 up hundreds and hundreds of megabytes, so you probably 252 00:10:54,089 --> 00:10:55,790 don't notice if you install some agent and it 253 00:10:55,790 --> 00:10:59,449 suddenly starts taking twenty, thirty, forty, fifty more megabytes. 254 00:10:59,449 --> 00:11:01,510 But we really wanted to keep the actual memory 255 00:11:01,510 --> 00:11:04,649 per process down to a small amount. So one 256 00:11:04,649 --> 00:11:06,170 of the very first things that we did, we 257 00:11:06,170 --> 00:11:07,910 even did it before last year, is that we 258 00:11:07,910 --> 00:11:09,800 pulled out all that shared logic into a, a 259 00:11:09,800 --> 00:11:13,420 separate process called the coordinator. And the agent is 260 00:11:13,420 --> 00:11:16,940 basically responsible simply for collecting the, the trace, and 261 00:11:16,940 --> 00:11:18,899 it's not responsible for actually talking to our server 262 00:11:18,899 --> 00:11:20,709 at all. And that means that the coordinator only 263 00:11:20,709 --> 00:11:22,720 has to do this queue, this keeping a st- 264 00:11:22,720 --> 00:11:25,550 a bunch of stuff of work in one place, 265 00:11:25,550 --> 00:11:28,500 and it doesn't end up using up as much 266 00:11:28,500 --> 00:11:28,839 memory. 267 00:11:28,839 --> 00:11:31,019 And I think this, this ended up being very 268 00:11:31,019 --> 00:11:31,940 effective for us. 269 00:11:31,940 --> 00:11:33,490 T.D.: And I think that low overhead also allows 270 00:11:33,490 --> 00:11:36,350 us to just collect more information, in general. 271 00:11:36,350 --> 00:11:37,060 Y.K.: Yeah. 272 00:11:37,060 --> 00:11:39,880 Now, after our first attempt, we started getting a 273 00:11:39,880 --> 00:11:42,079 bunch of customers that were telling us that even 274 00:11:42,079 --> 00:11:43,920 the separate - so the separate coordinator, started as 275 00:11:43,920 --> 00:11:45,260 a good thing and a bad thing. On the 276 00:11:45,260 --> 00:11:47,399 one hand, there's only one of them, so it 277 00:11:47,399 --> 00:11:49,529 uses up only one set of memory. On the 278 00:11:49,529 --> 00:11:51,220 other hand, it's really easy for someone to go 279 00:11:51,220 --> 00:11:52,839 in and PS that process and see how many 280 00:11:52,839 --> 00:11:54,260 megabytes of memory it's using. 281 00:11:54,260 --> 00:11:56,560 So, we got a lot of additional complaints that 282 00:11:56,560 --> 00:11:58,100 said oh, your process is using a lot of 283 00:11:58,100 --> 00:12:00,990 memory. And, I spent a few weeks, I, I 284 00:12:00,990 --> 00:12:03,160 know Ruby pretty well. I spent a couple of 285 00:12:03,160 --> 00:12:05,670 weeks. I actually wrote a gem called Allocation Counter 286 00:12:05,670 --> 00:12:07,550 that basically went in to try to pin point 287 00:12:07,550 --> 00:12:09,490 exactly where the allocations were hap- coming from. But 288 00:12:09,490 --> 00:12:11,800 it turns out that it's actually really, really hard 289 00:12:11,800 --> 00:12:14,019 to track down exactly where allocations are coming from 290 00:12:14,019 --> 00:12:15,170 in Ruby, because something as simple as using a 291 00:12:15,170 --> 00:12:18,630 regular expression in Ruby can allocate match objects that 292 00:12:18,630 --> 00:12:19,410 get put back on the stack. 293 00:12:19,410 --> 00:12:21,449 And so I was able to pair this down 294 00:12:21,449 --> 00:12:24,220 to some degree. But I really discovered quickly that, 295 00:12:24,220 --> 00:12:26,980 trying to keep a lid on the memory allocation 296 00:12:26,980 --> 00:12:29,940 by doing all the stuff in Ruby, is mostly 297 00:12:29,940 --> 00:12:31,579 fine. But for our specific use case where we 298 00:12:31,579 --> 00:12:33,100 really wanna, we wanna be telling you, you can 299 00:12:33,100 --> 00:12:34,860 run the agent on your process, on your box, 300 00:12:34,860 --> 00:12:36,870 and it's not gonna use a lot of memory. 301 00:12:36,870 --> 00:12:40,079 We really needed something more efficient, and our first 302 00:12:40,079 --> 00:12:42,790 thought was, we'll use C++ or C. No problem. 303 00:12:42,790 --> 00:12:45,220 C is, is native. It's great. And Carl did 304 00:12:45,220 --> 00:12:48,120 the work. Carl is very smart. And then he 305 00:12:48,120 --> 00:12:49,509 said, Yehuda. It is now your turn. You need 306 00:12:49,509 --> 00:12:51,250 to start maintaining this. And I said, I don't 307 00:12:51,250 --> 00:12:53,620 trust myself to write C++ code that's running in 308 00:12:53,620 --> 00:12:56,029 all of your guys's boxes, and not seg-fault. So 309 00:12:56,029 --> 00:12:59,649 I don't think that, that doesn't work for me. 310 00:12:59,649 --> 00:13:01,790 And so I, I noticed that rust was coming 311 00:13:01,790 --> 00:13:03,630 along, and what rust really gives you is it 312 00:13:03,630 --> 00:13:05,899 gives you the ability to write low-level code a 313 00:13:05,899 --> 00:13:08,529 la C or C++ with magma memory management, that 314 00:13:08,529 --> 00:13:11,790 keeps your memory allocation low and keeps things speedy. 315 00:13:11,790 --> 00:13:14,930 Low resource utilization. While also giving you compile time 316 00:13:14,930 --> 00:13:17,949 guarantees about not seg-faulting. So, again, if your processes 317 00:13:17,949 --> 00:13:20,320 randomly started seg-faulting because you installed the agent, I 318 00:13:20,320 --> 00:13:21,949 think you would stop being our customer very quickly. 319 00:13:21,949 --> 00:13:24,680 So having what, pretty much 100% guarantees about that 320 00:13:24,680 --> 00:13:26,600 was very important to us. And so that's why 321 00:13:26,600 --> 00:13:28,420 we decided to use rust. 322 00:13:28,420 --> 00:13:29,880 I'll just keep going. 323 00:13:29,880 --> 00:13:30,970 T.D.: Keep going. 324 00:13:30,970 --> 00:13:32,949 Y.K.: So, we had this coordinator object. And basically 325 00:13:32,949 --> 00:13:36,149 the coordinator object is receiving events. So the events 326 00:13:36,149 --> 00:13:39,750 basically end up being these traces that describe what's 327 00:13:39,750 --> 00:13:42,050 happening in your application. And the next thing, I 328 00:13:42,050 --> 00:13:44,420 think our initial work on this we used JSON 329 00:13:44,420 --> 00:13:46,160 just to send the pay load to the server, 330 00:13:46,160 --> 00:13:47,949 but we noticed that a lot of people have 331 00:13:47,949 --> 00:13:49,889 really big requests. So you may have a big 332 00:13:49,889 --> 00:13:51,519 request with a big SQL query in it, or 333 00:13:51,519 --> 00:13:53,320 a lot of big SQL queries in it. Some 334 00:13:53,320 --> 00:13:55,279 people have traces that are hundreds and hundreds of 335 00:13:55,279 --> 00:13:57,500 nodes long. And so we really wanted to figure 336 00:13:57,500 --> 00:14:00,100 out how to shrink down the payload size to 337 00:14:00,100 --> 00:14:02,569 something that we could be, you know, pumping out 338 00:14:02,569 --> 00:14:04,759 of your box on a regular basis without tearing 339 00:14:04,759 --> 00:14:06,850 up your bandwidth costs. 340 00:14:06,850 --> 00:14:09,009 So, one of the first things that we did 341 00:14:09,009 --> 00:14:11,069 early on was we switched using protobuf as the 342 00:14:11,069 --> 00:14:13,550 transport mechanism, and that really shrunk, shrunk down the 343 00:14:13,550 --> 00:14:17,250 payloads a lot. Our earlier prototypes for actually collecting 344 00:14:17,250 --> 00:14:19,490 the data were written in Ruby, but I think 345 00:14:19,490 --> 00:14:21,370 Carl did, like, a weekend hack to just pour 346 00:14:21,370 --> 00:14:24,180 it over the Java and got, like, 200x performance. 347 00:14:24,180 --> 00:14:26,319 And you don't always get 200x performance, if mostly 348 00:14:26,319 --> 00:14:27,970 what you're doing is database queries, you're not gonna 349 00:14:27,970 --> 00:14:29,139 get a huge performance swing. 350 00:14:29,139 --> 00:14:31,889 But mostly what we're doing is math. And algorithms 351 00:14:31,889 --> 00:14:34,209 and data structures. And for that, Ruby is, it 352 00:14:34,209 --> 00:14:36,310 could, in theory, one day, have a good git 353 00:14:36,310 --> 00:14:38,569 or something, but today, writing that code in Java 354 00:14:38,569 --> 00:14:40,949 didn't end up being significantly more code cause it's 355 00:14:40,949 --> 00:14:42,540 just, you know, algorithms and data structures. 356 00:14:42,540 --> 00:14:44,420 T.D.: And I'll just note something about standardizing on 357 00:14:44,420 --> 00:14:46,779 protobufs in our, in our stack, is actually a 358 00:14:46,779 --> 00:14:52,170 huge win, because we, we realized, hey, browsers, as 359 00:14:52,170 --> 00:14:53,350 it turns out are pretty powerful these days. They've 360 00:14:53,350 --> 00:14:54,490 got, you know, they can allocate memory, they can 361 00:14:54,490 --> 00:14:56,069 do all these types of computation. So, and protobuff's 362 00:14:56,069 --> 00:14:59,410 libraries exist everywhere. So we save ourselves a lot 363 00:14:59,410 --> 00:15:01,589 of computation and a lot of time by just 364 00:15:01,589 --> 00:15:04,259 treating protobuff as the canonical serialization form, and then 365 00:15:04,259 --> 00:15:06,190 you can move payloads around the entire stack and 366 00:15:06,190 --> 00:15:08,560 everything speaks the same language, so you've saved the 367 00:15:08,560 --> 00:15:09,300 serialization and deserialization. 368 00:15:09,300 --> 00:15:11,990 Y.K.: And JavaScript is actually surprisingly effective at des- 369 00:15:11,990 --> 00:15:13,870 at taking protobuffs and converting them to the format 370 00:15:13,870 --> 00:15:18,190 that we need efficiently. So, so we basically take 371 00:15:18,190 --> 00:15:21,029 this data. The Java collector is basically collecting all 372 00:15:21,029 --> 00:15:23,550 these protobuffs, and pretty much it just turns around, 373 00:15:23,550 --> 00:15:24,940 and this is sort of where we got into 374 00:15:24,940 --> 00:15:28,149 bespoke territory before we started rolling our own, but 375 00:15:28,149 --> 00:15:30,930 we realized that when you write a big, distributed, 376 00:15:30,930 --> 00:15:33,130 fault-tolerant system, there's a lot of problems that you 377 00:15:33,130 --> 00:15:35,319 really just want someone else to have thought about. 378 00:15:35,319 --> 00:15:37,750 So, what we do is we basically take these, 379 00:15:37,750 --> 00:15:39,600 take these payloads that are coming in. We convert 380 00:15:39,600 --> 00:15:41,709 them into batches and we send the batches down 381 00:15:41,709 --> 00:15:45,259 into the Kafka queue. And the, the next thing 382 00:15:45,259 --> 00:15:47,680 that happens, so the Kafka, sorry, Kafka's basically just 383 00:15:47,680 --> 00:15:49,910 a queue that allows you to throw things into, 384 00:15:49,910 --> 00:15:53,019 I guess, it might be considered similar to like, 385 00:15:53,019 --> 00:15:55,430 something lime AMQP. It has some nice fault-tolerance properties 386 00:15:55,430 --> 00:15:57,949 and integrates well with storm. But most important it's 387 00:15:57,949 --> 00:15:59,670 just super, super high through-put. 388 00:15:59,670 --> 00:16:01,940 So basically didn't want to put any barrier between 389 00:16:01,940 --> 00:16:03,560 you giving us the data and us getting it 390 00:16:03,560 --> 00:16:04,480 to disc as soon as possible. 391 00:16:04,480 --> 00:16:05,870 T.D.: Yeah. Which we'll, I think, talk about in 392 00:16:05,870 --> 00:16:06,180 a bit. 393 00:16:06,180 --> 00:16:08,540 Y.K.: So we, so the basic Kafka takes the 394 00:16:08,540 --> 00:16:11,410 data and starts sending it into Storm. And if 395 00:16:11,410 --> 00:16:13,009 you think about what has to happen in order 396 00:16:13,009 --> 00:16:15,089 to get some request. So, you have these requests. 397 00:16:15,089 --> 00:16:18,149 There's, you know, maybe traces that have a bunch 398 00:16:18,149 --> 00:16:19,670 of SQL queries, and our job is basically to 399 00:16:19,670 --> 00:16:21,459 take all those SQL queries and say, OK, I 400 00:16:21,459 --> 00:16:22,560 can see that in all of your requests, you 401 00:16:22,560 --> 00:16:24,040 had the SQL query and it took around this 402 00:16:24,040 --> 00:16:25,449 amount of time and it happened as a child 403 00:16:25,449 --> 00:16:27,470 of this other node. And the way to think 404 00:16:27,470 --> 00:16:29,740 about that is basically just a processing pipeline. Right. 405 00:16:29,740 --> 00:16:31,480 So you have these traces that come in one 406 00:16:31,480 --> 00:16:33,480 side. You start passing them through a bunch of 407 00:16:33,480 --> 00:16:34,800 processing steps, and then you end up on the 408 00:16:34,800 --> 00:16:36,670 other side with the data. 409 00:16:36,670 --> 00:16:38,879 And Storm is actually a way of describing that 410 00:16:38,879 --> 00:16:41,930 processing pipeline in sort of functional style, and then 411 00:16:41,930 --> 00:16:43,740 you tell it, OK. Here's how many servers I 412 00:16:43,740 --> 00:16:46,839 need. Here's how, here's how I'm gonna handle failures. 413 00:16:46,839 --> 00:16:50,000 And it basically deals with distribution and scaling and 414 00:16:50,000 --> 00:16:52,379 all that stuff for you. And part of that 415 00:16:52,379 --> 00:16:55,379 is because you wrote everything using functional style. 416 00:16:55,379 --> 00:16:57,110 And so what happens is Kafka sends the data 417 00:16:57,110 --> 00:17:00,550 into the entry spout, which is sort of terminology 418 00:17:00,550 --> 00:17:04,040 in, terminology in Storm for these streams that get 419 00:17:04,040 --> 00:17:06,930 created. And they basically go into these processing things, 420 00:17:06,930 --> 00:17:09,839 which very clever- cutely are called bolts. This is 421 00:17:09,839 --> 00:17:12,980 definitely not the naming I would have used, but. 422 00:17:12,980 --> 00:17:15,130 So they're called bolts. And the idea is that 423 00:17:15,130 --> 00:17:16,970 basically every request may have several things. 424 00:17:16,970 --> 00:17:20,089 So, for example, we now automatically detect n + 425 00:17:20,089 --> 00:17:22,220 1 queries and that's sort of a different kind 426 00:17:22,220 --> 00:17:25,059 of processing from just, make a picture of the 427 00:17:25,059 --> 00:17:26,980 entire request. Or what is the 95th percentile across 428 00:17:26,980 --> 00:17:29,090 your entire app, right. These are all different kinds 429 00:17:29,090 --> 00:17:30,940 of processing. So we take the data and we 430 00:17:30,940 --> 00:17:33,580 send them into a bunch of bolts, and the 431 00:17:33,580 --> 00:17:35,750 cool thing about bolts is that, again, because they're 432 00:17:35,750 --> 00:17:38,890 just functional chaining, you can take the output from 433 00:17:38,890 --> 00:17:41,130 one bolt and feed it into another bolt. And 434 00:17:41,130 --> 00:17:43,510 that works, that works pretty well. And, and you 435 00:17:43,510 --> 00:17:44,730 don't have to worry about - I mean, you 436 00:17:44,730 --> 00:17:46,600 have to worry a little bit about things like 437 00:17:46,600 --> 00:17:49,960 fault tolerance, failure, item potence. But you worry about 438 00:17:49,960 --> 00:17:52,230 them at, at the abstraction level, and then the 439 00:17:52,230 --> 00:17:54,159 operational part is handled for you. 440 00:17:54,159 --> 00:17:55,740 T.D.: So it's just like a very declarative way 441 00:17:55,740 --> 00:17:58,179 of describing how this computation work in, in a 442 00:17:58,179 --> 00:17:59,230 way that's easy to scale. 443 00:17:59,230 --> 00:18:01,860 Y.K.: And Carl actually talked about this at very 444 00:18:01,860 --> 00:18:04,909 high speed yesterday, and you, some of you may 445 00:18:04,909 --> 00:18:06,620 have been there. I would recommend watching the video 446 00:18:06,620 --> 00:18:09,020 when it comes out if you want to make 447 00:18:09,020 --> 00:18:11,159 use of this stuff in your own applications. 448 00:18:11,159 --> 00:18:13,250 And then when you're finally done with all the 449 00:18:13,250 --> 00:18:14,970 processing, you need to actually do something with it. 450 00:18:14,970 --> 00:18:16,289 You need to put it somewhere so that the 451 00:18:16,289 --> 00:18:18,360 web app can get access to it, and that 452 00:18:18,360 --> 00:18:21,350 is basically, we use Cassandra for this. And Cassandra 453 00:18:21,350 --> 00:18:24,929 again is mostly, it's a dumb database, but it 454 00:18:24,929 --> 00:18:27,780 has, it's, has high capacity. It has some of 455 00:18:27,780 --> 00:18:29,080 the fault-tolerance capacities that we want. 456 00:18:29,080 --> 00:18:30,770 T.D.: We're very, we're just very, very heavy, right. 457 00:18:30,770 --> 00:18:32,820 Like, we tend to be writing more than we're 458 00:18:32,820 --> 00:18:33,730 ever reading. 459 00:18:33,730 --> 00:18:36,270 Y.K.: Yup. And then when we're done, when we're 460 00:18:36,270 --> 00:18:38,780 done with a particular batch, Cassandra basically kicks off 461 00:18:38,780 --> 00:18:40,720 the process over again. So we're basically doing these 462 00:18:40,720 --> 00:18:41,200 things as batches. 463 00:18:41,200 --> 00:18:42,919 T.D.: So these are, these are roll-ups, is what's 464 00:18:42,919 --> 00:18:45,360 happening here. So basically every minute, every ten minutes, 465 00:18:45,360 --> 00:18:48,140 and then at every hour, we reprocess and we 466 00:18:48,140 --> 00:18:49,970 re-aggregate, so that when you query us we know 467 00:18:49,970 --> 00:18:51,010 exactly what to give you. 468 00:18:51,010 --> 00:18:52,830 Y.K.: Yup. So we sort of have this cycle 469 00:18:52,830 --> 00:18:55,049 where we start off, obviously, in the first five 470 00:18:55,049 --> 00:18:56,890 second, the first minute, you really want high granularity. 471 00:18:56,890 --> 00:18:59,140 You want to see what's happening right now. But, 472 00:18:59,140 --> 00:19:00,110 if you want to go back and look at 473 00:19:00,110 --> 00:19:02,460 data from three months ago, you probably care about 474 00:19:02,460 --> 00:19:04,830 it, like the day granularity or maybe the hour 475 00:19:04,830 --> 00:19:07,490 granularity. So, we basically do these roll-ups and we 476 00:19:07,490 --> 00:19:09,200 cycle through the process. 477 00:19:09,200 --> 00:19:11,880 T.D.: So this, it turns out, building the system 478 00:19:11,880 --> 00:19:15,169 required an intense amount of work. Carl spent probably 479 00:19:15,169 --> 00:19:17,490 six months reading PHP thesises to find- 480 00:19:17,490 --> 00:19:18,140 Y.K.: Thesis. 481 00:19:18,140 --> 00:19:23,850 T.D.: Thesis. To find, to find data structures and 482 00:19:23,850 --> 00:19:25,870 algorithms that we could use. Because this is a 483 00:19:25,870 --> 00:19:28,340 huge amount of data. Like, I think even a 484 00:19:28,340 --> 00:19:31,049 few months after we were in private data, private 485 00:19:31,049 --> 00:19:34,179 beta, we were already handling over a billion requests 486 00:19:34,179 --> 00:19:36,200 per month. And obviously there's no way that we- 487 00:19:36,200 --> 00:19:37,970 Y.K.: Basically the number of requests that we handle 488 00:19:37,970 --> 00:19:39,770 is the sum of all of the requests that 489 00:19:39,770 --> 00:19:40,010 you handle. 490 00:19:40,010 --> 00:19:40,110 T.D.: Right. 491 00:19:40,110 --> 00:19:41,200 Y.K.: And all of our customers handle. 492 00:19:41,200 --> 00:19:41,909 T.D.: Right. Right. So. 493 00:19:41,909 --> 00:19:43,120 Y.K.: So that's a lot of requests. 494 00:19:43,120 --> 00:19:45,760 T.D.: So obviously we can't provide a service, at 495 00:19:45,760 --> 00:19:48,480 least one that's not, we can't provide an affordable 496 00:19:48,480 --> 00:19:51,130 service, an accessible service, if we have to store 497 00:19:51,130 --> 00:19:53,190 terabytes or exabytes of data just to tell you 498 00:19:53,190 --> 00:19:53,990 how your app is running. 499 00:19:53,990 --> 00:19:56,630 Y.K.: And I think, also a problem, it's problematic 500 00:19:56,630 --> 00:19:58,429 if you store all the data in a database 501 00:19:58,429 --> 00:19:59,760 and then every single time someone wants to learn 502 00:19:59,760 --> 00:20:01,590 something about that, you have to do a query. 503 00:20:01,590 --> 00:20:03,159 Those queries can take a very long time. They 504 00:20:03,159 --> 00:20:04,700 can take minutes. And I think we really wanted 505 00:20:04,700 --> 00:20:07,159 to have something that would be very, that would, 506 00:20:07,159 --> 00:20:09,580 where the feedback loop would be fast. So we 507 00:20:09,580 --> 00:20:11,860 wanted to find algorithms that let us handle the 508 00:20:11,860 --> 00:20:13,940 data at, at real time, and then provide it 509 00:20:13,940 --> 00:20:16,020 to you at real time instead of these, like, 510 00:20:16,020 --> 00:20:18,309 dump the data somewhere and then do these complicated 511 00:20:18,309 --> 00:20:18,549 queries. 512 00:20:18,549 --> 00:20:20,820 T.D.: So, hold on. So this slide was not 513 00:20:20,820 --> 00:20:24,440 supposed to be here. It was supposed to be 514 00:20:24,440 --> 00:20:27,669 a Rails slide. So, whoa. I went too far. 515 00:20:27,669 --> 00:20:29,970 K. We'll watch that again. That's pretty cool. So 516 00:20:29,970 --> 00:20:32,460 then the last thing I want to say is, 517 00:20:32,460 --> 00:20:34,330 perhaps your take away from looking at this architecture 518 00:20:34,330 --> 00:20:36,750 diagram is, oh my gosh, these Rails guys completely- 519 00:20:36,750 --> 00:20:38,309 Y.K.: They really jumped the shark. 520 00:20:38,309 --> 00:20:40,870 T.D.: They jumped the shark. They ditched Rails. I 521 00:20:40,870 --> 00:20:42,330 saw, like, three Tweets yesterday - I wasn't here, 522 00:20:42,330 --> 00:20:43,450 I was in Portland yesterday, but I saw, like, 523 00:20:43,450 --> 00:20:44,340 three Tweets that were like, I'm at RailsConf and 524 00:20:44,340 --> 00:20:48,809 I haven't seen a single talk about, like, Rails. 525 00:20:48,809 --> 00:20:51,950 So that's true here, too. But, I want to 526 00:20:51,950 --> 00:20:54,940 assure you that we are only using this stack 527 00:20:54,940 --> 00:20:58,070 for the heavy computation. We started in Rails. We 528 00:20:58,070 --> 00:21:00,730 started, we were like, hey, what do we need. 529 00:21:00,730 --> 00:21:01,929 Ah, well, people probably need to authenticate and log 530 00:21:01,929 --> 00:21:05,090 in, and we probably need to do billing. And 531 00:21:05,090 --> 00:21:06,220 those are all things that Rails is really, really 532 00:21:06,220 --> 00:21:08,830 good at. So we started with Rails as, basically, 533 00:21:08,830 --> 00:21:11,110 the starting point, and then when we realized oh 534 00:21:11,110 --> 00:21:14,039 my gosh, computation is really slow. There's no way 535 00:21:14,039 --> 00:21:15,240 we're gonna be able to offer this service. OK. 536 00:21:15,240 --> 00:21:16,059 Now let's think about how we can do all 537 00:21:16,059 --> 00:21:16,270 of that. 538 00:21:16,270 --> 00:21:18,510 Y.K.: And I think notably, a lot of people 539 00:21:18,510 --> 00:21:20,049 who look at Rails are like, there's a lot 540 00:21:20,049 --> 00:21:21,750 of companies that have built big stuff on Rails, 541 00:21:21,750 --> 00:21:24,090 and their attitude is, like, oh, this legacy terrible 542 00:21:24,090 --> 00:21:25,409 Rails app. I really wish we could get rid 543 00:21:25,409 --> 00:21:26,850 of it. If we could just write everything in 544 00:21:26,850 --> 00:21:30,760 Scala or Clojure or Go, everything would be amazing. 545 00:21:30,760 --> 00:21:31,500 That is definitely not our attitude. Our attitude is 546 00:21:31,500 --> 00:21:34,320 that Rails is really amazing, at particular, at the 547 00:21:34,320 --> 00:21:36,740 kinds of things that are really common across everyone's 548 00:21:36,740 --> 00:21:39,900 web applications - authentication, billing, et cetera. And we 549 00:21:39,900 --> 00:21:41,429 really want to be using Rails for the parts 550 00:21:41,429 --> 00:21:43,360 of our app- even things like error-tracking, we do 551 00:21:43,360 --> 00:21:45,039 through the Rails app. We want to be using 552 00:21:45,039 --> 00:21:47,470 Rails because it's very productive at doing those things. 553 00:21:47,470 --> 00:21:48,789 It happens to be very slow with doing data 554 00:21:48,789 --> 00:21:50,330 crunching, so we're gonna use a different tool for 555 00:21:50,330 --> 00:21:50,539 that. 556 00:21:50,539 --> 00:21:51,909 But I don't think you'll ever see me getting 557 00:21:51,909 --> 00:21:54,210 up and saying, ah, I really wish we had 558 00:21:54,210 --> 00:21:55,080 just started writing, you know, the Rails app in 559 00:21:55,080 --> 00:21:55,159 rust. 560 00:21:55,159 --> 00:21:55,309 T.D.: Yeah. 561 00:21:55,309 --> 00:21:58,090 Y.K.: That would be terrible. 562 00:21:58,090 --> 00:22:02,429 T.D.: So that's number one, is, is, honest response 563 00:22:02,429 --> 00:22:04,390 times, which we're, which it turns out, seems like 564 00:22:04,390 --> 00:22:08,289 it should be easy, requires storing insane amount of 565 00:22:08,289 --> 00:22:09,169 data. 566 00:22:09,169 --> 00:22:10,620 So the second thing that we realized when we 567 00:22:10,620 --> 00:22:12,059 were looking at a lot of these tools, is 568 00:22:12,059 --> 00:22:14,360 that most of them focus on data. They focus 569 00:22:14,360 --> 00:22:16,590 on giving you the raw data. But I'm not 570 00:22:16,590 --> 00:22:19,130 a machine. I'm not a computer. I don't enjoy 571 00:22:19,130 --> 00:22:21,320 sifting through data. That's what computers are good for. 572 00:22:21,320 --> 00:22:23,289 I would rather be drinking a beer. It's really 573 00:22:23,289 --> 00:22:24,830 nice in Portland, this time of year. 574 00:22:24,830 --> 00:22:27,409 So, we wanted to think about, if you're trying 575 00:22:27,409 --> 00:22:31,179 to solve the performance problems in your application, what 576 00:22:31,179 --> 00:22:32,840 are the things that you would suss out with 577 00:22:32,840 --> 00:22:35,760 the existing tools after spending, like, four hours depleting 578 00:22:35,760 --> 00:22:37,510 your ego to get there? 579 00:22:37,510 --> 00:22:38,929 Y.K.: And I think part of this is just 580 00:22:38,929 --> 00:22:42,260 people are actually very, people like to think that 581 00:22:42,260 --> 00:22:43,880 they're gonna use these tools, but when the tools 582 00:22:43,880 --> 00:22:45,320 require you to dig through a lot of data, 583 00:22:45,320 --> 00:22:47,090 people just don't use them very much. So, the 584 00:22:47,090 --> 00:22:48,330 goal here was to build a tool that people 585 00:22:48,330 --> 00:22:50,809 actually use and actually like using, and not to 586 00:22:50,809 --> 00:22:54,870 build a tool that happens to provide a lot 587 00:22:54,870 --> 00:22:55,039 of data you can sift through. 588 00:22:55,039 --> 00:22:55,059 T.D.: Yes. 589 00:22:55,059 --> 00:22:55,929 So, probably the, one of the first things that 590 00:22:55,929 --> 00:22:58,529 we realized is that we don't want to provide. 591 00:22:58,529 --> 00:23:00,400 This is a trace of a request, you've probably 592 00:23:00,400 --> 00:23:04,070 seen similar UIs using other tools, using, for example, 593 00:23:04,070 --> 00:23:07,059 the inspector in, in like Chrome or Safari, and 594 00:23:07,059 --> 00:23:08,700 this is just showing basically, it's basically a visual 595 00:23:08,700 --> 00:23:10,830 stack trace of where your application is spending its 596 00:23:10,830 --> 00:23:11,600 time. 597 00:23:11,600 --> 00:23:13,950 But I think what was important for us is 598 00:23:13,950 --> 00:23:17,809 showing not just a single request, because your app 599 00:23:17,809 --> 00:23:20,570 handles, you know, hundreds of thousands of requests, or 600 00:23:20,570 --> 00:23:22,679 millions of requests. So looking at a single request 601 00:23:22,679 --> 00:23:24,630 statistically is complete, it's just noise. 602 00:23:24,630 --> 00:23:26,500 Y.K.: And it's especially bad if it's the worst 603 00:23:26,500 --> 00:23:28,659 request, because the worst request is, is really noise. 604 00:23:28,659 --> 00:23:30,850 It's like, a hiccup in the network, right. 605 00:23:30,850 --> 00:23:31,250 T.D.: It's the outlier. Yeah. 606 00:23:31,250 --> 00:23:32,150 Y.K.: It's literally the outlier. 607 00:23:32,150 --> 00:23:35,659 T.D.: It's literally the outlier. Yup. So, what we 608 00:23:35,659 --> 00:23:38,700 present in Skylight is something a little bit different, 609 00:23:38,700 --> 00:23:41,770 and it's something that we call the aggregate trace. 610 00:23:41,770 --> 00:23:46,260 So the aggregate trace is basically us taking all 611 00:23:46,260 --> 00:23:49,559 of your requests, averaging them out where each of 612 00:23:49,559 --> 00:23:51,750 these things spends their time, and then showing you 613 00:23:51,750 --> 00:23:54,899 that. So this is basically like, this is like, 614 00:23:54,899 --> 00:23:57,929 this is like the statue of David. It is 615 00:23:57,929 --> 00:24:00,880 the idealized form of the stack trace of how 616 00:24:00,880 --> 00:24:02,270 your application's behaving. 617 00:24:02,270 --> 00:24:05,330 But, of course, you have the same problem as 618 00:24:05,330 --> 00:24:07,500 before, which is, if this is all that we 619 00:24:07,500 --> 00:24:10,580 were showing you, it would be obscuring a lot 620 00:24:10,580 --> 00:24:12,870 of information. You want to actually be able to 621 00:24:12,870 --> 00:24:13,990 tell the difference between, OK, what's my stack trace 622 00:24:13,990 --> 00:24:16,419 look like for fast requests, and how does that 623 00:24:16,419 --> 00:24:18,539 differ from requests that are slower. 624 00:24:18,539 --> 00:24:20,860 So what we've got, I've got a little video 625 00:24:20,860 --> 00:24:22,320 here. You can see that when I move the 626 00:24:22,320 --> 00:24:26,490 slider, that this trace below it is actually updating 627 00:24:26,490 --> 00:24:29,130 in real time. As I move the slider around, 628 00:24:29,130 --> 00:24:31,770 you can see that the aggregate trace actually updates 629 00:24:31,770 --> 00:24:34,240 with it. And that's because we're collecting all this 630 00:24:34,240 --> 00:24:36,159 information. We're collecting, like I said, a lot of 631 00:24:36,159 --> 00:24:38,669 data. We can recompute this aggregate trace on the 632 00:24:38,669 --> 00:24:38,909 fly. 633 00:24:38,909 --> 00:24:41,200 Basically, for each bucket, we're storing a different trace, 634 00:24:41,200 --> 00:24:42,880 and then on the client we're reassembling that. We'll 635 00:24:42,880 --> 00:24:43,899 go into that a little bit. 636 00:24:43,899 --> 00:24:45,799 Y.K.: And I think it's really important that you 637 00:24:45,799 --> 00:24:48,370 be able to do these experiments quickly. If every 638 00:24:48,370 --> 00:24:50,059 time you think, oh, I wonder what happens if 639 00:24:50,059 --> 00:24:52,260 I add another histogram bucket, if it requires a 640 00:24:52,260 --> 00:24:54,830 whole full page refresh. Then that would basically make 641 00:24:54,830 --> 00:24:56,309 people not want to use the tool. Not able 642 00:24:56,309 --> 00:24:58,580 to use the tool. So, actually building something which 643 00:24:58,580 --> 00:24:59,649 is real time and fast, gets the data as 644 00:24:59,649 --> 00:25:00,110 it comes, was really important to us. 645 00:25:00,110 --> 00:25:01,220 T.D.: So that's number one. 646 00:25:01,220 --> 00:25:04,850 And the second thing. So we built that, and 647 00:25:04,850 --> 00:25:07,929 we're like, OK, well what's next? And I think 648 00:25:07,929 --> 00:25:09,250 that the big problem with this is that you 649 00:25:09,250 --> 00:25:12,020 need to know that there's a problem before you 650 00:25:12,020 --> 00:25:14,429 go look at it, right. So we have been 651 00:25:14,429 --> 00:25:16,080 working for the past few months, and the Storm 652 00:25:16,080 --> 00:25:18,390 infrastructure that we built makes it pretty straight-forward to 653 00:25:18,390 --> 00:25:21,149 start building more abstractions on top of the data 654 00:25:21,149 --> 00:25:21,559 that we've already collected. 655 00:25:21,559 --> 00:25:24,120 It's a very declarative system. So we've been working 656 00:25:24,120 --> 00:25:26,679 on a feature called inspections. And what's cool about 657 00:25:26,679 --> 00:25:29,279 inspections is that we can look at this tremendous 658 00:25:29,279 --> 00:25:31,270 volume of data that we've collected from your app, 659 00:25:31,270 --> 00:25:33,840 and we can automatically tease out what the problems 660 00:25:33,840 --> 00:25:35,210 are. So the first one that we shipped, this 661 00:25:35,210 --> 00:25:37,399 is in beta right now. It's not, it's not 662 00:25:37,399 --> 00:25:39,840 out and enabled by default, but there, it's behind 663 00:25:39,840 --> 00:25:42,440 a feature flag that we've had some users turning 664 00:25:42,440 --> 00:25:42,730 on. 665 00:25:42,730 --> 00:25:44,419 And, and trying out. And so what we can 666 00:25:44,419 --> 00:25:46,450 do in this case, is because we have information 667 00:25:46,450 --> 00:25:48,730 about all of the database queries in your app, 668 00:25:48,730 --> 00:25:50,840 we can look and see if you have n 669 00:25:50,840 --> 00:25:52,390 plus one queries. Can you maybe explain what an 670 00:25:52,390 --> 00:25:53,250 n plus one query is? 671 00:25:53,250 --> 00:25:54,600 Y.K.: Yeah. So, I'm, people know, hopefully, what n 672 00:25:54,600 --> 00:25:56,770 plus one queries. But the, it's the idea that 673 00:25:56,770 --> 00:25:59,260 you, by accident, for some reason, instead of making 674 00:25:59,260 --> 00:26:01,970 one query, you ask for like all the posts 675 00:26:01,970 --> 00:26:02,940 and then you iterated through all of them and 676 00:26:02,940 --> 00:26:04,940 got all the comments and now you, instead of 677 00:26:04,940 --> 00:26:08,679 having one query, you have one query per post, 678 00:26:08,679 --> 00:26:10,309 right. And you, what I've, what I've like to 679 00:26:10,309 --> 00:26:12,549 do is do eager reloading, where you say include 680 00:26:12,549 --> 00:26:14,559 comments, right. But you have to know that you 681 00:26:14,559 --> 00:26:15,039 have to do that. 682 00:26:15,039 --> 00:26:16,899 So there's some tools that will run in development 683 00:26:16,899 --> 00:26:18,380 mode, if you happen to catch it, like a 684 00:26:18,380 --> 00:26:20,460 bullet. This is basically a tool that's looking at 685 00:26:20,460 --> 00:26:22,210 every single one of your classes and has some 686 00:26:22,210 --> 00:26:24,169 thresholds that, once we see that a bunch of 687 00:26:24,169 --> 00:26:27,429 your requests have the same exact query, so we 688 00:26:27,429 --> 00:26:29,549 do some work to pull out binds. So if 689 00:26:29,549 --> 00:26:32,200 it's, like, where something equals one, we will automatically 690 00:26:32,200 --> 00:26:34,110 pull out the one and replace it with a 691 00:26:34,110 --> 00:26:34,740 question mark. 692 00:26:34,740 --> 00:26:36,230 And then we basically take all those queries, if 693 00:26:36,230 --> 00:26:39,529 they're the exact same query repeated multiple times, subject 694 00:26:39,529 --> 00:26:41,390 to some thresholds, we'll start showing you hey, there's 695 00:26:41,390 --> 00:26:42,450 an n plus one query. 696 00:26:42,450 --> 00:26:43,799 And you can imagine this same sort of thing 697 00:26:43,799 --> 00:26:46,320 being done for things, like, are you missing an 698 00:26:46,320 --> 00:26:49,690 index, right. Or, are you using the Ruby version 699 00:26:49,690 --> 00:26:50,950 of JSON when you should be using the native 700 00:26:50,950 --> 00:26:52,179 version of JSON. These are all things that we 701 00:26:52,179 --> 00:26:55,140 can start detecting just because we're consuming an enormous 702 00:26:55,140 --> 00:26:57,510 amount of information, and we can start writing some 703 00:26:57,510 --> 00:26:59,330 heuristics for bubbling it up. 704 00:26:59,330 --> 00:27:02,330 So, third and final breakthrough, we realized that we 705 00:27:02,330 --> 00:27:05,289 really, really needed a lightning fast UI. Something really 706 00:27:05,289 --> 00:27:08,279 responsive. So, in particular, the feedback loop is critical, 707 00:27:08,279 --> 00:27:09,929 right. You can imagine, if the way that you 708 00:27:09,929 --> 00:27:12,279 dug into data was you clicked and you wait 709 00:27:12,279 --> 00:27:14,320 an hour, and then you get your results, no 710 00:27:14,320 --> 00:27:15,730 one would do it. No one would ever do 711 00:27:15,730 --> 00:27:15,890 it. 712 00:27:15,890 --> 00:27:19,090 And the existing tools are OK, but you click 713 00:27:19,090 --> 00:27:20,470 and you wait. You look at it and you're 714 00:27:20,470 --> 00:27:21,730 like, oh, I want a different view, so then 715 00:27:21,730 --> 00:27:23,240 you go edit your query and then you click 716 00:27:23,240 --> 00:27:25,360 and you wait and it's just not a pleasant 717 00:27:25,360 --> 00:27:26,600 experience. 718 00:27:26,600 --> 00:27:28,850 So, so we use Ember, the, the UI that 719 00:27:28,850 --> 00:27:31,250 you're using when you log into Skylight. Even though 720 00:27:31,250 --> 00:27:33,289 it feels just like a regular website, it doesn't 721 00:27:33,289 --> 00:27:35,940 feel like a native app, is powered, all of 722 00:27:35,940 --> 00:27:37,679 the routing, all of the rendering, all of the 723 00:27:37,679 --> 00:27:40,769 decision making, is happening in, as an Ember.js app, 724 00:27:40,769 --> 00:27:43,049 and we pair that with D3. So all of 725 00:27:43,049 --> 00:27:44,830 the charts, the charts that you saw there in 726 00:27:44,830 --> 00:27:48,039 the aggregate trace, that is all Ember components powered 727 00:27:48,039 --> 00:27:48,970 by D3. 728 00:27:48,970 --> 00:27:52,860 So, this is actually significantly cleaned up our client-side 729 00:27:52,860 --> 00:27:55,679 code. It makes re-usability really, really awesome. So to 730 00:27:55,679 --> 00:27:57,039 give you an example, this is from our billing 731 00:27:57,039 --> 00:27:58,789 page that I, the designer came and they had, 732 00:27:58,789 --> 00:28:01,260 they had a component that was like, the gate 733 00:28:01,260 --> 00:28:01,809 component. 734 00:28:01,809 --> 00:28:02,919 And, the- 735 00:28:02,919 --> 00:28:05,899 T.D.: It seems really boring at first. 736 00:28:05,899 --> 00:28:06,799 Y.K.: It seemed really boring. But, this is the 737 00:28:06,799 --> 00:28:08,950 implementation, right. So you could copy and paste this 738 00:28:08,950 --> 00:28:11,059 code over and over again, everywhere you go. Just 739 00:28:11,059 --> 00:28:12,750 remember to format it correctly. If you forget to 740 00:28:12,750 --> 00:28:15,070 format it, it's not gonna look the same everywhere. 741 00:28:15,070 --> 00:28:17,460 But I was like, hey, we're using this all 742 00:28:17,460 --> 00:28:18,010 over the place. Why don't we bundle this up 743 00:28:18,010 --> 00:28:20,070 into a component? And so with Ember, it was 744 00:28:20,070 --> 00:28:22,230 super easy. We basically just said, OK, here's new 745 00:28:22,230 --> 00:28:24,590 calendar date component. It has a property on it 746 00:28:24,590 --> 00:28:26,460 called date. Just set that to any JavaScript data 747 00:28:26,460 --> 00:28:28,059 object. Just set, you don't have to remember about 748 00:28:28,059 --> 00:28:30,450 converting it or formatting it. Here's the component. Set 749 00:28:30,450 --> 00:28:31,840 the date and it will render the correct thing 750 00:28:31,840 --> 00:28:32,760 automatically. 751 00:28:32,760 --> 00:28:36,039 And, so the architecture of the Ember app looks 752 00:28:36,039 --> 00:28:37,640 a little bit, something like this, where you have 753 00:28:37,640 --> 00:28:39,919 many, many different components, most of them just driven 754 00:28:39,919 --> 00:28:42,370 by D3, and then they're plugged into the model 755 00:28:42,370 --> 00:28:43,480 and the controller. 756 00:28:43,480 --> 00:28:44,909 And the Ember app will go fetch those models 757 00:28:44,909 --> 00:28:46,750 from the cloud, and the cloud from the Java 758 00:28:46,750 --> 00:28:50,190 app, which just queries Cassandra, and render them. And 759 00:28:50,190 --> 00:28:53,429 what's neat about this model is turning on web 760 00:28:53,429 --> 00:28:56,360 sockets is super easy, right. Because all of these 761 00:28:56,360 --> 00:28:58,860 components are bound to a single place. So when 762 00:28:58,860 --> 00:29:00,890 the web socket says, hey, we have updated information 763 00:29:00,890 --> 00:29:02,630 for you to show, it just pushes it onto 764 00:29:02,630 --> 00:29:04,980 the model or onto the controller, and the whole 765 00:29:04,980 --> 00:29:06,159 UI updates automatically. 766 00:29:06,159 --> 00:29:06,890 It's like magic. 767 00:29:06,890 --> 00:29:07,230 And- 768 00:29:07,230 --> 00:29:08,250 Y.K.: Like magic. 769 00:29:08,250 --> 00:29:09,679 T.D.: It's like magic. And, and when debugging, this 770 00:29:09,679 --> 00:29:11,559 is especially awesome too, because, and I'll maybe show 771 00:29:11,559 --> 00:29:15,080 a demo of the Ember inspector. It's nice. 772 00:29:15,080 --> 00:29:17,830 So. Yeah. So, lightning fast UI. Reducing the feedback 773 00:29:17,830 --> 00:29:19,510 loop so that you can quickly play with your 774 00:29:19,510 --> 00:29:21,880 data, makes it go from a chore to something 775 00:29:21,880 --> 00:29:23,620 that actually feels kind of fun. 776 00:29:23,620 --> 00:29:27,039 So, these were the breakthroughs that we had when 777 00:29:27,039 --> 00:29:28,440 we were building Skylight. The things that made us 778 00:29:28,440 --> 00:29:29,980 think, yes, this is actually a product that we 779 00:29:29,980 --> 00:29:31,940 think deserves to be on the market. So, one, 780 00:29:31,940 --> 00:29:33,860 honest response times. Collect data that no one else 781 00:29:33,860 --> 00:29:36,549 can collect. Focus on answers instead of just dumping 782 00:29:36,549 --> 00:29:38,289 data, and have a lightning fast UI to do 783 00:29:38,289 --> 00:29:38,409 it. 784 00:29:38,409 --> 00:29:40,100 So we like to think of Skylight as basically 785 00:29:40,100 --> 00:29:42,690 a smart profiler. It's a smart profiler that runs 786 00:29:42,690 --> 00:29:44,350 in production. It's like the profiler that you run 787 00:29:44,350 --> 00:29:47,230 on your local development machine, but instead of being 788 00:29:47,230 --> 00:29:49,179 on your local dev box which has nothing to 789 00:29:49,179 --> 00:29:51,610 do with the performance characteristics of what your users 790 00:29:51,610 --> 00:29:53,450 are experience, we're actually running in production. 791 00:29:53,450 --> 00:29:58,919 So, let me just give you guys a quick 792 00:29:58,919 --> 00:30:00,390 demo. 793 00:30:00,390 --> 00:30:03,120 So, this is what the Skylight, this is what 794 00:30:03,120 --> 00:30:07,610 Skylight looks like. What's under this? There we go. 795 00:30:07,610 --> 00:30:09,620 So, the first thing here is we've got the 796 00:30:09,620 --> 00:30:12,669 app dash board. So this, it's like our, 95th 797 00:30:12,669 --> 00:30:15,500 responsile- 95th percentile response time has peaked. Maybe you're 798 00:30:15,500 --> 00:30:17,970 all hammering it right now. That would be nice. 799 00:30:17,970 --> 00:30:19,940 So, this is a graph of your response time 800 00:30:19,940 --> 00:30:22,010 over time, and then on the right, this is 801 00:30:22,010 --> 00:30:24,700 the graph of the RPMs, the requests per minute 802 00:30:24,700 --> 00:30:26,750 that your app is handling. So this is app-wide. 803 00:30:26,750 --> 00:30:29,440 And this is live. This updates every minute. 804 00:30:29,440 --> 00:30:31,039 Then down below, you have a list of the 805 00:30:31,039 --> 00:30:33,730 end points in your application. So you can see, 806 00:30:33,730 --> 00:30:35,700 actually, the top, the slowest ones for us were, 807 00:30:35,700 --> 00:30:37,789 we have an instrumentation API, and we've gone and 808 00:30:37,789 --> 00:30:39,929 instrumented our background workers. So we can see them 809 00:30:39,929 --> 00:30:42,010 here, and their response time plays in. So we 810 00:30:42,010 --> 00:30:44,220 can see that we have this reporting worker that's 811 00:30:44,220 --> 00:30:46,899 taking 95th percentile, thirteen seconds. 812 00:30:46,899 --> 00:30:48,880 Y.K.: So all that time used to be inside 813 00:30:48,880 --> 00:30:51,500 of some request somewhere, and we discovered that there 814 00:30:51,500 --> 00:30:52,840 was a lot of time being spent in things 815 00:30:52,840 --> 00:30:54,679 that we could push to the background. We probably 816 00:30:54,679 --> 00:30:56,789 need to update the agony index so that it 817 00:30:56,789 --> 00:30:59,190 doesn't make workers very high, because spending some time 818 00:30:59,190 --> 00:31:02,120 in your workers is not that big of a 819 00:31:02,120 --> 00:31:02,130 deal. 820 00:31:02,130 --> 00:31:03,000 T.D.: So, so then, if we dive into one 821 00:31:03,000 --> 00:31:05,299 of these, you can see that for this request, 822 00:31:05,299 --> 00:31:07,000 we've got the time explorer up above, and that 823 00:31:07,000 --> 00:31:10,429 shows a graph of response time at, again, 95th 824 00:31:10,429 --> 00:31:11,840 percentile, and you can, if you want to go 825 00:31:11,840 --> 00:31:13,549 back and look at historical data, you just drag 826 00:31:13,549 --> 00:31:15,250 it like this. And this has got a brush, 827 00:31:15,250 --> 00:31:16,980 so you can zoom in and out on different 828 00:31:16,980 --> 00:31:17,760 times. 829 00:31:17,760 --> 00:31:19,649 And every time you change the range, you can 830 00:31:19,649 --> 00:31:21,360 see that it's very responsive. It's never waiting for 831 00:31:21,360 --> 00:31:23,039 the server. But it is going back and fetching 832 00:31:23,039 --> 00:31:25,080 data from the server and then when the data 833 00:31:25,080 --> 00:31:29,210 comes back, you see the whole UI just updates. 834 00:31:29,210 --> 00:31:29,250 And we get that for free with Ember and 835 00:31:29,250 --> 00:31:31,190 And then down below, as we discussed, you actually 836 00:31:31,190 --> 00:31:33,760 have a real histogram. And this histogram, in this 837 00:31:33,760 --> 00:31:37,159 case, is showing. So this is for fifty-seven requests. 838 00:31:37,159 --> 00:31:39,019 And if we click and drag, we could just 839 00:31:39,019 --> 00:31:40,429 move this. And you can see that the aggregate 840 00:31:40,429 --> 00:31:43,360 trace below updates in response to us dragging this. 841 00:31:43,360 --> 00:31:44,919 And if we want to look at the fastest 842 00:31:44,919 --> 00:31:47,500 quartile, we just click faster and we'll just choose 843 00:31:47,500 --> 00:31:48,149 that range on the histogram. 844 00:31:48,149 --> 00:31:49,210 Y.K.: I think it's the fastest load. 845 00:31:49,210 --> 00:31:50,899 T.D.: The fastest load. And then if you click 846 00:31:50,899 --> 00:31:52,899 on slower, you can see the slower requests. So 847 00:31:52,899 --> 00:31:54,669 this makes it really easy to compare and contrast. 848 00:31:54,669 --> 00:31:56,710 OK. Why are certain requests faster and why are 849 00:31:56,710 --> 00:31:58,529 certain requests slow? 850 00:31:58,529 --> 00:32:00,779 You can see the blue, these blue areas. This 851 00:32:00,779 --> 00:32:03,559 is Ruby code. So, right now it's not super 852 00:32:03,559 --> 00:32:05,820 granular. It would be nice if you could actually 853 00:32:05,820 --> 00:32:07,940 know what was going on here. But, it'll at 854 00:32:07,940 --> 00:32:09,940 least tell you where in your controller action this 855 00:32:09,940 --> 00:32:12,690 is happening, and then you can actually see which 856 00:32:12,690 --> 00:32:15,919 database queries are being executed, and what their duration 857 00:32:15,919 --> 00:32:16,080 is. 858 00:32:16,080 --> 00:32:17,889 And you can see that we actually extract the 859 00:32:17,889 --> 00:32:20,419 SQL and we denormalize it so we, so you, 860 00:32:20,419 --> 00:32:22,159 or, we normalize it so you can see exactly 861 00:32:22,159 --> 00:32:24,019 what those requests are even if the values are 862 00:32:24,019 --> 00:32:24,820 totally different between them. 863 00:32:24,820 --> 00:32:27,649 Y.K.: Yeah. So the real query, courtesy of Rails, 864 00:32:27,649 --> 00:32:29,730 not yet supporting bind extraction is like, where id 865 00:32:29,730 --> 00:32:32,169 equals one or, ten or whatever. 866 00:32:32,169 --> 00:32:33,659 T.D.: Yup. So that's pretty cool. 867 00:32:33,659 --> 00:32:37,429 Y.K.: So one, one other thing is, initially, we 868 00:32:37,429 --> 00:32:39,269 actually just showed the whole trace, but we discovered 869 00:32:39,269 --> 00:32:41,659 that, obviously when you show whole traces you have 870 00:32:41,659 --> 00:32:43,639 information that doesn't really matter that much. So we 871 00:32:43,639 --> 00:32:47,340 started off by, we've recently basically started to collapse 872 00:32:47,340 --> 00:32:48,850 things that don't matter so much so that you 873 00:32:48,850 --> 00:32:51,090 can basically expand or condense the trace. 874 00:32:51,090 --> 00:32:52,519 And we wanted to make it not, but you 875 00:32:52,519 --> 00:32:55,690 have to think about expanding or condensing individual areas, 876 00:32:55,690 --> 00:32:57,960 but just, you see what matters the most and 877 00:32:57,960 --> 00:32:59,100 then you can see trivial errors. 878 00:32:59,100 --> 00:33:02,179 T.D.: Yup. So, so that's the demo of Skylight. 879 00:33:02,179 --> 00:33:04,190 We'd really like it if you checked it out. 880 00:33:04,190 --> 00:33:05,899 There is one more thing I want to show 881 00:33:05,899 --> 00:33:07,720 you that is, like, really freaking cool. This is 882 00:33:07,720 --> 00:33:10,529 coming out of Tilde labs. Carl was like, has 883 00:33:10,529 --> 00:33:13,730 been hacking, he's been up until past midnight, getting 884 00:33:13,730 --> 00:33:15,769 almost no sleep for the past month trying to 885 00:33:15,769 --> 00:33:16,730 have this ready. 886 00:33:16,730 --> 00:33:19,090 I don't know how many of you know this, 887 00:33:19,090 --> 00:33:23,630 but Ruby 2 point 1 has a new, a, 888 00:33:23,630 --> 00:33:27,950 a stack sampling feature. So you can get really 889 00:33:27,950 --> 00:33:31,149 granular information about how your Ruby code is performing. 890 00:33:31,149 --> 00:33:33,450 So I want to show you, I just mentioned 891 00:33:33,450 --> 00:33:34,570 how it would be nice if we could get 892 00:33:34,570 --> 00:33:36,830 more information out of what your Ruby code is 893 00:33:36,830 --> 00:33:38,760 doing. And now we can do that. 894 00:33:38,760 --> 00:33:42,039 Basically, every few milliseconds, this code that Carl wrote 895 00:33:42,039 --> 00:33:44,399 is going into the, to the Ruby, into MRI, 896 00:33:44,399 --> 00:33:47,419 and it's taking a snap shot of the stack. 897 00:33:47,419 --> 00:33:50,769 And because this is built-in, it's very low-impact. It's 898 00:33:50,769 --> 00:33:53,570 not allocating any new memory. It's very little performance 899 00:33:53,570 --> 00:33:55,769 hit. Basically you wouldn't even notice it. And so 900 00:33:55,769 --> 00:33:58,149 every few milliseconds it's sampling, and we take that 901 00:33:58,149 --> 00:34:00,260 information and we send it up to our servers. 902 00:34:00,260 --> 00:34:02,260 So it's almost like you're running Ruby profiler on 903 00:34:02,260 --> 00:34:05,220 your local dev box, where you get extremely granular 904 00:34:05,220 --> 00:34:07,159 information about where your code is spending its time 905 00:34:07,159 --> 00:34:09,010 in Ruby, per method, per all of these things. 906 00:34:09,010 --> 00:34:11,909 But it's happening in production. 907 00:34:11,909 --> 00:34:16,409 So, this is, so this is a, we enabled 908 00:34:16,409 --> 00:34:18,399 it in staging. You can see that we've got 909 00:34:18,399 --> 00:34:19,600 some rendering bugs. It's still in beta. 910 00:34:19,600 --> 00:34:21,918 Y.K.: Yeah, and we haven't yet collapsed things that 911 00:34:21,918 --> 00:34:21,980 are not important- 912 00:34:21,980 --> 00:34:22,020 T.D.: Yes. 913 00:34:22,020 --> 00:34:23,270 Y.K.: -for this particular feature. 914 00:34:23,270 --> 00:34:24,170 T.D.: So we want to show, we want to 915 00:34:24,170 --> 00:34:27,610 hide things like, like framework code, obviously. But this 916 00:34:27,610 --> 00:34:31,070 gives you an incredibly, incredibly granular view of what 917 00:34:31,070 --> 00:34:35,659 your app is doing in production. And we think. 918 00:34:35,659 --> 00:34:39,230 This is a, an API that's built into, into 919 00:34:39,230 --> 00:34:43,159 Ruby 2.1.1. Because our agent is running so low-level, 920 00:34:43,159 --> 00:34:44,659 because we wrote it in Rust, we have the 921 00:34:44,659 --> 00:34:47,409 ability to do things like this, and Carl thinks 922 00:34:47,409 --> 00:34:48,370 that we may be able to actually back port 923 00:34:48,370 --> 00:34:48,480 this to older Rubies, too. So if you're not 924 00:34:48,480 --> 00:34:50,130 on Ruby 2.1, we think that we can actually 925 00:34:50,130 --> 00:34:52,790 bring this. But that's TPD. 926 00:34:52,790 --> 00:34:55,480 Y.K.: Yeah, I- so I think the cool thing 927 00:34:55,480 --> 00:34:57,940 about this, in general, is when you run a 928 00:34:57,940 --> 00:34:59,430 sampling- so this is a sampling profiler, right, we 929 00:34:59,430 --> 00:35:01,260 don't want to be burning every single thing that 930 00:35:01,260 --> 00:35:03,790 you do in your program with tracing, right. That 931 00:35:03,790 --> 00:35:05,380 would be very slow. 932 00:35:05,380 --> 00:35:06,920 So when you normally run a sampling profiler, you 933 00:35:06,920 --> 00:35:08,760 have to basically make a loop. You have to 934 00:35:08,760 --> 00:35:11,090 basically create a loop, run this code a million 935 00:35:11,090 --> 00:35:12,970 times and keep sampling it. Eventually we'll get enough 936 00:35:12,970 --> 00:35:15,030 samples to get the information. But it turns out 937 00:35:15,030 --> 00:35:17,280 that your production server is a loop. Your production 938 00:35:17,280 --> 00:35:20,560 server is serving tons and tons of requests. So, 939 00:35:20,560 --> 00:35:22,880 by simply tak- you know, taking a few microseconds 940 00:35:22,880 --> 00:35:25,580 out of every request and collecting a couple of 941 00:35:25,580 --> 00:35:27,210 samples, over time we can actually get this really 942 00:35:27,210 --> 00:35:29,700 high fidelity picture with basically no cost. 943 00:35:29,700 --> 00:35:31,150 And that's pretty mind-blowing. And this is the kind 944 00:35:31,150 --> 00:35:34,650 of stuff that we can start doing by really 945 00:35:34,650 --> 00:35:37,250 caring about, about both the user experience and the 946 00:35:37,250 --> 00:35:40,830 implementation and getting really scary about it. And I'm 947 00:35:40,830 --> 00:35:42,700 really, like, honestly this is a really exciting feature 948 00:35:42,700 --> 00:35:45,330 that really shows what we can do as we 949 00:35:45,330 --> 00:35:46,130 start building this out. 950 00:35:46,130 --> 00:35:47,140 T.D.: Once we've got that, once we've got that 951 00:35:47,140 --> 00:35:48,380 groundwork. 952 00:35:48,380 --> 00:35:49,820 So if you guys want to check it out, 953 00:35:49,820 --> 00:35:51,760 Skylight dot io, it's available today. It's no longer 954 00:35:51,760 --> 00:35:54,040 in private beta. Everyone can sign up. No invitation 955 00:35:54,040 --> 00:35:56,630 token necessary. And you can get a thirty-day free 956 00:35:56,630 --> 00:35:58,240 trial if you haven't started one already. So if 957 00:35:58,240 --> 00:35:59,620 you have any questions, please come see us right 958 00:35:59,620 --> 00:36:00,980 now, or we have a booth in the vendor 959 00:36:00,980 --> 00:36:03,140 hall. Thank you guys very much.