WEBVTT 00:00:16.600 --> 00:00:17.920 TOM DALE: Hey, you guys ready? 00:00:17.940 --> 00:00:19.180 Thank you guys so much for coming. 00:00:19.189 --> 00:00:21.770 This is awesome. I was really, 00:00:21.770 --> 00:00:23.700 I, when they were putting together the schedule, 00:00:23.700 --> 00:00:25.349 I said, make sure that you put us down 00:00:25.349 --> 00:00:27.160 in the Caves of Moria. So thank you 00:00:27.160 --> 00:00:31.560 guys for coming down and making it. 00:00:31.560 --> 00:00:32.910 I'm Tom. This is Yehuda. 00:00:32.910 --> 00:00:35.270 YEHUDA KATZ: When people told me was signed up 00:00:35.270 --> 00:00:40.520 to do a back-to-back talk, I don't know what 00:00:40.520 --> 00:00:40.579 I was thinking. 00:00:40.579 --> 00:00:40.820 T.D.: Yup. So. We want to talk to you 00:00:40.820 --> 00:00:43.809 today about, about Skylight. So, just a little bit 00:00:43.809 --> 00:00:45.730 before we talk about that, I want to talk 00:00:45.730 --> 00:00:47.530 about us a little bit. 00:00:47.530 --> 00:00:50.420 So, in 2011 we started a company called Tilde. 00:00:50.420 --> 00:00:52.760 It's this shirt. It may have made me self-conscious, 00:00:52.760 --> 00:00:54.609 because this is actually a first edition and it's 00:00:54.609 --> 00:00:57.550 printed off-center. Well, either I'm off-center or the shirt's 00:00:57.550 --> 00:01:00.539 off-center. One of the two. 00:01:00.539 --> 00:01:02.570 So we started Tilde in 2011, and we had 00:01:02.570 --> 00:01:06.030 all just left a venture-backed company, and that was 00:01:06.030 --> 00:01:08.440 a pretty traumatic experience for us because we spent 00:01:08.440 --> 00:01:09.560 a lot of time building the company and then 00:01:09.560 --> 00:01:11.110 we ran out of money and sold to Facebook 00:01:11.110 --> 00:01:13.500 and we really didn't want to repeat that experience. 00:01:13.500 --> 00:01:16.420 So, we decided to start Tilde, and when we 00:01:16.420 --> 00:01:19.830 did it, we decided to be. DHH and the 00:01:19.830 --> 00:01:21.970 other people at Basecamp were talking about, you know, 00:01:21.970 --> 00:01:23.630 being bootstrapped and proud. And that was a message 00:01:23.630 --> 00:01:26.250 that really resonated with us, and so we wanted 00:01:26.250 --> 00:01:28.360 to capture the same thing. 00:01:28.360 --> 00:01:30.250 There's only problem with being bootstrapped and proud, and 00:01:30.250 --> 00:01:31.890 that is, in order to be both of those 00:01:31.890 --> 00:01:33.180 things, you actually need money, it turns out. It's 00:01:33.180 --> 00:01:34.850 not like you just say it in a blog 00:01:34.850 --> 00:01:37.660 post and then all of the sudden you are 00:01:37.660 --> 00:01:38.560 in business. 00:01:38.560 --> 00:01:40.350 So, we had to think a lot about, OK, 00:01:40.350 --> 00:01:42.450 well, how do we make money? How do we 00:01:42.450 --> 00:01:44.690 make money? How do we make a profitable and, 00:01:44.690 --> 00:01:47.290 most importantly, sustainable business? Because we didn't want to 00:01:47.290 --> 00:01:49.380 just flip to Facebook in a couple years. 00:01:49.380 --> 00:01:53.130 So, looking around, I think the most obvious thing 00:01:53.130 --> 00:01:55.110 that people suggested to us is, well, why don't 00:01:55.110 --> 00:01:58.050 you guys just become Ember, Inc.? Raise a few 00:01:58.050 --> 00:02:01.390 million dollars, you know, build a bunch of business 00:02:01.390 --> 00:02:05.730 model, mostly prayer. But that's not really how we 00:02:05.730 --> 00:02:08.959 want to think about building open source communities. 00:02:08.959 --> 00:02:10.860 We don't really think that that necessarily leads to 00:02:10.860 --> 00:02:13.810 the best open source communities. And if you're interested 00:02:13.810 --> 00:02:16.569 more in that, I recommend Leia Silver, who is 00:02:16.569 --> 00:02:19.700 one of our co-founders. She's giving a talk this 00:02:19.700 --> 00:02:22.450 afternoon. Oh, sorry. Friday afternoon, about how to build 00:02:22.450 --> 00:02:25.219 a company that is centered on open source. So 00:02:25.219 --> 00:02:26.469 if you want to learn more about how we've 00:02:26.469 --> 00:02:28.989 done that, I would really suggest you go check 00:02:28.989 --> 00:02:30.060 out her talk. 00:02:30.060 --> 00:02:33.689 So, no. So, no Ember, Inc. Not allowed. 00:02:33.689 --> 00:02:38.159 So, we really want to build something that leveraged 00:02:38.159 --> 00:02:40.249 the strengths that we thought that we had. One, 00:02:40.249 --> 00:02:42.680 I think most importantly, a really deep knowledge of 00:02:42.680 --> 00:02:44.569 open source and a deep knowledge of the Rails 00:02:44.569 --> 00:02:47.090 stack, and also Carl, it turns out, is really, 00:02:47.090 --> 00:02:50.689 really good at building highly scalable big data sys- 00:02:50.689 --> 00:02:53.989 big data systems. Lots of Hadoop in there. 00:02:53.989 --> 00:02:58.290 So, last year at RailsConf, we announced the private 00:02:58.290 --> 00:03:00.519 beta of Skylight. How many of you have used 00:03:00.519 --> 00:03:01.709 Skylight? Can you raise your hand if you have 00:03:01.709 --> 00:03:04.629 used it? OK. Many of you. Awesome. 00:03:04.629 --> 00:03:08.129 So, so Skylight is a tool for profiling and 00:03:08.129 --> 00:03:11.780 measuring the performance of your Rails applications in production. 00:03:11.780 --> 00:03:15.389 And, as a product, Skylight, I think, was built 00:03:15.389 --> 00:03:20.349 on three really, three key break-throughs. There were key, 00:03:20.349 --> 00:03:22.120 three key break-throughs. We didn't want to ship a 00:03:22.120 --> 00:03:26.189 product that was incrementally better than the competition. We 00:03:26.189 --> 00:03:28.319 wanted to ship a product that was dramatically better. 00:03:28.319 --> 00:03:30.079 Quantum leap. An order of magnitude better. 00:03:30.079 --> 00:03:32.079 And, in order to do that, we spent a 00:03:32.079 --> 00:03:33.889 lot of time thinking about it, about how we 00:03:33.889 --> 00:03:36.389 could solve most of the problems that we saw 00:03:36.389 --> 00:03:39.310 in the existing landscape. And so those, those break-throughs 00:03:39.310 --> 00:03:42.299 are predicated- sorry, those, delivering a product that does 00:03:42.299 --> 00:03:44.919 that is predicated on these three break-throughs. 00:03:44.919 --> 00:03:46.870 So, the first one I want to talk about 00:03:46.870 --> 00:03:53.870 is, honest response times. Honest response times. So, DHH 00:03:54.060 --> 00:03:55.799 wrote a blog post on what was then the 00:03:55.799 --> 00:03:58.930 37Signals blog, now the Basecamp blog, called The problem 00:03:58.930 --> 00:04:01.909 with averages. How many of you have read this? 00:04:01.909 --> 00:04:02.459 Awesome. 00:04:02.459 --> 00:04:03.779 For those of you that have not, how many 00:04:03.779 --> 00:04:08.469 of you hate raising your hands at presentations? 00:04:08.469 --> 00:04:10.510 So, for those of you that- 00:04:10.510 --> 00:04:11.549 Y.K.: Just put a button in every seat. Press 00:04:11.549 --> 00:04:11.779 this button- 00:04:11.779 --> 00:04:15.290 T.D.: Press the button if you have. Yes. Great. 00:04:15.290 --> 00:04:19.810 So, if you read this blog post, the way 00:04:19.810 --> 00:04:22.810 it opens is, Our average response time for Basecamp 00:04:22.810 --> 00:04:26.770 right now is 87ms... That sounds fantastic. And it 00:04:26.770 --> 00:04:29.950 easily leads you to believe that all is well 00:04:29.950 --> 00:04:31.680 and that we wouldn't need to spend any more 00:04:31.680 --> 00:04:34.150 time optimizing performance. 00:04:34.150 --> 00:04:38.840 But that's actually wrong. The average number is completely 00:04:38.840 --> 00:04:42.250 skewed by tons of fast responses to feed requests 00:04:42.250 --> 00:04:46.169 and other cached replies. If you have 1000 requests 00:04:46.169 --> 00:04:49.229 that return in 5ms, and then you can have 00:04:49.229 --> 00:04:53.560 200 requests taking 2000ms, or two seconds, you can 00:04:53.560 --> 00:04:57.509 still report an av- a respectable 170ms of average. 00:04:57.509 --> 00:04:59.819 That's useless. 00:04:59.819 --> 00:05:02.520 So what does DHH say that we need? DHH 00:05:02.520 --> 00:05:06.569 says the solution is histograms. So, for those of 00:05:06.569 --> 00:05:09.009 you like me who were sleeping through your statistics 00:05:09.009 --> 00:05:12.410 class in high school, and college, a brief primer 00:05:12.410 --> 00:05:15.310 on histograms. So a histogram is very simple. Basically, 00:05:15.310 --> 00:05:17.699 you have a, you have a series of numbers 00:05:17.699 --> 00:05:22.389 along some axis, and every time you, you're in 00:05:22.389 --> 00:05:24.360 that number, you're in that bucket, you basically increment 00:05:24.360 --> 00:05:25.280 that bar by one. 00:05:25.280 --> 00:05:27.669 So, this is an example of a histogram of 00:05:27.669 --> 00:05:30.620 response times in a Rails application. So you can 00:05:30.620 --> 00:05:31.979 see that there's a big cluster in the middle 00:05:31.979 --> 00:05:35.900 around 488ms, 500ms. This isn't a super speedy app 00:05:35.900 --> 00:05:38.740 but it's not the worst thing in the world. 00:05:38.740 --> 00:05:39.520 And they're all clustered, and then as you kind 00:05:39.520 --> 00:05:40.810 of move to the right you can see that 00:05:40.810 --> 00:05:42.169 the respond times get longer and longer and longer, 00:05:42.169 --> 00:05:43.990 and as you move to the left, response times 00:05:43.990 --> 00:05:45.720 get shorter and shorter and shorter. 00:05:45.720 --> 00:05:47.500 So, why do you want a histogram? What's the, 00:05:47.500 --> 00:05:48.599 what's the most important thing about a histogram? 00:05:48.599 --> 00:05:52.550 Y.K.: Well, I think it's because most requests don't 00:05:52.550 --> 00:05:53.229 actually look like this. 00:05:53.229 --> 00:05:53.509 T.D.: Yes. 00:05:53.509 --> 00:05:54.759 Y.K.: Most end points don't actually look like this. 00:05:54.759 --> 00:05:56.419 T.D.: Right. If you think about what your Rails 00:05:56.419 --> 00:05:58.610 app is doing, it's a complicated beast, right. Turns 00:05:58.610 --> 00:06:02.330 out, Ruby frankly, you can, you can do branching 00:06:02.330 --> 00:06:04.360 logic. You can do a lot of things. 00:06:04.360 --> 00:06:06.150 And so what that means is that one end 00:06:06.150 --> 00:06:09.460 point, if you represent that with a single number, 00:06:09.460 --> 00:06:11.650 you are losing a lot of fidelity, to the 00:06:11.650 --> 00:06:15.189 point where it becomes, as DHH said, useless. So, 00:06:15.189 --> 00:06:17.729 for example, in a histogram, you can easily see, 00:06:17.729 --> 00:06:19.810 oh, here's a group of requests and response times 00:06:19.810 --> 00:06:22.379 where I'm hitting the cache, and here's another group 00:06:22.379 --> 00:06:24.169 where I'm missing it. And you can see that 00:06:24.169 --> 00:06:27.889 that cluster is significantly slower than the faster cache-hitting 00:06:27.889 --> 00:06:28.439 cluster. 00:06:28.439 --> 00:06:30.849 And the other thing that you get when you 00:06:30.849 --> 00:06:32.800 have a, a distribution, when you keep the whole 00:06:32.800 --> 00:06:35.370 distribution in the histogram, is you can look at 00:06:35.370 --> 00:06:39.470 this number at the 95th percentile, right. So the 00:06:39.470 --> 00:06:41.639 right, the way to think about the performance of 00:06:41.639 --> 00:06:46.639 your web application is not the average, because the 00:06:46.639 --> 00:06:50.159 average doesn't really tell you anything. You want to 00:06:50.159 --> 00:06:52.220 think about the 95th percentile, because that's not the 00:06:52.220 --> 00:06:55.379 average response time, that's the average worst response time 00:06:55.379 --> 00:06:57.990 that a user is likely to hit. 00:06:57.990 --> 00:06:59.360 And the thing to keep in mind is that 00:06:59.360 --> 00:07:01.849 it's not as though a customer comes to your 00:07:01.849 --> 00:07:05.099 site, they issue one request, and then they're done, 00:07:05.099 --> 00:07:08.000 right. As someone is using your website, they're gonna 00:07:08.000 --> 00:07:10.300 be generating a lot of requests. And you need 00:07:10.300 --> 00:07:15.020 to look at the 95th percentile, because otherwise every 00:07:15.020 --> 00:07:17.219 request is basically you rolling the dice that they're 00:07:17.219 --> 00:07:18.610 not gonna hit one of those two second, three 00:07:18.610 --> 00:07:21.400 second, four second responses, close the tab and go 00:07:21.400 --> 00:07:23.919 to your competitor. 00:07:23.919 --> 00:07:25.340 So we look at this as, here's the crazy 00:07:25.340 --> 00:07:28.439 thing. Here's what I think is crazy. That blog 00:07:28.439 --> 00:07:32.750 post that DHH wrote, it's from 2009. It's been 00:07:32.750 --> 00:07:35.000 five years, and there's still no tool that does 00:07:35.000 --> 00:07:36.960 what DHH was asking for. So, we, frankly, we 00:07:36.960 --> 00:07:39.060 smelled money. We were like, holy crap. 00:07:39.060 --> 00:07:41.169 Y.K.: Yeah, why isn't that slide green? 00:07:41.169 --> 00:07:43.229 T.D.: Yeah. It should be green and dollars. I 00:07:43.229 --> 00:07:45.270 think keynote has the dollars, the make it rain 00:07:45.270 --> 00:07:50.240 effect I should have used. So we smelled blood 00:07:50.240 --> 00:07:53.090 in the water. We're like, this is awesome. There's 00:07:53.090 --> 00:07:56.610 only one problem that we discovered, and that is, 00:07:56.610 --> 00:07:58.330 it turns out that building this thing is actually 00:07:58.330 --> 00:08:01.189 really, really freaky hard. Really, really hard. 00:08:01.189 --> 00:08:05.780 So, we announced the private beta at RailsConf last 00:08:05.780 --> 00:08:09.139 year. Before doing that, we spent a year of 00:08:09.139 --> 00:08:12.789 research spiking out prototypes, building prototypes, building out the 00:08:12.789 --> 00:08:16.509 beta. We launched at RailsConf, and we realized, we 00:08:16.509 --> 00:08:18.520 made a lot of problems. We made a lot 00:08:18.520 --> 00:08:21.909 of errors when we were building this system. 00:08:21.909 --> 00:08:26.270 So then, after RailsConf last year, we basically took 00:08:26.270 --> 00:08:29.689 six months to completely rewrite the backend from the 00:08:29.689 --> 00:08:32.450 ground up. And I think tying into your keynote, 00:08:32.450 --> 00:08:36.280 Yehuda, we, we were like, oh. We clearly have 00:08:36.280 --> 00:08:39.120 a bespoke problem. No one else is doing this. 00:08:39.120 --> 00:08:42.090 So we rewrote our own custom backend. And then 00:08:42.090 --> 00:08:43.729 we had all these problems, and we realized that 00:08:43.729 --> 00:08:45.390 they had actually already all been solved by the 00:08:45.390 --> 00:08:47.769 open source community. And so we benefited tremendously by 00:08:47.769 --> 00:08:48.430 having a shared solution. 00:08:48.430 --> 00:08:50.709 Y.K.: Yeah. So our first release of this was 00:08:50.709 --> 00:08:55.279 really very bespoke, and the current release uses a 00:08:55.279 --> 00:08:59.540 tremendous amount of very off-the-shelf open source projects that 00:08:59.540 --> 00:09:04.550 just solved the particular problem very effectively, very well. 00:09:04.550 --> 00:09:05.779 None of which are as easy to use as 00:09:05.779 --> 00:09:09.029 Rails, but all of which solve really thorny problems 00:09:09.029 --> 00:09:10.230 very effectively. 00:09:10.230 --> 00:09:12.870 T.D.: So, so let's just talk, just for your 00:09:12.870 --> 00:09:15.950 own understanding, let's talk about how most performance monitoring 00:09:15.950 --> 00:09:17.670 tools work. So the way that most of these 00:09:17.670 --> 00:09:19.930 work is that you run your Rails app, and 00:09:19.930 --> 00:09:22.250 running inside of your Rails app is some gem, 00:09:22.250 --> 00:09:25.000 some agent that you install. And every time the 00:09:25.000 --> 00:09:28.560 Rails app handles a request, it generates events, and 00:09:28.560 --> 00:09:32.500 those events, which include information about performance data, those 00:09:32.500 --> 00:09:34.930 events are passed into the agent. 00:09:34.930 --> 00:09:37.630 And then the agent sends that data to some 00:09:37.630 --> 00:09:40.930 kind of centralized server. Now, it turns out that 00:09:40.930 --> 00:09:44.139 doing a running average is actually really simple. Which 00:09:44.139 --> 00:09:46.550 is why everyone does it. Basically you can do 00:09:46.550 --> 00:09:48.050 it in a single SQL query, right. All you 00:09:48.050 --> 00:09:50.310 do is you have three columns in database. The 00:09:50.310 --> 00:09:52.690 end point, the running average, and the number of 00:09:52.690 --> 00:09:55.769 requests, and then so, you can, those are the 00:09:55.769 --> 00:09:57.170 two things that you need to keep a running 00:09:57.170 --> 00:09:57.310 average, right. 00:09:57.310 --> 00:09:58.750 So keeping a running average is actually really simple 00:09:58.750 --> 00:10:00.980 from a technical point of view. 00:10:00.980 --> 00:10:03.800 Y.K.: I don't think you could even JavaScript through 00:10:03.800 --> 00:10:04.600 to the lack of integers. 00:10:04.600 --> 00:10:06.050 T.D.: Yes. You probably wouldn't want to do any 00:10:06.050 --> 00:10:07.529 math in JavaScript, it turns out. So, so we 00:10:07.529 --> 00:10:10.100 took a little bit different approach. Yehuda, do you 00:10:10.100 --> 00:10:12.070 want to go over the next section? 00:10:12.070 --> 00:10:15.089 Y.K.: Yeah. Sure. So, when we first started, right 00:10:15.089 --> 00:10:17.790 at the beginning, we basically did a similar thing 00:10:17.790 --> 00:10:19.790 where we had a bunch - your app creates 00:10:19.790 --> 00:10:22.620 events. Most of those start off as being ActiveSupport::Notifications, 00:10:22.620 --> 00:10:25.980 although it turns out that there's very limited use 00:10:25.980 --> 00:10:28.490 of ActiveSupport::Notifications so we had to do some normalization 00:10:28.490 --> 00:10:30.360 work to get them sane, which we're gonna be 00:10:30.360 --> 00:10:32.630 upstreaming back into, into Rails. 00:10:32.630 --> 00:10:35.320 But, one thing that's kind of unfortunate about having 00:10:35.320 --> 00:10:37.029 every single Rails app have an agent is that 00:10:37.029 --> 00:10:38.649 you end up having to do a lot of 00:10:38.649 --> 00:10:40.310 the same kind of work over and over again, 00:10:40.310 --> 00:10:42.180 and use up a lot of memory. So, for 00:10:42.180 --> 00:10:44.220 example, every one of these things is making HTTP 00:10:44.220 --> 00:10:46.380 requests. So now you have a queue of things 00:10:46.380 --> 00:10:48.810 that you're sending over HTTP in every single one 00:10:48.810 --> 00:10:50.510 of your Rails processes. And, of course, you probably 00:10:50.510 --> 00:10:52.250 don't notice this. People are used to Rails taking 00:10:52.250 --> 00:10:54.089 up hundreds and hundreds of megabytes, so you probably 00:10:54.089 --> 00:10:55.790 don't notice if you install some agent and it 00:10:55.790 --> 00:10:59.449 suddenly starts taking twenty, thirty, forty, fifty more megabytes. 00:10:59.449 --> 00:11:01.510 But we really wanted to keep the actual memory 00:11:01.510 --> 00:11:04.649 per process down to a small amount. So one 00:11:04.649 --> 00:11:06.170 of the very first things that we did, we 00:11:06.170 --> 00:11:07.910 even did it before last year, is that we 00:11:07.910 --> 00:11:09.800 pulled out all that shared logic into a, a 00:11:09.800 --> 00:11:13.420 separate process called the coordinator. And the agent is 00:11:13.420 --> 00:11:16.940 basically responsible simply for collecting the, the trace, and 00:11:16.940 --> 00:11:18.899 it's not responsible for actually talking to our server 00:11:18.899 --> 00:11:20.709 at all. And that means that the coordinator only 00:11:20.709 --> 00:11:22.720 has to do this queue, this keeping a st- 00:11:22.720 --> 00:11:25.550 a bunch of stuff of work in one place, 00:11:25.550 --> 00:11:28.500 and it doesn't end up using up as much 00:11:28.500 --> 00:11:28.839 memory. 00:11:28.839 --> 00:11:31.019 And I think this, this ended up being very 00:11:31.019 --> 00:11:31.940 effective for us. 00:11:31.940 --> 00:11:33.490 T.D.: And I think that low overhead also allows 00:11:33.490 --> 00:11:36.350 us to just collect more information, in general. 00:11:36.350 --> 00:11:37.060 Y.K.: Yeah. 00:11:37.060 --> 00:11:39.880 Now, after our first attempt, we started getting a 00:11:39.880 --> 00:11:42.079 bunch of customers that were telling us that even 00:11:42.079 --> 00:11:43.920 the separate - so the separate coordinator, started as 00:11:43.920 --> 00:11:45.260 a good thing and a bad thing. On the 00:11:45.260 --> 00:11:47.399 one hand, there's only one of them, so it 00:11:47.399 --> 00:11:49.529 uses up only one set of memory. On the 00:11:49.529 --> 00:11:51.220 other hand, it's really easy for someone to go 00:11:51.220 --> 00:11:52.839 in and PS that process and see how many 00:11:52.839 --> 00:11:54.260 megabytes of memory it's using. 00:11:54.260 --> 00:11:56.560 So, we got a lot of additional complaints that 00:11:56.560 --> 00:11:58.100 said oh, your process is using a lot of 00:11:58.100 --> 00:12:00.990 memory. And, I spent a few weeks, I, I 00:12:00.990 --> 00:12:03.160 know Ruby pretty well. I spent a couple of 00:12:03.160 --> 00:12:05.670 weeks. I actually wrote a gem called Allocation Counter 00:12:05.670 --> 00:12:07.550 that basically went in to try to pin point 00:12:07.550 --> 00:12:09.490 exactly where the allocations were hap- coming from. But 00:12:09.490 --> 00:12:11.800 it turns out that it's actually really, really hard 00:12:11.800 --> 00:12:14.019 to track down exactly where allocations are coming from 00:12:14.019 --> 00:12:15.170 in Ruby, because something as simple as using a 00:12:15.170 --> 00:12:18.630 regular expression in Ruby can allocate match objects that 00:12:18.630 --> 00:12:19.410 get put back on the stack. 00:12:19.410 --> 00:12:21.449 And so I was able to pair this down 00:12:21.449 --> 00:12:24.220 to some degree. But I really discovered quickly that, 00:12:24.220 --> 00:12:26.980 trying to keep a lid on the memory allocation 00:12:26.980 --> 00:12:29.940 by doing all the stuff in Ruby, is mostly 00:12:29.940 --> 00:12:31.579 fine. But for our specific use case where we 00:12:31.579 --> 00:12:33.100 really wanna, we wanna be telling you, you can 00:12:33.100 --> 00:12:34.860 run the agent on your process, on your box, 00:12:34.860 --> 00:12:36.870 and it's not gonna use a lot of memory. 00:12:36.870 --> 00:12:40.079 We really needed something more efficient, and our first 00:12:40.079 --> 00:12:42.790 thought was, we'll use C++ or C. No problem. 00:12:42.790 --> 00:12:45.220 C is, is native. It's great. And Carl did 00:12:45.220 --> 00:12:48.120 the work. Carl is very smart. And then he 00:12:48.120 --> 00:12:49.509 said, Yehuda. It is now your turn. You need 00:12:49.509 --> 00:12:51.250 to start maintaining this. And I said, I don't 00:12:51.250 --> 00:12:53.620 trust myself to write C++ code that's running in 00:12:53.620 --> 00:12:56.029 all of your guys's boxes, and not seg-fault. So 00:12:56.029 --> 00:12:59.649 I don't think that, that doesn't work for me. 00:12:59.649 --> 00:13:01.790 And so I, I noticed that rust was coming 00:13:01.790 --> 00:13:03.630 along, and what rust really gives you is it 00:13:03.630 --> 00:13:05.899 gives you the ability to write low-level code a 00:13:05.899 --> 00:13:08.529 la C or C++ with magma memory management, that 00:13:08.529 --> 00:13:11.790 keeps your memory allocation low and keeps things speedy. 00:13:11.790 --> 00:13:14.930 Low resource utilization. While also giving you compile time 00:13:14.930 --> 00:13:17.949 guarantees about not seg-faulting. So, again, if your processes 00:13:17.949 --> 00:13:20.320 randomly started seg-faulting because you installed the agent, I 00:13:20.320 --> 00:13:21.949 think you would stop being our customer very quickly. 00:13:21.949 --> 00:13:24.680 So having what, pretty much 100% guarantees about that 00:13:24.680 --> 00:13:26.600 was very important to us. And so that's why 00:13:26.600 --> 00:13:28.420 we decided to use rust. 00:13:28.420 --> 00:13:29.880 I'll just keep going. 00:13:29.880 --> 00:13:30.970 T.D.: Keep going. 00:13:30.970 --> 00:13:32.949 Y.K.: So, we had this coordinator object. And basically 00:13:32.949 --> 00:13:36.149 the coordinator object is receiving events. So the events 00:13:36.149 --> 00:13:39.750 basically end up being these traces that describe what's 00:13:39.750 --> 00:13:42.050 happening in your application. And the next thing, I 00:13:42.050 --> 00:13:44.420 think our initial work on this we used JSON 00:13:44.420 --> 00:13:46.160 just to send the pay load to the server, 00:13:46.160 --> 00:13:47.949 but we noticed that a lot of people have 00:13:47.949 --> 00:13:49.889 really big requests. So you may have a big 00:13:49.889 --> 00:13:51.519 request with a big SQL query in it, or 00:13:51.519 --> 00:13:53.320 a lot of big SQL queries in it. Some 00:13:53.320 --> 00:13:55.279 people have traces that are hundreds and hundreds of 00:13:55.279 --> 00:13:57.500 nodes long. And so we really wanted to figure 00:13:57.500 --> 00:14:00.100 out how to shrink down the payload size to 00:14:00.100 --> 00:14:02.569 something that we could be, you know, pumping out 00:14:02.569 --> 00:14:04.759 of your box on a regular basis without tearing 00:14:04.759 --> 00:14:06.850 up your bandwidth costs. 00:14:06.850 --> 00:14:09.009 So, one of the first things that we did 00:14:09.009 --> 00:14:11.069 early on was we switched using protobuf as the 00:14:11.069 --> 00:14:13.550 transport mechanism, and that really shrunk, shrunk down the 00:14:13.550 --> 00:14:17.250 payloads a lot. Our earlier prototypes for actually collecting 00:14:17.250 --> 00:14:19.490 the data were written in Ruby, but I think 00:14:19.490 --> 00:14:21.370 Carl did, like, a weekend hack to just pour 00:14:21.370 --> 00:14:24.180 it over the Java and got, like, 200x performance. 00:14:24.180 --> 00:14:26.319 And you don't always get 200x performance, if mostly 00:14:26.319 --> 00:14:27.970 what you're doing is database queries, you're not gonna 00:14:27.970 --> 00:14:29.139 get a huge performance swing. 00:14:29.139 --> 00:14:31.889 But mostly what we're doing is math. And algorithms 00:14:31.889 --> 00:14:34.209 and data structures. And for that, Ruby is, it 00:14:34.209 --> 00:14:36.310 could, in theory, one day, have a good git 00:14:36.310 --> 00:14:38.569 or something, but today, writing that code in Java 00:14:38.569 --> 00:14:40.949 didn't end up being significantly more code cause it's 00:14:40.949 --> 00:14:42.540 just, you know, algorithms and data structures. 00:14:42.540 --> 00:14:44.420 T.D.: And I'll just note something about standardizing on 00:14:44.420 --> 00:14:46.779 protobufs in our, in our stack, is actually a 00:14:46.779 --> 00:14:52.170 huge win, because we, we realized, hey, browsers, as 00:14:52.170 --> 00:14:53.350 it turns out are pretty powerful these days. They've 00:14:53.350 --> 00:14:54.490 got, you know, they can allocate memory, they can 00:14:54.490 --> 00:14:56.069 do all these types of computation. So, and protobuff's 00:14:56.069 --> 00:14:59.410 libraries exist everywhere. So we save ourselves a lot 00:14:59.410 --> 00:15:01.589 of computation and a lot of time by just 00:15:01.589 --> 00:15:04.259 treating protobuff as the canonical serialization form, and then 00:15:04.259 --> 00:15:06.190 you can move payloads around the entire stack and 00:15:06.190 --> 00:15:08.560 everything speaks the same language, so you've saved the 00:15:08.560 --> 00:15:09.300 serialization and deserialization. 00:15:09.300 --> 00:15:11.990 Y.K.: And JavaScript is actually surprisingly effective at des- 00:15:11.990 --> 00:15:13.870 at taking protobuffs and converting them to the format 00:15:13.870 --> 00:15:18.190 that we need efficiently. So, so we basically take 00:15:18.190 --> 00:15:21.029 this data. The Java collector is basically collecting all 00:15:21.029 --> 00:15:23.550 these protobuffs, and pretty much it just turns around, 00:15:23.550 --> 00:15:24.940 and this is sort of where we got into 00:15:24.940 --> 00:15:28.149 bespoke territory before we started rolling our own, but 00:15:28.149 --> 00:15:30.930 we realized that when you write a big, distributed, 00:15:30.930 --> 00:15:33.130 fault-tolerant system, there's a lot of problems that you 00:15:33.130 --> 00:15:35.319 really just want someone else to have thought about. 00:15:35.319 --> 00:15:37.750 So, what we do is we basically take these, 00:15:37.750 --> 00:15:39.600 take these payloads that are coming in. We convert 00:15:39.600 --> 00:15:41.709 them into batches and we send the batches down 00:15:41.709 --> 00:15:45.259 into the Kafka queue. And the, the next thing 00:15:45.259 --> 00:15:47.680 that happens, so the Kafka, sorry, Kafka's basically just 00:15:47.680 --> 00:15:49.910 a queue that allows you to throw things into, 00:15:49.910 --> 00:15:53.019 I guess, it might be considered similar to like, 00:15:53.019 --> 00:15:55.430 something lime AMQP. It has some nice fault-tolerance properties 00:15:55.430 --> 00:15:57.949 and integrates well with storm. But most important it's 00:15:57.949 --> 00:15:59.670 just super, super high through-put. 00:15:59.670 --> 00:16:01.940 So basically didn't want to put any barrier between 00:16:01.940 --> 00:16:03.560 you giving us the data and us getting it 00:16:03.560 --> 00:16:04.480 to disc as soon as possible. 00:16:04.480 --> 00:16:05.870 T.D.: Yeah. Which we'll, I think, talk about in 00:16:05.870 --> 00:16:06.180 a bit. 00:16:06.180 --> 00:16:08.540 Y.K.: So we, so the basic Kafka takes the 00:16:08.540 --> 00:16:11.410 data and starts sending it into Storm. And if 00:16:11.410 --> 00:16:13.009 you think about what has to happen in order 00:16:13.009 --> 00:16:15.089 to get some request. So, you have these requests. 00:16:15.089 --> 00:16:18.149 There's, you know, maybe traces that have a bunch 00:16:18.149 --> 00:16:19.670 of SQL queries, and our job is basically to 00:16:19.670 --> 00:16:21.459 take all those SQL queries and say, OK, I 00:16:21.459 --> 00:16:22.560 can see that in all of your requests, you 00:16:22.560 --> 00:16:24.040 had the SQL query and it took around this 00:16:24.040 --> 00:16:25.449 amount of time and it happened as a child 00:16:25.449 --> 00:16:27.470 of this other node. And the way to think 00:16:27.470 --> 00:16:29.740 about that is basically just a processing pipeline. Right. 00:16:29.740 --> 00:16:31.480 So you have these traces that come in one 00:16:31.480 --> 00:16:33.480 side. You start passing them through a bunch of 00:16:33.480 --> 00:16:34.800 processing steps, and then you end up on the 00:16:34.800 --> 00:16:36.670 other side with the data. 00:16:36.670 --> 00:16:38.879 And Storm is actually a way of describing that 00:16:38.879 --> 00:16:41.930 processing pipeline in sort of functional style, and then 00:16:41.930 --> 00:16:43.740 you tell it, OK. Here's how many servers I 00:16:43.740 --> 00:16:46.839 need. Here's how, here's how I'm gonna handle failures. 00:16:46.839 --> 00:16:50.000 And it basically deals with distribution and scaling and 00:16:50.000 --> 00:16:52.379 all that stuff for you. And part of that 00:16:52.379 --> 00:16:55.379 is because you wrote everything using functional style. 00:16:55.379 --> 00:16:57.110 And so what happens is Kafka sends the data 00:16:57.110 --> 00:17:00.550 into the entry spout, which is sort of terminology 00:17:00.550 --> 00:17:04.040 in, terminology in Storm for these streams that get 00:17:04.040 --> 00:17:06.930 created. And they basically go into these processing things, 00:17:06.930 --> 00:17:09.839 which very clever- cutely are called bolts. This is 00:17:09.839 --> 00:17:12.980 definitely not the naming I would have used, but. 00:17:12.980 --> 00:17:15.130 So they're called bolts. And the idea is that 00:17:15.130 --> 00:17:16.970 basically every request may have several things. 00:17:16.970 --> 00:17:20.089 So, for example, we now automatically detect n + 00:17:20.089 --> 00:17:22.220 1 queries and that's sort of a different kind 00:17:22.220 --> 00:17:25.059 of processing from just, make a picture of the 00:17:25.059 --> 00:17:26.980 entire request. Or what is the 95th percentile across 00:17:26.980 --> 00:17:29.090 your entire app, right. These are all different kinds 00:17:29.090 --> 00:17:30.940 of processing. So we take the data and we 00:17:30.940 --> 00:17:33.580 send them into a bunch of bolts, and the 00:17:33.580 --> 00:17:35.750 cool thing about bolts is that, again, because they're 00:17:35.750 --> 00:17:38.890 just functional chaining, you can take the output from 00:17:38.890 --> 00:17:41.130 one bolt and feed it into another bolt. And 00:17:41.130 --> 00:17:43.510 that works, that works pretty well. And, and you 00:17:43.510 --> 00:17:44.730 don't have to worry about - I mean, you 00:17:44.730 --> 00:17:46.600 have to worry a little bit about things like 00:17:46.600 --> 00:17:49.960 fault tolerance, failure, item potence. But you worry about 00:17:49.960 --> 00:17:52.230 them at, at the abstraction level, and then the 00:17:52.230 --> 00:17:54.159 operational part is handled for you. 00:17:54.159 --> 00:17:55.740 T.D.: So it's just like a very declarative way 00:17:55.740 --> 00:17:58.179 of describing how this computation work in, in a 00:17:58.179 --> 00:17:59.230 way that's easy to scale. 00:17:59.230 --> 00:18:01.860 Y.K.: And Carl actually talked about this at very 00:18:01.860 --> 00:18:04.909 high speed yesterday, and you, some of you may 00:18:04.909 --> 00:18:06.620 have been there. I would recommend watching the video 00:18:06.620 --> 00:18:09.020 when it comes out if you want to make 00:18:09.020 --> 00:18:11.159 use of this stuff in your own applications. 00:18:11.159 --> 00:18:13.250 And then when you're finally done with all the 00:18:13.250 --> 00:18:14.970 processing, you need to actually do something with it. 00:18:14.970 --> 00:18:16.289 You need to put it somewhere so that the 00:18:16.289 --> 00:18:18.360 web app can get access to it, and that 00:18:18.360 --> 00:18:21.350 is basically, we use Cassandra for this. And Cassandra 00:18:21.350 --> 00:18:24.929 again is mostly, it's a dumb database, but it 00:18:24.929 --> 00:18:27.780 has, it's, has high capacity. It has some of 00:18:27.780 --> 00:18:29.080 the fault-tolerance capacities that we want. 00:18:29.080 --> 00:18:30.770 T.D.: We're very, we're just very, very heavy, right. 00:18:30.770 --> 00:18:32.820 Like, we tend to be writing more than we're 00:18:32.820 --> 00:18:33.730 ever reading. 00:18:33.730 --> 00:18:36.270 Y.K.: Yup. And then when we're done, when we're 00:18:36.270 --> 00:18:38.780 done with a particular batch, Cassandra basically kicks off 00:18:38.780 --> 00:18:40.720 the process over again. So we're basically doing these 00:18:40.720 --> 00:18:41.200 things as batches. 00:18:41.200 --> 00:18:42.919 T.D.: So these are, these are roll-ups, is what's 00:18:42.919 --> 00:18:45.360 happening here. So basically every minute, every ten minutes, 00:18:45.360 --> 00:18:48.140 and then at every hour, we reprocess and we 00:18:48.140 --> 00:18:49.970 re-aggregate, so that when you query us we know 00:18:49.970 --> 00:18:51.010 exactly what to give you. 00:18:51.010 --> 00:18:52.830 Y.K.: Yup. So we sort of have this cycle 00:18:52.830 --> 00:18:55.049 where we start off, obviously, in the first five 00:18:55.049 --> 00:18:56.890 second, the first minute, you really want high granularity. 00:18:56.890 --> 00:18:59.140 You want to see what's happening right now. But, 00:18:59.140 --> 00:19:00.110 if you want to go back and look at 00:19:00.110 --> 00:19:02.460 data from three months ago, you probably care about 00:19:02.460 --> 00:19:04.830 it, like the day granularity or maybe the hour 00:19:04.830 --> 00:19:07.490 granularity. So, we basically do these roll-ups and we 00:19:07.490 --> 00:19:09.200 cycle through the process. 00:19:09.200 --> 00:19:11.880 T.D.: So this, it turns out, building the system 00:19:11.880 --> 00:19:15.169 required an intense amount of work. Carl spent probably 00:19:15.169 --> 00:19:17.490 six months reading PHP thesises to find- 00:19:17.490 --> 00:19:18.140 Y.K.: Thesis. 00:19:18.140 --> 00:19:23.850 T.D.: Thesis. To find, to find data structures and 00:19:23.850 --> 00:19:25.870 algorithms that we could use. Because this is a 00:19:25.870 --> 00:19:28.340 huge amount of data. Like, I think even a 00:19:28.340 --> 00:19:31.049 few months after we were in private data, private 00:19:31.049 --> 00:19:34.179 beta, we were already handling over a billion requests 00:19:34.179 --> 00:19:36.200 per month. And obviously there's no way that we- 00:19:36.200 --> 00:19:37.970 Y.K.: Basically the number of requests that we handle 00:19:37.970 --> 00:19:39.770 is the sum of all of the requests that 00:19:39.770 --> 00:19:40.010 you handle. 00:19:40.010 --> 00:19:40.110 T.D.: Right. 00:19:40.110 --> 00:19:41.200 Y.K.: And all of our customers handle. 00:19:41.200 --> 00:19:41.909 T.D.: Right. Right. So. 00:19:41.909 --> 00:19:43.120 Y.K.: So that's a lot of requests. 00:19:43.120 --> 00:19:45.760 T.D.: So obviously we can't provide a service, at 00:19:45.760 --> 00:19:48.480 least one that's not, we can't provide an affordable 00:19:48.480 --> 00:19:51.130 service, an accessible service, if we have to store 00:19:51.130 --> 00:19:53.190 terabytes or exabytes of data just to tell you 00:19:53.190 --> 00:19:53.990 how your app is running. 00:19:53.990 --> 00:19:56.630 Y.K.: And I think, also a problem, it's problematic 00:19:56.630 --> 00:19:58.429 if you store all the data in a database 00:19:58.429 --> 00:19:59.760 and then every single time someone wants to learn 00:19:59.760 --> 00:20:01.590 something about that, you have to do a query. 00:20:01.590 --> 00:20:03.159 Those queries can take a very long time. They 00:20:03.159 --> 00:20:04.700 can take minutes. And I think we really wanted 00:20:04.700 --> 00:20:07.159 to have something that would be very, that would, 00:20:07.159 --> 00:20:09.580 where the feedback loop would be fast. So we 00:20:09.580 --> 00:20:11.860 wanted to find algorithms that let us handle the 00:20:11.860 --> 00:20:13.940 data at, at real time, and then provide it 00:20:13.940 --> 00:20:16.020 to you at real time instead of these, like, 00:20:16.020 --> 00:20:18.309 dump the data somewhere and then do these complicated 00:20:18.309 --> 00:20:18.549 queries. 00:20:18.549 --> 00:20:20.820 T.D.: So, hold on. So this slide was not 00:20:20.820 --> 00:20:24.440 supposed to be here. It was supposed to be 00:20:24.440 --> 00:20:27.669 a Rails slide. So, whoa. I went too far. 00:20:27.669 --> 00:20:29.970 K. We'll watch that again. That's pretty cool. So 00:20:29.970 --> 00:20:32.460 then the last thing I want to say is, 00:20:32.460 --> 00:20:34.330 perhaps your take away from looking at this architecture 00:20:34.330 --> 00:20:36.750 diagram is, oh my gosh, these Rails guys completely- 00:20:36.750 --> 00:20:38.309 Y.K.: They really jumped the shark. 00:20:38.309 --> 00:20:40.870 T.D.: They jumped the shark. They ditched Rails. I 00:20:40.870 --> 00:20:42.330 saw, like, three Tweets yesterday - I wasn't here, 00:20:42.330 --> 00:20:43.450 I was in Portland yesterday, but I saw, like, 00:20:43.450 --> 00:20:44.340 three Tweets that were like, I'm at RailsConf and 00:20:44.340 --> 00:20:48.809 I haven't seen a single talk about, like, Rails. 00:20:48.809 --> 00:20:51.950 So that's true here, too. But, I want to 00:20:51.950 --> 00:20:54.940 assure you that we are only using this stack 00:20:54.940 --> 00:20:58.070 for the heavy computation. We started in Rails. We 00:20:58.070 --> 00:21:00.730 started, we were like, hey, what do we need. 00:21:00.730 --> 00:21:01.929 Ah, well, people probably need to authenticate and log 00:21:01.929 --> 00:21:05.090 in, and we probably need to do billing. And 00:21:05.090 --> 00:21:06.220 those are all things that Rails is really, really 00:21:06.220 --> 00:21:08.830 good at. So we started with Rails as, basically, 00:21:08.830 --> 00:21:11.110 the starting point, and then when we realized oh 00:21:11.110 --> 00:21:14.039 my gosh, computation is really slow. There's no way 00:21:14.039 --> 00:21:15.240 we're gonna be able to offer this service. OK. 00:21:15.240 --> 00:21:16.059 Now let's think about how we can do all 00:21:16.059 --> 00:21:16.270 of that. 00:21:16.270 --> 00:21:18.510 Y.K.: And I think notably, a lot of people 00:21:18.510 --> 00:21:20.049 who look at Rails are like, there's a lot 00:21:20.049 --> 00:21:21.750 of companies that have built big stuff on Rails, 00:21:21.750 --> 00:21:24.090 and their attitude is, like, oh, this legacy terrible 00:21:24.090 --> 00:21:25.409 Rails app. I really wish we could get rid 00:21:25.409 --> 00:21:26.850 of it. If we could just write everything in 00:21:26.850 --> 00:21:30.760 Scala or Clojure or Go, everything would be amazing. 00:21:30.760 --> 00:21:31.500 That is definitely not our attitude. Our attitude is 00:21:31.500 --> 00:21:34.320 that Rails is really amazing, at particular, at the 00:21:34.320 --> 00:21:36.740 kinds of things that are really common across everyone's 00:21:36.740 --> 00:21:39.900 web applications - authentication, billing, et cetera. And we 00:21:39.900 --> 00:21:41.429 really want to be using Rails for the parts 00:21:41.429 --> 00:21:43.360 of our app- even things like error-tracking, we do 00:21:43.360 --> 00:21:45.039 through the Rails app. We want to be using 00:21:45.039 --> 00:21:47.470 Rails because it's very productive at doing those things. 00:21:47.470 --> 00:21:48.789 It happens to be very slow with doing data 00:21:48.789 --> 00:21:50.330 crunching, so we're gonna use a different tool for 00:21:50.330 --> 00:21:50.539 that. 00:21:50.539 --> 00:21:51.909 But I don't think you'll ever see me getting 00:21:51.909 --> 00:21:54.210 up and saying, ah, I really wish we had 00:21:54.210 --> 00:21:55.080 just started writing, you know, the Rails app in 00:21:55.080 --> 00:21:55.159 rust. 00:21:55.159 --> 00:21:55.309 T.D.: Yeah. 00:21:55.309 --> 00:21:58.090 Y.K.: That would be terrible. 00:21:58.090 --> 00:22:02.429 T.D.: So that's number one, is, is, honest response 00:22:02.429 --> 00:22:04.390 times, which we're, which it turns out, seems like 00:22:04.390 --> 00:22:08.289 it should be easy, requires storing insane amount of 00:22:08.289 --> 00:22:09.169 data. 00:22:09.169 --> 00:22:10.620 So the second thing that we realized when we 00:22:10.620 --> 00:22:12.059 were looking at a lot of these tools, is 00:22:12.059 --> 00:22:14.360 that most of them focus on data. They focus 00:22:14.360 --> 00:22:16.590 on giving you the raw data. But I'm not 00:22:16.590 --> 00:22:19.130 a machine. I'm not a computer. I don't enjoy 00:22:19.130 --> 00:22:21.320 sifting through data. That's what computers are good for. 00:22:21.320 --> 00:22:23.289 I would rather be drinking a beer. It's really 00:22:23.289 --> 00:22:24.830 nice in Portland, this time of year. 00:22:24.830 --> 00:22:27.409 So, we wanted to think about, if you're trying 00:22:27.409 --> 00:22:31.179 to solve the performance problems in your application, what 00:22:31.179 --> 00:22:32.840 are the things that you would suss out with 00:22:32.840 --> 00:22:35.760 the existing tools after spending, like, four hours depleting 00:22:35.760 --> 00:22:37.510 your ego to get there? 00:22:37.510 --> 00:22:38.929 Y.K.: And I think part of this is just 00:22:38.929 --> 00:22:42.260 people are actually very, people like to think that 00:22:42.260 --> 00:22:43.880 they're gonna use these tools, but when the tools 00:22:43.880 --> 00:22:45.320 require you to dig through a lot of data, 00:22:45.320 --> 00:22:47.090 people just don't use them very much. So, the 00:22:47.090 --> 00:22:48.330 goal here was to build a tool that people 00:22:48.330 --> 00:22:50.809 actually use and actually like using, and not to 00:22:50.809 --> 00:22:54.870 build a tool that happens to provide a lot 00:22:54.870 --> 00:22:55.039 of data you can sift through. 00:22:55.039 --> 00:22:55.059 T.D.: Yes. 00:22:55.059 --> 00:22:55.929 So, probably the, one of the first things that 00:22:55.929 --> 00:22:58.529 we realized is that we don't want to provide. 00:22:58.529 --> 00:23:00.400 This is a trace of a request, you've probably 00:23:00.400 --> 00:23:04.070 seen similar UIs using other tools, using, for example, 00:23:04.070 --> 00:23:07.059 the inspector in, in like Chrome or Safari, and 00:23:07.059 --> 00:23:08.700 this is just showing basically, it's basically a visual 00:23:08.700 --> 00:23:10.830 stack trace of where your application is spending its 00:23:10.830 --> 00:23:11.600 time. 00:23:11.600 --> 00:23:13.950 But I think what was important for us is 00:23:13.950 --> 00:23:17.809 showing not just a single request, because your app 00:23:17.809 --> 00:23:20.570 handles, you know, hundreds of thousands of requests, or 00:23:20.570 --> 00:23:22.679 millions of requests. So looking at a single request 00:23:22.679 --> 00:23:24.630 statistically is complete, it's just noise. 00:23:24.630 --> 00:23:26.500 Y.K.: And it's especially bad if it's the worst 00:23:26.500 --> 00:23:28.659 request, because the worst request is, is really noise. 00:23:28.659 --> 00:23:30.850 It's like, a hiccup in the network, right. 00:23:30.850 --> 00:23:31.250 T.D.: It's the outlier. Yeah. 00:23:31.250 --> 00:23:32.150 Y.K.: It's literally the outlier. 00:23:32.150 --> 00:23:35.659 T.D.: It's literally the outlier. Yup. So, what we 00:23:35.659 --> 00:23:38.700 present in Skylight is something a little bit different, 00:23:38.700 --> 00:23:41.770 and it's something that we call the aggregate trace. 00:23:41.770 --> 00:23:46.260 So the aggregate trace is basically us taking all 00:23:46.260 --> 00:23:49.559 of your requests, averaging them out where each of 00:23:49.559 --> 00:23:51.750 these things spends their time, and then showing you 00:23:51.750 --> 00:23:54.899 that. So this is basically like, this is like, 00:23:54.899 --> 00:23:57.929 this is like the statue of David. It is 00:23:57.929 --> 00:24:00.880 the idealized form of the stack trace of how 00:24:00.880 --> 00:24:02.270 your application's behaving. 00:24:02.270 --> 00:24:05.330 But, of course, you have the same problem as 00:24:05.330 --> 00:24:07.500 before, which is, if this is all that we 00:24:07.500 --> 00:24:10.580 were showing you, it would be obscuring a lot 00:24:10.580 --> 00:24:12.870 of information. You want to actually be able to 00:24:12.870 --> 00:24:13.990 tell the difference between, OK, what's my stack trace 00:24:13.990 --> 00:24:16.419 look like for fast requests, and how does that 00:24:16.419 --> 00:24:18.539 differ from requests that are slower. 00:24:18.539 --> 00:24:20.860 So what we've got, I've got a little video 00:24:20.860 --> 00:24:22.320 here. You can see that when I move the 00:24:22.320 --> 00:24:26.490 slider, that this trace below it is actually updating 00:24:26.490 --> 00:24:29.130 in real time. As I move the slider around, 00:24:29.130 --> 00:24:31.770 you can see that the aggregate trace actually updates 00:24:31.770 --> 00:24:34.240 with it. And that's because we're collecting all this 00:24:34.240 --> 00:24:36.159 information. We're collecting, like I said, a lot of 00:24:36.159 --> 00:24:38.669 data. We can recompute this aggregate trace on the 00:24:38.669 --> 00:24:38.909 fly. 00:24:38.909 --> 00:24:41.200 Basically, for each bucket, we're storing a different trace, 00:24:41.200 --> 00:24:42.880 and then on the client we're reassembling that. We'll 00:24:42.880 --> 00:24:43.899 go into that a little bit. 00:24:43.899 --> 00:24:45.799 Y.K.: And I think it's really important that you 00:24:45.799 --> 00:24:48.370 be able to do these experiments quickly. If every 00:24:48.370 --> 00:24:50.059 time you think, oh, I wonder what happens if 00:24:50.059 --> 00:24:52.260 I add another histogram bucket, if it requires a 00:24:52.260 --> 00:24:54.830 whole full page refresh. Then that would basically make 00:24:54.830 --> 00:24:56.309 people not want to use the tool. Not able 00:24:56.309 --> 00:24:58.580 to use the tool. So, actually building something which 00:24:58.580 --> 00:24:59.649 is real time and fast, gets the data as 00:24:59.649 --> 00:25:00.110 it comes, was really important to us. 00:25:00.110 --> 00:25:01.220 T.D.: So that's number one. 00:25:01.220 --> 00:25:04.850 And the second thing. So we built that, and 00:25:04.850 --> 00:25:07.929 we're like, OK, well what's next? And I think 00:25:07.929 --> 00:25:09.250 that the big problem with this is that you 00:25:09.250 --> 00:25:12.020 need to know that there's a problem before you 00:25:12.020 --> 00:25:14.429 go look at it, right. So we have been 00:25:14.429 --> 00:25:16.080 working for the past few months, and the Storm 00:25:16.080 --> 00:25:18.390 infrastructure that we built makes it pretty straight-forward to 00:25:18.390 --> 00:25:21.149 start building more abstractions on top of the data 00:25:21.149 --> 00:25:21.559 that we've already collected. 00:25:21.559 --> 00:25:24.120 It's a very declarative system. So we've been working 00:25:24.120 --> 00:25:26.679 on a feature called inspections. And what's cool about 00:25:26.679 --> 00:25:29.279 inspections is that we can look at this tremendous 00:25:29.279 --> 00:25:31.270 volume of data that we've collected from your app, 00:25:31.270 --> 00:25:33.840 and we can automatically tease out what the problems 00:25:33.840 --> 00:25:35.210 are. So the first one that we shipped, this 00:25:35.210 --> 00:25:37.399 is in beta right now. It's not, it's not 00:25:37.399 --> 00:25:39.840 out and enabled by default, but there, it's behind 00:25:39.840 --> 00:25:42.440 a feature flag that we've had some users turning 00:25:42.440 --> 00:25:42.730 on. 00:25:42.730 --> 00:25:44.419 And, and trying out. And so what we can 00:25:44.419 --> 00:25:46.450 do in this case, is because we have information 00:25:46.450 --> 00:25:48.730 about all of the database queries in your app, 00:25:48.730 --> 00:25:50.840 we can look and see if you have n 00:25:50.840 --> 00:25:52.390 plus one queries. Can you maybe explain what an 00:25:52.390 --> 00:25:53.250 n plus one query is? 00:25:53.250 --> 00:25:54.600 Y.K.: Yeah. So, I'm, people know, hopefully, what n 00:25:54.600 --> 00:25:56.770 plus one queries. But the, it's the idea that 00:25:56.770 --> 00:25:59.260 you, by accident, for some reason, instead of making 00:25:59.260 --> 00:26:01.970 one query, you ask for like all the posts 00:26:01.970 --> 00:26:02.940 and then you iterated through all of them and 00:26:02.940 --> 00:26:04.940 got all the comments and now you, instead of 00:26:04.940 --> 00:26:08.679 having one query, you have one query per post, 00:26:08.679 --> 00:26:10.309 right. And you, what I've, what I've like to 00:26:10.309 --> 00:26:12.549 do is do eager reloading, where you say include 00:26:12.549 --> 00:26:14.559 comments, right. But you have to know that you 00:26:14.559 --> 00:26:15.039 have to do that. 00:26:15.039 --> 00:26:16.899 So there's some tools that will run in development 00:26:16.899 --> 00:26:18.380 mode, if you happen to catch it, like a 00:26:18.380 --> 00:26:20.460 bullet. This is basically a tool that's looking at 00:26:20.460 --> 00:26:22.210 every single one of your classes and has some 00:26:22.210 --> 00:26:24.169 thresholds that, once we see that a bunch of 00:26:24.169 --> 00:26:27.429 your requests have the same exact query, so we 00:26:27.429 --> 00:26:29.549 do some work to pull out binds. So if 00:26:29.549 --> 00:26:32.200 it's, like, where something equals one, we will automatically 00:26:32.200 --> 00:26:34.110 pull out the one and replace it with a 00:26:34.110 --> 00:26:34.740 question mark. 00:26:34.740 --> 00:26:36.230 And then we basically take all those queries, if 00:26:36.230 --> 00:26:39.529 they're the exact same query repeated multiple times, subject 00:26:39.529 --> 00:26:41.390 to some thresholds, we'll start showing you hey, there's 00:26:41.390 --> 00:26:42.450 an n plus one query. 00:26:42.450 --> 00:26:43.799 And you can imagine this same sort of thing 00:26:43.799 --> 00:26:46.320 being done for things, like, are you missing an 00:26:46.320 --> 00:26:49.690 index, right. Or, are you using the Ruby version 00:26:49.690 --> 00:26:50.950 of JSON when you should be using the native 00:26:50.950 --> 00:26:52.179 version of JSON. These are all things that we 00:26:52.179 --> 00:26:55.140 can start detecting just because we're consuming an enormous 00:26:55.140 --> 00:26:57.510 amount of information, and we can start writing some 00:26:57.510 --> 00:26:59.330 heuristics for bubbling it up. 00:26:59.330 --> 00:27:02.330 So, third and final breakthrough, we realized that we 00:27:02.330 --> 00:27:05.289 really, really needed a lightning fast UI. Something really 00:27:05.289 --> 00:27:08.279 responsive. So, in particular, the feedback loop is critical, 00:27:08.279 --> 00:27:09.929 right. You can imagine, if the way that you 00:27:09.929 --> 00:27:12.279 dug into data was you clicked and you wait 00:27:12.279 --> 00:27:14.320 an hour, and then you get your results, no 00:27:14.320 --> 00:27:15.730 one would do it. No one would ever do 00:27:15.730 --> 00:27:15.890 it. 00:27:15.890 --> 00:27:19.090 And the existing tools are OK, but you click 00:27:19.090 --> 00:27:20.470 and you wait. You look at it and you're 00:27:20.470 --> 00:27:21.730 like, oh, I want a different view, so then 00:27:21.730 --> 00:27:23.240 you go edit your query and then you click 00:27:23.240 --> 00:27:25.360 and you wait and it's just not a pleasant 00:27:25.360 --> 00:27:26.600 experience. 00:27:26.600 --> 00:27:28.850 So, so we use Ember, the, the UI that 00:27:28.850 --> 00:27:31.250 you're using when you log into Skylight. Even though 00:27:31.250 --> 00:27:33.289 it feels just like a regular website, it doesn't 00:27:33.289 --> 00:27:35.940 feel like a native app, is powered, all of 00:27:35.940 --> 00:27:37.679 the routing, all of the rendering, all of the 00:27:37.679 --> 00:27:40.769 decision making, is happening in, as an Ember.js app, 00:27:40.769 --> 00:27:43.049 and we pair that with D3. So all of 00:27:43.049 --> 00:27:44.830 the charts, the charts that you saw there in 00:27:44.830 --> 00:27:48.039 the aggregate trace, that is all Ember components powered 00:27:48.039 --> 00:27:48.970 by D3. 00:27:48.970 --> 00:27:52.860 So, this is actually significantly cleaned up our client-side 00:27:52.860 --> 00:27:55.679 code. It makes re-usability really, really awesome. So to 00:27:55.679 --> 00:27:57.039 give you an example, this is from our billing 00:27:57.039 --> 00:27:58.789 page that I, the designer came and they had, 00:27:58.789 --> 00:28:01.260 they had a component that was like, the gate 00:28:01.260 --> 00:28:01.809 component. 00:28:01.809 --> 00:28:02.919 And, the- 00:28:02.919 --> 00:28:05.899 T.D.: It seems really boring at first. 00:28:05.899 --> 00:28:06.799 Y.K.: It seemed really boring. But, this is the 00:28:06.799 --> 00:28:08.950 implementation, right. So you could copy and paste this 00:28:08.950 --> 00:28:11.059 code over and over again, everywhere you go. Just 00:28:11.059 --> 00:28:12.750 remember to format it correctly. If you forget to 00:28:12.750 --> 00:28:15.070 format it, it's not gonna look the same everywhere. 00:28:15.070 --> 00:28:17.460 But I was like, hey, we're using this all 00:28:17.460 --> 00:28:18.010 over the place. Why don't we bundle this up 00:28:18.010 --> 00:28:20.070 into a component? And so with Ember, it was 00:28:20.070 --> 00:28:22.230 super easy. We basically just said, OK, here's new 00:28:22.230 --> 00:28:24.590 calendar date component. It has a property on it 00:28:24.590 --> 00:28:26.460 called date. Just set that to any JavaScript data 00:28:26.460 --> 00:28:28.059 object. Just set, you don't have to remember about 00:28:28.059 --> 00:28:30.450 converting it or formatting it. Here's the component. Set 00:28:30.450 --> 00:28:31.840 the date and it will render the correct thing 00:28:31.840 --> 00:28:32.760 automatically. 00:28:32.760 --> 00:28:36.039 And, so the architecture of the Ember app looks 00:28:36.039 --> 00:28:37.640 a little bit, something like this, where you have 00:28:37.640 --> 00:28:39.919 many, many different components, most of them just driven 00:28:39.919 --> 00:28:42.370 by D3, and then they're plugged into the model 00:28:42.370 --> 00:28:43.480 and the controller. 00:28:43.480 --> 00:28:44.909 And the Ember app will go fetch those models 00:28:44.909 --> 00:28:46.750 from the cloud, and the cloud from the Java 00:28:46.750 --> 00:28:50.190 app, which just queries Cassandra, and render them. And 00:28:50.190 --> 00:28:53.429 what's neat about this model is turning on web 00:28:53.429 --> 00:28:56.360 sockets is super easy, right. Because all of these 00:28:56.360 --> 00:28:58.860 components are bound to a single place. So when 00:28:58.860 --> 00:29:00.890 the web socket says, hey, we have updated information 00:29:00.890 --> 00:29:02.630 for you to show, it just pushes it onto 00:29:02.630 --> 00:29:04.980 the model or onto the controller, and the whole 00:29:04.980 --> 00:29:06.159 UI updates automatically. 00:29:06.159 --> 00:29:06.890 It's like magic. 00:29:06.890 --> 00:29:07.230 And- 00:29:07.230 --> 00:29:08.250 Y.K.: Like magic. 00:29:08.250 --> 00:29:09.679 T.D.: It's like magic. And, and when debugging, this 00:29:09.679 --> 00:29:11.559 is especially awesome too, because, and I'll maybe show 00:29:11.559 --> 00:29:15.080 a demo of the Ember inspector. It's nice. 00:29:15.080 --> 00:29:17.830 So. Yeah. So, lightning fast UI. Reducing the feedback 00:29:17.830 --> 00:29:19.510 loop so that you can quickly play with your 00:29:19.510 --> 00:29:21.880 data, makes it go from a chore to something 00:29:21.880 --> 00:29:23.620 that actually feels kind of fun. 00:29:23.620 --> 00:29:27.039 So, these were the breakthroughs that we had when 00:29:27.039 --> 00:29:28.440 we were building Skylight. The things that made us 00:29:28.440 --> 00:29:29.980 think, yes, this is actually a product that we 00:29:29.980 --> 00:29:31.940 think deserves to be on the market. So, one, 00:29:31.940 --> 00:29:33.860 honest response times. Collect data that no one else 00:29:33.860 --> 00:29:36.549 can collect. Focus on answers instead of just dumping 00:29:36.549 --> 00:29:38.289 data, and have a lightning fast UI to do 00:29:38.289 --> 00:29:38.409 it. 00:29:38.409 --> 00:29:40.100 So we like to think of Skylight as basically 00:29:40.100 --> 00:29:42.690 a smart profiler. It's a smart profiler that runs 00:29:42.690 --> 00:29:44.350 in production. It's like the profiler that you run 00:29:44.350 --> 00:29:47.230 on your local development machine, but instead of being 00:29:47.230 --> 00:29:49.179 on your local dev box which has nothing to 00:29:49.179 --> 00:29:51.610 do with the performance characteristics of what your users 00:29:51.610 --> 00:29:53.450 are experience, we're actually running in production. 00:29:53.450 --> 00:29:58.919 So, let me just give you guys a quick 00:29:58.919 --> 00:30:00.390 demo. 00:30:00.390 --> 00:30:03.120 So, this is what the Skylight, this is what 00:30:03.120 --> 00:30:07.610 Skylight looks like. What's under this? There we go. 00:30:07.610 --> 00:30:09.620 So, the first thing here is we've got the 00:30:09.620 --> 00:30:12.669 app dash board. So this, it's like our, 95th 00:30:12.669 --> 00:30:15.500 responsile- 95th percentile response time has peaked. Maybe you're 00:30:15.500 --> 00:30:17.970 all hammering it right now. That would be nice. 00:30:17.970 --> 00:30:19.940 So, this is a graph of your response time 00:30:19.940 --> 00:30:22.010 over time, and then on the right, this is 00:30:22.010 --> 00:30:24.700 the graph of the RPMs, the requests per minute 00:30:24.700 --> 00:30:26.750 that your app is handling. So this is app-wide. 00:30:26.750 --> 00:30:29.440 And this is live. This updates every minute. 00:30:29.440 --> 00:30:31.039 Then down below, you have a list of the 00:30:31.039 --> 00:30:33.730 end points in your application. So you can see, 00:30:33.730 --> 00:30:35.700 actually, the top, the slowest ones for us were, 00:30:35.700 --> 00:30:37.789 we have an instrumentation API, and we've gone and 00:30:37.789 --> 00:30:39.929 instrumented our background workers. So we can see them 00:30:39.929 --> 00:30:42.010 here, and their response time plays in. So we 00:30:42.010 --> 00:30:44.220 can see that we have this reporting worker that's 00:30:44.220 --> 00:30:46.899 taking 95th percentile, thirteen seconds. 00:30:46.899 --> 00:30:48.880 Y.K.: So all that time used to be inside 00:30:48.880 --> 00:30:51.500 of some request somewhere, and we discovered that there 00:30:51.500 --> 00:30:52.840 was a lot of time being spent in things 00:30:52.840 --> 00:30:54.679 that we could push to the background. We probably 00:30:54.679 --> 00:30:56.789 need to update the agony index so that it 00:30:56.789 --> 00:30:59.190 doesn't make workers very high, because spending some time 00:30:59.190 --> 00:31:02.120 in your workers is not that big of a 00:31:02.120 --> 00:31:02.130 deal. 00:31:02.130 --> 00:31:03.000 T.D.: So, so then, if we dive into one 00:31:03.000 --> 00:31:05.299 of these, you can see that for this request, 00:31:05.299 --> 00:31:07.000 we've got the time explorer up above, and that 00:31:07.000 --> 00:31:10.429 shows a graph of response time at, again, 95th 00:31:10.429 --> 00:31:11.840 percentile, and you can, if you want to go 00:31:11.840 --> 00:31:13.549 back and look at historical data, you just drag 00:31:13.549 --> 00:31:15.250 it like this. And this has got a brush, 00:31:15.250 --> 00:31:16.980 so you can zoom in and out on different 00:31:16.980 --> 00:31:17.760 times. 00:31:17.760 --> 00:31:19.649 And every time you change the range, you can 00:31:19.649 --> 00:31:21.360 see that it's very responsive. It's never waiting for 00:31:21.360 --> 00:31:23.039 the server. But it is going back and fetching 00:31:23.039 --> 00:31:25.080 data from the server and then when the data 00:31:25.080 --> 00:31:29.210 comes back, you see the whole UI just updates. 00:31:29.210 --> 00:31:29.250 And we get that for free with Ember and 00:31:29.250 --> 00:31:31.190 And then down below, as we discussed, you actually 00:31:31.190 --> 00:31:33.760 have a real histogram. And this histogram, in this 00:31:33.760 --> 00:31:37.159 case, is showing. So this is for fifty-seven requests. 00:31:37.159 --> 00:31:39.019 And if we click and drag, we could just 00:31:39.019 --> 00:31:40.429 move this. And you can see that the aggregate 00:31:40.429 --> 00:31:43.360 trace below updates in response to us dragging this. 00:31:43.360 --> 00:31:44.919 And if we want to look at the fastest 00:31:44.919 --> 00:31:47.500 quartile, we just click faster and we'll just choose 00:31:47.500 --> 00:31:48.149 that range on the histogram. 00:31:48.149 --> 00:31:49.210 Y.K.: I think it's the fastest load. 00:31:49.210 --> 00:31:50.899 T.D.: The fastest load. And then if you click 00:31:50.899 --> 00:31:52.899 on slower, you can see the slower requests. So 00:31:52.899 --> 00:31:54.669 this makes it really easy to compare and contrast. 00:31:54.669 --> 00:31:56.710 OK. Why are certain requests faster and why are 00:31:56.710 --> 00:31:58.529 certain requests slow? 00:31:58.529 --> 00:32:00.779 You can see the blue, these blue areas. This 00:32:00.779 --> 00:32:03.559 is Ruby code. So, right now it's not super 00:32:03.559 --> 00:32:05.820 granular. It would be nice if you could actually 00:32:05.820 --> 00:32:07.940 know what was going on here. But, it'll at 00:32:07.940 --> 00:32:09.940 least tell you where in your controller action this 00:32:09.940 --> 00:32:12.690 is happening, and then you can actually see which 00:32:12.690 --> 00:32:15.919 database queries are being executed, and what their duration 00:32:15.919 --> 00:32:16.080 is. 00:32:16.080 --> 00:32:17.889 And you can see that we actually extract the 00:32:17.889 --> 00:32:20.419 SQL and we denormalize it so we, so you, 00:32:20.419 --> 00:32:22.159 or, we normalize it so you can see exactly 00:32:22.159 --> 00:32:24.019 what those requests are even if the values are 00:32:24.019 --> 00:32:24.820 totally different between them. 00:32:24.820 --> 00:32:27.649 Y.K.: Yeah. So the real query, courtesy of Rails, 00:32:27.649 --> 00:32:29.730 not yet supporting bind extraction is like, where id 00:32:29.730 --> 00:32:32.169 equals one or, ten or whatever. 00:32:32.169 --> 00:32:33.659 T.D.: Yup. So that's pretty cool. 00:32:33.659 --> 00:32:37.429 Y.K.: So one, one other thing is, initially, we 00:32:37.429 --> 00:32:39.269 actually just showed the whole trace, but we discovered 00:32:39.269 --> 00:32:41.659 that, obviously when you show whole traces you have 00:32:41.659 --> 00:32:43.639 information that doesn't really matter that much. So we 00:32:43.639 --> 00:32:47.340 started off by, we've recently basically started to collapse 00:32:47.340 --> 00:32:48.850 things that don't matter so much so that you 00:32:48.850 --> 00:32:51.090 can basically expand or condense the trace. 00:32:51.090 --> 00:32:52.519 And we wanted to make it not, but you 00:32:52.519 --> 00:32:55.690 have to think about expanding or condensing individual areas, 00:32:55.690 --> 00:32:57.960 but just, you see what matters the most and 00:32:57.960 --> 00:32:59.100 then you can see trivial errors. 00:32:59.100 --> 00:33:02.179 T.D.: Yup. So, so that's the demo of Skylight. 00:33:02.179 --> 00:33:04.190 We'd really like it if you checked it out. 00:33:04.190 --> 00:33:05.899 There is one more thing I want to show 00:33:05.899 --> 00:33:07.720 you that is, like, really freaking cool. This is 00:33:07.720 --> 00:33:10.529 coming out of Tilde labs. Carl was like, has 00:33:10.529 --> 00:33:13.730 been hacking, he's been up until past midnight, getting 00:33:13.730 --> 00:33:15.769 almost no sleep for the past month trying to 00:33:15.769 --> 00:33:16.730 have this ready. 00:33:16.730 --> 00:33:19.090 I don't know how many of you know this, 00:33:19.090 --> 00:33:23.630 but Ruby 2 point 1 has a new, a, 00:33:23.630 --> 00:33:27.950 a stack sampling feature. So you can get really 00:33:27.950 --> 00:33:31.149 granular information about how your Ruby code is performing. 00:33:31.149 --> 00:33:33.450 So I want to show you, I just mentioned 00:33:33.450 --> 00:33:34.570 how it would be nice if we could get 00:33:34.570 --> 00:33:36.830 more information out of what your Ruby code is 00:33:36.830 --> 00:33:38.760 doing. And now we can do that. 00:33:38.760 --> 00:33:42.039 Basically, every few milliseconds, this code that Carl wrote 00:33:42.039 --> 00:33:44.399 is going into the, to the Ruby, into MRI, 00:33:44.399 --> 00:33:47.419 and it's taking a snap shot of the stack. 00:33:47.419 --> 00:33:50.769 And because this is built-in, it's very low-impact. It's 00:33:50.769 --> 00:33:53.570 not allocating any new memory. It's very little performance 00:33:53.570 --> 00:33:55.769 hit. Basically you wouldn't even notice it. And so 00:33:55.769 --> 00:33:58.149 every few milliseconds it's sampling, and we take that 00:33:58.149 --> 00:34:00.260 information and we send it up to our servers. 00:34:00.260 --> 00:34:02.260 So it's almost like you're running Ruby profiler on 00:34:02.260 --> 00:34:05.220 your local dev box, where you get extremely granular 00:34:05.220 --> 00:34:07.159 information about where your code is spending its time 00:34:07.159 --> 00:34:09.010 in Ruby, per method, per all of these things. 00:34:09.010 --> 00:34:11.909 But it's happening in production. 00:34:11.909 --> 00:34:16.409 So, this is, so this is a, we enabled 00:34:16.409 --> 00:34:18.399 it in staging. You can see that we've got 00:34:18.399 --> 00:34:19.600 some rendering bugs. It's still in beta. 00:34:19.600 --> 00:34:21.918 Y.K.: Yeah, and we haven't yet collapsed things that 00:34:21.918 --> 00:34:21.980 are not important- 00:34:21.980 --> 00:34:22.020 T.D.: Yes. 00:34:22.020 --> 00:34:23.270 Y.K.: -for this particular feature. 00:34:23.270 --> 00:34:24.170 T.D.: So we want to show, we want to 00:34:24.170 --> 00:34:27.610 hide things like, like framework code, obviously. But this 00:34:27.610 --> 00:34:31.070 gives you an incredibly, incredibly granular view of what 00:34:31.070 --> 00:34:35.659 your app is doing in production. And we think. 00:34:35.659 --> 00:34:39.230 This is a, an API that's built into, into 00:34:39.230 --> 00:34:43.159 Ruby 2.1.1. Because our agent is running so low-level, 00:34:43.159 --> 00:34:44.659 because we wrote it in Rust, we have the 00:34:44.659 --> 00:34:47.409 ability to do things like this, and Carl thinks 00:34:47.409 --> 00:34:48.370 that we may be able to actually back port 00:34:48.370 --> 00:34:48.480 this to older Rubies, too. So if you're not 00:34:48.480 --> 00:34:50.130 on Ruby 2.1, we think that we can actually 00:34:50.130 --> 00:34:52.790 bring this. But that's TPD. 00:34:52.790 --> 00:34:55.480 Y.K.: Yeah, I- so I think the cool thing 00:34:55.480 --> 00:34:57.940 about this, in general, is when you run a 00:34:57.940 --> 00:34:59.430 sampling- so this is a sampling profiler, right, we 00:34:59.430 --> 00:35:01.260 don't want to be burning every single thing that 00:35:01.260 --> 00:35:03.790 you do in your program with tracing, right. That 00:35:03.790 --> 00:35:05.380 would be very slow. 00:35:05.380 --> 00:35:06.920 So when you normally run a sampling profiler, you 00:35:06.920 --> 00:35:08.760 have to basically make a loop. You have to 00:35:08.760 --> 00:35:11.090 basically create a loop, run this code a million 00:35:11.090 --> 00:35:12.970 times and keep sampling it. Eventually we'll get enough 00:35:12.970 --> 00:35:15.030 samples to get the information. But it turns out 00:35:15.030 --> 00:35:17.280 that your production server is a loop. Your production 00:35:17.280 --> 00:35:20.560 server is serving tons and tons of requests. So, 00:35:20.560 --> 00:35:22.880 by simply tak- you know, taking a few microseconds 00:35:22.880 --> 00:35:25.580 out of every request and collecting a couple of 00:35:25.580 --> 00:35:27.210 samples, over time we can actually get this really 00:35:27.210 --> 00:35:29.700 high fidelity picture with basically no cost. 00:35:29.700 --> 00:35:31.150 And that's pretty mind-blowing. And this is the kind 00:35:31.150 --> 00:35:34.650 of stuff that we can start doing by really 00:35:34.650 --> 00:35:37.250 caring about, about both the user experience and the 00:35:37.250 --> 00:35:40.830 implementation and getting really scary about it. And I'm 00:35:40.830 --> 00:35:42.700 really, like, honestly this is a really exciting feature 00:35:42.700 --> 00:35:45.330 that really shows what we can do as we 00:35:45.330 --> 00:35:46.130 start building this out. 00:35:46.130 --> 00:35:47.140 T.D.: Once we've got that, once we've got that 00:35:47.140 --> 00:35:48.380 groundwork. 00:35:48.380 --> 00:35:49.820 So if you guys want to check it out, 00:35:49.820 --> 00:35:51.760 Skylight dot io, it's available today. It's no longer 00:35:51.760 --> 00:35:54.040 in private beta. Everyone can sign up. No invitation 00:35:54.040 --> 00:35:56.630 token necessary. And you can get a thirty-day free 00:35:56.630 --> 00:35:58.240 trial if you haven't started one already. So if 00:35:58.240 --> 00:35:59.620 you have any questions, please come see us right 00:35:59.620 --> 00:36:00.980 now, or we have a booth in the vendor 00:36:00.980 --> 00:36:03.140 hall. Thank you guys very much.