WEBVTT 99:59:59.999 --> 99:59:59.999 Hi everyone, I'm Gil Tene. 99:59:59.999 --> 99:59:59.999 I'm going to be talking about this subject that I call "How NOT to Measure Latency". 99:59:59.999 --> 99:59:59.999 It's a subject that I've been talking about now for 3 years or so. 99:59:59.999 --> 99:59:59.999 I keep the title and change all the slides every time. 99:59:59.999 --> 99:59:59.999 A bunch of this stuff is new. 99:59:59.999 --> 99:59:59.999 So if you've seen any of my previous "How NOT to", you'll see only some things that are common. 99:59:59.999 --> 99:59:59.999 A nickname for the subject is this... 99:59:59.999 --> 99:59:59.999 Because I often will get that reaction from some people in the audience. 99:59:59.999 --> 99:59:59.999 Ever since I've told people that it's a nickname, 99:59:59.999 --> 99:59:59.999 They feel free to actually exclaim, "Oh S@%#!". 99:59:59.999 --> 99:59:59.999 And feel free to do that here in this talk. 99:59:59.999 --> 99:59:59.999 I'll prompt you in a couple of places where it is natural. 99:59:59.999 --> 99:59:59.999 But if just have the urge, go ahead. 99:59:59.999 --> 99:59:59.999 So just a tiny bit about me. 99:59:59.999 --> 99:59:59.999 I am the co-founder of Azul Systems. 99:59:59.999 --> 99:59:59.999 I play around with garbage collection a lot. 99:59:59.999 --> 99:59:59.999 Here is some evidence of me playing around with garbage collection in my kitchen. 99:59:59.999 --> 99:59:59.999 That's a trash compactor. 99:59:59.999 --> 99:59:59.999 The compaction function wasn't working right, so I had to fix it. 99:59:59.999 --> 99:59:59.999 I thought it'd be funny to take a picture with a book. 99:59:59.999 --> 99:59:59.999 I've also built a lot of things. 99:59:59.999 --> 99:59:59.999 I've been playing with computers since the early 80's. 99:59:59.999 --> 99:59:59.999 I've built hardware. 99:59:59.999 --> 99:59:59.999 I've helped design chips. 99:59:59.999 --> 99:59:59.999 I've built software at many different levels. 99:59:59.999 --> 99:59:59.999 Operating systems, drivers... JVM's obviously. 99:59:59.999 --> 99:59:59.999 And lots of big systems at the system level. 99:59:59.999 --> 99:59:59.999 Built our own app server in the late 90's because web logic wasn't around yet. 99:59:59.999 --> 99:59:59.999 So, I've made a lot of mistakes, and I've learned from a few of them. 99:59:59.999 --> 99:59:59.999 This is actually a combination of a bunch of those mistakes looking at latency. 99:59:59.999 --> 99:59:59.999 I do have this hobby of depressing people by pulling the wool up from over your eyes, 99:59:59.999 --> 99:59:59.999 and this is what this talk is about. 99:59:59.999 --> 99:59:59.999 So, I need to give you a choice right here. 99:59:59.999 --> 99:59:59.999 There's the door. 99:59:59.999 --> 99:59:59.999 You can take the blue pill, and you can leave. 99:59:59.999 --> 99:59:59.999 Tomorrow you can keep believing whatever it is you want to believe. 99:59:59.999 --> 99:59:59.999 But if you stay here and take the red pill, I will show you a glimpse of how 99:59:59.999 --> 99:59:59.999 far down the rabbit hole goes, and it will never be the same again. 99:59:59.999 --> 99:59:59.999 Let's talk about latency. 99:59:59.999 --> 99:59:59.999 And when I say latency, I'm talking about latency response time, any of those things 99:59:59.999 --> 99:59:59.999 where you measure time from 'here to here', and you're interested in how long it took. 99:59:59.999 --> 99:59:59.999 We do this all the time, but I see a lot of mish-mash in how people 99:59:59.999 --> 99:59:59.999 treat the data, or think about it. 99:59:59.999 --> 99:59:59.999 Latency is basically the time it took something to happen once. 99:59:59.999 --> 99:59:59.999 That one time, how long did it take. 99:59:59.999 --> 99:59:59.999 And when we measure stuff, like we did a million operations in the last hour, 99:59:59.999 --> 99:59:59.999 we have a million latencies. Not one, we have a million of them. 99:59:59.999 --> 99:59:59.999 Our actual goal is to figure out how to describe that million. 99:59:59.999 --> 99:59:59.999 How did the million behave? 99:59:59.999 --> 99:59:59.999 For example, 'they're all really good, and they're all exactly the same', would be a 99:59:59.999 --> 99:59:59.999 behavior that you will never see, but that would be a great behavior. 99:59:59.999 --> 99:59:59.999 So we need to talk about how things behave, communicate, think, evaluate, 99:59:59.999 --> 99:59:59.999 set requirements for, talk to other people, but these are all common things around that. 99:59:59.999 --> 99:59:59.999 To do that, we have to describe the distribution, the set, the behavior, 99:59:59.999 --> 99:59:59.999 but not the one. 99:59:59.999 --> 99:59:59.999 For example, the behavior that says "the the common case was x" is a piece of 99:59:59.999 --> 99:59:59.999 information about the behavior, but it's a tiny sliver. 99:59:59.999 --> 99:59:59.999 Usually the least relevant one. 99:59:59.999 --> 99:59:59.999 Well, there's some less relevant ones, but not a strongly relevant one, 99:59:59.999 --> 99:59:59.999 and one that people often focus on. 99:59:59.999 --> 99:59:59.999 To take a look at what we actually do with this stuff, almost on a daily basis, 99:59:59.999 --> 99:59:59.999 this is a snapshot from a monitoring system. 99:59:59.999 --> 99:59:59.999 A small dashboard on a big screen in a monitoring system. 99:59:59.999 --> 99:59:59.999 Where you're watching the response time of a system over time. 99:59:59.999 --> 99:59:59.999 This is a two hour window. 99:59:59.999 --> 99:59:59.999 These lines that are 95th percentile, 90, 75, 50, and 25th percentiles, 99:59:59.999 --> 99:59:59.999 you can look at how they behave over time. 99:59:59.999 --> 99:59:59.999 We're a small audience here, if you look at this picture, what draws your eye? 99:59:59.999 --> 99:59:59.999 What do you want to go investigate here or pay attention to ? 99:59:59.999 --> 99:59:59.999 It's the big red spike there, right? 99:59:59.999 --> 99:59:59.999 So we could look at the red spike, cause it's different, 99:59:59.999 --> 99:59:59.999 and say, "Woah, the 95th percentile shot up here. And look, the 90th percentile 99:59:59.999 --> 99:59:59.999 shot up at about the same time. 99:59:59.999 --> 99:59:59.999 The rest of them didn't shoot up, so maybe something happened here 99:59:59.999 --> 99:59:59.999 that affected that much, I should probably pay attention to it 99:59:59.999 --> 99:59:59.999 because it's a monitoring system, and I like things to be calm." 99:59:59.999 --> 99:59:59.999 You could go investigate the why. 99:59:59.999 --> 99:59:59.999 At this point, I've managed to waste about 90 seconds of your life, 99:59:59.999 --> 99:59:59.999 looking at a completely meaningless chart, which unfortunately you do 99:59:59.999 --> 99:59:59.999 every day, all the time. 99:59:59.999 --> 99:59:59.999 This chart is the chart you want to show somebody if you want to 99:59:59.999 --> 99:59:59.999 hide the truth from them. 99:59:59.999 --> 99:59:59.999 If you want to pull the wool over their eyes. 99:59:59.999 --> 99:59:59.999 This is the chart of the good stuff. 99:59:59.999 --> 99:59:59.999 What's not on this chart? 99:59:59.999 --> 99:59:59.999 The 5% worse things that happened during this two hours. 99:59:59.999 --> 99:59:59.999 They're not here. 99:59:59.999 --> 99:59:59.999 This is only the good things that happened during the things. 99:59:59.999 --> 99:59:59.999 And to get this spike, that 5% had to be so bad that it even pulled 99:59:59.999 --> 99:59:59.999 the 95th percentile all up. 99:59:59.999 --> 99:59:59.999 There is zero information here at all about what happened bad during this two hours, 99:59:59.999 --> 99:59:59.999 which makes it a bad fit for a monitoring system. 99:59:59.999 --> 99:59:59.999 It's a really good thing for a marketing system. 99:59:59.999 --> 99:59:59.999 It's a great way to get the bonus from your boss, even though you didn't do the work. 99:59:59.999 --> 99:59:59.999 If you want to learn how to do that, we can do another talk about that. 99:59:59.999 --> 99:59:59.999 But this is not a good way to look at latency. 99:59:59.999 --> 99:59:59.999 It's the opposite of good. 99:59:59.999 --> 99:59:59.999 Unfortunately, this is one of the most common tools used for 99:59:59.999 --> 99:59:59.999 server monitoring on earth right now. 99:59:59.999 --> 99:59:59.999 That's where the snapshot is from, and this is what people look at. 99:59:59.999 --> 99:59:59.999 I find this chart to be a goldmine of information. 99:59:59.999 --> 99:59:59.999 When I first showed it in another talk like this, I had this really cool experience. 99:59:59.999 --> 99:59:59.999 Somebody came up to me and said, "Hey, as I was sitting here, I was texting one 99:59:59.999 --> 99:59:59.999 of our guys, and he was saying, 99:59:59.999 --> 99:59:59.999 'look, we have this issue with our 95th percentile'." 99:59:59.999 --> 99:59:59.999 And I got this chart from him! 99:59:59.999 --> 99:59:59.999 So I went and said, "Hey, what does the rest of the spectrum look like?" 99:59:59.999 --> 99:59:59.999 This is the actual chart they got. 99:59:59.999 --> 99:59:59.999 And when they look at the rest of the spectrum, it looked like that. 99:59:59.999 --> 99:59:59.999 That's what was hiding. 99:59:59.999 --> 99:59:59.999 I noticed the scales are a little different. 99:59:59.999 --> 99:59:59.999 That yellow line is that yellow line. 99:59:59.999 --> 99:59:59.999 So that's a much more representative number. 99:59:59.999 --> 99:59:59.999 Is it? Is that good enough? 99:59:59.999 --> 99:59:59.999 That's the 99th percentile. 99:59:59.999 --> 99:59:59.999 We still have another 1% of really bad stuff that's hiding above the blue line. 99:59:59.999 --> 99:59:59.999 I wonder how big that is? 99:59:59.999 --> 99:59:59.999 I don't know because he didn't have the data. 99:59:59.999 --> 99:59:59.999 So a common problem that we have is that we only plot what's convenient. 99:59:59.999 --> 99:59:59.999 We only plot what gives us nice, colorful graphs. 99:59:59.999 --> 99:59:59.999 And often, when we have to choose between the stuff that hides the rest of the data, 99:59:59.999 --> 99:59:59.999 and the stuff that is noise, we choose the noise to display. 99:59:59.999 --> 99:59:59.999 I like to rant about latency. 99:59:59.999 --> 99:59:59.999 This is from a blog that I don't write enough in, but the format for it was simple. 99:59:59.999 --> 99:59:59.999 I tweet a single tweet about latency, latency tip of the day, 99:59:59.999 --> 99:59:59.999 and then I rant about my own tweet. 99:59:59.999 --> 99:59:59.999 As an example, this chart is a goldmine of information because it has so many 99:59:59.999 --> 99:59:59.999 different things that are wrong in it, but we won't get into all of them. 99:59:59.999 --> 99:59:59.999 You can read it online. 99:59:59.999 --> 99:59:59.999 Anyway, this is one to take away from what we just said. 99:59:59.999 --> 99:59:59.999 If you are not measuring and showing the maximum value, what is it you are hiding? 99:59:59.999 --> 99:59:59.999 And from whom? 99:59:59.999 --> 99:59:59.999 If you're job is to hide the truth from others, this is a good way to do it. 99:59:59.999 --> 99:59:59.999 But if actually are interested in what's going on, the number one indicator 99:59:59.999 --> 99:59:59.999 you should never get rid of is the maximum value. 99:59:59.999 --> 99:59:59.999 That is not noise, that is the signal. 99:59:59.999 --> 99:59:59.999 The rest of it is noise. 99:59:59.999 --> 99:59:59.999 Okay, let's look at this chart for some more cool stuff. 99:59:59.999 --> 99:59:59.999 I'm gonna zoom in to a small part of the chart, and ask you what that means. 99:59:59.999 --> 99:59:59.999 What is the average of the 95th percentile over 2 hours mean? 99:59:59.999 --> 99:59:59.999 What is the math that does that? 99:59:59.999 --> 99:59:59.999 What does it do? 99:59:59.999 --> 99:59:59.999 Let's look at that, and I'll give you an example with another percentile. 99:59:59.999 --> 99:59:59.999 The 100th percentile. The max, right? 99:59:59.999 --> 99:59:59.999 Let's take a data set. 99:59:59.999 --> 99:59:59.999 Suppose this was the maximum every minute for 15 minutes. 99:59:59.999 --> 99:59:59.999 What does it mean to say that the average max over the last 15 minutes was 42? 99:59:59.999 --> 99:59:59.999 I specifically chose the data to make that happen. 99:59:59.999 --> 99:59:59.999 It's a meaningless statement. 99:59:59.999 --> 99:59:59.999 It's a completely meaningless statement. 99:59:59.999 --> 99:59:59.999 But when you see 95th percentile, average 184, you think that the 95th 99:59:59.999 --> 99:59:59.999 percentile for the last two hours was around 184. 99:59:59.999 --> 99:59:59.999 It makes you think that. 99:59:59.999 --> 99:59:59.999 Putting this on a piece of paper is not just noise and irrelevant, 99:59:59.999 --> 99:59:59.999 it's a way to mislead people. 99:59:59.999 --> 99:59:59.999 It's a way to mislead yourself, because you'll start to believe your own mistruths. 99:59:59.999 --> 99:59:59.999 This is true for any percentile. 99:59:59.999 --> 99:59:59.999 There is no percentile that you could do this math on. 99:59:59.999 --> 99:59:59.999 Another tip, you cannot average percentiles. 99:59:59.999 --> 99:59:59.999 That math doesn't happen. 99:59:59.999 --> 99:59:59.999 But percentiles do matter. You really want to know about them. 99:59:59.999 --> 99:59:59.999 And a common misperception is that we want to look at the main part of the spectrum, 99:59:59.999 --> 99:59:59.999 not those outliers and perfection stuff. 99:59:59.999 --> 99:59:59.999 Only people that actually bet their house every day, or the bank on it, 99:59:59.999 --> 99:59:59.999 need to know about the "five-nine's", and all those. 99:59:59.999 --> 99:59:59.999 The 99th percentile is a pretty good number. 99:59:59.999 --> 99:59:59.999 Is 99% really rare? 99:59:59.999 --> 99:59:59.999 Let's look at some stuff, because we can ask questions like, "If I were looking 99:59:59.999 --> 99:59:59.999 at a webpage, what is the chance of me hitting the 99th percentile?" 99:59:59.999 --> 99:59:59.999 Of things like this: a search engine node, or a key value store, 99:59:59.999 --> 99:59:59.999 or a database, or a CDN, right? 99:59:59.999 --> 99:59:59.999 Because they will report their 99th percentile. 99:59:59.999 --> 99:59:59.999 They won't tell you anything above that, but how many of the 99:59:59.999 --> 99:59:59.999 webpages that we go to actually experience this? 99:59:59.999 --> 99:59:59.999 You want to say 1%, right? 99:59:59.999 --> 99:59:59.999 Well, I went to some webpages and I counted how many "http" requests were generated 99:59:59.999 --> 99:59:59.999 by one click into that webpage, and here are the numbers. 99:59:59.999 --> 99:59:59.999 I ended that about a year ago. 99:59:59.999 --> 99:59:59.999 They've probably gone up since then. 99:59:59.999 --> 99:59:59.999 Now that translates into this math. 99:59:59.999 --> 99:59:59.999 This is the likelihood of one click seeing the 99th percentile. 99:59:59.999 --> 99:59:59.999 And the only page where that is less than 50% is the clean google search page. 99:59:59.999 --> 99:59:59.999 Where only a quarter will see the 99th percentile. 99:59:59.999 --> 99:59:59.999 The 99th percentile is the thing that most of your webpages will see. 99:59:59.999 --> 99:59:59.999 Most of them will be there. 99:59:59.999 --> 99:59:59.999 Now, we could look at other things. 99:59:59.999 --> 99:59:59.999 We can pick which things to focus on. 99:59:59.999 --> 99:59:59.999 Let's say I had to pick between the 95th percentile, and the three 9's (99.9%). 99:59:59.999 --> 99:59:59.999 The three 9's is way into perfection mode for most people, or they think. 99:59:59.999 --> 99:59:59.999 Which one of those represents our community better? 99:59:59.999 --> 99:59:59.999 Our population? 99:59:59.999 --> 99:59:59.999 Our users? 99:59:59.999 --> 99:59:59.999 Our experience? 99:59:59.999 --> 99:59:59.999 Let's run a hypothetical. 99:59:59.999 --> 99:59:59.999 Suppose we don't have that many pages, and that many resources like we said before. 99:59:59.999 --> 99:59:59.999 We'll be much more conservative. 99:59:59.999 --> 99:59:59.999 A user session will only go through five clicks, and each click will only bring up 99:59:59.999 --> 99:59:59.999 up to 40 things. 99:59:59.999 --> 99:59:59.999 A lot less, and they're all as clean as the google page. 99:59:59.999 --> 99:59:59.999 How many of the users will not experience something worse than the 95th percentile? 99:59:59.999 --> 99:59:59.999 Because that's what the 95th percentile is good for, the people who see that. 99:59:59.999 --> 99:59:59.999 Anybody above that, is that. 99:59:59.999 --> 99:59:59.999 What are the chances of not seeing it? 99:59:59.999 --> 99:59:59.999 That's an interesting number. 99:59:59.999 --> 99:59:59.999 So you're watching a number that is relevant to 0.003% of your users. 99:59:59.999 --> 99:59:59.999 99.997% of your users are going to see worse than this number. 99:59:59.999 --> 99:59:59.999 Why are you looking at it? 99:59:59.999 --> 99:59:59.999 Why are you spending time thinking about it? 99:59:59.999 --> 99:59:59.999 In reverse, we could say how many people are going to see something 99:59:59.999 --> 99:59:59.999 worse than the three 9's (99.9%)? 99:59:59.999 --> 99:59:59.999 That's going to be 18%. 99:59:59.999 --> 99:59:59.999 In reverse, 82% of the people will see the three 9's (99.9%) or better. 99:59:59.999 --> 99:59:59.999 That's a slightly better representation. 99:59:59.999 --> 99:59:59.999 Probably not good enough either. 99:59:59.999 --> 99:59:59.999 We could look at some more math with them, same kind of scenario. 99:59:59.999 --> 99:59:59.999 What percentile of http response time will be the thing that 95% 99:59:59.999 --> 99:59:59.999 of people experience in this scenario? 99:59:59.999 --> 99:59:59.999 It's the 99.97 percentile that 95% of people see. 99:59:59.999 --> 99:59:59.999 And if you want to know what 99% of the people see, 99:59:59.999 --> 99:59:59.999 that's four and a half 9's (99.995%). 99:59:59.999 --> 99:59:59.999 You want to know that number from Akamai if you want to predict what 1% 99:59:59.999 --> 99:59:59.999 of your users are going to experience. 99:59:59.999 --> 99:59:59.999 When you know the 99th percentile, you kind of know a tiny bit. 99:59:59.999 --> 99:59:59.999 So here's another tip. 99:59:59.999 --> 99:59:59.999 And this is not an exaggeration, by the way. 99:59:59.999 --> 99:59:59.999 The median, which is a much smaller percentile, has that minuscule a chance 99:59:59.999 --> 99:59:59.999 of ever being the number that anybody experiences. 99:59:59.999 --> 99:59:59.999 This is the chance of getting worse than the median. 99:59:59.999 --> 99:59:59.999 Which makes the median an irrelevant number to look at. 99:59:59.999 --> 99:59:59.999 Unfortunately, it's probably the most common one looked at. 99:59:59.999 --> 99:59:59.999 When people say "the typical", they look at the thing that 99:59:59.999 --> 99:59:59.999 everything will be worse than. 99:59:59.999 --> 99:59:59.999 Okay, I'm sorry about that part. 99:59:59.999 --> 99:59:59.999 We'll do some other parts. 99:59:59.999 --> 99:59:59.999 Now, why is it that when we look at these monitoring systems, we don't see 99:59:59.999 --> 99:59:59.999 data with a lot of 9's? 99:59:59.999 --> 99:59:59.999 Why do we stop at the 90, 95, 99th percentile? 99:59:59.999 --> 99:59:59.999 Why don't we look further? 99:59:59.999 --> 99:59:59.999 Now, some of it is because people think, "Well that's perfection, I don't need it." 99:59:59.999 --> 99:59:59.999 The other part is that it's hard. 99:59:59.999 --> 99:59:59.999 It's hard because you can't average percentiles. 99:59:59.999 --> 99:59:59.999 We already talked about that. 99:59:59.999 --> 99:59:59.999 But you also can't derive your five 9's (99.999%) out of a lot 99:59:59.999 --> 99:59:59.999 of 10 second samples of percentiles. 99:59:59.999 --> 99:59:59.999 And the reason for that is, "Hey, in 10 seconds, maybe I only had 1,000 things." 99:59:59.999 --> 99:59:59.999 I could take all the 10 seconds in the world, there's no way to say what the 99:59:59.999 --> 99:59:59.999 hour five 9's (99.999%) were, what the minutes five 9's were 99:59:59.999 --> 99:59:59.999 if I'm collecting just this data. 99:59:59.999 --> 99:59:59.999 And unfortunately, the data being collected and reported to the back ends of monitoring 99:59:59.999 --> 99:59:59.999 is usually summarized at a second, 5 seconds, 10 seconds, etc. 99:59:59.999 --> 99:59:59.999 Basically throwing away all the good data, and leaving you with absolutely no way 99:59:59.999 --> 99:59:59.999 to compute large 9's for longer periods of time. 99:59:59.999 --> 99:59:59.999 So, this is where you might want to look at HDR Histogram. 99:59:59.999 --> 99:59:59.999 It's an open source thing I've created a few years ago. 99:59:59.999 --> 99:59:59.999 I did it in Java, and know there's a C, C-Sharp, Python, Erlang, 99:59:59.999 --> 99:59:59.999 and Go ports of this that I didn't create. 99:59:59.999 --> 99:59:59.999 And it lets you actually get an entire percentile spectrum. 99:59:59.999 --> 99:59:59.999 Some of you here I know are already using it. 99:59:59.999 --> 99:59:59.999 And you can look at all the percentiles. 99:59:59.999 --> 99:59:59.999 Any number of 9's that's in the data, if you just keep it right and report it right, 99:59:59.999 --> 99:59:59.999 it's got a log format, you can store things forever. 99:59:59.999 --> 99:59:59.999 Well, for a long time. 99:59:59.999 --> 99:59:59.999 Okay, so it lets you have nice things. 99:59:59.999 --> 99:59:59.999 Enough for that advertisement. 99:59:59.999 --> 99:59:59.999 Now, latency... Well, I think this is slightly out of order. 99:59:59.999 --> 99:59:59.999 Yeah, sorry. 99:59:59.999 --> 99:59:59.999 This is the red/blue pill part, so I warn you, this is your last chance. 99:59:59.999 --> 99:59:59.999 There's a problem I call the coordinated omission problem. 99:59:59.999 --> 99:59:59.999 The coordinated omission problem is basically a conspiracy. 99:59:59.999 --> 99:59:59.999 It's a conspiracy that we're all part of. 99:59:59.999 --> 99:59:59.999 I don't think anybody actually meant to do it, but once I've noticed it, 99:59:59.999 --> 99:59:59.999 everywhere I look, there it is. 99:59:59.999 --> 99:59:59.999 Now, I've been using a specific way of showing you numbers so far. 99:59:59.999 --> 99:59:59.999 Has anybody here noticed how I spell percentile? 99:59:59.999 --> 99:59:59.999 (Audience Member): "You put lie at the end of the percent sign." 99:59:59.999 --> 99:59:59.999 Yeah, good. 99:59:59.999 --> 99:59:59.999 So coordinated omission problem is the "lie" in %lies. 99:59:59.999 --> 99:59:59.999 And this is how it works. 99:59:59.999 --> 99:59:59.999 One common way to do this is to use a load generator. 99:59:59.999 --> 99:59:59.999 Pretty much all load generator's have this problem. 99:59:59.999 --> 99:59:59.999 There are two that I know of that don't. 99:59:59.999 --> 99:59:59.999 What you do with a load generator, is you test. 99:59:59.999 --> 99:59:59.999 You issue requests, or send packets. 99:59:59.999 --> 99:59:59.999 And you measure how long something took. 99:59:59.999 --> 99:59:59.999 And as long as the numbers go right, measure them, put them in a bucket, 99:59:59.999 --> 99:59:59.999 study them later, and get your percentiles from it. 99:59:59.999 --> 99:59:59.999 But what if the thing that you are measuring took longer than the time 99:59:59.999 --> 99:59:59.999 it would've taken until you send the next thing? 99:59:59.999 --> 99:59:59.999 You're supposed to send something every second, 99:59:59.999 --> 99:59:59.999 but this one took a second and a half. 99:59:59.999 --> 99:59:59.999 Well you've got to wait before you send the next one. 99:59:59.999 --> 99:59:59.999 You just avoided measuring something when the system was problematic. 99:59:59.999 --> 99:59:59.999 You've coordinated with it. 99:59:59.999 --> 99:59:59.999 You weren't looking at it then. 99:59:59.999 --> 99:59:59.999 That's common scenario A: You've backed off, and avoided measuring when it was bad. 99:59:59.999 --> 99:59:59.999 Another way, is you measure inside your code. 99:59:59.999 --> 99:59:59.999 We all do this. We all have to do this, 99:59:59.999 --> 99:59:59.999 where we measure time, do something, then measure time. 99:59:59.999 --> 99:59:59.999 The delta between them is how long it took. 99:59:59.999 --> 99:59:59.999 We can then put it in a stats bucket, and then do the percentiles in that. 99:59:59.999 --> 99:59:59.999 Unfortunately, if the system freezes right here, for any reason, 99:59:59.999 --> 99:59:59.999 an interrupted contact switch, 99:59:59.999 --> 99:59:59.999 a cash buffer flushed to disk, 99:59:59.999 --> 99:59:59.999 a garbage collection, 99:59:59.999 --> 99:59:59.999 a re-indexing of your database, this is a database. 99:59:59.999 --> 99:59:59.999 This is Cassandra by the way, measuring itself. 99:59:59.999 --> 99:59:59.999 In any of the above, then you will have one bad report 99:59:59.999 --> 99:59:59.999 while 10,000 things are waiting in line. 99:59:59.999 --> 99:59:59.999 And when they come in, they will look really, really good. 99:59:59.999 --> 99:59:59.999 Even though each one of them has had a really bad experience. 99:59:59.999 --> 99:59:59.999 It can even get worse, where maybe the freeze happened outside the timing, 99:59:59.999 --> 99:59:59.999 and you won't even know there was a freeze. 99:59:59.999 --> 99:59:59.999 Now these are examples of admitting data that is bad on a very selective basis. 99:59:59.999 --> 99:59:59.999 It's not random sampling. 99:59:59.999 --> 99:59:59.999 It's, "I don't like bad data", 99:59:59.999 --> 99:59:59.999 or "I couldn't handle it", 99:59:59.999 --> 99:59:59.999 or "I don't know about it", 99:59:59.999 --> 99:59:59.999 so we'll just talk about the good. 99:59:59.999 --> 99:59:59.999 What does that do to your data? 99:59:59.999 --> 99:59:59.999 Because it often makes people feel like, 99:59:59.999 --> 99:59:59.999 "Okay, yeah, I understand, but it's a little bit of noise." 99:59:59.999 --> 99:59:59.999 Let's run some hypotheticals, and I'll show you some real numbers. 99:59:59.999 --> 99:59:59.999 Imagine a perfect system. 99:59:59.999 --> 99:59:59.999 It's doing 100 requests a second, at exactly a millisecond each. 99:59:59.999 --> 99:59:59.999 But we go and freeze the system, after 100 seconds of perfect operations 99:59:59.999 --> 99:59:59.999 for 100 seconds, and then repeat. 99:59:59.999 --> 99:59:59.999 Now, I'm going to describe how the system behaves in terms that should mean something, 99:59:59.999 --> 99:59:59.999 and then we'll measure it. 99:59:59.999 --> 99:59:59.999 If we actually wanted to describe the system, 99:59:59.999 --> 99:59:59.999 on the left we have an average of one millisecond by the finish, 99:59:59.999 --> 99:59:59.999 and on the right we have an average of 50 seconds. 99:59:59.999 --> 99:59:59.999 Why 50? Because if I randomly came in in that 100 seconds, 99:59:59.999 --> 99:59:59.999 I'll get anything from 0 to 100 with even distribution. 99:59:59.999 --> 99:59:59.999 The overall average over 200 seconds is 25 seconds. 99:59:59.999 --> 99:59:59.999 If I just came in here and said, "Surprise, how long did this take?" 99:59:59.999 --> 99:59:59.999 On average, it will be 25. 99:59:59.999 --> 99:59:59.999 I can also do the percentiles. 99:59:59.999 --> 99:59:59.999 50th percentile will be really good, and then it'll get really bad. 99:59:59.999 --> 99:59:59.999 The four 9's is terrible. 99:59:59.999 --> 99:59:59.999 This is a fair honest description of this system if this is what it did. 99:59:59.999 --> 99:59:59.999 And you can make the system do that. 99:59:59.999 --> 99:59:59.999 That's what Control Z is good for. 99:59:59.999 --> 99:59:59.999 You can make any of your systems do that. 99:59:59.999 --> 99:59:59.999 Now lets go measure this system with a load generator, 99:59:59.999 --> 99:59:59.999 or with a monitoring system. 99:59:59.999 --> 99:59:59.999 The common ones. 99:59:59.999 --> 99:59:59.999 The ones everybody does. 99:59:59.999 --> 99:59:59.999 On the left, we're going to get 10,000 results of one millisecond each. 99:59:59.999 --> 99:59:59.999 Great. 99:59:59.999 --> 99:59:59.999 And we're going to get one result of 100 seconds. 99:59:59.999 --> 99:59:59.999 Wow, really big response time. 99:59:59.999 --> 99:59:59.999 This is our data. 99:59:59.999 --> 99:59:59.999 This is OUR data. 99:59:59.999 --> 99:59:59.999 So now you go do math with it. 99:59:59.999 --> 99:59:59.999 The average of that is 10.9 milliseconds. 99:59:59.999 --> 99:59:59.999 A little less than 25 seconds. 99:59:59.999 --> 99:59:59.999 And here are the percentiles. 99:59:59.999 --> 99:59:59.999 Your load generator monitoring system will tell you that this system is perfect. 99:59:59.999 --> 99:59:59.999 You could go to production with it. 99:59:59.999 --> 99:59:59.999 You like what you see. 99:59:59.999 --> 99:59:59.999 Look at that, four 9's. 99:59:59.999 --> 99:59:59.999 It is lying to you. 99:59:59.999 --> 99:59:59.999 To your face. 99:59:59.999 --> 99:59:59.999 And you can catch it doing that with a Control Z-Test. 99:59:59.999 --> 99:59:59.999 But people tend to not want to do that, because then what are they going to do? 99:59:59.999 --> 99:59:59.999 If you just do that test, and calibrate your system, and you find it 99:59:59.999 --> 99:59:59.999 telling you that, about this, the next step should be to throw all the numbers away. 99:59:59.999 --> 99:59:59.999 Don't believe anything else it says. 99:59:59.999 --> 99:59:59.999 If it lies this big, what else did it do? 99:59:59.999 --> 99:59:59.999 Don't waste your time on numbers from uncalibrated systems. 99:59:59.999 --> 99:59:59.999 Now the problem here was, that if you want to measure the system, 99:59:59.999 --> 99:59:59.999 you have to measure at random rates, or same rates. 99:59:59.999 --> 99:59:59.999 If you measure 10,000 things in 100 seconds, there should be another 10,000 things here. 99:59:59.999 --> 99:59:59.999 If you measure them, you would've gotten all the right numbers. 99:59:59.999 --> 99:59:59.999 Coordinated omission is the simple act of erasing all that bad stuff. 99:59:59.999 --> 99:59:59.999 The conspiracy here is that we all do it without meaning to. 99:59:59.999 --> 99:59:59.999 I don't know who put that in our systems, but it happens to all of us . 99:59:59.999 --> 99:59:59.999 Now, I often get people saying, "Okay, I get it. All the numbers are wrong, 99:59:59.999 --> 99:59:59.999 but at least for my job where I tune performance, and I try to make things 99:59:59.999 --> 99:59:59.999 faster, I can use the numbers to figure out if I'm going in the right direction." 99:59:59.999 --> 99:59:59.999 Is it better, or is it worse? Let me dispel that for you for a second. 99:59:59.999 --> 99:59:59.999 Suppose I went and took this system, and improved it dramatically. 99:59:59.999 --> 99:59:59.999 Rather than freezing for 100 seconds, it will now answer every question. 99:59:59.999 --> 99:59:59.999 It'll take a little longer, 5 milliseconds instead of one, 99:59:59.999 --> 99:59:59.999 but it's much better than freezing, right? 99:59:59.999 --> 99:59:59.999 So let's measure that system that we spent weeks and weeks improving, 99:59:59.999 --> 99:59:59.999 and see if it's better. 99:59:59.999 --> 99:59:59.999 That's the data. 99:59:59.999 --> 99:59:59.999 If we do the percentiles, it'll tell us that we just really hurt the four 9's. 99:59:59.999 --> 99:59:59.999 We made it go 5 times worse than before. 99:59:59.999 --> 99:59:59.999 We should revert this change, go back to that much better system we had before. 99:59:59.999 --> 99:59:59.999 So this is just to make sure that you don't think that you can have 99:59:59.999 --> 99:59:59.999 any intuition based on any of these numbers. 99:59:59.999 --> 99:59:59.999 They go backwards sometimes. 99:59:59.999 --> 99:59:59.999 You don't know which way is good or bad. 99:59:59.999 --> 99:59:59.999 And you'll never know which way is good or bad with a system that lies like that. 99:59:59.999 --> 99:59:59.999 The other cool technique is what I call "Cheating Twice". 99:59:59.999 --> 99:59:59.999 You have a constant load generator, and it needs to do 100 per second. 99:59:59.999 --> 99:59:59.999 When it woke up after 200 seconds, it says, 99:59:59.999 --> 99:59:59.999 "Woah, were 9,999 behind. We've got to issue those requests." 99:59:59.999 --> 99:59:59.999 So it issues those requests. 99:59:59.999 --> 99:59:59.999 At this point, not only did it get rid of all the bad requests, 99:59:59.999 --> 99:59:59.999 it replaced every one of them with a perfect request. 99:59:59.999 --> 99:59:59.999 Coining the four 9's (99.99%), all the way to four and a half 9's (99.995%), 99:59:59.999 --> 99:59:59.999 it's twice as wrong as dropping them. 99:59:59.999 --> 99:59:59.999 So these are all cool things that happen to you. 99:59:59.999 --> 99:59:59.999 I'm not going to spend much time on how to fix those and avoid those. 99:59:59.999 --> 99:59:59.999 There's a lot of other material that you can find with me 99:59:59.999 --> 99:59:59.999 talking about that, in longer talks. 99:59:59.999 --> 99:59:59.999 But this is pretty bad. 99:59:59.999 --> 99:59:59.999 And like I said... 99:59:59.999 --> 99:59:59.999 That should've been up there before. 99:59:59.999 --> 99:59:59.999 How did this repeat itself? 99:59:59.999 --> 99:59:59.999 Did I create a loop in the presentation somehow? 99:59:59.999 --> 99:59:59.999 I don't know how to do that. 99:59:59.999 --> 99:59:59.999 Let's see if I can get through here. 99:59:59.999 --> 99:59:59.999 Hopefully editing later will take it out. 99:59:59.999 --> 99:59:59.999 So we have the cheats twice. 99:59:59.999 --> 99:59:59.999 There, okay. 99:59:59.999 --> 99:59:59.999 So, after we look at coordinated omission that way, 99:59:59.999 --> 99:59:59.999 we should also look at response time, and service time. 99:59:59.999 --> 99:59:59.999 Coordinated omission, what it really is achieving for you, unfortunately, 99:59:59.999 --> 99:59:59.999 is that it makes something that you think is response time, and only shows you 99:59:59.999 --> 99:59:59.999 the service time component of latency. 99:59:59.999 --> 99:59:59.999 This is a simple depiction of what service time and response times are. 99:59:59.999 --> 99:59:59.999 This guy is taking a certain amount of time to take payment 99:59:59.999 --> 99:59:59.999 or make a cup of coffee. 99:59:59.999 --> 99:59:59.999 That's service time. 99:59:59.999 --> 99:59:59.999 How long does it take to do the work? 99:59:59.999 --> 99:59:59.999 This person has experienced the response time, 99:59:59.999 --> 99:59:59.999 which includes the amount of time they have to wait before they 99:59:59.999 --> 99:59:59.999 get to the person that does the work. 99:59:59.999 --> 99:59:59.999 And the difference between those two is immense. 99:59:59.999 --> 99:59:59.999 The coordinated omission problem makes something that you think is 99:59:59.999 --> 99:59:59.999 response time, only measure the service time, 99:59:59.999 --> 99:59:59.999 and basically hide the fact that things stalled, waited in line, 99:59:59.999 --> 99:59:59.999 that this guy might've taken a lunch break, 99:59:59.999 --> 99:59:59.999 and now we have line around, building three times. 99:59:59.999 --> 99:59:59.999 Service time stays the same. 99:59:59.999 --> 99:59:59.999 This is the backwards part... 99:59:59.999 --> 99:59:59.999 Now, let's look at what it actually looks like. 99:59:59.999 --> 99:59:59.999 In a load generator that I fixed, I measured both 99:59:59.999 --> 99:59:59.999 response time and service time, 99:59:59.999 --> 99:59:59.999 this happens to be Casandra, 99:59:59.999 --> 99:59:59.999 at a very low load. 99:59:59.999 --> 99:59:59.999 And you can see that they're very very similar, at a very low load. 99:59:59.999 --> 99:59:59.999 Why? Because there's nobody in line. 99:59:59.999 --> 99:59:59.999 This thing is really fast. 99:59:59.999 --> 99:59:59.999 We're not asking for too much. 99:59:59.999 --> 99:59:59.999 Casandra's pretty fast, so they're the same. 99:59:59.999 --> 99:59:59.999 But if I increase the load, we start seeing gaps. 99:59:59.999 --> 99:59:59.999 If I increase the load a little more, the gap grows. 99:59:59.999 --> 99:59:59.999 If I increase the load a little more, the gap grows. 99:59:59.999 --> 99:59:59.999 Now this is not the failure point yet. 99:59:59.999 --> 99:59:59.999 If I actually increase it all the way past the point where the system 99:59:59.999 --> 99:59:59.999 can't even do the work I want, service time stays the same, 99:59:59.999 --> 99:59:59.999 response time goes through the roof. 99:59:59.999 --> 99:59:59.999 This was when it was 100 and something milliseconds, now it's 7 and a half seconds. 99:59:59.999 --> 99:59:59.999 Why 7 and a half seconds? 99:59:59.999 --> 99:59:59.999 Cause you're waiting in line that long to go around the block. 99:59:59.999 --> 99:59:59.999 The guy just can't serve as many people as are showing up in line, you fall behind. 99:59:59.999 --> 99:59:59.999 This is a virtual world reaction to this. 99:59:59.999 --> 99:59:59.999 I really like this slide, it's where I came up with the notion of a blue/red pill. 99:59:59.999 --> 99:59:59.999 When you actually measure reality, people tend to have this reaction when 99:59:59.999 --> 99:59:59.999 they compare the two. 99:59:59.999 --> 99:59:59.999 And if we actually look at these on the two sides of a collapse point of a system, 99:59:59.999 --> 99:59:59.999 this specific system can only do 87,000 things a second. 99:59:59.999 --> 99:59:59.999 No matter how hard you press it, that's all it'll do. 99:59:59.999 --> 99:59:59.999 The service time on the two sides of the collapse looks virtually identical, 99:59:59.999 --> 99:59:59.999 which it would. 99:59:59.999 --> 99:59:59.999 But if you compare the response time, you have a very different picture. 99:59:59.999 --> 99:59:59.999 And I'm showing this picture so you get a feeling for what to look at 99:59:59.999 --> 99:59:59.999 on whether or not you're measuring the right one. 99:59:59.999 --> 99:59:59.999 Whenever you push, you try and push load beyond what the system can do, 99:59:59.999 --> 99:59:59.999 you are falling behind over time. 99:59:59.999 --> 99:59:59.999 This is a 250 second run, 99:59:59.999 --> 99:59:59.999 where at the end of it you are waiting for 8 seconds in line. 99:59:59.999 --> 99:59:59.999 Why? Because for every second that goes by, there are 99:59:59.999 --> 99:59:59.999 3,000 more things that are added to the line. 99:59:59.999 --> 99:59:59.999 The interesting thing that happens when you cross the threshold limit, 99:59:59.999 --> 99:59:59.999 or capability of the system, is that response time grows over time linearly. 99:59:59.999 --> 99:59:59.999 It doesn't happen if you're below. 99:59:59.999 --> 99:59:59.999 Only if you're above. 99:59:59.999 --> 99:59:59.999 It's the point where that happens, and any load generator that doesn't show 99:59:59.999 --> 99:59:59.999 that line when you try pushing harder than you can, is lying to you. 99:59:59.999 --> 99:59:59.999 It's a simple sanity check. 99:59:59.999 --> 99:59:59.999 If your load generator shows you that, it didn't push. 99:59:59.999 --> 99:59:59.999 Or it pushed, but it didn't report correctly, 99:59:59.999 --> 99:59:59.999 whichever it is. 99:59:59.999 --> 99:59:59.999 If we draw that to scale... 99:59:59.999 --> 99:59:59.999 Just to make sure, this was not to scale, this is the scale, I just zoomed in 99:59:59.999 --> 99:59:59.999 so you could see that it was relatively stable. 99:59:59.999 --> 99:59:59.999 So... I don't know what happened to the order of the slides. 99:59:59.999 --> 99:59:59.999 It's like looping and randoming. 99:59:59.999 --> 99:59:59.999 There's some conspiracy going on there. 99:59:59.999 --> 99:59:59.999 Now, latency doesn't live on it's own. 99:59:59.999 --> 99:59:59.999 You do need to look at latency in the context of load. 99:59:59.999 --> 99:59:59.999 Cause as I showed you, as you're nearly idle, things are nearly perfect. 99:59:59.999 --> 99:59:59.999 Even these mistakes won't show up. 99:59:59.999 --> 99:59:59.999 But as you start pressing, things start cracking or behaving differently. 99:59:59.999 --> 99:59:59.999 And usually when you want to know how much your system can handle, 99:59:59.999 --> 99:59:59.999 the answer is not 87,000 things a second, because nobody wants the 99:59:59.999 --> 99:59:59.999 response time that comes with that. 99:59:59.999 --> 99:59:59.999 It's how many things can I handle so that I don't get angry phone calls. 99:59:59.999 --> 99:59:59.999 So I do get my bonus, and so my company stays above ground. 99:59:59.999 --> 99:59:59.999 This is not sustainable speed. 99:59:59.999 --> 99:59:59.999 Running this experiment is really interesting with software, 99:59:59.999 --> 99:59:59.999 because it actually doesn't hurt, but spending the next 6 months of your time 99:59:59.999 --> 99:59:59.999 repeating this experiment, trying to change the shape of the bumper 99:59:59.999 --> 99:59:59.999 every time you hit the thing is a waste of your time. 99:59:59.999 --> 99:59:59.999 Your goal when you're trying to figure out sustainable speed throughput, 99:59:59.999 --> 99:59:59.999 whatever it is, is to see how fast you can go without this happening, 99:59:59.999 --> 99:59:59.999 and then to try and engineer to improve that. 99:59:59.999 --> 99:59:59.999 Meaning, can I make it go faster without this happening? 99:59:59.999 --> 99:59:59.999 Measuring what happens after you hit the pole is useless for that exercise. 99:59:59.999 --> 99:59:59.999 The only thing that matters about hitting the pole, is that you hit the pole. 99:59:59.999 --> 99:59:59.999 When you go and study the behavior of latency, at saturation, 99:59:59.999 --> 99:59:59.999 you are doing this. 99:59:59.999 --> 99:59:59.999 You're looking at this and saying, "That bumper, I don't like the shape of that. 99:59:59.999 --> 99:59:59.999 Let's measure it closely and do this 100 times to see if we can vary it." 99:59:59.999 --> 99:59:59.999 That's what it means to look at latency at saturation, 99:59:59.999 --> 99:59:59.999 and repeat, and repeat, and change, and tune, and see if you can do it again. 99:59:59.999 --> 99:59:59.999 If you're pressing it to the wall, it should look like this. 99:59:59.999 --> 99:59:59.999 And it shouldn't be a surprise that it's a 7 and a half second response time. 99:59:59.999 --> 99:59:59.999 In fact, if it's not, something is terribly wrong with what you're measuring. 99:59:59.999 --> 99:59:59.999 You should look at that instead. 99:59:59.999 --> 99:59:59.999 So don't do this. 99:59:59.999 --> 99:59:59.999 Try to minimize the number of times that you actually run red cars 99:59:59.999 --> 99:59:59.999 into poles in your testing. 99:59:59.999 --> 99:59:59.999 I'm not saying don't do it, but use it to establish the end. 99:59:59.999 --> 99:59:59.999 And then you need to test all the speeds, and we need to see when you hit the pole. 99:59:59.999 --> 99:59:59.999 Maybe you hit the pole at 100 mph, but maybe you also hit the pole at 70 mph. 99:59:59.999 --> 99:59:59.999 Maybe you don't hit it at 20. 99:59:59.999 --> 99:59:59.999 We should find out how fast is safe. 99:59:59.999 --> 99:59:59.999 When you have data, you can compare it like this. 99:59:59.999 --> 99:59:59.999 This is what I would say a recommended way to look at it. 99:59:59.999 --> 99:59:59.999 Plot requirements, that's the hitting the pole. 99:59:59.999 --> 99:59:59.999 And some things hit the pole, and some things don't. 99:59:59.999 --> 99:59:59.999 And you run different scenarios, different loads, 99:59:59.999 --> 99:59:59.999 different configurations, 99:59:59.999 --> 99:59:59.999 different settings, 99:59:59.999 --> 99:59:59.999 and see what works, and what doesn't. 99:59:59.999 --> 99:59:59.999 Your goal is to stay here, and carry more while staying there. 99:59:59.999 --> 99:59:59.999 Usually. 99:59:59.999 --> 99:59:59.999 It's very useful for figuring out how many machines I need to carry a certain thing. 99:59:59.999 --> 99:59:59.999 If you don't know this, you don't know how many machines to deploy. 99:59:59.999 --> 99:59:59.999 Okay, I'm going to run through some comparisons of 99:59:59.999 --> 99:59:59.999 latency or response time behaviors between different configurations 99:59:59.999 --> 99:59:59.999 to show you some of the places people look, and some of the 99:59:59.999 --> 99:59:59.999 intuitive and non-intuitive things to do with them. 99:59:59.999 --> 99:59:59.999 The common thing, 99:59:59.999 --> 99:59:59.999 and again, this is that Casandra thing, 99:59:59.999 --> 99:59:59.999 comparing two systems, A and B. 99:59:59.999 --> 99:59:59.999 I'll let you guess which one is A, and which one is B. 99:59:59.999 --> 99:59:59.999 It's two systems, and saying which is better, what can I do with this? 99:59:59.999 --> 99:59:59.999 And we're measuring here at two throughputs, 85 and 90k. 99:59:59.999 --> 99:59:59.999 As I said in here, 90k is past the capability of the system. 99:59:59.999 --> 99:59:59.999 You can sort of see it here. 99:59:59.999 --> 99:59:59.999 See, 85 for both of them is here, and 90k is here. 99:59:59.999 --> 99:59:59.999 So you could look at this and say, 99:59:59.999 --> 99:59:59.999 "Look. when the car hits the pole, the blue system is better." 99:59:59.999 --> 99:59:59.999 It's half as bad, but that's just the wrong place to look. 99:59:59.999 --> 99:59:59.999 They both suck. 99:59:59.999 --> 99:59:59.999 You do not want to be doing this. 99:59:59.999 --> 99:59:59.999 The fact that this system is better than that system 99:59:59.999 --> 99:59:59.999 doesn't make you want to use it. 99:59:59.999 --> 99:59:59.999 This is the wrong place to measure. 99:59:59.999 --> 99:59:59.999 This is where latency is irrelevant. 99:59:59.999 --> 99:59:59.999 How they behave past this point doesn't matter. 99:59:59.999 --> 99:59:59.999 What we should be doing is saying, 99:59:59.999 --> 99:59:59.999 "Well, then don't measure here. Let's look there." 99:59:59.999 --> 99:59:59.999 So if we zoom just at the 85k's on these two systems, okay, they're different. 99:59:59.999 --> 99:59:59.999 And now... 99:59:59.999 --> 99:59:59.999 The red and the blue alternate here, whatever that is. 99:59:59.999 --> 99:59:59.999 And now you look at this, and okay, it's better. 99:59:59.999 --> 99:59:59.999 But we're still in the wrong place, because we are 1.5% from hitting the pole. 99:59:59.999 --> 99:59:59.999 It is not where you will be running in production. 99:59:59.999 --> 99:59:59.999 It's not the interesting place to study latency. 99:59:59.999 --> 99:59:59.999 That's the place that if you're anywhere close to that, you should be on the phone 99:59:59.999 --> 99:59:59.999 getting more servers now, rather than trying to figure out the latency behaves. 99:59:59.999 --> 99:59:59.999 You know it's going to collapse if just a little bit of noise happens. 99:59:59.999 --> 99:59:59.999 What you should be doing is looking far away from the need, 99:59:59.999 --> 99:59:59.999 far away from that. 99:59:59.999 --> 99:59:59.999 For example, let's go to half the throughput that causes collapse, 99:59:59.999 --> 99:59:59.999 and see what things happen there. 99:59:59.999 --> 99:59:59.999 And here you can see, 99:59:59.999 --> 99:59:59.999 okay, these are two systems, and one of them does better. 99:59:59.999 --> 99:59:59.999 You can say that this percentile is better, 99:59:59.999 --> 99:59:59.999 that percentile, whatever these are. 99:59:59.999 --> 99:59:59.999 It is interesting, but what can we do with this? 99:59:59.999 --> 99:59:59.999 How do we tell our boss what this means? 99:59:59.999 --> 99:59:59.999 Or how do we translate this into, how many machines do I need? 99:59:59.999 --> 99:59:59.999 Now so far, I've been comparing things at the same throughput, 99:59:59.999 --> 99:59:59.999 and looking at latencies. 99:59:59.999 --> 99:59:59.999 And that's good for pass/fail kind of things, or getting quantitative things, 99:59:59.999 --> 99:59:59.999 but once you get to this point, you can start saying, 99:59:59.999 --> 99:59:59.999 "Wait, what if I do it at different throughputs?" 99:59:59.999 --> 99:59:59.999 How slow do I need to make this blue thing to make it look closer to the red thing, 99:59:59.999 --> 99:59:59.999 or the other way around. 99:59:59.999 --> 99:59:59.999 I don't want to move this fast to 3-L too, I want to move this to be there. 99:59:59.999 --> 99:59:59.999 For example, slow that one up by 4X, and look, 99:59:59.999 --> 99:59:59.999 the two 9's are actually starting to look similar. 99:59:59.999 --> 99:59:59.999 If you slow it by... 99:59:59.999 --> 99:59:59.999 So you can make a statement like this: 99:59:59.999 --> 99:59:59.999 The 99th percentile, if you had a goal like this, 99:59:59.999 --> 99:59:59.999 and now you've passed the goal, 99:59:59.999 --> 99:59:59.999 You'd say, "Both of them passed the goal, but system B does it at 4 times the load." 99:59:59.999 --> 99:59:59.999 That drives a choice, right? 99:59:59.999 --> 99:59:59.999 You can make a harsher goal, and say, 99:59:59.999 --> 99:59:59.999 I need the three 9's to be below 10 milliseconds, 99:59:59.999 --> 99:59:59.999 so you'll slow these down even further. 99:59:59.999 --> 99:59:59.999 At this point, you can make this statement: 99:59:59.999 --> 99:59:59.999 If you want those, one of them is 10 times better. 99:59:59.999 --> 99:59:59.999 Meaning, not that the system is 10 times faster, 99:59:59.999 --> 99:59:59.999 but I can carry 10 times the load before I fail, before I have to pull. 99:59:59.999 --> 99:59:59.999 What I'm trying to demonstrate here, is that how much more, or not, 99:59:59.999 --> 99:59:59.999 you can can get out of a system depends on you're requirements, 99:59:59.999 --> 99:59:59.999 and whether or not you need to meet them. 99:59:59.999 --> 99:59:59.999 Without setting those requirements, looking at the percentile spectrum 99:59:59.999 --> 99:59:59.999 of response time, not service time, 99:59:59.999 --> 99:59:59.999 you'll never know how much you need or not. 99:59:59.999 --> 99:59:59.999 You can do a lot of other things, these are just demonstrations 99:59:59.999 --> 99:59:59.999 of how to look at data sets. 99:59:59.999 --> 99:59:59.999 You make measure at a lot of levels. 99:59:59.999 --> 99:59:59.999 You can look for systemic behaviors. 99:59:59.999 --> 99:59:59.999 For example, this is one system, but at varying levels. 99:59:59.999 --> 99:59:59.999 You can sort of see that as you increase the load, the percentiles move to the left. 99:59:59.999 --> 99:59:59.999 That's a good observation. 99:59:59.999 --> 99:59:59.999 It's not all systems that'll do it, but for this system it'll be that. 99:59:59.999 --> 99:59:59.999 You can also see that even though this didn't totally collapse, 99:59:59.999 --> 99:59:59.999 it's completely out of whack with the rest, 99:59:59.999 --> 99:59:59.999 so that kind of tells you let's not look there. 99:59:59.999 --> 99:59:59.999 So throw away the behavior... 99:59:59.999 --> 99:59:59.999 You just know not to go to 80. 99:59:59.999 --> 99:59:59.999 No need to study it much. 99:59:59.999 --> 99:59:59.999 Now that's the remaining set. 99:59:59.999 --> 99:59:59.999 You could look at that. 99:59:59.999 --> 99:59:59.999 You could look at the set from the other system and compare them. 99:59:59.999 --> 99:59:59.999 Maybe put them next to each other like this. 99:59:59.999 --> 99:59:59.999 Or if you actually can fit enough lines, with enough colors on a chart, 99:59:59.999 --> 99:59:59.999 you can try and do stuff like that. 99:59:59.999 --> 99:59:59.999 These are all good ways to actually look at latencies, 99:59:59.999 --> 99:59:59.999 actually study them. 99:59:59.999 --> 99:59:59.999 And notice that in all these cases, I didn't pick a number. 99:59:59.999 --> 99:59:59.999 "Oh, let's compare the 99.9 percentile," because I won't get 99:59:59.999 --> 99:59:59.999 any feeling for the shapes if I did that. 99:59:59.999 --> 99:59:59.999 You want to look at the entire spectrum. 99:59:59.999 --> 99:59:59.999 And that is what and HDR histogram is very good for. 99:59:59.999 --> 99:59:59.999 So, you know... You get those. 99:59:59.999 --> 99:59:59.999 Now... 99:59:59.999 --> 99:59:59.999 Wow, we're actually doing okay on time. 99:59:59.999 --> 99:59:59.999 Now, this is one of my favorite ways to depict things. 99:59:59.999 --> 99:59:59.999 Remember I told you that if you don't plot the max, what are you hiding? 99:59:59.999 --> 99:59:59.999 It turns out that if you plot the max, 99:59:59.999 --> 99:59:59.999 usually it's the number one signal to look at over time, 99:59:59.999 --> 99:59:59.999 these are just those two systems. 99:59:59.999 --> 99:59:59.999 And with a simple visual, you get a great intuition. 99:59:59.999 --> 99:59:59.999 Same load, one of them's noisy, one's not. 99:59:59.999 --> 99:59:59.999 You can look at the response time and service time, 99:59:59.999 --> 99:59:59.999 and all of the numbers of different samples of percentiles, 99:59:59.999 --> 99:59:59.999 but if you actually want to show a CEO something, 99:59:59.999 --> 99:59:59.999 this is a pretty good thing to show them. 99:59:59.999 --> 99:59:59.999 "Look what I did over the weekend." 99:59:59.999 --> 99:59:59.999 Before the weekend it looked like that, and I fixed it." 99:59:59.999 --> 99:59:59.999 "I deserve a prize." 99:59:59.999 --> 99:59:59.999 With that, a simple thing to remember is that this is your load on system A, 99:59:59.999 --> 99:59:59.999 this your load on system B. 99:59:59.999 --> 99:59:59.999 Any questions? 99:59:59.999 --> 99:59:59.999 This is from an anti-drug commercial in the 80's, 99:59:59.999 --> 99:59:59.999 I don't know if anybody can remember. 99:59:59.999 --> 99:59:59.999 So with that, we're ready for any questions. 99:59:59.999 --> 99:59:59.999 Any questions? 99:59:59.999 --> 99:59:59.999 Wow, that bad? 99:59:59.999 --> 99:59:59.999 (Laughing) Dreadful. 99:59:59.999 --> 99:59:59.999 Okay, I have one here, and one back there. 99:59:59.999 --> 99:59:59.999 Let's start with the back. 99:59:59.999 --> 99:59:59.999 (Audience Member): You said that there are all these tools that you could use 99:59:59.999 --> 99:59:59.999 that give you reasonable numbers, and reasonable answers as far as 99:59:59.999 --> 99:59:59.999 latency is concerned, so what are those tools that you use? 99:59:59.999 --> 99:59:59.999 So the question was, there a couple of tools I mentioned that could give you 99:59:59.999 --> 99:59:59.999 better information, and I used some to chart here, 99:59:59.999 --> 99:59:59.999 let me see, there are a lot of tools. 99:59:59.999 --> 99:59:59.999 I used HDR histogram to plot all these charts 99:59:59.999 --> 99:59:59.999 with the continuous percentile curves. 99:59:59.999 --> 99:59:59.999 I highly recommend you look at using it. 99:59:59.999 --> 99:59:59.999 Just go to HDR histogram.org and read stuff. 99:59:59.999 --> 99:59:59.999 Or google it. 99:59:59.999 --> 99:59:59.999 There's a bunch of people using it. 99:59:59.999 --> 99:59:59.999 And the basic thing it does, is that it gives you a tool that 99:59:59.999 --> 99:59:59.999 allows you a practical way to have this kind of 99:59:59.999 --> 99:59:59.999 fidelity, dynamic range, and resolution to even look at the shapes. 99:59:59.999 --> 99:59:59.999 The other way to do it is to keep all the data. 99:59:59.999 --> 99:59:59.999 You don't have to have histograms if you kept every single result, 99:59:59.999 --> 99:59:59.999 but it many places that's not practical, or makes it harder for the system to run. 99:59:59.999 --> 99:59:59.999 If you can do that, that's even better. 99:59:59.999 --> 99:59:59.999 And then run it through an HDR histogram for analysis. 99:59:59.999 --> 99:59:59.999 So that's as far as viewing things. 99:59:59.999 --> 99:59:59.999 If you have data viewing it. 99:59:59.999 --> 99:59:59.999 Unfortunately, HDR histogram is not going to make the data good. 99:59:59.999 --> 99:59:59.999 It's just going to show you the data you have. 99:59:59.999 --> 99:59:59.999 One of the things I would highly recommend you try to do, 99:59:59.999 --> 99:59:59.999 I'm going backwards, and hopefully I'll hit what I wanted. 99:59:59.999 --> 99:59:59.999 I highly recommend you look at your data sets, and remember 99:59:59.999 --> 99:59:59.999 that in this visual, 99:59:59.999 --> 99:59:59.999 one strong tip I will give you, is that any time you see a vertical 99:59:59.999 --> 99:59:59.999 rise like that, you have a 99.9% chance of looking at coordinated omission. 99:59:59.999 --> 99:59:59.999 This is what coordinated omission looks like. 99:59:59.999 --> 99:59:59.999 There's a couple of other things that can also look like that. 99:59:59.999 --> 99:59:59.999 I haven't seen them in awhile, but I can make them artificially happen, 99:59:59.999 --> 99:59:59.999 so it's not conclusive that this is coordinated omission, 99:59:59.999 --> 99:59:59.999 but suspect it. 99:59:59.999 --> 99:59:59.999 Suspect it hard. 99:59:59.999 --> 99:59:59.999 So if you plot your data with coordinated omission, 99:59:59.999 --> 99:59:59.999 you will get a view of whether or not you have this other problem. 99:59:59.999 --> 99:59:59.999 But honestly, there's a much simpler way to do it. 99:59:59.999 --> 99:59:59.999 Run your control Z test, and see if you have the problem. 99:59:59.999 --> 99:59:59.999 This will just show you how it works. 99:59:59.999 --> 99:59:59.999 A non-omitted, a sane response time test, or latency test, 99:59:59.999 --> 99:59:59.999 tends to have these more smooth humps of curves transitioning between numbers. 99:59:59.999 --> 99:59:59.999 Any vertical rise tends to indicate omission. 99:59:59.999 --> 99:59:59.999 So that's one thing there. 99:59:59.999 --> 99:59:59.999 As far as the tools actually measuring correctly, 99:59:59.999 --> 99:59:59.999 remember I told you what the name of the talk is, 99:59:59.999 --> 99:59:59.999 so let me rattle off some tools. 99:59:59.999 --> 99:59:59.999 Actually, let's do this. 99:59:59.999 --> 99:59:59.999 You guys measure stuff here, I assume. 99:59:59.999 --> 99:59:59.999 Could you rattle off some tools that you use? 99:59:59.999 --> 99:59:59.999 What do you use for load generation and measurement right now? 99:59:59.999 --> 99:59:59.999 Volunteers? 99:59:59.999 --> 99:59:59.999 JMeter? 99:59:59.999 --> 99:59:59.999 Okay, JMeter. 99:59:59.999 --> 99:59:59.999 Gatling. 99:59:59.999 --> 99:59:59.999 Anybody else? 99:59:59.999 --> 99:59:59.999 Okay, anybody with Grinder, WRK, some of the commercial... 99:59:59.999 --> 99:59:59.999 Oh, well yeah. 99:59:59.999 --> 99:59:59.999 Gatling is the only tool I know of right now, that is an actual tool 99:59:59.999 --> 99:59:59.999 people use, not a demo, that has fixed a coordinated omission 99:59:59.999 --> 99:59:59.999 problem in its measurement. 99:59:59.999 --> 99:59:59.999 There was actually a bug filed against it, 99:59:59.999 --> 99:59:59.999 and the control Z edition in it was fixed. 99:59:59.999 --> 99:59:59.999 It is actually possible to perfectly fix this. 99:59:59.999 --> 99:59:59.999 You don't have to correct your guess, you can actually correctly compute 99:59:59.999 --> 99:59:59.999 the exact response time in any load generator on earth, 99:59:59.999 --> 99:59:59.999 if you just do it right. 99:59:59.999 --> 99:59:59.999 All the other tools, 99:59:59.999 --> 99:59:59.999 JMeter, Grinder, WRK, 99:59:59.999 --> 99:59:59.999 the commercial tools that I won't mention, 99:59:59.999 --> 99:59:59.999 they all do this wrong, unfortunately.