WEBVTT 99:59:59.999 --> 99:59:59.999 Hi everyone, I'm Gil Tene. 99:59:59.999 --> 99:59:59.999 I'm going to be talking about this subject that I call "How NOT to Measure Latency". 99:59:59.999 --> 99:59:59.999 It's a subject that I've been talking now about for 3 years or so. 99:59:59.999 --> 99:59:59.999 I keep the title and change all the slides every time. 99:59:59.999 --> 99:59:59.999 A bunch of this stuff is new. 99:59:59.999 --> 99:59:59.999 So if you've seen any of my previous "How NOT to", you'll see only some things that are common. 99:59:59.999 --> 99:59:59.999 A nickname for the subject is this... 99:59:59.999 --> 99:59:59.999 Because I often will get that reaction from some people in the audience. 99:59:59.999 --> 99:59:59.999 Ever since I've told people that it's a nickname, 99:59:59.999 --> 99:59:59.999 They feel free to actually exclaim, "Oh S@%#!". 99:59:59.999 --> 99:59:59.999 And feel free to do that here in this talk. 99:59:59.999 --> 99:59:59.999 I'll prompt you in a couple of places where it is natural. 99:59:59.999 --> 99:59:59.999 But if just have the urge, go ahead. 99:59:59.999 --> 99:59:59.999 So just a tiny bit about me. 99:59:59.999 --> 99:59:59.999 I am the co-founder of Azul Systems. 99:59:59.999 --> 99:59:59.999 I play around with garbage collection a lot. 99:59:59.999 --> 99:59:59.999 Here is some evidence of me playing around with garbage collection in my kitchen. 99:59:59.999 --> 99:59:59.999 That's a trash compactor. 99:59:59.999 --> 99:59:59.999 The compaction function wasn't working right, so I had to fix it. 99:59:59.999 --> 99:59:59.999 I thought it'd be funny to take a picture with a book. 99:59:59.999 --> 99:59:59.999 I've also built a lot of things. 99:59:59.999 --> 99:59:59.999 I've been playing with computers since the early 80's. 99:59:59.999 --> 99:59:59.999 I've built hardware. 99:59:59.999 --> 99:59:59.999 I've helped design chips. 99:59:59.999 --> 99:59:59.999 I've built software at many different levels. 99:59:59.999 --> 99:59:59.999 Operating systems, drivers... JVM's obviously. 99:59:59.999 --> 99:59:59.999 And lots of big systems at the system level. 99:59:59.999 --> 99:59:59.999 Built our own app server in the late 90's because web logic wasn't around yet. 99:59:59.999 --> 99:59:59.999 So, I've made a lot of mistakes, and I've learned from a few of them. 99:59:59.999 --> 99:59:59.999 This is actually a combination of a bunch of those mistakes looking at latency. 99:59:59.999 --> 99:59:59.999 I do have this hobby of depressing people by pulling the wool up from over your eyes, 99:59:59.999 --> 99:59:59.999 and this is what this talk is about. 99:59:59.999 --> 99:59:59.999 So, I need to give you a choice right here. 99:59:59.999 --> 99:59:59.999 There's the door. 99:59:59.999 --> 99:59:59.999 You can take the blue pill, and you can leave. 99:59:59.999 --> 99:59:59.999 Tomorrow you can keep believing whatever it is you want to believe. 99:59:59.999 --> 99:59:59.999 But if you stay here and take the red pill, I will show you a glimpse of how 99:59:59.999 --> 99:59:59.999 far down the rabbit hole goes, and it will never be the same again. 99:59:59.999 --> 99:59:59.999 Let's talk about latency. 99:59:59.999 --> 99:59:59.999 And when I say latency, I'm talking about latency response time, any of those things 99:59:59.999 --> 99:59:59.999 where you measure time from 'here to here', and you're interested in how long it took. 99:59:59.999 --> 99:59:59.999 We do this all the time, but I see a lot of mish-mash in how people 99:59:59.999 --> 99:59:59.999 treat the data, or think about it. 99:59:59.999 --> 99:59:59.999 Latency is basically the time it took something to happen once. 99:59:59.999 --> 99:59:59.999 That one time, how long did it take. 99:59:59.999 --> 99:59:59.999 And when we measure stuff, like we did a million operations in the last hour, 99:59:59.999 --> 99:59:59.999 we have a million latencies. Not one, we have a million of them. 99:59:59.999 --> 99:59:59.999 Our actual goal is to figure out how to describe that million. 99:59:59.999 --> 99:59:59.999 How did the million behave? 99:59:59.999 --> 99:59:59.999 For example, 'they're all really good, and they're all exactly the same', would be a 99:59:59.999 --> 99:59:59.999 behavior that you will never see, but that would be a great behavior. 99:59:59.999 --> 99:59:59.999 So we need to talk about how things behave, communicate, think, evaluate, 99:59:59.999 --> 99:59:59.999 set requirements for, talk to other people, but these are all common things around that. 99:59:59.999 --> 99:59:59.999 To do that, we have to describe the distribution, the set, the behavior, 99:59:59.999 --> 99:59:59.999 but not the one. 99:59:59.999 --> 99:59:59.999 For example, the behavior that says "the the common case was x" is a piece of 99:59:59.999 --> 99:59:59.999 information about the behavior, but it's a tiny sliver. 99:59:59.999 --> 99:59:59.999 Usually the least relevant one. 99:59:59.999 --> 99:59:59.999 Well, there's some less relevant ones, but not a strongly relevant one, 99:59:59.999 --> 99:59:59.999 and one that people often focus on. 99:59:59.999 --> 99:59:59.999 To take a look at what we actually do with this stuff, almost on a daily basis, 99:59:59.999 --> 99:59:59.999 this is a snapshot from a monitoring system. 99:59:59.999 --> 99:59:59.999 A small dashboard on a big screen in a monitoring system. 99:59:59.999 --> 99:59:59.999 Where you're watching the response time of a system over time. 99:59:59.999 --> 99:59:59.999 This is a two hour window. 99:59:59.999 --> 99:59:59.999 These lines that are 95th percentile, 90, 75, 50, and 25th percentiles, 99:59:59.999 --> 99:59:59.999 you can look at how they behave over time. 99:59:59.999 --> 99:59:59.999 We're a small audience here, if you look at this picture, what draws your eye? 99:59:59.999 --> 99:59:59.999 What do you want to go investigate here or pay attention to ? 99:59:59.999 --> 99:59:59.999 It's the big red spike there, right? 99:59:59.999 --> 99:59:59.999 So we could look at the red spike, cause it's different, 99:59:59.999 --> 99:59:59.999 and say, "Woah, the 95th percentile shot up here. And look, the 90th percentile 99:59:59.999 --> 99:59:59.999 shot up at about the same time. 99:59:59.999 --> 99:59:59.999 The rest of them didn't shoot up, so maybe something happened here 99:59:59.999 --> 99:59:59.999 that affected that much, I should probably pay attention to it 99:59:59.999 --> 99:59:59.999 because it's a monitoring system, and I like things to be calm." 99:59:59.999 --> 99:59:59.999 You could go investigate the why. 99:59:59.999 --> 99:59:59.999 At this point, I've managed to waste about 90 seconds of your life, 99:59:59.999 --> 99:59:59.999 looking at a completely meaningless chart, which unfortunately you do 99:59:59.999 --> 99:59:59.999 every day, all the time. 99:59:59.999 --> 99:59:59.999 This chart is the chart you want to show somebody if you want to 99:59:59.999 --> 99:59:59.999 hide the truth from them. 99:59:59.999 --> 99:59:59.999 If you want to pull the wool over their eyes. 99:59:59.999 --> 99:59:59.999 This is the chart of the good stuff. 99:59:59.999 --> 99:59:59.999 What's not on this chart? 99:59:59.999 --> 99:59:59.999 The 5% worse things that happened during this two hours. 99:59:59.999 --> 99:59:59.999 They're not here. 99:59:59.999 --> 99:59:59.999 This is only the good things that happened during the things. 99:59:59.999 --> 99:59:59.999 And to get this spike, that 5% had to be so bad that it even pulled 99:59:59.999 --> 99:59:59.999 the 95th percentile all up. 99:59:59.999 --> 99:59:59.999 There is zero information here at all about what happened bad during this two hours, 99:59:59.999 --> 99:59:59.999 which makes it a bad fit for a monitoring system. 99:59:59.999 --> 99:59:59.999 It's a really good thing for a marketing system. 99:59:59.999 --> 99:59:59.999 It's a great way to get the bonus from your boss, even though you didn't do the work. 99:59:59.999 --> 99:59:59.999 If you want to learn how to do that, we can do another talk about that. 99:59:59.999 --> 99:59:59.999 But this is not a good way to look at latency. 99:59:59.999 --> 99:59:59.999 It's the opposite of good. 99:59:59.999 --> 99:59:59.999 Unfortunately, this is one of the most common tools used for 99:59:59.999 --> 99:59:59.999 server monitoring on earth right now. 99:59:59.999 --> 99:59:59.999 That's where the snapshot is from, and this is what people look at. 99:59:59.999 --> 99:59:59.999 I find this chart to be a goldmine of information. 99:59:59.999 --> 99:59:59.999 When I first showed it in another talk like this, I had this really cool experience. 99:59:59.999 --> 99:59:59.999 Somebody came up to me and said, "Hey, as I was sitting here, I was texting one 99:59:59.999 --> 99:59:59.999 of our guys, and he was saying, 99:59:59.999 --> 99:59:59.999 'look, we have this issue with our 95th percentile'." 99:59:59.999 --> 99:59:59.999 And I got this chart from him! 99:59:59.999 --> 99:59:59.999 So I went and said, "Hey, what does the rest of the spectrum look like?" 99:59:59.999 --> 99:59:59.999 This is the actual chart they got. 99:59:59.999 --> 99:59:59.999 And when they look at the rest of the spectrum, it looked like that. 99:59:59.999 --> 99:59:59.999 That's what was hiding. 99:59:59.999 --> 99:59:59.999 I noticed the scales are a little different. 99:59:59.999 --> 99:59:59.999 That yellow line is that yellow line. 99:59:59.999 --> 99:59:59.999 So that's a much more representative number. 99:59:59.999 --> 99:59:59.999 Is it? Is that good enough? 99:59:59.999 --> 99:59:59.999 That's the 99th percentile. 99:59:59.999 --> 99:59:59.999 We still have another 1% of really bad stuff that's hiding above the blue line. 99:59:59.999 --> 99:59:59.999 I wonder how big that is? 99:59:59.999 --> 99:59:59.999 I don't know because he didn't have the data. 99:59:59.999 --> 99:59:59.999 So a common problem that we have is that we only plot what's convenient. 99:59:59.999 --> 99:59:59.999 We only plot what gives us nice, colorful graphs. 99:59:59.999 --> 99:59:59.999 And often, when we have to choose between the stuff that hides the rest of the data, 99:59:59.999 --> 99:59:59.999 and the stuff that is noise, we choose the noise to display. 99:59:59.999 --> 99:59:59.999 I like to rant about latency. 99:59:59.999 --> 99:59:59.999 This is from a blog that I don't write enough in, but the format for it was simple. 99:59:59.999 --> 99:59:59.999 I tweet a single tweet about latency, latency tip of the day, 99:59:59.999 --> 99:59:59.999 and then I rant about my own tweet. 99:59:59.999 --> 99:59:59.999 As an example, this chart is a goldmine of information because it has so many 99:59:59.999 --> 99:59:59.999 different things that are wrong in it, but we won't get into all of them. 99:59:59.999 --> 99:59:59.999 You can read it online. 99:59:59.999 --> 99:59:59.999 Anyway, this is one to take away from what we just said. 99:59:59.999 --> 99:59:59.999 If you are not measuring and showing the maximum value, what is it you are hiding? 99:59:59.999 --> 99:59:59.999 And from whom? 99:59:59.999 --> 99:59:59.999 If you're job is to hide the truth from others, this is a good way to do it. 99:59:59.999 --> 99:59:59.999 But if actually are interested in what's going on, the number one indicator 99:59:59.999 --> 99:59:59.999 you should never get rid of is the maximum value. 99:59:59.999 --> 99:59:59.999 That is not noise, that is the signal. 99:59:59.999 --> 99:59:59.999 The rest of it is noise. 99:59:59.999 --> 99:59:59.999 Okay, let's look at this chart for some more cool stuff. 99:59:59.999 --> 99:59:59.999 I'm gonna zoom in to a small part of the chart, and ask you what that means. 99:59:59.999 --> 99:59:59.999 What is the average of the 95th percentile over 2 hours mean? 99:59:59.999 --> 99:59:59.999 What is the math that does that? 99:59:59.999 --> 99:59:59.999 What does it do? 99:59:59.999 --> 99:59:59.999 Let's look at that, and I'll give you an example with another percentile. 99:59:59.999 --> 99:59:59.999 The 100th percentile. The max, right? 99:59:59.999 --> 99:59:59.999 Let's take a data set. 99:59:59.999 --> 99:59:59.999 Suppose this was the maximum every minute for 15 minutes. 99:59:59.999 --> 99:59:59.999 What does it mean to say that the average max over the last 15 minutes was 42? 99:59:59.999 --> 99:59:59.999 I specifically chose the data to make that happen. 99:59:59.999 --> 99:59:59.999 It's a meaningless statement. 99:59:59.999 --> 99:59:59.999 It's a completely meaningless statement. 99:59:59.999 --> 99:59:59.999 But when you see 95th percentile, average 184, you think that the 95th 99:59:59.999 --> 99:59:59.999 percentile for the last two hours was around 184. 99:59:59.999 --> 99:59:59.999 It makes you think that. 99:59:59.999 --> 99:59:59.999 Putting this on a piece of paper is not just noise and irrelevant, 99:59:59.999 --> 99:59:59.999 it's a way to mislead people. 99:59:59.999 --> 99:59:59.999 It's a way to mislead yourself, because you'll start to believe your own mistruths. 99:59:59.999 --> 99:59:59.999 This is true for any percentile. 99:59:59.999 --> 99:59:59.999 There is no percentile that you could do this math on. 99:59:59.999 --> 99:59:59.999 Another tip, you cannot average percentiles. 99:59:59.999 --> 99:59:59.999 That math doesn't happen. 99:59:59.999 --> 99:59:59.999 But percentiles do matter. You really want to know about them. 99:59:59.999 --> 99:59:59.999 And a common misperception is that we want to look at the main part of the spectrum, 99:59:59.999 --> 99:59:59.999 not those outliers and perfection stuff. 99:59:59.999 --> 99:59:59.999 Only people that actually bet their house every day, or the bank on it, 99:59:59.999 --> 99:59:59.999 need to know about the "five-nine's", and all those. 99:59:59.999 --> 99:59:59.999 The 99th percentile is a pretty good number. 99:59:59.999 --> 99:59:59.999 Is 99% really rare? 99:59:59.999 --> 99:59:59.999 Let's look at some stuff, because we can ask questions like, "If I were looking 99:59:59.999 --> 99:59:59.999 at a webpage, what is the chance of me hitting the 99th percentile?" 99:59:59.999 --> 99:59:59.999 Of things like this: a search engine node, or a key value store, 99:59:59.999 --> 99:59:59.999 or a database, or a CDN, right? 99:59:59.999 --> 99:59:59.999 Because they will report their 99th percentile. 99:59:59.999 --> 99:59:59.999 They won't tell you anything above that, but how many of the 99:59:59.999 --> 99:59:59.999 webpages that we go to actually experience this? 99:59:59.999 --> 99:59:59.999 You want to say 1%, right? 99:59:59.999 --> 99:59:59.999 Well, I went to some webpages and I counted how many "http" requests were generated 99:59:59.999 --> 99:59:59.999 by one click into that webpage, and here are the numbers. 99:59:59.999 --> 99:59:59.999 I ended that about a year ago. 99:59:59.999 --> 99:59:59.999 They've probably gone up since then. 99:59:59.999 --> 99:59:59.999 Now that translates into this math. 99:59:59.999 --> 99:59:59.999 This is the likelihood of one click seeing the 99th percentile. 99:59:59.999 --> 99:59:59.999 And the only page where that is less than 50% is the clean google search page. 99:59:59.999 --> 99:59:59.999 Where only a quarter will see the 99th percentile. 99:59:59.999 --> 99:59:59.999 The 99th percentile is the thing that most of your webpages will see. 99:59:59.999 --> 99:59:59.999 Most of them will be there. 99:59:59.999 --> 99:59:59.999 Now, we could look at other things. 99:59:59.999 --> 99:59:59.999 We can pick which things to focus on. 99:59:59.999 --> 99:59:59.999 Let's say I had to pick between the 95th percentile, and the three 9's (99.9%). 99:59:59.999 --> 99:59:59.999 The three 9's is way into perfection mode for most people, or they think. 99:59:59.999 --> 99:59:59.999 Which one of those represents our community better? 99:59:59.999 --> 99:59:59.999 Our population? 99:59:59.999 --> 99:59:59.999 Our users? 99:59:59.999 --> 99:59:59.999 Our experience? 99:59:59.999 --> 99:59:59.999 Let's run a hypothetical. 99:59:59.999 --> 99:59:59.999 Suppose we don't have that many pages, and that many resources like we said before. 99:59:59.999 --> 99:59:59.999 We'll be much more conservative. 99:59:59.999 --> 99:59:59.999 A user session will only go through five clicks, and each click will only bring up 99:59:59.999 --> 99:59:59.999 up to 40 things. 99:59:59.999 --> 99:59:59.999 A lot less, and they're all as clean as the google page. 99:59:59.999 --> 99:59:59.999 How many of the users will not experience something worse than the 95th percentile? 99:59:59.999 --> 99:59:59.999 Because that's what the 95th percentile is good for, the people who see that. 99:59:59.999 --> 99:59:59.999 Anybody above that, is that. 99:59:59.999 --> 99:59:59.999 What are the chances of not seeing it? 99:59:59.999 --> 99:59:59.999 That's an interesting number. 99:59:59.999 --> 99:59:59.999 So you're watching a number that is relevant to 0.003% of your users. 99:59:59.999 --> 99:59:59.999 99.997% of your users are going to see worse than this number. 99:59:59.999 --> 99:59:59.999 Why are you looking at it? 99:59:59.999 --> 99:59:59.999 Why are you spending time thinking about it? 99:59:59.999 --> 99:59:59.999 In reverse, we could say how many people are going to see something 99:59:59.999 --> 99:59:59.999 worse than the three 9's (99.9%)? 99:59:59.999 --> 99:59:59.999 That's going to be 18%. 99:59:59.999 --> 99:59:59.999 In reverse, 82% of the people will see the three 9's (99.9%) or better. 99:59:59.999 --> 99:59:59.999 That's a slightly better representation. 99:59:59.999 --> 99:59:59.999 Probably not good enough either. 99:59:59.999 --> 99:59:59.999 We could look at some more math with them, same kind of scenario. 99:59:59.999 --> 99:59:59.999 What percentile of http response time will be the thing that 95% 99:59:59.999 --> 99:59:59.999 of people experience in this scenario? 99:59:59.999 --> 99:59:59.999 It's the 99.97 percentile that 95% of people see. 99:59:59.999 --> 99:59:59.999 And if you want to know what 99% of the people see, 99:59:59.999 --> 99:59:59.999 that's four and a half 9's (99.995%). 99:59:59.999 --> 99:59:59.999 You want to know that number from Akamai if you want to predict what 1% 99:59:59.999 --> 99:59:59.999 of your users are going to experience. 99:59:59.999 --> 99:59:59.999 When you know the 99th percentile, you kind of know a tiny bit. 99:59:59.999 --> 99:59:59.999 So here's another tip. 99:59:59.999 --> 99:59:59.999 And this is not an exaggeration, by the way. 99:59:59.999 --> 99:59:59.999 The median, which is a much smaller percentile, has that minuscule a chance 99:59:59.999 --> 99:59:59.999 of ever being the number that anybody experiences. 99:59:59.999 --> 99:59:59.999 This is the chance of getting worse than the median. 99:59:59.999 --> 99:59:59.999 Which makes the median an irrelevant number to look at. 99:59:59.999 --> 99:59:59.999 Unfortunately, it's probably the most common one looked at. 99:59:59.999 --> 99:59:59.999 When people say "the typical", they look at the thing that 99:59:59.999 --> 99:59:59.999 everything will be worse than. 99:59:59.999 --> 99:59:59.999 Okay, I'm sorry about that part. 99:59:59.999 --> 99:59:59.999 We'll do some other parts. 99:59:59.999 --> 99:59:59.999 Now, why is it that when we look at these monitoring systems, we don't see 99:59:59.999 --> 99:59:59.999 data with a lot of 9's? 99:59:59.999 --> 99:59:59.999 Why do we stop at the 90, 95, 99th percentile? 99:59:59.999 --> 99:59:59.999 Why don't we look further? 99:59:59.999 --> 99:59:59.999 Now, some of it is because people think, "Well that's perfection, I don't need it." 99:59:59.999 --> 99:59:59.999 The other part is that it's hard. 99:59:59.999 --> 99:59:59.999 It's hard because you can't average percentiles. 99:59:59.999 --> 99:59:59.999 We already talked about that. 99:59:59.999 --> 99:59:59.999 But you also can't derive your five 9's (99.999%) out of a lot 99:59:59.999 --> 99:59:59.999 of 10 second samples of percentiles. 99:59:59.999 --> 99:59:59.999 And the reason for that is, "Hey, in 10 seconds, maybe I only had 1,000 things." 99:59:59.999 --> 99:59:59.999 I could take all the 10 seconds in the world, there's no way to say what the 99:59:59.999 --> 99:59:59.999 hour five 9's (99.999%) were, what the minutes five 9's were 99:59:59.999 --> 99:59:59.999 if I'm collecting just this data. 99:59:59.999 --> 99:59:59.999 And unfortunately, the data being collected and reported to the back ends of monitoring 99:59:59.999 --> 99:59:59.999 is usually summarized at a second, 5 seconds, 10 seconds, etc. 99:59:59.999 --> 99:59:59.999 Basically throwing away all the good data, and leaving you with absolutely no way 99:59:59.999 --> 99:59:59.999 to compute large 9's for longer periods of time. 99:59:59.999 --> 99:59:59.999 So, this is where you might want to look at HDR Histogram. 99:59:59.999 --> 99:59:59.999 It's an open source thing I've created a few years ago. 99:59:59.999 --> 99:59:59.999 I did it in Java, and know there's a C, C-Sharp, Python, Erlang, 99:59:59.999 --> 99:59:59.999 and Go ports of this that I didn't create. 99:59:59.999 --> 99:59:59.999 And it lets you actually get an entire percentile spectrum. 99:59:59.999 --> 99:59:59.999 Some of you here I know are already using it. 99:59:59.999 --> 99:59:59.999 And you can look at all the percentiles. 99:59:59.999 --> 99:59:59.999 Any number of 9's that's in the data, if you just keep it right and report it right, 99:59:59.999 --> 99:59:59.999 it's got a log format, you can store things forever. 99:59:59.999 --> 99:59:59.999 Well, for a long time. 99:59:59.999 --> 99:59:59.999 Okay, so it lets you have nice things. 99:59:59.999 --> 99:59:59.999 Enough for that advertisement. 99:59:59.999 --> 99:59:59.999 Now, latency... Well, I think this is slightly out of order. 99:59:59.999 --> 99:59:59.999 Yeah, sorry. 99:59:59.999 --> 99:59:59.999 This is the red/blue pill part, so I warn you, this is your last chance. 99:59:59.999 --> 99:59:59.999 There's a problem I call the coordinated omission problem. 99:59:59.999 --> 99:59:59.999 The coordinated omission problem is basically a conspiracy. 99:59:59.999 --> 99:59:59.999 It's a conspiracy that we're all part of. 99:59:59.999 --> 99:59:59.999 I don't think anybody actually meant to do it, but once I've noticed it, 99:59:59.999 --> 99:59:59.999 everywhere I look, there it is. 99:59:59.999 --> 99:59:59.999 Now, I've been using a specific way of showing you numbers so far. 99:59:59.999 --> 99:59:59.999 Has anybody here noticed how I spell percentile? 99:59:59.999 --> 99:59:59.999 (Audience Member): "You put lie at the end of the percent sign." 99:59:59.999 --> 99:59:59.999 Yeah, good. 99:59:59.999 --> 99:59:59.999 So coordinated omission problem is the "lie" in %lies. 99:59:59.999 --> 99:59:59.999 And this is how it works. 99:59:59.999 --> 99:59:59.999 One common way to do this is to use a load generator. 99:59:59.999 --> 99:59:59.999 Pretty much all load generator's have this problem. 99:59:59.999 --> 99:59:59.999 There are two that I know of that don't. 99:59:59.999 --> 99:59:59.999 What you do with a load generator, is you test. 99:59:59.999 --> 99:59:59.999 You issue requests, or send packets. 99:59:59.999 --> 99:59:59.999 And you measure how long something took. 99:59:59.999 --> 99:59:59.999 And as long as the numbers go right, measure them, put them in a bucket, 99:59:59.999 --> 99:59:59.999 study them later, and get your percentiles from it. 99:59:59.999 --> 99:59:59.999 But what if the thing that you are measuring took longer than the time 99:59:59.999 --> 99:59:59.999 it would've taken until you send the next thing? 99:59:59.999 --> 99:59:59.999 You're supposed to send something every second, 99:59:59.999 --> 99:59:59.999 but this one took a second and a half. 99:59:59.999 --> 99:59:59.999 Well you've got to wait before you send the next one. 99:59:59.999 --> 99:59:59.999 You just avoided measuring something when the system was problematic. 99:59:59.999 --> 99:59:59.999 You've coordinated with it. 99:59:59.999 --> 99:59:59.999 You weren't looking at it then. 99:59:59.999 --> 99:59:59.999 That's common scenario A: You've backed off, and avoided measuring when it was bad. 99:59:59.999 --> 99:59:59.999 Another way, is you measure inside your code. 99:59:59.999 --> 99:59:59.999 We all do this. We all have to do this, 99:59:59.999 --> 99:59:59.999 where we measure time, do something, then measure time. 99:59:59.999 --> 99:59:59.999 The delta between them is how long it took. 99:59:59.999 --> 99:59:59.999 We can then put it in a stats bucket, and then do the percentiles in that. 99:59:59.999 --> 99:59:59.999 Unfortunately, if the system freezes right here, for any reason, 99:59:59.999 --> 99:59:59.999 an interrupted contact switch, 99:59:59.999 --> 99:59:59.999 a cash buffer flushed to disk, 99:59:59.999 --> 99:59:59.999 a garbage collection, 99:59:59.999 --> 99:59:59.999 a re-indexing of your database, this is a database. 99:59:59.999 --> 99:59:59.999 This is Cassandra by the way, measuring itself. 99:59:59.999 --> 99:59:59.999 In any of the above, then you will have one bad report 99:59:59.999 --> 99:59:59.999 while 10,000 things are waiting in line. 99:59:59.999 --> 99:59:59.999 And when they come in, they will look really, really good. 99:59:59.999 --> 99:59:59.999 Even though each one of them has had a really bad experience. 99:59:59.999 --> 99:59:59.999 It can even get worse, where maybe the freeze happened outside the timing, 99:59:59.999 --> 99:59:59.999 and you won't even know there was a freeze. 99:59:59.999 --> 99:59:59.999 Now these are examples of admitting data that is bad on a very selective basis. 99:59:59.999 --> 99:59:59.999 It's not random sampling. 99:59:59.999 --> 99:59:59.999 It's, "I don't like bad data", 99:59:59.999 --> 99:59:59.999 or "I couldn't handle it", 99:59:59.999 --> 99:59:59.999 or "I don't know about it", 99:59:59.999 --> 99:59:59.999 so we'll just talk about the good. 99:59:59.999 --> 99:59:59.999 What does that do to your data? 99:59:59.999 --> 99:59:59.999 Because it often makes people feel like, 99:59:59.999 --> 99:59:59.999 "Okay, yeah, I understand, but it's a little bit of noise." 99:59:59.999 --> 99:59:59.999 Let's run some hypotheticals, and I'll show you some real numbers. 99:59:59.999 --> 99:59:59.999 Imagine a perfect system. 99:59:59.999 --> 99:59:59.999 It's doing 100 requests a second, at exactly a millisecond each. 99:59:59.999 --> 99:59:59.999 But we go and freeze the system, after 100 seconds of perfect operations 99:59:59.999 --> 99:59:59.999 for 100 seconds, and then repeat. 99:59:59.999 --> 99:59:59.999 Now, I'm going to describe how the system behaves in terms that should mean something, 99:59:59.999 --> 99:59:59.999 and then we'll measure it. 99:59:59.999 --> 99:59:59.999 If we actually wanted to describe the system, 99:59:59.999 --> 99:59:59.999 on the left we have an average of one millisecond by the finish, 99:59:59.999 --> 99:59:59.999 and on the right we have an average of 50 seconds. 99:59:59.999 --> 99:59:59.999 Why 50? Because if I randomly came in in that 100 seconds, 99:59:59.999 --> 99:59:59.999 I'll get anything from 0 to 100 with even distribution. 99:59:59.999 --> 99:59:59.999 The overall average over 200 seconds is 25 seconds. 99:59:59.999 --> 99:59:59.999 If I just came in here and said, "Surprise, how long did this take?" 99:59:59.999 --> 99:59:59.999 On average, it will be 25. 99:59:59.999 --> 99:59:59.999 I can also do the percentiles. 99:59:59.999 --> 99:59:59.999 50th percentile will be really good, and then it'll get really bad. 99:59:59.999 --> 99:59:59.999 The four 9's is terrible. 99:59:59.999 --> 99:59:59.999 This is a fair honest description of this system if this is what it did. 99:59:59.999 --> 99:59:59.999 And you can make the system do that. 99:59:59.999 --> 99:59:59.999 That's what Control Z is good for. 99:59:59.999 --> 99:59:59.999 You can make any of your systems do that. 99:59:59.999 --> 99:59:59.999 Now lets go measure this system with a load generator, 99:59:59.999 --> 99:59:59.999 or with a monitoring system. 99:59:59.999 --> 99:59:59.999 The common ones. 99:59:59.999 --> 99:59:59.999 The ones everybody does. 99:59:59.999 --> 99:59:59.999 On the left, we're going to get 10,000 results of one millisecond each. 99:59:59.999 --> 99:59:59.999 Great. 99:59:59.999 --> 99:59:59.999 And we're going to get one result of 100 seconds. 99:59:59.999 --> 99:59:59.999 Wow, really big response time. 99:59:59.999 --> 99:59:59.999 This is our data. 99:59:59.999 --> 99:59:59.999 This is OUR data. 99:59:59.999 --> 99:59:59.999 So now you go do math with it. 99:59:59.999 --> 99:59:59.999 The average of that is 10.9 milliseconds. 99:59:59.999 --> 99:59:59.999 A little less than 25 seconds. 99:59:59.999 --> 99:59:59.999 And here are the percentiles. 99:59:59.999 --> 99:59:59.999 Your load generator monitoring system will tell you that this system is perfect. 99:59:59.999 --> 99:59:59.999 You could go to production with it. 99:59:59.999 --> 99:59:59.999 You like what you see. 99:59:59.999 --> 99:59:59.999 Look at that, four 9's. 99:59:59.999 --> 99:59:59.999 It is lying to you. 99:59:59.999 --> 99:59:59.999 To your face. 99:59:59.999 --> 99:59:59.999 And you can catch it doing that with a Control Z-Test. 99:59:59.999 --> 99:59:59.999 But people tend to not want to do that, because then what are they going to do? 99:59:59.999 --> 99:59:59.999 If you just do that test, and calibrate your system, and you find it 99:59:59.999 --> 99:59:59.999 telling you that, about this, the next step should be to throw all the numbers away. 99:59:59.999 --> 99:59:59.999 Don't believe anything else it says. 99:59:59.999 --> 99:59:59.999 If it lies this big, what else did it do? 99:59:59.999 --> 99:59:59.999 Don't waste your time on numbers from uncalibrated systems. 99:59:59.999 --> 99:59:59.999 Now the problem here was, that if you want to measure the system, 99:59:59.999 --> 99:59:59.999 you have to measure at random rates, or same rates. 99:59:59.999 --> 99:59:59.999 If you measure 10,000 things in 100 seconds, there should be another 10,000 things here. 99:59:59.999 --> 99:59:59.999 If you measure them, you would've gotten all the right numbers. 99:59:59.999 --> 99:59:59.999 Coordinated omission is the simple act of erasing all that bad stuff. 99:59:59.999 --> 99:59:59.999 The conspiracy here is that we all do it without meaning to. 99:59:59.999 --> 99:59:59.999 I don't know who put that in our systems, but it happens to all of us . 99:59:59.999 --> 99:59:59.999 Now, I often get people saying, "Okay, I get it. All the numbers are wrong, 99:59:59.999 --> 99:59:59.999 but at least for my job where I tune performance, and I try to make things 99:59:59.999 --> 99:59:59.999 faster, I can use the numbers to figure out if I'm going in the right direction." 99:59:59.999 --> 99:59:59.999 Is it better, or is it worse? Let me dispel that for you for a second. 99:59:59.999 --> 99:59:59.999 Suppose I went and took this system, and improved it dramatically. 99:59:59.999 --> 99:59:59.999 Rather than freezing for 100 seconds, it will now answer every question. 99:59:59.999 --> 99:59:59.999 It'll take a little longer, 5 milliseconds instead of one, 99:59:59.999 --> 99:59:59.999 but it's much better than freezing, right? 99:59:59.999 --> 99:59:59.999 So let's measure that system that we spent weeks and weeks improving, 99:59:59.999 --> 99:59:59.999 and see if it's better. 99:59:59.999 --> 99:59:59.999 That's the data. 99:59:59.999 --> 99:59:59.999 If we do the percentiles, it'll tell us that we just really hurt the four 9's. 99:59:59.999 --> 99:59:59.999 We made it go 5 times worse than before. 99:59:59.999 --> 99:59:59.999 We should revert this change, go back to that much better system we had before. 99:59:59.999 --> 99:59:59.999 So this is just to make sure that you don't think that you can have 99:59:59.999 --> 99:59:59.999 any intuition based on any of these numbers. 99:59:59.999 --> 99:59:59.999 They go backwards sometimes. 99:59:59.999 --> 99:59:59.999 You don't know which way is good or bad. 99:59:59.999 --> 99:59:59.999 And you'll never know which way is good or bad with a system that lies like that. 99:59:59.999 --> 99:59:59.999 The other cool technique is what I call "Cheating Twice". 99:59:59.999 --> 99:59:59.999 You have a constant load generator, and it needs to do 100 per second. 99:59:59.999 --> 99:59:59.999 When it woke up after 200 seconds, it says, 99:59:59.999 --> 99:59:59.999 "Woah, were 9,999 behind. We've got to issue those requests." 99:59:59.999 --> 99:59:59.999 So it issues those requests. 99:59:59.999 --> 99:59:59.999 At this point, not only did it get rid of all the bad requests, 99:59:59.999 --> 99:59:59.999 it replaced every one of them with a perfect request. 99:59:59.999 --> 99:59:59.999 Coining the four 9's (99.99%), all the way to four and a half 9's (99.995%), 99:59:59.999 --> 99:59:59.999 it's twice as wrong as dropping them. 99:59:59.999 --> 99:59:59.999 So these are all cool things that happen to you. 99:59:59.999 --> 99:59:59.999 I'm not going to spend much time on how to fix those and avoid those. 99:59:59.999 --> 99:59:59.999 There's a lot of other material that you can find with me 99:59:59.999 --> 99:59:59.999 talking about that, in longer talks. 99:59:59.999 --> 99:59:59.999 But this is pretty bad. 99:59:59.999 --> 99:59:59.999 And like I said... 99:59:59.999 --> 99:59:59.999 That should've been up there before. 99:59:59.999 --> 99:59:59.999 How did this repeat itself? 99:59:59.999 --> 99:59:59.999 Did I create a loop in the presentation somehow? 99:59:59.999 --> 99:59:59.999 I don't know how to do that. 99:59:59.999 --> 99:59:59.999 Let's see if I can get through here. 99:59:59.999 --> 99:59:59.999 Hopefully editing later will take it out. 99:59:59.999 --> 99:59:59.999 So we have the cheats twice. 99:59:59.999 --> 99:59:59.999 There, okay. 99:59:59.999 --> 99:59:59.999 So, after we look at coordinated omission that way, 99:59:59.999 --> 99:59:59.999 we should also look at response time, and service time. 99:59:59.999 --> 99:59:59.999 Coordinated omission, what it really is achieving for you, unfortunately, 99:59:59.999 --> 99:59:59.999 is that it makes something that you think is response time, and only shows you 99:59:59.999 --> 99:59:59.999 the service time component of latency. 99:59:59.999 --> 99:59:59.999 This is a simple depiction of what service time and response times are. 99:59:59.999 --> 99:59:59.999 This guy is taking a certain amount of time to take payment 99:59:59.999 --> 99:59:59.999 or make a cup of coffee. 99:59:59.999 --> 99:59:59.999 That's service time. 99:59:59.999 --> 99:59:59.999 How long does it take to do the work? 99:59:59.999 --> 99:59:59.999 This person has experienced the response time, 99:59:59.999 --> 99:59:59.999 which includes the amount of time they have to wait before they 99:59:59.999 --> 99:59:59.999 get to the person that does the work. 99:59:59.999 --> 99:59:59.999 And the difference between those two is immense. 99:59:59.999 --> 99:59:59.999 The coordinated omission problem makes something that you think is 99:59:59.999 --> 99:59:59.999 response time, only measure the service time, 99:59:59.999 --> 99:59:59.999 and basically hide the fact that things stalled, waited in line, 99:59:59.999 --> 99:59:59.999 that this guy might've taken a lunch break, 99:59:59.999 --> 99:59:59.999 and now we have line around, building three times. 99:59:59.999 --> 99:59:59.999 Service time stays the same. 99:59:59.999 --> 99:59:59.999 This is the backwards part... 99:59:59.999 --> 99:59:59.999 Now, let's look at what it actually looks like. 99:59:59.999 --> 99:59:59.999 In a load generator that I fixed, I measured both 99:59:59.999 --> 99:59:59.999 response time and service time, 99:59:59.999 --> 99:59:59.999 this happens to be Casandra, 99:59:59.999 --> 99:59:59.999 at a very low load. 99:59:59.999 --> 99:59:59.999 And you can see that they're very very similar, at a very low load. 99:59:59.999 --> 99:59:59.999 Why? Because there's nobody in line. 99:59:59.999 --> 99:59:59.999 This thing is really fast. 99:59:59.999 --> 99:59:59.999 We're not asking for too much. 99:59:59.999 --> 99:59:59.999 Casandra's pretty fast, so they're the same. 99:59:59.999 --> 99:59:59.999 But if I increase the load, we start seeing gaps. 99:59:59.999 --> 99:59:59.999 If I increase the load a little more, the gap grows. 99:59:59.999 --> 99:59:59.999 If I increase the load a little more, the gap grows. 99:59:59.999 --> 99:59:59.999 Now this is not the failure point yet. 99:59:59.999 --> 99:59:59.999 If I actually increase it all the way past the point where the system 99:59:59.999 --> 99:59:59.999 can't even do the work I want, service time stays the same, 99:59:59.999 --> 99:59:59.999 response time goes through the roof. 99:59:59.999 --> 99:59:59.999 This was when it was 100 and something milliseconds, now it's 7 and a half seconds. 99:59:59.999 --> 99:59:59.999 Why 7 and a half seconds? 99:59:59.999 --> 99:59:59.999 Cause you're waiting in line that long to go around the block. 99:59:59.999 --> 99:59:59.999 The guy just can't serve as many people as are showing up in line, you fall behind. 99:59:59.999 --> 99:59:59.999 This is a virtual world reaction to this. 99:59:59.999 --> 99:59:59.999 I really like this slide, it's where I came up with the notion of a blue/red pill. 99:59:59.999 --> 99:59:59.999 When you actually measure reality, people tend to have this reaction when 99:59:59.999 --> 99:59:59.999 they compare the two. 99:59:59.999 --> 99:59:59.999 And if we actually look at these on the two sides of a collapse point of a system, 99:59:59.999 --> 99:59:59.999 this specific system can only do 87,000 things a second. 99:59:59.999 --> 99:59:59.999 No matter how hard you press it, that's all it'll do. 99:59:59.999 --> 99:59:59.999 The service time on the two sides of the collapse looks virtually identical, 99:59:59.999 --> 99:59:59.999 which it would. 99:59:59.999 --> 99:59:59.999 But if you compare the response time, you have a very different picture. 99:59:59.999 --> 99:59:59.999 And I'm showing this picture so you get a feeling for what to look at 99:59:59.999 --> 99:59:59.999 on whether or not you're measuring the right one.