9:59:59.000,9:59:59.000 Hi everyone, I'm Gil Tene. 9:59:59.000,9:59:59.000 I'm going to be talking about this subject[br]that I call "How NOT to Measure Latency". 9:59:59.000,9:59:59.000 It's a subject that I've been talking now[br]about for 3 years or so. 9:59:59.000,9:59:59.000 I keep the title and change all[br]the slides every time. 9:59:59.000,9:59:59.000 A bunch of this stuff is new. 9:59:59.000,9:59:59.000 So if you've seen any of my previous "How NOT to",[br]you'll see only some things that are common. 9:59:59.000,9:59:59.000 A nickname for the subject is this... 9:59:59.000,9:59:59.000 Because I often will get that reaction[br]from some people in the audience. 9:59:59.000,9:59:59.000 Ever since I've told people that it's a[br]nickname, 9:59:59.000,9:59:59.000 They feel free to actually exclaim,[br]"Oh S@%#!". 9:59:59.000,9:59:59.000 And feel free to do that here in this talk. 9:59:59.000,9:59:59.000 I'll prompt you in a couple of places[br]where it is natural. 9:59:59.000,9:59:59.000 But if just have the urge, go ahead. 9:59:59.000,9:59:59.000 So just a tiny bit about me. 9:59:59.000,9:59:59.000 I am the co-founder of Azul Systems. 9:59:59.000,9:59:59.000 I play around with garbage collection a lot. 9:59:59.000,9:59:59.000 Here is some evidence of me playing around[br]with garbage collection in my kitchen. 9:59:59.000,9:59:59.000 That's a trash compactor. 9:59:59.000,9:59:59.000 The compaction function wasn't working right,[br]so I had to fix it. 9:59:59.000,9:59:59.000 I thought it'd be funny to take a picture[br]with a book. 9:59:59.000,9:59:59.000 I've also built a lot of things. 9:59:59.000,9:59:59.000 I've been playing with computers since[br]the early 80's. 9:59:59.000,9:59:59.000 I've built hardware. 9:59:59.000,9:59:59.000 I've helped design chips. 9:59:59.000,9:59:59.000 I've built software at many [br]different levels. 9:59:59.000,9:59:59.000 Operating systems, drivers...[br]JVM's obviously. 9:59:59.000,9:59:59.000 And lots of big systems at the system level. 9:59:59.000,9:59:59.000 Built our own app server in the late 90's[br]because web logic wasn't around yet. 9:59:59.000,9:59:59.000 So, I've made a lot of mistakes,[br]and I've learned from a few of them. 9:59:59.000,9:59:59.000 This is actually a combination of a bunch[br]of those mistakes looking at latency. 9:59:59.000,9:59:59.000 I do have this hobby of depressing people[br]by pulling the wool up from over your eyes, 9:59:59.000,9:59:59.000 and this is what this talk is about. 9:59:59.000,9:59:59.000 So, I need to give you a choice right here. 9:59:59.000,9:59:59.000 There's the door. 9:59:59.000,9:59:59.000 You can take the blue pill, [br]and you can leave. 9:59:59.000,9:59:59.000 Tomorrow you can keep believing whatever[br]it is you want to believe. 9:59:59.000,9:59:59.000 But if you stay here and take the red pill, [br]I will show you a glimpse of how 9:59:59.000,9:59:59.000 far down the rabbit hole goes, [br]and it will never be the same again. 9:59:59.000,9:59:59.000 Let's talk about latency. 9:59:59.000,9:59:59.000 And when I say latency, I'm talking about[br]latency response time, any of those things 9:59:59.000,9:59:59.000 where you measure time from 'here to here',[br]and you're interested in how long it took. 9:59:59.000,9:59:59.000 We do this all the time, but I see a lot [br]of mish-mash in how people 9:59:59.000,9:59:59.000 treat the data, or think about it. 9:59:59.000,9:59:59.000 Latency is basically the time it took[br]something to happen once. 9:59:59.000,9:59:59.000 That one time, how long did it take. 9:59:59.000,9:59:59.000 And when we measure stuff, like we did [br]a million operations in the last hour, 9:59:59.000,9:59:59.000 we have a million latencies. Not one,[br]we have a million of them. 9:59:59.000,9:59:59.000 Our actual goal is to figure out how to[br]describe that million. 9:59:59.000,9:59:59.000 How did the million behave? 9:59:59.000,9:59:59.000 For example, 'they're all really good, and[br]they're all exactly the same', would be a 9:59:59.000,9:59:59.000 behavior that you will never see, [br]but that would be a great behavior. 9:59:59.000,9:59:59.000 So we need to talk about how things behave,[br]communicate, think, evaluate, 9:59:59.000,9:59:59.000 set requirements for, talk to other people,[br]but these are all common things around that. 9:59:59.000,9:59:59.000 To do that, we have to describe the [br]distribution, the set, the behavior, 9:59:59.000,9:59:59.000 but not the one. 9:59:59.000,9:59:59.000 For example, the behavior that says "the [br]the common case was x" is a piece of 9:59:59.000,9:59:59.000 information about the behavior,[br]but it's a tiny sliver. 9:59:59.000,9:59:59.000 Usually the least relevant one. 9:59:59.000,9:59:59.000 Well, there's some less relevant ones, [br]but not a strongly relevant one, 9:59:59.000,9:59:59.000 and one that people often focus on. 9:59:59.000,9:59:59.000 To take a look at what we actually do [br]with this stuff, almost on a daily basis, 9:59:59.000,9:59:59.000 this is a snapshot from a monitoring system. 9:59:59.000,9:59:59.000 A small dashboard on a big screen [br]in a monitoring system. 9:59:59.000,9:59:59.000 Where you're watching the response time of[br]a system over time. 9:59:59.000,9:59:59.000 This is a two hour window. 9:59:59.000,9:59:59.000 These lines that are 95th percentile, [br]90, 75, 50, and 25th percentiles, 9:59:59.000,9:59:59.000 you can look at how they behave over time. 9:59:59.000,9:59:59.000 We're a small audience here, if you look at[br]this picture, what draws your eye? 9:59:59.000,9:59:59.000 What do you want to go investigate here[br]or pay attention to ? 9:59:59.000,9:59:59.000 It's the big red spike there, right? 9:59:59.000,9:59:59.000 So we could look at the red spike,[br]cause it's different, 9:59:59.000,9:59:59.000 and say, "Woah, the 95th percentile shot up[br]here. And look, the 90th percentile 9:59:59.000,9:59:59.000 shot up at about the same time. 9:59:59.000,9:59:59.000 The rest of them didn't shoot up, [br]so maybe something happened here 9:59:59.000,9:59:59.000 that affected that much, I should probably[br]pay attention to it 9:59:59.000,9:59:59.000 because it's a monitoring system, and [br]I like things to be calm." 9:59:59.000,9:59:59.000 You could go investigate the why. 9:59:59.000,9:59:59.000 At this point, I've managed to waste [br]about 90 seconds of your life, 9:59:59.000,9:59:59.000 looking at a completely meaningless chart,[br]which unfortunately you do 9:59:59.000,9:59:59.000 every day, all the time. 9:59:59.000,9:59:59.000 This chart is the chart you want to show [br]somebody if you want to 9:59:59.000,9:59:59.000 hide the truth from them. 9:59:59.000,9:59:59.000 If you want to pull the wool [br]over their eyes. 9:59:59.000,9:59:59.000 This is the chart of the good stuff. 9:59:59.000,9:59:59.000 What's not on this chart? 9:59:59.000,9:59:59.000 The 5% worse things that happened during[br]this two hours. 9:59:59.000,9:59:59.000 They're not here. 9:59:59.000,9:59:59.000 This is only the good things that happened[br]during the things. 9:59:59.000,9:59:59.000 And to get this spike, that 5% had to be[br]so bad that it even pulled 9:59:59.000,9:59:59.000 the 95th percentile all up. 9:59:59.000,9:59:59.000 There is zero information here at all about[br]what happened bad during this two hours, 9:59:59.000,9:59:59.000 which makes it a bad fit for [br]a monitoring system. 9:59:59.000,9:59:59.000 It's a really good thing for [br]a marketing system. 9:59:59.000,9:59:59.000 It's a great way to get the bonus from your boss, even though you didn't do the work. 9:59:59.000,9:59:59.000 If you want to learn how to do that, [br]we can do another talk about that. 9:59:59.000,9:59:59.000 But this is not a good way to look at latency. 9:59:59.000,9:59:59.000 It's the opposite of good. 9:59:59.000,9:59:59.000 Unfortunately, this is one of the most[br]common tools used for 9:59:59.000,9:59:59.000 server monitoring on earth right now. 9:59:59.000,9:59:59.000 That's where the snapshot is from,[br]and this is what people look at. 9:59:59.000,9:59:59.000 I find this chart to be a goldmine[br]of information. 9:59:59.000,9:59:59.000 When I first showed it in another talk [br]like this, I had this really cool experience. 9:59:59.000,9:59:59.000 Somebody came up to me and said, "Hey, [br]as I was sitting here, I was texting one 9:59:59.000,9:59:59.000 of our guys, and he was saying, 9:59:59.000,9:59:59.000 'look, we have this issue with [br]our 95th percentile'." 9:59:59.000,9:59:59.000 And I got this chart from him! 9:59:59.000,9:59:59.000 So I went and said, "Hey, what does the [br]rest of the spectrum look like?" 9:59:59.000,9:59:59.000 This is the actual chart they got. 9:59:59.000,9:59:59.000 And when they look at the rest of the[br]spectrum, it looked like that. 9:59:59.000,9:59:59.000 That's what was hiding. 9:59:59.000,9:59:59.000 I noticed the scales are a little different. 9:59:59.000,9:59:59.000 That yellow line is that yellow line. 9:59:59.000,9:59:59.000 So that's a much more representative number. 9:59:59.000,9:59:59.000 Is it? Is that good enough? 9:59:59.000,9:59:59.000 That's the 99th percentile. 9:59:59.000,9:59:59.000 We still have another 1% of really bad [br]stuff that's hiding above the blue line. 9:59:59.000,9:59:59.000 I wonder how big that is? 9:59:59.000,9:59:59.000 I don't know because he didn't have the data. 9:59:59.000,9:59:59.000 So a common problem that we have is that[br]we only plot what's convenient. 9:59:59.000,9:59:59.000 We only plot what gives us nice,[br]colorful graphs. 9:59:59.000,9:59:59.000 And often, when we have to choose between[br]the stuff that hides the rest of the data, 9:59:59.000,9:59:59.000 and the stuff that is noise, we choose [br]the noise to display. 9:59:59.000,9:59:59.000 I like to rant about latency. 9:59:59.000,9:59:59.000 This is from a blog that I don't write [br]enough in, but the format for it was simple. 9:59:59.000,9:59:59.000 I tweet a single tweet about latency, [br]latency tip of the day, 9:59:59.000,9:59:59.000 and then I rant about my own tweet. 9:59:59.000,9:59:59.000 As an example, this chart is a goldmine[br]of information because it has so many 9:59:59.000,9:59:59.000 different things that are wrong in it, [br]but we won't get into all of them. 9:59:59.000,9:59:59.000 You can read it online. 9:59:59.000,9:59:59.000 Anyway, this is one to take away from [br]what we just said. 9:59:59.000,9:59:59.000 If you are not measuring and showing the[br]maximum value, what is it you are hiding? 9:59:59.000,9:59:59.000 And from whom? 9:59:59.000,9:59:59.000 If you're job is to hide the truth from[br]others, this is a good way to do it. 9:59:59.000,9:59:59.000 But if actually are interested in what's[br]going on, the number one indicator 9:59:59.000,9:59:59.000 you should never get rid of is the [br]maximum value. 9:59:59.000,9:59:59.000 That is not noise, that is the signal. 9:59:59.000,9:59:59.000 The rest of it is noise. 9:59:59.000,9:59:59.000 Okay, let's look at this chart for some[br]more cool stuff. 9:59:59.000,9:59:59.000 I'm gonna zoom in to a small part[br]of the chart, and ask you what that means. 9:59:59.000,9:59:59.000 What is the average of the 95th percentile[br]over 2 hours mean? 9:59:59.000,9:59:59.000 What is the math that does that? 9:59:59.000,9:59:59.000 What does it do? 9:59:59.000,9:59:59.000 Let's look at that, and I'll give you[br]an example with another percentile. 9:59:59.000,9:59:59.000 The 100th percentile. The max, right? 9:59:59.000,9:59:59.000 Let's take a data set. 9:59:59.000,9:59:59.000 Suppose this was the maximum every minute[br]for 15 minutes. 9:59:59.000,9:59:59.000 What does it mean to say that the average [br]max over the last 15 minutes was 42? 9:59:59.000,9:59:59.000 I specifically chose the data to[br]make that happen. 9:59:59.000,9:59:59.000 It's a meaningless statement. 9:59:59.000,9:59:59.000 It's a completely meaningless statement. 9:59:59.000,9:59:59.000 But when you see 95th percentile,[br]average 184, you think that the 95th 9:59:59.000,9:59:59.000 percentile for the last two hours[br]was around 184. 9:59:59.000,9:59:59.000 It makes you think that. 9:59:59.000,9:59:59.000 Putting this on a piece of paper is not [br]just noise and irrelevant, 9:59:59.000,9:59:59.000 it's a way to mislead people. 9:59:59.000,9:59:59.000 It's a way to mislead yourself, because [br]you'll start to believe your own mistruths. 9:59:59.000,9:59:59.000 This is true for any percentile. 9:59:59.000,9:59:59.000 There is no percentile that you could do[br]this math on. 9:59:59.000,9:59:59.000 Another tip, you cannot average percentiles. 9:59:59.000,9:59:59.000 That math doesn't happen. 9:59:59.000,9:59:59.000 But percentiles do matter. You really[br]want to know about them. 9:59:59.000,9:59:59.000 And a common misperception is that we want[br]to look at the main part of the spectrum, 9:59:59.000,9:59:59.000 not those outliers and perfection stuff. 9:59:59.000,9:59:59.000 Only people that actually bet their house[br]every day, or the bank on it, 9:59:59.000,9:59:59.000 need to know about the "five-nine's", [br]and all those. 9:59:59.000,9:59:59.000 The 99th percentile is a pretty[br]good number. 9:59:59.000,9:59:59.000 Is 99% really rare? 9:59:59.000,9:59:59.000 Let's look at some stuff, because we can[br]ask questions like, "If I were looking 9:59:59.000,9:59:59.000 at a webpage, what is the chance of me[br]hitting the 99th percentile?" 9:59:59.000,9:59:59.000 Of things like this: a search engine node,[br]or a key value store, 9:59:59.000,9:59:59.000 or a database, or a CDN, right? 9:59:59.000,9:59:59.000 Because they will report their 99th percentile. 9:59:59.000,9:59:59.000 They won't tell you anything above that,[br]but how many of the 9:59:59.000,9:59:59.000 webpages that we go to [br]actually experience this? [br] 9:59:59.000,9:59:59.000 You want to say 1%, right? 9:59:59.000,9:59:59.000 Well, I went to some webpages and I counted[br]how many "http" requests were generated 9:59:59.000,9:59:59.000 by one click into that webpage,[br]and here are the numbers. 9:59:59.000,9:59:59.000 I ended that about a year ago. 9:59:59.000,9:59:59.000 They've probably gone up since then. 9:59:59.000,9:59:59.000 Now that translates into this math. 9:59:59.000,9:59:59.000 This is the likelihood of one click seeing[br]the 99th percentile. 9:59:59.000,9:59:59.000 And the only page where that is less than[br]50% is the clean google search page. 9:59:59.000,9:59:59.000 Where only a quarter will see the[br]99th percentile. 9:59:59.000,9:59:59.000 The 99th percentile is the thing that most[br]of your webpages will see. 9:59:59.000,9:59:59.000 Most of them will be there. 9:59:59.000,9:59:59.000 Now, we could look at other things. 9:59:59.000,9:59:59.000 We can pick which things to focus on. 9:59:59.000,9:59:59.000 Let's say I had to pick between the 95th[br]percentile, and the three 9's (99.9%). 9:59:59.000,9:59:59.000 The three 9's is way into perfection mode[br]for most people, or they think. 9:59:59.000,9:59:59.000 Which one of those represents our [br]community better? 9:59:59.000,9:59:59.000 Our population? 9:59:59.000,9:59:59.000 Our users? 9:59:59.000,9:59:59.000 Our experience? 9:59:59.000,9:59:59.000 Let's run a hypothetical. 9:59:59.000,9:59:59.000 Suppose we don't have that many pages, [br]and that many resources like we said before.[br] 9:59:59.000,9:59:59.000 We'll be much more conservative. 9:59:59.000,9:59:59.000 A user session will only go through five[br]clicks, and each click will only bring up 9:59:59.000,9:59:59.000 up to 40 things. 9:59:59.000,9:59:59.000 A lot less, and they're all as clean[br]as the google page. 9:59:59.000,9:59:59.000 How many of the users will not experience[br]something worse than the 95th percentile? 9:59:59.000,9:59:59.000 Because that's what the 95th percentile[br]is good for, the people who see that. 9:59:59.000,9:59:59.000 Anybody above that, is that. 9:59:59.000,9:59:59.000 What are the chances of not seeing it? 9:59:59.000,9:59:59.000 That's an interesting number. 9:59:59.000,9:59:59.000 So you're watching a number that is [br]relevant to 0.003% of your users. 9:59:59.000,9:59:59.000 99.997% of your users are going to [br]see worse than this number. 9:59:59.000,9:59:59.000 Why are you looking at it? 9:59:59.000,9:59:59.000 Why are you spending time[br]thinking about it? 9:59:59.000,9:59:59.000 In reverse, we could say how many people[br]are going to see something 9:59:59.000,9:59:59.000 worse than the three 9's (99.9%)? 9:59:59.000,9:59:59.000 That's going to be 18%. 9:59:59.000,9:59:59.000 In reverse, 82% of the people will see[br]the three 9's (99.9%) or better. 9:59:59.000,9:59:59.000 That's a slightly better representation. 9:59:59.000,9:59:59.000 Probably not good enough either. 9:59:59.000,9:59:59.000 We could look at some more math with them, [br]same kind of scenario. 9:59:59.000,9:59:59.000 What percentile of http response time [br]will be the thing that 95% 9:59:59.000,9:59:59.000 of people experience in this scenario? 9:59:59.000,9:59:59.000 It's the 99.97 percentile that 95% [br]of people see. 9:59:59.000,9:59:59.000 And if you want to know what 99%[br]of the people see, 9:59:59.000,9:59:59.000 that's four and a half 9's (99.995%). 9:59:59.000,9:59:59.000 You want to know that number from Akamai[br]if you want to predict what 1% 9:59:59.000,9:59:59.000 of your users are going to experience. 9:59:59.000,9:59:59.000 When you know the 99th percentile, [br]you kind of know a tiny bit. 9:59:59.000,9:59:59.000 So here's another tip. 9:59:59.000,9:59:59.000 And this is not an exaggeration,[br]by the way. 9:59:59.000,9:59:59.000 The median, which is a much smaller[br]percentile, has that minuscule a chance 9:59:59.000,9:59:59.000 of ever being the number that [br]anybody experiences. 9:59:59.000,9:59:59.000 This is the chance of getting worse[br]than the median. 9:59:59.000,9:59:59.000 Which makes the median an irrelevant [br]number to look at. 9:59:59.000,9:59:59.000 Unfortunately, it's probably the most [br]common one looked at. 9:59:59.000,9:59:59.000 When people say "the typical",[br]they look at the thing that 9:59:59.000,9:59:59.000 everything will be worse than. 9:59:59.000,9:59:59.000 Okay, I'm sorry about that part. 9:59:59.000,9:59:59.000 We'll do some other parts. 9:59:59.000,9:59:59.000 Now, why is it that when we look at these[br]monitoring systems, we don't see 9:59:59.000,9:59:59.000 data with a lot of 9's? 9:59:59.000,9:59:59.000 Why do we stop at the[br]90, 95, 99th percentile? 9:59:59.000,9:59:59.000 Why don't we look further? 9:59:59.000,9:59:59.000 Now, some of it is because people think, [br]"Well that's perfection, I don't need it." 9:59:59.000,9:59:59.000 The other part is that it's hard. 9:59:59.000,9:59:59.000 It's hard because you can't[br]average percentiles. 9:59:59.000,9:59:59.000 We already talked about that. 9:59:59.000,9:59:59.000 But you also can't derive your [br]five 9's (99.999%) out of a lot 9:59:59.000,9:59:59.000 of 10 second samples of percentiles. 9:59:59.000,9:59:59.000 And the reason for that is, "Hey, in 10 [br]seconds, maybe I only had 1,000 things." 9:59:59.000,9:59:59.000 I could take all the 10 seconds in the [br]world, there's no way to say what the 9:59:59.000,9:59:59.000 hour five 9's (99.999%) were, what the [br]minutes five 9's were 9:59:59.000,9:59:59.000 if I'm collecting just this data. 9:59:59.000,9:59:59.000 And unfortunately, the data being collected[br]and reported to the back ends of monitoring 9:59:59.000,9:59:59.000 is usually summarized at a second,[br]5 seconds, 10 seconds, etc. 9:59:59.000,9:59:59.000 Basically throwing away all the good data,[br]and leaving you with absolutely no way 9:59:59.000,9:59:59.000 to compute large 9's for longer[br]periods of time. 9:59:59.000,9:59:59.000 So, this is where you might want to look[br]at HDR Histogram. 9:59:59.000,9:59:59.000 It's an open source thing I've created[br]a few years ago. 9:59:59.000,9:59:59.000 I did it in Java, and know there's a[br]C, C-Sharp, Python, Erlang, 9:59:59.000,9:59:59.000 and Go ports of this that I didn't create. 9:59:59.000,9:59:59.000 And it lets you actually get an entire[br]percentile spectrum. 9:59:59.000,9:59:59.000 Some of you here I know are [br]already using it. 9:59:59.000,9:59:59.000 And you can look at all the percentiles. 9:59:59.000,9:59:59.000 Any number of 9's that's in the data, if [br]you just keep it right and report it right, 9:59:59.000,9:59:59.000 it's got a log format, you can [br]store things forever. 9:59:59.000,9:59:59.000 Well, for a long time. 9:59:59.000,9:59:59.000 Okay, so it lets you have nice things. 9:59:59.000,9:59:59.000 Enough for that advertisement. 9:59:59.000,9:59:59.000 Now, latency... Well, I think this is[br]slightly out of order. 9:59:59.000,9:59:59.000 Yeah, sorry. 9:59:59.000,9:59:59.000 This is the red/blue pill part, so I warn[br]you, this is your last chance. 9:59:59.000,9:59:59.000 There's a problem I call the [br]coordinated omission problem. 9:59:59.000,9:59:59.000 The coordinated omission problem is [br]basically a conspiracy. 9:59:59.000,9:59:59.000 It's a conspiracy that we're all part of. 9:59:59.000,9:59:59.000 I don't think anybody actually meant[br]to do it, but once I've noticed it, 9:59:59.000,9:59:59.000 everywhere I look, there it is. 9:59:59.000,9:59:59.000 Now, I've been using a specific way of[br]showing you numbers so far. 9:59:59.000,9:59:59.000 Has anybody here noticed how[br]I spell percentile? 9:59:59.000,9:59:59.000 (Audience Member): "You put lie at the[br]end of the percent sign." 9:59:59.000,9:59:59.000 Yeah, good. 9:59:59.000,9:59:59.000 So coordinated omission problem is the[br]"lie" in %lies. 9:59:59.000,9:59:59.000 And this is how it works. 9:59:59.000,9:59:59.000 One common way to do this is[br]to use a load generator. 9:59:59.000,9:59:59.000 Pretty much all load generator's[br]have this problem. 9:59:59.000,9:59:59.000 There are two that I know of that don't. 9:59:59.000,9:59:59.000 What you do with a load generator,[br]is you test. 9:59:59.000,9:59:59.000 You issue requests, or send packets. 9:59:59.000,9:59:59.000 And you measure how long something took. 9:59:59.000,9:59:59.000 And as long as the numbers go right, [br]measure them, put them in a bucket, 9:59:59.000,9:59:59.000 study them later, and get your [br]percentiles from it. 9:59:59.000,9:59:59.000 But what if the thing that you are[br]measuring took longer than the time 9:59:59.000,9:59:59.000 it would've taken until you send [br]the next thing? 9:59:59.000,9:59:59.000 You're supposed to send something [br]every second, 9:59:59.000,9:59:59.000 but this one took a second and a half. 9:59:59.000,9:59:59.000 Well you've got to wait before[br]you send the next one. 9:59:59.000,9:59:59.000 You just avoided measuring something [br]when the system was problematic. 9:59:59.000,9:59:59.000 You've coordinated with it. 9:59:59.000,9:59:59.000 You weren't looking at it then. 9:59:59.000,9:59:59.000 That's common scenario A: You've backed[br]off, and avoided measuring when it was bad. 9:59:59.000,9:59:59.000 Another way, is you measure inside your code. 9:59:59.000,9:59:59.000 We all do this. We all have to do this, 9:59:59.000,9:59:59.000 where we measure time, do something, [br]then measure time. 9:59:59.000,9:59:59.000 The delta between them is how long it took. 9:59:59.000,9:59:59.000 We can then put it in a stats bucket,[br]and then do the percentiles in that. 9:59:59.000,9:59:59.000 Unfortunately, if the system freezes right[br]here, for any reason, 9:59:59.000,9:59:59.000 an interrupted contact switch, 9:59:59.000,9:59:59.000 a cash buffer flushed to disk, 9:59:59.000,9:59:59.000 a garbage collection, 9:59:59.000,9:59:59.000 a re-indexing of your database,[br]this is a database. 9:59:59.000,9:59:59.000 This is Cassandra by the way, [br]measuring itself. 9:59:59.000,9:59:59.000 In any of the above, then you will[br]have one bad report 9:59:59.000,9:59:59.000 while 10,000 things are waiting in line. 9:59:59.000,9:59:59.000 And when they come in, they will look[br]really, really good. 9:59:59.000,9:59:59.000 Even though each one of them has had[br]a really bad experience. 9:59:59.000,9:59:59.000 [br]It can even get worse, where maybe the[br]freeze happened outside the timing, 9:59:59.000,9:59:59.000 and you won't even know there was a freeze. 9:59:59.000,9:59:59.000 Now these are examples of admitting data[br]that is bad on a very selective basis. 9:59:59.000,9:59:59.000 It's not random sampling. 9:59:59.000,9:59:59.000 It's, "I don't like bad data", 9:59:59.000,9:59:59.000 or "I couldn't handle it", 9:59:59.000,9:59:59.000 or "I don't know about it", 9:59:59.000,9:59:59.000 so we'll just talk about the good. 9:59:59.000,9:59:59.000 What does that do to your data? 9:59:59.000,9:59:59.000 Because it often makes people feel like, 9:59:59.000,9:59:59.000 "Okay, yeah, I understand,[br]but it's a little bit of noise." 9:59:59.000,9:59:59.000 Let's run some hypotheticals, [br]and I'll show you some real numbers. 9:59:59.000,9:59:59.000 Imagine a perfect system. 9:59:59.000,9:59:59.000 It's doing 100 requests a second, [br]at exactly a millisecond each. 9:59:59.000,9:59:59.000 But we go and freeze the system, [br]after 100 seconds of perfect operations 9:59:59.000,9:59:59.000 for 100 seconds, and then repeat. 9:59:59.000,9:59:59.000 Now, I'm going to describe how the system[br]behaves in terms that should mean something, 9:59:59.000,9:59:59.000 and then we'll measure it. 9:59:59.000,9:59:59.000 If we actually wanted to describe the[br]system, 9:59:59.000,9:59:59.000 on the left we have an average[br]of one millisecond by the finish, 9:59:59.000,9:59:59.000 and on the right we have an[br]average of 50 seconds. 9:59:59.000,9:59:59.000 Why 50? Because if I randomly came in [br]in that 100 seconds, 9:59:59.000,9:59:59.000 I'll get anything from 0 to 100[br]with even distribution. 9:59:59.000,9:59:59.000 The overall average over 200 seconds[br]is 25 seconds. 9:59:59.000,9:59:59.000 If I just came in here and said, [br]"Surprise, how long did this take?" 9:59:59.000,9:59:59.000 On average, it will be 25. 9:59:59.000,9:59:59.000 I can also do the percentiles. 9:59:59.000,9:59:59.000 50th percentile will be really good, [br]and then it'll get really bad. 9:59:59.000,9:59:59.000 The four 9's is terrible. 9:59:59.000,9:59:59.000 This is a fair honest description of[br]this system if this is what it did. 9:59:59.000,9:59:59.000 And you can make the system do that. 9:59:59.000,9:59:59.000 That's what Control Z is good for. 9:59:59.000,9:59:59.000 You can make any of your systems do that. 9:59:59.000,9:59:59.000 Now lets go measure this system with [br]a load generator, 9:59:59.000,9:59:59.000 or with a monitoring system. 9:59:59.000,9:59:59.000 The common ones. 9:59:59.000,9:59:59.000 The ones everybody does. 9:59:59.000,9:59:59.000 On the left, we're going to get 10,000 [br]results of one millisecond each. 9:59:59.000,9:59:59.000 Great. 9:59:59.000,9:59:59.000 And we're going to get one result of[br]100 seconds. 9:59:59.000,9:59:59.000 Wow, really big response time. 9:59:59.000,9:59:59.000 This is our data. 9:59:59.000,9:59:59.000 This is OUR data. 9:59:59.000,9:59:59.000 So now you go do math with it. 9:59:59.000,9:59:59.000 The average of that is 10.9 milliseconds. 9:59:59.000,9:59:59.000 A little less than 25 seconds. 9:59:59.000,9:59:59.000 And here are the percentiles. 9:59:59.000,9:59:59.000 Your load generator monitoring system[br]will tell you that this system is perfect. 9:59:59.000,9:59:59.000 You could go to production with it. 9:59:59.000,9:59:59.000 You like what you see. 9:59:59.000,9:59:59.000 Look at that, four 9's. 9:59:59.000,9:59:59.000 It is lying to you. 9:59:59.000,9:59:59.000 To your face. 9:59:59.000,9:59:59.000 And you can catch it doing that with a [br]Control Z-Test. 9:59:59.000,9:59:59.000 But people tend to not want to do that,[br]because then what are they going to do? 9:59:59.000,9:59:59.000 If you just do that test, and calibrate [br]your system, and you find it 9:59:59.000,9:59:59.000 telling you that, about this, the next [br]step should be to throw all the numbers away. 9:59:59.000,9:59:59.000 Don't believe anything else it says. 9:59:59.000,9:59:59.000 If it lies this big, what else did it do? 9:59:59.000,9:59:59.000 Don't waste your time on numbers[br]from uncalibrated systems. 9:59:59.000,9:59:59.000 Now the problem here was, that if you[br]want to measure the system, 9:59:59.000,9:59:59.000 you have to measure at random rates, [br]or same rates. 9:59:59.000,9:59:59.000 If you measure 10,000 things in 100 seconds,[br]there should be another 10,000 things here. 9:59:59.000,9:59:59.000 If you measure them, you would've gotten[br]all the right numbers. 9:59:59.000,9:59:59.000 Coordinated omission is the simple act of[br]erasing all that bad stuff. 9:59:59.000,9:59:59.000 The conspiracy here is that we all do it[br]without meaning to. 9:59:59.000,9:59:59.000 I don't know who put that in our systems,[br]but it happens to all of us . 9:59:59.000,9:59:59.000 Now, I often get people saying,[br]"Okay, I get it. All the numbers are wrong, 9:59:59.000,9:59:59.000 but at least for my job where I tune [br]performance, and I try to make things 9:59:59.000,9:59:59.000 faster, I can use the numbers to figure[br]out if I'm going in the right direction." 9:59:59.000,9:59:59.000 Is it better, or is it worse? Let me [br]dispel that for you for a second. 9:59:59.000,9:59:59.000 Suppose I went and took this system,[br]and improved it dramatically. 9:59:59.000,9:59:59.000 Rather than freezing for 100 seconds, [br]it will now answer every question. 9:59:59.000,9:59:59.000 It'll take a little longer,[br]5 milliseconds instead of one, 9:59:59.000,9:59:59.000 but it's much better than freezing, right? 9:59:59.000,9:59:59.000 So let's measure that system that we spent[br]weeks and weeks improving, 9:59:59.000,9:59:59.000 and see if it's better. 9:59:59.000,9:59:59.000 That's the data. 9:59:59.000,9:59:59.000 If we do the percentiles, it'll tell us [br]that we just really hurt the four 9's. 9:59:59.000,9:59:59.000 We made it go 5 times worse than before. 9:59:59.000,9:59:59.000 We should revert this change, go back to[br]that much better system we had before. 9:59:59.000,9:59:59.000 So this is just to make sure that you[br]don't think that you can have 9:59:59.000,9:59:59.000 any intuition based on any of these numbers. 9:59:59.000,9:59:59.000 They go backwards sometimes. 9:59:59.000,9:59:59.000 You don't know which way is good or bad. 9:59:59.000,9:59:59.000 And you'll never know which way is good[br]or bad with a system that lies like that. 9:59:59.000,9:59:59.000 The other cool technique is[br]what I call "Cheating Twice". 9:59:59.000,9:59:59.000 You have a constant load generator,[br]and it needs to do 100 per second. 9:59:59.000,9:59:59.000 When it woke up after 200 seconds, [br]it says, 9:59:59.000,9:59:59.000 "Woah, were 9,999 behind.[br]We've got to issue those requests." 9:59:59.000,9:59:59.000 So it issues those requests. 9:59:59.000,9:59:59.000 At this point, not only did it get rid of[br]all the bad requests, 9:59:59.000,9:59:59.000 it replaced every one of them with [br]a perfect request. 9:59:59.000,9:59:59.000 Coining the four 9's (99.99%), all the way[br]to four and a half 9's (99.995%), 9:59:59.000,9:59:59.000 it's twice as wrong as dropping them. 9:59:59.000,9:59:59.000 So these are all cool things that[br]happen to you. 9:59:59.000,9:59:59.000 I'm not going to spend much time on how[br]to fix those and avoid those. 9:59:59.000,9:59:59.000 There's a lot of other material that you[br]can find with me 9:59:59.000,9:59:59.000 talking about that, in longer talks. 9:59:59.000,9:59:59.000 But this is pretty bad. 9:59:59.000,9:59:59.000 And like I said... 9:59:59.000,9:59:59.000 That should've been up there before. 9:59:59.000,9:59:59.000 How did this repeat itself? 9:59:59.000,9:59:59.000 Did I create a loop in the[br]presentation somehow? 9:59:59.000,9:59:59.000 I don't know how to do that. 9:59:59.000,9:59:59.000 Let's see if I can get through here. 9:59:59.000,9:59:59.000 Hopefully editing later will take it out. 9:59:59.000,9:59:59.000 So we have the cheats twice. 9:59:59.000,9:59:59.000 There, okay. 9:59:59.000,9:59:59.000 So, after we look at coordinated[br]omission that way, 9:59:59.000,9:59:59.000 we should also look at response time, [br]and service time. 9:59:59.000,9:59:59.000 Coordinated omission, what it really is[br]achieving for you, unfortunately, 9:59:59.000,9:59:59.000 is that it makes something that you think[br]is response time, and only shows you 9:59:59.000,9:59:59.000 the service time component of latency. 9:59:59.000,9:59:59.000 This is a simple depiction of what service[br]time and response times are. 9:59:59.000,9:59:59.000 This guy is taking a certain amount of[br]time to take payment 9:59:59.000,9:59:59.000 or make a cup of coffee. 9:59:59.000,9:59:59.000 That's service time. 9:59:59.000,9:59:59.000 How long does it take to do the work? 9:59:59.000,9:59:59.000 This person has experienced[br]the response time, 9:59:59.000,9:59:59.000 which includes the amount of time they [br]have to wait before they 9:59:59.000,9:59:59.000 get to the person that does the work. 9:59:59.000,9:59:59.000 And the difference between those[br]two is immense. 9:59:59.000,9:59:59.000 The coordinated omission problem makes[br]something that you think is 9:59:59.000,9:59:59.000 response time, only measure the [br]service time, 9:59:59.000,9:59:59.000 and basically hide the fact that things [br]stalled, waited in line, 9:59:59.000,9:59:59.000 that this guy might've taken a lunch break, 9:59:59.000,9:59:59.000 and now we have line around, [br]building three times. 9:59:59.000,9:59:59.000 Service time stays the same. 9:59:59.000,9:59:59.000 This is the backwards part... 9:59:59.000,9:59:59.000 Now, let's look at what it[br]actually looks like. 9:59:59.000,9:59:59.000 In a load generator that I fixed,[br]I measured both 9:59:59.000,9:59:59.000 response time and service time, 9:59:59.000,9:59:59.000 this happens to be Casandra, 9:59:59.000,9:59:59.000 at a very low load. 9:59:59.000,9:59:59.000 And you can see that they're very very [br]similar, at a very low load. 9:59:59.000,9:59:59.000 Why? Because there's nobody in line. 9:59:59.000,9:59:59.000 This thing is really fast. 9:59:59.000,9:59:59.000 We're not asking for too much. 9:59:59.000,9:59:59.000 Casandra's pretty fast,[br]so they're the same. 9:59:59.000,9:59:59.000 But if I increase the load, we [br]start seeing gaps. 9:59:59.000,9:59:59.000 If I increase the load a little more,[br]the gap grows. 9:59:59.000,9:59:59.000 If I increase the load a little more,[br]the gap grows. 9:59:59.000,9:59:59.000 Now this is not the failure point yet. 9:59:59.000,9:59:59.000 If I actually increase it all the way past[br]the point where the system 9:59:59.000,9:59:59.000 can't even do the work I want, [br]service time stays the same, 9:59:59.000,9:59:59.000 response time goes through the roof. 9:59:59.000,9:59:59.000 This was when it was 100 and something[br]milliseconds, now it's 7 and a half seconds. 9:59:59.000,9:59:59.000 Why 7 and a half seconds? 9:59:59.000,9:59:59.000 Cause you're waiting in line that long[br]to go around the block. 9:59:59.000,9:59:59.000 The guy just can't serve as many people[br]as are showing up in line, you fall behind. 9:59:59.000,9:59:59.000 This is a virtual world reaction to this. 9:59:59.000,9:59:59.000 I really like this slide, it's where I came[br]up with the notion of a blue/red pill. 9:59:59.000,9:59:59.000 When you actually measure reality, people[br]tend to have this reaction when 9:59:59.000,9:59:59.000 they compare the two. 9:59:59.000,9:59:59.000 And if we actually look at these on the[br]two sides of a collapse point of a system, 9:59:59.000,9:59:59.000 this specific system can only do 87,000 [br]things a second. 9:59:59.000,9:59:59.000 No matter how hard you press it,[br]that's all it'll do. 9:59:59.000,9:59:59.000 The service time on the two sides of [br]the collapse looks virtually identical, 9:59:59.000,9:59:59.000 which it would. 9:59:59.000,9:59:59.000 But if you compare the response time, [br]you have a very different picture. 9:59:59.000,9:59:59.000 And I'm showing this picture so you get [br]a feeling for what to look at 9:59:59.000,9:59:59.000 on whether or not you're measuring[br]the right one. 9:59:59.000,9:59:59.000 Whenever you push, you try and push load[br]beyond what the system can do, 9:59:59.000,9:59:59.000 you are falling behind over time. 9:59:59.000,9:59:59.000 This is a 250 second run, 9:59:59.000,9:59:59.000 where at the end of it[br]you are waiting for 8 seconds in line. 9:59:59.000,9:59:59.000 Why? Because for every second [br]that goes by, there are 9:59:59.000,9:59:59.000 3,000 more things that are[br]added to the line. 9:59:59.000,9:59:59.000 The interesting thing that happens when [br]you cross the threshold limit, 9:59:59.000,9:59:59.000 or capability of the system, is that[br]response time grows over time linearly. 9:59:59.000,9:59:59.000 It doesn't happen if you're below. 9:59:59.000,9:59:59.000 Only if you're above. 9:59:59.000,9:59:59.000 It's the point where that happens, and [br]any load generator that doesn't show 9:59:59.000,9:59:59.000 that line when you try pushing harder [br]than you can, is lying to you. 9:59:59.000,9:59:59.000 It's a simple sanity check. 9:59:59.000,9:59:59.000 If your load generator shows you that, [br]it didn't push. 9:59:59.000,9:59:59.000 Or it pushed, but it didn't[br]report correctly, 9:59:59.000,9:59:59.000 whichever it is. 9:59:59.000,9:59:59.000 If we draw that to scale... 9:59:59.000,9:59:59.000 Just to make sure, this was not to scale, [br]this is the scale, I just zoomed in 9:59:59.000,9:59:59.000 so you could see that it was [br]relatively stable. 9:59:59.000,9:59:59.000 So... I don't know what happened to the[br]order of the slides. 9:59:59.000,9:59:59.000 It's like looping and randoming. 9:59:59.000,9:59:59.000 There's some conspiracy going on there. 9:59:59.000,9:59:59.000 Now, latency doesn't live on it's own. 9:59:59.000,9:59:59.000 You do need to look at latency in the [br]context of load. 9:59:59.000,9:59:59.000 Cause as I showed you, as you're nearly[br]idle, things are nearly perfect. 9:59:59.000,9:59:59.000 Even these mistakes won't show up. 9:59:59.000,9:59:59.000 But as you start pressing, things start[br]cracking or behaving differently. 9:59:59.000,9:59:59.000 And usually when you want to know how much[br]your system can handle, 9:59:59.000,9:59:59.000 the answer is not 87,000 things a second,[br]because nobody wants the 9:59:59.000,9:59:59.000 response time that comes with that. 9:59:59.000,9:59:59.000 It's how many things can I handle so [br]that I don't get angry phone calls. 9:59:59.000,9:59:59.000 So I do get my bonus, and so my[br]company stays above ground. 9:59:59.000,9:59:59.000 This is not sustainable speed. 9:59:59.000,9:59:59.000 Running this experiment is really[br]interesting with software, 9:59:59.000,9:59:59.000 because it actually doesn't hurt, but[br]spending the next 6 months of your time 9:59:59.000,9:59:59.000 repeating this experiment, trying to[br]change the shape of the bumper 9:59:59.000,9:59:59.000 every time you hit the thing[br]is a waste of your time. 9:59:59.000,9:59:59.000 Your goal when you're trying to figure[br]out sustainable speed throughput, 9:59:59.000,9:59:59.000 whatever it is, is to see how fast you can[br]go without this happening, 9:59:59.000,9:59:59.000 and then to try and engineer[br]to improve that. 9:59:59.000,9:59:59.000 Meaning, can I make it go faster[br]without this happening? 9:59:59.000,9:59:59.000 Measuring what happens after you[br]hit the pole is useless for that exercise. 9:59:59.000,9:59:59.000 The only thing that matters about hitting[br]the pole, is that you hit the pole. 9:59:59.000,9:59:59.000 When you go and study the behavior[br]of latency, at saturation, 9:59:59.000,9:59:59.000 you are doing this. 9:59:59.000,9:59:59.000 You're looking at this and saying, "That[br]bumper, I don't like the shape of that. 9:59:59.000,9:59:59.000 Let's measure it closely and do this 100[br]times to see if we can vary it." 9:59:59.000,9:59:59.000 That's what it means to look at latency [br]at saturation, 9:59:59.000,9:59:59.000 and repeat, and repeat, and change,[br]and tune, and see if you can do it again. 9:59:59.000,9:59:59.000 If you're pressing it to the wall,[br]it should look like this. 9:59:59.000,9:59:59.000 And it shouldn't be a surprise that it's[br]a 7 and a half second response time. 9:59:59.000,9:59:59.000 In fact, if it's not, something is[br]terribly wrong with what you're measuring. 9:59:59.000,9:59:59.000 You should look at that instead. 9:59:59.000,9:59:59.000 So don't do this. 9:59:59.000,9:59:59.000 Try to minimize the number of times[br]that you actually run red cars 9:59:59.000,9:59:59.000 into poles in your testing. 9:59:59.000,9:59:59.000 I'm not saying don't do it, but use it[br]to establish the end. 9:59:59.000,9:59:59.000 And then you need to test all the speeds,[br]and we need to see when you hit the pole. 9:59:59.000,9:59:59.000 Maybe you hit the pole at 100 mph,[br]but maybe you also hit the pole at 70 mph. 9:59:59.000,9:59:59.000 Maybe you don't hit it at 20. 9:59:59.000,9:59:59.000 We should find out how fast is safe. 9:59:59.000,9:59:59.000 When you have data, you can compare[br]it like this. 9:59:59.000,9:59:59.000 This is what I would say a recommended [br]way to look at it. 9:59:59.000,9:59:59.000 Plot requirements, that's the hitting[br]the pole. 9:59:59.000,9:59:59.000 And some things hit the pole, [br]and some things don't. 9:59:59.000,9:59:59.000 And you run different scenarios, [br]different loads, 9:59:59.000,9:59:59.000 different configurations, 9:59:59.000,9:59:59.000 different settings, 9:59:59.000,9:59:59.000 and see what works, and what doesn't. 9:59:59.000,9:59:59.000 Your goal is to stay here, and carry [br]more while staying there. 9:59:59.000,9:59:59.000 Usually.