9:59:59.000,9:59:59.000 Hi everyone, I'm Gil Tene. 9:59:59.000,9:59:59.000 I'm going to be talking about this subject[br]that I call "How NOT to Measure Latency". 9:59:59.000,9:59:59.000 It's a subject that I've been talking[br]about for 3 years or so. 9:59:59.000,9:59:59.000 I keep the title and change all[br]the slides every time. 9:59:59.000,9:59:59.000 A bunch of this stuff is new. 9:59:59.000,9:59:59.000 So if you've seen any of my previous "How NOT to",[br]you'll see only some things that are common. 9:59:59.000,9:59:59.000 A nickname for the subject is this... 9:59:59.000,9:59:59.000 Because I often will get that reaction[br]from some people in the audience. 9:59:59.000,9:59:59.000 Ever since I've told people that it's a[br]nickname, 9:59:59.000,9:59:59.000 They feel free to actually exclaim,[br]"Oh S@%#!". 9:59:59.000,9:59:59.000 And feel free to do that here in this talk. 9:59:59.000,9:59:59.000 I'll prompt you in a couple of places[br]where it is natural. 9:59:59.000,9:59:59.000 But if just have the urge, go ahead. 9:59:59.000,9:59:59.000 So just a tiny bit about me. 9:59:59.000,9:59:59.000 I am the co-founder of Azul Systems. 9:59:59.000,9:59:59.000 I play around with garbage collection a lot. 9:59:59.000,9:59:59.000 Here is some evidence of me playing around[br]with garbage collection in my kitchen. 9:59:59.000,9:59:59.000 That's a trash compactor. 9:59:59.000,9:59:59.000 The compaction function wasn't working right,[br]so I had to fix it. 9:59:59.000,9:59:59.000 I thought it'd be funny to take a picture[br]with a book. 9:59:59.000,9:59:59.000 I've also built a lot of things. 9:59:59.000,9:59:59.000 I've been playing with computers since[br]the early 80's. 9:59:59.000,9:59:59.000 I've built hardware. 9:59:59.000,9:59:59.000 I've helped design chips. 9:59:59.000,9:59:59.000 I've built software at many [br]different levels. 9:59:59.000,9:59:59.000 Operating systems, drivers...[br]JVM's obviously. 9:59:59.000,9:59:59.000 And lots of big systems at the system level. 9:59:59.000,9:59:59.000 Built our own app server in the late 90's[br]because web logic wasn't around yet. 9:59:59.000,9:59:59.000 So, I've made a lot of mistakes,[br]and I've learned from a few of them. 9:59:59.000,9:59:59.000 This is actually a combination of a bunch[br]of those mistakes looking at latency. 9:59:59.000,9:59:59.000 I do have this hobby of depressing people[br]by pulling the wool up from over your eyes, 9:59:59.000,9:59:59.000 and this is what this talk is about. 9:59:59.000,9:59:59.000 So, I need to give you a choice right here. 9:59:59.000,9:59:59.000 There's the door. 9:59:59.000,9:59:59.000 You can take the blue pill, [br]and you can leave. 9:59:59.000,9:59:59.000 Tomorrow you can keep believing whatever[br]it is you want to believe. 9:59:59.000,9:59:59.000 But if you stay here and take the red pill, [br]I will show you a glimpse of how 9:59:59.000,9:59:59.000 far down the rabbit hole goes, [br]and it will never be the same again. 9:59:59.000,9:59:59.000 Let's talk about latency. 9:59:59.000,9:59:59.000 And when I say latency, I'm talking about[br]latency response time, any of those things 9:59:59.000,9:59:59.000 where you measure time from 'here to here',[br]and you're interested in how long it took. 9:59:59.000,9:59:59.000 We do this all the time, but I see a lot [br]of mish-mash in how people 9:59:59.000,9:59:59.000 treat the data, or think about it. 9:59:59.000,9:59:59.000 Latency is basically the time it took[br]something to happen once. 9:59:59.000,9:59:59.000 That one time, how long did it take. 9:59:59.000,9:59:59.000 And when we measure stuff, like we did [br]a million operations in the last hour, 9:59:59.000,9:59:59.000 we have a million latencies. Not one,[br]we have a million of them. 9:59:59.000,9:59:59.000 Our actual goal is to figure out how to[br]describe that million. 9:59:59.000,9:59:59.000 How did the million behave? 9:59:59.000,9:59:59.000 For example, 'they're all really good, and[br]they're all exactly the same', would be a 9:59:59.000,9:59:59.000 behavior that you will never see, [br]but that would be a great behavior. 9:59:59.000,9:59:59.000 So we need to talk about how things behave,[br]communicate, think, evaluate, 9:59:59.000,9:59:59.000 set requirements for, talk to other people,[br]but these are all common things around that. 9:59:59.000,9:59:59.000 To do that, we have to describe the [br]distribution, the set, the behavior, 9:59:59.000,9:59:59.000 but not the one. 9:59:59.000,9:59:59.000 For example, the behavior that says "the [br]the common case was x" is a piece of 9:59:59.000,9:59:59.000 information about the behavior,[br]but it's a tiny sliver. 9:59:59.000,9:59:59.000 Usually the least relevant one. 9:59:59.000,9:59:59.000 Well, there's some less relevant ones, [br]but not a strongly relevant one, 9:59:59.000,9:59:59.000 and one that people often focus on. 9:59:59.000,9:59:59.000 To take a look at what we actually do [br]with this stuff, almost on a daily basis, 9:59:59.000,9:59:59.000 this is a snapshot from a monitoring system. 9:59:59.000,9:59:59.000 A small dashboard on a big screen [br]in a monitoring system. 9:59:59.000,9:59:59.000 Where you're watching the response time of[br]a system over time. 9:59:59.000,9:59:59.000 This is a two hour window. 9:59:59.000,9:59:59.000 These lines that are 95th percentile, [br]90, 75, 50, and 25th percentiles, 9:59:59.000,9:59:59.000 you can look at how they behave over time. 9:59:59.000,9:59:59.000 We're a small audience here, if you look at[br]this picture, what draws your eye? 9:59:59.000,9:59:59.000 What do you want to go investigate here[br]or pay attention to ? 9:59:59.000,9:59:59.000 It's the big red spike there, right? 9:59:59.000,9:59:59.000 So we could look at the red spike,[br]cause it's different, 9:59:59.000,9:59:59.000 and say, "Woah, the 95th percentile shot up[br]here. And look, the 90th percentile 9:59:59.000,9:59:59.000 shot up at about the same time. 9:59:59.000,9:59:59.000 The rest of them didn't shoot up, [br]so maybe something happened here 9:59:59.000,9:59:59.000 that affected that much, I should probably[br]pay attention to it 9:59:59.000,9:59:59.000 because it's a monitoring system, and [br]I like things to be calm." 9:59:59.000,9:59:59.000 You could go investigate the why. 9:59:59.000,9:59:59.000 At this point, I've managed to waste [br]about 90 seconds of your life, 9:59:59.000,9:59:59.000 looking at a completely meaningless chart,[br]which unfortunately you do 9:59:59.000,9:59:59.000 every day, all the time. 9:59:59.000,9:59:59.000 This chart is the chart you want to show [br]somebody if you want to 9:59:59.000,9:59:59.000 hide the truth from them. 9:59:59.000,9:59:59.000 If you want to pull the wool [br]over their eyes. 9:59:59.000,9:59:59.000 This is the chart of the good stuff. 9:59:59.000,9:59:59.000 What's not on this chart? 9:59:59.000,9:59:59.000 The 5% worse things that happened during[br]this two hours. 9:59:59.000,9:59:59.000 They're not here. 9:59:59.000,9:59:59.000 This is only the good things that happened[br]during the things. 9:59:59.000,9:59:59.000 And to get this spike, that 5% had to be[br]so bad that it even pulled 9:59:59.000,9:59:59.000 the 95th percentile all up. 9:59:59.000,9:59:59.000 There is zero information here at all about[br]what happened bad during this two hours, 9:59:59.000,9:59:59.000 which makes it a bad fit for [br]a monitoring system. 9:59:59.000,9:59:59.000 It's a really good thing for [br]a marketing system. 9:59:59.000,9:59:59.000 It's a great way to get the bonus from your boss, even though you didn't do the work. 9:59:59.000,9:59:59.000 If you want to learn how to do that, [br]we can do another talk about that. 9:59:59.000,9:59:59.000 But this is not a good way to look at latency. 9:59:59.000,9:59:59.000 It's the opposite of good. 9:59:59.000,9:59:59.000 Unfortunately, this is one of the most[br]common tools used for 9:59:59.000,9:59:59.000 server monitoring on earth right now. 9:59:59.000,9:59:59.000 That's where the snapshot is from,[br]and this is what people look at. 9:59:59.000,9:59:59.000 I find this chart to be a goldmine[br]of information. 9:59:59.000,9:59:59.000 When I first showed it in another talk [br]like this, I had this really cool experience. 9:59:59.000,9:59:59.000 Somebody came up to me and said, "Hey, [br]as I was sitting here, I was texting one 9:59:59.000,9:59:59.000 of our guys, and he was saying, 9:59:59.000,9:59:59.000 'look, we have this issue with [br]our 95th percentile'." 9:59:59.000,9:59:59.000 And I got this chart from him! 9:59:59.000,9:59:59.000 So I went and said, "Hey, what does the [br]rest of the spectrum look like?" 9:59:59.000,9:59:59.000 This is the actual chart they got. 9:59:59.000,9:59:59.000 And when they look at the rest of the[br]spectrum, it looked like that. 9:59:59.000,9:59:59.000 That's what was hiding. 9:59:59.000,9:59:59.000 I noticed the scales are a little different. 9:59:59.000,9:59:59.000 That yellow line is that yellow line. 9:59:59.000,9:59:59.000 So that's a much more representative number. 9:59:59.000,9:59:59.000 Is it? Is that good enough? 9:59:59.000,9:59:59.000 That's the 99th percentile. 9:59:59.000,9:59:59.000 We still have another 1% of really bad [br]stuff that's hiding above the blue line. 9:59:59.000,9:59:59.000 I wonder how big that is? 9:59:59.000,9:59:59.000 I don't know because he didn't have the data. 9:59:59.000,9:59:59.000 So a common problem that we have is that[br]we only plot what's convenient. 9:59:59.000,9:59:59.000 We only plot what gives us nice,[br]colorful graphs. 9:59:59.000,9:59:59.000 And often, when we have to choose between[br]the stuff that hides the rest of the data, 9:59:59.000,9:59:59.000 and the stuff that is noise, we choose [br]the noise to display. 9:59:59.000,9:59:59.000 I like to rant about latency. 9:59:59.000,9:59:59.000 This is from a blog that I don't write [br]enough in, but the format for it was simple. 9:59:59.000,9:59:59.000 I tweet a single tweet about latency, [br]latency tip of the day, 9:59:59.000,9:59:59.000 and then I rant about my own tweet. 9:59:59.000,9:59:59.000 As an example, this chart is a goldmine[br]of information because it has so many 9:59:59.000,9:59:59.000 different things that are wrong in it, [br]but we won't get into all of them. 9:59:59.000,9:59:59.000 You can read it online. 9:59:59.000,9:59:59.000 Anyway, this is one to take away from [br]what we just said. 9:59:59.000,9:59:59.000 If you are not measuring and showing the[br]maximum value, what is it you are hiding? 9:59:59.000,9:59:59.000 And from whom? 9:59:59.000,9:59:59.000 If you're job is to hide the truth from[br]others, this is a good way to do it. 9:59:59.000,9:59:59.000 But if actually are interested in what's[br]going on, the number one indicator 9:59:59.000,9:59:59.000 you should never get rid of is the [br]maximum value. 9:59:59.000,9:59:59.000 That is not noise, that is the signal. 9:59:59.000,9:59:59.000 The rest of it is noise. 9:59:59.000,9:59:59.000 Okay, let's look at this chart for some[br]more cool stuff. 9:59:59.000,9:59:59.000 I'm gonna zoom in to a small part[br]of the chart, and ask you what that means. 9:59:59.000,9:59:59.000 What is the average of the 95th percentile[br]over 2 hours mean? 9:59:59.000,9:59:59.000 What is the math that does that? 9:59:59.000,9:59:59.000 What does it do? 9:59:59.000,9:59:59.000 Let's look at that, and I'll give you[br]an example with another percentile. 9:59:59.000,9:59:59.000 The 100th percentile. The max, right? 9:59:59.000,9:59:59.000 Let's take a data set. 9:59:59.000,9:59:59.000 Suppose this was the maximum every minute[br]for 15 minutes. 9:59:59.000,9:59:59.000 What does it mean to say that the average [br]max over the last 15 minutes was 42? 9:59:59.000,9:59:59.000 I specifically chose the data to[br]make that happen. 9:59:59.000,9:59:59.000 It's a meaningless statement. 9:59:59.000,9:59:59.000 It's a completely meaningless statement. 9:59:59.000,9:59:59.000 But when you see 95th percentile,[br]average 184, you think that the 95th 9:59:59.000,9:59:59.000 percentile for the last two hours[br]was around 184. 9:59:59.000,9:59:59.000 It makes you think that. 9:59:59.000,9:59:59.000 Putting this on a piece of paper is not [br]just noise and irrelevant, 9:59:59.000,9:59:59.000 it's a way to mislead people. 9:59:59.000,9:59:59.000 It's a way to mislead yourself, because [br]you'll start to believe your own mistruths. 9:59:59.000,9:59:59.000 This is true for any percentile. 9:59:59.000,9:59:59.000 There is no percentile that you could do[br]this math on. 9:59:59.000,9:59:59.000 Another tip, you cannot average percentiles. 9:59:59.000,9:59:59.000 That math doesn't happen. 9:59:59.000,9:59:59.000 But percentiles do matter. You really[br]want to know about them. 9:59:59.000,9:59:59.000 And a common misperception is that we want[br]to look at the main part of the spectrum, 9:59:59.000,9:59:59.000 not those outliers and perfection stuff. 9:59:59.000,9:59:59.000 Only people that actually bet their house[br]every day, or the bank on it, 9:59:59.000,9:59:59.000 need to know about the "five-nine's", [br]and all those. 9:59:59.000,9:59:59.000 The 99th percentile is a pretty[br]good number. 9:59:59.000,9:59:59.000 Is 99% really rare? 9:59:59.000,9:59:59.000 Let's look at some stuff, because we can[br]ask questions like, "If I were looking 9:59:59.000,9:59:59.000 at a webpage, what is the chance of me[br]hitting the 99th percentile?" 9:59:59.000,9:59:59.000 Of things like this: a search engine node,[br]or a key value store, 9:59:59.000,9:59:59.000 or a database, or a CDN, right? 9:59:59.000,9:59:59.000 Because they will report their 99th percentile. 9:59:59.000,9:59:59.000 They won't tell you anything above that,[br]but how many of the 9:59:59.000,9:59:59.000 webpages that we go to [br]actually experience this? [br] 9:59:59.000,9:59:59.000 You want to say 1%, right?