1 99:59:59,999 --> 99:59:59,999 Hi everyone, I'm Gil Tene. 2 99:59:59,999 --> 99:59:59,999 I'm going to be talking about this subject that I call "How NOT to Measure Latency". 3 99:59:59,999 --> 99:59:59,999 It's a subject that I've been talking about now for 3 years or so. 4 99:59:59,999 --> 99:59:59,999 I keep the title and change all the slides every time. 5 99:59:59,999 --> 99:59:59,999 A bunch of this stuff is new. 6 99:59:59,999 --> 99:59:59,999 So if you've seen any of my previous "How NOT to", you'll see only some things that are common. 7 99:59:59,999 --> 99:59:59,999 A nickname for the subject is this... 8 99:59:59,999 --> 99:59:59,999 Because I often will get that reaction from some people in the audience. 9 99:59:59,999 --> 99:59:59,999 Ever since I've told people that it's a nickname, 10 99:59:59,999 --> 99:59:59,999 They feel free to actually exclaim, "Oh S@%#!". 11 99:59:59,999 --> 99:59:59,999 And feel free to do that here in this talk. 12 99:59:59,999 --> 99:59:59,999 I'll prompt you in a couple of places where it is natural. 13 99:59:59,999 --> 99:59:59,999 But if just have the urge, go ahead. 14 99:59:59,999 --> 99:59:59,999 So just a tiny bit about me. 15 99:59:59,999 --> 99:59:59,999 I am the co-founder of Azul Systems. 16 99:59:59,999 --> 99:59:59,999 I play around with garbage collection a lot. 17 99:59:59,999 --> 99:59:59,999 Here is some evidence of me playing around with garbage collection in my kitchen. 18 99:59:59,999 --> 99:59:59,999 That's a trash compactor. 19 99:59:59,999 --> 99:59:59,999 The compaction function wasn't working right, so I had to fix it. 20 99:59:59,999 --> 99:59:59,999 I thought it'd be funny to take a picture with a book. 21 99:59:59,999 --> 99:59:59,999 I've also built a lot of things. 22 99:59:59,999 --> 99:59:59,999 I've been playing with computers since the early 80's. 23 99:59:59,999 --> 99:59:59,999 I've built hardware. 24 99:59:59,999 --> 99:59:59,999 I've helped design chips. 25 99:59:59,999 --> 99:59:59,999 I've built software at many different levels. 26 99:59:59,999 --> 99:59:59,999 Operating systems, drivers... JVM's obviously. 27 99:59:59,999 --> 99:59:59,999 And lots of big systems at the system level. 28 99:59:59,999 --> 99:59:59,999 Built our own app server in the late 90's because web logic wasn't around yet. 29 99:59:59,999 --> 99:59:59,999 So, I've made a lot of mistakes, and I've learned from a few of them. 30 99:59:59,999 --> 99:59:59,999 This is actually a combination of a bunch of those mistakes looking at latency. 31 99:59:59,999 --> 99:59:59,999 I do have this hobby of depressing people by pulling the wool up from over your eyes, 32 99:59:59,999 --> 99:59:59,999 and this is what this talk is about. 33 99:59:59,999 --> 99:59:59,999 So, I need to give you a choice right here. 34 99:59:59,999 --> 99:59:59,999 There's the door. 35 99:59:59,999 --> 99:59:59,999 You can take the blue pill, and you can leave. 36 99:59:59,999 --> 99:59:59,999 Tomorrow you can keep believing whatever it is you want to believe. 37 99:59:59,999 --> 99:59:59,999 But if you stay here and take the red pill, I will show you a glimpse of how 38 99:59:59,999 --> 99:59:59,999 far down the rabbit hole goes, and it will never be the same again. 39 99:59:59,999 --> 99:59:59,999 Let's talk about latency. 40 99:59:59,999 --> 99:59:59,999 And when I say latency, I'm talking about latency response time, any of those things 41 99:59:59,999 --> 99:59:59,999 where you measure time from 'here to here', and you're interested in how long it took. 42 99:59:59,999 --> 99:59:59,999 We do this all the time, but I see a lot of mish-mash in how people 43 99:59:59,999 --> 99:59:59,999 treat the data, or think about it. 44 99:59:59,999 --> 99:59:59,999 Latency is basically the time it took something to happen once. 45 99:59:59,999 --> 99:59:59,999 That one time, how long did it take. 46 99:59:59,999 --> 99:59:59,999 And when we measure stuff, like we did a million operations in the last hour, 47 99:59:59,999 --> 99:59:59,999 we have a million latencies. Not one, we have a million of them. 48 99:59:59,999 --> 99:59:59,999 Our actual goal is to figure out how to describe that million. 49 99:59:59,999 --> 99:59:59,999 How did the million behave? 50 99:59:59,999 --> 99:59:59,999 For example, 'they're all really good, and they're all exactly the same', would be a 51 99:59:59,999 --> 99:59:59,999 behavior that you will never see, but that would be a great behavior. 52 99:59:59,999 --> 99:59:59,999 So we need to talk about how things behave, communicate, think, evaluate, 53 99:59:59,999 --> 99:59:59,999 set requirements for, talk to other people, but these are all common things around that. 54 99:59:59,999 --> 99:59:59,999 To do that, we have to describe the distribution, the set, the behavior, 55 99:59:59,999 --> 99:59:59,999 but not the one. 56 99:59:59,999 --> 99:59:59,999 For example, the behavior that says "the the common case was x" is a piece of 57 99:59:59,999 --> 99:59:59,999 information about the behavior, but it's a tiny sliver. 58 99:59:59,999 --> 99:59:59,999 Usually the least relevant one. 59 99:59:59,999 --> 99:59:59,999 Well, there's some less relevant ones, but not a strongly relevant one, 60 99:59:59,999 --> 99:59:59,999 and one that people often focus on. 61 99:59:59,999 --> 99:59:59,999 To take a look at what we actually do with this stuff, almost on a daily basis, 62 99:59:59,999 --> 99:59:59,999 this is a snapshot from a monitoring system. 63 99:59:59,999 --> 99:59:59,999 A small dashboard on a big screen in a monitoring system. 64 99:59:59,999 --> 99:59:59,999 Where you're watching the response time of a system over time. 65 99:59:59,999 --> 99:59:59,999 This is a two hour window. 66 99:59:59,999 --> 99:59:59,999 These lines that are 95th percentile, 90, 75, 50, and 25th percentiles, 67 99:59:59,999 --> 99:59:59,999 you can look at how they behave over time. 68 99:59:59,999 --> 99:59:59,999 We're a small audience here, if you look at this picture, what draws your eye? 69 99:59:59,999 --> 99:59:59,999 What do you want to go investigate here or pay attention to ? 70 99:59:59,999 --> 99:59:59,999 It's the big red spike there, right? 71 99:59:59,999 --> 99:59:59,999 So we could look at the red spike, cause it's different, 72 99:59:59,999 --> 99:59:59,999 and say, "Woah, the 95th percentile shot up here. And look, the 90th percentile 73 99:59:59,999 --> 99:59:59,999 shot up at about the same time. 74 99:59:59,999 --> 99:59:59,999 The rest of them didn't shoot up, so maybe something happened here 75 99:59:59,999 --> 99:59:59,999 that affected that much, I should probably pay attention to it 76 99:59:59,999 --> 99:59:59,999 because it's a monitoring system, and I like things to be calm." 77 99:59:59,999 --> 99:59:59,999 You could go investigate the why. 78 99:59:59,999 --> 99:59:59,999 At this point, I've managed to waste about 90 seconds of your life, 79 99:59:59,999 --> 99:59:59,999 looking at a completely meaningless chart, which unfortunately you do 80 99:59:59,999 --> 99:59:59,999 every day, all the time. 81 99:59:59,999 --> 99:59:59,999 This chart is the chart you want to show somebody if you want to 82 99:59:59,999 --> 99:59:59,999 hide the truth from them. 83 99:59:59,999 --> 99:59:59,999 If you want to pull the wool over their eyes. 84 99:59:59,999 --> 99:59:59,999 This is the chart of the good stuff. 85 99:59:59,999 --> 99:59:59,999 What's not on this chart? 86 99:59:59,999 --> 99:59:59,999 The 5% worse things that happened during this two hours. 87 99:59:59,999 --> 99:59:59,999 They're not here. 88 99:59:59,999 --> 99:59:59,999 This is only the good things that happened during the things. 89 99:59:59,999 --> 99:59:59,999 And to get this spike, that 5% had to be so bad that it even pulled 90 99:59:59,999 --> 99:59:59,999 the 95th percentile all up. 91 99:59:59,999 --> 99:59:59,999 There is zero information here at all about what happened bad during this two hours, 92 99:59:59,999 --> 99:59:59,999 which makes it a bad fit for a monitoring system. 93 99:59:59,999 --> 99:59:59,999 It's a really good thing for a marketing system. 94 99:59:59,999 --> 99:59:59,999 It's a great way to get the bonus from your boss, even though you didn't do the work. 95 99:59:59,999 --> 99:59:59,999 If you want to learn how to do that, we can do another talk about that. 96 99:59:59,999 --> 99:59:59,999 But this is not a good way to look at latency. 97 99:59:59,999 --> 99:59:59,999 It's the opposite of good. 98 99:59:59,999 --> 99:59:59,999 Unfortunately, this is one of the most common tools used for 99 99:59:59,999 --> 99:59:59,999 server monitoring on earth right now. 100 99:59:59,999 --> 99:59:59,999 That's where the snapshot is from, and this is what people look at. 101 99:59:59,999 --> 99:59:59,999 I find this chart to be a goldmine of information. 102 99:59:59,999 --> 99:59:59,999 When I first showed it in another talk like this, I had this really cool experience. 103 99:59:59,999 --> 99:59:59,999 Somebody came up to me and said, "Hey, as I was sitting here, I was texting one 104 99:59:59,999 --> 99:59:59,999 of our guys, and he was saying, 105 99:59:59,999 --> 99:59:59,999 'look, we have this issue with our 95th percentile'." 106 99:59:59,999 --> 99:59:59,999 And I got this chart from him! 107 99:59:59,999 --> 99:59:59,999 So I went and said, "Hey, what does the rest of the spectrum look like?" 108 99:59:59,999 --> 99:59:59,999 This is the actual chart they got. 109 99:59:59,999 --> 99:59:59,999 And when they look at the rest of the spectrum, it looked like that. 110 99:59:59,999 --> 99:59:59,999 That's what was hiding. 111 99:59:59,999 --> 99:59:59,999 I noticed the scales are a little different. 112 99:59:59,999 --> 99:59:59,999 That yellow line is that yellow line. 113 99:59:59,999 --> 99:59:59,999 So that's a much more representative number. 114 99:59:59,999 --> 99:59:59,999 Is it? Is that good enough? 115 99:59:59,999 --> 99:59:59,999 That's the 99th percentile. 116 99:59:59,999 --> 99:59:59,999 We still have another 1% of really bad stuff that's hiding above the blue line. 117 99:59:59,999 --> 99:59:59,999 I wonder how big that is? 118 99:59:59,999 --> 99:59:59,999 I don't know because he didn't have the data. 119 99:59:59,999 --> 99:59:59,999 So a common problem that we have is that we only plot what's convenient. 120 99:59:59,999 --> 99:59:59,999 We only plot what gives us nice, colorful graphs. 121 99:59:59,999 --> 99:59:59,999 And often, when we have to choose between the stuff that hides the rest of the data, 122 99:59:59,999 --> 99:59:59,999 and the stuff that is noise, we choose the noise to display. 123 99:59:59,999 --> 99:59:59,999 I like to rant about latency. 124 99:59:59,999 --> 99:59:59,999 This is from a blog that I don't write enough in, but the format for it was simple. 125 99:59:59,999 --> 99:59:59,999 I tweet a single tweet about latency, latency tip of the day, 126 99:59:59,999 --> 99:59:59,999 and then I rant about my own tweet. 127 99:59:59,999 --> 99:59:59,999 As an example, this chart is a goldmine of information because it has so many 128 99:59:59,999 --> 99:59:59,999 different things that are wrong in it, but we won't get into all of them. 129 99:59:59,999 --> 99:59:59,999 You can read it online. 130 99:59:59,999 --> 99:59:59,999 Anyway, this is one to take away from what we just said. 131 99:59:59,999 --> 99:59:59,999 If you are not measuring and showing the maximum value, what is it you are hiding? 132 99:59:59,999 --> 99:59:59,999 And from whom? 133 99:59:59,999 --> 99:59:59,999 If you're job is to hide the truth from others, this is a good way to do it. 134 99:59:59,999 --> 99:59:59,999 But if actually are interested in what's going on, the number one indicator 135 99:59:59,999 --> 99:59:59,999 you should never get rid of is the maximum value. 136 99:59:59,999 --> 99:59:59,999 That is not noise, that is the signal. 137 99:59:59,999 --> 99:59:59,999 The rest of it is noise. 138 99:59:59,999 --> 99:59:59,999 Okay, let's look at this chart for some more cool stuff. 139 99:59:59,999 --> 99:59:59,999 I'm gonna zoom in to a small part of the chart, and ask you what that means. 140 99:59:59,999 --> 99:59:59,999 What is the average of the 95th percentile over 2 hours mean? 141 99:59:59,999 --> 99:59:59,999 What is the math that does that? 142 99:59:59,999 --> 99:59:59,999 What does it do? 143 99:59:59,999 --> 99:59:59,999 Let's look at that, and I'll give you an example with another percentile. 144 99:59:59,999 --> 99:59:59,999 The 100th percentile. The max, right? 145 99:59:59,999 --> 99:59:59,999 Let's take a data set. 146 99:59:59,999 --> 99:59:59,999 Suppose this was the maximum every minute for 15 minutes. 147 99:59:59,999 --> 99:59:59,999 What does it mean to say that the average max over the last 15 minutes was 42? 148 99:59:59,999 --> 99:59:59,999 I specifically chose the data to make that happen. 149 99:59:59,999 --> 99:59:59,999 It's a meaningless statement. 150 99:59:59,999 --> 99:59:59,999 It's a completely meaningless statement. 151 99:59:59,999 --> 99:59:59,999 But when you see 95th percentile, average 184, you think that the 95th 152 99:59:59,999 --> 99:59:59,999 percentile for the last two hours was around 184. 153 99:59:59,999 --> 99:59:59,999 It makes you think that. 154 99:59:59,999 --> 99:59:59,999 Putting this on a piece of paper is not just noise and irrelevant, 155 99:59:59,999 --> 99:59:59,999 it's a way to mislead people. 156 99:59:59,999 --> 99:59:59,999 It's a way to mislead yourself, because you'll start to believe your own mistruths. 157 99:59:59,999 --> 99:59:59,999 This is true for any percentile. 158 99:59:59,999 --> 99:59:59,999 There is no percentile that you could do this math on. 159 99:59:59,999 --> 99:59:59,999 Another tip, you cannot average percentiles. 160 99:59:59,999 --> 99:59:59,999 That math doesn't happen. 161 99:59:59,999 --> 99:59:59,999 But percentiles do matter. You really want to know about them. 162 99:59:59,999 --> 99:59:59,999 And a common misperception is that we want to look at the main part of the spectrum, 163 99:59:59,999 --> 99:59:59,999 not those outliers and perfection stuff. 164 99:59:59,999 --> 99:59:59,999 Only people that actually bet their house every day, or the bank on it, 165 99:59:59,999 --> 99:59:59,999 need to know about the "five-nine's", and all those. 166 99:59:59,999 --> 99:59:59,999 The 99th percentile is a pretty good number. 167 99:59:59,999 --> 99:59:59,999 Is 99% really rare? 168 99:59:59,999 --> 99:59:59,999 Let's look at some stuff, because we can ask questions like, "If I were looking 169 99:59:59,999 --> 99:59:59,999 at a webpage, what is the chance of me hitting the 99th percentile?" 170 99:59:59,999 --> 99:59:59,999 Of things like this: a search engine node, or a key value store, 171 99:59:59,999 --> 99:59:59,999 or a database, or a CDN, right? 172 99:59:59,999 --> 99:59:59,999 Because they will report their 99th percentile. 173 99:59:59,999 --> 99:59:59,999 They won't tell you anything above that, but how many of the 174 99:59:59,999 --> 99:59:59,999 webpages that we go to actually experience this? 175 99:59:59,999 --> 99:59:59,999 You want to say 1%, right? 176 99:59:59,999 --> 99:59:59,999 Well, I went to some webpages and I counted how many "http" requests were generated 177 99:59:59,999 --> 99:59:59,999 by one click into that webpage, and here are the numbers. 178 99:59:59,999 --> 99:59:59,999 I ended that about a year ago. 179 99:59:59,999 --> 99:59:59,999 They've probably gone up since then. 180 99:59:59,999 --> 99:59:59,999 Now that translates into this math. 181 99:59:59,999 --> 99:59:59,999 This is the likelihood of one click seeing the 99th percentile. 182 99:59:59,999 --> 99:59:59,999 And the only page where that is less than 50% is the clean google search page. 183 99:59:59,999 --> 99:59:59,999 Where only a quarter will see the 99th percentile. 184 99:59:59,999 --> 99:59:59,999 The 99th percentile is the thing that most of your webpages will see. 185 99:59:59,999 --> 99:59:59,999 Most of them will be there. 186 99:59:59,999 --> 99:59:59,999 Now, we could look at other things. 187 99:59:59,999 --> 99:59:59,999 We can pick which things to focus on. 188 99:59:59,999 --> 99:59:59,999 Let's say I had to pick between the 95th percentile, and the three 9's (99.9%). 189 99:59:59,999 --> 99:59:59,999 The three 9's is way into perfection mode for most people, or they think. 190 99:59:59,999 --> 99:59:59,999 Which one of those represents our community better? 191 99:59:59,999 --> 99:59:59,999 Our population? 192 99:59:59,999 --> 99:59:59,999 Our users? 193 99:59:59,999 --> 99:59:59,999 Our experience? 194 99:59:59,999 --> 99:59:59,999 Let's run a hypothetical. 195 99:59:59,999 --> 99:59:59,999 Suppose we don't have that many pages, and that many resources like we said before. 196 99:59:59,999 --> 99:59:59,999 We'll be much more conservative. 197 99:59:59,999 --> 99:59:59,999 A user session will only go through five clicks, and each click will only bring up 198 99:59:59,999 --> 99:59:59,999 up to 40 things. 199 99:59:59,999 --> 99:59:59,999 A lot less, and they're all as clean as the google page. 200 99:59:59,999 --> 99:59:59,999 How many of the users will not experience something worse than the 95th percentile? 201 99:59:59,999 --> 99:59:59,999 Because that's what the 95th percentile is good for, the people who see that. 202 99:59:59,999 --> 99:59:59,999 Anybody above that, is that. 203 99:59:59,999 --> 99:59:59,999 What are the chances of not seeing it? 204 99:59:59,999 --> 99:59:59,999 That's an interesting number. 205 99:59:59,999 --> 99:59:59,999 So you're watching a number that is relevant to 0.003% of your users. 206 99:59:59,999 --> 99:59:59,999 99.997% of your users are going to see worse than this number. 207 99:59:59,999 --> 99:59:59,999 Why are you looking at it? 208 99:59:59,999 --> 99:59:59,999 Why are you spending time thinking about it? 209 99:59:59,999 --> 99:59:59,999 In reverse, we could say how many people are going to see something 210 99:59:59,999 --> 99:59:59,999 worse than the three 9's (99.9%)? 211 99:59:59,999 --> 99:59:59,999 That's going to be 18%. 212 99:59:59,999 --> 99:59:59,999 In reverse, 82% of the people will see the three 9's (99.9%) or better. 213 99:59:59,999 --> 99:59:59,999 That's a slightly better representation. 214 99:59:59,999 --> 99:59:59,999 Probably not good enough either. 215 99:59:59,999 --> 99:59:59,999 We could look at some more math with them, same kind of scenario. 216 99:59:59,999 --> 99:59:59,999 What percentile of http response time will be the thing that 95% 217 99:59:59,999 --> 99:59:59,999 of people experience in this scenario? 218 99:59:59,999 --> 99:59:59,999 It's the 99.97 percentile that 95% of people see. 219 99:59:59,999 --> 99:59:59,999 And if you want to know what 99% of the people see, 220 99:59:59,999 --> 99:59:59,999 that's four and a half 9's (99.995%). 221 99:59:59,999 --> 99:59:59,999 You want to know that number from Akamai if you want to predict what 1% 222 99:59:59,999 --> 99:59:59,999 of your users are going to experience. 223 99:59:59,999 --> 99:59:59,999 When you know the 99th percentile, you kind of know a tiny bit. 224 99:59:59,999 --> 99:59:59,999 So here's another tip. 225 99:59:59,999 --> 99:59:59,999 And this is not an exaggeration, by the way. 226 99:59:59,999 --> 99:59:59,999 The median, which is a much smaller percentile, has that minuscule a chance 227 99:59:59,999 --> 99:59:59,999 of ever being the number that anybody experiences. 228 99:59:59,999 --> 99:59:59,999 This is the chance of getting worse than the median. 229 99:59:59,999 --> 99:59:59,999 Which makes the median an irrelevant number to look at. 230 99:59:59,999 --> 99:59:59,999 Unfortunately, it's probably the most common one looked at. 231 99:59:59,999 --> 99:59:59,999 When people say "the typical", they look at the thing that 232 99:59:59,999 --> 99:59:59,999 everything will be worse than. 233 99:59:59,999 --> 99:59:59,999 Okay, I'm sorry about that part. 234 99:59:59,999 --> 99:59:59,999 We'll do some other parts. 235 99:59:59,999 --> 99:59:59,999 Now, why is it that when we look at these monitoring systems, we don't see 236 99:59:59,999 --> 99:59:59,999 data with a lot of 9's? 237 99:59:59,999 --> 99:59:59,999 Why do we stop at the 90, 95, 99th percentile? 238 99:59:59,999 --> 99:59:59,999 Why don't we look further? 239 99:59:59,999 --> 99:59:59,999 Now, some of it is because people think, "Well that's perfection, I don't need it." 240 99:59:59,999 --> 99:59:59,999 The other part is that it's hard. 241 99:59:59,999 --> 99:59:59,999 It's hard because you can't average percentiles. 242 99:59:59,999 --> 99:59:59,999 We already talked about that. 243 99:59:59,999 --> 99:59:59,999 But you also can't derive your five 9's (99.999%) out of a lot 244 99:59:59,999 --> 99:59:59,999 of 10 second samples of percentiles. 245 99:59:59,999 --> 99:59:59,999 And the reason for that is, "Hey, in 10 seconds, maybe I only had 1,000 things." 246 99:59:59,999 --> 99:59:59,999 I could take all the 10 seconds in the world, there's no way to say what the 247 99:59:59,999 --> 99:59:59,999 hour five 9's (99.999%) were, what the minutes five 9's were 248 99:59:59,999 --> 99:59:59,999 if I'm collecting just this data. 249 99:59:59,999 --> 99:59:59,999 And unfortunately, the data being collected and reported to the back ends of monitoring 250 99:59:59,999 --> 99:59:59,999 is usually summarized at a second, 5 seconds, 10 seconds, etc. 251 99:59:59,999 --> 99:59:59,999 Basically throwing away all the good data, and leaving you with absolutely no way 252 99:59:59,999 --> 99:59:59,999 to compute large 9's for longer periods of time. 253 99:59:59,999 --> 99:59:59,999 So, this is where you might want to look at HDR Histogram. 254 99:59:59,999 --> 99:59:59,999 It's an open source thing I've created a few years ago. 255 99:59:59,999 --> 99:59:59,999 I did it in Java, and know there's a C, C-Sharp, Python, Erlang, 256 99:59:59,999 --> 99:59:59,999 and Go ports of this that I didn't create. 257 99:59:59,999 --> 99:59:59,999 And it lets you actually get an entire percentile spectrum. 258 99:59:59,999 --> 99:59:59,999 Some of you here I know are already using it. 259 99:59:59,999 --> 99:59:59,999 And you can look at all the percentiles. 260 99:59:59,999 --> 99:59:59,999 Any number of 9's that's in the data, if you just keep it right and report it right, 261 99:59:59,999 --> 99:59:59,999 it's got a log format, you can store things forever. 262 99:59:59,999 --> 99:59:59,999 Well, for a long time. 263 99:59:59,999 --> 99:59:59,999 Okay, so it lets you have nice things. 264 99:59:59,999 --> 99:59:59,999 Enough for that advertisement. 265 99:59:59,999 --> 99:59:59,999 Now, latency... Well, I think this is slightly out of order. 266 99:59:59,999 --> 99:59:59,999 Yeah, sorry. 267 99:59:59,999 --> 99:59:59,999 This is the red/blue pill part, so I warn you, this is your last chance. 268 99:59:59,999 --> 99:59:59,999 There's a problem I call the coordinated omission problem. 269 99:59:59,999 --> 99:59:59,999 The coordinated omission problem is basically a conspiracy. 270 99:59:59,999 --> 99:59:59,999 It's a conspiracy that we're all part of. 271 99:59:59,999 --> 99:59:59,999 I don't think anybody actually meant to do it, but once I've noticed it, 272 99:59:59,999 --> 99:59:59,999 everywhere I look, there it is. 273 99:59:59,999 --> 99:59:59,999 Now, I've been using a specific way of showing you numbers so far. 274 99:59:59,999 --> 99:59:59,999 Has anybody here noticed how I spell percentile? 275 99:59:59,999 --> 99:59:59,999 (Audience Member): "You put lie at the end of the percent sign." 276 99:59:59,999 --> 99:59:59,999 Yeah, good. 277 99:59:59,999 --> 99:59:59,999 So coordinated omission problem is the "lie" in %lies. 278 99:59:59,999 --> 99:59:59,999 And this is how it works. 279 99:59:59,999 --> 99:59:59,999 One common way to do this is to use a load generator. 280 99:59:59,999 --> 99:59:59,999 Pretty much all load generator's have this problem. 281 99:59:59,999 --> 99:59:59,999 There are two that I know of that don't. 282 99:59:59,999 --> 99:59:59,999 What you do with a load generator, is you test. 283 99:59:59,999 --> 99:59:59,999 You issue requests, or send packets. 284 99:59:59,999 --> 99:59:59,999 And you measure how long something took. 285 99:59:59,999 --> 99:59:59,999 And as long as the numbers go right, measure them, put them in a bucket, 286 99:59:59,999 --> 99:59:59,999 study them later, and get your percentiles from it. 287 99:59:59,999 --> 99:59:59,999 But what if the thing that you are measuring took longer than the time 288 99:59:59,999 --> 99:59:59,999 it would've taken until you send the next thing? 289 99:59:59,999 --> 99:59:59,999 You're supposed to send something every second, 290 99:59:59,999 --> 99:59:59,999 but this one took a second and a half. 291 99:59:59,999 --> 99:59:59,999 Well you've got to wait before you send the next one. 292 99:59:59,999 --> 99:59:59,999 You just avoided measuring something when the system was problematic. 293 99:59:59,999 --> 99:59:59,999 You've coordinated with it. 294 99:59:59,999 --> 99:59:59,999 You weren't looking at it then. 295 99:59:59,999 --> 99:59:59,999 That's common scenario A: You've backed off, and avoided measuring when it was bad. 296 99:59:59,999 --> 99:59:59,999 Another way, is you measure inside your code. 297 99:59:59,999 --> 99:59:59,999 We all do this. We all have to do this, 298 99:59:59,999 --> 99:59:59,999 where we measure time, do something, then measure time. 299 99:59:59,999 --> 99:59:59,999 The delta between them is how long it took. 300 99:59:59,999 --> 99:59:59,999 We can then put it in a stats bucket, and then do the percentiles in that. 301 99:59:59,999 --> 99:59:59,999 Unfortunately, if the system freezes right here, for any reason, 302 99:59:59,999 --> 99:59:59,999 an interrupted contact switch, 303 99:59:59,999 --> 99:59:59,999 a cash buffer flushed to disk, 304 99:59:59,999 --> 99:59:59,999 a garbage collection, 305 99:59:59,999 --> 99:59:59,999 a re-indexing of your database, this is a database. 306 99:59:59,999 --> 99:59:59,999 This is Cassandra by the way, measuring itself. 307 99:59:59,999 --> 99:59:59,999 In any of the above, then you will have one bad report 308 99:59:59,999 --> 99:59:59,999 while 10,000 things are waiting in line. 309 99:59:59,999 --> 99:59:59,999 And when they come in, they will look really, really good. 310 99:59:59,999 --> 99:59:59,999 Even though each one of them has had a really bad experience. 311 99:59:59,999 --> 99:59:59,999 It can even get worse, where maybe the freeze happened outside the timing, 312 99:59:59,999 --> 99:59:59,999 and you won't even know there was a freeze. 313 99:59:59,999 --> 99:59:59,999 Now these are examples of admitting data that is bad on a very selective basis. 314 99:59:59,999 --> 99:59:59,999 It's not random sampling. 315 99:59:59,999 --> 99:59:59,999 It's, "I don't like bad data", 316 99:59:59,999 --> 99:59:59,999 or "I couldn't handle it", 317 99:59:59,999 --> 99:59:59,999 or "I don't know about it", 318 99:59:59,999 --> 99:59:59,999 so we'll just talk about the good. 319 99:59:59,999 --> 99:59:59,999 What does that do to your data? 320 99:59:59,999 --> 99:59:59,999 Because it often makes people feel like, 321 99:59:59,999 --> 99:59:59,999 "Okay, yeah, I understand, but it's a little bit of noise." 322 99:59:59,999 --> 99:59:59,999 Let's run some hypotheticals, and I'll show you some real numbers. 323 99:59:59,999 --> 99:59:59,999 Imagine a perfect system. 324 99:59:59,999 --> 99:59:59,999 It's doing 100 requests a second, at exactly a millisecond each. 325 99:59:59,999 --> 99:59:59,999 But we go and freeze the system, after 100 seconds of perfect operations 326 99:59:59,999 --> 99:59:59,999 for 100 seconds, and then repeat. 327 99:59:59,999 --> 99:59:59,999 Now, I'm going to describe how the system behaves in terms that should mean something, 328 99:59:59,999 --> 99:59:59,999 and then we'll measure it. 329 99:59:59,999 --> 99:59:59,999 If we actually wanted to describe the system, 330 99:59:59,999 --> 99:59:59,999 on the left we have an average of one millisecond by the finish, 331 99:59:59,999 --> 99:59:59,999 and on the right we have an average of 50 seconds. 332 99:59:59,999 --> 99:59:59,999 Why 50? Because if I randomly came in in that 100 seconds, 333 99:59:59,999 --> 99:59:59,999 I'll get anything from 0 to 100 with even distribution. 334 99:59:59,999 --> 99:59:59,999 The overall average over 200 seconds is 25 seconds. 335 99:59:59,999 --> 99:59:59,999 If I just came in here and said, "Surprise, how long did this take?" 336 99:59:59,999 --> 99:59:59,999 On average, it will be 25. 337 99:59:59,999 --> 99:59:59,999 I can also do the percentiles. 338 99:59:59,999 --> 99:59:59,999 50th percentile will be really good, and then it'll get really bad. 339 99:59:59,999 --> 99:59:59,999 The four 9's is terrible. 340 99:59:59,999 --> 99:59:59,999 This is a fair honest description of this system if this is what it did. 341 99:59:59,999 --> 99:59:59,999 And you can make the system do that. 342 99:59:59,999 --> 99:59:59,999 That's what Control Z is good for. 343 99:59:59,999 --> 99:59:59,999 You can make any of your systems do that. 344 99:59:59,999 --> 99:59:59,999 Now lets go measure this system with a load generator, 345 99:59:59,999 --> 99:59:59,999 or with a monitoring system. 346 99:59:59,999 --> 99:59:59,999 The common ones. 347 99:59:59,999 --> 99:59:59,999 The ones everybody does. 348 99:59:59,999 --> 99:59:59,999 On the left, we're going to get 10,000 results of one millisecond each. 349 99:59:59,999 --> 99:59:59,999 Great. 350 99:59:59,999 --> 99:59:59,999 And we're going to get one result of 100 seconds. 351 99:59:59,999 --> 99:59:59,999 Wow, really big response time. 352 99:59:59,999 --> 99:59:59,999 This is our data. 353 99:59:59,999 --> 99:59:59,999 This is OUR data. 354 99:59:59,999 --> 99:59:59,999 So now you go do math with it. 355 99:59:59,999 --> 99:59:59,999 The average of that is 10.9 milliseconds. 356 99:59:59,999 --> 99:59:59,999 A little less than 25 seconds. 357 99:59:59,999 --> 99:59:59,999 And here are the percentiles. 358 99:59:59,999 --> 99:59:59,999 Your load generator monitoring system will tell you that this system is perfect. 359 99:59:59,999 --> 99:59:59,999 You could go to production with it. 360 99:59:59,999 --> 99:59:59,999 You like what you see. 361 99:59:59,999 --> 99:59:59,999 Look at that, four 9's. 362 99:59:59,999 --> 99:59:59,999 It is lying to you. 363 99:59:59,999 --> 99:59:59,999 To your face. 364 99:59:59,999 --> 99:59:59,999 And you can catch it doing that with a Control Z-Test. 365 99:59:59,999 --> 99:59:59,999 But people tend to not want to do that, because then what are they going to do? 366 99:59:59,999 --> 99:59:59,999 If you just do that test, and calibrate your system, and you find it 367 99:59:59,999 --> 99:59:59,999 telling you that, about this, the next step should be to throw all the numbers away. 368 99:59:59,999 --> 99:59:59,999 Don't believe anything else it says. 369 99:59:59,999 --> 99:59:59,999 If it lies this big, what else did it do? 370 99:59:59,999 --> 99:59:59,999 Don't waste your time on numbers from uncalibrated systems. 371 99:59:59,999 --> 99:59:59,999 Now the problem here was, that if you want to measure the system, 372 99:59:59,999 --> 99:59:59,999 you have to measure at random rates, or same rates. 373 99:59:59,999 --> 99:59:59,999 If you measure 10,000 things in 100 seconds, there should be another 10,000 things here. 374 99:59:59,999 --> 99:59:59,999 If you measure them, you would've gotten all the right numbers. 375 99:59:59,999 --> 99:59:59,999 Coordinated omission is the simple act of erasing all that bad stuff. 376 99:59:59,999 --> 99:59:59,999 The conspiracy here is that we all do it without meaning to. 377 99:59:59,999 --> 99:59:59,999 I don't know who put that in our systems, but it happens to all of us . 378 99:59:59,999 --> 99:59:59,999 Now, I often get people saying, "Okay, I get it. All the numbers are wrong, 379 99:59:59,999 --> 99:59:59,999 but at least for my job where I tune performance, and I try to make things 380 99:59:59,999 --> 99:59:59,999 faster, I can use the numbers to figure out if I'm going in the right direction." 381 99:59:59,999 --> 99:59:59,999 Is it better, or is it worse? Let me dispel that for you for a second. 382 99:59:59,999 --> 99:59:59,999 Suppose I went and took this system, and improved it dramatically. 383 99:59:59,999 --> 99:59:59,999 Rather than freezing for 100 seconds, it will now answer every question. 384 99:59:59,999 --> 99:59:59,999 It'll take a little longer, 5 milliseconds instead of one, 385 99:59:59,999 --> 99:59:59,999 but it's much better than freezing, right? 386 99:59:59,999 --> 99:59:59,999 So let's measure that system that we spent weeks and weeks improving, 387 99:59:59,999 --> 99:59:59,999 and see if it's better. 388 99:59:59,999 --> 99:59:59,999 That's the data. 389 99:59:59,999 --> 99:59:59,999 If we do the percentiles, it'll tell us that we just really hurt the four 9's. 390 99:59:59,999 --> 99:59:59,999 We made it go 5 times worse than before. 391 99:59:59,999 --> 99:59:59,999 We should revert this change, go back to that much better system we had before. 392 99:59:59,999 --> 99:59:59,999 So this is just to make sure that you don't think that you can have 393 99:59:59,999 --> 99:59:59,999 any intuition based on any of these numbers. 394 99:59:59,999 --> 99:59:59,999 They go backwards sometimes. 395 99:59:59,999 --> 99:59:59,999 You don't know which way is good or bad. 396 99:59:59,999 --> 99:59:59,999 And you'll never know which way is good or bad with a system that lies like that. 397 99:59:59,999 --> 99:59:59,999 The other cool technique is what I call "Cheating Twice". 398 99:59:59,999 --> 99:59:59,999 You have a constant load generator, and it needs to do 100 per second. 399 99:59:59,999 --> 99:59:59,999 When it woke up after 200 seconds, it says, 400 99:59:59,999 --> 99:59:59,999 "Woah, were 9,999 behind. We've got to issue those requests." 401 99:59:59,999 --> 99:59:59,999 So it issues those requests. 402 99:59:59,999 --> 99:59:59,999 At this point, not only did it get rid of all the bad requests, 403 99:59:59,999 --> 99:59:59,999 it replaced every one of them with a perfect request. 404 99:59:59,999 --> 99:59:59,999 Coining the four 9's (99.99%), all the way to four and a half 9's (99.995%), 405 99:59:59,999 --> 99:59:59,999 it's twice as wrong as dropping them. 406 99:59:59,999 --> 99:59:59,999 So these are all cool things that happen to you. 407 99:59:59,999 --> 99:59:59,999 I'm not going to spend much time on how to fix those and avoid those. 408 99:59:59,999 --> 99:59:59,999 There's a lot of other material that you can find with me 409 99:59:59,999 --> 99:59:59,999 talking about that, in longer talks. 410 99:59:59,999 --> 99:59:59,999 But this is pretty bad. 411 99:59:59,999 --> 99:59:59,999 And like I said... 412 99:59:59,999 --> 99:59:59,999 That should've been up there before. 413 99:59:59,999 --> 99:59:59,999 How did this repeat itself? 414 99:59:59,999 --> 99:59:59,999 Did I create a loop in the presentation somehow? 415 99:59:59,999 --> 99:59:59,999 I don't know how to do that. 416 99:59:59,999 --> 99:59:59,999 Let's see if I can get through here. 417 99:59:59,999 --> 99:59:59,999 Hopefully editing later will take it out. 418 99:59:59,999 --> 99:59:59,999 So we have the cheats twice. 419 99:59:59,999 --> 99:59:59,999 There, okay. 420 99:59:59,999 --> 99:59:59,999 So, after we look at coordinated omission that way, 421 99:59:59,999 --> 99:59:59,999 we should also look at response time, and service time. 422 99:59:59,999 --> 99:59:59,999 Coordinated omission, what it really is achieving for you, unfortunately, 423 99:59:59,999 --> 99:59:59,999 is that it makes something that you think is response time, and only shows you 424 99:59:59,999 --> 99:59:59,999 the service time component of latency. 425 99:59:59,999 --> 99:59:59,999 This is a simple depiction of what service time and response times are. 426 99:59:59,999 --> 99:59:59,999 This guy is taking a certain amount of time to take payment 427 99:59:59,999 --> 99:59:59,999 or make a cup of coffee. 428 99:59:59,999 --> 99:59:59,999 That's service time. 429 99:59:59,999 --> 99:59:59,999 How long does it take to do the work? 430 99:59:59,999 --> 99:59:59,999 This person has experienced the response time, 431 99:59:59,999 --> 99:59:59,999 which includes the amount of time they have to wait before they 432 99:59:59,999 --> 99:59:59,999 get to the person that does the work. 433 99:59:59,999 --> 99:59:59,999 And the difference between those two is immense. 434 99:59:59,999 --> 99:59:59,999 The coordinated omission problem makes something that you think is 435 99:59:59,999 --> 99:59:59,999 response time, only measure the service time, 436 99:59:59,999 --> 99:59:59,999 and basically hide the fact that things stalled, waited in line, 437 99:59:59,999 --> 99:59:59,999 that this guy might've taken a lunch break, 438 99:59:59,999 --> 99:59:59,999 and now we have line around, building three times. 439 99:59:59,999 --> 99:59:59,999 Service time stays the same. 440 99:59:59,999 --> 99:59:59,999 This is the backwards part... 441 99:59:59,999 --> 99:59:59,999 Now, let's look at what it actually looks like. 442 99:59:59,999 --> 99:59:59,999 In a load generator that I fixed, I measured both 443 99:59:59,999 --> 99:59:59,999 response time and service time, 444 99:59:59,999 --> 99:59:59,999 this happens to be Casandra, 445 99:59:59,999 --> 99:59:59,999 at a very low load. 446 99:59:59,999 --> 99:59:59,999 And you can see that they're very very similar, at a very low load. 447 99:59:59,999 --> 99:59:59,999 Why? Because there's nobody in line. 448 99:59:59,999 --> 99:59:59,999 This thing is really fast. 449 99:59:59,999 --> 99:59:59,999 We're not asking for too much. 450 99:59:59,999 --> 99:59:59,999 Casandra's pretty fast, so they're the same. 451 99:59:59,999 --> 99:59:59,999 But if I increase the load, we start seeing gaps. 452 99:59:59,999 --> 99:59:59,999 If I increase the load a little more, the gap grows. 453 99:59:59,999 --> 99:59:59,999 If I increase the load a little more, the gap grows. 454 99:59:59,999 --> 99:59:59,999 Now this is not the failure point yet. 455 99:59:59,999 --> 99:59:59,999 If I actually increase it all the way past the point where the system 456 99:59:59,999 --> 99:59:59,999 can't even do the work I want, service time stays the same, 457 99:59:59,999 --> 99:59:59,999 response time goes through the roof. 458 99:59:59,999 --> 99:59:59,999 This was when it was 100 and something milliseconds, now it's 7 and a half seconds. 459 99:59:59,999 --> 99:59:59,999 Why 7 and a half seconds? 460 99:59:59,999 --> 99:59:59,999 Cause you're waiting in line that long to go around the block. 461 99:59:59,999 --> 99:59:59,999 The guy just can't serve as many people as are showing up in line, you fall behind. 462 99:59:59,999 --> 99:59:59,999 This is a virtual world reaction to this. 463 99:59:59,999 --> 99:59:59,999 I really like this slide, it's where I came up with the notion of a blue/red pill. 464 99:59:59,999 --> 99:59:59,999 When you actually measure reality, people tend to have this reaction when 465 99:59:59,999 --> 99:59:59,999 they compare the two. 466 99:59:59,999 --> 99:59:59,999 And if we actually look at these on the two sides of a collapse point of a system, 467 99:59:59,999 --> 99:59:59,999 this specific system can only do 87,000 things a second. 468 99:59:59,999 --> 99:59:59,999 No matter how hard you press it, that's all it'll do. 469 99:59:59,999 --> 99:59:59,999 The service time on the two sides of the collapse looks virtually identical, 470 99:59:59,999 --> 99:59:59,999 which it would. 471 99:59:59,999 --> 99:59:59,999 But if you compare the response time, you have a very different picture. 472 99:59:59,999 --> 99:59:59,999 And I'm showing this picture so you get a feeling for what to look at 473 99:59:59,999 --> 99:59:59,999 on whether or not you're measuring the right one. 474 99:59:59,999 --> 99:59:59,999 Whenever you push, you try and push load beyond what the system can do, 475 99:59:59,999 --> 99:59:59,999 you are falling behind over time. 476 99:59:59,999 --> 99:59:59,999 This is a 250 second run, 477 99:59:59,999 --> 99:59:59,999 where at the end of it you are waiting for 8 seconds in line. 478 99:59:59,999 --> 99:59:59,999 Why? Because for every second that goes by, there are 479 99:59:59,999 --> 99:59:59,999 3,000 more things that are added to the line. 480 99:59:59,999 --> 99:59:59,999 The interesting thing that happens when you cross the threshold limit, 481 99:59:59,999 --> 99:59:59,999 or capability of the system, is that response time grows over time linearly. 482 99:59:59,999 --> 99:59:59,999 It doesn't happen if you're below. 483 99:59:59,999 --> 99:59:59,999 Only if you're above. 484 99:59:59,999 --> 99:59:59,999 It's the point where that happens, and any load generator that doesn't show 485 99:59:59,999 --> 99:59:59,999 that line when you try pushing harder than you can, is lying to you. 486 99:59:59,999 --> 99:59:59,999 It's a simple sanity check. 487 99:59:59,999 --> 99:59:59,999 If your load generator shows you that, it didn't push. 488 99:59:59,999 --> 99:59:59,999 Or it pushed, but it didn't report correctly, 489 99:59:59,999 --> 99:59:59,999 whichever it is. 490 99:59:59,999 --> 99:59:59,999 If we draw that to scale... 491 99:59:59,999 --> 99:59:59,999 Just to make sure, this was not to scale, this is the scale, I just zoomed in 492 99:59:59,999 --> 99:59:59,999 so you could see that it was relatively stable. 493 99:59:59,999 --> 99:59:59,999 So... I don't know what happened to the order of the slides. 494 99:59:59,999 --> 99:59:59,999 It's like looping and randoming. 495 99:59:59,999 --> 99:59:59,999 There's some conspiracy going on there. 496 99:59:59,999 --> 99:59:59,999 Now, latency doesn't live on it's own. 497 99:59:59,999 --> 99:59:59,999 You do need to look at latency in the context of load. 498 99:59:59,999 --> 99:59:59,999 Cause as I showed you, as you're nearly idle, things are nearly perfect. 499 99:59:59,999 --> 99:59:59,999 Even these mistakes won't show up. 500 99:59:59,999 --> 99:59:59,999 But as you start pressing, things start cracking or behaving differently. 501 99:59:59,999 --> 99:59:59,999 And usually when you want to know how much your system can handle, 502 99:59:59,999 --> 99:59:59,999 the answer is not 87,000 things a second, because nobody wants the 503 99:59:59,999 --> 99:59:59,999 response time that comes with that. 504 99:59:59,999 --> 99:59:59,999 It's how many things can I handle so that I don't get angry phone calls. 505 99:59:59,999 --> 99:59:59,999 So I do get my bonus, and so my company stays above ground. 506 99:59:59,999 --> 99:59:59,999 This is not sustainable speed. 507 99:59:59,999 --> 99:59:59,999 Running this experiment is really interesting with software, 508 99:59:59,999 --> 99:59:59,999 because it actually doesn't hurt, but spending the next 6 months of your time 509 99:59:59,999 --> 99:59:59,999 repeating this experiment, trying to change the shape of the bumper 510 99:59:59,999 --> 99:59:59,999 every time you hit the thing is a waste of your time. 511 99:59:59,999 --> 99:59:59,999 Your goal when you're trying to figure out sustainable speed throughput, 512 99:59:59,999 --> 99:59:59,999 whatever it is, is to see how fast you can go without this happening, 513 99:59:59,999 --> 99:59:59,999 and then to try and engineer to improve that. 514 99:59:59,999 --> 99:59:59,999 Meaning, can I make it go faster without this happening? 515 99:59:59,999 --> 99:59:59,999 Measuring what happens after you hit the pole is useless for that exercise. 516 99:59:59,999 --> 99:59:59,999 The only thing that matters about hitting the pole, is that you hit the pole. 517 99:59:59,999 --> 99:59:59,999 When you go and study the behavior of latency, at saturation, 518 99:59:59,999 --> 99:59:59,999 you are doing this. 519 99:59:59,999 --> 99:59:59,999 You're looking at this and saying, "That bumper, I don't like the shape of that. 520 99:59:59,999 --> 99:59:59,999 Let's measure it closely and do this 100 times to see if we can vary it." 521 99:59:59,999 --> 99:59:59,999 That's what it means to look at latency at saturation, 522 99:59:59,999 --> 99:59:59,999 and repeat, and repeat, and change, and tune, and see if you can do it again. 523 99:59:59,999 --> 99:59:59,999 If you're pressing it to the wall, it should look like this. 524 99:59:59,999 --> 99:59:59,999 And it shouldn't be a surprise that it's a 7 and a half second response time. 525 99:59:59,999 --> 99:59:59,999 In fact, if it's not, something is terribly wrong with what you're measuring. 526 99:59:59,999 --> 99:59:59,999 You should look at that instead. 527 99:59:59,999 --> 99:59:59,999 So don't do this. 528 99:59:59,999 --> 99:59:59,999 Try to minimize the number of times that you actually run red cars 529 99:59:59,999 --> 99:59:59,999 into poles in your testing. 530 99:59:59,999 --> 99:59:59,999 I'm not saying don't do it, but use it to establish the end. 531 99:59:59,999 --> 99:59:59,999 And then you need to test all the speeds, and we need to see when you hit the pole. 532 99:59:59,999 --> 99:59:59,999 Maybe you hit the pole at 100 mph, but maybe you also hit the pole at 70 mph. 533 99:59:59,999 --> 99:59:59,999 Maybe you don't hit it at 20. 534 99:59:59,999 --> 99:59:59,999 We should find out how fast is safe. 535 99:59:59,999 --> 99:59:59,999 When you have data, you can compare it like this. 536 99:59:59,999 --> 99:59:59,999 This is what I would say a recommended way to look at it. 537 99:59:59,999 --> 99:59:59,999 Plot requirements, that's the hitting the pole. 538 99:59:59,999 --> 99:59:59,999 And some things hit the pole, and some things don't. 539 99:59:59,999 --> 99:59:59,999 And you run different scenarios, different loads, 540 99:59:59,999 --> 99:59:59,999 different configurations, 541 99:59:59,999 --> 99:59:59,999 different settings, 542 99:59:59,999 --> 99:59:59,999 and see what works, and what doesn't. 543 99:59:59,999 --> 99:59:59,999 Your goal is to stay here, and carry more while staying there. 544 99:59:59,999 --> 99:59:59,999 Usually. 545 99:59:59,999 --> 99:59:59,999 It's very useful for figuring out how many machines I need to carry a certain thing. 546 99:59:59,999 --> 99:59:59,999 If you don't know this, you don't know how many machines to deploy. 547 99:59:59,999 --> 99:59:59,999 Okay, I'm going to run through some comparisons of 548 99:59:59,999 --> 99:59:59,999 latency or response time behaviors between different configurations 549 99:59:59,999 --> 99:59:59,999 to show you some of the places people look, and some of the 550 99:59:59,999 --> 99:59:59,999 intuitive and non-intuitive things to do with them. 551 99:59:59,999 --> 99:59:59,999 The common thing, 552 99:59:59,999 --> 99:59:59,999 and again, this is that Casandra thing, 553 99:59:59,999 --> 99:59:59,999 comparing two systems, A and B. 554 99:59:59,999 --> 99:59:59,999 I'll let you guess which one is A, and which one is B. 555 99:59:59,999 --> 99:59:59,999 It's two systems, and saying which is better, what can I do with this? 556 99:59:59,999 --> 99:59:59,999 And we're measuring here at two throughputs, 85 and 90k. 557 99:59:59,999 --> 99:59:59,999 As I said in here, 90k is past the capability of the system. 558 99:59:59,999 --> 99:59:59,999 You can sort of see it here. 559 99:59:59,999 --> 99:59:59,999 See, 85 for both of them is here, and 90k is here. 560 99:59:59,999 --> 99:59:59,999 So you could look at this and say, 561 99:59:59,999 --> 99:59:59,999 "Look. when the car hits the pole, the blue system is better. It's half as bad