Hi everyone, I'm Gil Tene. I'm going to be talking about this subject that I call "How NOT to Measure Latency". It's a subject that I've been talking now about for 3 years or so. I keep the title and change all the slides every time. A bunch of this stuff is new. So if you've seen any of my previous "How NOT to", you'll see only some things that are common. A nickname for the subject is this... Because I often will get that reaction from some people in the audience. Ever since I've told people that it's a nickname, They feel free to actually exclaim, "Oh S@%#!". And feel free to do that here in this talk. I'll prompt you in a couple of places where it is natural. But if just have the urge, go ahead. So just a tiny bit about me. I am the co-founder of Azul Systems. I play around with garbage collection a lot. Here is some evidence of me playing around with garbage collection in my kitchen. That's a trash compactor. The compaction function wasn't working right, so I had to fix it. I thought it'd be funny to take a picture with a book. I've also built a lot of things. I've been playing with computers since the early 80's. I've built hardware. I've helped design chips. I've built software at many different levels. Operating systems, drivers... JVM's obviously. And lots of big systems at the system level. Built our own app server in the late 90's because web logic wasn't around yet. So, I've made a lot of mistakes, and I've learned from a few of them. This is actually a combination of a bunch of those mistakes looking at latency. I do have this hobby of depressing people by pulling the wool up from over your eyes, and this is what this talk is about. So, I need to give you a choice right here. There's the door. You can take the blue pill, and you can leave. Tomorrow you can keep believing whatever it is you want to believe. But if you stay here and take the red pill, I will show you a glimpse of how far down the rabbit hole goes, and it will never be the same again. Let's talk about latency. And when I say latency, I'm talking about latency response time, any of those things where you measure time from 'here to here', and you're interested in how long it took. We do this all the time, but I see a lot of mish-mash in how people treat the data, or think about it. Latency is basically the time it took something to happen once. That one time, how long did it take. And when we measure stuff, like we did a million operations in the last hour, we have a million latencies. Not one, we have a million of them. Our actual goal is to figure out how to describe that million. How did the million behave? For example, 'they're all really good, and they're all exactly the same', would be a behavior that you will never see, but that would be a great behavior. So we need to talk about how things behave, communicate, think, evaluate, set requirements for, talk to other people, but these are all common things around that. To do that, we have to describe the distribution, the set, the behavior, but not the one. For example, the behavior that says "the the common case was x" is a piece of information about the behavior, but it's a tiny sliver. Usually the least relevant one. Well, there's some less relevant ones, but not a strongly relevant one, and one that people often focus on. To take a look at what we actually do with this stuff, almost on a daily basis, this is a snapshot from a monitoring system. A small dashboard on a big screen in a monitoring system. Where you're watching the response time of a system over time. This is a two hour window. These lines that are 95th percentile, 90, 75, 50, and 25th percentiles, you can look at how they behave over time. We're a small audience here, if you look at this picture, what draws your eye? What do you want to go investigate here or pay attention to ? It's the big red spike there, right? So we could look at the red spike, cause it's different, and say, "Woah, the 95th percentile shot up here. And look, the 90th percentile shot up at about the same time. The rest of them didn't shoot up, so maybe something happened here that affected that much, I should probably pay attention to it because it's a monitoring system, and I like things to be calm." You could go investigate the why. At this point, I've managed to waste about 90 seconds of your life, looking at a completely meaningless chart, which unfortunately you do every day, all the time. This chart is the chart you want to show somebody if you want to hide the truth from them. If you want to pull the wool over their eyes. This is the chart of the good stuff. What's not on this chart? The 5% worse things that happened during this two hours. They're not here. This is only the good things that happened during the things. And to get this spike, that 5% had to be so bad that it even pulled the 95th percentile all up. There is zero information here at all about what happened bad during this two hours, which makes it a bad fit for a monitoring system. It's a really good thing for a marketing system. It's a great way to get the bonus from your boss, even though you didn't do the work. If you want to learn how to do that, we can do another talk about that. But this is not a good way to look at latency. It's the opposite of good. Unfortunately, this is one of the most common tools used for server monitoring on earth right now. That's where the snapshot is from, and this is what people look at. I find this chart to be a goldmine of information. When I first showed it in another talk like this, I had this really cool experience. Somebody came up to me and said, "Hey, as I was sitting here, I was texting one of our guys, and he was saying, 'look, we have this issue with our 95th percentile'." And I got this chart from him! So I went and said, "Hey, what does the rest of the spectrum look like?" This is the actual chart they got. And when they look at the rest of the spectrum, it looked like that. That's what was hiding. I noticed the scales are a little different. That yellow line is that yellow line. So that's a much more representative number. Is it? Is that good enough? That's the 99th percentile. We still have another 1% of really bad stuff that's hiding above the blue line. I wonder how big that is? I don't know because he didn't have the data. So a common problem that we have is that we only plot what's convenient. We only plot what gives us nice, colorful graphs. And often, when we have to choose between the stuff that hides the rest of the data, and the stuff that is noise, we choose the noise to display. I like to rant about latency. This is from a blog that I don't write enough in, but the format for it was simple. I tweet a single tweet about latency, latency tip of the day, and then I rant about my own tweet. As an example, this chart is a goldmine of information because it has so many different things that are wrong in it, but we won't get into all of them. You can read it online. Anyway, this is one to take away from what we just said. If you are not measuring and showing the maximum value, what is it you are hiding? And from whom? If you're job is to hide the truth from others, this is a good way to do it. But if actually are interested in what's going on, the number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal. The rest of it is noise. Okay, let's look at this chart for some more cool stuff. I'm gonna zoom in to a small part of the chart, and ask you what that means. What is the average of the 95th percentile over 2 hours mean? What is the math that does that? What does it do? Let's look at that, and I'll give you an example with another percentile. The 100th percentile. The max, right? Let's take a data set. Suppose this was the maximum every minute for 15 minutes. What does it mean to say that the average max over the last 15 minutes was 42? I specifically chose the data to make that happen. It's a meaningless statement. It's a completely meaningless statement. But when you see 95th percentile, average 184, you think that the 95th percentile for the last two hours was around 184. It makes you think that. Putting this on a piece of paper is not just noise and irrelevant, it's a way to mislead people. It's a way to mislead yourself, because you'll start to believe your own mistruths. This is true for any percentile. There is no percentile that you could do this math on. Another tip, you cannot average percentiles. That math doesn't happen. But percentiles do matter. You really want to know about them. And a common misperception is that we want to look at the main part of the spectrum, not those outliers and perfection stuff. Only people that actually bet their house every day, or the bank on it, need to know about the "five-nine's", and all those. The 99th percentile is a pretty good number. Is 99% really rare? Let's look at some stuff, because we can ask questions like, "If I were looking at a webpage, what is the chance of me hitting the 99th percentile?" Of things like this: a search engine node, or a key value store, or a database, or a CDN, right? Because they will report their 99th percentile. They won't tell you anything above that, but how many of the webpages that we go to actually experience this? You want to say 1%, right? Well, I went to some webpages and I counted how many "http" requests were generated by one click into that webpage, and here are the numbers. I ended that about a year ago. They've probably gone up since then. Now that translates into this math. This is the likelihood of one click seeing the 99th percentile. And the only page where that is less than 50% is the clean google search page. Where only a quarter will see the 99th percentile. The 99th percentile is the thing that most of your webpages will see. Most of them will be there. Now, we could look at other things. We can pick which things to focus on. Let's say I had to pick between the 95th percentile, and the three 9's (99.9%). The three 9's is way into perfection mode for most people, or they think. Which one of those represents our community better? Our population? Our users? Our experience? Let's run a hypothetical. Suppose we don't have that many pages, and that many resources like we said before. We'll be much more conservative. A user session will only go through five clicks, and each click will only bring up up to 40 things. A lot less, and they're all as clean as the google page. How many of the users will not experience something worse than the 95th percentile? Because that's what the 95th percentile is good for, the people who see that. Anybody above that, is that. What are the chances of not seeing it? That's an interesting number. So you're watching a number that is relevant to 0.003% of your users. 99.997% of your users are going to see worse than this number. Why are you looking at it? Why are you spending time thinking about it? In reverse, we could say how many people are going to see something worse than the three 9's (99.9%)? That's going to be 18%. In reverse, 82% of the people will see the three 9's (99.9%) or better. That's a slightly better representation. Probably not good enough either. We could look at some more math with them, same kind of scenario. What percentile of http response time will be the thing that 95% of people experience in this scenario? It's the 99.97 percentile that 95% of people see. And if you want to know what 99% of the people see, that's four and a half 9's (99.995%). You want to know that number from Akamai if you want to predict what 1% of your users are going to experience. When you know the 99th percentile, you kind of know a tiny bit. So here's another tip. And this is not an exaggeration, by the way. The median, which is a much smaller percentile, has that minuscule a chance of ever being the number that anybody experiences. This is the chance of getting worse than the median. Which makes the median an irrelevant number to look at. Unfortunately, it's probably the most common one looked at. When people say "the typical", they look at the thing that everything will be worse than. Okay, I'm sorry about that part. We'll do some other parts. Now, why is it that when we look at these monitoring systems, we don't see data with a lot of 9's? Why do we stop at the 90, 95, 99th percentile? Why don't we look further? Now, some of it is because people think, "Well that's perfection, I don't need it." The other part is that it's hard. It's hard because you can't average percentiles. We already talked about that. But you also can't derive your five 9's (99.999%) out of a lot of 10 second samples of percentiles. And the reason for that is, "Hey, in 10 seconds, maybe I only had 1,000 things." I could take all the 10 seconds in the world, there's no way to say what the hour five 9's (99.999%) were, what the minutes five 9's were if I'm collecting just this data. And unfortunately, the data being collected and reported to the back ends of monitoring is usually summarized at a second, 5 seconds, 10 seconds, etc. Basically throwing away all the good data, and leaving you with absolutely no way to compute large 9's for longer periods of time. So, this is where you might want to look at HDR Histogram. It's an open source thing I've created a few years ago. I did it in Java, and know there's a C, C-Sharp, Python, Erlang, and Go ports of this that I didn't create. And it lets you actually get an entire percentile spectrum. Some of you here I know are already using it. And you can look at all the percentiles. Any number of 9's that's in the data, if you just keep it right and report it right, it's got a log format, you can store things forever. Well, for a long time. Okay, so it lets you have nice things. Enough for that advertisement. Now, latency... Well, I think this is slightly out of order. Yeah, sorry. This is the red/blue pill part, so I warn you, this is your last chance. There's a problem I call the coordinated omission problem. The coordinated omission problem is basically a conspiracy. It's a conspiracy that we're all part of. I don't think anybody actually meant to do it, but once I've noticed it, everywhere I look, there it is. Now, I've been using a specific way of showing you numbers so far. Has anybody here noticed how I spell percentile? (Audience Member): "You put lie at the end of the percent sign." Yeah, good. So coordinated omission problem is the "lie" in %lies. And this is how it works. One common way to do this is to use a load generator. Pretty much all load generator's have this problem. There are two that I know of that don't. What you do with a load generator, is you test. You issue requests, or send packets. And you measure how long something took. And as long as the numbers go right, measure them, put them in a bucket, study them later, and get your percentiles from it. But what if the thing that you are measuring took longer than the time it would've taken until you send the next thing? You're supposed to send something every second, but this one took a second and a half. Well you've got to wait before you send the next one. You just avoided measuring something when the system was problematic. You've coordinated with it. You weren't looking at it then. That's common scenario A: You've backed off, and avoided measuring when it was bad. Another way, is you measure inside your code. We all do this. We all have to do this, where we measure time, do something, then measure time. The delta between them is how long it took. We can then put it in a stats bucket, and then do the percentiles in that. Unfortunately, if the system freezes right here, for any reason, an interrupted contact switch, a cash buffer flushed to disk, a garbage collection, a re-indexing of your database, this is a database. This is Cassandra by the way, measuring itself. In any of the above, then you will have one bad report while 10,000 things are waiting in line. And when they come in, they will look really, really good. Even though each one of them has had a really bad experience. It can even get worse, where maybe the freeze happened outside the timing, and you won't even know there was a freeze. Now these are examples of admitting data that is bad on a very selective basis. It's not random sampling. It's, "I don't like bad data", or "I couldn't handle it", or "I don't know about it", so we'll just talk about the good. What does that do to your data? Because it often makes people feel like, "Okay, yeah, I understand, but it's a little bit of noise." Let's run some hypotheticals, and I'll show you some real numbers. Imagine a perfect system. It's doing 100 requests a second, at exactly a millisecond each. But we go and freeze the system, after 100 seconds of perfect operations for 100 seconds, and then repeat. Now, I'm going to describe how the system behaves in terms that should mean something, and then we'll measure it. If we actually wanted to describe the system, on the left we have an average of one millisecond by the finish, and on the right we have an average of 50 seconds. Why 50? Because if I randomly came in in that 100 seconds, I'll get anything from 0 to 100 with even distribution. The overall average over 200 seconds is 25 seconds. If I just came in here and said, "Surprise, how long did this take?" On average, it will be 25. I can also do the percentiles. 50th percentile will be really good, and then it'll get really bad. The four 9's is terrible. This is a fair honest description of this system if this is what it did. And you can make the system do that. That's what Control Z is good for. You can make any of your systems do that. Now lets go measure this system with a load generator, or with a monitoring system. The common ones. The ones everybody does. On the left, we're going to get 10,000 results of one millisecond each. Great. And we're going to get one result of 100 seconds. Wow, really big response time. This is our data. This is OUR data. So now you go do math with it. The average of that is 10.9 milliseconds. A little less than 25 seconds. And here are the percentiles. Your load generator monitoring system will tell you that this system is perfect. You could go to production with it. You like what you see. Look at that, four 9's. It is lying to you. To your face. And you can catch it doing that with a Control Z-Test. But people tend to not want to do that, because then what are they going to do? If you just do that test, and calibrate your system, and you find it telling you that, about this, the next step should be to throw all the numbers away. Don't believe anything else it says. If it lies this big, what else did it do? Don't waste your time on numbers from uncalibrated systems. Now the problem here was, that if you want to measure the system, you have to measure at random rates, or same rates. If you measure 10,000 things in 100 seconds, there should be another 10,000 things here. If you measure them, you would've gotten all the right numbers. Coordinated omission is the simple act of erasing all that bad stuff. The conspiracy here is that we all do it without meaning to. I don't know who put that in our systems, but it happens to all of us . Now, I often get people saying, "Okay, I get it. All the numbers are wrong, but at least for my job where I tune performance, and I try to make things faster, I can use the numbers to figure out if I'm going in the right direction." Is it better, or is it worse? Let me dispel that for you for a second. Suppose I went and took this system, and improved it dramatically. Rather than freezing for 100 seconds, it will now answer every question. It'll take a little longer, 5 milliseconds instead of one, but it's much better than freezing, right? So let's measure that system that we spent weeks and weeks improving, and see if it's better. That's the data. If we do the percentiles, it'll tell us that we just really hurt the four 9's. We made it go 5 times worse than before. We should revert this change, go back to that much better system we had before. So this is just to make sure that you don't think that you can have any intuition based on any of these numbers. They go backwards sometimes. You don't know which way is good or bad. And you'll never know which way is good or bad with a system that lies like that. The other cool technique is what I call "Cheating Twice". You have a constant load generator, and it needs to do 100 per second. When it woke up after 200 seconds, it says, "Woah, were 9,999 behind. We've got to issue those requests." So it issues those requests. At this point, not only did it get rid of all the bad requests, it replaced every one of them with a perfect request. Coining the four 9's (99.99%), all the way to four and a half 9's (99.995%), it's twice as wrong as dropping them. So these are all cool things that happen to you. I'm not going to spend much time on how to fix those and avoid those. There's a lot of other material that you can find with me talking about that, in longer talks. But this is pretty bad. And like I said... That should've been up there before. How did this repeat itself? Did I create a loop in the presentation somehow? I don't know how to do that. Let's see if I can get through here. Hopefully editing later will take it out. So we have the cheats twice. There, okay. So, after we look at coordinated omission that way, we should also look at response time, and service time. Coordinated omission, what it really is achieving for you, unfortunately, is that it makes something that you think is response time, and only shows you the service time component of latency. This is a simple depiction of what service time and response times are. This guy is taking a certain amount of time to take payment or make a cup of coffee. That's service time. How long does it take to do the work? This person has experienced the response time, which includes the amount of time they have to wait before they get to the person that does the work. And the difference between those two is immense. The coordinated omission problem makes something that you think is response time, only measure the service time, and basically hide the fact that things stalled, waited in line, that this guy might've taken a lunch break, and now we have line around, building three times. Service time stays the same. This is the backwards part... Now, let's look at what it actually looks like. In a load generator that I fixed, I measured both response time and service time, this happens to be Casandra, at a very low load. And you can see that they're very very similar, at a very low load. Why? Because there's nobody in line. This thing is really fast. We're not asking for too much. Casandra's pretty fast, so they're the same. But if I increase the load, we start seeing gaps. If I increase the load a little more, the gap grows. If I increase the load a little more, the gap grows. Now this is not the failure point yet. If I actually increase it all the way past the point where the system can't even do the work I want, service time stays the same, response time goes through the roof. This was when it was 100 and something milliseconds, now it's 7 and a half seconds. Why 7 and a half seconds? Cause you're waiting in line that long to go around the block. The guy just can't serve as many people as are showing up in line, you fall behind. This is a virtual world reaction to this. I really like this slide, it's where I came up with the notion of a blue/red pill. When you actually measure reality, people tend to have this reaction when they compare the two. And if we actually look at these on the two sides of a collapse point of a system, this specific system can only do 87,000 things a second. No matter how hard you press it, that's all it'll do. The service time on the two sides of the collapse looks virtually identical, which it would. But if you compare the response time, you have a very different picture. And I'm showing this picture so you get a feeling for what to look at on whether or not you're measuring the right one.