Hi everyone, I'm Gil Tene. I'm going to be talking about this subject that I call "How NOT to Measure Latency". It's a subject that I've been talking about for 3 years or so. I keep the title and change all the slides every time. A bunch of this stuff is new. So if you've seen any of my previous "How NOT to", you'll see only some things that are common. A nickname for the subject is this... Because I often will get that reaction from some people in the audience. Ever since I've told people that it's a nickname, They feel free to actually exclaim, "Oh S@%#!". And feel free to do that here in this talk. I'll prompt you in a couple of places where it is natural. But if just have the urge, go ahead. So just a tiny bit about me. I am the co-founder of Azul Systems. I play around with garbage collection a lot. Here is some evidence of me playing around with garbage collection in my kitchen. That's a trash compactor. The compaction function wasn't working right, so I had to fix it. I thought it'd be funny to take a picture with a book. I've also built a lot of things. I've been playing with computers since the early 80's. I've built hardware. I've helped design chips. I've built software at many different levels. Operating systems, drivers... JVM's obviously. And lots of big systems at the system level. Built our own app server in the late 90's because web logic wasn't around yet. So, I've made a lot of mistakes, and I've learned from a few of them. This is actually a combination of a bunch of those mistakes looking at latency. I do have this hobby of depressing people by pulling the wool up from over your eyes, and this is what this talk is about. So, I need to give you a choice right here. There's the door. You can take the blue pill, and you can leave. Tomorrow you can keep believing whatever it is you want to believe. But if you stay here and take the red pill, I will show you a glimpse of how far down the rabbit hole goes, and it will never be the same again. Let's talk about latency. And when I say latency, I'm talking about latency response time, any of those things where you measure time from 'here to here', and you're interested in how long it took. We do this all the time, but I see a lot of mish-mash in how people treat the data, or think about it. Latency is basically the time it took something to happen once. That one time, how long did it take. And when we measure stuff, like we did a million operations in the last hour, we have a million latencies. Not one, we have a million of them. Our actual goal is to figure out how to describe that million. How did the million behave? For example, 'they're all really good, and they're all exactly the same', would be a behavior that you will never see, but that would be a great behavior. So we need to talk about how things behave, communicate, think, evaluate, set requirements for, talk to other people, but these are all common things around that. To do that, we have to describe the distribution, the set, the behavior, but not the one. For example, the behavior that says "the the common case was x" is a piece of information about the behavior, but it's a tiny sliver. Usually the least relevant one. Well, there's some less relevant ones, but not a strongly relevant one, and one that people often focus on. To take a look at what we actually do with this stuff, almost on a daily basis, this is a snapshot from a monitoring system. A small dashboard on a big screen in a monitoring system. Where you're watching the response time of a system over time. This is a two hour window. These lines that are 95th percentile, 90, 75, 50, and 25th percentiles, you can look at how they behave over time. We're a small audience here, if you look at this picture, what draws your eye? What do you want to go investigate here or pay attention to ? It's the big red spike there, right? So we could look at the red spike, cause it's different, and say, "Woah, the 95th percentile shot up here. And look, the 90th percentile shot up at about the same time. The rest of them didn't shoot up, so maybe something happened here that affected that much, I should probably pay attention to it because it's a monitoring system, and I like things to be calm." You could go investigate the why. At this point, I've managed to waste about 90 seconds of your life, looking at a completely meaningless chart, which unfortunately you do every day, all the time. This chart is the chart you want to show somebody if you want to hide the truth from them. If you want to pull the wool over their eyes. This is the chart of the good stuff. What's not on this chart? The 5% worse things that happened during this two hours. They're not here. This is only the good things that happened during the things. And to get this spike, that 5% had to be so bad that it even pulled the 95th percentile up.