9:59:59.000,9:59:59.000
Hi everyone, I'm Gil Tene.

9:59:59.000,9:59:59.000
I'm going to be talking about this subject[br]that I call "How NOT to Measure Latency".

9:59:59.000,9:59:59.000
It's a subject that I've been talking[br]about now for 3 years or so.

9:59:59.000,9:59:59.000
I keep the title and change all[br]the slides every time.

9:59:59.000,9:59:59.000
A bunch of this stuff is new.

9:59:59.000,9:59:59.000
So if you've seen any of my previous "How NOT to",[br]you'll see only some things that are common.

9:59:59.000,9:59:59.000
A nickname for the subject is this...

9:59:59.000,9:59:59.000
Because I often will get that reaction[br]from some people in the audience.

9:59:59.000,9:59:59.000
Ever since I've told people that it's a[br]nickname,

9:59:59.000,9:59:59.000
They feel free to actually exclaim,[br]"Oh S@%#!".

9:59:59.000,9:59:59.000
And feel free to do that here in this talk.

9:59:59.000,9:59:59.000
I'll prompt you in a couple of places[br]where it is natural.

9:59:59.000,9:59:59.000
But if just have the urge, go ahead.

9:59:59.000,9:59:59.000
So just a tiny bit about me.

9:59:59.000,9:59:59.000
I am the co-founder of Azul Systems.

9:59:59.000,9:59:59.000
I play around with garbage collection a lot.

9:59:59.000,9:59:59.000
Here is some evidence of me playing around[br]with garbage collection in my kitchen.

9:59:59.000,9:59:59.000
That's a trash compactor.

9:59:59.000,9:59:59.000
The compaction function wasn't working right,[br]so I had to fix it.

9:59:59.000,9:59:59.000
I thought it'd be funny to take a picture[br]with a book.

9:59:59.000,9:59:59.000
I've also built a lot of things.

9:59:59.000,9:59:59.000
I've been playing with computers since[br]the early 80's.

9:59:59.000,9:59:59.000
I've built hardware.

9:59:59.000,9:59:59.000
I've helped design chips.

9:59:59.000,9:59:59.000
I've built software at many [br]different levels.

9:59:59.000,9:59:59.000
Operating systems, drivers...[br]JVM's obviously.

9:59:59.000,9:59:59.000
And lots of big systems at the system level.

9:59:59.000,9:59:59.000
Built our own app server in the late 90's[br]because web logic wasn't around yet.

9:59:59.000,9:59:59.000
So, I've made a lot of mistakes,[br]and I've learned from a few of them.

9:59:59.000,9:59:59.000
This is actually a combination of a bunch[br]of those mistakes looking at latency.

9:59:59.000,9:59:59.000
I do have this hobby of depressing people[br]by pulling the wool up from over your eyes,

9:59:59.000,9:59:59.000
and this is what this talk is about.

9:59:59.000,9:59:59.000
So, I need to give you a choice right here.

9:59:59.000,9:59:59.000
There's the door.

9:59:59.000,9:59:59.000
You can take the blue pill, [br]and you can leave.

9:59:59.000,9:59:59.000
Tomorrow you can keep believing whatever[br]it is you want to believe.

9:59:59.000,9:59:59.000
But if you stay here and take the red pill, [br]I will show you a glimpse of how

9:59:59.000,9:59:59.000
far down the rabbit hole goes, [br]and it will never be the same again.

9:59:59.000,9:59:59.000
Let's talk about latency.

9:59:59.000,9:59:59.000
And when I say latency, I'm talking about[br]latency response time, any of those things

9:59:59.000,9:59:59.000
where you measure time from 'here to here',[br]and you're interested in how long it took.

9:59:59.000,9:59:59.000
We do this all the time, but I see a lot [br]of mish-mash in how people

9:59:59.000,9:59:59.000
treat the data, or think about it.

9:59:59.000,9:59:59.000
Latency is basically the time it took[br]something to happen once.

9:59:59.000,9:59:59.000
That one time, how long did it take.

9:59:59.000,9:59:59.000
And when we measure stuff, like we did [br]a million operations in the last hour,

9:59:59.000,9:59:59.000
we have a million latencies. Not one,[br]we have a million of them.

9:59:59.000,9:59:59.000
Our actual goal is to figure out how to[br]describe that million.

9:59:59.000,9:59:59.000
How did the million behave?

9:59:59.000,9:59:59.000
For example, 'they're all really good, and[br]they're all exactly the same', would be a

9:59:59.000,9:59:59.000
behavior that you will never see, [br]but that would be a great behavior.

9:59:59.000,9:59:59.000
So we need to talk about how things behave,[br]communicate, think, evaluate,

9:59:59.000,9:59:59.000
set requirements for, talk to other people,[br]but these are all common things around that.

9:59:59.000,9:59:59.000
To do that, we have to describe the [br]distribution, the set, the behavior,

9:59:59.000,9:59:59.000
but not the one.

9:59:59.000,9:59:59.000
For example, the behavior that says "the [br]the common case was x" is a piece of

9:59:59.000,9:59:59.000
information about the behavior,[br]but it's a tiny sliver.

9:59:59.000,9:59:59.000
Usually the least relevant one.

9:59:59.000,9:59:59.000
Well, there's some less relevant ones, [br]but not a strongly relevant one,

9:59:59.000,9:59:59.000
and one that people often focus on.

9:59:59.000,9:59:59.000
To take a look at what we actually do [br]with this stuff, almost on a daily basis,

9:59:59.000,9:59:59.000
this is a snapshot from a monitoring system.

9:59:59.000,9:59:59.000
A small dashboard on a big screen [br]in a monitoring system.

9:59:59.000,9:59:59.000
Where you're watching the response time of[br]a system over time.

9:59:59.000,9:59:59.000
This is a two hour window.

9:59:59.000,9:59:59.000
These lines that are 95th percentile, [br]90, 75, 50, and 25th percentiles,

9:59:59.000,9:59:59.000
you can look at how they behave over time.

9:59:59.000,9:59:59.000
We're a small audience here, if you look at[br]this picture, what draws your eye?

9:59:59.000,9:59:59.000
What do you want to go investigate here[br]or pay attention to ?

9:59:59.000,9:59:59.000
It's the big red spike there, right?

9:59:59.000,9:59:59.000
So we could look at the red spike,[br]cause it's different,

9:59:59.000,9:59:59.000
and say, "Woah, the 95th percentile shot up[br]here. And look, the 90th percentile

9:59:59.000,9:59:59.000
shot up at about the same time.

9:59:59.000,9:59:59.000
The rest of them didn't shoot up, [br]so maybe something happened here

9:59:59.000,9:59:59.000
that affected that much, I should probably[br]pay attention to it

9:59:59.000,9:59:59.000
because it's a monitoring system, and [br]I like things to be calm."

9:59:59.000,9:59:59.000
You could go investigate the why.

9:59:59.000,9:59:59.000
At this point, I've managed to waste [br]about 90 seconds of your life,

9:59:59.000,9:59:59.000
looking at a completely meaningless chart,[br]which unfortunately you do

9:59:59.000,9:59:59.000
every day, all the time.

9:59:59.000,9:59:59.000
This chart is the chart you want to show [br]somebody if you want to

9:59:59.000,9:59:59.000
hide the truth from them.

9:59:59.000,9:59:59.000
If you want to pull the wool [br]over their eyes.

9:59:59.000,9:59:59.000
This is the chart of the good stuff.

9:59:59.000,9:59:59.000
What's not on this chart?

9:59:59.000,9:59:59.000
The 5% worse things that happened during[br]this two hours.

9:59:59.000,9:59:59.000
They're not here.

9:59:59.000,9:59:59.000
This is only the good things that happened[br]during the things.

9:59:59.000,9:59:59.000
And to get this spike, that 5% had to be[br]so bad that it even pulled

9:59:59.000,9:59:59.000
the 95th percentile all up.

9:59:59.000,9:59:59.000
There is zero information here at all about[br]what happened bad during this two hours,

9:59:59.000,9:59:59.000
which makes it a bad fit for [br]a monitoring system.

9:59:59.000,9:59:59.000
It's a really good thing for [br]a marketing system.

9:59:59.000,9:59:59.000
It's a great way to get the bonus from your boss, even though you didn't do the work.

9:59:59.000,9:59:59.000
If you want to learn how to do that, [br]we can do another talk about that.

9:59:59.000,9:59:59.000
But this is not a good way to look at latency.

9:59:59.000,9:59:59.000
It's the opposite of good.

9:59:59.000,9:59:59.000
Unfortunately, this is one of the most[br]common tools used for

9:59:59.000,9:59:59.000
server monitoring on earth right now.

9:59:59.000,9:59:59.000
That's where the snapshot is from,[br]and this is what people look at.

9:59:59.000,9:59:59.000
I find this chart to be a goldmine[br]of information.

9:59:59.000,9:59:59.000
When I first showed it in another talk [br]like this, I had this really cool experience.

9:59:59.000,9:59:59.000
Somebody came up to me and said, "Hey, [br]as I was sitting here, I was texting one

9:59:59.000,9:59:59.000
of our guys, and he was saying,

9:59:59.000,9:59:59.000
'look, we have this issue with [br]our 95th percentile'."

9:59:59.000,9:59:59.000
And I got this chart from him!

9:59:59.000,9:59:59.000
So I went and said, "Hey, what does the [br]rest of the spectrum look like?"

9:59:59.000,9:59:59.000
This is the actual chart they got.

9:59:59.000,9:59:59.000
And when they look at the rest of the[br]spectrum, it looked like that.

9:59:59.000,9:59:59.000
That's what was hiding.

9:59:59.000,9:59:59.000
I noticed the scales are a little different.

9:59:59.000,9:59:59.000
That yellow line is that yellow line.

9:59:59.000,9:59:59.000
So that's a much more representative number.

9:59:59.000,9:59:59.000
Is it? Is that good enough?

9:59:59.000,9:59:59.000
That's the 99th percentile.

9:59:59.000,9:59:59.000
We still have another 1% of really bad [br]stuff that's hiding above the blue line.

9:59:59.000,9:59:59.000
I wonder how big that is?

9:59:59.000,9:59:59.000
I don't know because he didn't have the data.

9:59:59.000,9:59:59.000
So a common problem that we have is that[br]we only plot what's convenient.

9:59:59.000,9:59:59.000
We only plot what gives us nice,[br]colorful graphs.

9:59:59.000,9:59:59.000
And often, when we have to choose between[br]the stuff that hides the rest of the data,

9:59:59.000,9:59:59.000
and the stuff that is noise, we choose [br]the noise to display.

9:59:59.000,9:59:59.000
I like to rant about latency.

9:59:59.000,9:59:59.000
This is from a blog that I don't write [br]enough in, but the format for it was simple.

9:59:59.000,9:59:59.000
I tweet a single tweet about latency, [br]latency tip of the day,

9:59:59.000,9:59:59.000
and then I rant about my own tweet.

9:59:59.000,9:59:59.000
As an example, this chart is a goldmine[br]of information because it has so many

9:59:59.000,9:59:59.000
different things that are wrong in it, [br]but we won't get into all of them.

9:59:59.000,9:59:59.000
You can read it online.

9:59:59.000,9:59:59.000
Anyway, this is one to take away from [br]what we just said.

9:59:59.000,9:59:59.000
If you are not measuring and showing the[br]maximum value, what is it you are hiding?

9:59:59.000,9:59:59.000
And from whom?

9:59:59.000,9:59:59.000
If you're job is to hide the truth from[br]others, this is a good way to do it.

9:59:59.000,9:59:59.000
But if actually are interested in what's[br]going on, the number one indicator

9:59:59.000,9:59:59.000
you should never get rid of is the [br]maximum value.

9:59:59.000,9:59:59.000
That is not noise, that is the signal.

9:59:59.000,9:59:59.000
The rest of it is noise.

9:59:59.000,9:59:59.000
Okay, let's look at this chart for some[br]more cool stuff.

9:59:59.000,9:59:59.000
I'm gonna zoom in to a small part[br]of the chart, and ask you what that means.

9:59:59.000,9:59:59.000
What is the average of the 95th percentile[br]over 2 hours mean?

9:59:59.000,9:59:59.000
What is the math that does that?

9:59:59.000,9:59:59.000
What does it do?

9:59:59.000,9:59:59.000
Let's look at that, and I'll give you[br]an example with another percentile.

9:59:59.000,9:59:59.000
The 100th percentile. The max, right?

9:59:59.000,9:59:59.000
Let's take a data set.

9:59:59.000,9:59:59.000
Suppose this was the maximum every minute[br]for 15 minutes.

9:59:59.000,9:59:59.000
What does it mean to say that the average [br]max over the last 15 minutes was 42?

9:59:59.000,9:59:59.000
I specifically chose the data to[br]make that happen.

9:59:59.000,9:59:59.000
It's a meaningless statement.

9:59:59.000,9:59:59.000
It's a completely meaningless statement.

9:59:59.000,9:59:59.000
But when you see 95th percentile,[br]average 184, you think that the 95th

9:59:59.000,9:59:59.000
percentile for the last two hours[br]was around 184.

9:59:59.000,9:59:59.000
It makes you think that.

9:59:59.000,9:59:59.000
Putting this on a piece of paper is not [br]just noise and irrelevant,

9:59:59.000,9:59:59.000
it's a way to mislead people.

9:59:59.000,9:59:59.000
It's a way to mislead yourself, because [br]you'll start to believe your own mistruths.

9:59:59.000,9:59:59.000
This is true for any percentile.

9:59:59.000,9:59:59.000
There is no percentile that you could do[br]this math on.

9:59:59.000,9:59:59.000
Another tip, you cannot average percentiles.

9:59:59.000,9:59:59.000
That math doesn't happen.

9:59:59.000,9:59:59.000
But percentiles do matter. You really[br]want to know about them.

9:59:59.000,9:59:59.000
And a common misperception is that we want[br]to look at the main part of the spectrum,

9:59:59.000,9:59:59.000
not those outliers and perfection stuff.

9:59:59.000,9:59:59.000
Only people that actually bet their house[br]every day, or the bank on it,

9:59:59.000,9:59:59.000
need to know about the "five-nine's", [br]and all those.

9:59:59.000,9:59:59.000
The 99th percentile is a pretty[br]good number.

9:59:59.000,9:59:59.000
Is 99% really rare?

9:59:59.000,9:59:59.000
Let's look at some stuff, because we can[br]ask questions like, "If I were looking

9:59:59.000,9:59:59.000
at a webpage, what is the chance of me[br]hitting the 99th percentile?"

9:59:59.000,9:59:59.000
Of things like this: a search engine node,[br]or a key value store,

9:59:59.000,9:59:59.000
or a database, or a CDN, right?

9:59:59.000,9:59:59.000
Because they will report their 99th percentile.

9:59:59.000,9:59:59.000
They won't tell you anything above that,[br]but how many of the

9:59:59.000,9:59:59.000
webpages that we go to [br]actually experience this? [br]

9:59:59.000,9:59:59.000
You want to say 1%, right?

9:59:59.000,9:59:59.000
Well, I went to some webpages and I counted[br]how many "http" requests were generated

9:59:59.000,9:59:59.000
by one click into that webpage,[br]and here are the numbers.

9:59:59.000,9:59:59.000
I ended that about a year ago.

9:59:59.000,9:59:59.000
They've probably gone up since then.

9:59:59.000,9:59:59.000
Now that translates into this math.

9:59:59.000,9:59:59.000
This is the likelihood of one click seeing[br]the 99th percentile.

9:59:59.000,9:59:59.000
And the only page where that is less than[br]50% is the clean google search page.

9:59:59.000,9:59:59.000
Where only a quarter will see the[br]99th percentile.

9:59:59.000,9:59:59.000
The 99th percentile is the thing that most[br]of your webpages will see.

9:59:59.000,9:59:59.000
Most of them will be there.

9:59:59.000,9:59:59.000
Now, we could look at other things.

9:59:59.000,9:59:59.000
We can pick which things to focus on.

9:59:59.000,9:59:59.000
Let's say I had to pick between the 95th[br]percentile, and the three 9's (99.9%).

9:59:59.000,9:59:59.000
The three 9's is way into perfection mode[br]for most people, or they think.

9:59:59.000,9:59:59.000
Which one of those represents our [br]community better?

9:59:59.000,9:59:59.000
Our population?

9:59:59.000,9:59:59.000
Our users?

9:59:59.000,9:59:59.000
Our experience?

9:59:59.000,9:59:59.000
Let's run a hypothetical.

9:59:59.000,9:59:59.000
Suppose we don't have that many pages, [br]and that many resources like we said before.[br]

9:59:59.000,9:59:59.000
We'll be much more conservative.

9:59:59.000,9:59:59.000
A user session will only go through five[br]clicks, and each click will only bring up

9:59:59.000,9:59:59.000
up to 40 things.

9:59:59.000,9:59:59.000
A lot less, and they're all as clean[br]as the google page.

9:59:59.000,9:59:59.000
How many of the users will not experience[br]something worse than the 95th percentile?

9:59:59.000,9:59:59.000
Because that's what the 95th percentile[br]is good for, the people who see that.

9:59:59.000,9:59:59.000
Anybody above that, is that.

9:59:59.000,9:59:59.000
What are the chances of not seeing it?

9:59:59.000,9:59:59.000
That's an interesting number.

9:59:59.000,9:59:59.000
So you're watching a number that is [br]relevant to 0.003% of your users.

9:59:59.000,9:59:59.000
99.997% of your users are going to [br]see worse than this number.

9:59:59.000,9:59:59.000
Why are you looking at it?

9:59:59.000,9:59:59.000
Why are you spending time[br]thinking about it?

9:59:59.000,9:59:59.000
In reverse, we could say how many people[br]are going to see something

9:59:59.000,9:59:59.000
worse than the three 9's (99.9%)?

9:59:59.000,9:59:59.000
That's going to be 18%.

9:59:59.000,9:59:59.000
In reverse, 82% of the people will see[br]the three 9's (99.9%) or better.

9:59:59.000,9:59:59.000
That's a slightly better representation.

9:59:59.000,9:59:59.000
Probably not good enough either.

9:59:59.000,9:59:59.000
We could look at some more math with them, [br]same kind of scenario.

9:59:59.000,9:59:59.000
What percentile of http response time [br]will be the thing that 95%

9:59:59.000,9:59:59.000
of people experience in this scenario?

9:59:59.000,9:59:59.000
It's the 99.97 percentile that 95% [br]of people see.

9:59:59.000,9:59:59.000
And if you want to know what 99%[br]of the people see,

9:59:59.000,9:59:59.000
that's four and a half 9's (99.995%).

9:59:59.000,9:59:59.000
You want to know that number from Akamai[br]if you want to predict what 1%

9:59:59.000,9:59:59.000
of your users are going to experience.

9:59:59.000,9:59:59.000
When you know the 99th percentile, [br]you kind of know a tiny bit.

9:59:59.000,9:59:59.000
So here's another tip.

9:59:59.000,9:59:59.000
And this is not an exaggeration,[br]by the way.

9:59:59.000,9:59:59.000
The median, which is a much smaller[br]percentile, has that minuscule a chance

9:59:59.000,9:59:59.000
of ever being the number that [br]anybody experiences.

9:59:59.000,9:59:59.000
This is the chance of getting worse[br]than the median.

9:59:59.000,9:59:59.000
Which makes the median an irrelevant [br]number to look at.

9:59:59.000,9:59:59.000
Unfortunately, it's probably the most [br]common one looked at.

9:59:59.000,9:59:59.000
When people say "the typical",[br]they look at the thing that

9:59:59.000,9:59:59.000
everything will be worse than.

9:59:59.000,9:59:59.000
Okay, I'm sorry about that part.

9:59:59.000,9:59:59.000
We'll do some other parts.

9:59:59.000,9:59:59.000
Now, why is it that when we look at these[br]monitoring systems, we don't see

9:59:59.000,9:59:59.000
data with a lot of 9's?

9:59:59.000,9:59:59.000
Why do we stop at the[br]90, 95, 99th percentile?

9:59:59.000,9:59:59.000
Why don't we look further?

9:59:59.000,9:59:59.000
Now, some of it is because people think, [br]"Well that's perfection, I don't need it."

9:59:59.000,9:59:59.000
The other part is that it's hard.

9:59:59.000,9:59:59.000
It's hard because you can't[br]average percentiles.

9:59:59.000,9:59:59.000
We already talked about that.

9:59:59.000,9:59:59.000
But you also can't derive your [br]five 9's (99.999%) out of a lot

9:59:59.000,9:59:59.000
of 10 second samples of percentiles.

9:59:59.000,9:59:59.000
And the reason for that is, "Hey, in 10 [br]seconds, maybe I only had 1,000 things."

9:59:59.000,9:59:59.000
I could take all the 10 seconds in the [br]world, there's no way to say what the

9:59:59.000,9:59:59.000
hour five 9's (99.999%) were, what the [br]minutes five 9's were

9:59:59.000,9:59:59.000
if I'm collecting just this data.

9:59:59.000,9:59:59.000
And unfortunately, the data being collected[br]and reported to the back ends of monitoring

9:59:59.000,9:59:59.000
is usually summarized at a second,[br]5 seconds, 10 seconds, etc.

9:59:59.000,9:59:59.000
Basically throwing away all the good data,[br]and leaving you with absolutely no way

9:59:59.000,9:59:59.000
to compute large 9's for longer[br]periods of time.

9:59:59.000,9:59:59.000
So, this is where you might want to look[br]at HDR Histogram.

9:59:59.000,9:59:59.000
It's an open source thing I've created[br]a few years ago.

9:59:59.000,9:59:59.000
I did it in Java, and know there's a[br]C, C-Sharp, Python, Erlang,

9:59:59.000,9:59:59.000
and Go ports of this that I didn't create.

9:59:59.000,9:59:59.000
And it lets you actually get an entire[br]percentile spectrum.

9:59:59.000,9:59:59.000
Some of you here I know are [br]already using it.

9:59:59.000,9:59:59.000
And you can look at all the percentiles.

9:59:59.000,9:59:59.000
Any number of 9's that's in the data, if [br]you just keep it right and report it right,

9:59:59.000,9:59:59.000
it's got a log format, you can [br]store things forever.

9:59:59.000,9:59:59.000
Well, for a long time.

9:59:59.000,9:59:59.000
Okay, so it lets you have nice things.

9:59:59.000,9:59:59.000
Enough for that advertisement.

9:59:59.000,9:59:59.000
Now, latency... Well, I think this is[br]slightly out of order.

9:59:59.000,9:59:59.000
Yeah, sorry.

9:59:59.000,9:59:59.000
This is the red/blue pill part, so I warn[br]you, this is your last chance.

9:59:59.000,9:59:59.000
There's a problem I call the [br]coordinated omission problem.

9:59:59.000,9:59:59.000
The coordinated omission problem is [br]basically a conspiracy.

9:59:59.000,9:59:59.000
It's a conspiracy that we're all part of.

9:59:59.000,9:59:59.000
I don't think anybody actually meant[br]to do it, but once I've noticed it,

9:59:59.000,9:59:59.000
everywhere I look, there it is.

9:59:59.000,9:59:59.000
Now, I've been using a specific way of[br]showing you numbers so far.

9:59:59.000,9:59:59.000
Has anybody here noticed how[br]I spell percentile?

9:59:59.000,9:59:59.000
(Audience Member): "You put lie at the[br]end of the percent sign."

9:59:59.000,9:59:59.000
Yeah, good.

9:59:59.000,9:59:59.000
So coordinated omission problem is the[br]"lie" in %lies.

9:59:59.000,9:59:59.000
And this is how it works.

9:59:59.000,9:59:59.000
One common way to do this is[br]to use a load generator.

9:59:59.000,9:59:59.000
Pretty much all load generator's[br]have this problem.

9:59:59.000,9:59:59.000
There are two that I know of that don't.

9:59:59.000,9:59:59.000
What you do with a load generator,[br]is you test.

9:59:59.000,9:59:59.000
You issue requests, or send packets.

9:59:59.000,9:59:59.000
And you measure how long something took.

9:59:59.000,9:59:59.000
And as long as the numbers go right, [br]measure them, put them in a bucket,

9:59:59.000,9:59:59.000
study them later, and get your [br]percentiles from it.

9:59:59.000,9:59:59.000
But what if the thing that you are[br]measuring took longer than the time

9:59:59.000,9:59:59.000
it would've taken until you send [br]the next thing?

9:59:59.000,9:59:59.000
You're supposed to send something [br]every second,

9:59:59.000,9:59:59.000
but this one took a second and a half.

9:59:59.000,9:59:59.000
Well you've got to wait before[br]you send the next one.

9:59:59.000,9:59:59.000
You just avoided measuring something [br]when the system was problematic.

9:59:59.000,9:59:59.000
You've coordinated with it.

9:59:59.000,9:59:59.000
You weren't looking at it then.

9:59:59.000,9:59:59.000
That's common scenario A: You've backed[br]off, and avoided measuring when it was bad.

9:59:59.000,9:59:59.000
Another way, is you measure inside your code.

9:59:59.000,9:59:59.000
We all do this. We all have to do this,

9:59:59.000,9:59:59.000
where we measure time, do something, [br]then measure time.

9:59:59.000,9:59:59.000
The delta between them is how long it took.

9:59:59.000,9:59:59.000
We can then put it in a stats bucket,[br]and then do the percentiles in that.

9:59:59.000,9:59:59.000
Unfortunately, if the system freezes right[br]here, for any reason,

9:59:59.000,9:59:59.000
an interrupted contact switch,

9:59:59.000,9:59:59.000
a cash buffer flushed to disk,

9:59:59.000,9:59:59.000
a garbage collection,

9:59:59.000,9:59:59.000
a re-indexing of your database,[br]this is a database.

9:59:59.000,9:59:59.000
This is Cassandra by the way, [br]measuring itself.

9:59:59.000,9:59:59.000
In any of the above, then you will[br]have one bad report

9:59:59.000,9:59:59.000
while 10,000 things are waiting in line.

9:59:59.000,9:59:59.000
And when they come in, they will look[br]really, really good.

9:59:59.000,9:59:59.000
Even though each one of them has had[br]a really bad experience.

9:59:59.000,9:59:59.000
[br]It can even get worse, where maybe the[br]freeze happened outside the timing,

9:59:59.000,9:59:59.000
and you won't even know there was a freeze.

9:59:59.000,9:59:59.000
Now these are examples of admitting data[br]that is bad on a very selective basis.

9:59:59.000,9:59:59.000
It's not random sampling.

9:59:59.000,9:59:59.000
It's, "I don't like bad data",

9:59:59.000,9:59:59.000
or "I couldn't handle it",

9:59:59.000,9:59:59.000
or "I don't know about it",

9:59:59.000,9:59:59.000
so we'll just talk about the good.

9:59:59.000,9:59:59.000
What does that do to your data?

9:59:59.000,9:59:59.000
Because it often makes people feel like,

9:59:59.000,9:59:59.000
"Okay, yeah, I understand,[br]but it's a little bit of noise."

9:59:59.000,9:59:59.000
Let's run some hypotheticals, [br]and I'll show you some real numbers.

9:59:59.000,9:59:59.000
Imagine a perfect system.

9:59:59.000,9:59:59.000
It's doing 100 requests a second, [br]at exactly a millisecond each.

9:59:59.000,9:59:59.000
But we go and freeze the system, [br]after 100 seconds of perfect operations

9:59:59.000,9:59:59.000
for 100 seconds, and then repeat.

9:59:59.000,9:59:59.000
Now, I'm going to describe how the system[br]behaves in terms that should mean something,

9:59:59.000,9:59:59.000
and then we'll measure it.

9:59:59.000,9:59:59.000
If we actually wanted to describe the[br]system,

9:59:59.000,9:59:59.000
on the left we have an average[br]of one millisecond by the finish,

9:59:59.000,9:59:59.000
and on the right we have an[br]average of 50 seconds.

9:59:59.000,9:59:59.000
Why 50? Because if I randomly came in [br]in that 100 seconds,

9:59:59.000,9:59:59.000
I'll get anything from 0 to 100[br]with even distribution.

9:59:59.000,9:59:59.000
The overall average over 200 seconds[br]is 25 seconds.

9:59:59.000,9:59:59.000
If I just came in here and said, [br]"Surprise, how long did this take?"

9:59:59.000,9:59:59.000
On average, it will be 25.

9:59:59.000,9:59:59.000
I can also do the percentiles.

9:59:59.000,9:59:59.000
50th percentile will be really good, [br]and then it'll get really bad.

9:59:59.000,9:59:59.000
The four 9's is terrible.

9:59:59.000,9:59:59.000
This is a fair honest description of[br]this system if this is what it did.

9:59:59.000,9:59:59.000
And you can make the system do that.

9:59:59.000,9:59:59.000
That's what Control Z is good for.

9:59:59.000,9:59:59.000
You can make any of your systems do that.

9:59:59.000,9:59:59.000
Now lets go measure this system with [br]a load generator,

9:59:59.000,9:59:59.000
or with a monitoring system.

9:59:59.000,9:59:59.000
The common ones.

9:59:59.000,9:59:59.000
The ones everybody does.

9:59:59.000,9:59:59.000
On the left, we're going to get 10,000 [br]results of one millisecond each.

9:59:59.000,9:59:59.000
Great.

9:59:59.000,9:59:59.000
And we're going to get one result of[br]100 seconds.

9:59:59.000,9:59:59.000
Wow, really big response time.

9:59:59.000,9:59:59.000
This is our data.

9:59:59.000,9:59:59.000
This is OUR data.

9:59:59.000,9:59:59.000
So now you go do math with it.

9:59:59.000,9:59:59.000
The average of that is 10.9 milliseconds.

9:59:59.000,9:59:59.000
A little less than 25 seconds.

9:59:59.000,9:59:59.000
And here are the percentiles.

9:59:59.000,9:59:59.000
Your load generator monitoring system[br]will tell you that this system is perfect.

9:59:59.000,9:59:59.000
You could go to production with it.

9:59:59.000,9:59:59.000
You like what you see.

9:59:59.000,9:59:59.000
Look at that, four 9's.

9:59:59.000,9:59:59.000
It is lying to you.

9:59:59.000,9:59:59.000
To your face.

9:59:59.000,9:59:59.000
And you can catch it doing that with a [br]Control Z-Test.

9:59:59.000,9:59:59.000
But people tend to not want to do that,[br]because then what are they going to do?

9:59:59.000,9:59:59.000
If you just do that test, and calibrate [br]your system, and you find it

9:59:59.000,9:59:59.000
telling you that, about this, the next [br]step should be to throw all the numbers away.

9:59:59.000,9:59:59.000
Don't believe anything else it says.

9:59:59.000,9:59:59.000
If it lies this big, what else did it do?

9:59:59.000,9:59:59.000
Don't waste your time on numbers[br]from uncalibrated systems.

9:59:59.000,9:59:59.000
Now the problem here was, that if you[br]want to measure the system,

9:59:59.000,9:59:59.000
you have to measure at random rates, [br]or same rates.

9:59:59.000,9:59:59.000
If you measure 10,000 things in 100 seconds,[br]there should be another 10,000 things here.

9:59:59.000,9:59:59.000
If you measure them, you would've gotten[br]all the right numbers.

9:59:59.000,9:59:59.000
Coordinated omission is the simple act of[br]erasing all that bad stuff.

9:59:59.000,9:59:59.000
The conspiracy here is that we all do it[br]without meaning to.

9:59:59.000,9:59:59.000
I don't know who put that in our systems,[br]but it happens to all of us .

9:59:59.000,9:59:59.000
Now, I often get people saying,[br]"Okay, I get it. All the numbers are wrong,

9:59:59.000,9:59:59.000
but at least for my job where I tune [br]performance, and I try to make things

9:59:59.000,9:59:59.000
faster, I can use the numbers to figure[br]out if I'm going in the right direction."

9:59:59.000,9:59:59.000
Is it better, or is it worse? Let me [br]dispel that for you for a second.

9:59:59.000,9:59:59.000
Suppose I went and took this system,[br]and improved it dramatically.

9:59:59.000,9:59:59.000
Rather than freezing for 100 seconds, [br]it will now answer every question.

9:59:59.000,9:59:59.000
It'll take a little longer,[br]5 milliseconds instead of one,

9:59:59.000,9:59:59.000
but it's much better than freezing, right?

9:59:59.000,9:59:59.000
So let's measure that system that we spent[br]weeks and weeks improving,

9:59:59.000,9:59:59.000
and see if it's better.

9:59:59.000,9:59:59.000
That's the data.

9:59:59.000,9:59:59.000
If we do the percentiles, it'll tell us [br]that we just really hurt the four 9's.

9:59:59.000,9:59:59.000
We made it go 5 times worse than before.

9:59:59.000,9:59:59.000
We should revert this change, go back to[br]that much better system we had before.

9:59:59.000,9:59:59.000
So this is just to make sure that you[br]don't think that you can have

9:59:59.000,9:59:59.000
any intuition based on any of these numbers.

9:59:59.000,9:59:59.000
They go backwards sometimes.

9:59:59.000,9:59:59.000
You don't know which way is good or bad.

9:59:59.000,9:59:59.000
And you'll never know which way is good[br]or bad with a system that lies like that.

9:59:59.000,9:59:59.000
The other cool technique is[br]what I call "Cheating Twice".

9:59:59.000,9:59:59.000
You have a constant load generator,[br]and it needs to do 100 per second.

9:59:59.000,9:59:59.000
When it woke up after 200 seconds, [br]it says,

9:59:59.000,9:59:59.000
"Woah, were 9,999 behind.[br]We've got to issue those requests."

9:59:59.000,9:59:59.000
So it issues those requests.

9:59:59.000,9:59:59.000
At this point, not only did it get rid of[br]all the bad requests,

9:59:59.000,9:59:59.000
it replaced every one of them with [br]a perfect request.

9:59:59.000,9:59:59.000
Coining the four 9's (99.99%), all the way[br]to four and a half 9's (99.995%),

9:59:59.000,9:59:59.000
it's twice as wrong as dropping them.

9:59:59.000,9:59:59.000
So these are all cool things that[br]happen to you.

9:59:59.000,9:59:59.000
I'm not going to spend much time on how[br]to fix those and avoid those.

9:59:59.000,9:59:59.000
There's a lot of other material that you[br]can find with me

9:59:59.000,9:59:59.000
talking about that, in longer talks.

9:59:59.000,9:59:59.000
But this is pretty bad.

9:59:59.000,9:59:59.000
And like I said...

9:59:59.000,9:59:59.000
That should've been up there before.

9:59:59.000,9:59:59.000
How did this repeat itself?

9:59:59.000,9:59:59.000
Did I create a loop in the[br]presentation somehow?

9:59:59.000,9:59:59.000
I don't know how to do that.

9:59:59.000,9:59:59.000
Let's see if I can get through here.

9:59:59.000,9:59:59.000
Hopefully editing later will take it out.

9:59:59.000,9:59:59.000
So we have the cheats twice.

9:59:59.000,9:59:59.000
There, okay.

9:59:59.000,9:59:59.000
So, after we look at coordinated[br]omission that way,

9:59:59.000,9:59:59.000
we should also look at response time, [br]and service time.

9:59:59.000,9:59:59.000
Coordinated omission, what it really is[br]achieving for you, unfortunately,

9:59:59.000,9:59:59.000
is that it makes something that you think[br]is response time, and only shows you

9:59:59.000,9:59:59.000
the service time component of latency.

9:59:59.000,9:59:59.000
This is a simple depiction of what service[br]time and response times are.

9:59:59.000,9:59:59.000
This guy is taking a certain amount of[br]time to take payment

9:59:59.000,9:59:59.000
or make a cup of coffee.

9:59:59.000,9:59:59.000
That's service time.

9:59:59.000,9:59:59.000
How long does it take to do the work?

9:59:59.000,9:59:59.000
This person has experienced[br]the response time,

9:59:59.000,9:59:59.000
which includes the amount of time they [br]have to wait before they

9:59:59.000,9:59:59.000
get to the person that does the work.

9:59:59.000,9:59:59.000
And the difference between those[br]two is immense.

9:59:59.000,9:59:59.000
The coordinated omission problem makes[br]something that you think is

9:59:59.000,9:59:59.000
response time, only measure the [br]service time,

9:59:59.000,9:59:59.000
and basically hide the fact that things [br]stalled, waited in line,

9:59:59.000,9:59:59.000
that this guy might've taken a lunch break,

9:59:59.000,9:59:59.000
and now we have line around, [br]building three times.

9:59:59.000,9:59:59.000
Service time stays the same.

9:59:59.000,9:59:59.000
This is the backwards part...

9:59:59.000,9:59:59.000
Now, let's look at what it[br]actually looks like.

9:59:59.000,9:59:59.000
In a load generator that I fixed,[br]I measured both

9:59:59.000,9:59:59.000
response time and service time,

9:59:59.000,9:59:59.000
this happens to be Casandra,

9:59:59.000,9:59:59.000
at a very low load.

9:59:59.000,9:59:59.000
And you can see that they're very very [br]similar, at a very low load.

9:59:59.000,9:59:59.000
Why? Because there's nobody in line.

9:59:59.000,9:59:59.000
This thing is really fast.

9:59:59.000,9:59:59.000
We're not asking for too much.

9:59:59.000,9:59:59.000
Casandra's pretty fast,[br]so they're the same.

9:59:59.000,9:59:59.000
But if I increase the load, we [br]start seeing gaps.

9:59:59.000,9:59:59.000
If I increase the load a little more,[br]the gap grows.

9:59:59.000,9:59:59.000
If I increase the load a little more,[br]the gap grows.

9:59:59.000,9:59:59.000
Now this is not the failure point yet.

9:59:59.000,9:59:59.000
If I actually increase it all the way past[br]the point where the system

9:59:59.000,9:59:59.000
can't even do the work I want, [br]service time stays the same,

9:59:59.000,9:59:59.000
response time goes through the roof.

9:59:59.000,9:59:59.000
This was when it was 100 and something[br]milliseconds, now it's 7 and a half seconds.

9:59:59.000,9:59:59.000
Why 7 and a half seconds?

9:59:59.000,9:59:59.000
Cause you're waiting in line that long[br]to go around the block.

9:59:59.000,9:59:59.000
The guy just can't serve as many people[br]as are showing up in line, you fall behind.

9:59:59.000,9:59:59.000
This is a virtual world reaction to this.

9:59:59.000,9:59:59.000
I really like this slide, it's where I came[br]up with the notion of a blue/red pill.

9:59:59.000,9:59:59.000
When you actually measure reality, people[br]tend to have this reaction when

9:59:59.000,9:59:59.000
they compare the two.

9:59:59.000,9:59:59.000
And if we actually look at these on the[br]two sides of a collapse point of a system,

9:59:59.000,9:59:59.000
this specific system can only do 87,000 [br]things a second.

9:59:59.000,9:59:59.000
No matter how hard you press it,[br]that's all it'll do.

9:59:59.000,9:59:59.000
The service time on the two sides of [br]the collapse looks virtually identical,

9:59:59.000,9:59:59.000
which it would.

9:59:59.000,9:59:59.000
But if you compare the response time, [br]you have a very different picture.

9:59:59.000,9:59:59.000
And I'm showing this picture so you get [br]a feeling for what to look at

9:59:59.000,9:59:59.000
on whether or not you're measuring[br]the right one.

9:59:59.000,9:59:59.000
Whenever you push, you try and push load[br]beyond what the system can do,

9:59:59.000,9:59:59.000
you are falling behind over time.

9:59:59.000,9:59:59.000
This is a 250 second run,

9:59:59.000,9:59:59.000
where at the end of it[br]you are waiting for 8 seconds in line.

9:59:59.000,9:59:59.000
Why? Because for every second [br]that goes by, there are

9:59:59.000,9:59:59.000
3,000 more things that are[br]added to the line.

9:59:59.000,9:59:59.000
The interesting thing that happens when [br]you cross the threshold limit,

9:59:59.000,9:59:59.000
or capability of the system, is that[br]response time grows over time linearly.

9:59:59.000,9:59:59.000
It doesn't happen if you're below.

9:59:59.000,9:59:59.000
Only if you're above.

9:59:59.000,9:59:59.000
It's the point where that happens, and [br]any load generator that doesn't show

9:59:59.000,9:59:59.000
that line when you try pushing harder [br]than you can, is lying to you.

9:59:59.000,9:59:59.000
It's a simple sanity check.

9:59:59.000,9:59:59.000
If your load generator shows you that, [br]it didn't push.

9:59:59.000,9:59:59.000
Or it pushed, but it didn't[br]report correctly,

9:59:59.000,9:59:59.000
whichever it is.

9:59:59.000,9:59:59.000
If we draw that to scale...

9:59:59.000,9:59:59.000
Just to make sure, this was not to scale, [br]this is the scale, I just zoomed in

9:59:59.000,9:59:59.000
so you could see that it was [br]relatively stable.

9:59:59.000,9:59:59.000
So... I don't know what happened to the[br]order of the slides.

9:59:59.000,9:59:59.000
It's like looping and randoming.

9:59:59.000,9:59:59.000
There's some conspiracy going on there.

9:59:59.000,9:59:59.000
Now, latency doesn't live on it's own.

9:59:59.000,9:59:59.000
You do need to look at latency in the [br]context of load.

9:59:59.000,9:59:59.000
Cause as I showed you, as you're nearly[br]idle, things are nearly perfect.

9:59:59.000,9:59:59.000
Even these mistakes won't show up.

9:59:59.000,9:59:59.000
But as you start pressing, things start[br]cracking or behaving differently.

9:59:59.000,9:59:59.000
And usually when you want to know how much[br]your system can handle,

9:59:59.000,9:59:59.000
the answer is not 87,000 things a second,[br]because nobody wants the

9:59:59.000,9:59:59.000
response time that comes with that.

9:59:59.000,9:59:59.000
It's how many things can I handle so [br]that I don't get angry phone calls.

9:59:59.000,9:59:59.000
So I do get my bonus, and so my[br]company stays above ground.

9:59:59.000,9:59:59.000
This is not sustainable speed.

9:59:59.000,9:59:59.000
Running this experiment is really[br]interesting with software,

9:59:59.000,9:59:59.000
because it actually doesn't hurt, but[br]spending the next 6 months of your time

9:59:59.000,9:59:59.000
repeating this experiment, trying to[br]change the shape of the bumper

9:59:59.000,9:59:59.000
every time you hit the thing[br]is a waste of your time.

9:59:59.000,9:59:59.000
Your goal when you're trying to figure[br]out sustainable speed throughput,

9:59:59.000,9:59:59.000
whatever it is, is to see how fast you can[br]go without this happening,

9:59:59.000,9:59:59.000
and then to try and engineer[br]to improve that.

9:59:59.000,9:59:59.000
Meaning, can I make it go faster[br]without this happening?

9:59:59.000,9:59:59.000
Measuring what happens after you[br]hit the pole is useless for that exercise.

9:59:59.000,9:59:59.000
The only thing that matters about hitting[br]the pole, is that you hit the pole.

9:59:59.000,9:59:59.000
When you go and study the behavior[br]of latency, at saturation,

9:59:59.000,9:59:59.000
you are doing this.

9:59:59.000,9:59:59.000
You're looking at this and saying, "That[br]bumper, I don't like the shape of that.

9:59:59.000,9:59:59.000
Let's measure it closely and do this 100[br]times to see if we can vary it."

9:59:59.000,9:59:59.000
That's what it means to look at latency [br]at saturation,

9:59:59.000,9:59:59.000
and repeat, and repeat, and change,[br]and tune, and see if you can do it again.

9:59:59.000,9:59:59.000
If you're pressing it to the wall,[br]it should look like this.

9:59:59.000,9:59:59.000
And it shouldn't be a surprise that it's[br]a 7 and a half second response time.

9:59:59.000,9:59:59.000
In fact, if it's not, something is[br]terribly wrong with what you're measuring.

9:59:59.000,9:59:59.000
You should look at that instead.

9:59:59.000,9:59:59.000
So don't do this.

9:59:59.000,9:59:59.000
Try to minimize the number of times[br]that you actually run red cars

9:59:59.000,9:59:59.000
into poles in your testing.

9:59:59.000,9:59:59.000
I'm not saying don't do it, but use it[br]to establish the end.

9:59:59.000,9:59:59.000
And then you need to test all the speeds,[br]and we need to see when you hit the pole.

9:59:59.000,9:59:59.000
Maybe you hit the pole at 100 mph,[br]but maybe you also hit the pole at 70 mph.

9:59:59.000,9:59:59.000
Maybe you don't hit it at 20.

9:59:59.000,9:59:59.000
We should find out how fast is safe.

9:59:59.000,9:59:59.000
When you have data, you can compare[br]it like this.

9:59:59.000,9:59:59.000
This is what I would say a recommended [br]way to look at it.

9:59:59.000,9:59:59.000
Plot requirements, that's the hitting[br]the pole.

9:59:59.000,9:59:59.000
And some things hit the pole, [br]and some things don't.

9:59:59.000,9:59:59.000
And you run different scenarios, [br]different loads,

9:59:59.000,9:59:59.000
different configurations,

9:59:59.000,9:59:59.000
different settings,

9:59:59.000,9:59:59.000
and see what works, and what doesn't.

9:59:59.000,9:59:59.000
Your goal is to stay here, and carry [br]more while staying there.

9:59:59.000,9:59:59.000
Usually.

9:59:59.000,9:59:59.000
It's very useful for figuring out how many[br]machines I need to carry a certain thing.

9:59:59.000,9:59:59.000
If you don't know this, you don't know[br]how many machines to deploy.

9:59:59.000,9:59:59.000
Okay, I'm going to run through[br]some comparisons of

9:59:59.000,9:59:59.000
latency or response time behaviors[br]between different configurations

9:59:59.000,9:59:59.000
to show you some of the places[br]people look, and some of the

9:59:59.000,9:59:59.000
intuitive and non-intuitive[br]things to do with them.

9:59:59.000,9:59:59.000
The common thing,

9:59:59.000,9:59:59.000
and again, this is that Casandra thing,

9:59:59.000,9:59:59.000
comparing two systems, A and B.

9:59:59.000,9:59:59.000
I'll let you guess which one is A,[br]and which one is B.

9:59:59.000,9:59:59.000
It's two systems, and saying[br]which is better, what can I do with this?

9:59:59.000,9:59:59.000
And we're measuring here at two [br]throughputs, 85 and 90k.

9:59:59.000,9:59:59.000
As I said in here, 90k is past the[br]capability of the system.

9:59:59.000,9:59:59.000
You can sort of see it here.

9:59:59.000,9:59:59.000
See, 85 for both of them is here,[br]and 90k is here.

9:59:59.000,9:59:59.000
So you could look at this and say,

9:59:59.000,9:59:59.000
"Look. when the car hits the pole,[br]the blue system is better."

9:59:59.000,9:59:59.000
It's half as bad, but that's just[br]the wrong place to look.

9:59:59.000,9:59:59.000
They both suck.

9:59:59.000,9:59:59.000
You do not want to be doing this.

9:59:59.000,9:59:59.000
The fact that this system is better[br]than that system

9:59:59.000,9:59:59.000
doesn't make you want to use it.

9:59:59.000,9:59:59.000
This is the wrong place to measure.

9:59:59.000,9:59:59.000
This is where latency is irrelevant.

9:59:59.000,9:59:59.000
How they behave past this point[br]doesn't matter.

9:59:59.000,9:59:59.000
What we should be doing is saying,

9:59:59.000,9:59:59.000
"Well, then don't measure here. [br]Let's look there."

9:59:59.000,9:59:59.000
So if we zoom just at the 85k's on these[br]two systems, okay, they're different.

9:59:59.000,9:59:59.000
And now...

9:59:59.000,9:59:59.000
The red and the blue alternate here,[br]whatever that is.

9:59:59.000,9:59:59.000
And now you look at this,[br]and okay, it's better.

9:59:59.000,9:59:59.000
But we're still in the wrong place,[br]because we are 1.5% from hitting the pole.

9:59:59.000,9:59:59.000
It is not where you will be[br]running in production.

9:59:59.000,9:59:59.000
It's not the interesting place[br]to study latency.

9:59:59.000,9:59:59.000
That's the place that if you're anywhere[br]close to that, you should be on the phone

9:59:59.000,9:59:59.000
getting more servers now, rather than[br]trying to figure out the latency behaves.

9:59:59.000,9:59:59.000
You know it's going to collapse if just[br]a little bit of noise happens.

9:59:59.000,9:59:59.000
What you should be doing is looking[br]far away from the need,

9:59:59.000,9:59:59.000
far away from that.

9:59:59.000,9:59:59.000
For example, let's go to half the[br]throughput that causes collapse,

9:59:59.000,9:59:59.000
and see what things happen there.

9:59:59.000,9:59:59.000
And here you can see,

9:59:59.000,9:59:59.000
okay, these are two systems, and one [br]of them does better.

9:59:59.000,9:59:59.000
You can say that this percentile is better,

9:59:59.000,9:59:59.000
that percentile, whatever these are.

9:59:59.000,9:59:59.000
It is interesting, but what[br]can we do with this?

9:59:59.000,9:59:59.000
How do we tell our boss what this means?

9:59:59.000,9:59:59.000
Or how do we translate this into, [br]how many machines do I need?

9:59:59.000,9:59:59.000
Now so far, I've been comparing[br]things at the same throughput,

9:59:59.000,9:59:59.000
and looking at latencies.

9:59:59.000,9:59:59.000
And that's good for pass/fail kind of[br]things, or getting quantitative things,

9:59:59.000,9:59:59.000
but once you get to this point,[br]you can start saying,

9:59:59.000,9:59:59.000
"Wait, what if I do it at[br]different throughputs?"

9:59:59.000,9:59:59.000
How slow do I need to make this blue thing[br]to make it look closer to the red thing,

9:59:59.000,9:59:59.000
or the other way around.

9:59:59.000,9:59:59.000
I don't want to move this fast to 3-L too,[br]I want to move this to be there.

9:59:59.000,9:59:59.000
For example, slow that one up by 4X, [br]and look,

9:59:59.000,9:59:59.000
the two 9's are actually starting[br]to look similar.

9:59:59.000,9:59:59.000
If you slow it by...

9:59:59.000,9:59:59.000
So you can make a statement like this:

9:59:59.000,9:59:59.000
The 99th percentile, if you had a goal[br]like this,

9:59:59.000,9:59:59.000
and now you've passed the goal,

9:59:59.000,9:59:59.000
You'd say, "Both of them passed the goal,[br]but system B does it at 4 times the load."

9:59:59.000,9:59:59.000
That drives a choice, right?

9:59:59.000,9:59:59.000
You can make a harsher goal, and say,

9:59:59.000,9:59:59.000
I need the three 9's to be below[br]10 milliseconds,

9:59:59.000,9:59:59.000
so you'll slow these down even further.

9:59:59.000,9:59:59.000
At this point, you can make this statement:

9:59:59.000,9:59:59.000
If you want those, one of them is [br]10 times better.

9:59:59.000,9:59:59.000
Meaning, not that the system is[br]10 times faster,

9:59:59.000,9:59:59.000
but I can carry 10 times the load[br]before I fail, before I have to pull.

9:59:59.000,9:59:59.000
What I'm trying to demonstrate here,[br]is that how much more, or not,

9:59:59.000,9:59:59.000
you can can get out of a system depends[br]on you're requirements,

9:59:59.000,9:59:59.000
and whether or not you need to meet them.

9:59:59.000,9:59:59.000
Without setting those requirements,[br]looking at the percentile spectrum

9:59:59.000,9:59:59.000
of response time, not service time,

9:59:59.000,9:59:59.000
you'll never know how much you[br]need or not.

9:59:59.000,9:59:59.000
You can do a lot of other things, [br]these are just demonstrations

9:59:59.000,9:59:59.000
of how to look at data sets.

9:59:59.000,9:59:59.000
You make measure at a lot of levels.

9:59:59.000,9:59:59.000
You can look for systemic behaviors.

9:59:59.000,9:59:59.000
For example, this is one system, but[br]at varying levels.

9:59:59.000,9:59:59.000
You can sort of see that as you increase[br]the load, the percentiles move to the left.

9:59:59.000,9:59:59.000
That's a good observation.

9:59:59.000,9:59:59.000
It's not all systems that'll do it, but [br]for this system it'll be that.

9:59:59.000,9:59:59.000
You can also see that even though[br]this didn't totally collapse,

9:59:59.000,9:59:59.000
it's completely out of whack with[br]the rest,

9:59:59.000,9:59:59.000
so that kind of tells you let's not[br]look there.

9:59:59.000,9:59:59.000
So throw away the behavior...

9:59:59.000,9:59:59.000
You just know not to go to 80.

9:59:59.000,9:59:59.000
No need to study it much.

9:59:59.000,9:59:59.000
Now that's the remaining set.

9:59:59.000,9:59:59.000
You could look at that.

9:59:59.000,9:59:59.000
You could look at the set from the other[br]system and compare them.

9:59:59.000,9:59:59.000
Maybe put them next to[br]each other like this.

9:59:59.000,9:59:59.000
Or if you actually can fit enough lines,[br]with enough colors on a chart,

9:59:59.000,9:59:59.000
you can try and do stuff like that.

9:59:59.000,9:59:59.000
These are all good ways to actually[br]look at latencies,

9:59:59.000,9:59:59.000
actually study them.

9:59:59.000,9:59:59.000
And notice that in all these cases, [br]I didn't pick a number.

9:59:59.000,9:59:59.000
"Oh, let's compare the 99.9 percentile,"[br]because I won't get

9:59:59.000,9:59:59.000
any feeling for the shapes if I did that.

9:59:59.000,9:59:59.000
You want to look at the entire spectrum.

9:59:59.000,9:59:59.000
And that is what and HDR histogram [br]is very good for.

9:59:59.000,9:59:59.000
So, you know... You get those.

9:59:59.000,9:59:59.000
Now...

9:59:59.000,9:59:59.000
Wow, we're actually doing okay on time.

9:59:59.000,9:59:59.000
Now, this is one of my favorite ways to [br]depict things.

9:59:59.000,9:59:59.000
Remember I told you that if you don't plot[br]the max, what are you hiding?

9:59:59.000,9:59:59.000
It turns out that if you plot the max,

9:59:59.000,9:59:59.000
usually it's the number one[br]signal to look at over time,

9:59:59.000,9:59:59.000
these are just those two systems.

9:59:59.000,9:59:59.000
And with a simple visual, you get[br]a great intuition.

9:59:59.000,9:59:59.000
Same load, one of them's noisy, one's not.

9:59:59.000,9:59:59.000
You can look at the response time[br]and service time,

9:59:59.000,9:59:59.000
and all of the numbers of different[br]samples of percentiles,

9:59:59.000,9:59:59.000
but if you actually want to show a CEO[br]something,

9:59:59.000,9:59:59.000
this is a pretty good thing to show them.

9:59:59.000,9:59:59.000
"Look what I did over the weekend."

9:59:59.000,9:59:59.000
Before the weekend it looked like that,[br]and I fixed it."

9:59:59.000,9:59:59.000
"I deserve a prize."

9:59:59.000,9:59:59.000
With that, a simple thing to remember[br]is that this is your load on system A,

9:59:59.000,9:59:59.000
this your load on system B.

9:59:59.000,9:59:59.000
Any questions?

9:59:59.000,9:59:59.000
This is from an anti-drug commercial [br]in the 80's,

9:59:59.000,9:59:59.000
I don't know if anybody can remember.

9:59:59.000,9:59:59.000
So with that, we're ready[br]for any questions.

9:59:59.000,9:59:59.000
Any questions?

9:59:59.000,9:59:59.000
Wow, that bad?

9:59:59.000,9:59:59.000
(Laughing) Dreadful.

9:59:59.000,9:59:59.000
Okay, I have one here, and[br]one back there.

9:59:59.000,9:59:59.000
Let's start with the back.

9:59:59.000,9:59:59.000
(Audience Member): You said that there are[br]all these tools that you could use

9:59:59.000,9:59:59.000
that give you reasonable numbers, [br]and reasonable answers as far as

9:59:59.000,9:59:59.000
latency is concerned, so what are those[br]tools that you use?

9:59:59.000,9:59:59.000
So the question was, there a couple of[br]tools I mentioned that could give you

9:59:59.000,9:59:59.000
better information, and I used some[br]to chart here,

9:59:59.000,9:59:59.000
let me see, there are a lot of tools.

9:59:59.000,9:59:59.000
I used HDR histogram to plot[br]all these charts

9:59:59.000,9:59:59.000
with the continuous percentile curves.

9:59:59.000,9:59:59.000
I highly recommend you look at using it.

9:59:59.000,9:59:59.000
Just go to HDR histogram.org[br]and read stuff.

9:59:59.000,9:59:59.000
Or google it.

9:59:59.000,9:59:59.000
There's a bunch of people using it.

9:59:59.000,9:59:59.000
And the basic thing it does, is that it[br]gives you a tool that

9:59:59.000,9:59:59.000
allows you a practical way to have[br]this kind of

9:59:59.000,9:59:59.000
fidelity, dynamic range, and resolution[br]to even look at the shapes.

9:59:59.000,9:59:59.000
The other way to do it is to keep[br]all the data.

9:59:59.000,9:59:59.000
You don't have to have histograms[br]if you kept every single result,

9:59:59.000,9:59:59.000
but it many places that's not practical,[br]or makes it harder for the system to run.

9:59:59.000,9:59:59.000
If you can do that, that's even better.

9:59:59.000,9:59:59.000
And then run it through an HDR histogram[br]for analysis.

9:59:59.000,9:59:59.000
So that's as far as viewing things.

9:59:59.000,9:59:59.000
If you have data viewing it.

9:59:59.000,9:59:59.000
Unfortunately, HDR histogram is not[br]going to make the data good.

9:59:59.000,9:59:59.000
It's just going to show you[br]the data you have.

9:59:59.000,9:59:59.000
One of the things I would highly [br]recommend you try to do,

9:59:59.000,9:59:59.000
I'm going backwards, and hopefully[br]I'll hit what I wanted.

9:59:59.000,9:59:59.000
I highly recommend you look at your [br]data sets, and remember

9:59:59.000,9:59:59.000
that in this visual,

9:59:59.000,9:59:59.000
one strong tip I will give you, is that [br]any time you see a vertical

9:59:59.000,9:59:59.000
rise like that, you have a 99.9% chance[br]of looking at coordinated omission.

9:59:59.000,9:59:59.000
This is what coordinated omission[br]looks like.

9:59:59.000,9:59:59.000
There's a couple of other things that[br]can also look like that.

9:59:59.000,9:59:59.000
I haven't seen them in awhile, but [br]I can make them artificially happen,

9:59:59.000,9:59:59.000
so it's not conclusive that this is [br]coordinated omission,

9:59:59.000,9:59:59.000
but suspect it.

9:59:59.000,9:59:59.000
Suspect it hard.

9:59:59.000,9:59:59.000
So if you plot your data with[br]coordinated omission,

9:59:59.000,9:59:59.000
you will get a view of whether or not[br]you have this other problem.

9:59:59.000,9:59:59.000
But honestly, there's a much simpler[br]way to do it.

9:59:59.000,9:59:59.000
Run your control Z test, and see [br]if you have the problem.

9:59:59.000,9:59:59.000
This will just show you how it works.

9:59:59.000,9:59:59.000
A non-omitted, a sane response time[br]test, or latency test,

9:59:59.000,9:59:59.000
tends to have these more smooth humps[br]of curves transitioning between numbers.

9:59:59.000,9:59:59.000
Any vertical rise tends to[br]indicate omission.

9:59:59.000,9:59:59.000
So that's one thing there.

9:59:59.000,9:59:59.000
As far as the tools actually[br]measuring correctly,

9:59:59.000,9:59:59.000
remember I told you what the name of [br]the talk is,

9:59:59.000,9:59:59.000
so let me rattle off some tools.

9:59:59.000,9:59:59.000
Actually, let's do this.

9:59:59.000,9:59:59.000
You guys measure stuff here,[br]I assume.

9:59:59.000,9:59:59.000
Could you rattle off some tools [br]that you use?

9:59:59.000,9:59:59.000
What do you use for load generation[br]and measurement right now?

9:59:59.000,9:59:59.000
Volunteers?

9:59:59.000,9:59:59.000
JMeter?

9:59:59.000,9:59:59.000
Okay, JMeter.

9:59:59.000,9:59:59.000
Gatling.

9:59:59.000,9:59:59.000
Anybody else?

9:59:59.000,9:59:59.000
Okay, anybody with Grinder, WRK, [br]some of the commercial...

9:59:59.000,9:59:59.000
Oh, well yeah.

9:59:59.000,9:59:59.000
Gatling is the only tool I know of[br]right now, that is an actual tool

9:59:59.000,9:59:59.000
people use, not a demo, that has fixed[br]a coordinated omission

9:59:59.000,9:59:59.000
problem in its measurement.

9:59:59.000,9:59:59.000
There was actually a bug filed against it,

9:59:59.000,9:59:59.000
and the control Z edition in it[br]was fixed.

9:59:59.000,9:59:59.000
It is actually possible to perfectly [br]fix this.

9:59:59.000,9:59:59.000
You don't have to correct your guess,[br]you can actually correctly compute

9:59:59.000,9:59:59.000
the exact response time in any load[br]generator on earth,

9:59:59.000,9:59:59.000
if you just do it right.

9:59:59.000,9:59:59.000
All the other tools,

9:59:59.000,9:59:59.000
JMeter, Grinder, WRK,

9:59:59.000,9:59:59.000
the commercial tools that I won't mention,

9:59:59.000,9:59:59.000
they all do this wrong, unfortunately.