Hi everyone, I'm Gil Tene.

I'm going to be talking about this subject
that I call "How NOT to Measure Latency".

It's a subject that I've been talking now
about for 3 years or so.

I keep the title and change all
the slides every time.

A bunch of this stuff is new.

So if you've seen any of my previous "How NOT to",
you'll see only some things that are common.

A nickname for the subject is this...

Because I often will get that reaction
from some people in the audience.

Ever since I've told people that it's a
nickname,

They feel free to actually exclaim,
"Oh S@%#!".

And feel free to do that here in this talk.

I'll prompt you in a couple of places
where it is natural.

But if just have the urge, go ahead.

So just a tiny bit about me.

I am the co-founder of Azul Systems.

I play around with garbage collection a lot.

Here is some evidence of me playing around
with garbage collection in my kitchen.

That's a trash compactor.

The compaction function wasn't working right,
so I had to fix it.

I thought it'd be funny to take a picture
with a book.

I've also built a lot of things.

I've been playing with computers since
the early 80's.

I've built hardware.

I've helped design chips.

I've built software at many 
different levels.

Operating systems, drivers...
JVM's obviously.

And lots of big systems at the system level.

Built our own app server in the late 90's
because web logic wasn't around yet.

So, I've made a lot of mistakes,
and I've learned from a few of them.

This is actually a combination of a bunch
of those mistakes looking at latency.

I do have this hobby of depressing people
by pulling the wool up from over your eyes,

and this is what this talk is about.

So, I need to give you a choice right here.

There's the door.

You can take the blue pill, 
and you can leave.

Tomorrow you can keep believing whatever
it is you want to believe.

But if you stay here and take the red pill, 
I will show you a glimpse of how

far down the rabbit hole goes, 
and it will never be the same again.

Let's talk about latency.

And when I say latency, I'm talking about
latency response time, any of those things

where you measure time from 'here to here',
and you're interested in how long it took.

We do this all the time, but I see a lot 
of mish-mash in how people

treat the data, or think about it.

Latency is basically the time it took
something to happen once.

That one time, how long did it take.

And when we measure stuff, like we did 
a million operations in the last hour,

we have a million latencies. Not one,
we have a million of them.

Our actual goal is to figure out how to
describe that million.

How did the million behave?

For example, 'they're all really good, and
they're all exactly the same', would be a

behavior that you will never see, 
but that would be a great behavior.

So we need to talk about how things behave,
communicate, think, evaluate,

set requirements for, talk to other people,
but these are all common things around that.

To do that, we have to describe the 
distribution, the set, the behavior,

but not the one.

For example, the behavior that says "the 
the common case was x" is a piece of

information about the behavior,
but it's a tiny sliver.

Usually the least relevant one.

Well, there's some less relevant ones, 
but not a strongly relevant one,

and one that people often focus on.

To take a look at what we actually do 
with this stuff, almost on a daily basis,

this is a snapshot from a monitoring system.

A small dashboard on a big screen 
in a monitoring system.

Where you're watching the response time of
a system over time.

This is a two hour window.

These lines that are 95th percentile, 
90, 75, 50, and 25th percentiles,

you can look at how they behave over time.

We're a small audience here, if you look at
this picture, what draws your eye?

What do you want to go investigate here
or pay attention to ?

It's the big red spike there, right?

So we could look at the red spike,
cause it's different,

and say, "Woah, the 95th percentile shot up
here. And look, the 90th percentile

shot up at about the same time.

The rest of them didn't shoot up, 
so maybe something happened here

that affected that much, I should probably
pay attention to it

because it's a monitoring system, and 
I like things to be calm."

You could go investigate the why.

At this point, I've managed to waste 
about 90 seconds of your life,

looking at a completely meaningless chart,
which unfortunately you do

every day, all the time.

This chart is the chart you want to show 
somebody if you want to

hide the truth from them.

If you want to pull the wool 
over their eyes.

This is the chart of the good stuff.

What's not on this chart?

The 5% worse things that happened during
this two hours.

They're not here.

This is only the good things that happened
during the things.

And to get this spike, that 5% had to be
so bad that it even pulled

the 95th percentile all up.

There is zero information here at all about
what happened bad during this two hours,

which makes it a bad fit for 
a monitoring system.

It's a really good thing for 
a marketing system.

It's a great way to get the bonus from your boss, even though you didn't do the work.

If you want to learn how to do that, 
we can do another talk about that.

But this is not a good way to look at latency.

It's the opposite of good.

Unfortunately, this is one of the most
common tools used for

server monitoring on earth right now.

That's where the snapshot is from,
and this is what people look at.

I find this chart to be a goldmine
of information.

When I first showed it in another talk 
like this, I had this really cool experience.

Somebody came up to me and said, "Hey, 
as I was sitting here, I was texting one

of our guys, and he was saying,

'look, we have this issue with 
our 95th percentile'."

And I got this chart from him!

So I went and said, "Hey, what does the 
rest of the spectrum look like?"

This is the actual chart they got.

And when they look at the rest of the
spectrum, it looked like that.

That's what was hiding.

I noticed the scales are a little different.

That yellow line is that yellow line.

So that's a much more representative number.

Is it? Is that good enough?

That's the 99th percentile.

We still have another 1% of really bad 
stuff that's hiding above the blue line.

I wonder how big that is?

I don't know because he didn't have the data.

So a common problem that we have is that
we only plot what's convenient.

We only plot what gives us nice,
colorful graphs.

And often, when we have to choose between
the stuff that hides the rest of the data,

and the stuff that is noise, we choose 
the noise to display.

I like to rant about latency.

This is from a blog that I don't write 
enough in, but the format for it was simple.

I tweet a single tweet about latency, 
latency tip of the day,

and then I rant about my own tweet.

As an example, this chart is a goldmine
of information because it has so many

different things that are wrong in it, 
but we won't get into all of them.

You can read it online.

Anyway, this is one to take away from 
what we just said.

If you are not measuring and showing the
maximum value, what is it you are hiding?

And from whom?

If you're job is to hide the truth from
others, this is a good way to do it.

But if actually are interested in what's
going on, the number one indicator

you should never get rid of is the 
maximum value.

That is not noise, that is the signal.

The rest of it is noise.

Okay, let's look at this chart for some
more cool stuff.

I'm gonna zoom in to a small part
of the chart, and ask you what that means.

What is the average of the 95th percentile
over 2 hours mean?

What is the math that does that?

What does it do?

Let's look at that, and I'll give you
an example with another percentile.

The 100th percentile. The max, right?

Let's take a data set.

Suppose this was the maximum every minute
for 15 minutes.

What does it mean to say that the average 
max over the last 15 minutes was 42?

I specifically chose the data to
make that happen.

It's a meaningless statement.

It's a completely meaningless statement.

But when you see 95th percentile,
average 184, you think that the 95th

percentile for the last two hours
was around 184.

It makes you think that.

Putting this on a piece of paper is not 
just noise and irrelevant,

it's a way to mislead people.

It's a way to mislead yourself, because 
you'll start to believe your own mistruths.

This is true for any percentile.

There is no percentile that you could do
this math on.

Another tip, you cannot average percentiles.

That math doesn't happen.

But percentiles do matter. You really
want to know about them.

And a common misperception is that we want
to look at the main part of the spectrum,

not those outliers and perfection stuff.

Only people that actually bet their house
every day, or the bank on it,

need to know about the "five-nine's", 
and all those.

The 99th percentile is a pretty
good number.

Is 99% really rare?

Let's look at some stuff, because we can
ask questions like, "If I were looking

at a webpage, what is the chance of me
hitting the 99th percentile?"

Of things like this: a search engine node,
or a key value store,

or a database, or a CDN, right?

Because they will report their 99th percentile.

They won't tell you anything above that,
but how many of the

webpages that we go to 
actually experience this?

You want to say 1%, right?

Well, I went to some webpages and I counted
how many "http" requests were generated

by one click into that webpage,
and here are the numbers.

I ended that about a year ago.

They've probably gone up since then.

Now that translates into this math.

This is the likelihood of one click seeing
the 99th percentile.

And the only page where that is less than
50% is the clean google search page.

Where only a quarter will see the
99th percentile.

The 99th percentile is the thing that most
of your webpages will see.

Most of them will be there.

Now, we could look at other things.

We can pick which things to focus on.

Let's say I had to pick between the 95th
percentile, and the three 9's (99.9%).

The three 9's is way into perfection mode
for most people, or they think.

Which one of those represents our 
community better?

Our population?

Our users?

Our experience?

Let's run a hypothetical.

Suppose we don't have that many pages, 
and that many resources like we said before.

We'll be much more conservative.

A user session will only go through five
clicks, and each click will only bring up

up to 40 things.

A lot less, and they're all as clean
as the google page.

How many of the users will not experience
something worse than the 95th percentile?

Because that's what the 95th percentile
is good for, the people who see that.

Anybody above that, is that.

What are the chances of not seeing it?

That's an interesting number.

So you're watching a number that is 
relevant to 0.003% of your users.

99.997% of your users are going to 
see worse than this number.

Why are you looking at it?

Why are you spending time
thinking about it?

In reverse, we could say how many people
are going to see something

worse than the three 9's (99.9%)?

That's going to be 18%.

In reverse, 82% of the people will see
the three 9's (99.9%) or better.

That's a slightly better representation.

Probably not good enough either.

We could look at some more math with them, 
same kind of scenario.

What percentile of http response time 
will be the thing that 95%

of people experience in this scenario?

It's the 99.97 percentile that 95% 
of people see.

And if you want to know what 99%
of the people see,

that's four and a half 9's (99.995%).

You want to know that number from Akamai
if you want to predict what 1%

of your users are going to experience.

When you know the 99th percentile, 
you kind of know a tiny bit.

So here's another tip.

And this is not an exaggeration,
by the way.

The median, which is a much smaller
percentile, has that minuscule a chance

of ever being the number that 
anybody experiences.

This is the chance of getting worse
than the median.

Which makes the median an irrelevant 
number to look at.

Unfortunately, it's probably the most 
common one looked at.

When people say "the typical",
they look at the thing that

everything will be worse than.

Okay, I'm sorry about that part.

We'll do some other parts.

Now, why is it that when we look at these
monitoring systems, we don't see

data with a lot of 9's?

Why do we stop at the
90, 95, 99th percentile?

Why don't we look further?

Now, some of it is because people think, 
"Well that's perfection, I don't need it."

The other part is that it's hard.

It's hard because you can't
average percentiles.

We already talked about that.

But you also can't derive your 
five 9's (99.999%) out of a lot

of 10 second samples of percentiles.

And the reason for that is, "Hey, in 10 
seconds, maybe I only had 1,000 things."

I could take all the 10 seconds in the 
world, there's no way to say what the

hour five 9's (99.999%) were, what the 
minutes five 9's were

if I'm collecting just this data.

And unfortunately, the data being collected
and reported to the back ends of monitoring

is usually summarized at a second,
5 seconds, 10 seconds, etc.

Basically throwing away all the good data,
and leaving you with absolutely no way

to compute large 9's for longer
periods of time.

So, this is where you might want to look
at HDR Histogram.

It's an open source thing I've created
a few years ago.

I did it in Java, and know there's a
C, C-Sharp, Python, Erlang,

and Go ports of this that I didn't create.

And it lets you actually get an entire
percentile spectrum.

Some of you here I know are 
already using it.

And you can look at all the percentiles.

Any number of 9's that's in the data, if 
you just keep it right and report it right,

it's got a log format, you can 
store things forever.

Well, for a long time.

Okay, so it lets you have nice things.

Enough for that advertisement.

Now, latency... Well, I think this is
slightly out of order.

Yeah, sorry.

This is the red/blue pill part, so I warn
you, this is your last chance.

There's a problem I call the 
coordinated omission problem.

The coordinated omission problem is 
basically a conspiracy.

It's a conspiracy that we're all part of.

I don't think anybody actually meant
to do it, but once I've noticed it,

everywhere I look, there it is.

Now, I've been using a specific way of
showing you numbers so far.

Has anybody here noticed how
I spell percentile?

(Audience Member): "You put lie at the
end of the percent sign."

Yeah, good.

So coordinated omission problem is the
"lie" in %lies.

And this is how it works.

One common way to do this is
to use a load generator.

Pretty much all load generator's
have this problem.

There are two that I know of that don't.

What you do with a load generator,
is you test.

You issue requests, or send packets.

And you measure how long something took.

And as long as the numbers go right, 
measure them, put them in a bucket,

study them later, and get your 
percentiles from it.

But what if the thing that you are
measuring took longer than the time

it would've taken until you send 
the next thing?

You're supposed to send something 
every second,

but this one took a second and a half.

Well you've got to wait before
you send the next one.

You just avoided measuring something 
when the system was problematic.

You've coordinated with it.

You weren't looking at it then.

That's common scenario A: You've backed
off, and avoided measuring when it was bad.

Another way, is you measure inside your code.

We all do this. We all have to do this,

where we measure time, do something, 
then measure time.

The delta between them is how long it took.

We can then put it in a stats bucket,
and then do the percentiles in that.

Unfortunately, if the system freezes right
here, for any reason,

an interrupted contact switch,

a cash buffer flushed to disk,

a garbage collection,

a re-indexing of your database,
this is a database.

This is Cassandra by the way, 
measuring itself.

In any of the above, then you will
have one bad report

while 10,000 things are waiting in line.

And when they come in, they will look
really, really good.

Even though each one of them has had
a really bad experience.

It can even get worse, where maybe the
freeze happened outside the timing,

and you won't even know there was a freeze.

Now these are examples of admitting data
that is bad on a very selective basis.

It's not random sampling.

It's, "I don't like bad data",

or "I couldn't handle it",

or "I don't know about it",

so we'll just talk about the good.

What does that do to your data?

Because it often makes people feel like,

"Okay, yeah, I understand,
but it's a little bit of noise."

Let's run some hypotheticals, 
and I'll show you some real numbers.

Imagine a perfect system.

It's doing 100 requests a second, 
at exactly a millisecond each.

But we go and freeze the system, 
after 100 seconds of perfect operations

for 100 seconds, and then repeat.

Now, I'm going to describe how the system
behaves in terms that should mean something,

and then we'll measure it.

If we actually wanted to describe the
system,

on the left we have an average
of one millisecond by the finish,

and on the right we have an
average of 50 seconds.

Why 50? Because if I randomly came in 
in that 100 seconds,

I'll get anything from 0 to 100
with even distribution.

The overall average over 200 seconds
is 25 seconds.

If I just came in here and said, 
"Surprise, how long did this take?"

On average, it will be 25.

I can also do the percentiles.

50th percentile will be really good, 
and then it'll get really bad.

The four 9's is terrible.

This is a fair honest description of
this system if this is what it did.

And you can make the system do that.

That's what Control Z is good for.

You can make any of your systems do that.

Now lets go measure this system with 
a load generator,

or with a monitoring system.

The common ones.

The ones everybody does.

On the left, we're going to get 10,000 
results of one millisecond each.

Great.

And we're going to get one result of
100 seconds.

Wow, really big response time.

This is our data.

This is OUR data.

So now you go do math with it.

The average of that is 10.9 milliseconds.

A little less than 25 seconds.

And here are the percentiles.

Your load generator monitoring system
will tell you that this system is perfect.

You could go to production with it.

You like what you see.

Look at that, four 9's.

It is lying to you.

To your face.

And you can catch it doing that with a 
Control Z-Test.

But people tend to not want to do that,
because then what are they going to do?

If you just do that test, and calibrate 
your system, and you find it

telling you that, about this, the next 
step should be to throw all the numbers away.

Don't believe anything else it says.

If it lies this big, what else did it do?

Don't waste your time on numbers
from uncalibrated systems.

Now the problem here was, that if you
want to measure the system,

you have to measure at random rates, 
or same rates.

If you measure 10,000 things in 100 seconds,
there should be another 10,000 things here.

If you measure them, you would've gotten
all the right numbers.

Coordinated omission is the simple act of
erasing all that bad stuff.

The conspiracy here is that we all do it
without meaning to.

I don't know who put that in our systems,
but it happens to all of us .

Now, I often get people saying,
"Okay, I get it. All the numbers are wrong,

but at least for my job where I tune 
performance, and I try to make things

faster, I can use the numbers to figure
out if I'm going in the right direction."

Is it better, or is it worse? Let me 
dispel that for you for a second.

Suppose I went and took this system,
and improved it dramatically.

Rather than freezing for 100 seconds, 
it will now answer every question.

It'll take a little longer,
5 milliseconds instead of one,

but it's much better than freezing, right?

So let's measure that system that we spent
weeks and weeks improving,

and see if it's better.

That's the data.

If we do the percentiles, it'll tell us 
that we just really hurt the four 9's.

We made it go 5 times worse than before.

We should revert this change, go back to
that much better system we had before.

So this is just to make sure that you
don't think that you can have

any intuition based on any of these numbers.

They go backwards sometimes.

You don't know which way is good or bad.

And you'll never know which way is good
or bad with a system that lies like that.

The other cool technique is
what I call "Cheating Twice".

You have a constant load generator,
and it needs to do 100 per second.

When it woke up after 200 seconds, 
it says,

"Woah, were 9,999 behind.
We've got to issue those requests."

So it issues those requests.

At this point, not only did it get rid of
all the bad requests,

it replaced every one of them with 
a perfect request.

Coining the four 9's (99.99%), all the way
to four and a half 9's (99.995%),

it's twice as wrong as dropping them.

So these are all cool things that
happen to you.

I'm not going to spend much time on how
to fix those and avoid those.

There's a lot of other material that you
can find with me

talking about that, in longer talks.

But this is pretty bad.

And like I said...

That should've been up there before.

How did this repeat itself?

Did I create a loop in the
presentation somehow?

I don't know how to do that.

Let's see if I can get through here.

Hopefully editing later will take it out.

So we have the cheats twice.

There, okay.

So, after we look at coordinated
omission that way,

we should also look at response time, 
and service time.

Coordinated omission, what it really is
achieving for you, unfortunately,

is that it makes something that you think
is response time, and only shows you

the service time component of latency.

This is a simple depiction of what service
time and response times are.

This guy is taking a certain amount of
time to take payment

or make a cup of coffee.

That's service time.

How long does it take to do the work?

This person has experienced
the response time,

which includes the amount of time they 
have to wait before they

get to the person that does the work.

And the difference between those
two is immense.

The coordinated omission problem makes
something that you think is

response time, only measure the 
service time,

and basically hide the fact that things 
stalled, waited in line,

that this guy might've taken a lunch break,

and now we have line around, 
building three times.

Service time stays the same.

This is the backwards part...

Now, let's look at what it
actually looks like.

In a load generator that I fixed,
I measured both

response time and service time,

this happens to be Casandra,

at a very low load.

And you can see that they're very very 
similar, at a very low load.

Why? Because there's nobody in line.

This thing is really fast.

We're not asking for too much.

Casandra's pretty fast,
so they're the same.

But if I increase the load, we 
start seeing gaps.

If I increase the load a little more,
the gap grows.

If I increase the load a little more,
the gap grows.

Now this is not the failure point yet.

If I actually increase it all the way past
the point where the system

can't even do the work I want, 
service time stays the same,

response time goes through the roof.

This was when it was 100 and something
milliseconds, now it's 7 and a half seconds.

Why 7 and a half seconds?

Cause you're waiting in line that long
to go around the block.

The guy just can't serve as many people
as are showing up in line, you fall behind.

This is a virtual world reaction to this.

I really like this slide, it's where I came
up with the notion of a blue/red pill.

When you actually measure reality, people
tend to have this reaction when

they compare the two.

And if we actually look at these on the
two sides of a collapse point of a system,

this specific system can only do 87,000 
things a second.

No matter how hard you press it,
that's all it'll do.

The service time on the two sides of 
the collapse looks virtually identical,

which it would.

But if you compare the response time, 
you have a very different picture.

And I'm showing this picture so you get 
a feeling for what to look at

on whether or not you're measuring
the right one.