Hi everyone, I'm Gil Tene.
I'm going to be talking about this subject
that I call "How NOT to Measure Latency".
It's a subject that I've been talking now
about for 3 years or so.
I keep the title and change all
the slides every time.
A bunch of this stuff is new.
So if you've seen any of my previous "How NOT to",
you'll see only some things that are common.
A nickname for the subject is this...
Because I often will get that reaction
from some people in the audience.
Ever since I've told people that it's a
nickname,
They feel free to actually exclaim,
"Oh S@%#!".
And feel free to do that here in this talk.
I'll prompt you in a couple of places
where it is natural.
But if just have the urge, go ahead.
So just a tiny bit about me.
I am the co-founder of Azul Systems.
I play around with garbage collection a lot.
Here is some evidence of me playing around
with garbage collection in my kitchen.
That's a trash compactor.
The compaction function wasn't working right,
so I had to fix it.
I thought it'd be funny to take a picture
with a book.
I've also built a lot of things.
I've been playing with computers since
the early 80's.
I've built hardware.
I've helped design chips.
I've built software at many
different levels.
Operating systems, drivers...
JVM's obviously.
And lots of big systems at the system level.
Built our own app server in the late 90's
because web logic wasn't around yet.
So, I've made a lot of mistakes,
and I've learned from a few of them.
This is actually a combination of a bunch
of those mistakes looking at latency.
I do have this hobby of depressing people
by pulling the wool up from over your eyes,
and this is what this talk is about.
So, I need to give you a choice right here.
There's the door.
You can take the blue pill,
and you can leave.
Tomorrow you can keep believing whatever
it is you want to believe.
But if you stay here and take the red pill,
I will show you a glimpse of how
far down the rabbit hole goes,
and it will never be the same again.
Let's talk about latency.
And when I say latency, I'm talking about
latency response time, any of those things
where you measure time from 'here to here',
and you're interested in how long it took.
We do this all the time, but I see a lot
of mish-mash in how people
treat the data, or think about it.
Latency is basically the time it took
something to happen once.
That one time, how long did it take.
And when we measure stuff, like we did
a million operations in the last hour,
we have a million latencies. Not one,
we have a million of them.
Our actual goal is to figure out how to
describe that million.
How did the million behave?
For example, 'they're all really good, and
they're all exactly the same', would be a
behavior that you will never see,
but that would be a great behavior.
So we need to talk about how things behave,
communicate, think, evaluate,
set requirements for, talk to other people,
but these are all common things around that.
To do that, we have to describe the
distribution, the set, the behavior,
but not the one.
For example, the behavior that says "the
the common case was x" is a piece of
information about the behavior,
but it's a tiny sliver.
Usually the least relevant one.
Well, there's some less relevant ones,
but not a strongly relevant one,
and one that people often focus on.
To take a look at what we actually do
with this stuff, almost on a daily basis,
this is a snapshot from a monitoring system.
A small dashboard on a big screen
in a monitoring system.
Where you're watching the response time of
a system over time.
This is a two hour window.
These lines that are 95th percentile,
90, 75, 50, and 25th percentiles,
you can look at how they behave over time.
We're a small audience here, if you look at
this picture, what draws your eye?
What do you want to go investigate here
or pay attention to ?
It's the big red spike there, right?
So we could look at the red spike,
cause it's different,
and say, "Woah, the 95th percentile shot up
here. And look, the 90th percentile
shot up at about the same time.
The rest of them didn't shoot up,
so maybe something happened here
that affected that much, I should probably
pay attention to it
because it's a monitoring system, and
I like things to be calm."
You could go investigate the why.
At this point, I've managed to waste
about 90 seconds of your life,
looking at a completely meaningless chart,
which unfortunately you do
every day, all the time.
This chart is the chart you want to show
somebody if you want to
hide the truth from them.
If you want to pull the wool
over their eyes.
This is the chart of the good stuff.
What's not on this chart?
The 5% worse things that happened during
this two hours.
They're not here.
This is only the good things that happened
during the things.
And to get this spike, that 5% had to be
so bad that it even pulled
the 95th percentile all up.
There is zero information here at all about
what happened bad during this two hours,
which makes it a bad fit for
a monitoring system.
It's a really good thing for
a marketing system.
It's a great way to get the bonus from your boss, even though you didn't do the work.
If you want to learn how to do that,
we can do another talk about that.
But this is not a good way to look at latency.
It's the opposite of good.
Unfortunately, this is one of the most
common tools used for
server monitoring on earth right now.
That's where the snapshot is from,
and this is what people look at.
I find this chart to be a goldmine
of information.
When I first showed it in another talk
like this, I had this really cool experience.
Somebody came up to me and said, "Hey,
as I was sitting here, I was texting one
of our guys, and he was saying,
'look, we have this issue with
our 95th percentile'."
And I got this chart from him!
So I went and said, "Hey, what does the
rest of the spectrum look like?"
This is the actual chart they got.
And when they look at the rest of the
spectrum, it looked like that.
That's what was hiding.
I noticed the scales are a little different.
That yellow line is that yellow line.
So that's a much more representative number.
Is it? Is that good enough?
That's the 99th percentile.
We still have another 1% of really bad
stuff that's hiding above the blue line.
I wonder how big that is?
I don't know because he didn't have the data.
So a common problem that we have is that
we only plot what's convenient.
We only plot what gives us nice,
colorful graphs.
And often, when we have to choose between
the stuff that hides the rest of the data,
and the stuff that is noise, we choose
the noise to display.
I like to rant about latency.
This is from a blog that I don't write
enough in, but the format for it was simple.
I tweet a single tweet about latency,
latency tip of the day,
and then I rant about my own tweet.
As an example, this chart is a goldmine
of information because it has so many
different things that are wrong in it,
but we won't get into all of them.
You can read it online.
Anyway, this is one to take away from
what we just said.
If you are not measuring and showing the
maximum value, what is it you are hiding?
And from whom?
If you're job is to hide the truth from
others, this is a good way to do it.
But if actually are interested in what's
going on, the number one indicator
you should never get rid of is the
maximum value.
That is not noise, that is the signal.
The rest of it is noise.
Okay, let's look at this chart for some
more cool stuff.
I'm gonna zoom in to a small part
of the chart, and ask you what that means.
What is the average of the 95th percentile
over 2 hours mean?
What is the math that does that?
What does it do?
Let's look at that, and I'll give you
an example with another percentile.
The 100th percentile. The max, right?
Let's take a data set.
Suppose this was the maximum every minute
for 15 minutes.
What does it mean to say that the average
max over the last 15 minutes was 42?
I specifically chose the data to
make that happen.
It's a meaningless statement.
It's a completely meaningless statement.
But when you see 95th percentile,
average 184, you think that the 95th
percentile for the last two hours
was around 184.
It makes you think that.
Putting this on a piece of paper is not
just noise and irrelevant,
it's a way to mislead people.
It's a way to mislead yourself, because
you'll start to believe your own mistruths.
This is true for any percentile.
There is no percentile that you could do
this math on.
Another tip, you cannot average percentiles.
That math doesn't happen.
But percentiles do matter. You really
want to know about them.
And a common misperception is that we want
to look at the main part of the spectrum,
not those outliers and perfection stuff.
Only people that actually bet their house
every day, or the bank on it,
need to know about the "five-nine's",
and all those.
The 99th percentile is a pretty
good number.
Is 99% really rare?
Let's look at some stuff, because we can
ask questions like, "If I were looking
at a webpage, what is the chance of me
hitting the 99th percentile?"
Of things like this: a search engine node,
or a key value store,
or a database, or a CDN, right?
Because they will report their 99th percentile.
They won't tell you anything above that,
but how many of the
webpages that we go to
actually experience this?
You want to say 1%, right?
Well, I went to some webpages and I counted
how many "http" requests were generated
by one click into that webpage,
and here are the numbers.
I ended that about a year ago.
They've probably gone up since then.
Now that translates into this math.
This is the likelihood of one click seeing
the 99th percentile.
And the only page where that is less than
50% is the clean google search page.
Where only a quarter will see the
99th percentile.
The 99th percentile is the thing that most
of your webpages will see.
Most of them will be there.
Now, we could look at other things.
We can pick which things to focus on.
Let's say I had to pick between the 95th
percentile, and the three 9's (99.9%).
The three 9's is way into perfection mode
for most people, or they think.
Which one of those represents our
community better?
Our population?
Our users?
Our experience?
Let's run a hypothetical.
Suppose we don't have that many pages,
and that many resources like we said before.
We'll be much more conservative.
A user session will only go through five
clicks, and each click will only bring up
up to 40 things.
A lot less, and they're all as clean
as the google page.
How many of the users will not experience
something worse than the 95th percentile?
Because that's what the 95th percentile
is good for, the people who see that.
Anybody above that, is that.
What are the chances of not seeing it?
That's an interesting number.
So you're watching a number that is
relevant to 0.003% of your users.
99.997% of your users are going to
see worse than this number.
Why are you looking at it?
Why are you spending time
thinking about it?
In reverse, we could say how many people
are going to see something
worse than the three 9's (99.9%)?
That's going to be 18%.
In reverse, 82% of the people will see
the three 9's (99.9%) or better.
That's a slightly better representation.
Probably not good enough either.
We could look at some more math with them,
same kind of scenario.
What percentile of http response time
will be the thing that 95%
of people experience in this scenario?
It's the 99.97 percentile that 95%
of people see.
And if you want to know what 99%
of the people see,
that's four and a half 9's (99.995%).
You want to know that number from Akamai
if you want to predict what 1%
of your users are going to experience.
When you know the 99th percentile,
you kind of know a tiny bit.
So here's another tip.
And this is not an exaggeration,
by the way.
The median, which is a much smaller
percentile, has that minuscule a chance
of ever being the number that
anybody experiences.
This is the chance of getting worse
than the median.
Which makes the median an irrelevant
number to look at.
Unfortunately, it's probably the most
common one looked at.
When people say "the typical",
they look at the thing that
everything will be worse than.
Okay, I'm sorry about that part.
We'll do some other parts.
Now, why is it that when we look at these
monitoring systems, we don't see
data with a lot of 9's?
Why do we stop at the
90, 95, 99th percentile?
Why don't we look further?
Now, some of it is because people think,
"Well that's perfection, I don't need it."
The other part is that it's hard.
It's hard because you can't
average percentiles.
We already talked about that.
But you also can't derive your
five 9's (99.999%) out of a lot
of 10 second samples of percentiles.
And the reason for that is, "Hey, in 10
seconds, maybe I only had 1,000 things."
I could take all the 10 seconds in the
world, there's no way to say what the
hour five 9's (99.999%) were, what the
minutes five 9's were
if I'm collecting just this data.
And unfortunately, the data being collected
and reported to the back ends of monitoring
is usually summarized at a second,
5 seconds, 10 seconds, etc.
Basically throwing away all the good data,
and leaving you with absolutely no way
to compute large 9's for longer
periods of time.
So, this is where you might want to look
at HDR Histogram.
It's an open source thing I've created
a few years ago.
I did it in Java, and know there's a
C, C-Sharp, Python, Erlang,
and Go ports of this that I didn't create.
And it lets you actually get an entire
percentile spectrum.
Some of you here I know are
already using it.
And you can look at all the percentiles.
Any number of 9's that's in the data, if
you just keep it right and report it right,
it's got a log format, you can
store things forever.
Well, for a long time.
Okay, so it lets you have nice things.
Enough for that advertisement.
Now, latency... Well, I think this is
slightly out of order.
Yeah, sorry.
This is the red/blue pill part, so I warn
you, this is your last chance.
There's a problem I call the
coordinated omission problem.
The coordinated omission problem is
basically a conspiracy.
It's a conspiracy that we're all part of.
I don't think anybody actually meant
to do it, but once I've noticed it,
everywhere I look, there it is.
Now, I've been using a specific way of
showing you numbers so far.
Has anybody here noticed how
I spell percentile?
(Audience Member): "You put lie at the
end of the percent sign."
Yeah, good.
So coordinated omission problem is the
"lie" in %lies.
And this is how it works.
One common way to do this is
to use a load generator.
Pretty much all load generator's
have this problem.
There are two that I know of that don't.
What you do with a load generator,
is you test.
You issue requests, or send packets.
And you measure how long something took.
And as long as the numbers go right,
measure them, put them in a bucket,
study them later, and get your
percentiles from it.
But what if the thing that you are
measuring took longer than the time
it would've taken until you send
the next thing?
You're supposed to send something
every second,
but this one took a second and a half.
Well you've got to wait before
you send the next one.
You just avoided measuring something
when the system was problematic.
You've coordinated with it.
You weren't looking at it then.
That's common scenario A: You've backed
off, and avoided measuring when it was bad.
Another way, is you measure inside your code.
We all do this. We all have to do this,
where we measure time, do something,
then measure time.
The delta between them is how long it took.
We can then put it in a stats bucket,
and then do the percentiles in that.
Unfortunately, if the system freezes right
here, for any reason,
an interrupted contact switch,
a cash buffer flushed to disk,
a garbage collection,
a re-indexing of your database,
this is a database.
This is Cassandra by the way,
measuring itself.
In any of the above, then you will
have one bad report
while 10,000 things are waiting in line.
And when they come in, they will look
really, really good.
Even though each one of them has had
a really bad experience.
It can even get worse, where maybe the
freeze happened outside the timing,
and you won't even know there was a freeze.
Now these are examples of admitting data
that is bad on a very selective basis.
It's not random sampling.
It's, "I don't like bad data",
or "I couldn't handle it",
or "I don't know about it",
so we'll just talk about the good.
What does that do to your data?
Because it often makes people feel like,
"Okay, yeah, I understand,
but it's a little bit of noise."
Let's run some hypotheticals,
and I'll show you some real numbers.
Imagine a perfect system.
It's doing 100 requests a second,
at exactly a millisecond each.
But we go and freeze the system,
after 100 seconds of perfect operations
for 100 seconds, and then repeat.
Now, I'm going to describe how the system
behaves in terms that should mean something,
and then we'll measure it.
If we actually wanted to describe the
system,
on the left we have an average
of one millisecond by the finish,
and on the right we have an
average of 50 seconds.
Why 50? Because if I randomly came in
in that 100 seconds,
I'll get anything from 0 to 100
with even distribution.
The overall average over 200 seconds
is 25 seconds.
If I just came in here and said,
"Surprise, how long did this take?"
On average, it will be 25.
I can also do the percentiles.
50th percentile will be really good,
and then it'll get really bad.
The four 9's is terrible.
This is a fair honest description of
this system if this is what it did.
And you can make the system do that.
That's what Control Z is good for.
You can make any of your systems do that.
Now lets go measure this system with
a load generator,
or with a monitoring system.
The common ones.
The ones everybody does.
On the left, we're going to get 10,000
results of one millisecond each.
Great.
And we're going to get one result of
100 seconds.
Wow, really big response time.
This is our data.
This is OUR data.
So now you go do math with it.
The average of that is 10.9 milliseconds.
A little less than 25 seconds.
And here are the percentiles.
Your load generator monitoring system
will tell you that this system is perfect.
You could go to production with it.
You like what you see.
Look at that, four 9's.
It is lying to you.
To your face.
And you can catch it doing that with a
Control Z-Test.
But people tend to not want to do that,
because then what are they going to do?
If you just do that test, and calibrate
your system, and you find it
telling you that, about this, the next
step should be to throw all the numbers away.
Don't believe anything else it says.
If it lies this big, what else did it do?
Don't waste your time on numbers
from uncalibrated systems.
Now the problem here was, that if you
want to measure the system,
you have to measure at random rates,
or same rates.
If you measure 10,000 things in 100 seconds,
there should be another 10,000 things here.
If you measure them, you would've gotten
all the right numbers.
Coordinated omission is the simple act of
erasing all that bad stuff.
The conspiracy here is that we all do it
without meaning to.
I don't know who put that in our systems,
but it happens to all of us .
Now, I often get people saying,
"Okay, I get it. All the numbers are wrong,
but at least for my job where I tune
performance, and I try to make things
faster, I can use the numbers to figure
out if I'm going in the right direction."
Is it better, or is it worse? Let me
dispel that for you for a second.
Suppose I went and took this system,
and improved it dramatically.
Rather than freezing for 100 seconds,
it will now answer every question.
It'll take a little longer,
5 milliseconds instead of one,
but it's much better than freezing, right?
So let's measure that system that we spent
weeks and weeks improving,
and see if it's better.
That's the data.
If we do the percentiles, it'll tell us
that we just really hurt the four 9's.
We made it go 5 times worse than before.
We should revert this change, go back to
that much better system we had before.
So this is just to make sure that you
don't think that you can have
any intuition based on any of these numbers.
They go backwards sometimes.
You don't know which way is good or bad.
And you'll never know which way is good
or bad with a system that lies like that.
The other cool technique is
what I call "Cheating Twice".
You have a constant load generator,
and it needs to do 100 per second.
When it woke up after 200 seconds,
it says,
"Woah, were 9,999 behind.
We've got to issue those requests."
So it issues those requests.
At this point, not only did it get rid of
all the bad requests,
it replaced every one of them with
a perfect request.
Coining the four 9's (99.99%), all the way
to four and a half 9's (99.995%),
it's twice as wrong as dropping them.
So these are all cool things that
happen to you.
I'm not going to spend much time on how
to fix those and avoid those.
There's a lot of other material that you
can find with me
talking about that, in longer talks.
But this is pretty bad.
And like I said...
That should've been up there before.
How did this repeat itself?
Did I create a loop in the
presentation somehow?
I don't know how to do that.
Let's see if I can get through here.
Hopefully editing later will take it out.
So we have the cheats twice.
There, okay.
So, after we look at coordinated
omission that way,
we should also look at response time,
and service time.
Coordinated omission, what it really is
achieving for you, unfortunately,
is that it makes something that you think
is response time, and only shows you
the service time component of latency.
This is a simple depiction of what service
time and response times are.
This guy is taking a certain amount of
time to take payment
or make a cup of coffee.
That's service time.
How long does it take to do the work?
This person has experienced
the response time,
which includes the amount of time they
have to wait before they
get to the person that does the work.
And the difference between those
two is immense.
The coordinated omission problem makes
something that you think is
response time, only measure the
service time,
and basically hide the fact that things
stalled, waited in line,
that this guy might've taken a lunch break,
and now we have line around,
building three times.
Service time stays the same.
This is the backwards part...
Now, let's look at what it
actually looks like.
In a load generator that I fixed,
I measured both
response time and service time,
this happens to be Casandra,
at a very low load.
And you can see that they're very very
similar, at a very low load.
Why? Because there's nobody in line.
This thing is really fast.
We're not asking for too much.
Casandra's pretty fast,
so they're the same.
But if I increase the load, we
start seeing gaps.
If I increase the load a little more,
the gap grows.
If I increase the load a little more,
the gap grows.
Now this is not the failure point yet.
If I actually increase it all the way past
the point where the system
can't even do the work I want,
service time stays the same,
response time goes through the roof.
This was when it was 100 and something
milliseconds, now it's 7 and a half seconds.
Why 7 and a half seconds?
Cause you're waiting in line that long
to go around the block.
The guy just can't serve as many people
as are showing up in line, you fall behind.
This is a virtual world reaction to this.
I really like this slide, it's where I came
up with the notion of a blue/red pill.
When you actually measure reality, people
tend to have this reaction when
they compare the two.
And if we actually look at these on the
two sides of a collapse point of a system,
this specific system can only do 87,000
things a second.
No matter how hard you press it,
that's all it'll do.
The service time on the two sides of
the collapse looks virtually identical,
which it would.
But if you compare the response time,
you have a very different picture.
And I'm showing this picture so you get
a feeling for what to look at
on whether or not you're measuring
the right one.
Whenever you push, you try and push load
beyond what the system can do,
you are falling behind over time.
This is a 250 second run,
where at the end of it
you are waiting for 8 seconds in line.
Why? Because for every second
that goes by, there are
3,000 more things that are
added to the line.
The interesting thing that happens when
you cross the threshold limit,
or capability of the system, is that
response time grows over time linearly.
It doesn't happen if you're below.
Only if you're above.
It's the point where that happens, and
any load generator that doesn't show
that line when you try pushing harder
than you can, is lying to you.
It's a simple sanity check.
If your load generator shows you that,
it didn't push.
Or it pushed, but it didn't
report correctly,
whichever it is.
If we draw that to scale...
Just to make sure, this was not to scale,
this is the scale, I just zoomed in
so you could see that it was
relatively stable.
So... I don't know what happened to the
order of the slides.
It's like looping and randoming.
There's some conspiracy going on there.
Now, latency doesn't live on it's own.
You do need to look at latency in the
context of load.
Cause as I showed you, as you're nearly
idle, things are nearly perfect.
Even these mistakes won't show up.
But as you start pressing, things start
cracking or behaving differently.
And usually when you want to know how much
your system can handle,
the answer is not 87,000 things a second,
because nobody wants the
response time that comes with that.
It's how many things can I handle so
that I don't get angry phone calls.
So I do get my bonus, and so my
company stays above ground.
This is not sustainable speed.
Running this experiment is really
interesting with software,
because it actually doesn't hurt, but
spending the next 6 months of your time
repeating this experiment, trying to
change the shape of the bumper
every time you hit the thing
is a waste of your time.
Your goal when you're trying to figure
out sustainable speed throughput,
whatever it is, is to see how fast you can
go without this happening,
and then to try and engineer
to improve that.
Meaning, can I make it go faster
without this happening?
Measuring what happens after you
hit the pole is useless for that exercise.
The only thing that matters about hitting
the pole, is that you hit the pole.
When you go and study the behavior
of latency, at saturation,
you are doing this.
You're looking at this and saying, "That
bumper, I don't like the shape of that.
Let's measure it closely and do this 100
times to see if we can vary it."
That's what it means to look at latency
at saturation,
and repeat, and repeat, and change,
and tune, and see if you can do it again.
If you're pressing it to the wall,
it should look like this.
And it shouldn't be a surprise that it's
a 7 and a half second response time.
In fact, if it's not, something is
terribly wrong with what you're measuring.
You should look at that instead.
So don't do this.
Try to minimize the number of times
that you actually run red cars
into poles in your testing.
I'm not saying don't do it, but use it
to establish the end.
And then you need to test all the speeds,
and we need to see when you hit the pole.
Maybe you hit the pole at 100 mph,
but maybe you also hit the pole at 70 mph.
Maybe you don't hit it at 20.
We should find out how fast is safe.
When you have data, you can compare
it like this.
This is what I would say a recommended
way to look at it.
Plot requirements, that's the hitting
the pole.
And some things hit the pole,
and some things don't.
And you run different scenarios,
different loads,
different configurations,
different settings,
and see what works, and what doesn't.
Your goal is to stay here, and carry
more while staying there.
Usually.