-
Not Synced
Hi everyone, I'm Gil Tene.
-
Not Synced
I'm going to be talking about this subject
that I call "How NOT to Measure Latency".
-
Not Synced
It's a subject that I've been talking now
about for 3 years or so.
-
Not Synced
I keep the title and change all
the slides every time.
-
Not Synced
A bunch of this stuff is new.
-
Not Synced
So if you've seen any of my previous "How NOT to",
you'll see only some things that are common.
-
Not Synced
A nickname for the subject is this...
-
Not Synced
Because I often will get that reaction
from some people in the audience.
-
Not Synced
Ever since I've told people that it's a
nickname,
-
Not Synced
They feel free to actually exclaim,
"Oh S@%#!".
-
Not Synced
And feel free to do that here in this talk.
-
Not Synced
I'll prompt you in a couple of places
where it is natural.
-
Not Synced
But if just have the urge, go ahead.
-
Not Synced
So just a tiny bit about me.
-
Not Synced
I am the co-founder of Azul Systems.
-
Not Synced
I play around with garbage collection a lot.
-
Not Synced
Here is some evidence of me playing around
with garbage collection in my kitchen.
-
Not Synced
That's a trash compactor.
-
Not Synced
The compaction function wasn't working right,
so I had to fix it.
-
Not Synced
I thought it'd be funny to take a picture
with a book.
-
Not Synced
I've also built a lot of things.
-
Not Synced
I've been playing with computers since
the early 80's.
-
Not Synced
I've built hardware.
-
Not Synced
I've helped design chips.
-
Not Synced
I've built software at many
different levels.
-
Not Synced
Operating systems, drivers...
JVM's obviously.
-
Not Synced
And lots of big systems at the system level.
-
Not Synced
Built our own app server in the late 90's
because web logic wasn't around yet.
-
Not Synced
So, I've made a lot of mistakes,
and I've learned from a few of them.
-
Not Synced
This is actually a combination of a bunch
of those mistakes looking at latency.
-
Not Synced
I do have this hobby of depressing people
by pulling the wool up from over your eyes,
-
Not Synced
and this is what this talk is about.
-
Not Synced
So, I need to give you a choice right here.
-
Not Synced
There's the door.
-
Not Synced
You can take the blue pill,
and you can leave.
-
Not Synced
Tomorrow you can keep believing whatever
it is you want to believe.
-
Not Synced
But if you stay here and take the red pill,
I will show you a glimpse of how
-
Not Synced
far down the rabbit hole goes,
and it will never be the same again.
-
Not Synced
Let's talk about latency.
-
Not Synced
And when I say latency, I'm talking about
latency response time, any of those things
-
Not Synced
where you measure time from 'here to here',
and you're interested in how long it took.
-
Not Synced
We do this all the time, but I see a lot
of mish-mash in how people
-
Not Synced
treat the data, or think about it.
-
Not Synced
Latency is basically the time it took
something to happen once.
-
Not Synced
That one time, how long did it take.
-
Not Synced
And when we measure stuff, like we did
a million operations in the last hour,
-
Not Synced
we have a million latencies. Not one,
we have a million of them.
-
Not Synced
Our actual goal is to figure out how to
describe that million.
-
Not Synced
How did the million behave?
-
Not Synced
For example, 'they're all really good, and
they're all exactly the same', would be a
-
Not Synced
behavior that you will never see,
but that would be a great behavior.
-
Not Synced
So we need to talk about how things behave,
communicate, think, evaluate,
-
Not Synced
set requirements for, talk to other people,
but these are all common things around that.
-
Not Synced
To do that, we have to describe the
distribution, the set, the behavior,
-
Not Synced
but not the one.
-
Not Synced
For example, the behavior that says "the
the common case was x" is a piece of
-
Not Synced
information about the behavior,
but it's a tiny sliver.
-
Not Synced
Usually the least relevant one.
-
Not Synced
Well, there's some less relevant ones,
but not a strongly relevant one,
-
Not Synced
and one that people often focus on.
-
Not Synced
To take a look at what we actually do
with this stuff, almost on a daily basis,
-
Not Synced
this is a snapshot from a monitoring system.
-
Not Synced
A small dashboard on a big screen
in a monitoring system.
-
Not Synced
Where you're watching the response time of
a system over time.
-
Not Synced
This is a two hour window.
-
Not Synced
These lines that are 95th percentile,
90, 75, 50, and 25th percentiles,
-
Not Synced
you can look at how they behave over time.
-
Not Synced
We're a small audience here, if you look at
this picture, what draws your eye?
-
Not Synced
What do you want to go investigate here
or pay attention to ?
-
Not Synced
It's the big red spike there, right?
-
Not Synced
So we could look at the red spike,
cause it's different,
-
Not Synced
and say, "Woah, the 95th percentile shot up
here. And look, the 90th percentile
-
Not Synced
shot up at about the same time.
-
Not Synced
The rest of them didn't shoot up,
so maybe something happened here
-
Not Synced
that affected that much, I should probably
pay attention to it
-
Not Synced
because it's a monitoring system, and
I like things to be calm."
-
Not Synced
You could go investigate the why.
-
Not Synced
At this point, I've managed to waste
about 90 seconds of your life,
-
Not Synced
looking at a completely meaningless chart,
which unfortunately you do
-
Not Synced
every day, all the time.
-
Not Synced
This chart is the chart you want to show
somebody if you want to
-
Not Synced
hide the truth from them.
-
Not Synced
If you want to pull the wool
over their eyes.
-
Not Synced
This is the chart of the good stuff.
-
Not Synced
What's not on this chart?
-
Not Synced
The 5% worse things that happened during
this two hours.
-
Not Synced
They're not here.
-
Not Synced
This is only the good things that happened
during the things.
-
Not Synced
And to get this spike, that 5% had to be
so bad that it even pulled
-
Not Synced
the 95th percentile all up.
-
Not Synced
There is zero information here at all about
what happened bad during this two hours,
-
Not Synced
which makes it a bad fit for
a monitoring system.
-
Not Synced
It's a really good thing for
a marketing system.
-
Not Synced
It's a great way to get the bonus from your boss, even though you didn't do the work.
-
Not Synced
If you want to learn how to do that,
we can do another talk about that.
-
Not Synced
But this is not a good way to look at latency.
-
Not Synced
It's the opposite of good.
-
Not Synced
Unfortunately, this is one of the most
common tools used for
-
Not Synced
server monitoring on earth right now.
-
Not Synced
That's where the snapshot is from,
and this is what people look at.
-
Not Synced
I find this chart to be a goldmine
of information.
-
Not Synced
When I first showed it in another talk
like this, I had this really cool experience.
-
Not Synced
Somebody came up to me and said, "Hey,
as I was sitting here, I was texting one
-
Not Synced
of our guys, and he was saying,
-
Not Synced
'look, we have this issue with
our 95th percentile'."
-
Not Synced
And I got this chart from him!
-
Not Synced
So I went and said, "Hey, what does the
rest of the spectrum look like?"
-
Not Synced
This is the actual chart they got.
-
Not Synced
And when they look at the rest of the
spectrum, it looked like that.
-
Not Synced
That's what was hiding.
-
Not Synced
I noticed the scales are a little different.
-
Not Synced
That yellow line is that yellow line.
-
Not Synced
So that's a much more representative number.
-
Not Synced
Is it? Is that good enough?
-
Not Synced
That's the 99th percentile.
-
Not Synced
We still have another 1% of really bad
stuff that's hiding above the blue line.
-
Not Synced
I wonder how big that is?
-
Not Synced
I don't know because he didn't have the data.
-
Not Synced
So a common problem that we have is that
we only plot what's convenient.
-
Not Synced
We only plot what gives us nice,
colorful graphs.
-
Not Synced
And often, when we have to choose between
the stuff that hides the rest of the data,
-
Not Synced
and the stuff that is noise, we choose
the noise to display.
-
Not Synced
I like to rant about latency.
-
Not Synced
This is from a blog that I don't write
enough in, but the format for it was simple.
-
Not Synced
I tweet a single tweet about latency,
latency tip of the day,
-
Not Synced
and then I rant about my own tweet.
-
Not Synced
As an example, this chart is a goldmine
of information because it has so many
-
Not Synced
different things that are wrong in it,
but we won't get into all of them.
-
Not Synced
You can read it online.
-
Not Synced
Anyway, this is one to take away from
what we just said.
-
Not Synced
If you are not measuring and showing the
maximum value, what is it you are hiding?
-
Not Synced
And from whom?
-
Not Synced
If you're job is to hide the truth from
others, this is a good way to do it.
-
Not Synced
But if actually are interested in what's
going on, the number one indicator
-
Not Synced
you should never get rid of is the
maximum value.
-
Not Synced
That is not noise, that is the signal.
-
Not Synced
The rest of it is noise.
-
Not Synced
Okay, let's look at this chart for some
more cool stuff.
-
Not Synced
I'm gonna zoom in to a small part
of the chart, and ask you what that means.
-
Not Synced
What is the average of the 95th percentile
over 2 hours mean?
-
Not Synced
What is the math that does that?
-
Not Synced
What does it do?
-
Not Synced
Let's look at that, and I'll give you
an example with another percentile.
-
Not Synced
The 100th percentile. The max, right?
-
Not Synced
Let's take a data set.
-
Not Synced
Suppose this was the maximum every minute
for 15 minutes.
-
Not Synced
What does it mean to say that the average
max over the last 15 minutes was 42?
-
Not Synced
I specifically chose the data to
make that happen.
-
Not Synced
It's a meaningless statement.
-
Not Synced
It's a completely meaningless statement.
-
Not Synced
But when you see 95th percentile,
average 184, you think that the 95th
-
Not Synced
percentile for the last two hours
was around 184.
-
Not Synced
It makes you think that.
-
Not Synced
Putting this on a piece of paper is not
just noise and irrelevant,
-
Not Synced
it's a way to mislead people.
-
Not Synced
It's a way to mislead yourself, because
you'll start to believe your own mistruths.
-
Not Synced
This is true for any percentile.
-
Not Synced
There is no percentile that you could do
this math on.
-
Not Synced
Another tip, you cannot average percentiles.
-
Not Synced
That math doesn't happen.
-
Not Synced
But percentiles do matter. You really
want to know about them.
-
Not Synced
And a common misperception is that we want
to look at the main part of the spectrum,
-
Not Synced
not those outliers and perfection stuff.
-
Not Synced
Only people that actually bet their house
every day, or the bank on it,
-
Not Synced
need to know about the "five-nine's",
and all those.
-
Not Synced
The 99th percentile is a pretty
good number.
-
Not Synced
Is 99% really rare?
-
Not Synced
Let's look at some stuff, because we can
ask questions like, "If I were looking
-
Not Synced
at a webpage, what is the chance of me
hitting the 99th percentile?"
-
Not Synced
Of things like this: a search engine node,
or a key value store,
-
Not Synced
or a database, or a CDN, right?
-
Not Synced
Because they will report their 99th percentile.
-
Not Synced
They won't tell you anything above that,
but how many of the
-
Not Synced
webpages that we go to
actually experience this?
-
Not Synced
You want to say 1%, right?
-
Not Synced
Well, I went to some webpages and I counted
how many "http" requests were generated
-
Not Synced
by one click into that webpage,
and here are the numbers.
-
Not Synced
I ended that about a year ago.
-
Not Synced
They've probably gone up since then.
-
Not Synced
Now that translates into this math.
-
Not Synced
This is the likelihood of one click seeing
the 99th percentile.
-
Not Synced
And the only page where that is less than
50% is the clean google search page.
-
Not Synced
Where only a quarter will see the
99th percentile.
-
Not Synced
The 99th percentile is the thing that most
of your webpages will see.
-
Not Synced
Most of them will be there.
-
Not Synced
Now, we could look at other things.
-
Not Synced
We can pick which things to focus on.
-
Not Synced
Let's say I had to pick between the 95th
percentile, and the three 9's (99.9%).
-
Not Synced
The three 9's is way into perfection mode
for most people, or they think.
-
Not Synced
Which one of those represents our
community better?
-
Not Synced
Our population?
-
Not Synced
Our users?
-
Not Synced
Our experience?
-
Not Synced
Let's run a hypothetical.
-
Not Synced
Suppose we don't have that many pages,
and that many resources like we said before.
-
Not Synced
We'll be much more conservative.
-
Not Synced
A user session will only go through five
clicks, and each click will only bring up
-
Not Synced
up to 40 things.
-
Not Synced
A lot less, and they're all as clean
as the google page.
-
Not Synced
How many of the users will not experience
something worse than the 95th percentile?
-
Not Synced
Because that's what the 95th percentile
is good for, the people who see that.
-
Not Synced
Anybody above that, is that.
-
Not Synced
What are the chances of not seeing it?
-
Not Synced
That's an interesting number.
-
Not Synced
So you're watching a number that is
relevant to 0.003% of your users.
-
Not Synced
99.997% of your users are going to
see worse than this number.
-
Not Synced
Why are you looking at it?
-
Not Synced
Why are you spending time
thinking about it?
-
Not Synced
In reverse, we could say how many people
are going to see something
-
Not Synced
worse than the three 9's (99.9%)?
-
Not Synced
That's going to be 18%.
-
Not Synced
In reverse, 82% of the people will see
the three 9's (99.9%) or better.
-
Not Synced
That's a slightly better representation.
-
Not Synced
Probably not good enough either.
-
Not Synced
We could look at some more math with them,
same kind of scenario.
-
Not Synced
What percentile of http response time
will be the thing that 95%
-
Not Synced
of people experience in this scenario?
-
Not Synced
It's the 99.97 percentile that 95%
of people see.
-
Not Synced
And if you want to know what 99%
of the people see,
-
Not Synced
that's four and a half 9's (99.995%).
-
Not Synced
You want to know that number from Akamai
if you want to predict what 1%
-
Not Synced
of your users are going to experience.
-
Not Synced
When you know the 99th percentile,
you kind of know a tiny bit.
-
Not Synced
So here's another tip.
-
Not Synced
And this is not an exaggeration,
by the way.
-
Not Synced
The median, which is a much smaller
percentile, has that minuscule a chance
-
Not Synced
of ever being the number that
anybody experiences.
-
Not Synced
This is the chance of getting worse
than the median.
-
Not Synced
Which makes the median an irrelevant
number to look at.
-
Not Synced
Unfortunately, it's probably the most
common one looked at.
-
Not Synced
When people say "the typical",
they look at the thing that
-
Not Synced
everything will be worse than.
-
Not Synced
Okay, I'm sorry about that part.
-
Not Synced
We'll do some other parts.
-
Not Synced
Now, why is it that when we look at these
monitoring systems, we don't see
-
Not Synced
data with a lot of 9's?
-
Not Synced
Why do we stop at the
90, 95, 99th percentile?
-
Not Synced
Why don't we look further?
-
Not Synced
Now, some of it is because people think,
"Well that's perfection, I don't need it."
-
Not Synced
The other part is that it's hard.
-
Not Synced
It's hard because you can't
average percentiles.
-
Not Synced
We already talked about that.
-
Not Synced
But you also can't derive your
five 9's (99.999%) out of a lot
-
Not Synced
of 10 second samples of percentiles.
-
Not Synced
And the reason for that is, "Hey, in 10
seconds, maybe I only had 1,000 things."
-
Not Synced
I could take all the 10 seconds in the
world, there's no way to say what the
-
Not Synced
hour five 9's (99.999%) were, what the
minutes five 9's were
-
Not Synced
if I'm collecting just this data.
-
Not Synced
And unfortunately, the data being collected
and reported to the back ends of monitoring
-
Not Synced
is usually summarized at a second,
5 seconds, 10 seconds, etc.
-
Not Synced
Basically throwing away all the good data,
and leaving you with absolutely no way
-
Not Synced
to compute large 9's for longer
periods of time.
-
Not Synced
So, this is where you might want to look
at HDR Histogram.
-
Not Synced
It's an open source thing I've created
a few years ago.
-
Not Synced
I did it in Java, and know there's a
C, C-Sharp, Python, Erlang,
-
Not Synced
and Go ports of this that I didn't create.
-
Not Synced
And it lets you actually get an entire
percentile spectrum.
-
Not Synced
Some of you here I know are
already using it.
-
Not Synced
And you can look at all the percentiles.
-
Not Synced
Any number of 9's that's in the data, if
you just keep it right and report it right,
-
Not Synced
it's got a log format, you can
store things forever.
-
Not Synced
Well, for a long time.
-
Not Synced
Okay, so it lets you have nice things.
-
Not Synced
Enough for that advertisement.
-
Not Synced
Now, latency... Well, I think this is
slightly out of order.
-
Not Synced
Yeah, sorry.
-
Not Synced
This is the red/blue pill part, so I warn
you, this is your last chance.
-
Not Synced
There's a problem I call the
coordinated omission problem.
-
Not Synced
The coordinated omission problem is
basically a conspiracy.
-
Not Synced
It's a conspiracy that we're all part of.
-
Not Synced
I don't think anybody actually meant
to do it, but once I've noticed it,
-
Not Synced
everywhere I look, there it is.
-
Not Synced
Now, I've been using a specific way of
showing you numbers so far.
-
Not Synced
Has anybody here noticed how
I spell percentile?
-
Not Synced
(Audience Member): "You put lie at the
end of the percent sign."
-
Not Synced
Yeah, good.
-
Not Synced
So coordinated omission problem is the
"lie" in %lies.
-
Not Synced
And this is how it works.
-
Not Synced
One common way to do this is
to use a load generator.
-
Not Synced
Pretty much all load generator's
have this problem.
-
Not Synced
There are two that I know of that don't.
-
Not Synced
What you do with a load generator,
is you test.
-
Not Synced
You issue requests, or send packets.
-
Not Synced
And you measure how long something took.
-
Not Synced
And as long as the numbers go right,
measure them, put them in a bucket,
-
Not Synced
study them later, and get your
percentiles from it.
-
Not Synced
But what if the thing that you are
measuring took longer than the time
-
Not Synced
it would've taken until you send
the next thing?
-
Not Synced
You're supposed to send something
every second,
-
Not Synced
but this one took a second and a half.
-
Not Synced
Well you've got to wait before
you send the next one.
-
Not Synced
You just avoided measuring something
when the system was problematic.
-
Not Synced
You've coordinated with it.
-
Not Synced
You weren't looking at it then.
-
Not Synced
That's common scenario A: You've backed
off, and avoided measuring when it was bad.
-
Not Synced
Another way, is you measure inside your code.
-
Not Synced
We all do this. We all have to do this,
-
Not Synced
where we measure time, do something,
then measure time.
-
Not Synced
The delta between them is how long it took.
-
Not Synced
We can then put it in a stats bucket,
and then do the percentiles in that.
-
Not Synced
Unfortunately, if the system freezes right
here, for any reason,
-
Not Synced
an interrupted contact switch,
-
Not Synced
a cash buffer flushed to disk