-
Not Synced
Hi everyone, I'm Gil Tene.
-
Not Synced
I'm going to be talking about this subject
that I call "How NOT to Measure Latency".
-
Not Synced
It's a subject that I've been talking
about for 3 years or so.
-
Not Synced
I keep the title and change all
the slides every time.
-
Not Synced
A bunch of this stuff is new.
-
Not Synced
So if you've seen any of my previous "How NOT to",
you'll see only some things that are common.
-
Not Synced
A nickname for the subject is this...
-
Not Synced
Because I often will get that reaction
from some people in the audience.
-
Not Synced
Ever since I've told people that it's a
nickname,
-
Not Synced
They feel free to actually exclaim,
"Oh S@%#!".
-
Not Synced
And feel free to do that here in this talk.
-
Not Synced
I'll prompt you in a couple of places
where it is natural.
-
Not Synced
But if just have the urge, go ahead.
-
Not Synced
So just a tiny bit about me.
-
Not Synced
I am the co-founder of Azul Systems.
-
Not Synced
I play around with garbage collection a lot.
-
Not Synced
Here is some evidence of me playing around
with garbage collection in my kitchen.
-
Not Synced
That's a trash compactor.
-
Not Synced
The compaction function wasn't working right,
so I had to fix it.
-
Not Synced
I thought it'd be funny to take a picture
with a book.
-
Not Synced
I've also built a lot of things.
-
Not Synced
I've been playing with computers since
the early 80's.
-
Not Synced
I've built hardware.
-
Not Synced
I've helped design chips.
-
Not Synced
I've built software at many
different levels.
-
Not Synced
Operating systems, drivers...
JVM's obviously.
-
Not Synced
And lots of big systems at the system level.
-
Not Synced
Built our own app server in the late 90's
because web logic wasn't around yet.
-
Not Synced
So, I've made a lot of mistakes,
and I've learned from a few of them.
-
Not Synced
This is actually a combination of a bunch
of those mistakes looking at latency.
-
Not Synced
I do have this hobby of depressing people
by pulling the wool up from over your eyes,
-
Not Synced
and this is what this talk is about.
-
Not Synced
So, I need to give you a choice right here.
-
Not Synced
There's the door.
-
Not Synced
You can take the blue pill,
and you can leave.
-
Not Synced
Tomorrow you can keep believing whatever
it is you want to believe.
-
Not Synced
But if you stay here and take the red pill,
I will show you a glimpse of how
-
Not Synced
far down the rabbit hole goes,
and it will never be the same again.
-
Not Synced
Let's talk about latency.
-
Not Synced
And when I say latency, I'm talking about
latency response time, any of those things
-
Not Synced
where you measure time from 'here to here',
and you're interested in how long it took.
-
Not Synced
We do this all the time, but I see a lot
of mish-mash in how people
-
Not Synced
treat the data, or think about it.
-
Not Synced
Latency is basically the time it took
something to happen once.
-
Not Synced
That one time, how long did it take.
-
Not Synced
And when we measure stuff, like we did
a million operations in the last hour,
-
Not Synced
we have a million latencies. Not one,
we have a million of them.
-
Not Synced
Our actual goal is to figure out how to
describe that million.
-
Not Synced
How did the million behave?
-
Not Synced
For example, 'they're all really good, and
they're all exactly the same', would be a
-
Not Synced
behavior that you will never see,
but that would be a great behavior.
-
Not Synced
So we need to talk about how things behave,
communicate, think, evaluate,
-
Not Synced
set requirements for, talk to other people,
but these are all common things around that.
-
Not Synced
To do that, we have to describe the
distribution, the set, the behavior,
-
Not Synced
but not the one.
-
Not Synced
For example, the behavior that says "the
the common case was x" is a piece of
-
Not Synced
information about the behavior,
but it's a tiny sliver.
-
Not Synced
Usually the least relevant one.
-
Not Synced
Well, there's some less relevant ones,
but not a strongly relevant one,
-
Not Synced
and one that people often focus on.
-
Not Synced
To take a look at what we actually do
with this stuff, almost on a daily basis,
-
Not Synced
this is a snapshot from a monitoring system.
-
Not Synced
A small dashboard on a big screen
in a monitoring system.
-
Not Synced
Where you're watching the response time of
a system over time.
-
Not Synced
This is a two hour window.
-
Not Synced
These lines that are 95th percentile,
90, 75, 50, and 25th percentiles,
-
Not Synced
you can look at how they behave over time.
-
Not Synced
We're a small audience here, if you look at
this picture, what draws your eye?
-
Not Synced
What do you want to go investigate here
or pay attention to ?
-
Not Synced
It's the big red spike there, right?
-
Not Synced
So we could look at the red spike,
cause it's different,
-
Not Synced
and say, "Woah, the 95th percentile shot up
here. And look, the 90th percentile
-
Not Synced
shot up at about the same time.
-
Not Synced
The rest of them didn't shoot up,
so maybe something happened here
-
Not Synced
that affected that much, I should probably
pay attention to it
-
Not Synced
because it's a monitoring system, and
I like things to be calm."
-
Not Synced
You could go investigate the why.
-
Not Synced
At this point, I've managed to waste
about 90 seconds of your life,
-
Not Synced
looking at a completely meaningless chart,
which unfortunately you do
-
Not Synced
every day, all the time.
-
Not Synced
This chart is the chart you want to show
somebody if you want to
-
Not Synced
hide the truth from them.
-
Not Synced
If you want to pull the wool
over their eyes.
-
Not Synced
This is the chart of the good stuff.
-
Not Synced
What's not on this chart?
-
Not Synced
The 5% worse things that happened during
this two hours.
-
Not Synced
They're not here.
-
Not Synced
This is only the good things that happened
during the things.
-
Not Synced
And to get this spike, that 5% had to be
so bad that it even pulled
-
Not Synced
the 95th percentile all up.
-
Not Synced
There is zero information here at all about
what happened bad during this two hours,
-
Not Synced
which makes it a bad fit for
a monitoring system.
-
Not Synced
It's a really good thing for
a marketing system.
-
Not Synced
It's a great way to get the bonus from your boss, even though you didn't do the work.
-
Not Synced
If you want to learn how to do that,
we can do another talk about that.
-
Not Synced
But this is not a good way to look at latency.
-
Not Synced
It's the opposite of good.
-
Not Synced
Unfortunately, this is one of the most
common tools used for
-
Not Synced
server monitoring on earth right now.
-
Not Synced
That's where the snapshot is from,
and this is what people look at.
-
Not Synced
I find this chart to be a goldmine
of information.
-
Not Synced
When I first showed it in another talk
like this, I had this really cool experience.
-
Not Synced
Somebody came up to me and said, "Hey,
as I was sitting here, I was texting one
-
Not Synced
of our guys, and he was saying,
-
Not Synced
'look, we have this issue with
our 95th percentile'."
-
Not Synced
And I got this chart from him!
-
Not Synced
So I went and said, "Hey, what does the
rest of the spectrum look like?"
-
Not Synced
This is the actual chart they got.
-
Not Synced
And when they look at the rest of the
spectrum, it looked like that.
-
Not Synced
That's what was hiding.
-
Not Synced
I noticed the scales are a little different.
-
Not Synced
That yellow line is that yellow line.
-
Not Synced
So that's a much more representative number.
-
Not Synced
Is it? Is that good enough?
-
Not Synced
That's the 99th percentile.
-
Not Synced
We still have another 1% of really bad
stuff that's hiding above the blue line.
-
Not Synced
I wonder how big that is?
-
Not Synced
I don't know because he didn't have the data.
-
Not Synced
So a common problem that we have is that
we only plot what's convenient.
-
Not Synced
We only plot what gives us nice,
colorful graphs.
-
Not Synced
And often, when we have to choose between
the stuff that hides the rest of the data,
-
Not Synced
and the stuff that is noise, we choose
the noise to display.
-
Not Synced
I like to rant about latency.
-
Not Synced
This is from a blog that I don't write
enough in, but the format for it was simple.
-
Not Synced
I tweet a single tweet about latency,
latency tip of the day,
-
Not Synced
and then I rant about my own tweet.
-
Not Synced
As an example, this chart is a goldmine
of information because it has so many
-
Not Synced
different things that are wrong in it,
but we won't get into all of them.
-
Not Synced
You can read it online.
-
Not Synced
Anyway, this is one to take away from
what we just said.
-
Not Synced
If you are not measuring and showing the
maximum value, what is it you are hiding?
-
Not Synced
And from whom?
-
Not Synced
If you're job is to hide the truth from
others, this is a good way to do it.
-
Not Synced
But if actually are interested in what's
going on, the number one indicator
-
Not Synced
you should never get rid of is the
maximum value.
-
Not Synced
That is not noise, that is the signal.
-
Not Synced
The rest of it is noise.
-
Not Synced
Okay, let's look at this chart for some
more cool stuff.
-
Not Synced
I'm gonna zoom in to a small part
of the chart, and ask you what that means.
-
Not Synced
What is the average of the 95th percentile
over 2 hours mean?
-
Not Synced
What is the math that does that?
-
Not Synced
What does it do?
-
Not Synced
Let's look at that, and I'll give you
an example with another percentile.
-
Not Synced
The 100th percentile. The max, right?
-
Not Synced
Let's take a data set.
-
Not Synced
Suppose this was the maximum every minute
for 15 minutes.
-
Not Synced
What does it mean to say that the average
max over the last 15 minutes was 42?
-
Not Synced
I specifically chose the data to
make that happen.
-
Not Synced
It's a meaningless statement.
-
Not Synced
It's a completely meaningless statement.
-
Not Synced
But when you see 95th percentile,
average 184, you think that the 95th
-
Not Synced
percentile for the last two hours
was around 184.
-
Not Synced
It makes you think that.
-
Not Synced
Putting this on a piece of paper is not
just noise and irrelevant,
-
Not Synced
it's a way to mislead people.
-
Not Synced
It's a way to mislead yourself, because
you'll start to believe your own mistruths.
-
Not Synced
This is true for any percentile.
-
Not Synced
There is no percentile that you could do
this math on.
-
Not Synced
Another tip, you cannot average percentiles.
-
Not Synced
That math doesn't happen.
-
Not Synced
But percentiles do matter. You really
want to know about them.
-
Not Synced
And a common misperception is that we want
to look at the main part of the spectrum,
-
Not Synced
not those outliers and perfection stuff.
-
Not Synced
Only people that actually bet their house
every day, or the bank on it,
-
Not Synced
need to know about the "five-nine's",
and all those.
-
Not Synced
The 99th percentile is a pretty
good number.
-
Not Synced
Is 99% really rare?
-
Not Synced
Let's look at some stuff, because we can
ask questions like, "If I were looking
-
Not Synced
at a webpage, what is the chance of me
hitting the 99th percentile?"
-
Not Synced
Of things like this: a search engine node,
or a key value store,
-
Not Synced
or a database, or a CDN, right?
-
Not Synced
Because they will report their 99th percentile.
-
Not Synced
They won't tell you anything above that,
but how many of the
-
Not Synced
webpages that we go to
actually experience this?
-
Not Synced
You want to say 1%, right?