-
Not Synced
Hi everyone, I'm Gil Tene.
-
Not Synced
I'm going to be talking about this subject
that I call "How NOT to Measure Latency".
-
Not Synced
It's a subject that I've been talking now
about for 3 years or so.
-
Not Synced
I keep the title and change all
the slides every time.
-
Not Synced
A bunch of this stuff is new.
-
Not Synced
So if you've seen any of my previous "How NOT to",
you'll see only some things that are common.
-
Not Synced
A nickname for the subject is this...
-
Not Synced
Because I often will get that reaction
from some people in the audience.
-
Not Synced
Ever since I've told people that it's a
nickname,
-
Not Synced
They feel free to actually exclaim,
"Oh S@%#!".
-
Not Synced
And feel free to do that here in this talk.
-
Not Synced
I'll prompt you in a couple of places
where it is natural.
-
Not Synced
But if just have the urge, go ahead.
-
Not Synced
So just a tiny bit about me.
-
Not Synced
I am the co-founder of Azul Systems.
-
Not Synced
I play around with garbage collection a lot.
-
Not Synced
Here is some evidence of me playing around
with garbage collection in my kitchen.
-
Not Synced
That's a trash compactor.
-
Not Synced
The compaction function wasn't working right,
so I had to fix it.
-
Not Synced
I thought it'd be funny to take a picture
with a book.
-
Not Synced
I've also built a lot of things.
-
Not Synced
I've been playing with computers since
the early 80's.
-
Not Synced
I've built hardware.
-
Not Synced
I've helped design chips.
-
Not Synced
I've built software at many
different levels.
-
Not Synced
Operating systems, drivers...
JVM's obviously.
-
Not Synced
And lots of big systems at the system level.
-
Not Synced
Built our own app server in the late 90's
because web logic wasn't around yet.
-
Not Synced
So, I've made a lot of mistakes,
and I've learned from a few of them.
-
Not Synced
This is actually a combination of a bunch
of those mistakes looking at latency.
-
Not Synced
I do have this hobby of depressing people
by pulling the wool up from over your eyes,
-
Not Synced
and this is what this talk is about.
-
Not Synced
So, I need to give you a choice right here.
-
Not Synced
There's the door.
-
Not Synced
You can take the blue pill,
and you can leave.
-
Not Synced
Tomorrow you can keep believing whatever
it is you want to believe.
-
Not Synced
But if you stay here and take the red pill,
I will show you a glimpse of how
-
Not Synced
far down the rabbit hole goes,
and it will never be the same again.
-
Not Synced
Let's talk about latency.
-
Not Synced
And when I say latency, I'm talking about
latency response time, any of those things
-
Not Synced
where you measure time from 'here to here',
and you're interested in how long it took.
-
Not Synced
We do this all the time, but I see a lot
of mish-mash in how people
-
Not Synced
treat the data, or think about it.
-
Not Synced
Latency is basically the time it took
something to happen once.
-
Not Synced
That one time, how long did it take.
-
Not Synced
And when we measure stuff, like we did
a million operations in the last hour,
-
Not Synced
we have a million latencies. Not one,
we have a million of them.
-
Not Synced
Our actual goal is to figure out how to
describe that million.
-
Not Synced
How did the million behave?
-
Not Synced
For example, 'they're all really good, and
they're all exactly the same', would be a
-
Not Synced
behavior that you will never see,
but that would be a great behavior.
-
Not Synced
So we need to talk about how things behave,
communicate, think, evaluate,
-
Not Synced
set requirements for, talk to other people,
but these are all common things around that.
-
Not Synced
To do that, we have to describe the
distribution, the set, the behavior,
-
Not Synced
but not the one.
-
Not Synced
For example, the behavior that says "the
the common case was x" is a piece of
-
Not Synced
information about the behavior,
but it's a tiny sliver.
-
Not Synced
Usually the least relevant one.
-
Not Synced
Well, there's some less relevant ones,
but not a strongly relevant one,
-
Not Synced
and one that people often focus on.
-
Not Synced
To take a look at what we actually do
with this stuff, almost on a daily basis,
-
Not Synced
this is a snapshot from a monitoring system.
-
Not Synced
A small dashboard on a big screen
in a monitoring system.
-
Not Synced
Where you're watching the response time of
a system over time.
-
Not Synced
This is a two hour window.
-
Not Synced
These lines that are 95th percentile,
90, 75, 50, and 25th percentiles,
-
Not Synced
you can look at how they behave over time.
-
Not Synced
We're a small audience here, if you look at
this picture, what draws your eye?
-
Not Synced
What do you want to go investigate here
or pay attention to ?
-
Not Synced
It's the big red spike there, right?
-
Not Synced
So we could look at the red spike,
cause it's different,
-
Not Synced
and say, "Woah, the 95th percentile shot up
here. And look, the 90th percentile
-
Not Synced
shot up at about the same time.
-
Not Synced
The rest of them didn't shoot up,
so maybe something happened here
-
Not Synced
that affected that much, I should probably
pay attention to it
-
Not Synced
because it's a monitoring system, and
I like things to be calm."
-
Not Synced
You could go investigate the why.
-
Not Synced
At this point, I've managed to waste
about 90 seconds of your life,
-
Not Synced
looking at a completely meaningless chart,
which unfortunately you do
-
Not Synced
every day, all the time.
-
Not Synced
This chart is the chart you want to show
somebody if you want to
-
Not Synced
hide the truth from them.
-
Not Synced
If you want to pull the wool
over their eyes.
-
Not Synced
This is the chart of the good stuff.
-
Not Synced
What's not on this chart?
-
Not Synced
The 5% worse things that happened during
this two hours.
-
Not Synced
They're not here.
-
Not Synced
This is only the good things that happened
during the things.
-
Not Synced
And to get this spike, that 5% had to be
so bad that it even pulled
-
Not Synced
the 95th percentile all up.
-
Not Synced
There is zero information here at all about
what happened bad during this two hours,
-
Not Synced
which makes it a bad fit for
a monitoring system.
-
Not Synced
It's a really good thing for
a marketing system.
-
Not Synced
It's a great way to get the bonus from your boss, even though you didn't do the work.
-
Not Synced
If you want to learn how to do that,
we can do another talk about that.
-
Not Synced
But this is not a good way to look at latency.
-
Not Synced
It's the opposite of good.
-
Not Synced
Unfortunately, this is one of the most
common tools used for
-
Not Synced
server monitoring on earth right now.
-
Not Synced
That's where the snapshot is from,
and this is what people look at.
-
Not Synced
I find this chart to be a goldmine
of information.
-
Not Synced
When I first showed it in another talk
like this, I had this really cool experience.
-
Not Synced
Somebody came up to me and said, "Hey,
as I was sitting here, I was texting one
-
Not Synced
of our guys, and he was saying,
-
Not Synced
'look, we have this issue with
our 95th percentile'."
-
Not Synced
And I got this chart from him!
-
Not Synced
So I went and said, "Hey, what does the
rest of the spectrum look like?"
-
Not Synced
This is the actual chart they got.
-
Not Synced
And when they look at the rest of the
spectrum, it looked like that.
-
Not Synced
That's what was hiding.
-
Not Synced
I noticed the scales are a little different.
-
Not Synced
That yellow line is that yellow line.
-
Not Synced
So that's a much more representative number.
-
Not Synced
Is it? Is that good enough?
-
Not Synced
That's the 99th percentile.
-
Not Synced
We still have another 1% of really bad
stuff that's hiding above the blue line.
-
Not Synced
I wonder how big that is?
-
Not Synced
I don't know because he didn't have the data.
-
Not Synced
So a common problem that we have is that
we only plot what's convenient.
-
Not Synced
We only plot what gives us nice,
colorful graphs.
-
Not Synced
And often, when we have to choose between
the stuff that hides the rest of the data,
-
Not Synced
and the stuff that is noise, we choose
the noise to display.
-
Not Synced
I like to rant about latency.
-
Not Synced
This is from a blog that I don't write
enough in, but the format for it was simple.
-
Not Synced
I tweet a single tweet about latency,
latency tip of the day,
-
Not Synced
and then I rant about my own tweet.
-
Not Synced
As an example, this chart is a goldmine
of information because it has so many
-
Not Synced
different things that are wrong in it,
but we won't get into all of them.
-
Not Synced
You can read it online.
-
Not Synced
Anyway, this is one to take away from
what we just said.
-
Not Synced
If you are not measuring and showing the
maximum value, what is it you are hiding?
-
Not Synced
And from whom?
-
Not Synced
If you're job is to hide the truth from
others, this is a good way to do it.
-
Not Synced
But if actually are interested in what's
going on, the number one indicator
-
Not Synced
you should never get rid of is the
maximum value.
-
Not Synced
That is not noise, that is the signal.
-
Not Synced
The rest of it is noise.
-
Not Synced
Okay, let's look at this chart for some
more cool stuff.
-
Not Synced
I'm gonna zoom in to a small part
of the chart, and ask you what that means.
-
Not Synced
What is the average of the 95th percentile
over 2 hours mean?
-
Not Synced
What is the math that does that?
-
Not Synced
What does it do?
-
Not Synced
Let's look at that, and I'll give you
an example with another percentile.
-
Not Synced
The 100th percentile. The max, right?
-
Not Synced
Let's take a data set.
-
Not Synced
Suppose this was the maximum every minute
for 15 minutes.
-
Not Synced
What does it mean to say that the average
max over the last 15 minutes was 42?
-
Not Synced
I specifically chose the data to
make that happen.
-
Not Synced
It's a meaningless statement.
-
Not Synced
It's a completely meaningless statement.
-
Not Synced
But when you see 95th percentile,
average 184, you think that the 95th
-
Not Synced
percentile for the last two hours
was around 184.
-
Not Synced
It makes you think that.
-
Not Synced
Putting this on a piece of paper is not
just noise and irrelevant,
-
Not Synced
it's a way to mislead people.
-
Not Synced
It's a way to mislead yourself, because
you'll start to believe your own mistruths.
-
Not Synced
This is true for any percentile.
-
Not Synced
There is no percentile that you could do
this math on.
-
Not Synced
Another tip, you cannot average percentiles.
-
Not Synced
That math doesn't happen.
-
Not Synced
But percentiles do matter. You really
want to know about them.
-
Not Synced
And a common misperception is that we want
to look at the main part of the spectrum,
-
Not Synced
not those outliers and perfection stuff.
-
Not Synced
Only people that actually bet their house
every day, or the bank on it,
-
Not Synced
need to know about the "five-nine's",
and all those.
-
Not Synced
The 99th percentile is a pretty
good number.
-
Not Synced
Is 99% really rare?
-
Not Synced
Let's look at some stuff, because we can
ask questions like, "If I were looking
-
Not Synced
at a webpage, what is the chance of me
hitting the 99th percentile?"
-
Not Synced
Of things like this: a search engine node,
or a key value store,
-
Not Synced
or a database, or a CDN, right?
-
Not Synced
Because they will report their 99th percentile.
-
Not Synced
They won't tell you anything above that,
but how many of the
-
Not Synced
webpages that we go to
actually experience this?
-
Not Synced
You want to say 1%, right?
-
Not Synced
Well, I went to some webpages and I counted
how many "http" requests were generated
-
Not Synced
by one click into that webpage,
and here are the numbers.
-
Not Synced
I ended that about a year ago.
-
Not Synced
They've probably gone up since then.
-
Not Synced
Now that translates into this math.
-
Not Synced
This is the likelihood of one click seeing
the 99th percentile.
-
Not Synced
And the only page where that is less than
50% is the clean google search page.
-
Not Synced
Where only a quarter will see the
99th percentile.
-
Not Synced
The 99th percentile is the thing that most
of your webpages will see.
-
Not Synced
Most of them will be there.
-
Not Synced
Now, we could look at other things.
-
Not Synced
We can pick which things to focus on.
-
Not Synced
Let's say I had to pick between the 95th
percentile, and the three 9's (99.9%).
-
Not Synced
The three 9's is way into perfection mode
for most people, or they think.
-
Not Synced
Which one of those represents our
community better?
-
Not Synced
Our population?
-
Not Synced
Our users?
-
Not Synced
Our experience?
-
Not Synced
Let's run a hypothetical.
-
Not Synced
Suppose we don't have that many pages,
and that many resources like we said before.
-
Not Synced
We'll be much more conservative.
-
Not Synced
A user session will only go through five
clicks, and each click will only bring up
-
Not Synced
up to 40 things.
-
Not Synced
A lot less, and they're all as clean
as the google page.
-
Not Synced
How many of the users will not experience
something worse than the 95th percentile?
-
Not Synced
Because that's what the 95th percentile
is good for, the people who see that.
-
Not Synced
Anybody above that, is that.
-
Not Synced
What are the chances of not seeing it?
-
Not Synced
That's an interesting number.
-
Not Synced
So you're watching a number that is
relevant to 0.003% of your users.
-
Not Synced
99.997% of your users are going to
see worse than this number.
-
Not Synced
Why are you looking at it?
-
Not Synced
Why are you spending time
thinking about it?
-
Not Synced
In reverse, we could say how many people
are going to see something
-
Not Synced
worse than the three 9's (99.9%)?
-
Not Synced
That's going to be 18%.
-
Not Synced
In reverse, 82% of the people will see
the three 9's (99.9%) or better.
-
Not Synced
That's a slightly better representation.
-
Not Synced
Probably not good enough either.
-
Not Synced
We could look at some more math with them,
same kind of scenario.
-
Not Synced
What percentile of http response time
will be the thing that 95%
-
Not Synced
of people experience in this scenario?
-
Not Synced
It's the 99.97 percentile that 95%
of people see.
-
Not Synced
And if you want to know what 99%
of the people see,
-
Not Synced
that's four and a half 9's (99.995%).
-
Not Synced
You want to know that number from Akamai
if you want to predict what 1%
-
Not Synced
of your users are going to experience.
-
Not Synced
When you know the 99th percentile,
you kind of know a tiny bit.
-
Not Synced
So here's another tip.
-
Not Synced
And this is not an exaggeration,
by the way.
-
Not Synced
The median, which is a much smaller
percentile, has that minuscule a chance
-
Not Synced
of ever being the number that
anybody experiences.
-
Not Synced
This is the chance of getting worse
than the median.
-
Not Synced
Which makes the median an irrelevant
number to look at.
-
Not Synced
Unfortunately, it's probably the most
common one looked at.
-
Not Synced
When people say "the typical",
they look at the thing that
-
Not Synced
everything will be worse than.
-
Not Synced
Okay, I'm sorry about that part.
-
Not Synced
We'll do some other parts.
-
Not Synced
Now, why is it that when we look at these
monitoring systems, we don't see
-
Not Synced
data with a lot of 9's?
-
Not Synced
Why do we stop at the
90, 95, 99th percentile?
-
Not Synced
Why don't we look further?
-
Not Synced
Now, some of it is because people think,
"Well that's perfection, I don't need it."
-
Not Synced
The other part is that it's hard.
-
Not Synced
It's hard because you can't
average percentiles.
-
Not Synced
We already talked about that.
-
Not Synced
But you also can't derive your
five 9's (99.999%) out of a lot
-
Not Synced
of 10 second samples of percentiles.
-
Not Synced
And the reason for that is, "Hey, in 10
seconds, maybe I only had 1,000 things."
-
Not Synced
I could take all the 10 seconds in the
world, there's no way to say what the
-
Not Synced
hour five 9's (99.999%) were, what the
minutes five 9's were
-
Not Synced
if I'm collecting just this data.
-
Not Synced
And unfortunately, the data being collected
and reported to the back ends of monitoring
-
Not Synced
is usually summarized at a second,
5 seconds, 10 seconds, etc.
-
Not Synced
Basically throwing away all the good data,
and leaving you with absolutely no way
-
Not Synced
to compute large 9's for longer
periods of time.
-
Not Synced
So, this is where you might want to look
at HDR Histogram.
-
Not Synced
It's an open source thing I've created
a few years ago.
-
Not Synced
I did it in Java, and know there's a
C, C-Sharp, Python, Erlang,
-
Not Synced
and Go ports of this that I didn't create.
-
Not Synced
And it lets you actually get an entire
percentile spectrum.
-
Not Synced
Some of you here I know are
already using it.
-
Not Synced
And you can look at all the percentiles.
-
Not Synced
Any number of 9's that's in the data, if
you just keep it right and report it right,
-
Not Synced
it's got a log format, you can
store things forever.
-
Not Synced
Well, for a long time.
-
Not Synced
Okay, so it lets you have nice things.
-
Not Synced
Enough for that advertisement.
-
Not Synced
Now, latency... Well, I think this is
slightly out of order.
-
Not Synced
Yeah, sorry.
-
Not Synced
This is the red/blue pill part, so I warn
you, this is your last chance.
-
Not Synced
There's a problem I call the
coordinated omission problem.
-
Not Synced
The coordinated omission problem is
basically a conspiracy.
-
Not Synced
It's a conspiracy that we're all part of.
-
Not Synced
I don't think anybody actually meant
to do it, but once I've noticed it,
-
Not Synced
everywhere I look, there it is.
-
Not Synced
Now, I've been using a specific way of
showing you numbers so far.
-
Not Synced
Has anybody here noticed how
I spell percentile?
-
Not Synced
(Audience Member): "You put lie at the
end of the percent sign."
-
Not Synced
Yeah, good.
-
Not Synced
So coordinated omission problem is the
"lie" in %lies.
-
Not Synced
And this is how it works.
-
Not Synced
One common way to do this is
to use a load generator.
-
Not Synced
Pretty much all load generator's
have this problem.
-
Not Synced
There are two that I know of that don't.
-
Not Synced
What you do with a load generator,
is you test.
-
Not Synced
You issue requests, or send packets.
-
Not Synced
And you measure how long something took.
-
Not Synced
And as long as the numbers go right,
measure them, put them in a bucket,
-
Not Synced
study them later, and get your
percentiles from it.
-
Not Synced
But what if the thing that you are
measuring took longer than the time
-
Not Synced
it would've taken until you send
the next thing?
-
Not Synced
You're supposed to send something
every second,
-
Not Synced
but this one took a second and a half.
-
Not Synced
Well you've got to wait before
you send the next one.
-
Not Synced
You just avoided measuring something
when the system was problematic.
-
Not Synced
You've coordinated with it.
-
Not Synced
You weren't looking at it then.
-
Not Synced
That's common scenario A: You've backed
off, and avoided measuring when it was bad.
-
Not Synced
Another way, is you measure inside your code.
-
Not Synced
We all do this. We all have to do this,
-
Not Synced
where we measure time, do something,
then measure time.
-
Not Synced
The delta between them is how long it took.
-
Not Synced
We can then put it in a stats bucket,
and then do the percentiles in that.
-
Not Synced
Unfortunately, if the system freezes right
here, for any reason,
-
Not Synced
an interrupted contact switch,
-
Not Synced
a cash buffer flushed to disk,
-
Not Synced
a garbage collection,
-
Not Synced
a re-indexing of your database,
this is a database.
-
Not Synced
This is Cassandra by the way,
measuring itself.
-
Not Synced
In any of the above, then you will
have one bad report
-
Not Synced
while 10,000 things are waiting in line.
-
Not Synced
And when they come in, they will look
really, really good.
-
Not Synced
Even though each one of them has had
a really bad experience.
-
Not Synced
It can even get worse, where maybe the
freeze happened outside the timing,
-
Not Synced
and you won't even know there was a freeze.
-
Not Synced
Now these are examples of admitting data
that is bad on a very selective basis.
-
Not Synced
It's not random sampling.
-
Not Synced
It's, "I don't like bad data",
-
Not Synced
or "I couldn't handle it",
-
Not Synced
or "I don't know about it",
-
Not Synced
so we'll just talk about the good.
-
Not Synced
What does that do to your data?
-
Not Synced
Because it often makes people feel like,
-
Not Synced
"Okay, yeah, I understand,
but it's a little bit of noise."
-
Not Synced
Let's run some hypotheticals,
and I'll show you some real numbers.
-
Not Synced
Imagine a perfect system.
-
Not Synced
It's doing 100 requests a second,
at exactly a millisecond each.
-
Not Synced
But we go and freeze the system,
after 100 seconds of perfect operations
-
Not Synced
for 100 seconds, and then repeat.
-
Not Synced
Now, I'm going to describe how the system
behaves in terms that should mean something,
-
Not Synced
and then we'll measure it.
-
Not Synced
If we actually wanted to describe the
system,
-
Not Synced
on the left we have an average
of one millisecond by the finish,
-
Not Synced
and on the right we have an
average of 50 seconds.
-
Not Synced
Why 50? Because if I randomly came in
in that 100 seconds,
-
Not Synced
I'll get anything from 0 to 100
with even distribution.
-
Not Synced
The overall average over 200 seconds
is 25 seconds.
-
Not Synced
If I just came in here and said,
"Surprise, how long did this take?"
-
Not Synced
On average, it will be 25.
-
Not Synced
I can also do the percentiles.
-
Not Synced
50th percentile will be really good,
and then it'll get really bad.
-
Not Synced
The four 9's is terrible.
-
Not Synced
This is a fair honest description of
this system if this is what it did.
-
Not Synced
And you can make the system do that.
-
Not Synced
That's what Control Z is good for.
-
Not Synced
You can make any of your systems do that.
-
Not Synced
Now lets go measure this system with
a load generator,
-
Not Synced
or with a monitoring system.
-
Not Synced
The common ones.
-
Not Synced
The ones everybody does.
-
Not Synced
On the left, we're going to get 10,000
results of one millisecond each.
-
Not Synced
Great.
-
Not Synced
And we're going to get one result of
100 seconds.
-
Not Synced
Wow, really big response time.
-
Not Synced
This is our data.
-
Not Synced
This is OUR data.
-
Not Synced
So now you go do math with it.
-
Not Synced
The average of that is 10.9 milliseconds.
-
Not Synced
A little less than 25 seconds.
-
Not Synced
And here are the percentiles.
-
Not Synced
Your load generator monitoring system
will tell you that this system is perfect.
-
Not Synced
You could go to production with it.
-
Not Synced
You like what you see.
-
Not Synced
Look at that, four 9's.
-
Not Synced
It is lying to you.
-
Not Synced
To your face.
-
Not Synced
And you can catch it doing that with a
Control Z-Test.
-
Not Synced
But people tend to not want to do that,
because then what are they going to do?
-
Not Synced
If you just do that test, and calibrate
your system, and you find it
-
Not Synced
telling you that, about this, the next
step should be to throw all the numbers away.
-
Not Synced
Don't believe anything else it says.
-
Not Synced
If it lies this big, what else did it do?
-
Not Synced
Don't waste your time on numbers
from uncalibrated systems.
-
Not Synced
Now the problem here was, that if you
want to measure the system,
-
Not Synced
you have to measure at random rates,
or same rates.
-
Not Synced
If you measure 10,000 things in 100 seconds,
there should be another 10,000 things here.
-
Not Synced
If you measure them, you would've gotten
all the right numbers.
-
Not Synced
Coordinated omission is the simple act of
erasing all that bad stuff.
-
Not Synced
The conspiracy here is that we all do it
without meaning to.
-
Not Synced
I don't know who put that in our systems,
but it happens to all of us .
-
Not Synced
Now, I often get people saying,
"Okay, I get it. All the numbers are wrong,
-
Not Synced
but at least for my job where I tune
performance, and I try to make things
-
Not Synced
faster, I can use the numbers to figure
out if I'm going in the right direction."
-
Not Synced
Is it better, or is it worse? Let me
dispel that for you for a second.
-
Not Synced
Suppose I went and took this system,
and improved it dramatically.
-
Not Synced
Rather than freezing for 100 seconds,
it will now answer every question.
-
Not Synced
It'll take a little longer,
5 milliseconds instead of one,
-
Not Synced
but it's much better than freezing, right?
-
Not Synced
So let's measure that system that we spent
weeks and weeks improving,
-
Not Synced
and see if it's better.
-
Not Synced
That's the data.
-
Not Synced
If we do the percentiles, it'll tell us
that we just really hurt the four 9's.
-
Not Synced
We made it go 5 times worse than before.
-
Not Synced
We should revert this change, go back to
that much better system we had before.
-
Not Synced
So this is just to make sure that you
don't think that you can have
-
Not Synced
any intuition based on any of these numbers.
-
Not Synced
They go backwards sometimes.
-
Not Synced
You don't know which way is good or bad.
-
Not Synced
And you'll never know which way is good
or bad with a system that lies like that.
-
Not Synced
The other cool technique is
what I call "Cheating Twice".
-
Not Synced
You have a constant load generator,
and it needs to do 100 per second.
-
Not Synced
When it woke up after 200 seconds,
it says,
-
Not Synced
"Woah, were 9,999 behind.
We've got to issue those requests."
-
Not Synced
So it issues those requests.
-
Not Synced
At this point, not only did it get rid of
all the bad requests,
-
Not Synced
it replaced every one of them with
a perfect request.
-
Not Synced
Coining the four 9's (99.99%), all the way
to four and a half 9's (99.995%),
-
Not Synced
it's twice as wrong as dropping them.
-
Not Synced
So these are all cool things that
happen to you.
-
Not Synced
I'm not going to spend much time on how
to fix those and avoid those.
-
Not Synced
There's a lot of other material that you
can find with me
-
Not Synced
talking about that, in longer talks.
-
Not Synced
But this is pretty bad.
-
Not Synced
And like I said...
-
Not Synced
That should've been up there before.
-
Not Synced
How did this repeat itself?
-
Not Synced
Did I create a loop in the
presentation somehow?
-
Not Synced
I don't know how to do that.
-
Not Synced
Let's see if I can get through here.
-
Not Synced
Hopefully editing later will take it out.
-
Not Synced
So we have the cheats twice.
-
Not Synced
There, okay.
-
Not Synced
So, after we look at coordinated
omission that way,
-
Not Synced
we should also look at response time,
and service time.
-
Not Synced
Coordinated omission, what it really is
achieving for you, unfortunately,
-
Not Synced
is that it makes something that you think
is response time, and only shows you
-
Not Synced
the service time component of latency.
-
Not Synced
This is a simple depiction of what service
time and response times are.
-
Not Synced
This guy is taking a certain amount of
time to take payment
-
Not Synced
or make a cup of coffee.
-
Not Synced
That's service time.
-
Not Synced
How long does it take to do the work?
-
Not Synced
This person has experienced
the response time,
-
Not Synced
which includes the amount of time they
have to wait before they
-
Not Synced
get to the person that does the work.
-
Not Synced
And the difference between those
two is immense.
-
Not Synced
The coordinated omission problem makes
something that you think is
-
Not Synced
response time, only measure the
service time,
-
Not Synced
and basically hide the fact that things
stalled, waited in line,
-
Not Synced
that this guy might've taken a lunch break,
-
Not Synced
and now we have line around,
building three times.
-
Not Synced
Service time stays the same.
-
Not Synced
This is the backwards part...
-
Not Synced
Now, let's look at what it
actually looks like.
-
Not Synced
In a load generator that I fixed,
I measured both
-
Not Synced
response time and service time,
-
Not Synced
this happens to be Casandra,
-
Not Synced
at a very low load.
-
Not Synced
And you can see that they're very very
similar, at a very low load.
-
Not Synced
Why? Because there's nobody in line.
-
Not Synced
This thing is really fast.
-
Not Synced
We're not asking for too much.
-
Not Synced
Casandra's pretty fast,
so they're the same.
-
Not Synced
But if I increase the load, we
start seeing gaps.
-
Not Synced
If I increase the load a little more,
the gap grows.
-
Not Synced
If I increase the load a little more,
the gap grows.
-
Not Synced
Now this is not the failure point yet.
-
Not Synced
If I actually increase it all the way past
the point where the system
-
Not Synced
can't even do the work I want,
service time stays the same.
-
Not Synced
Response time goes through the roof.