-
- [Narrator] So we have
nine students who recently
-
graduated from a small school
that has a class size of nine,
-
and they wanna figure out
what is the central tendency
-
for salaries one year after graduation?
-
And they also wanna have a
sense of the spread around
-
that central tendency one
year after graduation.
-
So they all agree to put in
their salaries into a computer,
-
and so these are their salaries.
-
They're measured in thousands.
-
So one makes 35,000, 50,000,
50,000, 50,000, 56,000,
-
two make 60,000, one makes
75,000, and one makes 250,000.
-
So she's doing very well for herself,
-
and the computer it spits
out a bunch of parameters
-
based on this data here.
-
So it spits out two typical
measures of central tendency.
-
The mean is roughly 76.2.
-
The computer would calculate
it by adding up all of these
-
numbers, these nine numbers,
and then dividing by nine,
-
and the median is 56, and median
is quite easy to calculate.
-
You just order the numbers and you take
-
the middle number here which is 56.
-
Now what I want you to
do is pause this video
-
and think about for this data set,
-
for this population of
salaries, which measure,
-
which measure of central
tendency is a better measure?
-
All right, so let's think
about this a little bit.
-
I'm gonna plot it on a line here.
-
I'm gonna plot my data
so we get a better sense
-
and we just don't see them,
so we just don't see things
-
as numbers, but we see
where those numbers sit
-
relative to each other.
-
So let's say this is zero.
-
Let's say this is, let's see,
one, two, three, four, five.
-
So this would be 250, this
is 50, 100, 150, 200, 200,
-
and let's see.
-
Let's say if this is 50
than this would be roughly
-
40 right here, and I just wanna get rough.
-
So this would be about 60,
70, 80, 90, close enough.
-
I'm, I could draw this
a little bit neater,
-
but, 60, 70, 80, 90.
-
Actually, let me just clean
this up a little bit more too.
-
This one right over here would be
-
a little bit closer to this one.
-
Let me just put it right around here.
-
So that's 40, and then
this would be 30, 20, 10.
-
Okay, that's pretty good.
-
So let's plot this data.
-
So, one student makes 35,000,
so that is right over there.
-
Two make 50,000, or three make 50,000,
-
so one, two, and three.
-
I'll put it like that.
-
One makes 56,000 which would
put them right over here.
-
One makes 60,000, or
actually, two make 60,000,
-
so it's like that.
-
One makes 75,000, so
that's 60, 70, 75,000.
-
So it's gonna be right around there,
-
and then one makes 250,000.
-
So one's salary is all
the way around there,
-
and then when we
calculate the mean as 76.2
-
as our measure of central tendency,
-
76.2 is right over there.
-
So is this a good measure
of central tendency?
-
Well to me it doesn't feel that good,
-
because our measure of central
tendency is higher than all
-
of the data points except for
one, and the reason is is that
-
you have this one that the,
that our, our data is skewed
-
significantly by this
data point at $250,000.
-
It is so far from the
rest of the distribution
-
from the rest of the data
that it has skewed the mean,
-
and this is something
that you see in general.
-
If you have data that is skewed,
and especially things like
-
salary data where someone might
make, most people are making
-
50, 60, $70,000, but someone
might make two million dollars,
-
and so that will skew the
average or skew the mean I should
-
say, when you add them all
up and divide by the number
-
of data points you have.
-
In this case, especially when
you have data points that
-
would skew the mean,
median is much more robust.
-
The median at 56 sits right
over here, which seems to be
-
much more indicative for central tendency.
-
And think about it.
-
Even if you made this instead of 250,000
-
if you made this 250,000
thousand, which would be 250
-
million dollars, which is
a ginormous amount of money
-
to make, it wouldn't, it would
skew the mean incredibly,
-
but it actually would not
even change the median,
-
because the median, it doesn't matter
-
how high this number gets.
-
This could be a trillion dollars.
-
This could be a quadrillion dollars.
-
The median is going to stay the same.
-
So the median is much more robust
-
if you have a skewed data set.
-
Mean makes a little bit more
sense if you have a symmetric
-
data set or if you have things
that are, you know, where,
-
where things are roughly
above and below the mean,
-
or things aren't skewed
incredibly in one direction,
-
especially by a handful of data
-
points like we have right over here.
-
So in this example, the median is a much
-
better measure of central tendency.
-
And so what about spread?
-
Well you might say, well,
Sal you already told us
-
that the mean is not so good
-
and the standard deviation
is based on the mean.
-
You take each of these data
points, find their distance
-
from the mean, square that
number, add up those squared
-
distances, divide by the
number of data points if we're
-
taking the population standard
deviation, and then you,
-
and then you, you take the
square root of the whole thing.
-
And so since this is based on
the mean, which isn't a good
-
measure of central tendency
in this situation, and this,
-
this is also going to skew
that standard deviation.
-
This is going to be, this is a lot larger
-
than if you look at the, the actual,
-
if you wanted an indication of the spread.
-
Yes, you have this one data
point that's way far away
-
from either the mean or
the median depending on how
-
you wanna think about it, but
most of the data points seem
-
much closer, and so for that situation,
-
not only are we using the median,
-
but the interquartile range
is once again more robust.
-
How do we calculate the
interquartile range?
-
Well, you take the median
and then you take the bottom
-
group of numbers and
calculate the median of those.
-
So that's 50 right over here
and then you take the top
-
group of numbers, the
upper group of numbers,
-
and the median there is
60 and 75, it's 67.5.
-
If this looks unfamiliar
we have many videos
-
on interquartile range and calculating
-
standard deviation and median and mean.
-
This is just a little bit of a review,
-
and then the difference
between these two is 17.5,
-
and notice, this distance
between these two, this 17.5,
-
this isn't going to change,
-
even if this is 250 billion dollars.
-
So once again, it is both of
these measures are more robust
-
when you have a skewed data set.
-
So the big take away here is
mean and standard deviation,
-
they're not bad if you have
a roughly symmetric data set,
-
if you don't have any
significant outliers,
-
things that really skew the data set,
-
mean and standard deviation
can be quite solid.
-
But if you're looking at
something that could get really
-
skewed by a handful of data
points median might be,
-
median and interquartile range,
median for central tendency,
-
interquartile range for spread
around that central tendency,
-
and that's why you'll see when
people talk about salaries
-
they'll often talk about
median, because you can have
-
some skewed salaries,
especially on the up side.
-
When we talk about things
like home prices you'll see
-
median often measured
more typically than mean,
-
because home prices in a
neighborhood, a lot of,
-
or in a city, a lot of the
houses might be in the 200,000,
-
$300,000 range, but maybe
there's one ginormous mansion
-
that is 100 million dollars,
and if you calculated mean
-
that would skew and give a
false impression of the average
-
or the central tendency
of prices in that city.