-
This video, here, is a
groundbreaking video,
-
for multiple reasons.
-
One, I'm going to introduce
you to the variance
-
of a sample, which is
interesting in its own right.
-
And I'm attempting to
record this video in HD,
-
and hopefully you can see it
bigger and clearer than ever
-
before.
-
But we'll see how
all of that goes.
-
So this is a bit of an
experiment, so bear with me.
-
But so, just before we go
into the variance of a sample,
-
I think it's
instructive to review
-
the variance the population, and
we can compare their formulas.
-
The variance of a
population-- and it's
-
this Greek letter, sigma.
-
Lowercase sigma squared.
-
That means variance.
-
I know it's weird that
a variable already
-
has a squared in it.
-
You're not squaring
the variable.
-
This is the variable.
-
Sigma squared means variance.
-
Actually, let me
write that down.
-
That equals variance.
-
And that is equal to--
you take each data point--
-
and we'll call them x sub i.
-
You take each data
point, find out
-
how far it is from the
mean of the population--
-
the mean of the population.
-
You square it, and then you take
the average of all of those.
-
So you take the average.
-
You sum them all up.
-
You go from i is equal to 1.
-
So from the first point all
the way to the N-th point.
-
And then to average,
you sum them all up,
-
and then you divide by N.
-
So the variance is the average
of the squared distances
-
of each point from the mean.
-
And just to give you
the intuition again,
-
it essentially says, on
average, roughly how far away
-
are each of the points
from the middle?
-
That's the best way to
think about the variance.
-
Now what if we're dealing-- this
was for a population, right?
-
And we said, if we
wanted to figure out
-
the variance of men's
heights in the country,
-
it would be very
hard to figure out
-
the variance for the population.
-
You would have to go
and, essentially, measure
-
everyone's height--
250 million people.
-
Or what if it's for some
population where it's just
-
completely impossible
to have the data,
-
or some random variable?
-
And we'll go more
into that later.
-
So a lot of times,
you actually want
-
to estimate this
variance by taking
-
the variance of a sample.
-
Same way that you could never
get the mean of a population,
-
but maybe you want
to estimate it
-
by getting the mean of a sample.
-
And we learned that
in that first video.
-
This is-- if that's the
whole population, that's
-
millions of data points-- or
even data points in the future
-
that you'll never
be able to get,
-
because it's a random variable.
-
So this is the population.
-
You might just want to estimate
things by looking at a sample.
-
And this is actually what
most of inferential statistics
-
is all about.
-
Figuring out descriptive
statistics about the sample,
-
and making inferences
about the population.
-
Let me try this
drug on 100 people.
-
And if it seems to have
statistically significant
-
results, this drug
will probably work
-
on the population as a whole.
-
So that's what it's all about.
-
So it's really
important to understand
-
this notion of a sample
versus a population,
-
and being able to
find statistics
-
on a sample that,
for the most part,
-
can describe the population or
help us estimate-- they call
-
it parameters for
the population.
-
So what's the mean of a-- let
me rewrite these definitions.
-
What's the mean of a population?
-
I'll do it in that purple--
purple for population.
-
The mean of a
population, you just
-
take each of the data points.
-
So you take each of the data
points in the population-- xi.
-
You sum them up.
-
You start with the
first data point,
-
and you go all the way
to the N-th data point,
-
and you divide by N.
You sum them all up
-
and divide them by
N. That's the mean.
-
So then you plug it
into this formula,
-
and you just can see
how far each point is
-
from that central
point, from that mean.
-
And you get the variance.
-
Now what happens if
we do it for a sample?
-
Well, if we want to estimate
the mean of a population
-
by somehow calculating a mean
for a sample, the best thing
-
I can think of--
and really these
-
are kind of engineered formulas.
-
These are human
beings saying, well,
-
what is the best
way to sample it?
-
Well, all we can do is really
take an average of our sample.
-
And that's the sample mean.
-
And we learned in
the first video
-
that notation-- the formula's
almost identical to this.
-
It's just the
notation is different.
-
Instead of writing mu, you
write x with a line over it.
-
Sample mean is equal to--
-
Once again, you take each of the
data points now in the sample,
-
not in the whole population.
-
You sum them up, from the first
one and then to the n-th one,
-
right?
-
They're saying that there are
n data points in this sample.
-
And then you divide it by the
number of data points you have.
-
Fair enough.
-
It's really the same formula.
-
The way I took the
mean of a population,
-
I said, well, if I
just have a sample,
-
let me just take the
mean the same way.
-
And it might-- it's
probably a good estimate
-
of the mean of the population.
-
Now, it gets interesting
when we talk about variance.
-
So your natural reaction
is, OK, I have this sample.
-
If I want to estimate the
variance of the population,
-
why don't I just apply this
same formula, essentially,
-
to the sample?
-
So I could say-- and this is
actually a sample variance.
-
They use a formula-- s squared.
-
So sigma is kind of the
Greek-letter equivalent of s.
-
So now when we're
dealing with a sample,
-
we just write the s there.
-
So this is sample variance.
-
Let me write that down.
-
Sample variance.
-
So we might just say,
well, maybe a good way
-
to take the sample variance
is, do it the same way.
-
Let's take the distance of each
of the points in the sample,
-
find out how far it is
from our sample mean.
-
Right here, we used
the population mean.
-
But now we'll just
use the sample
-
mean, because that's
all we can have.
-
We don't know what
the population mean
-
is without looking at
the whole population.
-
Take the square of that.
-
That makes it positive.
-
It has other properties
which we'll go over later.
-
And then take the average of
all of these squared distances.
-
So you take it from--
you sum them all up.
-
And there's n of them to
sum up, right-- lowercase n.
-
And you divide by lowercase n.
-
You say, well, you know,
this is a good estimate.
-
Whatever this variance
is, that might
-
be a good estimate for
the population as a whole.
-
And actually this is what
some people often refer to,
-
when they talk about
sample variance.
-
And sometimes it'll actually
be referred to as this.
-
They'll put a little
lowercase n there.
-
And the reason why I do that
is because we divided by n.
-
So you say, Sal, what's
the problem here?
-
And then the problem--
and I'll give you
-
the intuition, because
this is actually
-
something that used
to boggle my mind.
-
And I'm still,
frankly, struggling
-
with the intuition behind it.
-
Well, I have the
intuition, but more
-
of kind of rigorously
proving it to myself,
-
that this is
definitely the case.
-
But think about this.
-
If I have a bunch of numbers--
and I'll draw a number line,
-
here.
-
If I draw a number
line here-- let's
-
say I have a bunch of
numbers in my population.
-
So let's say-- I'm just
going to randomly put
-
a bunch of numbers
in my population.
-
And the ones to the right
are bigger than the ones
-
to the left.
-
And if I were to take a
sample of them, right?
-
Maybe I take-- and the
sample, it's random.
-
You actually want to
take a random sample.
-
You don't want to be
skewed in any way.
-
So maybe I take this one, this
one, this one, and that one,
-
right?
-
And then if I were to take
the mean of that number,
-
that number, that
number, and that number,
-
it'll be someplace
in the middle.
-
It might be
someplace over there.
-
And then, if I wanted to figure
out the sample variance using
-
this formula, I'd
say, OK, this distance
-
squared plus this distance
squared plus this distance
-
squared plus that distance
squared, and average them
-
all out.
-
And then I would
get this number.
-
And that probably would be
a pretty good approximation
-
for the variance of
this entire population.
-
The population of
the mean is probably
-
going to be-- I don't know, it
might be pretty close to this.
-
If we actually took all of the
data points and averaged them,
-
maybe they're, like,
here someplace.
-
And then, if you figure
out the variance,
-
it probably would be pretty
close to the average of all
-
of these lines, right, of the
sample variance distances.
-
Fair enough.
-
So you say, hey, Sal, this
looks pretty good now.
-
But there's one little catch.
-
What if-- I mean, there's always
a probability that, instead
-
of picking these fairly
well-distributed numbers
-
in my sample, what
if I happened to pick
-
this number, this number, and
that number, and, let's say,
-
that number, as my sample?
-
Well, whatever your sample
is, your sample mean's
-
always going to be in
the middle of it, right?
-
So in this case, your sample
mean might be right here.
-
So all of these numbers,
you might say, OK,
-
this number's not too
far from that number.
-
That number's not too far.
-
And then that
number's not too far.
-
So your sample variance,
when you do it this way,
-
it might turn out
a little bit low.
-
Because all of these numbers,
they're almost, by definition,
-
going to be pretty close
to the mean of each other.
-
But in this case, your
sample is kind of skewed,
-
and the actual mean
of the population
-
is out here someplace.
-
So the actual variance
of the sample,
-
if you had actually
known the mean--
-
I know this is all a little
confusing-- if you had actually
-
known the mean, you
would've said, oh, wow.
-
You would have found
these distances,
-
which would have
been a lot more.
-
The whole point
of what I'm saying
-
is, when you take
a sample, there's
-
some chance that
your sample mean is
-
pretty close to the
population mean.
-
Maybe your sample mean is
here, and your population mean
-
is here.
-
And then this formula
would probably
-
work out pretty well, at least
given your sample data points,
-
of figuring out what
the variance is.
-
But there's a reasonable
chance that your sample
-
mean-- your sample
mean is always
-
going to be within your
data sample, right?
-
It's always going to be the
center of your data sample.
-
But it's completely possible
that the population mean
-
is outside of your data sample.
-
It might have just
been-- you know,
-
you just happened
to pick ones that
-
don't contain the
actual population mean.
-
And then this sample
variance, calculated this way,
-
will actually underestimate
the actual population variance,
-
because they're always going
to be closer to their own mean
-
than they are to
the population mean.
-
And if you're understanding,
frankly, even 10% of this,
-
you are a very advanced
statistics student.
-
But I'm saying all of
this to just give you,
-
hopefully, some
intuition to realize
-
that this will
often underestimate.
-
This formula will
often underestimate
-
the actual population variance.
-
And there's a formula--
and this is actually
-
proven more rigorously
than I'll do
-
it-- that is considered to be
a better-- or, they'll call it,
-
unbiased-- estimate of
the population variance,
-
or the unbiased sample variance.
-
And sometimes it's just
denoted by the s squared again.
-
Sometimes it's denoted by
this-- s n minus 1 squared--
-
and I'll show you why.
-
It's almost the same thing.
-
You take each of
the data points,
-
figure out how far they
are from the sample mean,
-
you square them, and then
you take the average of those
-
squared, except for
one slight difference.
-
i equals 1 to i equals n.
-
Instead of dividing
by n, you divide
-
by a slightly smaller number.
-
You divide by n minus 1.
-
So when you divide by n minus
1 instead of dividing by n,
-
you're going to get a
slightly larger number here.
-
And it turns out
that this is actually
-
a much better estimate.
-
And one day, I'm going to
write a computer program
-
to at least prove it to
myself, experimentally,
-
that this is a better estimate
of the population variance.
-
And you would calculate
it the same way.
-
You just divide by n minus 1.
-
The other way to think about
it-- and actually-- no, no.
-
I'm all out of time.
-
I'll leave you there, now,
and then, in the next video,
-
we'll do a couple
calculations, just
-
so you don't get too
overwhelmed with these ideas,
-
because we're getting
a little bit abstract.
-
See you in the next video.