-
In this video, I want to
talk about what is easily
-
one of the most fundamental and
profound concepts in statistics
-
and maybe in all of mathematics.
-
And that's the
central limit theorem.
-
And what it tells us
is we can start off
-
with any distribution that
has a well-defined mean and
-
variance-- and if it has
a well-defined variance,
-
it has a well-defined
standard deviation.
-
And it could be a continuous
distribution or a discrete one.
-
I'll draw a discrete one,
just because it's easier
-
to imagine, at least for
the purposes of this video.
-
So let's say I have a discrete
probability distribution
-
function.
-
And I want to be
very careful not
-
to make it look anything close
to a normal distribution.
-
Because I want to show you
the power of the central limit
-
theorem.
-
So let's say I have
a distribution.
-
Let's say it could take
on values 1 through 6.
-
1, 2, 3, 4, 5, 6.
-
It's some kind of crazy dice.
-
It's very likely to get a one.
-
Let's say it's
impossible-- well,
-
let me make that
a straight line.
-
You have a very high
likelihood of getting a 1.
-
Let's say it's
impossible to get a 2.
-
Let's say it's an OK likelihood
of getting a 3 or a 4.
-
Let's say it's
impossible to get a 5.
-
And let's say it's very
likely to get a 6 like that.
-
So that's my probability
distribution function.
-
If I were to draw a
mean-- this the symmetric,
-
so maybe the mean would
be something like that.
-
The mean would be halfway.
-
So that would be my
mean right there.
-
The standard
deviation maybe would
-
look-- it would be
that far and that
-
far above and below the mean.
-
But that's my discrete
probability distribution
-
function.
-
Now what I'm going to do
here, instead of just taking
-
samples of this
random variable that's
-
described by this probability
distribution function,
-
I'm going to take samples of it.
-
But I'm going to
average the samples
-
and then look at
those samples and see
-
the frequency of the
averages that I get.
-
And when I say average,
I mean the mean.
-
Let me define something.
-
Let's say my sample size-- and
I could put any number here.
-
But let's say first off we try a
sample size of n is equal to 4.
-
And what that means is I'm going
to take four samples from this.
-
So let's say the first
time I take four samples--
-
so my sample sizes is
four-- let's say I get a 1.
-
Let's say I get another 1.
-
And let's say I get a 3.
-
And I get a 6.
-
So that right there is my
first sample of sample size 4.
-
I know the terminology
can get confusing.
-
Because this is the sample
that's made up of four samples.
-
But then when we talk about the
sample mean and the sampling
-
distribution of the
sample mean, which we're
-
going to talk more and more
about over the next few videos,
-
normally the sample refers
to the set of samples
-
from your distribution.
-
And the sample size tells
you how many you actually
-
took from your distribution.
-
But the terminology
can be very confusing,
-
because you could easily view
one of these as a sample.
-
But we're taking four
samples from here.
-
We have a sample size of four.
-
And what I'm going to do is
I'm going to average them.
-
So let's say the mean-- I
want to be very careful when
-
I say average.
-
The mean of this first
sample of size 4 is what?
-
1 plus 1 is 2.
-
2 plus 3 is 5.
-
5 plus 6 is 11.
-
11 divided by 4 is 2.75.
-
That is my first sample mean
for my first sample of size 4.
-
Let me do another one.
-
My second sample of size 4,
let's say that I get a 3, a 4.
-
Let's say I get another 3.
-
And let's say I get a 1.
-
I just didn't happen
to get a 6 that time.
-
And notice I can't
get a 2 or a 5.
-
It's impossible for
this distribution.
-
The chance of getting
a 2 or 5 is 0.
-
So I can't have any
2s or 5s over here.
-
So for the second
sample of sample size 4,
-
my second sample mean is
going to be 3 plus 4 is 7.
-
7 plus 3 is 10 plus 1 is 11.
-
11 divided by 4,
once again, is 2.75.
-
Let me do one more,
because I really
-
want to make it clear
what we're doing here.
-
So I do one more.
-
Actually, we're going
to do a gazillion more.
-
But let me just do
one more in detail.
-
So let's say my third
sample of sample size 4--
-
so I'm going to
literally take 4 samples.
-
So my sample is
made up of 4 samples
-
from this original
crazy distribution.
-
Let's say I get a 1,
a 1, and a 6 and a 6.
-
And so my third sample mean
is going to be 1 plus 1 is 2.
-
2 plus 6 is 8.
-
8 plus 6 is 14.
-
14 divided by 4 is 3 and 1/2.
-
And as I find each
of these sample
-
means-- so for each of my
samples of sample size 4,
-
I figure out a mean.
-
And as I do each
of them, I'm going
-
to plot it on a
frequency distribution.
-
And this is all going to
amaze you in a few seconds.
-
So I plot this all on a
frequency distribution.
-
So I say, OK, on
my first sample,
-
my first sample mean was 2.75.
-
So I'm plotting the actual
frequency of the sample
-
means I get for each sample.
-
So 2.75, I got it one time.
-
So I'll put a little plot there.
-
So that's from that
one right there.
-
And the next time,
I also got a 2.75.
-
That's a 2.75 there.
-
So I got it twice.
-
So I'll plot the
frequency right there.
-
Then I got a 3 and 1/2.
-
So all the possible values,
I could have a three,
-
I could have a 3.25, I
could have a 3 and 1/2.
-
So then I have the 3 and 1/2,
so I'll plot it right there.
-
And what I'm going
to do is I'm going
-
to keep taking these samples.
-
Maybe I'll take 10,000 of them.
-
So I'm going to keep
taking these samples.
-
So I go all the way to S 10,000.
-
I just do a bunch of these.
-
And what it's going to look like
over time is each of these--
-
I'm going to make it
a dot, because I'm
-
going to have to zoom out.
-
So if I look at it like
this, over time-- it still
-
has all the values that it
might be able to take on,
-
2.75 might be here.
-
So this first dot is
going to be-- this one
-
right here is going
to be right there.
-
And that second one is
going to be right there.
-
Then that one at 3.5 is
going to look right there.
-
But I'm going to
do it 10,000 times.
-
Because I'm going
to have 10,000 dots.
-
And let's say as I do it, I'm
going just keep plotting them.
-
I'm just going to keep
plotting the frequencies.
-
I'm just going to
keep plotting them
-
over and over and over again.
-
And what you're going
to see is, as I take
-
many, many samples
of size 4, I'm
-
going to have
something that's going
-
to start kind of approximating
a normal distribution.
-
So each of these dots represent
an incidence of a sample mean.
-
So as I keep adding on
this column right here,
-
that means I kept getting
the sample mean 2.75.
-
So over time.
-
I'm going to have
something that's
-
starting to approximate
a normal distribution.
-
And that is a neat thing about
the central limit theorem.
-
So an orange, that's the
case for n is equal to 4.
-
This was a sample size of 4.
-
Now, if I did the same thing
with a sample size of maybe
-
20-- so in this case, instead
of just taking 4 samples
-
from my original crazy
distribution, every sample
-
I take 20 instances
of my random variable,
-
and I average those 20.
-
And then I plot the
sample mean on here.
-
So in that case,
I'm going to have
-
a distribution that
looks like this.
-
And we'll discuss
this in more videos.
-
But it turns out if I were
to plot 10,000 of the sample
-
means here, I'm going
to have something
-
that, two things-- it's going
to even more closely approximate
-
a normal distribution.
-
And we're going to
see in future videos,
-
it's actually going to
have a smaller-- well,
-
let me be clear.
-
It's going to have
the same mean.
-
So that's the mean.
-
This is going to
have the same mean.
-
So it's going to have a
smaller standard deviation.
-
Well, I should plot
these from the bottom
-
because you kind of stack it.
-
One you get one, then another
instance and another instance.
-
But this is going to
more and more approach
-
a normal distribution.
-
So this is what's super
cool about the central limit
-
theorem.
-
As your sample size
becomes larger--
-
or you could even say as
it approaches infinity.
-
But you really don't
have to get that close
-
to infinity to really get
close to a normal distribution.
-
Even if you have a
sample size of 10 or 20,
-
you're already getting very
close to a normal distribution,
-
in fact about as
good an approximation
-
as we see in our everyday life.
-
But what's cool is we can start
with some crazy distribution.
-
This has nothing to do
with a normal distribution.
-
This was n equals 4, but if
we have a sample size of n
-
equals 10 or n
equals 100, and we
-
were to take 100 of these,
instead of four here,
-
and average them and
then plot that average,
-
the frequency of it, then we
take 100 again, average them,
-
take the mean, plot
that again, and if we
-
do that a bunch
of times, in fact,
-
if we were to do that
an infinite time,
-
we would find that
we, especially
-
if we had an
infinite sample size,
-
we would find a perfect
normal distribution.
-
That's the crazy thing.
-
And it doesn't apply just
to taking the sample mean.
-
Here we took the
sample mean every time.
-
But you could have also
taken the sample sum.
-
The central limit theorem
would have still applied.
-
But that's what's so
super useful about it.
-
Because in life, there's all
sorts of processes out there,
-
proteins bumping into
each other, people doing
-
crazy things, humans
interacting in weird ways.
-
And you don't know the
probability distribution
-
functions for any
of those things.
-
But what the central
limit theorem
-
tells us is if we add a
bunch of those actions
-
together, assuming that they
all have the same distribution,
-
or if we were to take the
mean of all of those actions
-
together, and if we were to plot
the frequency of those means,
-
we do get a normal distribution.
-
And that's frankly why the
normal distribution shows up
-
so much in statistics
and why, frankly, it's
-
a very good
approximation for the sum
-
or the means of a
lot of processes.
-
Normal distribution.
-
What I'm going to show you in
the next video is I'm actually
-
going to show you that this is
a reality, that as you increase
-
your sample size, as
you increase your n,
-
and as you take a
lot of sample means,
-
you're going to have a frequency
plot that looks very, very
-
close to a normal distribution.