-
In the last video we were able to
-
calculate the total sum of squares for these 9 data points right here,
-
these 9 data points are grouped into three different groups,
-
or if you wanted to speak generally into "m" different groups.
-
What I want to do in this video is to figure out how much of this total sum of squares
-
how much of this is due to variation within each group
-
versus variation between the actual groups.
-
So first let's figure out the total variation within the groups,
-
so let's call that the sum of squares within, I'll do that in yellow,
-
actually I've already used yellow so let's do this, I'm going to do blue.
-
So the sum of squares within.
-
Let me make that clear, that stands for within.
-
So we want to see how much of a variation is
-
due to how far each of these data points are from their central tendencies,
-
from their respective means.
-
So this is going to be equal to-- let's start with these guys.
-
So instead of taking the distance between each data point and the mean of means
-
I'm going to find the distance between each data point and that group's mean
-
because we want to square the total sum of squares
-
between each data point and their respective means
-
3 minus the mean here, it's 2. Squared.
-
+ 2 minus 2 squared,
-
+ 1 minus 2 squared.
-
I'm going to do this for all of the groups,
-
but for each group the distance between it's data point and it's mean
-
so + minus 4 squared, + 3 minus 4 squared, + 4 minus 4 squared
-
and finally we have the third group,
-
and we're finding all of the sum of squares from each point to it's central tendency
-
within that group, we're going to add them all up.
-
And then we find the third group so we have
-
5 minus 6 squared + 6 minus 6 squared, + 7 minus 6 squared.
-
And what is this going to equal?
-
So this is going to be equal to, so up here it is going to be 1 + 0 + 1,
-
that's going to be equal to 2,
-
+ this is going to be equal to 1 + 1 + 0, so another 2,
-
+ this is going to be equal to 1 + 0 + 1, so that's 2 over here.
-
Our total sum of squared within is 6.
-
So one way to think about it, our total variation was 30.
-
Based on that calculation 6 of that 30 comes from variation within these samples.
-
Now the next thing I want to think about is
-
how many degrees of freedom do we have in this calculation
-
how many, kind of, independent data points do we actually have,
-
well for each of these, over here, if you know we have 'n' data points for each one,
-
in particular n is 3 here, but if you know
-
n minus one of them, you can always find the 'n'th one, if you know the actual sample mean.
-
So in this case for any of these groups if you know 2 of these data points,
-
you can always figure out the third.
-
If you know these two, you can always
-
figure out the third if you can figure out the sample mean.
-
So in general let's figure out the degrees of freedom here.
-
You have, for each group, when you did this you had 'n' minus one degrees of freedom.
-
Remember 'n' is the number of data points you had in each group,
-
so you have n minus one degrees of freedom for each of these groups,
-
so it's n-1, n-1, n-1,
-
or you have, let me put it this way, you have 'n-1' for each of these groups, and
-
and there are m groups.
-
So there's m times n-1 degrees of freedom.
-
In this particular case, each group, n -1 is two
-
or each case, you have 2 degrees of freedom
-
and there's three groups about the there are 6 degrees of freedom.
-
In the future we may do a more detailed discussion of what degrees of freedom mean
-
how to mathematically think about it.
-
But the simplest way to think about it is really truly independent data points.
-
Assuming you knew in this case the central statistic
-
that we used to calculate the squared distances of each of these, if you know them already
-
the third data point actually could be calculated from the other 2.
-
So we have 6 degrees of freedom over here.
-
Now that was how much of the total variation is due to variation within each sample.
-
Now think about how much of the variation is due to variation between between the sample.
-
And to do that, we're going to calculate-- get a nice color here--
-
I think I've run out of all the colors--
-
we'll call it sum of squares between, the B stands for between.
-
So another way to think about it, how much of this total variation
-
is due to the variation between the means, between the central tendency
-
that's what we're going to calculate right now and
-
how much is due to variation from each data points to its mean.
-
Let's figure out how much is due variation between these guys over here.
-
One way to think about it for each of these data points--
-
let's just think about this first group.
-
For this first group, how much variation for each of these guys is
-
due to the variation between this mean and the mean of means.
-
For the first guy up here-- I'll just write it all out explicitly--
-
the variation is going to be its sample mean, 2, minus the mean of means, squared.
-
And then for this guy, it's going to be the same thing.
-
His sample mean, 2, minus the mean of means, squared.
-
Plus same thing for this guy.
-
His sample mean, 2, minus the mean of means, squared.
-
Or another way to think about it, this is equal to 3 times 2-4 squared,
-
which is the same thing as 3 times 4. It's equal to 12.
-
I can do it for each of them. I actually want to find the total sum.
-
Let me just write it all out. I think that might be an easier thing to do.
-
For all of these guys combined
-
the sum of squares due to the differences between the samples.
-
So that's from the first sample, the contribution from the first sample.
-
And then from the second sample,
-
you have this guy here, five-- sorry, you don't want to calculate him.
-
For this data point, the amount of variation due to the difference between the means
-
is going to be 4-4 squared
-
Same thing for this guys, would be 4-4 squared.
-
We're not taking it into consideration. We're only taking its sample mean into consideration.
-
And then finally + 4-4 square.
-
We're taking this
-
minus this squared for each of these data points.
-
And then finally we'll do that with the last group.
-
Sample mean is 6, so it's going to be 6-4 squared plus 6-4 squared plus 6-4 squared.
-
Now, let's think about
-
how many degrees of freedom we had in this calculation right over here.
-
Well, in general, I guess the easiest way to think about it is,
-
how much information do we have, assuming that we knew the mean of means?
-
If we know the mean of means, how much here is new information?
-
If you know 2 of these if you know the mean of the means and you know 2 of the sample means,
-
you can always figure out the third.
-
If you know this one and this one, you can figure out that one.
-
If you know that one and that one, you can figure out that one.
-
That's because this is the mean of these means over here.
-
So in general, if you m groups or if you have m means,
-
there are m-1 degrees of freedom here.
-
With that said, in this case m is 3.
-
So we could say, there's 2 degrees of freedom for this exact example.
-
Let's actually calculate the sum of squares between. So what is this going to be?
-
This is going to be equal to, this right here is, 2-4 is -2, squared is 4.
-
And then we have three fours over here, so three times four.
-
Plus 3 times 0, plus 3 times (6-4)2, which is 3 times 4. So plus 3 times 4.
-
And we get 3 times 4 is 12 + 0 + 12, is equal to 24.
-
So the sum of squares, or the variation due to
-
what's the difference between the groups, between the means is 24.
-
Not let's put this altogether. We said that
-
the total variation when you look at all 9 data points, is 30.
-
Let me write that over here.
-
So the total sum of squares is equal to 30.
-
We figured out the sum of squares between each data point and its central tendency, its sample
-
mean, we figure out and we totaled it all up, we got 6 for the sum of squares within.
-
The sum of squares within was equal to 6. In this case, it was 6 degrees of freedom.
-
If we wanted to write generally, there were m times n-1 degrees of freedom.
-
Actually for the total, we figured out we had m times n -1 degrees of freedom.
-
Let me write the degrees of freedom in this column over here.
-
In this case, the number turned out to be 8.
-
And then just now, we calculated the sum of squares between the samples.
-
The sum of squares between the samples is equal to 24
-
and we figured out that it had m-1 degrees of freedom which ended up being 2.
-
Now the interesting thing here-- this is why this analysis of variance all fits nicely together.
-
In future videos we will think about how we can actually test hypotheses
-
using some of the tools that we're thinking about right now--
-
is that the sum of squares within plus the sum of squares between
-
is equal to the total sum of squares.
-
So the way to think about is that the total variation in this data right here
-
can be described as the sum of the variation within each of these groups
-
when you take that total
-
plus the sum of the variation between the groups.
-
And even the degrees of freedom work out.
-
The sum of squares between has 2 degrees of freedom.
-
The sum of squares within each of the groups had 6 degrees of freedom.
-
2+6 is 8.
-
That's the total degrees of freedom we have for all of the data combined.
-
It even works if you look at the more general.
-
Our sum of squares between had m-1 degrees of freedom.
-
Our sum of squares within had m(n-1) degrees of freedom.
-
This is equal to m-1+mn-m.
-
These guys cancel out. This is equal to mn-1 degrees of freedom,
-
which is exactly the total degrees of freedom we have for the total sum of squares.
-
So the whole point of the calculations that we did in the last and this video
-
is just to appreciate that this total variation over here
-
can be viewed as the sum of these two component variations,
-
how much variation within each of the samples
-
plus how much variation is there between the means of the samples.
-
Hopefully that's not too confusing.