WEBVTT 00:00:00.520 --> 00:00:02.170 In the last video we were able to 00:00:02.170 --> 00:00:05.970 calculate the total sum of squares for these 9 data points right here, 00:00:05.980 --> 00:00:10.030 these 9 data points are grouped into three different groups, 00:00:10.030 --> 00:00:12.800 or if you wanted to speak generally into "m" different groups. 00:00:12.900 --> 00:00:17.940 What I want to do in this video is to figure out how much of this total sum of squares 00:00:17.940 --> 00:00:22.360 how much of this is due to variation within each group 00:00:22.380 --> 00:00:26.230 versus variation between the actual groups. 00:00:26.250 --> 00:00:29.970 So first let's figure out the total variation within the groups, 00:00:29.970 --> 00:00:36.200 so let's call that the sum of squares within, I'll do that in yellow, 00:00:36.490 --> 00:00:39.940 actually I've already used yellow so let's do this, I'm going to do blue. 00:00:40.180 --> 00:00:45.910 So the sum of squares within. 00:00:46.290 --> 00:00:50.850 Let me make that clear, that stands for within. 00:00:50.890 --> 00:00:53.710 So we want to see how much of a variation is 00:00:53.710 --> 00:00:57.960 due to how far each of these data points are from their central tendencies, 00:00:57.960 --> 00:00:59.550 from their respective means. 00:00:59.550 --> 00:01:02.300 So this is going to be equal to-- let's start with these guys. 00:01:02.500 --> 00:01:07.220 So instead of taking the distance between each data point and the mean of means 00:01:07.220 --> 00:01:11.530 I'm going to find the distance between each data point and that group's mean 00:01:11.550 --> 00:01:16.550 because we want to square the total sum of squares 00:01:16.550 --> 00:01:20.680 between each data point and their respective means 00:01:20.720 --> 00:01:25.740 3 minus the mean here, it's 2. Squared. 00:01:25.760 --> 00:01:30.700 + 2 minus 2 squared, 00:01:30.940 --> 00:01:34.480 + 1 minus 2 squared. 00:01:34.700 --> 00:01:36.600 I'm going to do this for all of the groups, 00:01:36.600 --> 00:01:39.520 but for each group the distance between it's data point and it's mean 00:01:39.560 --> 00:01:57.420 so + minus 4 squared, + 3 minus 4 squared, + 4 minus 4 squared 00:01:57.420 --> 00:02:00.360 and finally we have the third group, 00:02:00.380 --> 00:02:04.910 and we're finding all of the sum of squares from each point to it's central tendency 00:02:04.910 --> 00:02:06.640 within that group, we're going to add them all up. 00:02:07.140 --> 00:02:09.280 And then we find the third group so we have 00:02:09.280 --> 00:02:20.550 5 minus 6 squared + 6 minus 6 squared, + 7 minus 6 squared. 00:02:20.550 --> 00:02:22.390 And what is this going to equal? 00:02:22.420 --> 00:02:29.050 So this is going to be equal to, so up here it is going to be 1 + 0 + 1, 00:02:29.550 --> 00:02:31.510 that's going to be equal to 2, 00:02:31.850 --> 00:02:39.660 + this is going to be equal to 1 + 1 + 0, so another 2, 00:02:40.020 --> 00:02:51.130 + this is going to be equal to 1 + 0 + 1, so that's 2 over here. 00:02:51.540 --> 00:02:56.470 Our total sum of squared within is 6. 00:02:56.600 --> 00:03:00.870 So one way to think about it, our total variation was 30. 00:03:00.870 --> 00:03:08.660 Based on that calculation 6 of that 30 comes from variation within these samples. 00:03:09.020 --> 00:03:10.940 Now the next thing I want to think about is 00:03:11.180 --> 00:03:15.560 how many degrees of freedom do we have in this calculation 00:03:15.560 --> 00:03:19.300 how many, kind of, independent data points do we actually have, 00:03:19.630 --> 00:03:27.610 well for each of these, over here, if you know we have 'n' data points for each one, 00:03:27.610 --> 00:03:30.440 in particular n is 3 here, but if you know 00:03:30.710 --> 00:03:37.900 n minus one of them, you can always find the 'n'th one, if you know the actual sample mean. 00:03:38.090 --> 00:03:42.130 So in this case for any of these groups if you know 2 of these data points, 00:03:42.130 --> 00:03:43.410 you can always figure out the third. 00:03:43.410 --> 00:03:44.550 If you know these two, you can always 00:03:44.550 --> 00:03:46.770 figure out the third if you can figure out the sample mean. 00:03:47.130 --> 00:03:50.420 So in general let's figure out the degrees of freedom here. 00:03:50.420 --> 00:03:57.330 You have, for each group, when you did this you had 'n' minus one degrees of freedom. 00:03:57.370 --> 00:04:03.970 Remember 'n' is the number of data points you had in each group, 00:04:03.970 --> 00:04:09.310 so you have n minus one degrees of freedom for each of these groups, 00:04:09.350 --> 00:04:12.430 so it's n-1, n-1, n-1, 00:04:12.480 --> 00:04:19.210 or you have, let me put it this way, you have 'n-1' for each of these groups, and 00:04:19.380 --> 00:04:21.660 and there are m groups. 00:04:21.660 --> 00:04:28.890 So there's m times n-1 degrees of freedom. 00:04:28.910 --> 00:04:32.790 In this particular case, each group, n -1 is two 00:04:32.790 --> 00:04:34.970 or each case, you have 2 degrees of freedom 00:04:34.970 --> 00:04:45.680 and there's three groups about the there are 6 degrees of freedom. 00:04:46.100 --> 00:04:51.340 In the future we may do a more detailed discussion of what degrees of freedom mean 00:04:51.340 --> 00:04:54.380 how to mathematically think about it. 00:04:54.380 --> 00:04:58.470 But the simplest way to think about it is really truly independent data points. 00:04:58.490 --> 00:05:01.180 Assuming you knew in this case the central statistic 00:05:01.180 --> 00:05:04.670 that we used to calculate the squared distances of each of these, if you know them already 00:05:04.800 --> 00:05:08.230 the third data point actually could be calculated from the other 2. 00:05:08.230 --> 00:05:10.490 So we have 6 degrees of freedom over here. 00:05:10.720 --> 00:05:18.090 Now that was how much of the total variation is due to variation within each sample. 00:05:18.310 --> 00:05:23.800 Now think about how much of the variation is due to variation between between the sample. 00:05:25.440 --> 00:05:29.380 And to do that, we're going to calculate-- get a nice color here-- 00:05:29.390 --> 00:05:30.750 I think I've run out of all the colors-- 00:05:30.750 --> 00:05:40.570 we'll call it sum of squares between, the B stands for between. 00:05:41.090 --> 00:05:44.560 So another way to think about it, how much of this total variation 00:05:44.560 --> 00:05:49.300 is due to the variation between the means, between the central tendency 00:05:49.380 --> 00:05:50.990 that's what we're going to calculate right now and 00:05:50.990 --> 00:05:56.430 how much is due to variation from each data points to its mean. 00:05:56.740 --> 00:06:01.480 Let's figure out how much is due variation between these guys over here. 00:06:01.500 --> 00:06:06.840 One way to think about it for each of these data points-- 00:06:06.850 --> 00:06:09.360 let's just think about this first group. 00:06:09.530 --> 00:06:12.850 For this first group, how much variation for each of these guys is 00:06:12.850 --> 00:06:18.230 due to the variation between this mean and the mean of means. 00:06:18.730 --> 00:06:23.200 For the first guy up here-- I'll just write it all out explicitly-- 00:06:23.600 --> 00:06:31.000 the variation is going to be its sample mean, 2, minus the mean of means, squared. 00:06:31.030 --> 00:06:33.010 And then for this guy, it's going to be the same thing. 00:06:33.010 --> 00:06:36.880 His sample mean, 2, minus the mean of means, squared. 00:06:37.650 --> 00:06:39.220 Plus same thing for this guy. 00:06:39.250 --> 00:06:41.920 His sample mean, 2, minus the mean of means, squared. 00:06:41.920 --> 00:06:52.200 Or another way to think about it, this is equal to 3 times 2-4 squared, 00:06:52.440 --> 00:07:02.650 which is the same thing as 3 times 4. It's equal to 12. 00:07:02.820 --> 00:07:05.810 I can do it for each of them. I actually want to find the total sum. 00:07:05.810 --> 00:07:08.640 Let me just write it all out. I think that might be an easier thing to do. 00:07:09.120 --> 00:07:13.230 For all of these guys combined 00:07:13.230 --> 00:07:18.040 the sum of squares due to the differences between the samples. 00:07:18.040 --> 00:07:21.460 So that's from the first sample, the contribution from the first sample. 00:07:21.470 --> 00:07:23.130 And then from the second sample, 00:07:23.440 --> 00:07:28.760 you have this guy here, five-- sorry, you don't want to calculate him. 00:07:28.770 --> 00:07:33.040 For this data point, the amount of variation due to the difference between the means 00:07:33.040 --> 00:07:37.530 is going to be 4-4 squared 00:07:37.770 --> 00:07:41.090 Same thing for this guys, would be 4-4 squared. 00:07:41.100 --> 00:07:45.610 We're not taking it into consideration. We're only taking its sample mean into consideration. 00:07:45.920 --> 00:07:49.110 And then finally + 4-4 square. 00:07:49.120 --> 00:07:50.370 We're taking this 00:07:50.370 --> 00:07:53.500 minus this squared for each of these data points. 00:07:53.500 --> 00:07:57.240 And then finally we'll do that with the last group. 00:07:57.550 --> 00:08:09.940 Sample mean is 6, so it's going to be 6-4 squared plus 6-4 squared plus 6-4 squared. 00:08:10.370 --> 00:08:12.070 Now, let's think about 00:08:12.070 --> 00:08:19.490 how many degrees of freedom we had in this calculation right over here. 00:08:19.940 --> 00:08:24.650 Well, in general, I guess the easiest way to think about it is, 00:08:24.650 --> 00:08:28.410 how much information do we have, assuming that we knew the mean of means? 00:08:28.410 --> 00:08:31.310 If we know the mean of means, how much here is new information? 00:08:31.920 --> 00:08:37.160 If you know 2 of these if you know the mean of the means and you know 2 of the sample means, 00:08:37.160 --> 00:08:38.470 you can always figure out the third. 00:08:38.470 --> 00:08:40.590 If you know this one and this one, you can figure out that one. 00:08:40.700 --> 00:08:42.710 If you know that one and that one, you can figure out that one. 00:08:42.710 --> 00:08:46.190 That's because this is the mean of these means over here. 00:08:46.360 --> 00:08:51.530 So in general, if you m groups or if you have m means, 00:08:51.660 --> 00:09:05.880 there are m-1 degrees of freedom here. 00:09:05.910 --> 00:09:08.900 With that said, in this case m is 3. 00:09:08.900 --> 00:09:14.760 So we could say, there's 2 degrees of freedom for this exact example. 00:09:14.760 --> 00:09:18.670 Let's actually calculate the sum of squares between. So what is this going to be? 00:09:19.120 --> 00:09:29.340 This is going to be equal to, this right here is, 2-4 is -2, squared is 4. 00:09:29.350 --> 00:09:33.230 And then we have three fours over here, so three times four. 00:09:33.590 --> 00:09:51.070 Plus 3 times 0, plus 3 times (6-4)2, which is 3 times 4. So plus 3 times 4. 00:09:51.280 --> 00:09:59.730 And we get 3 times 4 is 12 + 0 + 12, is equal to 24. 00:09:59.750 --> 00:10:03.960 So the sum of squares, or the variation due to 00:10:03.960 --> 00:10:08.690 what's the difference between the groups, between the means is 24. 00:10:08.980 --> 00:10:11.570 Not let's put this altogether. We said that 00:10:11.570 --> 00:10:17.820 the total variation when you look at all 9 data points, is 30. 00:10:17.820 --> 00:10:19.350 Let me write that over here. 00:10:19.800 --> 00:10:25.500 So the total sum of squares is equal to 30. 00:10:25.880 --> 00:10:32.590 We figured out the sum of squares between each data point and its central tendency, its sample 00:10:32.590 --> 00:10:39.640 mean, we figure out and we totaled it all up, we got 6 for the sum of squares within. 00:10:40.140 --> 00:10:48.800 The sum of squares within was equal to 6. In this case, it was 6 degrees of freedom. 00:10:48.810 --> 00:10:54.430 If we wanted to write generally, there were m times n-1 degrees of freedom. 00:10:54.650 --> 00:11:03.300 Actually for the total, we figured out we had m times n -1 degrees of freedom. 00:11:03.320 --> 00:11:06.140 Let me write the degrees of freedom in this column over here. 00:11:06.240 --> 00:11:09.240 In this case, the number turned out to be 8. 00:11:09.240 --> 00:11:13.930 And then just now, we calculated the sum of squares between the samples. 00:11:14.180 --> 00:11:18.180 The sum of squares between the samples is equal to 24 00:11:18.180 --> 00:11:24.200 and we figured out that it had m-1 degrees of freedom which ended up being 2. 00:11:24.560 --> 00:11:31.210 Now the interesting thing here-- this is why this analysis of variance all fits nicely together. 00:11:31.230 --> 00:11:35.230 In future videos we will think about how we can actually test hypotheses 00:11:35.230 --> 00:11:38.040 using some of the tools that we're thinking about right now-- 00:11:38.300 --> 00:11:42.700 is that the sum of squares within plus the sum of squares between 00:11:42.700 --> 00:11:44.940 is equal to the total sum of squares. 00:11:45.040 --> 00:11:50.680 So the way to think about is that the total variation in this data right here 00:11:50.680 --> 00:11:55.800 can be described as the sum of the variation within each of these groups 00:11:55.800 --> 00:11:57.800 when you take that total 00:11:58.130 --> 00:12:03.750 plus the sum of the variation between the groups. 00:12:03.770 --> 00:12:05.970 And even the degrees of freedom work out. 00:12:05.970 --> 00:12:08.900 The sum of squares between has 2 degrees of freedom. 00:12:08.960 --> 00:12:12.730 The sum of squares within each of the groups had 6 degrees of freedom. 00:12:12.740 --> 00:12:14.190 2+6 is 8. 00:12:14.230 --> 00:12:19.120 That's the total degrees of freedom we have for all of the data combined. 00:12:19.120 --> 00:12:22.910 It even works if you look at the more general. 00:12:22.930 --> 00:12:26.730 Our sum of squares between had m-1 degrees of freedom. 00:12:27.070 --> 00:12:33.140 Our sum of squares within had m(n-1) degrees of freedom. 00:12:33.310 --> 00:12:37.900 This is equal to m-1+mn-m. 00:12:38.280 --> 00:12:43.900 These guys cancel out. This is equal to mn-1 degrees of freedom, 00:12:43.920 --> 00:12:48.610 which is exactly the total degrees of freedom we have for the total sum of squares. 00:12:48.940 --> 00:12:53.660 So the whole point of the calculations that we did in the last and this video 00:12:53.670 --> 00:12:58.880 is just to appreciate that this total variation over here 00:12:58.880 --> 00:13:04.160 can be viewed as the sum of these two component variations, 00:13:04.400 --> 00:13:12.150 how much variation within each of the samples 00:13:12.250 --> 00:13:16.910 plus how much variation is there between the means of the samples. 00:13:16.910 --> 00:13:18.580 Hopefully that's not too confusing.