1 00:00:00,680 --> 00:00:02,400 In this video and the next few videos, 2 00:00:02,400 --> 00:00:05,120 we're just really going to be doing a bunch of calculations 3 00:00:05,120 --> 00:00:07,640 about this data set right over here. 4 00:00:07,640 --> 00:00:09,850 And hopefully, just going through those calculations 5 00:00:09,850 --> 00:00:11,730 will give you an intuitive sense of what 6 00:00:11,730 --> 00:00:14,540 the analysis of variance is all about. 7 00:00:14,540 --> 00:00:17,060 Now, the first thing I want to do in this video 8 00:00:17,060 --> 00:00:19,940 is calculate the total sum of squares. 9 00:00:19,940 --> 00:00:22,740 So I'll call that SST. 10 00:00:22,740 --> 00:00:24,830 SS-- sum of squares total. 11 00:00:24,830 --> 00:00:27,170 And you could view it as really the numerator 12 00:00:27,170 --> 00:00:28,450 when you calculate variance. 13 00:00:28,450 --> 00:00:30,700 So you're just going to take the distance between each 14 00:00:30,700 --> 00:00:33,427 of these data points and the mean of all of these data 15 00:00:33,427 --> 00:00:35,260 points, square them, and just take that sum. 16 00:00:35,260 --> 00:00:37,660 We're not going to divide by the degree of freedom, which 17 00:00:37,660 --> 00:00:39,535 you would normally do if you were calculating 18 00:00:39,535 --> 00:00:40,920 sample variance. 19 00:00:40,920 --> 00:00:42,180 Now, what is this going to be? 20 00:00:42,180 --> 00:00:43,680 Well, the first thing we need to do, 21 00:00:43,680 --> 00:00:46,920 we have to figure out the mean of all of this stuff over here. 22 00:00:46,920 --> 00:00:50,449 And I'm actually going to call that the grand mean. 23 00:00:50,449 --> 00:00:51,990 And I'm going to show you in a second 24 00:00:51,990 --> 00:00:55,140 that it's the same thing as the mean of the means of each 25 00:00:55,140 --> 00:00:56,750 of these data sets. 26 00:00:56,750 --> 00:00:58,920 So let's calculate the grand mean. 27 00:00:58,920 --> 00:01:09,990 So it's going to be 3 plus 2 plus 1 plus 5 plus 3 plus 4 28 00:01:09,990 --> 00:01:11,620 plus 5 plus 6 plus 7. 29 00:01:16,010 --> 00:01:20,430 And then we have nine data points here 30 00:01:20,430 --> 00:01:21,540 so we'll divide by 9. 31 00:01:21,540 --> 00:01:23,410 And what is this going to be equal to? 32 00:01:23,410 --> 00:01:26,460 3 plus 2 plus 1 is 6. 33 00:01:26,460 --> 00:01:28,160 6 plus-- let me just add. 34 00:01:28,160 --> 00:01:30,040 So these are 6. 35 00:01:30,040 --> 00:01:36,010 5 plus 3 plus 4 is 12. 36 00:01:36,010 --> 00:01:40,550 And then 5 plus 6 plus 7 is 18. 37 00:01:40,550 --> 00:01:44,860 And then 6 plus 12 is 18 plus another 18 is 36, divided by 9 38 00:01:44,860 --> 00:01:45,994 is equal to 4. 39 00:01:45,994 --> 00:01:48,160 And let me show you that that's the exact same thing 40 00:01:48,160 --> 00:01:49,940 as the mean of the means. 41 00:01:49,940 --> 00:01:52,580 So the mean of this group 1 over here-- 42 00:01:52,580 --> 00:01:54,580 let me do it in that same green-- 43 00:01:54,580 --> 00:01:58,350 the mean of group 1 over here is 3 plus 2 plus 1. 44 00:01:58,350 --> 00:02:00,970 That's that 6 right over here, divided by 3 data 45 00:02:00,970 --> 00:02:03,600 points so that will be equal to 2. 46 00:02:03,600 --> 00:02:08,574 The mean of group 2, the sum here is 12. 47 00:02:08,574 --> 00:02:09,740 We saw that right over here. 48 00:02:09,740 --> 00:02:13,250 5 plus 3 plus 4 is 12, divided by 3 49 00:02:13,250 --> 00:02:15,820 is 4 because we have three data points. 50 00:02:15,820 --> 00:02:21,000 And then the mean of group 3, 5 plus 6 51 00:02:21,000 --> 00:02:24,790 plus 7 is 18 divided by 3 is 6. 52 00:02:24,790 --> 00:02:27,100 So if you were to take the mean of the means, which 53 00:02:27,100 --> 00:02:30,120 is another way of viewing this grand mean, you have 2 plus 4 54 00:02:30,120 --> 00:02:34,040 plus 6, which is 12, divided by 3 means here. 55 00:02:34,040 --> 00:02:35,530 And once again, you would get 4. 56 00:02:35,530 --> 00:02:36,946 So you could view this as the mean 57 00:02:36,946 --> 00:02:38,830 of all of the data in all of the groups 58 00:02:38,830 --> 00:02:41,505 or the mean of the means of each of these groups. 59 00:02:41,505 --> 00:02:43,380 But either way, now that we've calculated it, 60 00:02:43,380 --> 00:02:46,780 we can actually figure out the total sum of squares. 61 00:02:46,780 --> 00:02:48,590 So let's do that. 62 00:02:48,590 --> 00:02:53,950 So it's going to be equal to 3 minus 4-- 63 00:02:53,950 --> 00:02:59,730 the 4 is this 4 right over here-- squared plus 2 minus 4 64 00:02:59,730 --> 00:03:03,280 squared plus 1 minus 4 squared. 65 00:03:03,280 --> 00:03:05,490 Now, I'll do these guys over here in purple. 66 00:03:05,490 --> 00:03:15,270 Plus 5 minus 4 squared plus 3 minus 4 squared plus 4 minus 4 67 00:03:15,270 --> 00:03:15,770 squared. 68 00:03:15,770 --> 00:03:19,320 Let me scroll over a little bit. 69 00:03:19,320 --> 00:03:25,330 Now, we only have three left, plus 5 minus 4 squared 70 00:03:25,330 --> 00:03:31,210 plus 6 minus 4 squared plus 7 minus 4 squared. 71 00:03:31,210 --> 00:03:32,770 And what does this give us? 72 00:03:32,770 --> 00:03:36,620 So up here, this is going to be equal to 3 minus 4. 73 00:03:36,620 --> 00:03:37,490 Difference is 1. 74 00:03:37,490 --> 00:03:38,730 You square it. 75 00:03:38,730 --> 00:03:41,810 It's actually negative 1, but you square it, you get 1, 76 00:03:41,810 --> 00:03:48,320 plus you get negative 2 squared is 4, plus negative 3 squared. 77 00:03:48,320 --> 00:03:50,700 Negative 3 squared is 9. 78 00:03:50,700 --> 00:03:53,570 And then we have here in the magenta 5 minus 4 79 00:03:53,570 --> 00:03:55,640 is 1 squared is still 1. 80 00:03:55,640 --> 00:03:57,420 3 minus 4 squared is 1. 81 00:03:57,420 --> 00:03:59,320 You square it again, you still get 1. 82 00:03:59,320 --> 00:04:00,964 And then 4 minus 4 is just 0. 83 00:04:00,964 --> 00:04:03,130 So we could-- well, I'll just write the 0 there just 84 00:04:03,130 --> 00:04:05,344 to show you that we actually calculated that. 85 00:04:05,344 --> 00:04:07,260 And then we have these last three data points. 86 00:04:07,260 --> 00:04:09,180 5 minus 4 squared. 87 00:04:09,180 --> 00:04:09,760 That's 1. 88 00:04:09,760 --> 00:04:11,800 6 minus 4 squared. 89 00:04:11,800 --> 00:04:13,330 That is 4, right? 90 00:04:13,330 --> 00:04:14,850 That's 2 squared. 91 00:04:14,850 --> 00:04:19,370 And then plus 7 minus 4 is 3 squared is 9. 92 00:04:19,370 --> 00:04:22,230 So what's this going to be equal to? 93 00:04:22,230 --> 00:04:27,760 So I have 1 plus 4 plus 9 right over here. 94 00:04:27,760 --> 00:04:29,110 That's 5 plus 9. 95 00:04:29,110 --> 00:04:33,360 This right over here is 14, right? 96 00:04:33,360 --> 00:04:35,110 5 plus-- yup, 14. 97 00:04:35,110 --> 00:04:37,290 And then we also have another 14 right over here 98 00:04:37,290 --> 00:04:39,270 because we have a 1 plus 4 plus 9. 99 00:04:39,270 --> 00:04:41,940 So that right over there is also 14. 100 00:04:41,940 --> 00:04:43,190 And then we have 2 over here. 101 00:04:43,190 --> 00:04:46,720 So it's going to be 28-- 14 times 2, 14 102 00:04:46,720 --> 00:04:50,738 plus 14 is 28-- plus 2 is 30. 103 00:04:50,738 --> 00:04:53,380 Is equal to 30. 104 00:04:53,380 --> 00:04:55,770 So our total sum of squares-- and actually, 105 00:04:55,770 --> 00:04:57,450 if we wanted the variance here, we 106 00:04:57,450 --> 00:04:59,760 would divide this by the degrees of freedom. 107 00:04:59,760 --> 00:05:02,300 And we've learned multiple times the degrees of freedom 108 00:05:02,300 --> 00:05:06,840 here so let's say that we have-- so we 109 00:05:06,840 --> 00:05:08,710 know that we have m groups over here. 110 00:05:08,710 --> 00:05:10,800 So let me just write it as m and I'm not 111 00:05:10,800 --> 00:05:12,540 going to prove things rigorously here, 112 00:05:12,540 --> 00:05:14,510 but I want to show you where some 113 00:05:14,510 --> 00:05:17,730 of these strange formulas that show up in statistics books 114 00:05:17,730 --> 00:05:21,440 actually come from without proving it rigorously. 115 00:05:21,440 --> 00:05:22,780 More to give you the intuition. 116 00:05:22,780 --> 00:05:25,460 So we have m groups here. 117 00:05:25,460 --> 00:05:32,120 And each group here has n members. 118 00:05:32,120 --> 00:05:34,180 So how many total members do we have here? 119 00:05:34,180 --> 00:05:36,800 Well, we had m times n or 9, right? 120 00:05:36,800 --> 00:05:38,490 3 times 3 total members. 121 00:05:38,490 --> 00:05:41,510 So our degrees of freedom-- and remember, 122 00:05:41,510 --> 00:05:43,930 you have however many data points 123 00:05:43,930 --> 00:05:46,310 you had minus 1 degrees of freedom 124 00:05:46,310 --> 00:05:51,070 because if you know the mean of means, 125 00:05:51,070 --> 00:05:57,885 if you assume you knew that, then only 9 minus 1, 126 00:05:57,885 --> 00:06:00,260 only eight of these are going to give you new information 127 00:06:00,260 --> 00:06:03,005 because if you know that, you could calculate the last one. 128 00:06:03,005 --> 00:06:04,880 Or it really doesn't have to be the last one. 129 00:06:04,880 --> 00:06:07,629 If you have the other eight, you could calculate this one. 130 00:06:07,629 --> 00:06:09,420 If you have eight of them, you could always 131 00:06:09,420 --> 00:06:13,897 calculate the ninth one using the mean of means. 132 00:06:13,897 --> 00:06:15,730 So one way to think about it is that there's 133 00:06:15,730 --> 00:06:17,710 only eight independent measurements here. 134 00:06:17,710 --> 00:06:22,470 Or if we want to talk generally, there 135 00:06:22,470 --> 00:06:27,810 are m times n-- so that tells us the total number of samples-- 136 00:06:27,810 --> 00:06:30,210 minus 1 degrees of freedom. 137 00:06:33,840 --> 00:06:37,920 And if we were actually calculating the variance here, 138 00:06:37,920 --> 00:06:41,640 we would just divide 30 by m times n minus 1 139 00:06:41,640 --> 00:06:44,960 or this is another way of saying eight degrees of freedom 140 00:06:44,960 --> 00:06:46,065 for this exact example. 141 00:06:46,065 --> 00:06:48,190 We would take 30 divided by 8 and we would actually 142 00:06:48,190 --> 00:06:50,110 have the variance for this entire group, 143 00:06:50,110 --> 00:06:52,964 for the group of nine when you combine them. 144 00:06:52,964 --> 00:06:54,380 I'll leave you here in this video. 145 00:06:54,380 --> 00:06:57,090 In the next video, we're going to try to figure out 146 00:06:57,090 --> 00:07:02,970 how much of this total variance, how much of this total 147 00:07:02,970 --> 00:07:06,400 squared sum, total variation comes 148 00:07:06,400 --> 00:07:09,850 from the variation within each of these groups 149 00:07:09,850 --> 00:07:13,790 versus the variation between the groups. 150 00:07:13,790 --> 00:07:15,290 And I think you get a sense of where 151 00:07:15,290 --> 00:07:17,354 this whole analysis of variance is coming from. 152 00:07:17,354 --> 00:07:18,770 It's the sense that, look, there's 153 00:07:18,770 --> 00:07:20,890 a variance of this entire sample of nine, 154 00:07:20,890 --> 00:07:23,430 but some of that variance-- if these groups are 155 00:07:23,430 --> 00:07:27,390 different in some way-- might come from the variation 156 00:07:27,390 --> 00:07:30,710 from being in different groups versus the variation from being 157 00:07:30,710 --> 00:07:31,350 within a group. 158 00:07:31,350 --> 00:07:33,070 And we're going to calculate those two things 159 00:07:33,070 --> 00:07:34,528 and we're going to see that they're 160 00:07:34,528 --> 00:07:38,000 going to add up to the total squared sum variation.