1 00:00:01,476 --> 00:00:04,996 Hello, in this video, I want to talk about the standard error and this is really 2 00:00:04,996 --> 00:00:09,777 extending our understanding of sampling distributions and essential limit theorem. 3 00:00:10,485 --> 00:00:13,222 So, let's talk about what a standard error is. 4 00:00:14,998 --> 00:00:17,734 First of all, we'll go back to this penguin example and 5 00:00:17,734 --> 00:00:22,086 you've seen this distribution before as a uniform distribution of data. 6 00:00:22,958 --> 00:00:26,862 It has, like any distribution, it has-- there's descriptive statistics. 7 00:00:26,862 --> 00:00:28,524 So, it has a population mean. 8 00:00:28,524 --> 00:00:30,060 The average is 5.04. 9 00:00:30,060 --> 00:00:33,876 The average penguin is 5.04 meters from the edge of the ice sheet. 10 00:00:33,876 --> 00:00:36,747 You can calculate a standard deviation for this. 11 00:00:36,747 --> 00:00:39,513 So, the deviation is 2.88. 12 00:00:39,513 --> 00:00:42,311 So, that's the, you know, a measure of the spread. 13 00:00:42,311 --> 00:00:47,580 And there was 5,000 penguins floating on this ice sheet, that's the n, 14 00:00:47,580 --> 00:00:49,181 the population size. 15 00:00:50,105 --> 00:00:54,799 We then discussed about how if you were just to sample either just randomly select 16 00:00:54,799 --> 00:01:00,189 five penguins at a time or 50 penguins at a time, that each of those samples 17 00:01:00,189 --> 00:01:04,435 of, let's pick the n equals five for now, each of those five penguins, 18 00:01:04,435 --> 00:01:07,996 you could calculate how, what the average distance from the front of the edge sheet 19 00:01:07,996 --> 00:01:13,409 was for each of those individual penguins, sample of five penguins and if you were 20 00:01:13,409 --> 00:01:17,259 to do that over and over and over again and in this histogram, we did it 21 00:01:17,259 --> 00:01:22,998 1,000 times, we would be able to generate what's called the sampling distribution. 22 00:01:24,598 --> 00:01:30,153 And it's the sampling distribution of the sample means, that's what it is 23 00:01:30,153 --> 00:01:35,257 and I told you that we could calculate from that what the average of 24 00:01:35,257 --> 00:01:37,867 those sample means across the 1,000 samples was and 25 00:01:37,867 --> 00:01:43,947 that's this value and the notation that we use for that is this mu and then 26 00:01:43,947 --> 00:01:53,283 subscript x bar and that's the mean of the sample means and I've forgotten 27 00:01:53,283 --> 00:01:56,655 what it was, the exact value, but it's pretty much going to approximate 28 00:01:56,655 --> 00:01:58,208 very, very close. 29 00:01:58,208 --> 00:02:04,261 So, I just put approximately equal to 5.05, just go back, it's 5.04. 30 00:02:04,261 --> 00:02:11,111 So, it was-- it's going to approximate the population average and you can 31 00:02:11,111 --> 00:02:13,494 do that for any sample size. 32 00:02:13,494 --> 00:02:16,677 So, that was sample size five, let's look at the sample size 50. 33 00:02:16,677 --> 00:02:24,090 Again, we have the mean of the sampling distribution-- sorry, the mean of 34 00:02:24,090 --> 00:02:28,577 the sample means and that is also going to be very close to 5.04, it might be 35 00:02:28,577 --> 00:02:32,167 a little bit closer because our sample size is larger. 36 00:02:32,881 --> 00:02:35,646 Two other things to notice about these distributions, number one 37 00:02:35,646 --> 00:02:39,026 they're normally distributed or approx-- sorry, the approximate to normal 38 00:02:39,026 --> 00:02:43,151 distributions despite the fact for the original distribution of penguins. 39 00:02:43,151 --> 00:02:46,435 The population distribution was a uniform distribution. 40 00:02:46,435 --> 00:02:50,962 Second thing to notice, the sample size doesn't really effect where the value 41 00:02:50,962 --> 00:02:55,979 of the mean, of the sample means, it does effect the standard deviation of 42 00:02:55,979 --> 00:02:57,428 the sample means. 43 00:02:57,428 --> 00:03:00,299 So, if this is a normal distribution, or we believe it to approximate, 44 00:03:00,299 --> 00:03:08,467 and then also this approximates a normal distribution, then, it's clear 45 00:03:08,467 --> 00:03:14,622 that the distance here, let's just assume that's a standard deviation and I put it 46 00:03:14,622 --> 00:03:16,771 in the right place. 47 00:03:16,771 --> 00:03:21,487 This standard deviation, it's greater than whatever the corresponding value is 48 00:03:21,487 --> 00:03:25,124 over here, if that's also the standard deviation. 49 00:03:25,124 --> 00:03:29,318 So, as the sample size gets larger, the spead of the sample means 50 00:03:29,318 --> 00:03:33,152 gets smaller, so, we can say the standard deviation gets smaller. 51 00:03:33,152 --> 00:03:38,419 Now, does this standard deviation have any relationship at all to the original 52 00:03:38,419 --> 00:03:41,421 standard deviation of the original population. 53 00:03:41,421 --> 00:03:45,290 The original standard deviation was 2.88, so, I'll just say population of 54 00:03:45,290 --> 00:03:48,720 the original-- standard deviation was 2.88 55 00:03:48,955 --> 00:03:51,825 Is there any relationship at all between these two standard deviations? 56 00:03:52,195 --> 00:03:55,895 Because it's not like the mean of the sample means, which is pretty 57 00:03:55,895 --> 00:04:00,751 much the same, regardless of the sample size, I mean it does get better with 58 00:04:00,751 --> 00:04:04,130 larger samples but it approximates, it's close, especially if you have 59 00:04:04,130 --> 00:04:06,017 enough of these samples. 60 00:04:06,017 --> 00:04:08,977 What's the relationship of these standard deviations because it's clear that when 61 00:04:08,977 --> 00:04:14,598 you change n, this value is going to change, so is there a relationship? 62 00:04:14,598 --> 00:04:16,522 And it turns out that there is a relationship and we're going to 63 00:04:16,522 --> 00:04:18,166 look into that. 64 00:04:18,166 --> 00:04:23,755 This graph here just shows you that the normal distribution for becomes 65 00:04:23,755 --> 00:04:26,789 better and better the larger the sample size, so, it's a little 66 00:04:26,789 --> 00:04:29,854 tricky to see but let me, I just want to really point out one or two things here. 67 00:04:29,854 --> 00:04:33,863 I'm going to pick a color that represents that. 68 00:04:33,863 --> 00:04:38,746 So, this value here, actually in red, so, if I was just to pick one penguin at 69 00:04:38,746 --> 00:04:43,512 a time, a sample size of one, this is my estimate of the sample-- I'm going 70 00:04:43,512 --> 00:04:45,049 for the red line here. 71 00:04:45,049 --> 00:04:49,901 That's my estimate of the sample-- sorry, let's say that again. 72 00:04:49,901 --> 00:04:52,165 That's the distribution of the sample means. 73 00:04:52,165 --> 00:04:53,999 It looks like the original population. 74 00:04:53,999 --> 00:04:57,317 So, for a sample size of one, you don't get a normal distribution of the sample 75 00:04:57,317 --> 00:05:00,625 means, you get whatever the original population was. 76 00:05:00,625 --> 00:05:06,698 Let's look at two and I've got to find it on here, so, it's the orange one and 77 00:05:06,698 --> 00:05:11,632 I believe it's this one here. 78 00:05:11,632 --> 00:05:13,394 It is this one here. 79 00:05:13,394 --> 00:05:14,799 This is what it looks like. 80 00:05:14,799 --> 00:05:17,166 This is the n is two. 81 00:05:17,166 --> 00:05:19,492 So, again, not a really normal distribution. 82 00:05:19,492 --> 00:05:22,190 Now, let's skip to 50. 83 00:05:22,190 --> 00:05:27,129 This is 50 here and you can see it really, you don't need me to help 84 00:05:27,129 --> 00:05:28,368 you too much. 85 00:05:28,368 --> 00:05:30,781 This is the 50 value, it's very normal. 86 00:05:30,781 --> 00:05:35,651 And then, we got blue at ten-- sorry, 25 here. 87 00:05:35,651 --> 00:05:37,894 This is the 25 one and so on. 88 00:05:37,894 --> 00:05:39,806 This is the ten. 89 00:05:39,806 --> 00:05:41,677 This is the five. 90 00:05:41,677 --> 00:05:44,260 I wanted to just show you this graph because I wanted to show you that 91 00:05:44,260 --> 00:05:48,701 even with very, very, very small sample sizes of like five, we already 92 00:05:48,701 --> 00:05:51,258 get very close to a normal distribution. 93 00:05:51,258 --> 00:05:54,748 It's only with sample sizes of ridiculous sample sizes of like one or two that 94 00:05:54,748 --> 00:05:56,762 we don't do a very good job, 95 00:05:56,762 --> 00:05:59,971 So, even with small sample sizes, we get to the normal distribution 96 00:05:59,971 --> 00:06:02,768 of the normal distribution of the sample means. 97 00:06:02,768 --> 00:06:07,385 So, back to the problem I just posted a moment ago. 98 00:06:07,385 --> 00:06:14,501 This is our original standard deviation of a population, this is our population 99 00:06:14,501 --> 00:06:17,106 and whenever we get a sample, and again, this is just the sample 100 00:06:17,106 --> 00:06:17,950 size of five. 101 00:06:17,950 --> 00:06:19,411 This is the distribution of sample means. 102 00:06:19,411 --> 00:06:26,544 The mean is going to approximate the mean here but what is the relationship of 103 00:06:26,544 --> 00:06:31,469 the standard deviation to this original population. 104 00:06:31,469 --> 00:06:32,730 What is the relationship? 105 00:06:32,730 --> 00:06:38,633 It must be also related to the sample size because it changes with its sample size. 106 00:06:38,633 --> 00:06:43,024 And it's just a formula and we're not going to talk too much about-- 107 00:06:43,024 --> 00:06:47,407 we're not going to talk much really at all about how it's derived but this formula 108 00:06:47,407 --> 00:06:52,141 here, very neatly, just tells us about their relationship and 109 00:06:52,141 --> 00:06:55,513 so, what we have here is this is our standard deviation of 110 00:06:55,513 --> 00:06:59,987 the sampling distribution of the sample means. 111 00:06:59,987 --> 00:07:02,975 So, we call that sigma subscript x bar, 112 00:07:02,975 --> 00:07:04,401 sigma x bar. 113 00:07:04,401 --> 00:07:07,863 The standard deviation, so just to really reiterate what we're looking at, this is 114 00:07:07,863 --> 00:07:12,661 the distribution of sample means, this is-- we're looking for this value 115 00:07:12,661 --> 00:07:15,812 what's this standard deviation? 116 00:07:15,812 --> 00:07:22,158 And actually, technically, that's the notation, what is that standard 117 00:07:22,158 --> 00:07:23,149 deviation? 118 00:07:23,149 --> 00:07:25,191 So, what we do is, we just take the original population. 119 00:07:25,191 --> 00:07:30,300 This is the population standard deviation from the original population and we're 120 00:07:30,300 --> 00:07:35,479 going to divide it by the square root of n and that gives us that this value, 121 00:07:35,479 --> 00:07:36,752 this standard deviation. 122 00:07:36,752 --> 00:07:40,715 Its technical name is the standard deviation of the sampling distribution 123 00:07:40,715 --> 00:07:44,420 of the sample means, which is an awful mouthful but we just call 124 00:07:44,420 --> 00:07:45,420 it the standard error of 125 00:07:45,420 --> 00:07:49,447 the mean, which is what we call it the standard error of the mean. 126 00:07:49,447 --> 00:07:55,089 So, this graph illustrates how the standard error of the mean 127 00:07:55,089 --> 00:07:56,990 changes by sample size. 128 00:07:56,990 --> 00:08:06,727 So, if I just go back to-- maybe, I'll just go back to this slide here 129 00:08:06,727 --> 00:08:11,099 and we were asking the question of, you know, what's this value over 130 00:08:11,099 --> 00:08:14,841 sample size 50 compared to this value of a sample size of five? 131 00:08:14,841 --> 00:08:19,229 So, that was the question and I'm going to plot-- maybe here I'll plot it or write it 132 00:08:19,229 --> 00:08:20,265 sorry. 133 00:08:20,265 --> 00:08:25,102 So, this is the formula, the standard error of the mean or the standard 134 00:08:25,102 --> 00:08:27,675 deviation of the sampling distribution of the sample means is equal to 135 00:08:27,675 --> 00:08:31,601 the original population standard deviation divided by the square root of n. 136 00:08:31,601 --> 00:08:36,925 So, when we had that sample size of five, which is this one up here, what we're 137 00:08:36,925 --> 00:08:42,176 really looking at is this, the original standard deviation was 2.88 and 138 00:08:42,176 --> 00:08:45,774 we're going to divide by the square root of the sample size which is five, so that 139 00:08:45,774 --> 00:08:47,767 equals 1.3. 140 00:08:47,767 --> 00:08:53,340 So, the standard deviation here is 1.3 and that standard error we call that is 1.3. 141 00:08:53,340 --> 00:08:59,429 So, what this is saying is this value here is 1.3 higher that was it, I forget. 142 00:08:59,429 --> 00:09:04,000 I think it was 5.04 was the mean of the sample means and so this value here 143 00:09:04,000 --> 00:09:10,068 is going to be a 6.5-- nope, nope, not five. 144 00:09:10,068 --> 00:09:15,243 It's going to be at 6.34. 145 00:09:15,243 --> 00:09:22,008 This is one standard deviation above the sample mean but if we have 146 00:09:22,008 --> 00:09:26,743 a sample size of fifty, then the calculation becomes this. 147 00:09:26,743 --> 00:09:30,303 Becomes the original standard deviation of the population divided by the square 148 00:09:30,303 --> 00:09:32,523 root of 50, which is equal to and I've 149 00:09:32,523 --> 00:09:35,914 written this down so I can check, 0.4. 150 00:09:35,914 --> 00:09:40,234 So, back to this graph, this value is 0.4, 151 00:09:40,234 --> 00:09:43,002 and this value is 1.3. 152 00:09:43,002 --> 00:09:46,554 And so, it gets smaller the bigger the sample size. 153 00:09:46,554 --> 00:09:52,456 This graph here that I got to previously is actually showing us 154 00:09:52,456 --> 00:09:56,063 how the standard error changes by the sample size. 155 00:09:56,063 --> 00:09:59,661 So we just had a sample size of 50, which is approximately here. 156 00:09:59,661 --> 00:10:06,839 If we go across to this value on this axis, it tells us that's about 0.4, 157 00:10:06,839 --> 00:10:11,837 sample size of 50, and if we had a sample size of 5, 158 00:10:11,837 --> 00:10:15,991 which is approximately here -- I'm doing a line, not very well, 159 00:10:15,991 --> 00:10:20,372 but it goes to about there. This was about 1.3. 160 00:10:20,372 --> 00:10:23,829 And I just want you to -- there's nothing really too much for you to take home 161 00:10:23,829 --> 00:10:27,337 from this graph other than showing you that as the sample size increases, 162 00:10:27,337 --> 00:10:32,416 that the -- any population standard deviation that we have, 163 00:10:32,416 --> 00:10:36,584 the standard error is going to get much smaller very rapidly. 164 00:10:36,584 --> 00:10:41,044 A sample size of 5 is still quite high up on this curve, 165 00:10:41,044 --> 00:10:44,432 but once you come down to sample sizes of 20 or 30 or more, 166 00:10:44,432 --> 00:10:49,222 then we get a very, very small standard error. 167 00:10:50,578 --> 00:10:56,786 This is just to reiterate that point so you can see what these are on this graph. 168 00:10:56,786 --> 00:11:00,394 So let's put together what we've just learned about the standard error 169 00:11:00,394 --> 00:11:04,525 with what we have learned previously about the Central Limit Theorem. 170 00:11:04,525 --> 00:11:08,797 So what we have just been discussing is that we just know that we have 171 00:11:08,797 --> 00:11:10,929 an original population, it could be any distribution, 172 00:11:10,929 --> 00:11:13,369 here's our uniform distribution. 173 00:11:13,369 --> 00:11:16,762 If we take many samples from it, we get our sampling distribution. 174 00:11:16,762 --> 00:11:24,601 In this case, of the sample means, is normally distributed 175 00:11:24,601 --> 00:11:27,102 or approximately normally distributed. 176 00:11:27,102 --> 00:11:34,696 And we know that the sampling distribution has a mean that is approximately equal to 177 00:11:34,696 --> 00:11:40,404 the population mean and we've just learned that we just know now that 178 00:11:40,404 --> 00:11:44,118 the standard deviation of this approximately normal distribution, 179 00:11:44,118 --> 00:11:46,826 this is the standard error. 180 00:11:46,826 --> 00:11:49,653 I'll write here, "standard error." 181 00:11:50,319 --> 00:11:53,645 So we can actually write this in notation form, 182 00:11:53,645 --> 00:11:56,844 and we say that this sampling distribution is approximately normal, 183 00:11:56,844 --> 00:11:59,943 this is what this tilde squiggle means, is approximately normal, 184 00:11:59,943 --> 00:12:07,615 approximately normal and it has a mean of the population mean, 185 00:12:07,615 --> 00:12:11,153 so I'll just write here, the mean is the population mean. 186 00:12:11,153 --> 00:12:13,251 And the standard deviation of that distribution, 187 00:12:13,251 --> 00:12:15,767 and we're talking about this distribution down here, 188 00:12:15,767 --> 00:12:18,941 the standard deviation of that distribution is the standard error, 189 00:12:18,941 --> 00:12:20,604 that's what we call it. 190 00:12:20,604 --> 00:12:22,572 And it's approximately equal to the standard deviation of the 191 00:12:22,572 --> 00:12:26,758 original population divided by the square root of the sample size n. 192 00:12:26,758 --> 00:12:34,628 So, this is a key thing that we know. If we have at a population of any -- 193 00:12:34,628 --> 00:12:37,634 I'll just write "uniform" in here, of any type, it could bimodal, 194 00:12:37,634 --> 00:12:40,490 it could be uniform, it could be skewed, we know that if we were to take 195 00:12:40,490 --> 00:12:43,704 thousands and thousands of samples or just one thousand -- or just a few, 196 00:12:43,704 --> 00:12:47,370 hundred samples, the sample means that we get from all those samples 197 00:12:47,370 --> 00:12:50,035 are going to approximate a normal distribution 198 00:12:50,035 --> 00:12:52,651 if our sample size is larger, it's going to approximate 199 00:12:52,651 --> 00:12:57,088 a normal distribution even more. And we can already determine what the 200 00:12:57,088 --> 00:13:01,234 shape of that distribution is going to be because we know that the population mean 201 00:13:01,234 --> 00:13:04,340 is approximately equal to the mean of the sample means, 202 00:13:04,340 --> 00:13:09,130 and we know that the standard deviation, this is the standard error, 203 00:13:09,130 --> 00:13:15,096 we know that that, the standard error, is the standard deviation of the 204 00:13:15,096 --> 00:13:17,330 sampling distribution. 205 00:13:17,330 --> 00:13:19,712 Okay, so we can work that out. 206 00:13:19,712 --> 00:13:22,833 But the thing is, what you're probably already thinking is, 207 00:13:22,833 --> 00:13:25,780 "why do you care?" And you may not care, and that's fine. 208 00:13:25,780 --> 00:13:29,255 There's no reason to particularly. 209 00:13:29,255 --> 00:13:34,286 But, it can be very, very helpful. I'm just going to just float this idea 210 00:13:34,286 --> 00:13:38,121 and we'll return to it in future videos. 211 00:13:38,121 --> 00:13:42,588 Hopefully it's gone through your head that why is this strange person 212 00:13:42,588 --> 00:13:45,715 taking thousands of samples all the time? 213 00:13:45,715 --> 00:13:47,446 You know, you're not going to go to this penguin ice sheet and just keep 214 00:13:47,446 --> 00:13:50,951 randomly picking 5 penguins at random 1,000 times. 215 00:13:50,951 --> 00:13:53,627 Science and other types of time -- when we collect data, 216 00:13:53,627 --> 00:13:56,998 it doesn't work like that. We pretty much usually only just collect 217 00:13:56,998 --> 00:13:59,349 one sample of data. 218 00:13:59,349 --> 00:14:03,331 And so, when we collect one sample of data and this here -- I've got 219 00:14:03,331 --> 00:14:07,558 sampling distribution of n = 5 penguins. 220 00:14:07,558 --> 00:14:09,768 This is when we did do it 1,000 times. 221 00:14:09,768 --> 00:14:14,352 But let's just say that we did it one time and we got a value around about here, 222 00:14:14,352 --> 00:14:17,504 around about 7 meters, that was our sample. 223 00:14:17,504 --> 00:14:20,123 We just got one sample. 224 00:14:20,123 --> 00:14:24,697 If we just got one sample, we don't know anything really about that 225 00:14:24,697 --> 00:14:30,402 in terms of how certain or how uncertain are we that this truly is the sample mean. 226 00:14:30,402 --> 00:14:34,368 We knew if we did this many, many times the average of al the sample means 227 00:14:34,368 --> 00:14:36,618 would converge on the true population mean. 228 00:14:36,618 --> 00:14:38,455 And that's our ultimate goal, we're trying to est-- 229 00:14:38,455 --> 00:14:42,139 normally we don't know the population mean we're trying to estimate it. 230 00:14:42,139 --> 00:14:46,211 So in our one sample, we just got this value of 7, say. 231 00:14:46,211 --> 00:14:50,854 How confident are we that that is the population mean? 232 00:14:50,854 --> 00:14:55,793 And so, what we're able to do by having this belief that we're able to know 233 00:14:55,793 --> 00:15:01,224 that this value of 7 does come from in theory, 234 00:15:01,224 --> 00:15:03,710 a sampling distribution that exists. 235 00:15:03,710 --> 00:15:07,569 And in theory, this sampling distribution exists with a standard deviation 236 00:15:07,569 --> 00:15:11,300 that we call the standard error. We're able to understand how far 237 00:15:11,300 --> 00:15:16,439 this value of 7, or any value that we collected, it could be some other value 238 00:15:16,439 --> 00:15:19,850 but our one sample was 7 meters, we get a sense of how far away 239 00:15:19,850 --> 00:15:23,890 from the mean that is in the units of standard deviations 240 00:15:23,890 --> 00:15:26,063 or technically, with a sampling distribution, 241 00:15:26,063 --> 00:15:27,410 standard errors. 242 00:15:27,410 --> 00:15:30,264 So we're going to come back to this topic, but really the value of the standard error 243 00:15:30,264 --> 00:15:33,996 is that enables us to determine when we collect one sample, 244 00:15:33,996 --> 00:15:39,677 we're able to work out how far away or how confident we are in our value, 245 00:15:39,677 --> 00:15:41,717 is how far away is it from the population mean, 246 00:15:41,717 --> 00:15:45,652 how confident we are that this is a true representation of the population mean. 247 00:15:45,652 --> 00:15:48,456 We're going to come back to this in future videos.