0:00:01.486,0:00:04.716 Hello. In this video, I would[br]like to introduce you 0:00:04.716,0:00:08.165 to the concept[br]of sampling distributions. 0:00:08.165,0:00:13.550 Now, sampling distributions are going[br]to be fundamental to understanding 0:00:13.550,0:00:18.598 how we are able to derive at [br]certain knowledge about distributions. 0:00:18.598,0:00:21.227 So what -- Let's, first of all, [br]let's just recap 0:00:21.227,0:00:23.957 what a sample is [br]and what a population is. 0:00:23.957,0:00:26.107 So we've seen this diagram before. 0:00:26.107,0:00:29.767 A population is essentially all [br]the individuals we could possibly 0:00:29.767,0:00:35.343 ever get information on, [br]get data. And populations -- 0:00:35.343,0:00:38.161 let's just pick a very, [br]very boring example of height. 0:00:38.161,0:00:43.303 Maybe the population is [br]all UT students, all UT students. 0:00:43.303,0:00:45.643 Now, every single student [br]enrolled at UT, 0:00:45.643,0:00:50.500 if we were able to get all of their [br]heights, there would be some true 0:00:50.500,0:00:53.500 measure of the average height.[br]The average height would be 0:00:53.500,0:00:56.670 a population mean, and there would [br]be some value that that is. 0:00:56.670,0:01:00.530 We'd also be able to get some[br]other truth, we call these truths. 0:01:00.530,0:01:03.280 A standard deviation, if we were[br]able to measure every 0:01:03.280,0:01:05.143 however many thousand[br]UT students (inaudible). 0:01:05.143,0:01:09.387 If we were able to measure those things,[br]we'd be able to get those values, 0:01:09.387,0:01:11.967 and they're the population values,[br]and you may remember 0:01:11.967,0:01:14.663 we call them parameters. 0:01:14.947,0:01:18.417 Now, that's pretty much impossible.[br]We would never actually, in theory, 0:01:18.417,0:01:21.207 ever be able to measure [br]the height of all UT students. 0:01:21.207,0:01:23.670 So what we do, if we were interested,[br]if this was something we were 0:01:23.670,0:01:27.270 interested in for some strange reason,[br]we'd be able to just collect a sample. 0:01:27.270,0:01:30.410 I don't know. Maybe we'd get [br]a sample of five people, 0:01:30.410,0:01:32.730 or maybe we would get [br]a sample of 50 people, 0:01:32.730,0:01:35.460 or maybe we would[br]get a sample of 500. 0:01:35.460,0:01:37.560 In future videos, [br]we'll talk about 0:01:37.560,0:01:39.840 what's an appropriate size [br]of sample to collect 0:01:39.840,0:01:41.520 when we're interested [br]in finding out something. 0:01:41.520,0:01:45.600 But for now, who knows?[br]Maybe we picked ten people, 0:01:45.600,0:01:47.865 but we pick a sample [br]from those individuals. 0:01:47.865,0:01:51.040 Maybe we measure, we measure[br]all their heights and we calculate 0:01:51.040,0:01:55.530 their average height and we call[br]that the sample mean, "x-bar." 0:01:55.702,0:02:00.418 We're also able to calculate their,[br]maybe their sample standard deviation. 0:02:00.418,0:02:03.369 We'd call that "s," [br]something like that. 0:02:03.369,0:02:09.341 These things we call statistics.[br]This is just recap stuff. 0:02:09.341,0:02:12.627 And the purpose of [br]calculating the statistics 0:02:12.627,0:02:16.617 is because we want[br]to estimate these parameters. 0:02:16.617,0:02:20.414 We want to make an estimation[br]of what the true value is. 0:02:20.699,0:02:25.673 Now the issue is, let's say we're[br]just collecting samples of five, 0:02:25.913,0:02:29.533 and then maybe another day,[br]we got a different sample of five, 0:02:29.533,0:02:32.828 and then on another day,[br]we're addicted to measuring 0:02:32.828,0:02:35.638 the heights of people, [br]we get another sample of five. 0:02:35.638,0:02:38.058 Every time we get [br]another sample of five, 0:02:38.058,0:02:40.838 let's, for now, just consider [br]the sample mean. 0:02:40.838,0:02:44.381 We're not that interested in[br]the standard deviation of these. 0:02:44.381,0:02:47.751 We're just gonna get a different [br]average every time we collect a 0:02:47.751,0:02:50.956 sample of five. It'd be [br]very, very unusual 0:02:50.956,0:02:54.596 if we randomly sampled five individuals[br]from all UT students, 0:02:54.596,0:02:58.716 and it was the same five people.[br]It's almost unlikely 0:02:58.716,0:03:01.788 ever to be the same five people,[br]and so the average height 0:03:01.788,0:03:04.698 of the sample mean is going [br]to be different every single time. 0:03:04.698,0:03:07.653 And I don't know, let's say [br]another day we did another five, 0:03:07.653,0:03:08.783 we got another sample mean. 0:03:08.783,0:03:11.713 So we're trying [br]to estimate these things, 0:03:11.713,0:03:15.253 but our values are going [br]to generally be different, 0:03:15.253,0:03:17.391 and then they're never going [br]to be exactly -- well they could 0:03:17.391,0:03:20.358 actually be exactly the population [br]mean, but it's unlikely. 0:03:20.358,0:03:23.640 So what we've just done, by the way,[br]is a sampling distribution. 0:03:23.640,0:03:26.812 I kind of, by accident, have [br]introduced you to it. 0:03:26.812,0:03:30.162 A sampling distribution is these things: 0:03:30.162,0:03:33.742 It's when you collect lots, [br]and lots, and lots of samples. 0:03:33.742,0:03:36.242 You collect lots and lots[br]of samples, and then you 0:03:36.242,0:03:38.072 just look at all [br]the values that you got. 0:03:38.072,0:03:41.332 And then you tend to plot [br]them as a histogram. 0:03:41.332,0:03:44.022 So in the next couple of slides,[br]I'll just more formally go through this, 0:03:44.022,0:03:48.032 because it gets very slightly [br]more exciting than this. 0:03:48.032,0:03:50.037 Well this distribution, actually, 0:03:50.037,0:03:52.717 looks quite nice [br]when you collect the data. 0:03:52.717,0:03:56.938 Okay, so in this case, I don't have --[br]my population is not all UT students. 0:03:56.938,0:04:00.528 My population is four people.[br]There's just four people. 0:04:00.528,0:04:05.038 Maybe my population is four people[br]that are roommates together, 0:04:05.331,0:04:10.111 and I ask them, "How many people[br]do you phone in one week?" 0:04:10.111,0:04:13.891 And one person said they phoned[br]four people, one person said seven, 0:04:13.891,0:04:16.231 one person said five, [br]one person said eight. 0:04:16.231,0:04:19.367 So, if you look at [br]our population mean, it's six. 0:04:19.367,0:04:23.796 That's just 4+5+7+8 divided by [br]the number of individuals, 0:04:23.796,0:04:26.956 which is four. Our mean is six. 0:04:26.956,0:04:30.189 So, I was really interested [br]in knowing how many, 0:04:30.189,0:04:33.315 this population of this one house, 0:04:33.315,0:04:36.577 I wanted to know what's[br]the average number of people 0:04:36.577,0:04:37.917 that they phone in a week. 0:04:37.917,0:04:41.037 But I thought collecting data on [br]four people was too much work, 0:04:41.037,0:04:44.253 so I decided, I'm just going [br]to collect samples of three people. 0:04:44.253,0:04:47.093 And so, I got a five, [br]a seven, and an eight. 0:04:47.093,0:04:49.111 They were the first three people [br]that I sampled. 0:04:49.111,0:04:51.840 And so my sample mean [br]here is 6.67, 0:04:51.840,0:04:56.167 which is 5+7+8 divided by [br]my sample size of three. 0:04:56.339,0:04:59.979 So I got one sample mean,[br]but this is not a sampling distribution 0:04:59.979,0:05:01.549 because it's just one sample. 0:05:01.549,0:05:04.563 For a sampling distribution,[br]I need to do this over and over 0:05:04.563,0:05:06.263 and over again. 0:05:06.263,0:05:09.077 So I can do something like this,[br]where I do, let's say, 0:05:09.894,0:05:13.771 one sample, two samples, three samples,[br]four samples, five samples, six samples, 0:05:13.771,0:05:17.841 and here are my values:[br]5,7,8; 4,5,8; 4,7,8; 4, 8, 8; 0:05:17.841,0:05:20.812 and for each of these, [br]I calculate the sample mean. 0:05:21.281,0:05:24.212 This is a sampling distribution.[br]I've now got many samples, 0:05:24.212,0:05:27.314 and I've collected [br]the sample mean for each of them. 0:05:28.031,0:05:30.601 One thing -- there's two things[br]you may have noticed about this: 0:05:30.601,0:05:34.341 one thing I want you to notice is that [br]I've actually sampled with replacement. 0:05:34.341,0:05:36.660 So in the UT example, I said[br]it would be really unlikely 0:05:36.660,0:05:39.447 to ever get [br]the same five people. 0:05:39.447,0:05:41.835 More than that, if you randomly[br]selected five people, 0:05:41.835,0:05:45.255 it's unlikely to get the same [br]individual twice in one sample. 0:05:45.255,0:05:47.785 But for sampling distributions,[br]we tend to actually say 0:05:47.785,0:05:49.563 that we will sample [br]with replacement. 0:05:49.563,0:05:51.907 That's just a -- it's not something[br]we actually need to worry 0:05:51.907,0:05:54.277 too much about right now.[br]I just want you to be aware of -- 0:05:54.277,0:05:57.297 in this particular example,[br]I sampled with replacement, 0:05:57.297,0:06:00.207 which means you could sample [br]the same individual twice, 0:06:00.207,0:06:03.170 or even three times, because [br]you'll notice the 64th sample 0:06:03.170,0:06:05.830 had the same individual[br]three times and I got 8,8,8, 0:06:05.830,0:06:08.805 and the mean of [br]that sample mean is 8. 0:06:08.805,0:06:11.385 The second thing you may [br]have noticed is that 0:06:11.385,0:06:14.605 the sample number [br]only goes up to 64 here. 0:06:14.605,0:06:16.605 I've done 64 samples, [br]and that's because 0:06:16.605,0:06:20.855 when you collect three numbers[br]from four potential numbers, 0:06:21.213,0:06:25.233 there's only 64 combinations of the data,[br]so I just stopped at 64. 0:06:25.233,0:06:29.448 Okay, but what I want you to realize[br]is that I have collected 64 samples. 0:06:29.448,0:06:31.638 These dots just refer to --[br]I didn't put all the data 0:06:31.638,0:06:35.764 between 6 and 64.[br]I have 64 sample means. 0:06:35.764,0:06:40.269 So I have 64 of these things,[br]so what should I do with them? 0:06:40.269,0:06:42.699 Well, I could plot them on a histogram. 0:06:42.699,0:06:47.272 So, here's my histogram [br]of the 64 sample means, 0:06:47.609,0:06:50.999 and I wonder what you think about it. 0:06:50.999,0:06:55.179 Well, one thing is I've used a nice color,[br]I think, for the bars, but another thing 0:06:55.179,0:06:59.216 is maybe you see that there's[br]potentially a shape in this data. 0:06:59.365,0:07:04.639 So if I, if you allow me to be[br]kind of, draw a curve over it, 0:07:04.639,0:07:07.211 I haven't done a great job,[br]but there's a curve, 0:07:07.211,0:07:09.624 and it looks normal-ish. 0:07:09.624,0:07:11.946 I'm using the word, "ish," because[br]it's obviously not normal. 0:07:11.946,0:07:15.184 It's kind of getting there to be [br]normal distributed. 0:07:15.184,0:07:18.367 So this is just a histogram of all[br]the possible sample means. 0:07:18.367,0:07:23.252 And here is the one where we had 8,8,8,[br]and actually there's one down here 0:07:23.252,0:07:27.689 where we got 4,4,4, but there's all[br]the ones in between as well. 0:07:30.486,0:07:33.310 Now what we --[br]And I'll just put the curve back on it. 0:07:33.310,0:07:35.739 Maybe I'll actually this time [br]do a better job. 0:07:35.739,0:07:37.674 I'm not sur --[br]No, I didn't. 0:07:37.674,0:07:39.437 I'm gonna undo that because that[br]was a terrible job. 0:07:39.437,0:07:41.211 Let's see if I can -- 0:07:41.211,0:07:44.501 I think going slow...[br]no, it will do. 0:07:44.501,0:07:45.899 Mmm... no it won't. 0:07:45.899,0:07:47.010 Let's do it again. 0:07:47.010,0:07:49.369 This time, third time's lucky. 0:07:50.478,0:07:51.983 Okay I'm happy with that. 0:07:51.983,0:07:55.000 This is the sampling distribution. 0:07:55.000,0:07:57.338 I just told you that,[br]but I want you to formally know 0:07:57.338,0:07:58.404 what it means. 0:07:58.404,0:08:03.836 It's the sampling distribution,[br]dot dot dot, for the sample mean, 0:08:03.836,0:08:05.111 that's what we collected. 0:08:05.111,0:08:08.201 We collected the sample mean,[br]so it's a sampling distribution of the 0:08:08.201,0:08:12.162 sample mean, and then we say[br]for n=3. 0:08:12.162,0:08:15.938 Because our sample [br]could have been 2, n=2m 0:08:15.938,0:08:17.272 in which case we would have [br]had a different distribution. 0:08:17.272,0:08:19.782 It could've been 1. We could've [br]just been really lazy and said, 0:08:19.782,0:08:22.110 "I only want to just check [br]one person and ask them," 0:08:22.110,0:08:24.948 and calculated [br]these sample mean for 1. 0:08:24.948,0:08:27.509 But is the sample -- [br]we did three people, 0:08:27.509,0:08:30.317 and so this is technically [br]the sampling distribution 0:08:30.317,0:08:33.085 for the sample mean for n=3. 0:08:34.833,0:08:39.363 Okay, so that's probably enough for now [br]on introducing sampling distribution. 0:08:39.363,0:08:42.585 I hope you understand[br]about what it is. 0:08:42.585,0:08:44.152 It is the -- [br]you collect many samples, 0:08:44.152,0:08:46.358 and you collect some information [br]about each of those samples -- 0:08:46.358,0:08:50.606 in this case, it was the sample mean --[br]and then you plot them as a histogram, 0:08:50.606,0:08:53.177 and you are able to look at [br]the shape of the distribution. 0:08:53.415,0:08:55.441 And that's what we call [br]a sampling distribution. 0:08:55.441,0:08:58.013 And we're going to extend [br]this idea in future videos.