WEBVTT 00:00:00.550 --> 00:00:02.758 - [Narrator] So we have nine students who recently 00:00:02.758 --> 00:00:07.757 graduated from a small school that has a class size of nine, 00:00:07.757 --> 00:00:10.843 and they wanna figure out what is the central tendency 00:00:10.843 --> 00:00:14.297 for salaries one year after graduation? 00:00:14.297 --> 00:00:16.847 And they also wanna have a sense of the spread around 00:00:16.847 --> 00:00:20.246 that central tendency one year after graduation. 00:00:20.246 --> 00:00:23.885 So they all agree to put in their salaries into a computer, 00:00:23.885 --> 00:00:25.912 and so these are their salaries. 00:00:25.912 --> 00:00:27.342 They're measured in thousands. 00:00:27.342 --> 00:00:30.787 So one makes 35,000, 50,000, 50,000, 50,000, 56,000, 00:00:30.787 --> 00:00:34.598 two make 60,000, one makes 75,000, and one makes 250,000. 00:00:34.598 --> 00:00:37.048 So she's doing very well for herself, 00:00:37.048 --> 00:00:40.583 and the computer it spits out a bunch of parameters 00:00:40.583 --> 00:00:42.583 based on this data here. 00:00:43.441 --> 00:00:47.230 So it spits out two typical measures of central tendency. 00:00:47.230 --> 00:00:50.144 The mean is roughly 76.2. 00:00:50.144 --> 00:00:53.012 The computer would calculate it by adding up all of these 00:00:53.012 --> 00:00:55.849 numbers, these nine numbers, and then dividing by nine, 00:00:55.849 --> 00:00:59.646 and the median is 56, and median is quite easy to calculate. 00:00:59.646 --> 00:01:01.990 You just order the numbers and you take 00:01:01.990 --> 00:01:04.623 the middle number here which is 56. 00:01:04.623 --> 00:01:07.631 Now what I want you to do is pause this video 00:01:07.631 --> 00:01:10.085 and think about for this data set, 00:01:10.085 --> 00:01:14.276 for this population of salaries, which measure, 00:01:14.276 --> 00:01:19.242 which measure of central tendency is a better measure? 00:01:19.242 --> 00:01:21.172 All right, so let's think about this a little bit. 00:01:21.172 --> 00:01:23.837 I'm gonna plot it on a line here. 00:01:23.837 --> 00:01:26.054 I'm gonna plot my data so we get a better sense 00:01:26.054 --> 00:01:28.407 and we just don't see them, so we just don't see things 00:01:28.407 --> 00:01:31.136 as numbers, but we see where those numbers sit 00:01:31.136 --> 00:01:32.633 relative to each other. 00:01:32.633 --> 00:01:34.818 So let's say this is zero. 00:01:34.818 --> 00:01:38.985 Let's say this is, let's see, one, two, three, four, five. 00:01:41.679 --> 00:01:45.846 So this would be 250, this is 50, 100, 150, 200, 200, 00:01:51.564 --> 00:01:52.530 and let's see. 00:01:52.530 --> 00:01:56.370 Let's say if this is 50 than this would be roughly 00:01:56.370 --> 00:01:58.804 40 right here, and I just wanna get rough. 00:01:58.804 --> 00:02:03.664 So this would be about 60, 70, 80, 90, close enough. 00:02:03.664 --> 00:02:05.591 I'm, I could draw this a little bit neater, 00:02:05.591 --> 00:02:07.258 but, 60, 70, 80, 90. 00:02:08.953 --> 00:02:12.437 Actually, let me just clean this up a little bit more too. 00:02:12.437 --> 00:02:14.023 This one right over here would be 00:02:14.023 --> 00:02:16.690 a little bit closer to this one. 00:02:18.416 --> 00:02:22.049 Let me just put it right around here. 00:02:22.049 --> 00:02:26.049 So that's 40, and then this would be 30, 20, 10. 00:02:27.231 --> 00:02:28.686 Okay, that's pretty good. 00:02:28.686 --> 00:02:30.098 So let's plot this data. 00:02:30.098 --> 00:02:34.265 So, one student makes 35,000, so that is right over there. 00:02:35.567 --> 00:02:38.411 Two make 50,000, or three make 50,000, 00:02:38.411 --> 00:02:40.328 so one, two, and three. 00:02:42.278 --> 00:02:43.860 I'll put it like that. 00:02:43.860 --> 00:02:48.027 One makes 56,000 which would put them right over here. 00:02:49.897 --> 00:02:53.287 One makes 60,000, or actually, two make 60,000, 00:02:53.287 --> 00:02:54.804 so it's like that. 00:02:54.804 --> 00:02:58.387 One makes 75,000, so that's 60, 70, 75,000. 00:03:00.245 --> 00:03:02.108 So it's gonna be right around there, 00:03:02.108 --> 00:03:04.173 and then one makes 250,000. 00:03:04.173 --> 00:03:07.669 So one's salary is all the way around there, 00:03:07.669 --> 00:03:11.262 and then when we calculate the mean as 76.2 00:03:11.262 --> 00:03:13.328 as our measure of central tendency, 00:03:13.328 --> 00:03:15.411 76.2 is right over there. 00:03:16.646 --> 00:03:21.137 So is this a good measure of central tendency? 00:03:21.137 --> 00:03:23.237 Well to me it doesn't feel that good, 00:03:23.237 --> 00:03:26.137 because our measure of central tendency is higher than all 00:03:26.137 --> 00:03:29.862 of the data points except for one, and the reason is is that 00:03:29.862 --> 00:03:33.560 you have this one that the, that our, our data is skewed 00:03:33.560 --> 00:03:37.310 significantly by this data point at $250,000. 00:03:38.508 --> 00:03:41.288 It is so far from the rest of the distribution 00:03:41.288 --> 00:03:44.593 from the rest of the data that it has skewed the mean, 00:03:44.593 --> 00:03:46.894 and this is something that you see in general. 00:03:46.894 --> 00:03:49.866 If you have data that is skewed, and especially things like 00:03:49.866 --> 00:03:52.740 salary data where someone might make, most people are making 00:03:52.740 --> 00:03:56.100 50, 60, $70,000, but someone might make two million dollars, 00:03:56.100 --> 00:03:59.659 and so that will skew the average or skew the mean I should 00:03:59.659 --> 00:04:01.979 say, when you add them all up and divide by the number 00:04:01.979 --> 00:04:03.251 of data points you have. 00:04:03.251 --> 00:04:06.401 In this case, especially when you have data points that 00:04:06.401 --> 00:04:09.776 would skew the mean, median is much more robust. 00:04:09.776 --> 00:04:14.162 The median at 56 sits right over here, which seems to be 00:04:14.162 --> 00:04:17.461 much more indicative for central tendency. 00:04:17.461 --> 00:04:18.513 And think about it. 00:04:18.513 --> 00:04:21.579 Even if you made this instead of 250,000 00:04:21.579 --> 00:04:25.805 if you made this 250,000 thousand, which would be 250 00:04:25.805 --> 00:04:29.137 million dollars, which is a ginormous amount of money 00:04:29.137 --> 00:04:32.722 to make, it wouldn't, it would skew the mean incredibly, 00:04:32.722 --> 00:04:35.530 but it actually would not even change the median, 00:04:35.530 --> 00:04:37.338 because the median, it doesn't matter 00:04:37.338 --> 00:04:38.546 how high this number gets. 00:04:38.546 --> 00:04:39.934 This could be a trillion dollars. 00:04:39.934 --> 00:04:41.689 This could be a quadrillion dollars. 00:04:41.689 --> 00:04:43.943 The median is going to stay the same. 00:04:43.943 --> 00:04:45.828 So the median is much more robust 00:04:45.828 --> 00:04:48.148 if you have a skewed data set. 00:04:48.148 --> 00:04:51.518 Mean makes a little bit more sense if you have a symmetric 00:04:51.518 --> 00:04:54.559 data set or if you have things that are, you know, where, 00:04:54.559 --> 00:04:56.556 where things are roughly above and below the mean, 00:04:56.556 --> 00:04:59.614 or things aren't skewed incredibly in one direction, 00:04:59.614 --> 00:05:01.244 especially by a handful of data 00:05:01.244 --> 00:05:03.559 points like we have right over here. 00:05:03.559 --> 00:05:06.596 So in this example, the median is a much 00:05:06.596 --> 00:05:09.812 better measure of central tendency. 00:05:09.812 --> 00:05:11.400 And so what about spread? 00:05:11.400 --> 00:05:13.848 Well you might say, well, Sal you already told us 00:05:13.848 --> 00:05:15.685 that the mean is not so good 00:05:15.685 --> 00:05:18.496 and the standard deviation is based on the mean. 00:05:18.496 --> 00:05:22.190 You take each of these data points, find their distance 00:05:22.190 --> 00:05:24.991 from the mean, square that number, add up those squared 00:05:24.991 --> 00:05:27.775 distances, divide by the number of data points if we're 00:05:27.775 --> 00:05:31.127 taking the population standard deviation, and then you, 00:05:31.127 --> 00:05:34.556 and then you, you take the square root of the whole thing. 00:05:34.556 --> 00:05:37.829 And so since this is based on the mean, which isn't a good 00:05:37.829 --> 00:05:41.402 measure of central tendency in this situation, and this, 00:05:41.402 --> 00:05:44.959 this is also going to skew that standard deviation. 00:05:44.959 --> 00:05:47.938 This is going to be, this is a lot larger 00:05:47.938 --> 00:05:50.472 than if you look at the, the actual, 00:05:50.472 --> 00:05:53.448 if you wanted an indication of the spread. 00:05:53.448 --> 00:05:56.648 Yes, you have this one data point that's way far away 00:05:56.648 --> 00:05:59.617 from either the mean or the median depending on how 00:05:59.617 --> 00:06:02.500 you wanna think about it, but most of the data points seem 00:06:02.500 --> 00:06:04.935 much closer, and so for that situation, 00:06:04.935 --> 00:06:07.113 not only are we using the median, 00:06:07.113 --> 00:06:10.778 but the interquartile range is once again more robust. 00:06:10.778 --> 00:06:13.056 How do we calculate the interquartile range? 00:06:13.056 --> 00:06:15.325 Well, you take the median and then you take the bottom 00:06:15.325 --> 00:06:18.978 group of numbers and calculate the median of those. 00:06:18.978 --> 00:06:21.947 So that's 50 right over here and then you take the top 00:06:21.947 --> 00:06:24.880 group of numbers, the upper group of numbers, 00:06:24.880 --> 00:06:28.931 and the median there is 60 and 75, it's 67.5. 00:06:28.931 --> 00:06:30.914 If this looks unfamiliar we have many videos 00:06:30.914 --> 00:06:32.828 on interquartile range and calculating 00:06:32.828 --> 00:06:34.512 standard deviation and median and mean. 00:06:34.512 --> 00:06:35.890 This is just a little bit of a review, 00:06:35.890 --> 00:06:39.208 and then the difference between these two is 17.5, 00:06:39.208 --> 00:06:43.185 and notice, this distance between these two, this 17.5, 00:06:43.185 --> 00:06:44.908 this isn't going to change, 00:06:44.908 --> 00:06:48.203 even if this is 250 billion dollars. 00:06:48.203 --> 00:06:51.972 So once again, it is both of these measures are more robust 00:06:51.972 --> 00:06:54.639 when you have a skewed data set. 00:06:56.064 --> 00:06:59.410 So the big take away here is mean and standard deviation, 00:06:59.410 --> 00:07:02.232 they're not bad if you have a roughly symmetric data set, 00:07:02.232 --> 00:07:05.050 if you don't have any significant outliers, 00:07:05.050 --> 00:07:07.193 things that really skew the data set, 00:07:07.193 --> 00:07:10.278 mean and standard deviation can be quite solid. 00:07:10.278 --> 00:07:12.585 But if you're looking at something that could get really 00:07:12.585 --> 00:07:15.828 skewed by a handful of data points median might be, 00:07:15.828 --> 00:07:19.090 median and interquartile range, median for central tendency, 00:07:19.090 --> 00:07:23.489 interquartile range for spread around that central tendency, 00:07:23.489 --> 00:07:26.313 and that's why you'll see when people talk about salaries 00:07:26.313 --> 00:07:28.285 they'll often talk about median, because you can have 00:07:28.285 --> 00:07:30.262 some skewed salaries, especially on the up side. 00:07:30.262 --> 00:07:32.407 When we talk about things like home prices you'll see 00:07:32.407 --> 00:07:35.474 median often measured more typically than mean, 00:07:35.474 --> 00:07:38.998 because home prices in a neighborhood, a lot of, 00:07:38.998 --> 00:07:42.256 or in a city, a lot of the houses might be in the 200,000, 00:07:42.256 --> 00:07:45.629 $300,000 range, but maybe there's one ginormous mansion 00:07:45.629 --> 00:07:48.864 that is 100 million dollars, and if you calculated mean 00:07:48.864 --> 00:07:51.850 that would skew and give a false impression of the average 00:07:51.850 --> 00:07:55.767 or the central tendency of prices in that city.