1 00:00:00,550 --> 00:00:02,758 - [Narrator] So we have nine students who recently 2 00:00:02,758 --> 00:00:07,757 graduated from a small school that has a class size of nine, 3 00:00:07,757 --> 00:00:10,843 and they wanna figure out what is the central tendency 4 00:00:10,843 --> 00:00:14,297 for salaries one year after graduation? 5 00:00:14,297 --> 00:00:16,847 And they also wanna have a sense of the spread around 6 00:00:16,847 --> 00:00:20,246 that central tendency one year after graduation. 7 00:00:20,246 --> 00:00:23,885 So they all agree to put in their salaries into a computer, 8 00:00:23,885 --> 00:00:25,912 and so these are their salaries. 9 00:00:25,912 --> 00:00:27,342 They're measured in thousands. 10 00:00:27,342 --> 00:00:30,787 So one makes 35,000, 50,000, 50,000, 50,000, 56,000, 11 00:00:30,787 --> 00:00:34,598 two make 60,000, one makes 75,000, and one makes 250,000. 12 00:00:34,598 --> 00:00:37,048 So she's doing very well for herself, 13 00:00:37,048 --> 00:00:40,583 and the computer it spits out a bunch of parameters 14 00:00:40,583 --> 00:00:42,583 based on this data here. 15 00:00:43,441 --> 00:00:47,230 So it spits out two typical measures of central tendency. 16 00:00:47,230 --> 00:00:50,144 The mean is roughly 76.2. 17 00:00:50,144 --> 00:00:53,012 The computer would calculate it by adding up all of these 18 00:00:53,012 --> 00:00:55,849 numbers, these nine numbers, and then dividing by nine, 19 00:00:55,849 --> 00:00:59,646 and the median is 56, and median is quite easy to calculate. 20 00:00:59,646 --> 00:01:01,990 You just order the numbers and you take 21 00:01:01,990 --> 00:01:04,623 the middle number here which is 56. 22 00:01:04,623 --> 00:01:07,631 Now what I want you to do is pause this video 23 00:01:07,631 --> 00:01:10,085 and think about for this data set, 24 00:01:10,085 --> 00:01:14,276 for this population of salaries, which measure, 25 00:01:14,276 --> 00:01:19,242 which measure of central tendency is a better measure? 26 00:01:19,242 --> 00:01:21,172 All right, so let's think about this a little bit. 27 00:01:21,172 --> 00:01:23,837 I'm gonna plot it on a line here. 28 00:01:23,837 --> 00:01:26,054 I'm gonna plot my data so we get a better sense 29 00:01:26,054 --> 00:01:28,407 and we just don't see them, so we just don't see things 30 00:01:28,407 --> 00:01:31,136 as numbers, but we see where those numbers sit 31 00:01:31,136 --> 00:01:32,633 relative to each other. 32 00:01:32,633 --> 00:01:34,818 So let's say this is zero. 33 00:01:34,818 --> 00:01:38,985 Let's say this is, let's see, one, two, three, four, five. 34 00:01:41,679 --> 00:01:45,846 So this would be 250, this is 50, 100, 150, 200, 200, 35 00:01:51,564 --> 00:01:52,530 and let's see. 36 00:01:52,530 --> 00:01:56,370 Let's say if this is 50 than this would be roughly 37 00:01:56,370 --> 00:01:58,804 40 right here, and I just wanna get rough. 38 00:01:58,804 --> 00:02:03,664 So this would be about 60, 70, 80, 90, close enough. 39 00:02:03,664 --> 00:02:05,591 I'm, I could draw this a little bit neater, 40 00:02:05,591 --> 00:02:07,258 but, 60, 70, 80, 90. 41 00:02:08,953 --> 00:02:12,437 Actually, let me just clean this up a little bit more too. 42 00:02:12,437 --> 00:02:14,023 This one right over here would be 43 00:02:14,023 --> 00:02:16,690 a little bit closer to this one. 44 00:02:18,416 --> 00:02:22,049 Let me just put it right around here. 45 00:02:22,049 --> 00:02:26,049 So that's 40, and then this would be 30, 20, 10. 46 00:02:27,231 --> 00:02:28,686 Okay, that's pretty good. 47 00:02:28,686 --> 00:02:30,098 So let's plot this data. 48 00:02:30,098 --> 00:02:34,265 So, one student makes 35,000, so that is right over there. 49 00:02:35,567 --> 00:02:38,411 Two make 50,000, or three make 50,000, 50 00:02:38,411 --> 00:02:40,328 so one, two, and three. 51 00:02:42,278 --> 00:02:43,860 I'll put it like that. 52 00:02:43,860 --> 00:02:48,027 One makes 56,000 which would put them right over here. 53 00:02:49,897 --> 00:02:53,287 One makes 60,000, or actually, two make 60,000, 54 00:02:53,287 --> 00:02:54,804 so it's like that. 55 00:02:54,804 --> 00:02:58,387 One makes 75,000, so that's 60, 70, 75,000. 56 00:03:00,245 --> 00:03:02,108 So it's gonna be right around there, 57 00:03:02,108 --> 00:03:04,173 and then one makes 250,000. 58 00:03:04,173 --> 00:03:07,669 So one's salary is all the way around there, 59 00:03:07,669 --> 00:03:11,262 and then when we calculate the mean as 76.2 60 00:03:11,262 --> 00:03:13,328 as our measure of central tendency, 61 00:03:13,328 --> 00:03:15,411 76.2 is right over there. 62 00:03:16,646 --> 00:03:21,137 So is this a good measure of central tendency? 63 00:03:21,137 --> 00:03:23,237 Well to me it doesn't feel that good, 64 00:03:23,237 --> 00:03:26,137 because our measure of central tendency is higher than all 65 00:03:26,137 --> 00:03:29,862 of the data points except for one, and the reason is is that 66 00:03:29,862 --> 00:03:33,560 you have this one that the, that our, our data is skewed 67 00:03:33,560 --> 00:03:37,310 significantly by this data point at $250,000. 68 00:03:38,508 --> 00:03:41,288 It is so far from the rest of the distribution 69 00:03:41,288 --> 00:03:44,593 from the rest of the data that it has skewed the mean, 70 00:03:44,593 --> 00:03:46,894 and this is something that you see in general. 71 00:03:46,894 --> 00:03:49,866 If you have data that is skewed, and especially things like 72 00:03:49,866 --> 00:03:52,740 salary data where someone might make, most people are making 73 00:03:52,740 --> 00:03:56,100 50, 60, $70,000, but someone might make two million dollars, 74 00:03:56,100 --> 00:03:59,659 and so that will skew the average or skew the mean I should 75 00:03:59,659 --> 00:04:01,979 say, when you add them all up and divide by the number 76 00:04:01,979 --> 00:04:03,251 of data points you have. 77 00:04:03,251 --> 00:04:06,401 In this case, especially when you have data points that 78 00:04:06,401 --> 00:04:09,776 would skew the mean, median is much more robust. 79 00:04:09,776 --> 00:04:14,162 The median at 56 sits right over here, which seems to be 80 00:04:14,162 --> 00:04:17,461 much more indicative for central tendency. 81 00:04:17,461 --> 00:04:18,513 And think about it. 82 00:04:18,513 --> 00:04:21,579 Even if you made this instead of 250,000 83 00:04:21,579 --> 00:04:25,805 if you made this 250,000 thousand, which would be 250 84 00:04:25,805 --> 00:04:29,137 million dollars, which is a ginormous amount of money 85 00:04:29,137 --> 00:04:32,722 to make, it wouldn't, it would skew the mean incredibly, 86 00:04:32,722 --> 00:04:35,530 but it actually would not even change the median, 87 00:04:35,530 --> 00:04:37,338 because the median, it doesn't matter 88 00:04:37,338 --> 00:04:38,546 how high this number gets. 89 00:04:38,546 --> 00:04:39,934 This could be a trillion dollars. 90 00:04:39,934 --> 00:04:41,689 This could be a quadrillion dollars. 91 00:04:41,689 --> 00:04:43,943 The median is going to stay the same. 92 00:04:43,943 --> 00:04:45,828 So the median is much more robust 93 00:04:45,828 --> 00:04:48,148 if you have a skewed data set. 94 00:04:48,148 --> 00:04:51,518 Mean makes a little bit more sense if you have a symmetric 95 00:04:51,518 --> 00:04:54,559 data set or if you have things that are, you know, where, 96 00:04:54,559 --> 00:04:56,556 where things are roughly above and below the mean, 97 00:04:56,556 --> 00:04:59,614 or things aren't skewed incredibly in one direction, 98 00:04:59,614 --> 00:05:01,244 especially by a handful of data 99 00:05:01,244 --> 00:05:03,559 points like we have right over here. 100 00:05:03,559 --> 00:05:06,596 So in this example, the median is a much 101 00:05:06,596 --> 00:05:09,812 better measure of central tendency. 102 00:05:09,812 --> 00:05:11,400 And so what about spread? 103 00:05:11,400 --> 00:05:13,848 Well you might say, well, Sal you already told us 104 00:05:13,848 --> 00:05:15,685 that the mean is not so good 105 00:05:15,685 --> 00:05:18,496 and the standard deviation is based on the mean. 106 00:05:18,496 --> 00:05:22,190 You take each of these data points, find their distance 107 00:05:22,190 --> 00:05:24,991 from the mean, square that number, add up those squared 108 00:05:24,991 --> 00:05:27,775 distances, divide by the number of data points if we're 109 00:05:27,775 --> 00:05:31,127 taking the population standard deviation, and then you, 110 00:05:31,127 --> 00:05:34,556 and then you, you take the square root of the whole thing. 111 00:05:34,556 --> 00:05:37,829 And so since this is based on the mean, which isn't a good 112 00:05:37,829 --> 00:05:41,402 measure of central tendency in this situation, and this, 113 00:05:41,402 --> 00:05:44,959 this is also going to skew that standard deviation. 114 00:05:44,959 --> 00:05:47,938 This is going to be, this is a lot larger 115 00:05:47,938 --> 00:05:50,472 than if you look at the, the actual, 116 00:05:50,472 --> 00:05:53,448 if you wanted an indication of the spread. 117 00:05:53,448 --> 00:05:56,648 Yes, you have this one data point that's way far away 118 00:05:56,648 --> 00:05:59,617 from either the mean or the median depending on how 119 00:05:59,617 --> 00:06:02,500 you wanna think about it, but most of the data points seem 120 00:06:02,500 --> 00:06:04,935 much closer, and so for that situation, 121 00:06:04,935 --> 00:06:07,113 not only are we using the median, 122 00:06:07,113 --> 00:06:10,778 but the interquartile range is once again more robust. 123 00:06:10,778 --> 00:06:13,056 How do we calculate the interquartile range? 124 00:06:13,056 --> 00:06:15,325 Well, you take the median and then you take the bottom 125 00:06:15,325 --> 00:06:18,978 group of numbers and calculate the median of those. 126 00:06:18,978 --> 00:06:21,947 So that's 50 right over here and then you take the top 127 00:06:21,947 --> 00:06:24,880 group of numbers, the upper group of numbers, 128 00:06:24,880 --> 00:06:28,931 and the median there is 60 and 75, it's 67.5. 129 00:06:28,931 --> 00:06:30,914 If this looks unfamiliar we have many videos 130 00:06:30,914 --> 00:06:32,828 on interquartile range and calculating 131 00:06:32,828 --> 00:06:34,512 standard deviation and median and mean. 132 00:06:34,512 --> 00:06:35,890 This is just a little bit of a review, 133 00:06:35,890 --> 00:06:39,208 and then the difference between these two is 17.5, 134 00:06:39,208 --> 00:06:43,185 and notice, this distance between these two, this 17.5, 135 00:06:43,185 --> 00:06:44,908 this isn't going to change, 136 00:06:44,908 --> 00:06:48,203 even if this is 250 billion dollars. 137 00:06:48,203 --> 00:06:51,972 So once again, it is both of these measures are more robust 138 00:06:51,972 --> 00:06:54,639 when you have a skewed data set. 139 00:06:56,064 --> 00:06:59,410 So the big take away here is mean and standard deviation, 140 00:06:59,410 --> 00:07:02,232 they're not bad if you have a roughly symmetric data set, 141 00:07:02,232 --> 00:07:05,050 if you don't have any significant outliers, 142 00:07:05,050 --> 00:07:07,193 things that really skew the data set, 143 00:07:07,193 --> 00:07:10,278 mean and standard deviation can be quite solid. 144 00:07:10,278 --> 00:07:12,585 But if you're looking at something that could get really 145 00:07:12,585 --> 00:07:15,828 skewed by a handful of data points median might be, 146 00:07:15,828 --> 00:07:19,090 median and interquartile range, median for central tendency, 147 00:07:19,090 --> 00:07:23,489 interquartile range for spread around that central tendency, 148 00:07:23,489 --> 00:07:26,313 and that's why you'll see when people talk about salaries 149 00:07:26,313 --> 00:07:28,285 they'll often talk about median, because you can have 150 00:07:28,285 --> 00:07:30,262 some skewed salaries, especially on the up side. 151 00:07:30,262 --> 00:07:32,407 When we talk about things like home prices you'll see 152 00:07:32,407 --> 00:07:35,474 median often measured more typically than mean, 153 00:07:35,474 --> 00:07:38,998 because home prices in a neighborhood, a lot of, 154 00:07:38,998 --> 00:07:42,256 or in a city, a lot of the houses might be in the 200,000, 155 00:07:42,256 --> 00:07:45,629 $300,000 range, but maybe there's one ginormous mansion 156 00:07:45,629 --> 00:07:48,864 that is 100 million dollars, and if you calculated mean 157 00:07:48,864 --> 00:07:51,850 that would skew and give a false impression of the average 158 00:07:51,850 --> 00:07:55,767 or the central tendency of prices in that city.