0:00:04.660,0:00:10.710 This video I'm going to talk about, descriptive statistics are learning outcomes are first no, what a statistic is. 0:00:10.710,0:00:18.100 We talk about statistics. What is a statistic? Identify whether the mean or the median is more appropriate for summarizing some data. 0:00:18.100,0:00:22.750 And also to be able to strive both the central tendency and the spread of a data series. 0:00:22.750,0:00:27.640 So a statistic is a value that collect computed from a collection of data. 0:00:27.640,0:00:33.160 And it often summarizes a variable or particularly summarizes observations of a variable. 0:00:33.160,0:00:38.080 You've probably seen means medians, et cetera, before. Those are all examples of statistics. 0:00:38.080,0:00:43.030 In other contexts. We can get an additional statistic. But it's this one. 0:00:43.030,0:00:45.550 Excuse me. We can get additional statistics. 0:00:45.550,0:00:55.090 But if there's one value that summarizes the observations of a variable and it becomes useful for a variety of things. 0:00:55.090,0:01:04.000 So when we have a we have a variable. We have if we have a set of observations, there's a few different questions we want to ask of it. 0:01:04.000,0:01:11.050 And the readings talk about some conceptual questions, but these are getting at some just direct. 0:01:11.050,0:01:15.010 How are is the actual data values themselves laid out? 0:01:15.010,0:01:19.690 So one is where's the variables centered? This is called a measure of central tendency, a measure. 0:01:19.690,0:01:23.770 If it's a numeric variable, it measures how large is of value tend to be. 0:01:23.770,0:01:27.910 We also want to ask how spread out it is around that value. 0:01:27.910,0:01:38.350 So ways to do this. The mean of a data series is the sum divided by the number. 0:01:38.350,0:01:41.770 And so we have some data points here. 0:01:41.770,0:01:54.070 These are the scores of each of the players on the Chicago Bulls in the nineteen ninety eight game six of the NBA finals. 0:01:54.070,0:01:57.520 So we add up all of the values we're going to get. Eighty seven. There's ten of them. 0:01:57.520,0:02:02.680 And we we have a mean of eight point seven. This is often informally called an average. 0:02:02.680,0:02:06.700 When someone says average, they're usually talking about the mean. 0:02:06.700,0:02:12.730 But average itself is not a very specific term. It just means one of these measures of central tendency. 0:02:12.730,0:02:19.450 And so we want to be precise about. We often use it in informal discussion when we want to be precise. 0:02:19.450,0:02:25.690 Average is not a good enough term. We need something like mean, but the mean measures. 0:02:25.690,0:02:29.440 If every instance had the same value, what would it be? 0:02:29.440,0:02:37.710 If if the total is some resource or quantity, how much if it was evenly distributed among all the instances, how much would it be? 0:02:37.710,0:02:44.370 So all points per player kind of value. And if we go back, we think back to the question. 0:02:44.370,0:02:49.440 That I I told you to ask in the first week when we have a statistic. We have a metric. 0:02:49.440,0:02:54.540 How do I change this? Or if I have a definition that's defined as better, how do I improve it? 0:02:54.540,0:03:02.220 But how do you move this? Well, the way you move this is you increase the total score, more points. 0:03:02.220,0:03:08.700 But crucially, it does not matter where the total is increased among your data points. 0:03:08.700,0:03:17.180 One one value can get all of the increase in total to produce an increase in mean. 0:03:17.180,0:03:21.830 So we see a strong suit like there's an outlier here, Michael Jordan scored forty five points. 0:03:21.830,0:03:27.800 He could score 10 more points and that would have the same effect on the mean as every player scoring. 0:03:27.800,0:03:36.710 One more point. So. We want to measure how spread out the values are, one measure is the standard deviation. 0:03:36.710,0:03:44.240 So this is sample standard media. So the sample standard deviation and what it is, we take the mean. 0:03:44.240,0:03:54.050 X Bar is the main river from the previous slide. We subtract the mean from each value and then we square it and squaring does two things. 0:03:54.050,0:03:58.910 One. It makes it makes everything positive. We want to make all the values positive. 0:03:58.910,0:04:06.290 There's usually two ways to do it. Take the absolute value or take the square. But also squaring emphasizes larger values. 0:04:06.290,0:04:08.840 More one of the reasons it's really useful. 0:04:08.840,0:04:19.070 A third reason squaring is useful that does is that in a variety of contexts you'll see in future classes, particularly around machine learning. 0:04:19.070,0:04:24.530 It's really useful to have differentiable statistics and you can take the derivative of the square of the square. 0:04:24.530,0:04:31.850 But if we didn't have this square root, we would get the sample variance as squared. 0:04:31.850,0:04:40.160 But this is a measure of what this measures is the mean squared difference from distance from the mean. 0:04:40.160,0:04:45.530 If we just wanted to measure the mean distance, the total distance from the mean is. 0:04:45.530,0:04:50.270 Zero. Because we're subtracting the mean from every value. 0:04:50.270,0:04:58.230 And if you push through the algebra of that, I'll leave that as an exercise for you in order to better understand the algebra of these statistics. 0:04:58.230,0:05:04.400 But the sum of a bunch of values, minus the mean is zero. 0:05:04.400,0:05:12.590 So we needed to be positive. And what this measures is if the mean is the center or the expectation of our values. 0:05:12.590,0:05:19.220 How far away do values tend to be are if if it if the variance or the standard deviation is small? 0:05:19.220,0:05:24.740 That means the value is tightly clustered around the mean. 0:05:24.740,0:05:29.660 And if it's large, that means they're spread out quite a bit around the mean, 0:05:29.660,0:05:37.850 the state taking the square root of the variance means that the result is back in the original units rather than the square of the units. 0:05:37.850,0:05:42.920 And we can see here the standard DeeVee. So I've plotted here we have the mean. 0:05:42.920,0:05:56.320 And then we've got the standard deviation. I've plotted these points are one standard deviation away from the mean and each direction we 0:05:56.320,0:06:03.580 see that's pretty spread out in this particular data set because we have quite a bit of spread. 0:06:03.580,0:06:09.900 There's a number of values that are clustered over here. But then we've got this value over here. 0:06:09.900,0:06:19.680 So to compute these, we use some of the methods that we talked about in the group and aggregate video mean SDD is the standard deviation vare. 0:06:19.680,0:06:27.840 One note is that the number high versions of standard deviation and variance compute a slightly different statistic. 0:06:27.840,0:06:33.750 The population standard deviation and variance instead of the sample standard deviation and variance. 0:06:33.750,0:06:39.480 You can change this by passing a d o f option to them when you call them. 0:06:39.480,0:06:42.910 The difference is that they divide by N instead of N minus one. 0:06:42.910,0:06:49.000 And we'll but when we're computing the standard deviation or the variance of a set of data that we've collected, 0:06:49.000,0:06:56.340 that's a set of observations of the things we care about. We generally are going to want the sample standard deviation or variance. 0:06:56.340,0:07:04.980 So outliers are particularly large or small values and they draw the mean towards them and they also affect the standard deviation, 0:07:04.980,0:07:07.590 one of the reasons why the standard deviation was so large. 0:07:07.590,0:07:16.530 So we've got Michael Jordan's score over here and that that pulls the outlier or that pulls the mean quite a bit. 0:07:16.530,0:07:21.360 So the red the red line, again, is our mean, the blue line. 0:07:21.360,0:07:27.240 That's what the mean we would compute if we didn't have Michael Jordan's score of forty five. 0:07:27.240,0:07:35.010 That's so much larger than everybody else's. So we can see that one value is pulling the mean quite a bit to the right. 0:07:35.010,0:07:40.560 This is one of the downsides of using the mean. 0:07:40.560,0:07:47.850 So when we have a heavily skewed distribution like this, the skew is what we call it, when there's I mean, 0:07:47.850,0:07:51.870 there's a lot of stuff in one place and there's a few values that are way off in one direction. 0:07:51.870,0:07:57.450 We can have outliers without skew. We can have some very large and very small outliers. 0:07:57.450,0:08:03.870 If we if they're relatively balanced, they actually don't affect the mean too much because one pulls it way high, the other polls it way low. 0:08:03.870,0:08:10.320 It's when we were outliers tend to skew off in one direction that the mean starts to become a problem. 0:08:10.320,0:08:19.360 So the median is another way of measuring how where values tend to be and the way we computed that we sort the values and pick the middle one. 0:08:19.360,0:08:26.610 If there's an even number of values, we take the mean of the middle, too. So we have. 0:08:26.610,0:08:32.010 So we have five, ten values. So right in here is going to be the dividing line between them. 0:08:32.010,0:08:35.940 It's two and seven. The mean of two and seven is four point five. 0:08:35.940,0:08:41.820 So the median X tildy of our of our series is 4.5. 0:08:41.820,0:08:43.650 Now if we ask how do we move the median? 0:08:43.650,0:08:50.610 So remember, we can move the mean by just scoring more points, even if Michael Jordan is the one who scores all of the more points. 0:08:50.610,0:08:55.380 But the only way to increase the median is to increase the small values. 0:08:55.380,0:08:59.700 And similarly, you decrease the median by decreasing the large values. 0:08:59.700,0:09:06.620 And so no matter how many score points, more points, Michael Jordan scores. 0:09:06.620,0:09:18.410 We don't move the median. We can only move the median by having the players who scored the fewest points, scored more points and the. 0:09:18.410,0:09:24.290 We could have that that seven scored another point than we would. We would move the median just a little bit. 0:09:24.290,0:09:29.070 But primarily, we need to have the smallest values, 0:09:29.070,0:09:36.740 need to increase in order to increase our median and or at least the values in the smaller half of the distribution. 0:09:36.740,0:09:41.600 One can't really common use for medians as it is when we're talking about income and wealth. 0:09:41.600,0:09:47.330 Statistics mean income for a region is almost never reported. 0:09:47.330,0:09:55.970 What you report, you usually report the median income because income is usually a skewed distribution. 0:09:55.970,0:10:01.280 A few people have a very large income. A lot more people have a significantly smaller income. 0:10:01.280,0:10:07.250 And the mean would would be pretty high if the mean would be pretty high, 0:10:07.250,0:10:16.540 because you have these large incomes, the median wage ends up being reflecting the typical experience. 0:10:16.540,0:10:26.730 Of of people in the population when we have skewed values, so like standard deviation is a mean based measure of spread. 0:10:26.730,0:10:34.790 A. Measure of spread more connected to the median is the range which we often want to compute in general. 0:10:34.790,0:10:40.580 What's the maximum minus the minimum? And then the inter cortical range and the inter cortical range. 0:10:40.580,0:10:46.390 So the median is the point is the fiftieth percentile or the point five court quanti all. 0:10:46.390,0:10:50.290 The inter cortical range is the distance between the first and third quartiles. 0:10:50.290,0:10:59.140 The point, seven point two five point seventy five positions. If you split the data in half at the median, it's the medians of the two halves. 0:10:59.140,0:11:05.800 And so in our data set, we've got it split into two halves are lower, quartile is zero, value is zero. 0:11:05.800,0:11:13.660 Our upper one is eight. So the intercourse range is eight. And this gives you it gives you the width of the middle 50 percent of the data. 0:11:13.660,0:11:22.060 And so it gives you a measure of how spread out the data is. That is similarly robust outliers like the median is. 0:11:22.060,0:11:27.740 So if we want to, we can get a quick summary of a data series with the described method. 0:11:27.740,0:11:30.800 The described method prints its results, it does not return, 0:11:30.800,0:11:40.550 it makes for a few subtle differences in what you're doing in the room or what you're going to see in the room when you run it in a notebook. 0:11:40.550,0:11:44.660 But it gives us the count. It gives us the mean and standard deviation. 0:11:44.660,0:11:48.110 We have our main based metrics of central tendency and spread, 0:11:48.110,0:12:00.440 and then it gives us the minimum maximum and gives us the men the max and then the the points of off of the quartiles so that we can see the median. 0:12:00.440,0:12:06.800 This is gonna be our median. And then we can also see the inter cortical range, two to thirty six. 0:12:06.800,0:12:08.990 This particular data set, this is a heavily, 0:12:08.990,0:12:16.430 heavily skewed data set because one way we can see that it's skewed is that the median is significantly less than the mean. 0:12:16.430,0:12:20.570 That's evident. That's indication of scale. That mean is pulled way up. 0:12:20.570,0:12:25.340 Also, if you look at the seventy fifth percentile is thirty six. 0:12:25.340,0:12:30.100 But the mean is four hundred and twenty three. So. 0:12:30.100,0:12:38.740 If you pick a movie at random, it's very unlikely to have the mean or larger number of ratings. 0:12:38.740,0:12:43.480 I don't know exactly where. What what quantity of 423 is going to be at. 0:12:43.480,0:12:47.290 But probably somewhere over 80 percent, possibly even over 90. 0:12:47.290,0:12:51.240 So. 80, 90 percent of the movies. 0:12:51.240,0:13:01.010 So this is the number of ratings per movie. 80 to 90 percent of the movies have less than the mean number of ratings. 0:13:01.010,0:13:05.870 It's really emphasizes this difference between the mean and the median. 0:13:05.870,0:13:12.740 So how do you pick the mean works well for centered values. There's no excessively large or small values, especially skewed in one direction. 0:13:12.740,0:13:15.830 The mean is going to be approximately equal to the median. 0:13:15.830,0:13:23.630 A lot of other computations, as well as our ability to try to predict future values, really depend on the mean. 0:13:23.630,0:13:31.510 And also the mean is the central tendency so that if we take the total deviation from it, we get zero. 0:13:31.510,0:13:37.420 Now, the median is significantly more robust to outliers. 0:13:37.420,0:13:40.960 And so we're just trying to describe data and we have a strong skew. 0:13:40.960,0:13:51.580 We have outliers on one side of the data. Then we're it's of gives us a statistic that is not as strongly affected by them. 0:13:51.580,0:13:53.560 And if we think that as a as I indicated, 0:13:53.560,0:14:05.660 as we think back to the question of how do I change this value if we're using the statistic as our evaluation criteria, as our target or our goal? 0:14:05.660,0:14:13.580 That becomes very important because if our goal is to raise the mean, say, if our goal is to raise the main number of ratings, 0:14:13.580,0:14:17.480 we could do that by just getting a bunch of more ratings for the most popular movies. 0:14:17.480,0:14:23.870 But if our goal is to raise the median number of ratings, we can only do that by getting more people watching. 0:14:23.870,0:14:32.130 And rating less popular movies becomes a huge difference. It divides high into low the median value. 0:14:32.130,0:14:41.190 Is such that if you picked a random random observation, it's equally likely to be greater or less than the median. 0:14:41.190,0:14:47.560 That's not true for the mean. But it doesn't tell you, like, how far away the values are on its own. 0:14:47.560,0:14:52.830 So it's limited in its ability to generate predictions. 0:14:52.830,0:14:58.650 So when we when we think about what's when we want to do it really comes down to the question we want to answer. 0:14:58.650,0:15:02.580 And then also the other things are going to use it for. So the mean is answering. 0:15:02.580,0:15:13.980 If we distributed the points equally, how many would each player have? And the median gets to the distribution of players about the around the value. 0:15:13.980,0:15:20.020 We want to find one that players are equally likely to have more or less. 0:15:20.020,0:15:26.620 Another one, quick, is the mode, it's the most common value. It doesn't work for continuous, doesn't really work for continuous variables. 0:15:26.620,0:15:32.440 It's really, really useful for categorical variables. If you got a categorical variable that has like three codes. 0:15:32.440,0:15:36.070 The mode want to know which one. The common is super, super valuable thing. 0:15:36.070,0:15:45.400 It's also useful for integers in any other discrete variable. So wrap up the mean and the median, describe where a value tends to be stick. 0:15:45.400,0:15:50.740 Standard deviation, variance range and enter cortical range. Measure how spread out it is. 0:15:50.740,0:15:54.880 The mean is very computationally useful. We're going to need it a lot. 0:15:54.880,0:16:05.440 But it's very sensitive to outliers, median based. The median and the median based statistics like the ICU are are more robust outliers. 0:16:05.440,0:16:11.740 One of the things we're also going to see later on is we we can do data transformations 0:16:11.740,0:16:15.700 to get data to be less skewed and then we compute the mean and the transform 0:16:15.700,0:16:21.550 space and we can wind up with methods that are going to be give us the computational 0:16:21.550,0:16:32.235 benefits of the mean while also not having the outliers causing as many problems.