Return to Video


  • 0:05 - 0:11
    This video I'm going to talk about, descriptive statistics are learning outcomes are first no, what a statistic is.
  • 0:11 - 0:18
    We talk about statistics. What is a statistic? Identify whether the mean or the median is more appropriate for summarizing some data.
  • 0:18 - 0:23
    And also to be able to strive both the central tendency and the spread of a data series.
  • 0:23 - 0:28
    So a statistic is a value that collect computed from a collection of data.
  • 0:28 - 0:33
    And it often summarizes a variable or particularly summarizes observations of a variable.
  • 0:33 - 0:38
    You've probably seen means medians, et cetera, before. Those are all examples of statistics.
  • 0:38 - 0:43
    In other contexts. We can get an additional statistic. But it's this one.
  • 0:43 - 0:46
    Excuse me. We can get additional statistics.
  • 0:46 - 0:55
    But if there's one value that summarizes the observations of a variable and it becomes useful for a variety of things.
  • 0:55 - 1:04
    So when we have a we have a variable. We have if we have a set of observations, there's a few different questions we want to ask of it.
  • 1:04 - 1:11
    And the readings talk about some conceptual questions, but these are getting at some just direct.
  • 1:11 - 1:15
    How are is the actual data values themselves laid out?
  • 1:15 - 1:20
    So one is where's the variables centered? This is called a measure of central tendency, a measure.
  • 1:20 - 1:24
    If it's a numeric variable, it measures how large is of value tend to be.
  • 1:24 - 1:28
    We also want to ask how spread out it is around that value.
  • 1:28 - 1:38
    So ways to do this. The mean of a data series is the sum divided by the number.
  • 1:38 - 1:42
    And so we have some data points here.
  • 1:42 - 1:54
    These are the scores of each of the players on the Chicago Bulls in the nineteen ninety eight game six of the NBA finals.
  • 1:54 - 1:58
    So we add up all of the values we're going to get. Eighty seven. There's ten of them.
  • 1:58 - 2:03
    And we we have a mean of eight point seven. This is often informally called an average.
  • 2:03 - 2:07
    When someone says average, they're usually talking about the mean.
  • 2:07 - 2:13
    But average itself is not a very specific term. It just means one of these measures of central tendency.
  • 2:13 - 2:19
    And so we want to be precise about. We often use it in informal discussion when we want to be precise.
  • 2:19 - 2:26
    Average is not a good enough term. We need something like mean, but the mean measures.
  • 2:26 - 2:29
    If every instance had the same value, what would it be?
  • 2:29 - 2:38
    If if the total is some resource or quantity, how much if it was evenly distributed among all the instances, how much would it be?
  • 2:38 - 2:44
    So all points per player kind of value. And if we go back, we think back to the question.
  • 2:44 - 2:49
    That I I told you to ask in the first week when we have a statistic. We have a metric.
  • 2:49 - 2:55
    How do I change this? Or if I have a definition that's defined as better, how do I improve it?
  • 2:55 - 3:02
    But how do you move this? Well, the way you move this is you increase the total score, more points.
  • 3:02 - 3:09
    But crucially, it does not matter where the total is increased among your data points.
  • 3:09 - 3:17
    One one value can get all of the increase in total to produce an increase in mean.
  • 3:17 - 3:22
    So we see a strong suit like there's an outlier here, Michael Jordan scored forty five points.
  • 3:22 - 3:28
    He could score 10 more points and that would have the same effect on the mean as every player scoring.
  • 3:28 - 3:37
    One more point. So. We want to measure how spread out the values are, one measure is the standard deviation.
  • 3:37 - 3:44
    So this is sample standard media. So the sample standard deviation and what it is, we take the mean.
  • 3:44 - 3:54
    X Bar is the main river from the previous slide. We subtract the mean from each value and then we square it and squaring does two things.
  • 3:54 - 3:59
    One. It makes it makes everything positive. We want to make all the values positive.
  • 3:59 - 4:06
    There's usually two ways to do it. Take the absolute value or take the square. But also squaring emphasizes larger values.
  • 4:06 - 4:09
    More one of the reasons it's really useful.
  • 4:09 - 4:19
    A third reason squaring is useful that does is that in a variety of contexts you'll see in future classes, particularly around machine learning.
  • 4:19 - 4:25
    It's really useful to have differentiable statistics and you can take the derivative of the square of the square.
  • 4:25 - 4:32
    But if we didn't have this square root, we would get the sample variance as squared.
  • 4:32 - 4:40
    But this is a measure of what this measures is the mean squared difference from distance from the mean.
  • 4:40 - 4:46
    If we just wanted to measure the mean distance, the total distance from the mean is.
  • 4:46 - 4:50
    Zero. Because we're subtracting the mean from every value.
  • 4:50 - 4:58
    And if you push through the algebra of that, I'll leave that as an exercise for you in order to better understand the algebra of these statistics.
  • 4:58 - 5:04
    But the sum of a bunch of values, minus the mean is zero.
  • 5:04 - 5:13
    So we needed to be positive. And what this measures is if the mean is the center or the expectation of our values.
  • 5:13 - 5:19
    How far away do values tend to be are if if it if the variance or the standard deviation is small?
  • 5:19 - 5:25
    That means the value is tightly clustered around the mean.
  • 5:25 - 5:30
    And if it's large, that means they're spread out quite a bit around the mean,
  • 5:30 - 5:38
    the state taking the square root of the variance means that the result is back in the original units rather than the square of the units.
  • 5:38 - 5:43
    And we can see here the standard DeeVee. So I've plotted here we have the mean.
  • 5:43 - 5:56
    And then we've got the standard deviation. I've plotted these points are one standard deviation away from the mean and each direction we
  • 5:56 - 6:04
    see that's pretty spread out in this particular data set because we have quite a bit of spread.
  • 6:04 - 6:10
    There's a number of values that are clustered over here. But then we've got this value over here.
  • 6:10 - 6:20
    So to compute these, we use some of the methods that we talked about in the group and aggregate video mean SDD is the standard deviation vare.
  • 6:20 - 6:28
    One note is that the number high versions of standard deviation and variance compute a slightly different statistic.
  • 6:28 - 6:34
    The population standard deviation and variance instead of the sample standard deviation and variance.
  • 6:34 - 6:39
    You can change this by passing a d o f option to them when you call them.
  • 6:39 - 6:43
    The difference is that they divide by N instead of N minus one.
  • 6:43 - 6:49
    And we'll but when we're computing the standard deviation or the variance of a set of data that we've collected,
  • 6:49 - 6:56
    that's a set of observations of the things we care about. We generally are going to want the sample standard deviation or variance.
  • 6:56 - 7:05
    So outliers are particularly large or small values and they draw the mean towards them and they also affect the standard deviation,
  • 7:05 - 7:08
    one of the reasons why the standard deviation was so large.
  • 7:08 - 7:17
    So we've got Michael Jordan's score over here and that that pulls the outlier or that pulls the mean quite a bit.
  • 7:17 - 7:21
    So the red the red line, again, is our mean, the blue line.
  • 7:21 - 7:27
    That's what the mean we would compute if we didn't have Michael Jordan's score of forty five.
  • 7:27 - 7:35
    That's so much larger than everybody else's. So we can see that one value is pulling the mean quite a bit to the right.
  • 7:35 - 7:41
    This is one of the downsides of using the mean.
  • 7:41 - 7:48
    So when we have a heavily skewed distribution like this, the skew is what we call it, when there's I mean,
  • 7:48 - 7:52
    there's a lot of stuff in one place and there's a few values that are way off in one direction.
  • 7:52 - 7:57
    We can have outliers without skew. We can have some very large and very small outliers.
  • 7:57 - 8:04
    If we if they're relatively balanced, they actually don't affect the mean too much because one pulls it way high, the other polls it way low.
  • 8:04 - 8:10
    It's when we were outliers tend to skew off in one direction that the mean starts to become a problem.
  • 8:10 - 8:19
    So the median is another way of measuring how where values tend to be and the way we computed that we sort the values and pick the middle one.
  • 8:19 - 8:27
    If there's an even number of values, we take the mean of the middle, too. So we have.
  • 8:27 - 8:32
    So we have five, ten values. So right in here is going to be the dividing line between them.
  • 8:32 - 8:36
    It's two and seven. The mean of two and seven is four point five.
  • 8:36 - 8:42
    So the median X tildy of our of our series is 4.5.
  • 8:42 - 8:44
    Now if we ask how do we move the median?
  • 8:44 - 8:51
    So remember, we can move the mean by just scoring more points, even if Michael Jordan is the one who scores all of the more points.
  • 8:51 - 8:55
    But the only way to increase the median is to increase the small values.
  • 8:55 - 9:00
    And similarly, you decrease the median by decreasing the large values.
  • 9:00 - 9:07
    And so no matter how many score points, more points, Michael Jordan scores.
  • 9:07 - 9:18
    We don't move the median. We can only move the median by having the players who scored the fewest points, scored more points and the.
  • 9:18 - 9:24
    We could have that that seven scored another point than we would. We would move the median just a little bit.
  • 9:24 - 9:29
    But primarily, we need to have the smallest values,
  • 9:29 - 9:37
    need to increase in order to increase our median and or at least the values in the smaller half of the distribution.
  • 9:37 - 9:42
    One can't really common use for medians as it is when we're talking about income and wealth.
  • 9:42 - 9:47
    Statistics mean income for a region is almost never reported.
  • 9:47 - 9:56
    What you report, you usually report the median income because income is usually a skewed distribution.
  • 9:56 - 10:01
    A few people have a very large income. A lot more people have a significantly smaller income.
  • 10:01 - 10:07
    And the mean would would be pretty high if the mean would be pretty high,
  • 10:07 - 10:17
    because you have these large incomes, the median wage ends up being reflecting the typical experience.
  • 10:17 - 10:27
    Of of people in the population when we have skewed values, so like standard deviation is a mean based measure of spread.
  • 10:27 - 10:35
    A. Measure of spread more connected to the median is the range which we often want to compute in general.
  • 10:35 - 10:41
    What's the maximum minus the minimum? And then the inter cortical range and the inter cortical range.
  • 10:41 - 10:46
    So the median is the point is the fiftieth percentile or the point five court quanti all.
  • 10:46 - 10:50
    The inter cortical range is the distance between the first and third quartiles.
  • 10:50 - 10:59
    The point, seven point two five point seventy five positions. If you split the data in half at the median, it's the medians of the two halves.
  • 10:59 - 11:06
    And so in our data set, we've got it split into two halves are lower, quartile is zero, value is zero.
  • 11:06 - 11:14
    Our upper one is eight. So the intercourse range is eight. And this gives you it gives you the width of the middle 50 percent of the data.
  • 11:14 - 11:22
    And so it gives you a measure of how spread out the data is. That is similarly robust outliers like the median is.
  • 11:22 - 11:28
    So if we want to, we can get a quick summary of a data series with the described method.
  • 11:28 - 11:31
    The described method prints its results, it does not return,
  • 11:31 - 11:41
    it makes for a few subtle differences in what you're doing in the room or what you're going to see in the room when you run it in a notebook.
  • 11:41 - 11:45
    But it gives us the count. It gives us the mean and standard deviation.
  • 11:45 - 11:48
    We have our main based metrics of central tendency and spread,
  • 11:48 - 12:00
    and then it gives us the minimum maximum and gives us the men the max and then the the points of off of the quartiles so that we can see the median.
  • 12:00 - 12:07
    This is gonna be our median. And then we can also see the inter cortical range, two to thirty six.
  • 12:07 - 12:09
    This particular data set, this is a heavily,
  • 12:09 - 12:16
    heavily skewed data set because one way we can see that it's skewed is that the median is significantly less than the mean.
  • 12:16 - 12:21
    That's evident. That's indication of scale. That mean is pulled way up.
  • 12:21 - 12:25
    Also, if you look at the seventy fifth percentile is thirty six.
  • 12:25 - 12:30
    But the mean is four hundred and twenty three. So.
  • 12:30 - 12:39
    If you pick a movie at random, it's very unlikely to have the mean or larger number of ratings.
  • 12:39 - 12:43
    I don't know exactly where. What what quantity of 423 is going to be at.
  • 12:43 - 12:47
    But probably somewhere over 80 percent, possibly even over 90.
  • 12:47 - 12:51
    So. 80, 90 percent of the movies.
  • 12:51 - 13:01
    So this is the number of ratings per movie. 80 to 90 percent of the movies have less than the mean number of ratings.
  • 13:01 - 13:06
    It's really emphasizes this difference between the mean and the median.
  • 13:06 - 13:13
    So how do you pick the mean works well for centered values. There's no excessively large or small values, especially skewed in one direction.
  • 13:13 - 13:16
    The mean is going to be approximately equal to the median.
  • 13:16 - 13:24
    A lot of other computations, as well as our ability to try to predict future values, really depend on the mean.
  • 13:24 - 13:32
    And also the mean is the central tendency so that if we take the total deviation from it, we get zero.
  • 13:32 - 13:37
    Now, the median is significantly more robust to outliers.
  • 13:37 - 13:41
    And so we're just trying to describe data and we have a strong skew.
  • 13:41 - 13:52
    We have outliers on one side of the data. Then we're it's of gives us a statistic that is not as strongly affected by them.
  • 13:52 - 13:54
    And if we think that as a as I indicated,
  • 13:54 - 14:06
    as we think back to the question of how do I change this value if we're using the statistic as our evaluation criteria, as our target or our goal?
  • 14:06 - 14:14
    That becomes very important because if our goal is to raise the mean, say, if our goal is to raise the main number of ratings,
  • 14:14 - 14:17
    we could do that by just getting a bunch of more ratings for the most popular movies.
  • 14:17 - 14:24
    But if our goal is to raise the median number of ratings, we can only do that by getting more people watching.
  • 14:24 - 14:32
    And rating less popular movies becomes a huge difference. It divides high into low the median value.
  • 14:32 - 14:41
    Is such that if you picked a random random observation, it's equally likely to be greater or less than the median.
  • 14:41 - 14:48
    That's not true for the mean. But it doesn't tell you, like, how far away the values are on its own.
  • 14:48 - 14:53
    So it's limited in its ability to generate predictions.
  • 14:53 - 14:59
    So when we when we think about what's when we want to do it really comes down to the question we want to answer.
  • 14:59 - 15:03
    And then also the other things are going to use it for. So the mean is answering.
  • 15:03 - 15:14
    If we distributed the points equally, how many would each player have? And the median gets to the distribution of players about the around the value.
  • 15:14 - 15:20
    We want to find one that players are equally likely to have more or less.
  • 15:20 - 15:27
    Another one, quick, is the mode, it's the most common value. It doesn't work for continuous, doesn't really work for continuous variables.
  • 15:27 - 15:32
    It's really, really useful for categorical variables. If you got a categorical variable that has like three codes.
  • 15:32 - 15:36
    The mode want to know which one. The common is super, super valuable thing.
  • 15:36 - 15:45
    It's also useful for integers in any other discrete variable. So wrap up the mean and the median, describe where a value tends to be stick.
  • 15:45 - 15:51
    Standard deviation, variance range and enter cortical range. Measure how spread out it is.
  • 15:51 - 15:55
    The mean is very computationally useful. We're going to need it a lot.
  • 15:55 - 16:05
    But it's very sensitive to outliers, median based. The median and the median based statistics like the ICU are are more robust outliers.
  • 16:05 - 16:12
    One of the things we're also going to see later on is we we can do data transformations
  • 16:12 - 16:16
    to get data to be less skewed and then we compute the mean and the transform
  • 16:16 - 16:22
    space and we can wind up with methods that are going to be give us the computational
  • 16:22 - 16:32
    benefits of the mean while also not having the outliers causing as many problems.
Video Language:

English subtitles