0:00:04.660,0:00:10.710
This video I'm going to talk about, descriptive statistics are learning outcomes are first no, what a statistic is.

0:00:10.710,0:00:18.100
We talk about statistics. What is a statistic? Identify whether the mean or the median is more appropriate for summarizing some data.

0:00:18.100,0:00:22.750
And also to be able to strive both the central tendency and the spread of a data series.

0:00:22.750,0:00:27.640
So a statistic is a value that collect computed from a collection of data.

0:00:27.640,0:00:33.160
And it often summarizes a variable or particularly summarizes observations of a variable.

0:00:33.160,0:00:38.080
You've probably seen means medians, et cetera, before. Those are all examples of statistics.

0:00:38.080,0:00:43.030
In other contexts. We can get an additional statistic. But it's this one.

0:00:43.030,0:00:45.550
Excuse me. We can get additional statistics.

0:00:45.550,0:00:55.090
But if there's one value that summarizes the observations of a variable and it becomes useful for a variety of things.

0:00:55.090,0:01:04.000
So when we have a we have a variable. We have if we have a set of observations, there's a few different questions we want to ask of it.

0:01:04.000,0:01:11.050
And the readings talk about some conceptual questions, but these are getting at some just direct.

0:01:11.050,0:01:15.010
How are is the actual data values themselves laid out?

0:01:15.010,0:01:19.690
So one is where's the variables centered? This is called a measure of central tendency, a measure.

0:01:19.690,0:01:23.770
If it's a numeric variable, it measures how large is of value tend to be.

0:01:23.770,0:01:27.910
We also want to ask how spread out it is around that value.

0:01:27.910,0:01:38.350
So ways to do this. The mean of a data series is the sum divided by the number.

0:01:38.350,0:01:41.770
And so we have some data points here.

0:01:41.770,0:01:54.070
These are the scores of each of the players on the Chicago Bulls in the nineteen ninety eight game six of the NBA finals.

0:01:54.070,0:01:57.520
So we add up all of the values we're going to get. Eighty seven. There's ten of them.

0:01:57.520,0:02:02.680
And we we have a mean of eight point seven. This is often informally called an average.

0:02:02.680,0:02:06.700
When someone says average, they're usually talking about the mean.

0:02:06.700,0:02:12.730
But average itself is not a very specific term. It just means one of these measures of central tendency.

0:02:12.730,0:02:19.450
And so we want to be precise about. We often use it in informal discussion when we want to be precise.

0:02:19.450,0:02:25.690
Average is not a good enough term. We need something like mean, but the mean measures.

0:02:25.690,0:02:29.440
If every instance had the same value, what would it be?

0:02:29.440,0:02:37.710
If if the total is some resource or quantity, how much if it was evenly distributed among all the instances, how much would it be?

0:02:37.710,0:02:44.370
So all points per player kind of value. And if we go back, we think back to the question.

0:02:44.370,0:02:49.440
That I I told you to ask in the first week when we have a statistic. We have a metric.

0:02:49.440,0:02:54.540
How do I change this? Or if I have a definition that's defined as better, how do I improve it?

0:02:54.540,0:03:02.220
But how do you move this? Well, the way you move this is you increase the total score, more points.

0:03:02.220,0:03:08.700
But crucially, it does not matter where the total is increased among your data points.

0:03:08.700,0:03:17.180
One one value can get all of the increase in total to produce an increase in mean.

0:03:17.180,0:03:21.830
So we see a strong suit like there's an outlier here, Michael Jordan scored forty five points.

0:03:21.830,0:03:27.800
He could score 10 more points and that would have the same effect on the mean as every player scoring.

0:03:27.800,0:03:36.710
One more point. So. We want to measure how spread out the values are, one measure is the standard deviation.

0:03:36.710,0:03:44.240
So this is sample standard media. So the sample standard deviation and what it is, we take the mean.

0:03:44.240,0:03:54.050
X Bar is the main river from the previous slide. We subtract the mean from each value and then we square it and squaring does two things.

0:03:54.050,0:03:58.910
One. It makes it makes everything positive. We want to make all the values positive.

0:03:58.910,0:04:06.290
There's usually two ways to do it. Take the absolute value or take the square. But also squaring emphasizes larger values.

0:04:06.290,0:04:08.840
More one of the reasons it's really useful.

0:04:08.840,0:04:19.070
A third reason squaring is useful that does is that in a variety of contexts you'll see in future classes, particularly around machine learning.

0:04:19.070,0:04:24.530
It's really useful to have differentiable statistics and you can take the derivative of the square of the square.

0:04:24.530,0:04:31.850
But if we didn't have this square root, we would get the sample variance as squared.

0:04:31.850,0:04:40.160
But this is a measure of what this measures is the mean squared difference from distance from the mean.

0:04:40.160,0:04:45.530
If we just wanted to measure the mean distance, the total distance from the mean is.

0:04:45.530,0:04:50.270
Zero. Because we're subtracting the mean from every value.

0:04:50.270,0:04:58.230
And if you push through the algebra of that, I'll leave that as an exercise for you in order to better understand the algebra of these statistics.

0:04:58.230,0:05:04.400
But the sum of a bunch of values, minus the mean is zero.

0:05:04.400,0:05:12.590
So we needed to be positive. And what this measures is if the mean is the center or the expectation of our values.

0:05:12.590,0:05:19.220
How far away do values tend to be are if if it if the variance or the standard deviation is small?

0:05:19.220,0:05:24.740
That means the value is tightly clustered around the mean.

0:05:24.740,0:05:29.660
And if it's large, that means they're spread out quite a bit around the mean,

0:05:29.660,0:05:37.850
the state taking the square root of the variance means that the result is back in the original units rather than the square of the units.

0:05:37.850,0:05:42.920
And we can see here the standard DeeVee. So I've plotted here we have the mean.

0:05:42.920,0:05:56.320
And then we've got the standard deviation. I've plotted these points are one standard deviation away from the mean and each direction we

0:05:56.320,0:06:03.580
see that's pretty spread out in this particular data set because we have quite a bit of spread.

0:06:03.580,0:06:09.900
There's a number of values that are clustered over here. But then we've got this value over here.

0:06:09.900,0:06:19.680
So to compute these, we use some of the methods that we talked about in the group and aggregate video mean SDD is the standard deviation vare.

0:06:19.680,0:06:27.840
One note is that the number high versions of standard deviation and variance compute a slightly different statistic.

0:06:27.840,0:06:33.750
The population standard deviation and variance instead of the sample standard deviation and variance.

0:06:33.750,0:06:39.480
You can change this by passing a d o f option to them when you call them.

0:06:39.480,0:06:42.910
The difference is that they divide by N instead of N minus one.

0:06:42.910,0:06:49.000
And we'll but when we're computing the standard deviation or the variance of a set of data that we've collected,

0:06:49.000,0:06:56.340
that's a set of observations of the things we care about. We generally are going to want the sample standard deviation or variance.

0:06:56.340,0:07:04.980
So outliers are particularly large or small values and they draw the mean towards them and they also affect the standard deviation,

0:07:04.980,0:07:07.590
one of the reasons why the standard deviation was so large.

0:07:07.590,0:07:16.530
So we've got Michael Jordan's score over here and that that pulls the outlier or that pulls the mean quite a bit.

0:07:16.530,0:07:21.360
So the red the red line, again, is our mean, the blue line.

0:07:21.360,0:07:27.240
That's what the mean we would compute if we didn't have Michael Jordan's score of forty five.

0:07:27.240,0:07:35.010
That's so much larger than everybody else's. So we can see that one value is pulling the mean quite a bit to the right.

0:07:35.010,0:07:40.560
This is one of the downsides of using the mean.

0:07:40.560,0:07:47.850
So when we have a heavily skewed distribution like this, the skew is what we call it, when there's I mean,

0:07:47.850,0:07:51.870
there's a lot of stuff in one place and there's a few values that are way off in one direction.

0:07:51.870,0:07:57.450
We can have outliers without skew. We can have some very large and very small outliers.

0:07:57.450,0:08:03.870
If we if they're relatively balanced, they actually don't affect the mean too much because one pulls it way high, the other polls it way low.

0:08:03.870,0:08:10.320
It's when we were outliers tend to skew off in one direction that the mean starts to become a problem.

0:08:10.320,0:08:19.360
So the median is another way of measuring how where values tend to be and the way we computed that we sort the values and pick the middle one.

0:08:19.360,0:08:26.610
If there's an even number of values, we take the mean of the middle, too. So we have.

0:08:26.610,0:08:32.010
So we have five, ten values. So right in here is going to be the dividing line between them.

0:08:32.010,0:08:35.940
It's two and seven. The mean of two and seven is four point five.

0:08:35.940,0:08:41.820
So the median X tildy of our of our series is 4.5.

0:08:41.820,0:08:43.650
Now if we ask how do we move the median?

0:08:43.650,0:08:50.610
So remember, we can move the mean by just scoring more points, even if Michael Jordan is the one who scores all of the more points.

0:08:50.610,0:08:55.380
But the only way to increase the median is to increase the small values.

0:08:55.380,0:08:59.700
And similarly, you decrease the median by decreasing the large values.

0:08:59.700,0:09:06.620
And so no matter how many score points, more points, Michael Jordan scores.

0:09:06.620,0:09:18.410
We don't move the median. We can only move the median by having the players who scored the fewest points, scored more points and the.

0:09:18.410,0:09:24.290
We could have that that seven scored another point than we would. We would move the median just a little bit.

0:09:24.290,0:09:29.070
But primarily, we need to have the smallest values,

0:09:29.070,0:09:36.740
need to increase in order to increase our median and or at least the values in the smaller half of the distribution.

0:09:36.740,0:09:41.600
One can't really common use for medians as it is when we're talking about income and wealth.

0:09:41.600,0:09:47.330
Statistics mean income for a region is almost never reported.

0:09:47.330,0:09:55.970
What you report, you usually report the median income because income is usually a skewed distribution.

0:09:55.970,0:10:01.280
A few people have a very large income. A lot more people have a significantly smaller income.

0:10:01.280,0:10:07.250
And the mean would would be pretty high if the mean would be pretty high,

0:10:07.250,0:10:16.540
because you have these large incomes, the median wage ends up being reflecting the typical experience.

0:10:16.540,0:10:26.730
Of of people in the population when we have skewed values, so like standard deviation is a mean based measure of spread.

0:10:26.730,0:10:34.790
A. Measure of spread more connected to the median is the range which we often want to compute in general.

0:10:34.790,0:10:40.580
What's the maximum minus the minimum? And then the inter cortical range and the inter cortical range.

0:10:40.580,0:10:46.390
So the median is the point is the fiftieth percentile or the point five court quanti all.

0:10:46.390,0:10:50.290
The inter cortical range is the distance between the first and third quartiles.

0:10:50.290,0:10:59.140
The point, seven point two five point seventy five positions. If you split the data in half at the median, it's the medians of the two halves.

0:10:59.140,0:11:05.800
And so in our data set, we've got it split into two halves are lower, quartile is zero, value is zero.

0:11:05.800,0:11:13.660
Our upper one is eight. So the intercourse range is eight. And this gives you it gives you the width of the middle 50 percent of the data.

0:11:13.660,0:11:22.060
And so it gives you a measure of how spread out the data is. That is similarly robust outliers like the median is.

0:11:22.060,0:11:27.740
So if we want to, we can get a quick summary of a data series with the described method.

0:11:27.740,0:11:30.800
The described method prints its results, it does not return,

0:11:30.800,0:11:40.550
it makes for a few subtle differences in what you're doing in the room or what you're going to see in the room when you run it in a notebook.

0:11:40.550,0:11:44.660
But it gives us the count. It gives us the mean and standard deviation.

0:11:44.660,0:11:48.110
We have our main based metrics of central tendency and spread,

0:11:48.110,0:12:00.440
and then it gives us the minimum maximum and gives us the men the max and then the the points of off of the quartiles so that we can see the median.

0:12:00.440,0:12:06.800
This is gonna be our median. And then we can also see the inter cortical range, two to thirty six.

0:12:06.800,0:12:08.990
This particular data set, this is a heavily,

0:12:08.990,0:12:16.430
heavily skewed data set because one way we can see that it's skewed is that the median is significantly less than the mean.

0:12:16.430,0:12:20.570
That's evident. That's indication of scale. That mean is pulled way up.

0:12:20.570,0:12:25.340
Also, if you look at the seventy fifth percentile is thirty six.

0:12:25.340,0:12:30.100
But the mean is four hundred and twenty three. So.

0:12:30.100,0:12:38.740
If you pick a movie at random, it's very unlikely to have the mean or larger number of ratings.

0:12:38.740,0:12:43.480
I don't know exactly where. What what quantity of 423 is going to be at.

0:12:43.480,0:12:47.290
But probably somewhere over 80 percent, possibly even over 90.

0:12:47.290,0:12:51.240
So. 80, 90 percent of the movies.

0:12:51.240,0:13:01.010
So this is the number of ratings per movie. 80 to 90 percent of the movies have less than the mean number of ratings.

0:13:01.010,0:13:05.870
It's really emphasizes this difference between the mean and the median.

0:13:05.870,0:13:12.740
So how do you pick the mean works well for centered values. There's no excessively large or small values, especially skewed in one direction.

0:13:12.740,0:13:15.830
The mean is going to be approximately equal to the median.

0:13:15.830,0:13:23.630
A lot of other computations, as well as our ability to try to predict future values, really depend on the mean.

0:13:23.630,0:13:31.510
And also the mean is the central tendency so that if we take the total deviation from it, we get zero.

0:13:31.510,0:13:37.420
Now, the median is significantly more robust to outliers.

0:13:37.420,0:13:40.960
And so we're just trying to describe data and we have a strong skew.

0:13:40.960,0:13:51.580
We have outliers on one side of the data. Then we're it's of gives us a statistic that is not as strongly affected by them.

0:13:51.580,0:13:53.560
And if we think that as a as I indicated,

0:13:53.560,0:14:05.660
as we think back to the question of how do I change this value if we're using the statistic as our evaluation criteria, as our target or our goal?

0:14:05.660,0:14:13.580
That becomes very important because if our goal is to raise the mean, say, if our goal is to raise the main number of ratings,

0:14:13.580,0:14:17.480
we could do that by just getting a bunch of more ratings for the most popular movies.

0:14:17.480,0:14:23.870
But if our goal is to raise the median number of ratings, we can only do that by getting more people watching.

0:14:23.870,0:14:32.130
And rating less popular movies becomes a huge difference. It divides high into low the median value.

0:14:32.130,0:14:41.190
Is such that if you picked a random random observation, it's equally likely to be greater or less than the median.

0:14:41.190,0:14:47.560
That's not true for the mean. But it doesn't tell you, like, how far away the values are on its own.

0:14:47.560,0:14:52.830
So it's limited in its ability to generate predictions.

0:14:52.830,0:14:58.650
So when we when we think about what's when we want to do it really comes down to the question we want to answer.

0:14:58.650,0:15:02.580
And then also the other things are going to use it for. So the mean is answering.

0:15:02.580,0:15:13.980
If we distributed the points equally, how many would each player have? And the median gets to the distribution of players about the around the value.

0:15:13.980,0:15:20.020
We want to find one that players are equally likely to have more or less.

0:15:20.020,0:15:26.620
Another one, quick, is the mode, it's the most common value. It doesn't work for continuous, doesn't really work for continuous variables.

0:15:26.620,0:15:32.440
It's really, really useful for categorical variables. If you got a categorical variable that has like three codes.

0:15:32.440,0:15:36.070
The mode want to know which one. The common is super, super valuable thing.

0:15:36.070,0:15:45.400
It's also useful for integers in any other discrete variable. So wrap up the mean and the median, describe where a value tends to be stick.

0:15:45.400,0:15:50.740
Standard deviation, variance range and enter cortical range. Measure how spread out it is.

0:15:50.740,0:15:54.880
The mean is very computationally useful. We're going to need it a lot.

0:15:54.880,0:16:05.440
But it's very sensitive to outliers, median based. The median and the median based statistics like the ICU are are more robust outliers.

0:16:05.440,0:16:11.740
One of the things we're also going to see later on is we we can do data transformations

0:16:11.740,0:16:15.700
to get data to be less skewed and then we compute the mean and the transform

0:16:15.700,0:16:21.550
space and we can wind up with methods that are going to be give us the computational

0:16:21.550,0:16:32.235
benefits of the mean while also not having the outliers causing as many problems.