-
This video I'm going to talk with you about how to describe distribution,
-
so I said one of the learning outcomes with this week is that we can describe the distribution of a variable both numerically and graphically.
-
And we saw some of that in the descriptive statistics in the previous video.
-
We're going to go deeper into that in this video. So we want to be able to learning outcomes for for this particular video or to be able to provide
-
numerical descriptions of the distribution and visualize the distribution of a variable.
-
So the question when we ask, how is a variable distributed? There's a few questions that we want to.
-
A few things we want to break that down into four specific questions.
-
First, what is the average value mean or medium, whichever is appropriate?
-
How spread out is the data? This these two things really give us a lot.
-
Where is the data? How spread out is it? Tell us a lot about it.
-
But then we want to look at is it skewed? Does it do values tend high?
-
Or do they tend low with respect to that that average value?
-
And then what does the data actually look like visually?
-
So the numeric descriptions, the previous videos, descriptive statistics give us quite a few.
-
The described method gives us a quick way to generate several of them.
-
So I generated a ten thousand random numbers. And they've got a mean and a median, both of about five.
-
They're spread out. Excuse me, a standard deviation about one.
-
So these numbers visualized or we can visualize them in what we call a histogram and a histogram shows how common different values are.
-
The x axis are the values themselves. And the Y axis is how frequent that value is.
-
It can be either account, the number of values, the number of observations that are about that value, or it can be relative.
-
The percent or fraction of observations. This is account one. We can see it goes up to twelve.
-
The Y axis goes up to twelve hundred. When we had a continuous variable, the number of Binz control the division points.
-
So I said Binz equals twenty five. That divides the variable into twenty five.
-
The range of the of the variable as we've observed it in the twenty five bins of equal width.
-
And so for each bean it's showing how many values fall into that bean.
-
So the largest bit is just over twelve hundred centered right at five.
-
This data is also symmetrical. There's no skew to it.
-
There's a little bit of variation in the size stapes, the two sides, just the way things fall into the bins.
-
But there's no significant skew. It's evenly distributed around the mean, the mean and the median are equal reflecting that.
-
This this function, the highest function comes from the library matplotlib.
-
So the import for that is import matplotlib, that pie plot. The typical name for that is PLDT.
-
And that gives us functions to do. Plotting matplotlib is one of the fundamental plotting libraries for Python.
-
A lot of other plotting libraries such as Seabourne, Plot Nine, et cetera, are built on top of matplotlib.
-
So if we want to look at some real data, we can look at the average rating for movies.
-
So remember, we have this movie data set. We've computed each movies average rating.
-
Well, how are those average? How are those average ratings distributed? So, I mean, is three point seven the median is three point one five.
-
The there's a little bit of a what we call a left skew and the left skew this the direction and the skew is how far out.
-
The longer tail goes. So left skew means we have a longer tail on the left and the data is more bunched up on the right.
-
We have a longer tail on the left of the histogram. We can see that. That gives us a slight left skew.
-
The mean is slightly less than the median. That's also an indicator of left skew.
-
But it's not very skewed. The mean and the median are pretty close in this case.
-
We can also look at the movie rating count. So this is the distribution. So we have a variable.
-
So for each movie, we've got these two variables, its average rating and and the number of people who rated it.
-
And so we can look at the distribution of the movie rating count. This distribution is very, very heavily skewed.
-
A very strong right skew. The mean is much greater than the median for twenty three versus six.
-
Morreau, most this means most movies are going to have far fewer ratings in the mean and it's hard to
-
like the histogram shows us that it's skewed and shows us this huge spike at the small values.
-
But it's hard to really see what's happening with the distribution here.
-
So an alternate way that we can plot the distribution is.
-
So here's the word histogram. It just means the tabulation of how frequent each value is.
-
So we can compute the frequency of each value.
-
And the panda's method value counts. Does that. And so what we can what it computes.
-
It's a serious. Where the index is the values of the original series. And the value is the number of times that value appeared.
-
And so if we do that, we can then make a scatterplot and a scatterplot is where we plot some value on the X versus some value on the Y.
-
And the index, the index is an array. So our x axis and the scatterplot is the index, which is the actual values themselves.
-
And the Y index is the value of the earth in the array, which is how many times that value appeared in the original array.
-
And we're then we're going to do something. We're going to rescale it using what we call a log scale.
-
So you'll notice the Y, the X and the Y axes tend to the zero, tend to the one, tend to the to tend to the three.
-
Rather than being evenly spaced, they're evenly spaced.
-
The logarithms are evenly spaced and effectively since we're using ten base ten logarithms here.
-
What this does is it shows us, rather than the values themselves, that shows us the order of magnitude of each of the values.
-
And so. We can see one tend to the zero is the mode.
-
It's the most frequent one, that's the top point. And also we can see that the first part of the the first part of it is a lot, almost a line.
-
And when we have a line, we have a plot like this where the x axis is the number of ratings,
-
the y axis excuse me, the X axis is our value in the Y axis is how frequent that value is.
-
In this case, how many movies have that many ratings? And it's on a logarithmic.
-
It's on a log log scale. We when we see a line on a log log scale in this chart that indicates that what we're looking at is
-
a what's called a power law distribution or something that's close to a power law distribution.
-
And this is a common distribution that arises when we're talking about the popularity of various results of human activity.
-
It shows up in how frequently different words are used. It shows up in a lot of different human activity contexts.
-
If you look at, say, a social network. And you look at the popularity of different accounts.
-
You have accounts like Lady Gaga and beyond, say, have very, very popular Twitter accounts.
-
I have a moderately popular but much less popular Twitter account.
-
And a lot of people are down around 100 or 200 followers.
-
It's very common for the debt, for distributions of that kind of activity to fall to look like this.
-
And this kind of a chart where we have the scatterplot of X and Y axes is a good way
-
to see it and a good way to get a handle on what the state is actually looking like.
-
And we have this strong power law skew. So one of the artifacts of this I'm replanting our mean ratings here, except with more beans,
-
50 beans, instead of the default 10, we can we see the same basic shape.
-
We also see that a few values are much, much more common. Those values are one, two, three, four, five, two point five, three point five.
-
And the reason this is so these are exact rating values. Three point five is an exact.
-
You can rate it will be three point five. You can't rate a movie. Three point seven eight.
-
And so what the reason is. Look at how many movies have one rating.
-
Or two ratings. And if a movie only has one rating, it's mean rating is going to be the rating.
-
And so we're going to get a lot of movies since the most since this one rating is by far the most common popularity level of a movie.
-
We're gonna have a lot of movies. Where they're mean is exactly one of the possible rating values like three.
-
And so we see these spikes here in the distribution when we look at it in a more fine grained way,
-
just because there are so many movies that don't have very many ratings.
-
So we've seen numerical distributions, whether they're continuous, whether they're integer or other kinds of account data.
-
We've seen continuous distributions for a categorical distribution or go to as what's called a bar chart.
-
And so I'm again using value counts here to count the number of penguins.
-
So from the earlier video, the penguin dataset count the number of penguins of each species.
-
And then I'm plotting a bar chart that's showing us the the number of penguins that have each each species.
-
And we can see that the Adelie penguin is the most common here.
-
But the bar chart is going to be a really simple way to to view the distribution of a categorical variable.
-
Seabourne, which is an additional library we're going to see later, provides really convenient ways to do this.
-
But here I'm showing you the map, the raw matplotlib code, so that you can see how does the chart actually get generated?
-
It gets generated by counting, doing the value count, counting how many times each species appears.
-
And then we plot a bar chart whose x axis is the species and y axis is the number of times that species appears in the data set.