https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec

Edit subtitles

0:05 - 0:09

This video I'm going to talk with you about how to describe distribution,
0:09 - 0:18

so I said one of the learning outcomes with this week is that we can describe the distribution of a variable both numerically and graphically.
0:18 - 0:24

And we saw some of that in the descriptive statistics in the previous video.
0:24 - 0:30

We're going to go deeper into that in this video. So we want to be able to learning outcomes for for this particular video or to be able to provide
0:30 - 0:35

numerical descriptions of the distribution and visualize the distribution of a variable.
0:35 - 0:41

So the question when we ask, how is a variable distributed? There's a few questions that we want to.
0:41 - 0:46

A few things we want to break that down into four specific questions.
0:46 - 0:50

First, what is the average value mean or medium, whichever is appropriate?
0:50 - 0:54

How spread out is the data? This these two things really give us a lot.
0:54 - 0:57

Where is the data? How spread out is it? Tell us a lot about it.
0:57 - 1:03

But then we want to look at is it skewed? Does it do values tend high?
1:03 - 1:08

Or do they tend low with respect to that that average value?
1:08 - 1:13

And then what does the data actually look like visually?
1:13 - 1:18

So the numeric descriptions, the previous videos, descriptive statistics give us quite a few.
1:18 - 1:22

The described method gives us a quick way to generate several of them.
1:22 - 1:31

So I generated a ten thousand random numbers. And they've got a mean and a median, both of about five.
1:31 - 1:35

They're spread out. Excuse me, a standard deviation about one.
1:35 - 1:44

So these numbers visualized or we can visualize them in what we call a histogram and a histogram shows how common different values are.
1:44 - 1:50

The x axis are the values themselves. And the Y axis is how frequent that value is.
1:50 - 1:58

It can be either account, the number of values, the number of observations that are about that value, or it can be relative.
1:58 - 2:03

The percent or fraction of observations. This is account one. We can see it goes up to twelve.
2:03 - 2:10

The Y axis goes up to twelve hundred. When we had a continuous variable, the number of Binz control the division points.
2:10 - 2:15

So I said Binz equals twenty five. That divides the variable into twenty five.
2:15 - 2:22

The range of the of the variable as we've observed it in the twenty five bins of equal width.
2:22 - 2:28

And so for each bean it's showing how many values fall into that bean.
2:28 - 2:33

So the largest bit is just over twelve hundred centered right at five.
2:33 - 2:37

This data is also symmetrical. There's no skew to it.
2:37 - 2:42

There's a little bit of variation in the size stapes, the two sides, just the way things fall into the bins.
2:42 - 2:50

But there's no significant skew. It's evenly distributed around the mean, the mean and the median are equal reflecting that.
2:50 - 2:55

This this function, the highest function comes from the library matplotlib.
2:55 - 3:01

So the import for that is import matplotlib, that pie plot. The typical name for that is PLDT.
3:01 - 3:10

And that gives us functions to do. Plotting matplotlib is one of the fundamental plotting libraries for Python.
3:10 - 3:18

A lot of other plotting libraries such as Seabourne, Plot Nine, et cetera, are built on top of matplotlib.
3:18 - 3:23

So if we want to look at some real data, we can look at the average rating for movies.
3:23 - 3:27

So remember, we have this movie data set. We've computed each movies average rating.
3:27 - 3:36

Well, how are those average? How are those average ratings distributed? So, I mean, is three point seven the median is three point one five.
3:36 - 3:43

The there's a little bit of a what we call a left skew and the left skew this the direction and the skew is how far out.
3:43 - 3:51

The longer tail goes. So left skew means we have a longer tail on the left and the data is more bunched up on the right.
3:51 - 3:56

We have a longer tail on the left of the histogram. We can see that. That gives us a slight left skew.
3:56 - 4:00

The mean is slightly less than the median. That's also an indicator of left skew.
4:00 - 4:05

But it's not very skewed. The mean and the median are pretty close in this case.
4:05 - 4:09

We can also look at the movie rating count. So this is the distribution. So we have a variable.
4:09 - 4:14

So for each movie, we've got these two variables, its average rating and and the number of people who rated it.
4:14 - 4:22

And so we can look at the distribution of the movie rating count. This distribution is very, very heavily skewed.
4:22 - 4:30

A very strong right skew. The mean is much greater than the median for twenty three versus six.
4:30 - 4:35

Morreau, most this means most movies are going to have far fewer ratings in the mean and it's hard to
4:35 - 4:40

like the histogram shows us that it's skewed and shows us this huge spike at the small values.
4:40 - 4:44

But it's hard to really see what's happening with the distribution here.
4:44 - 4:49

So an alternate way that we can plot the distribution is.
4:49 - 4:54

So here's the word histogram. It just means the tabulation of how frequent each value is.
4:54 - 4:59

So we can compute the frequency of each value.
4:59 - 5:05

And the panda's method value counts. Does that. And so what we can what it computes.
5:05 - 5:13

It's a serious. Where the index is the values of the original series. And the value is the number of times that value appeared.
5:13 - 5:21

And so if we do that, we can then make a scatterplot and a scatterplot is where we plot some value on the X versus some value on the Y.
5:21 - 5:27

And the index, the index is an array. So our x axis and the scatterplot is the index, which is the actual values themselves.
5:27 - 5:36

And the Y index is the value of the earth in the array, which is how many times that value appeared in the original array.
5:36 - 5:41

And we're then we're going to do something. We're going to rescale it using what we call a log scale.
5:41 - 5:47

So you'll notice the Y, the X and the Y axes tend to the zero, tend to the one, tend to the to tend to the three.
5:47 - 5:50

Rather than being evenly spaced, they're evenly spaced.
5:50 - 5:57

The logarithms are evenly spaced and effectively since we're using ten base ten logarithms here.
5:57 - 6:06

What this does is it shows us, rather than the values themselves, that shows us the order of magnitude of each of the values.
6:06 - 6:11

And so. We can see one tend to the zero is the mode.
6:11 - 6:22

It's the most frequent one, that's the top point. And also we can see that the first part of the the first part of it is a lot, almost a line.
6:22 - 6:27

And when we have a line, we have a plot like this where the x axis is the number of ratings,
6:27 - 6:33

the y axis excuse me, the X axis is our value in the Y axis is how frequent that value is.
6:33 - 6:38

In this case, how many movies have that many ratings? And it's on a logarithmic.
6:38 - 6:50

It's on a log log scale. We when we see a line on a log log scale in this chart that indicates that what we're looking at is
6:50 - 6:56

a what's called a power law distribution or something that's close to a power law distribution.
6:56 - 7:03

And this is a common distribution that arises when we're talking about the popularity of various results of human activity.
7:03 - 7:10

It shows up in how frequently different words are used. It shows up in a lot of different human activity contexts.
7:10 - 7:16

If you look at, say, a social network. And you look at the popularity of different accounts.
7:16 - 7:22

You have accounts like Lady Gaga and beyond, say, have very, very popular Twitter accounts.
7:22 - 7:26

I have a moderately popular but much less popular Twitter account.
7:26 - 7:31

And a lot of people are down around 100 or 200 followers.
7:31 - 7:38

It's very common for the debt, for distributions of that kind of activity to fall to look like this.
7:38 - 7:44

And this kind of a chart where we have the scatterplot of X and Y axes is a good way
7:44 - 7:47

to see it and a good way to get a handle on what the state is actually looking like.
7:47 - 7:55

And we have this strong power law skew. So one of the artifacts of this I'm replanting our mean ratings here, except with more beans,
7:55 - 8:00

50 beans, instead of the default 10, we can we see the same basic shape.
8:00 - 8:08

We also see that a few values are much, much more common. Those values are one, two, three, four, five, two point five, three point five.
8:08 - 8:14

And the reason this is so these are exact rating values. Three point five is an exact.
8:14 - 8:19

You can rate it will be three point five. You can't rate a movie. Three point seven eight.
8:19 - 8:25

And so what the reason is. Look at how many movies have one rating.
8:25 - 8:32

Or two ratings. And if a movie only has one rating, it's mean rating is going to be the rating.
8:32 - 8:43

And so we're going to get a lot of movies since the most since this one rating is by far the most common popularity level of a movie.
8:43 - 8:50

We're gonna have a lot of movies. Where they're mean is exactly one of the possible rating values like three.
8:50 - 8:56

And so we see these spikes here in the distribution when we look at it in a more fine grained way,
8:56 - 9:00

just because there are so many movies that don't have very many ratings.
9:00 - 9:08

So we've seen numerical distributions, whether they're continuous, whether they're integer or other kinds of account data.
9:08 - 9:13

We've seen continuous distributions for a categorical distribution or go to as what's called a bar chart.
9:13 - 9:17

And so I'm again using value counts here to count the number of penguins.
9:17 - 9:22

So from the earlier video, the penguin dataset count the number of penguins of each species.
9:22 - 9:32

And then I'm plotting a bar chart that's showing us the the number of penguins that have each each species.
9:32 - 9:35

And we can see that the Adelie penguin is the most common here.
9:35 - 9:44

But the bar chart is going to be a really simple way to to view the distribution of a categorical variable.
9:44 - 9:49

Seabourne, which is an additional library we're going to see later, provides really convenient ways to do this.
9:49 - 9:58

But here I'm showing you the map, the raw matplotlib code, so that you can see how does the chart actually get generated?
9:58 - 10:03

It gets generated by counting, doing the value count, counting how many times each species appears.
10:03 - 10:17

And then we plot a bar chart whose x axis is the species and y axis is the number of times that species appears in the data set.

Title:: https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec
Video Language:: English
Duration:: 10:16

janetlayne edited English subtitles for https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec

English subtitles

Revisions

Revision 1 Uploaded

janetlayne

https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec

Revisions

Our website uses cookies

Operating cookies (Required)