< Return to Video

https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec

  • 0:05 - 0:09
    This video I'm going to talk with you about how to describe distribution,
  • 0:09 - 0:18
    so I said one of the learning outcomes with this week is that we can describe the distribution of a variable both numerically and graphically.
  • 0:18 - 0:24
    And we saw some of that in the descriptive statistics in the previous video.
  • 0:24 - 0:30
    We're going to go deeper into that in this video. So we want to be able to learning outcomes for for this particular video or to be able to provide
  • 0:30 - 0:35
    numerical descriptions of the distribution and visualize the distribution of a variable.
  • 0:35 - 0:41
    So the question when we ask, how is a variable distributed? There's a few questions that we want to.
  • 0:41 - 0:46
    A few things we want to break that down into four specific questions.
  • 0:46 - 0:50
    First, what is the average value mean or medium, whichever is appropriate?
  • 0:50 - 0:54
    How spread out is the data? This these two things really give us a lot.
  • 0:54 - 0:57
    Where is the data? How spread out is it? Tell us a lot about it.
  • 0:57 - 1:03
    But then we want to look at is it skewed? Does it do values tend high?
  • 1:03 - 1:08
    Or do they tend low with respect to that that average value?
  • 1:08 - 1:13
    And then what does the data actually look like visually?
  • 1:13 - 1:18
    So the numeric descriptions, the previous videos, descriptive statistics give us quite a few.
  • 1:18 - 1:22
    The described method gives us a quick way to generate several of them.
  • 1:22 - 1:31
    So I generated a ten thousand random numbers. And they've got a mean and a median, both of about five.
  • 1:31 - 1:35
    They're spread out. Excuse me, a standard deviation about one.
  • 1:35 - 1:44
    So these numbers visualized or we can visualize them in what we call a histogram and a histogram shows how common different values are.
  • 1:44 - 1:50
    The x axis are the values themselves. And the Y axis is how frequent that value is.
  • 1:50 - 1:58
    It can be either account, the number of values, the number of observations that are about that value, or it can be relative.
  • 1:58 - 2:03
    The percent or fraction of observations. This is account one. We can see it goes up to twelve.
  • 2:03 - 2:10
    The Y axis goes up to twelve hundred. When we had a continuous variable, the number of Binz control the division points.
  • 2:10 - 2:15
    So I said Binz equals twenty five. That divides the variable into twenty five.
  • 2:15 - 2:22
    The range of the of the variable as we've observed it in the twenty five bins of equal width.
  • 2:22 - 2:28
    And so for each bean it's showing how many values fall into that bean.
  • 2:28 - 2:33
    So the largest bit is just over twelve hundred centered right at five.
  • 2:33 - 2:37
    This data is also symmetrical. There's no skew to it.
  • 2:37 - 2:42
    There's a little bit of variation in the size stapes, the two sides, just the way things fall into the bins.
  • 2:42 - 2:50
    But there's no significant skew. It's evenly distributed around the mean, the mean and the median are equal reflecting that.
  • 2:50 - 2:55
    This this function, the highest function comes from the library matplotlib.
  • 2:55 - 3:01
    So the import for that is import matplotlib, that pie plot. The typical name for that is PLDT.
  • 3:01 - 3:10
    And that gives us functions to do. Plotting matplotlib is one of the fundamental plotting libraries for Python.
  • 3:10 - 3:18
    A lot of other plotting libraries such as Seabourne, Plot Nine, et cetera, are built on top of matplotlib.
  • 3:18 - 3:23
    So if we want to look at some real data, we can look at the average rating for movies.
  • 3:23 - 3:27
    So remember, we have this movie data set. We've computed each movies average rating.
  • 3:27 - 3:36
    Well, how are those average? How are those average ratings distributed? So, I mean, is three point seven the median is three point one five.
  • 3:36 - 3:43
    The there's a little bit of a what we call a left skew and the left skew this the direction and the skew is how far out.
  • 3:43 - 3:51
    The longer tail goes. So left skew means we have a longer tail on the left and the data is more bunched up on the right.
  • 3:51 - 3:56
    We have a longer tail on the left of the histogram. We can see that. That gives us a slight left skew.
  • 3:56 - 4:00
    The mean is slightly less than the median. That's also an indicator of left skew.
  • 4:00 - 4:05
    But it's not very skewed. The mean and the median are pretty close in this case.
  • 4:05 - 4:09
    We can also look at the movie rating count. So this is the distribution. So we have a variable.
  • 4:09 - 4:14
    So for each movie, we've got these two variables, its average rating and and the number of people who rated it.
  • 4:14 - 4:22
    And so we can look at the distribution of the movie rating count. This distribution is very, very heavily skewed.
  • 4:22 - 4:30
    A very strong right skew. The mean is much greater than the median for twenty three versus six.
  • 4:30 - 4:35
    Morreau, most this means most movies are going to have far fewer ratings in the mean and it's hard to
  • 4:35 - 4:40
    like the histogram shows us that it's skewed and shows us this huge spike at the small values.
  • 4:40 - 4:44
    But it's hard to really see what's happening with the distribution here.
  • 4:44 - 4:49
    So an alternate way that we can plot the distribution is.
  • 4:49 - 4:54
    So here's the word histogram. It just means the tabulation of how frequent each value is.
  • 4:54 - 4:59
    So we can compute the frequency of each value.
  • 4:59 - 5:05
    And the panda's method value counts. Does that. And so what we can what it computes.
  • 5:05 - 5:13
    It's a serious. Where the index is the values of the original series. And the value is the number of times that value appeared.
  • 5:13 - 5:21
    And so if we do that, we can then make a scatterplot and a scatterplot is where we plot some value on the X versus some value on the Y.
  • 5:21 - 5:27
    And the index, the index is an array. So our x axis and the scatterplot is the index, which is the actual values themselves.
  • 5:27 - 5:36
    And the Y index is the value of the earth in the array, which is how many times that value appeared in the original array.
  • 5:36 - 5:41
    And we're then we're going to do something. We're going to rescale it using what we call a log scale.
  • 5:41 - 5:47
    So you'll notice the Y, the X and the Y axes tend to the zero, tend to the one, tend to the to tend to the three.
  • 5:47 - 5:50
    Rather than being evenly spaced, they're evenly spaced.
  • 5:50 - 5:57
    The logarithms are evenly spaced and effectively since we're using ten base ten logarithms here.
  • 5:57 - 6:06
    What this does is it shows us, rather than the values themselves, that shows us the order of magnitude of each of the values.
  • 6:06 - 6:11
    And so. We can see one tend to the zero is the mode.
  • 6:11 - 6:22
    It's the most frequent one, that's the top point. And also we can see that the first part of the the first part of it is a lot, almost a line.
  • 6:22 - 6:27
    And when we have a line, we have a plot like this where the x axis is the number of ratings,
  • 6:27 - 6:33
    the y axis excuse me, the X axis is our value in the Y axis is how frequent that value is.
  • 6:33 - 6:38
    In this case, how many movies have that many ratings? And it's on a logarithmic.
  • 6:38 - 6:50
    It's on a log log scale. We when we see a line on a log log scale in this chart that indicates that what we're looking at is
  • 6:50 - 6:56
    a what's called a power law distribution or something that's close to a power law distribution.
  • 6:56 - 7:03
    And this is a common distribution that arises when we're talking about the popularity of various results of human activity.
  • 7:03 - 7:10
    It shows up in how frequently different words are used. It shows up in a lot of different human activity contexts.
  • 7:10 - 7:16
    If you look at, say, a social network. And you look at the popularity of different accounts.
  • 7:16 - 7:22
    You have accounts like Lady Gaga and beyond, say, have very, very popular Twitter accounts.
  • 7:22 - 7:26
    I have a moderately popular but much less popular Twitter account.
  • 7:26 - 7:31
    And a lot of people are down around 100 or 200 followers.
  • 7:31 - 7:38
    It's very common for the debt, for distributions of that kind of activity to fall to look like this.
  • 7:38 - 7:44
    And this kind of a chart where we have the scatterplot of X and Y axes is a good way
  • 7:44 - 7:47
    to see it and a good way to get a handle on what the state is actually looking like.
  • 7:47 - 7:55
    And we have this strong power law skew. So one of the artifacts of this I'm replanting our mean ratings here, except with more beans,
  • 7:55 - 8:00
    50 beans, instead of the default 10, we can we see the same basic shape.
  • 8:00 - 8:08
    We also see that a few values are much, much more common. Those values are one, two, three, four, five, two point five, three point five.
  • 8:08 - 8:14
    And the reason this is so these are exact rating values. Three point five is an exact.
  • 8:14 - 8:19
    You can rate it will be three point five. You can't rate a movie. Three point seven eight.
  • 8:19 - 8:25
    And so what the reason is. Look at how many movies have one rating.
  • 8:25 - 8:32
    Or two ratings. And if a movie only has one rating, it's mean rating is going to be the rating.
  • 8:32 - 8:43
    And so we're going to get a lot of movies since the most since this one rating is by far the most common popularity level of a movie.
  • 8:43 - 8:50
    We're gonna have a lot of movies. Where they're mean is exactly one of the possible rating values like three.
  • 8:50 - 8:56
    And so we see these spikes here in the distribution when we look at it in a more fine grained way,
  • 8:56 - 9:00
    just because there are so many movies that don't have very many ratings.
  • 9:00 - 9:08
    So we've seen numerical distributions, whether they're continuous, whether they're integer or other kinds of account data.
  • 9:08 - 9:13
    We've seen continuous distributions for a categorical distribution or go to as what's called a bar chart.
  • 9:13 - 9:17
    And so I'm again using value counts here to count the number of penguins.
  • 9:17 - 9:22
    So from the earlier video, the penguin dataset count the number of penguins of each species.
  • 9:22 - 9:32
    And then I'm plotting a bar chart that's showing us the the number of penguins that have each each species.
  • 9:32 - 9:35
    And we can see that the Adelie penguin is the most common here.
  • 9:35 - 9:44
    But the bar chart is going to be a really simple way to to view the distribution of a categorical variable.
  • 9:44 - 9:49
    Seabourne, which is an additional library we're going to see later, provides really convenient ways to do this.
  • 9:49 - 9:58
    But here I'm showing you the map, the raw matplotlib code, so that you can see how does the chart actually get generated?
  • 9:58 - 10:03
    It gets generated by counting, doing the value count, counting how many times each species appears.
  • 10:03 - 10:17
    And then we plot a bar chart whose x axis is the species and y axis is the number of times that species appears in the data set.
Title:
https:/.../5b6fb252-cd60-478b-aaf7-ad9000efd8b9-da4db524-f4a8-4f36-9570-ad9101277730.mp4?invocationId=7aa5eb4c-3009-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
10:16

English subtitles

Revisions