< Return to Video

https:/.../4c5b7bbb-dc57-4484-b160-ad9000efd740-94330698-4916-4a28-9b48-ad9101273819.mp4?invocationId=8cd5deeb-2f09-ec11-a9e9-0a1a827ad0ec

  • 0:05 - 0:10
    This video, I'm going to show you how to do grouping and aggregate operations and pandas sort of learning
  • 0:10 - 0:15
    outcomes are few to be able to compute an aggregate aggregate values from a panda series,
  • 0:15 - 0:22
    compute grouped aggregate values from a PANDAS data frame, and also be able to order a data frame,
  • 0:22 - 0:26
    pick the larger the rows with the largest values for some series.
  • 0:26 - 0:32
    And then finally join two pandas data frames to get context for the results that we just computed in the first part.
  • 0:32 - 0:39
    So we have a data frame, so this is the movie lends data that we used and some of the earlier videos.
  • 0:39 - 0:48
    So we have the data frame and we've got this ratings table that has the user I.D., movie, I.D. rating and timestamp columns.
  • 0:48 - 0:53
    It's twenty five million rose by four columns. So an aggregate.
  • 0:53 - 0:59
    If we want to ask the question, what does the mean rating? So all of the rating values users has ever given.
  • 0:59 - 1:06
    What's the mean value? And this is this is the code we would use to do this.
  • 1:06 - 1:12
    And there's a few pieces. There's. So we're using this data frame.
  • 1:12 - 1:26
    There's a few pieces. We're using this data frame. There's a few pieces.
  • 1:26 - 1:32
    We're using this data frame. And we're we're then selecting a column.
  • 1:32 - 1:50
    Remember, this is the court. This is how we select a column. And then the result of that operation, this whole operation here is a series.
  • 1:50 - 1:58
    And so then we call the mean method on the series and we get the mean we get the mean rating.
  • 1:58 - 2:05
    Think a little bit about the previous video to think a moment about what the conceptual problem here is, the common operation.
  • 2:05 - 2:09
    But there is a little conceptual problem with it in terms of what it actually means.
  • 2:09 - 2:14
    There's a variety of different aggregate functions that we have in pandas.
  • 2:14 - 2:18
    We've got mean median mode. We've got the minimum and the maximum.
  • 2:18 - 2:23
    You can some you can count. You can compute standard deviation and variance.
  • 2:23 - 2:27
    There is there are several others as well.
  • 2:27 - 2:30
    These are all methods on a PANDAS series. If you have a series, this is a method.
  • 2:30 - 2:38
    You've got the serious object dot. And then this function parentheses to call it, and you're going to compute that aggregate statistics.
  • 2:38 - 2:52
    So let's see these in action. So I'm going to compute the mean rating and we get three point five.
  • 2:52 - 2:59
    I can compute the sum. There's an alternate form. All of these are also available as functions in the num pi module.
  • 2:59 - 3:04
    That ticker, an array and a series is a kind of array for some of the functions.
  • 3:04 - 3:13
    There are slight differences between the panda's versions and the num pi versions, but mean and some are the same.
  • 3:13 - 3:20
    So if we want to get the size of a series, there's a couple of different ways so we can ask the series for its size or do a line on it.
  • 3:20 - 3:26
    Those are the same operation and they will give us the total length of the series, including missing values.
  • 3:26 - 3:29
    If we've got a series that has missing values and we haven't seen missing values yet,
  • 3:29 - 3:35
    but they're going to come up later and we want to count how many values we actually have.
  • 3:35 - 3:41
    That's what the series count does count. Method does. So we can see those.
  • 3:41 - 3:51
    The size, the land, those are the same. Also, we can get a series is an array and a raise in the number PI world have a shape we can get shape,
  • 3:51 - 3:56
    which is the same as the size except as a tuple, because arrays can have more than one dimension.
  • 3:56 - 4:02
    This weird syntax here where we have a number with a comma after it inside parentheses.
  • 4:02 - 4:07
    That's the python syntax for a tuple consisting of exactly one value.
  • 4:07 - 4:11
    It's a little bit of a weird syntax, but it comes up in a few places. But that's what that means.
  • 4:11 - 4:21
    It's a tuple with exactly one value. Then we can count the number of ratings and since we don't have any missing ratings, it returns the same number.
  • 4:21 - 4:32
    So. Another thing we can do that. That's a form of an aggregate is to get a Quantrell and the quanti all takes a parameter that is the fraction.
  • 4:32 - 4:38
    And what it does is, is this the parameter as a fraction? We want to find the value.
  • 4:38 - 4:44
    If we sorted the if we sort of the series from low smallest to largest.
  • 4:44 - 4:51
    And we went that fraction along it, so point five would be the middle. The median value, we're gonna see median in the next video.
  • 4:51 - 4:56
    What's the value that's there? So we can go we can see those run.
  • 4:56 - 5:06
    The median rating is three point five. If we ask for the Quanti or point two, we're going to get the.
  • 5:06 - 5:10
    We're going to if we ask for quanti or point two, we get 3.0.
  • 5:10 - 5:16
    And what this means, it's it's point to the way across 80 percent of the ratings are 3.0 or higher.
  • 5:16 - 5:22
    On a five star scale. So think a little bit about why that might be.
  • 5:22 - 5:26
    We've seen so far aggregates that work over a single serious evalu to take the series.
  • 5:26 - 5:33
    We get one value, but sometimes we want to be able to group and compute aggregates per group.
  • 5:33 - 5:37
    So remember this. This data frame has movie.
  • 5:37 - 5:45
    The ratings are for movies. And they're provided by users, so maybe we want to get rather than just the mean overall rating.
  • 5:45 - 5:50
    Maybe you won't want to do is we want to find the average number of ratings per movie.
  • 5:50 - 5:56
    This would give us a measure of popularity. We could say, well, the movie that's rated the most frequently is the most popular.
  • 5:56 - 6:02
    We could also look at the average rating per movie. And so we can do this with the group by.
  • 6:02 - 6:10
    So group by. Returns an object that allows us to perform grouped operations on a data frame.
  • 6:10 - 6:17
    And so we give it the column name that we want to group by. In this case, movie I.D., we can group by more than one column at a time.
  • 6:17 - 6:25
    We're only doing one for now. Then we can we can do the.
  • 6:25 - 6:31
    We then in this group are we're going to say we only want to work on one column.
  • 6:31 - 6:36
    And otherwise, it's going to count the ratings and the time stamps and back, so they're going to be the same count.
  • 6:36 - 6:40
    So we're gonna say we're gonna to group by movie idea. They're going to say within each group.
  • 6:40 - 6:46
    We only want to work with the rating. And then we want to do is we want to count it all of the aggregate values.
  • 6:46 - 6:52
    The functions that we've seen before work on a per group basis as well. And.
  • 6:52 - 6:59
    Do note, though, that we are grouping by grouping the whole data frame by movie I.D. before we select the column.
  • 6:59 - 7:02
    If we did it the other way around, we were okay, select red and well,
  • 7:02 - 7:06
    now we don't have a movie idea to group by because we've pulled the rating out of the movie.
  • 7:06 - 7:11
    This order is important. So we group by movie I.D. That's another column in the frame and we use the rating column.
  • 7:11 - 7:23
    So let's see this in action. So we want to count the number of ratings per movie and what it gives us is a series whose index?
  • 7:23 - 7:30
    Is the movie I.D.? And whose value is the number of movies for that movie?
  • 7:30 - 7:34
    We haven't really seen indexes yet. We haven't really worked with them much yet.
  • 7:34 - 7:39
    But that's what it's doing here. We're indexing the data front. It's resulting in a series that's indexed.
  • 7:39 - 7:48
    And this is the thing. Serious ads on top of a normal non pie array is that we have this index that tells us, oh, this is for movie one.
  • 7:48 - 7:53
    This is for movie two thousand two hundred nine thousand one hundred and seventy one.
  • 7:53 - 7:57
    We can also compute multiple aggregates at the same time.
  • 7:57 - 8:11
    So the the AG, there's an AG function that allows you to to specify movies, to aggregate functions, to call you, specify them by name.
  • 8:11 - 8:18
    So here I'm doing the group by that we did before. And then I'm AG.
  • 8:18 - 8:21
    Calling AG to say I want to aggregate the values values this column.
  • 8:21 - 8:27
    But I'm giving it a list of two different aggregation functions, mean and count.
  • 8:27 - 8:39
    And when I run this, I get a data frame. That's indexed by movie I.D., but then it has two columns and the columns are named after the function,
  • 8:39 - 8:43
    so have a mean function that's the result of mean and account function.
  • 8:43 - 8:50
    That's the result of count. And because I know that I did this on the rating column, I know these are the mean and the count to the ratings.
  • 8:50 - 8:55
    So we can see that movie I.D. has a mean rating of three point eighty nine and.
  • 8:55 - 9:02
    And the number of ratings is fifty seven thousand three hundred and nine.
  • 9:02 - 9:09
    So sometimes you want to sort out data. So sort values will resource an entire data frame.
  • 9:09 - 9:13
    And by a specific column, you get a column numbers.
  • 9:13 - 9:18
    We could resource this whole data frame by by, say, the number of ratings.
  • 9:18 - 9:22
    Sometimes we also want to just get the largest or small. Sometimes the reason we want to sort.
  • 9:22 - 9:28
    Is I want I'd want to know the five movies with the most ratings.
  • 9:28 - 9:33
    In which case, we don't necessarily need to sort the entire thing. And largest. And then smallest.
  • 9:33 - 9:40
    Let us just get the rose with the with the end largest or smallest values for a particular column.
  • 9:40 - 9:43
    So if I go over and do this.
  • 9:43 - 9:51
    So I want to get the 10 movies with the most ratings I can call and largest and tell it, I want 10 and I want to do it by count.
  • 9:51 - 9:57
    And it gives me the 10 movies with the with the most ratings sorted in decreasing order of count.
  • 9:57 - 10:07
    And we see movie ads. Three hundred and fifty six gives has eighty one thousand movies with a mean a 4.0 five.
  • 10:07 - 10:13
    But this doesn't tell us what movie that is. What we can do.
  • 10:13 - 10:19
    Remember, we have this movie is table two that gives us the movie titles. We can join the tables together.
  • 10:19 - 10:27
    And the simplest way to join is to join on a common index. There's a set index method that lets you set a column is the index.
  • 10:27 - 10:32
    You can also specify columns to join by. We're going to see more of this later, particularly.
  • 10:32 - 10:37
    I'm going to make a note book that walks you through the different indexing operations.
  • 10:37 - 10:42
    And you can also read about them in more detail in the text book.
  • 10:42 - 10:47
    But. If we want to see, so I'm going to say.
  • 10:47 - 10:54
    So I'm going to take our movie's frame that as a movie column and I'm going to join it with movie stats and movie stats.
  • 10:54 - 11:01
    Remember, it's the result of our aggregate its index. Is the movie I.D. and so on when to call?
  • 11:01 - 11:08
    I'm going to tell it. I want to join movies on movie stats and I'm going to tell it on movie I.D. movies doesn't have a useful index.
  • 11:08 - 11:15
    Its index is just the positions. But on whereas when I use the on keyword in join what it does.
  • 11:15 - 11:19
    Is it. It tells it to join the left feet.
  • 11:19 - 11:26
    The left table movies, to use that color movie ideas column and join it with the index in the other table.
  • 11:26 - 11:35
    So movie starts has an index and it expects the movie idea, column and movies to match up with the index in movie stats.
  • 11:35 - 11:41
    And so the resulting frame. Has our title in our genre does.
  • 11:41 - 11:45
    And then it has the mean and the count for each of these movies.
  • 11:45 - 11:56
    So now if I say and largest of this movie info frame, I see that the most frequently rated movie with 81000 movie ratings is Forrest Gump.
  • 11:56 - 12:00
    So another thing you can do is so the movie level rating statistics to be computed, this count.
  • 12:00 - 12:03
    This mean those are just more variables.
  • 12:03 - 12:10
    Remember, in the earlier we talked about, we can make some of the variables you might observe are actually aggregates from other things.
  • 12:10 - 12:17
    Well, these are just more variables. So now we have sweet. So if we have an observation of a movie, it has an I.D., it has a title.
  • 12:17 - 12:22
    It has on Rose. And it has the number of people who've rated it in the mean rating.
  • 12:22 - 12:31
    These also can be aggregated. So in the downloads for this video, you're going to find the notebook that I was just using for practice.
  • 12:31 - 12:36
    What I'd like you to do is to go in and compute the mean number of ratings per movie.
  • 12:36 - 12:38
    Maybe use some additional exploration as well.
  • 12:38 - 12:47
    But that's going to let you start to see how we can build from these aggregates into into additional structures.
  • 12:47 - 12:52
    And also emphasize that. A data frame is just a data frame.
  • 12:52 - 12:59
    Mike, we give it meaning in terms of observations. But the fact that a data frame resulted from an aggregate doesn't make it special in any way.
  • 12:59 - 13:03
    We can aggregate the results of of of an aggregate because it's just another data frame.
  • 13:03 - 13:10
    Everything's a data frame or a series and pandas. So to wrap up aggregates, combine a series or array into a single value.
  • 13:10 - 13:14
    That's what it means to aggregate. We can do this over an entire series.
  • 13:14 - 13:17
    We can also do this on a group by group basis.
  • 13:17 - 13:27
    If we have another column that provides us with grouping information so we can compute the average Beacon computer Mean Asama or whatever per group,
  • 13:27 - 13:34
    you might have this like if you have if you have records of financial transactions, you might want to compute.
  • 13:34 - 13:39
    Well, what was what was our total profit in each month.
  • 13:39 - 13:42
    So you could group by year, maybe as you grew by month,
  • 13:42 - 13:50
    maybe group by year and month and take a some of the of the profit margin on each of your transactions.
  • 13:50 - 13:58
    And then finally join Combine's frames, we can start to put two frames together in order to get context for values.
  • 13:58 - 14:00
    We're going to see a lot of other uses for join later.
  • 14:00 - 14:12
    But this that's in this context that lets us get context for understanding what's going on in a value.
Title:
https:/.../4c5b7bbb-dc57-4484-b160-ad9000efd740-94330698-4916-4a28-9b48-ad9101273819.mp4?invocationId=8cd5deeb-2f09-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
14:11

English subtitles

Revisions