-
This video, I'm going to show you how to do grouping and aggregate operations and pandas sort of learning
-
outcomes are few to be able to compute an aggregate aggregate values from a panda series,
-
compute grouped aggregate values from a PANDAS data frame, and also be able to order a data frame,
-
pick the larger the rows with the largest values for some series.
-
And then finally join two pandas data frames to get context for the results that we just computed in the first part.
-
So we have a data frame, so this is the movie lends data that we used and some of the earlier videos.
-
So we have the data frame and we've got this ratings table that has the user I.D., movie, I.D. rating and timestamp columns.
-
It's twenty five million rose by four columns. So an aggregate.
-
If we want to ask the question, what does the mean rating? So all of the rating values users has ever given.
-
What's the mean value? And this is this is the code we would use to do this.
-
And there's a few pieces. There's. So we're using this data frame.
-
There's a few pieces. We're using this data frame. There's a few pieces.
-
We're using this data frame. And we're we're then selecting a column.
-
Remember, this is the court. This is how we select a column. And then the result of that operation, this whole operation here is a series.
-
And so then we call the mean method on the series and we get the mean we get the mean rating.
-
Think a little bit about the previous video to think a moment about what the conceptual problem here is, the common operation.
-
But there is a little conceptual problem with it in terms of what it actually means.
-
There's a variety of different aggregate functions that we have in pandas.
-
We've got mean median mode. We've got the minimum and the maximum.
-
You can some you can count. You can compute standard deviation and variance.
-
There is there are several others as well.
-
These are all methods on a PANDAS series. If you have a series, this is a method.
-
You've got the serious object dot. And then this function parentheses to call it, and you're going to compute that aggregate statistics.
-
So let's see these in action. So I'm going to compute the mean rating and we get three point five.
-
I can compute the sum. There's an alternate form. All of these are also available as functions in the num pi module.
-
That ticker, an array and a series is a kind of array for some of the functions.
-
There are slight differences between the panda's versions and the num pi versions, but mean and some are the same.
-
So if we want to get the size of a series, there's a couple of different ways so we can ask the series for its size or do a line on it.
-
Those are the same operation and they will give us the total length of the series, including missing values.
-
If we've got a series that has missing values and we haven't seen missing values yet,
-
but they're going to come up later and we want to count how many values we actually have.
-
That's what the series count does count. Method does. So we can see those.
-
The size, the land, those are the same. Also, we can get a series is an array and a raise in the number PI world have a shape we can get shape,
-
which is the same as the size except as a tuple, because arrays can have more than one dimension.
-
This weird syntax here where we have a number with a comma after it inside parentheses.
-
That's the python syntax for a tuple consisting of exactly one value.
-
It's a little bit of a weird syntax, but it comes up in a few places. But that's what that means.
-
It's a tuple with exactly one value. Then we can count the number of ratings and since we don't have any missing ratings, it returns the same number.
-
So. Another thing we can do that. That's a form of an aggregate is to get a Quantrell and the quanti all takes a parameter that is the fraction.
-
And what it does is, is this the parameter as a fraction? We want to find the value.
-
If we sorted the if we sort of the series from low smallest to largest.
-
And we went that fraction along it, so point five would be the middle. The median value, we're gonna see median in the next video.
-
What's the value that's there? So we can go we can see those run.
-
The median rating is three point five. If we ask for the Quanti or point two, we're going to get the.
-
We're going to if we ask for quanti or point two, we get 3.0.
-
And what this means, it's it's point to the way across 80 percent of the ratings are 3.0 or higher.
-
On a five star scale. So think a little bit about why that might be.
-
We've seen so far aggregates that work over a single serious evalu to take the series.
-
We get one value, but sometimes we want to be able to group and compute aggregates per group.
-
So remember this. This data frame has movie.
-
The ratings are for movies. And they're provided by users, so maybe we want to get rather than just the mean overall rating.
-
Maybe you won't want to do is we want to find the average number of ratings per movie.
-
This would give us a measure of popularity. We could say, well, the movie that's rated the most frequently is the most popular.
-
We could also look at the average rating per movie. And so we can do this with the group by.
-
So group by. Returns an object that allows us to perform grouped operations on a data frame.
-
And so we give it the column name that we want to group by. In this case, movie I.D., we can group by more than one column at a time.
-
We're only doing one for now. Then we can we can do the.
-
We then in this group are we're going to say we only want to work on one column.
-
And otherwise, it's going to count the ratings and the time stamps and back, so they're going to be the same count.
-
So we're gonna say we're gonna to group by movie idea. They're going to say within each group.
-
We only want to work with the rating. And then we want to do is we want to count it all of the aggregate values.
-
The functions that we've seen before work on a per group basis as well. And.
-
Do note, though, that we are grouping by grouping the whole data frame by movie I.D. before we select the column.
-
If we did it the other way around, we were okay, select red and well,
-
now we don't have a movie idea to group by because we've pulled the rating out of the movie.
-
This order is important. So we group by movie I.D. That's another column in the frame and we use the rating column.
-
So let's see this in action. So we want to count the number of ratings per movie and what it gives us is a series whose index?
-
Is the movie I.D.? And whose value is the number of movies for that movie?
-
We haven't really seen indexes yet. We haven't really worked with them much yet.
-
But that's what it's doing here. We're indexing the data front. It's resulting in a series that's indexed.
-
And this is the thing. Serious ads on top of a normal non pie array is that we have this index that tells us, oh, this is for movie one.
-
This is for movie two thousand two hundred nine thousand one hundred and seventy one.
-
We can also compute multiple aggregates at the same time.
-
So the the AG, there's an AG function that allows you to to specify movies, to aggregate functions, to call you, specify them by name.
-
So here I'm doing the group by that we did before. And then I'm AG.
-
Calling AG to say I want to aggregate the values values this column.
-
But I'm giving it a list of two different aggregation functions, mean and count.
-
And when I run this, I get a data frame. That's indexed by movie I.D., but then it has two columns and the columns are named after the function,
-
so have a mean function that's the result of mean and account function.
-
That's the result of count. And because I know that I did this on the rating column, I know these are the mean and the count to the ratings.
-
So we can see that movie I.D. has a mean rating of three point eighty nine and.
-
And the number of ratings is fifty seven thousand three hundred and nine.
-
So sometimes you want to sort out data. So sort values will resource an entire data frame.
-
And by a specific column, you get a column numbers.
-
We could resource this whole data frame by by, say, the number of ratings.
-
Sometimes we also want to just get the largest or small. Sometimes the reason we want to sort.
-
Is I want I'd want to know the five movies with the most ratings.
-
In which case, we don't necessarily need to sort the entire thing. And largest. And then smallest.
-
Let us just get the rose with the with the end largest or smallest values for a particular column.
-
So if I go over and do this.
-
So I want to get the 10 movies with the most ratings I can call and largest and tell it, I want 10 and I want to do it by count.
-
And it gives me the 10 movies with the with the most ratings sorted in decreasing order of count.
-
And we see movie ads. Three hundred and fifty six gives has eighty one thousand movies with a mean a 4.0 five.
-
But this doesn't tell us what movie that is. What we can do.
-
Remember, we have this movie is table two that gives us the movie titles. We can join the tables together.
-
And the simplest way to join is to join on a common index. There's a set index method that lets you set a column is the index.
-
You can also specify columns to join by. We're going to see more of this later, particularly.
-
I'm going to make a note book that walks you through the different indexing operations.
-
And you can also read about them in more detail in the text book.
-
But. If we want to see, so I'm going to say.
-
So I'm going to take our movie's frame that as a movie column and I'm going to join it with movie stats and movie stats.
-
Remember, it's the result of our aggregate its index. Is the movie I.D. and so on when to call?
-
I'm going to tell it. I want to join movies on movie stats and I'm going to tell it on movie I.D. movies doesn't have a useful index.
-
Its index is just the positions. But on whereas when I use the on keyword in join what it does.
-
Is it. It tells it to join the left feet.
-
The left table movies, to use that color movie ideas column and join it with the index in the other table.
-
So movie starts has an index and it expects the movie idea, column and movies to match up with the index in movie stats.
-
And so the resulting frame. Has our title in our genre does.
-
And then it has the mean and the count for each of these movies.
-
So now if I say and largest of this movie info frame, I see that the most frequently rated movie with 81000 movie ratings is Forrest Gump.
-
So another thing you can do is so the movie level rating statistics to be computed, this count.
-
This mean those are just more variables.
-
Remember, in the earlier we talked about, we can make some of the variables you might observe are actually aggregates from other things.
-
Well, these are just more variables. So now we have sweet. So if we have an observation of a movie, it has an I.D., it has a title.
-
It has on Rose. And it has the number of people who've rated it in the mean rating.
-
These also can be aggregated. So in the downloads for this video, you're going to find the notebook that I was just using for practice.
-
What I'd like you to do is to go in and compute the mean number of ratings per movie.
-
Maybe use some additional exploration as well.
-
But that's going to let you start to see how we can build from these aggregates into into additional structures.
-
And also emphasize that. A data frame is just a data frame.
-
Mike, we give it meaning in terms of observations. But the fact that a data frame resulted from an aggregate doesn't make it special in any way.
-
We can aggregate the results of of of an aggregate because it's just another data frame.
-
Everything's a data frame or a series and pandas. So to wrap up aggregates, combine a series or array into a single value.
-
That's what it means to aggregate. We can do this over an entire series.
-
We can also do this on a group by group basis.
-
If we have another column that provides us with grouping information so we can compute the average Beacon computer Mean Asama or whatever per group,
-
you might have this like if you have if you have records of financial transactions, you might want to compute.
-
Well, what was what was our total profit in each month.
-
So you could group by year, maybe as you grew by month,
-
maybe group by year and month and take a some of the of the profit margin on each of your transactions.
-
And then finally join Combine's frames, we can start to put two frames together in order to get context for values.
-
We're going to see a lot of other uses for join later.
-
But this that's in this context that lets us get context for understanding what's going on in a value.