This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is.
So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries,
little data file on the pandas, examine the size and data types of a data frame,
and understand the relationship, particularly between a data frame and a series.
And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video.
So most useful Python Modu functions are in modules.
There are some functions that are just built in, but most of the time we're going to need functions in various modules.
And before we can use a module, we have to import it.
So some of the standard imports, basically every one of our notebooks is going to have imports at the top.
Not to import, not umpired pandas. We've got our basic scientific computing facilities.
And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names.
A lot of the files we've been working with, particularly early on,
are distributed in a format called comma separated value and comma separated value file is.
It's consists of one line per record and the values are separated by commas.
That's where it gets its name. So you've got a comma between the different values.
Also, sometimes in this case it does. The file will have a header.
So the first row is the names of the columns. It doesn't always have a header, but often it does.
And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP.
And so we call reads ESV and we give it the Amelle twenty five M slash movies.
That CSB file, which is from the data set that I have you download in the information for the week.
We get the. We get. A data frame, and as the convention that I showed you in video last week, I,
I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter.
Format's a panda's data frame nicely and it shows us the first five rows.
Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data.
It also tells us how big it is. So sixty two thousand rows, three columns.
So they immediately get one of the questions I said, do we want to know is how much data we have.
Right here. We already have that answer. We've got. Sixty two thousand rows and three columns.
So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame.
And particularly we're going to see. The it tells us what the index is.
We have a range index, it tells us the information about the different columns.
And so we've said that we have a range index from a range index goes from.
Zero to sixty two thousand.
So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries,
a range index just means we're looking up the data. Bye bye.
Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object.
These store strings objects is how so Panda's data types to store a string.
It can't just store the string directly in the column number. The the num pi arrays we talked about last week.
They store data compactly, but strings it has to store a pointer to the string.
And so PANDAS uses that array of extorts pointers to objects that can be any object.
We happen to know their strengths in this case.
And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question.
How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns.
And we also have. We have. And what kinds of data do we have?
We have a movie idea that's an integer and we have title and genres that are strings.
What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets.
Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something.
So in this case, the row is information about a movie and it represents a movie.
So each column of the data frame is a series.
As we mentioned last week, a series is an array that has an index associated with it.
We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns.
And so we can get the movie, we can we can get the title column out of the movies data frame.
And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three.
It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object.
We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index.
All columns, the data frame share the same index. That's an important link between the different columns.
There enter data from. So let's load another frame up the ratings frame.
We can look down at its info. It has four columns.
It has twenty five million instances. This is why this is called the movie lens.
Twenty five million data set. It contains twenty five million ratings.
Just over twenty five million. Twenty five million. Ninety four.
And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64,
which is double precision floating point and a timestamp, which is which is also an integer of type and 64.
The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here.
If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself.
If a column has an object type, it does not measure the size of the objects.
So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings.
But here we don't have any strings. We don't have any other object types. It's just insane floats.
So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes.
So data can also refer to other data. So ratings are instances.
This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie.
So we have the rating. But it also references to other kinds of entities or objects, users and movies.
The rating doesn't just exist on its own, but it's provided by a user for a movie.
Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say,
link ratings to the to the the movie information that they're associated with.
So to wrap up a data frame consists of columns. Each column is a series, an array with an index.
We can quickly find out how many rows, how many there are, and that in a data frame.
The instances of the data, we can find out what columns there are, what data types those columns.
Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.