0:00:03.860,0:00:14.120
This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is.

0:00:14.120,0:00:22.140
So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries,

0:00:22.140,0:00:27.170
little data file on the pandas, examine the size and data types of a data frame,

0:00:27.170,0:00:30.950
and understand the relationship, particularly between a data frame and a series.

0:00:30.950,0:00:37.060
And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video.

0:00:37.060,0:00:41.530
So most useful Python Modu functions are in modules.

0:00:41.530,0:00:46.480
There are some functions that are just built in, but most of the time we're going to need functions in various modules.

0:00:46.480,0:00:50.500
And before we can use a module, we have to import it.

0:00:50.500,0:00:58.830
So some of the standard imports, basically every one of our notebooks is going to have imports at the top.

0:00:58.830,0:01:03.460
Not to import, not umpired pandas. We've got our basic scientific computing facilities.

0:01:03.460,0:01:16.810
And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names.

0:01:16.810,0:01:19.750
A lot of the files we've been working with, particularly early on,

0:01:19.750,0:01:28.850
are distributed in a format called comma separated value and comma separated value file is.

0:01:28.850,0:01:32.960
It's consists of one line per record and the values are separated by commas.

0:01:32.960,0:01:39.080
That's where it gets its name. So you've got a comma between the different values.

0:01:39.080,0:01:44.000
Also, sometimes in this case it does. The file will have a header.

0:01:44.000,0:01:49.940
So the first row is the names of the columns. It doesn't always have a header, but often it does.

0:01:49.940,0:01:58.010
And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP.

0:01:58.010,0:02:02.570
And so we call reads ESV and we give it the Amelle twenty five M slash movies.

0:02:02.570,0:02:11.960
That CSB file, which is from the data set that I have you download in the information for the week.

0:02:11.960,0:02:20.400
We get the. We get. A data frame, and as the convention that I showed you in video last week, I,

0:02:20.400,0:02:27.590
I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter.

0:02:27.590,0:02:35.370
Format's a panda's data frame nicely and it shows us the first five rows.

0:02:35.370,0:02:43.410
Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data.

0:02:43.410,0:02:49.650
It also tells us how big it is. So sixty two thousand rows, three columns.

0:02:49.650,0:02:54.000
So they immediately get one of the questions I said, do we want to know is how much data we have.

0:02:54.000,0:03:01.180
Right here. We already have that answer. We've got. Sixty two thousand rows and three columns.

0:03:01.180,0:03:10.750
So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame.

0:03:10.750,0:03:18.690
And particularly we're going to see. The it tells us what the index is.

0:03:18.690,0:03:25.670
We have a range index, it tells us the information about the different columns.

0:03:25.670,0:03:33.910
And so we've said that we have a range index from a range index goes from.

0:03:33.910,0:03:37.390
Zero to sixty two thousand.

0:03:37.390,0:03:44.020
So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries,

0:03:44.020,0:03:47.710
a range index just means we're looking up the data. Bye bye.

0:03:47.710,0:03:56.560
Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object.

0:03:56.560,0:04:02.770
These store strings objects is how so Panda's data types to store a string.

0:04:02.770,0:04:08.410
It can't just store the string directly in the column number. The the num pi arrays we talked about last week.

0:04:08.410,0:04:13.600
They store data compactly, but strings it has to store a pointer to the string.

0:04:13.600,0:04:18.260
And so PANDAS uses that array of extorts pointers to objects that can be any object.

0:04:18.260,0:04:20.700
We happen to know their strengths in this case.

0:04:20.700,0:04:25.750
And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question.

0:04:25.750,0:04:30.820
How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns.

0:04:30.820,0:04:34.540
And we also have. We have. And what kinds of data do we have?

0:04:34.540,0:04:39.400
We have a movie idea that's an integer and we have title and genres that are strings.

0:04:39.400,0:04:48.160
What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets.

0:04:48.160,0:04:57.760
Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something.

0:04:57.760,0:05:04.070
So in this case, the row is information about a movie and it represents a movie.

0:05:04.070,0:05:08.810
So each column of the data frame is a series.

0:05:08.810,0:05:13.520
As we mentioned last week, a series is an array that has an index associated with it.

0:05:13.520,0:05:21.650
We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns.

0:05:21.650,0:05:27.530
And so we can get the movie, we can we can get the title column out of the movies data frame.

0:05:27.530,0:05:34.970
And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three.

0:05:34.970,0:05:40.430
It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object.

0:05:40.430,0:05:46.160
We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index.

0:05:46.160,0:05:51.710
All columns, the data frame share the same index. That's an important link between the different columns.

0:05:51.710,0:05:58.210
There enter data from. So let's load another frame up the ratings frame.

0:05:58.210,0:06:02.860
We can look down at its info. It has four columns.

0:06:02.860,0:06:06.970
It has twenty five million instances. This is why this is called the movie lens.

0:06:06.970,0:06:10.660
Twenty five million data set. It contains twenty five million ratings.

0:06:10.660,0:06:14.660
Just over twenty five million. Twenty five million. Ninety four.

0:06:14.660,0:06:24.860
And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64,

0:06:24.860,0:06:32.660
which is double precision floating point and a timestamp, which is which is also an integer of type and 64.

0:06:32.660,0:06:38.750
The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here.

0:06:38.750,0:06:47.780
If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself.

0:06:47.780,0:06:52.040
If a column has an object type, it does not measure the size of the objects.

0:06:52.040,0:06:59.390
So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings.

0:06:59.390,0:07:03.980
But here we don't have any strings. We don't have any other object types. It's just insane floats.

0:07:03.980,0:07:11.700
So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes.

0:07:11.700,0:07:17.480
So data can also refer to other data. So ratings are instances.

0:07:17.480,0:07:26.190
This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie.

0:07:26.190,0:07:32.490
So we have the rating. But it also references to other kinds of entities or objects, users and movies.

0:07:32.490,0:07:37.890
The rating doesn't just exist on its own, but it's provided by a user for a movie.

0:07:37.890,0:07:44.640
Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say,

0:07:44.640,0:07:55.200
link ratings to the to the the movie information that they're associated with.

0:07:55.200,0:08:00.960
So to wrap up a data frame consists of columns. Each column is a series, an array with an index.

0:08:00.960,0:08:06.660
We can quickly find out how many rows, how many there are, and that in a data frame.

0:08:06.660,0:08:11.760
The instances of the data, we can find out what columns there are, what data types those columns.

0:08:11.760,0:08:26.500
Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.