WEBVTT 00:00:03.860 --> 00:00:14.120 This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is. 00:00:14.120 --> 00:00:22.140 So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries, 00:00:22.140 --> 00:00:27.170 little data file on the pandas, examine the size and data types of a data frame, 00:00:27.170 --> 00:00:30.950 and understand the relationship, particularly between a data frame and a series. 00:00:30.950 --> 00:00:37.060 And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video. 00:00:37.060 --> 00:00:41.530 So most useful Python Modu functions are in modules. 00:00:41.530 --> 00:00:46.480 There are some functions that are just built in, but most of the time we're going to need functions in various modules. 00:00:46.480 --> 00:00:50.500 And before we can use a module, we have to import it. 00:00:50.500 --> 00:00:58.830 So some of the standard imports, basically every one of our notebooks is going to have imports at the top. 00:00:58.830 --> 00:01:03.460 Not to import, not umpired pandas. We've got our basic scientific computing facilities. 00:01:03.460 --> 00:01:16.810 And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names. 00:01:16.810 --> 00:01:19.750 A lot of the files we've been working with, particularly early on, 00:01:19.750 --> 00:01:28.850 are distributed in a format called comma separated value and comma separated value file is. 00:01:28.850 --> 00:01:32.960 It's consists of one line per record and the values are separated by commas. 00:01:32.960 --> 00:01:39.080 That's where it gets its name. So you've got a comma between the different values. 00:01:39.080 --> 00:01:44.000 Also, sometimes in this case it does. The file will have a header. 00:01:44.000 --> 00:01:49.940 So the first row is the names of the columns. It doesn't always have a header, but often it does. 00:01:49.940 --> 00:01:58.010 And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP. 00:01:58.010 --> 00:02:02.570 And so we call reads ESV and we give it the Amelle twenty five M slash movies. 00:02:02.570 --> 00:02:11.960 That CSB file, which is from the data set that I have you download in the information for the week. 00:02:11.960 --> 00:02:20.400 We get the. We get. A data frame, and as the convention that I showed you in video last week, I, 00:02:20.400 --> 00:02:27.590 I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter. 00:02:27.590 --> 00:02:35.370 Format's a panda's data frame nicely and it shows us the first five rows. 00:02:35.370 --> 00:02:43.410 Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data. 00:02:43.410 --> 00:02:49.650 It also tells us how big it is. So sixty two thousand rows, three columns. 00:02:49.650 --> 00:02:54.000 So they immediately get one of the questions I said, do we want to know is how much data we have. 00:02:54.000 --> 00:03:01.180 Right here. We already have that answer. We've got. Sixty two thousand rows and three columns. 00:03:01.180 --> 00:03:10.750 So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame. 00:03:10.750 --> 00:03:18.690 And particularly we're going to see. The it tells us what the index is. 00:03:18.690 --> 00:03:25.670 We have a range index, it tells us the information about the different columns. 00:03:25.670 --> 00:03:33.910 And so we've said that we have a range index from a range index goes from. 00:03:33.910 --> 00:03:37.390 Zero to sixty two thousand. 00:03:37.390 --> 00:03:44.020 So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries, 00:03:44.020 --> 00:03:47.710 a range index just means we're looking up the data. Bye bye. 00:03:47.710 --> 00:03:56.560 Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object. 00:03:56.560 --> 00:04:02.770 These store strings objects is how so Panda's data types to store a string. 00:04:02.770 --> 00:04:08.410 It can't just store the string directly in the column number. The the num pi arrays we talked about last week. 00:04:08.410 --> 00:04:13.600 They store data compactly, but strings it has to store a pointer to the string. 00:04:13.600 --> 00:04:18.260 And so PANDAS uses that array of extorts pointers to objects that can be any object. 00:04:18.260 --> 00:04:20.700 We happen to know their strengths in this case. 00:04:20.700 --> 00:04:25.750 And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question. 00:04:25.750 --> 00:04:30.820 How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns. 00:04:30.820 --> 00:04:34.540 And we also have. We have. And what kinds of data do we have? 00:04:34.540 --> 00:04:39.400 We have a movie idea that's an integer and we have title and genres that are strings. 00:04:39.400 --> 00:04:48.160 What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets. 00:04:48.160 --> 00:04:57.760 Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something. 00:04:57.760 --> 00:05:04.070 So in this case, the row is information about a movie and it represents a movie. 00:05:04.070 --> 00:05:08.810 So each column of the data frame is a series. 00:05:08.810 --> 00:05:13.520 As we mentioned last week, a series is an array that has an index associated with it. 00:05:13.520 --> 00:05:21.650 We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns. 00:05:21.650 --> 00:05:27.530 And so we can get the movie, we can we can get the title column out of the movies data frame. 00:05:27.530 --> 00:05:34.970 And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three. 00:05:34.970 --> 00:05:40.430 It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object. 00:05:40.430 --> 00:05:46.160 We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index. 00:05:46.160 --> 00:05:51.710 All columns, the data frame share the same index. That's an important link between the different columns. 00:05:51.710 --> 00:05:58.210 There enter data from. So let's load another frame up the ratings frame. 00:05:58.210 --> 00:06:02.860 We can look down at its info. It has four columns. 00:06:02.860 --> 00:06:06.970 It has twenty five million instances. This is why this is called the movie lens. 00:06:06.970 --> 00:06:10.660 Twenty five million data set. It contains twenty five million ratings. 00:06:10.660 --> 00:06:14.660 Just over twenty five million. Twenty five million. Ninety four. 00:06:14.660 --> 00:06:24.860 And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64, 00:06:24.860 --> 00:06:32.660 which is double precision floating point and a timestamp, which is which is also an integer of type and 64. 00:06:32.660 --> 00:06:38.750 The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here. 00:06:38.750 --> 00:06:47.780 If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself. 00:06:47.780 --> 00:06:52.040 If a column has an object type, it does not measure the size of the objects. 00:06:52.040 --> 00:06:59.390 So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings. 00:06:59.390 --> 00:07:03.980 But here we don't have any strings. We don't have any other object types. It's just insane floats. 00:07:03.980 --> 00:07:11.700 So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes. 00:07:11.700 --> 00:07:17.480 So data can also refer to other data. So ratings are instances. 00:07:17.480 --> 00:07:26.190 This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie. 00:07:26.190 --> 00:07:32.490 So we have the rating. But it also references to other kinds of entities or objects, users and movies. 00:07:32.490 --> 00:07:37.890 The rating doesn't just exist on its own, but it's provided by a user for a movie. 00:07:37.890 --> 00:07:44.640 Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say, 00:07:44.640 --> 00:07:55.200 link ratings to the to the the movie information that they're associated with. 00:07:55.200 --> 00:08:00.960 So to wrap up a data frame consists of columns. Each column is a series, an array with an index. 00:08:00.960 --> 00:08:06.660 We can quickly find out how many rows, how many there are, and that in a data frame. 00:08:06.660 --> 00:08:11.760 The instances of the data, we can find out what columns there are, what data types those columns. 00:08:11.760 --> 00:08:26.500 Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.