[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:03.86,0:00:14.12,Default,,0000,0000,0000,,This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is. Dialogue: 0,0:00:14.12,0:00:22.14,Default,,0000,0000,0000,,So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries, Dialogue: 0,0:00:22.14,0:00:27.17,Default,,0000,0000,0000,,little data file on the pandas, examine the size and data types of a data frame, Dialogue: 0,0:00:27.17,0:00:30.95,Default,,0000,0000,0000,,and understand the relationship, particularly between a data frame and a series. Dialogue: 0,0:00:30.95,0:00:37.06,Default,,0000,0000,0000,,And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video. Dialogue: 0,0:00:37.06,0:00:41.53,Default,,0000,0000,0000,,So most useful Python Modu functions are in modules. Dialogue: 0,0:00:41.53,0:00:46.48,Default,,0000,0000,0000,,There are some functions that are just built in, but most of the time we're going to need functions in various modules. Dialogue: 0,0:00:46.48,0:00:50.50,Default,,0000,0000,0000,,And before we can use a module, we have to import it. Dialogue: 0,0:00:50.50,0:00:58.83,Default,,0000,0000,0000,,So some of the standard imports, basically every one of our notebooks is going to have imports at the top. Dialogue: 0,0:00:58.83,0:01:03.46,Default,,0000,0000,0000,,Not to import, not umpired pandas. We've got our basic scientific computing facilities. Dialogue: 0,0:01:03.46,0:01:16.81,Default,,0000,0000,0000,,And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names. Dialogue: 0,0:01:16.81,0:01:19.75,Default,,0000,0000,0000,,A lot of the files we've been working with, particularly early on, Dialogue: 0,0:01:19.75,0:01:28.85,Default,,0000,0000,0000,,are distributed in a format called comma separated value and comma separated value file is. Dialogue: 0,0:01:28.85,0:01:32.96,Default,,0000,0000,0000,,It's consists of one line per record and the values are separated by commas. Dialogue: 0,0:01:32.96,0:01:39.08,Default,,0000,0000,0000,,That's where it gets its name. So you've got a comma between the different values. Dialogue: 0,0:01:39.08,0:01:44.00,Default,,0000,0000,0000,,Also, sometimes in this case it does. The file will have a header. Dialogue: 0,0:01:44.00,0:01:49.94,Default,,0000,0000,0000,,So the first row is the names of the columns. It doesn't always have a header, but often it does. Dialogue: 0,0:01:49.94,0:01:58.01,Default,,0000,0000,0000,,And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP. Dialogue: 0,0:01:58.01,0:02:02.57,Default,,0000,0000,0000,,And so we call reads ESV and we give it the Amelle twenty five M slash movies. Dialogue: 0,0:02:02.57,0:02:11.96,Default,,0000,0000,0000,,That CSB file, which is from the data set that I have you download in the information for the week. Dialogue: 0,0:02:11.96,0:02:20.40,Default,,0000,0000,0000,,We get the. We get. A data frame, and as the convention that I showed you in video last week, I, Dialogue: 0,0:02:20.40,0:02:27.59,Default,,0000,0000,0000,,I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter. Dialogue: 0,0:02:27.59,0:02:35.37,Default,,0000,0000,0000,,Format's a panda's data frame nicely and it shows us the first five rows. Dialogue: 0,0:02:35.37,0:02:43.41,Default,,0000,0000,0000,,Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data. Dialogue: 0,0:02:43.41,0:02:49.65,Default,,0000,0000,0000,,It also tells us how big it is. So sixty two thousand rows, three columns. Dialogue: 0,0:02:49.65,0:02:54.00,Default,,0000,0000,0000,,So they immediately get one of the questions I said, do we want to know is how much data we have. Dialogue: 0,0:02:54.00,0:03:01.18,Default,,0000,0000,0000,,Right here. We already have that answer. We've got. Sixty two thousand rows and three columns. Dialogue: 0,0:03:01.18,0:03:10.75,Default,,0000,0000,0000,,So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame. Dialogue: 0,0:03:10.75,0:03:18.69,Default,,0000,0000,0000,,And particularly we're going to see. The it tells us what the index is. Dialogue: 0,0:03:18.69,0:03:25.67,Default,,0000,0000,0000,,We have a range index, it tells us the information about the different columns. Dialogue: 0,0:03:25.67,0:03:33.91,Default,,0000,0000,0000,,And so we've said that we have a range index from a range index goes from. Dialogue: 0,0:03:33.91,0:03:37.39,Default,,0000,0000,0000,,Zero to sixty two thousand. Dialogue: 0,0:03:37.39,0:03:44.02,Default,,0000,0000,0000,,So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries, Dialogue: 0,0:03:44.02,0:03:47.71,Default,,0000,0000,0000,,a range index just means we're looking up the data. Bye bye. Dialogue: 0,0:03:47.71,0:03:56.56,Default,,0000,0000,0000,,Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object. Dialogue: 0,0:03:56.56,0:04:02.77,Default,,0000,0000,0000,,These store strings objects is how so Panda's data types to store a string. Dialogue: 0,0:04:02.77,0:04:08.41,Default,,0000,0000,0000,,It can't just store the string directly in the column number. The the num pi arrays we talked about last week. Dialogue: 0,0:04:08.41,0:04:13.60,Default,,0000,0000,0000,,They store data compactly, but strings it has to store a pointer to the string. Dialogue: 0,0:04:13.60,0:04:18.26,Default,,0000,0000,0000,,And so PANDAS uses that array of extorts pointers to objects that can be any object. Dialogue: 0,0:04:18.26,0:04:20.70,Default,,0000,0000,0000,,We happen to know their strengths in this case. Dialogue: 0,0:04:20.70,0:04:25.75,Default,,0000,0000,0000,,And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question. Dialogue: 0,0:04:25.75,0:04:30.82,Default,,0000,0000,0000,,How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns. Dialogue: 0,0:04:30.82,0:04:34.54,Default,,0000,0000,0000,,And we also have. We have. And what kinds of data do we have? Dialogue: 0,0:04:34.54,0:04:39.40,Default,,0000,0000,0000,,We have a movie idea that's an integer and we have title and genres that are strings. Dialogue: 0,0:04:39.40,0:04:48.16,Default,,0000,0000,0000,,What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets. Dialogue: 0,0:04:48.16,0:04:57.76,Default,,0000,0000,0000,,Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something. Dialogue: 0,0:04:57.76,0:05:04.07,Default,,0000,0000,0000,,So in this case, the row is information about a movie and it represents a movie. Dialogue: 0,0:05:04.07,0:05:08.81,Default,,0000,0000,0000,,So each column of the data frame is a series. Dialogue: 0,0:05:08.81,0:05:13.52,Default,,0000,0000,0000,,As we mentioned last week, a series is an array that has an index associated with it. Dialogue: 0,0:05:13.52,0:05:21.65,Default,,0000,0000,0000,,We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns. Dialogue: 0,0:05:21.65,0:05:27.53,Default,,0000,0000,0000,,And so we can get the movie, we can we can get the title column out of the movies data frame. Dialogue: 0,0:05:27.53,0:05:34.97,Default,,0000,0000,0000,,And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three. Dialogue: 0,0:05:34.97,0:05:40.43,Default,,0000,0000,0000,,It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object. Dialogue: 0,0:05:40.43,0:05:46.16,Default,,0000,0000,0000,,We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index. Dialogue: 0,0:05:46.16,0:05:51.71,Default,,0000,0000,0000,,All columns, the data frame share the same index. That's an important link between the different columns. Dialogue: 0,0:05:51.71,0:05:58.21,Default,,0000,0000,0000,,There enter data from. So let's load another frame up the ratings frame. Dialogue: 0,0:05:58.21,0:06:02.86,Default,,0000,0000,0000,,We can look down at its info. It has four columns. Dialogue: 0,0:06:02.86,0:06:06.97,Default,,0000,0000,0000,,It has twenty five million instances. This is why this is called the movie lens. Dialogue: 0,0:06:06.97,0:06:10.66,Default,,0000,0000,0000,,Twenty five million data set. It contains twenty five million ratings. Dialogue: 0,0:06:10.66,0:06:14.66,Default,,0000,0000,0000,,Just over twenty five million. Twenty five million. Ninety four. Dialogue: 0,0:06:14.66,0:06:24.86,Default,,0000,0000,0000,,And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64, Dialogue: 0,0:06:24.86,0:06:32.66,Default,,0000,0000,0000,,which is double precision floating point and a timestamp, which is which is also an integer of type and 64. Dialogue: 0,0:06:32.66,0:06:38.75,Default,,0000,0000,0000,,The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here. Dialogue: 0,0:06:38.75,0:06:47.78,Default,,0000,0000,0000,,If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself. Dialogue: 0,0:06:47.78,0:06:52.04,Default,,0000,0000,0000,,If a column has an object type, it does not measure the size of the objects. Dialogue: 0,0:06:52.04,0:06:59.39,Default,,0000,0000,0000,,So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings. Dialogue: 0,0:06:59.39,0:07:03.98,Default,,0000,0000,0000,,But here we don't have any strings. We don't have any other object types. It's just insane floats. Dialogue: 0,0:07:03.98,0:07:11.70,Default,,0000,0000,0000,,So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes. Dialogue: 0,0:07:11.70,0:07:17.48,Default,,0000,0000,0000,,So data can also refer to other data. So ratings are instances. Dialogue: 0,0:07:17.48,0:07:26.19,Default,,0000,0000,0000,,This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie. Dialogue: 0,0:07:26.19,0:07:32.49,Default,,0000,0000,0000,,So we have the rating. But it also references to other kinds of entities or objects, users and movies. Dialogue: 0,0:07:32.49,0:07:37.89,Default,,0000,0000,0000,,The rating doesn't just exist on its own, but it's provided by a user for a movie. Dialogue: 0,0:07:37.89,0:07:44.64,Default,,0000,0000,0000,,Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say, Dialogue: 0,0:07:44.64,0:07:55.20,Default,,0000,0000,0000,,link ratings to the to the the movie information that they're associated with. Dialogue: 0,0:07:55.20,0:08:00.96,Default,,0000,0000,0000,,So to wrap up a data frame consists of columns. Each column is a series, an array with an index. Dialogue: 0,0:08:00.96,0:08:06.66,Default,,0000,0000,0000,,We can quickly find out how many rows, how many there are, and that in a data frame. Dialogue: 0,0:08:06.66,0:08:11.76,Default,,0000,0000,0000,,The instances of the data, we can find out what columns there are, what data types those columns. Dialogue: 0,0:08:11.76,0:08:26.50,Default,,0000,0000,0000,,Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.