< Return to Video

https:/.../d6054cf9-e3dc-4f1e-8d4e-ad9000efd604-07520140-80a4-4365-81c8-ad9101202ca1.mp4?invocationId=d9304d24-2f09-ec11-a9e9-0a1a827ad0ec

  • 0:04 - 0:14
    This video I'm going to introduce loading a data file into pandas and actually starting to see what the shape and the structure of the data is.
  • 0:14 - 0:22
    So a lot of the data files that we're going to the learning outcomes of this video are for you to be able to import Python libraries,
  • 0:22 - 0:27
    little data file on the pandas, examine the size and data types of a data frame,
  • 0:27 - 0:31
    and understand the relationship, particularly between a data frame and a series.
  • 0:31 - 0:37
    And we're going to start introducing the concept of an index. Going to see a lot more about that in a later video.
  • 0:37 - 0:42
    So most useful Python Modu functions are in modules.
  • 0:42 - 0:46
    There are some functions that are just built in, but most of the time we're going to need functions in various modules.
  • 0:46 - 0:50
    And before we can use a module, we have to import it.
  • 0:50 - 0:59
    So some of the standard imports, basically every one of our notebooks is going to have imports at the top.
  • 0:59 - 1:03
    Not to import, not umpired pandas. We've got our basic scientific computing facilities.
  • 1:03 - 1:17
    And common practice is to import these with aliases, NDP and PD, so that we can reach refer to them later by shorter names.
  • 1:17 - 1:20
    A lot of the files we've been working with, particularly early on,
  • 1:20 - 1:29
    are distributed in a format called comma separated value and comma separated value file is.
  • 1:29 - 1:33
    It's consists of one line per record and the values are separated by commas.
  • 1:33 - 1:39
    That's where it gets its name. So you've got a comma between the different values.
  • 1:39 - 1:44
    Also, sometimes in this case it does. The file will have a header.
  • 1:44 - 1:50
    So the first row is the names of the columns. It doesn't always have a header, but often it does.
  • 1:50 - 1:58
    And it's very convenient when it does. So Pan does lets us read a CSB file through a function called Read CSP.
  • 1:58 - 2:03
    And so we call reads ESV and we give it the Amelle twenty five M slash movies.
  • 2:03 - 2:12
    That CSB file, which is from the data set that I have you download in the information for the week.
  • 2:12 - 2:20
    We get the. We get. A data frame, and as the convention that I showed you in video last week, I,
  • 2:20 - 2:28
    I then put just right the variables right, the variable I saves it ends that we can immediately see at Pynt Jupiter.
  • 2:28 - 2:35
    Format's a panda's data frame nicely and it shows us the first five rows.
  • 2:35 - 2:43
    Shows us the first five rows, the last five rows. We've got an ellipsis in the middle indicating that there's a lead of data.
  • 2:43 - 2:50
    It also tells us how big it is. So sixty two thousand rows, three columns.
  • 2:50 - 2:54
    So they immediately get one of the questions I said, do we want to know is how much data we have.
  • 2:54 - 3:01
    Right here. We already have that answer. We've got. Sixty two thousand rows and three columns.
  • 3:01 - 3:11
    So another way we can look at the data frame is we can use the info method and the info method will print out information about the data frame.
  • 3:11 - 3:19
    And particularly we're going to see. The it tells us what the index is.
  • 3:19 - 3:26
    We have a range index, it tells us the information about the different columns.
  • 3:26 - 3:34
    And so we've said that we have a range index from a range index goes from.
  • 3:34 - 3:37
    Zero to sixty two thousand.
  • 3:37 - 3:44
    So we've got zero to sixty two thousand four hundred and twenty two for sixty two thousand four hundred and twenty three entries,
  • 3:44 - 3:48
    a range index just means we're looking up the data. Bye bye.
  • 3:48 - 3:57
    Index zero three minus one. We have three columns. One of them movie I.D. is an inch sixty 64 and the other two are object.
  • 3:57 - 4:03
    These store strings objects is how so Panda's data types to store a string.
  • 4:03 - 4:08
    It can't just store the string directly in the column number. The the num pi arrays we talked about last week.
  • 4:08 - 4:14
    They store data compactly, but strings it has to store a pointer to the string.
  • 4:14 - 4:18
    And so PANDAS uses that array of extorts pointers to objects that can be any object.
  • 4:18 - 4:21
    We happen to know their strengths in this case.
  • 4:21 - 4:26
    And we have, as I said, sixty two thousand four hundred and twenty three rows answers to our initial question.
  • 4:26 - 4:31
    How much data do we have? Sixty two thousand four hundred twenty three rows each with three columns.
  • 4:31 - 4:35
    And we also have. We have. And what kinds of data do we have?
  • 4:35 - 4:39
    We have a movie idea that's an integer and we have title and genres that are strings.
  • 4:39 - 4:48
    What is the data about? The data is about movies and each row is a movie and the data sheets for data, streets for data sets.
  • 4:48 - 4:58
    Paper talks about these as terms of these are instances and they represent speech in each row is an instance and it represents something.
  • 4:58 - 5:04
    So in this case, the row is information about a movie and it represents a movie.
  • 5:04 - 5:09
    So each column of the data frame is a series.
  • 5:09 - 5:14
    As we mentioned last week, a series is an array that has an index associated with it.
  • 5:14 - 5:22
    We get a column by accessing the data frame like a dictionary. You can treat a data frame basically as a dictionary that contains columns.
  • 5:22 - 5:28
    And so we can get the movie, we can we can get the title column out of the movies data frame.
  • 5:28 - 5:35
    And it shows us the titles, the bottom. It says this series has a name title length sixty two thousand four hundred twenty three.
  • 5:35 - 5:40
    It's indexed from zero to sixty two thousand four hundred twenty two. And it has a D type of object.
  • 5:40 - 5:46
    We're going to learn a lot more about indexes in another video. But a series, as I said, is an array with an index.
  • 5:46 - 5:52
    All columns, the data frame share the same index. That's an important link between the different columns.
  • 5:52 - 5:58
    There enter data from. So let's load another frame up the ratings frame.
  • 5:58 - 6:03
    We can look down at its info. It has four columns.
  • 6:03 - 6:07
    It has twenty five million instances. This is why this is called the movie lens.
  • 6:07 - 6:11
    Twenty five million data set. It contains twenty five million ratings.
  • 6:11 - 6:15
    Just over twenty five million. Twenty five million. Ninety four.
  • 6:15 - 6:25
    And each row contains a user I.D. that's an integer and 60 for a movie I.D. that's also an integer, a rating that's a floating point value float 64,
  • 6:25 - 6:33
    which is double precision floating point and a timestamp, which is which is also an integer of type and 64.
  • 6:33 - 6:39
    The whole thing's a six hundred and twenty three megabytes of memory. Remember, there's not a plus here.
  • 6:39 - 6:48
    If you remember, the movies had a plus after that. That's because by default, it just measures the memory taken up by the panda's data frame itself.
  • 6:48 - 6:52
    If a column has an object type, it does not measure the size of the objects.
  • 6:52 - 6:59
    So for movies, that was an underestimate because we have all these strings. It was not measuring how much memory is taken up by the strings.
  • 6:59 - 7:04
    But here we don't have any strings. We don't have any other object types. It's just insane floats.
  • 7:04 - 7:12
    So it can tell us directly. This data frames take seven hundred and sixty two point nine megabytes.
  • 7:12 - 7:17
    So data can also refer to other data. So ratings are instances.
  • 7:17 - 7:26
    This rating file we just loaded. Ratings are instances themselves. But but each connects a user to a movie.
  • 7:26 - 7:32
    So we have the rating. But it also references to other kinds of entities or objects, users and movies.
  • 7:32 - 7:38
    The rating doesn't just exist on its own, but it's provided by a user for a movie.
  • 7:38 - 7:45
    Is work a lot like foreign keys and relational databases? We're going to see later how to do a merge so that we can actually, say,
  • 7:45 - 7:55
    link ratings to the to the the movie information that they're associated with.
  • 7:55 - 8:01
    So to wrap up a data frame consists of columns. Each column is a series, an array with an index.
  • 8:01 - 8:07
    We can quickly find out how many rows, how many there are, and that in a data frame.
  • 8:07 - 8:12
    The instances of the data, we can find out what columns there are, what data types those columns.
  • 8:12 - 8:26
    Have we talking later this week about more things we can do with that and also more about understanding what the data being stored in these types is.
Title:
https:/.../d6054cf9-e3dc-4f1e-8d4e-ad9000efd604-07520140-80a4-4365-81c8-ad9101202ca1.mp4?invocationId=d9304d24-2f09-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
08:26

English subtitles

Revisions