< Return to Video

https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec

  • Not Synced
    1
    00:00:04,570 --> 00:00:09,490
    Blow in this video. I want to talk with you about basic operations for manipulating data,
  • Not Synced
    2
    00:00:09,490 --> 00:00:15,100
    learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
  • Not Synced
    3
    00:00:15,100 --> 00:00:21,010
    Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
  • Not Synced
    4
    00:00:21,010 --> 00:00:22,750
    I'm not showing the actual code in the video,
  • Not Synced
    5
    00:00:22,750 --> 00:00:27,670
    but you're going to see in the notebook how you actually implement different versions of each of these steps.
  • Not Synced
    6
    00:00:27,670 --> 00:00:31,300
    So we think about the shape of our data. We have rows and we have columns.
  • Not Synced
    7
    00:00:31,300 --> 00:00:36,610
    We have a certain number of rows and columns, each of a number, a type in a name.
  • Not Synced
    8
    00:00:36,610 --> 00:00:42,460
    The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
  • Not Synced
    9
    00:00:42,460 --> 00:00:46,210
    So our data are well organized. Each row we have the variables.
  • Not Synced
    10
    00:00:46,210 --> 00:00:51,670
    And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
  • Not Synced
    11
    00:00:51,670 --> 00:00:55,180
    If we have data that's not in that kind of a format, we're going to talk about that later.
  • Not Synced
    12
    00:00:55,180 --> 00:00:59,410
    And sourcing and cleaning data. How do we get data in this kind of a tidy format?
  • Not Synced
    13
    00:00:59,410 --> 00:01:08,860
    For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
  • Not Synced
    14
    00:01:08,860 --> 00:01:13,480
    Now, each of these methods return in new frame. A few of them are going to return a series.
  • Not Synced
    15
    00:01:13,480 --> 00:01:17,620
    But in general, we're gonna be transforming data frames to data frames here.
  • Not Synced
    16
    00:01:17,620 --> 00:01:22,270
    And so if our input is a data frame, each row is another observation of the same kind of thing.
  • Not Synced
    17
    00:01:22,270 --> 00:01:26,200
    The output will be a data frame. Each row is an observation of the same kind of thing.
  • Not Synced
    18
    00:01:26,200 --> 00:01:30,820
    It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
  • Not Synced
    19
    00:01:30,820 --> 00:01:34,570
    But these are the different operations that we're going to be talking about here.
  • Not Synced
    20
    00:01:34,570 --> 00:01:38,110
    So if we want to select calls, we have a frame and we want the same frame.
  • Not Synced
    21
    00:01:38,110 --> 00:01:43,570
    But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
  • Not Synced
    22
    00:01:43,570 --> 00:01:50,230
    We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
  • Not Synced
    23
    00:01:50,230 --> 00:01:55,420
    One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
  • Not Synced
    24
    00:01:55,420 --> 00:02:02,020
    So we want to keep all of the columns except one or two or however many that we name the drop method.
  • Not Synced
    25
    00:02:02,020 --> 00:02:09,170
    Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
  • Not Synced
    26
    00:02:09,170 --> 00:02:13,840
    We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
  • Not Synced
    27
    00:02:13,840 --> 00:02:19,840
    A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
  • Not Synced
    28
    00:02:19,840 --> 00:02:24,370
    That's true. And all the for all the data positions we want to keep.
  • Not Synced
    29
    00:02:24,370 --> 00:02:29,410
    And then we select and then we. So this is really good if we want to select by column values.
  • Not Synced
    30
    00:02:29,410 --> 00:02:36,100
    So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
  • Not Synced
    31
    00:02:36,100 --> 00:02:40,450
    And then we can select, we can select using that boolean mask.
  • Not Synced
    32
    00:02:40,450 --> 00:02:45,220
    We can select by position in the in the frame, starting with zero.
  • Not Synced
    33
    00:02:45,220 --> 00:02:54,520
    We can do that no matter what the index is with. I lock that is that's so lock is the location.
  • Not Synced
    34
    00:02:54,520 --> 00:03:01,780
    Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
  • Not Synced
    35
    00:03:01,780 --> 00:03:07,240
    If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
  • Not Synced
    36
    00:03:07,240 --> 00:03:11,110
    We haven't specified any index options. It's using the default range index.
  • Not Synced
    37
    00:03:11,110 --> 00:03:15,220
    Then selecting that position and index key are the same thing.
  • Not Synced
    38
    00:03:15,220 --> 00:03:19,750
    If we have a data frame with, call with, whereas we've got our observations and we've got color,
  • Not Synced
    39
    00:03:19,750 --> 00:03:25,470
    a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
  • Not Synced
    40
    00:03:25,470 --> 00:03:29,950
    It's the movie. Maybe it's movies. And it's the actor, the genre.
  • Not Synced
    41
    00:03:29,950 --> 00:03:35,710
    And what we want is a frame or a series that has one row per group of the original data.
  • Not Synced
    42
    00:03:35,710 --> 00:03:42,550
    And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
  • Not Synced
    43
    00:03:42,550 --> 00:03:46,900
    Then we want a group by an aggregate like we saw in the videos last week.
  • Not Synced
    44
    00:03:46,900 --> 00:03:50,980
    A couple more transformations are to think about tall versus wide data.
  • Not Synced
    45
    00:03:50,980 --> 00:03:54,610
    So why the data has a column per variable.
  • Not Synced
    46
    00:03:54,610 --> 00:04:04,600
    So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
  • Not Synced
    47
    00:04:04,600 --> 00:04:12,450
    And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
  • Not Synced
    48
    00:04:12,450 --> 00:04:19,560
    Has its simplest form toll data or long data has three columns.
  • Not Synced
    49
    00:04:19,560 --> 00:04:23,670
    We have the road. We have the identifier. We have the variable name.
  • Not Synced
    50
    00:04:23,670 --> 00:04:27,990
    And we have the variable value. Sometimes this will just be called idee, variable and value.
  • Not Synced
    51
    00:04:27,990 --> 00:04:33,420
    But often it's often useful to give the variable and value name columns, meaningful names.
  • Not Synced
    52
    00:04:33,420 --> 00:04:36,570
    We could also have more than one idea. Call it if we need to.
  • Not Synced
    53
    00:04:36,570 --> 00:04:43,740
    But the idea here is that rather than having the stages in different columns, we split them out into a different row.
  • Not Synced
    54
    00:04:43,740 --> 00:04:48,420
    So cycle one cyclist one has four rows, one for each of the four stages.
  • Not Synced
    55
    00:04:48,420 --> 00:04:54,660
    We still call this an observation for one thing and for the same kind of thing.
  • Not Synced
    56
    00:04:54,660 --> 00:05:02,370
    It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
  • Not Synced
    57
    00:05:02,370 --> 00:05:06,900
    Whereas in the long data, each observation is for a cyclist.
  • Not Synced
    58
    00:05:06,900 --> 00:05:13,140
    One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
  • Not Synced
    59
    00:05:13,140 --> 00:05:18,830
    Total data is useful for plotting and grouping because a lot of our plotting function,
  • Not Synced
    60
    00:05:18,830 --> 00:05:27,430
    YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
  • Not Synced
    61
    00:05:27,430 --> 00:05:33,900
    Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
  • Not Synced
    62
    00:05:33,900 --> 00:05:43,230
    If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
  • Not Synced
    63
    00:05:43,230 --> 00:05:50,610
    You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
  • Not Synced
    64
    00:05:50,610 --> 00:05:58,020
    we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
  • Not Synced
    65
    00:05:58,020 --> 00:06:05,480
    But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
  • Not Synced
    66
    00:06:05,480 --> 00:06:10,110
    So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
  • Not Synced
    67
    00:06:10,110 --> 00:06:18,510
    duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
  • Not Synced
    68
    00:06:18,510 --> 00:06:25,830
    This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
  • Not Synced
    69
    00:06:25,830 --> 00:06:31,170
    So if we have if we have a data frame and we want to get a series, we just select the column from the data.
  • Not Synced
    70
    00:06:31,170 --> 00:06:35,400
    We saw that the beginning. If we have a series and we want to get a data frame,
  • Not Synced
    71
    00:06:35,400 --> 00:06:41,790
    we can just create a single column frame with two frame and the two frame method on the serious object also.
  • Not Synced
    72
    00:06:41,790 --> 00:06:45,270
    But to give it a names that you have a name and the resulting data frame,
  • Not Synced
    73
    00:06:45,270 --> 00:06:49,650
    you can also if you want to create a multi column data frame where you've got a column for the value end,
  • Not Synced
    74
    00:06:49,650 --> 00:06:53,910
    you have a column for the index of the original of the original series.
  • Not Synced
    75
    00:06:53,910 --> 00:07:00,630
    The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
  • Not Synced
    76
    00:07:00,630 --> 00:07:11,340
    And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
  • Not Synced
    77
    00:07:11,340 --> 00:07:18,190
    The unstamped method will turn the inner most index labels in the column labels.
  • Not Synced
    78
    00:07:18,190 --> 00:07:22,450
    To turn it series into a data. So to think about strategy.
  • Not Synced
    79
    00:07:22,450 --> 00:07:31,140
    Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
  • Not Synced
    80
    00:07:31,140 --> 00:07:36,000
    And so what I recommend is that you decide what you want the end product to look like.
  • Not Synced
    81
    00:07:36,000 --> 00:07:40,410
    If you're going to draw a chart or you're going to do an analysis or an inference,
  • Not Synced
    82
    00:07:40,410 --> 00:07:46,710
    what are the observations and the variables that you need for that chart or inference?
  • Not Synced
    83
    00:07:46,710 --> 00:07:54,210
    And then once you've figured that out, you can plot a path from your current data to your end product.
  • Not Synced
    84
    00:07:54,210 --> 00:08:10,770
    So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
  • Not Synced
    85
    00:08:10,770 --> 00:08:15,300
    The rows that have the movies only that are in the horror genre, you can select that.
  • Not Synced
    86
    00:08:15,300 --> 00:08:23,460
    You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
  • Not Synced
    87
    00:08:23,460 --> 00:08:33,720
    And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
  • Not Synced
    88
    00:08:33,720 --> 00:08:37,720
    You need to get there.
  • Not Synced
    89
    00:08:37,720 --> 00:08:45,950
    You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
  • Not Synced
    90
    00:08:45,950 --> 00:08:52,750
    And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
  • Not Synced
    91
    00:08:52,750 --> 00:08:58,350
    In this example, I've reference some Joynes. We saw joints very, very briefly last week.
  • Not Synced
    92
    00:08:58,350 --> 00:09:05,310
    We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
  • Not Synced
    93
    00:09:05,310 --> 00:09:09,300
    You want to start with the end in mind, work from what you have to what you need.
  • Not Synced
    94
    00:09:09,300 --> 00:09:18,167
    Read the tutorial notebooks for a lot more details.
  • Not Synced
Title:
https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
09:18

English subtitles

Incomplete

Revisions