1 00:00:04,570 --> 00:00:09,490 Blow in this video. I want to talk with you about basic operations for manipulating data, 2 00:00:09,490 --> 00:00:15,100 learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function. 3 00:00:15,100 --> 00:00:21,010 Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook. 4 00:00:21,010 --> 00:00:22,750 I'm not showing the actual code in the video, 5 00:00:22,750 --> 00:00:27,670 but you're going to see in the notebook how you actually implement different versions of each of these steps. 6 00:00:27,670 --> 00:00:31,300 So we think about the shape of our data. We have rows and we have columns. 7 00:00:31,300 --> 00:00:36,610 We have a certain number of rows and columns, each of a number, a type in a name. 8 00:00:36,610 --> 00:00:42,460 The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing. 9 00:00:42,460 --> 00:00:46,210 So our data are well organized. Each row we have the variables. 10 00:00:46,210 --> 00:00:51,670 And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie. 11 00:00:51,670 --> 00:00:55,180 If we have data that's not in that kind of a format, we're going to talk about that later. 12 00:00:55,180 --> 00:00:59,410 And sourcing and cleaning data. How do we get data in this kind of a tidy format? 13 00:00:59,410 --> 00:01:08,860 For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data. 14 00:01:08,860 --> 00:01:13,480 Now, each of these methods return in new frame. A few of them are going to return a series. 15 00:01:13,480 --> 00:01:17,620 But in general, we're gonna be transforming data frames to data frames here. 16 00:01:17,620 --> 00:01:22,270 And so if our input is a data frame, each row is another observation of the same kind of thing. 17 00:01:22,270 --> 00:01:26,200 The output will be a data frame. Each row is an observation of the same kind of thing. 18 00:01:26,200 --> 00:01:30,820 It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing. 19 00:01:30,820 --> 00:01:34,570 But these are the different operations that we're going to be talking about here. 20 00:01:34,570 --> 00:01:38,110 So if we want to select calls, we have a frame and we want the same frame. 21 00:01:38,110 --> 00:01:43,570 But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary. 22 00:01:43,570 --> 00:01:50,230 We can pick multiple columns bypassing in the list of column names to the same way we pick one column. 23 00:01:50,230 --> 00:01:55,420 One column will yield a series. Multiple columns will yield a frame. If we want to remove a column. 24 00:01:55,420 --> 00:02:02,020 So we want to keep all of the columns except one or two or however many that we name the drop method. 25 00:02:02,020 --> 00:02:09,170 Ream returns a frame with all the columns of the original frame except the ones you tell it to drop. 26 00:02:09,170 --> 00:02:13,840 We want to select rows. We have a frame. We want the same frame, but a subset of the rows. 27 00:02:13,840 --> 00:02:19,840 A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values. 28 00:02:19,840 --> 00:02:24,370 That's true. And all the for all the data positions we want to keep. 29 00:02:24,370 --> 00:02:29,410 And then we select and then we. So this is really good if we want to select by column values. 30 00:02:29,410 --> 00:02:36,100 So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value. 31 00:02:36,100 --> 00:02:40,450 And then we can select, we can select using that boolean mask. 32 00:02:40,450 --> 00:02:45,220 We can select by position in the in the frame, starting with zero. 33 00:02:45,220 --> 00:02:54,520 We can do that no matter what the index is with. I lock that is that's so lock is the location. 34 00:02:54,520 --> 00:03:01,780 Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys. 35 00:03:01,780 --> 00:03:07,240 If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file. 36 00:03:07,240 --> 00:03:11,110 We haven't specified any index options. It's using the default range index. 37 00:03:11,110 --> 00:03:15,220 Then selecting that position and index key are the same thing. 38 00:03:15,220 --> 00:03:19,750 If we have a data frame with, call with, whereas we've got our observations and we've got color, 39 00:03:19,750 --> 00:03:25,470 a column that identifies what some kind of a group that each observation is, then maybe it's ratings. 40 00:03:25,470 --> 00:03:29,950 It's the movie. Maybe it's movies. And it's the actor, the genre. 41 00:03:29,950 --> 00:03:35,710 And what we want is a frame or a series that has one row per group of the original data. 42 00:03:35,710 --> 00:03:42,550 And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group. 43 00:03:42,550 --> 00:03:46,900 Then we want a group by an aggregate like we saw in the videos last week. 44 00:03:46,900 --> 00:03:50,980 A couple more transformations are to think about tall versus wide data. 45 00:03:50,980 --> 00:03:54,610 So why the data has a column per variable. 46 00:03:54,610 --> 00:04:04,600 So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race. 47 00:04:04,600 --> 00:04:12,450 And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data. 48 00:04:12,450 --> 00:04:19,560 Has its simplest form toll data or long data has three columns. 49 00:04:19,560 --> 00:04:23,670 We have the road. We have the identifier. We have the variable name. 50 00:04:23,670 --> 00:04:27,990 And we have the variable value. Sometimes this will just be called idee, variable and value. 51 00:04:27,990 --> 00:04:33,420 But often it's often useful to give the variable and value name columns, meaningful names. 52 00:04:33,420 --> 00:04:36,570 We could also have more than one idea. Call it if we need to. 53 00:04:36,570 --> 00:04:43,740 But the idea here is that rather than having the stages in different columns, we split them out into a different row. 54 00:04:43,740 --> 00:04:48,420 So cycle one cyclist one has four rows, one for each of the four stages. 55 00:04:48,420 --> 00:04:54,660 We still call this an observation for one thing and for the same kind of thing. 56 00:04:54,660 --> 00:05:02,370 It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages. 57 00:05:02,370 --> 00:05:06,900 Whereas in the long data, each observation is for a cyclist. 58 00:05:06,900 --> 00:05:13,140 One cyclist in one particular stage. So each cyclist will have four observations, one for each stage. 59 00:05:13,140 --> 00:05:18,830 Total data is useful for plotting and grouping because a lot of our plotting function, 60 00:05:18,830 --> 00:05:27,430 YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis. 61 00:05:27,430 --> 00:05:33,900 Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting. 62 00:05:33,900 --> 00:05:43,230 If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas. 63 00:05:43,230 --> 00:05:50,610 You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists, 64 00:05:50,610 --> 00:05:58,020 we haven't seen any data with this so far except the John Rós column in the Waj Movieland data. 65 00:05:58,020 --> 00:06:05,480 But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element. 66 00:06:05,480 --> 00:06:10,110 So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead, 67 00:06:10,110 --> 00:06:18,510 duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements. 68 00:06:18,510 --> 00:06:25,830 This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame. 69 00:06:25,830 --> 00:06:31,170 So if we have if we have a data frame and we want to get a series, we just select the column from the data. 70 00:06:31,170 --> 00:06:35,400 We saw that the beginning. If we have a series and we want to get a data frame, 71 00:06:35,400 --> 00:06:41,790 we can just create a single column frame with two frame and the two frame method on the serious object also. 72 00:06:41,790 --> 00:06:45,270 But to give it a names that you have a name and the resulting data frame, 73 00:06:45,270 --> 00:06:49,650 you can also if you want to create a multi column data frame where you've got a column for the value end, 74 00:06:49,650 --> 00:06:53,910 you have a column for the index of the original of the original series. 75 00:06:53,910 --> 00:07:00,630 The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column. 76 00:07:00,630 --> 00:07:11,340 And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time. 77 00:07:11,340 --> 00:07:18,190 The unstamped method will turn the inner most index labels in the column labels. 78 00:07:18,190 --> 00:07:22,450 To turn it series into a data. So to think about strategy. 79 00:07:22,450 --> 00:07:31,140 Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want. 80 00:07:31,140 --> 00:07:36,000 And so what I recommend is that you decide what you want the end product to look like. 81 00:07:36,000 --> 00:07:40,410 If you're going to draw a chart or you're going to do an analysis or an inference, 82 00:07:40,410 --> 00:07:46,710 what are the observations and the variables that you need for that chart or inference? 83 00:07:46,710 --> 00:07:54,210 And then once you've figured that out, you can plot a path from your current data to your end product. 84 00:07:54,210 --> 00:08:10,770 So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select. 85 00:08:10,770 --> 00:08:15,300 The rows that have the movies only that are in the horror genre, you can select that. 86 00:08:15,300 --> 00:08:23,460 You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out. 87 00:08:23,460 --> 00:08:33,720 And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able. 88 00:08:33,720 --> 00:08:37,720 You need to get there. 89 00:08:37,720 --> 00:08:45,950 You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video. 90 00:08:45,950 --> 00:08:52,750 And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product. 91 00:08:52,750 --> 00:08:58,350 In this example, I've reference some Joynes. We saw joints very, very briefly last week. 92 00:08:58,350 --> 00:09:05,310 We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data. 93 00:09:05,310 --> 00:09:09,300 You want to start with the end in mind, work from what you have to what you need. 94 00:09:09,300 --> 00:09:18,167 Read the tutorial notebooks for a lot more details.