9:59:59.000,9:59:59.000 1[br]00:00:04,570 --> 00:00:09,490[br]Blow in this video. I want to talk with you about basic operations for manipulating data, 9:59:59.000,9:59:59.000 2[br]00:00:09,490 --> 00:00:15,100[br]learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function. 9:59:59.000,9:59:59.000 3[br]00:00:15,100 --> 00:00:21,010[br]Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook. 9:59:59.000,9:59:59.000 4[br]00:00:21,010 --> 00:00:22,750[br]I'm not showing the actual code in the video, 9:59:59.000,9:59:59.000 5[br]00:00:22,750 --> 00:00:27,670[br]but you're going to see in the notebook how you actually implement different versions of each of these steps. 9:59:59.000,9:59:59.000 6[br]00:00:27,670 --> 00:00:31,300[br]So we think about the shape of our data. We have rows and we have columns. 9:59:59.000,9:59:59.000 7[br]00:00:31,300 --> 00:00:36,610[br]We have a certain number of rows and columns, each of a number, a type in a name. 9:59:59.000,9:59:59.000 8[br]00:00:36,610 --> 00:00:42,460[br]The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing. 9:59:59.000,9:59:59.000 9[br]00:00:42,460 --> 00:00:46,210[br]So our data are well organized. Each row we have the variables. 9:59:59.000,9:59:59.000 10[br]00:00:46,210 --> 00:00:51,670[br]And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie. 9:59:59.000,9:59:59.000 11[br]00:00:51,670 --> 00:00:55,180[br]If we have data that's not in that kind of a format, we're going to talk about that later. 9:59:59.000,9:59:59.000 12[br]00:00:55,180 --> 00:00:59,410[br]And sourcing and cleaning data. How do we get data in this kind of a tidy format? 9:59:59.000,9:59:59.000 13[br]00:00:59,410 --> 00:01:08,860[br]For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data. 9:59:59.000,9:59:59.000 14[br]00:01:08,860 --> 00:01:13,480[br]Now, each of these methods return in new frame. A few of them are going to return a series. 9:59:59.000,9:59:59.000 15[br]00:01:13,480 --> 00:01:17,620[br]But in general, we're gonna be transforming data frames to data frames here. 9:59:59.000,9:59:59.000 16[br]00:01:17,620 --> 00:01:22,270[br]And so if our input is a data frame, each row is another observation of the same kind of thing. 9:59:59.000,9:59:59.000 17[br]00:01:22,270 --> 00:01:26,200[br]The output will be a data frame. Each row is an observation of the same kind of thing. 9:59:59.000,9:59:59.000 18[br]00:01:26,200 --> 00:01:30,820[br]It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing. 9:59:59.000,9:59:59.000 19[br]00:01:30,820 --> 00:01:34,570[br]But these are the different operations that we're going to be talking about here. 9:59:59.000,9:59:59.000 20[br]00:01:34,570 --> 00:01:38,110[br]So if we want to select calls, we have a frame and we want the same frame. 9:59:59.000,9:59:59.000 21[br]00:01:38,110 --> 00:01:43,570[br]But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary. 9:59:59.000,9:59:59.000 22[br]00:01:43,570 --> 00:01:50,230[br]We can pick multiple columns bypassing in the list of column names to the same way we pick one column. 9:59:59.000,9:59:59.000 23[br]00:01:50,230 --> 00:01:55,420[br]One column will yield a series. Multiple columns will yield a frame. If we want to remove a column. 9:59:59.000,9:59:59.000 24[br]00:01:55,420 --> 00:02:02,020[br]So we want to keep all of the columns except one or two or however many that we name the drop method. 9:59:59.000,9:59:59.000 25[br]00:02:02,020 --> 00:02:09,170[br]Ream returns a frame with all the columns of the original frame except the ones you tell it to drop. 9:59:59.000,9:59:59.000 26[br]00:02:09,170 --> 00:02:13,840[br]We want to select rows. We have a frame. We want the same frame, but a subset of the rows. 9:59:59.000,9:59:59.000 27[br]00:02:13,840 --> 00:02:19,840[br]A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values. 9:59:59.000,9:59:59.000 28[br]00:02:19,840 --> 00:02:24,370[br]That's true. And all the for all the data positions we want to keep. 9:59:59.000,9:59:59.000 29[br]00:02:24,370 --> 00:02:29,410[br]And then we select and then we. So this is really good if we want to select by column values. 9:59:59.000,9:59:59.000 30[br]00:02:29,410 --> 00:02:36,100[br]So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value. 9:59:59.000,9:59:59.000 31[br]00:02:36,100 --> 00:02:40,450[br]And then we can select, we can select using that boolean mask. 9:59:59.000,9:59:59.000 32[br]00:02:40,450 --> 00:02:45,220[br]We can select by position in the in the frame, starting with zero. 9:59:59.000,9:59:59.000 33[br]00:02:45,220 --> 00:02:54,520[br]We can do that no matter what the index is with. I lock that is that's so lock is the location. 9:59:59.000,9:59:59.000 34[br]00:02:54,520 --> 00:03:01,780[br]Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys. 9:59:59.000,9:59:59.000 35[br]00:03:01,780 --> 00:03:07,240[br]If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file. 9:59:59.000,9:59:59.000 36[br]00:03:07,240 --> 00:03:11,110[br]We haven't specified any index options. It's using the default range index. 9:59:59.000,9:59:59.000 37[br]00:03:11,110 --> 00:03:15,220[br]Then selecting that position and index key are the same thing. 9:59:59.000,9:59:59.000 38[br]00:03:15,220 --> 00:03:19,750[br]If we have a data frame with, call with, whereas we've got our observations and we've got color, 9:59:59.000,9:59:59.000 39[br]00:03:19,750 --> 00:03:25,470[br]a column that identifies what some kind of a group that each observation is, then maybe it's ratings. 9:59:59.000,9:59:59.000 40[br]00:03:25,470 --> 00:03:29,950[br]It's the movie. Maybe it's movies. And it's the actor, the genre. 9:59:59.000,9:59:59.000 41[br]00:03:29,950 --> 00:03:35,710[br]And what we want is a frame or a series that has one row per group of the original data. 9:59:59.000,9:59:59.000 42[br]00:03:35,710 --> 00:03:42,550[br]And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group. 9:59:59.000,9:59:59.000 43[br]00:03:42,550 --> 00:03:46,900[br]Then we want a group by an aggregate like we saw in the videos last week. 9:59:59.000,9:59:59.000 44[br]00:03:46,900 --> 00:03:50,980[br]A couple more transformations are to think about tall versus wide data. 9:59:59.000,9:59:59.000 45[br]00:03:50,980 --> 00:03:54,610[br]So why the data has a column per variable. 9:59:59.000,9:59:59.000 46[br]00:03:54,610 --> 00:04:04,600[br]So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race. 9:59:59.000,9:59:59.000 47[br]00:04:04,600 --> 00:04:12,450[br]And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data. 9:59:59.000,9:59:59.000 48[br]00:04:12,450 --> 00:04:19,560[br]Has its simplest form toll data or long data has three columns. 9:59:59.000,9:59:59.000 49[br]00:04:19,560 --> 00:04:23,670[br]We have the road. We have the identifier. We have the variable name. 9:59:59.000,9:59:59.000 50[br]00:04:23,670 --> 00:04:27,990[br]And we have the variable value. Sometimes this will just be called idee, variable and value. 9:59:59.000,9:59:59.000 51[br]00:04:27,990 --> 00:04:33,420[br]But often it's often useful to give the variable and value name columns, meaningful names. 9:59:59.000,9:59:59.000 52[br]00:04:33,420 --> 00:04:36,570[br]We could also have more than one idea. Call it if we need to. 9:59:59.000,9:59:59.000 53[br]00:04:36,570 --> 00:04:43,740[br]But the idea here is that rather than having the stages in different columns, we split them out into a different row. 9:59:59.000,9:59:59.000 54[br]00:04:43,740 --> 00:04:48,420[br]So cycle one cyclist one has four rows, one for each of the four stages. 9:59:59.000,9:59:59.000 55[br]00:04:48,420 --> 00:04:54,660[br]We still call this an observation for one thing and for the same kind of thing. 9:59:59.000,9:59:59.000 56[br]00:04:54,660 --> 00:05:02,370[br]It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages. 9:59:59.000,9:59:59.000 57[br]00:05:02,370 --> 00:05:06,900[br]Whereas in the long data, each observation is for a cyclist. 9:59:59.000,9:59:59.000 58[br]00:05:06,900 --> 00:05:13,140[br]One cyclist in one particular stage. So each cyclist will have four observations, one for each stage. 9:59:59.000,9:59:59.000 59[br]00:05:13,140 --> 00:05:18,830[br]Total data is useful for plotting and grouping because a lot of our plotting function, 9:59:59.000,9:59:59.000 60[br]00:05:18,830 --> 00:05:27,430[br]YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis. 9:59:59.000,9:59:59.000 61[br]00:05:27,430 --> 00:05:33,900[br]Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting. 9:59:59.000,9:59:59.000 62[br]00:05:33,900 --> 00:05:43,230[br]If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas. 9:59:59.000,9:59:59.000 63[br]00:05:43,230 --> 00:05:50,610[br]You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists, 9:59:59.000,9:59:59.000 64[br]00:05:50,610 --> 00:05:58,020[br]we haven't seen any data with this so far except the John Rós column in the Waj Movieland data. 9:59:59.000,9:59:59.000 65[br]00:05:58,020 --> 00:06:05,480[br]But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element. 9:59:59.000,9:59:59.000 66[br]00:06:05,480 --> 00:06:10,110[br]So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead, 9:59:59.000,9:59:59.000 67[br]00:06:10,110 --> 00:06:18,510[br]duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements. 9:59:59.000,9:59:59.000 68[br]00:06:18,510 --> 00:06:25,830[br]This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame. 9:59:59.000,9:59:59.000 69[br]00:06:25,830 --> 00:06:31,170[br]So if we have if we have a data frame and we want to get a series, we just select the column from the data. 9:59:59.000,9:59:59.000 70[br]00:06:31,170 --> 00:06:35,400[br]We saw that the beginning. If we have a series and we want to get a data frame, 9:59:59.000,9:59:59.000 71[br]00:06:35,400 --> 00:06:41,790[br]we can just create a single column frame with two frame and the two frame method on the serious object also. 9:59:59.000,9:59:59.000 72[br]00:06:41,790 --> 00:06:45,270[br]But to give it a names that you have a name and the resulting data frame, 9:59:59.000,9:59:59.000 73[br]00:06:45,270 --> 00:06:49,650[br]you can also if you want to create a multi column data frame where you've got a column for the value end, 9:59:59.000,9:59:59.000 74[br]00:06:49,650 --> 00:06:53,910[br]you have a column for the index of the original of the original series. 9:59:59.000,9:59:59.000 75[br]00:06:53,910 --> 00:07:00,630[br]The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column. 9:59:59.000,9:59:59.000 76[br]00:07:00,630 --> 00:07:11,340[br]And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time. 9:59:59.000,9:59:59.000 77[br]00:07:11,340 --> 00:07:18,190[br]The unstamped method will turn the inner most index labels in the column labels. 9:59:59.000,9:59:59.000 78[br]00:07:18,190 --> 00:07:22,450[br]To turn it series into a data. So to think about strategy. 9:59:59.000,9:59:59.000 79[br]00:07:22,450 --> 00:07:31,140[br]Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want. 9:59:59.000,9:59:59.000 80[br]00:07:31,140 --> 00:07:36,000[br]And so what I recommend is that you decide what you want the end product to look like. 9:59:59.000,9:59:59.000 81[br]00:07:36,000 --> 00:07:40,410[br]If you're going to draw a chart or you're going to do an analysis or an inference, 9:59:59.000,9:59:59.000 82[br]00:07:40,410 --> 00:07:46,710[br]what are the observations and the variables that you need for that chart or inference? 9:59:59.000,9:59:59.000 83[br]00:07:46,710 --> 00:07:54,210[br]And then once you've figured that out, you can plot a path from your current data to your end product. 9:59:59.000,9:59:59.000 84[br]00:07:54,210 --> 00:08:10,770[br]So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select. 9:59:59.000,9:59:59.000 85[br]00:08:10,770 --> 00:08:15,300[br]The rows that have the movies only that are in the horror genre, you can select that. 9:59:59.000,9:59:59.000 86[br]00:08:15,300 --> 00:08:23,460[br]You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out. 9:59:59.000,9:59:59.000 87[br]00:08:23,460 --> 00:08:33,720[br]And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able. 9:59:59.000,9:59:59.000 88[br]00:08:33,720 --> 00:08:37,720[br]You need to get there. 9:59:59.000,9:59:59.000 89[br]00:08:37,720 --> 00:08:45,950[br]You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video. 9:59:59.000,9:59:59.000 90[br]00:08:45,950 --> 00:08:52,750[br]And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product. 9:59:59.000,9:59:59.000 91[br]00:08:52,750 --> 00:08:58,350[br]In this example, I've reference some Joynes. We saw joints very, very briefly last week. 9:59:59.000,9:59:59.000 92[br]00:08:58,350 --> 00:09:05,310[br]We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data. 9:59:59.000,9:59:59.000 93[br]00:09:05,310 --> 00:09:09,300[br]You want to start with the end in mind, work from what you have to what you need. 9:59:59.000,9:59:59.000 94[br]00:09:09,300 --> 00:09:18,167[br]Read the tutorial notebooks for a lot more details. 9:59:59.000,9:59:59.000