9:59:59.000,9:59:59.000
1[br]00:00:04,570 --&gt; 00:00:09,490[br]Blow in this video. I want to talk with you about basic operations for manipulating data,

9:59:59.000,9:59:59.000
2[br]00:00:09,490 --&gt; 00:00:15,100[br]learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.

9:59:59.000,9:59:59.000
3[br]00:00:15,100 --&gt; 00:00:21,010[br]Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.

9:59:59.000,9:59:59.000
4[br]00:00:21,010 --&gt; 00:00:22,750[br]I'm not showing the actual code in the video,

9:59:59.000,9:59:59.000
5[br]00:00:22,750 --&gt; 00:00:27,670[br]but you're going to see in the notebook how you actually implement different versions of each of these steps.

9:59:59.000,9:59:59.000
6[br]00:00:27,670 --&gt; 00:00:31,300[br]So we think about the shape of our data. We have rows and we have columns.

9:59:59.000,9:59:59.000
7[br]00:00:31,300 --&gt; 00:00:36,610[br]We have a certain number of rows and columns, each of a number, a type in a name.

9:59:59.000,9:59:59.000
8[br]00:00:36,610 --&gt; 00:00:42,460[br]The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.

9:59:59.000,9:59:59.000
9[br]00:00:42,460 --&gt; 00:00:46,210[br]So our data are well organized. Each row we have the variables.

9:59:59.000,9:59:59.000
10[br]00:00:46,210 --&gt; 00:00:51,670[br]And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.

9:59:59.000,9:59:59.000
11[br]00:00:51,670 --&gt; 00:00:55,180[br]If we have data that's not in that kind of a format, we're going to talk about that later.

9:59:59.000,9:59:59.000
12[br]00:00:55,180 --&gt; 00:00:59,410[br]And sourcing and cleaning data. How do we get data in this kind of a tidy format?

9:59:59.000,9:59:59.000
13[br]00:00:59,410 --&gt; 00:01:08,860[br]For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.

9:59:59.000,9:59:59.000
14[br]00:01:08,860 --&gt; 00:01:13,480[br]Now, each of these methods return in new frame. A few of them are going to return a series.

9:59:59.000,9:59:59.000
15[br]00:01:13,480 --&gt; 00:01:17,620[br]But in general, we're gonna be transforming data frames to data frames here.

9:59:59.000,9:59:59.000
16[br]00:01:17,620 --&gt; 00:01:22,270[br]And so if our input is a data frame, each row is another observation of the same kind of thing.

9:59:59.000,9:59:59.000
17[br]00:01:22,270 --&gt; 00:01:26,200[br]The output will be a data frame. Each row is an observation of the same kind of thing.

9:59:59.000,9:59:59.000
18[br]00:01:26,200 --&gt; 00:01:30,820[br]It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.

9:59:59.000,9:59:59.000
19[br]00:01:30,820 --&gt; 00:01:34,570[br]But these are the different operations that we're going to be talking about here.

9:59:59.000,9:59:59.000
20[br]00:01:34,570 --&gt; 00:01:38,110[br]So if we want to select calls, we have a frame and we want the same frame.

9:59:59.000,9:59:59.000
21[br]00:01:38,110 --&gt; 00:01:43,570[br]But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.

9:59:59.000,9:59:59.000
22[br]00:01:43,570 --&gt; 00:01:50,230[br]We can pick multiple columns bypassing in the list of column names to the same way we pick one column.

9:59:59.000,9:59:59.000
23[br]00:01:50,230 --&gt; 00:01:55,420[br]One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.

9:59:59.000,9:59:59.000
24[br]00:01:55,420 --&gt; 00:02:02,020[br]So we want to keep all of the columns except one or two or however many that we name the drop method.

9:59:59.000,9:59:59.000
25[br]00:02:02,020 --&gt; 00:02:09,170[br]Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.

9:59:59.000,9:59:59.000
26[br]00:02:09,170 --&gt; 00:02:13,840[br]We want to select rows. We have a frame. We want the same frame, but a subset of the rows.

9:59:59.000,9:59:59.000
27[br]00:02:13,840 --&gt; 00:02:19,840[br]A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.

9:59:59.000,9:59:59.000
28[br]00:02:19,840 --&gt; 00:02:24,370[br]That's true. And all the for all the data positions we want to keep.

9:59:59.000,9:59:59.000
29[br]00:02:24,370 --&gt; 00:02:29,410[br]And then we select and then we. So this is really good if we want to select by column values.

9:59:59.000,9:59:59.000
30[br]00:02:29,410 --&gt; 00:02:36,100[br]So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.

9:59:59.000,9:59:59.000
31[br]00:02:36,100 --&gt; 00:02:40,450[br]And then we can select, we can select using that boolean mask.

9:59:59.000,9:59:59.000
32[br]00:02:40,450 --&gt; 00:02:45,220[br]We can select by position in the in the frame, starting with zero.

9:59:59.000,9:59:59.000
33[br]00:02:45,220 --&gt; 00:02:54,520[br]We can do that no matter what the index is with. I lock that is that's so lock is the location.

9:59:59.000,9:59:59.000
34[br]00:02:54,520 --&gt; 00:03:01,780[br]Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.

9:59:59.000,9:59:59.000
35[br]00:03:01,780 --&gt; 00:03:07,240[br]If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.

9:59:59.000,9:59:59.000
36[br]00:03:07,240 --&gt; 00:03:11,110[br]We haven't specified any index options. It's using the default range index.

9:59:59.000,9:59:59.000
37[br]00:03:11,110 --&gt; 00:03:15,220[br]Then selecting that position and index key are the same thing.

9:59:59.000,9:59:59.000
38[br]00:03:15,220 --&gt; 00:03:19,750[br]If we have a data frame with, call with, whereas we've got our observations and we've got color,

9:59:59.000,9:59:59.000
39[br]00:03:19,750 --&gt; 00:03:25,470[br]a column that identifies what some kind of a group that each observation is, then maybe it's ratings.

9:59:59.000,9:59:59.000
40[br]00:03:25,470 --&gt; 00:03:29,950[br]It's the movie. Maybe it's movies. And it's the actor, the genre.

9:59:59.000,9:59:59.000
41[br]00:03:29,950 --&gt; 00:03:35,710[br]And what we want is a frame or a series that has one row per group of the original data.

9:59:59.000,9:59:59.000
42[br]00:03:35,710 --&gt; 00:03:42,550[br]And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.

9:59:59.000,9:59:59.000
43[br]00:03:42,550 --&gt; 00:03:46,900[br]Then we want a group by an aggregate like we saw in the videos last week.

9:59:59.000,9:59:59.000
44[br]00:03:46,900 --&gt; 00:03:50,980[br]A couple more transformations are to think about tall versus wide data.

9:59:59.000,9:59:59.000
45[br]00:03:50,980 --&gt; 00:03:54,610[br]So why the data has a column per variable.

9:59:59.000,9:59:59.000
46[br]00:03:54,610 --&gt; 00:04:04,600[br]So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.

9:59:59.000,9:59:59.000
47[br]00:04:04,600 --&gt; 00:04:12,450[br]And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.

9:59:59.000,9:59:59.000
48[br]00:04:12,450 --&gt; 00:04:19,560[br]Has its simplest form toll data or long data has three columns.

9:59:59.000,9:59:59.000
49[br]00:04:19,560 --&gt; 00:04:23,670[br]We have the road. We have the identifier. We have the variable name.

9:59:59.000,9:59:59.000
50[br]00:04:23,670 --&gt; 00:04:27,990[br]And we have the variable value. Sometimes this will just be called idee, variable and value.

9:59:59.000,9:59:59.000
51[br]00:04:27,990 --&gt; 00:04:33,420[br]But often it's often useful to give the variable and value name columns, meaningful names.

9:59:59.000,9:59:59.000
52[br]00:04:33,420 --&gt; 00:04:36,570[br]We could also have more than one idea. Call it if we need to.

9:59:59.000,9:59:59.000
53[br]00:04:36,570 --&gt; 00:04:43,740[br]But the idea here is that rather than having the stages in different columns, we split them out into a different row.

9:59:59.000,9:59:59.000
54[br]00:04:43,740 --&gt; 00:04:48,420[br]So cycle one cyclist one has four rows, one for each of the four stages.

9:59:59.000,9:59:59.000
55[br]00:04:48,420 --&gt; 00:04:54,660[br]We still call this an observation for one thing and for the same kind of thing.

9:59:59.000,9:59:59.000
56[br]00:04:54,660 --&gt; 00:05:02,370[br]It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.

9:59:59.000,9:59:59.000
57[br]00:05:02,370 --&gt; 00:05:06,900[br]Whereas in the long data, each observation is for a cyclist.

9:59:59.000,9:59:59.000
58[br]00:05:06,900 --&gt; 00:05:13,140[br]One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.

9:59:59.000,9:59:59.000
59[br]00:05:13,140 --&gt; 00:05:18,830[br]Total data is useful for plotting and grouping because a lot of our plotting function,

9:59:59.000,9:59:59.000
60[br]00:05:18,830 --&gt; 00:05:27,430[br]YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.

9:59:59.000,9:59:59.000
61[br]00:05:27,430 --&gt; 00:05:33,900[br]Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.

9:59:59.000,9:59:59.000
62[br]00:05:33,900 --&gt; 00:05:43,230[br]If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.

9:59:59.000,9:59:59.000
63[br]00:05:43,230 --&gt; 00:05:50,610[br]You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,

9:59:59.000,9:59:59.000
64[br]00:05:50,610 --&gt; 00:05:58,020[br]we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.

9:59:59.000,9:59:59.000
65[br]00:05:58,020 --&gt; 00:06:05,480[br]But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.

9:59:59.000,9:59:59.000
66[br]00:06:05,480 --&gt; 00:06:10,110[br]So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,

9:59:59.000,9:59:59.000
67[br]00:06:10,110 --&gt; 00:06:18,510[br]duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.

9:59:59.000,9:59:59.000
68[br]00:06:18,510 --&gt; 00:06:25,830[br]This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.

9:59:59.000,9:59:59.000
69[br]00:06:25,830 --&gt; 00:06:31,170[br]So if we have if we have a data frame and we want to get a series, we just select the column from the data.

9:59:59.000,9:59:59.000
70[br]00:06:31,170 --&gt; 00:06:35,400[br]We saw that the beginning. If we have a series and we want to get a data frame,

9:59:59.000,9:59:59.000
71[br]00:06:35,400 --&gt; 00:06:41,790[br]we can just create a single column frame with two frame and the two frame method on the serious object also.

9:59:59.000,9:59:59.000
72[br]00:06:41,790 --&gt; 00:06:45,270[br]But to give it a names that you have a name and the resulting data frame,

9:59:59.000,9:59:59.000
73[br]00:06:45,270 --&gt; 00:06:49,650[br]you can also if you want to create a multi column data frame where you've got a column for the value end,

9:59:59.000,9:59:59.000
74[br]00:06:49,650 --&gt; 00:06:53,910[br]you have a column for the index of the original of the original series.

9:59:59.000,9:59:59.000
75[br]00:06:53,910 --&gt; 00:07:00,630[br]The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.

9:59:59.000,9:59:59.000
76[br]00:07:00,630 --&gt; 00:07:11,340[br]And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.

9:59:59.000,9:59:59.000
77[br]00:07:11,340 --&gt; 00:07:18,190[br]The unstamped method will turn the inner most index labels in the column labels.

9:59:59.000,9:59:59.000
78[br]00:07:18,190 --&gt; 00:07:22,450[br]To turn it series into a data. So to think about strategy.

9:59:59.000,9:59:59.000
79[br]00:07:22,450 --&gt; 00:07:31,140[br]Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.

9:59:59.000,9:59:59.000
80[br]00:07:31,140 --&gt; 00:07:36,000[br]And so what I recommend is that you decide what you want the end product to look like.

9:59:59.000,9:59:59.000
81[br]00:07:36,000 --&gt; 00:07:40,410[br]If you're going to draw a chart or you're going to do an analysis or an inference,

9:59:59.000,9:59:59.000
82[br]00:07:40,410 --&gt; 00:07:46,710[br]what are the observations and the variables that you need for that chart or inference?

9:59:59.000,9:59:59.000
83[br]00:07:46,710 --&gt; 00:07:54,210[br]And then once you've figured that out, you can plot a path from your current data to your end product.

9:59:59.000,9:59:59.000
84[br]00:07:54,210 --&gt; 00:08:10,770[br]So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.

9:59:59.000,9:59:59.000
85[br]00:08:10,770 --&gt; 00:08:15,300[br]The rows that have the movies only that are in the horror genre, you can select that.

9:59:59.000,9:59:59.000
86[br]00:08:15,300 --&gt; 00:08:23,460[br]You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.

9:59:59.000,9:59:59.000
87[br]00:08:23,460 --&gt; 00:08:33,720[br]And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.

9:59:59.000,9:59:59.000
88[br]00:08:33,720 --&gt; 00:08:37,720[br]You need to get there.

9:59:59.000,9:59:59.000
89[br]00:08:37,720 --&gt; 00:08:45,950[br]You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.

9:59:59.000,9:59:59.000
90[br]00:08:45,950 --&gt; 00:08:52,750[br]And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.

9:59:59.000,9:59:59.000
91[br]00:08:52,750 --&gt; 00:08:58,350[br]In this example, I've reference some Joynes. We saw joints very, very briefly last week.

9:59:59.000,9:59:59.000
92[br]00:08:58,350 --&gt; 00:09:05,310[br]We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.

9:59:59.000,9:59:59.000
93[br]00:09:05,310 --&gt; 00:09:09,300[br]You want to start with the end in mind, work from what you have to what you need.

9:59:59.000,9:59:59.000
94[br]00:09:09,300 --&gt; 00:09:18,167[br]Read the tutorial notebooks for a lot more details.

9:59:59.000,9:59:59.000