[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,1\N00:00:04,570 --&gt; 00:00:09,490\NBlow in this video. I want to talk with you about basic operations for manipulating data,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,2\N00:00:09,490 --&gt; 00:00:15,100\Nlearning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,3\N00:00:15,100 --&gt; 00:00:21,010\NThink about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,4\N00:00:21,010 --&gt; 00:00:22,750\NI'm not showing the actual code in the video,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,5\N00:00:22,750 --&gt; 00:00:27,670\Nbut you're going to see in the notebook how you actually implement different versions of each of these steps.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,6\N00:00:27,670 --&gt; 00:00:31,300\NSo we think about the shape of our data. We have rows and we have columns.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,7\N00:00:31,300 --&gt; 00:00:36,610\NWe have a certain number of rows and columns, each of a number, a type in a name.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,8\N00:00:36,610 --&gt; 00:00:42,460\NThe assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,9\N00:00:42,460 --&gt; 00:00:46,210\NSo our data are well organized. Each row we have the variables.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,10\N00:00:46,210 --&gt; 00:00:51,670\NAnd that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,11\N00:00:51,670 --&gt; 00:00:55,180\NIf we have data that's not in that kind of a format, we're going to talk about that later.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,12\N00:00:55,180 --&gt; 00:00:59,410\NAnd sourcing and cleaning data. How do we get data in this kind of a tidy format?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,13\N00:00:59,410 --&gt; 00:01:08,860\NFor now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,14\N00:01:08,860 --&gt; 00:01:13,480\NNow, each of these methods return in new frame. A few of them are going to return a series.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,15\N00:01:13,480 --&gt; 00:01:17,620\NBut in general, we're gonna be transforming data frames to data frames here.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,16\N00:01:17,620 --&gt; 00:01:22,270\NAnd so if our input is a data frame, each row is another observation of the same kind of thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,17\N00:01:22,270 --&gt; 00:01:26,200\NThe output will be a data frame. Each row is an observation of the same kind of thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,18\N00:01:26,200 --&gt; 00:01:30,820\NIt might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,19\N00:01:30,820 --&gt; 00:01:34,570\NBut these are the different operations that we're going to be talking about here.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,20\N00:01:34,570 --&gt; 00:01:38,110\NSo if we want to select calls, we have a frame and we want the same frame.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,21\N00:01:38,110 --&gt; 00:01:43,570\NBut with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,22\N00:01:43,570 --&gt; 00:01:50,230\NWe can pick multiple columns bypassing in the list of column names to the same way we pick one column.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,23\N00:01:50,230 --&gt; 00:01:55,420\NOne column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,24\N00:01:55,420 --&gt; 00:02:02,020\NSo we want to keep all of the columns except one or two or however many that we name the drop method.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,25\N00:02:02,020 --&gt; 00:02:09,170\NReam returns a frame with all the columns of the original frame except the ones you tell it to drop.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,26\N00:02:09,170 --&gt; 00:02:13,840\NWe want to select rows. We have a frame. We want the same frame, but a subset of the rows.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,27\N00:02:13,840 --&gt; 00:02:19,840\NA few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,28\N00:02:19,840 --&gt; 00:02:24,370\NThat's true. And all the for all the data positions we want to keep.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,29\N00:02:24,370 --&gt; 00:02:29,410\NAnd then we select and then we. So this is really good if we want to select by column values.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,30\N00:02:29,410 --&gt; 00:02:36,100\NSo we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,31\N00:02:36,100 --&gt; 00:02:40,450\NAnd then we can select, we can select using that boolean mask.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,32\N00:02:40,450 --&gt; 00:02:45,220\NWe can select by position in the in the frame, starting with zero.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,33\N00:02:45,220 --&gt; 00:02:54,520\NWe can do that no matter what the index is with. I lock that is that's so lock is the location.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,34\N00:02:54,520 --&gt; 00:03:01,780\NAccessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,35\N00:03:01,780 --&gt; 00:03:07,240\NIf we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,36\N00:03:07,240 --&gt; 00:03:11,110\NWe haven't specified any index options. It's using the default range index.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,37\N00:03:11,110 --&gt; 00:03:15,220\NThen selecting that position and index key are the same thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,38\N00:03:15,220 --&gt; 00:03:19,750\NIf we have a data frame with, call with, whereas we've got our observations and we've got color,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,39\N00:03:19,750 --&gt; 00:03:25,470\Na column that identifies what some kind of a group that each observation is, then maybe it's ratings.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,40\N00:03:25,470 --&gt; 00:03:29,950\NIt's the movie. Maybe it's movies. And it's the actor, the genre.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,41\N00:03:29,950 --&gt; 00:03:35,710\NAnd what we want is a frame or a series that has one row per group of the original data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,42\N00:03:35,710 --&gt; 00:03:42,550\NAnd it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,43\N00:03:42,550 --&gt; 00:03:46,900\NThen we want a group by an aggregate like we saw in the videos last week.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,44\N00:03:46,900 --&gt; 00:03:50,980\NA couple more transformations are to think about tall versus wide data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,45\N00:03:50,980 --&gt; 00:03:54,610\NSo why the data has a column per variable.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,46\N00:03:54,610 --&gt; 00:04:04,600\NSo in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,47\N00:04:04,600 --&gt; 00:04:12,450\NAnd so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,48\N00:04:12,450 --&gt; 00:04:19,560\NHas its simplest form toll data or long data has three columns.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,49\N00:04:19,560 --&gt; 00:04:23,670\NWe have the road. We have the identifier. We have the variable name.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,50\N00:04:23,670 --&gt; 00:04:27,990\NAnd we have the variable value. Sometimes this will just be called idee, variable and value.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,51\N00:04:27,990 --&gt; 00:04:33,420\NBut often it's often useful to give the variable and value name columns, meaningful names.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,52\N00:04:33,420 --&gt; 00:04:36,570\NWe could also have more than one idea. Call it if we need to.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,53\N00:04:36,570 --&gt; 00:04:43,740\NBut the idea here is that rather than having the stages in different columns, we split them out into a different row.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,54\N00:04:43,740 --&gt; 00:04:48,420\NSo cycle one cyclist one has four rows, one for each of the four stages.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,55\N00:04:48,420 --&gt; 00:04:54,660\NWe still call this an observation for one thing and for the same kind of thing.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,56\N00:04:54,660 --&gt; 00:05:02,370\NIt's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,57\N00:05:02,370 --&gt; 00:05:06,900\NWhereas in the long data, each observation is for a cyclist.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,58\N00:05:06,900 --&gt; 00:05:13,140\NOne cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,59\N00:05:13,140 --&gt; 00:05:18,830\NTotal data is useful for plotting and grouping because a lot of our plotting function,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,60\N00:05:18,830 --&gt; 00:05:27,430\NYouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,61\N00:05:27,430 --&gt; 00:05:33,900\NMaybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,62\N00:05:33,900 --&gt; 00:05:43,230\NIf you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,63\N00:05:43,230 --&gt; 00:05:50,610\NYou can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,64\N00:05:50,610 --&gt; 00:05:58,020\Nwe haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,65\N00:05:58,020 --&gt; 00:06:05,480\NBut if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,66\N00:06:05,480 --&gt; 00:06:10,110\NSo we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,67\N00:06:10,110 --&gt; 00:06:18,510\Nduplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,68\N00:06:18,510 --&gt; 00:06:25,830\NThis list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,69\N00:06:25,830 --&gt; 00:06:31,170\NSo if we have if we have a data frame and we want to get a series, we just select the column from the data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,70\N00:06:31,170 --&gt; 00:06:35,400\NWe saw that the beginning. If we have a series and we want to get a data frame,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,71\N00:06:35,400 --&gt; 00:06:41,790\Nwe can just create a single column frame with two frame and the two frame method on the serious object also.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,72\N00:06:41,790 --&gt; 00:06:45,270\NBut to give it a names that you have a name and the resulting data frame,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,73\N00:06:45,270 --&gt; 00:06:49,650\Nyou can also if you want to create a multi column data frame where you've got a column for the value end,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,74\N00:06:49,650 --&gt; 00:06:53,910\Nyou have a column for the index of the original of the original series.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,75\N00:06:53,910 --&gt; 00:07:00,630\NThe Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,76\N00:07:00,630 --&gt; 00:07:11,340\NAnd then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,77\N00:07:11,340 --&gt; 00:07:18,190\NThe unstamped method will turn the inner most index labels in the column labels.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,78\N00:07:18,190 --&gt; 00:07:22,450\NTo turn it series into a data. So to think about strategy.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,79\N00:07:22,450 --&gt; 00:07:31,140\NEach of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,80\N00:07:31,140 --&gt; 00:07:36,000\NAnd so what I recommend is that you decide what you want the end product to look like.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,81\N00:07:36,000 --&gt; 00:07:40,410\NIf you're going to draw a chart or you're going to do an analysis or an inference,
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,82\N00:07:40,410 --&gt; 00:07:46,710\Nwhat are the observations and the variables that you need for that chart or inference?
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,83\N00:07:46,710 --&gt; 00:07:54,210\NAnd then once you've figured that out, you can plot a path from your current data to your end product.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,84\N00:07:54,210 --&gt; 00:08:10,770\NSo. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,85\N00:08:10,770 --&gt; 00:08:15,300\NThe rows that have the movies only that are in the horror genre, you can select that.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,86\N00:08:15,300 --&gt; 00:08:23,460\NYou're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,87\N00:08:23,460 --&gt; 00:08:33,720\NAnd once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,88\N00:08:33,720 --&gt; 00:08:37,720\NYou need to get there.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,89\N00:08:37,720 --&gt; 00:08:45,950\NYou need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,90\N00:08:45,950 --&gt; 00:08:52,750\NAnd then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,91\N00:08:52,750 --&gt; 00:08:58,350\NIn this example, I've reference some Joynes. We saw joints very, very briefly last week.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,92\N00:08:58,350 --&gt; 00:09:05,310\NWe're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,93\N00:09:05,310 --&gt; 00:09:09,300\NYou want to start with the end in mind, work from what you have to what you need.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,94\N00:09:09,300 --&gt; 00:09:18,167\NRead the tutorial notebooks for a lot more details.
Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,