1
00:00:04,570 --> 00:00:09,490
Blow in this video. I want to talk with you about basic operations for manipulating data,
2
00:00:09,490 --> 00:00:15,100
learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
3
00:00:15,100 --> 00:00:21,010
Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
4
00:00:21,010 --> 00:00:22,750
I'm not showing the actual code in the video,
5
00:00:22,750 --> 00:00:27,670
but you're going to see in the notebook how you actually implement different versions of each of these steps.
6
00:00:27,670 --> 00:00:31,300
So we think about the shape of our data. We have rows and we have columns.
7
00:00:31,300 --> 00:00:36,610
We have a certain number of rows and columns, each of a number, a type in a name.
8
00:00:36,610 --> 00:00:42,460
The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
9
00:00:42,460 --> 00:00:46,210
So our data are well organized. Each row we have the variables.
10
00:00:46,210 --> 00:00:51,670
And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
11
00:00:51,670 --> 00:00:55,180
If we have data that's not in that kind of a format, we're going to talk about that later.
12
00:00:55,180 --> 00:00:59,410
And sourcing and cleaning data. How do we get data in this kind of a tidy format?
13
00:00:59,410 --> 00:01:08,860
For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
14
00:01:08,860 --> 00:01:13,480
Now, each of these methods return in new frame. A few of them are going to return a series.
15
00:01:13,480 --> 00:01:17,620
But in general, we're gonna be transforming data frames to data frames here.
16
00:01:17,620 --> 00:01:22,270
And so if our input is a data frame, each row is another observation of the same kind of thing.
17
00:01:22,270 --> 00:01:26,200
The output will be a data frame. Each row is an observation of the same kind of thing.
18
00:01:26,200 --> 00:01:30,820
It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
19
00:01:30,820 --> 00:01:34,570
But these are the different operations that we're going to be talking about here.
20
00:01:34,570 --> 00:01:38,110
So if we want to select calls, we have a frame and we want the same frame.
21
00:01:38,110 --> 00:01:43,570
But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
22
00:01:43,570 --> 00:01:50,230
We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
23
00:01:50,230 --> 00:01:55,420
One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
24
00:01:55,420 --> 00:02:02,020
So we want to keep all of the columns except one or two or however many that we name the drop method.
25
00:02:02,020 --> 00:02:09,170
Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
26
00:02:09,170 --> 00:02:13,840
We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
27
00:02:13,840 --> 00:02:19,840
A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
28
00:02:19,840 --> 00:02:24,370
That's true. And all the for all the data positions we want to keep.
29
00:02:24,370 --> 00:02:29,410
And then we select and then we. So this is really good if we want to select by column values.
30
00:02:29,410 --> 00:02:36,100
So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
31
00:02:36,100 --> 00:02:40,450
And then we can select, we can select using that boolean mask.
32
00:02:40,450 --> 00:02:45,220
We can select by position in the in the frame, starting with zero.
33
00:02:45,220 --> 00:02:54,520
We can do that no matter what the index is with. I lock that is that's so lock is the location.
34
00:02:54,520 --> 00:03:01,780
Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
35
00:03:01,780 --> 00:03:07,240
If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
36
00:03:07,240 --> 00:03:11,110
We haven't specified any index options. It's using the default range index.
37
00:03:11,110 --> 00:03:15,220
Then selecting that position and index key are the same thing.
38
00:03:15,220 --> 00:03:19,750
If we have a data frame with, call with, whereas we've got our observations and we've got color,
39
00:03:19,750 --> 00:03:25,470
a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
40
00:03:25,470 --> 00:03:29,950
It's the movie. Maybe it's movies. And it's the actor, the genre.
41
00:03:29,950 --> 00:03:35,710
And what we want is a frame or a series that has one row per group of the original data.
42
00:03:35,710 --> 00:03:42,550
And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
43
00:03:42,550 --> 00:03:46,900
Then we want a group by an aggregate like we saw in the videos last week.
44
00:03:46,900 --> 00:03:50,980
A couple more transformations are to think about tall versus wide data.
45
00:03:50,980 --> 00:03:54,610
So why the data has a column per variable.
46
00:03:54,610 --> 00:04:04,600
So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
47
00:04:04,600 --> 00:04:12,450
And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
48
00:04:12,450 --> 00:04:19,560
Has its simplest form toll data or long data has three columns.
49
00:04:19,560 --> 00:04:23,670
We have the road. We have the identifier. We have the variable name.
50
00:04:23,670 --> 00:04:27,990
And we have the variable value. Sometimes this will just be called idee, variable and value.
51
00:04:27,990 --> 00:04:33,420
But often it's often useful to give the variable and value name columns, meaningful names.
52
00:04:33,420 --> 00:04:36,570
We could also have more than one idea. Call it if we need to.
53
00:04:36,570 --> 00:04:43,740
But the idea here is that rather than having the stages in different columns, we split them out into a different row.
54
00:04:43,740 --> 00:04:48,420
So cycle one cyclist one has four rows, one for each of the four stages.
55
00:04:48,420 --> 00:04:54,660
We still call this an observation for one thing and for the same kind of thing.
56
00:04:54,660 --> 00:05:02,370
It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
57
00:05:02,370 --> 00:05:06,900
Whereas in the long data, each observation is for a cyclist.
58
00:05:06,900 --> 00:05:13,140
One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
59
00:05:13,140 --> 00:05:18,830
Total data is useful for plotting and grouping because a lot of our plotting function,
60
00:05:18,830 --> 00:05:27,430
YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
61
00:05:27,430 --> 00:05:33,900
Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
62
00:05:33,900 --> 00:05:43,230
If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
63
00:05:43,230 --> 00:05:50,610
You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
64
00:05:50,610 --> 00:05:58,020
we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
65
00:05:58,020 --> 00:06:05,480
But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
66
00:06:05,480 --> 00:06:10,110
So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
67
00:06:10,110 --> 00:06:18,510
duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
68
00:06:18,510 --> 00:06:25,830
This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
69
00:06:25,830 --> 00:06:31,170
So if we have if we have a data frame and we want to get a series, we just select the column from the data.
70
00:06:31,170 --> 00:06:35,400
We saw that the beginning. If we have a series and we want to get a data frame,
71
00:06:35,400 --> 00:06:41,790
we can just create a single column frame with two frame and the two frame method on the serious object also.
72
00:06:41,790 --> 00:06:45,270
But to give it a names that you have a name and the resulting data frame,
73
00:06:45,270 --> 00:06:49,650
you can also if you want to create a multi column data frame where you've got a column for the value end,
74
00:06:49,650 --> 00:06:53,910
you have a column for the index of the original of the original series.
75
00:06:53,910 --> 00:07:00,630
The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
76
00:07:00,630 --> 00:07:11,340
And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
77
00:07:11,340 --> 00:07:18,190
The unstamped method will turn the inner most index labels in the column labels.
78
00:07:18,190 --> 00:07:22,450
To turn it series into a data. So to think about strategy.
79
00:07:22,450 --> 00:07:31,140
Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
80
00:07:31,140 --> 00:07:36,000
And so what I recommend is that you decide what you want the end product to look like.
81
00:07:36,000 --> 00:07:40,410
If you're going to draw a chart or you're going to do an analysis or an inference,
82
00:07:40,410 --> 00:07:46,710
what are the observations and the variables that you need for that chart or inference?
83
00:07:46,710 --> 00:07:54,210
And then once you've figured that out, you can plot a path from your current data to your end product.
84
00:07:54,210 --> 00:08:10,770
So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
85
00:08:10,770 --> 00:08:15,300
The rows that have the movies only that are in the horror genre, you can select that.
86
00:08:15,300 --> 00:08:23,460
You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
87
00:08:23,460 --> 00:08:33,720
And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
88
00:08:33,720 --> 00:08:37,720
You need to get there.
89
00:08:37,720 --> 00:08:45,950
You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
90
00:08:45,950 --> 00:08:52,750
And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
91
00:08:52,750 --> 00:08:58,350
In this example, I've reference some Joynes. We saw joints very, very briefly last week.
92
00:08:58,350 --> 00:09:05,310
We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
93
00:09:05,310 --> 00:09:09,300
You want to start with the end in mind, work from what you have to what you need.
94
00:09:09,300 --> 00:09:18,167
Read the tutorial notebooks for a lot more details.