-
Not Synced
1
00:00:04,570 --> 00:00:09,490
Blow in this video. I want to talk with you about basic operations for manipulating data,
-
Not Synced
2
00:00:09,490 --> 00:00:15,100
learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
-
Not Synced
3
00:00:15,100 --> 00:00:21,010
Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
-
Not Synced
4
00:00:21,010 --> 00:00:22,750
I'm not showing the actual code in the video,
-
Not Synced
5
00:00:22,750 --> 00:00:27,670
but you're going to see in the notebook how you actually implement different versions of each of these steps.
-
Not Synced
6
00:00:27,670 --> 00:00:31,300
So we think about the shape of our data. We have rows and we have columns.
-
Not Synced
7
00:00:31,300 --> 00:00:36,610
We have a certain number of rows and columns, each of a number, a type in a name.
-
Not Synced
8
00:00:36,610 --> 00:00:42,460
The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
-
Not Synced
9
00:00:42,460 --> 00:00:46,210
So our data are well organized. Each row we have the variables.
-
Not Synced
10
00:00:46,210 --> 00:00:51,670
And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
-
Not Synced
11
00:00:51,670 --> 00:00:55,180
If we have data that's not in that kind of a format, we're going to talk about that later.
-
Not Synced
12
00:00:55,180 --> 00:00:59,410
And sourcing and cleaning data. How do we get data in this kind of a tidy format?
-
Not Synced
13
00:00:59,410 --> 00:01:08,860
For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
-
Not Synced
14
00:01:08,860 --> 00:01:13,480
Now, each of these methods return in new frame. A few of them are going to return a series.
-
Not Synced
15
00:01:13,480 --> 00:01:17,620
But in general, we're gonna be transforming data frames to data frames here.
-
Not Synced
16
00:01:17,620 --> 00:01:22,270
And so if our input is a data frame, each row is another observation of the same kind of thing.
-
Not Synced
17
00:01:22,270 --> 00:01:26,200
The output will be a data frame. Each row is an observation of the same kind of thing.
-
Not Synced
18
00:01:26,200 --> 00:01:30,820
It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
-
Not Synced
19
00:01:30,820 --> 00:01:34,570
But these are the different operations that we're going to be talking about here.
-
Not Synced
20
00:01:34,570 --> 00:01:38,110
So if we want to select calls, we have a frame and we want the same frame.
-
Not Synced
21
00:01:38,110 --> 00:01:43,570
But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
-
Not Synced
22
00:01:43,570 --> 00:01:50,230
We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
-
Not Synced
23
00:01:50,230 --> 00:01:55,420
One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
-
Not Synced
24
00:01:55,420 --> 00:02:02,020
So we want to keep all of the columns except one or two or however many that we name the drop method.
-
Not Synced
25
00:02:02,020 --> 00:02:09,170
Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
-
Not Synced
26
00:02:09,170 --> 00:02:13,840
We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
-
Not Synced
27
00:02:13,840 --> 00:02:19,840
A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
-
Not Synced
28
00:02:19,840 --> 00:02:24,370
That's true. And all the for all the data positions we want to keep.
-
Not Synced
29
00:02:24,370 --> 00:02:29,410
And then we select and then we. So this is really good if we want to select by column values.
-
Not Synced
30
00:02:29,410 --> 00:02:36,100
So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
-
Not Synced
31
00:02:36,100 --> 00:02:40,450
And then we can select, we can select using that boolean mask.
-
Not Synced
32
00:02:40,450 --> 00:02:45,220
We can select by position in the in the frame, starting with zero.
-
Not Synced
33
00:02:45,220 --> 00:02:54,520
We can do that no matter what the index is with. I lock that is that's so lock is the location.
-
Not Synced
34
00:02:54,520 --> 00:03:01,780
Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
-
Not Synced
35
00:03:01,780 --> 00:03:07,240
If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
-
Not Synced
36
00:03:07,240 --> 00:03:11,110
We haven't specified any index options. It's using the default range index.
-
Not Synced
37
00:03:11,110 --> 00:03:15,220
Then selecting that position and index key are the same thing.
-
Not Synced
38
00:03:15,220 --> 00:03:19,750
If we have a data frame with, call with, whereas we've got our observations and we've got color,
-
Not Synced
39
00:03:19,750 --> 00:03:25,470
a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
-
Not Synced
40
00:03:25,470 --> 00:03:29,950
It's the movie. Maybe it's movies. And it's the actor, the genre.
-
Not Synced
41
00:03:29,950 --> 00:03:35,710
And what we want is a frame or a series that has one row per group of the original data.
-
Not Synced
42
00:03:35,710 --> 00:03:42,550
And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
-
Not Synced
43
00:03:42,550 --> 00:03:46,900
Then we want a group by an aggregate like we saw in the videos last week.
-
Not Synced
44
00:03:46,900 --> 00:03:50,980
A couple more transformations are to think about tall versus wide data.
-
Not Synced
45
00:03:50,980 --> 00:03:54,610
So why the data has a column per variable.
-
Not Synced
46
00:03:54,610 --> 00:04:04,600
So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
-
Not Synced
47
00:04:04,600 --> 00:04:12,450
And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
-
Not Synced
48
00:04:12,450 --> 00:04:19,560
Has its simplest form toll data or long data has three columns.
-
Not Synced
49
00:04:19,560 --> 00:04:23,670
We have the road. We have the identifier. We have the variable name.
-
Not Synced
50
00:04:23,670 --> 00:04:27,990
And we have the variable value. Sometimes this will just be called idee, variable and value.
-
Not Synced
51
00:04:27,990 --> 00:04:33,420
But often it's often useful to give the variable and value name columns, meaningful names.
-
Not Synced
52
00:04:33,420 --> 00:04:36,570
We could also have more than one idea. Call it if we need to.
-
Not Synced
53
00:04:36,570 --> 00:04:43,740
But the idea here is that rather than having the stages in different columns, we split them out into a different row.
-
Not Synced
54
00:04:43,740 --> 00:04:48,420
So cycle one cyclist one has four rows, one for each of the four stages.
-
Not Synced
55
00:04:48,420 --> 00:04:54,660
We still call this an observation for one thing and for the same kind of thing.
-
Not Synced
56
00:04:54,660 --> 00:05:02,370
It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
-
Not Synced
57
00:05:02,370 --> 00:05:06,900
Whereas in the long data, each observation is for a cyclist.
-
Not Synced
58
00:05:06,900 --> 00:05:13,140
One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
-
Not Synced
59
00:05:13,140 --> 00:05:18,830
Total data is useful for plotting and grouping because a lot of our plotting function,
-
Not Synced
60
00:05:18,830 --> 00:05:27,430
YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
-
Not Synced
61
00:05:27,430 --> 00:05:33,900
Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
-
Not Synced
62
00:05:33,900 --> 00:05:43,230
If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
-
Not Synced
63
00:05:43,230 --> 00:05:50,610
You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
-
Not Synced
64
00:05:50,610 --> 00:05:58,020
we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
-
Not Synced
65
00:05:58,020 --> 00:06:05,480
But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
-
Not Synced
66
00:06:05,480 --> 00:06:10,110
So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
-
Not Synced
67
00:06:10,110 --> 00:06:18,510
duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
-
Not Synced
68
00:06:18,510 --> 00:06:25,830
This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
-
Not Synced
69
00:06:25,830 --> 00:06:31,170
So if we have if we have a data frame and we want to get a series, we just select the column from the data.
-
Not Synced
70
00:06:31,170 --> 00:06:35,400
We saw that the beginning. If we have a series and we want to get a data frame,
-
Not Synced
71
00:06:35,400 --> 00:06:41,790
we can just create a single column frame with two frame and the two frame method on the serious object also.
-
Not Synced
72
00:06:41,790 --> 00:06:45,270
But to give it a names that you have a name and the resulting data frame,
-
Not Synced
73
00:06:45,270 --> 00:06:49,650
you can also if you want to create a multi column data frame where you've got a column for the value end,
-
Not Synced
74
00:06:49,650 --> 00:06:53,910
you have a column for the index of the original of the original series.
-
Not Synced
75
00:06:53,910 --> 00:07:00,630
The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
-
Not Synced
76
00:07:00,630 --> 00:07:11,340
And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
-
Not Synced
77
00:07:11,340 --> 00:07:18,190
The unstamped method will turn the inner most index labels in the column labels.
-
Not Synced
78
00:07:18,190 --> 00:07:22,450
To turn it series into a data. So to think about strategy.
-
Not Synced
79
00:07:22,450 --> 00:07:31,140
Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
-
Not Synced
80
00:07:31,140 --> 00:07:36,000
And so what I recommend is that you decide what you want the end product to look like.
-
Not Synced
81
00:07:36,000 --> 00:07:40,410
If you're going to draw a chart or you're going to do an analysis or an inference,
-
Not Synced
82
00:07:40,410 --> 00:07:46,710
what are the observations and the variables that you need for that chart or inference?
-
Not Synced
83
00:07:46,710 --> 00:07:54,210
And then once you've figured that out, you can plot a path from your current data to your end product.
-
Not Synced
84
00:07:54,210 --> 00:08:10,770
So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
-
Not Synced
85
00:08:10,770 --> 00:08:15,300
The rows that have the movies only that are in the horror genre, you can select that.
-
Not Synced
86
00:08:15,300 --> 00:08:23,460
You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
-
Not Synced
87
00:08:23,460 --> 00:08:33,720
And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
-
Not Synced
88
00:08:33,720 --> 00:08:37,720
You need to get there.
-
Not Synced
89
00:08:37,720 --> 00:08:45,950
You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
-
Not Synced
90
00:08:45,950 --> 00:08:52,750
And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
-
Not Synced
91
00:08:52,750 --> 00:08:58,350
In this example, I've reference some Joynes. We saw joints very, very briefly last week.
-
Not Synced
92
00:08:58,350 --> 00:09:05,310
We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
-
Not Synced
93
00:09:05,310 --> 00:09:09,300
You want to start with the end in mind, work from what you have to what you need.
-
Not Synced
94
00:09:09,300 --> 00:09:18,167
Read the tutorial notebooks for a lot more details.
-
Not Synced