https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec

Edit subtitles

Not Synced

1
00:00:04,570 --> 00:00:09,490
Blow in this video. I want to talk with you about basic operations for manipulating data,
Not Synced

2
00:00:09,490 --> 00:00:15,100
learning outcomes for this video are for you to know key data reshaping operations and the corresponding Penders function.
Not Synced

3
00:00:15,100 --> 00:00:21,010
Think about the process of transforming data in steps. This is a tour guide to the corresponding notebook.
Not Synced

4
00:00:21,010 --> 00:00:22,750
I'm not showing the actual code in the video,
Not Synced

5
00:00:22,750 --> 00:00:27,670
but you're going to see in the notebook how you actually implement different versions of each of these steps.
Not Synced

6
00:00:27,670 --> 00:00:31,300
So we think about the shape of our data. We have rows and we have columns.
Not Synced

7
00:00:31,300 --> 00:00:36,610
We have a certain number of rows and columns, each of a number, a type in a name.
Not Synced

8
00:00:36,610 --> 00:00:42,460
The assumption we're going to make throughout these operations is that each row is another observation of the same kind of thing.
Not Synced

9
00:00:42,460 --> 00:00:46,210
So our data are well organized. Each row we have the variables.
Not Synced

10
00:00:46,210 --> 00:00:51,670
And that's gonna be one type of thing. So if you have a data frame of movies, each row represents one movie.
Not Synced

11
00:00:51,670 --> 00:00:55,180
If we have data that's not in that kind of a format, we're going to talk about that later.
Not Synced

12
00:00:55,180 --> 00:00:59,410
And sourcing and cleaning data. How do we get data in this kind of a tidy format?
Not Synced

13
00:00:59,410 --> 00:01:08,860
For now, we're going to assume we have data in this format. This kind of a layout are the are Eco-System calls this tidy Vaida data.
Not Synced

14
00:01:08,860 --> 00:01:13,480
Now, each of these methods return in new frame. A few of them are going to return a series.
Not Synced

15
00:01:13,480 --> 00:01:17,620
But in general, we're gonna be transforming data frames to data frames here.
Not Synced

16
00:01:17,620 --> 00:01:22,270
And so if our input is a data frame, each row is another observation of the same kind of thing.
Not Synced

17
00:01:22,270 --> 00:01:26,200
The output will be a data frame. Each row is an observation of the same kind of thing.
Not Synced

18
00:01:26,200 --> 00:01:30,820
It might be an observation, the same kind of thing as the input. It might be an observation of a different kind of thing.
Not Synced

19
00:01:30,820 --> 00:01:34,570
But these are the different operations that we're going to be talking about here.
Not Synced

20
00:01:34,570 --> 00:01:38,110
So if we want to select calls, we have a frame and we want the same frame.
Not Synced

21
00:01:38,110 --> 00:01:43,570
But with fewer columns, we have few options. We can pick one column by treating the frame as a dictionary.
Not Synced

22
00:01:43,570 --> 00:01:50,230
We can pick multiple columns bypassing in the list of column names to the same way we pick one column.
Not Synced

23
00:01:50,230 --> 00:01:55,420
One column will yield a series. Multiple columns will yield a frame. If we want to remove a column.
Not Synced

24
00:01:55,420 --> 00:02:02,020
So we want to keep all of the columns except one or two or however many that we name the drop method.
Not Synced

25
00:02:02,020 --> 00:02:09,170
Ream returns a frame with all the columns of the original frame except the ones you tell it to drop.
Not Synced

26
00:02:09,170 --> 00:02:13,840
We want to select rows. We have a frame. We want the same frame, but a subset of the rows.
Not Synced

27
00:02:13,840 --> 00:02:19,840
A few common ways to do that or to select by a boolean mask. We set up a PAN, the series that has boolean values.
Not Synced

28
00:02:19,840 --> 00:02:24,370
That's true. And all the for all the data positions we want to keep.
Not Synced

29
00:02:24,370 --> 00:02:29,410
And then we select and then we. So this is really good if we want to select by column values.
Not Synced

30
00:02:29,410 --> 00:02:36,100
So we can use a comparison operator to create a mask where all the values for one column are equal to a particular value.
Not Synced

31
00:02:36,100 --> 00:02:40,450
And then we can select, we can select using that boolean mask.
Not Synced

32
00:02:40,450 --> 00:02:45,220
We can select by position in the in the frame, starting with zero.
Not Synced

33
00:02:45,220 --> 00:02:54,520
We can do that no matter what the index is with. I lock that is that's so lock is the location.
Not Synced

34
00:02:54,520 --> 00:03:01,780
Accessor for Panda's data frames, I lock Access's by integer position, always lock indexes by index keys.
Not Synced

35
00:03:01,780 --> 00:03:07,240
If we have the index keys we want, if we just load it. So we just loaded the data frame from a CSP file.
Not Synced

36
00:03:07,240 --> 00:03:11,110
We haven't specified any index options. It's using the default range index.
Not Synced

37
00:03:11,110 --> 00:03:15,220
Then selecting that position and index key are the same thing.
Not Synced

38
00:03:15,220 --> 00:03:19,750
If we have a data frame with, call with, whereas we've got our observations and we've got color,
Not Synced

39
00:03:19,750 --> 00:03:25,470
a column that identifies what some kind of a group that each observation is, then maybe it's ratings.
Not Synced

40
00:03:25,470 --> 00:03:29,950
It's the movie. Maybe it's movies. And it's the actor, the genre.
Not Synced

41
00:03:29,950 --> 00:03:35,710
And what we want is a frame or a series that has one row per group of the original data.
Not Synced

42
00:03:35,710 --> 00:03:42,550
And it's computing some kind of a statistic from a value and all the rows, all of the rows within that group.
Not Synced

43
00:03:42,550 --> 00:03:46,900
Then we want a group by an aggregate like we saw in the videos last week.
Not Synced

44
00:03:46,900 --> 00:03:50,980
A couple more transformations are to think about tall versus wide data.
Not Synced

45
00:03:50,980 --> 00:03:54,610
So why the data has a column per variable.
Not Synced

46
00:03:54,610 --> 00:04:04,600
So in this case, if this is data, this is data of the of this average speed for each of four different stages of a cycling race.
Not Synced

47
00:04:04,600 --> 00:04:12,450
And so we've got a column for each of the four different stages and end our rows or for each cyclist, total data.
Not Synced

48
00:04:12,450 --> 00:04:19,560
Has its simplest form toll data or long data has three columns.
Not Synced

49
00:04:19,560 --> 00:04:23,670
We have the road. We have the identifier. We have the variable name.
Not Synced

50
00:04:23,670 --> 00:04:27,990
And we have the variable value. Sometimes this will just be called idee, variable and value.
Not Synced

51
00:04:27,990 --> 00:04:33,420
But often it's often useful to give the variable and value name columns, meaningful names.
Not Synced

52
00:04:33,420 --> 00:04:36,570
We could also have more than one idea. Call it if we need to.
Not Synced

53
00:04:36,570 --> 00:04:43,740
But the idea here is that rather than having the stages in different columns, we split them out into a different row.
Not Synced

54
00:04:43,740 --> 00:04:48,420
So cycle one cyclist one has four rows, one for each of the four stages.
Not Synced

55
00:04:48,420 --> 00:04:54,660
We still call this an observation for one thing and for the same kind of thing.
Not Synced

56
00:04:54,660 --> 00:05:02,370
It's just in the wide data. Each of our observations is for a cyclist and it's an observation of their speed for all four stages.
Not Synced

57
00:05:02,370 --> 00:05:06,900
Whereas in the long data, each observation is for a cyclist.
Not Synced

58
00:05:06,900 --> 00:05:13,140
One cyclist in one particular stage. So each cyclist will have four observations, one for each stage.
Not Synced

59
00:05:13,140 --> 00:05:18,830
Total data is useful for plotting and grouping because a lot of our plotting function,
Not Synced

60
00:05:18,830 --> 00:05:27,430
YouTube plotting utility functions are going to want to deal with a categorical variable that we use to determine maybe the x axis.
Not Synced

61
00:05:27,430 --> 00:05:33,900
Maybe the color. And so often we're going to need tall data, especially when we're going to be plotting.
Not Synced

62
00:05:33,900 --> 00:05:43,230
If you want to term why data in the tall, you use melt. If you want to turn tall data into why use the pivot with a pivot table methods and pandas.
Not Synced

63
00:05:43,230 --> 00:05:50,610
You can also create tall data from a list. So if we have a data frame when one of the columns actually contains lists,
Not Synced

64
00:05:50,610 --> 00:05:58,020
we haven't seen any data with this so far except the John Rós column in the Waj Movieland data.
Not Synced

65
00:05:58,020 --> 00:06:05,480
But if we have a if we have a data frame or one column contains lists and we have and what we want is one row per list element.
Not Synced

66
00:06:05,480 --> 00:06:10,110
So we want to take this list that's in a column and split it out so that each element gets another row where it's going to go ahead,
Not Synced

67
00:06:10,110 --> 00:06:18,510
duplicate the rest of the column. So they're going to have their values repeated, whatever we're doing once for each of the elements.
Not Synced

68
00:06:18,510 --> 00:06:25,830
This list, the pandas explode method. We'll do that. Then finally, to convert between series and data frame.
Not Synced

69
00:06:25,830 --> 00:06:31,170
So if we have if we have a data frame and we want to get a series, we just select the column from the data.
Not Synced

70
00:06:31,170 --> 00:06:35,400
We saw that the beginning. If we have a series and we want to get a data frame,
Not Synced

71
00:06:35,400 --> 00:06:41,790
we can just create a single column frame with two frame and the two frame method on the serious object also.
Not Synced

72
00:06:41,790 --> 00:06:45,270
But to give it a names that you have a name and the resulting data frame,
Not Synced

73
00:06:45,270 --> 00:06:49,650
you can also if you want to create a multi column data frame where you've got a column for the value end,
Not Synced

74
00:06:49,650 --> 00:06:53,910
you have a column for the index of the original of the original series.
Not Synced

75
00:06:53,910 --> 00:07:00,630
The Pandas, the series Freeze Reset Index Method or pop that index out into a data frame column.
Not Synced

76
00:07:00,630 --> 00:07:11,340
And then finally, if you have a series with multiple levels to its index, we haven't seen those yet, but we're going to see them from time to time.
Not Synced

77
00:07:11,340 --> 00:07:18,190
The unstamped method will turn the inner most index labels in the column labels.
Not Synced

78
00:07:18,190 --> 00:07:22,450
To turn it series into a data. So to think about strategy.
Not Synced

79
00:07:22,450 --> 00:07:31,140
Each of these is an individual little building block. And we need to put them together to get from the data that we have to the data that we want.
Not Synced

80
00:07:31,140 --> 00:07:36,000
And so what I recommend is that you decide what you want the end product to look like.
Not Synced

81
00:07:36,000 --> 00:07:40,410
If you're going to draw a chart or you're going to do an analysis or an inference,
Not Synced

82
00:07:40,410 --> 00:07:46,710
what are the observations and the variables that you need for that chart or inference?
Not Synced

83
00:07:46,710 --> 00:07:54,210
And then once you've figured that out, you can plot a path from your current data to your end product.
Not Synced

84
00:07:54,210 --> 00:08:10,770
So. If you want to show a distribution of the mean ratings for all of the movies in the horror genre, then you're going to need to select.
Not Synced

85
00:08:10,770 --> 00:08:15,300
The rows that have the movies only that are in the horror genre, you can select that.
Not Synced

86
00:08:15,300 --> 00:08:23,460
You're probably going to have a join as well in order to get the genre table and the movie table connected, depending on how your data's laid out.
Not Synced

87
00:08:23,460 --> 00:08:33,720
And once you've filtered it down, OK, these are the horror movies that you need to get the ratings and you need to you need to be able.
Not Synced

88
00:08:33,720 --> 00:08:37,720
You need to get there.
Not Synced

89
00:08:37,720 --> 00:08:45,950
You need to have the average ratings, you need to combine those with the movies, as we've seen the ability to do in a previous video.
Not Synced

90
00:08:45,950 --> 00:08:52,750
And then you have the observations that you want. You need to be able to plot this kind of a path and what you have at the end product.
Not Synced

91
00:08:52,750 --> 00:08:58,350
In this example, I've reference some Joynes. We saw joints very, very briefly last week.
Not Synced

92
00:08:58,350 --> 00:09:05,310
We're going to see them again in more detail in the notebooks. So to wrap up, Penders has many tools for reshaping data.
Not Synced

93
00:09:05,310 --> 00:09:09,300
You want to start with the end in mind, work from what you have to what you need.
Not Synced

94
00:09:09,300 --> 00:09:18,167
Read the tutorial notebooks for a lot more details.
Not Synced

Title:: https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec
Video Language:: English
Duration:: 09:18

janetlayne edited English subtitles for https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec

Sep 7, 2021, 6:33 AM

English subtitles

Incomplete

Revisions

Revision 1 Uploaded

janetlayne Sep 7, 2021, 6:33 AM

https:/.../7ee3707b-0e2d-4a18-8de1-ad9601830ce7-e6d4df35-320f-4d95-ab81-ad9c006bb089.mp4?invocationId=fcdaf66c-a50f-ec11-a9e9-0a1a827ad0ec

Revisions

Our website uses cookies

Operating cookies (Required)