1 99:59:59,999 --> 99:59:59,999 1 00:00:04,610 --> 00:00:10,490 Hello. And this video, I want to talk to you about how to build up a chart from the ground up as we think 2 99:59:59,999 --> 99:59:59,999 2 00:00:10,490 --> 00:00:15,260 about the question it's going to try to answer and the pieces that need to go into it. 3 99:59:59,999 --> 99:59:59,999 3 00:00:15,260 --> 00:00:21,350 So the learning outcome for this video is for us to be able to design a chart by thinking first of the questions, 4 99:59:59,999 --> 99:59:59,999 4 00:00:21,350 --> 00:00:28,070 the goals and the data that are going to be in and from the chart. 5 99:59:59,999 --> 99:59:59,999 5 00:00:28,070 --> 00:00:34,580 So a good chart answers a question and the guiding principle for how we design and 6 99:59:59,999 --> 99:59:59,999 6 00:00:34,580 --> 00:00:40,940 how we lay out our chart is to illuminate the question that we want to answer. 7 99:59:59,999 --> 99:59:59,999 7 00:00:40,940 --> 00:00:45,230 And this depends. We need to know what question we want to answer in the first place. 8 99:59:59,999 --> 99:59:59,999 8 00:00:45,230 --> 00:00:52,550 We also need to know precisely how we operationalize that question so we can use that to then inform how we're going into the chart layout. 9 99:59:59,999 --> 99:59:59,999 9 00:00:52,550 --> 00:00:59,370 And we need to know what data that we're using, specifically what variables we're using as a part of this chart. 10 99:59:59,999 --> 99:59:59,999 10 00:00:59,370 --> 00:01:05,250 For example, there's a data set, you'll see it in the notebook that goes with this video for of passengers on the 11 99:59:59,999 --> 99:59:59,999 11 00:01:05,250 --> 00:01:09,800 Titanic and supposedly wanted to examine whether passengers in a higher fare class, 12 99:59:59,999 --> 99:59:59,999 12 00:01:09,800 --> 00:01:15,450 say, first class or more likely to survive than passengers in lower fare classes. 13 99:59:59,999 --> 99:59:59,999 13 00:01:15,450 --> 00:01:19,860 In this analysis, we have an outcome variable zero one, 14 99:59:59,999 --> 99:59:59,999 14 00:01:19,860 --> 00:01:29,590 whether or not the passenger survived the Titanic sinking and a lot of charts are going to have an outcome variable. 15 99:59:59,999 --> 99:59:59,999 15 00:01:29,590 --> 00:01:37,140 We want to we have some outcome variable and we want to see how it responds to or how it differs with some other variable, 16 99:59:59,999 --> 99:59:59,999 16 00:01:37,140 --> 00:01:43,170 which we call the explanatory variable, in this case, the passage class where outcome is survival. 17 99:59:59,999 --> 99:59:59,999 17 00:01:43,170 --> 00:01:52,500 And we want to see how it changes as the the the passengers passage class, the explanatory variable changes. 18 99:59:59,999 --> 99:59:59,999 18 00:01:52,500 --> 00:01:57,510 The outcome variable is also called the response variable or the dependent variable, 19 99:59:59,999 --> 99:59:59,999 19 00:01:57,510 --> 00:02:01,770 because it's what we're trying to measure that's responding to the condition we're trying to analyze. 20 99:59:59,999 --> 99:59:59,999 20 00:02:01,770 --> 00:02:06,990 And the explanatory variable is sometimes called the independent variable because it's changing, 21 99:59:59,999 --> 99:59:59,999 21 00:02:06,990 --> 00:02:11,820 but it's not changing as a function of the other variables in theory. 22 99:59:59,999 --> 99:59:59,999 22 00:02:11,820 --> 00:02:20,160 So we can do this with a bar chart and this bar chart shows the x axis is our steerage class through our passage class first, 23 99:59:59,999 --> 99:59:59,999 23 00:02:20,160 --> 00:02:29,520 second and third, and the Y axis is the average is the fraction of passengers in that class who survive. 24 99:59:59,999 --> 99:59:59,999 24 00:02:29,520 --> 00:02:35,460 We also see some error bars. We're going to see later what those mean and how to how to compute them. 25 99:59:59,999 --> 99:59:59,999 25 00:02:35,460 --> 00:02:47,550 But this lets us see how the outcome survival changes as we age, as the pass or with the different passage classes of the passengers. 26 99:59:59,999 --> 99:59:59,999 26 00:02:47,550 --> 00:02:56,490 And one of the things to note here is that we have our explanatory variable on the X axis and the outcome variable on the Y axis. 27 99:59:59,999 --> 99:59:59,999 27 00:02:56,490 --> 00:03:02,100 That is the general convention. There are some cases where we might want to flip it. 28 99:59:59,999 --> 99:59:59,999 28 00:03:02,100 --> 00:03:09,510 So we've we've got a horizontal bar chart where the explanatory is on the Y and the outcome is on the X, 29 99:59:59,999 --> 99:59:59,999 29 00:03:09,510 --> 00:03:13,920 particularly if we if if it makes the labels more readable. 30 99:59:59,999 --> 99:59:59,999 30 00:03:13,920 --> 00:03:25,380 But the standard convention for most types of charts is to put explanatory on x axis, the horizontal axis and the outcome variable on the Y axis. 31 99:59:59,999 --> 99:59:59,999 31 00:03:25,380 --> 00:03:29,370 And this chart shows the relationship, many charts or relationship, 32 99:59:59,999 --> 99:59:59,999 32 00:03:29,370 --> 00:03:39,010 most of the plot that we're going to be drawing in this class show how some kind of a numeric variable either continues or. 33 99:59:59,999 --> 99:59:59,999 33 00:03:39,010 --> 00:03:51,610 Or integer changes between different values of one or more other variables, and in this case, even though our response was zero one logical. 34 99:59:59,999 --> 99:59:59,999 34 00:03:51,610 --> 00:03:59,480 When we convert it into a rate per per passage class, it became a continuous variable. 35 99:59:59,999 --> 99:59:59,999 35 00:03:59,480 --> 00:04:05,620 And so when we do this, we need to identify a few key things to design our plots. 36 99:59:59,999 --> 99:59:59,999 36 00:04:05,620 --> 00:04:12,490 We need to identify what variable we want to show. That's going to guide a lot of plots that'll be on our y axis. 37 99:59:59,999 --> 99:59:59,999 37 00:04:12,490 --> 00:04:18,520 When it's not, it'll usually be on X and it's going to identifying that variable is, 38 99:59:59,999 --> 99:59:59,999 38 00:04:18,520 --> 00:04:22,660 if anything, probably the most important thing in designing a plot. 39 99:59:59,999 --> 99:59:59,999 39 00:04:22,660 --> 00:04:27,160 We then need to identify what we want to show about this variable. 40 99:59:59,999 --> 99:59:59,999 40 00:04:27,160 --> 00:04:32,380 Do we want to show its value for different data points? Do we want to show a statistic? 41 99:59:59,999 --> 99:59:59,999 41 00:04:32,380 --> 00:04:36,370 The do we want to show, for example, the the mean or the rate? 42 99:59:59,999 --> 99:59:59,999 42 00:04:36,370 --> 00:04:44,220 In the previous when we showed a statistic, the Titanic example, we showed a statistic, the rate of of survival. 43 99:59:59,999 --> 99:59:59,999 43 00:04:44,220 --> 00:04:47,660 Do we want to show its distribution? 44 99:59:59,999 --> 99:59:59,999 44 00:04:47,660 --> 00:04:57,410 And then how do we want to compare that between values of the explanatory variable, particularly, do we want to look at absolute differences? 45 99:59:59,999 --> 99:59:59,999 45 00:04:57,410 --> 00:05:08,410 Do we want to look at relative or proportional differences? And even the histograms follow this kind of a design because they have an outcome, 46 99:59:59,999 --> 99:59:59,999 46 00:05:08,410 --> 00:05:15,640 which is the frequency or the count of the abortion or the density, depending on precisely what kind of histogram we're showing. 47 99:59:59,999 --> 99:59:59,999 47 00:05:15,640 --> 00:05:20,500 And then they have the explanatory variable, which is the value or the. 48 99:59:59,999 --> 99:59:59,999 48 00:05:20,500 --> 00:05:33,850 So we've got a histogram and we've got some beans. And the response variable is how how many values are in that bend and the explanatory is the. 49 99:59:59,999 --> 99:59:59,999 49 00:05:33,850 --> 00:05:40,900 So identifying these then informs the entire pipeline of producing our chart, the data processing the beginning. 50 99:59:59,999 --> 99:59:59,999 50 00:05:40,900 --> 00:05:47,110 We're going to do group aggregation transformation that gets us to the final values we can actually plot. 51 99:59:59,999 --> 99:59:59,999 51 00:05:47,110 --> 00:05:55,210 It's going to affect our choice of plot type and it's going to affect our choice of axis labels, colors, facets, the other aspects of the plot. 52 99:59:59,999 --> 99:59:59,999 52 00:05:55,210 --> 00:06:04,500 So. The type of the variable has a significant impact if the response is numeric or can be transformed, 53 99:59:59,999 --> 99:59:59,999 53 00:06:04,500 --> 00:06:07,560 the response is often numeric or can be transformed to be. 54 99:59:59,999 --> 99:59:59,999 54 00:06:07,560 --> 00:06:15,840 If we're talking about a of categorical value, we usually want the relative frequency of different of different values of that. 55 99:59:59,999 --> 99:59:59,999 55 00:06:15,840 --> 00:06:22,320 So either we're doing it like a histogram and the we're going to transform it so that we're showing just the distribution. 56 99:59:59,999 --> 99:59:59,999 56 00:06:22,320 --> 00:06:29,010 We're going to transform it, so that we're showing that the explanatory becomes the value of the categorical. 57 99:59:59,999 --> 99:59:59,999 57 00:06:29,010 --> 00:06:33,870 And the response is how many or what fraction have it in a logical. 58 99:59:59,999 --> 99:59:59,999 58 00:06:33,870 --> 00:06:40,870 It might if it's a two level categorical, we might turn it into a fraction, just a fraction to have one of the levels versus the other explanatory. 59 99:59:59,999 --> 99:59:59,999 59 00:06:40,870 --> 00:06:46,920 It can be anything. We're going to see how to use numeric explanatory variable is categorical explanatory variables ordinal. 60 99:59:59,999 --> 99:59:59,999 60 00:06:46,920 --> 00:06:48,420 We were just like categorical, 61 99:59:59,999 --> 99:59:59,999 61 00:06:48,420 --> 00:06:57,180 except that the that it's a discrete axis that preserves order and we need to make sure that the order and ordinal data is being preserved. 62 99:59:59,999 --> 99:59:59,999 62 00:06:57,180 --> 00:07:05,270 If you're using pandas ordered category type, it automatically preserves order when you're doing the plot for you. 63 99:59:59,999 --> 99:59:59,999 63 00:07:05,270 --> 00:07:12,050 So if we just have one explanatory variable, this is the easiest case, if our explanatory variable is continuous, 64 99:59:59,999 --> 99:59:59,999 64 00:07:12,050 --> 00:07:16,820 we usually want to scatterplot or align plot for showing individual values. 65 99:59:59,999 --> 99:59:59,999 65 00:07:16,820 --> 00:07:22,890 Sometimes we'll flip the response and explanatory on a scatterplot or will or both might be explanatory. 66 99:59:59,999 --> 99:59:59,999 66 00:07:22,890 --> 00:07:25,790 We want to show where points lie in a two dimensional space. 67 99:59:59,999 --> 99:59:59,999 67 00:07:25,790 --> 00:07:35,120 But generally, if we've got an explanatory, a continuous explanatory variable and we've got a and we're trying to show values, 68 99:59:59,999 --> 99:59:59,999 68 00:07:35,120 --> 00:07:39,530 we're going to use a scatterplot or a line excuse me, we're going to try to show values. 69 99:59:59,999 --> 99:59:59,999 69 00:07:39,530 --> 00:07:45,380 We're going to try to show statistics like a mean at each at each value of the explanatory variable. 70 99:59:59,999 --> 99:59:59,999 70 00:07:45,380 --> 00:07:56,490 We're going to use a scatterplot or a line plott. If the explanatory variable is discrete, then we're going to use a bar chart to show a statistic. 71 99:59:59,999 --> 99:59:59,999 71 00:07:56,490 --> 00:08:05,400 If we want to estimate the relative difference, we want to be able to compare the relative value relatively compared to values, 72 99:59:59,999 --> 99:59:59,999 72 00:08:05,400 --> 00:08:08,640 because a bar, one bar will be twice as high as another. 73 99:59:59,999 --> 99:59:59,999 73 00:08:08,640 --> 00:08:15,390 And a point plot shows a statistic or an individual value, and it emphasizes absolute difference. 74 99:59:59,999 --> 99:59:59,999 74 00:08:15,390 --> 00:08:19,740 You don't have a whole bar in order to to compare heights. 75 99:59:59,999 --> 99:59:59,999 75 00:08:19,740 --> 00:08:28,140 You just have the point. And then if we want to show a distribution, we usually use a box or a violin plot with this discrete explanatory variable. 76 99:59:59,999 --> 99:59:59,999 76 00:08:28,140 --> 00:08:32,670 We don't have great ways to show distributions with continuous explanatory variables. 77 99:59:59,999 --> 99:59:59,999 77 00:08:32,670 --> 00:08:36,480 You can show a variance with an error bar, but that's about where a ribbon. 78 99:59:59,999 --> 99:59:59,999 78 00:08:36,480 --> 00:08:44,220 But that's about it. For too explanatory variables we get into. 79 99:59:59,999 --> 99:59:59,999 79 00:08:44,220 --> 00:08:53,160 Too explanatory variables, we have a couple of options. One is we can do a three, a pseudo 3D display where we do a contour plot or a heat map. 80 99:59:59,999 --> 99:59:59,999 80 00:08:53,160 --> 00:08:57,000 And I'm going to show both of these here. So this is a contour plot. 81 99:59:59,999 --> 99:59:59,999 81 00:08:57,000 --> 00:09:01,860 The left one is a contour plot and it reads like a topographical map. 82 99:59:59,999 --> 99:59:59,999 82 00:09:01,860 --> 00:09:09,180 If you envision your your two explanatory variables in this case, we're going to we're showing a two dimensional distribution. 83 99:59:59,999 --> 99:59:59,999 83 00:09:09,180 --> 00:09:18,990 So one explanatory variable is the score given to a movie by its critics, and another explanatory variable is the score given by its audience. 84 99:59:59,999 --> 99:59:59,999 84 00:09:18,990 --> 00:09:23,930 And then the response variable is how many movies have that combination? 85 99:59:59,999 --> 99:59:59,999 85 00:09:23,930 --> 00:09:30,320 And so we can see here, this is the peak, a contour plot is really good for showing us the peak. 86 99:59:59,999 --> 99:59:59,999 86 00:09:30,320 --> 00:09:43,510 It's going to be that innermost circle and it also shows us the shape because each of these rings is a a a level of decreasing. 87 99:59:59,999 --> 99:59:59,999 87 00:09:43,510 --> 00:09:52,090 Decreasing height in this map, so if the response if we envision that the response variable is this height and we're looking at a two dimensional map, 88 99:59:59,999 --> 99:59:59,999 88 00:09:52,090 --> 00:09:56,110 the rings show us the contours around the mountains of that height. 89 99:59:59,999 --> 99:59:59,999 89 00:09:56,110 --> 00:10:01,150 Good for showing, good for showing shape. The other one of the heat map which uses color. 90 99:59:59,999 --> 99:59:59,999 90 00:10:01,150 --> 00:10:07,240 And so it's usually going to be from a cool color like, say, black here to to a hot orange, 91 99:59:59,999 --> 99:59:59,999 91 00:10:07,240 --> 00:10:18,230 or it's going to be sometimes if you have a bidirectional one, which goes blue to red and it lets us see the highest density is here. 92 99:59:59,999 --> 99:59:59,999 92 00:10:18,230 --> 00:10:23,390 And the as you go out from there, you get lower and lower densities. 93 99:59:59,999 --> 99:59:59,999 93 00:10:23,390 --> 00:10:27,740 Either one can work for a continuous variable heat map, you often have to it in order to. 94 99:59:59,999 --> 99:59:59,999 94 00:10:27,740 --> 00:10:36,890 This is a descriptivist heat map where we have been everything in in bins of of a half a star or a half 95 99:59:59,999 --> 99:59:59,999 95 00:10:36,890 --> 00:10:41,510 a star on the audience score and a four star on the credit score because they're on different scales. 96 99:59:59,999 --> 99:59:59,999 96 00:10:41,510 --> 00:10:46,850 But heat maps also work well for categorical ordinal data. 97 99:59:59,999 --> 99:59:59,999 97 00:10:46,850 --> 00:10:55,180 So. Another way we can do it is we can use other esthetics for secondary variables such as color or shape or size, 98 99:59:59,999 --> 99:59:59,999 98 00:10:55,180 --> 00:10:58,300 sometimes we'll use that to indicate a second response variable, 99 99:59:59,999 --> 99:59:59,999 99 00:10:58,300 --> 00:11:06,820 like you might have a scatterplot where the size of the point is a second response variable, but often it's for multiple explanatory variables. 100 99:59:59,999 --> 99:59:59,999 100 00:11:06,820 --> 00:11:15,640 So this shows us how we can do that. So if we wanted to break down Titanic's survival rates by both class and sex, 101 99:59:59,999 --> 99:59:59,999 101 00:11:15,640 --> 00:11:22,240 we can see we can use we keep our class on the X axis like we did before, 102 99:59:59,999 --> 99:59:59,999 102 00:11:22,240 --> 00:11:32,290 and then we use color for the passenger sex so we can see significantly higher survival rates for women across all three classes. 103 99:59:59,999 --> 99:59:59,999 103 00:11:32,290 --> 00:11:36,550 I'm also showing you here the difference between a bar chart and a point plot. 104 99:59:59,999 --> 99:59:59,999 104 00:11:36,550 --> 00:11:41,590 So the left is the bar chart. The right is the point plot. 105 99:59:59,999 --> 99:59:59,999 105 00:11:41,590 --> 00:11:46,030 And the bar chart lets us compare the heights of the bars. Note that it starts at zero. 106 99:59:59,999 --> 99:59:59,999 106 00:11:46,030 --> 00:11:55,590 Bar charts always start at zero. And because so it lets us compare the height of the bars and we can see that. 107 99:59:59,999 --> 99:59:59,999 107 00:11:55,590 --> 00:12:08,290 It's easy to see from just using our vision that the female passenger first class bar is almost is more than twice as tall as the. 108 99:59:59,999 --> 99:59:59,999 108 00:12:08,290 --> 00:12:12,370 As the female or the male passenger first class bar, 109 99:59:59,999 --> 99:59:59,999 109 00:12:12,370 --> 00:12:18,370 the male passenger first class bar is twice as tall as the male passenger passenger second class bar. 110 99:59:59,999 --> 99:59:59,999 110 00:12:18,370 --> 00:12:23,410 So it lets us compare make relative comparisons between the different values. 111 99:59:59,999 --> 99:59:59,999 111 00:12:23,410 --> 00:12:29,230 This is why it always starts at zero, because the natural thing to do with the bar is compare its height. 112 99:59:59,999 --> 99:59:59,999 112 00:12:29,230 --> 00:12:34,990 If your bar chart does not start at zero, suppose our bar chart started at point one, 113 99:59:59,999 --> 99:59:59,999 113 00:12:34,990 --> 00:12:40,300 then the comparison of height would exaggerate the difference relative to the value. 114 99:59:59,999 --> 99:59:59,999 114 00:12:40,300 --> 00:12:45,850 And what looks twice as tall isn't actually twice as tall because we cut off a bunch of the bottom. 115 99:59:59,999 --> 99:59:59,999 115 00:12:45,850 --> 00:12:52,010 So always start at zero. The point plot. Does not it makes it hard to compare relative difference. 116 99:59:59,999 --> 99:59:59,999 116 00:12:52,010 --> 00:12:55,790 We can't it's difficult for us to tell that the survival rate visually tell. 117 99:59:59,999 --> 99:59:59,999 117 00:12:55,790 --> 00:12:57,500 We can tell if we look at the numbers, 118 99:59:59,999 --> 99:59:59,999 118 00:12:57,500 --> 00:13:05,090 but it's difficult to visually tell that the survival rate of women in first classes is twice as high as the survival rate of men. 119 99:59:59,999 --> 99:59:59,999 119 00:13:05,090 --> 00:13:11,390 But what it does literacy is it lets us see the absolute, absolute difference between these values, 120 99:59:59,999 --> 99:59:59,999 120 00:13:11,390 --> 00:13:16,550 and it makes it easy to compare the difference in the gaps across the three classes. 121 99:59:59,999 --> 99:59:59,999 121 00:13:16,550 --> 00:13:31,670 We can see that the the survival rate by by sex is much higher or is much closer in the third class than it is in the first or in the second class. 122 99:59:59,999 --> 99:59:59,999 122 00:13:31,670 --> 00:13:39,020 So your choice of plot really guides the user to see different things in your choice of plot, 123 99:59:59,999 --> 99:59:59,999 123 00:13:39,020 --> 00:13:42,650 allows you to emphasize different things and you need to decide. 124 99:59:59,999 --> 99:59:59,999 124 00:13:42,650 --> 00:13:51,440 You need to choose and design your plot in such a way that's going to tell the story that you need to tell from the data. 125 99:59:59,999 --> 99:59:59,999 125 00:13:51,440 --> 00:13:59,570 We can also have more than two explanatory variables. It's difficult to have more than one that's numeric or two for doing a contour plot. 126 99:59:59,999 --> 99:59:59,999 126 00:13:59,570 --> 00:14:05,270 We can bend variables that are then going to let us use some more techniques, such as FaceTime. 127 99:59:59,999 --> 99:59:59,999 127 00:14:05,270 --> 00:14:08,270 So if we want to break down by more categorical variables, 128 99:59:59,999 --> 99:59:59,999 128 00:14:08,270 --> 00:14:12,320 so we want let's say we also want to look at a or we want to break down many more variables. 129 99:59:59,999 --> 99:59:59,999 129 00:14:12,320 --> 00:14:18,140 Let's say we also want to look at age. And so we're going to keep sex on the color. 130 99:59:59,999 --> 99:59:59,999 130 00:14:18,140 --> 00:14:24,050 We're going to now use age as the x axis. Since this numeric, it really works better on an axis. 131 99:59:59,999 --> 99:59:59,999 131 00:14:24,050 --> 00:14:28,970 I have bend it into bins of tens that you only have one point for every decade. 132 99:59:59,999 --> 99:59:59,999 132 00:14:28,970 --> 00:14:36,530 But then we use a fassett and the fassett means we draw a different chart for each of the three classes. 133 99:59:59,999 --> 99:59:59,999 133 00:14:36,530 --> 00:14:43,400 The charts all share a y axis so we can directly compare across the row of charts and we can see it lets us see 134 99:59:59,999 --> 99:59:59,999 134 00:14:43,400 --> 00:14:55,400 particularly how does the survival as a function of age change between different different passenger classes, 135 99:59:59,999 --> 99:59:59,999 135 00:14:55,400 --> 00:15:01,460 for example? And so it is, but it lets us start to build up. 136 99:59:59,999 --> 99:59:59,999 136 00:15:01,460 --> 00:15:05,720 And if we had a fourth, we could use rows and columns in the faceted plot. 137 99:59:59,999 --> 99:59:59,999 137 00:15:05,720 --> 00:15:10,310 So we have these mechanisms of building up and we have our x axis or y axis. 138 99:59:59,999 --> 99:59:59,999 138 00:15:10,310 --> 00:15:18,380 We can use esthetics of the lines of the points, particularly color, size, shape, 139 99:59:59,999 --> 99:59:59,999 139 00:15:18,380 --> 00:15:26,730 and then we can use facets to build up even more variables into our plot. 140 99:59:59,999 --> 99:59:59,999 140 00:15:26,730 --> 00:15:32,370 To do fascinating, there's a couple of things you can do, it's built into some of the seabourne row plotting functions. 141 99:59:59,999 --> 99:59:59,999 141 00:15:32,370 --> 00:15:37,230 The plot and cat plot function functions can both do fascinating on their own. 142 99:59:59,999 --> 99:59:59,999 142 00:15:37,230 --> 00:15:42,240 They let you control the statistic. They're very, very flexible functions for a wide range of plot. 143 99:59:59,999 --> 99:59:59,999 143 00:15:42,240 --> 00:15:48,810 The general purpose Fassett Grid allows you to fassett any kind of plot by writing some more Python code on your own. 144 99:59:59,999 --> 99:59:59,999 144 00:15:48,810 --> 00:15:53,190 Very useful if you want to fassett something that doesn't support Facetune built in. 145 99:59:59,999 --> 99:59:59,999 145 00:15:53,190 --> 00:15:59,940 And if you're using Plot nine or the R.G. plot to package Fassett Grid and Fassett wrap a control fassett, 146 99:59:59,999 --> 99:59:59,999 146 00:15:59,940 --> 00:16:07,680 you build that faceted plot you need to pay attention to what variables go where your choice of which variables are going to be on color, 147 99:59:59,999 --> 99:59:59,999 147 00:16:07,680 --> 00:16:16,530 what variables are going to be facets, which variables are going to be on your axes really affect how the reader is going to interpret and understand 148 99:59:59,999 --> 99:59:59,999 148 00:16:16,530 --> 00:16:22,680 your plot and you need to choose them strategically to tell the story that addresses your question. 149 99:59:59,999 --> 99:59:59,999 149 00:16:22,680 --> 00:16:28,950 You also need to do it, though, in a way that is honest and does not mislead your user, your readers. 150 99:59:59,999 --> 99:59:59,999 150 00:16:28,950 --> 00:16:38,810 The chart needs to honestly show the readers what it is that you learned from the data and show that clearly. 151 99:59:59,999 --> 99:59:59,999 151 00:16:38,810 --> 00:16:43,610 Another thing we can do to build up a chart, especially if we have more categorical variables, 152 99:59:59,999 --> 99:59:59,999 152 00:16:43,610 --> 00:16:47,180 if we've got a categorical response variable with more than two levels, 153 99:59:59,999 --> 99:59:59,999 153 00:16:47,180 --> 00:16:56,390 and we want to show how particularly how the the proportion in different categories changes the response to another variable, 154 99:59:59,999 --> 99:59:59,999 154 00:16:56,390 --> 00:17:04,250 a stack chart can be very good. Let's see the differences in composition to see how the parts of a hole change. 155 99:59:59,999 --> 99:59:59,999 155 00:17:04,250 --> 00:17:05,390 And so this chart, 156 99:59:59,999 --> 99:59:59,999 156 00:17:05,390 --> 00:17:16,060 this is a stacked bar chart and it's a horizontal bar chart where I put the explanatory variable on the x axis excuse me, on the Y axis. 157 99:59:59,999 --> 99:59:59,999 157 00:17:16,060 --> 00:17:22,150 Just in part to make the labels easier to read and so are explanatory variable is what data set. 158 99:59:59,999 --> 99:59:59,999 158 00:17:22,150 --> 00:17:28,210 Something came from Locke, M.D. Gry. What those are don't matter for our purposes right now. 159 99:59:59,999 --> 99:59:59,999 159 00:17:28,210 --> 00:17:34,540 The response variable is the distribution of gender's in this case. 160 99:59:59,999 --> 99:59:59,999 160 00:17:34,540 --> 00:17:39,970 These are data sets of books, the genders of the authors of those books in the data set. 161 99:59:59,999 --> 99:59:59,999 161 00:17:39,970 --> 00:17:47,260 And so we have female, we've got mail and we also have codes for we it's ambiguous or unknown or we didn't have data. 162 99:59:59,999 --> 99:59:59,999 162 00:17:47,260 --> 00:17:59,470 And so we can see, for example, the GYŐRI data set has a higher fraction of women and a significantly lower fraction of men. 163 99:59:59,999 --> 99:59:59,999 163 00:17:59,470 --> 00:18:12,900 And we can see quite a few more. Books that we don't know what gender on, and so this the order on this chart is very strategic. 164 99:59:59,999 --> 99:59:59,999 164 00:18:12,900 --> 00:18:22,230 I observed these levels is very strategic. I bunch I batched all of the various kinds of we don't know together so that 165 99:59:59,999 --> 99:59:59,999 165 00:18:22,230 --> 00:18:27,180 you can look at that whole block and see the and see the various types of. 166 99:59:59,999 --> 99:59:59,999 166 00:18:27,180 --> 00:18:33,720 We don't know the gender of the book's author together, but you can also see how they're broken down into individual things. 167 99:59:59,999 --> 99:59:59,999 167 00:18:33,720 --> 00:18:41,790 You can see that UNlinked is a very, very large fraction of of that increase in books where we don't know the author's gender. 168 99:59:59,999 --> 99:59:59,999 168 00:18:41,790 --> 00:18:46,560 So you need to think you need to think about all of these different things in order 169 99:59:59,999 --> 99:59:59,999 169 00:18:46,560 --> 00:18:51,630 to be able to generate a chart that's going to clearly and unambiguously communicate. 170 99:59:59,999 --> 99:59:59,999 170 00:18:51,630 --> 00:18:54,930 You can show either you can show raw values in a stack bar chart at the bars. 171 99:59:59,999 --> 99:59:59,999 171 00:18:54,930 --> 00:18:58,690 Don't all have to be the same height you can show fractions, in which case they will be. 172 99:59:59,999 --> 99:59:59,999 172 00:18:58,690 --> 00:19:07,080 I chose to show fractions in this chart. The code that generates this using raw matplotlib is linked in the notes for the video. 173 99:59:59,999 --> 99:59:59,999 173 00:19:07,080 --> 00:19:09,330 Sometimes we're also going to transform our charts. 174 99:59:59,999 --> 99:59:59,999 174 00:19:09,330 --> 00:19:15,330 We might transform the axis such as doing a log ten scale, in which case the label would transform the axis. 175 99:59:59,999 --> 99:59:59,999 175 00:19:15,330 --> 00:19:20,070 The labels are still in their original value. It's just they're spaced out logarithmically. 176 99:59:59,999 --> 99:59:59,999 176 00:19:20,070 --> 00:19:24,780 We generally won't do this for bars. Reading a bar on a large scale. 177 99:59:59,999 --> 99:59:59,999 177 00:19:24,780 --> 00:19:30,840 You can draw it, but you have to be really, really careful in order to make sure that your readers are going to accurately interpret it. 178 99:59:59,999 --> 99:59:59,999 178 00:19:30,840 --> 00:19:35,580 But for line and scatter plots, log transforms are a lot more common. 179 99:59:59,999 --> 99:59:59,999 179 00:19:35,580 --> 00:19:41,580 Sometimes, though, we're actually going to transform the data itself and we're going to plot a log or a square root or some other rescaling. 180 99:59:59,999 --> 99:59:59,999 180 00:19:41,580 --> 00:19:49,160 And another kargman transformation is to be in the data, somehow democratize it into fixed bins. 181 99:59:59,999 --> 99:59:59,999 181 00:19:49,160 --> 00:19:54,530 By some mechanism or another, so the key decisions that you need to make when you're making one of these charts 182 99:59:59,999 --> 99:59:59,999 182 00:19:54,530 --> 00:20:00,920 are you need to pick the variables and how you're doing their transformations. You need to pick that what's called the esthetics, 183 99:59:59,999 --> 99:59:59,999 183 00:20:00,920 --> 00:20:06,680 how you're going to map the different variables you're looking at to chart features your X and Y axes, 184 99:59:59,999 --> 99:59:59,999 184 00:20:06,680 --> 00:20:10,700 your facets row and column your color, your point marker style. 185 99:59:59,999 --> 99:59:59,999 185 00:20:10,700 --> 00:20:14,690 If you're doing a joint plot, often it's useful to put. 186 99:59:59,999 --> 99:59:59,999 186 00:20:14,690 --> 00:20:25,370 The same esthetic on both color and style, and that way, if you have a reader who's colorblind, they still get different point styles, 187 99:59:59,999 --> 99:59:59,999 187 00:20:25,370 --> 00:20:30,050 even if they can't tell the colors apart or if someone's putting it on a black and white printer. 188 99:59:59,999 --> 99:59:59,999 188 00:20:30,050 --> 00:20:34,340 And then you need the type of the chart line, chart, bar, point box, et cetera. 189 99:59:59,999 --> 99:59:59,999 189 00:20:34,340 --> 00:20:41,750 So you have to make all of these decisions when you're drawing this chart and they're driven by what variables and data you have and what 190 99:59:59,999 --> 99:59:59,999 190 00:20:41,750 --> 00:20:50,210 question you're trying to answer and what story you're trying to tell about that you do need to be careful to avoid excessive complexity. 191 99:59:59,999 --> 99:59:59,999 191 00:20:50,210 --> 00:20:58,550 We can put a different variable on every conceivable esthetic and it's often going to result in a chart that's very difficult to read. 192 99:59:59,999 --> 99:59:59,999 192 00:20:58,550 --> 00:21:03,410 We also have to be careful with color because it's easy to make a chart that has differences 193 99:59:59,999 --> 99:59:59,999 193 00:21:03,410 --> 00:21:07,580 that are difficult for the human eye to distinguish or get obscured by printers, 194 99:59:59,999 --> 99:59:59,999 194 00:21:07,580 --> 00:21:17,990 low quality displays, etc. It's also important to note a good graphic reveals the data and does not distort or obscure the data. 195 99:59:59,999 --> 99:59:59,999 195 00:21:17,990 --> 00:21:24,530 It's easy to create a graphic that manipulates the data to tell a story that's not very well supported. 196 99:59:59,999 --> 99:59:59,999 196 00:21:24,530 --> 00:21:30,080 And we want to avoid that when we're doing data science with honesty and integrity. 197 99:59:59,999 --> 99:59:59,999 197 00:21:30,080 --> 00:21:35,120 So wrap up. You need to identify the variables and relationships that you want to highlight in your chart. 198 99:59:59,999 --> 99:59:59,999 198 00:21:35,120 --> 00:21:38,030 You want to design a plot that illustrates them, 199 99:59:59,999 --> 99:59:59,999 199 00:21:38,030 --> 00:21:43,760 and you're going to need to spend some time studying your plodding library APIs and the Plotting Libraries Gallery. 200 99:59:59,999 --> 99:59:59,999 200 00:21:43,760 --> 00:21:50,180 Any plotting library usually has a gallery of a bunch of different plots and the code that was used to generate them. 201 99:59:59,999 --> 99:59:59,999 201 00:21:50,180 --> 00:21:52,880 Seabourne has this, matplotlib has this. 202 99:59:59,999 --> 99:59:59,999 202 00:21:52,880 --> 00:22:00,650 And so you spending some time with that looking, oh, this looks like this looks like the kind of plot that might display my data well. 203 99:59:59,999 --> 99:59:59,999 203 00:22:00,650 --> 00:22:14,167 And then look and click on it and see what code they use to generate it and borrow it. 204 99:59:59,999 --> 99:59:59,999