1
00:00:04,610 --> 00:00:10,490
Hello. And this video, I want to talk to you about how to build up a chart from the ground up as we think
2
00:00:10,490 --> 00:00:15,260
about the question it's going to try to answer and the pieces that need to go into it.
3
00:00:15,260 --> 00:00:21,350
So the learning outcome for this video is for us to be able to design a chart by thinking first of the questions,
4
00:00:21,350 --> 00:00:28,070
the goals and the data that are going to be in and from the chart.
5
00:00:28,070 --> 00:00:34,580
So a good chart answers a question and the guiding principle for how we design and
6
00:00:34,580 --> 00:00:40,940
how we lay out our chart is to illuminate the question that we want to answer.
7
00:00:40,940 --> 00:00:45,230
And this depends. We need to know what question we want to answer in the first place.
8
00:00:45,230 --> 00:00:52,550
We also need to know precisely how we operationalize that question so we can use that to then inform how we're going into the chart layout.
9
00:00:52,550 --> 00:00:59,370
And we need to know what data that we're using, specifically what variables we're using as a part of this chart.
10
00:00:59,370 --> 00:01:05,250
For example, there's a data set, you'll see it in the notebook that goes with this video for of passengers on the
11
00:01:05,250 --> 00:01:09,800
Titanic and supposedly wanted to examine whether passengers in a higher fare class,
12
00:01:09,800 --> 00:01:15,450
say, first class or more likely to survive than passengers in lower fare classes.
13
00:01:15,450 --> 00:01:19,860
In this analysis, we have an outcome variable zero one,
14
00:01:19,860 --> 00:01:29,590
whether or not the passenger survived the Titanic sinking and a lot of charts are going to have an outcome variable.
15
00:01:29,590 --> 00:01:37,140
We want to we have some outcome variable and we want to see how it responds to or how it differs with some other variable,
16
00:01:37,140 --> 00:01:43,170
which we call the explanatory variable, in this case, the passage class where outcome is survival.
17
00:01:43,170 --> 00:01:52,500
And we want to see how it changes as the the the passengers passage class, the explanatory variable changes.
18
00:01:52,500 --> 00:01:57,510
The outcome variable is also called the response variable or the dependent variable,
19
00:01:57,510 --> 00:02:01,770
because it's what we're trying to measure that's responding to the condition we're trying to analyze.
20
00:02:01,770 --> 00:02:06,990
And the explanatory variable is sometimes called the independent variable because it's changing,
21
00:02:06,990 --> 00:02:11,820
but it's not changing as a function of the other variables in theory.
22
00:02:11,820 --> 00:02:20,160
So we can do this with a bar chart and this bar chart shows the x axis is our steerage class through our passage class first,
23
00:02:20,160 --> 00:02:29,520
second and third, and the Y axis is the average is the fraction of passengers in that class who survive.
24
00:02:29,520 --> 00:02:35,460
We also see some error bars. We're going to see later what those mean and how to how to compute them.
25
00:02:35,460 --> 00:02:47,550
But this lets us see how the outcome survival changes as we age, as the pass or with the different passage classes of the passengers.
26
00:02:47,550 --> 00:02:56,490
And one of the things to note here is that we have our explanatory variable on the X axis and the outcome variable on the Y axis.
27
00:02:56,490 --> 00:03:02,100
That is the general convention. There are some cases where we might want to flip it.
28
00:03:02,100 --> 00:03:09,510
So we've we've got a horizontal bar chart where the explanatory is on the Y and the outcome is on the X,
29
00:03:09,510 --> 00:03:13,920
particularly if we if if it makes the labels more readable.
30
00:03:13,920 --> 00:03:25,380
But the standard convention for most types of charts is to put explanatory on x axis, the horizontal axis and the outcome variable on the Y axis.
31
00:03:25,380 --> 00:03:29,370
And this chart shows the relationship, many charts or relationship,
32
00:03:29,370 --> 00:03:39,010
most of the plot that we're going to be drawing in this class show how some kind of a numeric variable either continues or.
33
00:03:39,010 --> 00:03:51,610
Or integer changes between different values of one or more other variables, and in this case, even though our response was zero one logical.
34
00:03:51,610 --> 00:03:59,480
When we convert it into a rate per per passage class, it became a continuous variable.
35
00:03:59,480 --> 00:04:05,620
And so when we do this, we need to identify a few key things to design our plots.
36
00:04:05,620 --> 00:04:12,490
We need to identify what variable we want to show. That's going to guide a lot of plots that'll be on our y axis.
37
00:04:12,490 --> 00:04:18,520
When it's not, it'll usually be on X and it's going to identifying that variable is,
38
00:04:18,520 --> 00:04:22,660
if anything, probably the most important thing in designing a plot.
39
00:04:22,660 --> 00:04:27,160
We then need to identify what we want to show about this variable.
40
00:04:27,160 --> 00:04:32,380
Do we want to show its value for different data points? Do we want to show a statistic?
41
00:04:32,380 --> 00:04:36,370
The do we want to show, for example, the the mean or the rate?
42
00:04:36,370 --> 00:04:44,220
In the previous when we showed a statistic, the Titanic example, we showed a statistic, the rate of of survival.
43
00:04:44,220 --> 00:04:47,660
Do we want to show its distribution?
44
00:04:47,660 --> 00:04:57,410
And then how do we want to compare that between values of the explanatory variable, particularly, do we want to look at absolute differences?
45
00:04:57,410 --> 00:05:08,410
Do we want to look at relative or proportional differences? And even the histograms follow this kind of a design because they have an outcome,
46
00:05:08,410 --> 00:05:15,640
which is the frequency or the count of the abortion or the density, depending on precisely what kind of histogram we're showing.
47
00:05:15,640 --> 00:05:20,500
And then they have the explanatory variable, which is the value or the.
48
00:05:20,500 --> 00:05:33,850
So we've got a histogram and we've got some beans. And the response variable is how how many values are in that bend and the explanatory is the.
49
00:05:33,850 --> 00:05:40,900
So identifying these then informs the entire pipeline of producing our chart, the data processing the beginning.
50
00:05:40,900 --> 00:05:47,110
We're going to do group aggregation transformation that gets us to the final values we can actually plot.
51
00:05:47,110 --> 00:05:55,210
It's going to affect our choice of plot type and it's going to affect our choice of axis labels, colors, facets, the other aspects of the plot.
52
00:05:55,210 --> 00:06:04,500
So. The type of the variable has a significant impact if the response is numeric or can be transformed,
53
00:06:04,500 --> 00:06:07,560
the response is often numeric or can be transformed to be.
54
00:06:07,560 --> 00:06:15,840
If we're talking about a of categorical value, we usually want the relative frequency of different of different values of that.
55
00:06:15,840 --> 00:06:22,320
So either we're doing it like a histogram and the we're going to transform it so that we're showing just the distribution.
56
00:06:22,320 --> 00:06:29,010
We're going to transform it, so that we're showing that the explanatory becomes the value of the categorical.
57
00:06:29,010 --> 00:06:33,870
And the response is how many or what fraction have it in a logical.
58
00:06:33,870 --> 00:06:40,870
It might if it's a two level categorical, we might turn it into a fraction, just a fraction to have one of the levels versus the other explanatory.
59
00:06:40,870 --> 00:06:46,920
It can be anything. We're going to see how to use numeric explanatory variable is categorical explanatory variables ordinal.
60
00:06:46,920 --> 00:06:48,420
We were just like categorical,
61
00:06:48,420 --> 00:06:57,180
except that the that it's a discrete axis that preserves order and we need to make sure that the order and ordinal data is being preserved.
62
00:06:57,180 --> 00:07:05,270
If you're using pandas ordered category type, it automatically preserves order when you're doing the plot for you.
63
00:07:05,270 --> 00:07:12,050
So if we just have one explanatory variable, this is the easiest case, if our explanatory variable is continuous,
64
00:07:12,050 --> 00:07:16,820
we usually want to scatterplot or align plot for showing individual values.
65
00:07:16,820 --> 00:07:22,890
Sometimes we'll flip the response and explanatory on a scatterplot or will or both might be explanatory.
66
00:07:22,890 --> 00:07:25,790
We want to show where points lie in a two dimensional space.
67
00:07:25,790 --> 00:07:35,120
But generally, if we've got an explanatory, a continuous explanatory variable and we've got a and we're trying to show values,
68
00:07:35,120 --> 00:07:39,530
we're going to use a scatterplot or a line excuse me, we're going to try to show values.
69
00:07:39,530 --> 00:07:45,380
We're going to try to show statistics like a mean at each at each value of the explanatory variable.
70
00:07:45,380 --> 00:07:56,490
We're going to use a scatterplot or a line plott. If the explanatory variable is discrete, then we're going to use a bar chart to show a statistic.
71
00:07:56,490 --> 00:08:05,400
If we want to estimate the relative difference, we want to be able to compare the relative value relatively compared to values,
72
00:08:05,400 --> 00:08:08,640
because a bar, one bar will be twice as high as another.
73
00:08:08,640 --> 00:08:15,390
And a point plot shows a statistic or an individual value, and it emphasizes absolute difference.
74
00:08:15,390 --> 00:08:19,740
You don't have a whole bar in order to to compare heights.
75
00:08:19,740 --> 00:08:28,140
You just have the point. And then if we want to show a distribution, we usually use a box or a violin plot with this discrete explanatory variable.
76
00:08:28,140 --> 00:08:32,670
We don't have great ways to show distributions with continuous explanatory variables.
77
00:08:32,670 --> 00:08:36,480
You can show a variance with an error bar, but that's about where a ribbon.
78
00:08:36,480 --> 00:08:44,220
But that's about it. For too explanatory variables we get into.
79
00:08:44,220 --> 00:08:53,160
Too explanatory variables, we have a couple of options. One is we can do a three, a pseudo 3D display where we do a contour plot or a heat map.
80
00:08:53,160 --> 00:08:57,000
And I'm going to show both of these here. So this is a contour plot.
81
00:08:57,000 --> 00:09:01,860
The left one is a contour plot and it reads like a topographical map.
82
00:09:01,860 --> 00:09:09,180
If you envision your your two explanatory variables in this case, we're going to we're showing a two dimensional distribution.
83
00:09:09,180 --> 00:09:18,990
So one explanatory variable is the score given to a movie by its critics, and another explanatory variable is the score given by its audience.
84
00:09:18,990 --> 00:09:23,930
And then the response variable is how many movies have that combination?
85
00:09:23,930 --> 00:09:30,320
And so we can see here, this is the peak, a contour plot is really good for showing us the peak.
86
00:09:30,320 --> 00:09:43,510
It's going to be that innermost circle and it also shows us the shape because each of these rings is a a a level of decreasing.
87
00:09:43,510 --> 00:09:52,090
Decreasing height in this map, so if the response if we envision that the response variable is this height and we're looking at a two dimensional map,
88
00:09:52,090 --> 00:09:56,110
the rings show us the contours around the mountains of that height.
89
00:09:56,110 --> 00:10:01,150
Good for showing, good for showing shape. The other one of the heat map which uses color.
90
00:10:01,150 --> 00:10:07,240
And so it's usually going to be from a cool color like, say, black here to to a hot orange,
91
00:10:07,240 --> 00:10:18,230
or it's going to be sometimes if you have a bidirectional one, which goes blue to red and it lets us see the highest density is here.
92
00:10:18,230 --> 00:10:23,390
And the as you go out from there, you get lower and lower densities.
93
00:10:23,390 --> 00:10:27,740
Either one can work for a continuous variable heat map, you often have to it in order to.
94
00:10:27,740 --> 00:10:36,890
This is a descriptivist heat map where we have been everything in in bins of of a half a star or a half
95
00:10:36,890 --> 00:10:41,510
a star on the audience score and a four star on the credit score because they're on different scales.
96
00:10:41,510 --> 00:10:46,850
But heat maps also work well for categorical ordinal data.
97
00:10:46,850 --> 00:10:55,180
So. Another way we can do it is we can use other esthetics for secondary variables such as color or shape or size,
98
00:10:55,180 --> 00:10:58,300
sometimes we'll use that to indicate a second response variable,
99
00:10:58,300 --> 00:11:06,820
like you might have a scatterplot where the size of the point is a second response variable, but often it's for multiple explanatory variables.
100
00:11:06,820 --> 00:11:15,640
So this shows us how we can do that. So if we wanted to break down Titanic's survival rates by both class and sex,
101
00:11:15,640 --> 00:11:22,240
we can see we can use we keep our class on the X axis like we did before,
102
00:11:22,240 --> 00:11:32,290
and then we use color for the passenger sex so we can see significantly higher survival rates for women across all three classes.
103
00:11:32,290 --> 00:11:36,550
I'm also showing you here the difference between a bar chart and a point plot.
104
00:11:36,550 --> 00:11:41,590
So the left is the bar chart. The right is the point plot.
105
00:11:41,590 --> 00:11:46,030
And the bar chart lets us compare the heights of the bars. Note that it starts at zero.
106
00:11:46,030 --> 00:11:55,590
Bar charts always start at zero. And because so it lets us compare the height of the bars and we can see that.
107
00:11:55,590 --> 00:12:08,290
It's easy to see from just using our vision that the female passenger first class bar is almost is more than twice as tall as the.
108
00:12:08,290 --> 00:12:12,370
As the female or the male passenger first class bar,
109
00:12:12,370 --> 00:12:18,370
the male passenger first class bar is twice as tall as the male passenger passenger second class bar.
110
00:12:18,370 --> 00:12:23,410
So it lets us compare make relative comparisons between the different values.
111
00:12:23,410 --> 00:12:29,230
This is why it always starts at zero, because the natural thing to do with the bar is compare its height.
112
00:12:29,230 --> 00:12:34,990
If your bar chart does not start at zero, suppose our bar chart started at point one,
113
00:12:34,990 --> 00:12:40,300
then the comparison of height would exaggerate the difference relative to the value.
114
00:12:40,300 --> 00:12:45,850
And what looks twice as tall isn't actually twice as tall because we cut off a bunch of the bottom.
115
00:12:45,850 --> 00:12:52,010
So always start at zero. The point plot. Does not it makes it hard to compare relative difference.
116
00:12:52,010 --> 00:12:55,790
We can't it's difficult for us to tell that the survival rate visually tell.
117
00:12:55,790 --> 00:12:57,500
We can tell if we look at the numbers,
118
00:12:57,500 --> 00:13:05,090
but it's difficult to visually tell that the survival rate of women in first classes is twice as high as the survival rate of men.
119
00:13:05,090 --> 00:13:11,390
But what it does literacy is it lets us see the absolute, absolute difference between these values,
120
00:13:11,390 --> 00:13:16,550
and it makes it easy to compare the difference in the gaps across the three classes.
121
00:13:16,550 --> 00:13:31,670
We can see that the the survival rate by by sex is much higher or is much closer in the third class than it is in the first or in the second class.
122
00:13:31,670 --> 00:13:39,020
So your choice of plot really guides the user to see different things in your choice of plot,
123
00:13:39,020 --> 00:13:42,650
allows you to emphasize different things and you need to decide.
124
00:13:42,650 --> 00:13:51,440
You need to choose and design your plot in such a way that's going to tell the story that you need to tell from the data.
125
00:13:51,440 --> 00:13:59,570
We can also have more than two explanatory variables. It's difficult to have more than one that's numeric or two for doing a contour plot.
126
00:13:59,570 --> 00:14:05,270
We can bend variables that are then going to let us use some more techniques, such as FaceTime.
127
00:14:05,270 --> 00:14:08,270
So if we want to break down by more categorical variables,
128
00:14:08,270 --> 00:14:12,320
so we want let's say we also want to look at a or we want to break down many more variables.
129
00:14:12,320 --> 00:14:18,140
Let's say we also want to look at age. And so we're going to keep sex on the color.
130
00:14:18,140 --> 00:14:24,050
We're going to now use age as the x axis. Since this numeric, it really works better on an axis.
131
00:14:24,050 --> 00:14:28,970
I have bend it into bins of tens that you only have one point for every decade.
132
00:14:28,970 --> 00:14:36,530
But then we use a fassett and the fassett means we draw a different chart for each of the three classes.
133
00:14:36,530 --> 00:14:43,400
The charts all share a y axis so we can directly compare across the row of charts and we can see it lets us see
134
00:14:43,400 --> 00:14:55,400
particularly how does the survival as a function of age change between different different passenger classes,
135
00:14:55,400 --> 00:15:01,460
for example? And so it is, but it lets us start to build up.
136
00:15:01,460 --> 00:15:05,720
And if we had a fourth, we could use rows and columns in the faceted plot.
137
00:15:05,720 --> 00:15:10,310
So we have these mechanisms of building up and we have our x axis or y axis.
138
00:15:10,310 --> 00:15:18,380
We can use esthetics of the lines of the points, particularly color, size, shape,
139
00:15:18,380 --> 00:15:26,730
and then we can use facets to build up even more variables into our plot.
140
00:15:26,730 --> 00:15:32,370
To do fascinating, there's a couple of things you can do, it's built into some of the seabourne row plotting functions.
141
00:15:32,370 --> 00:15:37,230
The plot and cat plot function functions can both do fascinating on their own.
142
00:15:37,230 --> 00:15:42,240
They let you control the statistic. They're very, very flexible functions for a wide range of plot.
143
00:15:42,240 --> 00:15:48,810
The general purpose Fassett Grid allows you to fassett any kind of plot by writing some more Python code on your own.
144
00:15:48,810 --> 00:15:53,190
Very useful if you want to fassett something that doesn't support Facetune built in.
145
00:15:53,190 --> 00:15:59,940
And if you're using Plot nine or the R.G. plot to package Fassett Grid and Fassett wrap a control fassett,
146
00:15:59,940 --> 00:16:07,680
you build that faceted plot you need to pay attention to what variables go where your choice of which variables are going to be on color,
147
00:16:07,680 --> 00:16:16,530
what variables are going to be facets, which variables are going to be on your axes really affect how the reader is going to interpret and understand
148
00:16:16,530 --> 00:16:22,680
your plot and you need to choose them strategically to tell the story that addresses your question.
149
00:16:22,680 --> 00:16:28,950
You also need to do it, though, in a way that is honest and does not mislead your user, your readers.
150
00:16:28,950 --> 00:16:38,810
The chart needs to honestly show the readers what it is that you learned from the data and show that clearly.
151
00:16:38,810 --> 00:16:43,610
Another thing we can do to build up a chart, especially if we have more categorical variables,
152
00:16:43,610 --> 00:16:47,180
if we've got a categorical response variable with more than two levels,
153
00:16:47,180 --> 00:16:56,390
and we want to show how particularly how the the proportion in different categories changes the response to another variable,
154
00:16:56,390 --> 00:17:04,250
a stack chart can be very good. Let's see the differences in composition to see how the parts of a hole change.
155
00:17:04,250 --> 00:17:05,390
And so this chart,
156
00:17:05,390 --> 00:17:16,060
this is a stacked bar chart and it's a horizontal bar chart where I put the explanatory variable on the x axis excuse me, on the Y axis.
157
00:17:16,060 --> 00:17:22,150
Just in part to make the labels easier to read and so are explanatory variable is what data set.
158
00:17:22,150 --> 00:17:28,210
Something came from Locke, M.D. Gry. What those are don't matter for our purposes right now.
159
00:17:28,210 --> 00:17:34,540
The response variable is the distribution of gender's in this case.
160
00:17:34,540 --> 00:17:39,970
These are data sets of books, the genders of the authors of those books in the data set.
161
00:17:39,970 --> 00:17:47,260
And so we have female, we've got mail and we also have codes for we it's ambiguous or unknown or we didn't have data.
162
00:17:47,260 --> 00:17:59,470
And so we can see, for example, the GYŐRI data set has a higher fraction of women and a significantly lower fraction of men.
163
00:17:59,470 --> 00:18:12,900
And we can see quite a few more. Books that we don't know what gender on, and so this the order on this chart is very strategic.
164
00:18:12,900 --> 00:18:22,230
I observed these levels is very strategic. I bunch I batched all of the various kinds of we don't know together so that
165
00:18:22,230 --> 00:18:27,180
you can look at that whole block and see the and see the various types of.
166
00:18:27,180 --> 00:18:33,720
We don't know the gender of the book's author together, but you can also see how they're broken down into individual things.
167
00:18:33,720 --> 00:18:41,790
You can see that UNlinked is a very, very large fraction of of that increase in books where we don't know the author's gender.
168
00:18:41,790 --> 00:18:46,560
So you need to think you need to think about all of these different things in order
169
00:18:46,560 --> 00:18:51,630
to be able to generate a chart that's going to clearly and unambiguously communicate.
170
00:18:51,630 --> 00:18:54,930
You can show either you can show raw values in a stack bar chart at the bars.
171
00:18:54,930 --> 00:18:58,690
Don't all have to be the same height you can show fractions, in which case they will be.
172
00:18:58,690 --> 00:19:07,080
I chose to show fractions in this chart. The code that generates this using raw matplotlib is linked in the notes for the video.
173
00:19:07,080 --> 00:19:09,330
Sometimes we're also going to transform our charts.
174
00:19:09,330 --> 00:19:15,330
We might transform the axis such as doing a log ten scale, in which case the label would transform the axis.
175
00:19:15,330 --> 00:19:20,070
The labels are still in their original value. It's just they're spaced out logarithmically.
176
00:19:20,070 --> 00:19:24,780
We generally won't do this for bars. Reading a bar on a large scale.
177
00:19:24,780 --> 00:19:30,840
You can draw it, but you have to be really, really careful in order to make sure that your readers are going to accurately interpret it.
178
00:19:30,840 --> 00:19:35,580
But for line and scatter plots, log transforms are a lot more common.
179
00:19:35,580 --> 00:19:41,580
Sometimes, though, we're actually going to transform the data itself and we're going to plot a log or a square root or some other rescaling.
180
00:19:41,580 --> 00:19:49,160
And another kargman transformation is to be in the data, somehow democratize it into fixed bins.
181
00:19:49,160 --> 00:19:54,530
By some mechanism or another, so the key decisions that you need to make when you're making one of these charts
182
00:19:54,530 --> 00:20:00,920
are you need to pick the variables and how you're doing their transformations. You need to pick that what's called the esthetics,
183
00:20:00,920 --> 00:20:06,680
how you're going to map the different variables you're looking at to chart features your X and Y axes,
184
00:20:06,680 --> 00:20:10,700
your facets row and column your color, your point marker style.
185
00:20:10,700 --> 00:20:14,690
If you're doing a joint plot, often it's useful to put.
186
00:20:14,690 --> 00:20:25,370
The same esthetic on both color and style, and that way, if you have a reader who's colorblind, they still get different point styles,
187
00:20:25,370 --> 00:20:30,050
even if they can't tell the colors apart or if someone's putting it on a black and white printer.
188
00:20:30,050 --> 00:20:34,340
And then you need the type of the chart line, chart, bar, point box, et cetera.
189
00:20:34,340 --> 00:20:41,750
So you have to make all of these decisions when you're drawing this chart and they're driven by what variables and data you have and what
190
00:20:41,750 --> 00:20:50,210
question you're trying to answer and what story you're trying to tell about that you do need to be careful to avoid excessive complexity.
191
00:20:50,210 --> 00:20:58,550
We can put a different variable on every conceivable esthetic and it's often going to result in a chart that's very difficult to read.
192
00:20:58,550 --> 00:21:03,410
We also have to be careful with color because it's easy to make a chart that has differences
193
00:21:03,410 --> 00:21:07,580
that are difficult for the human eye to distinguish or get obscured by printers,
194
00:21:07,580 --> 00:21:17,990
low quality displays, etc. It's also important to note a good graphic reveals the data and does not distort or obscure the data.
195
00:21:17,990 --> 00:21:24,530
It's easy to create a graphic that manipulates the data to tell a story that's not very well supported.
196
00:21:24,530 --> 00:21:30,080
And we want to avoid that when we're doing data science with honesty and integrity.
197
00:21:30,080 --> 00:21:35,120
So wrap up. You need to identify the variables and relationships that you want to highlight in your chart.
198
00:21:35,120 --> 00:21:38,030
You want to design a plot that illustrates them,
199
00:21:38,030 --> 00:21:43,760
and you're going to need to spend some time studying your plodding library APIs and the Plotting Libraries Gallery.
200
00:21:43,760 --> 00:21:50,180
Any plotting library usually has a gallery of a bunch of different plots and the code that was used to generate them.
201
00:21:50,180 --> 00:21:52,880
Seabourne has this, matplotlib has this.
202
00:21:52,880 --> 00:22:00,650
And so you spending some time with that looking, oh, this looks like this looks like the kind of plot that might display my data well.
203
00:22:00,650 --> 00:22:14,167
And then look and click on it and see what code they use to generate it and borrow it.