< Return to Video

https:/.../bb921f27-23a5-4103-afff-ad970182fbb2-a99a3566-e841-4af0-9607-ad9c006e1504.mp4?invocationId=5b7bb097-a60f-ec11-a9e9-0a1a827ad0ec

  • Not Synced
    1
    00:00:04,610 --> 00:00:10,490
    Hello. And this video, I want to talk to you about how to build up a chart from the ground up as we think
  • Not Synced
    2
    00:00:10,490 --> 00:00:15,260
    about the question it's going to try to answer and the pieces that need to go into it.
  • Not Synced
    3
    00:00:15,260 --> 00:00:21,350
    So the learning outcome for this video is for us to be able to design a chart by thinking first of the questions,
  • Not Synced
    4
    00:00:21,350 --> 00:00:28,070
    the goals and the data that are going to be in and from the chart.
  • Not Synced
    5
    00:00:28,070 --> 00:00:34,580
    So a good chart answers a question and the guiding principle for how we design and
  • Not Synced
    6
    00:00:34,580 --> 00:00:40,940
    how we lay out our chart is to illuminate the question that we want to answer.
  • Not Synced
    7
    00:00:40,940 --> 00:00:45,230
    And this depends. We need to know what question we want to answer in the first place.
  • Not Synced
    8
    00:00:45,230 --> 00:00:52,550
    We also need to know precisely how we operationalize that question so we can use that to then inform how we're going into the chart layout.
  • Not Synced
    9
    00:00:52,550 --> 00:00:59,370
    And we need to know what data that we're using, specifically what variables we're using as a part of this chart.
  • Not Synced
    10
    00:00:59,370 --> 00:01:05,250
    For example, there's a data set, you'll see it in the notebook that goes with this video for of passengers on the
  • Not Synced
    11
    00:01:05,250 --> 00:01:09,800
    Titanic and supposedly wanted to examine whether passengers in a higher fare class,
  • Not Synced
    12
    00:01:09,800 --> 00:01:15,450
    say, first class or more likely to survive than passengers in lower fare classes.
  • Not Synced
    13
    00:01:15,450 --> 00:01:19,860
    In this analysis, we have an outcome variable zero one,
  • Not Synced
    14
    00:01:19,860 --> 00:01:29,590
    whether or not the passenger survived the Titanic sinking and a lot of charts are going to have an outcome variable.
  • Not Synced
    15
    00:01:29,590 --> 00:01:37,140
    We want to we have some outcome variable and we want to see how it responds to or how it differs with some other variable,
  • Not Synced
    16
    00:01:37,140 --> 00:01:43,170
    which we call the explanatory variable, in this case, the passage class where outcome is survival.
  • Not Synced
    17
    00:01:43,170 --> 00:01:52,500
    And we want to see how it changes as the the the passengers passage class, the explanatory variable changes.
  • Not Synced
    18
    00:01:52,500 --> 00:01:57,510
    The outcome variable is also called the response variable or the dependent variable,
  • Not Synced
    19
    00:01:57,510 --> 00:02:01,770
    because it's what we're trying to measure that's responding to the condition we're trying to analyze.
  • Not Synced
    20
    00:02:01,770 --> 00:02:06,990
    And the explanatory variable is sometimes called the independent variable because it's changing,
  • Not Synced
    21
    00:02:06,990 --> 00:02:11,820
    but it's not changing as a function of the other variables in theory.
  • Not Synced
    22
    00:02:11,820 --> 00:02:20,160
    So we can do this with a bar chart and this bar chart shows the x axis is our steerage class through our passage class first,
  • Not Synced
    23
    00:02:20,160 --> 00:02:29,520
    second and third, and the Y axis is the average is the fraction of passengers in that class who survive.
  • Not Synced
    24
    00:02:29,520 --> 00:02:35,460
    We also see some error bars. We're going to see later what those mean and how to how to compute them.
  • Not Synced
    25
    00:02:35,460 --> 00:02:47,550
    But this lets us see how the outcome survival changes as we age, as the pass or with the different passage classes of the passengers.
  • Not Synced
    26
    00:02:47,550 --> 00:02:56,490
    And one of the things to note here is that we have our explanatory variable on the X axis and the outcome variable on the Y axis.
  • Not Synced
    27
    00:02:56,490 --> 00:03:02,100
    That is the general convention. There are some cases where we might want to flip it.
  • Not Synced
    28
    00:03:02,100 --> 00:03:09,510
    So we've we've got a horizontal bar chart where the explanatory is on the Y and the outcome is on the X,
  • Not Synced
    29
    00:03:09,510 --> 00:03:13,920
    particularly if we if if it makes the labels more readable.
  • Not Synced
    30
    00:03:13,920 --> 00:03:25,380
    But the standard convention for most types of charts is to put explanatory on x axis, the horizontal axis and the outcome variable on the Y axis.
  • Not Synced
    31
    00:03:25,380 --> 00:03:29,370
    And this chart shows the relationship, many charts or relationship,
  • Not Synced
    32
    00:03:29,370 --> 00:03:39,010
    most of the plot that we're going to be drawing in this class show how some kind of a numeric variable either continues or.
  • Not Synced
    33
    00:03:39,010 --> 00:03:51,610
    Or integer changes between different values of one or more other variables, and in this case, even though our response was zero one logical.
  • Not Synced
    34
    00:03:51,610 --> 00:03:59,480
    When we convert it into a rate per per passage class, it became a continuous variable.
  • Not Synced
    35
    00:03:59,480 --> 00:04:05,620
    And so when we do this, we need to identify a few key things to design our plots.
  • Not Synced
    36
    00:04:05,620 --> 00:04:12,490
    We need to identify what variable we want to show. That's going to guide a lot of plots that'll be on our y axis.
  • Not Synced
    37
    00:04:12,490 --> 00:04:18,520
    When it's not, it'll usually be on X and it's going to identifying that variable is,
  • Not Synced
    38
    00:04:18,520 --> 00:04:22,660
    if anything, probably the most important thing in designing a plot.
  • Not Synced
    39
    00:04:22,660 --> 00:04:27,160
    We then need to identify what we want to show about this variable.
  • Not Synced
    40
    00:04:27,160 --> 00:04:32,380
    Do we want to show its value for different data points? Do we want to show a statistic?
  • Not Synced
    41
    00:04:32,380 --> 00:04:36,370
    The do we want to show, for example, the the mean or the rate?
  • Not Synced
    42
    00:04:36,370 --> 00:04:44,220
    In the previous when we showed a statistic, the Titanic example, we showed a statistic, the rate of of survival.
  • Not Synced
    43
    00:04:44,220 --> 00:04:47,660
    Do we want to show its distribution?
  • Not Synced
    44
    00:04:47,660 --> 00:04:57,410
    And then how do we want to compare that between values of the explanatory variable, particularly, do we want to look at absolute differences?
  • Not Synced
    45
    00:04:57,410 --> 00:05:08,410
    Do we want to look at relative or proportional differences? And even the histograms follow this kind of a design because they have an outcome,
  • Not Synced
    46
    00:05:08,410 --> 00:05:15,640
    which is the frequency or the count of the abortion or the density, depending on precisely what kind of histogram we're showing.
  • Not Synced
    47
    00:05:15,640 --> 00:05:20,500
    And then they have the explanatory variable, which is the value or the.
  • Not Synced
    48
    00:05:20,500 --> 00:05:33,850
    So we've got a histogram and we've got some beans. And the response variable is how how many values are in that bend and the explanatory is the.
  • Not Synced
    49
    00:05:33,850 --> 00:05:40,900
    So identifying these then informs the entire pipeline of producing our chart, the data processing the beginning.
  • Not Synced
    50
    00:05:40,900 --> 00:05:47,110
    We're going to do group aggregation transformation that gets us to the final values we can actually plot.
  • Not Synced
    51
    00:05:47,110 --> 00:05:55,210
    It's going to affect our choice of plot type and it's going to affect our choice of axis labels, colors, facets, the other aspects of the plot.
  • Not Synced
    52
    00:05:55,210 --> 00:06:04,500
    So. The type of the variable has a significant impact if the response is numeric or can be transformed,
  • Not Synced
    53
    00:06:04,500 --> 00:06:07,560
    the response is often numeric or can be transformed to be.
  • Not Synced
    54
    00:06:07,560 --> 00:06:15,840
    If we're talking about a of categorical value, we usually want the relative frequency of different of different values of that.
  • Not Synced
    55
    00:06:15,840 --> 00:06:22,320
    So either we're doing it like a histogram and the we're going to transform it so that we're showing just the distribution.
  • Not Synced
    56
    00:06:22,320 --> 00:06:29,010
    We're going to transform it, so that we're showing that the explanatory becomes the value of the categorical.
  • Not Synced
    57
    00:06:29,010 --> 00:06:33,870
    And the response is how many or what fraction have it in a logical.
  • Not Synced
    58
    00:06:33,870 --> 00:06:40,870
    It might if it's a two level categorical, we might turn it into a fraction, just a fraction to have one of the levels versus the other explanatory.
  • Not Synced
    59
    00:06:40,870 --> 00:06:46,920
    It can be anything. We're going to see how to use numeric explanatory variable is categorical explanatory variables ordinal.
  • Not Synced
    60
    00:06:46,920 --> 00:06:48,420
    We were just like categorical,
  • Not Synced
    61
    00:06:48,420 --> 00:06:57,180
    except that the that it's a discrete axis that preserves order and we need to make sure that the order and ordinal data is being preserved.
  • Not Synced
    62
    00:06:57,180 --> 00:07:05,270
    If you're using pandas ordered category type, it automatically preserves order when you're doing the plot for you.
  • Not Synced
    63
    00:07:05,270 --> 00:07:12,050
    So if we just have one explanatory variable, this is the easiest case, if our explanatory variable is continuous,
  • Not Synced
    64
    00:07:12,050 --> 00:07:16,820
    we usually want to scatterplot or align plot for showing individual values.
  • Not Synced
    65
    00:07:16,820 --> 00:07:22,890
    Sometimes we'll flip the response and explanatory on a scatterplot or will or both might be explanatory.
  • Not Synced
    66
    00:07:22,890 --> 00:07:25,790
    We want to show where points lie in a two dimensional space.
  • Not Synced
    67
    00:07:25,790 --> 00:07:35,120
    But generally, if we've got an explanatory, a continuous explanatory variable and we've got a and we're trying to show values,
  • Not Synced
    68
    00:07:35,120 --> 00:07:39,530
    we're going to use a scatterplot or a line excuse me, we're going to try to show values.
  • Not Synced
    69
    00:07:39,530 --> 00:07:45,380
    We're going to try to show statistics like a mean at each at each value of the explanatory variable.
  • Not Synced
    70
    00:07:45,380 --> 00:07:56,490
    We're going to use a scatterplot or a line plott. If the explanatory variable is discrete, then we're going to use a bar chart to show a statistic.
  • Not Synced
    71
    00:07:56,490 --> 00:08:05,400
    If we want to estimate the relative difference, we want to be able to compare the relative value relatively compared to values,
  • Not Synced
    72
    00:08:05,400 --> 00:08:08,640
    because a bar, one bar will be twice as high as another.
  • Not Synced
    73
    00:08:08,640 --> 00:08:15,390
    And a point plot shows a statistic or an individual value, and it emphasizes absolute difference.
  • Not Synced
    74
    00:08:15,390 --> 00:08:19,740
    You don't have a whole bar in order to to compare heights.
  • Not Synced
    75
    00:08:19,740 --> 00:08:28,140
    You just have the point. And then if we want to show a distribution, we usually use a box or a violin plot with this discrete explanatory variable.
  • Not Synced
    76
    00:08:28,140 --> 00:08:32,670
    We don't have great ways to show distributions with continuous explanatory variables.
  • Not Synced
    77
    00:08:32,670 --> 00:08:36,480
    You can show a variance with an error bar, but that's about where a ribbon.
  • Not Synced
    78
    00:08:36,480 --> 00:08:44,220
    But that's about it. For too explanatory variables we get into.
  • Not Synced
    79
    00:08:44,220 --> 00:08:53,160
    Too explanatory variables, we have a couple of options. One is we can do a three, a pseudo 3D display where we do a contour plot or a heat map.
  • Not Synced
    80
    00:08:53,160 --> 00:08:57,000
    And I'm going to show both of these here. So this is a contour plot.
  • Not Synced
    81
    00:08:57,000 --> 00:09:01,860
    The left one is a contour plot and it reads like a topographical map.
  • Not Synced
    82
    00:09:01,860 --> 00:09:09,180
    If you envision your your two explanatory variables in this case, we're going to we're showing a two dimensional distribution.
  • Not Synced
    83
    00:09:09,180 --> 00:09:18,990
    So one explanatory variable is the score given to a movie by its critics, and another explanatory variable is the score given by its audience.
  • Not Synced
    84
    00:09:18,990 --> 00:09:23,930
    And then the response variable is how many movies have that combination?
  • Not Synced
    85
    00:09:23,930 --> 00:09:30,320
    And so we can see here, this is the peak, a contour plot is really good for showing us the peak.
  • Not Synced
    86
    00:09:30,320 --> 00:09:43,510
    It's going to be that innermost circle and it also shows us the shape because each of these rings is a a a level of decreasing.
  • Not Synced
    87
    00:09:43,510 --> 00:09:52,090
    Decreasing height in this map, so if the response if we envision that the response variable is this height and we're looking at a two dimensional map,
  • Not Synced
    88
    00:09:52,090 --> 00:09:56,110
    the rings show us the contours around the mountains of that height.
  • Not Synced
    89
    00:09:56,110 --> 00:10:01,150
    Good for showing, good for showing shape. The other one of the heat map which uses color.
  • Not Synced
    90
    00:10:01,150 --> 00:10:07,240
    And so it's usually going to be from a cool color like, say, black here to to a hot orange,
  • Not Synced
    91
    00:10:07,240 --> 00:10:18,230
    or it's going to be sometimes if you have a bidirectional one, which goes blue to red and it lets us see the highest density is here.
  • Not Synced
    92
    00:10:18,230 --> 00:10:23,390
    And the as you go out from there, you get lower and lower densities.
  • Not Synced
    93
    00:10:23,390 --> 00:10:27,740
    Either one can work for a continuous variable heat map, you often have to it in order to.
  • Not Synced
    94
    00:10:27,740 --> 00:10:36,890
    This is a descriptivist heat map where we have been everything in in bins of of a half a star or a half
  • Not Synced
    95
    00:10:36,890 --> 00:10:41,510
    a star on the audience score and a four star on the credit score because they're on different scales.
  • Not Synced
    96
    00:10:41,510 --> 00:10:46,850
    But heat maps also work well for categorical ordinal data.
  • Not Synced
    97
    00:10:46,850 --> 00:10:55,180
    So. Another way we can do it is we can use other esthetics for secondary variables such as color or shape or size,
  • Not Synced
    98
    00:10:55,180 --> 00:10:58,300
    sometimes we'll use that to indicate a second response variable,
  • Not Synced
    99
    00:10:58,300 --> 00:11:06,820
    like you might have a scatterplot where the size of the point is a second response variable, but often it's for multiple explanatory variables.
  • Not Synced
    100
    00:11:06,820 --> 00:11:15,640
    So this shows us how we can do that. So if we wanted to break down Titanic's survival rates by both class and sex,
  • Not Synced
    101
    00:11:15,640 --> 00:11:22,240
    we can see we can use we keep our class on the X axis like we did before,
  • Not Synced
    102
    00:11:22,240 --> 00:11:32,290
    and then we use color for the passenger sex so we can see significantly higher survival rates for women across all three classes.
  • Not Synced
    103
    00:11:32,290 --> 00:11:36,550
    I'm also showing you here the difference between a bar chart and a point plot.
  • Not Synced
    104
    00:11:36,550 --> 00:11:41,590
    So the left is the bar chart. The right is the point plot.
  • Not Synced
    105
    00:11:41,590 --> 00:11:46,030
    And the bar chart lets us compare the heights of the bars. Note that it starts at zero.
  • Not Synced
    106
    00:11:46,030 --> 00:11:55,590
    Bar charts always start at zero. And because so it lets us compare the height of the bars and we can see that.
  • Not Synced
    107
    00:11:55,590 --> 00:12:08,290
    It's easy to see from just using our vision that the female passenger first class bar is almost is more than twice as tall as the.
  • Not Synced
    108
    00:12:08,290 --> 00:12:12,370
    As the female or the male passenger first class bar,
  • Not Synced
    109
    00:12:12,370 --> 00:12:18,370
    the male passenger first class bar is twice as tall as the male passenger passenger second class bar.
  • Not Synced
    110
    00:12:18,370 --> 00:12:23,410
    So it lets us compare make relative comparisons between the different values.
  • Not Synced
    111
    00:12:23,410 --> 00:12:29,230
    This is why it always starts at zero, because the natural thing to do with the bar is compare its height.
  • Not Synced
    112
    00:12:29,230 --> 00:12:34,990
    If your bar chart does not start at zero, suppose our bar chart started at point one,
  • Not Synced
    113
    00:12:34,990 --> 00:12:40,300
    then the comparison of height would exaggerate the difference relative to the value.
  • Not Synced
    114
    00:12:40,300 --> 00:12:45,850
    And what looks twice as tall isn't actually twice as tall because we cut off a bunch of the bottom.
  • Not Synced
    115
    00:12:45,850 --> 00:12:52,010
    So always start at zero. The point plot. Does not it makes it hard to compare relative difference.
  • Not Synced
    116
    00:12:52,010 --> 00:12:55,790
    We can't it's difficult for us to tell that the survival rate visually tell.
  • Not Synced
    117
    00:12:55,790 --> 00:12:57,500
    We can tell if we look at the numbers,
  • Not Synced
    118
    00:12:57,500 --> 00:13:05,090
    but it's difficult to visually tell that the survival rate of women in first classes is twice as high as the survival rate of men.
  • Not Synced
    119
    00:13:05,090 --> 00:13:11,390
    But what it does literacy is it lets us see the absolute, absolute difference between these values,
  • Not Synced
    120
    00:13:11,390 --> 00:13:16,550
    and it makes it easy to compare the difference in the gaps across the three classes.
  • Not Synced
    121
    00:13:16,550 --> 00:13:31,670
    We can see that the the survival rate by by sex is much higher or is much closer in the third class than it is in the first or in the second class.
  • Not Synced
    122
    00:13:31,670 --> 00:13:39,020
    So your choice of plot really guides the user to see different things in your choice of plot,
  • Not Synced
    123
    00:13:39,020 --> 00:13:42,650
    allows you to emphasize different things and you need to decide.
  • Not Synced
    124
    00:13:42,650 --> 00:13:51,440
    You need to choose and design your plot in such a way that's going to tell the story that you need to tell from the data.
  • Not Synced
    125
    00:13:51,440 --> 00:13:59,570
    We can also have more than two explanatory variables. It's difficult to have more than one that's numeric or two for doing a contour plot.
  • Not Synced
    126
    00:13:59,570 --> 00:14:05,270
    We can bend variables that are then going to let us use some more techniques, such as FaceTime.
  • Not Synced
    127
    00:14:05,270 --> 00:14:08,270
    So if we want to break down by more categorical variables,
  • Not Synced
    128
    00:14:08,270 --> 00:14:12,320
    so we want let's say we also want to look at a or we want to break down many more variables.
  • Not Synced
    129
    00:14:12,320 --> 00:14:18,140
    Let's say we also want to look at age. And so we're going to keep sex on the color.
  • Not Synced
    130
    00:14:18,140 --> 00:14:24,050
    We're going to now use age as the x axis. Since this numeric, it really works better on an axis.
  • Not Synced
    131
    00:14:24,050 --> 00:14:28,970
    I have bend it into bins of tens that you only have one point for every decade.
  • Not Synced
    132
    00:14:28,970 --> 00:14:36,530
    But then we use a fassett and the fassett means we draw a different chart for each of the three classes.
  • Not Synced
    133
    00:14:36,530 --> 00:14:43,400
    The charts all share a y axis so we can directly compare across the row of charts and we can see it lets us see
  • Not Synced
    134
    00:14:43,400 --> 00:14:55,400
    particularly how does the survival as a function of age change between different different passenger classes,
  • Not Synced
    135
    00:14:55,400 --> 00:15:01,460
    for example? And so it is, but it lets us start to build up.
  • Not Synced
    136
    00:15:01,460 --> 00:15:05,720
    And if we had a fourth, we could use rows and columns in the faceted plot.
  • Not Synced
    137
    00:15:05,720 --> 00:15:10,310
    So we have these mechanisms of building up and we have our x axis or y axis.
  • Not Synced
    138
    00:15:10,310 --> 00:15:18,380
    We can use esthetics of the lines of the points, particularly color, size, shape,
  • Not Synced
    139
    00:15:18,380 --> 00:15:26,730
    and then we can use facets to build up even more variables into our plot.
  • Not Synced
    140
    00:15:26,730 --> 00:15:32,370
    To do fascinating, there's a couple of things you can do, it's built into some of the seabourne row plotting functions.
  • Not Synced
    141
    00:15:32,370 --> 00:15:37,230
    The plot and cat plot function functions can both do fascinating on their own.
  • Not Synced
    142
    00:15:37,230 --> 00:15:42,240
    They let you control the statistic. They're very, very flexible functions for a wide range of plot.
  • Not Synced
    143
    00:15:42,240 --> 00:15:48,810
    The general purpose Fassett Grid allows you to fassett any kind of plot by writing some more Python code on your own.
  • Not Synced
    144
    00:15:48,810 --> 00:15:53,190
    Very useful if you want to fassett something that doesn't support Facetune built in.
  • Not Synced
    145
    00:15:53,190 --> 00:15:59,940
    And if you're using Plot nine or the R.G. plot to package Fassett Grid and Fassett wrap a control fassett,
  • Not Synced
    146
    00:15:59,940 --> 00:16:07,680
    you build that faceted plot you need to pay attention to what variables go where your choice of which variables are going to be on color,
  • Not Synced
    147
    00:16:07,680 --> 00:16:16,530
    what variables are going to be facets, which variables are going to be on your axes really affect how the reader is going to interpret and understand
  • Not Synced
    148
    00:16:16,530 --> 00:16:22,680
    your plot and you need to choose them strategically to tell the story that addresses your question.
  • Not Synced
    149
    00:16:22,680 --> 00:16:28,950
    You also need to do it, though, in a way that is honest and does not mislead your user, your readers.
  • Not Synced
    150
    00:16:28,950 --> 00:16:38,810
    The chart needs to honestly show the readers what it is that you learned from the data and show that clearly.
  • Not Synced
    151
    00:16:38,810 --> 00:16:43,610
    Another thing we can do to build up a chart, especially if we have more categorical variables,
  • Not Synced
    152
    00:16:43,610 --> 00:16:47,180
    if we've got a categorical response variable with more than two levels,
  • Not Synced
    153
    00:16:47,180 --> 00:16:56,390
    and we want to show how particularly how the the proportion in different categories changes the response to another variable,
  • Not Synced
    154
    00:16:56,390 --> 00:17:04,250
    a stack chart can be very good. Let's see the differences in composition to see how the parts of a hole change.
  • Not Synced
    155
    00:17:04,250 --> 00:17:05,390
    And so this chart,
  • Not Synced
    156
    00:17:05,390 --> 00:17:16,060
    this is a stacked bar chart and it's a horizontal bar chart where I put the explanatory variable on the x axis excuse me, on the Y axis.
  • Not Synced
    157
    00:17:16,060 --> 00:17:22,150
    Just in part to make the labels easier to read and so are explanatory variable is what data set.
  • Not Synced
    158
    00:17:22,150 --> 00:17:28,210
    Something came from Locke, M.D. Gry. What those are don't matter for our purposes right now.
  • Not Synced
    159
    00:17:28,210 --> 00:17:34,540
    The response variable is the distribution of gender's in this case.
  • Not Synced
    160
    00:17:34,540 --> 00:17:39,970
    These are data sets of books, the genders of the authors of those books in the data set.
  • Not Synced
    161
    00:17:39,970 --> 00:17:47,260
    And so we have female, we've got mail and we also have codes for we it's ambiguous or unknown or we didn't have data.
  • Not Synced
    162
    00:17:47,260 --> 00:17:59,470
    And so we can see, for example, the GYŐRI data set has a higher fraction of women and a significantly lower fraction of men.
  • Not Synced
    163
    00:17:59,470 --> 00:18:12,900
    And we can see quite a few more. Books that we don't know what gender on, and so this the order on this chart is very strategic.
  • Not Synced
    164
    00:18:12,900 --> 00:18:22,230
    I observed these levels is very strategic. I bunch I batched all of the various kinds of we don't know together so that
  • Not Synced
    165
    00:18:22,230 --> 00:18:27,180
    you can look at that whole block and see the and see the various types of.
  • Not Synced
    166
    00:18:27,180 --> 00:18:33,720
    We don't know the gender of the book's author together, but you can also see how they're broken down into individual things.
  • Not Synced
    167
    00:18:33,720 --> 00:18:41,790
    You can see that UNlinked is a very, very large fraction of of that increase in books where we don't know the author's gender.
  • Not Synced
    168
    00:18:41,790 --> 00:18:46,560
    So you need to think you need to think about all of these different things in order
  • Not Synced
    169
    00:18:46,560 --> 00:18:51,630
    to be able to generate a chart that's going to clearly and unambiguously communicate.
  • Not Synced
    170
    00:18:51,630 --> 00:18:54,930
    You can show either you can show raw values in a stack bar chart at the bars.
  • Not Synced
    171
    00:18:54,930 --> 00:18:58,690
    Don't all have to be the same height you can show fractions, in which case they will be.
  • Not Synced
    172
    00:18:58,690 --> 00:19:07,080
    I chose to show fractions in this chart. The code that generates this using raw matplotlib is linked in the notes for the video.
  • Not Synced
    173
    00:19:07,080 --> 00:19:09,330
    Sometimes we're also going to transform our charts.
  • Not Synced
    174
    00:19:09,330 --> 00:19:15,330
    We might transform the axis such as doing a log ten scale, in which case the label would transform the axis.
  • Not Synced
    175
    00:19:15,330 --> 00:19:20,070
    The labels are still in their original value. It's just they're spaced out logarithmically.
  • Not Synced
    176
    00:19:20,070 --> 00:19:24,780
    We generally won't do this for bars. Reading a bar on a large scale.
  • Not Synced
    177
    00:19:24,780 --> 00:19:30,840
    You can draw it, but you have to be really, really careful in order to make sure that your readers are going to accurately interpret it.
  • Not Synced
    178
    00:19:30,840 --> 00:19:35,580
    But for line and scatter plots, log transforms are a lot more common.
  • Not Synced
    179
    00:19:35,580 --> 00:19:41,580
    Sometimes, though, we're actually going to transform the data itself and we're going to plot a log or a square root or some other rescaling.
  • Not Synced
    180
    00:19:41,580 --> 00:19:49,160
    And another kargman transformation is to be in the data, somehow democratize it into fixed bins.
  • Not Synced
    181
    00:19:49,160 --> 00:19:54,530
    By some mechanism or another, so the key decisions that you need to make when you're making one of these charts
  • Not Synced
    182
    00:19:54,530 --> 00:20:00,920
    are you need to pick the variables and how you're doing their transformations. You need to pick that what's called the esthetics,
  • Not Synced
    183
    00:20:00,920 --> 00:20:06,680
    how you're going to map the different variables you're looking at to chart features your X and Y axes,
  • Not Synced
    184
    00:20:06,680 --> 00:20:10,700
    your facets row and column your color, your point marker style.
  • Not Synced
    185
    00:20:10,700 --> 00:20:14,690
    If you're doing a joint plot, often it's useful to put.
  • Not Synced
    186
    00:20:14,690 --> 00:20:25,370
    The same esthetic on both color and style, and that way, if you have a reader who's colorblind, they still get different point styles,
  • Not Synced
    187
    00:20:25,370 --> 00:20:30,050
    even if they can't tell the colors apart or if someone's putting it on a black and white printer.
  • Not Synced
    188
    00:20:30,050 --> 00:20:34,340
    And then you need the type of the chart line, chart, bar, point box, et cetera.
  • Not Synced
    189
    00:20:34,340 --> 00:20:41,750
    So you have to make all of these decisions when you're drawing this chart and they're driven by what variables and data you have and what
  • Not Synced
    190
    00:20:41,750 --> 00:20:50,210
    question you're trying to answer and what story you're trying to tell about that you do need to be careful to avoid excessive complexity.
  • Not Synced
    191
    00:20:50,210 --> 00:20:58,550
    We can put a different variable on every conceivable esthetic and it's often going to result in a chart that's very difficult to read.
  • Not Synced
    192
    00:20:58,550 --> 00:21:03,410
    We also have to be careful with color because it's easy to make a chart that has differences
  • Not Synced
    193
    00:21:03,410 --> 00:21:07,580
    that are difficult for the human eye to distinguish or get obscured by printers,
  • Not Synced
    194
    00:21:07,580 --> 00:21:17,990
    low quality displays, etc. It's also important to note a good graphic reveals the data and does not distort or obscure the data.
  • Not Synced
    195
    00:21:17,990 --> 00:21:24,530
    It's easy to create a graphic that manipulates the data to tell a story that's not very well supported.
  • Not Synced
    196
    00:21:24,530 --> 00:21:30,080
    And we want to avoid that when we're doing data science with honesty and integrity.
  • Not Synced
    197
    00:21:30,080 --> 00:21:35,120
    So wrap up. You need to identify the variables and relationships that you want to highlight in your chart.
  • Not Synced
    198
    00:21:35,120 --> 00:21:38,030
    You want to design a plot that illustrates them,
  • Not Synced
    199
    00:21:38,030 --> 00:21:43,760
    and you're going to need to spend some time studying your plodding library APIs and the Plotting Libraries Gallery.
  • Not Synced
    200
    00:21:43,760 --> 00:21:50,180
    Any plotting library usually has a gallery of a bunch of different plots and the code that was used to generate them.
  • Not Synced
    201
    00:21:50,180 --> 00:21:52,880
    Seabourne has this, matplotlib has this.
  • Not Synced
    202
    00:21:52,880 --> 00:22:00,650
    And so you spending some time with that looking, oh, this looks like this looks like the kind of plot that might display my data well.
  • Not Synced
    203
    00:22:00,650 --> 00:22:14,167
    And then look and click on it and see what code they use to generate it and borrow it.
  • Not Synced
Title:
https:/.../bb921f27-23a5-4103-afff-ad970182fbb2-a99a3566-e841-4af0-9607-ad9c006e1504.mp4?invocationId=5b7bb097-a60f-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
22:14

English subtitles

Incomplete

Revisions