< Return to Video

https:/.../9077db8e-b57d-4825-a472-ad9601830d80-6177ebff-599e-4e9f-bc3d-ad9c006c9db8.mp4?invocationId=994e94ce-a50f-ec11-a9e9-0a1a827ad0ec

  • Not Synced
    1
    00:00:04,510 --> 00:00:05,680
    Welcome back. This video,
  • Not Synced
    2
    00:00:05,680 --> 00:00:10,480
    I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or
  • Not Synced
    3
    00:00:10,480 --> 00:00:16,720
    be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors.
  • Not Synced
    4
    00:00:16,720 --> 00:00:20,880
    I'm not going to be showing the detailed code for these chart types in the video.
  • Not Synced
    5
    00:00:20,880 --> 00:00:24,640
    You're going to be able to find that in the documentation link from here. And also,
  • Not Synced
    6
    00:00:24,640 --> 00:00:28,600
    I'm going to be preparing a notebook that demonstrates various of these charting
  • Not Synced
    7
    00:00:28,600 --> 00:00:34,330
    types with the actual code to create them using the software we discussing.
  • Not Synced
    8
    00:00:34,330 --> 00:00:44,470
    So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester.
  • Not Synced
    9
    00:00:44,470 --> 00:00:49,330
    When I'm showing the function names, Seabourne is commonly input imported S.A.S.
  • Not Synced
    10
    00:00:49,330 --> 00:00:52,870
    So as an ascot function is going to be a seabourne function PLDT,
  • Not Synced
    11
    00:00:52,870 --> 00:00:59,860
    the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too,
  • Not Synced
    12
    00:00:59,860 --> 00:01:04,600
    if you want to use those instead. I often use plot nine for a lot of my graphics.
  • Not Synced
    13
    00:01:04,600 --> 00:01:10,060
    That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course.
  • Not Synced
    14
    00:01:10,060 --> 00:01:16,130
    So there's a variety of different types of charts. Some of them are showing relative proportions.
  • Not Synced
    15
    00:01:16,130 --> 00:01:23,860
    Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space.
  • Not Synced
    16
    00:01:23,860 --> 00:01:30,070
    A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable.
  • Not Synced
    17
    00:01:30,070 --> 00:01:31,780
    Sometimes they're grouped by New America as well.
  • Not Synced
    18
    00:01:31,780 --> 00:01:37,600
    But usually our x axis is a categorical variable of some kind best with a moderate number of categories.
  • Not Synced
    19
    00:01:37,600 --> 00:01:41,950
    We can use a second categorical variable to say color the bars.
  • Not Synced
    20
    00:01:41,950 --> 00:01:50,170
    So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class.
  • Not Synced
    21
    00:01:50,170 --> 00:01:55,660
    And then the bars are colored based on the gender of the of the passenger.
  • Not Synced
    22
    00:01:55,660 --> 00:01:58,000
    And so we can see the different survival rates.
  • Not Synced
    23
    00:01:58,000 --> 00:02:06,250
    The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables.
  • Not Synced
    24
    00:02:06,250 --> 00:02:12,970
    Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally.
  • Not Synced
    25
    00:02:12,970 --> 00:02:17,080
    This also shows some whiskers that come from a confidence interval.
  • Not Synced
    26
    00:02:17,080 --> 00:02:24,760
    It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot
  • Not Synced
    27
    00:02:24,760 --> 00:02:30,730
    these Seabourne has the count plot function which lets which does a quick,
  • Not Synced
    28
    00:02:30,730 --> 00:02:36,970
    basically categorical histogram. How many observations are in each are in each category.
  • Not Synced
    29
    00:02:36,970 --> 00:02:41,830
    The cap plot variable will plot by default a mean value for each category.
  • Not Synced
    30
    00:02:41,830 --> 00:02:47,860
    And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals.
  • Not Synced
    31
    00:02:47,860 --> 00:02:53,350
    That's what's being shown in this in this plot here.
  • Not Synced
    32
    00:02:53,350 --> 00:03:00,050
    And then you can also use the bat, the bar function or the plot nine Geon Bar.
  • Not Synced
    33
    00:03:00,050 --> 00:03:04,340
    So if you rules about bar charts first is never start the Y axis on a bar chart.
  • Not Synced
    34
    00:03:04,340 --> 00:03:13,320
    Anything but zero. And so the reason for this we can see here is that.
  • Not Synced
    35
    00:03:13,320 --> 00:03:17,270
    So the top one. So these are these are looking at the mean average ratings.
  • Not Synced
    36
    00:03:17,270 --> 00:03:22,770
    We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre.
  • Not Synced
    37
    00:03:22,770 --> 00:03:33,570
    What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so.
  • Not Synced
    38
    00:03:33,570 --> 00:03:40,080
    The difference between sci fi and short is a difference of a little under one, probably.
  • Not Synced
    39
    00:03:40,080 --> 00:03:46,170
    But when we start the Y axis at 2.5 instead of zero,
  • Not Synced
    40
    00:03:46,170 --> 00:03:51,840
    what happens is the differences look much larger than they are because the human eye, naturally it's not.
  • Not Synced
    41
    00:03:51,840 --> 00:04:01,050
    We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars.
  • Not Synced
    42
    00:04:01,050 --> 00:04:06,620
    They have length, they have an area since they're all the same with the length is proportional to the area.
  • Not Synced
    43
    00:04:06,620 --> 00:04:12,660
    Braking length area. Proportionality is a good way to confuse your readers,
  • Not Synced
    44
    00:04:12,660 --> 00:04:20,910
    but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't.
  • Not Synced
    45
    00:04:20,910 --> 00:04:25,770
    It's really a shift from about 2.8 to three point three or three point four.
  • Not Synced
    46
    00:04:25,770 --> 00:04:33,090
    And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are.
  • Not Synced
    47
    00:04:33,090 --> 00:04:39,210
    So when I talked about integrity and avoiding deception, when I was introducing statistical graphics,
  • Not Synced
    48
    00:04:39,210 --> 00:04:45,750
    this is what I was talking about, the differences there. It's just not as big as it looks like it is.
  • Not Synced
    49
    00:04:45,750 --> 00:04:47,460
    And we truncate our bar charts.
  • Not Synced
    50
    00:04:47,460 --> 00:04:55,050
    So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data,
  • Not Synced
    51
    00:04:55,050 --> 00:05:04,470
    that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else.
  • Not Synced
    52
    00:05:04,470 --> 00:05:06,960
    So if you're including Whiskers, like I did in the previous chart,
  • Not Synced
    53
    00:05:06,960 --> 00:05:14,040
    define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot.
  • Not Synced
    54
    00:05:14,040 --> 00:05:21,270
    If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason,
  • Not Synced
    55
    00:05:21,270 --> 00:05:25,010
    which it creates something that's different when it doesn't need to be so.
  • Not Synced
    56
    00:05:25,010 --> 00:05:29,520
    It causes the reader to look for a difference that isn't actually their best avoided.
  • Not Synced
    57
    00:05:29,520 --> 00:05:32,640
    You can fix that by just specifying the color.
  • Not Synced
    58
    00:05:32,640 --> 00:05:40,050
    We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value.
  • Not Synced
    59
    00:05:40,050 --> 00:05:46,800
    Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram,
  • Not Synced
    60
    00:05:46,800 --> 00:05:50,730
    the Y axis is either the number or the fraction of occurrences in this case.
  • Not Synced
    61
    00:05:50,730 --> 00:05:58,170
    So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values.
  • Not Synced
    62
    00:05:58,170 --> 00:06:01,680
    So it really makes it visually clear how the data is shaped.
  • Not Synced
    63
    00:06:01,680 --> 00:06:07,950
    We can see Skewes and things like that. Is there one way to graphically describe a distribution?
  • Not Synced
    64
    00:06:07,950 --> 00:06:12,120
    A scatterplot shows two numeric variables. So each observation is a dot.
  • Not Synced
    65
    00:06:12,120 --> 00:06:16,800
    Each observation has two numeric variables. And we put the one variable on the x axis.
  • Not Synced
    66
    00:06:16,800 --> 00:06:22,290
    The other variable on the Y axis and put the dot at where its variable values would intersect.
  • Not Synced
    67
    00:06:22,290 --> 00:06:26,700
    This is really useful for seeing how two variables relate. Does one increase with the other Duplin?
  • Not Synced
    68
    00:06:26,700 --> 00:06:30,840
    Do points clump or cluster in an interesting way? Other interesting patterns.
  • Not Synced
    69
    00:06:30,840 --> 00:06:43,050
    It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills.
  • Not Synced
    70
    00:06:43,050 --> 00:06:45,540
    And each each observation is a bill.
  • Not Synced
    71
    00:06:45,540 --> 00:06:55,380
    And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill.
  • Not Synced
    72
    00:06:55,380 --> 00:07:02,790
    And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable.
  • Not Synced
    73
    00:07:02,790 --> 00:07:08,140
    So on this one, we've changed it so that the points are different color.
  • Not Synced
    74
    00:07:08,140 --> 00:07:13,310
    So those dinners are blue circles and the lunches are orange AXA's.
  • Not Synced
    75
    00:07:13,310 --> 00:07:19,130
    We could also add a trend line or some other kind of a line to show some context, for example, on this chart,
  • Not Synced
    76
    00:07:19,130 --> 00:07:27,170
    we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent,
  • Not Synced
    77
    00:07:27,170 --> 00:07:33,650
    how that the tips are distributed relative to it to a 20 percent mark.
  • Not Synced
    78
    00:07:33,650 --> 00:07:41,170
    We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot.
  • Not Synced
    79
    00:07:41,170 --> 00:07:47,020
    Functions for doing this are scatter scatterplot and then plotlines John Point,
  • Not Synced
    80
    00:07:47,020 --> 00:07:50,800
    the Seabourne documentation has some examples of more of these align plot.
  • Not Synced
    81
    00:07:50,800 --> 00:07:59,350
    It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next.
  • Not Synced
    82
    00:07:59,350 --> 00:08:04,930
    By combining them with a line, it really works best. We have one Y per X value that we want to plot.
  • Not Synced
    83
    00:08:04,930 --> 00:08:11,260
    If we've got more than one, it really starts getting very, very jagged. It's very common for Time series.
  • Not Synced
    84
    00:08:11,260 --> 00:08:15,430
    So this is another example from the Seabourne tutorial not labeled super well.
  • Not Synced
    85
    00:08:15,430 --> 00:08:22,510
    I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative.
  • Not Synced
    86
    00:08:22,510 --> 00:08:29,710
    That was zero. The Y axis is at the top and the values otherwise our negative functions to create.
  • Not Synced
    87
    00:08:29,710 --> 00:08:36,370
    These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine.
  • Not Synced
    88
    00:08:36,370 --> 00:08:41,380
    A box plot shows the distribution of a numeric variable grouped by a categorical.
  • Not Synced
    89
    00:08:41,380 --> 00:08:46,630
    So the bar chart just showed us, say, the average value, maybe with confidence interval.
  • Not Synced
    90
    00:08:46,630 --> 00:08:51,820
    The box plot actually shows us the distribution and it does so in a way that's based on the median.
  • Not Synced
    91
    00:08:51,820 --> 00:09:01,050
    So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box.
  • Not Synced
    92
    00:09:01,050 --> 00:09:06,750
    Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile.
  • Not Synced
    93
    00:09:06,750 --> 00:09:12,630
    And what that means is twenty five percent of the values are below the bottom of the box.
  • Not Synced
    94
    00:09:12,630 --> 00:09:17,700
    Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above.
  • Not Synced
    95
    00:09:17,700 --> 00:09:21,390
    We then show these these whiskers that extend out to the minimum,
  • Not Synced
    96
    00:09:21,390 --> 00:09:27,390
    a maximum of the data and a number of plotting packages will do some kind of an outlier detection.
  • Not Synced
    97
    00:09:27,390 --> 00:09:37,470
    This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be.
  • Not Synced
    98
    00:09:37,470 --> 00:09:44,460
    So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall.
  • Not Synced
    99
    00:09:44,460 --> 00:09:50,880
    And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers.
  • Not Synced
    100
    00:09:50,880 --> 00:09:57,930
    You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups.
  • Not Synced
    101
    00:09:57,930 --> 00:10:03,210
    The median, the first and third quartiles and the men in the max to the data.
  • Not Synced
    102
    00:10:03,210 --> 00:10:11,520
    Very useful for comparing observations of a variable when you're grouped by some categorical functions
  • Not Synced
    103
    00:10:11,520 --> 00:10:19,680
    for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine.
  • Not Synced
    104
    00:10:19,680 --> 00:10:25,710
    A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides.
  • Not Synced
    105
    00:10:25,710 --> 00:10:30,120
    The swarm plot is a kind of another kind of a categorical scatterplot.
  • Not Synced
    106
    00:10:30,120 --> 00:10:38,860
    It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily.
  • Not Synced
    107
    00:10:38,860 --> 00:10:46,620
    Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint.
  • Not Synced
    108
    00:10:46,620 --> 00:10:57,750
    But even a pie chart, just because the human perception is not super great at accurately comparing angular areas.
  • Not Synced
    109
    00:10:57,750 --> 00:11:00,120
    So usually a bar chart,
  • Not Synced
    110
    00:11:00,120 --> 00:11:07,680
    restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle.
  • Not Synced
    111
    00:11:07,680 --> 00:11:14,250
    This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions.
  • Not Synced
    112
    00:11:14,250 --> 00:11:19,170
    I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you
  • Not Synced
    113
    00:11:19,170 --> 00:11:25,230
    want to show multiple proportions of different or relative proportions within different categories.
  • Not Synced
    114
    00:11:25,230 --> 00:11:29,670
    There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots.
  • Not Synced
    115
    00:11:29,670 --> 00:11:34,830
    That's a rug plot useful for just displaying distributions at a margin.
  • Not Synced
    116
    00:11:34,830 --> 00:11:40,500
    So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings.
  • Not Synced
    117
    00:11:40,500 --> 00:11:46,170
    So the paper that I assigned you to read, it talks through the use cases for a number of different plot types.
  • Not Synced
    118
    00:11:46,170 --> 00:11:50,160
    I'm going to be providing tutorial notebooks that walk you through different plot types.
  • Not Synced
    119
    00:11:50,160 --> 00:11:54,660
    The textbook talks about graph plotting and data visualization.
  • Not Synced
    120
    00:11:54,660 --> 00:11:58,190
    The Seabourne and matplotlib docs are extensive. And for what?
  • Not Synced
    121
    00:11:58,190 --> 00:12:04,230
    If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student.
  • Not Synced
    122
    00:12:04,230 --> 00:12:10,740
    Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data.
  • Not Synced
    123
    00:12:10,740 --> 00:12:14,580
    Click on it and they'll give you the code to show you how they made that plot.
  • Not Synced
    124
    00:12:14,580 --> 00:12:22,650
    You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library
  • Not Synced
    125
    00:12:22,650 --> 00:12:27,630
    and figure out how to make it show you the data in the way you really want it to.
  • Not Synced
    126
    00:12:27,630 --> 00:12:32,730
    Learning one plotting library really deep is useful for a lot of the a lot of the python ones,
  • Not Synced
    127
    00:12:32,730 --> 00:12:37,350
    especially the ones that are oriented towards static charts. They're built on top of matplotlib.
  • Not Synced
    128
    00:12:37,350 --> 00:12:41,910
    So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne,
  • Not Synced
    129
    00:12:41,910 --> 00:12:49,500
    you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there,
  • Not Synced
    130
    00:12:49,500 --> 00:12:55,470
    but not quite all the way. So to wrap up, there are many different types of charts that have different use cases.
  • Not Synced
    131
    00:12:55,470 --> 00:13:01,020
    Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing.
  • Not Synced
    132
    00:13:01,020 --> 00:13:05,730
    Take some of the galleries from the examples from, say, the Seabourne Gallery.
  • Not Synced
    133
    00:13:05,730 --> 00:13:11,670
    Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere.
  • Not Synced
    134
    00:13:11,670 --> 00:13:30,957
    But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using.
  • Not Synced
Title:
https:/.../9077db8e-b57d-4825-a472-ad9601830d80-6177ebff-599e-4e9f-bc3d-ad9c006c9db8.mp4?invocationId=994e94ce-a50f-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
13:30

English subtitles

Incomplete

Revisions