WEBVTT 99:59:59.999 --> 99:59:59.999 1 00:00:04,510 --> 00:00:05,680 Welcome back. This video, 99:59:59.999 --> 99:59:59.999 2 00:00:05,680 --> 00:00:10,480 I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or 99:59:59.999 --> 99:59:59.999 3 00:00:10,480 --> 00:00:16,720 be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors. 99:59:59.999 --> 99:59:59.999 4 00:00:16,720 --> 00:00:20,880 I'm not going to be showing the detailed code for these chart types in the video. 99:59:59.999 --> 99:59:59.999 5 00:00:20,880 --> 00:00:24,640 You're going to be able to find that in the documentation link from here. And also, 99:59:59.999 --> 99:59:59.999 6 00:00:24,640 --> 00:00:28,600 I'm going to be preparing a notebook that demonstrates various of these charting 99:59:59.999 --> 99:59:59.999 7 00:00:28,600 --> 00:00:34,330 types with the actual code to create them using the software we discussing. 99:59:59.999 --> 99:59:59.999 8 00:00:34,330 --> 00:00:44,470 So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester. 99:59:59.999 --> 99:59:59.999 9 00:00:44,470 --> 00:00:49,330 When I'm showing the function names, Seabourne is commonly input imported S.A.S. 99:59:59.999 --> 99:59:59.999 10 00:00:49,330 --> 00:00:52,870 So as an ascot function is going to be a seabourne function PLDT, 99:59:59.999 --> 99:59:59.999 11 00:00:52,870 --> 00:00:59,860 the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too, 99:59:59.999 --> 99:59:59.999 12 00:00:59,860 --> 00:01:04,600 if you want to use those instead. I often use plot nine for a lot of my graphics. 99:59:59.999 --> 99:59:59.999 13 00:01:04,600 --> 00:01:10,060 That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course. 99:59:59.999 --> 99:59:59.999 14 00:01:10,060 --> 00:01:16,130 So there's a variety of different types of charts. Some of them are showing relative proportions. 99:59:59.999 --> 99:59:59.999 15 00:01:16,130 --> 00:01:23,860 Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space. 99:59:59.999 --> 99:59:59.999 16 00:01:23,860 --> 00:01:30,070 A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable. 99:59:59.999 --> 99:59:59.999 17 00:01:30,070 --> 00:01:31,780 Sometimes they're grouped by New America as well. 99:59:59.999 --> 99:59:59.999 18 00:01:31,780 --> 00:01:37,600 But usually our x axis is a categorical variable of some kind best with a moderate number of categories. 99:59:59.999 --> 99:59:59.999 19 00:01:37,600 --> 00:01:41,950 We can use a second categorical variable to say color the bars. 99:59:59.999 --> 99:59:59.999 20 00:01:41,950 --> 00:01:50,170 So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class. 99:59:59.999 --> 99:59:59.999 21 00:01:50,170 --> 00:01:55,660 And then the bars are colored based on the gender of the of the passenger. 99:59:59.999 --> 99:59:59.999 22 00:01:55,660 --> 00:01:58,000 And so we can see the different survival rates. 99:59:59.999 --> 99:59:59.999 23 00:01:58,000 --> 00:02:06,250 The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables. 99:59:59.999 --> 99:59:59.999 24 00:02:06,250 --> 00:02:12,970 Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally. 99:59:59.999 --> 99:59:59.999 25 00:02:12,970 --> 00:02:17,080 This also shows some whiskers that come from a confidence interval. 99:59:59.999 --> 99:59:59.999 26 00:02:17,080 --> 00:02:24,760 It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot 99:59:59.999 --> 99:59:59.999 27 00:02:24,760 --> 00:02:30,730 these Seabourne has the count plot function which lets which does a quick, 99:59:59.999 --> 99:59:59.999 28 00:02:30,730 --> 00:02:36,970 basically categorical histogram. How many observations are in each are in each category. 99:59:59.999 --> 99:59:59.999 29 00:02:36,970 --> 00:02:41,830 The cap plot variable will plot by default a mean value for each category. 99:59:59.999 --> 99:59:59.999 30 00:02:41,830 --> 00:02:47,860 And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals. 99:59:59.999 --> 99:59:59.999 31 00:02:47,860 --> 00:02:53,350 That's what's being shown in this in this plot here. 99:59:59.999 --> 99:59:59.999 32 00:02:53,350 --> 00:03:00,050 And then you can also use the bat, the bar function or the plot nine Geon Bar. 99:59:59.999 --> 99:59:59.999 33 00:03:00,050 --> 00:03:04,340 So if you rules about bar charts first is never start the Y axis on a bar chart. 99:59:59.999 --> 99:59:59.999 34 00:03:04,340 --> 00:03:13,320 Anything but zero. And so the reason for this we can see here is that. 99:59:59.999 --> 99:59:59.999 35 00:03:13,320 --> 00:03:17,270 So the top one. So these are these are looking at the mean average ratings. 99:59:59.999 --> 99:59:59.999 36 00:03:17,270 --> 00:03:22,770 We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre. 99:59:59.999 --> 99:59:59.999 37 00:03:22,770 --> 00:03:33,570 What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so. 99:59:59.999 --> 99:59:59.999 38 00:03:33,570 --> 00:03:40,080 The difference between sci fi and short is a difference of a little under one, probably. 99:59:59.999 --> 99:59:59.999 39 00:03:40,080 --> 00:03:46,170 But when we start the Y axis at 2.5 instead of zero, 99:59:59.999 --> 99:59:59.999 40 00:03:46,170 --> 00:03:51,840 what happens is the differences look much larger than they are because the human eye, naturally it's not. 99:59:59.999 --> 99:59:59.999 41 00:03:51,840 --> 00:04:01,050 We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars. 99:59:59.999 --> 99:59:59.999 42 00:04:01,050 --> 00:04:06,620 They have length, they have an area since they're all the same with the length is proportional to the area. 99:59:59.999 --> 99:59:59.999 43 00:04:06,620 --> 00:04:12,660 Braking length area. Proportionality is a good way to confuse your readers, 99:59:59.999 --> 99:59:59.999 44 00:04:12,660 --> 00:04:20,910 but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't. 99:59:59.999 --> 99:59:59.999 45 00:04:20,910 --> 00:04:25,770 It's really a shift from about 2.8 to three point three or three point four. 99:59:59.999 --> 99:59:59.999 46 00:04:25,770 --> 00:04:33,090 And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are. 99:59:59.999 --> 99:59:59.999 47 00:04:33,090 --> 00:04:39,210 So when I talked about integrity and avoiding deception, when I was introducing statistical graphics, 99:59:59.999 --> 99:59:59.999 48 00:04:39,210 --> 00:04:45,750 this is what I was talking about, the differences there. It's just not as big as it looks like it is. 99:59:59.999 --> 99:59:59.999 49 00:04:45,750 --> 00:04:47,460 And we truncate our bar charts. 99:59:59.999 --> 99:59:59.999 50 00:04:47,460 --> 00:04:55,050 So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data, 99:59:59.999 --> 99:59:59.999 51 00:04:55,050 --> 00:05:04,470 that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else. 99:59:59.999 --> 99:59:59.999 52 00:05:04,470 --> 00:05:06,960 So if you're including Whiskers, like I did in the previous chart, 99:59:59.999 --> 99:59:59.999 53 00:05:06,960 --> 00:05:14,040 define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot. 99:59:59.999 --> 99:59:59.999 54 00:05:14,040 --> 00:05:21,270 If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason, 99:59:59.999 --> 99:59:59.999 55 00:05:21,270 --> 00:05:25,010 which it creates something that's different when it doesn't need to be so. 99:59:59.999 --> 99:59:59.999 56 00:05:25,010 --> 00:05:29,520 It causes the reader to look for a difference that isn't actually their best avoided. 99:59:59.999 --> 99:59:59.999 57 00:05:29,520 --> 00:05:32,640 You can fix that by just specifying the color. 99:59:59.999 --> 99:59:59.999 58 00:05:32,640 --> 00:05:40,050 We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value. 99:59:59.999 --> 99:59:59.999 59 00:05:40,050 --> 00:05:46,800 Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram, 99:59:59.999 --> 99:59:59.999 60 00:05:46,800 --> 00:05:50,730 the Y axis is either the number or the fraction of occurrences in this case. 99:59:59.999 --> 99:59:59.999 61 00:05:50,730 --> 00:05:58,170 So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values. 99:59:59.999 --> 99:59:59.999 62 00:05:58,170 --> 00:06:01,680 So it really makes it visually clear how the data is shaped. 99:59:59.999 --> 99:59:59.999 63 00:06:01,680 --> 00:06:07,950 We can see Skewes and things like that. Is there one way to graphically describe a distribution? 99:59:59.999 --> 99:59:59.999 64 00:06:07,950 --> 00:06:12,120 A scatterplot shows two numeric variables. So each observation is a dot. 99:59:59.999 --> 99:59:59.999 65 00:06:12,120 --> 00:06:16,800 Each observation has two numeric variables. And we put the one variable on the x axis. 99:59:59.999 --> 99:59:59.999 66 00:06:16,800 --> 00:06:22,290 The other variable on the Y axis and put the dot at where its variable values would intersect. 99:59:59.999 --> 99:59:59.999 67 00:06:22,290 --> 00:06:26,700 This is really useful for seeing how two variables relate. Does one increase with the other Duplin? 99:59:59.999 --> 99:59:59.999 68 00:06:26,700 --> 00:06:30,840 Do points clump or cluster in an interesting way? Other interesting patterns. 99:59:59.999 --> 99:59:59.999 69 00:06:30,840 --> 00:06:43,050 It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills. 99:59:59.999 --> 99:59:59.999 70 00:06:43,050 --> 00:06:45,540 And each each observation is a bill. 99:59:59.999 --> 99:59:59.999 71 00:06:45,540 --> 00:06:55,380 And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill. 99:59:59.999 --> 99:59:59.999 72 00:06:55,380 --> 00:07:02,790 And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable. 99:59:59.999 --> 99:59:59.999 73 00:07:02,790 --> 00:07:08,140 So on this one, we've changed it so that the points are different color. 99:59:59.999 --> 99:59:59.999 74 00:07:08,140 --> 00:07:13,310 So those dinners are blue circles and the lunches are orange AXA's. 99:59:59.999 --> 99:59:59.999 75 00:07:13,310 --> 00:07:19,130 We could also add a trend line or some other kind of a line to show some context, for example, on this chart, 99:59:59.999 --> 99:59:59.999 76 00:07:19,130 --> 00:07:27,170 we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent, 99:59:59.999 --> 99:59:59.999 77 00:07:27,170 --> 00:07:33,650 how that the tips are distributed relative to it to a 20 percent mark. 99:59:59.999 --> 99:59:59.999 78 00:07:33,650 --> 00:07:41,170 We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot. 99:59:59.999 --> 99:59:59.999 79 00:07:41,170 --> 00:07:47,020 Functions for doing this are scatter scatterplot and then plotlines John Point, 99:59:59.999 --> 99:59:59.999 80 00:07:47,020 --> 00:07:50,800 the Seabourne documentation has some examples of more of these align plot. 99:59:59.999 --> 99:59:59.999 81 00:07:50,800 --> 00:07:59,350 It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next. 99:59:59.999 --> 99:59:59.999 82 00:07:59,350 --> 00:08:04,930 By combining them with a line, it really works best. We have one Y per X value that we want to plot. 99:59:59.999 --> 99:59:59.999 83 00:08:04,930 --> 00:08:11,260 If we've got more than one, it really starts getting very, very jagged. It's very common for Time series. 99:59:59.999 --> 99:59:59.999 84 00:08:11,260 --> 00:08:15,430 So this is another example from the Seabourne tutorial not labeled super well. 99:59:59.999 --> 99:59:59.999 85 00:08:15,430 --> 00:08:22,510 I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative. 99:59:59.999 --> 99:59:59.999 86 00:08:22,510 --> 00:08:29,710 That was zero. The Y axis is at the top and the values otherwise our negative functions to create. 99:59:59.999 --> 99:59:59.999 87 00:08:29,710 --> 00:08:36,370 These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine. 99:59:59.999 --> 99:59:59.999 88 00:08:36,370 --> 00:08:41,380 A box plot shows the distribution of a numeric variable grouped by a categorical. 99:59:59.999 --> 99:59:59.999 89 00:08:41,380 --> 00:08:46,630 So the bar chart just showed us, say, the average value, maybe with confidence interval. 99:59:59.999 --> 99:59:59.999 90 00:08:46,630 --> 00:08:51,820 The box plot actually shows us the distribution and it does so in a way that's based on the median. 99:59:59.999 --> 99:59:59.999 91 00:08:51,820 --> 00:09:01,050 So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box. 99:59:59.999 --> 99:59:59.999 92 00:09:01,050 --> 00:09:06,750 Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile. 99:59:59.999 --> 99:59:59.999 93 00:09:06,750 --> 00:09:12,630 And what that means is twenty five percent of the values are below the bottom of the box. 99:59:59.999 --> 99:59:59.999 94 00:09:12,630 --> 00:09:17,700 Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above. 99:59:59.999 --> 99:59:59.999 95 00:09:17,700 --> 00:09:21,390 We then show these these whiskers that extend out to the minimum, 99:59:59.999 --> 99:59:59.999 96 00:09:21,390 --> 00:09:27,390 a maximum of the data and a number of plotting packages will do some kind of an outlier detection. 99:59:59.999 --> 99:59:59.999 97 00:09:27,390 --> 00:09:37,470 This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be. 99:59:59.999 --> 99:59:59.999 98 00:09:37,470 --> 00:09:44,460 So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall. 99:59:59.999 --> 99:59:59.999 99 00:09:44,460 --> 00:09:50,880 And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers. 99:59:59.999 --> 99:59:59.999 100 00:09:50,880 --> 00:09:57,930 You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups. 99:59:59.999 --> 99:59:59.999 101 00:09:57,930 --> 00:10:03,210 The median, the first and third quartiles and the men in the max to the data. 99:59:59.999 --> 99:59:59.999 102 00:10:03,210 --> 00:10:11,520 Very useful for comparing observations of a variable when you're grouped by some categorical functions 99:59:59.999 --> 99:59:59.999 103 00:10:11,520 --> 00:10:19,680 for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine. 99:59:59.999 --> 99:59:59.999 104 00:10:19,680 --> 00:10:25,710 A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides. 99:59:59.999 --> 99:59:59.999 105 00:10:25,710 --> 00:10:30,120 The swarm plot is a kind of another kind of a categorical scatterplot. 99:59:59.999 --> 99:59:59.999 106 00:10:30,120 --> 00:10:38,860 It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily. 99:59:59.999 --> 99:59:59.999 107 00:10:38,860 --> 00:10:46,620 Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint. 99:59:59.999 --> 99:59:59.999 108 00:10:46,620 --> 00:10:57,750 But even a pie chart, just because the human perception is not super great at accurately comparing angular areas. 99:59:59.999 --> 99:59:59.999 109 00:10:57,750 --> 00:11:00,120 So usually a bar chart, 99:59:59.999 --> 99:59:59.999 110 00:11:00,120 --> 00:11:07,680 restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle. 99:59:59.999 --> 99:59:59.999 111 00:11:07,680 --> 00:11:14,250 This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions. 99:59:59.999 --> 99:59:59.999 112 00:11:14,250 --> 00:11:19,170 I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you 99:59:59.999 --> 99:59:59.999 113 00:11:19,170 --> 00:11:25,230 want to show multiple proportions of different or relative proportions within different categories. 99:59:59.999 --> 99:59:59.999 114 00:11:25,230 --> 00:11:29,670 There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots. 99:59:59.999 --> 99:59:59.999 115 00:11:29,670 --> 00:11:34,830 That's a rug plot useful for just displaying distributions at a margin. 99:59:59.999 --> 99:59:59.999 116 00:11:34,830 --> 00:11:40,500 So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings. 99:59:59.999 --> 99:59:59.999 117 00:11:40,500 --> 00:11:46,170 So the paper that I assigned you to read, it talks through the use cases for a number of different plot types. 99:59:59.999 --> 99:59:59.999 118 00:11:46,170 --> 00:11:50,160 I'm going to be providing tutorial notebooks that walk you through different plot types. 99:59:59.999 --> 99:59:59.999 119 00:11:50,160 --> 00:11:54,660 The textbook talks about graph plotting and data visualization. 99:59:59.999 --> 99:59:59.999 120 00:11:54,660 --> 00:11:58,190 The Seabourne and matplotlib docs are extensive. And for what? 99:59:59.999 --> 99:59:59.999 121 00:11:58,190 --> 00:12:04,230 If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student. 99:59:59.999 --> 99:59:59.999 122 00:12:04,230 --> 00:12:10,740 Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data. 99:59:59.999 --> 99:59:59.999 123 00:12:10,740 --> 00:12:14,580 Click on it and they'll give you the code to show you how they made that plot. 99:59:59.999 --> 99:59:59.999 124 00:12:14,580 --> 00:12:22,650 You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library 99:59:59.999 --> 99:59:59.999 125 00:12:22,650 --> 00:12:27,630 and figure out how to make it show you the data in the way you really want it to. 99:59:59.999 --> 99:59:59.999 126 00:12:27,630 --> 00:12:32,730 Learning one plotting library really deep is useful for a lot of the a lot of the python ones, 99:59:59.999 --> 99:59:59.999 127 00:12:32,730 --> 00:12:37,350 especially the ones that are oriented towards static charts. They're built on top of matplotlib. 99:59:59.999 --> 99:59:59.999 128 00:12:37,350 --> 00:12:41,910 So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne, 99:59:59.999 --> 99:59:59.999 129 00:12:41,910 --> 00:12:49,500 you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there, 99:59:59.999 --> 99:59:59.999 130 00:12:49,500 --> 00:12:55,470 but not quite all the way. So to wrap up, there are many different types of charts that have different use cases. 99:59:59.999 --> 99:59:59.999 131 00:12:55,470 --> 00:13:01,020 Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing. 99:59:59.999 --> 99:59:59.999 132 00:13:01,020 --> 00:13:05,730 Take some of the galleries from the examples from, say, the Seabourne Gallery. 99:59:59.999 --> 99:59:59.999 133 00:13:05,730 --> 00:13:11,670 Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere. 99:59:59.999 --> 99:59:59.999 134 00:13:11,670 --> 00:13:30,957 But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using. 99:59:59.999 --> 99:59:59.999