1 00:00:04,510 --> 00:00:05,680 Welcome back. This video, 2 00:00:05,680 --> 00:00:10,480 I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or 3 00:00:10,480 --> 00:00:16,720 be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors. 4 00:00:16,720 --> 00:00:20,880 I'm not going to be showing the detailed code for these chart types in the video. 5 00:00:20,880 --> 00:00:24,640 You're going to be able to find that in the documentation link from here. And also, 6 00:00:24,640 --> 00:00:28,600 I'm going to be preparing a notebook that demonstrates various of these charting 7 00:00:28,600 --> 00:00:34,330 types with the actual code to create them using the software we discussing. 8 00:00:34,330 --> 00:00:44,470 So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester. 9 00:00:44,470 --> 00:00:49,330 When I'm showing the function names, Seabourne is commonly input imported S.A.S. 10 00:00:49,330 --> 00:00:52,870 So as an ascot function is going to be a seabourne function PLDT, 11 00:00:52,870 --> 00:00:59,860 the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too, 12 00:00:59,860 --> 00:01:04,600 if you want to use those instead. I often use plot nine for a lot of my graphics. 13 00:01:04,600 --> 00:01:10,060 That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course. 14 00:01:10,060 --> 00:01:16,130 So there's a variety of different types of charts. Some of them are showing relative proportions. 15 00:01:16,130 --> 00:01:23,860 Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space. 16 00:01:23,860 --> 00:01:30,070 A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable. 17 00:01:30,070 --> 00:01:31,780 Sometimes they're grouped by New America as well. 18 00:01:31,780 --> 00:01:37,600 But usually our x axis is a categorical variable of some kind best with a moderate number of categories. 19 00:01:37,600 --> 00:01:41,950 We can use a second categorical variable to say color the bars. 20 00:01:41,950 --> 00:01:50,170 So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class. 21 00:01:50,170 --> 00:01:55,660 And then the bars are colored based on the gender of the of the passenger. 22 00:01:55,660 --> 00:01:58,000 And so we can see the different survival rates. 23 00:01:58,000 --> 00:02:06,250 The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables. 24 00:02:06,250 --> 00:02:12,970 Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally. 25 00:02:12,970 --> 00:02:17,080 This also shows some whiskers that come from a confidence interval. 26 00:02:17,080 --> 00:02:24,760 It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot 27 00:02:24,760 --> 00:02:30,730 these Seabourne has the count plot function which lets which does a quick, 28 00:02:30,730 --> 00:02:36,970 basically categorical histogram. How many observations are in each are in each category. 29 00:02:36,970 --> 00:02:41,830 The cap plot variable will plot by default a mean value for each category. 30 00:02:41,830 --> 00:02:47,860 And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals. 31 00:02:47,860 --> 00:02:53,350 That's what's being shown in this in this plot here. 32 00:02:53,350 --> 00:03:00,050 And then you can also use the bat, the bar function or the plot nine Geon Bar. 33 00:03:00,050 --> 00:03:04,340 So if you rules about bar charts first is never start the Y axis on a bar chart. 34 00:03:04,340 --> 00:03:13,320 Anything but zero. And so the reason for this we can see here is that. 35 00:03:13,320 --> 00:03:17,270 So the top one. So these are these are looking at the mean average ratings. 36 00:03:17,270 --> 00:03:22,770 We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre. 37 00:03:22,770 --> 00:03:33,570 What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so. 38 00:03:33,570 --> 00:03:40,080 The difference between sci fi and short is a difference of a little under one, probably. 39 00:03:40,080 --> 00:03:46,170 But when we start the Y axis at 2.5 instead of zero, 40 00:03:46,170 --> 00:03:51,840 what happens is the differences look much larger than they are because the human eye, naturally it's not. 41 00:03:51,840 --> 00:04:01,050 We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars. 42 00:04:01,050 --> 00:04:06,620 They have length, they have an area since they're all the same with the length is proportional to the area. 43 00:04:06,620 --> 00:04:12,660 Braking length area. Proportionality is a good way to confuse your readers, 44 00:04:12,660 --> 00:04:20,910 but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't. 45 00:04:20,910 --> 00:04:25,770 It's really a shift from about 2.8 to three point three or three point four. 46 00:04:25,770 --> 00:04:33,090 And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are. 47 00:04:33,090 --> 00:04:39,210 So when I talked about integrity and avoiding deception, when I was introducing statistical graphics, 48 00:04:39,210 --> 00:04:45,750 this is what I was talking about, the differences there. It's just not as big as it looks like it is. 49 00:04:45,750 --> 00:04:47,460 And we truncate our bar charts. 50 00:04:47,460 --> 00:04:55,050 So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data, 51 00:04:55,050 --> 00:05:04,470 that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else. 52 00:05:04,470 --> 00:05:06,960 So if you're including Whiskers, like I did in the previous chart, 53 00:05:06,960 --> 00:05:14,040 define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot. 54 00:05:14,040 --> 00:05:21,270 If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason, 55 00:05:21,270 --> 00:05:25,010 which it creates something that's different when it doesn't need to be so. 56 00:05:25,010 --> 00:05:29,520 It causes the reader to look for a difference that isn't actually their best avoided. 57 00:05:29,520 --> 00:05:32,640 You can fix that by just specifying the color. 58 00:05:32,640 --> 00:05:40,050 We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value. 59 00:05:40,050 --> 00:05:46,800 Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram, 60 00:05:46,800 --> 00:05:50,730 the Y axis is either the number or the fraction of occurrences in this case. 61 00:05:50,730 --> 00:05:58,170 So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values. 62 00:05:58,170 --> 00:06:01,680 So it really makes it visually clear how the data is shaped. 63 00:06:01,680 --> 00:06:07,950 We can see Skewes and things like that. Is there one way to graphically describe a distribution? 64 00:06:07,950 --> 00:06:12,120 A scatterplot shows two numeric variables. So each observation is a dot. 65 00:06:12,120 --> 00:06:16,800 Each observation has two numeric variables. And we put the one variable on the x axis. 66 00:06:16,800 --> 00:06:22,290 The other variable on the Y axis and put the dot at where its variable values would intersect. 67 00:06:22,290 --> 00:06:26,700 This is really useful for seeing how two variables relate. Does one increase with the other Duplin? 68 00:06:26,700 --> 00:06:30,840 Do points clump or cluster in an interesting way? Other interesting patterns. 69 00:06:30,840 --> 00:06:43,050 It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills. 70 00:06:43,050 --> 00:06:45,540 And each each observation is a bill. 71 00:06:45,540 --> 00:06:55,380 And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill. 72 00:06:55,380 --> 00:07:02,790 And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable. 73 00:07:02,790 --> 00:07:08,140 So on this one, we've changed it so that the points are different color. 74 00:07:08,140 --> 00:07:13,310 So those dinners are blue circles and the lunches are orange AXA's. 75 00:07:13,310 --> 00:07:19,130 We could also add a trend line or some other kind of a line to show some context, for example, on this chart, 76 00:07:19,130 --> 00:07:27,170 we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent, 77 00:07:27,170 --> 00:07:33,650 how that the tips are distributed relative to it to a 20 percent mark. 78 00:07:33,650 --> 00:07:41,170 We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot. 79 00:07:41,170 --> 00:07:47,020 Functions for doing this are scatter scatterplot and then plotlines John Point, 80 00:07:47,020 --> 00:07:50,800 the Seabourne documentation has some examples of more of these align plot. 81 00:07:50,800 --> 00:07:59,350 It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next. 82 00:07:59,350 --> 00:08:04,930 By combining them with a line, it really works best. We have one Y per X value that we want to plot. 83 00:08:04,930 --> 00:08:11,260 If we've got more than one, it really starts getting very, very jagged. It's very common for Time series. 84 00:08:11,260 --> 00:08:15,430 So this is another example from the Seabourne tutorial not labeled super well. 85 00:08:15,430 --> 00:08:22,510 I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative. 86 00:08:22,510 --> 00:08:29,710 That was zero. The Y axis is at the top and the values otherwise our negative functions to create. 87 00:08:29,710 --> 00:08:36,370 These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine. 88 00:08:36,370 --> 00:08:41,380 A box plot shows the distribution of a numeric variable grouped by a categorical. 89 00:08:41,380 --> 00:08:46,630 So the bar chart just showed us, say, the average value, maybe with confidence interval. 90 00:08:46,630 --> 00:08:51,820 The box plot actually shows us the distribution and it does so in a way that's based on the median. 91 00:08:51,820 --> 00:09:01,050 So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box. 92 00:09:01,050 --> 00:09:06,750 Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile. 93 00:09:06,750 --> 00:09:12,630 And what that means is twenty five percent of the values are below the bottom of the box. 94 00:09:12,630 --> 00:09:17,700 Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above. 95 00:09:17,700 --> 00:09:21,390 We then show these these whiskers that extend out to the minimum, 96 00:09:21,390 --> 00:09:27,390 a maximum of the data and a number of plotting packages will do some kind of an outlier detection. 97 00:09:27,390 --> 00:09:37,470 This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be. 98 00:09:37,470 --> 00:09:44,460 So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall. 99 00:09:44,460 --> 00:09:50,880 And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers. 100 00:09:50,880 --> 00:09:57,930 You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups. 101 00:09:57,930 --> 00:10:03,210 The median, the first and third quartiles and the men in the max to the data. 102 00:10:03,210 --> 00:10:11,520 Very useful for comparing observations of a variable when you're grouped by some categorical functions 103 00:10:11,520 --> 00:10:19,680 for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine. 104 00:10:19,680 --> 00:10:25,710 A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides. 105 00:10:25,710 --> 00:10:30,120 The swarm plot is a kind of another kind of a categorical scatterplot. 106 00:10:30,120 --> 00:10:38,860 It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily. 107 00:10:38,860 --> 00:10:46,620 Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint. 108 00:10:46,620 --> 00:10:57,750 But even a pie chart, just because the human perception is not super great at accurately comparing angular areas. 109 00:10:57,750 --> 00:11:00,120 So usually a bar chart, 110 00:11:00,120 --> 00:11:07,680 restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle. 111 00:11:07,680 --> 00:11:14,250 This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions. 112 00:11:14,250 --> 00:11:19,170 I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you 113 00:11:19,170 --> 00:11:25,230 want to show multiple proportions of different or relative proportions within different categories. 114 00:11:25,230 --> 00:11:29,670 There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots. 115 00:11:29,670 --> 00:11:34,830 That's a rug plot useful for just displaying distributions at a margin. 116 00:11:34,830 --> 00:11:40,500 So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings. 117 00:11:40,500 --> 00:11:46,170 So the paper that I assigned you to read, it talks through the use cases for a number of different plot types. 118 00:11:46,170 --> 00:11:50,160 I'm going to be providing tutorial notebooks that walk you through different plot types. 119 00:11:50,160 --> 00:11:54,660 The textbook talks about graph plotting and data visualization. 120 00:11:54,660 --> 00:11:58,190 The Seabourne and matplotlib docs are extensive. And for what? 121 00:11:58,190 --> 00:12:04,230 If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student. 122 00:12:04,230 --> 00:12:10,740 Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data. 123 00:12:10,740 --> 00:12:14,580 Click on it and they'll give you the code to show you how they made that plot. 124 00:12:14,580 --> 00:12:22,650 You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library 125 00:12:22,650 --> 00:12:27,630 and figure out how to make it show you the data in the way you really want it to. 126 00:12:27,630 --> 00:12:32,730 Learning one plotting library really deep is useful for a lot of the a lot of the python ones, 127 00:12:32,730 --> 00:12:37,350 especially the ones that are oriented towards static charts. They're built on top of matplotlib. 128 00:12:37,350 --> 00:12:41,910 So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne, 129 00:12:41,910 --> 00:12:49,500 you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there, 130 00:12:49,500 --> 00:12:55,470 but not quite all the way. So to wrap up, there are many different types of charts that have different use cases. 131 00:12:55,470 --> 00:13:01,020 Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing. 132 00:13:01,020 --> 00:13:05,730 Take some of the galleries from the examples from, say, the Seabourne Gallery. 133 00:13:05,730 --> 00:13:11,670 Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere. 134 00:13:11,670 --> 00:13:30,957 But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using.