9:59:59.000,9:59:59.000 1[br]00:00:04,510 --> 00:00:05,680[br]Welcome back. This video, 9:59:59.000,9:59:59.000 2[br]00:00:05,680 --> 00:00:10,480[br]I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or 9:59:59.000,9:59:59.000 3[br]00:00:10,480 --> 00:00:16,720[br]be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors. 9:59:59.000,9:59:59.000 4[br]00:00:16,720 --> 00:00:20,880[br]I'm not going to be showing the detailed code for these chart types in the video. 9:59:59.000,9:59:59.000 5[br]00:00:20,880 --> 00:00:24,640[br]You're going to be able to find that in the documentation link from here. And also, 9:59:59.000,9:59:59.000 6[br]00:00:24,640 --> 00:00:28,600[br]I'm going to be preparing a notebook that demonstrates various of these charting 9:59:59.000,9:59:59.000 7[br]00:00:28,600 --> 00:00:34,330[br]types with the actual code to create them using the software we discussing. 9:59:59.000,9:59:59.000 8[br]00:00:34,330 --> 00:00:44,470[br]So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester. 9:59:59.000,9:59:59.000 9[br]00:00:44,470 --> 00:00:49,330[br]When I'm showing the function names, Seabourne is commonly input imported S.A.S. 9:59:59.000,9:59:59.000 10[br]00:00:49,330 --> 00:00:52,870[br]So as an ascot function is going to be a seabourne function PLDT, 9:59:59.000,9:59:59.000 11[br]00:00:52,870 --> 00:00:59,860[br]the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too, 9:59:59.000,9:59:59.000 12[br]00:00:59,860 --> 00:01:04,600[br]if you want to use those instead. I often use plot nine for a lot of my graphics. 9:59:59.000,9:59:59.000 13[br]00:01:04,600 --> 00:01:10,060[br]That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course. 9:59:59.000,9:59:59.000 14[br]00:01:10,060 --> 00:01:16,130[br]So there's a variety of different types of charts. Some of them are showing relative proportions. 9:59:59.000,9:59:59.000 15[br]00:01:16,130 --> 00:01:23,860[br]Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space. 9:59:59.000,9:59:59.000 16[br]00:01:23,860 --> 00:01:30,070[br]A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable. 9:59:59.000,9:59:59.000 17[br]00:01:30,070 --> 00:01:31,780[br]Sometimes they're grouped by New America as well. 9:59:59.000,9:59:59.000 18[br]00:01:31,780 --> 00:01:37,600[br]But usually our x axis is a categorical variable of some kind best with a moderate number of categories. 9:59:59.000,9:59:59.000 19[br]00:01:37,600 --> 00:01:41,950[br]We can use a second categorical variable to say color the bars. 9:59:59.000,9:59:59.000 20[br]00:01:41,950 --> 00:01:50,170[br]So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class. 9:59:59.000,9:59:59.000 21[br]00:01:50,170 --> 00:01:55,660[br]And then the bars are colored based on the gender of the of the passenger. 9:59:59.000,9:59:59.000 22[br]00:01:55,660 --> 00:01:58,000[br]And so we can see the different survival rates. 9:59:59.000,9:59:59.000 23[br]00:01:58,000 --> 00:02:06,250[br]The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables. 9:59:59.000,9:59:59.000 24[br]00:02:06,250 --> 00:02:12,970[br]Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally. 9:59:59.000,9:59:59.000 25[br]00:02:12,970 --> 00:02:17,080[br]This also shows some whiskers that come from a confidence interval. 9:59:59.000,9:59:59.000 26[br]00:02:17,080 --> 00:02:24,760[br]It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot 9:59:59.000,9:59:59.000 27[br]00:02:24,760 --> 00:02:30,730[br]these Seabourne has the count plot function which lets which does a quick, 9:59:59.000,9:59:59.000 28[br]00:02:30,730 --> 00:02:36,970[br]basically categorical histogram. How many observations are in each are in each category. 9:59:59.000,9:59:59.000 29[br]00:02:36,970 --> 00:02:41,830[br]The cap plot variable will plot by default a mean value for each category. 9:59:59.000,9:59:59.000 30[br]00:02:41,830 --> 00:02:47,860[br]And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals. 9:59:59.000,9:59:59.000 31[br]00:02:47,860 --> 00:02:53,350[br]That's what's being shown in this in this plot here. 9:59:59.000,9:59:59.000 32[br]00:02:53,350 --> 00:03:00,050[br]And then you can also use the bat, the bar function or the plot nine Geon Bar. 9:59:59.000,9:59:59.000 33[br]00:03:00,050 --> 00:03:04,340[br]So if you rules about bar charts first is never start the Y axis on a bar chart. 9:59:59.000,9:59:59.000 34[br]00:03:04,340 --> 00:03:13,320[br]Anything but zero. And so the reason for this we can see here is that. 9:59:59.000,9:59:59.000 35[br]00:03:13,320 --> 00:03:17,270[br]So the top one. So these are these are looking at the mean average ratings. 9:59:59.000,9:59:59.000 36[br]00:03:17,270 --> 00:03:22,770[br]We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre. 9:59:59.000,9:59:59.000 37[br]00:03:22,770 --> 00:03:33,570[br]What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so. 9:59:59.000,9:59:59.000 38[br]00:03:33,570 --> 00:03:40,080[br]The difference between sci fi and short is a difference of a little under one, probably. 9:59:59.000,9:59:59.000 39[br]00:03:40,080 --> 00:03:46,170[br]But when we start the Y axis at 2.5 instead of zero, 9:59:59.000,9:59:59.000 40[br]00:03:46,170 --> 00:03:51,840[br]what happens is the differences look much larger than they are because the human eye, naturally it's not. 9:59:59.000,9:59:59.000 41[br]00:03:51,840 --> 00:04:01,050[br]We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars. 9:59:59.000,9:59:59.000 42[br]00:04:01,050 --> 00:04:06,620[br]They have length, they have an area since they're all the same with the length is proportional to the area. 9:59:59.000,9:59:59.000 43[br]00:04:06,620 --> 00:04:12,660[br]Braking length area. Proportionality is a good way to confuse your readers, 9:59:59.000,9:59:59.000 44[br]00:04:12,660 --> 00:04:20,910[br]but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't. 9:59:59.000,9:59:59.000 45[br]00:04:20,910 --> 00:04:25,770[br]It's really a shift from about 2.8 to three point three or three point four. 9:59:59.000,9:59:59.000 46[br]00:04:25,770 --> 00:04:33,090[br]And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are. 9:59:59.000,9:59:59.000 47[br]00:04:33,090 --> 00:04:39,210[br]So when I talked about integrity and avoiding deception, when I was introducing statistical graphics, 9:59:59.000,9:59:59.000 48[br]00:04:39,210 --> 00:04:45,750[br]this is what I was talking about, the differences there. It's just not as big as it looks like it is. 9:59:59.000,9:59:59.000 49[br]00:04:45,750 --> 00:04:47,460[br]And we truncate our bar charts. 9:59:59.000,9:59:59.000 50[br]00:04:47,460 --> 00:04:55,050[br]So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data, 9:59:59.000,9:59:59.000 51[br]00:04:55,050 --> 00:05:04,470[br]that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else. 9:59:59.000,9:59:59.000 52[br]00:05:04,470 --> 00:05:06,960[br]So if you're including Whiskers, like I did in the previous chart, 9:59:59.000,9:59:59.000 53[br]00:05:06,960 --> 00:05:14,040[br]define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot. 9:59:59.000,9:59:59.000 54[br]00:05:14,040 --> 00:05:21,270[br]If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason, 9:59:59.000,9:59:59.000 55[br]00:05:21,270 --> 00:05:25,010[br]which it creates something that's different when it doesn't need to be so. 9:59:59.000,9:59:59.000 56[br]00:05:25,010 --> 00:05:29,520[br]It causes the reader to look for a difference that isn't actually their best avoided. 9:59:59.000,9:59:59.000 57[br]00:05:29,520 --> 00:05:32,640[br]You can fix that by just specifying the color. 9:59:59.000,9:59:59.000 58[br]00:05:32,640 --> 00:05:40,050[br]We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value. 9:59:59.000,9:59:59.000 59[br]00:05:40,050 --> 00:05:46,800[br]Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram, 9:59:59.000,9:59:59.000 60[br]00:05:46,800 --> 00:05:50,730[br]the Y axis is either the number or the fraction of occurrences in this case. 9:59:59.000,9:59:59.000 61[br]00:05:50,730 --> 00:05:58,170[br]So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values. 9:59:59.000,9:59:59.000 62[br]00:05:58,170 --> 00:06:01,680[br]So it really makes it visually clear how the data is shaped. 9:59:59.000,9:59:59.000 63[br]00:06:01,680 --> 00:06:07,950[br]We can see Skewes and things like that. Is there one way to graphically describe a distribution? 9:59:59.000,9:59:59.000 64[br]00:06:07,950 --> 00:06:12,120[br]A scatterplot shows two numeric variables. So each observation is a dot. 9:59:59.000,9:59:59.000 65[br]00:06:12,120 --> 00:06:16,800[br]Each observation has two numeric variables. And we put the one variable on the x axis. 9:59:59.000,9:59:59.000 66[br]00:06:16,800 --> 00:06:22,290[br]The other variable on the Y axis and put the dot at where its variable values would intersect. 9:59:59.000,9:59:59.000 67[br]00:06:22,290 --> 00:06:26,700[br]This is really useful for seeing how two variables relate. Does one increase with the other Duplin? 9:59:59.000,9:59:59.000 68[br]00:06:26,700 --> 00:06:30,840[br]Do points clump or cluster in an interesting way? Other interesting patterns. 9:59:59.000,9:59:59.000 69[br]00:06:30,840 --> 00:06:43,050[br]It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills. 9:59:59.000,9:59:59.000 70[br]00:06:43,050 --> 00:06:45,540[br]And each each observation is a bill. 9:59:59.000,9:59:59.000 71[br]00:06:45,540 --> 00:06:55,380[br]And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill. 9:59:59.000,9:59:59.000 72[br]00:06:55,380 --> 00:07:02,790[br]And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable. 9:59:59.000,9:59:59.000 73[br]00:07:02,790 --> 00:07:08,140[br]So on this one, we've changed it so that the points are different color. 9:59:59.000,9:59:59.000 74[br]00:07:08,140 --> 00:07:13,310[br]So those dinners are blue circles and the lunches are orange AXA's. 9:59:59.000,9:59:59.000 75[br]00:07:13,310 --> 00:07:19,130[br]We could also add a trend line or some other kind of a line to show some context, for example, on this chart, 9:59:59.000,9:59:59.000 76[br]00:07:19,130 --> 00:07:27,170[br]we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent, 9:59:59.000,9:59:59.000 77[br]00:07:27,170 --> 00:07:33,650[br]how that the tips are distributed relative to it to a 20 percent mark. 9:59:59.000,9:59:59.000 78[br]00:07:33,650 --> 00:07:41,170[br]We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot. 9:59:59.000,9:59:59.000 79[br]00:07:41,170 --> 00:07:47,020[br]Functions for doing this are scatter scatterplot and then plotlines John Point, 9:59:59.000,9:59:59.000 80[br]00:07:47,020 --> 00:07:50,800[br]the Seabourne documentation has some examples of more of these align plot. 9:59:59.000,9:59:59.000 81[br]00:07:50,800 --> 00:07:59,350[br]It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next. 9:59:59.000,9:59:59.000 82[br]00:07:59,350 --> 00:08:04,930[br]By combining them with a line, it really works best. We have one Y per X value that we want to plot. 9:59:59.000,9:59:59.000 83[br]00:08:04,930 --> 00:08:11,260[br]If we've got more than one, it really starts getting very, very jagged. It's very common for Time series. 9:59:59.000,9:59:59.000 84[br]00:08:11,260 --> 00:08:15,430[br]So this is another example from the Seabourne tutorial not labeled super well. 9:59:59.000,9:59:59.000 85[br]00:08:15,430 --> 00:08:22,510[br]I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative. 9:59:59.000,9:59:59.000 86[br]00:08:22,510 --> 00:08:29,710[br]That was zero. The Y axis is at the top and the values otherwise our negative functions to create. 9:59:59.000,9:59:59.000 87[br]00:08:29,710 --> 00:08:36,370[br]These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine. 9:59:59.000,9:59:59.000 88[br]00:08:36,370 --> 00:08:41,380[br]A box plot shows the distribution of a numeric variable grouped by a categorical. 9:59:59.000,9:59:59.000 89[br]00:08:41,380 --> 00:08:46,630[br]So the bar chart just showed us, say, the average value, maybe with confidence interval. 9:59:59.000,9:59:59.000 90[br]00:08:46,630 --> 00:08:51,820[br]The box plot actually shows us the distribution and it does so in a way that's based on the median. 9:59:59.000,9:59:59.000 91[br]00:08:51,820 --> 00:09:01,050[br]So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box. 9:59:59.000,9:59:59.000 92[br]00:09:01,050 --> 00:09:06,750[br]Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile. 9:59:59.000,9:59:59.000 93[br]00:09:06,750 --> 00:09:12,630[br]And what that means is twenty five percent of the values are below the bottom of the box. 9:59:59.000,9:59:59.000 94[br]00:09:12,630 --> 00:09:17,700[br]Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above. 9:59:59.000,9:59:59.000 95[br]00:09:17,700 --> 00:09:21,390[br]We then show these these whiskers that extend out to the minimum, 9:59:59.000,9:59:59.000 96[br]00:09:21,390 --> 00:09:27,390[br]a maximum of the data and a number of plotting packages will do some kind of an outlier detection. 9:59:59.000,9:59:59.000 97[br]00:09:27,390 --> 00:09:37,470[br]This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be. 9:59:59.000,9:59:59.000 98[br]00:09:37,470 --> 00:09:44,460[br]So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall. 9:59:59.000,9:59:59.000 99[br]00:09:44,460 --> 00:09:50,880[br]And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers. 9:59:59.000,9:59:59.000 100[br]00:09:50,880 --> 00:09:57,930[br]You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups. 9:59:59.000,9:59:59.000 101[br]00:09:57,930 --> 00:10:03,210[br]The median, the first and third quartiles and the men in the max to the data. 9:59:59.000,9:59:59.000 102[br]00:10:03,210 --> 00:10:11,520[br]Very useful for comparing observations of a variable when you're grouped by some categorical functions 9:59:59.000,9:59:59.000 103[br]00:10:11,520 --> 00:10:19,680[br]for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine. 9:59:59.000,9:59:59.000 104[br]00:10:19,680 --> 00:10:25,710[br]A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides. 9:59:59.000,9:59:59.000 105[br]00:10:25,710 --> 00:10:30,120[br]The swarm plot is a kind of another kind of a categorical scatterplot. 9:59:59.000,9:59:59.000 106[br]00:10:30,120 --> 00:10:38,860[br]It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily. 9:59:59.000,9:59:59.000 107[br]00:10:38,860 --> 00:10:46,620[br]Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint. 9:59:59.000,9:59:59.000 108[br]00:10:46,620 --> 00:10:57,750[br]But even a pie chart, just because the human perception is not super great at accurately comparing angular areas. 9:59:59.000,9:59:59.000 109[br]00:10:57,750 --> 00:11:00,120[br]So usually a bar chart, 9:59:59.000,9:59:59.000 110[br]00:11:00,120 --> 00:11:07,680[br]restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle. 9:59:59.000,9:59:59.000 111[br]00:11:07,680 --> 00:11:14,250[br]This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions. 9:59:59.000,9:59:59.000 112[br]00:11:14,250 --> 00:11:19,170[br]I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you 9:59:59.000,9:59:59.000 113[br]00:11:19,170 --> 00:11:25,230[br]want to show multiple proportions of different or relative proportions within different categories. 9:59:59.000,9:59:59.000 114[br]00:11:25,230 --> 00:11:29,670[br]There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots. 9:59:59.000,9:59:59.000 115[br]00:11:29,670 --> 00:11:34,830[br]That's a rug plot useful for just displaying distributions at a margin. 9:59:59.000,9:59:59.000 116[br]00:11:34,830 --> 00:11:40,500[br]So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings. 9:59:59.000,9:59:59.000 117[br]00:11:40,500 --> 00:11:46,170[br]So the paper that I assigned you to read, it talks through the use cases for a number of different plot types. 9:59:59.000,9:59:59.000 118[br]00:11:46,170 --> 00:11:50,160[br]I'm going to be providing tutorial notebooks that walk you through different plot types. 9:59:59.000,9:59:59.000 119[br]00:11:50,160 --> 00:11:54,660[br]The textbook talks about graph plotting and data visualization. 9:59:59.000,9:59:59.000 120[br]00:11:54,660 --> 00:11:58,190[br]The Seabourne and matplotlib docs are extensive. And for what? 9:59:59.000,9:59:59.000 121[br]00:11:58,190 --> 00:12:04,230[br]If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student. 9:59:59.000,9:59:59.000 122[br]00:12:04,230 --> 00:12:10,740[br]Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data. 9:59:59.000,9:59:59.000 123[br]00:12:10,740 --> 00:12:14,580[br]Click on it and they'll give you the code to show you how they made that plot. 9:59:59.000,9:59:59.000 124[br]00:12:14,580 --> 00:12:22,650[br]You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library 9:59:59.000,9:59:59.000 125[br]00:12:22,650 --> 00:12:27,630[br]and figure out how to make it show you the data in the way you really want it to. 9:59:59.000,9:59:59.000 126[br]00:12:27,630 --> 00:12:32,730[br]Learning one plotting library really deep is useful for a lot of the a lot of the python ones, 9:59:59.000,9:59:59.000 127[br]00:12:32,730 --> 00:12:37,350[br]especially the ones that are oriented towards static charts. They're built on top of matplotlib. 9:59:59.000,9:59:59.000 128[br]00:12:37,350 --> 00:12:41,910[br]So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne, 9:59:59.000,9:59:59.000 129[br]00:12:41,910 --> 00:12:49,500[br]you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there, 9:59:59.000,9:59:59.000 130[br]00:12:49,500 --> 00:12:55,470[br]but not quite all the way. So to wrap up, there are many different types of charts that have different use cases. 9:59:59.000,9:59:59.000 131[br]00:12:55,470 --> 00:13:01,020[br]Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing. 9:59:59.000,9:59:59.000 132[br]00:13:01,020 --> 00:13:05,730[br]Take some of the galleries from the examples from, say, the Seabourne Gallery. 9:59:59.000,9:59:59.000 133[br]00:13:05,730 --> 00:13:11,670[br]Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere. 9:59:59.000,9:59:59.000 134[br]00:13:11,670 --> 00:13:30,957[br]But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using. 9:59:59.000,9:59:59.000