-
Not Synced
1
00:00:04,510 --> 00:00:05,680
Welcome back. This video,
-
Not Synced
2
00:00:05,680 --> 00:00:10,480
I'm going to walk you through some of the different types of charts that we're going to be learning how to create outcomes or
-
Not Synced
3
00:00:10,480 --> 00:00:16,720
be able to identify the appropriate type of chart for data in a question and understand key rules to avoid common errors.
-
Not Synced
4
00:00:16,720 --> 00:00:20,880
I'm not going to be showing the detailed code for these chart types in the video.
-
Not Synced
5
00:00:20,880 --> 00:00:24,640
You're going to be able to find that in the documentation link from here. And also,
-
Not Synced
6
00:00:24,640 --> 00:00:28,600
I'm going to be preparing a notebook that demonstrates various of these charting
-
Not Synced
7
00:00:28,600 --> 00:00:34,330
types with the actual code to create them using the software we discussing.
-
Not Synced
8
00:00:34,330 --> 00:00:44,470
So common software for this or Seabourne and matplotlib, those are going to be the primary ones that we're working with this semester.
-
Not Synced
9
00:00:44,470 --> 00:00:49,330
When I'm showing the function names, Seabourne is commonly input imported S.A.S.
-
Not Synced
10
00:00:49,330 --> 00:00:52,870
So as an ascot function is going to be a seabourne function PLDT,
-
Not Synced
11
00:00:52,870 --> 00:00:59,860
the function is going to be a matplotlib of function and also showing the function you can use in plot nine or Ares g.G plot too,
-
Not Synced
12
00:00:59,860 --> 00:01:04,600
if you want to use those instead. I often use plot nine for a lot of my graphics.
-
Not Synced
13
00:01:04,600 --> 00:01:10,060
That's just for reference though. We're not going to be getting into much detail on Plot nine in the course of this course.
-
Not Synced
14
00:01:10,060 --> 00:01:16,130
So there's a variety of different types of charts. Some of them are showing relative proportions.
-
Not Synced
15
00:01:16,130 --> 00:01:23,860
Some of them are showing how different amounts relate to each other. Some of them are showing positions and an x y coordinate space.
-
Not Synced
16
00:01:23,860 --> 00:01:30,070
A bar chart is a very common type of chart that shows numeric values grouped by a categorical or ordinal variable.
-
Not Synced
17
00:01:30,070 --> 00:01:31,780
Sometimes they're grouped by New America as well.
-
Not Synced
18
00:01:31,780 --> 00:01:37,600
But usually our x axis is a categorical variable of some kind best with a moderate number of categories.
-
Not Synced
19
00:01:37,600 --> 00:01:41,950
We can use a second categorical variable to say color the bars.
-
Not Synced
20
00:01:41,950 --> 00:01:50,170
So this chart shows the survival rates of Titanic passengers or the X axis is the passage class for second or third class.
-
Not Synced
21
00:01:50,170 --> 00:01:55,660
And then the bars are colored based on the gender of the of the passenger.
-
Not Synced
22
00:01:55,660 --> 00:01:58,000
And so we can see the different survival rates.
-
Not Synced
23
00:01:58,000 --> 00:02:06,250
The y axis on a bar chart is often a mean or a sum or a count within the cap of the group determined by our categorical variables.
-
Not Synced
24
00:02:06,250 --> 00:02:12,970
Sometimes these will be horizontal. So the horizontal bar chart, the categorical is on the Y and the bars run horizontally.
-
Not Synced
25
00:02:12,970 --> 00:02:17,080
This also shows some whiskers that come from a confidence interval.
-
Not Synced
26
00:02:17,080 --> 00:02:24,760
It's very easy to generate a default, relatively good confidence interval with Seabourne so tough to pluck to plot
-
Not Synced
27
00:02:24,760 --> 00:02:30,730
these Seabourne has the count plot function which lets which does a quick,
-
Not Synced
28
00:02:30,730 --> 00:02:36,970
basically categorical histogram. How many observations are in each are in each category.
-
Not Synced
29
00:02:36,970 --> 00:02:41,830
The cap plot variable will plot by default a mean value for each category.
-
Not Synced
30
00:02:41,830 --> 00:02:47,860
And if you have it, do the mean plotting. It will also compute. Ninety five percent confidence intervals.
-
Not Synced
31
00:02:47,860 --> 00:02:53,350
That's what's being shown in this in this plot here.
-
Not Synced
32
00:02:53,350 --> 00:03:00,050
And then you can also use the bat, the bar function or the plot nine Geon Bar.
-
Not Synced
33
00:03:00,050 --> 00:03:04,340
So if you rules about bar charts first is never start the Y axis on a bar chart.
-
Not Synced
34
00:03:04,340 --> 00:03:13,320
Anything but zero. And so the reason for this we can see here is that.
-
Not Synced
35
00:03:13,320 --> 00:03:17,270
So the top one. So these are these are looking at the mean average ratings.
-
Not Synced
36
00:03:17,270 --> 00:03:22,770
We take you to movies, mean rating, and then we compute the mean of the average ratings within a genre.
-
Not Synced
37
00:03:22,770 --> 00:03:33,570
What is that? So if we look here, the difference between horror and IMAX, it's a notable difference, but it's a difference of about point five or so.
-
Not Synced
38
00:03:33,570 --> 00:03:40,080
The difference between sci fi and short is a difference of a little under one, probably.
-
Not Synced
39
00:03:40,080 --> 00:03:46,170
But when we start the Y axis at 2.5 instead of zero,
-
Not Synced
40
00:03:46,170 --> 00:03:51,840
what happens is the differences look much larger than they are because the human eye, naturally it's not.
-
Not Synced
41
00:03:51,840 --> 00:04:01,050
We not only want to see the difference, but we want to it's very natural for us to compare the difference to the bar length because these are bars.
-
Not Synced
42
00:04:01,050 --> 00:04:06,620
They have length, they have an area since they're all the same with the length is proportional to the area.
-
Not Synced
43
00:04:06,620 --> 00:04:12,660
Braking length area. Proportionality is a good way to confuse your readers,
-
Not Synced
44
00:04:12,660 --> 00:04:20,910
but it looks like IMAX movies have twice as high an average rating as horror movies because the bar is twice as high, but they don't.
-
Not Synced
45
00:04:20,910 --> 00:04:25,770
It's really a shift from about 2.8 to three point three or three point four.
-
Not Synced
46
00:04:25,770 --> 00:04:33,090
And so it creates a distortion that makes the different like it highlights the differences, but it makes the differences look larger than they are.
-
Not Synced
47
00:04:33,090 --> 00:04:39,210
So when I talked about integrity and avoiding deception, when I was introducing statistical graphics,
-
Not Synced
48
00:04:39,210 --> 00:04:45,750
this is what I was talking about, the differences there. It's just not as big as it looks like it is.
-
Not Synced
49
00:04:45,750 --> 00:04:47,460
And we truncate our bar charts.
-
Not Synced
50
00:04:47,460 --> 00:04:55,050
So if you have the general rule here to generalize beyond bar charts is if something has a length that varies based on the data,
-
Not Synced
51
00:04:55,050 --> 00:05:04,470
that length needs to actually represent the value, not the value, minus something because you started the axis somewhere else.
-
Not Synced
52
00:05:04,470 --> 00:05:06,960
So if you're including Whiskers, like I did in the previous chart,
-
Not Synced
53
00:05:06,960 --> 00:05:14,040
define how they're computed and also as one thing to just be careful of seaboard's cat platen count plot.
-
Not Synced
54
00:05:14,040 --> 00:05:21,270
If you aren't using the color for a second variable, they will just make every bar a different color for no particular reason,
-
Not Synced
55
00:05:21,270 --> 00:05:25,010
which it creates something that's different when it doesn't need to be so.
-
Not Synced
56
00:05:25,010 --> 00:05:29,520
It causes the reader to look for a difference that isn't actually their best avoided.
-
Not Synced
57
00:05:29,520 --> 00:05:32,640
You can fix that by just specifying the color.
-
Not Synced
58
00:05:32,640 --> 00:05:40,050
We saw histograms last week in a histogram as a bar chart, but a categorical was Binz or ranges of a numerical value.
-
Not Synced
59
00:05:40,050 --> 00:05:46,800
Also, though, if we have a bar chart that's showing the relative frequency of categorical variables that can also be called a histogram,
-
Not Synced
60
00:05:46,800 --> 00:05:50,730
the Y axis is either the number or the fraction of occurrences in this case.
-
Not Synced
61
00:05:50,730 --> 00:05:58,170
So we can that. The key thing, though, is the different heights of the bars that I see visually, the relative frequency of different values.
-
Not Synced
62
00:05:58,170 --> 00:06:01,680
So it really makes it visually clear how the data is shaped.
-
Not Synced
63
00:06:01,680 --> 00:06:07,950
We can see Skewes and things like that. Is there one way to graphically describe a distribution?
-
Not Synced
64
00:06:07,950 --> 00:06:12,120
A scatterplot shows two numeric variables. So each observation is a dot.
-
Not Synced
65
00:06:12,120 --> 00:06:16,800
Each observation has two numeric variables. And we put the one variable on the x axis.
-
Not Synced
66
00:06:16,800 --> 00:06:22,290
The other variable on the Y axis and put the dot at where its variable values would intersect.
-
Not Synced
67
00:06:22,290 --> 00:06:26,700
This is really useful for seeing how two variables relate. Does one increase with the other Duplin?
-
Not Synced
68
00:06:26,700 --> 00:06:30,840
Do points clump or cluster in an interesting way? Other interesting patterns.
-
Not Synced
69
00:06:30,840 --> 00:06:43,050
It helps us find outliers. So this this is scatterplot is showing the tip versus the total bill for a bunch of restaurant bills.
-
Not Synced
70
00:06:43,050 --> 00:06:45,540
And each each observation is a bill.
-
Not Synced
71
00:06:45,540 --> 00:06:55,380
And then the x axis is the the under the total bill on the Y axis is the tip that the that the the customer added to the bill.
-
Not Synced
72
00:06:55,380 --> 00:07:02,790
And we a couple of refinements we can do here. We can color or change to the point tight by a categorical variable.
-
Not Synced
73
00:07:02,790 --> 00:07:08,140
So on this one, we've changed it so that the points are different color.
-
Not Synced
74
00:07:08,140 --> 00:07:13,310
So those dinners are blue circles and the lunches are orange AXA's.
-
Not Synced
75
00:07:13,310 --> 00:07:19,130
We could also add a trend line or some other kind of a line to show some context, for example, on this chart,
-
Not Synced
76
00:07:19,130 --> 00:07:27,170
we might want to plot a line that shows that the 20 percent point and that let us easily see where we're going over 20 percent,
-
Not Synced
77
00:07:27,170 --> 00:07:33,650
how that the tips are distributed relative to it to a 20 percent mark.
-
Not Synced
78
00:07:33,650 --> 00:07:41,170
We can also X can be a categorical variable when that happens. We call this a point plot or a strip plot.
-
Not Synced
79
00:07:41,170 --> 00:07:47,020
Functions for doing this are scatter scatterplot and then plotlines John Point,
-
Not Synced
80
00:07:47,020 --> 00:07:50,800
the Seabourne documentation has some examples of more of these align plot.
-
Not Synced
81
00:07:50,800 --> 00:07:59,350
It's like a scatterplot that we have to numeric variables. However, we it emphasizes the progression or continuity from one variable to the next.
-
Not Synced
82
00:07:59,350 --> 00:08:04,930
By combining them with a line, it really works best. We have one Y per X value that we want to plot.
-
Not Synced
83
00:08:04,930 --> 00:08:11,260
If we've got more than one, it really starts getting very, very jagged. It's very common for Time series.
-
Not Synced
84
00:08:11,260 --> 00:08:15,430
So this is another example from the Seabourne tutorial not labeled super well.
-
Not Synced
85
00:08:15,430 --> 00:08:22,510
I don't know what the value actually is, but it shows that we have some kind of a value that's changing over time and it's going negative.
-
Not Synced
86
00:08:22,510 --> 00:08:29,710
That was zero. The Y axis is at the top and the values otherwise our negative functions to create.
-
Not Synced
87
00:08:29,710 --> 00:08:36,370
These are line plot from seabourne, line from a matplotlib and Gyeom line from plot nine.
-
Not Synced
88
00:08:36,370 --> 00:08:41,380
A box plot shows the distribution of a numeric variable grouped by a categorical.
-
Not Synced
89
00:08:41,380 --> 00:08:46,630
So the bar chart just showed us, say, the average value, maybe with confidence interval.
-
Not Synced
90
00:08:46,630 --> 00:08:51,820
The box plot actually shows us the distribution and it does so in a way that's based on the median.
-
Not Synced
91
00:08:51,820 --> 00:09:01,050
So the median, the the horizontal line in the middle of the box is the median value, the top and bottom of the box.
-
Not Synced
92
00:09:01,050 --> 00:09:06,750
Are the first and third quarter close to the bottom of the first quartile and the top as the third quartile.
-
Not Synced
93
00:09:06,750 --> 00:09:12,630
And what that means is twenty five percent of the values are below the bottom of the box.
-
Not Synced
94
00:09:12,630 --> 00:09:17,700
Twenty five percent in the bottom half. Twenty five percent here and then twenty five percent above.
-
Not Synced
95
00:09:17,700 --> 00:09:21,390
We then show these these whiskers that extend out to the minimum,
-
Not Synced
96
00:09:21,390 --> 00:09:27,390
a maximum of the data and a number of plotting packages will do some kind of an outlier detection.
-
Not Synced
97
00:09:27,390 --> 00:09:37,470
This is using seabourne default outlier detection. So if the max is very high and what the rule it uses by default is it allows the whisker to be.
-
Not Synced
98
00:09:37,470 --> 00:09:44,460
So you've got the IQ are the inter quartile range. That's the height of the box. It allows the whisker to be one point five times that tall.
-
Not Synced
99
00:09:44,460 --> 00:09:50,880
And if you have any data points that are further away than that, it plots them as individual points, makes it easy to see outliers.
-
Not Synced
100
00:09:50,880 --> 00:09:57,930
You can change. It's that the whisker goes all the way up to the max, but it lets you quickly see and compare between different groups.
-
Not Synced
101
00:09:57,930 --> 00:10:03,210
The median, the first and third quartiles and the men in the max to the data.
-
Not Synced
102
00:10:03,210 --> 00:10:11,520
Very useful for comparing observations of a variable when you're grouped by some categorical functions
-
Not Synced
103
00:10:11,520 --> 00:10:19,680
for doing this or box plot from both Seabourne and matplotlib and then Gyeom block box from plot nine.
-
Not Synced
104
00:10:19,680 --> 00:10:25,710
A few more plots, a violin plot. It's like a box plot, except it's based around the mean and has curved sides.
-
Not Synced
105
00:10:25,710 --> 00:10:30,120
The swarm plot is a kind of another kind of a categorical scatterplot.
-
Not Synced
106
00:10:30,120 --> 00:10:38,860
It's usually best to avoid pie charts, especially 3D pie charts, or a lot of the of our software is not going to produce 3D charts very easily.
-
Not Synced
107
00:10:38,860 --> 00:10:46,620
Don't try to go make a 3D chart. They're almost always more confusing, especially like the 3D bars that you have from vintage PowerPoint.
-
Not Synced
108
00:10:46,620 --> 00:10:57,750
But even a pie chart, just because the human perception is not super great at accurately comparing angular areas.
-
Not Synced
109
00:10:57,750 --> 00:11:00,120
So usually a bar chart,
-
Not Synced
110
00:11:00,120 --> 00:11:07,680
restacked bar chart is going to be a better option than a pie chart or a donut chart is sometimes a better option where you've got to circle.
-
Not Synced
111
00:11:07,680 --> 00:11:14,250
This is one place where I disagree with the reading. The reading that I gave you recommends pie charts for showing relative proportions.
-
Not Synced
112
00:11:14,250 --> 00:11:19,170
I recommend usually avoiding those use a bar chart is a stacked bar chart if you need to show you
-
Not Synced
113
00:11:19,170 --> 00:11:25,230
want to show multiple proportions of different or relative proportions within different categories.
-
Not Synced
114
00:11:25,230 --> 00:11:29,670
There's another kind of plot that's not a plot on its own, but it's combined with other kinds of plots.
-
Not Synced
115
00:11:29,670 --> 00:11:34,830
That's a rug plot useful for just displaying distributions at a margin.
-
Not Synced
116
00:11:34,830 --> 00:11:40,500
So to learn more, I've gone I've taken a whirlwind tour through a number of different plot types, the class readings.
-
Not Synced
117
00:11:40,500 --> 00:11:46,170
So the paper that I assigned you to read, it talks through the use cases for a number of different plot types.
-
Not Synced
118
00:11:46,170 --> 00:11:50,160
I'm going to be providing tutorial notebooks that walk you through different plot types.
-
Not Synced
119
00:11:50,160 --> 00:11:54,660
The textbook talks about graph plotting and data visualization.
-
Not Synced
120
00:11:54,660 --> 00:11:58,190
The Seabourne and matplotlib docs are extensive. And for what?
-
Not Synced
121
00:11:58,190 --> 00:12:04,230
If you're using another plodding library, its documentation as well. Most plotting libraries also have a gallery student.
-
Not Synced
122
00:12:04,230 --> 00:12:10,740
Go through the gallery, look for a plot that has a feature you want in your plot or that you think might be useful for displaying your data.
-
Not Synced
123
00:12:10,740 --> 00:12:14,580
Click on it and they'll give you the code to show you how they made that plot.
-
Not Synced
124
00:12:14,580 --> 00:12:22,650
You might want to combine pieces from multiple plots. In practice, it takes a lot of trial and error to really get the hang of your plot and library
-
Not Synced
125
00:12:22,650 --> 00:12:27,630
and figure out how to make it show you the data in the way you really want it to.
-
Not Synced
126
00:12:27,630 --> 00:12:32,730
Learning one plotting library really deep is useful for a lot of the a lot of the python ones,
-
Not Synced
127
00:12:32,730 --> 00:12:37,350
especially the ones that are oriented towards static charts. They're built on top of matplotlib.
-
Not Synced
128
00:12:37,350 --> 00:12:41,910
So Seabourne is a convenience API on top of matplotlib. If you're using Seabourne,
-
Not Synced
129
00:12:41,910 --> 00:12:49,500
you're also going to need to use matplotlib calls a lot of the time when the seabourne gets you 90 percent of the way there,
-
Not Synced
130
00:12:49,500 --> 00:12:55,470
but not quite all the way. So to wrap up, there are many different types of charts that have different use cases.
-
Not Synced
131
00:12:55,470 --> 00:13:01,020
Learning graphics techniques takes time and practice takes some of the example notebooks that I'm providing.
-
Not Synced
132
00:13:01,020 --> 00:13:05,730
Take some of the galleries from the examples from, say, the Seabourne Gallery.
-
Not Synced
133
00:13:05,730 --> 00:13:11,670
Play with them, play with them with some data that I'm giving you, play with them with some data that you have elsewhere.
-
Not Synced
134
00:13:11,670 --> 00:13:30,957
But it takes time and practice and spend some time with the galleries of the of the the plotting libraries you're using.
-
Not Synced