-
Not Synced
1
00:00:04,540 --> 00:00:10,180
Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
-
Not Synced
2
00:00:10,180 --> 00:00:13,660
I want you to be able to understand the value of graphics for presenting data,
-
Not Synced
3
00:00:13,660 --> 00:00:19,540
identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
-
Not Synced
4
00:00:19,540 --> 00:00:25,300
So here's an example of a chart, and there's a variety of different pieces of this chart.
-
Not Synced
5
00:00:25,300 --> 00:00:31,900
We have an x axis. That's the horizontal x axis.
-
Not Synced
6
00:00:31,900 --> 00:00:41,820
We have a y axis, the vertical axis. Each of these axes has a label.
-
Not Synced
7
00:00:41,820 --> 00:00:52,610
Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
-
Not Synced
8
00:00:52,610 --> 00:00:59,660
And it says that this graph is showing the number of queries per task with query account distributions in the margins.
-
Not Synced
9
00:00:59,660 --> 00:01:04,070
And each dot is one participant. So it tells us we have a data point in it.
-
Not Synced
10
00:01:04,070 --> 00:01:09,020
What is that? It tells us what it is that we're charting. Number of queries per task.
-
Not Synced
11
00:01:09,020 --> 00:01:14,760
When we then see the axis labels that we have task one and we have task to.
-
Not Synced
12
00:01:14,760 --> 00:01:19,500
Those two together, give us the context, understand that. Oh, we have two tasks.
-
Not Synced
13
00:01:19,500 --> 00:01:24,000
And this is why of one participant.
-
Not Synced
14
00:01:24,000 --> 00:01:28,620
And they're appearing at the point where they have their task, one count on their task, two counts.
-
Not Synced
15
00:01:28,620 --> 00:01:34,530
OK, this allows us to to see if there's any relationship between how long it took to.
-
Not Synced
16
00:01:34,530 --> 00:01:40,800
How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
-
Not Synced
17
00:01:40,800 --> 00:01:46,050
So this is a compound plot. And in the left and right, margins are in the X and Y margins.
-
Not Synced
18
00:01:46,050 --> 00:01:52,530
We have the distribution, a histogram of the X axis, the task one.
-
Not Synced
19
00:01:52,530 --> 00:01:59,560
We have a histogram of the Y axis task to these histograms don't have axes themselves because,
-
Not Synced
20
00:01:59,560 --> 00:02:04,680
well, we just wanted to show a distribution the exact particularly for our purposes here.
-
Not Synced
21
00:02:04,680 --> 00:02:16,980
The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
-
Not Synced
22
00:02:16,980 --> 00:02:21,150
Where is the mass on the two different task counts?
-
Not Synced
23
00:02:21,150 --> 00:02:27,620
And we can see that both of them have a a right skew.
-
Not Synced
24
00:02:27,620 --> 00:02:32,780
They're bulked up towards towards the low end of the scale.
-
Not Synced
25
00:02:32,780 --> 00:02:41,120
And then we have the all the we have all of the individual data points scat on the chart.
-
Not Synced
26
00:02:41,120 --> 00:02:46,670
This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
-
Not Synced
27
00:02:46,670 --> 00:02:55,400
And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
-
Not Synced
28
00:02:55,400 --> 00:02:59,300
What is your x axis? What is your y axis?
-
Not Synced
29
00:02:59,300 --> 00:03:08,600
Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
-
Not Synced
30
00:03:08,600 --> 00:03:16,280
So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
-
Not Synced
31
00:03:16,280 --> 00:03:19,200
In this chart, there's really not much of a pattern.
-
Not Synced
32
00:03:19,200 --> 00:03:26,700
And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
-
Not Synced
33
00:03:26,700 --> 00:03:34,560
The one with the participant, with the most tasks and with the most tasks or queries for task one.
-
Not Synced
34
00:03:34,560 --> 00:03:38,760
Has a middling to low number of queries for Task two.
-
Not Synced
35
00:03:38,760 --> 00:03:47,790
And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
-
Not Synced
36
00:03:47,790 --> 00:03:53,520
They're not at all the highest. So we can see there's not a not very much of a relationship here.
-
Not Synced
37
00:03:53,520 --> 00:03:59,490
At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
-
Not Synced
38
00:03:59,490 --> 00:04:08,220
We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
-
Not Synced
39
00:04:08,220 --> 00:04:12,210
the highest number of counts for another task or different. We can also see trends.
-
Not Synced
40
00:04:12,210 --> 00:04:14,610
We can see if a line looks like it goes up or down,
-
Not Synced
41
00:04:14,610 --> 00:04:22,950
wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
-
Not Synced
42
00:04:22,950 --> 00:04:25,080
particularly our human visual senses,
-
Not Synced
43
00:04:25,080 --> 00:04:35,820
to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
-
Not Synced
44
00:04:35,820 --> 00:04:41,520
We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
-
Not Synced
45
00:04:41,520 --> 00:04:45,720
They need to be able to understand what each point in the chart is going to be.
-
Not Synced
46
00:04:45,720 --> 00:04:56,530
They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
-
Not Synced
47
00:04:56,530 --> 00:05:01,260
Often this is done in an axis label in our in the chart I showed you,
-
Not Synced
48
00:05:01,260 --> 00:05:06,040
it said the values in the caption in the axis labels said which version of them they were.
-
Not Synced
49
00:05:06,040 --> 00:05:08,560
If there are units, that needs to be clear.
-
Not Synced
50
00:05:08,560 --> 00:05:19,840
So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
-
Not Synced
51
00:05:19,840 --> 00:05:29,320
either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
-
Not Synced
52
00:05:29,320 --> 00:05:32,440
And you've got a fraction or a percentage in the left hand side.
-
Not Synced
53
00:05:32,440 --> 00:05:41,260
It's standard convention that we're talking about, the fraction of the values that are in each bin,
-
Not Synced
54
00:05:41,260 --> 00:05:49,930
at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
-
Not Synced
55
00:05:49,930 --> 00:05:57,090
What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
-
Not Synced
56
00:05:57,090 --> 00:06:04,230
Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
-
Not Synced
57
00:06:04,230 --> 00:06:08,640
You can assume a reasonable level. You have to know your audience for this.
-
Not Synced
58
00:06:08,640 --> 00:06:16,200
But someone should be able to just look at the chart with its immediately surrounding description,
-
Not Synced
59
00:06:16,200 --> 00:06:20,670
the labels, the caption and understand have a pretty good idea of what's going on.
-
Not Synced
60
00:06:20,670 --> 00:06:25,990
The surrounding text with the text that references the chart if you're writing a document.
-
Not Synced
61
00:06:25,990 --> 00:06:30,190
That can have your observations, that can provide more context and clarity.
-
Not Synced
62
00:06:30,190 --> 00:06:35,950
But someone just looking at the chart should be able to figure out basically what's going on and
-
Not Synced
63
00:06:35,950 --> 00:06:41,080
not be too far off of this is particularly important because there's a there's a lot of people,
-
Not Synced
64
00:06:41,080 --> 00:06:46,150
whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
-
Not Synced
65
00:06:46,150 --> 00:06:52,000
they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
-
Not Synced
66
00:06:52,000 --> 00:07:01,990
And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
-
Not Synced
67
00:07:01,990 --> 00:07:09,370
see what it's doing and decide whether they are going to pay it further attention.
-
Not Synced
68
00:07:09,370 --> 00:07:12,970
So in a paper, if you're putting a chart in a docket, a written document or a paper,
-
Not Synced
69
00:07:12,970 --> 00:07:18,640
each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
-
Not Synced
70
00:07:18,640 --> 00:07:18,790
Like,
-
Not Synced
71
00:07:18,790 --> 00:07:26,410
it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
-
Not Synced
72
00:07:26,410 --> 00:07:31,800
what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
-
Not Synced
73
00:07:31,800 --> 00:07:37,900
So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
-
Not Synced
74
00:07:37,900 --> 00:07:40,840
It doesn't hurt, but often it's redundant with the caption.
-
Not Synced
75
00:07:40,840 --> 00:07:46,690
In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
-
Not Synced
76
00:07:46,690 --> 00:07:50,230
We have a chart in one of our notebooks. A title is often helpful in notebooks.
-
Not Synced
77
00:07:50,230 --> 00:07:52,180
The surrounding text may be sufficient,
-
Not Synced
78
00:07:52,180 --> 00:07:59,920
but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
-
Not Synced
79
00:07:59,920 --> 00:08:04,570
So a few pitfalls to be aware of when we're thinking about statistical graphics is
-
Not Synced
80
00:08:04,570 --> 00:08:10,930
one is distorting the distances or the differences that are happening particularly.
-
Not Synced
81
00:08:10,930 --> 00:08:13,600
We need to make sure if something has a length,
-
Not Synced
82
00:08:13,600 --> 00:08:21,040
anything that has a length that length should accurately represent quantities, position, relative position.
-
Not Synced
83
00:08:21,040 --> 00:08:26,800
If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
-
Not Synced
84
00:08:26,800 --> 00:08:30,700
It also is an area we need to make sure those accurately represent quantities.
-
Not Synced
85
00:08:30,700 --> 00:08:37,140
One really common way to violate this is having a bar chart whose access starts at something other than zero.
-
Not Synced
86
00:08:37,140 --> 00:08:41,670
The software we're using doesn't do that by default. Excel does.
-
Not Synced
87
00:08:41,670 --> 00:08:46,860
But your bar chart always needs to start at zero because people are beat.
-
Not Synced
88
00:08:46,860 --> 00:08:51,540
People don't look at the relative position of the bar. People see the whole height of the bar.
-
Not Synced
89
00:08:51,540 --> 00:08:59,400
And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
-
Not Synced
90
00:08:59,400 --> 00:09:01,860
There's also ways in which we can violate conventions.
-
Not Synced
91
00:09:01,860 --> 00:09:09,210
So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
-
Not Synced
92
00:09:09,210 --> 00:09:16,320
If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
-
Not Synced
93
00:09:16,320 --> 00:09:24,240
Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
-
Not Synced
94
00:09:24,240 --> 00:09:32,570
like you assimilate how to read written text. And if those expect expectations are violated, that can.
-
Not Synced
95
00:09:32,570 --> 00:09:37,490
Lead the user to incorrect conclusions from our charts, from our presentation.
-
Not Synced
96
00:09:37,490 --> 00:09:42,920
A key thing to remember here that also applies to all of our presentations.
-
Not Synced
97
00:09:42,920 --> 00:09:52,350
Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
-
Not Synced
98
00:09:52,350 --> 00:09:56,100
not to subvert tropes or present shocking new presentations.
-
Not Synced
99
00:09:56,100 --> 00:10:00,470
We might have shocking new evidence, but from a presentation perspective,
-
Not Synced
100
00:10:00,470 --> 00:10:08,040
we want it to fit within conventions and not violate readers expectations unnecessarily so that
-
Not Synced
101
00:10:08,040 --> 00:10:14,370
they can read it and be confident that they've correctly understood what it is that you're saying.
-
Not Synced
102
00:10:14,370 --> 00:10:18,990
Another thing to be aware of is that graphics can illustrate an effect.
-
Not Synced
103
00:10:18,990 --> 00:10:27,060
They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
-
Not Synced
104
00:10:27,060 --> 00:10:33,660
We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
-
Not Synced
105
00:10:33,660 --> 00:10:39,420
We can't combine exploratory and what's called confirmatory analysis, but they can help us.
-
Not Synced
106
00:10:39,420 --> 00:10:45,990
Visualizing data can help us look for possible effects and get ideas for what to go look for next.
-
Not Synced
107
00:10:45,990 --> 00:10:53,530
But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
-
Not Synced
108
00:10:53,530 --> 00:11:00,030
The raw numbers as well as the numeric result are the results of inferential techniques that let us
-
Not Synced
109
00:11:00,030 --> 00:11:06,090
estimate how big an effect is and whether it's significant in order to come to any conclusions.
-
Not Synced
110
00:11:06,090 --> 00:11:13,920
So in this chart, I want to show you, for example, we have if we look at the chart closely,
-
Not Synced
111
00:11:13,920 --> 00:11:20,790
we have these two data points and the little blue Xs in them, the func SVOD axis.
-
Not Synced
112
00:11:20,790 --> 00:11:26,160
As to the left of the item, item X. So it looks like for this particular metric,
-
Not Synced
113
00:11:26,160 --> 00:11:32,900
lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
-
Not Synced
114
00:11:32,900 --> 00:11:38,520
But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
-
Not Synced
115
00:11:38,520 --> 00:11:44,220
To conclude that func SVOD is better than item item on the per user are masc metric.
-
Not Synced
116
00:11:44,220 --> 00:11:48,900
Exactly what all these things are is, is a topic for another day.
-
Not Synced
117
00:11:48,900 --> 00:11:55,230
But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
-
Not Synced
118
00:11:55,230 --> 00:12:02,220
But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
-
Not Synced
119
00:12:02,220 --> 00:12:07,120
it's a relatively small difference. So. They help us see.
-
Not Synced
120
00:12:07,120 --> 00:12:11,620
They help us communicate. They're not definitive and conclusive proof.
-
Not Synced
121
00:12:11,620 --> 00:12:16,660
Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
-
Not Synced
122
00:12:16,660 --> 00:12:19,930
So the earlier graph, we just had one kind of symbol. We had dots here.
-
Not Synced
123
00:12:19,930 --> 00:12:24,580
We have two different kinds with a legend that says red circles are global.
-
Not Synced
124
00:12:24,580 --> 00:12:30,930
Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
-
Not Synced
125
00:12:30,930 --> 00:12:37,920
Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
-
Not Synced
126
00:12:37,920 --> 00:12:45,750
to show different versions of a thing in the same chart, using different shapes.
-
Not Synced
127
00:12:45,750 --> 00:12:50,430
In addition to different colors is useful because it's so imprinted on a black and white printer.
-
Not Synced
128
00:12:50,430 --> 00:12:57,270
If they if they have some form of color blindness, it helps it make the differences clearer.
-
Not Synced
129
00:12:57,270 --> 00:13:05,070
I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
-
Not Synced
130
00:13:05,070 --> 00:13:11,430
I also have grouped them just to make it easier for the user to see.
-
Not Synced
131
00:13:11,430 --> 00:13:15,330
These are the same like these. These first ones are all single algorithms.
-
Not Synced
132
00:13:15,330 --> 00:13:20,920
And then we have a blend and a few other things. The details are.
-
Not Synced
133
00:13:20,920 --> 00:13:27,220
Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
-
Not Synced
134
00:13:27,220 --> 00:13:32,680
It also helps save space in the paper because I can present all of these different things in one place.
-
Not Synced
135
00:13:32,680 --> 00:13:39,730
It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
-
Not Synced
136
00:13:39,730 --> 00:13:46,540
But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
-
Not Synced
137
00:13:46,540 --> 00:13:51,880
So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
-
Not Synced
138
00:13:51,880 --> 00:13:54,160
They don't replace our numerical analysis,
-
Not Synced
139
00:13:54,160 --> 00:14:01,570
but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
-
Not Synced
140
00:14:01,570 --> 00:14:06,340
We do, however, always need to make sure that we clearly label and describe our graphics so that
-
Not Synced
141
00:14:06,340 --> 00:14:15,300
readers can understand them and they can draw correct conclusions from them.
-
Not Synced