1
00:00:04,540 --> 00:00:10,180
Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
2
00:00:10,180 --> 00:00:13,660
I want you to be able to understand the value of graphics for presenting data,
3
00:00:13,660 --> 00:00:19,540
identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
4
00:00:19,540 --> 00:00:25,300
So here's an example of a chart, and there's a variety of different pieces of this chart.
5
00:00:25,300 --> 00:00:31,900
We have an x axis. That's the horizontal x axis.
6
00:00:31,900 --> 00:00:41,820
We have a y axis, the vertical axis. Each of these axes has a label.
7
00:00:41,820 --> 00:00:52,610
Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
8
00:00:52,610 --> 00:00:59,660
And it says that this graph is showing the number of queries per task with query account distributions in the margins.
9
00:00:59,660 --> 00:01:04,070
And each dot is one participant. So it tells us we have a data point in it.
10
00:01:04,070 --> 00:01:09,020
What is that? It tells us what it is that we're charting. Number of queries per task.
11
00:01:09,020 --> 00:01:14,760
When we then see the axis labels that we have task one and we have task to.
12
00:01:14,760 --> 00:01:19,500
Those two together, give us the context, understand that. Oh, we have two tasks.
13
00:01:19,500 --> 00:01:24,000
And this is why of one participant.
14
00:01:24,000 --> 00:01:28,620
And they're appearing at the point where they have their task, one count on their task, two counts.
15
00:01:28,620 --> 00:01:34,530
OK, this allows us to to see if there's any relationship between how long it took to.
16
00:01:34,530 --> 00:01:40,800
How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
17
00:01:40,800 --> 00:01:46,050
So this is a compound plot. And in the left and right, margins are in the X and Y margins.
18
00:01:46,050 --> 00:01:52,530
We have the distribution, a histogram of the X axis, the task one.
19
00:01:52,530 --> 00:01:59,560
We have a histogram of the Y axis task to these histograms don't have axes themselves because,
20
00:01:59,560 --> 00:02:04,680
well, we just wanted to show a distribution the exact particularly for our purposes here.
21
00:02:04,680 --> 00:02:16,980
The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
22
00:02:16,980 --> 00:02:21,150
Where is the mass on the two different task counts?
23
00:02:21,150 --> 00:02:27,620
And we can see that both of them have a a right skew.
24
00:02:27,620 --> 00:02:32,780
They're bulked up towards towards the low end of the scale.
25
00:02:32,780 --> 00:02:41,120
And then we have the all the we have all of the individual data points scat on the chart.
26
00:02:41,120 --> 00:02:46,670
This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
27
00:02:46,670 --> 00:02:55,400
And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
28
00:02:55,400 --> 00:02:59,300
What is your x axis? What is your y axis?
29
00:02:59,300 --> 00:03:08,600
Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
30
00:03:08,600 --> 00:03:16,280
So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
31
00:03:16,280 --> 00:03:19,200
In this chart, there's really not much of a pattern.
32
00:03:19,200 --> 00:03:26,700
And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
33
00:03:26,700 --> 00:03:34,560
The one with the participant, with the most tasks and with the most tasks or queries for task one.
34
00:03:34,560 --> 00:03:38,760
Has a middling to low number of queries for Task two.
35
00:03:38,760 --> 00:03:47,790
And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
36
00:03:47,790 --> 00:03:53,520
They're not at all the highest. So we can see there's not a not very much of a relationship here.
37
00:03:53,520 --> 00:03:59,490
At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
38
00:03:59,490 --> 00:04:08,220
We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
39
00:04:08,220 --> 00:04:12,210
the highest number of counts for another task or different. We can also see trends.
40
00:04:12,210 --> 00:04:14,610
We can see if a line looks like it goes up or down,
41
00:04:14,610 --> 00:04:22,950
wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
42
00:04:22,950 --> 00:04:25,080
particularly our human visual senses,
43
00:04:25,080 --> 00:04:35,820
to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
44
00:04:35,820 --> 00:04:41,520
We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
45
00:04:41,520 --> 00:04:45,720
They need to be able to understand what each point in the chart is going to be.
46
00:04:45,720 --> 00:04:56,530
They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
47
00:04:56,530 --> 00:05:01,260
Often this is done in an axis label in our in the chart I showed you,
48
00:05:01,260 --> 00:05:06,040
it said the values in the caption in the axis labels said which version of them they were.
49
00:05:06,040 --> 00:05:08,560
If there are units, that needs to be clear.
50
00:05:08,560 --> 00:05:19,840
So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
51
00:05:19,840 --> 00:05:29,320
either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
52
00:05:29,320 --> 00:05:32,440
And you've got a fraction or a percentage in the left hand side.
53
00:05:32,440 --> 00:05:41,260
It's standard convention that we're talking about, the fraction of the values that are in each bin,
54
00:05:41,260 --> 00:05:49,930
at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
55
00:05:49,930 --> 00:05:57,090
What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
56
00:05:57,090 --> 00:06:04,230
Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
57
00:06:04,230 --> 00:06:08,640
You can assume a reasonable level. You have to know your audience for this.
58
00:06:08,640 --> 00:06:16,200
But someone should be able to just look at the chart with its immediately surrounding description,
59
00:06:16,200 --> 00:06:20,670
the labels, the caption and understand have a pretty good idea of what's going on.
60
00:06:20,670 --> 00:06:25,990
The surrounding text with the text that references the chart if you're writing a document.
61
00:06:25,990 --> 00:06:30,190
That can have your observations, that can provide more context and clarity.
62
00:06:30,190 --> 00:06:35,950
But someone just looking at the chart should be able to figure out basically what's going on and
63
00:06:35,950 --> 00:06:41,080
not be too far off of this is particularly important because there's a there's a lot of people,
64
00:06:41,080 --> 00:06:46,150
whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
65
00:06:46,150 --> 00:06:52,000
they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
66
00:06:52,000 --> 00:07:01,990
And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
67
00:07:01,990 --> 00:07:09,370
see what it's doing and decide whether they are going to pay it further attention.
68
00:07:09,370 --> 00:07:12,970
So in a paper, if you're putting a chart in a docket, a written document or a paper,
69
00:07:12,970 --> 00:07:18,640
each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
70
00:07:18,640 --> 00:07:18,790
Like,
71
00:07:18,790 --> 00:07:26,410
it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
72
00:07:26,410 --> 00:07:31,800
what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
73
00:07:31,800 --> 00:07:37,900
So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
74
00:07:37,900 --> 00:07:40,840
It doesn't hurt, but often it's redundant with the caption.
75
00:07:40,840 --> 00:07:46,690
In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
76
00:07:46,690 --> 00:07:50,230
We have a chart in one of our notebooks. A title is often helpful in notebooks.
77
00:07:50,230 --> 00:07:52,180
The surrounding text may be sufficient,
78
00:07:52,180 --> 00:07:59,920
but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
79
00:07:59,920 --> 00:08:04,570
So a few pitfalls to be aware of when we're thinking about statistical graphics is
80
00:08:04,570 --> 00:08:10,930
one is distorting the distances or the differences that are happening particularly.
81
00:08:10,930 --> 00:08:13,600
We need to make sure if something has a length,
82
00:08:13,600 --> 00:08:21,040
anything that has a length that length should accurately represent quantities, position, relative position.
83
00:08:21,040 --> 00:08:26,800
If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
84
00:08:26,800 --> 00:08:30,700
It also is an area we need to make sure those accurately represent quantities.
85
00:08:30,700 --> 00:08:37,140
One really common way to violate this is having a bar chart whose access starts at something other than zero.
86
00:08:37,140 --> 00:08:41,670
The software we're using doesn't do that by default. Excel does.
87
00:08:41,670 --> 00:08:46,860
But your bar chart always needs to start at zero because people are beat.
88
00:08:46,860 --> 00:08:51,540
People don't look at the relative position of the bar. People see the whole height of the bar.
89
00:08:51,540 --> 00:08:59,400
And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
90
00:08:59,400 --> 00:09:01,860
There's also ways in which we can violate conventions.
91
00:09:01,860 --> 00:09:09,210
So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
92
00:09:09,210 --> 00:09:16,320
If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
93
00:09:16,320 --> 00:09:24,240
Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
94
00:09:24,240 --> 00:09:32,570
like you assimilate how to read written text. And if those expect expectations are violated, that can.
95
00:09:32,570 --> 00:09:37,490
Lead the user to incorrect conclusions from our charts, from our presentation.
96
00:09:37,490 --> 00:09:42,920
A key thing to remember here that also applies to all of our presentations.
97
00:09:42,920 --> 00:09:52,350
Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
98
00:09:52,350 --> 00:09:56,100
not to subvert tropes or present shocking new presentations.
99
00:09:56,100 --> 00:10:00,470
We might have shocking new evidence, but from a presentation perspective,
100
00:10:00,470 --> 00:10:08,040
we want it to fit within conventions and not violate readers expectations unnecessarily so that
101
00:10:08,040 --> 00:10:14,370
they can read it and be confident that they've correctly understood what it is that you're saying.
102
00:10:14,370 --> 00:10:18,990
Another thing to be aware of is that graphics can illustrate an effect.
103
00:10:18,990 --> 00:10:27,060
They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
104
00:10:27,060 --> 00:10:33,660
We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
105
00:10:33,660 --> 00:10:39,420
We can't combine exploratory and what's called confirmatory analysis, but they can help us.
106
00:10:39,420 --> 00:10:45,990
Visualizing data can help us look for possible effects and get ideas for what to go look for next.
107
00:10:45,990 --> 00:10:53,530
But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
108
00:10:53,530 --> 00:11:00,030
The raw numbers as well as the numeric result are the results of inferential techniques that let us
109
00:11:00,030 --> 00:11:06,090
estimate how big an effect is and whether it's significant in order to come to any conclusions.
110
00:11:06,090 --> 00:11:13,920
So in this chart, I want to show you, for example, we have if we look at the chart closely,
111
00:11:13,920 --> 00:11:20,790
we have these two data points and the little blue Xs in them, the func SVOD axis.
112
00:11:20,790 --> 00:11:26,160
As to the left of the item, item X. So it looks like for this particular metric,
113
00:11:26,160 --> 00:11:32,900
lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
114
00:11:32,900 --> 00:11:38,520
But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
115
00:11:38,520 --> 00:11:44,220
To conclude that func SVOD is better than item item on the per user are masc metric.
116
00:11:44,220 --> 00:11:48,900
Exactly what all these things are is, is a topic for another day.
117
00:11:48,900 --> 00:11:55,230
But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
118
00:11:55,230 --> 00:12:02,220
But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
119
00:12:02,220 --> 00:12:07,120
it's a relatively small difference. So. They help us see.
120
00:12:07,120 --> 00:12:11,620
They help us communicate. They're not definitive and conclusive proof.
121
00:12:11,620 --> 00:12:16,660
Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
122
00:12:16,660 --> 00:12:19,930
So the earlier graph, we just had one kind of symbol. We had dots here.
123
00:12:19,930 --> 00:12:24,580
We have two different kinds with a legend that says red circles are global.
124
00:12:24,580 --> 00:12:30,930
Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
125
00:12:30,930 --> 00:12:37,920
Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
126
00:12:37,920 --> 00:12:45,750
to show different versions of a thing in the same chart, using different shapes.
127
00:12:45,750 --> 00:12:50,430
In addition to different colors is useful because it's so imprinted on a black and white printer.
128
00:12:50,430 --> 00:12:57,270
If they if they have some form of color blindness, it helps it make the differences clearer.
129
00:12:57,270 --> 00:13:05,070
I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
130
00:13:05,070 --> 00:13:11,430
I also have grouped them just to make it easier for the user to see.
131
00:13:11,430 --> 00:13:15,330
These are the same like these. These first ones are all single algorithms.
132
00:13:15,330 --> 00:13:20,920
And then we have a blend and a few other things. The details are.
133
00:13:20,920 --> 00:13:27,220
Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
134
00:13:27,220 --> 00:13:32,680
It also helps save space in the paper because I can present all of these different things in one place.
135
00:13:32,680 --> 00:13:39,730
It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
136
00:13:39,730 --> 00:13:46,540
But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
137
00:13:46,540 --> 00:13:51,880
So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
138
00:13:51,880 --> 00:13:54,160
They don't replace our numerical analysis,
139
00:13:54,160 --> 00:14:01,570
but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
140
00:14:01,570 --> 00:14:06,340
We do, however, always need to make sure that we clearly label and describe our graphics so that
141
00:14:06,340 --> 00:14:15,300
readers can understand them and they can draw correct conclusions from them.