- 
Not Synced 1
 00:00:04,540 --> 00:00:10,180
 Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
 
- 
Not Synced 2
 00:00:10,180 --> 00:00:13,660
 I want you to be able to understand the value of graphics for presenting data,
 
- 
Not Synced 3
 00:00:13,660 --> 00:00:19,540
 identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
 
- 
Not Synced 4
 00:00:19,540 --> 00:00:25,300
 So here's an example of a chart, and there's a variety of different pieces of this chart.
 
- 
Not Synced 5
 00:00:25,300 --> 00:00:31,900
 We have an x axis. That's the horizontal x axis.
 
- 
Not Synced 6
 00:00:31,900 --> 00:00:41,820
 We have a y axis, the vertical axis. Each of these axes has a label.
 
- 
Not Synced 7
 00:00:41,820 --> 00:00:52,610
 Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
 
- 
Not Synced 8
 00:00:52,610 --> 00:00:59,660
 And it says that this graph is showing the number of queries per task with query account distributions in the margins.
 
- 
Not Synced 9
 00:00:59,660 --> 00:01:04,070
 And each dot is one participant. So it tells us we have a data point in it.
 
- 
Not Synced 10
 00:01:04,070 --> 00:01:09,020
 What is that? It tells us what it is that we're charting. Number of queries per task.
 
- 
Not Synced 11
 00:01:09,020 --> 00:01:14,760
 When we then see the axis labels that we have task one and we have task to.
 
- 
Not Synced 12
 00:01:14,760 --> 00:01:19,500
 Those two together, give us the context, understand that. Oh, we have two tasks.
 
- 
Not Synced 13
 00:01:19,500 --> 00:01:24,000
 And this is why of one participant.
 
- 
Not Synced 14
 00:01:24,000 --> 00:01:28,620
 And they're appearing at the point where they have their task, one count on their task, two counts.
 
- 
Not Synced 15
 00:01:28,620 --> 00:01:34,530
 OK, this allows us to to see if there's any relationship between how long it took to.
 
- 
Not Synced 16
 00:01:34,530 --> 00:01:40,800
 How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
 
- 
Not Synced 17
 00:01:40,800 --> 00:01:46,050
 So this is a compound plot. And in the left and right, margins are in the X and Y margins.
 
- 
Not Synced 18
 00:01:46,050 --> 00:01:52,530
 We have the distribution, a histogram of the X axis, the task one.
 
- 
Not Synced 19
 00:01:52,530 --> 00:01:59,560
 We have a histogram of the Y axis task to these histograms don't have axes themselves because,
 
- 
Not Synced 20
 00:01:59,560 --> 00:02:04,680
 well, we just wanted to show a distribution the exact particularly for our purposes here.
 
- 
Not Synced 21
 00:02:04,680 --> 00:02:16,980
 The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
 
- 
Not Synced 22
 00:02:16,980 --> 00:02:21,150
 Where is the mass on the two different task counts?
 
- 
Not Synced 23
 00:02:21,150 --> 00:02:27,620
 And we can see that both of them have a a right skew.
 
- 
Not Synced 24
 00:02:27,620 --> 00:02:32,780
 They're bulked up towards towards the low end of the scale.
 
- 
Not Synced 25
 00:02:32,780 --> 00:02:41,120
 And then we have the all the we have all of the individual data points scat on the chart.
 
- 
Not Synced 26
 00:02:41,120 --> 00:02:46,670
 This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
 
- 
Not Synced 27
 00:02:46,670 --> 00:02:55,400
 And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
 
- 
Not Synced 28
 00:02:55,400 --> 00:02:59,300
 What is your x axis? What is your y axis?
 
- 
Not Synced 29
 00:02:59,300 --> 00:03:08,600
 Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
 
- 
Not Synced 30
 00:03:08,600 --> 00:03:16,280
 So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
 
- 
Not Synced 31
 00:03:16,280 --> 00:03:19,200
 In this chart, there's really not much of a pattern.
 
- 
Not Synced 32
 00:03:19,200 --> 00:03:26,700
 And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
 
- 
Not Synced 33
 00:03:26,700 --> 00:03:34,560
 The one with the participant, with the most tasks and with the most tasks or queries for task one.
 
- 
Not Synced 34
 00:03:34,560 --> 00:03:38,760
 Has a middling to low number of queries for Task two.
 
- 
Not Synced 35
 00:03:38,760 --> 00:03:47,790
 And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
 
- 
Not Synced 36
 00:03:47,790 --> 00:03:53,520
 They're not at all the highest. So we can see there's not a not very much of a relationship here.
 
- 
Not Synced 37
 00:03:53,520 --> 00:03:59,490
 At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
 
- 
Not Synced 38
 00:03:59,490 --> 00:04:08,220
 We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
 
- 
Not Synced 39
 00:04:08,220 --> 00:04:12,210
 the highest number of counts for another task or different. We can also see trends.
 
- 
Not Synced 40
 00:04:12,210 --> 00:04:14,610
 We can see if a line looks like it goes up or down,
 
- 
Not Synced 41
 00:04:14,610 --> 00:04:22,950
 wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
 
- 
Not Synced 42
 00:04:22,950 --> 00:04:25,080
 particularly our human visual senses,
 
- 
Not Synced 43
 00:04:25,080 --> 00:04:35,820
 to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
 
- 
Not Synced 44
 00:04:35,820 --> 00:04:41,520
 We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
 
- 
Not Synced 45
 00:04:41,520 --> 00:04:45,720
 They need to be able to understand what each point in the chart is going to be.
 
- 
Not Synced 46
 00:04:45,720 --> 00:04:56,530
 They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
 
- 
Not Synced 47
 00:04:56,530 --> 00:05:01,260
 Often this is done in an axis label in our in the chart I showed you,
 
- 
Not Synced 48
 00:05:01,260 --> 00:05:06,040
 it said the values in the caption in the axis labels said which version of them they were.
 
- 
Not Synced 49
 00:05:06,040 --> 00:05:08,560
 If there are units, that needs to be clear.
 
- 
Not Synced 50
 00:05:08,560 --> 00:05:19,840
 So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
 
- 
Not Synced 51
 00:05:19,840 --> 00:05:29,320
 either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
 
- 
Not Synced 52
 00:05:29,320 --> 00:05:32,440
 And you've got a fraction or a percentage in the left hand side.
 
- 
Not Synced 53
 00:05:32,440 --> 00:05:41,260
 It's standard convention that we're talking about, the fraction of the values that are in each bin,
 
- 
Not Synced 54
 00:05:41,260 --> 00:05:49,930
 at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
 
- 
Not Synced 55
 00:05:49,930 --> 00:05:57,090
 What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
 
- 
Not Synced 56
 00:05:57,090 --> 00:06:04,230
 Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
 
- 
Not Synced 57
 00:06:04,230 --> 00:06:08,640
 You can assume a reasonable level. You have to know your audience for this.
 
- 
Not Synced 58
 00:06:08,640 --> 00:06:16,200
 But someone should be able to just look at the chart with its immediately surrounding description,
 
- 
Not Synced 59
 00:06:16,200 --> 00:06:20,670
 the labels, the caption and understand have a pretty good idea of what's going on.
 
- 
Not Synced 60
 00:06:20,670 --> 00:06:25,990
 The surrounding text with the text that references the chart if you're writing a document.
 
- 
Not Synced 61
 00:06:25,990 --> 00:06:30,190
 That can have your observations, that can provide more context and clarity.
 
- 
Not Synced 62
 00:06:30,190 --> 00:06:35,950
 But someone just looking at the chart should be able to figure out basically what's going on and
 
- 
Not Synced 63
 00:06:35,950 --> 00:06:41,080
 not be too far off of this is particularly important because there's a there's a lot of people,
 
- 
Not Synced 64
 00:06:41,080 --> 00:06:46,150
 whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
 
- 
Not Synced 65
 00:06:46,150 --> 00:06:52,000
 they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
 
- 
Not Synced 66
 00:06:52,000 --> 00:07:01,990
 And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
 
- 
Not Synced 67
 00:07:01,990 --> 00:07:09,370
 see what it's doing and decide whether they are going to pay it further attention.
 
- 
Not Synced 68
 00:07:09,370 --> 00:07:12,970
 So in a paper, if you're putting a chart in a docket, a written document or a paper,
 
- 
Not Synced 69
 00:07:12,970 --> 00:07:18,640
 each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
 
- 
Not Synced 70
 00:07:18,640 --> 00:07:18,790
 Like,
 
- 
Not Synced 71
 00:07:18,790 --> 00:07:26,410
 it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
 
- 
Not Synced 72
 00:07:26,410 --> 00:07:31,800
 what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
 
- 
Not Synced 73
 00:07:31,800 --> 00:07:37,900
 So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
 
- 
Not Synced 74
 00:07:37,900 --> 00:07:40,840
 It doesn't hurt, but often it's redundant with the caption.
 
- 
Not Synced 75
 00:07:40,840 --> 00:07:46,690
 In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
 
- 
Not Synced 76
 00:07:46,690 --> 00:07:50,230
 We have a chart in one of our notebooks. A title is often helpful in notebooks.
 
- 
Not Synced 77
 00:07:50,230 --> 00:07:52,180
 The surrounding text may be sufficient,
 
- 
Not Synced 78
 00:07:52,180 --> 00:07:59,920
 but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
 
- 
Not Synced 79
 00:07:59,920 --> 00:08:04,570
 So a few pitfalls to be aware of when we're thinking about statistical graphics is
 
- 
Not Synced 80
 00:08:04,570 --> 00:08:10,930
 one is distorting the distances or the differences that are happening particularly.
 
- 
Not Synced 81
 00:08:10,930 --> 00:08:13,600
 We need to make sure if something has a length,
 
- 
Not Synced 82
 00:08:13,600 --> 00:08:21,040
 anything that has a length that length should accurately represent quantities, position, relative position.
 
- 
Not Synced 83
 00:08:21,040 --> 00:08:26,800
 If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
 
- 
Not Synced 84
 00:08:26,800 --> 00:08:30,700
 It also is an area we need to make sure those accurately represent quantities.
 
- 
Not Synced 85
 00:08:30,700 --> 00:08:37,140
 One really common way to violate this is having a bar chart whose access starts at something other than zero.
 
- 
Not Synced 86
 00:08:37,140 --> 00:08:41,670
 The software we're using doesn't do that by default. Excel does.
 
- 
Not Synced 87
 00:08:41,670 --> 00:08:46,860
 But your bar chart always needs to start at zero because people are beat.
 
- 
Not Synced 88
 00:08:46,860 --> 00:08:51,540
 People don't look at the relative position of the bar. People see the whole height of the bar.
 
- 
Not Synced 89
 00:08:51,540 --> 00:08:59,400
 And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
 
- 
Not Synced 90
 00:08:59,400 --> 00:09:01,860
 There's also ways in which we can violate conventions.
 
- 
Not Synced 91
 00:09:01,860 --> 00:09:09,210
 So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
 
- 
Not Synced 92
 00:09:09,210 --> 00:09:16,320
 If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
 
- 
Not Synced 93
 00:09:16,320 --> 00:09:24,240
 Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
 
- 
Not Synced 94
 00:09:24,240 --> 00:09:32,570
 like you assimilate how to read written text. And if those expect expectations are violated, that can.
 
- 
Not Synced 95
 00:09:32,570 --> 00:09:37,490
 Lead the user to incorrect conclusions from our charts, from our presentation.
 
- 
Not Synced 96
 00:09:37,490 --> 00:09:42,920
 A key thing to remember here that also applies to all of our presentations.
 
- 
Not Synced 97
 00:09:42,920 --> 00:09:52,350
 Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
 
- 
Not Synced 98
 00:09:52,350 --> 00:09:56,100
 not to subvert tropes or present shocking new presentations.
 
- 
Not Synced 99
 00:09:56,100 --> 00:10:00,470
 We might have shocking new evidence, but from a presentation perspective,
 
- 
Not Synced 100
 00:10:00,470 --> 00:10:08,040
 we want it to fit within conventions and not violate readers expectations unnecessarily so that
 
- 
Not Synced 101
 00:10:08,040 --> 00:10:14,370
 they can read it and be confident that they've correctly understood what it is that you're saying.
 
- 
Not Synced 102
 00:10:14,370 --> 00:10:18,990
 Another thing to be aware of is that graphics can illustrate an effect.
 
- 
Not Synced 103
 00:10:18,990 --> 00:10:27,060
 They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
 
- 
Not Synced 104
 00:10:27,060 --> 00:10:33,660
 We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
 
- 
Not Synced 105
 00:10:33,660 --> 00:10:39,420
 We can't combine exploratory and what's called confirmatory analysis, but they can help us.
 
- 
Not Synced 106
 00:10:39,420 --> 00:10:45,990
 Visualizing data can help us look for possible effects and get ideas for what to go look for next.
 
- 
Not Synced 107
 00:10:45,990 --> 00:10:53,530
 But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
 
- 
Not Synced 108
 00:10:53,530 --> 00:11:00,030
 The raw numbers as well as the numeric result are the results of inferential techniques that let us
 
- 
Not Synced 109
 00:11:00,030 --> 00:11:06,090
 estimate how big an effect is and whether it's significant in order to come to any conclusions.
 
- 
Not Synced 110
 00:11:06,090 --> 00:11:13,920
 So in this chart, I want to show you, for example, we have if we look at the chart closely,
 
- 
Not Synced 111
 00:11:13,920 --> 00:11:20,790
 we have these two data points and the little blue Xs in them, the func SVOD axis.
 
- 
Not Synced 112
 00:11:20,790 --> 00:11:26,160
 As to the left of the item, item X. So it looks like for this particular metric,
 
- 
Not Synced 113
 00:11:26,160 --> 00:11:32,900
 lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
 
- 
Not Synced 114
 00:11:32,900 --> 00:11:38,520
 But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
 
- 
Not Synced 115
 00:11:38,520 --> 00:11:44,220
 To conclude that func SVOD is better than item item on the per user are masc metric.
 
- 
Not Synced 116
 00:11:44,220 --> 00:11:48,900
 Exactly what all these things are is, is a topic for another day.
 
- 
Not Synced 117
 00:11:48,900 --> 00:11:55,230
 But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
 
- 
Not Synced 118
 00:11:55,230 --> 00:12:02,220
 But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
 
- 
Not Synced 119
 00:12:02,220 --> 00:12:07,120
 it's a relatively small difference. So. They help us see.
 
- 
Not Synced 120
 00:12:07,120 --> 00:12:11,620
 They help us communicate. They're not definitive and conclusive proof.
 
- 
Not Synced 121
 00:12:11,620 --> 00:12:16,660
 Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
 
- 
Not Synced 122
 00:12:16,660 --> 00:12:19,930
 So the earlier graph, we just had one kind of symbol. We had dots here.
 
- 
Not Synced 123
 00:12:19,930 --> 00:12:24,580
 We have two different kinds with a legend that says red circles are global.
 
- 
Not Synced 124
 00:12:24,580 --> 00:12:30,930
 Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
 
- 
Not Synced 125
 00:12:30,930 --> 00:12:37,920
 Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
 
- 
Not Synced 126
 00:12:37,920 --> 00:12:45,750
 to show different versions of a thing in the same chart, using different shapes.
 
- 
Not Synced 127
 00:12:45,750 --> 00:12:50,430
 In addition to different colors is useful because it's so imprinted on a black and white printer.
 
- 
Not Synced 128
 00:12:50,430 --> 00:12:57,270
 If they if they have some form of color blindness, it helps it make the differences clearer.
 
- 
Not Synced 129
 00:12:57,270 --> 00:13:05,070
 I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
 
- 
Not Synced 130
 00:13:05,070 --> 00:13:11,430
 I also have grouped them just to make it easier for the user to see.
 
- 
Not Synced 131
 00:13:11,430 --> 00:13:15,330
 These are the same like these. These first ones are all single algorithms.
 
- 
Not Synced 132
 00:13:15,330 --> 00:13:20,920
 And then we have a blend and a few other things. The details are.
 
- 
Not Synced 133
 00:13:20,920 --> 00:13:27,220
 Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
 
- 
Not Synced 134
 00:13:27,220 --> 00:13:32,680
 It also helps save space in the paper because I can present all of these different things in one place.
 
- 
Not Synced 135
 00:13:32,680 --> 00:13:39,730
 It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
 
- 
Not Synced 136
 00:13:39,730 --> 00:13:46,540
 But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
 
- 
Not Synced 137
 00:13:46,540 --> 00:13:51,880
 So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
 
- 
Not Synced 138
 00:13:51,880 --> 00:13:54,160
 They don't replace our numerical analysis,
 
- 
Not Synced 139
 00:13:54,160 --> 00:14:01,570
 but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
 
- 
Not Synced 140
 00:14:01,570 --> 00:14:06,340
 We do, however, always need to make sure that we clearly label and describe our graphics so that
 
- 
Not Synced 141
 00:14:06,340 --> 00:14:15,300
 readers can understand them and they can draw correct conclusions from them.
 
- 
Not Synced