1 00:00:04,540 --> 00:00:10,180 Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics. 2 00:00:10,180 --> 00:00:13,660 I want you to be able to understand the value of graphics for presenting data, 3 00:00:13,660 --> 00:00:19,540 identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid. 4 00:00:19,540 --> 00:00:25,300 So here's an example of a chart, and there's a variety of different pieces of this chart. 5 00:00:25,300 --> 00:00:31,900 We have an x axis. That's the horizontal x axis. 6 00:00:31,900 --> 00:00:41,820 We have a y axis, the vertical axis. Each of these axes has a label. 7 00:00:41,820 --> 00:00:52,610 Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it. 8 00:00:52,610 --> 00:00:59,660 And it says that this graph is showing the number of queries per task with query account distributions in the margins. 9 00:00:59,660 --> 00:01:04,070 And each dot is one participant. So it tells us we have a data point in it. 10 00:01:04,070 --> 00:01:09,020 What is that? It tells us what it is that we're charting. Number of queries per task. 11 00:01:09,020 --> 00:01:14,760 When we then see the axis labels that we have task one and we have task to. 12 00:01:14,760 --> 00:01:19,500 Those two together, give us the context, understand that. Oh, we have two tasks. 13 00:01:19,500 --> 00:01:24,000 And this is why of one participant. 14 00:01:24,000 --> 00:01:28,620 And they're appearing at the point where they have their task, one count on their task, two counts. 15 00:01:28,620 --> 00:01:34,530 OK, this allows us to to see if there's any relationship between how long it took to. 16 00:01:34,530 --> 00:01:40,800 How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin. 17 00:01:40,800 --> 00:01:46,050 So this is a compound plot. And in the left and right, margins are in the X and Y margins. 18 00:01:46,050 --> 00:01:52,530 We have the distribution, a histogram of the X axis, the task one. 19 00:01:52,530 --> 00:01:59,560 We have a histogram of the Y axis task to these histograms don't have axes themselves because, 20 00:01:59,560 --> 00:02:04,680 well, we just wanted to show a distribution the exact particularly for our purposes here. 21 00:02:04,680 --> 00:02:16,980 The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different? 22 00:02:16,980 --> 00:02:21,150 Where is the mass on the two different task counts? 23 00:02:21,150 --> 00:02:27,620 And we can see that both of them have a a right skew. 24 00:02:27,620 --> 00:02:32,780 They're bulked up towards towards the low end of the scale. 25 00:02:32,780 --> 00:02:41,120 And then we have the all the we have all of the individual data points scat on the chart. 26 00:02:41,120 --> 00:02:46,670 This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify. 27 00:02:46,670 --> 00:02:55,400 And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces. 28 00:02:55,400 --> 00:02:59,300 What is your x axis? What is your y axis? 29 00:02:59,300 --> 00:03:08,600 Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart? 30 00:03:08,600 --> 00:03:16,280 So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of. 31 00:03:16,280 --> 00:03:19,200 In this chart, there's really not much of a pattern. 32 00:03:19,200 --> 00:03:26,700 And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern. 33 00:03:26,700 --> 00:03:34,560 The one with the participant, with the most tasks and with the most tasks or queries for task one. 34 00:03:34,560 --> 00:03:38,760 Has a middling to low number of queries for Task two. 35 00:03:38,760 --> 00:03:47,790 And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased. 36 00:03:47,790 --> 00:03:53,520 They're not at all the highest. So we can see there's not a not very much of a relationship here. 37 00:03:53,520 --> 00:03:59,490 At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars. 38 00:03:59,490 --> 00:04:08,220 We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task, 39 00:04:08,220 --> 00:04:12,210 the highest number of counts for another task or different. We can also see trends. 40 00:04:12,210 --> 00:04:14,610 We can see if a line looks like it goes up or down, 41 00:04:14,610 --> 00:04:22,950 wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human, 42 00:04:22,950 --> 00:04:25,080 particularly our human visual senses, 43 00:04:25,080 --> 00:04:35,820 to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart. 44 00:04:35,820 --> 00:04:41,520 We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart. 45 00:04:41,520 --> 00:04:45,720 They need to be able to understand what each point in the chart is going to be. 46 00:04:45,720 --> 00:04:56,530 They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes. 47 00:04:56,530 --> 00:05:01,260 Often this is done in an axis label in our in the chart I showed you, 48 00:05:01,260 --> 00:05:06,040 it said the values in the caption in the axis labels said which version of them they were. 49 00:05:06,040 --> 00:05:08,560 If there are units, that needs to be clear. 50 00:05:08,560 --> 00:05:19,840 So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart, 51 00:05:19,840 --> 00:05:29,320 either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram. 52 00:05:29,320 --> 00:05:32,440 And you've got a fraction or a percentage in the left hand side. 53 00:05:32,440 --> 00:05:41,260 It's standard convention that we're talking about, the fraction of the values that are in each bin, 54 00:05:41,260 --> 00:05:49,930 at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about. 55 00:05:49,930 --> 00:05:57,090 What a value, what an axis label is. Or there's any doubt that the reader will understand what it is. 56 00:05:57,090 --> 00:06:04,230 Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own. 57 00:06:04,230 --> 00:06:08,640 You can assume a reasonable level. You have to know your audience for this. 58 00:06:08,640 --> 00:06:16,200 But someone should be able to just look at the chart with its immediately surrounding description, 59 00:06:16,200 --> 00:06:20,670 the labels, the caption and understand have a pretty good idea of what's going on. 60 00:06:20,670 --> 00:06:25,990 The surrounding text with the text that references the chart if you're writing a document. 61 00:06:25,990 --> 00:06:30,190 That can have your observations, that can provide more context and clarity. 62 00:06:30,190 --> 00:06:35,950 But someone just looking at the chart should be able to figure out basically what's going on and 63 00:06:35,950 --> 00:06:41,080 not be too far off of this is particularly important because there's a there's a lot of people, 64 00:06:41,080 --> 00:06:46,150 whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper, 65 00:06:46,150 --> 00:06:52,000 they focus on the charts and look at the key charts first to see what it is that's going on in the paper. 66 00:06:52,000 --> 00:07:01,990 And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work, 67 00:07:01,990 --> 00:07:09,370 see what it's doing and decide whether they are going to pay it further attention. 68 00:07:09,370 --> 00:07:12,970 So in a paper, if you're putting a chart in a docket, a written document or a paper, 69 00:07:12,970 --> 00:07:18,640 each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance. 70 00:07:18,640 --> 00:07:18,790 Like, 71 00:07:18,790 --> 00:07:26,410 it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology, 72 00:07:26,410 --> 00:07:31,800 what precisely some of the computations are, etc. In other contexts, we often need a title for the charts. 73 00:07:31,800 --> 00:07:37,900 So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time. 74 00:07:37,900 --> 00:07:40,840 It doesn't hurt, but often it's redundant with the caption. 75 00:07:40,840 --> 00:07:46,690 In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation. 76 00:07:46,690 --> 00:07:50,230 We have a chart in one of our notebooks. A title is often helpful in notebooks. 77 00:07:50,230 --> 00:07:52,180 The surrounding text may be sufficient, 78 00:07:52,180 --> 00:07:59,920 but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart. 79 00:07:59,920 --> 00:08:04,570 So a few pitfalls to be aware of when we're thinking about statistical graphics is 80 00:08:04,570 --> 00:08:10,930 one is distorting the distances or the differences that are happening particularly. 81 00:08:10,930 --> 00:08:13,600 We need to make sure if something has a length, 82 00:08:13,600 --> 00:08:21,040 anything that has a length that length should accurately represent quantities, position, relative position. 83 00:08:21,040 --> 00:08:26,800 If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length. 84 00:08:26,800 --> 00:08:30,700 It also is an area we need to make sure those accurately represent quantities. 85 00:08:30,700 --> 00:08:37,140 One really common way to violate this is having a bar chart whose access starts at something other than zero. 86 00:08:37,140 --> 00:08:41,670 The software we're using doesn't do that by default. Excel does. 87 00:08:41,670 --> 00:08:46,860 But your bar chart always needs to start at zero because people are beat. 88 00:08:46,860 --> 00:08:51,540 People don't look at the relative position of the bar. People see the whole height of the bar. 89 00:08:51,540 --> 00:08:59,400 And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is. 90 00:08:59,400 --> 00:09:01,860 There's also ways in which we can violate conventions. 91 00:09:01,860 --> 00:09:09,210 So in the first video I showed you the chart that violated the convention, that the x axis goes in order. 92 00:09:09,210 --> 00:09:16,320 If we violate the user's expectations, they they'll either be confused by the chart or read it wrong. 93 00:09:16,320 --> 00:09:24,240 Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading, 94 00:09:24,240 --> 00:09:32,570 like you assimilate how to read written text. And if those expect expectations are violated, that can. 95 00:09:32,570 --> 00:09:37,490 Lead the user to incorrect conclusions from our charts, from our presentation. 96 00:09:37,490 --> 00:09:42,920 A key thing to remember here that also applies to all of our presentations. 97 00:09:42,920 --> 00:09:52,350 Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it, 98 00:09:52,350 --> 00:09:56,100 not to subvert tropes or present shocking new presentations. 99 00:09:56,100 --> 00:10:00,470 We might have shocking new evidence, but from a presentation perspective, 100 00:10:00,470 --> 00:10:08,040 we want it to fit within conventions and not violate readers expectations unnecessarily so that 101 00:10:08,040 --> 00:10:14,370 they can read it and be confident that they've correctly understood what it is that you're saying. 102 00:10:14,370 --> 00:10:18,990 Another thing to be aware of is that graphics can illustrate an effect. 103 00:10:18,990 --> 00:10:27,060 They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for. 104 00:10:27,060 --> 00:10:33,660 We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining. 105 00:10:33,660 --> 00:10:39,420 We can't combine exploratory and what's called confirmatory analysis, but they can help us. 106 00:10:39,420 --> 00:10:45,990 Visualizing data can help us look for possible effects and get ideas for what to go look for next. 107 00:10:45,990 --> 00:10:53,530 But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical. 108 00:10:53,530 --> 00:11:00,030 The raw numbers as well as the numeric result are the results of inferential techniques that let us 109 00:11:00,030 --> 00:11:06,090 estimate how big an effect is and whether it's significant in order to come to any conclusions. 110 00:11:06,090 --> 00:11:13,920 So in this chart, I want to show you, for example, we have if we look at the chart closely, 111 00:11:13,920 --> 00:11:20,790 we have these two data points and the little blue Xs in them, the func SVOD axis. 112 00:11:20,790 --> 00:11:26,160 As to the left of the item, item X. So it looks like for this particular metric, 113 00:11:26,160 --> 00:11:32,900 lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for. 114 00:11:32,900 --> 00:11:38,520 But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include. 115 00:11:38,520 --> 00:11:44,220 To conclude that func SVOD is better than item item on the per user are masc metric. 116 00:11:44,220 --> 00:11:48,900 Exactly what all these things are is, is a topic for another day. 117 00:11:48,900 --> 00:11:55,230 But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it. 118 00:11:55,230 --> 00:12:02,220 But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy, 119 00:12:02,220 --> 00:12:07,120 it's a relatively small difference. So. They help us see. 120 00:12:07,120 --> 00:12:11,620 They help us communicate. They're not definitive and conclusive proof. 121 00:12:11,620 --> 00:12:16,660 Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here. 122 00:12:16,660 --> 00:12:19,930 So the earlier graph, we just had one kind of symbol. We had dots here. 123 00:12:19,930 --> 00:12:24,580 We have two different kinds with a legend that says red circles are global. 124 00:12:24,580 --> 00:12:30,930 Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE. 125 00:12:30,930 --> 00:12:37,920 Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate, 126 00:12:37,920 --> 00:12:45,750 to show different versions of a thing in the same chart, using different shapes. 127 00:12:45,750 --> 00:12:50,430 In addition to different colors is useful because it's so imprinted on a black and white printer. 128 00:12:50,430 --> 00:12:57,270 If they if they have some form of color blindness, it helps it make the differences clearer. 129 00:12:57,270 --> 00:13:05,070 I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here. 130 00:13:05,070 --> 00:13:11,430 I also have grouped them just to make it easier for the user to see. 131 00:13:11,430 --> 00:13:15,330 These are the same like these. These first ones are all single algorithms. 132 00:13:15,330 --> 00:13:20,920 And then we have a blend and a few other things. The details are. 133 00:13:20,920 --> 00:13:27,220 Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns. 134 00:13:27,220 --> 00:13:32,680 It also helps save space in the paper because I can present all of these different things in one place. 135 00:13:32,680 --> 00:13:39,730 It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper. 136 00:13:39,730 --> 00:13:46,540 But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart. 137 00:13:46,540 --> 00:13:51,880 So to wrap up graphics can make data clearer and they let us leverage human perception to understand it. 138 00:13:51,880 --> 00:13:54,160 They don't replace our numerical analysis, 139 00:13:54,160 --> 00:14:01,570 but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it. 140 00:14:01,570 --> 00:14:06,340 We do, however, always need to make sure that we clearly label and describe our graphics so that 141 00:14:06,340 --> 00:14:15,300 readers can understand them and they can draw correct conclusions from them.