WEBVTT 99:59:59.999 --> 99:59:59.999 1 00:00:04,540 --> 00:00:10,180 Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics. 99:59:59.999 --> 99:59:59.999 2 00:00:10,180 --> 00:00:13,660 I want you to be able to understand the value of graphics for presenting data, 99:59:59.999 --> 99:59:59.999 3 00:00:13,660 --> 00:00:19,540 identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid. 99:59:59.999 --> 99:59:59.999 4 00:00:19,540 --> 00:00:25,300 So here's an example of a chart, and there's a variety of different pieces of this chart. 99:59:59.999 --> 99:59:59.999 5 00:00:25,300 --> 00:00:31,900 We have an x axis. That's the horizontal x axis. 99:59:59.999 --> 99:59:59.999 6 00:00:31,900 --> 00:00:41,820 We have a y axis, the vertical axis. Each of these axes has a label. 99:59:59.999 --> 99:59:59.999 7 00:00:41,820 --> 00:00:52,610 Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it. 99:59:59.999 --> 99:59:59.999 8 00:00:52,610 --> 00:00:59,660 And it says that this graph is showing the number of queries per task with query account distributions in the margins. 99:59:59.999 --> 99:59:59.999 9 00:00:59,660 --> 00:01:04,070 And each dot is one participant. So it tells us we have a data point in it. 99:59:59.999 --> 99:59:59.999 10 00:01:04,070 --> 00:01:09,020 What is that? It tells us what it is that we're charting. Number of queries per task. 99:59:59.999 --> 99:59:59.999 11 00:01:09,020 --> 00:01:14,760 When we then see the axis labels that we have task one and we have task to. 99:59:59.999 --> 99:59:59.999 12 00:01:14,760 --> 00:01:19,500 Those two together, give us the context, understand that. Oh, we have two tasks. 99:59:59.999 --> 99:59:59.999 13 00:01:19,500 --> 00:01:24,000 And this is why of one participant. 99:59:59.999 --> 99:59:59.999 14 00:01:24,000 --> 00:01:28,620 And they're appearing at the point where they have their task, one count on their task, two counts. 99:59:59.999 --> 99:59:59.999 15 00:01:28,620 --> 00:01:34,530 OK, this allows us to to see if there's any relationship between how long it took to. 99:59:59.999 --> 99:59:59.999 16 00:01:34,530 --> 00:01:40,800 How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin. 99:59:59.999 --> 99:59:59.999 17 00:01:40,800 --> 00:01:46,050 So this is a compound plot. And in the left and right, margins are in the X and Y margins. 99:59:59.999 --> 99:59:59.999 18 00:01:46,050 --> 00:01:52,530 We have the distribution, a histogram of the X axis, the task one. 99:59:59.999 --> 99:59:59.999 19 00:01:52,530 --> 00:01:59,560 We have a histogram of the Y axis task to these histograms don't have axes themselves because, 99:59:59.999 --> 99:59:59.999 20 00:01:59,560 --> 00:02:04,680 well, we just wanted to show a distribution the exact particularly for our purposes here. 99:59:59.999 --> 99:59:59.999 21 00:02:04,680 --> 00:02:16,980 The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different? 99:59:59.999 --> 99:59:59.999 22 00:02:16,980 --> 00:02:21,150 Where is the mass on the two different task counts? 99:59:59.999 --> 99:59:59.999 23 00:02:21,150 --> 00:02:27,620 And we can see that both of them have a a right skew. 99:59:59.999 --> 99:59:59.999 24 00:02:27,620 --> 00:02:32,780 They're bulked up towards towards the low end of the scale. 99:59:59.999 --> 99:59:59.999 25 00:02:32,780 --> 00:02:41,120 And then we have the all the we have all of the individual data points scat on the chart. 99:59:59.999 --> 99:59:59.999 26 00:02:41,120 --> 00:02:46,670 This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify. 99:59:59.999 --> 99:59:59.999 27 00:02:46,670 --> 00:02:55,400 And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces. 99:59:59.999 --> 99:59:59.999 28 00:02:55,400 --> 00:02:59,300 What is your x axis? What is your y axis? 99:59:59.999 --> 99:59:59.999 29 00:02:59,300 --> 00:03:08,600 Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart? 99:59:59.999 --> 99:59:59.999 30 00:03:08,600 --> 00:03:16,280 So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of. 99:59:59.999 --> 99:59:59.999 31 00:03:16,280 --> 00:03:19,200 In this chart, there's really not much of a pattern. 99:59:59.999 --> 99:59:59.999 32 00:03:19,200 --> 00:03:26,700 And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern. 99:59:59.999 --> 99:59:59.999 33 00:03:26,700 --> 00:03:34,560 The one with the participant, with the most tasks and with the most tasks or queries for task one. 99:59:59.999 --> 99:59:59.999 34 00:03:34,560 --> 00:03:38,760 Has a middling to low number of queries for Task two. 99:59:59.999 --> 99:59:59.999 35 00:03:38,760 --> 00:03:47,790 And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased. 99:59:59.999 --> 99:59:59.999 36 00:03:47,790 --> 00:03:53,520 They're not at all the highest. So we can see there's not a not very much of a relationship here. 99:59:59.999 --> 99:59:59.999 37 00:03:53,520 --> 00:03:59,490 At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars. 99:59:59.999 --> 99:59:59.999 38 00:03:59,490 --> 00:04:08,220 We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task, 99:59:59.999 --> 99:59:59.999 39 00:04:08,220 --> 00:04:12,210 the highest number of counts for another task or different. We can also see trends. 99:59:59.999 --> 99:59:59.999 40 00:04:12,210 --> 00:04:14,610 We can see if a line looks like it goes up or down, 99:59:59.999 --> 99:59:59.999 41 00:04:14,610 --> 00:04:22,950 wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human, 99:59:59.999 --> 99:59:59.999 42 00:04:22,950 --> 00:04:25,080 particularly our human visual senses, 99:59:59.999 --> 99:59:59.999 43 00:04:25,080 --> 00:04:35,820 to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart. 99:59:59.999 --> 99:59:59.999 44 00:04:35,820 --> 00:04:41,520 We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart. 99:59:59.999 --> 99:59:59.999 45 00:04:41,520 --> 00:04:45,720 They need to be able to understand what each point in the chart is going to be. 99:59:59.999 --> 99:59:59.999 46 00:04:45,720 --> 00:04:56,530 They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes. 99:59:59.999 --> 99:59:59.999 47 00:04:56,530 --> 00:05:01,260 Often this is done in an axis label in our in the chart I showed you, 99:59:59.999 --> 99:59:59.999 48 00:05:01,260 --> 00:05:06,040 it said the values in the caption in the axis labels said which version of them they were. 99:59:59.999 --> 99:59:59.999 49 00:05:06,040 --> 00:05:08,560 If there are units, that needs to be clear. 99:59:59.999 --> 99:59:59.999 50 00:05:08,560 --> 00:05:19,840 So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart, 99:59:59.999 --> 99:59:59.999 51 00:05:19,840 --> 00:05:29,320 either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram. 99:59:59.999 --> 99:59:59.999 52 00:05:29,320 --> 00:05:32,440 And you've got a fraction or a percentage in the left hand side. 99:59:59.999 --> 99:59:59.999 53 00:05:32,440 --> 00:05:41,260 It's standard convention that we're talking about, the fraction of the values that are in each bin, 99:59:59.999 --> 99:59:59.999 54 00:05:41,260 --> 00:05:49,930 at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about. 99:59:59.999 --> 99:59:59.999 55 00:05:49,930 --> 00:05:57,090 What a value, what an axis label is. Or there's any doubt that the reader will understand what it is. 99:59:59.999 --> 99:59:59.999 56 00:05:57,090 --> 00:06:04,230 Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own. 99:59:59.999 --> 99:59:59.999 57 00:06:04,230 --> 00:06:08,640 You can assume a reasonable level. You have to know your audience for this. 99:59:59.999 --> 99:59:59.999 58 00:06:08,640 --> 00:06:16,200 But someone should be able to just look at the chart with its immediately surrounding description, 99:59:59.999 --> 99:59:59.999 59 00:06:16,200 --> 00:06:20,670 the labels, the caption and understand have a pretty good idea of what's going on. 99:59:59.999 --> 99:59:59.999 60 00:06:20,670 --> 00:06:25,990 The surrounding text with the text that references the chart if you're writing a document. 99:59:59.999 --> 99:59:59.999 61 00:06:25,990 --> 00:06:30,190 That can have your observations, that can provide more context and clarity. 99:59:59.999 --> 99:59:59.999 62 00:06:30,190 --> 00:06:35,950 But someone just looking at the chart should be able to figure out basically what's going on and 99:59:59.999 --> 99:59:59.999 63 00:06:35,950 --> 00:06:41,080 not be too far off of this is particularly important because there's a there's a lot of people, 99:59:59.999 --> 99:59:59.999 64 00:06:41,080 --> 00:06:46,150 whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper, 99:59:59.999 --> 99:59:59.999 65 00:06:46,150 --> 00:06:52,000 they focus on the charts and look at the key charts first to see what it is that's going on in the paper. 99:59:59.999 --> 99:59:59.999 66 00:06:52,000 --> 00:07:01,990 And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work, 99:59:59.999 --> 99:59:59.999 67 00:07:01,990 --> 00:07:09,370 see what it's doing and decide whether they are going to pay it further attention. 99:59:59.999 --> 99:59:59.999 68 00:07:09,370 --> 00:07:12,970 So in a paper, if you're putting a chart in a docket, a written document or a paper, 99:59:59.999 --> 99:59:59.999 69 00:07:12,970 --> 00:07:18,640 each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance. 99:59:59.999 --> 99:59:59.999 70 00:07:18,640 --> 00:07:18,790 Like, 99:59:59.999 --> 99:59:59.999 71 00:07:18,790 --> 00:07:26,410 it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology, 99:59:59.999 --> 99:59:59.999 72 00:07:26,410 --> 00:07:31,800 what precisely some of the computations are, etc. In other contexts, we often need a title for the charts. 99:59:59.999 --> 99:59:59.999 73 00:07:31,800 --> 00:07:37,900 So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time. 99:59:59.999 --> 99:59:59.999 74 00:07:37,900 --> 00:07:40,840 It doesn't hurt, but often it's redundant with the caption. 99:59:59.999 --> 99:59:59.999 75 00:07:40,840 --> 00:07:46,690 In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation. 99:59:59.999 --> 99:59:59.999 76 00:07:46,690 --> 00:07:50,230 We have a chart in one of our notebooks. A title is often helpful in notebooks. 99:59:59.999 --> 99:59:59.999 77 00:07:50,230 --> 00:07:52,180 The surrounding text may be sufficient, 99:59:59.999 --> 99:59:59.999 78 00:07:52,180 --> 00:07:59,920 but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart. 99:59:59.999 --> 99:59:59.999 79 00:07:59,920 --> 00:08:04,570 So a few pitfalls to be aware of when we're thinking about statistical graphics is 99:59:59.999 --> 99:59:59.999 80 00:08:04,570 --> 00:08:10,930 one is distorting the distances or the differences that are happening particularly. 99:59:59.999 --> 99:59:59.999 81 00:08:10,930 --> 00:08:13,600 We need to make sure if something has a length, 99:59:59.999 --> 99:59:59.999 82 00:08:13,600 --> 00:08:21,040 anything that has a length that length should accurately represent quantities, position, relative position. 99:59:59.999 --> 99:59:59.999 83 00:08:21,040 --> 00:08:26,800 If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length. 99:59:59.999 --> 99:59:59.999 84 00:08:26,800 --> 00:08:30,700 It also is an area we need to make sure those accurately represent quantities. 99:59:59.999 --> 99:59:59.999 85 00:08:30,700 --> 00:08:37,140 One really common way to violate this is having a bar chart whose access starts at something other than zero. 99:59:59.999 --> 99:59:59.999 86 00:08:37,140 --> 00:08:41,670 The software we're using doesn't do that by default. Excel does. 99:59:59.999 --> 99:59:59.999 87 00:08:41,670 --> 00:08:46,860 But your bar chart always needs to start at zero because people are beat. 99:59:59.999 --> 99:59:59.999 88 00:08:46,860 --> 00:08:51,540 People don't look at the relative position of the bar. People see the whole height of the bar. 99:59:59.999 --> 99:59:59.999 89 00:08:51,540 --> 00:08:59,400 And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is. 99:59:59.999 --> 99:59:59.999 90 00:08:59,400 --> 00:09:01,860 There's also ways in which we can violate conventions. 99:59:59.999 --> 99:59:59.999 91 00:09:01,860 --> 00:09:09,210 So in the first video I showed you the chart that violated the convention, that the x axis goes in order. 99:59:59.999 --> 99:59:59.999 92 00:09:09,210 --> 00:09:16,320 If we violate the user's expectations, they they'll either be confused by the chart or read it wrong. 99:59:59.999 --> 99:59:59.999 93 00:09:16,320 --> 00:09:24,240 Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading, 99:59:59.999 --> 99:59:59.999 94 00:09:24,240 --> 00:09:32,570 like you assimilate how to read written text. And if those expect expectations are violated, that can. 99:59:59.999 --> 99:59:59.999 95 00:09:32,570 --> 00:09:37,490 Lead the user to incorrect conclusions from our charts, from our presentation. 99:59:59.999 --> 99:59:59.999 96 00:09:37,490 --> 00:09:42,920 A key thing to remember here that also applies to all of our presentations. 99:59:59.999 --> 99:59:59.999 97 00:09:42,920 --> 00:09:52,350 Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it, 99:59:59.999 --> 99:59:59.999 98 00:09:52,350 --> 00:09:56,100 not to subvert tropes or present shocking new presentations. 99:59:59.999 --> 99:59:59.999 99 00:09:56,100 --> 00:10:00,470 We might have shocking new evidence, but from a presentation perspective, 99:59:59.999 --> 99:59:59.999 100 00:10:00,470 --> 00:10:08,040 we want it to fit within conventions and not violate readers expectations unnecessarily so that 99:59:59.999 --> 99:59:59.999 101 00:10:08,040 --> 00:10:14,370 they can read it and be confident that they've correctly understood what it is that you're saying. 99:59:59.999 --> 99:59:59.999 102 00:10:14,370 --> 00:10:18,990 Another thing to be aware of is that graphics can illustrate an effect. 99:59:59.999 --> 99:59:59.999 103 00:10:18,990 --> 00:10:27,060 They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for. 99:59:59.999 --> 99:59:59.999 104 00:10:27,060 --> 00:10:33,660 We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining. 99:59:59.999 --> 99:59:59.999 105 00:10:33,660 --> 00:10:39,420 We can't combine exploratory and what's called confirmatory analysis, but they can help us. 99:59:59.999 --> 99:59:59.999 106 00:10:39,420 --> 00:10:45,990 Visualizing data can help us look for possible effects and get ideas for what to go look for next. 99:59:59.999 --> 99:59:59.999 107 00:10:45,990 --> 00:10:53,530 But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical. 99:59:59.999 --> 99:59:59.999 108 00:10:53,530 --> 00:11:00,030 The raw numbers as well as the numeric result are the results of inferential techniques that let us 99:59:59.999 --> 99:59:59.999 109 00:11:00,030 --> 00:11:06,090 estimate how big an effect is and whether it's significant in order to come to any conclusions. 99:59:59.999 --> 99:59:59.999 110 00:11:06,090 --> 00:11:13,920 So in this chart, I want to show you, for example, we have if we look at the chart closely, 99:59:59.999 --> 99:59:59.999 111 00:11:13,920 --> 00:11:20,790 we have these two data points and the little blue Xs in them, the func SVOD axis. 99:59:59.999 --> 99:59:59.999 112 00:11:20,790 --> 00:11:26,160 As to the left of the item, item X. So it looks like for this particular metric, 99:59:59.999 --> 99:59:59.999 113 00:11:26,160 --> 00:11:32,900 lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for. 99:59:59.999 --> 99:59:59.999 114 00:11:32,900 --> 00:11:38,520 But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include. 99:59:59.999 --> 99:59:59.999 115 00:11:38,520 --> 00:11:44,220 To conclude that func SVOD is better than item item on the per user are masc metric. 99:59:59.999 --> 99:59:59.999 116 00:11:44,220 --> 00:11:48,900 Exactly what all these things are is, is a topic for another day. 99:59:59.999 --> 99:59:59.999 117 00:11:48,900 --> 00:11:55,230 But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it. 99:59:59.999 --> 99:59:59.999 118 00:11:55,230 --> 00:12:02,220 But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy, 99:59:59.999 --> 99:59:59.999 119 00:12:02,220 --> 00:12:07,120 it's a relatively small difference. So. They help us see. 99:59:59.999 --> 99:59:59.999 120 00:12:07,120 --> 00:12:11,620 They help us communicate. They're not definitive and conclusive proof. 99:59:59.999 --> 99:59:59.999 121 00:12:11,620 --> 00:12:16,660 Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here. 99:59:59.999 --> 99:59:59.999 122 00:12:16,660 --> 00:12:19,930 So the earlier graph, we just had one kind of symbol. We had dots here. 99:59:59.999 --> 99:59:59.999 123 00:12:19,930 --> 00:12:24,580 We have two different kinds with a legend that says red circles are global. 99:59:59.999 --> 99:59:59.999 124 00:12:24,580 --> 00:12:30,930 Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE. 99:59:59.999 --> 99:59:59.999 125 00:12:30,930 --> 00:12:37,920 Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate, 99:59:59.999 --> 99:59:59.999 126 00:12:37,920 --> 00:12:45,750 to show different versions of a thing in the same chart, using different shapes. 99:59:59.999 --> 99:59:59.999 127 00:12:45,750 --> 00:12:50,430 In addition to different colors is useful because it's so imprinted on a black and white printer. 99:59:59.999 --> 99:59:59.999 128 00:12:50,430 --> 00:12:57,270 If they if they have some form of color blindness, it helps it make the differences clearer. 99:59:59.999 --> 99:59:59.999 129 00:12:57,270 --> 00:13:05,070 I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here. 99:59:59.999 --> 99:59:59.999 130 00:13:05,070 --> 00:13:11,430 I also have grouped them just to make it easier for the user to see. 99:59:59.999 --> 99:59:59.999 131 00:13:11,430 --> 00:13:15,330 These are the same like these. These first ones are all single algorithms. 99:59:59.999 --> 99:59:59.999 132 00:13:15,330 --> 00:13:20,920 And then we have a blend and a few other things. The details are. 99:59:59.999 --> 99:59:59.999 133 00:13:20,920 --> 00:13:27,220 Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns. 99:59:59.999 --> 99:59:59.999 134 00:13:27,220 --> 00:13:32,680 It also helps save space in the paper because I can present all of these different things in one place. 99:59:59.999 --> 99:59:59.999 135 00:13:32,680 --> 00:13:39,730 It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper. 99:59:59.999 --> 99:59:59.999 136 00:13:39,730 --> 00:13:46,540 But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart. 99:59:59.999 --> 99:59:59.999 137 00:13:46,540 --> 00:13:51,880 So to wrap up graphics can make data clearer and they let us leverage human perception to understand it. 99:59:59.999 --> 99:59:59.999 138 00:13:51,880 --> 00:13:54,160 They don't replace our numerical analysis, 99:59:59.999 --> 99:59:59.999 139 00:13:54,160 --> 00:14:01,570 but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it. 99:59:59.999 --> 99:59:59.999 140 00:14:01,570 --> 00:14:06,340 We do, however, always need to make sure that we clearly label and describe our graphics so that 99:59:59.999 --> 99:59:59.999 141 00:14:06,340 --> 00:14:15,300 readers can understand them and they can draw correct conclusions from them. 99:59:59.999 --> 99:59:59.999