9:59:59.000,9:59:59.000 1[br]00:00:04,540 --> 00:00:10,180[br]Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics. 9:59:59.000,9:59:59.000 2[br]00:00:10,180 --> 00:00:13,660[br]I want you to be able to understand the value of graphics for presenting data, 9:59:59.000,9:59:59.000 3[br]00:00:13,660 --> 00:00:19,540[br]identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid. 9:59:59.000,9:59:59.000 4[br]00:00:19,540 --> 00:00:25,300[br]So here's an example of a chart, and there's a variety of different pieces of this chart. 9:59:59.000,9:59:59.000 5[br]00:00:25,300 --> 00:00:31,900[br]We have an x axis. That's the horizontal x axis. 9:59:59.000,9:59:59.000 6[br]00:00:31,900 --> 00:00:41,820[br]We have a y axis, the vertical axis. Each of these axes has a label. 9:59:59.000,9:59:59.000 7[br]00:00:41,820 --> 00:00:52,610[br]Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it. 9:59:59.000,9:59:59.000 8[br]00:00:52,610 --> 00:00:59,660[br]And it says that this graph is showing the number of queries per task with query account distributions in the margins. 9:59:59.000,9:59:59.000 9[br]00:00:59,660 --> 00:01:04,070[br]And each dot is one participant. So it tells us we have a data point in it. 9:59:59.000,9:59:59.000 10[br]00:01:04,070 --> 00:01:09,020[br]What is that? It tells us what it is that we're charting. Number of queries per task. 9:59:59.000,9:59:59.000 11[br]00:01:09,020 --> 00:01:14,760[br]When we then see the axis labels that we have task one and we have task to. 9:59:59.000,9:59:59.000 12[br]00:01:14,760 --> 00:01:19,500[br]Those two together, give us the context, understand that. Oh, we have two tasks. 9:59:59.000,9:59:59.000 13[br]00:01:19,500 --> 00:01:24,000[br]And this is why of one participant. 9:59:59.000,9:59:59.000 14[br]00:01:24,000 --> 00:01:28,620[br]And they're appearing at the point where they have their task, one count on their task, two counts. 9:59:59.000,9:59:59.000 15[br]00:01:28,620 --> 00:01:34,530[br]OK, this allows us to to see if there's any relationship between how long it took to. 9:59:59.000,9:59:59.000 16[br]00:01:34,530 --> 00:01:40,800[br]How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin. 9:59:59.000,9:59:59.000 17[br]00:01:40,800 --> 00:01:46,050[br]So this is a compound plot. And in the left and right, margins are in the X and Y margins. 9:59:59.000,9:59:59.000 18[br]00:01:46,050 --> 00:01:52,530[br]We have the distribution, a histogram of the X axis, the task one. 9:59:59.000,9:59:59.000 19[br]00:01:52,530 --> 00:01:59,560[br]We have a histogram of the Y axis task to these histograms don't have axes themselves because, 9:59:59.000,9:59:59.000 20[br]00:01:59,560 --> 00:02:04,680[br]well, we just wanted to show a distribution the exact particularly for our purposes here. 9:59:59.000,9:59:59.000 21[br]00:02:04,680 --> 00:02:16,980[br]The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different? 9:59:59.000,9:59:59.000 22[br]00:02:16,980 --> 00:02:21,150[br]Where is the mass on the two different task counts? 9:59:59.000,9:59:59.000 23[br]00:02:21,150 --> 00:02:27,620[br]And we can see that both of them have a a right skew. 9:59:59.000,9:59:59.000 24[br]00:02:27,620 --> 00:02:32,780[br]They're bulked up towards towards the low end of the scale. 9:59:59.000,9:59:59.000 25[br]00:02:32,780 --> 00:02:41,120[br]And then we have the all the we have all of the individual data points scat on the chart. 9:59:59.000,9:59:59.000 26[br]00:02:41,120 --> 00:02:46,670[br]This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify. 9:59:59.000,9:59:59.000 27[br]00:02:46,670 --> 00:02:55,400[br]And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces. 9:59:59.000,9:59:59.000 28[br]00:02:55,400 --> 00:02:59,300[br]What is your x axis? What is your y axis? 9:59:59.000,9:59:59.000 29[br]00:02:59,300 --> 00:03:08,600[br]Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart? 9:59:59.000,9:59:59.000 30[br]00:03:08,600 --> 00:03:16,280[br]So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of. 9:59:59.000,9:59:59.000 31[br]00:03:16,280 --> 00:03:19,200[br]In this chart, there's really not much of a pattern. 9:59:59.000,9:59:59.000 32[br]00:03:19,200 --> 00:03:26,700[br]And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern. 9:59:59.000,9:59:59.000 33[br]00:03:26,700 --> 00:03:34,560[br]The one with the participant, with the most tasks and with the most tasks or queries for task one. 9:59:59.000,9:59:59.000 34[br]00:03:34,560 --> 00:03:38,760[br]Has a middling to low number of queries for Task two. 9:59:59.000,9:59:59.000 35[br]00:03:38,760 --> 00:03:47,790[br]And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased. 9:59:59.000,9:59:59.000 36[br]00:03:47,790 --> 00:03:53,520[br]They're not at all the highest. So we can see there's not a not very much of a relationship here. 9:59:59.000,9:59:59.000 37[br]00:03:53,520 --> 00:03:59,490[br]At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars. 9:59:59.000,9:59:59.000 38[br]00:03:59,490 --> 00:04:08,220[br]We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task, 9:59:59.000,9:59:59.000 39[br]00:04:08,220 --> 00:04:12,210[br]the highest number of counts for another task or different. We can also see trends. 9:59:59.000,9:59:59.000 40[br]00:04:12,210 --> 00:04:14,610[br]We can see if a line looks like it goes up or down, 9:59:59.000,9:59:59.000 41[br]00:04:14,610 --> 00:04:22,950[br]wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human, 9:59:59.000,9:59:59.000 42[br]00:04:22,950 --> 00:04:25,080[br]particularly our human visual senses, 9:59:59.000,9:59:59.000 43[br]00:04:25,080 --> 00:04:35,820[br]to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart. 9:59:59.000,9:59:59.000 44[br]00:04:35,820 --> 00:04:41,520[br]We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart. 9:59:59.000,9:59:59.000 45[br]00:04:41,520 --> 00:04:45,720[br]They need to be able to understand what each point in the chart is going to be. 9:59:59.000,9:59:59.000 46[br]00:04:45,720 --> 00:04:56,530[br]They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes. 9:59:59.000,9:59:59.000 47[br]00:04:56,530 --> 00:05:01,260[br]Often this is done in an axis label in our in the chart I showed you, 9:59:59.000,9:59:59.000 48[br]00:05:01,260 --> 00:05:06,040[br]it said the values in the caption in the axis labels said which version of them they were. 9:59:59.000,9:59:59.000 49[br]00:05:06,040 --> 00:05:08,560[br]If there are units, that needs to be clear. 9:59:59.000,9:59:59.000 50[br]00:05:08,560 --> 00:05:19,840[br]So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart, 9:59:59.000,9:59:59.000 51[br]00:05:19,840 --> 00:05:29,320[br]either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram. 9:59:59.000,9:59:59.000 52[br]00:05:29,320 --> 00:05:32,440[br]And you've got a fraction or a percentage in the left hand side. 9:59:59.000,9:59:59.000 53[br]00:05:32,440 --> 00:05:41,260[br]It's standard convention that we're talking about, the fraction of the values that are in each bin, 9:59:59.000,9:59:59.000 54[br]00:05:41,260 --> 00:05:49,930[br]at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about. 9:59:59.000,9:59:59.000 55[br]00:05:49,930 --> 00:05:57,090[br]What a value, what an axis label is. Or there's any doubt that the reader will understand what it is. 9:59:59.000,9:59:59.000 56[br]00:05:57,090 --> 00:06:04,230[br]Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own. 9:59:59.000,9:59:59.000 57[br]00:06:04,230 --> 00:06:08,640[br]You can assume a reasonable level. You have to know your audience for this. 9:59:59.000,9:59:59.000 58[br]00:06:08,640 --> 00:06:16,200[br]But someone should be able to just look at the chart with its immediately surrounding description, 9:59:59.000,9:59:59.000 59[br]00:06:16,200 --> 00:06:20,670[br]the labels, the caption and understand have a pretty good idea of what's going on. 9:59:59.000,9:59:59.000 60[br]00:06:20,670 --> 00:06:25,990[br]The surrounding text with the text that references the chart if you're writing a document. 9:59:59.000,9:59:59.000 61[br]00:06:25,990 --> 00:06:30,190[br]That can have your observations, that can provide more context and clarity. 9:59:59.000,9:59:59.000 62[br]00:06:30,190 --> 00:06:35,950[br]But someone just looking at the chart should be able to figure out basically what's going on and 9:59:59.000,9:59:59.000 63[br]00:06:35,950 --> 00:06:41,080[br]not be too far off of this is particularly important because there's a there's a lot of people, 9:59:59.000,9:59:59.000 64[br]00:06:41,080 --> 00:06:46,150[br]whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper, 9:59:59.000,9:59:59.000 65[br]00:06:46,150 --> 00:06:52,000[br]they focus on the charts and look at the key charts first to see what it is that's going on in the paper. 9:59:59.000,9:59:59.000 66[br]00:06:52,000 --> 00:07:01,990[br]And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work, 9:59:59.000,9:59:59.000 67[br]00:07:01,990 --> 00:07:09,370[br]see what it's doing and decide whether they are going to pay it further attention. 9:59:59.000,9:59:59.000 68[br]00:07:09,370 --> 00:07:12,970[br]So in a paper, if you're putting a chart in a docket, a written document or a paper, 9:59:59.000,9:59:59.000 69[br]00:07:12,970 --> 00:07:18,640[br]each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance. 9:59:59.000,9:59:59.000 70[br]00:07:18,640 --> 00:07:18,790[br]Like, 9:59:59.000,9:59:59.000 71[br]00:07:18,790 --> 00:07:26,410[br]it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology, 9:59:59.000,9:59:59.000 72[br]00:07:26,410 --> 00:07:31,800[br]what precisely some of the computations are, etc. In other contexts, we often need a title for the charts. 9:59:59.000,9:59:59.000 73[br]00:07:31,800 --> 00:07:37,900[br]So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time. 9:59:59.000,9:59:59.000 74[br]00:07:37,900 --> 00:07:40,840[br]It doesn't hurt, but often it's redundant with the caption. 9:59:59.000,9:59:59.000 75[br]00:07:40,840 --> 00:07:46,690[br]In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation. 9:59:59.000,9:59:59.000 76[br]00:07:46,690 --> 00:07:50,230[br]We have a chart in one of our notebooks. A title is often helpful in notebooks. 9:59:59.000,9:59:59.000 77[br]00:07:50,230 --> 00:07:52,180[br]The surrounding text may be sufficient, 9:59:59.000,9:59:59.000 78[br]00:07:52,180 --> 00:07:59,920[br]but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart. 9:59:59.000,9:59:59.000 79[br]00:07:59,920 --> 00:08:04,570[br]So a few pitfalls to be aware of when we're thinking about statistical graphics is 9:59:59.000,9:59:59.000 80[br]00:08:04,570 --> 00:08:10,930[br]one is distorting the distances or the differences that are happening particularly. 9:59:59.000,9:59:59.000 81[br]00:08:10,930 --> 00:08:13,600[br]We need to make sure if something has a length, 9:59:59.000,9:59:59.000 82[br]00:08:13,600 --> 00:08:21,040[br]anything that has a length that length should accurately represent quantities, position, relative position. 9:59:59.000,9:59:59.000 83[br]00:08:21,040 --> 00:08:26,800[br]If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length. 9:59:59.000,9:59:59.000 84[br]00:08:26,800 --> 00:08:30,700[br]It also is an area we need to make sure those accurately represent quantities. 9:59:59.000,9:59:59.000 85[br]00:08:30,700 --> 00:08:37,140[br]One really common way to violate this is having a bar chart whose access starts at something other than zero. 9:59:59.000,9:59:59.000 86[br]00:08:37,140 --> 00:08:41,670[br]The software we're using doesn't do that by default. Excel does. 9:59:59.000,9:59:59.000 87[br]00:08:41,670 --> 00:08:46,860[br]But your bar chart always needs to start at zero because people are beat. 9:59:59.000,9:59:59.000 88[br]00:08:46,860 --> 00:08:51,540[br]People don't look at the relative position of the bar. People see the whole height of the bar. 9:59:59.000,9:59:59.000 89[br]00:08:51,540 --> 00:08:59,400[br]And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is. 9:59:59.000,9:59:59.000 90[br]00:08:59,400 --> 00:09:01,860[br]There's also ways in which we can violate conventions. 9:59:59.000,9:59:59.000 91[br]00:09:01,860 --> 00:09:09,210[br]So in the first video I showed you the chart that violated the convention, that the x axis goes in order. 9:59:59.000,9:59:59.000 92[br]00:09:09,210 --> 00:09:16,320[br]If we violate the user's expectations, they they'll either be confused by the chart or read it wrong. 9:59:59.000,9:59:59.000 93[br]00:09:16,320 --> 00:09:24,240[br]Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading, 9:59:59.000,9:59:59.000 94[br]00:09:24,240 --> 00:09:32,570[br]like you assimilate how to read written text. And if those expect expectations are violated, that can. 9:59:59.000,9:59:59.000 95[br]00:09:32,570 --> 00:09:37,490[br]Lead the user to incorrect conclusions from our charts, from our presentation. 9:59:59.000,9:59:59.000 96[br]00:09:37,490 --> 00:09:42,920[br]A key thing to remember here that also applies to all of our presentations. 9:59:59.000,9:59:59.000 97[br]00:09:42,920 --> 00:09:52,350[br]Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it, 9:59:59.000,9:59:59.000 98[br]00:09:52,350 --> 00:09:56,100[br]not to subvert tropes or present shocking new presentations. 9:59:59.000,9:59:59.000 99[br]00:09:56,100 --> 00:10:00,470[br]We might have shocking new evidence, but from a presentation perspective, 9:59:59.000,9:59:59.000 100[br]00:10:00,470 --> 00:10:08,040[br]we want it to fit within conventions and not violate readers expectations unnecessarily so that 9:59:59.000,9:59:59.000 101[br]00:10:08,040 --> 00:10:14,370[br]they can read it and be confident that they've correctly understood what it is that you're saying. 9:59:59.000,9:59:59.000 102[br]00:10:14,370 --> 00:10:18,990[br]Another thing to be aware of is that graphics can illustrate an effect. 9:59:59.000,9:59:59.000 103[br]00:10:18,990 --> 00:10:27,060[br]They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for. 9:59:59.000,9:59:59.000 104[br]00:10:27,060 --> 00:10:33,660[br]We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining. 9:59:59.000,9:59:59.000 105[br]00:10:33,660 --> 00:10:39,420[br]We can't combine exploratory and what's called confirmatory analysis, but they can help us. 9:59:59.000,9:59:59.000 106[br]00:10:39,420 --> 00:10:45,990[br]Visualizing data can help us look for possible effects and get ideas for what to go look for next. 9:59:59.000,9:59:59.000 107[br]00:10:45,990 --> 00:10:53,530[br]But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical. 9:59:59.000,9:59:59.000 108[br]00:10:53,530 --> 00:11:00,030[br]The raw numbers as well as the numeric result are the results of inferential techniques that let us 9:59:59.000,9:59:59.000 109[br]00:11:00,030 --> 00:11:06,090[br]estimate how big an effect is and whether it's significant in order to come to any conclusions. 9:59:59.000,9:59:59.000 110[br]00:11:06,090 --> 00:11:13,920[br]So in this chart, I want to show you, for example, we have if we look at the chart closely, 9:59:59.000,9:59:59.000 111[br]00:11:13,920 --> 00:11:20,790[br]we have these two data points and the little blue Xs in them, the func SVOD axis. 9:59:59.000,9:59:59.000 112[br]00:11:20,790 --> 00:11:26,160[br]As to the left of the item, item X. So it looks like for this particular metric, 9:59:59.000,9:59:59.000 113[br]00:11:26,160 --> 00:11:32,900[br]lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for. 9:59:59.000,9:59:59.000 114[br]00:11:32,900 --> 00:11:38,520[br]But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include. 9:59:59.000,9:59:59.000 115[br]00:11:38,520 --> 00:11:44,220[br]To conclude that func SVOD is better than item item on the per user are masc metric. 9:59:59.000,9:59:59.000 116[br]00:11:44,220 --> 00:11:48,900[br]Exactly what all these things are is, is a topic for another day. 9:59:59.000,9:59:59.000 117[br]00:11:48,900 --> 00:11:55,230[br]But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it. 9:59:59.000,9:59:59.000 118[br]00:11:55,230 --> 00:12:02,220[br]But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy, 9:59:59.000,9:59:59.000 119[br]00:12:02,220 --> 00:12:07,120[br]it's a relatively small difference. So. They help us see. 9:59:59.000,9:59:59.000 120[br]00:12:07,120 --> 00:12:11,620[br]They help us communicate. They're not definitive and conclusive proof. 9:59:59.000,9:59:59.000 121[br]00:12:11,620 --> 00:12:16,660[br]Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here. 9:59:59.000,9:59:59.000 122[br]00:12:16,660 --> 00:12:19,930[br]So the earlier graph, we just had one kind of symbol. We had dots here. 9:59:59.000,9:59:59.000 123[br]00:12:19,930 --> 00:12:24,580[br]We have two different kinds with a legend that says red circles are global. 9:59:59.000,9:59:59.000 124[br]00:12:24,580 --> 00:12:30,930[br]Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE. 9:59:59.000,9:59:59.000 125[br]00:12:30,930 --> 00:12:37,920[br]Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate, 9:59:59.000,9:59:59.000 126[br]00:12:37,920 --> 00:12:45,750[br]to show different versions of a thing in the same chart, using different shapes. 9:59:59.000,9:59:59.000 127[br]00:12:45,750 --> 00:12:50,430[br]In addition to different colors is useful because it's so imprinted on a black and white printer. 9:59:59.000,9:59:59.000 128[br]00:12:50,430 --> 00:12:57,270[br]If they if they have some form of color blindness, it helps it make the differences clearer. 9:59:59.000,9:59:59.000 129[br]00:12:57,270 --> 00:13:05,070[br]I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here. 9:59:59.000,9:59:59.000 130[br]00:13:05,070 --> 00:13:11,430[br]I also have grouped them just to make it easier for the user to see. 9:59:59.000,9:59:59.000 131[br]00:13:11,430 --> 00:13:15,330[br]These are the same like these. These first ones are all single algorithms. 9:59:59.000,9:59:59.000 132[br]00:13:15,330 --> 00:13:20,920[br]And then we have a blend and a few other things. The details are. 9:59:59.000,9:59:59.000 133[br]00:13:20,920 --> 00:13:27,220[br]Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns. 9:59:59.000,9:59:59.000 134[br]00:13:27,220 --> 00:13:32,680[br]It also helps save space in the paper because I can present all of these different things in one place. 9:59:59.000,9:59:59.000 135[br]00:13:32,680 --> 00:13:39,730[br]It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper. 9:59:59.000,9:59:59.000 136[br]00:13:39,730 --> 00:13:46,540[br]But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart. 9:59:59.000,9:59:59.000 137[br]00:13:46,540 --> 00:13:51,880[br]So to wrap up graphics can make data clearer and they let us leverage human perception to understand it. 9:59:59.000,9:59:59.000 138[br]00:13:51,880 --> 00:13:54,160[br]They don't replace our numerical analysis, 9:59:59.000,9:59:59.000 139[br]00:13:54,160 --> 00:14:01,570[br]but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it. 9:59:59.000,9:59:59.000 140[br]00:14:01,570 --> 00:14:06,340[br]We do, however, always need to make sure that we clearly label and describe our graphics so that 9:59:59.000,9:59:59.000 141[br]00:14:06,340 --> 00:14:15,300[br]readers can understand them and they can draw correct conclusions from them. 9:59:59.000,9:59:59.000