< Return to Video

https:/.../f87fc754-ecbd-4970-ad3b-ad9601830c4f-3d7efd1b-793d-4716-bef3-ad9c006aebdc.mp4?invocationId=ea100906-a50f-ec11-a9e9-0a1a827ad0ec

  • Not Synced
    1
    00:00:04,540 --> 00:00:10,180
    Welcome in this video, I'm going to start introducing the basic concepts of statistical graphics.
  • Not Synced
    2
    00:00:10,180 --> 00:00:13,660
    I want you to be able to understand the value of graphics for presenting data,
  • Not Synced
    3
    00:00:13,660 --> 00:00:19,540
    identify parts of a statistical image, and understand some pitfalls and graphics that we want to try to avoid.
  • Not Synced
    4
    00:00:19,540 --> 00:00:25,300
    So here's an example of a chart, and there's a variety of different pieces of this chart.
  • Not Synced
    5
    00:00:25,300 --> 00:00:31,900
    We have an x axis. That's the horizontal x axis.
  • Not Synced
    6
    00:00:31,900 --> 00:00:41,820
    We have a y axis, the vertical axis. Each of these axes has a label.
  • Not Synced
    7
    00:00:41,820 --> 00:00:52,610
    Task to task one. We have a caption up at the top that explains what's going on in the image, provides us with the context to understand it.
  • Not Synced
    8
    00:00:52,610 --> 00:00:59,660
    And it says that this graph is showing the number of queries per task with query account distributions in the margins.
  • Not Synced
    9
    00:00:59,660 --> 00:01:04,070
    And each dot is one participant. So it tells us we have a data point in it.
  • Not Synced
    10
    00:01:04,070 --> 00:01:09,020
    What is that? It tells us what it is that we're charting. Number of queries per task.
  • Not Synced
    11
    00:01:09,020 --> 00:01:14,760
    When we then see the axis labels that we have task one and we have task to.
  • Not Synced
    12
    00:01:14,760 --> 00:01:19,500
    Those two together, give us the context, understand that. Oh, we have two tasks.
  • Not Synced
    13
    00:01:19,500 --> 00:01:24,000
    And this is why of one participant.
  • Not Synced
    14
    00:01:24,000 --> 00:01:28,620
    And they're appearing at the point where they have their task, one count on their task, two counts.
  • Not Synced
    15
    00:01:28,620 --> 00:01:34,530
    OK, this allows us to to see if there's any relationship between how long it took to.
  • Not Synced
    16
    00:01:34,530 --> 00:01:40,800
    How many queries it took to complete the two different tasks. It then says we have query count distribution than the margin.
  • Not Synced
    17
    00:01:40,800 --> 00:01:46,050
    So this is a compound plot. And in the left and right, margins are in the X and Y margins.
  • Not Synced
    18
    00:01:46,050 --> 00:01:52,530
    We have the distribution, a histogram of the X axis, the task one.
  • Not Synced
    19
    00:01:52,530 --> 00:01:59,560
    We have a histogram of the Y axis task to these histograms don't have axes themselves because,
  • Not Synced
    20
    00:01:59,560 --> 00:02:04,680
    well, we just wanted to show a distribution the exact particularly for our purposes here.
  • Not Synced
    21
    00:02:04,680 --> 00:02:16,980
    The exact number in each bean is not so important. The key thing is just see being able to see relatively where is the mass of the different?
  • Not Synced
    22
    00:02:16,980 --> 00:02:21,150
    Where is the mass on the two different task counts?
  • Not Synced
    23
    00:02:21,150 --> 00:02:27,620
    And we can see that both of them have a a right skew.
  • Not Synced
    24
    00:02:27,620 --> 00:02:32,780
    They're bulked up towards towards the low end of the scale.
  • Not Synced
    25
    00:02:32,780 --> 00:02:41,120
    And then we have the all the we have all of the individual data points scat on the chart.
  • Not Synced
    26
    00:02:41,120 --> 00:02:46,670
    This is called a scatterplot. We have these different pieces of the chart that we want to be able to identify.
  • Not Synced
    27
    00:02:46,670 --> 00:02:55,400
    And when you go particularly as you go to refine a chart, what you're going to need to do is specify what's happening on each of these pieces.
  • Not Synced
    28
    00:02:55,400 --> 00:02:59,300
    What is your x axis? What is your y axis?
  • Not Synced
    29
    00:02:59,300 --> 00:03:08,600
    Before you even start the chart, you need to set up your data so that we have what is the data point that I'm going to be plotting on this chart?
  • Not Synced
    30
    00:03:08,600 --> 00:03:16,280
    So charts can are really useful for revealing a variety of things that can reveal patterns or lack there of.
  • Not Synced
    31
    00:03:16,280 --> 00:03:19,200
    In this chart, there's really not much of a pattern.
  • Not Synced
    32
    00:03:19,200 --> 00:03:26,700
    And we can see that it's booked up, but particularly if we get out to that larger number of tasks, there's not a lot of pattern.
  • Not Synced
    33
    00:03:26,700 --> 00:03:34,560
    The one with the participant, with the most tasks and with the most tasks or queries for task one.
  • Not Synced
    34
    00:03:34,560 --> 00:03:38,760
    Has a middling to low number of queries for Task two.
  • Not Synced
    35
    00:03:38,760 --> 00:03:47,790
    And the the one who has the most queries for task to while they're in the upper end of of the queries per task on task one, they're biased.
  • Not Synced
    36
    00:03:47,790 --> 00:03:53,520
    They're not at all the highest. So we can see there's not a not very much of a relationship here.
  • Not Synced
    37
    00:03:53,520 --> 00:03:59,490
    At least that doesn't look like one. They can be useful for comparisons. If we've got a bar chart, we can compare to bars.
  • Not Synced
    38
    00:03:59,490 --> 00:04:08,220
    We can see where points lay. We can see like we can see in that chart that we just saw that the the highest number of counts for one task,
  • Not Synced
    39
    00:04:08,220 --> 00:04:12,210
    the highest number of counts for another task or different. We can also see trends.
  • Not Synced
    40
    00:04:12,210 --> 00:04:14,610
    We can see if a line looks like it goes up or down,
  • Not Synced
    41
    00:04:14,610 --> 00:04:22,950
    wiggles around so they can reveal a lot of these kinds of things and they can really leverage our human perception and our human,
  • Not Synced
    42
    00:04:22,950 --> 00:04:25,080
    particularly our human visual senses,
  • Not Synced
    43
    00:04:25,080 --> 00:04:35,820
    to be able to quickly internalize and understand what is going on in in a set of data when we're creating a chart.
  • Not Synced
    44
    00:04:35,820 --> 00:04:41,520
    We need to clearly document a few things. We you clearly state what is being presented when someone looks at a chart.
  • Not Synced
    45
    00:04:41,520 --> 00:04:45,720
    They need to be able to understand what each point in the chart is going to be.
  • Not Synced
    46
    00:04:45,720 --> 00:04:56,530
    They need to understand what values are plotted on the axis. They need to understand what values are plotted on the axes.
  • Not Synced
    47
    00:04:56,530 --> 00:05:01,260
    Often this is done in an axis label in our in the chart I showed you,
  • Not Synced
    48
    00:05:01,260 --> 00:05:06,040
    it said the values in the caption in the axis labels said which version of them they were.
  • Not Synced
    49
    00:05:06,040 --> 00:05:08,560
    If there are units, that needs to be clear.
  • Not Synced
    50
    00:05:08,560 --> 00:05:19,840
    So if you've got something that's millimeters, that's pounds, that's megabytes, whatever, you need to specify the units in your in your chart,
  • Not Synced
    51
    00:05:19,840 --> 00:05:29,320
    either in the Axis label or in the caption, some of these things can sometimes be implicit in the type of chart, such as a histogram.
  • Not Synced
    52
    00:05:29,320 --> 00:05:32,440
    And you've got a fraction or a percentage in the left hand side.
  • Not Synced
    53
    00:05:32,440 --> 00:05:41,260
    It's standard convention that we're talking about, the fraction of the values that are in each bin,
  • Not Synced
    54
    00:05:41,260 --> 00:05:49,930
    at least if you label it as a histogram or as a chart showing the distribution. But when in doubt, if there's any doubt about.
  • Not Synced
    55
    00:05:49,930 --> 00:05:57,090
    What a value, what an axis label is. Or there's any doubt that the reader will understand what it is.
  • Not Synced
    56
    00:05:57,090 --> 00:06:04,230
    Be explicit, explicitly, say what's going on in your chart. That also the chart in the caption should be interpretable on their own.
  • Not Synced
    57
    00:06:04,230 --> 00:06:08,640
    You can assume a reasonable level. You have to know your audience for this.
  • Not Synced
    58
    00:06:08,640 --> 00:06:16,200
    But someone should be able to just look at the chart with its immediately surrounding description,
  • Not Synced
    59
    00:06:16,200 --> 00:06:20,670
    the labels, the caption and understand have a pretty good idea of what's going on.
  • Not Synced
    60
    00:06:20,670 --> 00:06:25,990
    The surrounding text with the text that references the chart if you're writing a document.
  • Not Synced
    61
    00:06:25,990 --> 00:06:30,190
    That can have your observations, that can provide more context and clarity.
  • Not Synced
    62
    00:06:30,190 --> 00:06:35,950
    But someone just looking at the chart should be able to figure out basically what's going on and
  • Not Synced
    63
    00:06:35,950 --> 00:06:41,080
    not be too far off of this is particularly important because there's a there's a lot of people,
  • Not Synced
    64
    00:06:41,080 --> 00:06:46,150
    whether this is a good or a bad practice, we can debate. But there's a lot of people who, when they're reading a paper,
  • Not Synced
    65
    00:06:46,150 --> 00:06:52,000
    they focus on the charts and look at the key charts first to see what it is that's going on in the paper.
  • Not Synced
    66
    00:06:52,000 --> 00:07:01,990
    And if our if our charts are self-explanatory and are clear that it makes a lot easier for people to glance at our work,
  • Not Synced
    67
    00:07:01,990 --> 00:07:09,370
    see what it's doing and decide whether they are going to pay it further attention.
  • Not Synced
    68
    00:07:09,370 --> 00:07:12,970
    So in a paper, if you're putting a chart in a docket, a written document or a paper,
  • Not Synced
    69
    00:07:12,970 --> 00:07:18,640
    each figure should have a caption and the caption can it labels the figure and it can also provide interpretive guidance.
  • Not Synced
    70
    00:07:18,640 --> 00:07:18,790
    Like,
  • Not Synced
    71
    00:07:18,790 --> 00:07:26,410
    it's not uncommon for a caption to be two or three sentences saying things about what's going on in the chart and describing some of the methodology,
  • Not Synced
    72
    00:07:26,410 --> 00:07:31,800
    what precisely some of the computations are, etc. In other contexts, we often need a title for the charts.
  • Not Synced
    73
    00:07:31,800 --> 00:07:37,900
    So if we have a caption we don't, we need to label our axes, but we don't need a title for the chart itself all the time.
  • Not Synced
    74
    00:07:37,900 --> 00:07:40,840
    It doesn't hurt, but often it's redundant with the caption.
  • Not Synced
    75
    00:07:40,840 --> 00:07:46,690
    In other contexts, though, we often do need a title such as when we have a chart that's going in a presentation.
  • Not Synced
    76
    00:07:46,690 --> 00:07:50,230
    We have a chart in one of our notebooks. A title is often helpful in notebooks.
  • Not Synced
    77
    00:07:50,230 --> 00:07:52,180
    The surrounding text may be sufficient,
  • Not Synced
    78
    00:07:52,180 --> 00:07:59,920
    but a title is often a good idea for someone who's quickly scanning the notebook to be able to understand what's going on in the chart.
  • Not Synced
    79
    00:07:59,920 --> 00:08:04,570
    So a few pitfalls to be aware of when we're thinking about statistical graphics is
  • Not Synced
    80
    00:08:04,570 --> 00:08:10,930
    one is distorting the distances or the differences that are happening particularly.
  • Not Synced
    81
    00:08:10,930 --> 00:08:13,600
    We need to make sure if something has a length,
  • Not Synced
    82
    00:08:13,600 --> 00:08:21,040
    anything that has a length that length should accurately represent quantities, position, relative position.
  • Not Synced
    83
    00:08:21,040 --> 00:08:26,800
    If you have two dots, their relative position is what's important. But if we have a length of it, if we have a bar, it has a length.
  • Not Synced
    84
    00:08:26,800 --> 00:08:30,700
    It also is an area we need to make sure those accurately represent quantities.
  • Not Synced
    85
    00:08:30,700 --> 00:08:37,140
    One really common way to violate this is having a bar chart whose access starts at something other than zero.
  • Not Synced
    86
    00:08:37,140 --> 00:08:41,670
    The software we're using doesn't do that by default. Excel does.
  • Not Synced
    87
    00:08:41,670 --> 00:08:46,860
    But your bar chart always needs to start at zero because people are beat.
  • Not Synced
    88
    00:08:46,860 --> 00:08:51,540
    People don't look at the relative position of the bar. People see the whole height of the bar.
  • Not Synced
    89
    00:08:51,540 --> 00:08:59,400
    And so if it doesn't start at zero, it looks like the difference between bars is much higher relative to the bar size than it actually is.
  • Not Synced
    90
    00:08:59,400 --> 00:09:01,860
    There's also ways in which we can violate conventions.
  • Not Synced
    91
    00:09:01,860 --> 00:09:09,210
    So in the first video I showed you the chart that violated the convention, that the x axis goes in order.
  • Not Synced
    92
    00:09:09,210 --> 00:09:16,320
    If we violate the user's expectations, they they'll either be confused by the chart or read it wrong.
  • Not Synced
    93
    00:09:16,320 --> 00:09:24,240
    Statistical graphics in each particular type of chart have conventions that people who read a lot of them assimilate by long patterns of reading,
  • Not Synced
    94
    00:09:24,240 --> 00:09:32,570
    like you assimilate how to read written text. And if those expect expectations are violated, that can.
  • Not Synced
    95
    00:09:32,570 --> 00:09:37,490
    Lead the user to incorrect conclusions from our charts, from our presentation.
  • Not Synced
    96
    00:09:37,490 --> 00:09:42,920
    A key thing to remember here that also applies to all of our presentations.
  • Not Synced
    97
    00:09:42,920 --> 00:09:52,350
    Research isn't a mystery novel. You don't have to worry about spoiling the surprise or you end the goal here is not to present it,
  • Not Synced
    98
    00:09:52,350 --> 00:09:56,100
    not to subvert tropes or present shocking new presentations.
  • Not Synced
    99
    00:09:56,100 --> 00:10:00,470
    We might have shocking new evidence, but from a presentation perspective,
  • Not Synced
    100
    00:10:00,470 --> 00:10:08,040
    we want it to fit within conventions and not violate readers expectations unnecessarily so that
  • Not Synced
    101
    00:10:08,040 --> 00:10:14,370
    they can read it and be confident that they've correctly understood what it is that you're saying.
  • Not Synced
    102
    00:10:14,370 --> 00:10:18,990
    Another thing to be aware of is that graphics can illustrate an effect.
  • Not Synced
    103
    00:10:18,990 --> 00:10:27,060
    They can also help you find an effect. Like more exploring data. We can look at the graphics to see what effects we might be looking for.
  • Not Synced
    104
    00:10:27,060 --> 00:10:33,660
    We have to be careful about that. We'll talk about some of the pitfalls of we have to be careful, more combining.
  • Not Synced
    105
    00:10:33,660 --> 00:10:39,420
    We can't combine exploratory and what's called confirmatory analysis, but they can help us.
  • Not Synced
    106
    00:10:39,420 --> 00:10:45,990
    Visualizing data can help us look for possible effects and get ideas for what to go look for next.
  • Not Synced
    107
    00:10:45,990 --> 00:10:53,530
    But they're not conclusive proof of an effect. We need the numeric results, just the raw numerical.
  • Not Synced
    108
    00:10:53,530 --> 00:11:00,030
    The raw numbers as well as the numeric result are the results of inferential techniques that let us
  • Not Synced
    109
    00:11:00,030 --> 00:11:06,090
    estimate how big an effect is and whether it's significant in order to come to any conclusions.
  • Not Synced
    110
    00:11:06,090 --> 00:11:13,920
    So in this chart, I want to show you, for example, we have if we look at the chart closely,
  • Not Synced
    111
    00:11:13,920 --> 00:11:20,790
    we have these two data points and the little blue Xs in them, the func SVOD axis.
  • Not Synced
    112
    00:11:20,790 --> 00:11:26,160
    As to the left of the item, item X. So it looks like for this particular metric,
  • Not Synced
    113
    00:11:26,160 --> 00:11:32,900
    lower is a better value for it because it's an error metric root mean squared errors with RMX he stands for.
  • Not Synced
    114
    00:11:32,900 --> 00:11:38,520
    But it looks like. Okay. This is a little bit better, but that's not sufficient evidence for us to include.
  • Not Synced
    115
    00:11:38,520 --> 00:11:44,220
    To conclude that func SVOD is better than item item on the per user are masc metric.
  • Not Synced
    116
    00:11:44,220 --> 00:11:48,900
    Exactly what all these things are is, is a topic for another day.
  • Not Synced
    117
    00:11:48,900 --> 00:11:55,230
    But the fact that we see the thing to the left, that s that if the effect is real, this illustrates it.
  • Not Synced
    118
    00:11:55,230 --> 00:12:02,220
    But seeing it's not enough for us to conclude that it outperforms because it might be a fluke of our experimental strategy,
  • Not Synced
    119
    00:12:02,220 --> 00:12:07,120
    it's a relatively small difference. So. They help us see.
  • Not Synced
    120
    00:12:07,120 --> 00:12:11,620
    They help us communicate. They're not definitive and conclusive proof.
  • Not Synced
    121
    00:12:11,620 --> 00:12:16,660
    Couple of other things I want to highlight. They're going on in this graph. I've introduced two different kinds of symbols here.
  • Not Synced
    122
    00:12:16,660 --> 00:12:19,930
    So the earlier graph, we just had one kind of symbol. We had dots here.
  • Not Synced
    123
    00:12:19,930 --> 00:12:24,580
    We have two different kinds with a legend that says red circles are global.
  • Not Synced
    124
    00:12:24,580 --> 00:12:30,930
    Oremus are a thing called global are masc and blue Xs are a thing called per user are MSE.
  • Not Synced
    125
    00:12:30,930 --> 00:12:37,920
    Don't have to understand what those are. But the point is, I'm using different colors and shapes in order to communicate,
  • Not Synced
    126
    00:12:37,920 --> 00:12:45,750
    to show different versions of a thing in the same chart, using different shapes.
  • Not Synced
    127
    00:12:45,750 --> 00:12:50,430
    In addition to different colors is useful because it's so imprinted on a black and white printer.
  • Not Synced
    128
    00:12:50,430 --> 00:12:57,270
    If they if they have some form of color blindness, it helps it make the differences clearer.
  • Not Synced
    129
    00:12:57,270 --> 00:13:05,070
    I've also in addition so I've got my Y at my at Y axis, which is indicating different things that I'm plotting here.
  • Not Synced
    130
    00:13:05,070 --> 00:13:11,430
    I also have grouped them just to make it easier for the user to see.
  • Not Synced
    131
    00:13:11,430 --> 00:13:15,330
    These are the same like these. These first ones are all single algorithms.
  • Not Synced
    132
    00:13:15,330 --> 00:13:20,920
    And then we have a blend and a few other things. The details are.
  • Not Synced
    133
    00:13:20,920 --> 00:13:27,220
    Aren't important for illustrating them, but it helps guide the user to understand his structures to we have these group breakdowns.
  • Not Synced
    134
    00:13:27,220 --> 00:13:32,680
    It also helps save space in the paper because I can present all of these different things in one place.
  • Not Synced
    135
    00:13:32,680 --> 00:13:39,730
    It's easy to compare the different stages, even though I have to split the mountain, the discussion in the paper.
  • Not Synced
    136
    00:13:39,730 --> 00:13:46,540
    But it gives you one place to compare them and it concisely shows the key results of the entire paper in one chart.
  • Not Synced
    137
    00:13:46,540 --> 00:13:51,880
    So to wrap up graphics can make data clearer and they let us leverage human perception to understand it.
  • Not Synced
    138
    00:13:51,880 --> 00:13:54,160
    They don't replace our numerical analysis,
  • Not Synced
    139
    00:13:54,160 --> 00:14:01,570
    but they give it context and they help us more clearly communicate what it is that we're learning from the data and what's going on in it.
  • Not Synced
    140
    00:14:01,570 --> 00:14:06,340
    We do, however, always need to make sure that we clearly label and describe our graphics so that
  • Not Synced
    141
    00:14:06,340 --> 00:14:15,300
    readers can understand them and they can draw correct conclusions from them.
  • Not Synced
Title:
https:/.../f87fc754-ecbd-4970-ad3b-ad9601830c4f-3d7efd1b-793d-4716-bef3-ad9c006aebdc.mp4?invocationId=ea100906-a50f-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
14:15

English subtitles

Incomplete

Revisions