< Return to Video

https:/.../824f231f-7cda-42a5-917c-ad9000efd981-17a11971-cdab-48ab-8f77-ad9101278d7e.mp4?invocationId=36100777-3009-ec11-a9e9-0a1a827ad0ec

  • 0:05 - 0:10
    This video, I want to talk some about data sources where our data is coming from and particularly
  • 0:10 - 0:15
    introduce the concept of bias and start to talk about where biases can come from in our data.
  • 0:15 - 0:23
    So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable.
  • 0:23 - 0:31
    So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things.
  • 0:31 - 0:36
    And in statistical terminology, what we say is that we're estimating the value of a parameter.
  • 0:36 - 0:40
    So I introduced the term statistic in the previous in a previous video.
  • 0:40 - 0:46
    But a parameter is some property of.
  • 0:46 - 0:54
    Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic.
  • 0:54 - 0:58
    So if we have our data pipeline, we have the things that we're trying to study,
  • 0:58 - 1:09
    we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data.
  • 1:09 - 1:11
    The goal is to be able to use the data,
  • 1:11 - 1:20
    the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world.
  • 1:20 - 1:27
    For example, if we want to understand the approval of our company.
  • 1:27 - 1:33
    And we want to estimate the parameter of either the net approval, like the number of people who agree,
  • 1:33 - 1:36
    minus the number of feet who approve of our company, minus disapprove.
  • 1:36 - 1:45
    Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company.
  • 1:45 - 1:46
    We could computer statistic.
  • 1:46 - 1:53
    We could take a sample of of people and look at the percentage that of that population that has about half of that sample,
  • 1:53 - 2:01
    that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter.
  • 2:01 - 2:09
    And what bias is bias is when the statistics systematically differs from the parameter.
  • 2:09 - 2:19
    And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey.
  • 2:19 - 2:22
    And it and if the people are poor,
  • 2:22 - 2:29
    more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted.
  • 2:29 - 2:36
    That's a source of bias. Response bias is some people are more likely to respond.
  • 2:36 - 2:43
    So if one survey method is called random digit dialing, where you dial random phone numbers,
  • 2:43 - 2:46
    if some people are more likely to pick up the phone than others,
  • 2:46 - 2:52
    or if some people are more likely once they find out what the call is to respond to the survey than others.
  • 2:52 - 3:03
    That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another.
  • 3:03 - 3:09
    And in our example here where this could arise is if the way that we frame the question.
  • 3:09 - 3:16
    Bias is the approval positive? What people say positively or negatively or how they respond?
  • 3:16 - 3:26
    Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data.
  • 3:26 - 3:36
    These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process.
  • 3:36 - 3:48
    Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters,
  • 3:48 - 3:57
    reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias.
  • 3:57 - 4:03
    But it's a way when we have our from the population of people, we're trying to study objects.
  • 4:03 - 4:11
    We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process.
  • 4:11 - 4:14
    Bias also may not affect all groups equally.
  • 4:14 - 4:20
    We may have a group that shows up more frequently in the data than than they are in the population less frequently.
  • 4:20 - 4:28
    There may be a measurement skew so that the the way that we're measuring our data
  • 4:28 - 4:33
    responds to the thing we're trying to measure differently between different groups.
  • 4:33 - 4:43
    So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college.
  • 4:43 - 4:48
    But there's two things that go into how well you're going to do in the essay tier.
  • 4:48 - 4:53
    The ACTC one is your raw economic or academic preparedness.
  • 4:53 - 4:58
    How good are you were engaging with the kind of material that they're testing your ability to engage on,
  • 4:58 - 5:07
    and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test.
  • 5:07 - 5:13
    Then there's the other things of just how much time do you have available to study and things like that.
  • 5:13 - 5:25
    And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores.
  • 5:25 - 5:33
    So if you have two students who given the same situation and the same economics, the same economic situation,
  • 5:33 - 5:40
    the same level of stress, the same level of preparedness would be able to equally well engage with the material.
  • 5:40 - 5:48
    And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student.
  • 5:48 - 5:54
    The one who has more economic security, they don't have to work as many hours that take from their studies.
  • 5:54 - 5:58
    They have the ability to, four, afford more test prep resources.
  • 5:58 - 6:06
    They're going to score higher on the standardized test than the person who, because of their social situation,
  • 6:06 - 6:11
    because of their economic situation, because of their background, is goes into the test less prepared.
  • 6:11 - 6:15
    These students, given this, if you swapped their circumstances, the scores would swap.
  • 6:15 - 6:22
    There's no difference in the student's academic ability to engage with the material and to do the work.
  • 6:22 - 6:32
    The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the
  • 6:32 - 6:37
    socio economic status and surrounding circumstances of the student we're trying to measure.
  • 6:37 - 6:42
    So one of the things immediately that we need to do with this in line with our theme this week
  • 6:42 - 6:48
    of describing data is that we need to clearly and fully document the data collection process.
  • 6:48 - 6:54
    This is a major focus of the data sheets reading because and this this does a few things at first.
  • 6:54 - 6:59
    It forces us to think about it if we're creating the data or if we're using an existing dataset.
  • 6:59 - 7:04
    We're trying to find the answers to these questions. It then enables further and future reuses of the data,
  • 7:04 - 7:13
    because if we've carefully documented the collection process, the data processing, etc., that results in the data.
  • 7:13 - 7:21
    Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem.
  • 7:21 - 7:31
    They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem.
  • 7:31 - 7:37
    It also creates the basis for as potential if we discover in the future through research, additional potential biases.
  • 7:37 - 7:42
    It lets us go back and see well, based on the documentation of how this data is collected,
  • 7:42 - 7:47
    how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right.
  • 7:47 - 7:54
    I want you to start thinking about how bias can affect our data. And is this a bias?
  • 7:54 - 8:03
    Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate.
  • 8:03 - 8:09
    But document your data. Look for the documentation of the data that you're using.
  • 8:09 - 8:11
    So to wrap up the goal,
  • 8:11 - 8:19
    as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters,
  • 8:19 - 8:22
    they're never going to exactly equal the quantity of interest.
  • 8:22 - 8:29
    But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another.
  • 8:29 - 8:46
    But various sources of bias, sampling, bias, response, bias and measurement bias just for three.
Title:
https:/.../824f231f-7cda-42a5-917c-ad9000efd981-17a11971-cdab-48ab-8f77-ad9101278d7e.mp4?invocationId=36100777-3009-ec11-a9e9-0a1a827ad0ec
Video Language:
English
Duration:
08:45

English subtitles

Revisions