1 00:00:04,580 --> 00:00:09,890 This video, I want to talk some about data sources where our data is coming from and particularly 2 00:00:09,890 --> 00:00:15,230 introduce the concept of bias and start to talk about where biases can come from in our data. 3 00:00:15,230 --> 00:00:23,030 So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable. 4 00:00:23,030 --> 00:00:31,120 So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things. 5 00:00:31,120 --> 00:00:36,260 And in statistical terminology, what we say is that we're estimating the value of a parameter. 6 00:00:36,260 --> 00:00:40,190 So I introduced the term statistic in the previous in a previous video. 7 00:00:40,190 --> 00:00:46,430 But a parameter is some property of. 8 00:00:46,430 --> 00:00:53,570 Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic. 9 00:00:53,570 --> 00:00:58,040 So if we have our data pipeline, we have the things that we're trying to study, 10 00:00:58,040 --> 00:01:08,920 we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data. 11 00:01:08,920 --> 00:01:11,350 The goal is to be able to use the data, 12 00:01:11,350 --> 00:01:20,230 the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world. 13 00:01:20,230 --> 00:01:27,160 For example, if we want to understand the approval of our company. 14 00:01:27,160 --> 00:01:32,920 And we want to estimate the parameter of either the net approval, like the number of people who agree, 15 00:01:32,920 --> 00:01:36,490 minus the number of feet who approve of our company, minus disapprove. 16 00:01:36,490 --> 00:01:44,670 Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company. 17 00:01:44,670 --> 00:01:45,790 We could computer statistic. 18 00:01:45,790 --> 00:01:52,750 We could take a sample of of people and look at the percentage that of that population that has about half of that sample, 19 00:01:52,750 --> 00:02:00,610 that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter. 20 00:02:00,610 --> 00:02:09,190 And what bias is bias is when the statistics systematically differs from the parameter. 21 00:02:09,190 --> 00:02:18,950 And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey. 22 00:02:18,950 --> 00:02:21,680 And it and if the people are poor, 23 00:02:21,680 --> 00:02:29,180 more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted. 24 00:02:29,180 --> 00:02:35,700 That's a source of bias. Response bias is some people are more likely to respond. 25 00:02:35,700 --> 00:02:43,290 So if one survey method is called random digit dialing, where you dial random phone numbers, 26 00:02:43,290 --> 00:02:45,510 if some people are more likely to pick up the phone than others, 27 00:02:45,510 --> 00:02:52,290 or if some people are more likely once they find out what the call is to respond to the survey than others. 28 00:02:52,290 --> 00:03:03,300 That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another. 29 00:03:03,300 --> 00:03:09,360 And in our example here where this could arise is if the way that we frame the question. 30 00:03:09,360 --> 00:03:15,620 Bias is the approval positive? What people say positively or negatively or how they respond? 31 00:03:15,620 --> 00:03:26,210 Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data. 32 00:03:26,210 --> 00:03:36,450 These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process. 33 00:03:36,450 --> 00:03:48,360 Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters, 34 00:03:48,360 --> 00:03:57,480 reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias. 35 00:03:57,480 --> 00:04:02,820 But it's a way when we have our from the population of people, we're trying to study objects. 36 00:04:02,820 --> 00:04:11,440 We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process. 37 00:04:11,440 --> 00:04:14,500 Bias also may not affect all groups equally. 38 00:04:14,500 --> 00:04:20,410 We may have a group that shows up more frequently in the data than than they are in the population less frequently. 39 00:04:20,410 --> 00:04:28,060 There may be a measurement skew so that the the way that we're measuring our data 40 00:04:28,060 --> 00:04:32,530 responds to the thing we're trying to measure differently between different groups. 41 00:04:32,530 --> 00:04:43,000 So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college. 42 00:04:43,000 --> 00:04:47,680 But there's two things that go into how well you're going to do in the essay tier. 43 00:04:47,680 --> 00:04:52,720 The ACTC one is your raw economic or academic preparedness. 44 00:04:52,720 --> 00:04:58,450 How good are you were engaging with the kind of material that they're testing your ability to engage on, 45 00:04:58,450 --> 00:05:07,420 and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test. 46 00:05:07,420 --> 00:05:12,670 Then there's the other things of just how much time do you have available to study and things like that. 47 00:05:12,670 --> 00:05:24,850 And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores. 48 00:05:24,850 --> 00:05:32,920 So if you have two students who given the same situation and the same economics, the same economic situation, 49 00:05:32,920 --> 00:05:39,520 the same level of stress, the same level of preparedness would be able to equally well engage with the material. 50 00:05:39,520 --> 00:05:47,930 And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student. 51 00:05:47,930 --> 00:05:54,440 The one who has more economic security, they don't have to work as many hours that take from their studies. 52 00:05:54,440 --> 00:05:58,160 They have the ability to, four, afford more test prep resources. 53 00:05:58,160 --> 00:06:05,540 They're going to score higher on the standardized test than the person who, because of their social situation, 54 00:06:05,540 --> 00:06:11,420 because of their economic situation, because of their background, is goes into the test less prepared. 55 00:06:11,420 --> 00:06:15,230 These students, given this, if you swapped their circumstances, the scores would swap. 56 00:06:15,230 --> 00:06:22,070 There's no difference in the student's academic ability to engage with the material and to do the work. 57 00:06:22,070 --> 00:06:32,240 The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the 58 00:06:32,240 --> 00:06:37,490 socio economic status and surrounding circumstances of the student we're trying to measure. 59 00:06:37,490 --> 00:06:41,990 So one of the things immediately that we need to do with this in line with our theme this week 60 00:06:41,990 --> 00:06:47,600 of describing data is that we need to clearly and fully document the data collection process. 61 00:06:47,600 --> 00:06:54,220 This is a major focus of the data sheets reading because and this this does a few things at first. 62 00:06:54,220 --> 00:06:59,150 It forces us to think about it if we're creating the data or if we're using an existing dataset. 63 00:06:59,150 --> 00:07:04,250 We're trying to find the answers to these questions. It then enables further and future reuses of the data, 64 00:07:04,250 --> 00:07:12,560 because if we've carefully documented the collection process, the data processing, etc., that results in the data. 65 00:07:12,560 --> 00:07:21,380 Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem. 66 00:07:21,380 --> 00:07:30,920 They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem. 67 00:07:30,920 --> 00:07:37,400 It also creates the basis for as potential if we discover in the future through research, additional potential biases. 68 00:07:37,400 --> 00:07:42,140 It lets us go back and see well, based on the documentation of how this data is collected, 69 00:07:42,140 --> 00:07:47,210 how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right. 70 00:07:47,210 --> 00:07:53,680 I want you to start thinking about how bias can affect our data. And is this a bias? 71 00:07:53,680 --> 00:08:03,160 Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate. 72 00:08:03,160 --> 00:08:09,370 But document your data. Look for the documentation of the data that you're using. 73 00:08:09,370 --> 00:08:11,380 So to wrap up the goal, 74 00:08:11,380 --> 00:08:19,480 as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters, 75 00:08:19,480 --> 00:08:22,030 they're never going to exactly equal the quantity of interest. 76 00:08:22,030 --> 00:08:29,170 But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another. 77 00:08:29,170 --> 00:08:45,676 But various sources of bias, sampling, bias, response, bias and measurement bias just for three.