[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:04.58,0:00:09.89,Default,,0000,0000,0000,,This video, I want to talk some about data sources where our data is coming from and particularly Dialogue: 0,0:00:09.89,0:00:15.23,Default,,0000,0000,0000,,introduce the concept of bias and start to talk about where biases can come from in our data. Dialogue: 0,0:00:15.23,0:00:23.03,Default,,0000,0000,0000,,So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable. Dialogue: 0,0:00:23.03,0:00:31.12,Default,,0000,0000,0000,,So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things. Dialogue: 0,0:00:31.12,0:00:36.26,Default,,0000,0000,0000,,And in statistical terminology, what we say is that we're estimating the value of a parameter. Dialogue: 0,0:00:36.26,0:00:40.19,Default,,0000,0000,0000,,So I introduced the term statistic in the previous in a previous video. Dialogue: 0,0:00:40.19,0:00:46.43,Default,,0000,0000,0000,,But a parameter is some property of. Dialogue: 0,0:00:46.43,0:00:53.57,Default,,0000,0000,0000,,Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic. Dialogue: 0,0:00:53.57,0:00:58.04,Default,,0000,0000,0000,,So if we have our data pipeline, we have the things that we're trying to study, Dialogue: 0,0:00:58.04,0:01:08.92,Default,,0000,0000,0000,,we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data. Dialogue: 0,0:01:08.92,0:01:11.35,Default,,0000,0000,0000,,The goal is to be able to use the data, Dialogue: 0,0:01:11.35,0:01:20.23,Default,,0000,0000,0000,,the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world. Dialogue: 0,0:01:20.23,0:01:27.16,Default,,0000,0000,0000,,For example, if we want to understand the approval of our company. Dialogue: 0,0:01:27.16,0:01:32.92,Default,,0000,0000,0000,,And we want to estimate the parameter of either the net approval, like the number of people who agree, Dialogue: 0,0:01:32.92,0:01:36.49,Default,,0000,0000,0000,,minus the number of feet who approve of our company, minus disapprove. Dialogue: 0,0:01:36.49,0:01:44.67,Default,,0000,0000,0000,,Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company. Dialogue: 0,0:01:44.67,0:01:45.79,Default,,0000,0000,0000,,We could computer statistic. Dialogue: 0,0:01:45.79,0:01:52.75,Default,,0000,0000,0000,,We could take a sample of of people and look at the percentage that of that population that has about half of that sample, Dialogue: 0,0:01:52.75,0:02:00.61,Default,,0000,0000,0000,,that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter. Dialogue: 0,0:02:00.61,0:02:09.19,Default,,0000,0000,0000,,And what bias is bias is when the statistics systematically differs from the parameter. Dialogue: 0,0:02:09.19,0:02:18.95,Default,,0000,0000,0000,,And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey. Dialogue: 0,0:02:18.95,0:02:21.68,Default,,0000,0000,0000,,And it and if the people are poor, Dialogue: 0,0:02:21.68,0:02:29.18,Default,,0000,0000,0000,,more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted. Dialogue: 0,0:02:29.18,0:02:35.70,Default,,0000,0000,0000,,That's a source of bias. Response bias is some people are more likely to respond. Dialogue: 0,0:02:35.70,0:02:43.29,Default,,0000,0000,0000,,So if one survey method is called random digit dialing, where you dial random phone numbers, Dialogue: 0,0:02:43.29,0:02:45.51,Default,,0000,0000,0000,,if some people are more likely to pick up the phone than others, Dialogue: 0,0:02:45.51,0:02:52.29,Default,,0000,0000,0000,,or if some people are more likely once they find out what the call is to respond to the survey than others. Dialogue: 0,0:02:52.29,0:03:03.30,Default,,0000,0000,0000,,That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another. Dialogue: 0,0:03:03.30,0:03:09.36,Default,,0000,0000,0000,,And in our example here where this could arise is if the way that we frame the question. Dialogue: 0,0:03:09.36,0:03:15.62,Default,,0000,0000,0000,,Bias is the approval positive? What people say positively or negatively or how they respond? Dialogue: 0,0:03:15.62,0:03:26.21,Default,,0000,0000,0000,,Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data. Dialogue: 0,0:03:26.21,0:03:36.45,Default,,0000,0000,0000,,These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process. Dialogue: 0,0:03:36.45,0:03:48.36,Default,,0000,0000,0000,,Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters, Dialogue: 0,0:03:48.36,0:03:57.48,Default,,0000,0000,0000,,reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias. Dialogue: 0,0:03:57.48,0:04:02.82,Default,,0000,0000,0000,,But it's a way when we have our from the population of people, we're trying to study objects. Dialogue: 0,0:04:02.82,0:04:11.44,Default,,0000,0000,0000,,We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process. Dialogue: 0,0:04:11.44,0:04:14.50,Default,,0000,0000,0000,,Bias also may not affect all groups equally. Dialogue: 0,0:04:14.50,0:04:20.41,Default,,0000,0000,0000,,We may have a group that shows up more frequently in the data than than they are in the population less frequently. Dialogue: 0,0:04:20.41,0:04:28.06,Default,,0000,0000,0000,,There may be a measurement skew so that the the way that we're measuring our data Dialogue: 0,0:04:28.06,0:04:32.53,Default,,0000,0000,0000,,responds to the thing we're trying to measure differently between different groups. Dialogue: 0,0:04:32.53,0:04:43.00,Default,,0000,0000,0000,,So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college. Dialogue: 0,0:04:43.00,0:04:47.68,Default,,0000,0000,0000,,But there's two things that go into how well you're going to do in the essay tier. Dialogue: 0,0:04:47.68,0:04:52.72,Default,,0000,0000,0000,,The ACTC one is your raw economic or academic preparedness. Dialogue: 0,0:04:52.72,0:04:58.45,Default,,0000,0000,0000,,How good are you were engaging with the kind of material that they're testing your ability to engage on, Dialogue: 0,0:04:58.45,0:05:07.42,Default,,0000,0000,0000,,and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test. Dialogue: 0,0:05:07.42,0:05:12.67,Default,,0000,0000,0000,,Then there's the other things of just how much time do you have available to study and things like that. Dialogue: 0,0:05:12.67,0:05:24.85,Default,,0000,0000,0000,,And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores. Dialogue: 0,0:05:24.85,0:05:32.92,Default,,0000,0000,0000,,So if you have two students who given the same situation and the same economics, the same economic situation, Dialogue: 0,0:05:32.92,0:05:39.52,Default,,0000,0000,0000,,the same level of stress, the same level of preparedness would be able to equally well engage with the material. Dialogue: 0,0:05:39.52,0:05:47.93,Default,,0000,0000,0000,,And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student. Dialogue: 0,0:05:47.93,0:05:54.44,Default,,0000,0000,0000,,The one who has more economic security, they don't have to work as many hours that take from their studies. Dialogue: 0,0:05:54.44,0:05:58.16,Default,,0000,0000,0000,,They have the ability to, four, afford more test prep resources. Dialogue: 0,0:05:58.16,0:06:05.54,Default,,0000,0000,0000,,They're going to score higher on the standardized test than the person who, because of their social situation, Dialogue: 0,0:06:05.54,0:06:11.42,Default,,0000,0000,0000,,because of their economic situation, because of their background, is goes into the test less prepared. Dialogue: 0,0:06:11.42,0:06:15.23,Default,,0000,0000,0000,,These students, given this, if you swapped their circumstances, the scores would swap. Dialogue: 0,0:06:15.23,0:06:22.07,Default,,0000,0000,0000,,There's no difference in the student's academic ability to engage with the material and to do the work. Dialogue: 0,0:06:22.07,0:06:32.24,Default,,0000,0000,0000,,The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the Dialogue: 0,0:06:32.24,0:06:37.49,Default,,0000,0000,0000,,socio economic status and surrounding circumstances of the student we're trying to measure. Dialogue: 0,0:06:37.49,0:06:41.99,Default,,0000,0000,0000,,So one of the things immediately that we need to do with this in line with our theme this week Dialogue: 0,0:06:41.99,0:06:47.60,Default,,0000,0000,0000,,of describing data is that we need to clearly and fully document the data collection process. Dialogue: 0,0:06:47.60,0:06:54.22,Default,,0000,0000,0000,,This is a major focus of the data sheets reading because and this this does a few things at first. Dialogue: 0,0:06:54.22,0:06:59.15,Default,,0000,0000,0000,,It forces us to think about it if we're creating the data or if we're using an existing dataset. Dialogue: 0,0:06:59.15,0:07:04.25,Default,,0000,0000,0000,,We're trying to find the answers to these questions. It then enables further and future reuses of the data, Dialogue: 0,0:07:04.25,0:07:12.56,Default,,0000,0000,0000,,because if we've carefully documented the collection process, the data processing, etc., that results in the data. Dialogue: 0,0:07:12.56,0:07:21.38,Default,,0000,0000,0000,,Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem. Dialogue: 0,0:07:21.38,0:07:30.92,Default,,0000,0000,0000,,They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem. Dialogue: 0,0:07:30.92,0:07:37.40,Default,,0000,0000,0000,,It also creates the basis for as potential if we discover in the future through research, additional potential biases. Dialogue: 0,0:07:37.40,0:07:42.14,Default,,0000,0000,0000,,It lets us go back and see well, based on the documentation of how this data is collected, Dialogue: 0,0:07:42.14,0:07:47.21,Default,,0000,0000,0000,,how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right. Dialogue: 0,0:07:47.21,0:07:53.68,Default,,0000,0000,0000,,I want you to start thinking about how bias can affect our data. And is this a bias? Dialogue: 0,0:07:53.68,0:08:03.16,Default,,0000,0000,0000,,Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate. Dialogue: 0,0:08:03.16,0:08:09.37,Default,,0000,0000,0000,,But document your data. Look for the documentation of the data that you're using. Dialogue: 0,0:08:09.37,0:08:11.38,Default,,0000,0000,0000,,So to wrap up the goal, Dialogue: 0,0:08:11.38,0:08:19.48,Default,,0000,0000,0000,,as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters, Dialogue: 0,0:08:19.48,0:08:22.03,Default,,0000,0000,0000,,they're never going to exactly equal the quantity of interest. Dialogue: 0,0:08:22.03,0:08:29.17,Default,,0000,0000,0000,,But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another. Dialogue: 0,0:08:29.17,0:08:45.68,Default,,0000,0000,0000,,But various sources of bias, sampling, bias, response, bias and measurement bias just for three.