This video, I want to talk some about data sources where our data is coming from and particularly introduce the concept of bias and start to talk about where biases can come from in our data. So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable. So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things. And in statistical terminology, what we say is that we're estimating the value of a parameter. So I introduced the term statistic in the previous in a previous video. But a parameter is some property of. Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic. So if we have our data pipeline, we have the things that we're trying to study, we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data. The goal is to be able to use the data, the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world. For example, if we want to understand the approval of our company. And we want to estimate the parameter of either the net approval, like the number of people who agree, minus the number of feet who approve of our company, minus disapprove. Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company. We could computer statistic. We could take a sample of of people and look at the percentage that of that population that has about half of that sample, that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter. And what bias is bias is when the statistics systematically differs from the parameter. And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey. And it and if the people are poor, more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted. That's a source of bias. Response bias is some people are more likely to respond. So if one survey method is called random digit dialing, where you dial random phone numbers, if some people are more likely to pick up the phone than others, or if some people are more likely once they find out what the call is to respond to the survey than others. That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another. And in our example here where this could arise is if the way that we frame the question. Bias is the approval positive? What people say positively or negatively or how they respond? Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data. These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process. Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters, reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias. But it's a way when we have our from the population of people, we're trying to study objects. We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process. Bias also may not affect all groups equally. We may have a group that shows up more frequently in the data than than they are in the population less frequently. There may be a measurement skew so that the the way that we're measuring our data responds to the thing we're trying to measure differently between different groups. So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college. But there's two things that go into how well you're going to do in the essay tier. The ACTC one is your raw economic or academic preparedness. How good are you were engaging with the kind of material that they're testing your ability to engage on, and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test. Then there's the other things of just how much time do you have available to study and things like that. And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores. So if you have two students who given the same situation and the same economics, the same economic situation, the same level of stress, the same level of preparedness would be able to equally well engage with the material. And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student. The one who has more economic security, they don't have to work as many hours that take from their studies. They have the ability to, four, afford more test prep resources. They're going to score higher on the standardized test than the person who, because of their social situation, because of their economic situation, because of their background, is goes into the test less prepared. These students, given this, if you swapped their circumstances, the scores would swap. There's no difference in the student's academic ability to engage with the material and to do the work. The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the socio economic status and surrounding circumstances of the student we're trying to measure. So one of the things immediately that we need to do with this in line with our theme this week of describing data is that we need to clearly and fully document the data collection process. This is a major focus of the data sheets reading because and this this does a few things at first. It forces us to think about it if we're creating the data or if we're using an existing dataset. We're trying to find the answers to these questions. It then enables further and future reuses of the data, because if we've carefully documented the collection process, the data processing, etc., that results in the data. Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem. They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem. It also creates the basis for as potential if we discover in the future through research, additional potential biases. It lets us go back and see well, based on the documentation of how this data is collected, how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right. I want you to start thinking about how bias can affect our data. And is this a bias? Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate. But document your data. Look for the documentation of the data that you're using. So to wrap up the goal, as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters, they're never going to exactly equal the quantity of interest. But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another. But various sources of bias, sampling, bias, response, bias and measurement bias just for three.