0:00:04.580,0:00:09.890
This video, I want to talk some about data sources where our data is coming from and particularly

0:00:09.890,0:00:15.230
introduce the concept of bias and start to talk about where biases can come from in our data.

0:00:15.230,0:00:23.030
So our learning outcomes are to understand what bias means and start to identify the sources of bias and observations of a variable.

0:00:23.030,0:00:31.120
So one of the goals of a lot of our data science work is festively as we develop more sophisticated tools is going to be to estimate things.

0:00:31.120,0:00:36.260
And in statistical terminology, what we say is that we're estimating the value of a parameter.

0:00:36.260,0:00:40.190
So I introduced the term statistic in the previous in a previous video.

0:00:40.190,0:00:46.430
But a parameter is some property of.

0:00:46.430,0:00:53.570
Of the world or of the population that we're trying to study. And our goal is to estimate that with some statistic.

0:00:53.570,0:00:58.040
So if we have our data pipeline, we have the things that we're trying to study,

0:00:58.040,0:01:08.920
we can we have observable phenomenon or experimental results that come out of those that become raw and then processed data.

0:01:08.920,0:01:11.350
The goal is to be able to use the data,

0:01:11.350,0:01:20.230
the processed data to estimate to computers a statistic that allows us to estimate the value, the parameter back in the world.

0:01:20.230,0:01:27.160
For example, if we want to understand the approval of our company.

0:01:27.160,0:01:32.920
And we want to estimate the parameter of either the net approval, like the number of people who agree,

0:01:32.920,0:01:36.490
minus the number of feet who approve of our company, minus disapprove.

0:01:36.490,0:01:44.670
Or maybe the percentage of the citizens of residents of the society who have a positive opinion of our company.

0:01:44.670,0:01:45.790
We could computer statistic.

0:01:45.790,0:01:52.750
We could take a sample of of people and look at the percentage that of that population that has about half of that sample,

0:01:52.750,0:02:00.610
that has a positive opinion of our company. And the goal of this process is that the statistic is approximately the parameter.

0:02:00.610,0:02:09.190
And what bias is bias is when the statistics systematically differs from the parameter.

0:02:09.190,0:02:18.950
And there are a few sources of this. One is selection bias, where some people are more likely to be contacted than others in our survey.

0:02:18.950,0:02:21.680
And it and if the people are poor,

0:02:21.680,0:02:29.180
more likely to be contacted are either more or less likely to have a positive opinion than those who aren't contacted.

0:02:29.180,0:02:35.700
That's a source of bias. Response bias is some people are more likely to respond.

0:02:35.700,0:02:43.290
So if one survey method is called random digit dialing, where you dial random phone numbers,

0:02:43.290,0:02:45.510
if some people are more likely to pick up the phone than others,

0:02:45.510,0:02:52.290
or if some people are more likely once they find out what the call is to respond to the survey than others.

0:02:52.290,0:03:03.300
That is that's going to also induce a bias. And then measurement bias is when the way that we measure the results skews one way or another.

0:03:03.300,0:03:09.360
And in our example here where this could arise is if the way that we frame the question.

0:03:09.360,0:03:15.620
Bias is the approval positive? What people say positively or negatively or how they respond?

0:03:15.620,0:03:26.210
Then that we have the response, they're going to answer our questions. But we've changed how there's a bias in how their opinion translates into data.

0:03:26.210,0:03:36.450
These biases can come up at the biases that these stages of the pipeline can come up and almost any data collection kind of process.

0:03:36.450,0:03:48.360
Controlling for them and counteracting them is a significant field of study where reputable, reputable political pollsters,

0:03:48.360,0:03:57.480
reputable survey organizations have very good mechanisms for quantifying and reducing these sources of bias.

0:03:57.480,0:04:02.820
But it's a way when we have our from the population of people, we're trying to study objects.

0:04:02.820,0:04:11.440
We're trying to study through to the data that we actually get. It's the places where we're bias can come into the process.

0:04:11.440,0:04:14.500
Bias also may not affect all groups equally.

0:04:14.500,0:04:20.410
We may have a group that shows up more frequently in the data than than they are in the population less frequently.

0:04:20.410,0:04:28.060
There may be a measurement skew so that the the way that we're measuring our data

0:04:28.060,0:04:32.530
responds to the thing we're trying to measure differently between different groups.

0:04:32.530,0:04:43.000
So one is one example of this is standardized tests like the S.A.T. and the ACTC are intended to measure your academic preparedness for college.

0:04:43.000,0:04:47.680
But there's two things that go into how well you're going to do in the essay tier.

0:04:47.680,0:04:52.720
The ACTC one is your raw economic or academic preparedness.

0:04:52.720,0:04:58.450
How good are you were engaging with the kind of material that they're testing your ability to engage on,

0:04:58.450,0:05:07.420
and the other is your preparedness for the test itself. And there are a lot of test preparation resources that help you prepare for the test.

0:05:07.420,0:05:12.670
Then there's the other things of just how much time do you have available to study and things like that.

0:05:12.670,0:05:24.850
And one of the outcomes of that is that socio economic status becomes a very strong indicator in a very strong factor in standardized test scores.

0:05:24.850,0:05:32.920
So if you have two students who given the same situation and the same economics, the same economic situation,

0:05:32.920,0:05:39.520
the same level of stress, the same level of preparedness would be able to equally well engage with the material.

0:05:39.520,0:05:47.930
And that ideally is what you want to test if you're say seeing if someone is going to be a an effective college student.

0:05:47.930,0:05:54.440
The one who has more economic security, they don't have to work as many hours that take from their studies.

0:05:54.440,0:05:58.160
They have the ability to, four, afford more test prep resources.

0:05:58.160,0:06:05.540
They're going to score higher on the standardized test than the person who, because of their social situation,

0:06:05.540,0:06:11.420
because of their economic situation, because of their background, is goes into the test less prepared.

0:06:11.420,0:06:15.230
These students, given this, if you swapped their circumstances, the scores would swap.

0:06:15.230,0:06:22.070
There's no difference in the student's academic ability to engage with the material and to do the work.

0:06:22.070,0:06:32.240
The system is responding. The measurement instrument, the standardized test is responding differently to the thing it wants to measure based on the

0:06:32.240,0:06:37.490
socio economic status and surrounding circumstances of the student we're trying to measure.

0:06:37.490,0:06:41.990
So one of the things immediately that we need to do with this in line with our theme this week

0:06:41.990,0:06:47.600
of describing data is that we need to clearly and fully document the data collection process.

0:06:47.600,0:06:54.220
This is a major focus of the data sheets reading because and this this does a few things at first.

0:06:54.220,0:06:59.150
It forces us to think about it if we're creating the data or if we're using an existing dataset.

0:06:59.150,0:07:04.250
We're trying to find the answers to these questions. It then enables further and future reuses of the data,

0:07:04.250,0:07:12.560
because if we've carefully documented the collection process, the data processing, etc., that results in the data.

0:07:12.560,0:07:21.380
Then other people who come across the data, future users that may want to reproduce our analysis, may want to apply the data to a different problem.

0:07:21.380,0:07:30.920
They'll have the information they need to assess what the likely biases are and if those biases are likely to be to affect their problem.

0:07:30.920,0:07:37.400
It also creates the basis for as potential if we discover in the future through research, additional potential biases.

0:07:37.400,0:07:42.140
It lets us go back and see well, based on the documentation of how this data is collected,

0:07:42.140,0:07:47.210
how likely is it for that to be a problem for this data as well? So the takeaway I want you to have, right.

0:07:47.210,0:07:53.680
I want you to start thinking about how bias can affect our data. And is this a bias?

0:07:53.680,0:08:03.160
Is is the systematic from a statistical perspective? Bias is the systematic deviation of our estimate from the thing we're trying to estimate.

0:08:03.160,0:08:09.370
But document your data. Look for the documentation of the data that you're using.

0:08:09.370,0:08:11.380
So to wrap up the goal,

0:08:11.380,0:08:19.480
as for our data to accurately reflect the population and for the statistics we compute from it to accurately and reliably approximate parameters,

0:08:19.480,0:08:22.030
they're never going to exactly equal the quantity of interest.

0:08:22.030,0:08:29.170
But hopefully they're pretty close and hopefully there's not systemic or systematic differences in one way or another.

0:08:29.170,0:08:45.676
But various sources of bias, sampling, bias, response, bias and measurement bias just for three.