< Return to Video

Techniques for random sampling and avoiding bias

  • 0:00 - 0:01
    - [Instructor] Let's
    say that we run a school
  • 0:01 - 0:04
    and in that school there is a population
  • 0:04 - 0:07
    of students right over here.
  • 0:07 - 0:09
    And that is our population.
  • 0:09 - 0:11
    And we want to get a sense of how
  • 0:11 - 0:15
    these students feel about the
    quality of math instruction
  • 0:15 - 0:18
    at the school, so we construct a survey,
  • 0:18 - 0:21
    and we just need to decide
    who are we going to get
  • 0:21 - 0:24
    to actually answer this survey.
  • 0:24 - 0:27
    One option is to just go to
    every member of the population,
  • 0:27 - 0:29
    but let's just say it's
    a really large school.
  • 0:29 - 0:31
    Let's say we're a college
  • 0:31 - 0:33
    and there's 10,000 people in the college.
  • 0:33 - 0:35
    We say, well, we can't
    just talk to everyone.
  • 0:35 - 0:38
    So instead, we say, let's
    sample this population
  • 0:38 - 0:41
    to get an indication of how
    the entire school feels.
  • 0:41 - 0:44
    So we are going to sample it.
  • 0:44 - 0:47
    We are going to sample that population.
  • 0:47 - 0:50
    Now in order to avoid having bias
  • 0:51 - 0:53
    in our response, in order for it to
  • 0:53 - 0:56
    have the best chance of
    it being indicative of the
  • 0:56 - 1:01
    entire population, we want
    our sample to be random.
  • 1:01 - 1:04
    So our sample could either be random,
  • 1:04 - 1:06
    random, or not random.
  • 1:08 - 1:09
    Not random.
  • 1:10 - 1:13
    And it might seem, at first,
    pretty straightforward
  • 1:13 - 1:16
    to do a random sample, but when
    you actually get down to it,
  • 1:16 - 1:20
    it's not always as straightforward
    as you would think.
  • 1:20 - 1:24
    So one type of random sample
    is just a simple random sample.
  • 1:24 - 1:26
    So, simple, simple,
  • 1:28 - 1:30
    random, random, sample,
  • 1:32 - 1:35
    and this is saying, alright, let me
  • 1:36 - 1:40
    maybe assign a number to
    every person in the school,
  • 1:40 - 1:42
    maybe they already have
    a student ID number,
  • 1:42 - 1:44
    and I'm just going to get a computer,
  • 1:44 - 1:46
    a random number generator, to generate the
  • 1:46 - 1:49
    100 people, the 100 students,
  • 1:49 - 1:51
    so let's say there's a
    sample of 100 students,
  • 1:51 - 1:52
    that I'm going to apply the survey to,
  • 1:52 - 1:55
    so that would be a simple random sample.
  • 1:55 - 1:58
    We are just going into this
    whole population and randomly,
  • 1:58 - 2:00
    let me just draw this.
  • 2:00 - 2:03
    So this is the population,
    we are just randomly
  • 2:03 - 2:05
    picking people out, and we
    know it's random because
  • 2:05 - 2:08
    a random number generator, or
    we have a string of numbers
  • 2:08 - 2:10
    or something like that,
    that is allowing us
  • 2:10 - 2:12
    to pick the students.
  • 2:12 - 2:14
    Now that's pretty good, it's
    unlikely that you're going
  • 2:14 - 2:18
    to have bias from this
    sample, but there is some
  • 2:18 - 2:20
    probability that, just by chance,
  • 2:23 - 2:25
    your random number generator
    just happened to select
  • 2:25 - 2:28
    maybe a disproportionate
    number of boys over girls,
  • 2:28 - 2:31
    or a disproportionate number of freshmen,
  • 2:31 - 2:33
    or a disproportionate
    number of engineering majors
  • 2:33 - 2:36
    versus English majors,
    and that's a possibility.
  • 2:36 - 2:39
    So even though you are
    taking a simple random sample
  • 2:39 - 2:42
    that is truly random, once
    again, it's some probability
  • 2:42 - 2:46
    that it's not indicative
    of the entire population.
  • 2:46 - 2:49
    And so to mitigate that,
    there are other techniques
  • 2:49 - 2:50
    at our disposal.
  • 2:50 - 2:53
    One technique is a stratified sample.
  • 2:55 - 2:55
    Stratified.
  • 2:57 - 3:00
    And so this is the idea of
    taking our entire population
  • 3:00 - 3:01
    and essentially stratifying it.
  • 3:01 - 3:05
    So let's say we want to, we
    take that same population,
  • 3:05 - 3:07
    we take that same
    population, I'll draw it as a
  • 3:07 - 3:09
    square here just for convenience,
  • 3:09 - 3:11
    and we're gonna stratify it by,
  • 3:11 - 3:14
    let's say we're concerned that
    we get a appropriate sample
  • 3:14 - 3:17
    of freshmen, sophomores,
    juniors, and seniors.
  • 3:17 - 3:21
    So we'll stratify it by
    freshmen, sophomores,
  • 3:21 - 3:25
    juniors, and seniors, and then we sample
  • 3:25 - 3:28
    25 from each of these groups.
  • 3:28 - 3:29
    So these are the stratifications.
  • 3:29 - 3:33
    This is freshmen, sophomore,
    juniors, and seniors,
  • 3:34 - 3:38
    and instead of just sampling
    100 out of the entire pool,
  • 3:38 - 3:40
    we sample 25 from each of these.
  • 3:42 - 3:44
    So just like that.
  • 3:44 - 3:47
    And so that makes sure that you are
  • 3:47 - 3:49
    getting indicative responses from
  • 3:49 - 3:53
    at least all of the different age groups
  • 3:54 - 3:56
    or levels within your university.
  • 3:56 - 3:58
    Now there might be another
    issue where you say,
  • 3:58 - 4:01
    well, I'm actually more
    concerned that we have
  • 4:01 - 4:05
    accurate representation of
    males and females in the school,
  • 4:05 - 4:07
    and there is some probability,
  • 4:07 - 4:09
    you know, if I do 100
    random people, it's very
  • 4:09 - 4:11
    likely that it's close to
    50/50, but there's some chance,
  • 4:11 - 4:14
    just due to randomness,
    there's disproportionately male
  • 4:14 - 4:16
    or disproportionately female.
  • 4:16 - 4:19
    And that's even possible
    in the stratified case.
  • 4:19 - 4:20
    And so what you might say is,
  • 4:20 - 4:22
    well, you know what I'm gonna do?
  • 4:22 - 4:26
    I'm going to, there's a technique
    called a clustered sample.
  • 4:26 - 4:29
    Let me write this right
    over here, clustered,
  • 4:29 - 4:34
    a clustered sample, and what
    we do is we sample groups.
  • 4:34 - 4:36
    Each of those groups we feel confident has
  • 4:36 - 4:38
    a good balance of male females.
  • 4:38 - 4:41
    So, for example, we might,
  • 4:41 - 4:44
    instead of sampling individuals
    from the entire population,
  • 4:44 - 4:47
    we might say, look, you know,
  • 4:47 - 4:50
    on Tuesdays and Thursdays, and this, well,
  • 4:50 - 4:51
    even there as you can tell this is not a
  • 4:51 - 4:56
    trivial thing to do, let's
    just say that we can split,
  • 4:56 - 4:59
    let's say we can split our population
  • 5:00 - 5:02
    into groups, maybe these are classrooms,
  • 5:02 - 5:06
    and each of these classrooms
    have an even distribution
  • 5:06 - 5:11
    of males and females, or pretty
    close to even distributions.
  • 5:11 - 5:14
    And so what we do is we
    sample the actual classrooms,
  • 5:14 - 5:16
    so that's why it's called
    cluster, or cluster technique,
  • 5:16 - 5:20
    or clustered random
    sample, because we're going
  • 5:20 - 5:24
    to randomly sample our
    classrooms, each of which have a
  • 5:24 - 5:27
    close or maybe a exact
    balance of males and females
  • 5:27 - 5:29
    so we know that we're gonna
    get good representation,
  • 5:29 - 5:32
    but we are still sampling,
    we are sampling from
  • 5:32 - 5:33
    the clusters, but then we're gonna survey
  • 5:33 - 5:36
    every single person in
    each of these clusters,
  • 5:36 - 5:39
    every single person in
    one of these classrooms.
  • 5:39 - 5:44
    So, once again, these are
    all forms of random surveys,
  • 5:44 - 5:46
    or random samples, you have
    the simple random sample,
  • 5:46 - 5:49
    you can stratify, or
    you can cluster and then
  • 5:49 - 5:51
    randomly pick the clusters and then survey
  • 5:51 - 5:54
    everyone in that cluster.
  • 5:54 - 5:57
    Now if these are all random samples,
  • 5:57 - 6:00
    what are the non-random things like?
  • 6:00 - 6:02
    Well, one case of
    non-random, you could have a
  • 6:02 - 6:04
    voluntary survey,
  • 6:07 - 6:09
    or voluntary sample,
    and this might just be
  • 6:09 - 6:10
    you tell every student at the school,
  • 6:10 - 6:13
    "Hey, here's a web address.
  • 6:13 - 6:15
    "If you're interested, come
    and fill out this survey."
  • 6:15 - 6:18
    And that's likely to
    introduce bias because
  • 6:18 - 6:20
    you might have maybe the
    students who really like
  • 6:20 - 6:23
    the math instruction at their school
  • 6:23 - 6:25
    more likely to fill it out,
    maybe the students who really
  • 6:25 - 6:27
    don't like it are more
    likely to fill it out,
  • 6:27 - 6:29
    maybe it's just the
    kids who have more time
  • 6:29 - 6:30
    more likely to fill it out.
  • 6:30 - 6:34
    So this has a good chance
    of introducing bias.
  • 6:34 - 6:37
    The students who fill out the survey
  • 6:37 - 6:41
    might be just more skewed
    one way or the other because,
  • 6:41 - 6:44
    you know, they volunteered for it.
  • 6:45 - 6:48
    Another not random sample would be called
  • 6:52 - 6:54
    you're introducing bias
    because of convenience
  • 6:54 - 6:56
    is the term that's often used,
  • 6:56 - 6:58
    and this might say, well,
    let's just sample the 100
  • 6:58 - 7:00
    first students who show up in school.
  • 7:00 - 7:02
    And that's just convenient for me because
  • 7:02 - 7:03
    I didn't have to use random numbers,
  • 7:03 - 7:07
    or do the stratification, or
    doing any of this clustering,
  • 7:07 - 7:10
    but you can understand how
    this also would introduce bias,
  • 7:10 - 7:12
    because the first 100 students
    who show up at school,
  • 7:12 - 7:14
    maybe those are the
    most diligent students,
  • 7:14 - 7:18
    maybe they all take an
    early math class that has
  • 7:18 - 7:20
    a very good instructor where
    they're all happy about it.
  • 7:20 - 7:21
    Or it might go the other way,
  • 7:21 - 7:23
    the instructor there
    isn't the best one, and so
  • 7:23 - 7:25
    it might introduce bias the other way.
  • 7:25 - 7:28
    So if you let people
    volunteer or you just say,
  • 7:28 - 7:30
    "Oh, let me do the first N students."
  • 7:30 - 7:32
    Or you say, "Hey, let me just
    talk to all of the students
  • 7:32 - 7:35
    "who happen to be in
    front of me right now."
  • 7:35 - 7:37
    They might be in front of
    you out of convenience,
  • 7:37 - 7:41
    but they might not be
    a true random sample.
  • 7:42 - 7:45
    Now there is other reasons
    why you might introduce bias,
  • 7:45 - 7:47
    and it might not be
    because of the sampling.
  • 7:47 - 7:49
    You might introduce bias because of the
  • 7:49 - 7:51
    wording of your survey.
  • 7:51 - 7:53
    You could imagine a survey that says,
  • 7:53 - 7:57
    do you consider yourself
    lucky to get a math education
  • 7:57 - 8:00
    that very few other people
    in the world have access to?
  • 8:00 - 8:01
    Well, that might bias you to say,
  • 8:01 - 8:03
    "Well, yeah, I guess I feel lucky."
  • 8:03 - 8:06
    Well, if the wording was,
  • 8:06 - 8:08
    do you like the fact
    that a disproportionate
  • 8:08 - 8:12
    more students at your school tend to fail
  • 8:12 - 8:15
    algebra than our surrounding schools?
  • 8:15 - 8:17
    Well, that might bias you negatively.
  • 8:17 - 8:20
    So the wording really, really,
    really matters in surveys,
  • 8:20 - 8:23
    and there is a lot that
    would go into this.
  • 8:23 - 8:26
    And the other one is just people's,
  • 8:26 - 8:27
    you know, it's called response bias.
  • 8:27 - 8:29
    And, once again, this isn't about...
  • 8:29 - 8:31
    Response bias.
  • 8:32 - 8:36
    And this is just people not
    wanting to tell the truth
  • 8:36 - 8:37
    or maybe not wanting to respond at all.
  • 8:37 - 8:39
    Maybe they're afraid that somehow
  • 8:39 - 8:41
    their response is gonna show up
  • 8:41 - 8:43
    in front of their math
    teacher or the administrators,
  • 8:43 - 8:44
    or if they're too negative,
  • 8:44 - 8:46
    it might be taken out on them in some way.
  • 8:46 - 8:48
    And because of that, they
    might not be truthful,
  • 8:48 - 8:51
    and so they might be overly positive
  • 8:51 - 8:53
    or not fill it out at all.
  • 8:53 - 8:57
    So anyway, this is a
    very high level overview
  • 8:57 - 8:59
    of how you could think about sampling.
  • 8:59 - 9:01
    You want to go random
    because it lowers the
  • 9:01 - 9:04
    probability of their
    introducing some bias into it.
  • 9:04 - 9:06
    And then these are some techniques.
  • 9:06 - 9:07
    And also think about
    whether you're falling
  • 9:07 - 9:09
    into some of these pitfalls
    that have a good chance
  • 9:09 - 9:11
    of introducing bias.
Title:
Techniques for random sampling and avoiding bias
Description:

more » « less
Video Language:
English
Team:
Khan Academy
Duration:
09:13

English subtitles

Revisions