< Return to Video

Statistics: Sample Variance

  • 0:01 - 0:03
    This video, here, is a
    groundbreaking video,
  • 0:03 - 0:06
    for multiple reasons.
  • 0:06 - 0:09
    One, I'm going to introduce
    you to the variance
  • 0:09 - 0:12
    of a sample, which is
    interesting in its own right.
  • 0:12 - 0:14
    And I'm attempting to
    record this video in HD,
  • 0:14 - 0:17
    and hopefully you can see it
    bigger and clearer than ever
  • 0:17 - 0:17
    before.
  • 0:17 - 0:19
    But we'll see how
    all of that goes.
  • 0:19 - 0:22
    So this is a bit of an
    experiment, so bear with me.
  • 0:22 - 0:25
    But so, just before we go
    into the variance of a sample,
  • 0:25 - 0:28
    I think it's
    instructive to review
  • 0:28 - 0:32
    the variance the population, and
    we can compare their formulas.
  • 0:32 - 0:34
    The variance of a
    population-- and it's
  • 0:34 - 0:36
    this Greek letter, sigma.
  • 0:36 - 0:38
    Lowercase sigma squared.
  • 0:38 - 0:39
    That means variance.
  • 0:39 - 0:41
    I know it's weird that
    a variable already
  • 0:41 - 0:42
    has a squared in it.
  • 0:42 - 0:43
    You're not squaring
    the variable.
  • 0:43 - 0:44
    This is the variable.
  • 0:44 - 0:46
    Sigma squared means variance.
  • 0:46 - 0:47
    Actually, let me
    write that down.
  • 0:47 - 0:48
    That equals variance.
  • 0:52 - 0:55
    And that is equal to--
    you take each data point--
  • 0:55 - 0:59
    and we'll call them x sub i.
  • 0:59 - 1:00
    You take each data
    point, find out
  • 1:00 - 1:04
    how far it is from the
    mean of the population--
  • 1:04 - 1:06
    the mean of the population.
  • 1:06 - 1:11
    You square it, and then you take
    the average of all of those.
  • 1:11 - 1:12
    So you take the average.
  • 1:12 - 1:13
    You sum them all up.
  • 1:13 - 1:14
    You go from i is equal to 1.
  • 1:14 - 1:18
    So from the first point all
    the way to the N-th point.
  • 1:18 - 1:20
    And then to average,
    you sum them all up,
  • 1:20 - 1:22
    and then you divide by N.
  • 1:22 - 1:26
    So the variance is the average
    of the squared distances
  • 1:26 - 1:27
    of each point from the mean.
  • 1:27 - 1:29
    And just to give you
    the intuition again,
  • 1:29 - 1:32
    it essentially says, on
    average, roughly how far away
  • 1:32 - 1:34
    are each of the points
    from the middle?
  • 1:34 - 1:36
    That's the best way to
    think about the variance.
  • 1:36 - 1:39
    Now what if we're dealing-- this
    was for a population, right?
  • 1:39 - 1:41
    And we said, if we
    wanted to figure out
  • 1:41 - 1:44
    the variance of men's
    heights in the country,
  • 1:44 - 1:45
    it would be very
    hard to figure out
  • 1:45 - 1:47
    the variance for the population.
  • 1:47 - 1:49
    You would have to go
    and, essentially, measure
  • 1:49 - 1:51
    everyone's height--
    250 million people.
  • 1:51 - 1:55
    Or what if it's for some
    population where it's just
  • 1:55 - 1:57
    completely impossible
    to have the data,
  • 1:57 - 1:58
    or some random variable?
  • 1:58 - 1:59
    And we'll go more
    into that later.
  • 1:59 - 2:00
    So a lot of times,
    you actually want
  • 2:00 - 2:03
    to estimate this
    variance by taking
  • 2:03 - 2:05
    the variance of a sample.
  • 2:05 - 2:08
    Same way that you could never
    get the mean of a population,
  • 2:08 - 2:09
    but maybe you want
    to estimate it
  • 2:09 - 2:11
    by getting the mean of a sample.
  • 2:11 - 2:14
    And we learned that
    in that first video.
  • 2:14 - 2:18
    This is-- if that's the
    whole population, that's
  • 2:18 - 2:21
    millions of data points-- or
    even data points in the future
  • 2:21 - 2:22
    that you'll never
    be able to get,
  • 2:22 - 2:23
    because it's a random variable.
  • 2:23 - 2:25
    So this is the population.
  • 2:27 - 2:33
    You might just want to estimate
    things by looking at a sample.
  • 2:33 - 2:36
    And this is actually what
    most of inferential statistics
  • 2:36 - 2:36
    is all about.
  • 2:36 - 2:39
    Figuring out descriptive
    statistics about the sample,
  • 2:39 - 2:41
    and making inferences
    about the population.
  • 2:41 - 2:44
    Let me try this
    drug on 100 people.
  • 2:44 - 2:46
    And if it seems to have
    statistically significant
  • 2:46 - 2:47
    results, this drug
    will probably work
  • 2:47 - 2:49
    on the population as a whole.
  • 2:49 - 2:50
    So that's what it's all about.
  • 2:50 - 2:52
    So it's really
    important to understand
  • 2:52 - 2:54
    this notion of a sample
    versus a population,
  • 2:54 - 2:56
    and being able to
    find statistics
  • 2:56 - 2:58
    on a sample that,
    for the most part,
  • 2:58 - 3:02
    can describe the population or
    help us estimate-- they call
  • 3:02 - 3:04
    it parameters for
    the population.
  • 3:04 - 3:07
    So what's the mean of a-- let
    me rewrite these definitions.
  • 3:07 - 3:09
    What's the mean of a population?
  • 3:09 - 3:12
    I'll do it in that purple--
    purple for population.
  • 3:12 - 3:14
    The mean of a
    population, you just
  • 3:14 - 3:16
    take each of the data points.
  • 3:16 - 3:20
    So you take each of the data
    points in the population-- xi.
  • 3:20 - 3:22
    You sum them up.
  • 3:22 - 3:23
    You start with the
    first data point,
  • 3:23 - 3:25
    and you go all the way
    to the N-th data point,
  • 3:25 - 3:27
    and you divide by N.
    You sum them all up
  • 3:27 - 3:29
    and divide them by
    N. That's the mean.
  • 3:29 - 3:30
    So then you plug it
    into this formula,
  • 3:30 - 3:32
    and you just can see
    how far each point is
  • 3:32 - 3:35
    from that central
    point, from that mean.
  • 3:35 - 3:36
    And you get the variance.
  • 3:36 - 3:40
    Now what happens if
    we do it for a sample?
  • 3:40 - 3:43
    Well, if we want to estimate
    the mean of a population
  • 3:43 - 3:46
    by somehow calculating a mean
    for a sample, the best thing
  • 3:46 - 3:47
    I can think of--
    and really these
  • 3:47 - 3:49
    are kind of engineered formulas.
  • 3:49 - 3:51
    These are human
    beings saying, well,
  • 3:51 - 3:52
    what is the best
    way to sample it?
  • 3:52 - 3:55
    Well, all we can do is really
    take an average of our sample.
  • 3:55 - 3:57
    And that's the sample mean.
  • 3:57 - 3:58
    And we learned in
    the first video
  • 3:58 - 4:00
    that notation-- the formula's
    almost identical to this.
  • 4:00 - 4:02
    It's just the
    notation is different.
  • 4:02 - 4:05
    Instead of writing mu, you
    write x with a line over it.
  • 4:05 - 4:07
    Sample mean is equal to--
  • 4:07 - 4:10
    Once again, you take each of the
    data points now in the sample,
  • 4:10 - 4:12
    not in the whole population.
  • 4:12 - 4:17
    You sum them up, from the first
    one and then to the n-th one,
  • 4:17 - 4:17
    right?
  • 4:17 - 4:21
    They're saying that there are
    n data points in this sample.
  • 4:21 - 4:23
    And then you divide it by the
    number of data points you have.
  • 4:23 - 4:25
    Fair enough.
  • 4:25 - 4:26
    It's really the same formula.
  • 4:26 - 4:27
    The way I took the
    mean of a population,
  • 4:27 - 4:28
    I said, well, if I
    just have a sample,
  • 4:28 - 4:30
    let me just take the
    mean the same way.
  • 4:30 - 4:32
    And it might-- it's
    probably a good estimate
  • 4:32 - 4:34
    of the mean of the population.
  • 4:34 - 4:36
    Now, it gets interesting
    when we talk about variance.
  • 4:36 - 4:39
    So your natural reaction
    is, OK, I have this sample.
  • 4:39 - 4:43
    If I want to estimate the
    variance of the population,
  • 4:43 - 4:45
    why don't I just apply this
    same formula, essentially,
  • 4:45 - 4:46
    to the sample?
  • 4:46 - 4:50
    So I could say-- and this is
    actually a sample variance.
  • 4:50 - 4:55
    They use a formula-- s squared.
  • 4:55 - 4:58
    So sigma is kind of the
    Greek-letter equivalent of s.
  • 4:58 - 5:00
    So now when we're
    dealing with a sample,
  • 5:00 - 5:01
    we just write the s there.
  • 5:01 - 5:02
    So this is sample variance.
  • 5:02 - 5:03
    Let me write that down.
  • 5:03 - 5:04
    Sample variance.
  • 5:13 - 5:15
    So we might just say,
    well, maybe a good way
  • 5:15 - 5:18
    to take the sample variance
    is, do it the same way.
  • 5:18 - 5:24
    Let's take the distance of each
    of the points in the sample,
  • 5:24 - 5:26
    find out how far it is
    from our sample mean.
  • 5:26 - 5:28
    Right here, we used
    the population mean.
  • 5:28 - 5:30
    But now we'll just
    use the sample
  • 5:30 - 5:31
    mean, because that's
    all we can have.
  • 5:31 - 5:33
    We don't know what
    the population mean
  • 5:33 - 5:36
    is without looking at
    the whole population.
  • 5:36 - 5:37
    Take the square of that.
  • 5:37 - 5:38
    That makes it positive.
  • 5:38 - 5:40
    It has other properties
    which we'll go over later.
  • 5:40 - 5:43
    And then take the average of
    all of these squared distances.
  • 5:43 - 5:45
    So you take it from--
    you sum them all up.
  • 5:45 - 5:48
    And there's n of them to
    sum up, right-- lowercase n.
  • 5:48 - 5:51
    And you divide by lowercase n.
  • 5:51 - 5:53
    You say, well, you know,
    this is a good estimate.
  • 5:53 - 5:55
    Whatever this variance
    is, that might
  • 5:55 - 5:57
    be a good estimate for
    the population as a whole.
  • 5:57 - 6:00
    And actually this is what
    some people often refer to,
  • 6:00 - 6:02
    when they talk about
    sample variance.
  • 6:02 - 6:05
    And sometimes it'll actually
    be referred to as this.
  • 6:05 - 6:08
    They'll put a little
    lowercase n there.
  • 6:08 - 6:10
    And the reason why I do that
    is because we divided by n.
  • 6:10 - 6:12
    So you say, Sal, what's
    the problem here?
  • 6:12 - 6:13
    And then the problem--
    and I'll give you
  • 6:13 - 6:14
    the intuition, because
    this is actually
  • 6:14 - 6:16
    something that used
    to boggle my mind.
  • 6:16 - 6:19
    And I'm still,
    frankly, struggling
  • 6:19 - 6:21
    with the intuition behind it.
  • 6:21 - 6:23
    Well, I have the
    intuition, but more
  • 6:23 - 6:25
    of kind of rigorously
    proving it to myself,
  • 6:25 - 6:27
    that this is
    definitely the case.
  • 6:27 - 6:28
    But think about this.
  • 6:28 - 6:31
    If I have a bunch of numbers--
    and I'll draw a number line,
  • 6:31 - 6:33
    here.
  • 6:33 - 6:36
    If I draw a number
    line here-- let's
  • 6:36 - 6:39
    say I have a bunch of
    numbers in my population.
  • 6:39 - 6:41
    So let's say-- I'm just
    going to randomly put
  • 6:41 - 6:44
    a bunch of numbers
    in my population.
  • 6:44 - 6:46
    And the ones to the right
    are bigger than the ones
  • 6:46 - 6:46
    to the left.
  • 6:49 - 6:52
    And if I were to take a
    sample of them, right?
  • 6:52 - 6:55
    Maybe I take-- and the
    sample, it's random.
  • 6:55 - 6:56
    You actually want to
    take a random sample.
  • 6:56 - 6:58
    You don't want to be
    skewed in any way.
  • 6:58 - 7:04
    So maybe I take this one, this
    one, this one, and that one,
  • 7:04 - 7:05
    right?
  • 7:05 - 7:07
    And then if I were to take
    the mean of that number,
  • 7:07 - 7:08
    that number, that
    number, and that number,
  • 7:08 - 7:10
    it'll be someplace
    in the middle.
  • 7:10 - 7:11
    It might be
    someplace over there.
  • 7:11 - 7:14
    And then, if I wanted to figure
    out the sample variance using
  • 7:14 - 7:16
    this formula, I'd
    say, OK, this distance
  • 7:16 - 7:19
    squared plus this distance
    squared plus this distance
  • 7:19 - 7:23
    squared plus that distance
    squared, and average them
  • 7:23 - 7:23
    all out.
  • 7:23 - 7:25
    And then I would
    get this number.
  • 7:25 - 7:28
    And that probably would be
    a pretty good approximation
  • 7:28 - 7:30
    for the variance of
    this entire population.
  • 7:30 - 7:32
    The population of
    the mean is probably
  • 7:32 - 7:35
    going to be-- I don't know, it
    might be pretty close to this.
  • 7:35 - 7:37
    If we actually took all of the
    data points and averaged them,
  • 7:37 - 7:39
    maybe they're, like,
    here someplace.
  • 7:39 - 7:40
    And then, if you figure
    out the variance,
  • 7:40 - 7:42
    it probably would be pretty
    close to the average of all
  • 7:42 - 7:47
    of these lines, right, of the
    sample variance distances.
  • 7:47 - 7:47
    Fair enough.
  • 7:47 - 7:50
    So you say, hey, Sal, this
    looks pretty good now.
  • 7:50 - 7:52
    But there's one little catch.
  • 7:52 - 7:55
    What if-- I mean, there's always
    a probability that, instead
  • 7:55 - 7:57
    of picking these fairly
    well-distributed numbers
  • 7:57 - 7:59
    in my sample, what
    if I happened to pick
  • 7:59 - 8:03
    this number, this number, and
    that number, and, let's say,
  • 8:03 - 8:06
    that number, as my sample?
  • 8:06 - 8:08
    Well, whatever your sample
    is, your sample mean's
  • 8:08 - 8:10
    always going to be in
    the middle of it, right?
  • 8:10 - 8:13
    So in this case, your sample
    mean might be right here.
  • 8:13 - 8:14
    So all of these numbers,
    you might say, OK,
  • 8:14 - 8:16
    this number's not too
    far from that number.
  • 8:16 - 8:18
    That number's not too far.
  • 8:18 - 8:19
    And then that
    number's not too far.
  • 8:19 - 8:22
    So your sample variance,
    when you do it this way,
  • 8:22 - 8:24
    it might turn out
    a little bit low.
  • 8:24 - 8:28
    Because all of these numbers,
    they're almost, by definition,
  • 8:28 - 8:30
    going to be pretty close
    to the mean of each other.
  • 8:30 - 8:34
    But in this case, your
    sample is kind of skewed,
  • 8:34 - 8:36
    and the actual mean
    of the population
  • 8:36 - 8:38
    is out here someplace.
  • 8:38 - 8:40
    So the actual variance
    of the sample,
  • 8:40 - 8:42
    if you had actually
    known the mean--
  • 8:42 - 8:44
    I know this is all a little
    confusing-- if you had actually
  • 8:44 - 8:47
    known the mean, you
    would've said, oh, wow.
  • 8:47 - 8:48
    You would have found
    these distances,
  • 8:48 - 8:51
    which would have
    been a lot more.
  • 8:51 - 8:53
    The whole point
    of what I'm saying
  • 8:53 - 8:55
    is, when you take
    a sample, there's
  • 8:55 - 8:58
    some chance that
    your sample mean is
  • 8:58 - 9:00
    pretty close to the
    population mean.
  • 9:00 - 9:03
    Maybe your sample mean is
    here, and your population mean
  • 9:03 - 9:03
    is here.
  • 9:03 - 9:05
    And then this formula
    would probably
  • 9:05 - 9:07
    work out pretty well, at least
    given your sample data points,
  • 9:07 - 9:09
    of figuring out what
    the variance is.
  • 9:09 - 9:14
    But there's a reasonable
    chance that your sample
  • 9:14 - 9:15
    mean-- your sample
    mean is always
  • 9:15 - 9:17
    going to be within your
    data sample, right?
  • 9:17 - 9:19
    It's always going to be the
    center of your data sample.
  • 9:19 - 9:21
    But it's completely possible
    that the population mean
  • 9:21 - 9:23
    is outside of your data sample.
  • 9:23 - 9:24
    It might have just
    been-- you know,
  • 9:24 - 9:25
    you just happened
    to pick ones that
  • 9:25 - 9:28
    don't contain the
    actual population mean.
  • 9:28 - 9:32
    And then this sample
    variance, calculated this way,
  • 9:32 - 9:36
    will actually underestimate
    the actual population variance,
  • 9:36 - 9:38
    because they're always going
    to be closer to their own mean
  • 9:38 - 9:40
    than they are to
    the population mean.
  • 9:40 - 9:44
    And if you're understanding,
    frankly, even 10% of this,
  • 9:44 - 9:46
    you are a very advanced
    statistics student.
  • 9:46 - 9:48
    But I'm saying all of
    this to just give you,
  • 9:48 - 9:51
    hopefully, some
    intuition to realize
  • 9:51 - 9:54
    that this will
    often underestimate.
  • 9:54 - 9:57
    This formula will
    often underestimate
  • 9:57 - 9:59
    the actual population variance.
  • 9:59 - 10:01
    And there's a formula--
    and this is actually
  • 10:01 - 10:02
    proven more rigorously
    than I'll do
  • 10:02 - 10:06
    it-- that is considered to be
    a better-- or, they'll call it,
  • 10:06 - 10:09
    unbiased-- estimate of
    the population variance,
  • 10:09 - 10:11
    or the unbiased sample variance.
  • 10:11 - 10:14
    And sometimes it's just
    denoted by the s squared again.
  • 10:14 - 10:19
    Sometimes it's denoted by
    this-- s n minus 1 squared--
  • 10:19 - 10:21
    and I'll show you why.
  • 10:21 - 10:22
    It's almost the same thing.
  • 10:22 - 10:24
    You take each of
    the data points,
  • 10:24 - 10:28
    figure out how far they
    are from the sample mean,
  • 10:28 - 10:30
    you square them, and then
    you take the average of those
  • 10:30 - 10:34
    squared, except for
    one slight difference.
  • 10:34 - 10:36
    i equals 1 to i equals n.
  • 10:36 - 10:38
    Instead of dividing
    by n, you divide
  • 10:38 - 10:42
    by a slightly smaller number.
  • 10:42 - 10:44
    You divide by n minus 1.
  • 10:44 - 10:48
    So when you divide by n minus
    1 instead of dividing by n,
  • 10:48 - 10:50
    you're going to get a
    slightly larger number here.
  • 10:50 - 10:51
    And it turns out
    that this is actually
  • 10:51 - 10:52
    a much better estimate.
  • 10:52 - 10:54
    And one day, I'm going to
    write a computer program
  • 10:54 - 10:57
    to at least prove it to
    myself, experimentally,
  • 10:57 - 11:02
    that this is a better estimate
    of the population variance.
  • 11:02 - 11:03
    And you would calculate
    it the same way.
  • 11:03 - 11:05
    You just divide by n minus 1.
  • 11:05 - 11:08
    The other way to think about
    it-- and actually-- no, no.
  • 11:08 - 11:09
    I'm all out of time.
  • 11:09 - 11:10
    I'll leave you there, now,
    and then, in the next video,
  • 11:10 - 11:11
    we'll do a couple
    calculations, just
  • 11:11 - 11:13
    so you don't get too
    overwhelmed with these ideas,
  • 11:13 - 11:15
    because we're getting
    a little bit abstract.
  • 11:15 - 11:17
    See you in the next video.
Title:
Statistics: Sample Variance
Description:

more » « less
Video Language:
English
Team:
Khan Academy
Duration:
11:18

English subtitles

Revisions Compare revisions