< Return to Video

Pearson's Chi Square Test (Goodness of Fit)

  • 0:01 - 0:03
    I'm thinking about
    buying a restaurant,
  • 0:03 - 0:04
    so I go and ask
    the current owner,
  • 0:04 - 0:07
    what is the distribution
    of the number of customers
  • 0:07 - 0:08
    you get each day?
  • 0:08 - 0:10
    And he says, oh, I've
    already figure that out.
  • 0:10 - 0:11
    And he gives me
    this distribution
  • 0:11 - 0:14
    over here, which essentially
    says 10% of his customers come
  • 0:14 - 0:17
    in on Monday, 10% on
    Tuesday, 15% on Wednesday,
  • 0:17 - 0:18
    so forth, and so on.
  • 0:18 - 0:20
    They're closed on Sunday.
  • 0:20 - 0:22
    So this is 100% of the
    customers for a week.
  • 0:22 - 0:23
    If you add that
    up, you get 100%.
  • 0:23 - 0:25
    I obviously am a
    little bit suspicious,
  • 0:25 - 0:30
    so I decide to see how good
    this distribution that he's
  • 0:30 - 0:32
    describing actually
    fits observed data.
  • 0:32 - 0:35
    So I actually observe the number
    of customers, when they come in
  • 0:35 - 0:37
    during the week,
    and this is what
  • 0:37 - 0:39
    I get from my observed data.
  • 0:39 - 0:43
    So to figure out whether
    I want to accept or reject
  • 0:43 - 0:44
    his hypothesis right
    here, I'm going
  • 0:44 - 0:47
    to do a little bit
    of a hypothesis test.
  • 0:47 - 0:58
    So I'll make the null hypothesis
    that the owner's distribution--
  • 0:58 - 1:00
    so that's this thing
    right here-- is correct.
  • 1:03 - 1:07
    And then the
    alternative hypothesis
  • 1:07 - 1:10
    is going to be that
    it is not correct,
  • 1:10 - 1:12
    that it is not a
    correct distribution,
  • 1:12 - 1:15
    that I should not feel
    reasonably OK relying on this.
  • 1:15 - 1:17
    It's not the correct--
    I should reject
  • 1:17 - 1:19
    the owner's distribution.
  • 1:19 - 1:24
    And I want to do this with
    a significance level of 5%.
  • 1:27 - 1:28
    Or another way of
    thinking about it,
  • 1:28 - 1:31
    I'm going to calculate a
    statistic based on this data
  • 1:31 - 1:32
    right here.
  • 1:32 - 1:35
    And it's going to be
    chi-square statistic.
  • 1:35 - 1:37
    Or another way to view
    it is it that statistic
  • 1:37 - 1:40
    that I'm going to
    calculate has approximately
  • 1:40 - 1:42
    a chi-square distribution.
  • 1:42 - 1:44
    And given that it does have
    a chi-square distribution
  • 1:44 - 1:46
    with a certain number
    of degrees of freedom
  • 1:46 - 1:49
    and we're going to calculate
    that, what I want to see
  • 1:49 - 1:52
    is the probability of
    getting this result,
  • 1:52 - 1:55
    or getting a result like
    this or a result more extreme
  • 1:55 - 1:57
    less than 5%.
  • 1:57 - 2:00
    If the probability of getting
    a result like this or something
  • 2:00 - 2:03
    less likely than
    this is less than 5%,
  • 2:03 - 2:07
    then I'm going to reject
    the null hypothesis, which
  • 2:07 - 2:11
    is essentially just rejecting
    the owner's distribution.
  • 2:11 - 2:14
    If I don't get
    that, if I say, hey,
  • 2:14 - 2:17
    the probability of getting
    a chi-square statistic that
  • 2:17 - 2:22
    is this extreme or more
    is greater than my alpha,
  • 2:22 - 2:25
    than my significance level,
    then I'm not going to reject it.
  • 2:25 - 2:26
    I'm going to say,
    well, I have no reason
  • 2:26 - 2:28
    to really assume
    that he's lying.
  • 2:28 - 2:30
    So let's do that.
  • 2:30 - 2:33
    So to calculate the chi-square
    statistic, what I'm going to do
  • 2:33 - 2:36
    is-- so here we're assuming
    the owner's distribution is
  • 2:36 - 2:37
    correct.
  • 2:41 - 2:43
    So assuming the
    owner's distribution
  • 2:43 - 2:48
    was correct, what would have
    been the expected observed?
  • 2:48 - 2:50
    So we have expected
    percentage here,
  • 2:50 - 2:52
    but what would have been
    the expected observed?
  • 2:52 - 2:53
    So let me write this right here.
  • 2:53 - 2:54
    Expected.
  • 2:54 - 2:57
    I'll add another row, Expected.
  • 2:57 - 3:00
    So we would have expected
    10% of the total customers
  • 3:00 - 3:01
    in that week to
    come in on Monday,
  • 3:01 - 3:03
    10% of the total
    customers of that week
  • 3:03 - 3:06
    to come in on Tuesday, 15%
    to come in on Wednesday.
  • 3:06 - 3:08
    Now to figure out what
    the actual number is,
  • 3:08 - 3:11
    we need to figure out the
    total number of customers.
  • 3:11 - 3:14
    So let's add up these
    numbers right here.
  • 3:14 - 3:18
    So we have-- I'll get
    the calculator out.
  • 3:18 - 3:27
    So we have 30 plus 14 plus
    34 plus 45 plus 57 plus 20.
  • 3:27 - 3:28
    So there's a total
    of 200 customers who
  • 3:28 - 3:31
    came into the
    restaurant that week.
  • 3:31 - 3:32
    So let me write this down.
  • 3:32 - 3:38
    So this is equal to-- so I
    wrote the total over here.
  • 3:38 - 3:39
    Ignore this right here.
  • 3:39 - 3:41
    I had 200 customers
    come in for the week.
  • 3:41 - 3:44
    So what was the expected
    number on Monday?
  • 3:44 - 3:47
    Well, on Monday, we would
    have expected 10% of the 200
  • 3:47 - 3:47
    to come in.
  • 3:47 - 3:51
    So this would have been 20
    customers, 10% times 200.
  • 3:51 - 3:53
    On Tuesday, another 10%.
  • 3:53 - 3:55
    So we would have
    expected 20 customers.
  • 3:55 - 3:59
    Wednesday, 15% of 200,
    that's 30 customers.
  • 3:59 - 4:03
    On Thursday, we would have
    expected 20% of 200 customers,
  • 4:03 - 4:05
    so that would have
    been 40 customers.
  • 4:05 - 4:09
    Then on Friday, 30%, that
    would have been 60 customers.
  • 4:09 - 4:11
    And then on Friday 15% again.
  • 4:11 - 4:14
    15% of 200 would have
    been 30 customers.
  • 4:14 - 4:16
    So if this distribution
    is correct,
  • 4:16 - 4:21
    this is the actual number
    that I would have expected.
  • 4:21 - 4:24
    Now to calculate
    chi-square statistic,
  • 4:24 - 4:27
    we essentially just take--
    let me just show it to you,
  • 4:27 - 4:29
    and instead of
    writing chi, I'm going
  • 4:29 - 4:30
    to write capital X squared.
  • 4:30 - 4:33
    Sometimes someone will write the
    actual Greek letter chi here.
  • 4:33 - 4:36
    But I'll write the
    x squared here.
  • 4:36 - 4:37
    And let me write it this way.
  • 4:37 - 4:45
    This is our
    chi-square statistic,
  • 4:45 - 4:48
    but I'm going to write it with
    a capital X instead of a chi
  • 4:48 - 4:50
    because this is going
    to have approximately
  • 4:50 - 4:52
    a chi-squared distribution.
  • 4:52 - 4:54
    I can't assume
    that it's exactly,
  • 4:54 - 4:56
    so this is where we're dealing
    with approximations right here.
  • 4:56 - 4:59
    But it's fairly
    straightforward to calculate.
  • 4:59 - 5:01
    For each of the days,
    we take the difference
  • 5:01 - 5:03
    between the observed
    and expected.
  • 5:03 - 5:08
    So it's going to
    be 30 minus 20--
  • 5:08 - 5:12
    I'll do the first one
    color coded-- squared
  • 5:12 - 5:14
    divided by the expected.
  • 5:14 - 5:16
    So we're essentially
    taking the square
  • 5:16 - 5:19
    of almost you could kind of
    do the error between what
  • 5:19 - 5:22
    we observed and expected or
    the difference between what
  • 5:22 - 5:24
    we observed and expect, and
    we're kind of normalizing it
  • 5:24 - 5:26
    by the expected right over here.
  • 5:26 - 5:28
    But we want to take the
    sum of all of these.
  • 5:28 - 5:31
    So I'll just do all
    of those in yellow.
  • 5:31 - 5:45
    So plus 14 minus 20 squared
    over 20 plus 34 minus 30 squared
  • 5:45 - 5:54
    over 30 plus-- I'll continue
    over here-- 45 minus 40 squared
  • 5:54 - 6:05
    over 40 plus 57 minus
    60 squared over 60,
  • 6:05 - 6:13
    and then finally, plus 20
    minus 30 squared over 30.
  • 6:13 - 6:15
    I just took the observed
    minus the expected
  • 6:15 - 6:16
    squared over the expected.
  • 6:16 - 6:18
    I took the sum of
    it, and this is
  • 6:18 - 6:20
    what gives us our
    chi-square statistic.
  • 6:20 - 6:24
    Now let's just calculate what
    this number is going to be.
  • 6:24 - 6:27
    So this is going to be equal
    to-- I'll do it over here
  • 6:27 - 6:28
    so you don't run out of space.
  • 6:28 - 6:30
    So we'll do this a new color.
  • 6:30 - 6:31
    We'll do it in orange.
  • 6:31 - 6:34
    This is going to be
    equal to 30 minus 20
  • 6:34 - 6:41
    is 10 squared, which is 100
    divided by 20, which is 5.
  • 6:41 - 6:43
    I might not be able to do all
    of them in my head like this.
  • 6:43 - 6:45
    Plus, actually, let me
    just write it this way
  • 6:45 - 6:48
    just so you can
    see what I'm doing.
  • 6:48 - 6:53
    This right here is 100
    over 20 plus 14 minus 20
  • 6:53 - 6:56
    is negative 6 squared
    is positive 36.
  • 6:56 - 7:00
    So plus 36 over 20.
  • 7:00 - 7:04
    Plus 34 minus 30 is
    4, squared is 16.
  • 7:04 - 7:07
    So plus 16 over 30.
  • 7:07 - 7:11
    Plus 45 minus 40
    is 5 squared is 25.
  • 7:11 - 7:15
    So plus 25 over 40.
  • 7:15 - 7:18
    Plus the difference
    here is 3 squared is 9,
  • 7:18 - 7:20
    so it's 9 over 60.
  • 7:20 - 7:27
    Plus we have a difference of
    10 squared is plus 100 over 30.
  • 7:27 - 7:30
    And this is equal to-- and I'll
    just get the calculator out
  • 7:30 - 7:36
    for this-- this is
    equal to, we have
  • 7:36 - 7:42
    100 divided by 20
    plus 36 divided by 20
  • 7:42 - 7:49
    plus 16 divided by 30
    plus 25 divided by 40
  • 7:49 - 8:02
    plus 9 divided by 60 plus 100
    divided by 30 gives us 11.44.
  • 8:02 - 8:03
    So let me write that down.
  • 8:03 - 8:10
    So this right here
    is going to be 11.44.
  • 8:10 - 8:12
    This is my chi-square
    statistic, or we
  • 8:12 - 8:14
    could call it a big
    capital X squared.
  • 8:14 - 8:17
    Sometimes you'll have it
    written as a chi-square,
  • 8:17 - 8:20
    but this statistic is
    going to have approximately
  • 8:20 - 8:22
    a chi-square distribution.
  • 8:22 - 8:24
    Anyway, with that
    said, let's figure out,
  • 8:24 - 8:28
    if we assume that it has roughly
    a chi-square distribution, what
  • 8:28 - 8:33
    is the probability of getting a
    result this extreme or at least
  • 8:33 - 8:36
    this extreme, I guess is another
    way of thinking about it.
  • 8:36 - 8:40
    Or another way of saying, is
    this a more extreme result
  • 8:40 - 8:42
    than the critical
    chi-square value
  • 8:42 - 8:45
    that there's a 5% chance of
    getting a result that extreme?
  • 8:45 - 8:46
    So let's do it that way.
  • 8:46 - 8:49
    Let's figure out the
    critical chi-square value.
  • 8:49 - 8:50
    And if this is more
    extreme than that,
  • 8:50 - 8:53
    then we will reject
    our null hypothesis.
  • 8:53 - 8:57
    So let's figure out our
    critical chi-square values.
  • 8:57 - 8:58
    So we have an alpha of 5%.
  • 8:58 - 9:00
    And actually the other
    thing we have to figure out
  • 9:00 - 9:03
    is the degrees of freedom.
  • 9:03 - 9:07
    The degrees of freedom, we're
    taking one, two, three, four,
  • 9:07 - 9:09
    five, six sums, so
    you might be tempted
  • 9:09 - 9:11
    to say the degrees
    of freedom are six.
  • 9:11 - 9:13
    But one thing to
    realize is that if you
  • 9:13 - 9:15
    had all of this
    information over here,
  • 9:15 - 9:20
    you could actually figure out
    this last piece of information,
  • 9:20 - 9:22
    so you actually have
    five degrees of freedom.
  • 9:22 - 9:24
    When you have just kind of
    n data points like this,
  • 9:24 - 9:27
    and you're measuring kind of
    the observed versus expected,
  • 9:27 - 9:29
    your degrees of freedom
    are going to be n minus 1,
  • 9:29 - 9:31
    because you could figure
    out that nth data point just
  • 9:31 - 9:33
    based on everything
    else that you have,
  • 9:33 - 9:35
    all of the other information.
  • 9:35 - 9:37
    So our degrees of freedom
    here are going to be 5.
  • 9:37 - 9:40
    It's n minus 1.
  • 9:40 - 9:43
    So our significance level is 5%.
  • 9:43 - 9:48
    And our degrees of freedom is
    also going to be equal to 5.
  • 9:48 - 9:51
    So let's look at our
    chi-square distribution.
  • 9:51 - 9:53
    We have a degree
    of freedom of 5.
  • 9:56 - 9:59
    We have a significance
    level of 5%.
  • 9:59 - 10:04
    And so the critical
    chi-square value is 11.07.
  • 10:04 - 10:05
    So let's go with this chart.
  • 10:05 - 10:07
    So we have a
    chi-squared distribution
  • 10:07 - 10:09
    with a degree of freedom of 5.
  • 10:09 - 10:12
    So that's this distribution
    over here in magenta.
  • 10:12 - 10:16
    And we care about a
    critical value of 11.07.
  • 10:16 - 10:17
    So this is right here.
  • 10:17 - 10:19
    Oh, you actually even
    can't see it on this.
  • 10:19 - 10:21
    So if I were to keep drawing
    this magenta thing all
  • 10:21 - 10:27
    the way over here, if the
    magenta line just kept going,
  • 10:27 - 10:29
    over here, you'd have 8.
  • 10:29 - 10:30
    Over here you'd have 10.
  • 10:30 - 10:32
    Over here, you'd have 12.
  • 10:32 - 10:36
    11.07 is maybe some
    place right over there.
  • 10:36 - 10:38
    So what it's saying
    is the probability
  • 10:38 - 10:50
    of getting a result at least
    as extreme as 11.07 is 5%.
  • 10:50 - 10:52
    So we could write it even here.
  • 10:52 - 10:58
    Our critical chi-square value is
    equal to-- we just saw-- 11.07.
  • 10:58 - 11:00
    Let me look at the chart again.
  • 11:00 - 11:07
    11.07.
  • 11:07 - 11:09
    The result we got
    for our statistic
  • 11:09 - 11:13
    is even less likely than that.
  • 11:13 - 11:16
    The probability is less
    than our significance level.
  • 11:16 - 11:19
    So then we are going to reject.
  • 11:19 - 11:21
    So the probability
    of getting that is--
  • 11:21 - 11:27
    let me put it this
    way-- 11.44 is
  • 11:27 - 11:31
    more extreme than our
    critical chi-square level.
  • 11:31 - 11:36
    So it's very unlikely that
    this distribution is true.
  • 11:36 - 11:42
    So we will reject
    what he's telling us.
  • 11:42 - 11:44
    We will reject
    this distribution.
  • 11:44 - 11:48
    It's not a good fit based
    on this significance level.
Title:
Pearson's Chi Square Test (Goodness of Fit)
Description:

more » « less
Video Language:
English
Team:
Khan Academy
Duration:
11:48

English subtitles

Revisions Compare revisions