< Return to Video

Judging outliers in a dataset

  • 0:00 - 0:03
    - [Instructor] We have a
    list of 15 numbers here,
  • 0:03 - 0:06
    and what I want to do is
    think about the outliers.
  • 0:06 - 0:10
    And to help us with that,
    let's actually visualize this,
  • 0:10 - 0:12
    the distribution of actual numbers.
  • 0:12 - 0:14
    So let us do that.
  • 0:15 - 0:16
    So here, on a number line,
  • 0:16 - 0:19
    I have all the numbers from one to 19.
  • 0:20 - 0:24
    And let's see, we have two ones.
  • 0:24 - 0:28
    So I could say that's one
    one and then two ones.
  • 0:28 - 0:29
    We have one six.
  • 0:29 - 0:32
    So let's put that six there.
  • 0:32 - 0:33
    We have got a 13,
  • 0:35 - 0:36
    or we have two 13s.
  • 0:36 - 0:40
    So we're gonna go up
    here, one 13 and two 13s.
  • 0:41 - 0:44
    Let's see, we have three 14s.
  • 0:45 - 0:46
    So 14,
  • 0:47 - 0:48
    14,
  • 0:48 - 0:49
    and 14.
  • 0:50 - 0:53
    We have a couple of 15s, 15, 15.
  • 0:53 - 0:54
    So 15,
  • 0:55 - 0:56
    15.
  • 0:56 - 0:58
    We have one 16.
  • 0:58 - 1:01
    So that's our 16 there.
  • 1:01 - 1:03
    We have three 18s.
  • 1:03 - 1:05
    One, two, three.
  • 1:05 - 1:06
    So one,
  • 1:07 - 1:08
    two,
  • 1:08 - 1:10
    and then three.
  • 1:10 - 1:13
    And then we have a 19.
  • 1:13 - 1:15
    Then we have a 19.
  • 1:15 - 1:16
    So when you look,
  • 1:16 - 1:18
    when you look visually at
    the distribution of numbers,
  • 1:18 - 1:21
    it looks like the meat of the
    distribution, so to speak,
  • 1:21 - 1:24
    is in this area, right over here.
  • 1:24 - 1:25
    And so some people might say,
  • 1:25 - 1:27
    "Okay, we have three outliers.
  • 1:27 - 1:28
    "There are these two ones and the six."
  • 1:28 - 1:29
    Some people might say,
  • 1:29 - 1:31
    "Well, the six is kinda close enough.
  • 1:31 - 1:34
    "Maybe only these two ones are outliers."
  • 1:34 - 1:38
    And those would actually be
    both reasonable things to say.
  • 1:38 - 1:41
    Now to get on the same page,
  • 1:41 - 1:45
    statisticians will use a rule sometimes.
  • 1:45 - 1:47
    We say, well, anything that is more than
  • 1:47 - 1:49
    one and a half times
    the interquartile range
  • 1:49 - 1:52
    from below Q-one or above Q-three,
  • 1:52 - 1:55
    well, those are going to be outliers.
  • 1:55 - 1:56
    Well, what am I talking about?
  • 1:56 - 1:58
    Well, let's actually, let's
    figure out the median,
  • 1:58 - 2:00
    Q-one and Q-three here.
  • 2:00 - 2:02
    Then we can figure out
    the interquartile range.
  • 2:02 - 2:04
    And then we can figure
    out by that definition,
  • 2:04 - 2:06
    what is going to be an outlier?
  • 2:06 - 2:08
    And if that all made sense to you so far,
  • 2:08 - 2:09
    I encourage you to pause this video
  • 2:09 - 2:10
    and try to work through it on your own,
  • 2:10 - 2:13
    or I'll do it for you right now.
  • 2:13 - 2:16
    All right, so what's the median here?
  • 2:16 - 2:18
    Well, the median is the middle number.
  • 2:18 - 2:21
    We have 15 numbers, so the
    middle number is going to be
  • 2:21 - 2:23
    whatever number has seven on either side.
  • 2:23 - 2:24
    So it's gonna be the eighth number.
  • 2:24 - 2:28
    One, two, three, four, five, six, seven.
  • 2:28 - 2:29
    Is that right?
  • 2:29 - 2:33
    Yep, six, seven, so that's the median.
  • 2:33 - 2:36
    And then you have one, two,
    three, four, five, six, seven
  • 2:36 - 2:37
    numbers on the right side too.
  • 2:37 - 2:41
    So that is the median,
    sometimes called Q-two.
  • 2:41 - 2:43
    That is our median.
  • 2:43 - 2:45
    Now what is Q-one?
  • 2:45 - 2:48
    Well, Q-one is going to be the
    middle of this first group.
  • 2:48 - 2:50
    This first group has seven numbers in it.
  • 2:50 - 2:53
    And so the middle is going
    to be the fourth number.
  • 2:53 - 2:55
    It has three and three,
  • 2:55 - 2:57
    three to the left, three to the right.
  • 2:57 - 2:58
    So that is Q-one.
  • 3:00 - 3:01
    And then Q-three is going
  • 3:01 - 3:03
    to be the middle of this upper group.
  • 3:03 - 3:04
    Well, that also has seven numbers in it.
  • 3:04 - 3:06
    So the middle is going
    to be right over there.
  • 3:06 - 3:08
    It has three on either side.
  • 3:08 - 3:10
    So that is Q-three.
  • 3:12 - 3:14
    Now what is the interquartile
    range going to be?
  • 3:14 - 3:16
    Interquartile range
  • 3:17 - 3:19
    is going to be equal to
  • 3:19 - 3:21
    Q-three
  • 3:21 - 3:22
    minus Q-one,
  • 3:23 - 3:25
    the difference between 18 and 13.
  • 3:25 - 3:26
    Between 18 and 13,
  • 3:27 - 3:30
    well, that is going to be 18 minus 13,
  • 3:30 - 3:32
    which is equal to five.
  • 3:32 - 3:34
    Now to figure out outliers,
  • 3:34 - 3:36
    well, outliers are gonna
    be anything that is below.
  • 3:36 - 3:37
    So outliers,
  • 3:39 - 3:40
    outliers,
  • 3:40 - 3:42
    are going to be less than
  • 3:42 - 3:43
    our Q-one
  • 3:45 - 3:45
    minus 1.5,
  • 3:47 - 3:49
    times our interquartile range.
  • 3:51 - 3:53
    And this, once again, this
    isn't some rule of the universe.
  • 3:53 - 3:54
    This is something that statisticians
  • 3:54 - 3:55
    have kind of said, well,
  • 3:55 - 3:57
    if we want to have a better
    definition for outliers,
  • 3:57 - 3:59
    let's just agree that
    it's something that's
  • 3:59 - 4:00
    more than one and half times
  • 4:00 - 4:03
    the interquartile range below Q-one.
  • 4:03 - 4:04
    Or,
  • 4:04 - 4:08
    or an outlier could be
    greater than Q-three
  • 4:08 - 4:12
    plus one and half times
    the interquartile range,
  • 4:12 - 4:14
    interquartile range.
  • 4:14 - 4:15
    And once again, this is somewhat,
  • 4:15 - 4:17
    you know, people just
    decided it felt right.
  • 4:17 - 4:18
    One could argue it should be 1.6.
  • 4:18 - 4:22
    Or one could argue it should
    be one, or two, or whatever.
  • 4:22 - 4:25
    But this is what people
    have tended to agree on.
  • 4:25 - 4:27
    So let's think about
    what these numbers are.
  • 4:27 - 4:28
    Q-one we already know.
  • 4:28 - 4:30
    So this is going to be 13
  • 4:30 - 4:34
    minus 1.5 times our interquartile range.
  • 4:34 - 4:37
    Our interquartile range here is five.
  • 4:37 - 4:40
    So it's 1.5 times five, which is 7.5.
  • 4:43 - 4:44
    So this is 7.5.
  • 4:46 - 4:48
    13 minus 7.5 is what?
  • 4:49 - 4:51
    13 minus seven is six,
  • 4:51 - 4:54
    and then you subtract another .5, is 5.5.
  • 4:54 - 4:56
    So we have outliers,
  • 4:56 - 4:57
    outliers.
  • 4:57 - 4:58
    Outliers
  • 4:59 - 5:01
    would be less than 5.5.
  • 5:03 - 5:04
    Or
  • 5:04 - 5:06
    the Q-three is 18,
  • 5:06 - 5:08
    this is, once again, 7.5.
  • 5:10 - 5:11
    18 plus 7.5
  • 5:12 - 5:13
    is 25.5,
  • 5:14 - 5:15
    or outliers,
  • 5:17 - 5:19
    outliers greater than 25,
  • 5:21 - 5:22
    25.5.
  • 5:23 - 5:24
    So based on this, we have a,
  • 5:24 - 5:26
    kind of a numerical definition
    for what's an outlier.
  • 5:26 - 5:28
    We're not just subjectively saying,
  • 5:28 - 5:30
    well, this feels right
    or that feels right.
  • 5:30 - 5:33
    And based on this, we
    only have two outliers,
  • 5:33 - 5:37
    that only these two
    ones are less than 5.5.
  • 5:37 - 5:40
    Only these two ones are less than 5.5.
  • 5:40 - 5:43
    This is the cutoff, right over here.
  • 5:43 - 5:45
    So this dot just happened to make it.
  • 5:45 - 5:48
    And we don't have any
    outliers on the high side.
  • 5:48 - 5:50
    Now another thing to think about
  • 5:50 - 5:52
    is drawing box-and-whiskers plots
  • 5:52 - 5:54
    based on Q-one, our median, our range,
  • 5:54 - 5:56
    all the range of numbers.
  • 5:56 - 5:57
    And you could do it either
  • 5:57 - 5:58
    taking in consideration your outliers
  • 5:58 - 6:02
    or not taking into
    consideration your outliers.
  • 6:02 - 6:05
    So there's a couple of
    ways that we can do it.
  • 6:05 - 6:09
    So let me actually clear,
    let me clear all of this.
  • 6:09 - 6:12
    We've figured out all of this stuff.
  • 6:12 - 6:14
    So let me clear all of that out.
  • 6:14 - 6:18
    And let's actually draw
    a box-and-whiskers plot.
  • 6:18 - 6:19
    So I'll put another,
  • 6:22 - 6:24
    another, actually let me do two here.
  • 6:24 - 6:25
    That's one,
  • 6:26 - 6:29
    and then let me put
    another one down there.
  • 6:29 - 6:31
    And then this is another.
  • 6:31 - 6:32
    Now if we were to just draw
  • 6:32 - 6:35
    a classic box-and-whiskers plot here,
  • 6:35 - 6:38
    we would say, all right,
    our median's at 14.
  • 6:38 - 6:39
    And actually, I'll do it both ways.
  • 6:39 - 6:40
    Our median's at 14.
  • 6:40 - 6:42
    Median's at 14.
  • 6:42 - 6:44
    Q-one's at 13.
  • 6:44 - 6:45
    Q-one's at 13,
  • 6:45 - 6:47
    and Q-one's at 13.
  • 6:47 - 6:48
    Q-three is at 18.
  • 6:48 - 6:50
    Q-three is at 18,
  • 6:50 - 6:51
    Q-three is 18.
  • 6:51 - 6:53
    So that's the box part.
  • 6:53 - 6:55
    Now let me draw that as an actual,
  • 6:55 - 6:58
    let me actually draw that as a box.
  • 6:58 - 6:59
    So my best attempt,
  • 7:01 - 7:02
    there you go.
  • 7:02 - 7:03
    That's the box.
  • 7:03 - 7:06
    And this is also a box.
  • 7:06 - 7:08
    So far, I'm doing the exact same thing.
  • 7:08 - 7:10
    Now if we don't want to consider outliers,
  • 7:10 - 7:11
    we would say, well, what's
    the entire range here?
  • 7:11 - 7:14
    Well, we have things that go
    from one all the way to 19.
  • 7:14 - 7:16
    So one way to do it is to, hey,
  • 7:16 - 7:18
    we start at one.
  • 7:18 - 7:20
    And so our entire range, we go,
  • 7:20 - 7:22
    actually let me draw it a
    little bit better than that.
  • 7:22 - 7:25
    We're going all the way,
  • 7:25 - 7:26
    all the way from one
  • 7:28 - 7:29
    to 19.
  • 7:30 - 7:32
    Now in this one, we're
    including everything.
  • 7:32 - 7:35
    We're including even these two outliers.
  • 7:35 - 7:37
    But if we don't want to
    include those outliers,
  • 7:37 - 7:39
    we want to make it clear
    that they're outliers,
  • 7:39 - 7:40
    well, let's not include them.
  • 7:40 - 7:43
    And what we can do instead is say,
  • 7:43 - 7:46
    all right, including
    (chuckles) our non-outliers,
  • 7:46 - 7:48
    we would start at six
  • 7:48 - 7:50
    'cause six we're saying
    is in our data set,
  • 7:50 - 7:52
    but it is not an outlier.
  • 7:52 - 7:54
    Let me make this look better.
  • 7:54 - 7:56
    So we're gonna,
  • 7:56 - 7:57
    we are going to
  • 7:59 - 8:02
    start at six and go all the way to 19.
  • 8:04 - 8:06
    And then to say that
    we have these outliers,
  • 8:06 - 8:09
    we would put this, we
    have outliers over there.
  • 8:09 - 8:12
    So once again, this is
    a box-and-whiskers plot
  • 8:12 - 8:14
    of the same data set without outliers.
  • 8:14 - 8:16
    And this is one where we make specific,
  • 8:16 - 8:20
    we make it clear where
    the outliers actually are.
Title:
Judging outliers in a dataset
Description:

more » « less
Video Language:
English
Team:
Khan Academy
Duration:
08:21

English subtitles

Revisions