< Return to Video

Lecture 10 | Machine Learning (Stanford)

  • 0:12 - 0:16
    This presentation is delivered by the Stanford Center for Professional
  • 0:16 - 0:23
    Development.
  • 0:24 - 0:28
    So what I want to do today in this lecture is
  • 0:28 - 0:31
    talk a little bit more about learning theory. In particular, I'll
  • 0:31 - 0:33
    talk about VC dimension
  • 0:33 - 0:36
    and
  • 0:36 - 0:40
    building on the issues of bias variance tradeoffs of under
  • 0:40 - 0:43
    fitting and over fitting; that we've been seeing in the previous lecture, and then we'll see in
  • 0:43 - 0:44
    this one.
  • 0:44 - 0:49
    I then want to talk about model selection algorithms for automatically
  • 0:49 - 0:50
    making
  • 0:50 - 0:52
    decisions for this
  • 0:52 - 0:55
    bias variance tradeoff, that we started to talk about in the previous lecture.
  • 0:55 - 0:59
    And depending on how much time, I actually may not get to Bayesian, [inaudible]. But if
  • 0:59 - 1:01
    I don't get to this today,
  • 1:01 - 1:07
    I'll get to this in next week's lecture.
  • 1:07 - 1:10
    To recap:
  • 1:10 - 1:10
    the result we
  • 1:10 - 1:14
    proved at the previous lecture was that
  • 1:14 - 1:15
    if you have a finite
  • 1:15 - 1:16
    hypothesis class if
  • 1:16 - 1:20
    h is a set of k hypotheses,
  • 1:20 - 1:24
  • 1:24 - 1:26
    and suppose you have some
  • 1:26 - 1:33
    fixed parameters, gamma and delta,
  • 1:33 - 1:34
    then
  • 1:34 - 1:37
  • 1:37 - 1:41
    in order to guarantee
  • 1:41 - 1:48
    that this holds,
  • 1:48 - 1:51
    we're probability at least one minus delta.
  • 1:51 - 1:56
    It suffices
  • 1:56 - 2:03
    that n is greater and equal to that; okay?
  • 2:05 - 2:07
    And using big-O notations,
  • 2:07 - 2:14
    just learning dropped constants, I can also write this as that; okay?
  • 2:14 - 2:18
    So just to quickly remind you of what all of the notation means,
  • 2:18 - 2:21
    we talked about empirical risk minimization,
  • 2:21 - 2:22
    which was
  • 2:22 - 2:25
    the simplified modern machine learning
  • 2:25 - 2:26
    that
  • 2:26 - 2:29
    has a hypothesis class of script h.
  • 2:29 - 2:33
    And what the empirical risk minimization-learning algorithm does is it just chooses the
  • 2:33 - 2:34
    hypothesis
  • 2:34 - 2:38
    that attains the smallest error on the training set.
  • 2:38 - 2:39
    And so
  • 2:39 - 2:43
    this symbol, epsilon, just denoted generalization error; right? This is the
  • 2:43 - 2:44
    probability
  • 2:44 - 2:49
    of a hypothesis h [inaudible] misclassifying a new example drawn from the same
  • 2:49 - 2:51
    distribution as the training set.
  • 2:51 - 2:53
    And so this says that
  • 3:02 - 3:06
    in order to guarantee that the generalization error of the hypothesis h
  • 3:06 - 3:09
    [inaudible] output by empirical risk minimization
  • 3:09 - 3:11
    that this is less and equal to
  • 3:11 - 3:14
    the best possible generalization error
  • 3:15 - 3:20
    use it in your hypothesis class plus two times gamma two times this error threshold.
  • 3:20 - 3:24
    We want to guarantee that this holds a probability at least one minus delta.
  • 3:24 - 3:26
    We show that
  • 3:26 - 3:29
    it suffices for your training set size m
  • 3:29 - 3:33
    to be greater than equal to this; okay? One over two gamma square log two k over
  • 3:33 - 3:34
    delta;
  • 3:34 - 3:39
    where again, k is the size of your hypothesis class.
  • 3:39 - 3:43
    And so this is some complexity result because it gives us a bound in the
  • 3:43 - 3:45
    number of training examples we need
  • 3:45 - 3:47
    in order to give a guarantee
  • 3:47 - 3:49
    on something on the error; okay? So
  • 3:49 - 3:55
    this is a sample complexity result. So
  • 3:55 - 3:59
    what I want to do now is take this result, and try to generalize it to the
  • 3:59 - 4:03
    case of infinite hypothesis classes. So here,
  • 4:03 - 4:07
    we said that the set script h is sort of just k
  • 4:07 - 4:09
    specific functions,
  • 4:09 - 4:12
    when you want to use a model like logistic regression, which is actually
  • 4:12 - 4:14
    parameterized by real numbers.
  • 4:14 - 4:15
    So
  • 4:15 - 4:17
    I'm actually first going to give an argument that's sort of
  • 4:17 - 4:21
    formally broken just sort of technically somewhat broken, but conveys
  • 4:21 - 4:25
    useful intuition. And then I'll give the more correct argument, but
  • 4:25 - 4:30
    without proving. It's as if, full proof is somewhat involved. So here's a somewhat broken argument. Let's say I want
  • 4:30 - 4:31
    to
  • 4:31 - 4:35
    apply this result analyzing logistic regression.
  • 4:36 - 4:37
    So let's say
  • 4:39 - 4:45
    your hypothesis class is because of all linear division boundaries; right? So say
  • 4:45 - 4:47
    script h is
  • 4:47 - 4:52
    parameterized
  • 4:52 - 4:58
    by d
  • 4:58 - 5:00
    real numbers; okay? So for example, if
  • 5:00 - 5:03
    you're applying logistic regression with
  • 5:03 - 5:07
    over [inaudible], then d would be endless one with logistic regression to find the
  • 5:07 - 5:09
    linear position boundary,
  • 5:09 - 5:10
    parameterized by
  • 5:10 - 5:14
    endless one real numbers.
  • 5:14 - 5:18
    When you think about how your hypothesis class is really represented in a computer
  • 5:18 - 5:19
    computers use
  • 5:19 - 5:22
    zero one bits to represent real numbers.
  • 5:22 - 5:23
    And so if you
  • 5:23 - 5:24
    use like
  • 5:24 - 5:29
    a normal standard computer, it normally will represent real numbers by what's
  • 5:29 - 5:31
    called double position floating point numbers.
  • 5:31 - 5:33
    And what that means is that each real number
  • 5:33 - 5:37
    is represented by or a 64-bit representation;
  • 5:37 - 5:38
    right? So really
  • 5:38 - 5:42
    you know what floating point is in a computer. So a 64-bit floating point
  • 5:42 - 5:43
    is what almost all of
  • 5:43 - 5:45
    us use routinely.
  • 5:45 - 5:48
    And so this parameterized by d real numbers, that's really
  • 5:48 - 5:51
    as if it's parameterized by 64
  • 5:51 - 5:53
    times d
  • 5:53 - 5:54
    bits.
  • 5:54 - 5:57
    Computers can't represent real numbers. They only represent
  • 5:57 - 5:59
    used to speed things.
  • 5:59 - 6:01
    And so the size of
  • 6:01 - 6:03
    your
  • 6:03 - 6:07
    hypothesis class in your computer representation
  • 6:07 - 6:11
    you have 64 times d bits that you can flip.
  • 6:11 - 6:15
    And so the number of possible values for your 62 to 64 d bits is
  • 6:15 - 6:18
    really just
  • 6:18 - 6:21
    to the power of 64 d; okay?
  • 6:21 - 6:23
    Because that's the number of ways you can
  • 6:23 - 6:26
    flip the 64 d bits.
  • 6:26 - 6:29
    And so
  • 6:29 - 6:34
    this is why it's important that we that we had log k there; right? So k is
  • 6:34 - 6:37
    therefore, to the 64 d.
  • 6:37 - 6:40
    And if I plug it into this equation over here,
  • 6:40 - 6:43
    what you find is that
  • 6:43 - 6:47
    in order to get this sort of guarantee,
  • 6:47 - 6:51
    it suffices that m
  • 6:51 - 6:53
    is
  • 6:53 - 6:56
    great and equal to
  • 6:56 - 6:59
    on the order of one of
  • 6:59 - 7:02
    the gamma square log it's
  • 7:02 - 7:03
    just a 64 d
  • 7:03 - 7:05
    over delta,
  • 7:05 - 7:12
    which is that; okay? So
  • 7:15 - 7:18
    just to be clear, in order to guarantee that
  • 7:18 - 7:24
    there's only one, instead of the same complexity result as we had before so the question is:
  • 7:24 - 7:27
    suppose, you want a guarantee that a hypotheses returned by empirical risk
  • 7:27 - 7:29
    minimization will
  • 7:29 - 7:34
    have a generalization error that's within two gamma or the best hypotheses in your hypotheses
  • 7:34 - 7:36
    class.
  • 7:36 - 7:39
    Then what this result suggests is that, you know, in order to give that sort of
  • 7:39 - 7:41
    error bound guarantee,
  • 7:41 - 7:43
    it suffices that m is greater and equal to this.
  • 7:43 - 7:45
    In other words, that
  • 7:45 - 7:47
    your number of training examples has to be
  • 7:47 - 7:52
    on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the
  • 7:53 - 7:54
    intuition
  • 7:54 - 7:58
    that this conveys is actually, roughly right. This says, that the number of
  • 7:58 - 7:59
    training examples you need
  • 7:59 - 8:01
    is roughly linear
  • 8:01 - 8:05
    in the number of parameters of your hypothesis class. That
  • 8:05 - 8:09
    m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right.
  • 8:09 - 8:10
  • 8:10 - 8:12
    I'll say more about
  • 8:17 - 8:21
    this
  • 8:22 - 8:26
    later. This result is clearly, slightly broken, in the sense that it relies on a
  • 8:26 - 8:29
    64-bit representation of 14-point numbers.
  • 8:29 - 8:32
    So let me actually go ahead and outline
  • 8:32 - 8:35
    the "right way" to show this more formally; all
  • 8:35 - 8:39
    right? And it turns out the "right way" to show this more formally
  • 8:39 - 8:43
    involves a much longer because the
  • 8:43 - 8:46
    proof is extremely involved, so I'm just actually going to state the result, and not
  • 8:46 - 8:50
    prove it. Farther proof be a source of learning theory balance, infinite hypothesis
  • 8:50 - 8:54
    classes.
  • 8:56 - 9:03
    This definition given a set of d points,
  • 9:19 - 9:23
    we say,
  • 9:23 - 9:27
    a hypothesis class h
  • 9:27 - 9:29
    shatters
  • 9:29 - 9:35
    the set s, if
  • 9:35 - 9:42
    h can realize any labeling on it; okay?
  • 9:47 - 9:47
    And what
  • 9:47 - 9:51
    I mean by realizing any labeling on it the informal way of thinking about this
  • 9:51 - 9:53
    is:
  • 9:53 - 9:57
    if a hypothesis class has shattered the set s, what that means is that I can take these
  • 9:57 - 9:59
    d points,
  • 9:59 - 10:02
    and I can associate these d points with any
  • 10:02 - 10:04
    caught set of labels y; right?
  • 10:04 - 10:09
    So choose any set of labeling y for each of these d
  • 10:09 - 10:10
    points.
  • 10:10 - 10:14
    And if your hypothesis class shatters s, then that means that
  • 10:14 - 10:15
    there will be
  • 10:15 - 10:19
    a hypothesis that labels those d examples perfectly;
  • 10:19 - 10:24
    okay? That's what shattering means. So let me just illustrate those in an example.
  • 10:24 - 10:31
    So let's say h is the class of all linear classifiers
  • 10:31 - 10:36
    into e,
  • 10:36 - 10:38
    and let's say that
  • 10:38 - 10:40
    s is this
  • 10:40 - 10:44
    [inaudible] comprising two points; okay?
  • 10:44 - 10:46
    So there are
  • 10:46 - 10:50
    four possible labelings that computes
  • 10:50 - 10:53
    with these two points. You can choose
  • 10:53 - 10:56
    to label both positive; one positive, one negative, one negative, one positive or you can
  • 10:56 - 10:59
    label both of them negative.
  • 10:59 - 11:01
    And if
  • 11:01 - 11:05
    the hypothesis class h classed all linear classifiers into the then, for each
  • 11:05 - 11:08
    of these training sets, I can sort of find a
  • 11:08 - 11:10
    linear classifier
  • 11:10 - 11:14
    that attains zero training error on
  • 11:14 - 11:18
    each of these. Then on all possible labelings of this set of two points.
  • 11:18 - 11:20
    And so I'll say
  • 11:20 - 11:23
    that the hypothesis class script h shatters
  • 11:23 - 11:24
    this set s
  • 11:24 - 11:29
    of two points; okay? One
  • 11:29 - 11:36
    more example show
  • 11:37 - 11:39
    you
  • 11:39 - 11:42
    a larger example. Suppose my set s is now
  • 11:42 - 11:45
    this set of three points;
  • 11:45 - 11:52
    right? Then, I now have eight possible labelings for these three points;
  • 12:11 - 12:15
    okay? And so for these three points, I now have eight possible labelings.
  • 12:15 - 12:17
    And once again, I can for each
  • 12:17 - 12:22
    of these labelings, I can find the hypothesis in the hypothesis class
  • 12:22 - 12:23
    that labels
  • 12:23 - 12:26
    these examples correctly.
  • 12:26 - 12:30
    And so once again, I see that by definition, say, that my hypothesis
  • 12:30 - 12:34
    class also shatters this set s. Student:Right. Instructor (Andrew Ng):And
  • 12:34 - 12:38
    then that that terminology h can realize any labeling on s. That's
  • 12:38 - 12:40
    obviously [inaudible].
  • 12:40 - 12:44
    Give it any set of labels and you can find a hypothesis that perfectly
  • 12:44 - 12:48
    separates the positive and negative examples;
  • 12:48 - 12:49
    okay?
  • 12:49 - 12:51
    So
  • 12:51 - 12:55
    how about this
  • 12:55 - 12:57
    set?
  • 12:57 - 13:01
    Suppose s is now this set of four points, then,
  • 13:01 - 13:05
    you know, there are lots of labels. There are now 16 labelings we can choose on this; right?
  • 13:05 - 13:06
    That's one for instance,
  • 13:06 - 13:09
    and this is another one;
  • 13:09 - 13:13
    right? And so I can realize some labelings. But there's no linear division
  • 13:13 - 13:16
    boundary that can realize this labeling,
  • 13:16 - 13:18
    and so h does not shatter
  • 13:18 - 13:20
    this set of four points; okay?
  • 13:20 - 13:22
    And
  • 13:22 - 13:25
    I'm not really going to prove it here, but it turns out that you can show that
  • 13:25 - 13:30
    in two dimensions, there is no set of four points that the
  • 13:30 - 13:37
    class of all linear classifiers can shatter; okay?
  • 13:46 - 13:49
    So here's another definition. When I say that the well,
  • 13:49 - 13:50
    it's called the
  • 13:50 - 13:53
    VC dimension. These
  • 13:53 - 14:00
    two people, Vapnik and Chervonenkis
  • 14:07 - 14:10
    so given a hypothesis class,
  • 14:10 - 14:14
    the Vapnik and Chervonenkis
  • 14:14 - 14:18
    dimension of h, which we usually write as VC of script h,
  • 14:18 - 14:25
    is the size of the
  • 14:34 - 14:37
    larger
  • 14:37 - 14:40
    set that is shattered by this set by h.
  • 14:40 - 14:41
    And
  • 14:41 - 14:45
    if a hypothesis class can shatter arbitrarily large sets, then the VC dimension
  • 14:45 - 14:48
    is infinite.
  • 14:48 - 14:51
    So just as a kind of good example: if h
  • 14:51 - 14:52
    is the class of all linear
  • 14:52 - 14:58
    classifiers
  • 14:58 - 15:03
    into d, then the
  • 15:03 - 15:04
    VC dimension of the set
  • 15:04 - 15:05
    is equal to
  • 15:05 - 15:09
    three because we saw just now that there is a size of
  • 15:09 - 15:13
    there was a set s of size three that it could shatter,
  • 15:13 - 15:17
    and I don't really prove it. But it turns out there is no sets of size four
  • 15:17 - 15:18
    that it can
  • 15:18 - 15:20
    shatter. And therefore,
  • 15:20 - 15:21
    the VC dimension of this is three. Yeah? Student:But there are sets of size three that cannot shatter; right? [Inaudible] was your
  • 15:21 - 15:23
    point. Instructor
  • 15:23 - 15:28
    (Andrew
  • 15:28 - 15:30
    Ng):Yes, absolutely. So it turns out that
  • 15:30 - 15:32
    if
  • 15:32 - 15:36
    I choose a set like this it's actually set s,
  • 15:36 - 15:39
    then there are labelings on this they cannot realize. And so,
  • 15:39 - 15:41
    h cannot shatter this set.
  • 15:41 - 15:44
    But that's okay because right there definitely is
  • 15:44 - 15:47
    there exists some other set of size three being shattered. So the VC
  • 15:47 - 15:48
    dimension is
  • 15:48 - 15:55
    three. And then there is no set of size four that can shatter. Yeah? Student:[Inaudible]. Instructor
  • 15:56 - 15:58
    (Andrew Ng):Not according to this definition. No.
  • 15:58 - 16:01
    Right. So again, let's see,
  • 16:01 - 16:04
    I can choose my set s to be
  • 16:04 - 16:08
    to be a set of three points that are all over lapping. Three points in exactly
  • 16:08 - 16:10
    the same place.
  • 16:10 - 16:12
    And clearly, I can't shatter this set,
  • 16:12 - 16:15
    but that's okay. And I can't shatter this set, either,
  • 16:15 - 16:19
    but that's okay because there are some other sets of size three that I can
  • 16:19 - 16:25
    shatter.
  • 16:25 - 16:28
    And it turns out this result holds true into
  • 16:28 - 16:32
    the more generally,
  • 16:32 - 16:34
  • 16:34 - 16:41
    in
  • 16:41 - 16:48
    any dimensions the VC dimension of the class of linear classifiers in
  • 16:51 - 16:54
    any dimensions is equal to n plus one.
  • 16:54 - 16:59
    Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC
  • 16:59 - 17:02
    dimension in any dimensions; whereas, n d
  • 17:02 - 17:04
    is equal to n plus one.
  • 17:04 - 17:11
  • 17:12 - 17:14
    So maybe you wanna write it down:
  • 17:14 - 17:16
  • 17:16 - 17:21
    what is arguably the best-known result in all of learning theory, I guess;
  • 17:21 - 17:26
  • 17:26 - 17:29
    which is that.
  • 17:29 - 17:36
    Let a hypothesis class be given,
  • 17:38 - 17:41
    and let the VC dimension of h be
  • 17:41 - 17:46
    equal to d. Then we're
  • 17:46 - 17:48
    in probability of
  • 17:48 - 17:50
    one minus delta.
  • 17:50 - 17:57
    We have that
  • 18:32 - 18:35
    the formula on the right looks a bit complicated, but don't worry about it. I'll point
  • 18:35 - 18:37
    out the essential aspects of it later.
  • 18:37 - 18:41
    But the key to this result is that if you have a hypothesis class with VC dimension d, and
  • 18:41 - 18:45
    now this can be an infinite hypothesis
  • 18:45 - 18:47
    class,
  • 18:47 - 18:49
    what
  • 18:49 - 18:51
    Vapnik and Chervonenkis show is that
  • 18:51 - 18:54
    we're probability of at least one minus delta.
  • 18:54 - 18:55
    You enjoy this
  • 18:55 - 18:58
    sort of uniform conversions results;
  • 18:58 - 19:00
    okay? We have
  • 19:00 - 19:03
  • 19:03 - 19:07
    that for all hypotheses h that for all the hypotheses in your
  • 19:07 - 19:08
    hypothesis class,
  • 19:08 - 19:13
    you have that the generalization error of h
  • 19:13 - 19:14
    minus
  • 19:14 - 19:19
    the training error of h. So the difference between these two things is bounded above
  • 19:19 - 19:24
    by some complicated formula like this; okay?
  • 19:24 - 19:30
    And thus,
  • 19:30 - 19:37
    we're probably one minus delta. We also have that have the
  • 19:45 - 19:52
    same thing; okay?
  • 19:57 - 20:01
    And going from this step to this step; right? Going from this step
  • 20:01 - 20:03
    to this step is actually something that you saw yourself;
  • 20:03 - 20:06
    that we actually proved earlier. Because
  • 20:06 - 20:09
    you remember, in the previous lecture we proved that if you have uniform
  • 20:09 - 20:11
    conversions,
  • 20:11 - 20:13
    then that implies that
  • 20:13 - 20:18
    it appears actually that we showed that if generalization error and training error are close to
  • 20:18 - 20:21
    each other; within gamma of each other,
  • 20:21 - 20:24
    then the generalization error of the hypotheses you pick will be within
  • 20:24 - 20:26
    two gamma
  • 20:26 - 20:28
    times the best generalization error.
  • 20:28 - 20:30
    So this is really
  • 20:30 - 20:31
    generalization error of
  • 20:31 - 20:35
    h [inaudible] best possible generalization error plus two times gamma. And just the
  • 20:35 - 20:37
    two constants in front here
  • 20:37 - 20:44
    that I've absorbed into the big-O notation.
  • 20:44 - 20:47
    So that formula is slightly more complicated. Let
  • 20:47 - 20:54
    me just rewrite this as a corollary, which is that
  • 20:55 - 20:59
    in order to guarantee that
  • 20:59 - 21:04
    this holds,
  • 21:04 - 21:09
    we're probability of one minus delta.
  • 21:09 - 21:11
    We're probably at least one minus delta, I should say. It
  • 21:11 - 21:17
    suffices
  • 21:17 - 21:20
    that I'm gonna write this this way: I'm gonna write m equals
  • 21:20 - 21:22
    big-O of
  • 21:22 - 21:23
    d,
  • 21:23 - 21:26
    and I'm going to put
  • 21:26 - 21:33
    gamma and delta in as a subscript error to denote that.
  • 21:33 - 21:37
    Let's see, if we treat gamma and delta as constants, so they allow me to absorb turns that
  • 21:37 - 21:40
    depend on gamma and delta into the big-O notation,
  • 21:40 - 21:42
    then
  • 21:42 - 21:45
    in order to guarantee this holds, it suffices that m
  • 21:45 - 21:46
    is on the order of the
  • 21:46 - 21:53
    VC dimension and hypotheses class; okay?
  • 21:53 - 21:56
    So
  • 21:56 - 21:59
    let's see.
  • 21:59 - 22:02
    So what we conclude from this is that
  • 22:02 - 22:06
    if you have a learning algorithm that tries to for empirical risk minimization
  • 22:06 - 22:09
    algorithms in other words, less formally,
  • 22:09 - 22:13
    for learning algorithms, they try to minimize training error. The
  • 22:13 - 22:15
    intuition to take away from this is that
  • 22:15 - 22:18
    the number of training examples you need is therefore,
  • 22:18 - 22:19
    roughly, linear
  • 22:19 - 22:24
    in the VC dimension of the hypotheses class.
  • 22:24 - 22:28
    And more formally, this shows that sample complexity
  • 22:28 - 22:30
    is upper bounded
  • 22:30 - 22:34
    by the VC dimension; okay?
  • 22:34 - 22:38
    It turns out that for most reasonable hypothesis classes, it turns out that
  • 22:38 - 22:41
    the VC dimension is sort
  • 22:41 - 22:42
  • 22:42 - 22:45
    of very similar, I guess, to
  • 22:45 - 22:47
    the number of parameters you model.
  • 22:47 - 22:49
    So for example,
  • 22:49 - 22:52
    you have model and logistic regression linear classification.
  • 22:52 - 22:56
    In any dimensions logistic regression in any dimensions is endless
  • 22:56 - 22:57
    one parameters.
  • 22:57 - 23:02
    And the VC dimension of which is the of class of linear classifiers is always the endless one.
  • 23:02 - 23:04
    So it turns out that
  • 23:04 - 23:07
    for most reasonable hypothesis classes, the
  • 23:07 - 23:11
    VC dimension is usually linear in the number of parameters of your model.
  • 23:11 - 23:15
    Wherein, is most sense of low other polynomial;
  • 23:15 - 23:17
    in the number of parameters of your model.
  • 23:17 - 23:19
    And so this
  • 23:19 - 23:21
    the takeaway intuition from this is that
  • 23:21 - 23:25
    the number of training examples you need to fit in those models is going to be let's say,
  • 23:25 - 23:30
    roughly, linear in the number of parameters in your model; okay?
  • 23:30 - 23:33
    There are some somewhat strange examples where what I just said is not true. There are some
  • 23:33 - 23:37
    strange examples where you have very few parameters, but the VC dimension is
  • 23:37 - 23:37
    enormous.
  • 23:37 - 23:39
    But I actually know of
  • 23:39 - 23:42
    all of the examples I know of that fall into that regime are somewhat
  • 23:42 - 23:44
    strange and
  • 23:44 - 23:45
    degenerate. So somewhat unusual, and
  • 23:45 - 23:50
    not the source of not learning algorithms you usually use.
  • 23:50 - 23:52
    Let's see,
  • 23:52 - 23:55
    just other things. It turns out that
  • 23:55 - 23:58
    so this result shows the sample complexity is upper bounded by VC
  • 23:58 - 24:01
    dimension. But if you have a number of training examples that
  • 24:01 - 24:05
    are on the order of the VC dimension, then you find
  • 24:05 - 24:08
    it turns out that in the worse case
  • 24:08 - 24:10
  • 24:10 - 24:12
  • 24:12 - 24:16
    some complexity is also lower bounded by VC dimension.
  • 24:16 - 24:20
    And what that means is that if you have a perfectly nasty learning
  • 24:20 - 24:23
    problem, say, then if
  • 24:23 - 24:27
    the number of training examples you have is less than on the order of the VC
  • 24:27 - 24:27
    dimension;
  • 24:27 - 24:30
    then
  • 24:30 - 24:34
    it is not possible to prove this bound. So I guess in the worse case,
  • 24:34 - 24:37
    sample complexity in the number of training examples you need is upper bounded and lower bounded
  • 24:37 - 24:42
    by the VC dimension. Let's
  • 24:42 - 24:48
    see,
  • 24:48 - 24:55
    questions about this? Does
  • 24:56 - 25:00
  • 25:00 - 25:00
    the
  • 25:00 - 25:05
    proof of this assume any sort of finites of, like, finite [inaudible] like you have to just [inaudible] real numbers and [inaudible]? Let's see. The proof is not, no. I've actually stated the
  • 25:05 - 25:07
    entirety of the theorem. This is true. It
  • 25:07 - 25:10
    turns out in the proof well,
  • 25:10 - 25:13
    somewhere, regardless of the proof there's a step reconstruction called an
  • 25:13 - 25:16
    epsilon net, which is a very clever [inaudible]. It' sort of in regardless of the
  • 25:16 - 25:20
    proof, it is not an assumption that you need.
  • 25:20 - 25:24
    In someway that sort of proof that's one-step that uses a very
  • 25:24 - 25:27
    clever [inaudible] to prove
  • 25:27 - 25:32
    this. But that's not needed; it's an assumption. I'd like to say, back when
  • 25:32 - 25:36
    I was a Ph.D. student, when I was working through this proof,
  • 25:36 - 25:38
    there was sort of a solid week where I
  • 25:38 - 25:42
    would wake up, and go to the office at 9:00 a.m. Then I'd start reading
  • 25:42 - 25:45
    the book that led up to this proof. And then
  • 25:45 - 25:49
    I'd read from 9:00 a.m. to 6:00 p.m. And then I'd go home, and then the next day, I'd pick up where I left
  • 25:49 - 25:52
    off. And it sort of took me a whole week that way,
  • 25:52 - 25:53
    to understand this proof, so
  • 25:53 - 26:00
    I thought I would inflict that on you.
  • 26:03 - 26:07
    Just to tie a couple of loose ends:
  • 26:07 - 26:13
    what I'm about to do is, I'm about to just mention a few things that will
  • 26:13 - 26:18
    maybe, feel a little bit like random facts. But I'm just gonna tie up just a couple of loose ends.
  • 26:18 - 26:21
    And so
  • 26:21 - 26:24
    let's see,
  • 26:24 - 26:27
    it turns out that
  • 26:27 - 26:30
    just so it will be more strong with you so this bound was
  • 26:30 - 26:31
    proved
  • 26:31 - 26:35
    for an algorithm that uses empirical risk minimization, for an algorithm that
  • 26:35 - 26:36
    minimizes
  • 26:36 - 26:40
    0-1 training error. So
  • 26:40 - 26:46
    one
  • 26:46 - 26:49
    question that some of
  • 26:49 - 26:52
    you ask is
  • 26:52 - 26:55
    how about support vector machines; right? How come SVM's don't over
  • 26:55 - 26:57
    fit?
  • 26:57 - 27:01
    And in the sequel of remember our discussion on support vector machines said that
  • 27:01 - 27:06
    you use kernels, and map the features in infinite dimensional feature space.
  • 27:06 - 27:10
    And so it seems like the VC dimension should be infinite; n plus
  • 27:10 - 27:12
    one and n is infinite.
  • 27:12 - 27:15
    So it turns out that
  • 27:15 - 27:20
    the class of linear separators with large margin actually has low VC dimension.
  • 27:20 - 27:23
    I wanna say this very quickly, and informally.
  • 27:23 - 27:28
    It's actually, not very important for you to understand the details, but I'm going
  • 27:28 - 27:32
    to say it very informally. It turns out that I will give you a set of points.
  • 27:32 - 27:33
    And
  • 27:33 - 27:38
    if I ask you to consider only the course of lines that separate these points of a
  • 27:38 - 27:39
    large margin [inaudible],
  • 27:39 - 27:40
    so
  • 27:40 - 27:44
    my hypothesis class will comprise only
  • 27:44 - 27:46
    the linear position boundaries
  • 27:46 - 27:49
    that separate the points of a large margin. Say with a
  • 27:49 - 27:50
    margin,
  • 27:50 - 27:52
    at least gamma; okay.
  • 27:52 - 27:53
    And so
  • 27:53 - 27:56
    I won't allow a point that comes closer. Like,
  • 27:56 - 28:00
    I won't allow that line because it comes too close to one of my points.
  • 28:00 - 28:04
    It turns out that
  • 28:04 - 28:11
    if I consider my
  • 28:11 - 28:15
    data points all
  • 28:15 - 28:19
    lie within some sphere of radius r,
  • 28:19 - 28:23
    and if I consider only the course of linear separators is separate to data
  • 28:23 - 28:27
    with a margin of at least gamma,
  • 28:27 - 28:30
    then the VC dimension of this
  • 28:30 - 28:34
    course is less than or equal to r squared over four
  • 28:34 - 28:36
    gamma squared
  • 28:36 - 28:38
    plus one; okay?
  • 28:38 - 28:40
    So
  • 28:40 - 28:43
    this funny symbol here, that just means rounding up.
  • 28:43 - 28:47
    This is a ceiling symbol; it means rounding up x.
  • 28:47 - 28:48
    And it
  • 28:48 - 28:51
    turns out you prove and there are some strange things
  • 28:51 - 28:53
    about this result that I'm deliberately not
  • 28:53 - 28:55
    gonna to talk about but turns they
  • 28:55 - 28:59
    can prove that the VC dimension of the class of linear classifiers with large
  • 28:59 - 29:01
    margins is actually bounded.
  • 29:01 - 29:03
    The surprising thing about this is that
  • 29:03 - 29:06
    this is the bound on VC dimension that has no dependents
  • 29:06 - 29:08
    on the dimension of the points x. So
  • 29:08 - 29:09
    in other words,
  • 29:09 - 29:13
    your data points x combine an infinite dimensional space,
  • 29:13 - 29:16
    but so long as you restrict attention to the class of
  • 29:16 - 29:18
    your separators with large margin, the
  • 29:18 - 29:20
    VC dimension is bounded.
  • 29:20 - 29:22
    And so
  • 29:22 - 29:26
    in trying to find a large margin separator
  • 29:26 - 29:29
    in trying to find the line that separates your positive and
  • 29:29 - 29:30
    your negative examples
  • 29:30 - 29:32
    with large margin,
  • 29:32 - 29:35
    it turns out therefore, that
  • 29:35 - 29:38
    the support vector machine is automatically trying to find
  • 29:38 - 29:40
    a hypothesis class
  • 29:40 - 29:42
    with small VC dimension. And therefore, it does not over fit. Alex? Student:What is
  • 29:42 - 29:49
    the [inaudible]? Instructor (Andrew
  • 29:49 - 29:50
    Ng):It
  • 29:50 - 29:52
    is actually defined the
  • 29:52 - 29:56
    same way as finite dimensional spaces. So you
  • 29:56 - 29:57
    know, suppose you have infinite actually,
  • 29:57 - 30:01
    these are constantly infinite dimensional vectors; not [inaudible] to the infinite
  • 30:01 - 30:04
    dimensional
  • 30:04 - 30:06
    vectors. Normally, the 2 to 1
  • 30:06 - 30:07
    squared
  • 30:07 - 30:10
    is equal to some [inaudible] equals 110
  • 30:10 - 30:12
    xi squared, so if x
  • 30:12 - 30:17
    is infinite dimensional, you just appoint it like
  • 30:17 - 30:20
    that. [Inaudible]. [Crosstalk] [Inaudible]. Now, say that again.
  • 30:20 - 30:23
    [Inaudible].
  • 30:23 - 30:27
    Yes. Although, I assume that this is bounded by r. Oh. It's
  • 30:27 - 30:31
    a yeah so this insures that conversions.
  • 30:31 - 30:32
  • 30:32 - 30:35
    So just
  • 30:35 - 30:38
    something people sometimes wonder about. And
  • 30:38 - 30:40
    last, the actually
  • 30:40 - 30:44
    tie empirical risk minimization back a little more strongly to the source of algorithms
  • 30:44 - 30:47
    we've talked about.
  • 30:47 - 30:48
    It turns out
  • 30:48 - 30:55
    that
  • 31:10 - 31:15
    so the theory was about, and so far, was really for empirical risk minimization. So that view's
  • 31:15 - 31:18
  • 31:18 - 31:21
    so we focus on just one training example.
  • 31:21 - 31:23
    Let me draw a
  • 31:23 - 31:24
    function, you know, a
  • 31:24 - 31:25
    zero here jumps
  • 31:25 - 31:26
    to one,
  • 31:26 - 31:28
    and it looks like that.
  • 31:28 - 31:30
    And so this for
  • 31:30 - 31:32
    once, this training example,
  • 31:32 - 31:36
    this may be
  • 31:36 - 31:43
    indicator
  • 31:43 - 31:47
    h where [inaudible] is d
  • 31:47 - 31:49
    equals data transpose x;
  • 31:49 - 31:53
    okay? But one
  • 31:53 - 31:56
    training example your training example will be positive or negative. And
  • 31:56 - 31:59
    depending on what the value of this data transpose x is, you either get it right
  • 31:59 - 32:02
    or wrong. And so
  • 32:02 - 32:06
    you know, I guess if your training example if you have a positive example,
  • 32:06 - 32:08
    then when z is positive,
  • 32:08 - 32:10
    you get it
  • 32:10 - 32:11
    right.
  • 32:11 - 32:18
  • 32:19 - 32:23
    Suppose you have a negative example, so y equals 0; right?
  • 32:23 - 32:27
    Then if z, which is data transpose x if this is positive,
  • 32:27 - 32:30
    then you will get this example wrong;
  • 32:30 - 32:33
    whereas, if z is negative then you'd get this example right.
  • 32:33 - 32:37
    And so this is a part of indicator h subscript [inaudible] x not equals y; okay? You
  • 32:37 - 32:43
    know, it's equal
  • 32:43 - 32:50
    to g of data transpose
  • 32:50 - 32:51
  • 32:51 - 32:54
    x; okay? And so it turns out that
  • 32:54 - 32:57
    so what you really like to do is choose parameters data so as to
  • 32:57 - 33:00
    minimize this step function; right?
  • 33:00 - 33:02
    You'd like to choose parameters data, so that
  • 33:02 - 33:05
    you end up with a
  • 33:05 - 33:08
    correct classification on setting your training example, and so you'd like
  • 33:08 - 33:09
    indicator h
  • 33:09 - 33:11
    of x not equal y.
  • 33:11 - 33:15
    You'd like this indicator function to be 0. It
  • 33:15 - 33:19
    turns out this step function is clearly a non-convex function. And so
  • 33:19 - 33:22
    it turns out that just the linear classifiers
  • 33:22 - 33:25
    minimizing the training error is an empty heart problem. It
  • 33:25 - 33:30
    turns out that both logistic regression, and support vector machines can be viewed as
  • 33:30 - 33:31
    using a convex
  • 33:31 - 33:33
    approximation for this problem.
  • 33:33 - 33:36
    And in particular
  • 33:36 - 33:40
    and draw a function like that
  • 33:41 - 33:48
    it turns out that
  • 33:51 - 33:55
    logistic regression is trying to maximize likelihood. And so it's tying to minimize
  • 33:55 - 33:57
    the minus of the logged likelihood.
  • 33:57 - 34:00
    And if you plot the minus of the logged likelihood, it actually turns out it'll be a function that
  • 34:00 - 34:03
    looks like this. And
  • 34:03 - 34:07
    this line that I just drew, you can think of it as a rough approximation to this step function; which is
  • 34:07 - 34:11
    maybe what you're really trying to minimize,
  • 34:11 - 34:15
    so you want to minimize training error. So you can actually think of logistic regression as trying to approximate
  • 34:15 - 34:16
    empirical risk minimization.
  • 34:16 - 34:20
    Where instead of using this step function, which is non-convex, and gives you a hard
  • 34:20 - 34:21
    optimization problem, it
  • 34:21 - 34:25
    uses this line above this curve above. So approximate it,
  • 34:25 - 34:28
    so you have a convex optimization problem you can
  • 34:28 - 34:30
    find the
  • 34:30 - 34:34
    maximum likelihood it's in the parameters for logistic regression.
  • 34:34 - 34:39
    And it turns out, support vector machine also can be viewed as approximated dysfunction
  • 34:39 - 34:41
    to only a little bit different let's
  • 34:41 - 34:45
    see, support vector machine turns out, can be viewed as trying to approximate this
  • 34:45 - 34:46
    step function two
  • 34:46 - 34:48
    over different
  • 34:48 - 34:51
    approximation that's linear, and then
  • 34:51 - 34:54
    that sort of [inaudible] linear that
  • 34:54 - 34:56
    our results goes this [inaudible] there, and then it goes up as a
  • 34:56 - 34:58
    linear function there. And that's that
  • 34:58 - 35:00
    is called the hinge class. And so you
  • 35:00 - 35:04
    can think of logistic regression and the support vector machine as
  • 35:04 - 35:07
    different approximations to try to minimize this
  • 35:07 - 35:11
    step function;
  • 35:11 - 35:12
    okay?
  • 35:12 - 35:13
    And that's why I guess,
  • 35:13 - 35:16
    all the theory we developed
  • 35:16 - 35:19
    even though SVM's and logistic regression aren't exactly due to
  • 35:19 - 35:22
    empirical risk minimization,
  • 35:22 - 35:25
    the theory we develop often gives the
  • 35:25 - 35:32
    completely appropriate intuitions for SVM's, and logistic regression; okay.
  • 35:33 - 35:36
    So that was the last of the loose ends.
  • 35:36 - 35:39
    And if you didn't get this, don't worry too much about it. It's a high-level message. It's
  • 35:39 - 35:40
    just that
  • 35:40 - 35:44
    SVM's and logistic regression are reasonable to think of as approximations
  • 35:44 - 35:48
    empirical risk minimization algorithms.
  • 35:48 - 35:52
    What I want to do next is move on to talk about model selection. Before I do that, let me just
  • 35:52 - 35:59
    check for questions about
  • 36:49 - 36:53
    this. Okay. Cool. Okay. So in the theory that we started to develop in the previous lecture, and that
  • 36:53 - 36:53
    we
  • 36:53 - 36:55
    sort of wrapped up with a
  • 36:55 - 36:57
    discussion on VC dimension,
  • 36:57 - 37:01
    we saw that there's often a trade-off between bias and variance. And in
  • 37:01 - 37:02
    particular, so
  • 37:02 - 37:07
    it is important not to choose a hypothesis that's either too simple or too
  • 37:07 - 37:08
    complex. So
  • 37:08 - 37:11
    if your data has sort of a quadratic structure to it,
  • 37:11 - 37:14
    then if you choose a linear
  • 37:14 - 37:18
    function to try to approximate it, then you would under fit. So you have a
  • 37:18 - 37:20
    hypothesis with high bias.
  • 37:20 - 37:24
    And conversely, we choose a hypothesis that's too complex, and you have high variance.
  • 37:24 - 37:26
    And you'll also
  • 37:26 - 37:32
    fail to fit. Then you would over fit the data, and you'd also fail to generalize well.
  • 37:32 - 37:36
    So model selection algorithms
  • 37:36 - 37:40
    provide a class of methods to automatically trade make these tradeoffs
  • 37:40 - 37:42
    between bias
  • 37:42 - 37:43
    and variance; right? So remember the
  • 37:43 - 37:45
    cartoon I drew last time
  • 37:45 - 37:52
    of generalization error?
  • 37:54 - 37:57
    I drew this last time. Where on the x-axis
  • 37:57 - 38:03
    was model complexity, meaning the number of
  • 38:03 - 38:08
    the degree of the polynomial; the [inaudible] regression function
  • 38:08 - 38:09
    or whatever.
  • 38:09 - 38:12
    And if you have too simple a model, you have high generalization error, those under
  • 38:12 - 38:14
    fitting.
  • 38:14 - 38:16
    And you if have too complex a model,
  • 38:16 - 38:20
    like 15 or 14-degree polynomial to five data points, then you also have high
  • 38:20 - 38:24
    generalization error, and you're over fitting.
  • 38:24 - 38:29
    So what I wanna do now is actually just talk about model selection in the abstract;
  • 38:29 - 38:30
    all right?
  • 38:30 - 38:33
    Some examples of model selection problems will include
  • 38:33 - 38:34
    well, I'll run the example
  • 38:34 - 38:36
    of
  • 38:36 - 38:37
    let's say you're
  • 38:37 - 38:44
    trying to choose the degree of a polynomial;
  • 38:45 - 38:49
    right? What degree polynomial do you want to choose?
  • 38:49 - 38:52
    Another example of a model selection problem would be if you're trying to choose
  • 38:52 - 38:54
    the parameter [inaudible],
  • 38:54 - 39:00
    which was the bandwidth parameter
  • 39:00 - 39:04
    in locally awaited linear regression
  • 39:04 - 39:08
    or in some sort of local way to
  • 39:08 - 39:08
    regression.
  • 39:08 - 39:10
    Yet, another model selection problem
  • 39:10 - 39:14
    is if you're trying to choose the parameter c [inaudible] and as the
  • 39:14 - 39:16
    [inaudible];
  • 39:16 - 39:22
    right? And so one known soft margin is the
  • 39:22 - 39:29
    we had this optimization objective; right?
  • 39:29 - 39:31
    And the parameter c
  • 39:31 - 39:33
    controls the tradeoff between
  • 39:33 - 39:34
    how much you want
  • 39:34 - 39:38
    to set for your example. So a large margin versus how much you want to penalize
  • 39:38 - 39:40
  • 39:40 - 39:44
    in this class [inaudible] example. So these are three specific examples of model selection
  • 39:44 - 39:45
    problems.
  • 39:45 - 39:48
    And
  • 39:48 - 39:55
    let's come up with a method for semantically choosing them.
  • 40:14 - 40:18
    Let's say you have some finite set of models, and let's write these as m1, m2, m3,
  • 40:18 - 40:20
    and
  • 40:20 - 40:25
    so on. For example, this may be the linear
  • 40:25 - 40:30
    classifier or this may be the quadratic classifier,
  • 40:30 - 40:30
    and so on;
  • 40:30 - 40:32
    okay? Or this
  • 40:32 - 40:34
    may also be
  • 40:34 - 40:37
    you may also take the bandwidth parameter [inaudible]
  • 40:37 - 40:41
    and discretize it into a range of values, and you're trying to choose from the most
  • 40:41 - 40:43
    discrete of the values.
  • 40:43 - 40:48
    So let's talk about how you would select an appropriate model; all right? Well,
  • 40:48 - 40:52
    one thing you could do is you can pick all of these models, and train them on
  • 40:52 - 40:54
    you're training set.
  • 40:54 - 40:57
    And then see which model has the lowest training
  • 40:57 - 41:04
    error. So that's a terrible idea, and why's that?
  • 41:05 - 41:11
    Right. Cool. Because of the over fit; right. And those some of you are laughing that
  • 41:11 - 41:14
    I asked that. So that'd be a terrible idea to choose a model
  • 41:14 - 41:17
    by looking at your training set because well, obviously, you end up choosing the most
  • 41:17 - 41:19
    complex model; right? And you
  • 41:19 - 41:26
    choose a 10th degree polynomial because that's what fits the training set. So we
  • 41:26 - 41:29
    come to model selection in a training set
  • 41:29 - 41:32
    several standard procedures to do this. One is hold out cross
  • 41:32 - 41:39
    validation,
  • 41:39 - 41:41
    and in hold out cross validation,
  • 41:41 - 41:46
    we teach a training set. And we randomly split
  • 41:46 - 41:51
    the training set into two subsets. We
  • 41:51 - 41:51
    call it
  • 41:51 - 41:54
    subset take all the data you have
  • 41:54 - 42:01
    and randomly split it into two subsets. And we'll call it the training set, and the
  • 42:01 - 42:05
    hold out cross validation subset.
  • 42:05 - 42:07
    And then,
  • 42:07 - 42:12
    you know, you train each model
  • 42:12 - 42:19
    on just trading subset of it, and test it
  • 42:19 - 42:22
    on your hold out cross validation set.
  • 42:22 - 42:25
    And you pick the model
  • 42:25 - 42:32
    with the lowest error
  • 42:33 - 42:35
    on the hold out cross validation
  • 42:35 - 42:37
    subset; okay? So this is sort of a
  • 42:37 - 42:40
    relatively straightforward procedure, and it's commonly used
  • 42:40 - 42:42
    where you train on 70 percent of the data.
  • 42:42 - 42:45
    Then test all of your models. And 30 percent, you can take
  • 42:45 - 42:46
    whatever has the smallest
  • 42:46 - 42:51
    hold out cross validation error.
  • 42:51 - 42:55
    And after this you actually have a chose. You can actually
  • 42:55 - 42:59
    having taken all of these hypothesis trained on 70 percent of the data,
  • 42:59 - 43:03
    you can actually just output the hypothesis that has the lowest error on your hold out
  • 43:03 - 43:04
    cross validation set.
  • 43:04 - 43:09
    And optionally, you can actually take the model that you selected
  • 43:09 - 43:13
    and go back, and retrain it on all 100 percent of the data;
  • 43:13 - 43:16
    okay? So both versions are actually done and used really often. You can
  • 43:16 - 43:16
    either,
  • 43:16 - 43:20
    you know, just take the best hypothesis that was trained on 70 percent of the data,
  • 43:20 - 43:22
    and just output that as you find the hypothesis
  • 43:22 - 43:24
    or you can use this to
  • 43:24 - 43:27
    say, having chosen the degree of the polynomial you want to fit, you can then go
  • 43:27 - 43:31
    back and retrain the model on the entire 100 percent of your data.
  • 43:31 - 43:38
    And both of these are commonly done.
  • 43:40 - 43:47
  • 43:50 - 43:54
    How about a cross validation does sort of work straight?
  • 43:54 - 43:56
    And
  • 43:56 - 43:58
    sometimes we're working
  • 43:58 - 43:59
  • 43:59 - 44:04
    with a company or application or something. The
  • 44:04 - 44:06
    many machine-learning applications we have
  • 44:06 - 44:08
    very little data or where, you
  • 44:08 - 44:10
    know, every training example you have was
  • 44:10 - 44:15
    painfully acquired at great cost; right? Sometimes your data is
  • 44:15 - 44:16
    acquired by
  • 44:16 - 44:19
    medical experiments, and each of these each
  • 44:19 - 44:24
    training example represents a sick man in amounts of physical human pain or something. So
  • 44:24 - 44:27
    we talk and say, well, I'm going to hold out 30 percent of your data set, just to select
  • 44:27 - 44:28
    my model.
  • 44:28 - 44:33
    If people were who sometimes that causes unhappiness, and so maybe you wanna
  • 44:33 - 44:37
    use not have to leave out 30 percent of your data just to do model
  • 44:37 - 44:39
    selection.
  • 44:39 - 44:43
    So there are a couple of other variations on hold out cross validation that makes
  • 44:43 - 44:48
    sometimes, slightly more efficient use of the data.
  • 44:48 - 44:52
    And one is called k-fold cross validation.
  • 44:52 - 44:54
    And here's the idea: I'm gonna
  • 44:54 - 44:56
    take all of
  • 44:56 - 44:57
    my data s;
  • 44:57 - 45:01
    so imagine, I'm gonna draw this
  • 45:01 - 45:04
    box s, as to note the entirety of all the data I have.
  • 45:04 - 45:07
    And I'll then divide it
  • 45:07 - 45:13
    into k pieces, and this is five pieces in what I've drawn.
  • 45:13 - 45:17
    Then what'll I'll do is I will repeatedly
  • 45:17 - 45:23
    train on k minus one pieces.
  • 45:23 - 45:28
    Test on the remaining one
  • 45:28 - 45:31
    test on the remaining piece, I guess;
  • 45:31 - 45:36
    right? And then you average
  • 45:36 - 45:41
    over the k result.
  • 45:41 - 45:44
    So another way, we'll just hold out I will
  • 45:44 - 45:48
    hold
  • 45:48 - 45:51
    out say, just 1/5 of my data
  • 45:51 - 45:55
    and I'll train on the remaining 4/5, and I'll test on the first one.
  • 45:55 - 45:58
    And then I'll then go and hold out the second 1/5 from my [inaudible] for the
  • 45:58 - 46:03
    remaining pieces test on this, you remove the third piece,
  • 46:03 - 46:06
    train on the 4/5; I'm gonna do this five times.
  • 46:06 - 46:09
    And then I'll take the five error measures I
  • 46:09 - 46:11
    have and I'll
  • 46:11 - 46:16
    average them. And this then gives me an estimate of the generalization error of my model; okay?
  • 46:16 - 46:18
    And then, again, when you do
  • 46:18 - 46:20
    k-fold cross validation,
  • 46:20 - 46:22
    usually you then go back and
  • 46:22 - 46:24
    retrain the model you selected
  • 46:24 - 46:28
    on the entirety of your training set.
  • 46:28 - 46:34
    So I drew five pieces here because that was easier for me to draw, but k equals 10 is
  • 46:34 - 46:39
    very
  • 46:39 - 46:42
    common; okay? I should say
  • 46:42 - 46:48
    k equals 10 is the fairly common choice to do 10 fold cross validation.
  • 46:48 - 46:52
    And the advantage of the over hold out cross option is that you switch the data into ten pieces.
  • 46:52 - 46:54
    Then each time you're only holding out
  • 46:54 - 46:56
    1/10 of your data, rather than,
  • 46:56 - 46:59
    you know, say, 30 percent of your data. I
  • 46:59 - 47:03
    must say, in standard hold out in simple hold out cross validation,
  • 47:03 - 47:06
    a 30 70 split is fairly common.
  • 47:06 - 47:08
    Sometimes like 2/3 1/3 or
  • 47:08 - 47:10
    a 70 30 split is fairly common.
  • 47:10 - 47:14
    And if you use k-fold cross validation, k equals 5 or more commonly k equals 10, and is the most
  • 47:14 - 47:16
    common choice.
  • 47:16 - 47:20
    The disadvantage of k-fold cross validation is that it can be much more
  • 47:20 - 47:21
    computationally expensive.
  • 47:21 - 47:23
    In particular,
  • 47:23 - 47:26
    to validate your model, you now need to train your model ten times,
  • 47:26 - 47:29
    instead of just once. And so you need
  • 47:29 - 47:33
    to: from logistic regression, ten times per model, rather than just once. And so this is
  • 47:33 - 47:35
    computationally more expensive.
  • 47:35 - 47:38
    But k equals ten works great.
  • 47:38 - 47:40
    And then,
  • 47:40 - 47:44
    finally, in
  • 47:44 - 47:48
    there's actually a version of this that you can take even further,
  • 47:48 - 47:50
    which is when your set k
  • 47:50 - 47:52
    equals m.
  • 47:52 - 47:57
    And so that's when you take your training set, and you split it into as many pieces as you have training
  • 47:57 - 47:59
    examples.
  • 47:59 - 48:06
    And this procedure is called leave one out cross validation.
  • 48:06 - 48:07
    And
  • 48:07 - 48:09
    what you do is you then
  • 48:09 - 48:12
    take out the first training example, train on the rest, and test on the first
  • 48:12 - 48:13
    example.
  • 48:13 - 48:16
    Then you take out the second training example,
  • 48:16 - 48:19
    train on the rest, and test on the second example. Then
  • 48:19 - 48:22
    you take out the third example, train on everything, but your third example. Test on
  • 48:22 - 48:24
    the third example, and so on.
  • 48:24 - 48:25
    And so
  • 48:25 - 48:28
    with this many pieces you are now making,
  • 48:28 - 48:31
    maybe even more effective use of your data than k-fold cross
  • 48:31 - 48:32
    validation. But
  • 48:32 - 48:36
    you could leave leave one out cross validation is computationally very expensive
  • 48:36 - 48:37
    because now
  • 48:37 - 48:42
    you need to repeatedly leave one example out, and then run your learning
  • 48:42 - 48:46
    algorithm on m minus one training examples. You need to do this a lot of times, and so
  • 48:46 - 48:49
    this is computationally very expensive.
  • 48:49 - 48:55
    And typically, this is done only when you're extremely data scarce. So if you
  • 48:55 - 48:59
    have a learning problem where you have, say, 15 training examples or something,
  • 48:59 - 49:01
    then if you have very few training examples, leave one
  • 49:01 - 49:04
    out cross validation
  • 49:04 - 49:04
    is
  • 49:04 - 49:11
    maybe preferred.
  • 49:16 - 49:18
    Yeah?
  • 49:18 - 49:23
    Student:You know, that time you proved that the difference between the generalized [inaudible] by
  • 49:23 - 49:27
    number of examples in your training
  • 49:27 - 49:28
    set and VC
  • 49:28 - 49:31
    dimension. So
  • 49:31 - 49:33
    maybe [inaudible] examples into
  • 49:33 - 49:35
    different groups, we can use that for [inaudible]. Instructor (Andrew Ng):Yeah, I
  • 49:35 - 49:37
    mean Student:- compute the training error, and use that for computing [inaudible] for a generalized error. Instructor (Andrew Ng):Yeah, that's done, but
  • 49:37 - 49:40
    yeah, in practice, I personally tend not to do that. It
  • 49:40 - 49:43
    tends not to be the VC
  • 49:43 - 49:47
    dimension bounds are somewhat loose bounds. And so
  • 49:47 - 49:48
    there are people
  • 49:48 - 49:51
    in structure risk minimization that propose what you do, but I personally tend not to do
  • 49:51 - 49:55
    that,
  • 50:09 - 50:16
    though. Questions for cross validation?
  • 50:34 - 50:36
    Yeah. This
  • 50:36 - 50:39
    is kind of far from there because when we spend all this time [inaudible] but how many data points do you sort of need to go into your certain marginal [inaudible]?
  • 50:39 - 50:40
    Right.
  • 50:40 - 50:42
    So it seems like when I'd be
  • 50:42 - 50:44
    able to use that
  • 50:44 - 50:47
  • 50:47 - 50:48
    instead
  • 50:48 - 50:50
    of do this; more analytically, I guess. I mean Yeah. [Inaudible]. Instructor
  • 50:50 - 50:53
    No okay. So it turns out that when you're proving learning theory bounds, very often
  • 50:53 - 50:58
    the bounds will be extremely loose because you're sort of proving the worse case upper bound that
  • 50:58 - 50:59
    holds true
  • 50:59 - 51:03
    even for very bad what is
  • 51:03 - 51:06
    it
  • 51:06 - 51:10
    so the bounds that I proved just now; right? That holds true for absolutely any
  • 51:10 - 51:12
    probability distribution over training
  • 51:12 - 51:13
    examples; right?
  • 51:13 - 51:17
    So just assume the training examples we've drawn, iid
  • 51:17 - 51:19
    from some distribution script d,
  • 51:19 - 51:23
    and the bounds I proved hold true for absolutely any probability
  • 51:23 - 51:27
    distribution over script
  • 51:27 - 51:29
    d. And chances are
  • 51:29 - 51:33
    whatever real life distribution you get over, you know, houses and their
  • 51:33 - 51:36
    prices or whatever, is probably not as bad as
  • 51:36 - 51:39
    the very worse one you
  • 51:39 - 51:41
    could've gotten; okay?
  • 51:41 - 51:42
    And so it turns out that if you
  • 51:42 - 51:46
    actually plug in the constants of learning theory bounds, you often get
  • 51:46 - 51:51
    extremely large numbers. Take logistic regression logistic
  • 51:51 - 51:53
    regression you have ten parameters
  • 51:53 - 51:57
    and 0.01 error, and with 95 percent probability. How many training
  • 51:57 - 51:58
    examples do I
  • 51:58 - 52:02
    need? If you actually plug in actual constants into the text for learning theory bounds,
  • 52:02 - 52:06
    you often get extremely pessimistic estimates with the number of examples you need. You end
  • 52:06 - 52:08
    up with
  • 52:08 - 52:12
    some ridiculously large numbers. You would need 10,000 training examples to fit
  • 52:12 - 52:14
    ten parameters.
  • 52:14 - 52:17
    So
  • 52:17 - 52:20
    a good way to think of these learning theory bounds is
  • 52:20 - 52:25
    and this is why, also, when I write papers on learning theory bounds, I
  • 52:25 - 52:26
    quite often
  • 52:26 - 52:30
    use big-O notation to just absolutely just ignore the constant factors because
  • 52:30 - 52:33
    the bounds seem to be very loose.
  • 52:33 - 52:39
    There are some attempts to use these bounds to give guidelines as to
  • 52:39 - 52:40
    what model
  • 52:40 - 52:42
    to choose, and so on.
  • 52:42 - 52:47
    But I personally tend to use the bounds again,
  • 52:47 - 52:47
    intuition
  • 52:47 - 52:49
    about
  • 52:49 - 52:53
    for example, what are the number of training examples you need
  • 52:53 - 52:56
    gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes
  • 52:56 - 53:00
    quadratic parameters?
  • 53:00 - 53:02
    So it's quite often the shape of the bounds. The fact that
  • 53:02 - 53:07
    the number of training examples the fact that some complexity is linear in the VC dimension,
  • 53:07 - 53:09
    that's sort of a useful intuition you can get from
  • 53:09 - 53:11
    these theories. But the
  • 53:11 - 53:14
    actual magnitude of the bound will tend to be much looser than
  • 53:14 - 53:18
    will hold true for a particular problem you are working
  • 53:18 - 53:25
    on. So did that
  • 53:26 - 53:28
    answer your question? Student:Uh-huh. Instructor (Andrew Ng):Yeah. And it turns out, by the
  • 53:28 - 53:32
    way, for myself, a rule of thumb that I often use is if
  • 53:32 - 53:35
    you're trying to fit a logistic regression model,
  • 53:35 - 53:36
    if you have
  • 53:36 - 53:39
    n parameters or n plus one parameters;
  • 53:39 - 53:42
    if the number of training examples is ten times your number of
  • 53:42 - 53:45
    parameters, then you're probably in good shape.
  • 53:45 - 53:49
    And if your number of training examples is like tiny times the number of parameters, then
  • 53:49 - 53:52
    you're probably perfectly fine fitting that model.
  • 53:52 - 53:55
    So those are the sorts of intuitions
  • 53:55 - 54:02
    that you can get from these bounds. Student:In cross validation do
  • 54:03 - 54:07
    we assume these examples randomly? Instructor (Andrew Ng):Yes. So by convention we usually split the train
  • 54:07 - 54:14
    testers randomly.
  • 54:23 - 54:27
    One more thing I want to talk about for model selection is there's actually a special case of model selections,
  • 54:27 - 54:34
    called the feature selection problem.
  • 54:38 - 54:41
    And so here's the intuition:
  • 54:41 - 54:45
    for many machine-learning problems you may have a very high dimensional
  • 54:45 - 54:48
    feature space with very high dimensional
  • 54:48 - 54:52
    you have x's [inaudible] feature x's.
  • 54:52 - 54:56
    So for example, for text classification and I wanna talk about this text classification example that
  • 54:56 - 54:58
    spam versus
  • 54:58 - 55:00
    non-spam. You may easily have on
  • 55:00 - 55:04
    the order of 30,000 or 50,000 features. I think I used 50,000 in
  • 55:04 - 55:07
    my early examples. So if you have
  • 55:07 - 55:11
    so many features you have 50,000 features,
  • 55:11 - 55:14
    depending on what learning algorithm you use, there may be a real
  • 55:14 - 55:16
    risk of over fitting.
  • 55:16 - 55:17
    And so
  • 55:17 - 55:20
    if you can reduce the number of features, maybe
  • 55:20 - 55:23
    you can reduce the variance of your learning algorithm, and reduce the risk of
  • 55:23 - 55:25
    over fitting. And for
  • 55:25 - 55:28
    the specific case of text classification, if
  • 55:28 - 55:32
    you imagine that maybe there's a small number of relevant features, so
  • 55:32 - 55:34
    there are all these English words.
  • 55:34 - 55:37
    And many of these English words probably don't tell you anything at all about
  • 55:37 - 55:40
    whether the email is spam or non-spam. If it were, you
  • 55:40 - 55:44
    know, English function words like, the, of, a, and;
  • 55:44 - 55:48
    these are probably words that don't tell you anything about whether the email is spam or non-spam. So
  • 55:48 - 55:51
    words in contrast will be a much smaller number of
  • 55:51 - 55:54
    features that are truly relevant to the learning problem.
  • 55:54 - 55:57
    So for example, you see the word buy or Viagra, those are words that
  • 55:57 - 56:02
    are very useful. So you words, some you spam and non-spam. You see the word
  • 56:02 - 56:03
    Stanford or
  • 56:03 - 56:06
    machinelearning or your own personal name. These are other words that are useful for
  • 56:06 - 56:11
    telling you whether something is spam or non-spam. So
  • 56:11 - 56:16
    in feature selection, we would like to select a subset of the features
  • 56:16 - 56:20
    that may be or hopefully the most relevant ones for a specific learning
  • 56:20 - 56:20
    problem, so
  • 56:20 - 56:25
    as to give ourselves a simpler learning a simpler hypothesis class to choose
  • 56:25 - 56:28
    from. And then therefore, reduce the risk of over fitting.
  • 56:28 - 56:30
    Even when we
  • 56:30 - 56:37
    may have had 50,000 features originally. So
  • 56:37 - 56:41
    how do you do this? Well, if you
  • 56:41 - 56:44
    have n
  • 56:44 - 56:47
    features, then there are
  • 56:47 - 56:50
    two to the n possible subsets;
  • 56:50 - 56:54
  • 56:54 - 56:56
    right? Because, you know,
  • 56:56 - 57:00
    each of your n features can either be included or excluded. So there are two to the n
  • 57:00 - 57:01
    possibilities.
  • 57:01 - 57:04
    And this is a huge space.
  • 57:04 - 57:09
    So in feature selection, what we most commonly do is use various searcheristics
  • 57:09 - 57:10
    sort of
  • 57:10 - 57:11
    simple search algorithms
  • 57:11 - 57:15
    to try to search through this space of two to the n possible subsets of features;
  • 57:15 - 57:17
    to try to find a good
  • 57:17 - 57:18
    subset of features. This is
  • 57:18 - 57:23
    too large a number to enumerate all possible feature subsets.
  • 57:23 - 57:26
    And as a complete example,
  • 57:26 - 57:27
  • 57:27 - 57:31
    this is the forward search algorithm; it's also called the forward selection algorithm.
  • 57:31 - 57:33
  • 57:33 - 57:35
    It's actually pretty simple, but I'll just
  • 57:35 - 57:36
    write it out.
  • 57:36 - 57:42
    My writing it out will make it look more complicated than it really
  • 57:42 - 57:45
    is, but
  • 57:45 - 57:46
    it starts with
  • 57:46 - 57:48
    initialize the sets script f
  • 57:48 - 57:52
    to be the empty set,
  • 57:52 - 57:53
    and then
  • 57:53 - 57:55
    repeat
  • 57:55 - 57:57
    for
  • 57:57 - 57:59
    i equals one
  • 57:59 - 58:00
    to n;
  • 58:05 - 58:08
    try adding
  • 58:08 - 58:11
    feature i
  • 58:11 - 58:14
    to the set scripts
  • 58:14 - 58:16
    f, and evaluate the
  • 58:16 - 58:20
    model
  • 58:20 - 58:26
    using cross validation.
  • 58:26 - 58:30
    And by cross validation, I mean any of the three flavors, be it simple hold out cross
  • 58:30 - 58:35
    validation or k-fold cross validation or leave one out cross
  • 58:35 - 58:36
    validation.
  • 58:36 - 58:40
    And then, you know, set f to be
  • 58:40 - 58:41
    equal to
  • 58:41 - 58:43
    f union, I guess.
  • 58:43 - 58:49
    And then the best feature
  • 58:49 - 58:56
    found is f 1, I guess; okay?
  • 58:58 - 59:00
    And finally, you would
  • 59:00 - 59:07
    okay? So
  • 59:07 - 59:14
  • 59:17 - 59:20
    forward selections, procedure is: follow through the empty set of features.
  • 59:20 - 59:22
    And then on each
  • 59:22 - 59:26
    generation, take each of
  • 59:26 - 59:29
    your features that isn't already in your set script f and you try adding that
  • 59:29 - 59:30
    feature to your set. Then
  • 59:30 - 59:34
    you train them all, though, and evaluate them all, though, using
  • 59:34 - 59:35
    cross validation.
  • 59:35 - 59:39
    And basically, figure out what is the best single feature to add to your
  • 59:39 - 59:41
    set
  • 59:41 - 59:44
    script f. In step two here, you go ahead and add
  • 59:44 - 59:47
    that feature to your set script f, and you get it right.
  • 59:47 - 59:51
    And when I say best feature or best model here by best, I really mean the
  • 59:51 - 59:54
    best model according to hold out cross validation.
  • 59:54 - 59:56
    By best, I really mean
  • 59:56 - 60:01
    the single feature addition that results in the lowest hold out
  • 60:01 - 60:02
    cross validation error or the
  • 60:02 - 60:03
    lowest cross validation error.
  • 60:03 - 60:04
    So you do this
  • 60:04 - 60:08
    adding one feature at a time.
  • 60:08 - 60:12
    When you terminate this a little bit, as if you've
  • 60:12 - 60:15
    added all the features to f, so f is now
  • 60:15 - 60:18
    the entire set of features; you can terminate this.
  • 60:18 - 60:19
    Or if
  • 60:19 - 60:22
    by some rule of thumb, you know that you
  • 60:22 - 60:24
    probably don't ever want more than
  • 60:24 - 60:27
    k features, you can also terminate
  • 60:27 - 60:30
    this if f is already exceeded some threshold number of features. So maybe
  • 60:30 - 60:31
    if
  • 60:31 - 60:34
    you have 100 training examples, and you're fitting logistic regression,
  • 60:34 - 60:39
    you probably know you won't want more than 100 features. And so
  • 60:39 - 60:41
    you stop after you have
  • 60:41 - 60:46
    100 features added to set f; okay?
  • 60:46 - 60:50
    And then finally, having done this, output of best hypothesis found; again, by
  • 60:50 - 60:52
    best, I mean,
  • 60:52 - 60:55
    when learning this algorithm, you'd be seeing lots of hypothesis. You'd be training lots of
  • 60:55 - 60:58
    hypothesis, and testing them using cross validation.
  • 60:58 - 61:02
    So when I say output best hypothesis found, I mean
  • 61:02 - 61:05
    of all of the hypothesis you've seen during this entire procedure,
  • 61:05 - 61:07
    pick the one with the lowest
  • 61:07 - 61:10
    cross validation error that you saw; okay?
  • 61:10 - 61:17
    So that's forward selection. So
  • 61:35 - 61:40
    let's see, just to give this a name, this is an incidence of what's called
  • 61:40 - 61:43
  • 61:43 - 61:46
    wrapper feature
  • 61:46 - 61:48
    selection.
  • 61:48 - 61:51
    And the term wrapper comes
  • 61:51 - 61:52
    from the fact that
  • 61:52 - 61:55
    this feature selection algorithm that I just described is a forward selection or forward
  • 61:55 - 61:57
    search.
  • 61:57 - 61:59
    It's a piece of software that
  • 61:59 - 62:02
    you write that wraps around your learning algorithm. In the sense that
  • 62:02 - 62:04
    to perform forward selection,
  • 62:04 - 62:08
    you need to repeatedly make cause to your learning algorithm
  • 62:08 - 62:09
    to train
  • 62:09 - 62:12
    your model,
  • 62:12 - 62:15
    using different subsets of features; okay? So this is called wrapper model
  • 62:15 - 62:17
    feature selection.
  • 62:17 - 62:21
    And it tends to be somewhat computationally expensive because as you're
  • 62:21 - 62:23
    performing the search process,
  • 62:23 - 62:26
    you're repeatedly training your learning algorithm over and over and over on all of
  • 62:26 - 62:30
    these different subsets of features. Let's
  • 62:30 - 62:37
    just mention also, there is a variation of this called backward search or
  • 62:39 - 62:41
    backward selection,
  • 62:41 - 62:44
    which is where you start with
  • 62:44 - 62:47
    f equals
  • 62:47 - 62:54
    the entire set of features, and you delete features one at a time;
  • 62:59 - 63:01
    okay?
  • 63:01 - 63:04
  • 63:04 - 63:07
    So that's backward search or backward selection.
  • 63:07 - 63:08
    And
  • 63:08 - 63:13
    this is another feature selection algorithm that you might use. Part of
  • 63:13 - 63:15
    whether this makes sense is
  • 63:15 - 63:17
    really there will be problems where
  • 63:17 - 63:21
    it really doesn't even make sense to initialize f to be the set of all features.
  • 63:21 - 63:24
    So if you have 100 training examples
  • 63:24 - 63:26
    and 10,000 features,
  • 63:26 - 63:29
    which may well happen
  • 63:29 - 63:31
    100 emails and 10,000 training
  • 63:31 - 63:35
    10,000 features in email, then 100 training examples then
  • 63:35 - 63:38
    depending on the learning algorithm you're using, it may or may not make sense to
  • 63:38 - 63:39
    initialize
  • 63:39 - 63:43
    the set f to be all features, and train them all by using all features. And if it
  • 63:43 - 63:46
    doesn't make sense, then you can train them all by using all features; then
  • 63:46 - 63:52
    forward selection would be more common.
  • 63:52 - 63:54
    So
  • 63:54 - 63:57
    let's see. Wrapper model feature selection algorithms tend to work
  • 63:57 - 63:58
    well.
  • 63:58 - 64:02
    And in particular, they actually often work better than a different class of algorithms I'm gonna talk
  • 64:02 - 64:06
    about now. But their main disadvantage is that they're computationally very
  • 64:06 - 64:09
    expensive.
  • 64:09 - 64:16
    Do you have any questions about this
  • 64:16 - 64:19
    before I talk about the other? Yeah? Student:[Inaudible]. Instructor (Andrew Ng):Yeah yes,
  • 64:19 - 64:23
    you're actually right. So forward search and backward search, both of these are searcheristics,
  • 64:23 - 64:24
    and
  • 64:24 - 64:28
    you cannot but for either of these you cannot guarantee they'll find the best
  • 64:28 - 64:29
    subset of features. It
  • 64:29 - 64:31
    actually turns out that
  • 64:31 - 64:35
    under many formulizations of the feature selection problems it actually turns out to be an empty
  • 64:35 - 64:40
    heart problem, to find the best subset of features.
  • 64:40 - 64:44
    But in practice, forward selection backward selection work fine,
  • 64:44 - 64:47
    and you can also envision other search algorithms where you sort of
  • 64:47 - 64:50
    have other methods to search through the space up to the end possible feature
  • 64:50 - 64:57
    subsets. So
  • 65:01 - 65:04
    let's see.
  • 65:04 - 65:06
    Wrapper feature selection
  • 65:06 - 65:10
    tends to work well when you can afford to do it
  • 65:10 - 65:12
    computationally.
  • 65:12 - 65:14
    But for problems
  • 65:14 - 65:18
    such as text classification it turns out for text classification specifically
  • 65:18 - 65:22
    because you have so many features, and easily have 50,000 features.
  • 65:22 - 65:26
    Forward selection would be very, very expensive.
  • 65:26 - 65:28
    So there's a different class of algorithms that
  • 65:28 - 65:31
    will give you that
  • 65:31 - 65:35
    tends not to do as well in the sense of generalization error. So you tend to learn the
  • 65:35 - 65:37
    hypothesis that works less well,
  • 65:37 - 65:40
    but is computationally much less expensive.
  • 65:40 - 65:44
    And these are called
  • 65:44 - 65:47
    the
  • 65:47 - 65:50
    filter feature selection methods.
  • 65:50 - 65:53
    And the basic idea is
  • 65:53 - 65:54
  • 65:54 - 66:01
    that for each feature i will
  • 66:01 - 66:04
    compute
  • 66:04 - 66:11
    some measure
  • 66:11 - 66:17
    of how informative
  • 66:17 - 66:20
    xi
  • 66:20 - 66:25
    is about y; okay?
  • 66:25 - 66:29
    And to do this, we'll use some simple
  • 66:29 - 66:33
    heuristics; for every feature we'll just try to compute some rough estimate or
  • 66:33 - 66:40
    compute
  • 66:41 - 66:48
    some measure of how informative
  • 66:50 - 66:54
    xi is about
  • 66:54 - 66:59
    y. So there are many ways you can do this. One way you can choose is to just compute the
  • 66:59 - 67:01
    correlation between xi and y. And just for each of
  • 67:01 - 67:05
    your features just see how correlated this is with
  • 67:05 - 67:07
    your class label y.
  • 67:07 - 67:13
    And then you just pick the top k most correlated features.
  • 67:13 - 67:20
    Another way
  • 67:46 - 67:52
    to do this
  • 67:52 - 67:55
    for the case of text classification, there's one other method, which especially
  • 67:55 - 67:57
    for this k features I guess
  • 67:57 - 67:59
    there's one other
  • 67:59 - 68:01
    informative measure
  • 68:01 - 68:04
    that's used very commonly, which is called major information. I'm going to
  • 68:04 - 68:11
    tell you
  • 68:11 - 68:15
    some of these ideas in problem sets, but I'll
  • 68:15 - 68:18
    just say this very briefly.
  • 68:18 - 68:22
    So the major information between feature xi and y
  • 68:22 - 68:28
    I'll just write out the definition, I guess.
  • 68:28 - 68:32
    Let's say this is text classification, so x can take on two values, 0, 1;
  • 68:32 - 68:36
    the major information between xi and y is to find out some overall possible values of
  • 68:36 - 68:37
    x;
  • 68:37 - 68:41
    some overall possible values of
  • 68:41 - 68:46
    y times the distribution
  • 68:46 - 68:52
  • 68:52 - 68:53
    times that.
  • 68:53 - 68:54
    Where
  • 69:00 - 69:04
    all of these distributions where so the joint distribution over xi and y,
  • 69:04 - 69:11
    you would estimate from your training data
  • 69:13 - 69:14
    all of these things you would use, as well. You would estimate
  • 69:14 - 69:15
    from
  • 69:15 - 69:20
    the training data what is the probability that x is 0, what's the probability that x is one, what's the
  • 69:20 - 69:27
    probability that x is 0, y is 0, x is one; y is 0, and so on. So it
  • 69:28 - 69:31
    turns out
  • 69:31 - 69:35
    there's a standard information theoretic measure of how different
  • 69:35 - 69:37
    probability distributions are.
  • 69:37 - 69:40
    And I'm not gonna prove this here. But it turns out that
  • 69:40 - 69:46
    this major information is
  • 69:46 - 69:52
    actually
  • 69:52 - 69:54
    so the standard
  • 69:54 - 69:58
    measure of how different distributions are; called the K-L divergence.
  • 69:58 - 70:00
    When you take a class in information theory,
  • 70:00 - 70:03
    you have seen concepts of mutual information in the K-L divergence, but if you
  • 70:03 - 70:06
    haven't, don't worry about it. Just
  • 70:06 - 70:09
    the intuition is there's something called K-L divergence that's a formal measure
  • 70:09 - 70:13
    of how different two probability distributions
  • 70:13 - 70:17
    are. And mutual information is a measure for how different
  • 70:17 - 70:20
    the joint distribution is
  • 70:20 - 70:22
    of x and y;
  • 70:22 - 70:23
    from the distribution
  • 70:23 - 70:26
    you would get if you were to assume they were independent;
  • 70:26 - 70:27
    okay? So
  • 70:27 - 70:29
    if x and y were independent, then
  • 70:29 - 70:34
    p of x, y would be equal to p of x times p of y.
  • 70:34 - 70:37
    And so you know, this distribution and this distribution would be identical, and the
  • 70:37 - 70:40
    K-L divergence would be 0.
  • 70:40 - 70:44
    In contrast, if x and y were very non-independent in
  • 70:44 - 70:47
    other words, if x and y are very informative about each other,
  • 70:47 - 70:48
    then this K-L divergence
  • 70:48 - 70:50
    will be large.
  • 70:50 - 70:52
    And so mutual information is a formal measure of
  • 70:52 - 70:55
    how non-independent x and y are.
  • 70:55 - 70:58
    And if x and y are highly non-independent
  • 70:58 - 70:59
    then that means that
  • 70:59 - 71:02
    x will presumably tell you something about y,
  • 71:02 - 71:06
    and so they'll have large mutual information. And
  • 71:06 - 71:10
    this measure of information will tell you x might be a good feature. And you
  • 71:10 - 71:12
    get to play with some of these ideas
  • 71:12 - 71:15
    more in the problem sets. So I won't say much more about it.
  • 71:17 - 71:19
    And what you do then is having chosen
  • 71:19 - 71:24
    some measure like correlation or major information or something else, you
  • 71:24 - 71:29
    then pick the top
  • 71:29 - 71:34
    k features; meaning that you compute correlation between xi and y for all the
  • 71:34 - 71:37
    features of mutual information xi and y for all the features.
  • 71:37 - 71:40
    And then you include in your learning algorithm the
  • 71:40 - 71:43
    k features of the largest correlation with the label or
  • 71:43 - 71:46
    the largest mutual information label, whatever.
  • 71:46 - 71:52
    And to choose
  • 71:52 - 71:55
    k,
  • 71:55 - 71:58
    you can actually use cross validation, as well; okay?
  • 71:58 - 71:59
    So you would
  • 71:59 - 72:03
    take all your features, and sort them in decreasing order of mutual
  • 72:03 - 72:04
    information.
  • 72:04 - 72:08
    And then you'd try using just the top one feature, the top two features, the top
  • 72:08 - 72:10
    three features, and so on.
  • 72:10 - 72:14
    And you decide how many features includes
  • 72:14 - 72:17
    using cross validation; okay? Or you
  • 72:17 - 72:24
    can sometimes you can just choose this by hand, as well. Okay. Questions about
  • 72:24 - 72:31
    this? Okay.
  • 72:35 - 72:36
    Cool. Great. So
  • 72:36 - 72:41
    next lecture I'll contine - I'll wrap up the Bayesian model selection, but less close
  • 72:41 - 72:41
    to the end.
Title:
Lecture 10 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng continues his lecture on learning theory by discussing VC dimension and model selection.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:12:56
N. Ueda added a translation

English subtitles

Revisions