hide🎇 Amara staff will be offline on December 25, 2024.
Happy Holidays from the Amara Team! 🥳

< Return to Video

Lecture 7 | Machine Learning (Stanford)

  • 0:12 - 0:15
    This presentation is delivered by the Stanford Center for Professional
  • 0:15 - 0:22
    Development.
  • 0:24 - 0:26
    So welcome back.
  • 0:26 - 0:31
    And what I wanna do today is continue our discussion on support vector machines. And in
  • 0:31 - 0:34
    particular, I wanna talk about the optimal margin classifier.
  • 0:34 - 0:38
    Then I wanna take a brief digression and talk about primal and duo optimization problems, and
  • 0:38 - 0:39
    in particular,
  • 0:39 - 0:42
    what's called the KKT conditions.
  • 0:42 - 0:45
    And then we'll derive the duo to the optimization problem that I
  • 0:45 - 0:47
    had posed earlier.
  • 0:47 - 0:51
    And that will lead us into a discussion of kernels, which I won't really -
  • 0:51 - 0:54
    which we just get to say couple words about, but which I'll do probably
  • 0:54 - 0:58
    only in the next lecture.
  • 0:58 - 1:02
    And as part of today's lecture, I'll spend some time talking about optimization problems.
  • 1:02 - 1:04
    And in
  • 1:04 - 1:08
    the little time I have today, I won't really be able to do this topic justice.
  • 1:08 - 1:13
    I wanna talk about convex optimization and do that topic justice.
  • 1:13 - 1:18
    And so at this week's discussion session, the TAs will have more
  • 1:18 - 1:22
    time - will teach a discussion session - focus on convex optimization
  • 1:22 - 1:25
    - sort of very beautiful and useful theory.
  • 1:25 - 1:32
    So you want to learn more about that, listen to this Friday's discussion session.
  • 1:33 - 1:37
    Just to recap what we did in the previous lecture,
  • 1:37 - 1:39
    as
  • 1:39 - 1:42
    we were beginning on developing on support vector machines, I
  • 1:42 - 1:44
    said that a
  • 1:44 - 1:49
    hypothesis represented as H sub [inaudible] wb as g of
  • 1:49 - 1:53
    w transpose [inaudible] x + b, where
  • 1:53 - 1:55
  • 1:55 - 1:56
    g
  • 1:56 - 2:03
    will be
  • 2:04 - 2:09
    +1 or -1, depending on whether z is greater than
  • 2:09 - 2:10
    0.
  • 2:10 - 2:13
    And I said that in our development of support vector machines,
  • 2:13 - 2:17
    we'll use - we'll change the convention of letting y be +1, -1 to note
  • 2:17 - 2:20
    the class labels.
  • 2:20 - 2:24
    So last time,
  • 2:24 - 2:28
    we also talked about the functional margin, which was this thing, gamma
  • 2:28 - 2:34
    hat i.
  • 2:34 - 2:36
  • 2:36 - 2:39
    And so we had the intuition that the
  • 2:39 - 2:42
    if functional margin is a large positive number,
  • 2:42 - 2:47
    then that means that we are classifying a training example correctly and very
  • 2:47 - 2:48
    confidently. So
  • 2:48 - 2:50
    yi is +1. We
  • 2:50 - 2:53
    would like w transpose xi + b to be very large.
  • 2:53 - 2:55
    And it makes i - if, excuse me, if
  • 2:55 - 2:57
    yi is -1,
  • 2:57 - 2:59
    then we'd w transpose xi + b to be a large negative number. So
  • 2:59 - 3:03
    we'd sort of like functional margins to be large. We
  • 3:03 - 3:06
    also said - functional margin is a strange property -
  • 3:06 - 3:10
    that you can increase functional margin just by, say, taking your parameters,
  • 3:10 - 3:11
    w and b,
  • 3:11 - 3:14
    and multiplying them by
  • 3:14 - 3:16
    2.
  • 3:16 - 3:18
    And then we also
  • 3:18 - 3:21
    defined the geometric margin,
  • 3:21 - 3:23
    which
  • 3:23 - 3:25
  • 3:25 - 3:32
    was
  • 3:33 - 3:37
    that we just - essentially, the functional margin
  • 3:37 - 3:38
    divided by
  • 3:38 - 3:40
    the normal w.
  • 3:40 - 3:43
    And so the geometric margin had
  • 3:43 - 3:47
    the interpretation as being - I'll give
  • 3:47 - 3:49
    you a few examples. The
  • 3:49 - 3:52
    geometric margin, for example, is - has
  • 3:52 - 3:56
    the interpretation as a distance between a training example
  • 3:56 - 3:57
    and a hyperplane.
  • 3:57 - 4:01
    And it'll actually be a sin distance, so that this distance will be positive
  • 4:01 - 4:05
    if you're classifying the example correctly. And if you misclassify the example, this
  • 4:05 - 4:08
    distance - it'll be the minus of the distance,
  • 4:08 - 4:10
    reaching the point, reaching the training example.
  • 4:10 - 4:12
    And you're separating hyperplane.
  • 4:12 - 4:16
    Where you're separating hyperplane is defined by the equation w transpose x
  • 4:16 - 4:18
    +
  • 4:18 - 4:21
    b =
  • 4:21 - 4:24
    0.
  • 4:24 - 4:28
    So - oh,
  • 4:28 - 4:32
    well, and I guess
  • 4:32 - 4:37
    also defined these things as the functional margin, geometric margins,
  • 4:37 - 4:38
    respect to training set I defined as
  • 4:38 - 4:45
    the worst case or the minimum functional geometric margin. So in our
  • 4:49 - 4:53
    development of the optimal margin classifier,
  • 4:53 - 4:56
    our learning algorithm would choose parameters w and b so as to maximize
  • 4:56 - 5:00
    the geometric margin. So our goal is to find the separating hyperplane
  • 5:00 - 5:05
    that separates the positive and negative examples with as large a distance as possible
  • 5:05 - 5:06
    between hyperplane
  • 5:06 - 5:11
    and the positive and negative examples. And if you
  • 5:11 - 5:15
    go to choose parameters w and b to maximize this, [inaudible] one copy of
  • 5:15 - 5:17
    the geometric margin
  • 5:17 - 5:18
    is that you
  • 5:18 - 5:20
    can actually scale w
  • 5:20 - 5:24
    and b arbitrarily. So you look at this definition for the geometric margin.
  • 5:24 - 5:25
  • 5:25 - 5:28
    I can choose to multiply my parameters w and b
  • 5:28 - 5:32
    by 2 or by 10 or any other constant.
  • 5:32 - 5:34
    And it doesn't change
  • 5:34 - 5:36
    my geometric margin.
  • 5:36 - 5:38
    And one way of interpreting that is you're looking
  • 5:38 - 5:39
    at
  • 5:39 - 5:43
    just separating hyperplane. You look at this line you're separating by positive and negative training examples. If
  • 5:43 - 5:45
    I
  • 5:45 - 5:46
    scale w
  • 5:46 - 5:47
    and b,
  • 5:47 - 5:49
    that doesn't change the position of this plane, though
  • 5:49 - 5:52
    because the equation wh +
  • 5:52 - 5:59
    b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight
  • 5:59 - 6:03
    line. And what that means is that I can actually choose whatever scaling
  • 6:03 - 6:06
    for w and b is convenient for me.
  • 6:06 - 6:08
    And in particular,
  • 6:08 - 6:09
    we use in a minute,
  • 6:09 - 6:14
    I can [inaudible] perfect constraint like that the normal w [inaudible] 1
  • 6:14 - 6:17
    because this means that you can find a solution to w and b.
  • 6:17 - 6:22
    And then by rescaling the parameters, you can easily meet this condition, this rescaled w
  • 6:22 - 6:24
    [inaudible] 1. And so I can
  • 6:24 - 6:28
    add the condition like this and then essentially not change the problem. Or I
  • 6:28 - 6:32
    can add other conditions. I can actually add a
  • 6:32 - 6:35
    condition
  • 6:35 - 6:38
    that - excuse me, the absolute value of w1 = 1. I can have
  • 6:38 - 6:41
    only one of these conditions right now [inaudible]. And adding condition to the absolute
  • 6:41 - 6:42
    value - the
  • 6:42 - 6:45
    first component of w must be to 1. And again,
  • 6:45 - 6:49
    you can find the absolute solution and just rescale w and meet this
  • 6:49 - 6:51
    condition.
  • 6:51 - 6:53
    And it can have other,
  • 6:53 - 6:56
    most esoteric conditions like that
  • 6:56 - 6:57
    because again,
  • 6:57 - 7:01
    this is a condition that you can solve for the optimal margin, and then just
  • 7:01 - 7:03
    by scaling,
  • 7:03 - 7:05
    you have w up and down. You can - you can then
  • 7:05 - 7:09
    ensure you meet this condition as well. So
  • 7:09 - 7:13
    again, [inaudible] one of these conditions right now, not all of them.
  • 7:13 - 7:15
    And so our ability to choose
  • 7:15 - 7:19
    any scaling condition on w that's convenient to us
  • 7:19 - 7:24
    will be useful again in a second. All right.
  • 7:24 - 7:29
    So let's go ahead and break down the optimization problem. And again, my goal is to choose
  • 7:29 - 7:30
    parameters w and b
  • 7:30 - 7:34
    so as to maximize the geometric margin.
  • 7:34 - 7:39
    Here's my first attempt at writing down the optimization problem. Actually wrote this one down
  • 7:39 - 7:40
    right at
  • 7:40 - 7:43
    the end of the previous
  • 7:43 - 7:47
    lecture. Begin to solve the parameters gamma w and b
  • 7:47 - 7:48
    such that
  • 7:48 - 7:49
  • 7:49 - 7:54
    -
  • 7:54 - 7:57
    that [inaudible] i
  • 7:57 - 8:02
    for in training examples.
  • 8:02 - 8:02
    Let's say I
  • 8:02 - 8:06
    choose to add this normalization condition.
  • 8:06 - 8:11
    So the norm condition that w - the normal w is equal to 1 just makes
  • 8:11 - 8:15
    the geometric and the functional margin the same.
  • 8:15 - 8:18
    And so I'm saying I want to find a value -
  • 8:18 - 8:23
    I want to find a value for gamma as big as possible
  • 8:23 - 8:26
    so that all of my training examples have functional margin
  • 8:26 - 8:29
    greater than or equals gamma,
  • 8:29 - 8:30
    and
  • 8:30 - 8:34
    with the constraint that normal w equals 1,
  • 8:34 - 8:36
    functional margin and geometric margin are the same.
  • 8:36 - 8:38
    So it's the same.
  • 8:38 - 8:40
    Find the value for gamma so that
  • 8:40 - 8:43
    all the values - all the geometric margins are greater or equal to
  • 8:43 - 8:46
    gamma.
  • 8:46 - 8:51
    So you solve this optimization problem, then you have derived
  • 8:51 - 8:55
    the optimal margin classifier -
  • 8:55 - 8:57
    that
  • 8:57 - 9:00
    there's not a very nice optimization problem because this is a
  • 9:00 - 9:04
    nasty, nonconvex constraints. And [inaudible] is asking that you
  • 9:04 - 9:06
    solve for parameters w
  • 9:06 - 9:11
    that lie on the surface of a unisphere, lie on his [inaudible].
  • 9:11 - 9:15
    It lies on a unicircle - a unisphere.
  • 9:15 - 9:16
  • 9:16 - 9:18
    And so
  • 9:18 - 9:21
    if we can come up with a convex optimization problem, then
  • 9:21 - 9:25
    we'd be guaranteed that our [inaudible] descend to other local [inaudible] will
  • 9:25 - 9:25
    not have
  • 9:25 - 9:26
    local optimal. And
  • 9:26 - 9:31
    it turns out this is an example of a nonconvex constraint. This is a nasty constraint
  • 9:31 - 9:38
    that I would like to get rid of. So
  • 9:41 - 9:43
    let's change the optimization problem
  • 9:43 - 9:48
    one more time.
  • 9:48 - 9:52
    Now, let me
  • 9:52 - 9:57
    pose a slightly different optimization problem. Let
  • 9:57 - 10:03
    me maximize the functional margin divided by the normal w
  • 10:03 - 10:05
    subject
  • 10:05 - 10:12
    to yi w transpose xi.
  • 10:13 - 10:15
    So in other words, once you find
  • 10:15 - 10:17
    a number, gamma hat,
  • 10:17 - 10:20
    so that every one of my training examples has functional margin greater
  • 10:20 - 10:23
    than the gamma hat,
  • 10:23 - 10:27
    and my optimization objective is I want to maximize gamma hat divided by the normal
  • 10:27 - 10:28
  • 10:28 - 10:31
    w. And so I wanna maximize the
  • 10:31 - 10:36
    function margin divided by the normal w. And we saw previously the function
  • 10:36 - 10:38
    margin divided by the normal
  • 10:38 - 10:42
    w is just a geometric margin, and so this is a different way of posing
  • 10:42 - 10:45
    the same
  • 10:45 - 10:47
    optimization problem. [Inaudible] confused, though. Are there questions
  • 10:47 - 10:54
    about this?
  • 10:54 - 10:54
  • 10:54 - 10:57
    Student:[Inaudible] the second statement has to be made of the functional margin y
  • 10:57 - 10:59
    divided by - why don't you just have it
  • 10:59 - 11:02
    the geometric
  • 11:02 - 11:05
    margin? Why do
  • 11:05 - 11:09
    you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew
  • 11:09 - 11:11
    Ng):Oh, I see, yes. Student:[Inaudible]
  • 11:11 - 11:13
  • 11:13 - 11:17
    is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin.
  • 11:17 - 11:19
    Student:Yeah.
  • 11:19 - 11:26
    Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional
  • 11:26 - 11:28
    margin.
  • 11:28 - 11:32
    Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me
  • 11:32 - 11:36
    try saying
  • 11:36 - 11:40
    this again. So here's my goal. My - I want [inaudible]. So
  • 11:40 - 11:42
    let's see,
  • 11:42 - 11:47
    the parameters of this optimization problem where gamma hat w and b - so
  • 11:47 - 11:50
    the convex optimization software
  • 11:50 - 11:52
    solves this problem for some set of parameters gamma
  • 11:52 - 11:54
    w and b.
  • 11:54 - 11:57
    And I'm imposing the constraint that
  • 11:57 - 11:59
    whatever values it comes up with,
  • 11:59 - 12:04
    yi x [inaudible] x5 + b must be greater than gamma hat.
  • 12:04 - 12:06
    And so this means that
  • 12:06 - 12:09
    the functional margin of every example had
  • 12:09 - 12:09
    better be
  • 12:09 - 12:11
    greater than equal to gamma hat. So
  • 12:11 - 12:14
    there's a constraint to the function margin and a constraint to the gamma hat.
  • 12:14 - 12:18
    But what I care about is not really maximizing the functional margin. What I
  • 12:18 - 12:19
    really care about -
  • 12:19 - 12:21
    in other words, in optimization objective,
  • 12:21 - 12:25
    is maximizing gamma hat divided by the normal w,
  • 12:25 - 12:28
    which is the geometric margin.
  • 12:28 - 12:32
    So in other words, my optimization [inaudible] is I want to maximize the function margin
  • 12:32 - 12:35
    divided by the normal
  • 12:35 - 12:38
    w. Subject to that, every example must have
  • 12:38 - 12:40
    function margin and at least gamma hat. Does that make
  • 12:40 - 12:46
    sense now? Student:[Inaudible] when you
  • 12:46 - 12:48
    said that to maximize gamma
  • 12:48 - 12:51
    or gamma hat, respect to gamma w and with
  • 12:51 - 12:52
    respect to gamma hat
  • 12:52 - 12:57
    so that
  • 12:57 - 13:01
    [inaudible] gamma hat are no
  • 13:01 - 13:05
    longer [inaudible]? Instructor (Andrew Ng):So this is the -
  • 13:05 - 13:07
    so it turns out -
  • 13:07 - 13:11
    so this is how I write down the - this is how I write down an optimization problem
  • 13:11 - 13:13
    in order to solve for
  • 13:13 - 13:16
    the geometric margin. What is it -
  • 13:16 - 13:18
    so it turns out that
  • 13:18 - 13:21
    the question of this - is the gamma hat the function of w and b? And it turns out
  • 13:21 - 13:22
    that
  • 13:22 - 13:25
    in my previous mathematical definition, it was,
  • 13:25 - 13:30
    but the way I'm going to pose this as an optimization problem is
  • 13:30 - 13:35
    I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving
  • 13:35 - 13:38
    convex optimization problems -
  • 13:38 - 13:40
    hen I'm going to
  • 13:40 - 13:44
    pretend that these are independent variables and ask my convex optimization software
  • 13:44 - 13:46
    to find me values for gamma, w, and b,
  • 13:46 - 13:51
    to make this value as big as possible and subject to this constraint.
  • 13:51 - 13:53
    And it'll turn out that
  • 13:53 - 13:55
    when it does that, it will choose - or
  • 13:55 - 13:59
    obviously, it will choose for gamma to be as big as possible
  • 13:59 - 14:01
    because optimization objective is this:
  • 14:01 - 14:03
    You're trying to maximize gamma hat.
  • 14:03 - 14:06
    So for x value of w and b,
  • 14:06 - 14:09
    my software, which choose to make gamma hat as big as possible -
  • 14:09 - 14:13
    well, but how big can we make gamma hat? Well, it's limited by use
  • 14:13 - 14:15
    constraints. It says that every training example
  • 14:15 - 14:17
    must have function margin
  • 14:17 - 14:20
    greater than equal to gamma hat.
  • 14:20 - 14:24
    And so my - the bigger you can make gamma hat
  • 14:24 - 14:28
    will be the value of the smallest functional margin.
  • 14:28 - 14:30
    And so when you solve this optimization problem,
  • 14:30 - 14:34
    the value of gamma hat you get out will be, indeed,
  • 14:34 - 14:39
    the minimum of the functional margins of your training set. Okay, so
  • 14:39 - 14:41
    Justin? Student:Yeah, I was just
  • 14:41 - 14:42
    wondering, I
  • 14:42 - 14:46
    guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line
  • 14:46 - 14:50
    such that you maximize the distance between - the smallest distance that [inaudible] between the line and
  • 14:50 - 14:52
    the
  • 14:52 - 14:54
    data points."
  • 14:54 - 14:56
    And it seems like that's kind of what we're doing, but it's - it seems like
  • 14:56 - 15:00
    this is more complicated than that. And I guess I'm wondering what
  • 15:00 - 15:04
    is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And
  • 15:04 - 15:09
    this seems
  • 15:09 - 15:12
    more complicated than trying to find a line [inaudible]. So I'm
  • 15:12 - 15:18
    just repeating the questions in case - since I'm not sure how all the audio catches it.
  • 15:18 - 15:21
    So the answer is this is actually exactly that problem. This is exactly that problem
  • 15:21 - 15:23
    of
  • 15:23 - 15:27
    given the two class of data, positive and negative examples,
  • 15:27 - 15:30
    this is exactly the formalization of the problem
  • 15:30 - 15:32
    where I go is to find
  • 15:32 - 15:36
    a line that separates the two - the positive and negative examples,
  • 15:36 - 15:43
    maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case
  • 15:43 - 15:44
    distance [inaudible]?
  • 15:44 - 15:46
    Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now,
  • 15:46 - 15:48
  • 15:48 - 15:53
    let's just say - let's just care about the worst-case distance for now. We'll come back,
  • 15:53 - 15:55
    and we'll fix that later. We'll - that's a -
  • 15:55 - 15:58
    caring about the worst case is is just -
  • 15:58 - 16:01
    is just a nice way to formulate this optimization problem. I'll come back, and I'll
  • 16:01 - 16:04
    change that later. Okay,
  • 16:04 - 16:10
    raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool.
  • 16:10 - 16:14
    Great. So let's see -
  • 16:14 - 16:16
    so this is just a different way of posing
  • 16:16 - 16:18
    the same optimization problem. And
  • 16:18 - 16:22
    on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on
  • 16:22 - 16:24
    the other hand, I've now
  • 16:24 - 16:25
    added
  • 16:25 - 16:30
    a nasty, nonconvex objective. In particular, this is not a convex function in parameters
  • 16:30 - 16:31
    w.
  • 16:31 - 16:33
    And so you can't -
  • 16:33 - 16:37
    you don't have the usual guarantees like if you [inaudible]
  • 16:37 - 16:38
    global minimum.
  • 16:38 - 16:40
    At least that
  • 16:40 - 16:45
    does not follow immediately from this because this is nonconvex.
  • 16:45 - 16:48
  • 16:48 - 16:51
    So what
  • 16:51 - 16:55
    I'm going to do is,
  • 16:55 - 16:57
    earlier, I said that can pose
  • 16:57 - 16:58
    any of
  • 16:58 - 17:03
    a number of even fairly bizarre scaling constraints on w. So you can
  • 17:03 - 17:06
    choose any scaling constraint like this, and things are still fine.
  • 17:06 - 17:08
    And so here's
  • 17:08 - 17:14
    the scaling I'm going to choose to add. Again, I'm
  • 17:14 - 17:18
    gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly
  • 17:18 - 17:20
    separable, that
  • 17:20 - 17:23
    you can actually separate the positive and negative classes, and that we'll come back and
  • 17:23 - 17:26
    fix this later as well.
  • 17:26 - 17:29
    But here's the scaling constraint I want to impose on w. I want
  • 17:29 - 17:34
    to impose a constraint that
  • 17:34 - 17:38
    the functional margin is equal to 1.
  • 17:38 - 17:40
    And another way of writing that
  • 17:40 - 17:44
    is that I want to impose a constraint that min
  • 17:44 - 17:51
    over i, yi -
  • 17:52 - 17:55
    that in the worst case, function y is over 1.
  • 17:55 - 17:59
    And clearly, this is a scaling constraint because if
  • 17:59 - 18:02
  • 18:02 - 18:06
    you solve for w and b, and you find that your worst-case function margin is actually 10 or
  • 18:06 - 18:07
    whatever,
  • 18:07 - 18:10
    then by dividing through w and b by a factor of 10, I can get
  • 18:10 - 18:13
    my functional margin to be over 1.
  • 18:13 - 18:16
    So this is a scaling constraint [inaudible] would
  • 18:16 - 18:17
    imply. And this is
  • 18:17 - 18:19
    just more compactly written
  • 18:19 - 18:23
    as follows. This is imposing a constraint that the functional margin be
  • 18:23 - 18:26
    equal to 1.
  • 18:26 - 18:29
    And so if we just take
  • 18:29 - 18:32
    what I wrote down as No. 2 of our previous optimization problem and add the
  • 18:32 - 18:35
    scaling constraint,
  • 18:35 - 18:38
    we then get the following optimization problem:
  • 18:38 - 18:45
    min over wb.
  • 18:57 - 19:01
    I guess previously, we had a maximization
  • 19:01 - 19:06
    over gamma hats divided by the normal w. So those maximize
  • 19:06 - 19:10
    1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum
  • 19:10 - 19:11
    normal w
  • 19:11 - 19:15
    is min w - normal w squared. And then these
  • 19:15 - 19:16
    are our constraints.
  • 19:16 - 19:21
    Since I've added the constraint, the functional margin
  • 19:21 - 19:24
    is over 1. And this is actually
  • 19:24 - 19:27
    my
  • 19:27 - 19:30
    final - well,
  • 19:30 - 19:37
    final formulation of the optimal margin classifier problem, at least for now.
  • 19:37 - 19:38
    So the picture
  • 19:38 - 19:41
    to
  • 19:41 - 19:45
    keep in mind for this, I guess, is that our
  • 19:45 - 19:48
    optimization objective is once you minimize the normal w. And so our
  • 19:48 - 19:50
    optimization objective
  • 19:50 - 19:52
    is just the [inaudible] quadratic function.
  • 19:52 - 19:55
    And [inaudible] those pictures [inaudible] can draw
  • 19:55 - 19:57
    it. So it -
  • 19:57 - 19:59
    if [inaudible] is w1 and w2, and you
  • 19:59 - 20:03
    want to minimize the quadratic function like this - so quadratic function
  • 20:03 - 20:06
    just has [inaudible] that look like this.
  • 20:06 - 20:10
    And moreover, you have a number of linear constraints in your parameters, so you may have linear
  • 20:10 - 20:12
    constraints that eliminates that half space or
  • 20:12 - 20:17
    linear constraint eliminates that half space [inaudible]. So there's that half space
  • 20:17 - 20:20
  • 20:20 - 20:23
    and so on.
  • 20:23 - 20:26
    And so the picture is you have a quadratic function,
  • 20:26 - 20:27
    and you're ruling out
  • 20:27 - 20:29
  • 20:29 - 20:34
    various half spaces where each of these linear constraints. And I hope - if
  • 20:34 - 20:38
    you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can
  • 20:38 - 20:43
    convince yourself that this is a convex problem that has no local optimum. But they
  • 20:43 - 20:46
    be run great
  • 20:46 - 20:50
    [inaudible] within this set of points that hasn't ruled out, then you convert to the
  • 20:50 - 20:53
    global optimum.
  • 20:53 - 20:56
    And so that's the convex optimization problem.
  • 20:56 - 20:59
    The - does this [inaudible] nice and
  • 20:59 - 21:02
    [inaudible].
  • 21:02 - 21:09
    Questions about this?
  • 21:17 - 21:24
    Actually, just raise your hand if this makes sense. Okay, cool.
  • 21:26 - 21:30
    So this gives you the optimal margin classifier algorithm.
  • 21:30 - 21:32
    And it turns out that
  • 21:32 - 21:34
    this is the convex optimization problem,
  • 21:34 - 21:37
    so you can actually take this formulation of the problem
  • 21:37 - 21:41
    and throw it at off-the-shelf software - what's called a QP or quadratic
  • 21:41 - 21:45
    program software. This [inaudible] optimization is called a quadratic program,
  • 21:45 - 21:49
    where the quadratic convex objective function and [inaudible] constraints -
  • 21:49 - 21:54
    so you can actually download software to solve these optimization problems for you.
  • 21:54 - 21:56
    Usually, as you wanna use the -
  • 21:56 - 21:57
    use
  • 21:57 - 22:00
    [inaudible] because you have constraints like these, although you could actually modify [inaudible] work with
  • 22:00 - 22:04
    this, too.
  • 22:04 - 22:06
    So we could just declare success and say that we're done with this formulation of the
  • 22:06 - 22:08
    problem. But
  • 22:08 - 22:11
    what I'm going to do now is take a
  • 22:11 - 22:15
    digression to talk about primal and duo optimization problems.
  • 22:15 - 22:16
    And in particular, I'm going to
  • 22:16 - 22:18
    -
  • 22:18 - 22:22
    later, I'm going to come back and derive yet another very different form
  • 22:22 - 22:25
    of this optimization problem.
  • 22:25 - 22:29
    And the reason we'll do that is because it turns out this optimization problem has
  • 22:29 - 22:33
    certain properties that make it amenable to very efficient algorithms.
  • 22:33 - 22:35
    And moreover, I'll be deriving
  • 22:35 - 22:37
    what's called the duo formulation of this
  • 22:37 - 22:40
    that allows us to
  • 22:40 - 22:43
    apply the optimal margin classifier even in
  • 22:43 - 22:47
    very high-dimensional feature spaces - even in sometimes infinite
  • 22:47 - 22:50
    dimensional feature spaces. So we
  • 22:50 - 22:52
    can come back to that later.
  • 22:52 - 22:53
    But
  • 22:53 - 22:54
    let me know,
  • 22:54 - 23:01
    since I'm talking about
  • 23:03 - 23:06
    convex optimization. So how many here is - how many of you, from, I don't
  • 23:06 - 23:08
    know, calculus,
  • 23:08 - 23:12
    remember the method of Lagrange multipliers for
  • 23:12 - 23:14
    solving an optimization problem
  • 23:14 - 23:18
    like minimum - minimization, maximization problem subject to some constraint?
  • 23:18 - 23:23
    How many of you remember the method of Lagrange multipliers for that? Oh, okay,
  • 23:23 - 23:24
    cool. Some of you, yeah.
  • 23:24 - 23:28
    So if you don't remember, don't worry. I - I'll describe that briefly
  • 23:28 - 23:28
    here
  • 23:28 - 23:32
    as well, but what I'm really gonna do is talk about the generalization of this method
  • 23:32 - 23:34
    of Lagrange multipliers
  • 23:34 - 23:38
    that you may or may not have seen in some calculus classes. But if you haven't
  • 23:38 - 23:40
    seen it before, don't worry about it.
  • 23:40 - 23:43
  • 23:43 - 23:49
    So the method of Lagrange multipliers is - was - well, suppose there's some
  • 23:49 - 23:53
    function you want to minimize, or minimize f of w.
  • 23:53 - 23:56
    We're subject to
  • 23:56 - 24:03
    some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And
  • 24:03 - 24:07
    given this constraint, I'll actually usually write it in vectorial
  • 24:07 - 24:10
    form in which I write h of w
  • 24:10 - 24:11
    as this
  • 24:11 - 24:17
    vector value function.
  • 24:17 - 24:21
    So that is equal to 0, where 0 is the arrow on top. I used
  • 24:21 - 24:25
    that to denote the vector of
  • 24:25 - 24:27
    all 0s. So you want to solve this optimization problem.
  • 24:27 - 24:29
  • 24:29 - 24:32
    Some of you have seen method of Lagrange multipliers where
  • 24:32 - 24:35
    you construct this
  • 24:35 - 24:39
    [inaudible] Lagrange,
  • 24:39 - 24:46
  • 24:46 - 24:50
    which is the original optimization objective plus some [inaudible] Lagrange multipliers the
  • 24:50 - 24:52
    highest constraints.
  • 24:52 - 24:55
    And these parameters - they derive -
  • 24:55 - 25:02
    we call the Lagrange multipliers.
  • 25:02 - 25:06
    And so the way you actually solve the optimization problem is
  • 25:06 - 25:11
    you take the partial derivative of this with respect to the original parameters
  • 25:11 - 25:14
    and set that to 0. So
  • 25:14 - 25:20
    the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0.
  • 25:20 - 25:24
    And then the same as theorem through [inaudible], I guess [inaudible]
  • 25:24 - 25:25
    Lagrange
  • 25:25 - 25:31
    was that for w - for some value w star to get a
  • 25:31 - 25:32
    solution,
  • 25:32 - 25:35
    it
  • 25:35 - 25:42
    is necessary
  • 25:43 - 25:48
    that - can this
  • 25:48 - 25:49
    be
  • 25:49 - 25:50
  • 25:50 - 25:54
    the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists
  • 25:54 - 25:55
    beta star
  • 25:55 - 26:00
  • 26:00 - 26:07
  • 26:17 - 26:20
    such that those partial derivatives are
  • 26:20 - 26:22
    equal to
  • 26:22 - 26:22
    0.
  • 26:22 - 26:24
    So the
  • 26:24 - 26:25
    method
  • 26:25 - 26:29
    of Lagrange multipliers is to solve this problem,
  • 26:29 - 26:31
    you construct a Lagrange,
  • 26:31 - 26:33
    take the derivative with respect to
  • 26:33 - 26:35
    the original parameters b, the original
  • 26:35 - 26:40
    parameters w, and with respect to the Lagrange multipliers beta.
  • 26:40 - 26:42
    Set the partial derivatives equal to 0, and solve
  • 26:42 - 26:45
    for our solutions. And then you check each of the solutions to see if it is indeed a minimum.
  • 26:45 - 26:48
    Great.
  • 26:48 - 26:52
    So great
  • 26:52 - 26:57
    -
  • 26:57 - 26:59
    so what I'm going to do is actually
  • 26:59 - 27:00
    write
  • 27:00 - 27:03
    down the generalization of this. And
  • 27:03 - 27:06
    if you haven't seen that before, don't worry about it. This is [inaudible].
  • 27:06 - 27:11
    So what I'm going to do is actually write down the generalization of this to solve
  • 27:11 - 27:16
    a slightly more difficult type of constraint optimization problem,
  • 27:16 - 27:20
    which is suppose you want to minimize f of w
  • 27:20 - 27:24
    subject to the constraint that gi of w,
  • 27:24 - 27:26
    excuse me,
  • 27:26 - 27:31
    is less than equal to
  • 27:31 - 27:35
    0, and that hi of w is equal to 0.
  • 27:35 - 27:38
    And
  • 27:38 - 27:43
    again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0.
  • 27:43 - 27:45
    So
  • 27:45 - 27:49
    in [inaudible]'s
  • 27:49 - 27:52
    case, we now have inequality for constraint as well as
  • 27:52 - 27:59
    equality constraint.
  • 28:02 - 28:03
    I then have
  • 28:03 - 28:07
    a Lagrange, or it's actually still - called - say generalized
  • 28:07 - 28:08
    Lagrange,
  • 28:08 - 28:13
    which is now a function of my original optimization for parameters w,
  • 28:13 - 28:18
    as well as two sets of Lagrange multipliers, alpha and beta.
  • 28:18 - 28:20
    And so this will be
  • 28:20 - 28:27
    f of w.
  • 28:29 - 28:33
    Now,
  • 28:33 - 28:37
  • 28:37 - 28:39
    here's a
  • 28:39 - 28:40
    cool part.
  • 28:40 - 28:43
    I'm going to define theta
  • 28:43 - 28:44
  • 28:44 - 28:46
    subscript p of
  • 28:46 - 28:47
    w
  • 28:47 - 28:49
    to be equal to
  • 28:49 - 28:52
    max of alpha beta subject to the
  • 28:52 - 28:59
    constraints that the alphas are, beta equal
  • 28:59 - 29:06
    to 0 of the Lagrange.
  • 29:10 - 29:15
    And so
  • 29:15 - 29:18
    I want you to consider
  • 29:18 - 29:22
    the optimization problem
  • 29:22 - 29:24
    min over w of
  • 29:24 - 29:31
    max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And
  • 29:32 - 29:36
    that's just equal to min over w,
  • 29:36 - 29:42
    theta p of
  • 29:42 - 29:48
    w. And just to give us a name, the [inaudible] - the subscript p here is a sense of
  • 29:48 - 29:50
    primal problem.
  • 29:50 - 29:57
    And that refers to this entire thing.
  • 29:57 - 30:01
    This optimization problem that written down is called a primal problem. This means
  • 30:01 - 30:04
    there's the original optimization problem in which [inaudible] solving.
  • 30:04 - 30:08
    And later on, I'll derive in another version of this, but that's what p
  • 30:08 - 30:13
    stands for. It's a - this is a primal problem.
  • 30:13 - 30:18
    Now, I want you to look at - consider theta over p again. And in particular, I wanna
  • 30:18 - 30:22
    consider the problem of what happens if you minimize w
  • 30:22 - 30:24
    - minimize as a function of w
  • 30:24 - 30:27
    this quantity theta over p.
  • 30:27 - 30:34
  • 30:34 - 30:38
    So let's look at what theta p of w is. Notice
  • 30:38 - 30:40
    that
  • 30:40 - 30:42
    if
  • 30:42 - 30:43
    gi of w
  • 30:43 - 30:47
    is greater than 0, so let's pick the value of w.
  • 30:47 - 30:50
    And let's ask what is the state of p of w?
  • 30:50 - 30:57
    So if w violates one of your primal problems constraints,
  • 30:57 - 31:00
  • 31:00 - 31:04
    then state of p of w would be infinity.
  • 31:04 - 31:06
    Why is that?
  • 31:06 - 31:09
    [Inaudible] p [inaudible] second.
  • 31:09 - 31:13
    Suppose I pick a value of w that violates one of these constraints. So gi of
  • 31:13 - 31:17
    w is positive.
  • 31:17 - 31:22
    Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So
  • 31:22 - 31:26
    one of these gi of w's is this positive,
  • 31:26 - 31:30
    then by setting the other responding alpha i to plus infinity, I can make this
  • 31:30 - 31:32
    arbitrarily large.
  • 31:32 - 31:35
    And so if w violates one of my
  • 31:35 - 31:38
    primal problem's constraints in one of the gis, then
  • 31:38 - 31:42
    max over alpha of this Lagrange will be plus
  • 31:42 - 31:49
    infinity. There's some of - and in the same way - I guess in a similar way,
  • 31:50 - 31:55
    if hi of w is not equal to
  • 31:55 - 31:58
    0,
  • 31:58 - 32:03
    then theta p of w also be infinity for a very similar reason because
  • 32:03 - 32:06
    if hi of w is not equal to 0 for some value of i,
  • 32:06 - 32:10
    then in my Lagrange, I had a beta i x hi theorem.
  • 32:10 - 32:13
    And so by setting beta i to be plus infinity or minus
  • 32:13 - 32:16
    infinity depending on the sign of hi,
  • 32:16 - 32:20
    I can make this plus infinity as well.
  • 32:20 - 32:22
    And otherwise,
  • 32:22 - 32:29
  • 32:30 - 32:34
  • 32:34 - 32:38
    theta p of w is just equal to f of w. Turns
  • 32:38 - 32:40
    out if
  • 32:40 - 32:42
    I had a value of w that
  • 32:42 - 32:45
    satisfies all of the gi and the hi constraints,
  • 32:45 - 32:48
    then we maximize in terms of alpha and beta
  • 32:48 - 32:49
    - all the Lagrange multiply
  • 32:49 - 32:54
    theorems will actually be obtained by
  • 32:54 - 32:58
    setting all the Lagrange multiply theorems to be 0,
  • 32:58 - 32:59
    and so theta p
  • 32:59 - 33:04
    just left with f of w. Thus, theta
  • 33:04 - 33:06
    p
  • 33:06 - 33:11
    of w is equal to
  • 33:11 - 33:14
    f of w
  • 33:14 - 33:17
    if constraints are
  • 33:17 - 33:18
    satisfied
  • 33:18 - 33:21
    [inaudible]
  • 33:21 - 33:23
    the gi in
  • 33:23 - 33:24
    hi constraints,
  • 33:24 - 33:28
    and is equal to plus infinity
  • 33:28 - 33:32
    otherwise.
  • 33:32 - 33:34
    So the problem I
  • 33:34 - 33:39
    wrote down that minimizes the function of w -
  • 33:39 - 33:45
    theta p of w - this is [inaudible] problem. That's
  • 33:45 - 33:48
    just
  • 33:48 - 33:51
    exactly the same problem as my original primal problem
  • 33:51 - 33:55
    because if you choose a value of w that violates the constraints, you get infinity.
  • 33:55 - 33:58
    And if you satisfy the constraints, you get f of w.
  • 33:58 - 34:00
    So this is really just the same as - well,
  • 34:00 - 34:01
    we'll say,
  • 34:01 - 34:05
    "Satisfy the constraints, and minimize f of w." That's really what
  • 34:05 - 34:09
    minimizing the state of p
  • 34:09 - 34:16
    of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right.
  • 34:25 - 34:28
    I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly
  • 34:28 - 34:31
    the same thing we started with. So here's
  • 34:31 - 34:33
    the cool part.
  • 34:33 - 34:40
    Let me know if you find it in your problem. To find
  • 34:40 - 34:42
    theta
  • 34:42 - 34:45
    d
  • 34:45 - 34:47
    and d [inaudible] duo,
  • 34:47 - 34:50
    and this is how the function of alpha and beta is. It's not the function of the Lagrange
  • 34:50 - 34:53
    multipliers. It's not of w.
  • 34:53 - 34:58
    To find this, we minimize over w of
  • 34:58 - 35:05
    my generalized Lagrange. And
  • 35:06 - 35:13
    my dual problem is this. So in
  • 35:17 - 35:20
    other words, this is max over
  • 35:20 - 35:22
    that.
  • 35:22 - 35:23
  • 35:23 - 35:28
  • 35:28 - 35:33
    And so this is my duo optimization problem. To maximize over alpha and
  • 35:33 - 35:34
    beta, theta
  • 35:34 - 35:35
    d over alpha
  • 35:35 - 35:38
    and beta. So this optimization problem, I
  • 35:38 - 35:41
    guess, is my dual problem. I
  • 35:41 - 35:45
    want you to compare this to our previous prime optimization problem.
  • 35:45 - 35:47
    The only difference is that
  • 35:47 - 35:52
    I took the max and min, and I switched the order around with the max and
  • 35:52 - 35:56
    min. That's the difference in the primal and the duo optimization [inaudible].
  • 35:56 - 36:00
    And it turns out that
  • 36:00 - 36:02
  • 36:02 - 36:05
    it's a - it's sort of - it's a fact - it's
  • 36:05 - 36:06
    true, generally, that d
  • 36:06 - 36:10
    star is less than [inaudible] p star. In other words, I think I defined
  • 36:10 - 36:16
    p star previously. P star was a value of the prime optimization problem.
  • 36:16 - 36:20
    And in other words, that it's just generally true
  • 36:20 - 36:25
    that the max of the min of something is less than equal to the min of the max
  • 36:25 - 36:28
    of something.
  • 36:28 - 36:29
    And this is a
  • 36:29 - 36:30
    general fact.
  • 36:30 - 36:33
    And just as a concrete example, the
  • 36:33 - 36:37
    max over y in the set 01 x -
  • 36:37 - 36:40
    oh, excuse me, of the min of the
  • 36:40 - 36:44
    set in 01
  • 36:44 - 36:48
    of indicator x =
  • 36:48 - 36:51
    y - this is
  • 36:51 - 36:58
    [inaudible] equal to
  • 37:03 - 37:10
    min.
  • 37:10 - 37:14
    So this equality - this inequality actually holds true for any
  • 37:14 - 37:16
    function you might find in here.
  • 37:16 - 37:18
    And this is one specific example
  • 37:18 - 37:21
    where
  • 37:21 - 37:24
    the min over xy - excuse me, min over x of [inaudible] equals y -
  • 37:24 - 37:27
    this is always equal to 0
  • 37:27 - 37:31
    because whatever y is, you can choose x to be something different. So
  • 37:31 - 37:33
    this is always 0,
  • 37:33 - 37:34
    whereas if
  • 37:34 - 37:38
    I exchange the order to min
  • 37:38 - 37:40
    and max, then thing here is always equal to
  • 37:40 - 37:42
    1. So 0 [inaudible] to
  • 37:42 - 37:45
    1. And more generally, this min/max - excuse
  • 37:45 - 37:51
    me, this max/min, thus with the min/max holds true for any function you might put in
  • 37:51 - 37:52
    there.
  • 37:52 - 37:58
    But it turns out that sometimes under certain conditions,
  • 37:58 - 37:59
  • 37:59 - 38:01
  • 38:01 - 38:02
  • 38:02 - 38:03
  • 38:03 - 38:07
    these two optimization problems have the same value. Sometimes under certain
  • 38:07 - 38:08
    conditions,
  • 38:08 - 38:10
    the primal and the dual problems
  • 38:10 - 38:12
    have the same value.
  • 38:12 - 38:15
    And so
  • 38:15 - 38:16
    you might be able to solve
  • 38:16 - 38:20
    the dual problem rather than the primal problem.
  • 38:20 - 38:22
    And the reason to do that is that
  • 38:22 - 38:26
    sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem,
  • 38:26 - 38:31
    the dual problem turns out to be much easier than it - often has many useful properties that
  • 38:31 - 38:34
    will make user
  • 38:34 - 38:41
    compared to the primal. So for the sake of -
  • 38:42 - 38:48
    so
  • 38:48 - 38:55
    what
  • 39:02 - 39:05
    I'm going to do now is write down formally the certain conditions under which that's
  • 39:05 - 39:09
    true - where the primal and the dual problems are equivalent.
  • 39:09 - 39:14
    And so our strategy for working out the [inaudible] of support vector machine algorithm
  • 39:14 - 39:18
    will be that we'll write down the primal optimization problem, which we did
  • 39:18 - 39:21
    previously, and maximizing classifier.
  • 39:21 - 39:22
    And then we'll
  • 39:22 - 39:24
    derive the duo optimization problem for that.
  • 39:24 - 39:26
    And then we'll solve the dual problem.
  • 39:26 - 39:29
    And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for
  • 39:29 - 39:30
  • 39:30 - 39:34
    now, let me just first, for
  • 39:34 - 39:37
    the sake of completeness, I just write down the conditions under which the primal
  • 39:37 - 39:42
    and the duo optimization problems give you the same solutions. So let f
  • 39:42 - 39:45
    be convex. If you're not
  • 39:45 - 39:47
  • 39:47 - 39:50
    sure what convex means, for the purposes of this class, you can take it to
  • 39:50 - 39:53
    mean that the
  • 39:53 - 39:56
    Hessian, h is positive. [Inaudible], so it just means it's
  • 39:56 - 39:58
    a [inaudible] function like that.
  • 39:58 - 39:59
    And
  • 39:59 - 40:01
    once you learn more about optimization
  • 40:01 - 40:07
    - again, please come to this week's discussion session taught by the TAs.
  • 40:07 - 40:09
    Then
  • 40:09 - 40:11
    suppose
  • 40:11 - 40:13
    hi
  • 40:13 - 40:18
    - the hi constraints [inaudible], and what that means is that hi of w equals alpha i
  • 40:18 - 40:25
    transpose w plus vi. This actually
  • 40:25 - 40:29
    means the same thing as linear. Without the term b here, we say
  • 40:29 - 40:30
    that hi is linear
  • 40:30 - 40:34
    where we have a constant interceptor as well. This is technically called [inaudible] other than
  • 40:34 - 40:37
    linear.
  • 40:37 - 40:40
    And let's suppose
  • 40:40 - 40:43
  • 40:43 - 40:50
    that gi's are strictly feasible.
  • 40:51 - 40:54
    And
  • 40:54 - 40:57
    what that means is that
  • 40:57 - 41:01
    there is just a value of the w such that
  • 41:01 - 41:04
    from i,
  • 41:04 - 41:07
    gi of w is less
  • 41:07 - 41:11
    than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the
  • 41:11 - 41:13
    technical details.
  • 41:13 - 41:16
    Strictly feasible, which just means that there's a value of w such that
  • 41:16 - 41:19
    all of these constraints are satisfy were stricter than the equality rather than what
  • 41:19 - 41:22
    less than equal to.
  • 41:22 - 41:22
  • 41:22 - 41:25
    Under these conditions,
  • 41:25 - 41:26
  • 41:26 - 41:30
    there were exists w star,
  • 41:30 - 41:32
    alpha
  • 41:32 - 41:34
    star, beta
  • 41:34 - 41:40
    star such that w star solves the primal problem.
  • 41:40 - 41:42
    And alpha star
  • 41:42 - 41:49
    and beta star, the Lagrange multipliers, solve the dual problem.
  • 41:52 - 41:54
  • 41:54 - 41:59
    And the value of the primal problem will be equal to the value of the dual problem will
  • 41:59 - 42:04
    be equal to the value of your Lagrange multiplier - excuse
  • 42:04 - 42:06
    me, will be equal to the value of your generalized Lagrange, the value
  • 42:06 - 42:09
    of that w star, alpha star, beta star.
  • 42:09 - 42:13
    In other words, you can solve either the primal or the dual problem. You get
  • 42:13 - 42:15
    the same
  • 42:15 - 42:20
    solution. Further,
  • 42:20 - 42:23
    your parameters will satisfy
  • 42:23 - 42:24
  • 42:24 - 42:27
  • 42:27 - 42:32
    these conditions. Partial derivative perspective parameters would be
  • 42:32 - 42:33
    0. And
  • 42:33 - 42:37
    actually, to keep this equation in mind, we'll actually use this in a second
  • 42:37 - 42:39
    when we take the Lagrange, and we - and
  • 42:39 - 42:43
    our support vector machine problem, and take a derivative with respect to w to solve a -
  • 42:43 - 42:45
    to solve our -
  • 42:45 - 42:47
    to derive our dual problem. We'll actually
  • 42:47 - 42:50
    perform this step ourselves in a second.
  • 42:50 - 42:52
  • 42:52 - 42:58
    Partial derivative with respect to the Lagrange multiplier beta is
  • 42:58 - 43:02
  • 43:02 - 43:05
    equal
  • 43:05 - 43:06
    to
  • 43:06 - 43:08
    0.
  • 43:08 - 43:10
    Turns out this will hold true,
  • 43:10 - 43:12
    too.
  • 43:12 - 43:14
    This is called
  • 43:14 - 43:19
  • 43:19 - 43:21
    the -
  • 43:21 - 43:27
    well
  • 43:27 - 43:29
    -
  • 43:29 - 43:32
    this is called the KKT complementary condition.
  • 43:32 - 43:38
    KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem.
  • 43:38 - 43:42
    Well, and by tradition, usually this [inaudible] KKT conditions.
  • 43:42 - 43:49
    But the other two are
  • 43:50 - 43:53
    - just so the [inaudible] is greater than 0, which we had
  • 43:53 - 43:55
    previously
  • 43:55 - 44:02
    and that your constraints are actually satisfied.
  • 44:03 - 44:10
    So let's see. [Inaudible] All
  • 44:22 - 44:29
    right.
  • 44:30 - 44:34
    So let's take those and apply this to our
  • 44:34 - 44:37
    optimal margin
  • 44:37 - 44:41
    optimization problem that we had previously. I
  • 44:41 - 44:44
    was gonna say one word about this,
  • 44:44 - 44:47
    which is -
  • 44:47 - 44:49
    was gonna say one word about this
  • 44:49 - 44:51
    KTT
  • 44:51 - 44:52
    complementary condition is
  • 44:52 - 44:54
    that a condition that is a -
  • 44:54 - 45:01
    at your solution, you must have that alpha star i times gi of w is equal to 0.
  • 45:01 - 45:03
    So
  • 45:03 - 45:04
    let's see.
  • 45:04 - 45:09
    So the product of two numbers is equal to 0. That means that
  • 45:09 - 45:09
  • 45:09 - 45:12
    at least one of these things must be equal to
  • 45:12 - 45:16
    0. For the product of two things to be equal to 0, well, that's just saying either alpha
  • 45:16 - 45:17
    or i
  • 45:17 - 45:21
    or gi is equal to 0.
  • 45:21 - 45:26
    So what that implies is that the - just Karush-Kuhn-Tucker
  • 45:26 - 45:32
  • 45:32 - 45:38
  • 45:38 - 45:43
    - most people just say KKT, but we wanna show you the right
  • 45:43 - 45:46
    spelling
  • 45:46 - 45:47
    of their names. So
  • 45:47 - 45:48
    KKT
  • 45:48 - 45:52
    complementary condition implies that if alpha
  • 45:52 - 45:55
    i is not 0,
  • 45:55 - 46:01
    that necessarily implies that
  • 46:01 - 46:03
    gi of w star
  • 46:03 - 46:06
    is equal to
  • 46:06 - 46:11
    0.
  • 46:11 - 46:13
    And
  • 46:13 - 46:15
    usually,
  • 46:15 - 46:21
  • 46:21 - 46:24
  • 46:24 - 46:26
    it turns out - so
  • 46:26 - 46:29
    all the KKT condition guarantees is that
  • 46:29 - 46:33
    at least one of them is 0. It may actually be the case that both
  • 46:33 - 46:35
    alpha and gi are both equal to 0.
  • 46:35 - 46:39
    But in practice, when you solve this optimization problem, you find that
  • 46:39 - 46:43
    to a large part, alpha i star is not equal to 0 if and only gi of
  • 46:43 - 46:45
    w star 0, 0.
  • 46:45 - 46:50
    This is not strictly true because it's possible that both of these may be 0.
  • 46:50 - 46:52
    But in practice,
  • 46:52 - 46:55
    when we - because when we solve problems like these, you're, for the most part,
  • 46:55 - 46:59
    usually exactly one of these will be non-0.
  • 46:59 - 47:03
    And also, when this holds true, when gi of w star is equal to 0,
  • 47:03 - 47:06
    we say that
  • 47:06 - 47:07
    gi
  • 47:07 - 47:14
    - gi of w, I guess, is an active
  • 47:16 - 47:18
    constraint
  • 47:18 - 47:22
    because we call a constraint - our constraint was a gi of w must be less than or equal to
  • 47:22 - 47:23
    0.
  • 47:23 - 47:26
    And so it is equal to 0, then
  • 47:26 - 47:33
    we say that that's a constraint that this is an active constraint.
  • 47:34 - 47:37
    Once we talk about [inaudible], we come back and [inaudible] and just extend this
  • 47:37 - 47:43
    idea a little bit more.
  • 47:43 - 47:48
    [Inaudible] board. [Inaudible]
  • 47:48 - 47:52
    turn
  • 47:52 - 47:59
    to this board in a second, but -
  • 48:07 - 48:11
    so let's go back and work out one of the primal and the duo optimization problems for
  • 48:11 - 48:12
  • 48:12 - 48:17
    our optimal margin classifier for the optimization problem that we worked on just now. As
  • 48:17 - 48:20
    a point of notation,
  • 48:20 - 48:25
    in whatever I've been writing down so far in deriving the KKT conditions,
  • 48:25 - 48:28
  • 48:28 - 48:32
    when Lagrange multipliers were alpha i and beta i,
  • 48:32 - 48:36
    it turns out that when applied as [inaudible]
  • 48:36 - 48:37
    dm,
  • 48:37 - 48:41
    turns out we only have one set of Lagrange multipliers alpha i.
  • 48:41 - 48:44
    And also,
  • 48:44 - 48:47
    as I was working out the KKT conditions, I used w
  • 48:47 - 48:52
    to denote the parameters of my primal optimization problem. [Inaudible] I wanted to
  • 48:52 - 48:54
    minimize f of w. In my
  • 48:54 - 48:58
    very first optimization problem, I had
  • 48:58 - 48:59
    that optimization problem [inaudible] finding the parameters
  • 48:59 - 49:02
    w. In my svn problem, I'm
  • 49:02 - 49:05
    actually gonna have two sets of parameters, w and b. So
  • 49:05 - 49:08
    this is just a - keep that
  • 49:08 - 49:15
    sort of slight notation change in mind. So
  • 49:16 - 49:20
    problem we worked out previously was we want to minimize the normal w squared and just add a
  • 49:20 - 49:21
    half there
  • 49:21 - 49:26
    by convention because it makes other work - math work a little
  • 49:26 - 49:27
    nicer.
  • 49:27 - 49:33
    And subject to the constraint that yi x w [inaudible] xi + v must be = greater
  • 49:41 - 49:46
    And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w,
  • 49:46 - 49:49
    b. Again, previously,
  • 49:49 - 49:53
    I had gi
  • 49:53 - 49:56
    of w, but now I have parameters w and b. So
  • 49:56 - 50:03
    gi of w, b defined
  • 50:03 - 50:10
    as 1.
  • 50:10 - 50:16
    So
  • 50:16 - 50:20
    let's look at the implications of this in terms
  • 50:20 - 50:25
    of the KKT duo complementary condition again.
  • 50:25 - 50:29
    So we have that alpha i is basically equal to
  • 50:29 - 50:32
    0. That necessarily implies that
  • 50:32 - 50:35
    gi of w, b
  • 50:35 - 50:42
    is equal to 0. In other words, this is an active constraint.
  • 50:47 - 50:54
    And what does this mean? It means that it actually turns out gi of wb equal to 0 that
  • 50:54 - 51:01
    is - that means exactly that the training example xi, yi
  • 51:01 - 51:08
    has functional margin
  • 51:09 - 51:10
    equal to 1.
  • 51:10 - 51:11
    Because
  • 51:11 - 51:15
    this constraint was that
  • 51:15 - 51:19
    the functional margin of every example has to be greater equal to
  • 51:19 - 51:22
    1. And so if this is an active constraint, it just -
  • 51:22 - 51:24
    inequality holds that equality.
  • 51:24 - 51:27
    That means that my training example i
  • 51:27 - 51:30
    must have functional margin equal to exactly 1. And so - actually, yeah,
  • 51:30 - 51:34
    right now, I'll
  • 51:34 - 51:40
    do this on a different board, I guess.
  • 51:40 - 51:47
  • 51:57 - 51:59
    So in pictures,
  • 51:59 - 52:04
    what that means is that, you have some
  • 52:04 - 52:11
    training sets, and you'll
  • 52:11 - 52:15
    have some separating hyperplane.
  • 52:15 - 52:19
    And so the examples with functional margin equal to 1
  • 52:19 - 52:23
    will be exactly those which are -
  • 52:23 - 52:26
    so they're closest
  • 52:26 - 52:28
    to my separating hyperplane. So
  • 52:28 - 52:31
    that's
  • 52:31 - 52:34
    my equation. [Inaudible] equal to 0.
  • 52:34 - 52:39
    And so in this - in this cartoon example that I've done, it'll be
  • 52:39 - 52:45
    exactly
  • 52:45 - 52:50
    - these three
  • 52:50 - 52:53
    examples that have functional margin
  • 52:53 - 52:54
    equal to 1,
  • 52:54 - 52:59
    and all of the other examples as being further away than these
  • 52:59 - 53:06
    three will have functional margin that is strictly greater than 1.
  • 53:06 - 53:08
    And
  • 53:08 - 53:11
    the examples with functional margin equal to 1 will usually correspond to the
  • 53:11 - 53:15
  • 53:15 - 53:22
  • 53:22 - 53:23
    ones where
  • 53:23 - 53:27
    the corresponding Lagrange multipliers also not equal to 0. And again, it may not
  • 53:27 - 53:29
    hold true. It may be the case that
  • 53:29 - 53:33
    gi and alpha i equal to 0. But usually, when gi's not -
  • 53:33 - 53:36
    is 0, alpha i will be non-0.
  • 53:36 - 53:39
    And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal
  • 53:39 - 53:46
    to 0. One
  • 53:47 - 53:49
    useful property of this is that
  • 53:49 - 53:53
    as suggested by this picture and so true in general as well, it
  • 53:53 - 53:57
    turns out that we find a solution to this - to the optimization problem,
  • 53:57 - 54:02
    you find that relatively few training examples have functional margin equal to 1. In
  • 54:02 - 54:03
    this picture I've drawn,
  • 54:03 - 54:06
    there are three examples with functional margin equal to 1. There
  • 54:06 - 54:09
    are just few examples of this minimum possible distance to your separating hyperplane.
  • 54:09 - 54:11
  • 54:11 - 54:14
    And these are three -
  • 54:14 - 54:18
    these examples of functional margin equal to 1 - they
  • 54:18 - 54:21
    are what we're going to call
  • 54:21 - 54:26
    the support vectors. And
  • 54:26 - 54:30
    this needs the name support vector machine. There'll be these three points with functional margin
  • 54:30 - 54:31
    1
  • 54:31 - 54:35
    that we're calling support vectors.
  • 54:35 - 54:37
    And
  • 54:37 - 54:40
    the fact that they're relatively few support vectors also means that
  • 54:40 - 54:41
    usually,
  • 54:41 - 54:43
    most of the alpha i's are equal to
  • 54:43 - 54:45
    0. So with alpha i equal
  • 54:45 - 54:52
    to
  • 54:53 - 54:54
    0,
  • 54:54 - 55:01
    for examples, though, not support vectors.
  • 55:03 - 55:07
    Let's go ahead and work out the actual
  • 55:07 - 55:08
    optimization problem.
  • 55:08 - 55:12
  • 55:12 - 55:19
  • 55:28 - 55:29
    So
  • 55:29 - 55:32
    we have a [inaudible] margin
  • 55:32 - 55:33
    optimization problem.
  • 55:33 - 55:34
  • 55:34 - 55:39
    So there we go and write down the margin,
  • 55:39 - 55:39
    and
  • 55:39 - 55:44
    because we only have inequality constraints where we really have gi star
  • 55:44 - 55:47
    constraints, no hi star constraint. We have
  • 55:47 - 55:50
    inequality constraints and no equality constraints,
  • 55:50 - 55:52
    I'll only have
  • 55:52 - 55:53
    Lagrange multipliers of type
  • 55:53 - 55:57
    alpha - no betas in my generalized Lagrange. But
  • 55:57 - 56:01
  • 56:01 - 56:03
    my Lagrange will be
  • 56:03 - 56:10
    one-half w squared minus.
  • 56:15 - 56:18
  • 56:18 - 56:20
    That's my
  • 56:20 - 56:27
    Lagrange.
  • 56:27 - 56:29
    And so let's work out what the dual problem is.
  • 56:29 - 56:33
    And to do that, I need to figure out what theta d of alpha - and I know again, beta's there
  • 56:33 - 56:37
    - so what theta d of alpha
  • 56:37 - 56:44
    is min with
  • 56:44 - 56:48
    respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha.
  • 56:48 - 56:55
    So as to work out what theta d is, and then that'll give us our dual problem.
  • 56:55 - 56:58
    So then to work out what this is, what do you need to do? We need to
  • 56:58 - 57:02
    take a look at Lagrange and minimize it as a function of lv and b
  • 57:02 - 57:03
    so - and what is this? How do you
  • 57:03 - 57:05
    minimize Lagrange? So in order to
  • 57:05 - 57:06
  • 57:06 - 57:09
    minimize the Lagrange as a function of w and b,
  • 57:09 - 57:11
    we do the usual thing. We
  • 57:11 - 57:13
    take the derivatives of w -
  • 57:13 - 57:17
    Lagrange with respect to w and b. And we set that to 0. That's how we
  • 57:17 - 57:20
    minimize the Lagrange with respect
  • 57:20 - 57:26
    to w and b. So take the derivative with respect to w of the Lagrange.
  • 57:26 - 57:28
    And
  • 57:28 - 57:35
    I want - I just write down the answer. You know how to do calculus like this.
  • 57:38 - 57:41
    So I wanna minimize this function of w, so I take the derivative and set it
  • 57:41 - 57:43
    to 0. And
  • 57:43 - 57:45
    I get that. And
  • 57:45 - 57:52
    then so this implies that w
  • 57:56 - 57:59
    must be that.
  • 57:59 - 58:01
    And so
  • 58:01 - 58:05
    w, therefore, is actually a linear combination of your input feature vectors xi.
  • 58:05 - 58:07
    This is
  • 58:07 - 58:10
    sum of your various weights given by the alpha i's and times
  • 58:10 - 58:13
    the xi's, which are your examples in your training set.
  • 58:13 - 58:16
    And this will be useful later. The
  • 58:16 - 58:23
    other equation we have is - here, partial derivative of
  • 58:23 - 58:28
    Lagrange with respect to p is equal to minus sum
  • 58:28 - 58:34
    of i plus 1 to m [inaudible] for
  • 58:34 - 58:35
    i.
  • 58:35 - 58:37
    And so I'll just set that to equal to
  • 58:37 - 58:40
    0. And so these are my two constraints.
  • 58:40 - 58:43
    And so
  • 58:43 - 58:48
  • 58:48 - 58:50
    [inaudible].
  • 58:50 - 58:54
    So what I'm going to do is I'm actually going to take these two constraints,
  • 58:54 - 58:58
    and well, I'm going to take whatever I thought to be the value for w.
  • 58:58 - 59:00
    And I'm
  • 59:00 - 59:01
    gonna
  • 59:01 - 59:04
    take what I've worked out to be the
  • 59:04 - 59:05
    value for w, and
  • 59:05 - 59:07
    I'll plug it back in there
  • 59:07 - 59:09
    to figure out what the Lagrange really is
  • 59:09 - 59:14
    when I minimize with respect to w. [Inaudible] and I'll
  • 59:14 - 59:21
    deal with b in a second.
  • 59:28 - 59:35
    So
  • 59:38 - 59:41
    let's see. So my Lagrange is 1/2
  • 59:41 - 59:44
    w transpose w minus.
  • 59:44 - 59:51
  • 59:51 - 59:58
  • 59:58 - 60:04
    So this first term, w transpose w
  • 60:04 - 60:06
    - this becomes
  • 60:06 - 60:11
    sum y equals one to m, alpha i, yi,
  • 60:11 - 60:18
  • 60:21 - 60:25
    xi transpose. This is just putting in the value for w that I worked out previously.
  • 60:25 - 60:28
    But since this is w transpose w -
  • 60:28 - 60:32
    and so when they expand out of this quadratic function, and when I plug in w
  • 60:32 - 60:34
    over there as well,
  • 60:34 - 60:36
    I
  • 60:36 - 60:39
    find
  • 60:39 - 60:41
  • 60:41 - 60:43
  • 60:43 - 60:50
    that
  • 60:50 - 60:55
  • 60:55 - 60:58
    I
  • 60:58 - 61:00
    have
  • 61:00 - 61:04
  • 61:04 - 61:09
  • 61:09 - 61:11
  • 61:11 - 61:18
    that.
  • 61:21 - 61:22
    Oh,
  • 61:22 - 61:29
    where I'm using these angle brackets to denote end product, so this
  • 61:29 - 61:35
    thing here, it just means the end product, xi transpose
  • 61:35 - 61:36
    xj. And
  • 61:36 - 61:40
    the first and second terms are actually the same except for the minus one half. So to
  • 61:40 - 61:42
    simplify to be
  • 61:42 - 61:45
    equal to
  • 61:45 - 61:51
  • 61:51 - 61:58
  • 62:08 - 62:13
    that.
  • 62:13 - 62:16
    So
  • 62:16 - 62:19
    let me go ahead and
  • 62:19 - 62:22
    call this w of alpha.
  • 62:22 - 62:23
  • 62:23 - 62:29
  • 62:29 - 62:36
  • 62:47 - 62:52
    My dual problem is, therefore, the following. I want to maximize w
  • 62:52 - 62:56
    of alpha, which is that [inaudible].
  • 62:56 - 62:58
    And
  • 62:58 - 63:02
    I want to the - I realize the notation is somewhat
  • 63:02 - 63:07
    unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier.
  • 63:07 - 63:14
    And then we also had our lowercase w. The original [inaudible] is the primal
  • 63:14 - 63:16
    problem. Lowercase w transpose xi. So
  • 63:16 - 63:18
    uppercase and lowercase w
  • 63:18 - 63:21
    are totally different
  • 63:21 - 63:27
    things, so unfortunately, the notation is standard as well, as far as I know,
  • 63:27 - 63:30
    so. So the dual problem is
  • 63:30 - 63:33
    that subject to the alpha [inaudible] related to 0,
  • 63:33 - 63:37
    and
  • 63:37 - 63:39
    we also have that the
  • 63:39 - 63:41
    sum of i,
  • 63:41 - 63:44
    yi, alpha i is related to 0.
  • 63:44 - 63:46
    That last constraint
  • 63:46 - 63:50
    was the constraint I got from
  • 63:50 - 63:52
    this - the
  • 63:52 - 63:55
    sum of i - sum of i, yi alpha i equals to 0. But that's where
  • 63:55 - 64:00
    that [inaudible] came
  • 64:00 - 64:02
    from. Let
  • 64:02 - 64:02
    me just
  • 64:02 - 64:06
    - I think in previous years that I taught this,
  • 64:06 - 64:09
    where this constraint comes from is just - is slightly confusing. So let
  • 64:09 - 64:13
    me just take two minutes to say what the real interpretation of that is. And if you
  • 64:13 - 64:16
    don't understand it, it's
  • 64:16 - 64:18
    not a big deal, I guess.
  • 64:18 - 64:22
    So when we took the partial derivative of the Lagrange with
  • 64:22 - 64:23
    respect to b,
  • 64:23 - 64:28
    we end up with this constraint that sum of i, yi, alpha i must be equal to 0.
  • 64:28 - 64:34
    The interpretation of that, it turns out, is that if sum of i, yi, alpha i
  • 64:34 - 64:37
    is not equal to
  • 64:37 - 64:41
    0, then
  • 64:41 - 64:43
  • 64:43 - 64:50
  • 64:51 - 64:53
    theta d of wb
  • 64:53 - 64:56
    is -
  • 64:56 - 65:00
    actually, excuse me.
  • 65:00 - 65:02
  • 65:02 - 65:03
  • 65:03 - 65:06
    Then theta d of alpha is equal to
  • 65:06 - 65:08
  • 65:08 - 65:13
    minus infinity for minimizing.
  • 65:13 - 65:16
    So in other words, it turns out my Lagrange is
  • 65:16 - 65:20
    actually a linear function of my parameters b. And so the interpretation of
  • 65:20 - 65:23
    that constraint we worked out previously was that if sum of i or yi, alpha i
  • 65:23 - 65:26
    is not equal to 0, then
  • 65:26 - 65:29
    theta d of alpha is equal to minus infinity.
  • 65:29 - 65:32
    And so if your goal is to
  • 65:32 - 65:37
    maximize as a function of alpha, theta
  • 65:37 - 65:38
    d of alpha,
  • 65:38 - 65:45
    then you've gotta choose values of alpha for which sum of yi alpha is equal to 0.
  • 65:45 - 65:51
    And then when sum of yi alpha is equal to 0, then
  • 65:51 - 65:54
  • 65:54 - 65:56
  • 65:56 - 66:01
    theta d of
  • 66:01 - 66:04
    alpha is equal to w of alpha.
  • 66:04 - 66:09
    And so that's why we ended up deciding to maximize w of alpha subject to
  • 66:09 - 66:14
    that sum of yi alpha is equal to 0.
  • 66:14 - 66:19
    Yeah, the - unfortunately, the fact of that d would be [inaudible]
  • 66:19 - 66:22
    adds just a little bit of extra notation in our
  • 66:22 - 66:23
    derivation of the duo. But
  • 66:23 - 66:27
    by the way, and [inaudible] all the action of the optimization problem is with w
  • 66:27 - 66:33
    because b is just one parameter.
  • 66:33 - 66:40
    So let's check. Are there any questions about this? Okay, cool.
  • 66:47 - 66:49
    So
  • 66:49 - 66:54
    what derived a duo optimization problem - and really, don't worry about this
  • 66:54 - 66:56
    if you're not quite sure where this was. Just think of this as
  • 66:56 - 67:00
    we worked out this constraint, and we worked out, and we took partial derivative with
  • 67:00 - 67:01
    respect to b,
  • 67:01 - 67:06
    that this constraint has the [inaudible] and so I just copied that over here. But so - worked out
  • 67:06 - 67:11
    the duo of the optimization problem,
  • 67:11 - 67:16
    so our approach to finding - to deriving the optimal margin classifier or support vector
  • 67:16 - 67:17
    machine
  • 67:17 - 67:19
    will be that we'll solve
  • 67:19 - 67:26
    along this duo optimization problem for the parameters alpha
  • 67:28 - 67:29
    star.
  • 67:29 - 67:31
    And then
  • 67:31 - 67:34
    if you want, you can then - this is the equation that we worked out on
  • 67:34 - 67:36
    the previous board. We said that
  • 67:36 - 67:43
    w - this [inaudible] alpha - w must be equal to
  • 67:44 - 67:46
    that.
  • 67:46 - 67:48
    And so
  • 67:48 - 67:53
    once you solve for alpha, you can then go back and quickly derive
  • 67:53 - 67:57
    w in parameters to your primal problem. And we worked this out earlier.
  • 67:57 - 67:58
  • 67:58 - 68:00
    And moreover,
  • 68:00 - 68:05
    once you solve alpha and w, you can then focus back into your - once you solve for alpha and w,
  • 68:05 - 68:07
  • 68:07 - 68:10
    it's really easy to solve for v, so
  • 68:10 - 68:14
  • 68:14 - 68:15
    that b gives us the interpretation of [inaudible]
  • 68:15 - 68:20
    training set, and you found the direction for w. So you know where your separating
  • 68:20 - 68:22
    hyperplane's direction is. You know it's got to be
  • 68:22 - 68:25
    one of these things.
  • 68:25 - 68:28
    And you know the orientation and separating hyperplane. You just have to
  • 68:28 - 68:31
    decide where to place
  • 68:31 - 68:33
    this hyperplane. And that's what solving b is.
  • 68:33 - 68:37
    So once you solve for alpha and w, it's really easy to solve b.
  • 68:37 - 68:40
    You can plug alpha and w back into the
  • 68:40 - 68:43
    primal optimization problem
  • 68:43 - 68:46
  • 68:46 - 68:50
  • 68:50 - 68:57
    and solve for b.
  • 69:02 - 69:05
    And I just wrote it down
  • 69:05 - 69:12
    for the sake of completeness,
  • 69:15 - 69:20
  • 69:20 - 69:21
  • 69:21 - 69:23
    but
  • 69:23 - 69:28
  • 69:28 - 69:30
    - and the
  • 69:30 - 69:36
    intuition behind this formula is just that find the worst positive
  • 69:36 - 69:37
    [inaudible] and the
  • 69:37 - 69:42
    worst negative example. Let's say
  • 69:42 - 69:45
    this one and this one - say [inaudible] and [inaudible] the difference between them. And
  • 69:45 - 69:46
    that tells you where you should
  • 69:46 - 69:48
    set the threshold for
  • 69:48 - 69:53
    where to place the separating hyperplane.
  • 69:53 - 69:53
    And then
  • 69:53 - 69:55
    that's the -
  • 69:55 - 69:58
    this is the optimal margin classifier. This is also called a support vector
  • 69:58 - 70:02
    machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words
  • 70:02 - 70:04
    about that. But I
  • 70:04 - 70:08
    hope the process is clear. It's a dual problem. We're going to solve the duo
  • 70:08 - 70:10
    problem for the alpha i's.
  • 70:10 - 70:11
    That gives us w, and that gives
  • 70:11 - 70:14
    us b.
  • 70:14 - 70:17
    So
  • 70:17 - 70:22
    there's just one more thing I wanna point out as I lead into the next lecture,
  • 70:22 - 70:23
    which is that - I'll just
  • 70:23 - 70:28
    write this out again,
  • 70:28 - 70:29
    I guess -
  • 70:29 - 70:31
    which is that it turns out
  • 70:31 - 70:34
    we can take the entire algorithm,
  • 70:34 - 70:38
    and we can express the entire algorithm in terms of inner products. And here's what I
  • 70:38 - 70:41
    mean by that. So
  • 70:41 - 70:42
    say that the parameters w
  • 70:42 - 70:45
    is the sum of your input examples.
  • 70:45 - 70:49
    And so we need to make a prediction.
  • 70:49 - 70:55
    Someone gives you a new value of x. You want a value of the hypothesis on the value of x.
  • 70:55 - 70:58
    That's given by g of w transpose x plus b, or
  • 70:58 - 71:03
    where g was this threshold function that outputs minus 1 or plus 1.
  • 71:03 - 71:06
    And so you need to compute w transpose x plus b.
  • 71:06 - 71:11
    And that is equal to alpha i,
  • 71:11 - 71:18
    yi.
  • 71:20 - 71:24
    And that can be expressed as a sum of these inner products between
  • 71:24 - 71:26
    your training examples
  • 71:26 - 71:33
    and this new value of x [inaudible] value [inaudible]. And this will
  • 71:34 - 71:39
    lead into our next lecture, which is the idea of kernels.
  • 71:39 - 71:40
    And
  • 71:40 - 71:44
    it turns out that in the source of feature spaces where used to support vector
  • 71:44 - 71:45
    machines -
  • 71:45 - 71:52
    it turns out that sometimes your training examples may be very high-dimensional. It may even be the case
  • 71:54 - 71:58
    that the features that you want to use
  • 71:58 - 72:00
    are
  • 72:00 - 72:04
    inner-dimensional feature vectors.
  • 72:04 - 72:11
    But despite this, it'll turn out that there'll be an interesting representation that
  • 72:11 - 72:13
    you can use
  • 72:13 - 72:15
    that will allow you
  • 72:15 - 72:19
    to compute inner products like these efficiently.
  • 72:19 - 72:26
  • 72:26 - 72:29
    And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets
  • 72:29 - 72:30
    of features.
  • 72:30 - 72:34
    But we talk about the idea of
  • 72:34 - 72:35
    kernels. In the next lecture, we'll
  • 72:35 - 72:38
    see examples where
  • 72:38 - 72:41
    even though you have extremely high-dimensional feature vectors, you can compute
  • 72:41 - 72:46
    - you may never want to represent xi, x plus [inaudible] inner-dimensional
  • 72:46 - 72:48
    feature vector. You can even store in computer memory.
  • 72:48 - 72:52
    But you will nonetheless be able to compute inner products between different
  • 72:52 - 72:54
    [inaudible] feature vectors very efficiently.
  • 72:54 - 72:57
    And so you can - for example, you can make predictions by making use of these inner
  • 72:57 - 72:59
    products.
  • 72:59 - 73:02
    This is just xi
  • 73:02 - 73:04
    transpose.
  • 73:04 - 73:09
    You will compute these inner products very efficiently and, therefore, make predictions.
  • 73:09 - 73:11
    And this pointed also - the other
  • 73:11 - 73:15
    reason we derive the duo was because
  • 73:15 - 73:18
    on this board, when we worked out what w of alpha is, w of alpha
  • 73:18 - 73:24
    - actually are the same property - w of alpha is again
  • 73:24 - 73:27
    written in terms of these inner products.
  • 73:27 - 73:29
    And so if you
  • 73:29 - 73:33
    actually look at the duo optimization problem and step - for all the steps of the
  • 73:33 - 73:34
    algorithm,
  • 73:34 - 73:37
    you'll find that you actually do everything you want - learn the parameters of
  • 73:37 - 73:38
    alpha. So
  • 73:38 - 73:42
    suppose you do an optimization problem, go into parameters alpha, and you do everything you want
  • 73:42 - 73:47
    without ever needing to represent xi directly. And all you need to do
  • 73:47 - 73:54
    is represent this compute inner products with your feature vectors like these. Well,
  • 73:54 - 73:56
    one last property of
  • 73:56 - 73:58
    this algorithm that's kinda nice is that
  • 73:58 - 74:00
    I said previously
  • 74:00 - 74:01
    that
  • 74:01 - 74:07
    the alpha i's are 0 only for the - are non-0 only for the support vectors,
  • 74:07 - 74:09
    only for the vectors
  • 74:09 - 74:10
    that function y [inaudible] 1.
  • 74:10 - 74:14
    And in practice, there are usually fairly few of them.
  • 74:14 - 74:17
    And so what this means is that if you're representing w this way,
  • 74:17 - 74:18
    then
  • 74:18 - 74:23
    w when represented as a fairly small fraction of training examples
  • 74:23 - 74:25
    because mostly alpha i's is 0 -
  • 74:25 - 74:27
    and so when you're summing up
  • 74:27 - 74:28
    the sum,
  • 74:28 - 74:32
    you need to compute inner products only if the support vectors, which is
  • 74:32 - 74:36
    usually a small fraction of your training set. So that's another nice [inaudible]
  • 74:36 - 74:39
    because [inaudible] alpha is
  • 74:39 - 74:41
    0. And well,
  • 74:41 - 74:45
    much of this will make much more sense when we talk about kernels. [Inaudible] quick
  • 74:45 - 74:52
    questions
  • 74:52 - 74:53
  • 74:53 - 74:58
    before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well
  • 74:58 - 75:00
    behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that
  • 75:00 - 75:04
    the data is linearly separable - that you can actually get perfect
  • 75:04 - 75:11
    [inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible]
  • 75:11 - 75:13
    point [inaudible], so [inaudible]
  • 75:13 - 75:18
    have
  • 75:18 - 75:19
    [inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that
  • 75:19 - 75:23
    there are ways to generalize this in multiple classes that I probably won't [inaudible] -
  • 75:23 - 75:26
    but yeah, that's generalization [inaudible].
  • 75:26 - 75:28
    Okay. Let's close for today, then.
  • 75:28 - 75:29
    We'll talk about kernels in our next lecture.
Title:
Lecture 7 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on optimal margin classifiers, KKT conditions, and SUM duals.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:15:45
N. Ueda edited English subtitles for Lecture 7 | Machine Learning (Stanford) Apr 14, 2013, 5:14 AM
jhprks2 added a translation Aug 9, 2012, 2:51 AM

English subtitles

Revisions