Lecture 7 | Machine Learning (Stanford)
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:22Development.
-
0:24 - 0:26So welcome back.
-
0:26 - 0:31And what I wanna do today is continue our discussion on support vector machines. And in
-
0:31 - 0:34particular, I wanna talk about the optimal margin classifier.
-
0:34 - 0:38Then I wanna take a brief digression and talk about primal and duo optimization problems, and
-
0:38 - 0:39in particular,
-
0:39 - 0:42what's called the KKT conditions.
-
0:42 - 0:45And then we'll derive the duo to the optimization problem that I
-
0:45 - 0:47had posed earlier.
-
0:47 - 0:51And that will lead us into a discussion of kernels, which I won't really -
-
0:51 - 0:54which we just get to say couple words about, but which I'll do probably
-
0:54 - 0:58only in the next lecture.
-
0:58 - 1:02And as part of today's lecture, I'll spend some time talking about optimization problems.
-
1:02 - 1:04And in
-
1:04 - 1:08the little time I have today, I won't really be able to do this topic justice.
-
1:08 - 1:13I wanna talk about convex optimization and do that topic justice.
-
1:13 - 1:18And so at this week's discussion session, the TAs will have more
-
1:18 - 1:22time - will teach a discussion session - focus on convex optimization
-
1:22 - 1:25- sort of very beautiful and useful theory.
-
1:25 - 1:32So you want to learn more about that, listen to this Friday's discussion session.
-
1:33 - 1:37Just to recap what we did in the previous lecture,
-
1:37 - 1:39as
-
1:39 - 1:42we were beginning on developing on support vector machines, I
-
1:42 - 1:44said that a
-
1:44 - 1:49hypothesis represented as H sub [inaudible] wb as g of
-
1:49 - 1:53w transpose [inaudible] x + b, where
-
1:53 - 1:55
-
1:55 - 1:56g
-
1:56 - 2:03will be
-
2:04 - 2:09+1 or -1, depending on whether z is greater than
-
2:09 - 2:100.
-
2:10 - 2:13And I said that in our development of support vector machines,
-
2:13 - 2:17we'll use - we'll change the convention of letting y be +1, -1 to note
-
2:17 - 2:20the class labels.
-
2:20 - 2:24So last time,
-
2:24 - 2:28we also talked about the functional margin, which was this thing, gamma
-
2:28 - 2:34hat i.
-
2:34 - 2:36
-
2:36 - 2:39And so we had the intuition that the
-
2:39 - 2:42if functional margin is a large positive number,
-
2:42 - 2:47then that means that we are classifying a training example correctly and very
-
2:47 - 2:48confidently. So
-
2:48 - 2:50yi is +1. We
-
2:50 - 2:53would like w transpose xi + b to be very large.
-
2:53 - 2:55And it makes i - if, excuse me, if
-
2:55 - 2:57yi is -1,
-
2:57 - 2:59then we'd w transpose xi + b to be a large negative number. So
-
2:59 - 3:03we'd sort of like functional margins to be large. We
-
3:03 - 3:06also said - functional margin is a strange property -
-
3:06 - 3:10that you can increase functional margin just by, say, taking your parameters,
-
3:10 - 3:11w and b,
-
3:11 - 3:14and multiplying them by
-
3:14 - 3:162.
-
3:16 - 3:18And then we also
-
3:18 - 3:21defined the geometric margin,
-
3:21 - 3:23which
-
3:23 - 3:25
-
3:25 - 3:32was
-
3:33 - 3:37that we just - essentially, the functional margin
-
3:37 - 3:38divided by
-
3:38 - 3:40the normal w.
-
3:40 - 3:43And so the geometric margin had
-
3:43 - 3:47the interpretation as being - I'll give
-
3:47 - 3:49you a few examples. The
-
3:49 - 3:52geometric margin, for example, is - has
-
3:52 - 3:56the interpretation as a distance between a training example
-
3:56 - 3:57and a hyperplane.
-
3:57 - 4:01And it'll actually be a sin distance, so that this distance will be positive
-
4:01 - 4:05if you're classifying the example correctly. And if you misclassify the example, this
-
4:05 - 4:08distance - it'll be the minus of the distance,
-
4:08 - 4:10reaching the point, reaching the training example.
-
4:10 - 4:12And you're separating hyperplane.
-
4:12 - 4:16Where you're separating hyperplane is defined by the equation w transpose x
-
4:16 - 4:18+
-
4:18 - 4:21b =
-
4:21 - 4:240.
-
4:24 - 4:28So - oh,
-
4:28 - 4:32well, and I guess
-
4:32 - 4:37also defined these things as the functional margin, geometric margins,
-
4:37 - 4:38respect to training set I defined as
-
4:38 - 4:45the worst case or the minimum functional geometric margin. So in our
-
4:49 - 4:53development of the optimal margin classifier,
-
4:53 - 4:56our learning algorithm would choose parameters w and b so as to maximize
-
4:56 - 5:00the geometric margin. So our goal is to find the separating hyperplane
-
5:00 - 5:05that separates the positive and negative examples with as large a distance as possible
-
5:05 - 5:06between hyperplane
-
5:06 - 5:11and the positive and negative examples. And if you
-
5:11 - 5:15go to choose parameters w and b to maximize this, [inaudible] one copy of
-
5:15 - 5:17the geometric margin
-
5:17 - 5:18is that you
-
5:18 - 5:20can actually scale w
-
5:20 - 5:24and b arbitrarily. So you look at this definition for the geometric margin.
-
5:24 - 5:25
-
5:25 - 5:28I can choose to multiply my parameters w and b
-
5:28 - 5:32by 2 or by 10 or any other constant.
-
5:32 - 5:34And it doesn't change
-
5:34 - 5:36my geometric margin.
-
5:36 - 5:38And one way of interpreting that is you're looking
-
5:38 - 5:39at
-
5:39 - 5:43just separating hyperplane. You look at this line you're separating by positive and negative training examples. If
-
5:43 - 5:45I
-
5:45 - 5:46scale w
-
5:46 - 5:47and b,
-
5:47 - 5:49that doesn't change the position of this plane, though
-
5:49 - 5:52because the equation wh +
-
5:52 - 5:59b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight
-
5:59 - 6:03line. And what that means is that I can actually choose whatever scaling
-
6:03 - 6:06for w and b is convenient for me.
-
6:06 - 6:08And in particular,
-
6:08 - 6:09we use in a minute,
-
6:09 - 6:14I can [inaudible] perfect constraint like that the normal w [inaudible] 1
-
6:14 - 6:17because this means that you can find a solution to w and b.
-
6:17 - 6:22And then by rescaling the parameters, you can easily meet this condition, this rescaled w
-
6:22 - 6:24[inaudible] 1. And so I can
-
6:24 - 6:28add the condition like this and then essentially not change the problem. Or I
-
6:28 - 6:32can add other conditions. I can actually add a
-
6:32 - 6:35condition
-
6:35 - 6:38that - excuse me, the absolute value of w1 = 1. I can have
-
6:38 - 6:41only one of these conditions right now [inaudible]. And adding condition to the absolute
-
6:41 - 6:42value - the
-
6:42 - 6:45first component of w must be to 1. And again,
-
6:45 - 6:49you can find the absolute solution and just rescale w and meet this
-
6:49 - 6:51condition.
-
6:51 - 6:53And it can have other,
-
6:53 - 6:56most esoteric conditions like that
-
6:56 - 6:57because again,
-
6:57 - 7:01this is a condition that you can solve for the optimal margin, and then just
-
7:01 - 7:03by scaling,
-
7:03 - 7:05you have w up and down. You can - you can then
-
7:05 - 7:09ensure you meet this condition as well. So
-
7:09 - 7:13again, [inaudible] one of these conditions right now, not all of them.
-
7:13 - 7:15And so our ability to choose
-
7:15 - 7:19any scaling condition on w that's convenient to us
-
7:19 - 7:24will be useful again in a second. All right.
-
7:24 - 7:29So let's go ahead and break down the optimization problem. And again, my goal is to choose
-
7:29 - 7:30parameters w and b
-
7:30 - 7:34so as to maximize the geometric margin.
-
7:34 - 7:39Here's my first attempt at writing down the optimization problem. Actually wrote this one down
-
7:39 - 7:40right at
-
7:40 - 7:43the end of the previous
-
7:43 - 7:47lecture. Begin to solve the parameters gamma w and b
-
7:47 - 7:48such that
-
7:48 - 7:49
-
7:49 - 7:54-
-
7:54 - 7:57that [inaudible] i
-
7:57 - 8:02for in training examples.
-
8:02 - 8:02Let's say I
-
8:02 - 8:06choose to add this normalization condition.
-
8:06 - 8:11So the norm condition that w - the normal w is equal to 1 just makes
-
8:11 - 8:15the geometric and the functional margin the same.
-
8:15 - 8:18And so I'm saying I want to find a value -
-
8:18 - 8:23I want to find a value for gamma as big as possible
-
8:23 - 8:26so that all of my training examples have functional margin
-
8:26 - 8:29greater than or equals gamma,
-
8:29 - 8:30and
-
8:30 - 8:34with the constraint that normal w equals 1,
-
8:34 - 8:36functional margin and geometric margin are the same.
-
8:36 - 8:38So it's the same.
-
8:38 - 8:40Find the value for gamma so that
-
8:40 - 8:43all the values - all the geometric margins are greater or equal to
-
8:43 - 8:46gamma.
-
8:46 - 8:51So you solve this optimization problem, then you have derived
-
8:51 - 8:55the optimal margin classifier -
-
8:55 - 8:57that
-
8:57 - 9:00there's not a very nice optimization problem because this is a
-
9:00 - 9:04nasty, nonconvex constraints. And [inaudible] is asking that you
-
9:04 - 9:06solve for parameters w
-
9:06 - 9:11that lie on the surface of a unisphere, lie on his [inaudible].
-
9:11 - 9:15It lies on a unicircle - a unisphere.
-
9:15 - 9:16
-
9:16 - 9:18And so
-
9:18 - 9:21if we can come up with a convex optimization problem, then
-
9:21 - 9:25we'd be guaranteed that our [inaudible] descend to other local [inaudible] will
-
9:25 - 9:25not have
-
9:25 - 9:26local optimal. And
-
9:26 - 9:31it turns out this is an example of a nonconvex constraint. This is a nasty constraint
-
9:31 - 9:38that I would like to get rid of. So
-
9:41 - 9:43let's change the optimization problem
-
9:43 - 9:48one more time.
-
9:48 - 9:52Now, let me
-
9:52 - 9:57pose a slightly different optimization problem. Let
-
9:57 - 10:03me maximize the functional margin divided by the normal w
-
10:03 - 10:05subject
-
10:05 - 10:12to yi w transpose xi.
-
10:13 - 10:15So in other words, once you find
-
10:15 - 10:17a number, gamma hat,
-
10:17 - 10:20so that every one of my training examples has functional margin greater
-
10:20 - 10:23than the gamma hat,
-
10:23 - 10:27and my optimization objective is I want to maximize gamma hat divided by the normal
-
10:27 - 10:28
-
10:28 - 10:31w. And so I wanna maximize the
-
10:31 - 10:36function margin divided by the normal w. And we saw previously the function
-
10:36 - 10:38margin divided by the normal
-
10:38 - 10:42w is just a geometric margin, and so this is a different way of posing
-
10:42 - 10:45the same
-
10:45 - 10:47optimization problem. [Inaudible] confused, though. Are there questions
-
10:47 - 10:54about this?
-
10:54 - 10:54
-
10:54 - 10:57Student:[Inaudible] the second statement has to be made of the functional margin y
-
10:57 - 10:59divided by - why don't you just have it
-
10:59 - 11:02the geometric
-
11:02 - 11:05margin? Why do
-
11:05 - 11:09you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew
-
11:09 - 11:11Ng):Oh, I see, yes. Student:[Inaudible]
-
11:11 - 11:13
-
11:13 - 11:17is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin.
-
11:17 - 11:19Student:Yeah.
-
11:19 - 11:26Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional
-
11:26 - 11:28margin.
-
11:28 - 11:32Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me
-
11:32 - 11:36try saying
-
11:36 - 11:40this again. So here's my goal. My - I want [inaudible]. So
-
11:40 - 11:42let's see,
-
11:42 - 11:47the parameters of this optimization problem where gamma hat w and b - so
-
11:47 - 11:50the convex optimization software
-
11:50 - 11:52solves this problem for some set of parameters gamma
-
11:52 - 11:54w and b.
-
11:54 - 11:57And I'm imposing the constraint that
-
11:57 - 11:59whatever values it comes up with,
-
11:59 - 12:04yi x [inaudible] x5 + b must be greater than gamma hat.
-
12:04 - 12:06And so this means that
-
12:06 - 12:09the functional margin of every example had
-
12:09 - 12:09better be
-
12:09 - 12:11greater than equal to gamma hat. So
-
12:11 - 12:14there's a constraint to the function margin and a constraint to the gamma hat.
-
12:14 - 12:18But what I care about is not really maximizing the functional margin. What I
-
12:18 - 12:19really care about -
-
12:19 - 12:21in other words, in optimization objective,
-
12:21 - 12:25is maximizing gamma hat divided by the normal w,
-
12:25 - 12:28which is the geometric margin.
-
12:28 - 12:32So in other words, my optimization [inaudible] is I want to maximize the function margin
-
12:32 - 12:35divided by the normal
-
12:35 - 12:38w. Subject to that, every example must have
-
12:38 - 12:40function margin and at least gamma hat. Does that make
-
12:40 - 12:46sense now? Student:[Inaudible] when you
-
12:46 - 12:48said that to maximize gamma
-
12:48 - 12:51or gamma hat, respect to gamma w and with
-
12:51 - 12:52respect to gamma hat
-
12:52 - 12:57so that
-
12:57 - 13:01[inaudible] gamma hat are no
-
13:01 - 13:05longer [inaudible]? Instructor (Andrew Ng):So this is the -
-
13:05 - 13:07so it turns out -
-
13:07 - 13:11so this is how I write down the - this is how I write down an optimization problem
-
13:11 - 13:13in order to solve for
-
13:13 - 13:16the geometric margin. What is it -
-
13:16 - 13:18so it turns out that
-
13:18 - 13:21the question of this - is the gamma hat the function of w and b? And it turns out
-
13:21 - 13:22that
-
13:22 - 13:25in my previous mathematical definition, it was,
-
13:25 - 13:30but the way I'm going to pose this as an optimization problem is
-
13:30 - 13:35I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving
-
13:35 - 13:38convex optimization problems -
-
13:38 - 13:40hen I'm going to
-
13:40 - 13:44pretend that these are independent variables and ask my convex optimization software
-
13:44 - 13:46to find me values for gamma, w, and b,
-
13:46 - 13:51to make this value as big as possible and subject to this constraint.
-
13:51 - 13:53And it'll turn out that
-
13:53 - 13:55when it does that, it will choose - or
-
13:55 - 13:59obviously, it will choose for gamma to be as big as possible
-
13:59 - 14:01because optimization objective is this:
-
14:01 - 14:03You're trying to maximize gamma hat.
-
14:03 - 14:06So for x value of w and b,
-
14:06 - 14:09my software, which choose to make gamma hat as big as possible -
-
14:09 - 14:13well, but how big can we make gamma hat? Well, it's limited by use
-
14:13 - 14:15constraints. It says that every training example
-
14:15 - 14:17must have function margin
-
14:17 - 14:20greater than equal to gamma hat.
-
14:20 - 14:24And so my - the bigger you can make gamma hat
-
14:24 - 14:28will be the value of the smallest functional margin.
-
14:28 - 14:30And so when you solve this optimization problem,
-
14:30 - 14:34the value of gamma hat you get out will be, indeed,
-
14:34 - 14:39the minimum of the functional margins of your training set. Okay, so
-
14:39 - 14:41Justin? Student:Yeah, I was just
-
14:41 - 14:42wondering, I
-
14:42 - 14:46guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line
-
14:46 - 14:50such that you maximize the distance between - the smallest distance that [inaudible] between the line and
-
14:50 - 14:52the
-
14:52 - 14:54data points."
-
14:54 - 14:56And it seems like that's kind of what we're doing, but it's - it seems like
-
14:56 - 15:00this is more complicated than that. And I guess I'm wondering what
-
15:00 - 15:04is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And
-
15:04 - 15:09this seems
-
15:09 - 15:12more complicated than trying to find a line [inaudible]. So I'm
-
15:12 - 15:18just repeating the questions in case - since I'm not sure how all the audio catches it.
-
15:18 - 15:21So the answer is this is actually exactly that problem. This is exactly that problem
-
15:21 - 15:23of
-
15:23 - 15:27given the two class of data, positive and negative examples,
-
15:27 - 15:30this is exactly the formalization of the problem
-
15:30 - 15:32where I go is to find
-
15:32 - 15:36a line that separates the two - the positive and negative examples,
-
15:36 - 15:43maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case
-
15:43 - 15:44distance [inaudible]?
-
15:44 - 15:46Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now,
-
15:46 - 15:48
-
15:48 - 15:53let's just say - let's just care about the worst-case distance for now. We'll come back,
-
15:53 - 15:55and we'll fix that later. We'll - that's a -
-
15:55 - 15:58caring about the worst case is is just -
-
15:58 - 16:01is just a nice way to formulate this optimization problem. I'll come back, and I'll
-
16:01 - 16:04change that later. Okay,
-
16:04 - 16:10raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool.
-
16:10 - 16:14Great. So let's see -
-
16:14 - 16:16so this is just a different way of posing
-
16:16 - 16:18the same optimization problem. And
-
16:18 - 16:22on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on
-
16:22 - 16:24the other hand, I've now
-
16:24 - 16:25added
-
16:25 - 16:30a nasty, nonconvex objective. In particular, this is not a convex function in parameters
-
16:30 - 16:31w.
-
16:31 - 16:33And so you can't -
-
16:33 - 16:37you don't have the usual guarantees like if you [inaudible]
-
16:37 - 16:38global minimum.
-
16:38 - 16:40At least that
-
16:40 - 16:45does not follow immediately from this because this is nonconvex.
-
16:45 - 16:48
-
16:48 - 16:51So what
-
16:51 - 16:55I'm going to do is,
-
16:55 - 16:57earlier, I said that can pose
-
16:57 - 16:58any of
-
16:58 - 17:03a number of even fairly bizarre scaling constraints on w. So you can
-
17:03 - 17:06choose any scaling constraint like this, and things are still fine.
-
17:06 - 17:08And so here's
-
17:08 - 17:14the scaling I'm going to choose to add. Again, I'm
-
17:14 - 17:18gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly
-
17:18 - 17:20separable, that
-
17:20 - 17:23you can actually separate the positive and negative classes, and that we'll come back and
-
17:23 - 17:26fix this later as well.
-
17:26 - 17:29But here's the scaling constraint I want to impose on w. I want
-
17:29 - 17:34to impose a constraint that
-
17:34 - 17:38the functional margin is equal to 1.
-
17:38 - 17:40And another way of writing that
-
17:40 - 17:44is that I want to impose a constraint that min
-
17:44 - 17:51over i, yi -
-
17:52 - 17:55that in the worst case, function y is over 1.
-
17:55 - 17:59And clearly, this is a scaling constraint because if
-
17:59 - 18:02
-
18:02 - 18:06you solve for w and b, and you find that your worst-case function margin is actually 10 or
-
18:06 - 18:07whatever,
-
18:07 - 18:10then by dividing through w and b by a factor of 10, I can get
-
18:10 - 18:13my functional margin to be over 1.
-
18:13 - 18:16So this is a scaling constraint [inaudible] would
-
18:16 - 18:17imply. And this is
-
18:17 - 18:19just more compactly written
-
18:19 - 18:23as follows. This is imposing a constraint that the functional margin be
-
18:23 - 18:26equal to 1.
-
18:26 - 18:29And so if we just take
-
18:29 - 18:32what I wrote down as No. 2 of our previous optimization problem and add the
-
18:32 - 18:35scaling constraint,
-
18:35 - 18:38we then get the following optimization problem:
-
18:38 - 18:45min over wb.
-
18:57 - 19:01I guess previously, we had a maximization
-
19:01 - 19:06over gamma hats divided by the normal w. So those maximize
-
19:06 - 19:101 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum
-
19:10 - 19:11normal w
-
19:11 - 19:15is min w - normal w squared. And then these
-
19:15 - 19:16are our constraints.
-
19:16 - 19:21Since I've added the constraint, the functional margin
-
19:21 - 19:24is over 1. And this is actually
-
19:24 - 19:27my
-
19:27 - 19:30final - well,
-
19:30 - 19:37final formulation of the optimal margin classifier problem, at least for now.
-
19:37 - 19:38So the picture
-
19:38 - 19:41to
-
19:41 - 19:45keep in mind for this, I guess, is that our
-
19:45 - 19:48optimization objective is once you minimize the normal w. And so our
-
19:48 - 19:50optimization objective
-
19:50 - 19:52is just the [inaudible] quadratic function.
-
19:52 - 19:55And [inaudible] those pictures [inaudible] can draw
-
19:55 - 19:57it. So it -
-
19:57 - 19:59if [inaudible] is w1 and w2, and you
-
19:59 - 20:03want to minimize the quadratic function like this - so quadratic function
-
20:03 - 20:06just has [inaudible] that look like this.
-
20:06 - 20:10And moreover, you have a number of linear constraints in your parameters, so you may have linear
-
20:10 - 20:12constraints that eliminates that half space or
-
20:12 - 20:17linear constraint eliminates that half space [inaudible]. So there's that half space
-
20:17 - 20:20
-
20:20 - 20:23and so on.
-
20:23 - 20:26And so the picture is you have a quadratic function,
-
20:26 - 20:27and you're ruling out
-
20:27 - 20:29
-
20:29 - 20:34various half spaces where each of these linear constraints. And I hope - if
-
20:34 - 20:38you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can
-
20:38 - 20:43convince yourself that this is a convex problem that has no local optimum. But they
-
20:43 - 20:46be run great
-
20:46 - 20:50[inaudible] within this set of points that hasn't ruled out, then you convert to the
-
20:50 - 20:53global optimum.
-
20:53 - 20:56And so that's the convex optimization problem.
-
20:56 - 20:59The - does this [inaudible] nice and
-
20:59 - 21:02[inaudible].
-
21:02 - 21:09Questions about this?
-
21:17 - 21:24Actually, just raise your hand if this makes sense. Okay, cool.
-
21:26 - 21:30So this gives you the optimal margin classifier algorithm.
-
21:30 - 21:32And it turns out that
-
21:32 - 21:34this is the convex optimization problem,
-
21:34 - 21:37so you can actually take this formulation of the problem
-
21:37 - 21:41and throw it at off-the-shelf software - what's called a QP or quadratic
-
21:41 - 21:45program software. This [inaudible] optimization is called a quadratic program,
-
21:45 - 21:49where the quadratic convex objective function and [inaudible] constraints -
-
21:49 - 21:54so you can actually download software to solve these optimization problems for you.
-
21:54 - 21:56Usually, as you wanna use the -
-
21:56 - 21:57use
-
21:57 - 22:00[inaudible] because you have constraints like these, although you could actually modify [inaudible] work with
-
22:00 - 22:04this, too.
-
22:04 - 22:06So we could just declare success and say that we're done with this formulation of the
-
22:06 - 22:08problem. But
-
22:08 - 22:11what I'm going to do now is take a
-
22:11 - 22:15digression to talk about primal and duo optimization problems.
-
22:15 - 22:16And in particular, I'm going to
-
22:16 - 22:18-
-
22:18 - 22:22later, I'm going to come back and derive yet another very different form
-
22:22 - 22:25of this optimization problem.
-
22:25 - 22:29And the reason we'll do that is because it turns out this optimization problem has
-
22:29 - 22:33certain properties that make it amenable to very efficient algorithms.
-
22:33 - 22:35And moreover, I'll be deriving
-
22:35 - 22:37what's called the duo formulation of this
-
22:37 - 22:40that allows us to
-
22:40 - 22:43apply the optimal margin classifier even in
-
22:43 - 22:47very high-dimensional feature spaces - even in sometimes infinite
-
22:47 - 22:50dimensional feature spaces. So we
-
22:50 - 22:52can come back to that later.
-
22:52 - 22:53But
-
22:53 - 22:54let me know,
-
22:54 - 23:01since I'm talking about
-
23:03 - 23:06convex optimization. So how many here is - how many of you, from, I don't
-
23:06 - 23:08know, calculus,
-
23:08 - 23:12remember the method of Lagrange multipliers for
-
23:12 - 23:14solving an optimization problem
-
23:14 - 23:18like minimum - minimization, maximization problem subject to some constraint?
-
23:18 - 23:23How many of you remember the method of Lagrange multipliers for that? Oh, okay,
-
23:23 - 23:24cool. Some of you, yeah.
-
23:24 - 23:28So if you don't remember, don't worry. I - I'll describe that briefly
-
23:28 - 23:28here
-
23:28 - 23:32as well, but what I'm really gonna do is talk about the generalization of this method
-
23:32 - 23:34of Lagrange multipliers
-
23:34 - 23:38that you may or may not have seen in some calculus classes. But if you haven't
-
23:38 - 23:40seen it before, don't worry about it.
-
23:40 - 23:43
-
23:43 - 23:49So the method of Lagrange multipliers is - was - well, suppose there's some
-
23:49 - 23:53function you want to minimize, or minimize f of w.
-
23:53 - 23:56We're subject to
-
23:56 - 24:03some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And
-
24:03 - 24:07given this constraint, I'll actually usually write it in vectorial
-
24:07 - 24:10form in which I write h of w
-
24:10 - 24:11as this
-
24:11 - 24:17vector value function.
-
24:17 - 24:21So that is equal to 0, where 0 is the arrow on top. I used
-
24:21 - 24:25that to denote the vector of
-
24:25 - 24:27all 0s. So you want to solve this optimization problem.
-
24:27 - 24:29
-
24:29 - 24:32Some of you have seen method of Lagrange multipliers where
-
24:32 - 24:35you construct this
-
24:35 - 24:39[inaudible] Lagrange,
-
24:39 - 24:46
-
24:46 - 24:50which is the original optimization objective plus some [inaudible] Lagrange multipliers the
-
24:50 - 24:52highest constraints.
-
24:52 - 24:55And these parameters - they derive -
-
24:55 - 25:02we call the Lagrange multipliers.
-
25:02 - 25:06And so the way you actually solve the optimization problem is
-
25:06 - 25:11you take the partial derivative of this with respect to the original parameters
-
25:11 - 25:14and set that to 0. So
-
25:14 - 25:20the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0.
-
25:20 - 25:24And then the same as theorem through [inaudible], I guess [inaudible]
-
25:24 - 25:25Lagrange
-
25:25 - 25:31was that for w - for some value w star to get a
-
25:31 - 25:32solution,
-
25:32 - 25:35it
-
25:35 - 25:42is necessary
-
25:43 - 25:48that - can this
-
25:48 - 25:49be
-
25:49 - 25:50
-
25:50 - 25:54the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists
-
25:54 - 25:55beta star
-
25:55 - 26:00
-
26:00 - 26:07
-
26:17 - 26:20such that those partial derivatives are
-
26:20 - 26:22equal to
-
26:22 - 26:220.
-
26:22 - 26:24So the
-
26:24 - 26:25method
-
26:25 - 26:29of Lagrange multipliers is to solve this problem,
-
26:29 - 26:31you construct a Lagrange,
-
26:31 - 26:33take the derivative with respect to
-
26:33 - 26:35the original parameters b, the original
-
26:35 - 26:40parameters w, and with respect to the Lagrange multipliers beta.
-
26:40 - 26:42Set the partial derivatives equal to 0, and solve
-
26:42 - 26:45for our solutions. And then you check each of the solutions to see if it is indeed a minimum.
-
26:45 - 26:48Great.
-
26:48 - 26:52So great
-
26:52 - 26:57-
-
26:57 - 26:59so what I'm going to do is actually
-
26:59 - 27:00write
-
27:00 - 27:03down the generalization of this. And
-
27:03 - 27:06if you haven't seen that before, don't worry about it. This is [inaudible].
-
27:06 - 27:11So what I'm going to do is actually write down the generalization of this to solve
-
27:11 - 27:16a slightly more difficult type of constraint optimization problem,
-
27:16 - 27:20which is suppose you want to minimize f of w
-
27:20 - 27:24subject to the constraint that gi of w,
-
27:24 - 27:26excuse me,
-
27:26 - 27:31is less than equal to
-
27:31 - 27:350, and that hi of w is equal to 0.
-
27:35 - 27:38And
-
27:38 - 27:43again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0.
-
27:43 - 27:45So
-
27:45 - 27:49in [inaudible]'s
-
27:49 - 27:52case, we now have inequality for constraint as well as
-
27:52 - 27:59equality constraint.
-
28:02 - 28:03I then have
-
28:03 - 28:07a Lagrange, or it's actually still - called - say generalized
-
28:07 - 28:08Lagrange,
-
28:08 - 28:13which is now a function of my original optimization for parameters w,
-
28:13 - 28:18as well as two sets of Lagrange multipliers, alpha and beta.
-
28:18 - 28:20And so this will be
-
28:20 - 28:27f of w.
-
28:29 - 28:33Now,
-
28:33 - 28:37
-
28:37 - 28:39here's a
-
28:39 - 28:40cool part.
-
28:40 - 28:43I'm going to define theta
-
28:43 - 28:44
-
28:44 - 28:46subscript p of
-
28:46 - 28:47w
-
28:47 - 28:49to be equal to
-
28:49 - 28:52max of alpha beta subject to the
-
28:52 - 28:59constraints that the alphas are, beta equal
-
28:59 - 29:06to 0 of the Lagrange.
-
29:10 - 29:15And so
-
29:15 - 29:18I want you to consider
-
29:18 - 29:22the optimization problem
-
29:22 - 29:24min over w of
-
29:24 - 29:31max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And
-
29:32 - 29:36that's just equal to min over w,
-
29:36 - 29:42theta p of
-
29:42 - 29:48w. And just to give us a name, the [inaudible] - the subscript p here is a sense of
-
29:48 - 29:50primal problem.
-
29:50 - 29:57And that refers to this entire thing.
-
29:57 - 30:01This optimization problem that written down is called a primal problem. This means
-
30:01 - 30:04there's the original optimization problem in which [inaudible] solving.
-
30:04 - 30:08And later on, I'll derive in another version of this, but that's what p
-
30:08 - 30:13stands for. It's a - this is a primal problem.
-
30:13 - 30:18Now, I want you to look at - consider theta over p again. And in particular, I wanna
-
30:18 - 30:22consider the problem of what happens if you minimize w
-
30:22 - 30:24- minimize as a function of w
-
30:24 - 30:27this quantity theta over p.
-
30:27 - 30:34
-
30:34 - 30:38So let's look at what theta p of w is. Notice
-
30:38 - 30:40that
-
30:40 - 30:42if
-
30:42 - 30:43gi of w
-
30:43 - 30:47is greater than 0, so let's pick the value of w.
-
30:47 - 30:50And let's ask what is the state of p of w?
-
30:50 - 30:57So if w violates one of your primal problems constraints,
-
30:57 - 31:00
-
31:00 - 31:04then state of p of w would be infinity.
-
31:04 - 31:06Why is that?
-
31:06 - 31:09[Inaudible] p [inaudible] second.
-
31:09 - 31:13Suppose I pick a value of w that violates one of these constraints. So gi of
-
31:13 - 31:17w is positive.
-
31:17 - 31:22Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So
-
31:22 - 31:26one of these gi of w's is this positive,
-
31:26 - 31:30then by setting the other responding alpha i to plus infinity, I can make this
-
31:30 - 31:32arbitrarily large.
-
31:32 - 31:35And so if w violates one of my
-
31:35 - 31:38primal problem's constraints in one of the gis, then
-
31:38 - 31:42max over alpha of this Lagrange will be plus
-
31:42 - 31:49infinity. There's some of - and in the same way - I guess in a similar way,
-
31:50 - 31:55if hi of w is not equal to
-
31:55 - 31:580,
-
31:58 - 32:03then theta p of w also be infinity for a very similar reason because
-
32:03 - 32:06if hi of w is not equal to 0 for some value of i,
-
32:06 - 32:10then in my Lagrange, I had a beta i x hi theorem.
-
32:10 - 32:13And so by setting beta i to be plus infinity or minus
-
32:13 - 32:16infinity depending on the sign of hi,
-
32:16 - 32:20I can make this plus infinity as well.
-
32:20 - 32:22And otherwise,
-
32:22 - 32:29
-
32:30 - 32:34
-
32:34 - 32:38theta p of w is just equal to f of w. Turns
-
32:38 - 32:40out if
-
32:40 - 32:42I had a value of w that
-
32:42 - 32:45satisfies all of the gi and the hi constraints,
-
32:45 - 32:48then we maximize in terms of alpha and beta
-
32:48 - 32:49- all the Lagrange multiply
-
32:49 - 32:54theorems will actually be obtained by
-
32:54 - 32:58setting all the Lagrange multiply theorems to be 0,
-
32:58 - 32:59and so theta p
-
32:59 - 33:04just left with f of w. Thus, theta
-
33:04 - 33:06p
-
33:06 - 33:11of w is equal to
-
33:11 - 33:14f of w
-
33:14 - 33:17if constraints are
-
33:17 - 33:18satisfied
-
33:18 - 33:21[inaudible]
-
33:21 - 33:23the gi in
-
33:23 - 33:24hi constraints,
-
33:24 - 33:28and is equal to plus infinity
-
33:28 - 33:32otherwise.
-
33:32 - 33:34So the problem I
-
33:34 - 33:39wrote down that minimizes the function of w -
-
33:39 - 33:45theta p of w - this is [inaudible] problem. That's
-
33:45 - 33:48just
-
33:48 - 33:51exactly the same problem as my original primal problem
-
33:51 - 33:55because if you choose a value of w that violates the constraints, you get infinity.
-
33:55 - 33:58And if you satisfy the constraints, you get f of w.
-
33:58 - 34:00So this is really just the same as - well,
-
34:00 - 34:01we'll say,
-
34:01 - 34:05"Satisfy the constraints, and minimize f of w." That's really what
-
34:05 - 34:09minimizing the state of p
-
34:09 - 34:16of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right.
-
34:25 - 34:28I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly
-
34:28 - 34:31the same thing we started with. So here's
-
34:31 - 34:33the cool part.
-
34:33 - 34:40Let me know if you find it in your problem. To find
-
34:40 - 34:42theta
-
34:42 - 34:45d
-
34:45 - 34:47and d [inaudible] duo,
-
34:47 - 34:50and this is how the function of alpha and beta is. It's not the function of the Lagrange
-
34:50 - 34:53multipliers. It's not of w.
-
34:53 - 34:58To find this, we minimize over w of
-
34:58 - 35:05my generalized Lagrange. And
-
35:06 - 35:13my dual problem is this. So in
-
35:17 - 35:20other words, this is max over
-
35:20 - 35:22that.
-
35:22 - 35:23
-
35:23 - 35:28
-
35:28 - 35:33And so this is my duo optimization problem. To maximize over alpha and
-
35:33 - 35:34beta, theta
-
35:34 - 35:35d over alpha
-
35:35 - 35:38and beta. So this optimization problem, I
-
35:38 - 35:41guess, is my dual problem. I
-
35:41 - 35:45want you to compare this to our previous prime optimization problem.
-
35:45 - 35:47The only difference is that
-
35:47 - 35:52I took the max and min, and I switched the order around with the max and
-
35:52 - 35:56min. That's the difference in the primal and the duo optimization [inaudible].
-
35:56 - 36:00And it turns out that
-
36:00 - 36:02
-
36:02 - 36:05it's a - it's sort of - it's a fact - it's
-
36:05 - 36:06true, generally, that d
-
36:06 - 36:10star is less than [inaudible] p star. In other words, I think I defined
-
36:10 - 36:16p star previously. P star was a value of the prime optimization problem.
-
36:16 - 36:20And in other words, that it's just generally true
-
36:20 - 36:25that the max of the min of something is less than equal to the min of the max
-
36:25 - 36:28of something.
-
36:28 - 36:29And this is a
-
36:29 - 36:30general fact.
-
36:30 - 36:33And just as a concrete example, the
-
36:33 - 36:37max over y in the set 01 x -
-
36:37 - 36:40oh, excuse me, of the min of the
-
36:40 - 36:44set in 01
-
36:44 - 36:48of indicator x =
-
36:48 - 36:51y - this is
-
36:51 - 36:58[inaudible] equal to
-
37:03 - 37:10min.
-
37:10 - 37:14So this equality - this inequality actually holds true for any
-
37:14 - 37:16function you might find in here.
-
37:16 - 37:18And this is one specific example
-
37:18 - 37:21where
-
37:21 - 37:24the min over xy - excuse me, min over x of [inaudible] equals y -
-
37:24 - 37:27this is always equal to 0
-
37:27 - 37:31because whatever y is, you can choose x to be something different. So
-
37:31 - 37:33this is always 0,
-
37:33 - 37:34whereas if
-
37:34 - 37:38I exchange the order to min
-
37:38 - 37:40and max, then thing here is always equal to
-
37:40 - 37:421. So 0 [inaudible] to
-
37:42 - 37:451. And more generally, this min/max - excuse
-
37:45 - 37:51me, this max/min, thus with the min/max holds true for any function you might put in
-
37:51 - 37:52there.
-
37:52 - 37:58But it turns out that sometimes under certain conditions,
-
37:58 - 37:59
-
37:59 - 38:01
-
38:01 - 38:02
-
38:02 - 38:03
-
38:03 - 38:07these two optimization problems have the same value. Sometimes under certain
-
38:07 - 38:08conditions,
-
38:08 - 38:10the primal and the dual problems
-
38:10 - 38:12have the same value.
-
38:12 - 38:15And so
-
38:15 - 38:16you might be able to solve
-
38:16 - 38:20the dual problem rather than the primal problem.
-
38:20 - 38:22And the reason to do that is that
-
38:22 - 38:26sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem,
-
38:26 - 38:31the dual problem turns out to be much easier than it - often has many useful properties that
-
38:31 - 38:34will make user
-
38:34 - 38:41compared to the primal. So for the sake of -
-
38:42 - 38:48so
-
38:48 - 38:55what
-
39:02 - 39:05I'm going to do now is write down formally the certain conditions under which that's
-
39:05 - 39:09true - where the primal and the dual problems are equivalent.
-
39:09 - 39:14And so our strategy for working out the [inaudible] of support vector machine algorithm
-
39:14 - 39:18will be that we'll write down the primal optimization problem, which we did
-
39:18 - 39:21previously, and maximizing classifier.
-
39:21 - 39:22And then we'll
-
39:22 - 39:24derive the duo optimization problem for that.
-
39:24 - 39:26And then we'll solve the dual problem.
-
39:26 - 39:29And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for
-
39:29 - 39:30
-
39:30 - 39:34now, let me just first, for
-
39:34 - 39:37the sake of completeness, I just write down the conditions under which the primal
-
39:37 - 39:42and the duo optimization problems give you the same solutions. So let f
-
39:42 - 39:45be convex. If you're not
-
39:45 - 39:47
-
39:47 - 39:50sure what convex means, for the purposes of this class, you can take it to
-
39:50 - 39:53mean that the
-
39:53 - 39:56Hessian, h is positive. [Inaudible], so it just means it's
-
39:56 - 39:58a [inaudible] function like that.
-
39:58 - 39:59And
-
39:59 - 40:01once you learn more about optimization
-
40:01 - 40:07- again, please come to this week's discussion session taught by the TAs.
-
40:07 - 40:09Then
-
40:09 - 40:11suppose
-
40:11 - 40:13hi
-
40:13 - 40:18- the hi constraints [inaudible], and what that means is that hi of w equals alpha i
-
40:18 - 40:25transpose w plus vi. This actually
-
40:25 - 40:29means the same thing as linear. Without the term b here, we say
-
40:29 - 40:30that hi is linear
-
40:30 - 40:34where we have a constant interceptor as well. This is technically called [inaudible] other than
-
40:34 - 40:37linear.
-
40:37 - 40:40And let's suppose
-
40:40 - 40:43
-
40:43 - 40:50that gi's are strictly feasible.
-
40:51 - 40:54And
-
40:54 - 40:57what that means is that
-
40:57 - 41:01there is just a value of the w such that
-
41:01 - 41:04from i,
-
41:04 - 41:07gi of w is less
-
41:07 - 41:11than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the
-
41:11 - 41:13technical details.
-
41:13 - 41:16Strictly feasible, which just means that there's a value of w such that
-
41:16 - 41:19all of these constraints are satisfy were stricter than the equality rather than what
-
41:19 - 41:22less than equal to.
-
41:22 - 41:22
-
41:22 - 41:25Under these conditions,
-
41:25 - 41:26
-
41:26 - 41:30there were exists w star,
-
41:30 - 41:32alpha
-
41:32 - 41:34star, beta
-
41:34 - 41:40star such that w star solves the primal problem.
-
41:40 - 41:42And alpha star
-
41:42 - 41:49and beta star, the Lagrange multipliers, solve the dual problem.
-
41:52 - 41:54
-
41:54 - 41:59And the value of the primal problem will be equal to the value of the dual problem will
-
41:59 - 42:04be equal to the value of your Lagrange multiplier - excuse
-
42:04 - 42:06me, will be equal to the value of your generalized Lagrange, the value
-
42:06 - 42:09of that w star, alpha star, beta star.
-
42:09 - 42:13In other words, you can solve either the primal or the dual problem. You get
-
42:13 - 42:15the same
-
42:15 - 42:20solution. Further,
-
42:20 - 42:23your parameters will satisfy
-
42:23 - 42:24
-
42:24 - 42:27
-
42:27 - 42:32these conditions. Partial derivative perspective parameters would be
-
42:32 - 42:330. And
-
42:33 - 42:37actually, to keep this equation in mind, we'll actually use this in a second
-
42:37 - 42:39when we take the Lagrange, and we - and
-
42:39 - 42:43our support vector machine problem, and take a derivative with respect to w to solve a -
-
42:43 - 42:45to solve our -
-
42:45 - 42:47to derive our dual problem. We'll actually
-
42:47 - 42:50perform this step ourselves in a second.
-
42:50 - 42:52
-
42:52 - 42:58Partial derivative with respect to the Lagrange multiplier beta is
-
42:58 - 43:02
-
43:02 - 43:05equal
-
43:05 - 43:06to
-
43:06 - 43:080.
-
43:08 - 43:10Turns out this will hold true,
-
43:10 - 43:12too.
-
43:12 - 43:14This is called
-
43:14 - 43:19
-
43:19 - 43:21the -
-
43:21 - 43:27well
-
43:27 - 43:29-
-
43:29 - 43:32this is called the KKT complementary condition.
-
43:32 - 43:38KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem.
-
43:38 - 43:42Well, and by tradition, usually this [inaudible] KKT conditions.
-
43:42 - 43:49But the other two are
-
43:50 - 43:53- just so the [inaudible] is greater than 0, which we had
-
43:53 - 43:55previously
-
43:55 - 44:02and that your constraints are actually satisfied.
-
44:03 - 44:10So let's see. [Inaudible] All
-
44:22 - 44:29right.
-
44:30 - 44:34So let's take those and apply this to our
-
44:34 - 44:37optimal margin
-
44:37 - 44:41optimization problem that we had previously. I
-
44:41 - 44:44was gonna say one word about this,
-
44:44 - 44:47which is -
-
44:47 - 44:49was gonna say one word about this
-
44:49 - 44:51KTT
-
44:51 - 44:52complementary condition is
-
44:52 - 44:54that a condition that is a -
-
44:54 - 45:01at your solution, you must have that alpha star i times gi of w is equal to 0.
-
45:01 - 45:03So
-
45:03 - 45:04let's see.
-
45:04 - 45:09So the product of two numbers is equal to 0. That means that
-
45:09 - 45:09
-
45:09 - 45:12at least one of these things must be equal to
-
45:12 - 45:160. For the product of two things to be equal to 0, well, that's just saying either alpha
-
45:16 - 45:17or i
-
45:17 - 45:21or gi is equal to 0.
-
45:21 - 45:26So what that implies is that the - just Karush-Kuhn-Tucker
-
45:26 - 45:32
-
45:32 - 45:38
-
45:38 - 45:43- most people just say KKT, but we wanna show you the right
-
45:43 - 45:46spelling
-
45:46 - 45:47of their names. So
-
45:47 - 45:48KKT
-
45:48 - 45:52complementary condition implies that if alpha
-
45:52 - 45:55i is not 0,
-
45:55 - 46:01that necessarily implies that
-
46:01 - 46:03gi of w star
-
46:03 - 46:06is equal to
-
46:06 - 46:110.
-
46:11 - 46:13And
-
46:13 - 46:15usually,
-
46:15 - 46:21
-
46:21 - 46:24
-
46:24 - 46:26it turns out - so
-
46:26 - 46:29all the KKT condition guarantees is that
-
46:29 - 46:33at least one of them is 0. It may actually be the case that both
-
46:33 - 46:35alpha and gi are both equal to 0.
-
46:35 - 46:39But in practice, when you solve this optimization problem, you find that
-
46:39 - 46:43to a large part, alpha i star is not equal to 0 if and only gi of
-
46:43 - 46:45w star 0, 0.
-
46:45 - 46:50This is not strictly true because it's possible that both of these may be 0.
-
46:50 - 46:52But in practice,
-
46:52 - 46:55when we - because when we solve problems like these, you're, for the most part,
-
46:55 - 46:59usually exactly one of these will be non-0.
-
46:59 - 47:03And also, when this holds true, when gi of w star is equal to 0,
-
47:03 - 47:06we say that
-
47:06 - 47:07gi
-
47:07 - 47:14- gi of w, I guess, is an active
-
47:16 - 47:18constraint
-
47:18 - 47:22because we call a constraint - our constraint was a gi of w must be less than or equal to
-
47:22 - 47:230.
-
47:23 - 47:26And so it is equal to 0, then
-
47:26 - 47:33we say that that's a constraint that this is an active constraint.
-
47:34 - 47:37Once we talk about [inaudible], we come back and [inaudible] and just extend this
-
47:37 - 47:43idea a little bit more.
-
47:43 - 47:48[Inaudible] board. [Inaudible]
-
47:48 - 47:52turn
-
47:52 - 47:59to this board in a second, but -
-
48:07 - 48:11so let's go back and work out one of the primal and the duo optimization problems for
-
48:11 - 48:12
-
48:12 - 48:17our optimal margin classifier for the optimization problem that we worked on just now. As
-
48:17 - 48:20a point of notation,
-
48:20 - 48:25in whatever I've been writing down so far in deriving the KKT conditions,
-
48:25 - 48:28
-
48:28 - 48:32when Lagrange multipliers were alpha i and beta i,
-
48:32 - 48:36it turns out that when applied as [inaudible]
-
48:36 - 48:37dm,
-
48:37 - 48:41turns out we only have one set of Lagrange multipliers alpha i.
-
48:41 - 48:44And also,
-
48:44 - 48:47as I was working out the KKT conditions, I used w
-
48:47 - 48:52to denote the parameters of my primal optimization problem. [Inaudible] I wanted to
-
48:52 - 48:54minimize f of w. In my
-
48:54 - 48:58very first optimization problem, I had
-
48:58 - 48:59that optimization problem [inaudible] finding the parameters
-
48:59 - 49:02w. In my svn problem, I'm
-
49:02 - 49:05actually gonna have two sets of parameters, w and b. So
-
49:05 - 49:08this is just a - keep that
-
49:08 - 49:15sort of slight notation change in mind. So
-
49:16 - 49:20problem we worked out previously was we want to minimize the normal w squared and just add a
-
49:20 - 49:21half there
-
49:21 - 49:26by convention because it makes other work - math work a little
-
49:26 - 49:27nicer.
-
49:27 - 49:33And subject to the constraint that yi x w [inaudible] xi + v must be = greater
-
49:41 - 49:46And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w,
-
49:46 - 49:49b. Again, previously,
-
49:49 - 49:53I had gi
-
49:53 - 49:56of w, but now I have parameters w and b. So
-
49:56 - 50:03gi of w, b defined
-
50:03 - 50:10as 1.
-
50:10 - 50:16So
-
50:16 - 50:20let's look at the implications of this in terms
-
50:20 - 50:25of the KKT duo complementary condition again.
-
50:25 - 50:29So we have that alpha i is basically equal to
-
50:29 - 50:320. That necessarily implies that
-
50:32 - 50:35gi of w, b
-
50:35 - 50:42is equal to 0. In other words, this is an active constraint.
-
50:47 - 50:54And what does this mean? It means that it actually turns out gi of wb equal to 0 that
-
50:54 - 51:01is - that means exactly that the training example xi, yi
-
51:01 - 51:08has functional margin
-
51:09 - 51:10equal to 1.
-
51:10 - 51:11Because
-
51:11 - 51:15this constraint was that
-
51:15 - 51:19the functional margin of every example has to be greater equal to
-
51:19 - 51:221. And so if this is an active constraint, it just -
-
51:22 - 51:24inequality holds that equality.
-
51:24 - 51:27That means that my training example i
-
51:27 - 51:30must have functional margin equal to exactly 1. And so - actually, yeah,
-
51:30 - 51:34right now, I'll
-
51:34 - 51:40do this on a different board, I guess.
-
51:40 - 51:47
-
51:57 - 51:59So in pictures,
-
51:59 - 52:04what that means is that, you have some
-
52:04 - 52:11training sets, and you'll
-
52:11 - 52:15have some separating hyperplane.
-
52:15 - 52:19And so the examples with functional margin equal to 1
-
52:19 - 52:23will be exactly those which are -
-
52:23 - 52:26so they're closest
-
52:26 - 52:28to my separating hyperplane. So
-
52:28 - 52:31that's
-
52:31 - 52:34my equation. [Inaudible] equal to 0.
-
52:34 - 52:39And so in this - in this cartoon example that I've done, it'll be
-
52:39 - 52:45exactly
-
52:45 - 52:50- these three
-
52:50 - 52:53examples that have functional margin
-
52:53 - 52:54equal to 1,
-
52:54 - 52:59and all of the other examples as being further away than these
-
52:59 - 53:06three will have functional margin that is strictly greater than 1.
-
53:06 - 53:08And
-
53:08 - 53:11the examples with functional margin equal to 1 will usually correspond to the
-
53:11 - 53:15
-
53:15 - 53:22
-
53:22 - 53:23ones where
-
53:23 - 53:27the corresponding Lagrange multipliers also not equal to 0. And again, it may not
-
53:27 - 53:29hold true. It may be the case that
-
53:29 - 53:33gi and alpha i equal to 0. But usually, when gi's not -
-
53:33 - 53:36is 0, alpha i will be non-0.
-
53:36 - 53:39And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal
-
53:39 - 53:46to 0. One
-
53:47 - 53:49useful property of this is that
-
53:49 - 53:53as suggested by this picture and so true in general as well, it
-
53:53 - 53:57turns out that we find a solution to this - to the optimization problem,
-
53:57 - 54:02you find that relatively few training examples have functional margin equal to 1. In
-
54:02 - 54:03this picture I've drawn,
-
54:03 - 54:06there are three examples with functional margin equal to 1. There
-
54:06 - 54:09are just few examples of this minimum possible distance to your separating hyperplane.
-
54:09 - 54:11
-
54:11 - 54:14And these are three -
-
54:14 - 54:18these examples of functional margin equal to 1 - they
-
54:18 - 54:21are what we're going to call
-
54:21 - 54:26the support vectors. And
-
54:26 - 54:30this needs the name support vector machine. There'll be these three points with functional margin
-
54:30 - 54:311
-
54:31 - 54:35that we're calling support vectors.
-
54:35 - 54:37And
-
54:37 - 54:40the fact that they're relatively few support vectors also means that
-
54:40 - 54:41usually,
-
54:41 - 54:43most of the alpha i's are equal to
-
54:43 - 54:450. So with alpha i equal
-
54:45 - 54:52to
-
54:53 - 54:540,
-
54:54 - 55:01for examples, though, not support vectors.
-
55:03 - 55:07Let's go ahead and work out the actual
-
55:07 - 55:08optimization problem.
-
55:08 - 55:12
-
55:12 - 55:19
-
55:28 - 55:29So
-
55:29 - 55:32we have a [inaudible] margin
-
55:32 - 55:33optimization problem.
-
55:33 - 55:34
-
55:34 - 55:39So there we go and write down the margin,
-
55:39 - 55:39and
-
55:39 - 55:44because we only have inequality constraints where we really have gi star
-
55:44 - 55:47constraints, no hi star constraint. We have
-
55:47 - 55:50inequality constraints and no equality constraints,
-
55:50 - 55:52I'll only have
-
55:52 - 55:53Lagrange multipliers of type
-
55:53 - 55:57alpha - no betas in my generalized Lagrange. But
-
55:57 - 56:01
-
56:01 - 56:03my Lagrange will be
-
56:03 - 56:10one-half w squared minus.
-
56:15 - 56:18
-
56:18 - 56:20That's my
-
56:20 - 56:27Lagrange.
-
56:27 - 56:29And so let's work out what the dual problem is.
-
56:29 - 56:33And to do that, I need to figure out what theta d of alpha - and I know again, beta's there
-
56:33 - 56:37- so what theta d of alpha
-
56:37 - 56:44is min with
-
56:44 - 56:48respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha.
-
56:48 - 56:55So as to work out what theta d is, and then that'll give us our dual problem.
-
56:55 - 56:58So then to work out what this is, what do you need to do? We need to
-
56:58 - 57:02take a look at Lagrange and minimize it as a function of lv and b
-
57:02 - 57:03so - and what is this? How do you
-
57:03 - 57:05minimize Lagrange? So in order to
-
57:05 - 57:06
-
57:06 - 57:09minimize the Lagrange as a function of w and b,
-
57:09 - 57:11we do the usual thing. We
-
57:11 - 57:13take the derivatives of w -
-
57:13 - 57:17Lagrange with respect to w and b. And we set that to 0. That's how we
-
57:17 - 57:20minimize the Lagrange with respect
-
57:20 - 57:26to w and b. So take the derivative with respect to w of the Lagrange.
-
57:26 - 57:28And
-
57:28 - 57:35I want - I just write down the answer. You know how to do calculus like this.
-
57:38 - 57:41So I wanna minimize this function of w, so I take the derivative and set it
-
57:41 - 57:43to 0. And
-
57:43 - 57:45I get that. And
-
57:45 - 57:52then so this implies that w
-
57:56 - 57:59must be that.
-
57:59 - 58:01And so
-
58:01 - 58:05w, therefore, is actually a linear combination of your input feature vectors xi.
-
58:05 - 58:07This is
-
58:07 - 58:10sum of your various weights given by the alpha i's and times
-
58:10 - 58:13the xi's, which are your examples in your training set.
-
58:13 - 58:16And this will be useful later. The
-
58:16 - 58:23other equation we have is - here, partial derivative of
-
58:23 - 58:28Lagrange with respect to p is equal to minus sum
-
58:28 - 58:34of i plus 1 to m [inaudible] for
-
58:34 - 58:35i.
-
58:35 - 58:37And so I'll just set that to equal to
-
58:37 - 58:400. And so these are my two constraints.
-
58:40 - 58:43And so
-
58:43 - 58:48
-
58:48 - 58:50[inaudible].
-
58:50 - 58:54So what I'm going to do is I'm actually going to take these two constraints,
-
58:54 - 58:58and well, I'm going to take whatever I thought to be the value for w.
-
58:58 - 59:00And I'm
-
59:00 - 59:01gonna
-
59:01 - 59:04take what I've worked out to be the
-
59:04 - 59:05value for w, and
-
59:05 - 59:07I'll plug it back in there
-
59:07 - 59:09to figure out what the Lagrange really is
-
59:09 - 59:14when I minimize with respect to w. [Inaudible] and I'll
-
59:14 - 59:21deal with b in a second.
-
59:28 - 59:35So
-
59:38 - 59:41let's see. So my Lagrange is 1/2
-
59:41 - 59:44w transpose w minus.
-
59:44 - 59:51
-
59:51 - 59:58
-
59:58 - 60:04So this first term, w transpose w
-
60:04 - 60:06- this becomes
-
60:06 - 60:11sum y equals one to m, alpha i, yi,
-
60:11 - 60:18
-
60:21 - 60:25xi transpose. This is just putting in the value for w that I worked out previously.
-
60:25 - 60:28But since this is w transpose w -
-
60:28 - 60:32and so when they expand out of this quadratic function, and when I plug in w
-
60:32 - 60:34over there as well,
-
60:34 - 60:36I
-
60:36 - 60:39find
-
60:39 - 60:41
-
60:41 - 60:43
-
60:43 - 60:50that
-
60:50 - 60:55
-
60:55 - 60:58I
-
60:58 - 61:00have
-
61:00 - 61:04
-
61:04 - 61:09
-
61:09 - 61:11
-
61:11 - 61:18that.
-
61:21 - 61:22Oh,
-
61:22 - 61:29where I'm using these angle brackets to denote end product, so this
-
61:29 - 61:35thing here, it just means the end product, xi transpose
-
61:35 - 61:36xj. And
-
61:36 - 61:40the first and second terms are actually the same except for the minus one half. So to
-
61:40 - 61:42simplify to be
-
61:42 - 61:45equal to
-
61:45 - 61:51
-
61:51 - 61:58
-
62:08 - 62:13that.
-
62:13 - 62:16So
-
62:16 - 62:19let me go ahead and
-
62:19 - 62:22call this w of alpha.
-
62:22 - 62:23
-
62:23 - 62:29
-
62:29 - 62:36
-
62:47 - 62:52My dual problem is, therefore, the following. I want to maximize w
-
62:52 - 62:56of alpha, which is that [inaudible].
-
62:56 - 62:58And
-
62:58 - 63:02I want to the - I realize the notation is somewhat
-
63:02 - 63:07unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier.
-
63:07 - 63:14And then we also had our lowercase w. The original [inaudible] is the primal
-
63:14 - 63:16problem. Lowercase w transpose xi. So
-
63:16 - 63:18uppercase and lowercase w
-
63:18 - 63:21are totally different
-
63:21 - 63:27things, so unfortunately, the notation is standard as well, as far as I know,
-
63:27 - 63:30so. So the dual problem is
-
63:30 - 63:33that subject to the alpha [inaudible] related to 0,
-
63:33 - 63:37and
-
63:37 - 63:39we also have that the
-
63:39 - 63:41sum of i,
-
63:41 - 63:44yi, alpha i is related to 0.
-
63:44 - 63:46That last constraint
-
63:46 - 63:50was the constraint I got from
-
63:50 - 63:52this - the
-
63:52 - 63:55sum of i - sum of i, yi alpha i equals to 0. But that's where
-
63:55 - 64:00that [inaudible] came
-
64:00 - 64:02from. Let
-
64:02 - 64:02me just
-
64:02 - 64:06- I think in previous years that I taught this,
-
64:06 - 64:09where this constraint comes from is just - is slightly confusing. So let
-
64:09 - 64:13me just take two minutes to say what the real interpretation of that is. And if you
-
64:13 - 64:16don't understand it, it's
-
64:16 - 64:18not a big deal, I guess.
-
64:18 - 64:22So when we took the partial derivative of the Lagrange with
-
64:22 - 64:23respect to b,
-
64:23 - 64:28we end up with this constraint that sum of i, yi, alpha i must be equal to 0.
-
64:28 - 64:34The interpretation of that, it turns out, is that if sum of i, yi, alpha i
-
64:34 - 64:37is not equal to
-
64:37 - 64:410, then
-
64:41 - 64:43
-
64:43 - 64:50
-
64:51 - 64:53theta d of wb
-
64:53 - 64:56is -
-
64:56 - 65:00actually, excuse me.
-
65:00 - 65:02
-
65:02 - 65:03
-
65:03 - 65:06Then theta d of alpha is equal to
-
65:06 - 65:08
-
65:08 - 65:13minus infinity for minimizing.
-
65:13 - 65:16So in other words, it turns out my Lagrange is
-
65:16 - 65:20actually a linear function of my parameters b. And so the interpretation of
-
65:20 - 65:23that constraint we worked out previously was that if sum of i or yi, alpha i
-
65:23 - 65:26is not equal to 0, then
-
65:26 - 65:29theta d of alpha is equal to minus infinity.
-
65:29 - 65:32And so if your goal is to
-
65:32 - 65:37maximize as a function of alpha, theta
-
65:37 - 65:38d of alpha,
-
65:38 - 65:45then you've gotta choose values of alpha for which sum of yi alpha is equal to 0.
-
65:45 - 65:51And then when sum of yi alpha is equal to 0, then
-
65:51 - 65:54
-
65:54 - 65:56
-
65:56 - 66:01theta d of
-
66:01 - 66:04alpha is equal to w of alpha.
-
66:04 - 66:09And so that's why we ended up deciding to maximize w of alpha subject to
-
66:09 - 66:14that sum of yi alpha is equal to 0.
-
66:14 - 66:19Yeah, the - unfortunately, the fact of that d would be [inaudible]
-
66:19 - 66:22adds just a little bit of extra notation in our
-
66:22 - 66:23derivation of the duo. But
-
66:23 - 66:27by the way, and [inaudible] all the action of the optimization problem is with w
-
66:27 - 66:33because b is just one parameter.
-
66:33 - 66:40So let's check. Are there any questions about this? Okay, cool.
-
66:47 - 66:49So
-
66:49 - 66:54what derived a duo optimization problem - and really, don't worry about this
-
66:54 - 66:56if you're not quite sure where this was. Just think of this as
-
66:56 - 67:00we worked out this constraint, and we worked out, and we took partial derivative with
-
67:00 - 67:01respect to b,
-
67:01 - 67:06that this constraint has the [inaudible] and so I just copied that over here. But so - worked out
-
67:06 - 67:11the duo of the optimization problem,
-
67:11 - 67:16so our approach to finding - to deriving the optimal margin classifier or support vector
-
67:16 - 67:17machine
-
67:17 - 67:19will be that we'll solve
-
67:19 - 67:26along this duo optimization problem for the parameters alpha
-
67:28 - 67:29star.
-
67:29 - 67:31And then
-
67:31 - 67:34if you want, you can then - this is the equation that we worked out on
-
67:34 - 67:36the previous board. We said that
-
67:36 - 67:43w - this [inaudible] alpha - w must be equal to
-
67:44 - 67:46that.
-
67:46 - 67:48And so
-
67:48 - 67:53once you solve for alpha, you can then go back and quickly derive
-
67:53 - 67:57w in parameters to your primal problem. And we worked this out earlier.
-
67:57 - 67:58
-
67:58 - 68:00And moreover,
-
68:00 - 68:05once you solve alpha and w, you can then focus back into your - once you solve for alpha and w,
-
68:05 - 68:07
-
68:07 - 68:10it's really easy to solve for v, so
-
68:10 - 68:14
-
68:14 - 68:15that b gives us the interpretation of [inaudible]
-
68:15 - 68:20training set, and you found the direction for w. So you know where your separating
-
68:20 - 68:22hyperplane's direction is. You know it's got to be
-
68:22 - 68:25one of these things.
-
68:25 - 68:28And you know the orientation and separating hyperplane. You just have to
-
68:28 - 68:31decide where to place
-
68:31 - 68:33this hyperplane. And that's what solving b is.
-
68:33 - 68:37So once you solve for alpha and w, it's really easy to solve b.
-
68:37 - 68:40You can plug alpha and w back into the
-
68:40 - 68:43primal optimization problem
-
68:43 - 68:46
-
68:46 - 68:50
-
68:50 - 68:57and solve for b.
-
69:02 - 69:05And I just wrote it down
-
69:05 - 69:12for the sake of completeness,
-
69:15 - 69:20
-
69:20 - 69:21
-
69:21 - 69:23but
-
69:23 - 69:28
-
69:28 - 69:30- and the
-
69:30 - 69:36intuition behind this formula is just that find the worst positive
-
69:36 - 69:37[inaudible] and the
-
69:37 - 69:42worst negative example. Let's say
-
69:42 - 69:45this one and this one - say [inaudible] and [inaudible] the difference between them. And
-
69:45 - 69:46that tells you where you should
-
69:46 - 69:48set the threshold for
-
69:48 - 69:53where to place the separating hyperplane.
-
69:53 - 69:53And then
-
69:53 - 69:55that's the -
-
69:55 - 69:58this is the optimal margin classifier. This is also called a support vector
-
69:58 - 70:02machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words
-
70:02 - 70:04about that. But I
-
70:04 - 70:08hope the process is clear. It's a dual problem. We're going to solve the duo
-
70:08 - 70:10problem for the alpha i's.
-
70:10 - 70:11That gives us w, and that gives
-
70:11 - 70:14us b.
-
70:14 - 70:17So
-
70:17 - 70:22there's just one more thing I wanna point out as I lead into the next lecture,
-
70:22 - 70:23which is that - I'll just
-
70:23 - 70:28write this out again,
-
70:28 - 70:29I guess -
-
70:29 - 70:31which is that it turns out
-
70:31 - 70:34we can take the entire algorithm,
-
70:34 - 70:38and we can express the entire algorithm in terms of inner products. And here's what I
-
70:38 - 70:41mean by that. So
-
70:41 - 70:42say that the parameters w
-
70:42 - 70:45is the sum of your input examples.
-
70:45 - 70:49And so we need to make a prediction.
-
70:49 - 70:55Someone gives you a new value of x. You want a value of the hypothesis on the value of x.
-
70:55 - 70:58That's given by g of w transpose x plus b, or
-
70:58 - 71:03where g was this threshold function that outputs minus 1 or plus 1.
-
71:03 - 71:06And so you need to compute w transpose x plus b.
-
71:06 - 71:11And that is equal to alpha i,
-
71:11 - 71:18yi.
-
71:20 - 71:24And that can be expressed as a sum of these inner products between
-
71:24 - 71:26your training examples
-
71:26 - 71:33and this new value of x [inaudible] value [inaudible]. And this will
-
71:34 - 71:39lead into our next lecture, which is the idea of kernels.
-
71:39 - 71:40And
-
71:40 - 71:44it turns out that in the source of feature spaces where used to support vector
-
71:44 - 71:45machines -
-
71:45 - 71:52it turns out that sometimes your training examples may be very high-dimensional. It may even be the case
-
71:54 - 71:58that the features that you want to use
-
71:58 - 72:00are
-
72:00 - 72:04inner-dimensional feature vectors.
-
72:04 - 72:11But despite this, it'll turn out that there'll be an interesting representation that
-
72:11 - 72:13you can use
-
72:13 - 72:15that will allow you
-
72:15 - 72:19to compute inner products like these efficiently.
-
72:19 - 72:26
-
72:26 - 72:29And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets
-
72:29 - 72:30of features.
-
72:30 - 72:34But we talk about the idea of
-
72:34 - 72:35kernels. In the next lecture, we'll
-
72:35 - 72:38see examples where
-
72:38 - 72:41even though you have extremely high-dimensional feature vectors, you can compute
-
72:41 - 72:46- you may never want to represent xi, x plus [inaudible] inner-dimensional
-
72:46 - 72:48feature vector. You can even store in computer memory.
-
72:48 - 72:52But you will nonetheless be able to compute inner products between different
-
72:52 - 72:54[inaudible] feature vectors very efficiently.
-
72:54 - 72:57And so you can - for example, you can make predictions by making use of these inner
-
72:57 - 72:59products.
-
72:59 - 73:02This is just xi
-
73:02 - 73:04transpose.
-
73:04 - 73:09You will compute these inner products very efficiently and, therefore, make predictions.
-
73:09 - 73:11And this pointed also - the other
-
73:11 - 73:15reason we derive the duo was because
-
73:15 - 73:18on this board, when we worked out what w of alpha is, w of alpha
-
73:18 - 73:24- actually are the same property - w of alpha is again
-
73:24 - 73:27written in terms of these inner products.
-
73:27 - 73:29And so if you
-
73:29 - 73:33actually look at the duo optimization problem and step - for all the steps of the
-
73:33 - 73:34algorithm,
-
73:34 - 73:37you'll find that you actually do everything you want - learn the parameters of
-
73:37 - 73:38alpha. So
-
73:38 - 73:42suppose you do an optimization problem, go into parameters alpha, and you do everything you want
-
73:42 - 73:47without ever needing to represent xi directly. And all you need to do
-
73:47 - 73:54is represent this compute inner products with your feature vectors like these. Well,
-
73:54 - 73:56one last property of
-
73:56 - 73:58this algorithm that's kinda nice is that
-
73:58 - 74:00I said previously
-
74:00 - 74:01that
-
74:01 - 74:07the alpha i's are 0 only for the - are non-0 only for the support vectors,
-
74:07 - 74:09only for the vectors
-
74:09 - 74:10that function y [inaudible] 1.
-
74:10 - 74:14And in practice, there are usually fairly few of them.
-
74:14 - 74:17And so what this means is that if you're representing w this way,
-
74:17 - 74:18then
-
74:18 - 74:23w when represented as a fairly small fraction of training examples
-
74:23 - 74:25because mostly alpha i's is 0 -
-
74:25 - 74:27and so when you're summing up
-
74:27 - 74:28the sum,
-
74:28 - 74:32you need to compute inner products only if the support vectors, which is
-
74:32 - 74:36usually a small fraction of your training set. So that's another nice [inaudible]
-
74:36 - 74:39because [inaudible] alpha is
-
74:39 - 74:410. And well,
-
74:41 - 74:45much of this will make much more sense when we talk about kernels. [Inaudible] quick
-
74:45 - 74:52questions
-
74:52 - 74:53
-
74:53 - 74:58before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well
-
74:58 - 75:00behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that
-
75:00 - 75:04the data is linearly separable - that you can actually get perfect
-
75:04 - 75:11[inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible]
-
75:11 - 75:13point [inaudible], so [inaudible]
-
75:13 - 75:18have
-
75:18 - 75:19[inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that
-
75:19 - 75:23there are ways to generalize this in multiple classes that I probably won't [inaudible] -
-
75:23 - 75:26but yeah, that's generalization [inaudible].
-
75:26 - 75:28Okay. Let's close for today, then.
-
75:28 - 75:29We'll talk about kernels in our next lecture.
- Title:
- Lecture 7 | Machine Learning (Stanford)
- Description:
-
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on optimal margin classifiers, KKT conditions, and SUM duals.
This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599CS 229 Course Website:
http://www.stanford.edu/class/cs229/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanford - Video Language:
- English
- Duration:
- 01:15:45
N. Ueda edited English subtitles for Lecture 7 | Machine Learning (Stanford) | Apr 14, 2013, 5:14 AM | |
jhprks2 added a translation | Aug 9, 2012, 2:51 AM |