WEBVTT

00:00:11.789 --> 00:00:15.049
This presentation is delivered by the Stanford Center for Professional

00:00:15.049 --> 00:00:22.049
Development.

00:00:23.680 --> 00:00:25.839
So welcome back.

00:00:25.839 --> 00:00:30.859
And what I wanna do today is continue our discussion on support vector machines. And in

00:00:30.859 --> 00:00:34.410
particular, I wanna talk about the optimal margin classifier.

00:00:34.410 --> 00:00:38.280
Then I wanna take a brief digression and talk about primal and duo optimization problems, and

00:00:38.280 --> 00:00:39.410
in particular,

00:00:39.410 --> 00:00:41.570
what's called the KKT conditions.

00:00:41.570 --> 00:00:45.190
And then we'll derive the duo to the optimization problem that I

00:00:45.190 --> 00:00:46.570
had posed earlier.

00:00:46.570 --> 00:00:50.980
And that will lead us into a discussion of kernels, which I won't really -

00:00:50.980 --> 00:00:54.010
which we just get to say couple words about, but which I'll do probably

00:00:54.010 --> 00:00:57.530
only in the next lecture.

00:00:57.530 --> 00:01:02.079
And as part of today's lecture, I'll spend some time talking about optimization problems.

00:01:02.079 --> 00:01:04.180
And in

00:01:04.180 --> 00:01:08.299
the little time I have today, I won't really be able to do this topic justice.

00:01:08.299 --> 00:01:13.010
I wanna talk about convex optimization and do that topic justice.

00:01:13.010 --> 00:01:17.500
And so at this week's discussion session, the TAs will have more

00:01:17.500 --> 00:01:22.110
time - will teach a discussion session - focus on convex optimization

00:01:22.110 --> 00:01:25.159
- sort of very beautiful and useful theory.

00:01:25.159 --> 00:01:32.159
So you want to learn more about that, listen to this Friday's discussion session.

00:01:33.100 --> 00:01:36.630
Just to recap what we did in the previous lecture,

00:01:36.630 --> 00:01:38.830
as

00:01:38.830 --> 00:01:42.420
we were beginning on developing on support vector machines, I

00:01:42.420 --> 00:01:43.900
said that a

00:01:43.900 --> 00:01:48.800
hypothesis represented as H sub [inaudible] wb as g of

00:01:48.800 --> 00:01:53.090
w transpose [inaudible] x + b, where

00:01:53.090 --> 00:01:55.030


00:01:55.030 --> 00:01:55.820
g

00:01:55.820 --> 00:02:02.820
will be

00:02:04.350 --> 00:02:08.599
+1 or -1, depending on whether z is greater than

00:02:08.599 --> 00:02:10.069
0.

00:02:10.069 --> 00:02:12.779
And I said that in our development of support vector machines,

00:02:12.779 --> 00:02:16.999
we'll use - we'll change the convention of letting y be +1, -1 to note

00:02:16.999 --> 00:02:20.159
the class labels.

00:02:20.159 --> 00:02:24.439
So last time,

00:02:24.439 --> 00:02:28.149
we also talked about the functional margin, which was this thing, gamma

00:02:28.149 --> 00:02:33.519
hat i.

00:02:33.519 --> 00:02:36.109


00:02:36.109 --> 00:02:38.759
And so we had the intuition that the

00:02:38.759 --> 00:02:42.389
if functional margin is a large positive number,

00:02:42.389 --> 00:02:46.769
then that means that we are classifying a training example correctly and very

00:02:46.769 --> 00:02:48.209
confidently. So

00:02:48.209 --> 00:02:50.389
yi is +1. We

00:02:50.389 --> 00:02:53.319
would like w transpose xi + b to be very large.

00:02:53.319 --> 00:02:54.649
And it makes i - if, excuse me, if

00:02:54.649 --> 00:02:56.779
yi is -1,

00:02:56.779 --> 00:02:59.459
then we'd w transpose xi + b to be a large negative number. So

00:02:59.459 --> 00:03:03.399
we'd sort of like functional margins to be large. We

00:03:03.399 --> 00:03:06.029
also said - functional margin is a strange property -

00:03:06.029 --> 00:03:10.469
that you can increase functional margin just by, say, taking your parameters,

00:03:10.469 --> 00:03:11.359
w and b,

00:03:11.359 --> 00:03:13.909
and multiplying them by

00:03:13.909 --> 00:03:15.809
2.

00:03:15.809 --> 00:03:18.299
And then we also

00:03:18.299 --> 00:03:20.949
defined the geometric margin,

00:03:20.949 --> 00:03:22.709
which

00:03:22.709 --> 00:03:24.659


00:03:24.659 --> 00:03:31.659
was

00:03:33.219 --> 00:03:37.389
that we just - essentially, the functional margin

00:03:37.389 --> 00:03:38.499
divided by

00:03:38.499 --> 00:03:40.349
the normal w.

00:03:40.349 --> 00:03:43.449
And so the geometric margin had

00:03:43.449 --> 00:03:46.589
the interpretation as being - I'll give

00:03:46.589 --> 00:03:48.799
you a few examples. The

00:03:48.799 --> 00:03:52.079
geometric margin, for example, is - has

00:03:52.079 --> 00:03:56.139
the interpretation as a distance between a training example

00:03:56.139 --> 00:03:56.909
and a hyperplane.

00:03:56.909 --> 00:04:00.679
And it'll actually be a sin distance, so that this distance will be positive

00:04:00.679 --> 00:04:05.059
if you're classifying the example correctly. And if you misclassify the example, this

00:04:05.059 --> 00:04:07.929
distance - it'll be the minus of the distance,

00:04:07.929 --> 00:04:10.269
reaching the point, reaching the training example.

00:04:10.269 --> 00:04:12.079
And you're separating hyperplane.

00:04:12.079 --> 00:04:16.329
Where you're separating hyperplane is defined by the equation w transpose x

00:04:16.329 --> 00:04:17.620
+

00:04:17.620 --> 00:04:20.589
b =

00:04:20.589 --> 00:04:24.229
0.

00:04:24.229 --> 00:04:28.089
So - oh,

00:04:28.089 --> 00:04:32.099
well, and I guess

00:04:32.099 --> 00:04:36.539
also defined these things as the functional margin, geometric margins,

00:04:36.539 --> 00:04:38.499
respect to training set I defined as

00:04:38.499 --> 00:04:45.499
the worst case or the minimum functional geometric margin. So in our

00:04:48.719 --> 00:04:52.870
development of the optimal margin classifier,

00:04:52.870 --> 00:04:56.499
our learning algorithm would choose parameters w and b so as to maximize

00:04:56.499 --> 00:05:00.050
the geometric margin. So our goal is to find the separating hyperplane

00:05:00.050 --> 00:05:04.729
that separates the positive and negative examples with as large a distance as possible

00:05:04.729 --> 00:05:06.100
between hyperplane

00:05:06.100 --> 00:05:10.540
and the positive and negative examples. And if you

00:05:10.540 --> 00:05:14.940
go to choose parameters w and b to maximize this, [inaudible] one copy of

00:05:14.940 --> 00:05:16.620
the geometric margin

00:05:16.620 --> 00:05:18.159
is that you

00:05:18.159 --> 00:05:19.989
can actually scale w

00:05:19.989 --> 00:05:24.460
and b arbitrarily. So you look at this definition for the geometric margin.

00:05:24.460 --> 00:05:24.850


00:05:24.850 --> 00:05:28.300
I can choose to multiply my parameters w and b

00:05:28.300 --> 00:05:31.629
by 2 or by 10 or any other constant.

00:05:31.629 --> 00:05:33.589
And it doesn't change

00:05:33.589 --> 00:05:35.819
my geometric margin.

00:05:35.819 --> 00:05:38.280
And one way of interpreting that is you're looking

00:05:38.280 --> 00:05:39.379
at

00:05:39.379 --> 00:05:43.099
just separating hyperplane. You look at this line you're separating by positive and negative training examples. If

00:05:43.099 --> 00:05:44.819
I

00:05:44.819 --> 00:05:45.740
scale w

00:05:45.740 --> 00:05:47.400
and b,

00:05:47.400 --> 00:05:49.489
that doesn't change the position of this plane, though

00:05:49.489 --> 00:05:51.629
because the equation wh +

00:05:51.629 --> 00:05:58.629
b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight

00:05:58.899 --> 00:06:02.589
line. And what that means is that I can actually choose whatever scaling

00:06:02.589 --> 00:06:05.969
for w and b is convenient for me.

00:06:05.969 --> 00:06:07.819
And in particular,

00:06:07.819 --> 00:06:09.199
we use in a minute,

00:06:09.199 --> 00:06:13.600
I can [inaudible] perfect constraint like that the normal w [inaudible] 1

00:06:13.600 --> 00:06:17.180
because this means that you can find a solution to w and b.

00:06:17.180 --> 00:06:21.930
And then by rescaling the parameters, you can easily meet this condition, this rescaled w

00:06:21.930 --> 00:06:23.530
[inaudible] 1. And so I can

00:06:23.530 --> 00:06:28.440
add the condition like this and then essentially not change the problem. Or I

00:06:28.440 --> 00:06:32.469
can add other conditions. I can actually add a

00:06:32.469 --> 00:06:35.120
condition

00:06:35.120 --> 00:06:38.150
that - excuse me, the absolute value of w1 = 1. I can have

00:06:38.150 --> 00:06:41.430
only one of these conditions right now [inaudible]. And adding condition to the absolute

00:06:41.430 --> 00:06:42.419
value - the

00:06:42.419 --> 00:06:45.169
first component of w must be to 1. And again,

00:06:45.169 --> 00:06:48.569
you can find the absolute solution and just rescale w and meet this

00:06:48.569 --> 00:06:50.539
condition.

00:06:50.539 --> 00:06:53.179
And it can have other,

00:06:53.179 --> 00:06:56.300
most esoteric conditions like that

00:06:56.300 --> 00:06:57.250
because again,

00:06:57.250 --> 00:07:01.039
this is a condition that you can solve for the optimal margin, and then just

00:07:01.039 --> 00:07:02.669
by scaling,

00:07:02.669 --> 00:07:04.789
you have w up and down. You can - you can then

00:07:04.789 --> 00:07:08.729
ensure you meet this condition as well. So

00:07:08.729 --> 00:07:12.580
again, [inaudible] one of these conditions right now, not all of them.

00:07:12.580 --> 00:07:15.449
And so our ability to choose

00:07:15.449 --> 00:07:19.330
any scaling condition on w that's convenient to us

00:07:19.330 --> 00:07:23.779
will be useful again in a second. All right.

00:07:23.779 --> 00:07:28.749
So let's go ahead and break down the optimization problem. And again, my goal is to choose

00:07:28.749 --> 00:07:30.279
parameters w and b

00:07:30.279 --> 00:07:34.169
so as to maximize the geometric margin.

00:07:34.169 --> 00:07:38.569
Here's my first attempt at writing down the optimization problem. Actually wrote this one down

00:07:38.569 --> 00:07:40.379
right at

00:07:40.379 --> 00:07:42.520
the end of the previous

00:07:42.520 --> 00:07:46.529
lecture. Begin to solve the parameters gamma w and b

00:07:46.529 --> 00:07:48.300
such that

00:07:48.300 --> 00:07:49.429


00:07:49.429 --> 00:07:53.989
-

00:07:53.989 --> 00:07:57.399
that [inaudible] i

00:07:57.399 --> 00:08:01.610
for in training examples.

00:08:01.610 --> 00:08:02.460
Let's say I

00:08:02.460 --> 00:08:06.439
choose to add this normalization condition.

00:08:06.439 --> 00:08:10.629
So the norm condition that w - the normal w is equal to 1 just makes

00:08:10.629 --> 00:08:15.279
the geometric and the functional margin the same.

00:08:15.279 --> 00:08:18.029
And so I'm saying I want to find a value -

00:08:18.029 --> 00:08:22.610
I want to find a value for gamma as big as possible

00:08:22.610 --> 00:08:26.379
so that all of my training examples have functional margin

00:08:26.379 --> 00:08:29.049
greater than or equals gamma,

00:08:29.049 --> 00:08:30.010
and

00:08:30.010 --> 00:08:33.780
with the constraint that normal w equals 1,

00:08:33.780 --> 00:08:36.089
functional margin and geometric margin are the same.

00:08:36.089 --> 00:08:37.760
So it's the same.

00:08:37.760 --> 00:08:39.670
Find the value for gamma so that

00:08:39.670 --> 00:08:43.130
all the values - all the geometric margins are greater or equal to

00:08:43.130 --> 00:08:45.870
gamma.

00:08:45.870 --> 00:08:51.269
So you solve this optimization problem, then you have derived

00:08:51.269 --> 00:08:55.060
the optimal margin classifier -

00:08:55.060 --> 00:08:56.940
that

00:08:56.940 --> 00:09:00.170
there's not a very nice optimization problem because this is a

00:09:00.170 --> 00:09:04.300
nasty, nonconvex constraints. And [inaudible] is asking that you

00:09:04.300 --> 00:09:06.229
solve for parameters w

00:09:06.229 --> 00:09:10.989
that lie on the surface of a unisphere, lie on his [inaudible].

00:09:10.989 --> 00:09:15.170
It lies on a unicircle - a unisphere.

00:09:15.170 --> 00:09:16.390


00:09:16.390 --> 00:09:17.990
And so

00:09:17.990 --> 00:09:21.160
if we can come up with a convex optimization problem, then

00:09:21.160 --> 00:09:24.579
we'd be guaranteed that our [inaudible] descend to other local [inaudible] will

00:09:24.579 --> 00:09:25.460
not have

00:09:25.460 --> 00:09:26.490
local optimal. And

00:09:26.490 --> 00:09:31.480
it turns out this is an example of a nonconvex constraint. This is a nasty constraint

00:09:31.480 --> 00:09:38.480
that I would like to get rid of. So

00:09:40.790 --> 00:09:42.840
let's change the optimization problem

00:09:42.840 --> 00:09:47.650
one more time.

00:09:47.650 --> 00:09:52.180
Now, let me

00:09:52.180 --> 00:09:56.870
pose a slightly different optimization problem. Let

00:09:56.870 --> 00:10:02.680
me maximize the functional margin divided by the normal w

00:10:02.680 --> 00:10:04.870
subject

00:10:04.870 --> 00:10:11.870
to yi w transpose xi.

00:10:12.580 --> 00:10:15.390
So in other words, once you find

00:10:15.390 --> 00:10:17.220
a number, gamma hat,

00:10:17.220 --> 00:10:19.700
so that every one of my training examples has functional margin greater

00:10:19.700 --> 00:10:22.840
than the gamma hat,

00:10:22.840 --> 00:10:27.300
and my optimization objective is I want to maximize gamma hat divided by the normal

00:10:27.300 --> 00:10:28.180


00:10:28.180 --> 00:10:30.550
w. And so I wanna maximize the

00:10:30.550 --> 00:10:35.680
function margin divided by the normal w. And we saw previously the function

00:10:35.680 --> 00:10:38.390
margin divided by the normal

00:10:38.390 --> 00:10:41.780
w is just a geometric margin, and so this is a different way of posing

00:10:41.780 --> 00:10:44.610
the same

00:10:44.610 --> 00:10:46.780
optimization problem. [Inaudible] confused, though. Are there questions

00:10:46.780 --> 00:10:53.710
about this?

00:10:53.710 --> 00:10:53.960


00:10:53.940 --> 00:10:56.590
Student:[Inaudible] the second statement has to be made of the functional margin y

00:10:56.590 --> 00:10:58.900
divided by - why don't you just have it

00:10:58.900 --> 00:11:02.280
the geometric

00:11:02.280 --> 00:11:04.800
margin? Why do

00:11:04.800 --> 00:11:09.450
you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew

00:11:09.450 --> 00:11:10.740
Ng):Oh, I see, yes. Student:[Inaudible]

00:11:10.740 --> 00:11:12.730


00:11:12.730 --> 00:11:17.050
is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin.

00:11:17.050 --> 00:11:18.820
Student:Yeah.

00:11:18.820 --> 00:11:25.820
Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional

00:11:26.040 --> 00:11:27.580
margin.

00:11:27.580 --> 00:11:31.780
Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me

00:11:31.780 --> 00:11:36.210
try saying

00:11:36.210 --> 00:11:40.390
this again. So here's my goal. My - I want [inaudible]. So

00:11:40.390 --> 00:11:42.110
let's see,

00:11:42.110 --> 00:11:46.750
the parameters of this optimization problem where gamma hat w and b - so

00:11:46.750 --> 00:11:49.610
the convex optimization software

00:11:49.610 --> 00:11:52.399
solves this problem for some set of parameters gamma

00:11:52.399 --> 00:11:54.260
w and b.

00:11:54.260 --> 00:11:57.120
And I'm imposing the constraint that

00:11:57.120 --> 00:11:59.050
whatever values it comes up with,

00:11:59.050 --> 00:12:03.750
yi x [inaudible] x5 + b must be greater than gamma hat.

00:12:03.750 --> 00:12:05.889
And so this means that

00:12:05.889 --> 00:12:08.640
the functional margin of every example had

00:12:08.640 --> 00:12:09.310
better be

00:12:09.310 --> 00:12:11.070
greater than equal to gamma hat. So

00:12:11.070 --> 00:12:14.300
there's a constraint to the function margin and a constraint to the gamma hat.

00:12:14.300 --> 00:12:17.950
But what I care about is not really maximizing the functional margin. What I

00:12:17.950 --> 00:12:19.010
really care about -

00:12:19.010 --> 00:12:21.430
in other words, in optimization objective,

00:12:21.430 --> 00:12:25.110
is maximizing gamma hat divided by the normal w,

00:12:25.110 --> 00:12:27.890
which is the geometric margin.

00:12:27.890 --> 00:12:32.370
So in other words, my optimization [inaudible] is I want to maximize the function margin

00:12:32.370 --> 00:12:35.210
divided by the normal

00:12:35.210 --> 00:12:38.230
w. Subject to that, every example must have

00:12:38.230 --> 00:12:40.290
function margin and at least gamma hat. Does that make

00:12:40.290 --> 00:12:46.120
sense now? Student:[Inaudible] when you

00:12:46.120 --> 00:12:48.000
said that to maximize gamma

00:12:48.000 --> 00:12:50.580
or gamma hat, respect to gamma w and with

00:12:50.580 --> 00:12:51.640
respect to gamma hat

00:12:51.640 --> 00:12:56.820
so that

00:12:56.820 --> 00:13:01.490
[inaudible] gamma hat are no

00:13:01.490 --> 00:13:04.920
longer [inaudible]? Instructor (Andrew Ng):So this is the -

00:13:04.920 --> 00:13:06.800
so it turns out -

00:13:06.800 --> 00:13:11.020
so this is how I write down the - this is how I write down an optimization problem

00:13:11.020 --> 00:13:12.790
in order to solve for

00:13:12.790 --> 00:13:15.950
the geometric margin. What is it -

00:13:15.950 --> 00:13:18.450
so it turns out that

00:13:18.450 --> 00:13:21.410
the question of this - is the gamma hat the function of w and b? And it turns out

00:13:21.410 --> 00:13:22.250
that

00:13:22.250 --> 00:13:25.130
in my previous mathematical definition, it was,

00:13:25.130 --> 00:13:29.590
but the way I'm going to pose this as an optimization problem is

00:13:29.590 --> 00:13:34.630
I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving

00:13:34.630 --> 00:13:37.920
convex optimization problems -

00:13:37.920 --> 00:13:39.550
hen I'm going to

00:13:39.550 --> 00:13:43.930
pretend that these are independent variables and ask my convex optimization software

00:13:43.930 --> 00:13:46.420
to find me values for gamma, w, and b,

00:13:46.420 --> 00:13:50.610
to make this value as big as possible and subject to this constraint.

00:13:50.610 --> 00:13:52.670
And it'll turn out that

00:13:52.670 --> 00:13:55.329
when it does that, it will choose - or

00:13:55.329 --> 00:13:58.790
obviously, it will choose for gamma to be as big as possible

00:13:58.790 --> 00:14:01.340
because optimization objective is this:

00:14:01.340 --> 00:14:02.730
You're trying to maximize gamma hat.

00:14:02.730 --> 00:14:05.540
So for x value of w and b,

00:14:05.540 --> 00:14:09.140
my software, which choose to make gamma hat as big as possible -

00:14:09.140 --> 00:14:12.590
well, but how big can we make gamma hat? Well, it's limited by use

00:14:12.590 --> 00:14:15.460
constraints. It says that every training example

00:14:15.460 --> 00:14:17.079
must have function margin

00:14:17.079 --> 00:14:20.380
greater than equal to gamma hat.

00:14:20.380 --> 00:14:23.950
And so my - the bigger you can make gamma hat

00:14:23.950 --> 00:14:27.880
will be the value of the smallest functional margin.

00:14:27.880 --> 00:14:30.220
And so when you solve this optimization problem,

00:14:30.220 --> 00:14:33.720
the value of gamma hat you get out will be, indeed,

00:14:33.720 --> 00:14:38.590
the minimum of the functional margins of your training set. Okay, so

00:14:38.590 --> 00:14:40.530
Justin? Student:Yeah, I was just

00:14:40.530 --> 00:14:41.760
wondering, I

00:14:41.760 --> 00:14:46.200
guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line

00:14:46.200 --> 00:14:50.180
such that you maximize the distance between - the smallest distance that [inaudible] between the line and

00:14:50.180 --> 00:14:52.190
the

00:14:52.190 --> 00:14:53.780
data points."

00:14:53.780 --> 00:14:56.370
And it seems like that's kind of what we're doing, but it's - it seems like

00:14:56.370 --> 00:15:00.200
this is more complicated than that. And I guess I'm wondering what

00:15:00.200 --> 00:15:04.109
is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And

00:15:04.109 --> 00:15:08.520
this seems

00:15:08.520 --> 00:15:12.420
more complicated than trying to find a line [inaudible]. So I'm

00:15:12.420 --> 00:15:17.550
just repeating the questions in case - since I'm not sure how all the audio catches it.

00:15:17.550 --> 00:15:21.380
So the answer is this is actually exactly that problem. This is exactly that problem

00:15:21.380 --> 00:15:22.710
of

00:15:22.710 --> 00:15:26.620
given the two class of data, positive and negative examples,

00:15:26.620 --> 00:15:29.900
this is exactly the formalization of the problem

00:15:29.900 --> 00:15:31.519
where I go is to find

00:15:31.519 --> 00:15:35.880
a line that separates the two - the positive and negative examples,

00:15:35.880 --> 00:15:42.880
maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case

00:15:43.040 --> 00:15:43.740
distance [inaudible]?

00:15:43.740 --> 00:15:46.270
Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now,

00:15:46.270 --> 00:15:48.230


00:15:48.230 --> 00:15:52.590
let's just say - let's just care about the worst-case distance for now. We'll come back,

00:15:52.590 --> 00:15:55.330
and we'll fix that later. We'll - that's a -

00:15:55.330 --> 00:15:57.850
caring about the worst case is is just -

00:15:57.850 --> 00:16:01.240
is just a nice way to formulate this optimization problem. I'll come back, and I'll

00:16:01.240 --> 00:16:03.580
change that later. Okay,

00:16:03.580 --> 00:16:10.100
raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool.

00:16:10.100 --> 00:16:13.830
Great. So let's see -

00:16:13.830 --> 00:16:15.910
so this is just a different way of posing

00:16:15.910 --> 00:16:18.160
the same optimization problem. And

00:16:18.160 --> 00:16:22.339
on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on

00:16:22.339 --> 00:16:23.880
the other hand, I've now

00:16:23.880 --> 00:16:24.830
added

00:16:24.830 --> 00:16:30.200
a nasty, nonconvex objective. In particular, this is not a convex function in parameters

00:16:30.200 --> 00:16:31.010
w.

00:16:31.010 --> 00:16:32.610
And so you can't -

00:16:32.610 --> 00:16:36.670
you don't have the usual guarantees like if you [inaudible]

00:16:36.670 --> 00:16:38.240
global minimum.

00:16:38.240 --> 00:16:40.170
At least that

00:16:40.170 --> 00:16:45.190
does not follow immediately from this because this is nonconvex.

00:16:45.190 --> 00:16:47.840


00:16:47.840 --> 00:16:51.380
So what

00:16:51.380 --> 00:16:54.810
I'm going to do is,

00:16:54.810 --> 00:16:56.940
earlier, I said that can pose

00:16:56.940 --> 00:16:57.770
any of

00:16:57.770 --> 00:17:02.630
a number of even fairly bizarre scaling constraints on w. So you can

00:17:02.630 --> 00:17:06.100
choose any scaling constraint like this, and things are still fine.

00:17:06.100 --> 00:17:07.660
And so here's

00:17:07.660 --> 00:17:14.089
the scaling I'm going to choose to add. Again, I'm

00:17:14.089 --> 00:17:18.380
gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly

00:17:18.380 --> 00:17:19.580
separable, that

00:17:19.580 --> 00:17:23.290
you can actually separate the positive and negative classes, and that we'll come back and

00:17:23.290 --> 00:17:25.580
fix this later as well.

00:17:25.580 --> 00:17:28.700
But here's the scaling constraint I want to impose on w. I want

00:17:28.700 --> 00:17:33.840
to impose a constraint that

00:17:33.840 --> 00:17:37.840
the functional margin is equal to 1.

00:17:37.840 --> 00:17:40.059
And another way of writing that

00:17:40.059 --> 00:17:44.380
is that I want to impose a constraint that min

00:17:44.380 --> 00:17:51.380
over i, yi -

00:17:52.380 --> 00:17:55.170
that in the worst case, function y is over 1.

00:17:55.170 --> 00:17:59.390
And clearly, this is a scaling constraint because if

00:17:59.390 --> 00:18:01.740


00:18:01.740 --> 00:18:05.820
you solve for w and b, and you find that your worst-case function margin is actually 10 or

00:18:05.820 --> 00:18:06.790
whatever,

00:18:06.790 --> 00:18:10.099
then by dividing through w and b by a factor of 10, I can get

00:18:10.099 --> 00:18:13.210
my functional margin to be over 1.

00:18:13.210 --> 00:18:15.580
So this is a scaling constraint [inaudible] would

00:18:15.580 --> 00:18:17.289
imply. And this is

00:18:17.289 --> 00:18:19.220
just more compactly written

00:18:19.220 --> 00:18:22.590
as follows. This is imposing a constraint that the functional margin be

00:18:22.590 --> 00:18:25.520
equal to 1.

00:18:25.520 --> 00:18:28.770
And so if we just take

00:18:28.770 --> 00:18:32.230
what I wrote down as No. 2 of our previous optimization problem and add the

00:18:32.230 --> 00:18:34.510
scaling constraint,

00:18:34.510 --> 00:18:38.400
we then get the following optimization problem:

00:18:38.400 --> 00:18:45.400
min over wb.

00:18:57.360 --> 00:19:00.520
I guess previously, we had a maximization

00:19:00.520 --> 00:19:05.500
over gamma hats divided by the normal w. So those maximize

00:19:05.500 --> 00:19:10.240
1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum

00:19:10.240 --> 00:19:10.720
normal w

00:19:10.720 --> 00:19:14.510
is min w - normal w squared. And then these

00:19:14.510 --> 00:19:16.400
are our constraints.

00:19:16.400 --> 00:19:20.980
Since I've added the constraint, the functional margin

00:19:20.980 --> 00:19:23.570
is over 1. And this is actually

00:19:23.570 --> 00:19:26.830
my

00:19:26.830 --> 00:19:30.440
final - well,

00:19:30.440 --> 00:19:36.960
final formulation of the optimal margin classifier problem, at least for now.

00:19:36.960 --> 00:19:38.250
So the picture

00:19:38.250 --> 00:19:40.640
to

00:19:40.640 --> 00:19:44.540
keep in mind for this, I guess, is that our

00:19:44.540 --> 00:19:48.450
optimization objective is once you minimize the normal w. And so our

00:19:48.450 --> 00:19:50.020
optimization objective

00:19:50.020 --> 00:19:52.450
is just the [inaudible] quadratic function.

00:19:52.450 --> 00:19:55.300
And [inaudible] those pictures [inaudible] can draw

00:19:55.300 --> 00:19:56.659
it. So it -

00:19:56.659 --> 00:19:58.850
if [inaudible] is w1 and w2, and you

00:19:58.850 --> 00:20:02.640
want to minimize the quadratic function like this - so quadratic function

00:20:02.640 --> 00:20:06.250
just has [inaudible] that look like this.

00:20:06.250 --> 00:20:10.190
And moreover, you have a number of linear constraints in your parameters, so you may have linear

00:20:10.190 --> 00:20:11.940
constraints that eliminates that half space or

00:20:11.940 --> 00:20:17.260
linear constraint eliminates that half space [inaudible]. So there's that half space

00:20:17.260 --> 00:20:19.560


00:20:19.560 --> 00:20:22.570
and so on.

00:20:22.570 --> 00:20:25.610
And so the picture is you have a quadratic function,

00:20:25.610 --> 00:20:27.020
and you're ruling out

00:20:27.020 --> 00:20:28.639


00:20:28.639 --> 00:20:33.540
various half spaces where each of these linear constraints. And I hope - if

00:20:33.540 --> 00:20:38.080
you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can

00:20:38.080 --> 00:20:43.250
convince yourself that this is a convex problem that has no local optimum. But they

00:20:43.250 --> 00:20:45.640
be run great

00:20:45.640 --> 00:20:49.850
[inaudible] within this set of points that hasn't ruled out, then you convert to the

00:20:49.850 --> 00:20:53.400
global optimum.

00:20:53.400 --> 00:20:56.180
And so that's the convex optimization problem.

00:20:56.180 --> 00:20:58.600
The - does this [inaudible] nice and

00:20:58.600 --> 00:21:02.390
[inaudible].

00:21:02.390 --> 00:21:09.390
Questions about this?

00:21:17.080 --> 00:21:24.080
Actually, just raise your hand if this makes sense. Okay, cool.

00:21:25.799 --> 00:21:30.250
So this gives you the optimal margin classifier algorithm.

00:21:30.250 --> 00:21:31.500
And it turns out that

00:21:31.500 --> 00:21:33.789
this is the convex optimization problem,

00:21:33.789 --> 00:21:36.970
so you can actually take this formulation of the problem

00:21:36.970 --> 00:21:41.410
and throw it at off-the-shelf software - what's called a QP or quadratic

00:21:41.410 --> 00:21:45.039
program software. This [inaudible] optimization is called a quadratic program,

00:21:45.039 --> 00:21:48.720
where the quadratic convex objective function and [inaudible] constraints -

00:21:48.720 --> 00:21:53.700
so you can actually download software to solve these optimization problems for you.

00:21:53.700 --> 00:21:55.810
Usually, as you wanna use the -

00:21:55.810 --> 00:21:56.880
use

00:21:56.880 --> 00:22:00.350
[inaudible] because you have constraints like these, although you could actually modify [inaudible] work with

00:22:00.350 --> 00:22:03.520
this, too.

00:22:03.520 --> 00:22:06.500
So we could just declare success and say that we're done with this formulation of the

00:22:06.500 --> 00:22:07.860
problem. But

00:22:07.860 --> 00:22:10.899
what I'm going to do now is take a

00:22:10.899 --> 00:22:14.680
digression to talk about primal and duo optimization problems.

00:22:14.680 --> 00:22:16.160
And in particular, I'm going to

00:22:16.160 --> 00:22:17.970
-

00:22:17.970 --> 00:22:21.700
later, I'm going to come back and derive yet another very different form

00:22:21.700 --> 00:22:24.640
of this optimization problem.

00:22:24.640 --> 00:22:29.130
And the reason we'll do that is because it turns out this optimization problem has

00:22:29.130 --> 00:22:33.050
certain properties that make it amenable to very efficient algorithms.

00:22:33.050 --> 00:22:35.320
And moreover, I'll be deriving

00:22:35.320 --> 00:22:37.430
what's called the duo formulation of this

00:22:37.430 --> 00:22:39.830
that allows us to

00:22:39.830 --> 00:22:42.850
apply the optimal margin classifier even in

00:22:42.850 --> 00:22:46.950
very high-dimensional feature spaces - even in sometimes infinite

00:22:46.950 --> 00:22:49.730
dimensional feature spaces. So we

00:22:49.730 --> 00:22:51.780
can come back to that later.

00:22:51.780 --> 00:22:53.190
But

00:22:53.190 --> 00:22:54.450
let me know,

00:22:54.450 --> 00:23:01.450
since I'm talking about

00:23:03.470 --> 00:23:06.480
convex optimization. So how many here is - how many of you, from, I don't

00:23:06.480 --> 00:23:08.080
know, calculus,

00:23:08.080 --> 00:23:12.120
remember the method of Lagrange multipliers for

00:23:12.120 --> 00:23:13.910
solving an optimization problem

00:23:13.910 --> 00:23:18.299
like minimum - minimization, maximization problem subject to some constraint?

00:23:18.299 --> 00:23:22.740
How many of you remember the method of Lagrange multipliers for that? Oh, okay,

00:23:22.740 --> 00:23:24.070
cool. Some of you, yeah.

00:23:24.070 --> 00:23:27.710
So if you don't remember, don't worry. I - I'll describe that briefly

00:23:27.710 --> 00:23:28.390
here

00:23:28.390 --> 00:23:32.039
as well, but what I'm really gonna do is talk about the generalization of this method

00:23:32.039 --> 00:23:33.710
of Lagrange multipliers

00:23:33.710 --> 00:23:37.659
that you may or may not have seen in some calculus classes. But if you haven't

00:23:37.659 --> 00:23:40.080
seen it before, don't worry about it.

00:23:40.080 --> 00:23:43.350


00:23:43.350 --> 00:23:49.159
So the method of Lagrange multipliers is - was - well, suppose there's some

00:23:49.159 --> 00:23:52.950
function you want to minimize, or minimize f of w.

00:23:52.950 --> 00:23:55.660
We're subject to

00:23:55.660 --> 00:24:02.660
some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And

00:24:03.400 --> 00:24:06.770
given this constraint, I'll actually usually write it in vectorial

00:24:06.770 --> 00:24:09.550
form in which I write h of w

00:24:09.550 --> 00:24:11.429
as this

00:24:11.429 --> 00:24:16.980
vector value function.

00:24:16.980 --> 00:24:20.919
So that is equal to 0, where 0 is the arrow on top. I used

00:24:20.919 --> 00:24:24.600
that to denote the vector of

00:24:24.600 --> 00:24:27.200
all 0s. So you want to solve this optimization problem.

00:24:27.200 --> 00:24:28.880


00:24:28.880 --> 00:24:32.130
Some of you have seen method of Lagrange multipliers where

00:24:32.130 --> 00:24:34.880
you construct this

00:24:34.880 --> 00:24:39.470
[inaudible] Lagrange,

00:24:39.470 --> 00:24:45.980


00:24:45.980 --> 00:24:49.810
which is the original optimization objective plus some [inaudible] Lagrange multipliers the

00:24:49.810 --> 00:24:51.950
highest constraints.

00:24:51.950 --> 00:24:54.720
And these parameters - they derive -

00:24:54.720 --> 00:25:01.570
we call the Lagrange multipliers.

00:25:01.570 --> 00:25:06.320
And so the way you actually solve the optimization problem is

00:25:06.320 --> 00:25:10.830
you take the partial derivative of this with respect to the original parameters

00:25:10.830 --> 00:25:13.740
and set that to 0. So

00:25:13.740 --> 00:25:19.710
the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0.

00:25:19.710 --> 00:25:23.630
And then the same as theorem through [inaudible], I guess [inaudible]

00:25:23.630 --> 00:25:24.510
Lagrange

00:25:24.510 --> 00:25:30.800
was that for w - for some value w star to get a

00:25:30.800 --> 00:25:32.380
solution,

00:25:32.380 --> 00:25:34.660
it

00:25:34.660 --> 00:25:41.660
is necessary

00:25:42.970 --> 00:25:48.299
that - can this

00:25:48.299 --> 00:25:48.840
be

00:25:48.840 --> 00:25:50.200


00:25:50.200 --> 00:25:53.510
the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists

00:25:53.510 --> 00:25:54.750
beta star

00:25:54.750 --> 00:26:00.420


00:26:00.420 --> 00:26:07.420


00:26:17.300 --> 00:26:19.940
such that those partial derivatives are

00:26:19.940 --> 00:26:21.500
equal to

00:26:21.500 --> 00:26:22.250
0.

00:26:22.250 --> 00:26:24.020
So the

00:26:24.020 --> 00:26:25.400
method

00:26:25.400 --> 00:26:28.510
of Lagrange multipliers is to solve this problem,

00:26:28.510 --> 00:26:30.710
you construct a Lagrange,

00:26:30.710 --> 00:26:32.580
take the derivative with respect to

00:26:32.580 --> 00:26:34.520
the original parameters b, the original

00:26:34.520 --> 00:26:39.809
parameters w, and with respect to the Lagrange multipliers beta.

00:26:39.809 --> 00:26:42.139
Set the partial derivatives equal to 0, and solve

00:26:42.139 --> 00:26:45.460
for our solutions. And then you check each of the solutions to see if it is indeed a minimum.

00:26:45.460 --> 00:26:48.500
Great.

00:26:48.500 --> 00:26:52.380
So great

00:26:52.380 --> 00:26:56.790
-

00:26:56.790 --> 00:26:59.110
so what I'm going to do is actually

00:26:59.110 --> 00:27:00.420
write

00:27:00.420 --> 00:27:02.760
down the generalization of this. And

00:27:02.760 --> 00:27:06.090
if you haven't seen that before, don't worry about it. This is [inaudible].

00:27:06.090 --> 00:27:11.000
So what I'm going to do is actually write down the generalization of this to solve

00:27:11.000 --> 00:27:15.870
a slightly more difficult type of constraint optimization problem,

00:27:15.870 --> 00:27:20.370
which is suppose you want to minimize f of w

00:27:20.370 --> 00:27:24.039
subject to the constraint that gi of w,

00:27:24.039 --> 00:27:25.539
excuse me,

00:27:25.539 --> 00:27:30.780
is less than equal to

00:27:30.780 --> 00:27:34.940
0, and that hi of w is equal to 0.

00:27:34.940 --> 00:27:38.280
And

00:27:38.280 --> 00:27:43.230
again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0.

00:27:43.230 --> 00:27:44.710
So

00:27:44.710 --> 00:27:48.940
in [inaudible]'s

00:27:48.940 --> 00:27:52.169
case, we now have inequality for constraint as well as

00:27:52.169 --> 00:27:59.169
equality constraint.

00:28:01.620 --> 00:28:02.779
I then have

00:28:02.779 --> 00:28:06.719
a Lagrange, or it's actually still - called - say generalized

00:28:06.719 --> 00:28:08.370
Lagrange,

00:28:08.370 --> 00:28:13.040
which is now a function of my original optimization for parameters w,

00:28:13.040 --> 00:28:18.050
as well as two sets of Lagrange multipliers, alpha and beta.

00:28:18.050 --> 00:28:19.799
And so this will be

00:28:19.799 --> 00:28:26.799
f of w.

00:28:28.650 --> 00:28:32.549
Now,

00:28:32.549 --> 00:28:36.559


00:28:36.559 --> 00:28:39.350
here's a

00:28:39.350 --> 00:28:40.220
cool part.

00:28:40.220 --> 00:28:42.650
I'm going to define theta

00:28:42.650 --> 00:28:43.679


00:28:43.679 --> 00:28:45.990
subscript p of

00:28:45.990 --> 00:28:46.690
w

00:28:46.690 --> 00:28:49.030
to be equal to

00:28:49.030 --> 00:28:52.030
max of alpha beta subject to the

00:28:52.030 --> 00:28:59.030
constraints that the alphas are, beta equal

00:28:59.090 --> 00:29:06.090
to 0 of the Lagrange.

00:29:10.480 --> 00:29:15.490
And so

00:29:15.490 --> 00:29:17.910
I want you to consider

00:29:17.910 --> 00:29:21.610
the optimization problem

00:29:21.610 --> 00:29:24.350
min over w of

00:29:24.350 --> 00:29:31.350
max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And

00:29:31.700 --> 00:29:35.690
that's just equal to min over w,

00:29:35.690 --> 00:29:42.000
theta p of

00:29:42.000 --> 00:29:47.990
w. And just to give us a name, the [inaudible] - the subscript p here is a sense of

00:29:47.990 --> 00:29:49.590
primal problem.

00:29:49.590 --> 00:29:56.590
And that refers to this entire thing.

00:29:56.920 --> 00:30:01.080
This optimization problem that written down is called a primal problem. This means

00:30:01.080 --> 00:30:04.270
there's the original optimization problem in which [inaudible] solving.

00:30:04.270 --> 00:30:08.450
And later on, I'll derive in another version of this, but that's what p

00:30:08.450 --> 00:30:13.360
stands for. It's a - this is a primal problem.

00:30:13.360 --> 00:30:17.690
Now, I want you to look at - consider theta over p again. And in particular, I wanna

00:30:17.690 --> 00:30:21.510
consider the problem of what happens if you minimize w

00:30:21.510 --> 00:30:24.070
- minimize as a function of w

00:30:24.070 --> 00:30:27.340
this quantity theta over p.

00:30:27.340 --> 00:30:34.160


00:30:34.160 --> 00:30:37.510
So let's look at what theta p of w is. Notice

00:30:37.510 --> 00:30:39.659
that

00:30:39.659 --> 00:30:41.620
if

00:30:41.620 --> 00:30:43.480
gi of w

00:30:43.480 --> 00:30:47.039
is greater than 0, so let's pick the value of w.

00:30:47.039 --> 00:30:49.779
And let's ask what is the state of p of w?

00:30:49.779 --> 00:30:56.779
So if w violates one of your primal problems constraints,

00:30:57.050 --> 00:31:00.280


00:31:00.280 --> 00:31:03.670
then state of p of w would be infinity.

00:31:03.670 --> 00:31:05.780
Why is that?

00:31:05.780 --> 00:31:09.440
[Inaudible] p [inaudible] second.

00:31:09.440 --> 00:31:12.800
Suppose I pick a value of w that violates one of these constraints. So gi of

00:31:12.800 --> 00:31:16.880
w is positive.

00:31:16.880 --> 00:31:21.840
Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So

00:31:21.840 --> 00:31:25.740
one of these gi of w's is this positive,

00:31:25.740 --> 00:31:30.300
then by setting the other responding alpha i to plus infinity, I can make this

00:31:30.300 --> 00:31:31.940
arbitrarily large.

00:31:31.940 --> 00:31:34.539
And so if w violates one of my

00:31:34.539 --> 00:31:37.770
primal problem's constraints in one of the gis, then

00:31:37.770 --> 00:31:42.290
max over alpha of this Lagrange will be plus

00:31:42.290 --> 00:31:49.290
infinity. There's some of - and in the same way - I guess in a similar way,

00:31:50.500 --> 00:31:55.240
if hi of w is not equal to

00:31:55.240 --> 00:31:57.690
0,

00:31:57.690 --> 00:32:02.669
then theta p of w also be infinity for a very similar reason because

00:32:02.669 --> 00:32:05.650
if hi of w is not equal to 0 for some value of i,

00:32:05.650 --> 00:32:09.990
then in my Lagrange, I had a beta i x hi theorem.

00:32:09.990 --> 00:32:13.480
And so by setting beta i to be plus infinity or minus

00:32:13.480 --> 00:32:15.830
infinity depending on the sign of hi,

00:32:15.830 --> 00:32:20.310
I can make this plus infinity as well.

00:32:20.310 --> 00:32:22.190
And otherwise,

00:32:22.190 --> 00:32:29.190


00:32:30.400 --> 00:32:33.730


00:32:33.730 --> 00:32:38.270
theta p of w is just equal to f of w. Turns

00:32:38.270 --> 00:32:39.649
out if

00:32:39.649 --> 00:32:41.799
I had a value of w that

00:32:41.799 --> 00:32:44.940
satisfies all of the gi and the hi constraints,

00:32:44.940 --> 00:32:47.749
then we maximize in terms of alpha and beta

00:32:47.749 --> 00:32:48.950
- all the Lagrange multiply

00:32:48.950 --> 00:32:53.830
theorems will actually be obtained by

00:32:53.830 --> 00:32:57.600
setting all the Lagrange multiply theorems to be 0,

00:32:57.600 --> 00:32:59.190
and so theta p

00:32:59.190 --> 00:33:03.590
just left with f of w. Thus, theta

00:33:03.590 --> 00:33:06.400
p

00:33:06.400 --> 00:33:11.490
of w is equal to

00:33:11.490 --> 00:33:13.540
f of w

00:33:13.540 --> 00:33:16.610
if constraints are

00:33:16.610 --> 00:33:17.929
satisfied

00:33:17.929 --> 00:33:21.330
[inaudible]

00:33:21.330 --> 00:33:22.540
the gi in

00:33:22.540 --> 00:33:24.409
hi constraints,

00:33:24.409 --> 00:33:28.270
and is equal to plus infinity

00:33:28.270 --> 00:33:31.770
otherwise.

00:33:31.770 --> 00:33:34.500
So the problem I

00:33:34.500 --> 00:33:38.619
wrote down that minimizes the function of w -

00:33:38.619 --> 00:33:44.630
theta p of w - this is [inaudible] problem. That's

00:33:44.630 --> 00:33:47.540
just

00:33:47.540 --> 00:33:50.639
exactly the same problem as my original primal problem

00:33:50.639 --> 00:33:55.010
because if you choose a value of w that violates the constraints, you get infinity.

00:33:55.010 --> 00:33:57.920
And if you satisfy the constraints, you get f of w.

00:33:57.920 --> 00:33:59.809
So this is really just the same as - well,

00:33:59.809 --> 00:34:00.980
we'll say,

00:34:00.980 --> 00:34:05.100
"Satisfy the constraints, and minimize f of w." That's really what

00:34:05.100 --> 00:34:08.609
minimizing the state of p

00:34:08.609 --> 00:34:15.609
of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right.

00:34:24.699 --> 00:34:28.339
I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly

00:34:28.339 --> 00:34:31.159
the same thing we started with. So here's

00:34:31.159 --> 00:34:33.499
the cool part.

00:34:33.499 --> 00:34:40.289
Let me know if you find it in your problem. To find

00:34:40.289 --> 00:34:41.909
theta

00:34:41.909 --> 00:34:45.059
d

00:34:45.059 --> 00:34:47.129
and d [inaudible] duo,

00:34:47.129 --> 00:34:50.409
and this is how the function of alpha and beta is. It's not the function of the Lagrange

00:34:50.409 --> 00:34:52.589
multipliers. It's not of w.

00:34:52.589 --> 00:34:58.229
To find this, we minimize over w of

00:34:58.229 --> 00:35:05.229
my generalized Lagrange. And

00:35:06.309 --> 00:35:13.309
my dual problem is this. So in

00:35:16.729 --> 00:35:20.359
other words, this is max over

00:35:20.359 --> 00:35:21.859
that.

00:35:21.859 --> 00:35:23.130


00:35:23.130 --> 00:35:28.239


00:35:28.239 --> 00:35:33.019
And so this is my duo optimization problem. To maximize over alpha and

00:35:33.019 --> 00:35:34.059
beta, theta

00:35:34.059 --> 00:35:35.499
d over alpha

00:35:35.499 --> 00:35:37.980
and beta. So this optimization problem, I

00:35:37.980 --> 00:35:40.859
guess, is my dual problem. I

00:35:40.859 --> 00:35:44.810
want you to compare this to our previous prime optimization problem.

00:35:44.810 --> 00:35:47.499
The only difference is that

00:35:47.499 --> 00:35:51.690
I took the max and min, and I switched the order around with the max and

00:35:51.690 --> 00:35:56.199
min. That's the difference in the primal and the duo optimization [inaudible].

00:35:56.199 --> 00:36:00.210
And it turns out that

00:36:00.210 --> 00:36:01.669


00:36:01.669 --> 00:36:04.730
it's a - it's sort of - it's a fact - it's

00:36:04.730 --> 00:36:06.169
true, generally, that d

00:36:06.169 --> 00:36:09.919
star is less than [inaudible] p star. In other words, I think I defined

00:36:09.919 --> 00:36:15.609
p star previously. P star was a value of the prime optimization problem.

00:36:15.609 --> 00:36:19.829
And in other words, that it's just generally true

00:36:19.829 --> 00:36:25.150
that the max of the min of something is less than equal to the min of the max

00:36:25.150 --> 00:36:27.609
of something.

00:36:27.609 --> 00:36:29.199
And this is a

00:36:29.199 --> 00:36:30.099
general fact.

00:36:30.099 --> 00:36:33.169
And just as a concrete example, the

00:36:33.169 --> 00:36:37.069
max over y in the set 01 x -

00:36:37.069 --> 00:36:40.449
oh, excuse me, of the min of the

00:36:40.449 --> 00:36:43.579
set in 01

00:36:43.579 --> 00:36:47.939
of indicator x =

00:36:47.939 --> 00:36:50.940
y - this is

00:36:50.940 --> 00:36:57.940
[inaudible] equal to

00:37:02.689 --> 00:37:09.669
min.

00:37:09.669 --> 00:37:14.229
So this equality - this inequality actually holds true for any

00:37:14.229 --> 00:37:15.759
function you might find in here.

00:37:15.759 --> 00:37:18.059
And this is one specific example

00:37:18.059 --> 00:37:20.670
where

00:37:20.670 --> 00:37:24.289
the min over xy - excuse me, min over x of [inaudible] equals y -

00:37:24.289 --> 00:37:27.009
this is always equal to 0

00:37:27.009 --> 00:37:30.760
because whatever y is, you can choose x to be something different. So

00:37:30.760 --> 00:37:32.539
this is always 0,

00:37:32.539 --> 00:37:34.079
whereas if

00:37:34.079 --> 00:37:37.680
I exchange the order to min

00:37:37.680 --> 00:37:39.569
and max, then thing here is always equal to

00:37:39.569 --> 00:37:41.669
1. So 0 [inaudible] to

00:37:41.669 --> 00:37:44.700
1. And more generally, this min/max - excuse

00:37:44.700 --> 00:37:50.909
me, this max/min, thus with the min/max holds true for any function you might put in

00:37:50.909 --> 00:37:52.149
there.

00:37:52.149 --> 00:37:57.779
But it turns out that sometimes under certain conditions,

00:37:57.779 --> 00:37:59.349


00:37:59.349 --> 00:38:01.349


00:38:01.349 --> 00:38:01.839


00:38:01.839 --> 00:38:03.239


00:38:03.239 --> 00:38:06.949
these two optimization problems have the same value. Sometimes under certain

00:38:06.949 --> 00:38:07.859
conditions,

00:38:07.859 --> 00:38:10.110
the primal and the dual problems

00:38:10.110 --> 00:38:12.349
have the same value.

00:38:12.349 --> 00:38:14.559
And so

00:38:14.559 --> 00:38:15.700
you might be able to solve

00:38:15.700 --> 00:38:20.079
the dual problem rather than the primal problem.

00:38:20.079 --> 00:38:22.290
And the reason to do that is that

00:38:22.290 --> 00:38:26.289
sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem,

00:38:26.289 --> 00:38:31.130
the dual problem turns out to be much easier than it - often has many useful properties that

00:38:31.130 --> 00:38:33.749
will make user

00:38:33.749 --> 00:38:40.749
compared to the primal. So for the sake of -

00:38:42.199 --> 00:38:48.019
so

00:38:48.019 --> 00:38:55.019
what

00:39:01.549 --> 00:39:05.230
I'm going to do now is write down formally the certain conditions under which that's

00:39:05.230 --> 00:39:09.449
true - where the primal and the dual problems are equivalent.

00:39:09.449 --> 00:39:14.199
And so our strategy for working out the [inaudible] of support vector machine algorithm

00:39:14.199 --> 00:39:17.849
will be that we'll write down the primal optimization problem, which we did

00:39:17.849 --> 00:39:20.619
previously, and maximizing classifier.

00:39:20.619 --> 00:39:21.780
And then we'll

00:39:21.780 --> 00:39:24.470
derive the duo optimization problem for that.

00:39:24.470 --> 00:39:26.489
And then we'll solve the dual problem.

00:39:26.489 --> 00:39:29.279
And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for

00:39:29.279 --> 00:39:30.489


00:39:30.489 --> 00:39:33.639
now, let me just first, for

00:39:33.639 --> 00:39:36.659
the sake of completeness, I just write down the conditions under which the primal

00:39:36.659 --> 00:39:41.579
and the duo optimization problems give you the same solutions. So let f

00:39:41.579 --> 00:39:45.170
be convex. If you're not

00:39:45.170 --> 00:39:46.699


00:39:46.699 --> 00:39:50.249
sure what convex means, for the purposes of this class, you can take it to

00:39:50.249 --> 00:39:53.319
mean that the

00:39:53.319 --> 00:39:56.260
Hessian, h is positive. [Inaudible], so it just means it's

00:39:56.260 --> 00:39:58.199
a [inaudible] function like that.

00:39:58.199 --> 00:39:58.860
And

00:39:58.860 --> 00:40:01.030
once you learn more about optimization

00:40:01.030 --> 00:40:06.559
- again, please come to this week's discussion session taught by the TAs.

00:40:06.559 --> 00:40:09.279
Then

00:40:09.279 --> 00:40:11.309
suppose

00:40:11.309 --> 00:40:12.520
hi

00:40:12.520 --> 00:40:17.919
- the hi constraints [inaudible], and what that means is that hi of w equals alpha i

00:40:17.919 --> 00:40:24.519
transpose w plus vi. This actually

00:40:24.519 --> 00:40:28.509
means the same thing as linear. Without the term b here, we say

00:40:28.509 --> 00:40:29.970
that hi is linear

00:40:29.970 --> 00:40:34.079
where we have a constant interceptor as well. This is technically called [inaudible] other than

00:40:34.079 --> 00:40:37.229
linear.

00:40:37.229 --> 00:40:39.770
And let's suppose

00:40:39.770 --> 00:40:42.599


00:40:42.599 --> 00:40:49.599
that gi's are strictly feasible.

00:40:51.170 --> 00:40:54.049
And

00:40:54.049 --> 00:40:57.029
what that means is that

00:40:57.029 --> 00:41:00.869
there is just a value of the w such that

00:41:00.869 --> 00:41:04.249
from i,

00:41:04.249 --> 00:41:06.669
gi of w is less

00:41:06.669 --> 00:41:10.519
than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the

00:41:10.519 --> 00:41:12.939
technical details.

00:41:12.939 --> 00:41:15.820
Strictly feasible, which just means that there's a value of w such that

00:41:15.820 --> 00:41:19.260
all of these constraints are satisfy were stricter than the equality rather than what

00:41:19.260 --> 00:41:21.739
less than equal to.

00:41:21.739 --> 00:41:22.490


00:41:22.490 --> 00:41:25.130
Under these conditions,

00:41:25.130 --> 00:41:26.180


00:41:26.180 --> 00:41:29.990
there were exists w star,

00:41:29.990 --> 00:41:32.299
alpha

00:41:32.299 --> 00:41:33.800
star, beta

00:41:33.800 --> 00:41:39.639
star such that w star solves the primal problem.

00:41:39.639 --> 00:41:41.849
And alpha star

00:41:41.849 --> 00:41:48.849
and beta star, the Lagrange multipliers, solve the dual problem.

00:41:51.829 --> 00:41:53.659


00:41:53.659 --> 00:41:58.679
And the value of the primal problem will be equal to the value of the dual problem will

00:41:58.679 --> 00:42:03.649
be equal to the value of your Lagrange multiplier - excuse

00:42:03.649 --> 00:42:06.189
me, will be equal to the value of your generalized Lagrange, the value

00:42:06.189 --> 00:42:08.759
of that w star, alpha star, beta star.

00:42:08.759 --> 00:42:13.339
In other words, you can solve either the primal or the dual problem. You get

00:42:13.339 --> 00:42:14.899
the same

00:42:14.899 --> 00:42:19.869
solution. Further,

00:42:19.869 --> 00:42:22.530
your parameters will satisfy

00:42:22.530 --> 00:42:24.139


00:42:24.139 --> 00:42:27.190


00:42:27.190 --> 00:42:32.119
these conditions. Partial derivative perspective parameters would be

00:42:32.119 --> 00:42:33.339
0. And

00:42:33.339 --> 00:42:36.559
actually, to keep this equation in mind, we'll actually use this in a second

00:42:36.559 --> 00:42:38.630
when we take the Lagrange, and we - and

00:42:38.630 --> 00:42:43.259
our support vector machine problem, and take a derivative with respect to w to solve a -

00:42:43.259 --> 00:42:44.679
to solve our -

00:42:44.679 --> 00:42:46.519
to derive our dual problem. We'll actually

00:42:46.519 --> 00:42:50.480
perform this step ourselves in a second.

00:42:50.480 --> 00:42:52.099


00:42:52.099 --> 00:42:58.289
Partial derivative with respect to the Lagrange multiplier beta is

00:42:58.289 --> 00:43:01.789


00:43:01.789 --> 00:43:05.169
equal

00:43:05.169 --> 00:43:06.269
to

00:43:06.269 --> 00:43:08.439
0.

00:43:08.439 --> 00:43:10.479
Turns out this will hold true,

00:43:10.479 --> 00:43:11.869
too.

00:43:11.869 --> 00:43:14.429
This is called

00:43:14.429 --> 00:43:18.960


00:43:18.960 --> 00:43:20.549
the -

00:43:20.549 --> 00:43:27.380
well

00:43:27.380 --> 00:43:28.569
-

00:43:28.569 --> 00:43:32.359
this is called the KKT complementary condition.

00:43:32.359 --> 00:43:37.869
KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem.

00:43:37.869 --> 00:43:41.630
Well, and by tradition, usually this [inaudible] KKT conditions.

00:43:41.630 --> 00:43:48.630
But the other two are

00:43:50.489 --> 00:43:53.239
- just so the [inaudible] is greater than 0, which we had

00:43:53.239 --> 00:43:54.509
previously

00:43:54.509 --> 00:44:01.509
and that your constraints are actually satisfied.

00:44:02.919 --> 00:44:09.919
So let's see. [Inaudible] All

00:44:21.789 --> 00:44:28.789
right.

00:44:29.959 --> 00:44:34.170
So let's take those and apply this to our

00:44:34.170 --> 00:44:37.349
optimal margin

00:44:37.349 --> 00:44:40.689
optimization problem that we had previously. I

00:44:40.689 --> 00:44:43.779
was gonna say one word about this,

00:44:43.779 --> 00:44:46.809
which is -

00:44:46.809 --> 00:44:49.369
was gonna say one word about this

00:44:49.369 --> 00:44:50.880
KTT

00:44:50.880 --> 00:44:52.379
complementary condition is

00:44:52.379 --> 00:44:54.229
that a condition that is a -

00:44:54.229 --> 00:45:00.839
at your solution, you must have that alpha star i times gi of w is equal to 0.

00:45:00.839 --> 00:45:02.680
So

00:45:02.680 --> 00:45:04.390
let's see.

00:45:04.390 --> 00:45:08.789
So the product of two numbers is equal to 0. That means that

00:45:08.789 --> 00:45:09.450


00:45:09.450 --> 00:45:12.459
at least one of these things must be equal to

00:45:12.459 --> 00:45:15.989
0. For the product of two things to be equal to 0, well, that's just saying either alpha

00:45:15.989 --> 00:45:17.230
or i

00:45:17.230 --> 00:45:21.269
or gi is equal to 0.

00:45:21.269 --> 00:45:25.759
So what that implies is that the - just Karush-Kuhn-Tucker

00:45:25.759 --> 00:45:31.539


00:45:31.539 --> 00:45:38.410


00:45:38.410 --> 00:45:42.630
- most people just say KKT, but we wanna show you the right

00:45:42.630 --> 00:45:45.589
spelling

00:45:45.589 --> 00:45:46.890
of their names. So

00:45:46.890 --> 00:45:47.800
KKT

00:45:47.800 --> 00:45:52.119
complementary condition implies that if alpha

00:45:52.119 --> 00:45:55.360
i is not 0,

00:45:55.360 --> 00:46:00.559
that necessarily implies that

00:46:00.559 --> 00:46:03.199
gi of w star

00:46:03.199 --> 00:46:06.239
is equal to

00:46:06.239 --> 00:46:10.959
0.

00:46:10.959 --> 00:46:13.039
And

00:46:13.039 --> 00:46:15.000
usually,

00:46:15.000 --> 00:46:21.149


00:46:21.149 --> 00:46:23.629


00:46:23.629 --> 00:46:25.999
it turns out - so

00:46:25.999 --> 00:46:28.649
all the KKT condition guarantees is that

00:46:28.649 --> 00:46:32.939
at least one of them is 0. It may actually be the case that both

00:46:32.939 --> 00:46:35.279
alpha and gi are both equal to 0.

00:46:35.279 --> 00:46:39.170
But in practice, when you solve this optimization problem, you find that

00:46:39.170 --> 00:46:42.880
to a large part, alpha i star is not equal to 0 if and only gi of

00:46:42.880 --> 00:46:44.749
w star 0, 0.

00:46:44.749 --> 00:46:50.449
This is not strictly true because it's possible that both of these may be 0.

00:46:50.449 --> 00:46:51.720
But in practice,

00:46:51.720 --> 00:46:55.119
when we - because when we solve problems like these, you're, for the most part,

00:46:55.119 --> 00:46:59.319
usually exactly one of these will be non-0.

00:46:59.319 --> 00:47:03.459
And also, when this holds true, when gi of w star is equal to 0,

00:47:03.459 --> 00:47:05.819
we say that

00:47:05.819 --> 00:47:06.540
gi

00:47:06.540 --> 00:47:13.540
- gi of w, I guess, is an active

00:47:16.279 --> 00:47:17.919
constraint

00:47:17.919 --> 00:47:22.019
because we call a constraint - our constraint was a gi of w must be less than or equal to

00:47:22.019 --> 00:47:23.399
0.

00:47:23.399 --> 00:47:25.589
And so it is equal to 0, then

00:47:25.589 --> 00:47:32.589
we say that that's a constraint that this is an active constraint.

00:47:33.679 --> 00:47:37.410
Once we talk about [inaudible], we come back and [inaudible] and just extend this

00:47:37.410 --> 00:47:42.589
idea a little bit more.

00:47:42.589 --> 00:47:48.349
[Inaudible] board. [Inaudible]

00:47:48.349 --> 00:47:52.159
turn

00:47:52.159 --> 00:47:59.159
to this board in a second, but -

00:48:07.269 --> 00:48:10.899
so let's go back and work out one of the primal and the duo optimization problems for

00:48:10.899 --> 00:48:12.029


00:48:12.029 --> 00:48:17.329
our optimal margin classifier for the optimization problem that we worked on just now. As

00:48:17.329 --> 00:48:20.379
a point of notation,

00:48:20.379 --> 00:48:24.669
in whatever I've been writing down so far in deriving the KKT conditions,

00:48:24.669 --> 00:48:28.189


00:48:28.189 --> 00:48:32.369
when Lagrange multipliers were alpha i and beta i,

00:48:32.369 --> 00:48:35.600
it turns out that when applied as [inaudible]

00:48:35.600 --> 00:48:37.020
dm,

00:48:37.020 --> 00:48:41.089
turns out we only have one set of Lagrange multipliers alpha i.

00:48:41.089 --> 00:48:43.859
And also,

00:48:43.859 --> 00:48:47.419
as I was working out the KKT conditions, I used w

00:48:47.419 --> 00:48:52.189
to denote the parameters of my primal optimization problem. [Inaudible] I wanted to

00:48:52.189 --> 00:48:53.999
minimize f of w. In my

00:48:53.999 --> 00:48:57.739
very first optimization problem, I had

00:48:57.739 --> 00:48:58.989
that optimization problem [inaudible] finding the parameters

00:48:58.989 --> 00:49:01.670
w. In my svn problem, I'm

00:49:01.670 --> 00:49:04.640
actually gonna have two sets of parameters, w and b. So

00:49:04.640 --> 00:49:08.479
this is just a - keep that

00:49:08.479 --> 00:49:15.479
sort of slight notation change in mind. So

00:49:15.960 --> 00:49:20.289
problem we worked out previously was we want to minimize the normal w squared and just add a

00:49:20.289 --> 00:49:21.149
half there

00:49:21.149 --> 00:49:25.719
by convention because it makes other work - math work a little

00:49:25.719 --> 00:49:26.799
nicer.

00:49:26.799 --> 00:49:32.829
And subject to the constraint that yi x w [inaudible] xi + v must be = greater

00:49:40.949 --> 00:49:46.390
And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w,

00:49:46.390 --> 00:49:48.839
b. Again, previously,

00:49:48.839 --> 00:49:52.559
I had gi

00:49:52.559 --> 00:49:56.079
of w, but now I have parameters w and b. So

00:49:56.079 --> 00:50:03.079
gi of w, b defined

00:50:03.369 --> 00:50:09.589
as 1.

00:50:09.589 --> 00:50:15.659
So

00:50:15.659 --> 00:50:19.799
let's look at the implications of this in terms

00:50:19.799 --> 00:50:25.380
of the KKT duo complementary condition again.

00:50:25.380 --> 00:50:29.180
So we have that alpha i is basically equal to

00:50:29.180 --> 00:50:32.309
0. That necessarily implies that

00:50:32.309 --> 00:50:35.149
gi of w, b

00:50:35.149 --> 00:50:42.149
is equal to 0. In other words, this is an active constraint.

00:50:46.849 --> 00:50:53.849
And what does this mean? It means that it actually turns out gi of wb equal to 0 that

00:50:54.460 --> 00:51:00.729
is - that means exactly that the training example xi, yi

00:51:00.729 --> 00:51:07.729
has functional margin

00:51:08.909 --> 00:51:10.289
equal to 1.

00:51:10.289 --> 00:51:11.279
Because

00:51:11.279 --> 00:51:15.049
this constraint was that

00:51:15.049 --> 00:51:18.769
the functional margin of every example has to be greater equal to

00:51:18.769 --> 00:51:22.139
1. And so if this is an active constraint, it just -

00:51:22.139 --> 00:51:24.160
inequality holds that equality.

00:51:24.160 --> 00:51:26.529
That means that my training example i

00:51:26.529 --> 00:51:30.349
must have functional margin equal to exactly 1. And so - actually, yeah,

00:51:30.349 --> 00:51:33.609
right now, I'll

00:51:33.609 --> 00:51:39.689
do this on a different board, I guess.

00:51:39.689 --> 00:51:46.689


00:51:57.239 --> 00:51:59.439
So in pictures,

00:51:59.439 --> 00:52:03.650
what that means is that, you have some

00:52:03.650 --> 00:52:10.650
training sets, and you'll

00:52:10.729 --> 00:52:15.189
have some separating hyperplane.

00:52:15.189 --> 00:52:19.469
And so the examples with functional margin equal to 1

00:52:19.469 --> 00:52:23.299
will be exactly those which are -

00:52:23.299 --> 00:52:25.549
so they're closest

00:52:25.549 --> 00:52:27.799
to my separating hyperplane. So

00:52:27.799 --> 00:52:30.669
that's

00:52:30.669 --> 00:52:34.109
my equation. [Inaudible] equal to 0.

00:52:34.109 --> 00:52:39.499
And so in this - in this cartoon example that I've done, it'll be

00:52:39.499 --> 00:52:44.649
exactly

00:52:44.649 --> 00:52:49.699
- these three

00:52:49.699 --> 00:52:52.709
examples that have functional margin

00:52:52.709 --> 00:52:54.399
equal to 1,

00:52:54.399 --> 00:52:58.519
and all of the other examples as being further away than these

00:52:58.519 --> 00:53:05.519
three will have functional margin that is strictly greater than 1.

00:53:05.929 --> 00:53:08.109
And

00:53:08.109 --> 00:53:11.029
the examples with functional margin equal to 1 will usually correspond to the

00:53:11.029 --> 00:53:15.429


00:53:15.429 --> 00:53:21.589


00:53:21.589 --> 00:53:22.979
ones where

00:53:22.979 --> 00:53:27.049
the corresponding Lagrange multipliers also not equal to 0. And again, it may not

00:53:27.049 --> 00:53:29.009
hold true. It may be the case that

00:53:29.009 --> 00:53:32.679
gi and alpha i equal to 0. But usually, when gi's not -

00:53:32.679 --> 00:53:35.739
is 0, alpha i will be non-0.

00:53:35.739 --> 00:53:39.380
And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal

00:53:39.380 --> 00:53:46.380
to 0. One

00:53:46.920 --> 00:53:48.969
useful property of this is that

00:53:48.969 --> 00:53:52.839
as suggested by this picture and so true in general as well, it

00:53:52.839 --> 00:53:57.069
turns out that we find a solution to this - to the optimization problem,

00:53:57.069 --> 00:54:01.509
you find that relatively few training examples have functional margin equal to 1. In

00:54:01.509 --> 00:54:02.829
this picture I've drawn,

00:54:02.829 --> 00:54:05.859
there are three examples with functional margin equal to 1. There

00:54:05.859 --> 00:54:09.449
are just few examples of this minimum possible distance to your separating hyperplane.

00:54:09.449 --> 00:54:11.479


00:54:11.479 --> 00:54:13.779
And these are three -

00:54:13.779 --> 00:54:18.420
these examples of functional margin equal to 1 - they

00:54:18.420 --> 00:54:21.369
are what we're going to call

00:54:21.369 --> 00:54:26.169
the support vectors. And

00:54:26.169 --> 00:54:29.839
this needs the name support vector machine. There'll be these three points with functional margin

00:54:29.839 --> 00:54:30.809
1

00:54:30.809 --> 00:54:35.229
that we're calling support vectors.

00:54:35.229 --> 00:54:36.749
And

00:54:36.749 --> 00:54:40.089
the fact that they're relatively few support vectors also means that

00:54:40.089 --> 00:54:40.919
usually,

00:54:40.919 --> 00:54:43.319
most of the alpha i's are equal to

00:54:43.319 --> 00:54:44.909
0. So with alpha i equal

00:54:44.909 --> 00:54:51.909
to

00:54:52.599 --> 00:54:54.239
0,

00:54:54.239 --> 00:55:01.239
for examples, though, not support vectors.

00:55:03.349 --> 00:55:06.769
Let's go ahead and work out the actual

00:55:06.769 --> 00:55:08.369
optimization problem.

00:55:08.369 --> 00:55:12.169


00:55:12.169 --> 00:55:19.169


00:55:28.499 --> 00:55:29.369
So

00:55:29.369 --> 00:55:31.779
we have a [inaudible] margin

00:55:31.779 --> 00:55:33.129
optimization problem.

00:55:33.129 --> 00:55:34.009


00:55:34.009 --> 00:55:38.599
So there we go and write down the margin,

00:55:38.599 --> 00:55:39.249
and

00:55:39.249 --> 00:55:43.599
because we only have inequality constraints where we really have gi star

00:55:43.599 --> 00:55:47.059
constraints, no hi star constraint. We have

00:55:47.059 --> 00:55:49.739
inequality constraints and no equality constraints,

00:55:49.739 --> 00:55:51.599
I'll only have

00:55:51.599 --> 00:55:52.990
Lagrange multipliers of type

00:55:52.990 --> 00:55:57.489
alpha - no betas in my generalized Lagrange. But

00:55:57.489 --> 00:56:00.519


00:56:00.519 --> 00:56:02.579
my Lagrange will be

00:56:02.579 --> 00:56:09.579
one-half w squared minus.

00:56:14.739 --> 00:56:17.579


00:56:17.579 --> 00:56:19.829
That's my

00:56:19.829 --> 00:56:26.729
Lagrange.

00:56:26.729 --> 00:56:29.389
And so let's work out what the dual problem is.

00:56:29.389 --> 00:56:33.479
And to do that, I need to figure out what theta d of alpha - and I know again, beta's there

00:56:33.479 --> 00:56:37.239
- so what theta d of alpha

00:56:37.239 --> 00:56:43.829
is min with

00:56:43.829 --> 00:56:48.419
respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha.

00:56:48.419 --> 00:56:54.999
So as to work out what theta d is, and then that'll give us our dual problem.

00:56:54.999 --> 00:56:58.090
So then to work out what this is, what do you need to do? We need to

00:56:58.090 --> 00:57:01.930
take a look at Lagrange and minimize it as a function of lv and b

00:57:01.930 --> 00:57:02.709
so - and what is this? How do you

00:57:02.709 --> 00:57:05.079
minimize Lagrange? So in order to

00:57:05.079 --> 00:57:05.890


00:57:05.890 --> 00:57:09.079
minimize the Lagrange as a function of w and b,

00:57:09.079 --> 00:57:11.199
we do the usual thing. We

00:57:11.199 --> 00:57:13.119
take the derivatives of w -

00:57:13.119 --> 00:57:16.819
Lagrange with respect to w and b. And we set that to 0. That's how we

00:57:16.819 --> 00:57:20.209
minimize the Lagrange with respect

00:57:20.209 --> 00:57:25.709
to w and b. So take the derivative with respect to w of the Lagrange.

00:57:25.709 --> 00:57:27.829
And

00:57:27.829 --> 00:57:34.829
I want - I just write down the answer. You know how to do calculus like this.

00:57:38.189 --> 00:57:41.439
So I wanna minimize this function of w, so I take the derivative and set it

00:57:41.439 --> 00:57:42.839
to 0. And

00:57:42.839 --> 00:57:44.950
I get that. And

00:57:44.950 --> 00:57:51.950
then so this implies that w

00:57:55.940 --> 00:57:58.789
must be that.

00:57:58.789 --> 00:58:00.779
And so

00:58:00.779 --> 00:58:05.189
w, therefore, is actually a linear combination of your input feature vectors xi.

00:58:05.189 --> 00:58:06.779
This is

00:58:06.779 --> 00:58:09.570
sum of your various weights given by the alpha i's and times

00:58:09.570 --> 00:58:13.069
the xi's, which are your examples in your training set.

00:58:13.069 --> 00:58:15.939
And this will be useful later. The

00:58:15.939 --> 00:58:22.939
other equation we have is - here, partial derivative of

00:58:23.139 --> 00:58:27.900
Lagrange with respect to p is equal to minus sum

00:58:27.900 --> 00:58:33.819
of i plus 1 to m [inaudible] for

00:58:33.819 --> 00:58:35.219
i.

00:58:35.219 --> 00:58:37.199
And so I'll just set that to equal to

00:58:37.199 --> 00:58:40.439
0. And so these are my two constraints.

00:58:40.439 --> 00:58:42.699
And so

00:58:42.699 --> 00:58:48.489


00:58:48.489 --> 00:58:49.769
[inaudible].

00:58:49.769 --> 00:58:53.650
So what I'm going to do is I'm actually going to take these two constraints,

00:58:53.650 --> 00:58:57.579
and well, I'm going to take whatever I thought to be the value for w.

00:58:57.579 --> 00:59:00.379
And I'm

00:59:00.379 --> 00:59:01.309
gonna

00:59:01.309 --> 00:59:04.070
take what I've worked out to be the

00:59:04.070 --> 00:59:04.769
value for w, and

00:59:04.769 --> 00:59:06.979
I'll plug it back in there

00:59:06.979 --> 00:59:09.169
to figure out what the Lagrange really is

00:59:09.169 --> 00:59:13.989
when I minimize with respect to w. [Inaudible] and I'll

00:59:13.989 --> 00:59:20.989
deal with b in a second.

00:59:27.869 --> 00:59:34.869
So

00:59:37.629 --> 00:59:41.109
let's see. So my Lagrange is 1/2

00:59:41.109 --> 00:59:44.479
w transpose w minus.

00:59:44.479 --> 00:59:51.199


00:59:51.199 --> 00:59:58.079


00:59:58.079 --> 01:00:03.529
So this first term, w transpose w

01:00:03.529 --> 01:00:05.939
- this becomes

01:00:05.939 --> 01:00:11.039
sum y equals one to m, alpha i, yi,

01:00:11.039 --> 01:00:18.039


01:00:20.749 --> 01:00:24.859
xi transpose. This is just putting in the value for w that I worked out previously.

01:00:24.859 --> 01:00:27.880
But since this is w transpose w -

01:00:27.880 --> 01:00:32.239
and so when they expand out of this quadratic function, and when I plug in w

01:00:32.239 --> 01:00:34.469
over there as well,

01:00:34.469 --> 01:00:36.329
I

01:00:36.329 --> 01:00:38.759
find

01:00:38.759 --> 01:00:40.849


01:00:40.849 --> 01:00:42.869


01:00:42.869 --> 01:00:49.539
that

01:00:49.539 --> 01:00:55.189


01:00:55.189 --> 01:00:58.499
I

01:00:58.499 --> 01:01:00.150
have

01:01:00.150 --> 01:01:03.969


01:01:03.969 --> 01:01:08.529


01:01:08.529 --> 01:01:10.979


01:01:10.979 --> 01:01:17.979
that.

01:01:20.689 --> 01:01:22.469
Oh,

01:01:22.469 --> 01:01:28.589
where I'm using these angle brackets to denote end product, so this

01:01:28.589 --> 01:01:35.179
thing here, it just means the end product, xi transpose

01:01:35.179 --> 01:01:36.259
xj. And

01:01:36.259 --> 01:01:40.299
the first and second terms are actually the same except for the minus one half. So to

01:01:40.299 --> 01:01:41.689
simplify to be

01:01:41.689 --> 01:01:44.989
equal to

01:01:44.989 --> 01:01:51.259


01:01:51.259 --> 01:01:58.259


01:02:08.239 --> 01:02:12.549
that.

01:02:12.549 --> 01:02:15.689
So

01:02:15.689 --> 01:02:18.749
let me go ahead and

01:02:18.749 --> 01:02:22.239
call this w of alpha.

01:02:22.239 --> 01:02:23.449


01:02:23.449 --> 01:02:29.469


01:02:29.469 --> 01:02:36.469


01:02:46.779 --> 01:02:51.919
My dual problem is, therefore, the following. I want to maximize w

01:02:51.919 --> 01:02:55.839
of alpha, which is that [inaudible].

01:02:55.839 --> 01:02:58.380
And

01:02:58.380 --> 01:03:02.429
I want to the - I realize the notation is somewhat

01:03:02.429 --> 01:03:07.259
unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier.

01:03:07.259 --> 01:03:13.529
And then we also had our lowercase w. The original [inaudible] is the primal

01:03:13.529 --> 01:03:16.349
problem. Lowercase w transpose xi. So

01:03:16.349 --> 01:03:18.059
uppercase and lowercase w

01:03:18.059 --> 01:03:20.609
are totally different

01:03:20.609 --> 01:03:27.160
things, so unfortunately, the notation is standard as well, as far as I know,

01:03:27.160 --> 01:03:29.929
so. So the dual problem is

01:03:29.929 --> 01:03:33.119
that subject to the alpha [inaudible] related to 0,

01:03:33.119 --> 01:03:37.239
and

01:03:37.239 --> 01:03:39.439
we also have that the

01:03:39.439 --> 01:03:40.619
sum of i,

01:03:40.619 --> 01:03:43.819
yi, alpha i is related to 0.

01:03:43.819 --> 01:03:46.380
That last constraint

01:03:46.380 --> 01:03:49.589
was the constraint I got from

01:03:49.589 --> 01:03:52.299
this - the

01:03:52.299 --> 01:03:55.020
sum of i - sum of i, yi alpha i equals to 0. But that's where

01:03:55.020 --> 01:03:59.619
that [inaudible] came

01:03:59.619 --> 01:04:01.839
from. Let

01:04:01.839 --> 01:04:02.380
me just

01:04:02.380 --> 01:04:05.619
- I think in previous years that I taught this,

01:04:05.619 --> 01:04:09.199
where this constraint comes from is just - is slightly confusing. So let

01:04:09.199 --> 01:04:12.910
me just take two minutes to say what the real interpretation of that is. And if you

01:04:12.910 --> 01:04:15.769
don't understand it, it's

01:04:15.769 --> 01:04:18.069
not a big deal, I guess.

01:04:18.069 --> 01:04:21.789
So when we took the partial derivative of the Lagrange with

01:04:21.789 --> 01:04:22.669
respect to b,

01:04:22.669 --> 01:04:28.219
we end up with this constraint that sum of i, yi, alpha i must be equal to 0.

01:04:28.219 --> 01:04:34.219
The interpretation of that, it turns out, is that if sum of i, yi, alpha i

01:04:34.219 --> 01:04:37.410
is not equal to

01:04:37.410 --> 01:04:40.920
0, then

01:04:40.920 --> 01:04:42.809


01:04:42.809 --> 01:04:49.809


01:04:51.019 --> 01:04:53.210
theta d of wb

01:04:53.210 --> 01:04:55.659
is -

01:04:55.659 --> 01:04:59.829
actually, excuse me.

01:04:59.829 --> 01:05:01.559


01:05:01.559 --> 01:05:02.779


01:05:02.779 --> 01:05:06.259
Then theta d of alpha is equal to

01:05:06.259 --> 01:05:08.499


01:05:08.499 --> 01:05:13.179
minus infinity for minimizing.

01:05:13.179 --> 01:05:16.279
So in other words, it turns out my Lagrange is

01:05:16.279 --> 01:05:20.479
actually a linear function of my parameters b. And so the interpretation of

01:05:20.479 --> 01:05:22.890
that constraint we worked out previously was that if sum of i or yi, alpha i

01:05:22.890 --> 01:05:25.820
is not equal to 0, then

01:05:25.820 --> 01:05:28.889
theta d of alpha is equal to minus infinity.

01:05:28.889 --> 01:05:32.469
And so if your goal is to

01:05:32.469 --> 01:05:36.839
maximize as a function of alpha, theta

01:05:36.839 --> 01:05:38.339
d of alpha,

01:05:38.339 --> 01:05:44.599
then you've gotta choose values of alpha for which sum of yi alpha is equal to 0.

01:05:44.599 --> 01:05:50.969
And then when sum of yi alpha is equal to 0, then

01:05:50.969 --> 01:05:54.189


01:05:54.189 --> 01:05:56.099


01:05:56.099 --> 01:06:00.949
theta d of

01:06:00.949 --> 01:06:03.999
alpha is equal to w of alpha.

01:06:03.999 --> 01:06:08.690
And so that's why we ended up deciding to maximize w of alpha subject to

01:06:08.690 --> 01:06:14.319
that sum of yi alpha is equal to 0.

01:06:14.319 --> 01:06:18.929
Yeah, the - unfortunately, the fact of that d would be [inaudible]

01:06:18.929 --> 01:06:21.619
adds just a little bit of extra notation in our

01:06:21.619 --> 01:06:22.989
derivation of the duo. But

01:06:22.989 --> 01:06:27.319
by the way, and [inaudible] all the action of the optimization problem is with w

01:06:27.319 --> 01:06:32.809
because b is just one parameter.

01:06:32.809 --> 01:06:39.809
So let's check. Are there any questions about this? Okay, cool.

01:06:47.069 --> 01:06:49.499
So

01:06:49.499 --> 01:06:53.729
what derived a duo optimization problem - and really, don't worry about this

01:06:53.729 --> 01:06:56.469
if you're not quite sure where this was. Just think of this as

01:06:56.469 --> 01:06:59.959
we worked out this constraint, and we worked out, and we took partial derivative with

01:06:59.959 --> 01:07:01.309
respect to b,

01:07:01.309 --> 01:07:06.049
that this constraint has the [inaudible] and so I just copied that over here. But so - worked out

01:07:06.049 --> 01:07:11.129
the duo of the optimization problem,

01:07:11.129 --> 01:07:16.069
so our approach to finding - to deriving the optimal margin classifier or support vector

01:07:16.069 --> 01:07:16.679
machine

01:07:16.679 --> 01:07:19.409
will be that we'll solve

01:07:19.409 --> 01:07:26.409
along this duo optimization problem for the parameters alpha

01:07:28.109 --> 01:07:29.169
star.

01:07:29.169 --> 01:07:30.509
And then

01:07:30.509 --> 01:07:34.149
if you want, you can then - this is the equation that we worked out on

01:07:34.149 --> 01:07:36.459
the previous board. We said that

01:07:36.459 --> 01:07:43.459
w - this [inaudible] alpha - w must be equal to

01:07:44.369 --> 01:07:46.349
that.

01:07:46.349 --> 01:07:47.690
And so

01:07:47.690 --> 01:07:53.309
once you solve for alpha, you can then go back and quickly derive

01:07:53.309 --> 01:07:57.389
w in parameters to your primal problem. And we worked this out earlier.

01:07:57.389 --> 01:07:58.209


01:07:58.209 --> 01:08:00.429
And moreover,

01:08:00.429 --> 01:08:05.409
once you solve alpha and w, you can then focus back into your - once you solve for alpha and w,

01:08:05.409 --> 01:08:06.819


01:08:06.819 --> 01:08:10.419
it's really easy to solve for v, so

01:08:10.419 --> 01:08:13.979


01:08:13.979 --> 01:08:15.269
that b gives us the interpretation of [inaudible]

01:08:15.269 --> 01:08:19.559
training set, and you found the direction for w. So you know where your separating

01:08:19.559 --> 01:08:22.079
hyperplane's direction is. You know it's got to be

01:08:22.079 --> 01:08:25.059
one of these things.

01:08:25.059 --> 01:08:28.499
And you know the orientation and separating hyperplane. You just have to

01:08:28.499 --> 01:08:31.000
decide where to place

01:08:31.000 --> 01:08:33.289
this hyperplane. And that's what solving b is.

01:08:33.289 --> 01:08:36.969
So once you solve for alpha and w, it's really easy to solve b.

01:08:36.969 --> 01:08:40.409
You can plug alpha and w back into the

01:08:40.409 --> 01:08:42.779
primal optimization problem

01:08:42.779 --> 01:08:45.649


01:08:45.649 --> 01:08:49.899


01:08:49.899 --> 01:08:56.899
and solve for b.

01:09:02.329 --> 01:09:05.140
And I just wrote it down

01:09:05.140 --> 01:09:12.140
for the sake of completeness,

01:09:14.859 --> 01:09:19.890


01:09:19.890 --> 01:09:21.299


01:09:21.299 --> 01:09:23.279
but

01:09:23.279 --> 01:09:27.939


01:09:27.939 --> 01:09:29.609
- and the

01:09:29.609 --> 01:09:36.199
intuition behind this formula is just that find the worst positive

01:09:36.199 --> 01:09:37.309
[inaudible] and the

01:09:37.309 --> 01:09:41.920
worst negative example. Let's say

01:09:41.920 --> 01:09:44.779
this one and this one - say [inaudible] and [inaudible] the difference between them. And

01:09:44.779 --> 01:09:45.990
that tells you where you should

01:09:45.990 --> 01:09:47.969
set the threshold for

01:09:47.969 --> 01:09:52.509
where to place the separating hyperplane.

01:09:52.509 --> 01:09:53.410
And then

01:09:53.410 --> 01:09:54.829
that's the -

01:09:54.829 --> 01:09:57.749
this is the optimal margin classifier. This is also called a support vector

01:09:57.749 --> 01:10:02.329
machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words

01:10:02.329 --> 01:10:04.320
about that. But I

01:10:04.320 --> 01:10:08.080
hope the process is clear. It's a dual problem. We're going to solve the duo

01:10:08.080 --> 01:10:09.789
problem for the alpha i's.

01:10:09.789 --> 01:10:11.360
That gives us w, and that gives

01:10:11.360 --> 01:10:13.589
us b.

01:10:13.589 --> 01:10:16.649
So

01:10:16.649 --> 01:10:21.709
there's just one more thing I wanna point out as I lead into the next lecture,

01:10:21.709 --> 01:10:23.070
which is that - I'll just

01:10:23.070 --> 01:10:28.440
write this out again,

01:10:28.440 --> 01:10:29.250
I guess -

01:10:29.250 --> 01:10:31.359
which is that it turns out

01:10:31.359 --> 01:10:33.530
we can take the entire algorithm,

01:10:33.530 --> 01:10:37.530
and we can express the entire algorithm in terms of inner products. And here's what I

01:10:37.530 --> 01:10:40.730
mean by that. So

01:10:40.730 --> 01:10:42.249
say that the parameters w

01:10:42.249 --> 01:10:45.469
is the sum of your input examples.

01:10:45.469 --> 01:10:49.160
And so we need to make a prediction.

01:10:49.160 --> 01:10:54.659
Someone gives you a new value of x. You want a value of the hypothesis on the value of x.

01:10:54.659 --> 01:10:58.090
That's given by g of w transpose x plus b, or

01:10:58.090 --> 01:11:02.869
where g was this threshold function that outputs minus 1 or plus 1.

01:11:02.869 --> 01:11:06.380
And so you need to compute w transpose x plus b.

01:11:06.380 --> 01:11:11.300
And that is equal to alpha i,

01:11:11.300 --> 01:11:18.300
yi.

01:11:19.739 --> 01:11:24.029
And that can be expressed as a sum of these inner products between

01:11:24.029 --> 01:11:26.070
your training examples

01:11:26.070 --> 01:11:33.070
and this new value of x [inaudible] value [inaudible]. And this will

01:11:33.849 --> 01:11:39.329
lead into our next lecture, which is the idea of kernels.

01:11:39.329 --> 01:11:39.929
And

01:11:39.929 --> 01:11:44.209
it turns out that in the source of feature spaces where used to support vector

01:11:44.209 --> 01:11:45.039
machines -

01:11:45.039 --> 01:11:52.039
it turns out that sometimes your training examples may be very high-dimensional. It may even be the case

01:11:53.730 --> 01:11:58.329
that the features that you want to use

01:11:58.329 --> 01:11:59.579
are

01:11:59.579 --> 01:12:04.129
inner-dimensional feature vectors.

01:12:04.129 --> 01:12:11.129
But despite this, it'll turn out that there'll be an interesting representation that

01:12:11.229 --> 01:12:12.879
you can use

01:12:12.879 --> 01:12:14.820
that will allow you

01:12:14.820 --> 01:12:18.719
to compute inner products like these efficiently.

01:12:18.719 --> 01:12:25.719


01:12:25.780 --> 01:12:29.179
And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets

01:12:29.179 --> 01:12:30.479
of features.

01:12:30.479 --> 01:12:33.609
But we talk about the idea of

01:12:33.609 --> 01:12:35.320
kernels. In the next lecture, we'll

01:12:35.320 --> 01:12:37.809
see examples where

01:12:37.809 --> 01:12:41.249
even though you have extremely high-dimensional feature vectors, you can compute

01:12:41.249 --> 01:12:45.829
- you may never want to represent xi, x plus [inaudible] inner-dimensional

01:12:45.829 --> 01:12:48.459
feature vector. You can even store in computer memory.

01:12:48.459 --> 01:12:51.769
But you will nonetheless be able to compute inner products between different

01:12:51.769 --> 01:12:53.839
[inaudible] feature vectors very efficiently.

01:12:53.839 --> 01:12:57.419
And so you can - for example, you can make predictions by making use of these inner

01:12:57.419 --> 01:12:58.790
products.

01:12:58.790 --> 01:13:02.129
This is just xi

01:13:02.129 --> 01:13:03.809
transpose.

01:13:03.809 --> 01:13:08.570
You will compute these inner products very efficiently and, therefore, make predictions.

01:13:08.570 --> 01:13:11.109
And this pointed also - the other

01:13:11.109 --> 01:13:14.599
reason we derive the duo was because

01:13:14.599 --> 01:13:18.300
on this board, when we worked out what w of alpha is, w of alpha

01:13:18.300 --> 01:13:23.620
- actually are the same property - w of alpha is again

01:13:23.620 --> 01:13:26.699
written in terms of these inner products.

01:13:26.699 --> 01:13:29.040
And so if you

01:13:29.040 --> 01:13:33.429
actually look at the duo optimization problem and step - for all the steps of the

01:13:33.429 --> 01:13:33.999
algorithm,

01:13:33.999 --> 01:13:36.929
you'll find that you actually do everything you want - learn the parameters of

01:13:36.929 --> 01:13:38.169
alpha. So

01:13:38.169 --> 01:13:41.989
suppose you do an optimization problem, go into parameters alpha, and you do everything you want

01:13:41.989 --> 01:13:46.519
without ever needing to represent xi directly. And all you need to do

01:13:46.519 --> 01:13:53.519
is represent this compute inner products with your feature vectors like these. Well,

01:13:53.879 --> 01:13:56.400
one last property of

01:13:56.400 --> 01:13:58.159
this algorithm that's kinda nice is that

01:13:58.159 --> 01:13:59.900
I said previously

01:13:59.900 --> 01:14:00.929
that

01:14:00.929 --> 01:14:07.389
the alpha i's are 0 only for the - are non-0 only for the support vectors,

01:14:07.389 --> 01:14:08.690
only for the vectors

01:14:08.690 --> 01:14:10.400
that function y [inaudible] 1.

01:14:10.400 --> 01:14:13.540
And in practice, there are usually fairly few of them.

01:14:13.540 --> 01:14:17.360
And so what this means is that if you're representing w this way,

01:14:17.360 --> 01:14:18.320
then

01:14:18.320 --> 01:14:22.870
w when represented as a fairly small fraction of training examples

01:14:22.870 --> 01:14:25.139
because mostly alpha i's is 0 -

01:14:25.139 --> 01:14:27.180
and so when you're summing up

01:14:27.180 --> 01:14:28.279
the sum,

01:14:28.279 --> 01:14:32.359
you need to compute inner products only if the support vectors, which is

01:14:32.359 --> 01:14:36.139
usually a small fraction of your training set. So that's another nice [inaudible]

01:14:36.139 --> 01:14:39.320
because [inaudible] alpha is

01:14:39.320 --> 01:14:40.999
0. And well,

01:14:40.999 --> 01:14:44.639
much of this will make much more sense when we talk about kernels. [Inaudible] quick

01:14:44.639 --> 01:14:51.639
questions

01:14:51.959 --> 01:14:52.920


01:14:52.920 --> 01:14:57.989
before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well

01:14:57.989 --> 01:14:59.560
behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that

01:14:59.560 --> 01:15:03.889
the data is linearly separable - that you can actually get perfect

01:15:03.889 --> 01:15:10.889
[inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible]

01:15:10.909 --> 01:15:13.380
point [inaudible], so [inaudible]

01:15:13.380 --> 01:15:17.579
have

01:15:17.579 --> 01:15:19.300
[inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that

01:15:19.300 --> 01:15:23.429
there are ways to generalize this in multiple classes that I probably won't [inaudible] -

01:15:23.429 --> 01:15:26.429
but yeah, that's generalization [inaudible].

01:15:26.429 --> 01:15:27.870
Okay. Let's close for today, then.

01:15:27.870 --> 01:15:29.300
We'll talk about kernels in our next lecture.